Speech recognition of Czech—inclusion of rare words helps

合集下载

2018年ASME PVP会议简介

摘要：美国机械工程师协会（ＡＳＭＥ）压力容器与管道会议（ＰＶＰ）是压力容器与管道领域的国际顶级会议。主要介绍了在捷克布拉格举行的２０１８年ＡＳＭＥＰＶＰ会议的主题和分议题；对比了２０１４～２０１７年美国、日本和中国等主要国家在ＰＶＰ会议上论文收录和研究领域的情况，指出了压力容器和管道领域的发展趋势和热点问题，综述了２００８～２０１８年我国会议论文发表情况和学生论文竞赛情况。关键词：压力容器；压力管道；发展趋势；学生论文竞赛
中图分类号：ＴＨ４９；Ｔ－２文献标志码：Ｂ文章编号：１００１－４８３７（２０１８）０９－００７０－０８ｄｏｉ：１０．３９６９／ｊ．ｉｓｓｎ．１００１－４８３７．２０１８．０９．０１２
ＢｒｉｅｆＩｎｔｒｏｄｕｃｔｉｏｎｏｆＡＳＭＥ２０１８ＰＶＰＣｏｎｆｅｒｅｎｃｅ
ＱＩＮＹｉｎ－ｋａｎｇ１，ＳＨＩＪｉａｎ－ｆｅｎｇ１，ＤＥＮＧＧｕｉ－ｄｅ２，ＬＩＵＹｉｎｇ－ｈｕａ３，ＣＨＥＮＧＧｕａｎｇ－ｘｕ４，ＣＨＥＮＸｕｅ－ｄｏｎｇ５，ＦＡＮＺｈｉ－ｃｈａｏ５，ＪＩＡＧｕｏ－ｄｏｎｇ６，ＴＵＳｈａｎ－ｔｕｎｇ７，ＺＨＥＮＧＪｉｎ－ｙａｎｇ１（１．ＣｏｌｌｅｇｅｏｆＥｎｅｒｇｙＥｎｇｉｎｅｅｒｉｎｇ，ＺｈｅｊｉａｎｇＵｎｉｖｅｒｓｉｔｙ，Ｈａｎｇｚｈｏｕ３１００２７，Ｃｈｉｎａ；２．ＣｈｉｎａＳｐｅｃｉａｌＥｑｕｉｐｍｅｎｔＩｎｓｐｅｃｔｉｏｎａｎｄＲｅｓｅａｒｃｈＩｎｓｔｉｔｕｔｅ，Ｂｅｉｊｉｎｇ１０００２９，Ｃｈｉｎａ；３．ＤｅｐａｒｔｍｅｎｔｏｆＥｎｇｉｎｅｅｒｉｎｇＭｅｃｈａｎｉｃｓ，ＴｓｉｎｇｈｕａＵｎｉｖｅｒｓｉｔｙ，Ｂｅｉｊｉｎｇ１０００８４，Ｃｈｉｎａ；４．ＳｃｈｏｏｌｏｆＣｈｅｍｉｃａｌＥｎｇｉｎｅｅｒｉｎｇａｎｄＴｅｃｈｎｏｌｏｇｙ，Ｘｉ′ａｎＪｉａｏｔｏｎｇＵｎｉｖｅｒｓｉｔｙ，Ｘｉ′ａｎ７１００４９，Ｃｈｉｎａ；５．ＨｅｆｅｉＧｅｎｅｒａｌＭａｃｈｉｎｅｒｙＲｅｓｅａｒｃｈＩｎｓｔｉｔｕｔｅＣｏ．，Ｌｔｄ．，Ｈｅｆｅｉ２３００３１，Ｃｈｉｎａ；６．ＳｔａｔｅＡｄｍｉｎｉｓｔｒａｔｉｏｎｆｏｒＭａｒｋｅｔＲｅｇｕｌａｔｉｏｎｏｆＰｅｏｐｌｅ′ｓＲｅｐｕｂｌｉｃｏｆＣｈｉｎａ，Ｂｅｉｊｉｎｇ１０００８８，Ｃｈｉｎａ；７．ＳｃｈｏｏｌｏｆＭｅｃｈａｎｉｃａｌａｎｄＰｏｗｅｒＥｎｇｉｎｅｅｒｉｎｇ，ＥａｓｔＣｈｉｎａＵｎｉｖｅｒｓｉｔｙｏｆＳｃｉｅｎｃｅａｎｄＴｅｃｈｎｏｌｏｇｙ，Ｓｈａｎｇｈａｉ２００２３７，Ｃｈｉｎａ）

上海市华东师范大学松江实验高级中学2023年高三前半期期中英语在线测验完整版

阅读下面短文，在空白处填入1个适当的单词或括号内单词的正确形式。

Success StoriesMercy Cherono is one of many very successful young athletes from Kenya. She was born in 1991 in the village of Kipajit. She is the oldest of six children, and some of the other children in her family are also athletes. Her father, John Koech, runs a training camp in the village. During the school holidays, the camp 【1】attract over 50 trainees. Cherono started running in primary school and continued when she went to secondary school in the nearby town of Sotik. At the age of 16, she participated in the 2007 International Association of Athletics Federations (IAAF) World Cross Country Championships 【2】(hold) in Mombasa, Kenya. It was her first international event. 【3】the fact that she finished 23rd in the junior race, she had launched herself into international athletics. In the same year, at the World Youth Championships at Ostrava in the Czech Republic, she won a gold medal in the 3,000mon C.obvious D.rare【5】A.Fortunately B.Adequately C.Unsurprisingly D.Unbelievably 【6】A.similar B.strange C.present D.unique【7】A.recall B.overlook pose D.recover【8】A.generous B.internal C.harmful D.positive【9】A.consequences B.preparations C.significance D.symptoms 【10】A.accidentally B.purposefully C.unreasonably D.unwillingly 【11】A.temper B.result C.excuse D.loss【12】A.Because B.Although C.Just as D.So【13】bine B.satisfy C.describe D.separate【14】A.appointing B.struggling C.carrying D.affording【15】A.recognize B.ignore C.restore D.respect【答案】【1】B【2】D【3】A【4】D【5】C【6】A【7】C【8】C【9】A【10】A【11】B【12】A【13】D【14】B【15】D【解析】这是一篇说明文。

专题讲座：语音识别与声纹识别

3）声道的谐振频率表示为黑带，浊音部分则以出现条纹图形为特征，这是因为此时的时域波形具有周期性，而在清音的时间间隔内比较致密 4）“声纹”用于说话人识别女声：“他去无锡市，我去黑龙江”的语谱图
1 语音信号处理基础
1.1 语音信号的产生
load mtlb specgram(mtlb,512,Fs,kaiser(500,5),475) title('Spectrogram')
单词的最小单位为音节，句子的最小单位为单词。
1 语音信号处理基础
1.1 语音信号的产生
2）汉语语音的特点音系简单，在汉语中一个字就是一个音节，由一般为2～3 个音素组成，而且具有音素少、音节少。英语中一个单词由若干个音节组成，一般为2～3个，一个音节由若干个音素组成，一般为1～4个。清辅音多，在听感上有清亮、高扬和舒服、柔和的感觉。有鲜明的轻重音和儿化韵，所以字词分隔清楚，语言表达准确而丰富。
1 语音信号处理基础
对语音的研究
对语音的研究包括两个方面 1) 语音中各个音的排列由一些规则所控制，对这些规则及其含义的研究称为语言学(linguistics)。语言学是语音信号处理的基础。例如：可以利用句法和语义信息减少语音识别中搜索匹配范围，提高正确识别率。 2) 语音中各个音的物理特性和分类的研究称为语音学(phonetics)。它考虑的是语音产生、语音感知等过程，以及各个音的特征和分类。
人的前方甲状软骨
声门声带环形软骨喉的生理结构
1 语音信号处理基础
1.1 语音信号的产生
当说话时，声带在软骨的作用下相互靠近但不完全闭合，声门变成一条窄缝。当气流通过气管经过咽喉时，收紧的声带由于气流的冲击而产生振动，不断地张开和闭合，使声门向上送出一连串喷流。

An Overview of Speech Recognition Activities at LIMSI

An Overview of Speech Recognition Activities at LIMSI Jean-Luc Gauvain,Gilles Adda,Martine Adda-Decker,Claude Barras, Langzhou Chen,Mich`e le Jardino,Lori Lamel,Holger SchwenkSpoken Language Processing Group(http://www.limsi.fr/tlp)LIMSI-CNRS,B.P.133,91403Orsay cedex,FranceABSTRACTThis paper provides an overview of recent activities at L IMSI in multilingual speech recognition and its applications.The main goal of speech recognition is to provide a transcription of the speech signal as a sequence of words.Speech recognition is a core technology for most applications involving voice technol-ogy.The two main classes of applications currently addressed are transcription and indexation of broadcast data and spoken lan-guage dialog systems for information access.Speaker-independent,large vocabulary,continuous speech recognition systems for different European languages(French, German and British English)and for American English and Man-darin Chinese have been developed.These systems rely on sup-porting research in acoustic-phonetic modeling,lexical modeling and language modeling.1.INTRODUCTIONSpeech recognition and related application areas have been a long term research topic at LIMSI,going back to the early1980’s.Our aim is to develop basic speech rec-ognizer technology that is speaker-independent and task-independent,which at least means that using very large (ideally unlimited)recognition vocabularies.The systems also have to be robust with respect to background and channel noise and changes in microphone and microphone placement.They must have the ability to deal with sponta-neous speech,including well-known disﬂuencies. Speech recognition performance is acknowledged to be heavily dependent on the correctness of the acoustic and language models used,and the recognition lexicon.These are areas of active research in the group.The different sources of variability present in the speech signal is ad-dressed by the use of different modeling techniques.How-ever,what is considered non-pertinent for word recognition can be quite relevant from another perspective.A variable number of labels can be associated with the acoustic mod-els(phoneme,talker’s gender,identity,dialect,language, ...)and depending upon the application,the recognition system can eventually identify the acoustic or channel con-ditions,the speaker,and the language along with the lin-guistic content encoded in the speech signal.In addition to our main research activites,substantial ef-fort is devoted to system evaluation and to the develop-ment of speech corpora.Concerning evaluation,L IMSIwas one of theﬁrst non-American labs to participate in annual DARPA evaluations of speech recognition technol-ogy and has regularly participated in benchmark tests since 1992,on tasks ranging from Resource Management to Wall Street journal and most recently,broadcast news tran-scription.These evaluations permit the comparison of dif-ferent systems on the same test data,using common train-ing materials and test protocols.The LIMSI system consis-tently obtained top level performance(always among the top3sites),achieving the highest word accuracy on four of the baseline tests.Evaluation of multilingual systems wascarried out in the context of the LRE project SQALE,and the LIMSI recognition system achieved the lowest word error rate in the1997Aupelf sponsored evaluation of read-speech in French.In1998and1999we participated in the 1999TREC-8and2000TREC-9SDR evaluation for re-trieval of audio documents.Research contracts cover most of the groups activities, the most recent European projects being:A LERT,C ORE-TEX,E CHO,O LIVE,as well as French National projects from the DGA(D´e l´e gation G´e n´e rale de l’Armement),andRNRT(V ocadis and Theoreme).The group also partici-pates or has participated in several projects related to the development and distribution of linguistic resources and evaluation.In the following sections we give a brief overview of ourresearch activities related to developing core speech recog-nition technology and applying this technolgoy to vari-ous languages,and mention some of our recent research projects.Recent advances have been in techniques for robust acoustic feature extraction and normalization;im-proved training techniques which can take advantage of very large audio and textual corpora;algorithms for audio segmentation;unsupervised acoustic model adaptation;ef-ﬁcient decoding with long span language models;ability to use very large vocabularies.Much of recent progress can be linked to the availability of large speech and text corpora and simultaneous advances made in computational means and storage,which have facilitated the implementation of more complex models and algorithms.France-China Speech Processing Workshop,Beijing,Gauvain et al712.CORE TECHNOLOGYSpeech recognition is principally concerned with the problem of transcribing the speech signal as a sequence of words.The L IMSI system,in common with most of today’s state-of-the-art systems,makes use of statistical models of speech generation.From this point of view, message generation is represented by a language model which provides an estimate of the probability of any given word string,and the encoding of the message in the acous-tic signal is represented by a probability density function (HMM).The speech decoding problem then consists of maximizing the a posteriori probability of the word string given the observed acoustic signal.One of our primary objectives is to improve core tech-nology for speaker-independent,continuous speech recog-nition so as to minimize the expected performance degra-dation under mismatched training/testing conditions.Our research thus focuses on extending the capabilities of the system to deal with unlimited-vocabulary speech in a va-riety of acoustical conditions,with different background environmental noise,unknown channels and microphones. The LIMSI speaker-independent,large vocabulary,con-tinuous speech recognizer makes use of continuous density hidden Markov models(HMMs)with Gaussian mixtures for acoustic modeling.The acoustic and language models are trained on large,representative corpora for each task and language[8,12].Each word is represented by one or more sequences of context-dependent phone models(intra and interword)as determined by its lexical transcription. The acoustic parameterization is based on a cepstral repre-sentation of the speech signal.Word recognition is carried out in one or more decoding passes with more accurate acoustic and language models used in successive passes.For many applications there are limitations on the response time and the available compu-tational resources,which in turn can signiﬁcantly affect the design of the acoustic and language models.For each op-erating point,the right balance between model complexity and search speed must be found to optimize performance. A4-gram single pass decoder which uses cross-word mod-els has been implemented.The decoder makes use of a variety of techniques to reduce the search space and com-putation time,such as LM state conditioned lexicon trees, acoustic and language model lookahead,predictive prun-ing and fast Gaussian likelihood computation[7].We have recently carried out some experiments to inves-tigate the use of voting schemes to combine transcriptions produced by different speech recognizers.An extended ROVER algorithm has been developed that incorporates language model information and is of particular interest when the outputs from only a few recognizers are com-bined[26].Speaker-independence is achieved by estimating the pa-rameters of the acoustic models on large speech corpora containing data from a large speaker population(sev-eral hundreds to thousands of speakers).Local contex-tual variation is modeled by using large sets of context-dependent phone models,and the variability associated with the acoustic environment,the microphone and trans-mission channel are accounted for by adapting the acoustic models to the particular conditions or by using an explicit model of the channel.Regularities and local syntactic constraints are captured via-gram models,which attempt to account for the syn-tactic and semantic constraints by estimating the proba-bility of a word in a sentence given the preceding n-1 words.Given a large corpus of texts(or transcriptions) it may seem relatively straightforward to construct n-gram language models.The main considerations are the choice of the vocabulary and the deﬁnition of words,such as the treatment of compound words or acronyms,and the choice of the backoff strategy.There is,however,a signiﬁcant amount of effort needed to process the texts before they can be used.One motivation for normalization is to reduce lexical variability so as to increase the coverage for aﬁxed size task vocabulary.Normalization decisions are gener-ally language-speciﬁc.For example,some standard pro-cessing steps include the expansion of numerical expres-sions,treatment of isolated letters and letter sequences,and optionally elimination of case distinction.Further semi-automatic processing is necessary to correct frequent errors inherent in the texts,and the expansion of abbreviations and acronyms.The error correction consists primarily of correcting obvious misspellings.Better language models are obtained by using texts transformed to be closer to the observed reading style,where the transformation rules and corresponding probabilities are automatically derived by aligning prompt texts with the transcriptions of the acous-tic data[13].An essential component of the transcription system is the recognition lexicon which provides the link between the lexical entries(usually words)used by the language model and the acoustic models.Lexical modeling consists of deﬁning the recognition vocabulary and associating one or more phonetic transcriptions for each word in the vocab-ulary.Each lexical entry is described as a sequence of el-ementary units,usually phonemes.We explicitly represent silence,breath noise,and aﬁller words with speciﬁc sym-bols.The American English pronunciations are based on a48phone set,whereas for French and German sets of37 and49phonemes are used,respectively.A pronunciation graph is associated with each word so as to allow for alter-nate pronunciations.For the Mandarin language a set of40 phones is used.Since Mandarin is a tone-based language, withﬁve different tones associated with the syllables,we investigate explicitly modeling tone in the pronunciation lexicon.For pronunciations with tone,we chose to distin-guish only3tones:ﬂat(tones1and5),rising(tones2and 3),falling(tone4).3.PROCESSING AUDIO STREAMSA major advance in speech recognition technology is the ability of todays systems to deal with non-homogeneous data as is exempliﬁed by broadcast data.With the rapid ex-pansion of different media sources for information dissem-ination,there is a pressing need for automatic processing of the audio data stream.A variety of near-term applications are possible such as audio data mining,selective dissemi-nation of information,media monitoring services,disclo-sure of the information content and content-based index-ation for digital libraries,etc.Broadcast news shows are challenging to transcribe as they contain signal segments of various acoustic and linguistic nature,with abrupt or gradual transitions between segments.The signal may be of studio quality or have been transmitted over a telephone or other noisy channel(ie.,corrupted by additive noise and nonlinear distorsions),as well as speech over music and pure music segments.The speech is produced by a wide variety of speakers:news anchors and talk show hosts,re-porters in remote locations,interviews with politicians and common people,unknown speakers,speakers with strong regional accents,non-native speakers,etc.The linguistic style ranges from prepared speech to spontaneous speech. Two principle types of problems are encountered in tran-scribing broadcast news data:those relating to the varied acoustic properties of the signal,and those related to the linguistic properties of the speech.Problems associated with the acoustic signal properties are handled using ap-propriate signal analyses,by classifying the signal accord-ing to segment type and by training speciﬁc acoustic mod-els for the different acoustic conditions.Transcribing such data requires signiﬁcantly higher pro-cessing power than what is needed to transcribe read speech data in a controlled environment,such as for speaker adapted dictation.Although it is usually assumed that processing time is not a major issue since computer power has been increasing continuously,it is also known that the amount of data appearing on information channels is increasing at a close rate.Therefore processing time is an important factor in making a speech transcription sys-tem viable for audio data mining and other related applica-tions.The L IMSI broadcast news automatic indexation sys-tem[11]consists of an audio partitioner[9],a speech rec-ognizer[10,12]and an indexation module[6].The tran-scription components are shown in Figure1. Partitioning the Audio streamThe goal of audio partitioning is to divide the acoustic signal into homogeneous segments,removing non-speech segments,and labeling and structuring the acoustic con-tent of the data.The result of the partitioning process is a set of speech segments with cluster,gender and tele-phone/wideband labels,which can be used to generate metadata annotations.While it is possible to transcribe the continuous stream of audio data without any prior seg-mentation,partitioning offers several advantages over this straight-forward solution.First,in addition to the tran-scription of what was said,other interesting information can be extracted such as the division into speaker turns and the speaker identities,and background acoustic con-ditions.Second,by clustering segments from the same speaker,acoustic model adaptation can be carried out on a per cluster basis,as opposed to on a single segment ba-sis,thus providing more adaptation data.Third,prior seg-mentation can avoid problems caused by linguistic discon-tinuity at speaker changes.Fourth,by using acoustic mod-els trained on particular acoustic conditions(such as wide-band or telephone band),overall performance can be sig-niﬁcantly improved.Finally,eliminating non-speech seg-ments and dividing the data into shorter segments(which can still be several minutes long),substantially reduces the computation time and simpliﬁes decoding.The LIMSI partitioning approach relies on an audio stream mixture model[9].Each component audio source, representing a speaker in a particular background and channel condition,is in turn modeled by a GMM.The seg-ment boundaries and labels are jointly identiﬁed by an iter-ative maximum likelihood segmentation/clustering proce-dure using GMMs and agglomerative clustering. Speaker and Language RecognitionSpeaker and language recognition are a logical exten-sion of continuous speech recognition as basically the same speech models can be used.The basic idea is to con-struct feature-speciﬁc model sets for each non-linguistic speech feature to be identiﬁed,and to process the unknown speech by all model sets in parallel.Instead of retaining the recognized string(as is done in recognition),what is of in-terest is which of the model sets has the highest likelihood. The feature associated with that model set is then attributed to the signal[17].This is usually using phonotactic mod-els,i.e.without the use of lexical information.Another such feature which is commonly identiﬁed is the sex of the speaker,which is used to select sex-dependent models prior to word recognition.For high quality speech,identiﬁcation of the speaker’s sex is essen-tially perfect,and in the few cases where errors are made better recognition results are usually obtained with the cho-sen models(of the opposite sex).Some references to our work on speaker and language identiﬁcation can be found in[3,18,24].Word transcriptionFor each speech segment,the word recognizer deter-mines the sequence of words in the segment,associating start and end times and an optional conﬁdence measure with each word.Word recognition is usually performed in two or more steps to allow unsupervised acoustic model adaptation.For audio stream data,the initial hypotheses are used in cluster-based acoustic model adaptation usingLexicon Acoustic modelsLanguage modelspeech acoustic modelsMusic, noise and telephone/non-tel modelsMale/female models Figure 1:Overview of transcription system for audio stream.the MLLR technique [22]prior to word graph generation,and all subsequent decoding passes.Due to the availability of large corpora of audio and tex-tual data for broadcast news in American English,most efforts in transcription have focused on this language.To evaluate the generalizability of the underlying methods,we have applied the partitioning and transcription algorithms developed for American English broadcast news data to three other languages.The development of the French and German systems has been partially ﬁnanced by the the EC LE4-O LIVE project and by the French Ministry of De-fense.The Mandarin language was chosen because it is quite different from the other languages (tone and syllable-based),and Mandarin resources are available via the LDC as well as reference performance results.Current state-of-the-art laboratory systems can tran-scribe unrestricted American English broadcast news data with word error rates under 20%.Our transcription sys-tems for French and German have comparable error rates for news broadcasts [1].The character error rate for Man-darin is also about 20%[2].Based on our experience,it appears that with appropriately trained models,recognizer performance is more dependent upon the type and source of data,than on the language.For example,we found that foreign documentaries are particularly challenging to tran-scribe,as the audio quality is often not too high,and there is a large proportion of voice over.Spoken Document RetrievalThe automatically generated partition and word tran-scription can be used to for indexation and information retrieval purposes.Spoken document retrieval (SDR)can support random access to relevant portions of audio or video documents,reducing the time needed to identify recordings in large multimedia databases.For such ap-plications,processing time is an important factor,and im-poses constraints on the development of acoustic and lan-guage models.The L IMSI spoken document indexing and retrieval system combines a state-of-the-art speech recog-nizer [12]with a text-based IR system [6].The same techniques commonly applied to automatic text indexation have been applied to automatic transcriptions of broad-cast news radio and TV documents.Such techniques areclassically based on document term frequencies,where the terms are obtained after standard text processing,such as text normalization,tokenization,stopping,stemming and named-entity identiﬁcation.We have been investigat-ing two IR approaches,one based on Okapi term weight-ing and the other on a Markovian term weighting,com-bined with query expansion via Blind Relevance Feedback (BRF)using the audio corpus and a parallel text corpus.As part of the SDR’99TREC-8evaluation,500hours of unpartitioned,unrestricted American English broadcast data was indexed using both state-of-the-art speech rec-ognizer outputs and manually generated closed captioning .The average word error measured on a representative 10hour subset of this data was around 20%.The two IR approaches were shown to yield comparable results [14].Only small differences in information retrieval performance as given by the mean average precision were observed for automatic and manual transcriptions when the story boundaries are known.These results indicate that the transcription quality may not be the limiting factor on IR performance for current IR techniques.4.SPOKEN DIALOG SYSTEMS FORINFORMATION RETRIEV ALSpoken language systems (SLSs)aim to help a user ac-complish a task via interactive dialog.Task and domain knowledge must be used to deﬁne the vocabulary and the concepts speciﬁc to the application in order to construct appropriate acoustic,language and semantic models.Mod-elization of spontaneous speech effects,such as hesita-tions,false starts,and reparations,is particularly important for these systems.In contrast to dictation and transcrip-tion tasks where it is relatively straight-forward to select a recognition vocabulary from large written corpora,for spoken language systems there usually are no application-speciﬁc training data (acoustic or textual)available.A commonly adopted approach for data collection is to start with an initial system (that may involve a Wizard of Oz conﬁguration)and to collect a set of data which can be used to start an iterative development cycle.In addition to a speaker-independent,continuous speech recognizer,a spoken language dialog system also includesFigure 2:Overview of a spoken language dialog system for information access (left).The MASK kiosk (right).components for natural spoken language understanding,dialog management,history management,database access,response generation,and speech synthesis.An overview of the L IMSI SLDS architecture is shown in Figure 2.The speech recognizer transforms the input signal into the most probable word sequence (or optionally a word graph),and forwards it to the rule-based natural language under-standing component,which generates a semantic frame.A mixed-initiative dialog manager prompts the user to supply any missing information needed for database access and then generates a database query.The retrieved informa-tion is transformed into natural language by the response generator (taking into account the dialog context)and pre-sented to the user.Synthesis by waveform concatenation is used to ensure high quality speech output,where dictio-nary units are put together according to the generated text string.It is becoming increasing clear that dialog man-agement and response generation play an important role in system design and user satisfaction [25].At L IMSI prototype systems to provide vocal access to train travel information have been developed in the con-text of several European projects,E SPRIT -M ASK [5]and LE R AIL T EL ,A RISE [19,21].Development of systems for more general tourist information (A UPELF )and control of household appliances (Tide H OME )are also underway.These systems have been tested in ﬁeld trials with naive users.5.RECENT RESEARCH PROJECTSThe I DEAL project (CNET 1994-1998)on languageidentiﬁcation over the telephone.This project explored different approaches to auto-matic language identiﬁcation,and evaluated them us-ing a large,multilingual (French,English,German and Spanish)corpus of telephone speech designed for the task [15,23,24].E SPRIT M ASK Multimodal-multimedia Automated Service Kiosk project (1994-1998)(http://www.limsi.fr/tlp/mask.html)The aim of the M ASK project was to explore the use of a spoken language understanding system as part of an advanced user interface for a public service kiosk.A prototype information kiosk,designed to en-able the coordinated use of multimodal inputs (speech and touch)and multimedia output (sound,video,text,and graphics)was developed and evaluated in the zare train station in Paris [5,16].LE-3A RISE Automatic Railway Information Systems for Europe (1996-1999).The A RISE project developed speech recognition and spoken dialog technologies that were used in pro-totype services providing train schedule informa-tion [21].A UPELF -U REF (1995-1998)projects on “Linguis-tique,Informatique et Corpus Oraux”.Six projects covered the following areas:“Evaluation des syst`e mes de synth`e se,”“Evaluation des syst`e mes de reconnaissance,”“Evaluation des mod`e les de lan-gage,”“Evaluation des syst`e mes de dialogue,”“Cor-pus de textes,”“Corpus de parole”Project under the auspices of the D´e l´e gation G´e n´e rale de l’Armement (DGA)on the Automatic Indexationof Multilingual Broadcasts(1999-2002).This fol-lows a study concerned with the transcription of radio broadcasts and language identiﬁcation(1997-1998). T IDE H OME-AOM project Home application Opti-mum Multimedia/multimodal system for Environ-ment control(1997-2000).http://www.swt.iao.fhg.de/home/index.htmlThe aim of this project was to develop a single,easy-to-use and coherent usage concept for all household appliances,with one of the communication modes be-ing a natural spoken language interface between the user and system[27].The E SPRIT-LTR D ISC(1997-1998)and D ISC2 (1999)projects Spoken Language Dialogue Systems and Components Best Practice in Development and Evaluationhttp://www.disc2.dk/The DISC projects aimed at identifying what consi-tutes best practice in spoken language dialog systems development and evaluation,with the objective of de-veloping reference methodologies[4,20].The LE-4O LIVE project A Multilingual Indexing Tool for Broadcast Material Based on Speech Recog-nition(1998-2000).http://twentyone.tpd.tno.nl/olive/The O LIVE project addressed methods to automate the disclosure of the information content of broadcast data thus allowing content-based indexation.Speech recognition was used to produce a time-linked tran-script of the audio channel of a broadcast,which was then used to produce a concept index for retrieval. The RNRT V ocadis project Reconnaissance de parole distribu´e e(1998-2000).http://www.telecom.gouv.fr/rnrt/projets/pvocadis.htm This project investigates the concept of”distributed speech recognition”,which aims to combine the com-puting power of a central server for speech recogni-tion with local acoustic parameter estimation to en-sure high quality analysis with low cost communica-tion.Language Engineering LE-5A LERT Alert system for selective dissemination project(2000-2002).http://www.fb9-ti.uni-duisburg.de/alertThe ALERT project aims to associate state-of-the-art speech recognition with audio and video segmenta-tion and automatic topic indexing to develop an auto-matic media monitoring demonstrator and evaluate it in the context of real world applications.The targeted languages are French,German and Portuguese.The RNRT Theoreme project Th´e matisation par re-connaissance vocale des m´e dias(1999-2001). http://www.telecom.gouv.fr/rnrt/indexREFERENCES[1]M.Adda-Decker,G.Adda,mel,“Investigatingtext normalization and pronunciation variants for Ger-man broadcast transcription,”Proc.ICSLP’2000,Beijing, China,October2000.[2]L.Chen,mel,G.Adda and J.L.Gauvain,“BroadcastNews Transcription in Mandarin,”Proc.ICSLP’2000,Bei-jing,China,October2000.[3] C.Corredor-Ardoy,mel,M.Adda-Decker,J.L.Gau-vain,“Multilingual Phone Recognition of Spontaneous Telephone Speech”,Proc.IEEE ICASSP-98,I,pp.413–416,Seattle,WA,mai1998.[4]L.Dybkjaer et al.“The DISC Approach to Developmentand Evaluation,”Proc.First International Conference on Language Resources and Evaluation,LREC’98,I,pp.185-189,Granada,Spain,May1998.[5]J.L.Gauvain,S.K.Bennacef,L.Devillers,mel,S.Rosset,“The Spoken Language Component of the Mask Kiosk,”in Human Comfort and Security of Information Sys-tems,Advanced Interfaces for the Information Society,ed-itors K.C.Varghese S.Pﬂeger,Springer Verlag,1997,pp 93–103.[6]J.L.Gauvain,Y.de Kercadio,mel,G.Adda“TheL IMSI SDR system for TREC-8,”Proc.of the8th Text Retrieval Conference TREC-8,pp.405-412,Gaithersburg, MD,November1999.[7]J.L.Gauvain,mel,“Fast Decoding for Indexation ofBroadcast Data,”Proc.ICSLP’2000,Beijing,China,Octo-ber2000.[8]J.L.Gauvain,mel,G.Adda,”The L IMSI1997Hub-4E Transcription System”,Proc.D ARPA Broadcast News Transcription&Understanding Workshop,pp.75-79, Landsdowne,V A February1998.[9]J.L.Gauvain,mel,G.Adda,“Partitioning and Tran-scription of Broadcast News Data,”ICSLP’98,5,pp.1335-1338,Dec.1998.[10]J.L.Gauvain,mel,G.Adda,“Recent Advances inTranscribing Television and Radio Broadcasts,”Proc.Eu-rospeech’99,2,pp.655-658,Budapest,Sept.1999. [11]J.L.Gauvain,mel,and G.Adda,“Transcribing broad-cast news for audio and video indexing,”Communications of the ACM,43(2),Feb2000.[12]J.L.Gauvain,mel,G.Adda and M.Jardino,”TheL IMSI1998Hub-4E Transcription System”,Proc.D ARPA Broadcast News Workshop,pp.99-104,Herndon,V A, February1999.[13]J.L.Gauvain,mel,M.Adda-Decker,“Develop-ments in continuous speech dictation using the ARPA WSJ task,”Proc.IEEE-ICASSP,pp.65-68,Detroit,May1995.[14]J.L.Gauvain,mel,Y.Kercadio et G.Adda,“Tran-scription and Indexation of Broadcast Data,Proc.IEEE ICASSP’00,Istanbul,June2000.[15]mel,G.Adda,M.Adda-Decker,C.Corredor-Ardoy,J.J.Gangolf,J.L.Gauvain,“A multilingual corpus for lan-guage identiﬁcation,”Proc.First International Conference on Language Resources and Evaluation,LREC’98,II,pp.1115-1122,Granada,Spain,May1998.[16]mel,S.Bennacef,J.L.Gauvain,H.Dartigues,J.N.Temem,“User Evaluation of the Mask Kiosk,”Proc.IC-SLP’98,7,pp.2875-2878,Sydney,Australie,decembre 1998.[17]mel,J.L.Gauvain,“A phone-based approach to non-linguistic speech feature identiﬁcation,”Computer Speech and Language,9(1):87-103,January1995.[18]mel,J.L.Gauvain,“SpeakerVeriﬁcation over the Tele-phone,”Speech Communication,31(2-3),pp.141-154,June 2000.[19]mel,J.L.Gauvain,S.K.Bennacef,L.Devillers,S.Foukia,J.J.Gangolf,S.Rosset,“Field Trials of a Tele-phone Service for Rail Travel Information,”Speech Com-munication,23,pp.67–82,October1997.[20]mel,W.Minker,P.Paroubek,“Towards Best Practicein the Development and Evaluation of Speech Recognition Components of a Spoken Language Dialogue System,”Nat-ural Language Engineering“special issue on Best Practice in Spoken LanguageDialogue Systems Engineering”,to ap-pear2000.[21]mel,S.Rosset,J.L.Gauvain,S.Bennacef,M.Garnier-Rizet,B.Prouts“The L IMSI ARISE System,”Speech Com-munication,31(4),pp.339–354,August2000.[22] C.J.Leggetter,P.C.Woodland,“Maximum likelihood linearregression for speaker adaptation of continuous density hid-den Markov models,”Computer Speech&Language,9(2), pp.171-185,1995.[23] D.Matrouf,M.Adda-Decker,J.L.Gauvain,and mel,“Comparing different model conﬁgurations for language identiﬁcation using a phonotactic approach,”Proc.ESCA EuroSpeech’99,pp.387-390,Budapest,sep1999.[24] D.Matrouf,M.Adda-Decker,mel,J.L.Gauvain,“Language identiﬁcation incorporating lexical informa-tion”,Proc.ICSLP-98,2,pp.181-184,Sydney,Australie, decembre1998.[25]S.Rosset,S.Bennacef and mel,“Design strategiesfor spoken language dialog systems”,Proc.ESCA Eu-rospeech’99,pp.1535-1538,Budapest,sep1999.[26]H.Schwenk,J.L.Gauvain,“Combining Multiple SpeechRecognizers using V oting&Language Model Information,”Proc.ICSLP’2000,Beijing,China,October2000.[27]J.Shao,N.E.Tazine,mel,B.Prouts,and S.Schr¨oter,“An Open System Architecture for a Multimedia and Mul-timodal User Interface,”Proceedings of the3rd TIDE Congress,Helsinki,June23-251998.。

speechrecognition

Speech RecognitionIn computer technology, Speech Recognition refers to the recognition of human speech by computers for the performance of speaker-initiated computer-generated functions (e.g., transcribing speech to text; data entry; operating electronic and mechanical devices; automated processing of telephone calls) —a main element of so-called natural language processing through computer speech technology.The Challenge of Speech RecognitionWriting systems are ancient, going back as far as the Sumerians of 6,000 years ago. The phonograph, which allowed the analog recording and playback of speech, dates to 1877. Speech recognition had to await the development of computer, however, due to multifarious problems with the recognition of speech.First, speech is not simply spoken text--in the same way that Miles Davis playing So What can hardly be captured by a note-for-note rendition as sheet music. What humans understand as discrete words, phrases or sentences with clear boundaries are actually delivered as a continuous stream of sounds: Iwenttothestoreyesterday, rather than I went to the store yesterday. Words can also blend, with Whaddayawa? representing What do you want? Second, there is no one-to-one correlation between the sounds and letters. In English, there are slightly more than five vowel letters--a, e, i, o, u, and sometimes y and w. There are more than twenty different vowel sounds, though, and the exact count can vary depending on the accent of the speaker. The reverse problem also occurs, where more than one letter can represent a given sound. The letter c can have the same sound as the letter k, as in cake, or as the letter s, as in citrus.History of Speech RecognitionDespite the manifold difficulties, speech recognition has been attempted for almost as long as there have been digital computers. As early as 1952, researchers at Bell Labs had developed an Automatic Digit Recognizer, or "Audrey". Audrey attained an accuracy of 97 to 99 percent if the speaker was male, and if the speaker paused 350 milliseconds between words, and if the speaker limited his vocabulary to the digits from one to nine, plus "oh", and if the machine could be adjusted to the speaker's speech profile. Results dipped as low as 60 percent if the recognizer was not adjusted.Audrey worked by recognizing phonemes, or individual sounds that were considered distinct from each other. The phonemes were correlated to reference models of phonemes that were generated by training the recognizer. Over the next two decades, researchers spent large amounts of time and money trying to improve upon this concept, with little success. Computer hardware improved by leaps and bounds, speech synthesis improved steadily, and Noam Chomsky's idea of generative grammar suggested that language could be analyzed programmatically. None of this, however, seemed to improve the state of the art in speech recognition.In 1969, John R. Pierce wrote a forthright letter to the Journal of the Acoustical Society of America, where much of the research on speech recognition was published. Pierce was one of the pioneers in satellite communications, and an executive vice president at Bell Labs, which was a leader in speech recognition research. Pierce said everyone involved was wasting time and money.It would be too simple to say that work in speech recognition is carried out simply because one can get money for it. . . .The attraction is perhaps similar to the attraction of schemes for turning water into gasoline, extracting gold from the sea, curing cancer, or going to the moon. One doesn't attract thoughtlessly given dollars by means of schemes for cutting the cost of soap by 10%. To sell suckers, one uses deceit and offers glamor.Pierce's 1969 letter marked the end of official research at Bell Labs for nearly a decade. The defense research agency ARPA, however, chose to persevere. In 1971 they sponsored a research initiative to develop a speech recognizer that could handle at least 1,000 words and understand connected speech, i.e., speech without clear pauses between each word. The recognizer could assume a low-background-noise environment, and it did not need to work in real time. By 1976, three contractors had developed six systems. The most successful system, developed by Carnegie Mellon University, was called Harpy. Harpy was slow—a four-second sentence would have taken more than five minutes to process. It also still required speakers to 'train' it by speaking sentences to build up a reference model. Nonetheless, it did recognize a thousand-word vocabulary, and it did support connected speech.Research continued on several paths, but Harpy was the model for future success. It used hidden Markov models and statistical modeling to extract meaning from speech. In essence, speech was broken up into overlapping small chunks of sound, and probabilistic models inferred the most likely words or parts of words in each chunk, and then the same model was appliedagain to the aggregate of the overlapping chunks. The procedure is computationally intensive, but it has proven to be the most successful. Throughout the 1970s and 1980s research continued. By the 1980s, most researchers were using hidden Markov models, which are behind all contemporary speech recognizers. In the latter part of the 1980s and in the 1990s, DARPA (the renamed ARPA) funded several initiatives. The first initiative was similar to the previous challenge: the requirement was still a one-thousand word vocabulary, but this time a rigorous performance standard was devised. This initiative produced systems that lowered the word error rate from ten percent to a few percent. Additional initiatives have focused on improving algorithms and improving computational efficiency.In 2001, Microsoft released a speech recognition system that worked with Office XP. It neatly encapsulated how far the technology had come in fifty years, and what the limitations still were. The system had to be trained to a specific user's voice, using the works of great authors that were provided, Even after training ,the system was fragile enough that a warning was provided, "If you change the room in which you use Microsoft Speech Recognition and your accuracy drops, run the Microphone Wizard again." On the plus side, the system did work in real time, and it did recognize connected speech.Speech Recognition TodayTechnologyCurrent voice recognition technologies work on the ability to mathematically analyze the sound waves formed by our voices through resonance and spectrum analysis. Computer systems first record the sound waves spoken into a microphone through a digital to analog converter. The analog or continuous sound wave that we produce when we say a word is sliced up into small time fragments. These fragments are then measured based on their amplit ude levels, the level of compression of air released from a person’s mouth. To measure the amplitudes and convert a sound wave to digital format the industry has commonly used the Nyquist-Shannon Theorem.Nyquist-Shannon TheoremThe Nyquist –Shannon theorem was developed in 1928 to show that a given analog frequency is most accurately recreated by a digital frequency that is twice the original analog frequency. Nyquist proved this was true because an audible frequency must be sampled once for compression and once for rarefaction. For example, a 20 kHz audio signal can be accurately represented as a digital sample at 44.1 kHz.Recognizing CommandsThe most important goal of current speech recognition software is to recognize commands. This increases the functionality of speech software. Software such as Microsost Sync is built into many new vehicles, supposedly allowing users to access all of the car’s electronic accessories, hands-free. This software is adaptive. It asks the user a series of questions and utilizes the pronunciation of commonly used words to derive speech constants. These constants are then factored into the speech recognition algorithms, allowing the application to provide better recognition in the future. Current tech reviewers have said the technology is much improved from the early 1990’s but will not be replacing hand controls any time soon.DictationSecond to command recognition is dictation. Today's market sees value in dictation software as discussed below in transcription of medical records, or papers for students, and as a more productive way to get one's thoughts down a written word. In addition many companies see value in dictation for the process of translation, in that users could have their words translated for written letters, or translated so the user could then say the word back to another party in their native language. Products of these types already exist in the market today.Errors in Interpreting the Spoken WordAs speech recognition programs process your spoken words their success rate is based on their ability to minimize errors. The scale on which they can dothis is called Single Word Error Rate (SWER) and Command Success Rate (CSR). A Single Word Error is simply put, a misunderstanding of one word in a spoken sentence. While SWERs can be found in Command Recognition Programs, they are most commonly found in dictation software. Command Success Rate is defined by an accurate interpretation of the spoken command. All words in a command statement may not be correctly interpreted, but the recognition program is able to use mathematical models to deduce the command the user wants to execute.Future Trends & ApplicationsThe Medical IndustryFor years the medical industry has been touting electronic medical records (EMR). Unfortunately the industry has been slow to adopt EMRs and some companies are betting that the reason is because of data entry. There isn’t enough people to enter the multitude of current patient’s data into electronic format and because of that the paper record prevails. A company called Nuance (also featured in other areas here, and developer of the software called Dragon Dictate) is betting that they can find a market selling their voice recognition software to physicians who would rather speak patients' data than handwrite all medical information into a person’s file.The MilitaryThe Defense industry has researched voice recognition software in an attempt to make complex user intense applications more efficient and friendly. Currently voice recognition is being experimented with cockpit displays in aircraft under the context that the pilot could access needed data faster and easier.Command Centers are also looking to use voice recognition technology to search and access the vast amounts of database data under their control in a quick and concise manner during situations of crisis. In addition the militaryhas also jumped onboard with EMR for patient care. The military has voiced its commitment to utilizing voice recognition software in transmitting data into patients' records.。

RECENT ADVANCES IN CANTONESE SPEECH RECOGNITION

RECENT ADVANCES IN CANTONESE SPEECH RECOGNITION
1 2 Department of Electronic Engineering Department of Computer Science and Engineering The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong Tel: 852 2609 8270 Fax : 852 2603 5558 Email: pcching@.hk
ABSTRACT
1 INTRODUCTION
2 ABOUT CANTONESE
Owing to its monosyllabic nature, Cantonese syllable recognition is considered as an important basic task. It requires parallel identi cation of the base syllable as well as the lexical tone. These two sub-problems can, in general, be decoupled from each other and tackled separately and independently Figure 1. As a matter of fact, recognition of the nine Cantonese tones relies on pitch-related features and duration while base syllable recognition utilies parameterized spectral information, such as LPC cepstral coe cients. In our work, tone recognition is performed by a three-layer feedforward neural network. Each output neuron represents a particular tone. The network input is a 5-dimensional feature vector extracted from the voiced portion of the syllable. Simulation experiments have shown a recognition accuracy of 87.6 in multi-speaker application 6 . As for the recognition of base syllables, recurrent neural networks are employed due to their capability of modeling and memorizing temporal information. The base syllable recognizer consists of a large number of small RNNs, each being trained dedicately to model a speci c base syllable. To cover the entire Cantonese syllabary, about 580 RNN syllable models are required and each of them contains 8 10 neurons. There are two main reasons for using such a modular structure instead of putting all syllables into a single network. Firstly, training of large RNN is very time consuming and may have stability problem. By decomposing it into smaller sub-networks, the training e ciency is improved and the training process becomes more tractable so that unstable cases can be identi ed and handled. Secondly, the recognition vocabulary can be expanded in a fairly straightforward manner by adding new syllable models. It does not require re-con guration and re-training of the whole system. This advantage is not so trivial for other languages which are not monosyllabic. To attain better discrimination among syllable models, a discriminative training algorithm has been devised based on generalized probabilistic descent GPD of classi cation error. Furthermore, duration characteristics of phonetic segments are also utilized in the recognition process 7 . As shown in Figure 1, the N-best outputs of the tone recognizer and base syllable recognizer are integrated to produce the nal result of recognition. Phonological rules governing the combination of base syllables and lexical tones are embeded in the integrated decision criterion. Initially, a multi-speaker recognition system was built for only 40 commonly used Cantonese syllables. Then the vocabulary has been progressively expanded to cover 80 and 120 syllables by taking advantage of the modular structure described above. The system performance is shown in Table 1. Further work is undergoing to both expand the vocabulary and include more speaker variations.

爱国朗诵演讲稿(英文版)课件

Controlled pronunciation
01
Ensure accurate pronunciation of each word, especially difficult or easily confused sounds, to avoid misunderstanding and maintain the integrity of the text.
Inflection
Use vivid language and imagery to convey the emotional content of the text, painting a mental picture for listeners.
Vivid description
Strive to understand and convey the feelings and ideas of the poem, connecting with the audience through shared values and experiences.
The concept of patriotism has evolved over time, shaped by historical events and social movements.
In ancient times, patriotism was often linked to the concept of loyalty to the king or the head of state.
The anthem has been a source of inspiration for many Americans through history, representing the country's spirit and resilience. It was officially recognized as the national anthem in 1931.

recognition

RecognitionRecognition is the process of acknowledging or identifying something or someone. It plays a vital role in various aspects of our lives, including communication, security, and machine learning. In this document, we will discuss different types of recognition, their applications, and the technologies behind them.Facial RecognitionFacial recognition is a biometric technology that identifies or verifies individuals by analyzing their facial features. It has gained significant popularity and application in rec ent years. Facial recognition systems capture an image or video of a person’s face and analyze unique facial landmarks, such as the size and shape of the eyes, nose, and mouth.Applications of facial recognition are diverse. One of the most common uses is in security systems, where it can be used to grant or deny access to restricted areas based on facial recognition. It is also used in mobile devices and social media platforms for user identification and authentication.Facial recognition technology has been the subject of debate due to privacy concerns. Organizations and governments need to ensure that the data collected through facial recognition systems is properly secured and used within legal boundaries.Speech RecognitionSpeech recognition is a technology that converts spoken language into written text. It enables interaction between humans and machines through voice commands. This technology has improved significantly in recent years, especially with the development of deep learning algorithms.The applications of speech recognition are widespread. Virtual assistants like Siri, Alexa, and Google Assistant rely on this technology to understand and respond to user commands. It is also used in transcription services, where audio files are automatically converted into text, saving time and effort.Speech recognition technology has also found applications in healthcare, where it is used to transcribe medical records, facilitate communication with patients with speech impairments, and assist in language translation.Object RecognitionObject recognition is the process of identifying and classifying objects or entities within an image or video. It involves extracting meaningful information from visual input and mapping it to known objects or categories.One of the key applications of object recognition is in autonomous driving. Self-driving cars utilize object recognition to identify and track other vehicles, pedestrians, traffic signs, and obstacles to navigate safely on the road.Object recognition is also used in the field of augmented reality, where virtual objects are overlaid onto the real world. This technology enables various interactive experiences, such as gaming, visualization, and shopping.Pattern RecognitionPattern recognition is a branch of machine learning that focuses on the automatic discovery of regularities or patterns within data. It involves the extraction of features from input data and the use of algorithms to identify similarities or anomalies.Pattern recognition has a wide range of applications in diverse fields. In finance, it is used to predict stock market trends or detect fraudulent activities. In healthcare, it helps in the diagnosis of diseases based on symptoms or medical images. It is also widely used in image and speech recognition systems.With the advancement of deep learning algorithms and the availability of massive amounts of data, pattern recognition has become an essential tool for data analysis and decision-making processes.ConclusionRecognition technologies have revolutionized various aspects of our lives. Whether it is facial recognition for security, speech recognition for virtual assistants, object recognition for autonomous driving, or pattern recognition for data analysis, these technologies have made our lives more convenient and efficient.However, it is important to address the ethical and privacy concerns associated with these technologies. Stringent regulations and safeguards should be in place to protect individuals’ rights and ensure responsible use of recog nition systems.In conclusion, recognition technologies continue to evolve, and we can expect even more innovative and impactful applications in the future. As technology advances, it is crucial to strike a balance between the benefits and risks associated with recognition systems.。

机器人语音识别作文英语

机器人语音识别作文英语标题，The Impact of Speech Recognition Technology on Writing。

Introduction。

In today's digital age, speech recognition technology has revolutionized the way we interact with computers and devices. This advancement has particularly influenced the process of writing, enabling users to dictate their thoughts and ideas rather than typing them out manually. This essay explores the impact of speech recognition technology on writing, discussing its benefits, challenges, and implications for the future.Benefits of Speech Recognition Technology in Writing。

One of the primary benefits of speech recognition technology in writing is its efficiency. By allowing users to speak their thoughts, ideas can be transcribed at a muchfaster pace compared to traditional typing. This is especially advantageous for individuals who may havedifficulty typing quickly or accurately. Additionally, speech recognition technology can improve accessibility for those with physical disabilities, enabling them to express themselves more easily through writing.Furthermore, speech recognition technology can enhance productivity by enabling multitasking. Users can dictate their ideas while performing other tasks, such as drivingor cooking, allowing them to make the most of their time. This seamless integration into daily activities can lead to increased creativity and output in writing.Moreover, speech recognition technology can help overcome writer's block. When faced with a blank page, speaking aloud can stimulate ideas and facilitate thewriting process. The act of verbalizing thoughts can often lead to a flow of ideas that may not have emerged through traditional typing.Challenges of Speech Recognition Technology in Writing。

英语口语学习中的语音识别技术

申请上海交通大学硕士学位论文英语口语学习中的语音识别技术专业：计算机应用技术硕士生：陈一宁导师：申瑞民教授上海交通大学电子信息与电气工程学院2009年12月1Master Dissertation Submitted to Shanghai Jiao Tong University Speech recognition technique in Oral English LearningAuthor:Yining ChenAdvisor:Prof. Ruimin shenSpecialty:Computer Application TechnologySchool of Electronics and Electric EngineeringShanghai Jiao Tong UniversityDecember 29 , 20092英语口语学习中的语音识别技术摘要随着计算机技术的发展,计算机辅助教学在教育领域运用越来越广泛。

现在借助于计算机辅助教学人们已经可以更加便捷的学习语言。

计算机丰富的图形、声音处理功能有力促进了人们的语言学习效果。

目前该领域的研究热点集中在探索有效的语言学习方法,把语音识别技术与多媒体技术相结合。

开发具有语音识别判别能力的教学软件,己成为这一类语言教学的热点。

虽然目前语言教学软件发展较快，但是众多软件都缺乏对口语学习结果良好的评估和反馈。

然而在语言发音学习中，对学习者一个很大的帮助就是来自于有效的反馈。

这就成为制约智能化学习软件发展的一大瓶劲。

对于这一问题，本论文将首先介绍语音识别相关基础知识，根据中国人的英语发音习惯，研究建立针对性的语料库，结合以中文为母语的英语口语学习者的需求，利用基于HMM模型的语音识别技术，对语音进行Viterbi解码，并通过后验概率进行识别评分。

最后运用专家库进行音素纠错，实现一个整合了解码、评分与纠错三大功能模块的系统。

其中主要工作集中在深入研究语音识别基础上的语音评分和纠错算法，解决语音识别技术应用于口语学习的一系列应用上的难题，尤其是在语音纠错方面。

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

Speech Recognition of Czech-Inclusion of Rare Words HelpsPetr Podvesk´y and Pavel MachekInstitute of Formal and Applied LinguisticsCharles UniversityPrague,Czech Republicpodvesky,machek@ufal.mff.cuni.czAbstractLarge vocabulary continuous speechrecognition of inﬂective languages,suchas Czech,Russian or Serbo-Croatian,isheavily deteriorated by excessive out ofvocabulary rate.In this paper,we tacklethe problem of vocabulary selection,lan-guage modeling and pruning for inﬂectivelanguages.We show that by explicitreduction of out of vocabulary rate wecan achieve signiﬁcant improvementsin recognition accuracy while almostpreserving the model size.Reportedresults are on Czech speech corpora.1IntroductionLarge vocabulary continuous speech recognition of inﬂective languages is a challenging task for mainly two reasons.Rich morphology generates huge num-ber of forms which are not captured by limited-size dictionaries,and therefore leads to worse recogni-tion results.Relatively free word order admits enor-mous number of word sequences and thus impover-ishes-gram language models.In this paper we are concerned with the former issue.Previous work which deals with excessive vocab-ulary growth goes mainly in two lines.Authors have either decided to break words into sub-word units or to adapt dictionaries in a multi-pass scenario.On Czech data,(Byrne et al.,2001)suggest to use lin-guistically motivated recognition units.Words are broken down to stems and endings and used as the recognition units in theﬁrst recognition phase.In the second phase,stems and endings are concate-nated.On Serbo-Croatian,(Geutner et al.,1998) also tested morphemes as the recognition units.Both groups of authors agreed that this approach is not beneﬁcial for speech recognition of inﬂective lan-guages.V ocabulary adaptation,however,brought considerable improvement.Both(Icring and Psutka, 2001)on Czech and(Geutner et al.,1998)on Serbo-Croatian reported substantial reduction of word er-ror rate.Both authors followed the same procedure. In theﬁrst pass,they used a dictionary composed of the most frequent words.Generated lattices were then processed to get a list of all words which ap-peared in them.This list served as a basis for a new adapted dictionary into which morphological vari-ants were added.It can be concluded that large corpora contain a host of words which are ignored during estimation of language models used inﬁrst pass,despite the fact that these rare words can bring substantial improve-ment.Therefore,it is desirable to explore how to in-corporate rare or even unseen words into a language model which can be used in aﬁrst pass.2Language ModelLanguage models used in aﬁrst pass of current speech recognition systems are usually built in the following way.First,a text corpus is acquired. In case of broadcast news,a newspaper collection or news transcriptions are a good source.Second, most frequent words are picked out to form a dictio-nary.Dictionary size is typically in tens of thousand words.For English,for example,dictionaries of sizeof60k words sufﬁciently cover common domains. (Of course,for recognition of entries listed in the Yellow pages,such limited dictionaries are clearly inappropriate.)Third,an-gram language model is estimated.In case of Katz back-off model,the con-ditional bigram word probability is estimated asifotherwise(1) where represents a smoothed probability distribu-tion,stands for the back-off weight,and denotes the count of its argument.Back-off model can be also nicely viewed as aﬁnite state automaton as depicted in Figure1.Figure1:A fragment of a bigram back-off model represented as aﬁnite-state automaton.To alleviate the problem of a high OOV,we sug-gest to gather supplementary words and add them into the model in the following way.(2)refers to the regular back-off model,de-notes the regular dictionary from which the back-off model was estimated,is the supplementary dictio-nary which does not overlap with.Several sources can be exploited to obtain sup-plementary dictionaries.Morphology tools can de-rive words which are close to those observed in cor-pus.In such a case,can be set as a constant function and estimated on held-out data to maximize recognition accuracy.for generated by morphology(3) Having prior domain knowledge,new words which are expected to appear in audio recordings might be collected and added into.Consider an example of transcribing an ice-hockey s of new players are desirably in the vocabulary.An-other source of are the words which fell below the selection threshold of.In large corpora,there are hundreds of thousands words which are omitted from the estimated language model.We suggest to put them into.As it turned out,unigram proba-bility of these words is very low,so it is suitable to increase their score to make them competitive with other words in during recognition.is then computed asshift(4) where refers to the relative frequency of in a given corpus,shift denotes a shifting factor which should be tuned on some held-out data.Figure2:A fragment of a bigram back-off model injected by a supplementary dictionaryNote that the probability of a word given its his-tory is no longer proper probability.It does not adds up to one.We decided not to normalize the model for two reasons.First,we used a decoder which searches for the best path using Viterbi criterion,so there’s no need for normalization.Second,normal-ization would have involved recomputing all back-off model weights and could also enforce re-tuning of the language model scaling factor.To rule out any variation which the re-tuning of the scaling fac-tor could bring,we decided not to normalize the new model.Inﬁnite-state representation,injection of a new dictionary was implemented as depicted in Figure 2.Supplementary words form a loop in the back-off state.3ExperimentsWe have evaluated our approach on two corpora, Czech Broadcast News and the Czech portion of MALACH data.3.1Czech Broadcast News DataThe Czech Broadcast News(Radov´a et al.,2004)isa collection of both radio and TV news in Czech. Weather forecast,trafﬁc announcements and sport news were excluded from this corpus.Our train-ing portion comprises22hours of speech.To tune the language model scaling factor and additional LM parameters,we set aside100sentences.The test set consists of2500sentences.We used the HTK toolkit(Young et al.,1999)to extract acoustic features from sampled signal and to estimate acoustic models.As acoustic features we used12Mel-Frequency Cepstral Coefﬁcients plus energy and delta and delta-delta features.We trained a triphone acoustic model with tied mixtures of con-tinuous density Gaussians.As a LM training corpus we exploited a collection of newspaper articles from the Lidov´e Noviny(LN) newspaper.This collection was published as a part of the Prague Dependency Treebank by LDC(Hajiˇc et al.,2001).This corpus contains33million tokens. Its vocabulary contains more than650k word forms. OOV rates are displayed in Table1.Dict.size60k6.92%124k2.23%658kWER18.89% 1118.40% 9WER19.52%17.91%Model OWER29.17%27.44%26.12%25.21%CLG size Baseline60k106MB 60k+Uniform115MB 60k+Unigram115MB441MB41k.To obtain a supplementary vocabulary,we used Czech morphology tools(Hajiˇc and Vidov´a-Hladk´a, 1998).Out of41k words we generated416k words which were the inﬂected forms of the observed words in the corpus.Note that we posed restrictions on the generation procedure to avoid obsolete,ar-chaic and uncommon expressions.To do so,we ran a Czech tagger on the transcriptions and thus ob-tained a list of all morphological tags of observed forms.The morphological generation was then con-ﬁned to this set of tags.Since there is no corpus to train unigram scores of generated words on,we set the LM score of the generated forms to a constant.The transcriptions are not the only source of text data in the MALACH project.(Psutka et al.,2004) searched the Czech National Corpus(CNC)for sen-tences which are similar to the transcriptions.This additional corpus contains almost16million words, 330k C vocabulary overlaps to a large ex-tent with TR vocabulary.This fact is not surprising since the selection criterion was based on a lemma unigram probability.Table6summarizes OOV rates of several dictionaries.We estimated several language models.The base-line models are pruned bigram back-off models with Knesser-Ney smoothing.The baseline word errorDictionarySizeTR41k 5.07% TR41k+Morph416k 2.74% TR41k+CNC60k 3.04% TR41k+CNC100k 2.62% TR41k+CNC160k 2.25%TR41k+CNC329k 1.76%All together 1.46% Table6:OOV for several dictionaries.TR,CNC de-note the transcriptions,the Czech National Corpus, respectively.Morph refers to the dictionary gener-ated by the morphology tools from from TR.Num-bers in the dictionary names represent the dictionary size.rate of the model built solely from transcriptions was 37.35%.We injected constant loop of morphologi-cal variants into this model.In terms of text cover-age,this action reduced OOV from5.07%to2.74%. In terms of recognition word error rate,we observed a relative improvement of3.5%.In the next experiment we took as the baseline LM a linear interpolation of the LM built from transcrip-tions and a model estimated from the CNC corpus. Into this model,we injected a unigram loop of all the available words.That is the rest of words from the CNC corpus with unigram scores and words pro-vided by morphology which were not already in the model.Table7summarizes the achieved WER and oracle WER.Given the fact that the injection only slightly reduced the OOV rate,a small relative re-duction of2.3%matched our expectations. Model OAcc37.35%Morph12.48%100k11.95% TR41k+CNC33.67%160k11.65% Table7:Word error rate and oracle WER for base-line and injected models.UniformCLGTR41k 5.6MB TR41k+Morph11MB100k53MB TR41k+CNC307MB160k59MB Table8:Disk usage of tested models.G refers to a language model compiled into an automaton, CLG denotes triphone-to-word C and Morph refer to a LM estimated from transcriptions and the Czech National Corpus,respectively.Morph represents the loop of words generated by morphol-ogy.Inj is the loop of all words from CNC which were not included in CNC language model,more-over,Inj also contains words generated by the mor-phology.4ConclusionIn this paper,we have suggested to inject a loop of supplementary words into the back-off state of a ﬁrst-pass language model.As it turned out,addition of rare or morphology-generated words into a lan-guage model can considerably decrease both recog-nition word error rate and oracle WER in single recognition pass.In the recognition of Czech Broad-cast News,we achieved13.6%relative improvement in terms of word error rate.In terms of oracle er-ror rate,we observed more than30%relative im-provement.On the MALACH data,we attained only marginal word error rate reduction.Since the text corpora already covered the transcribed speech rela-tively well,a smaller OOV reduction translated into a smaller word error rate reduction.In the near fu-ture,we would like to test our approach on agglu-tinative languages,where the problems with high OOV are even more challenging.We would also like to experiment with more complex language models.5AcknowledgementsWe would like to thank our colleagues from the Uni-versity of Western Bohemia for providing us with acoustic models.This work has been done under the support of the project of the Ministry of Education of the Czech Republic No.MSM0021620838and the grant of the Grant Agency of the Charles University (GAUK)No.375/2005.ReferencesW.Byrne,J.Hajiˇc,P.Ircing,F.Jelinek,S.Khudanpur, P.Krbec,and J.Psutka.2001.On large vocabulary continuous speech recognition of highly inﬂectional language-Czech.In Eurospeech2001.P.Geutner,M.Finke,and P.Scheytt.1998.Adaptive V ocabulariesfor Transcribing Multilingual Broadcast News.In ICASSP,Seattle,Washington.Jan Hajiˇc and Barbora Vidov´a-Hladk´a.1998.Tagging inﬂective languages:Prediction of morphological cat-egories for a rich,structured tagset.In Proceedings of the Conference COLING ACL‘98,pages483-490, Mountreal,Canada.Jan Hajiˇc,Eva Hajiˇc ov´a,Petr Pajas,Jarmila Panevov´a, Petr Sgall,and Barbora Vidov´a-Hladk´a.2001.Prague dependency treebank1.0.Linguistic Data Consortium (LDC),catalog number LDC2001T10.P.Icring and J.Psutka.2001.Two-Pass Recognition of Czech Speech Using Adaptive V ocabulary.In TSD,ˇZelezna´a Ruda,Czech Republic.M.Mohri,F.Pereira,and M.Riley.2002.Weighted ﬁnite-state transducers in speech -puter Speech and Language,16:69-88.J.Psutka,P.Ircing,V.Radova,and J.V.Psutka.2004. Issues in annotation of the Czech spontaneous speech corpus in the MALACH project.In Proceedings of the 4th International Conference on Language Resources and Evaluation,Lisbon,Portugal.Vlasta Radov´a,Josef Psutka,Ludˇe k M¨u ller,William Byrne,J.V.Psutka,Pavel Ircing,and Jindˇr ich Ma-touˇs ek.2004.Czech broadcast news speech. Linguistic Data Consortium(LDC),catalog number LDC2004S01.A.Stolcke.1998.Entropy-based pruning of backoff lan-guage models.In In Proceedings of the ARPA Work-shop on Human Language Technology.S.Young et al.1999.The HTK Book.Entropic Inc.。