Towards Spontaneous Speech Recognition For On-Board Car Navigation And Information Systems

合集下载

外文翻译---说话人识别

附录A 英文文献Speaker RecognitionBy Judith A. Markowitz, J. Markowitz ConsultantsSpeaker recognition uses features of a person‟s voice to identify or verify that person. It is a well-established biometric with commercial systems that are more than 10 years old and deployed non-commercial systems that are more than 20 years old. This paper describes how speaker recognition systems work and how they are used in applications.1. IntroductionSpeaker recognition (also called voice ID and voice biometrics) is the only human-biometric technology in commercial use today that extracts information from sound patterns. It is also one of the most well-established biometrics, with deployed commercial applications that are more than 10 years old and non-commercial systems that are more than 20 years old.2. How do Speaker-Recognition Systems WorkSpeaker-recognition systems use features of a person‟s voice and speaking style to:●attach an identity to the voice of an unknown speaker●verify that a person is who she/ he claims to be●separate one person‟s voice from other voices in a multi-speakerenvironmentThe first operation is called speak identification or speaker recognition; the second has many names, including speaker verification, speaker authentication, voice verification, and voice recognition; the third is speaker separation or, in some situations, speaker classification. This papers focuses on speaker verification, the most highly commercialized of these technologies.2.1 Overview of the ProcessSpeaker verification is a biometric technology used for determining whether the person is who she or he claims to be. It should not be confused with speech recognition, a non-biometric technology used for identifying what a person is saying. Speech recognition products are not designed to determine who is speaking.Speaker verification begins with a claim of identity (see Figure A1). Usually, the claim entails manual entry of a personal identification number (PIN), but a growing number of products allow spoken entry of the PIN and use speech recognition to identify the numeric code. Some applications replace manual or spoken PIN entry with bank cards, smartcards, or the number of the telephone being used. PINS are also eliminated when a speaker-verification system contacts the user, an approach typical of systems used to monitor home-incarcerated criminals.Figure A1.Once the identity claim has been made, the system retrieves the stored voice sample (called a voiceprint) for the claimed identity and requests spoken input from the person making the claim. Usually, the requested input is a password. The newly input speech is compared with the stored voiceprint and the results of that comparison are measured against an acceptance/rejection threshold. Finally, the system accepts the speaker as the authorized user, rejects the speaker as an impostor, or takes another action determined by the application. Some systems report a confidence level or other score indicating how confident it about its decision.If the verification is successful the system may update the acoustic information in the stored voiceprint. This process is called adaptation. Adaptation is an unobtrusive solution for keeping voiceprints current and is used by many commercial speaker verification systems.2.2 The Speech SampleAs with all biometrics, before verification (or identification) can be performed the person must provide a sample of speech (called enrolment). The sample is used to create the stored voiceprint.Systems differ in the type and amount of speech needed for enrolment and verification. The basic divisions among these systems are●text dependent●text independent●text prompted2.2.1 Text DependentMost commercial systems are text dependent.Text-dependent systems expect the speaker to say a pre-determined phrase, password, or ID. By controlling the words that are spoken the system can look for a close match with the stored voiceprint. Typically, each person selects a private password, although some administrators prefer to assign passwords. Passwords offer extra security, requiring an impostor to know the correct PIN and password and to have a matching voice. Some systems further enhance security by not storing a human-readable representation of the password.A global phrase may also be used. In its 1996 pilot of speaker verification Chase Manhattan Bank used …Verification by Chemical Bank‟. Global phrases avoid the problem of forgotten passwords, but lack the added protection offered by private passwords.2.2.2 Text IndependentText-independent systems ask the person to talk. What the person says is different every time. It is extremely difficult to accurately compare utterances that are totally different from each other - particularly in noisy environments or over poor telephone connections. Consequently, commercial deployment of text-independentverification has been limited.2.2.3 Text PromptedText-prompted systems (also called challenge response) ask speakers to repeat one or more randomly selected numbers or words (e.g. “43516”, “27,46”, or “Friday, c omputer”). Text prompting adds time to enrolment and verification, but it enhances security against tape recordings. Since the items to be repeated cannot be predicted, it is extremely difficult to play a recording. Furthermore, there is no problem of forgetting a password, even though the PIN, if used, may still be forgotten.2.3 Anti-speaker ModellingMost systems compare the new speech sample with the stored voiceprint for the claimed identity. Other systems also compare the newly input speech with the voices of other people. Such techniques are called anti-speaker modelling. The underlying philosophy of anti-speaker modelling is that under any conditions a voice sample from a particular speaker will be more like other samples from that person than voice samples from other speakers. If, for example, the speaker is using a bad telephone connection and the match with the speaker‟s voiceprint is poor, it is likely that the scores for the cohorts (or world model) will be even worse.The most common anti-speaker techniques are●discriminate training●cohort modeling●world modelsDiscriminate training builds the comparisons into the voiceprint of the new speaker using the voices of the other speakers in the system. Cohort modelling selects a small set of speakers whose voices are similar to that of the person being enrolled. Cohorts are, for example, always the same sex as the speaker. When the speaker attempts verification, the incoming speech is compared with his/her stored voiceprint and with the voiceprints of each of the cohort speakers. World models (also called background models or composite models) contain a cross-section of voices. The same world model is used for all speakers.2.4 Physical and Behavioural BiometricsSpeaker recognition is often characterized as a behavioural biometric. This description is set in contrast with physical biometrics, such as fingerprinting and iris scanning. Unfortunately, its classification as a behavioural biometric promotes the misunderstanding that speaker recognition is entirely (or almost entirely) behavioural. If that were the case, good mimics would have no difficulty defeating speaker-recognition systems. Early studies determined this was not the case and identified mimic-resistant factors. Those factors reflect the size and shape of a speaker‟s speaking mechanism (called the vocal tract).The physical/behavioural classification also implies that performance of physical biometrics is not heavily influenced by behaviour. This misconception has led to the design of biometric systems that are unnecessarily vulnerable to careless and resistant users. This is unfortunate because it has delayed good human-factors design for those biometrics.3. How is Speaker Verification Used?Speaker verification is well-established as a means of providing biometric-based security for:●telephone networks●site access●data and data networksand monitoring of:●criminal offenders in community release programmes●outbound calls by incarcerated felons●time and attendance3.1 Telephone NetworksToll fraud (theft of long-distance telephone services) is a growing problem that costs telecommunications services providers, government, and private industry US$3-5 billion annually in the United States alone. The major types of toll fraud include the following:●Hacking CPE●Calling card fraud●Call forwarding●Prisoner toll fraud●Hacking 800 numbers●Call sell operations●900 number fraud●Switch/network hits●Social engineering●Subscriber fraud●Cloning wireless telephonesAmong the most damaging are theft of services from customer premises equipment (CPE), such as PBXs, and cloning of wireless telephones. Cloning involves stealing the ID of a telephone and programming other phones with it. Subscriber fraud, a growing problem in Europe, involves enrolling for services, usually under an alias, with no intention of paying for them.Speaker verification has two features that make it ideal for telephone and telephone network security: it uses voice input and it is not bound to proprietary hardware. Unlike most other biometrics that need specialized input devices, speaker verification operates with standard wireline and/or wireless telephones over existing telephone networks. Reliance on input devices created by other manufacturers for a purpose other than speaker verification also means that speaker verification cannot expect the consistency and quality offered by a proprietary input device. Speaker verification must overcome differences in input quality and the way in which speech frequencies are processed. This variability is produced by differences in network type (e.g. wireline v wireless), unpredictable noise levels on the line and in the background, transmission inconsistency, and differences in the microphone in telephone handset. Sensitivity to such variability is reduced through techniques such as speech enhancement and noise modelling, but products still need to be tested under expected conditions of use.Applications of speaker verification on wireline networks include secure calling cards, interactive voice response (IVR) systems, and integration with security forproprietary network systems. Such applications have been deployed by organizations as diverse as the University of Maryland, the Department of Foreign Affairs and International Trade Canada, and AMOCO. Wireless applications focus on preventing cloning but are being extended to subscriber fraud. The European Union is also actively applying speaker verification to telephony in various projects, including Caller Verification in Banking and Telecommunications, COST250, and Picasso.3.2 Site accessThe first deployment of speaker verification more than 20 years ago was for site access control. Since then, speaker verification has been used to control access to office buildings, factories, laboratories, bank vaults, homes, pharmacy departments in hospitals, and even access to the US and Canada. Since April 1997, the US Department of Immigration and Naturalization (INS) and other US and Canadian agencies have been using speaker verification to control after-hours border crossings at the Scobey, Montana port-of-entry. The INS is now testing a combination of speaker verification and face recognition in the commuter lane of other ports-of-entry.3.3 Data and Data NetworksGrowing threats of unauthorized penetration of computing networks, concerns about security of the Internet, and increases in off-site employees with data access needs have produced an upsurge in the application of speaker verification to data and network security.The financial services industry has been a leader in using speaker verification to protect proprietary data networks, electronic funds transfer between banks, access to customer accounts for telephone banking, and employee access to sensitive financial information. The Illinois Department of Revenue, for example, uses speaker verification to allow secure access to tax data by its off-site auditors.3.4 CorrectionsIn 1993, there were 4.8 million adults under correctional supervision in the United States and that number continues to increase. Community release programmes, such as parole and home detention, are the fastest growing segments of this industry. It is no longer possible for corrections officers to provide adequate monitoring ofthose people.In the US, corrections agencies have turned to electronic monitoring systems. Since the late 1980s speaker verification has been one of those electronic monitoring tools. Today, several products are used by corrections agencies, including an alcohol breathalyzer with speaker verification for people convicted of driving while intoxicated and a system that calls offenders on home detention at random times during the day.Speaker verification also controls telephone calls made by incarcerated felons. Inmates place a lot of calls. In 1994, US telecommunications services providers made $1.5 billion on outbound calls from inmates. Most inmates have restrictions on whom they can call. Speaker verification ensures that an inmate is not using another inmate‟s PIN to make a forbidden contact.3.5 Time and AttendanceTime and attendance applications are a small but growing segment of the speaker-verification market. SOC Credit Union in Michigan has used speaker verification for time and attendance monitoring of part-time employees for several years. Like many others, SOC Credit Union first deployed speaker verification for security and later extended it to time and attendance monitoring for part-time employees.4. StandardsThis paper concludes with a short discussion of application programming interface (API) standards. An API contains the function calls that enable programmers to use speaker-verification to create a product or application. Until April 1997, when the Speaker Verification API (SV API) standard was introduced, all available APIs for biometric products were proprietary. SV API remains the only API standard covering a specific biometric. It is now being incorporated into proposed generic biometric API standards. SV API was developed by a cross-section of speaker-recognition vendors, consultants, and end-user organizations to address a spectrum of needs and to support a broad range of product features. Because it supports both high level functions (e.g. calls to enrol) and low level functions (e.g. choices of audio input features) itfacilitates development of different types of applications by both novice and experienced developers.Why is it important to support API standards? Developers using a product with a proprietary API face difficult choices if the vendor of that product goes out of business, fails to support its product, or does not keep pace with technological advances. One of those choices is to rebuild the application from scratch using a different product. Given the same events, developers using a SV API-compliant product can select another compliant vendor and need perform far fewer modifications. Consequently, SV API makes development with speaker verification less risky and less costly. The advent of generic biometric API standards further facilitates integration of speaker verification with other biometrics. All of this helps speaker-verification vendors because it fosters growth in the marketplace. In the final analysis active support of API standards by developers and vendors benefits everyone.附录B 中文翻译说话人识别作者：Judith A. Markowitz, J. Markowitz Consultants 说话人识别是用一个人的语音特征来辨认或确认这个人。

胡壮麟语言学教程修订版课堂笔记和讲义精选Chapter (6)

Chapter 6 Language Processing in Mind6.1 Introduction1. Language is a mirror of the mind in a deep and significant sense.2. Language is a product of human intelligence, created a new in each individual byoperation that lie far beyond the reach of will or consciousness.3. Psycholinguistics “proper” can perhaps be glossed as the storage, comprehension,production and acquisition of language in any medium (spoken or written).4. Psycholinguistics is concerned primarily with investigating the psychological reality oflinguistic structures.5. The differences between psycholinguistics and psychology of language.Psycholinguistics can be defined as the storage, comprehension, production and acquisition of language in any medium (spoken or written). It is concerned primarily with investigating the psychological reality of linguistic structures.On the other hand, the psychology of language deals with more general topics such as the extent to which language shapes thought, and from the psychology of communication, includes non-verbal communication such as gestures and facial expressions.6. Cognitive psycholinguistics: Cognitive psycholinguistics is concerned above all withmaking inferences about the content of the human mind.7. Experimental psycholinguistics: Experimental psycholinguistics is mainly concernedwith empirical matters, such as speed of response to a particular word.6.1.1 Evidence1. Linguists tend to favor descriptions of spontaneous speech as their mainsource of evidence, whereas psychologists mostly prefer experimental studies.2. The subjects of psycholinguistic investigation are normal adults and childrenon the one hand, and aphasics----people with speech disorders-----on the other.The primary assumption with regard to aphasic patient that a breakdown insome part of language could lead to an understanding of which componentsmight be independent of others.6.1.2 Current issues1. Modular theory: Modular theory assumes that the mind is structured intoseparate modules or components, each governed by its own principles andoperating independently of others.2. Cohort theory: The cohort theory hypothesizes that auditory word recognitionbegins with the formation of a group of words at the perception of the initialsound and proceeds sound by sound with the cohort of words decreasing asmore sounds are perceived. This theory can be expanded to deal with writtenmaterials as well. Several experiments have supported this view of wordrecognition. One obvious prediction of this model is that if the beginningsound or letter is missing, recognition will be much more difficult, perhapseven impossible. For example: Gray tie------ great eye; a name-----an aim;an ice man-----a nice man; I scream-----ice cream; See Mable----seem able;well fare----welfare; lookout------look out ; decade-----Deck Eight;Layman------laymen; persistent turn------persist and turn3. Psychological reality: The reality of grammar, etc. as a purported account ofstructures represented in the mind of a speaker. Often opposed, in discussionof the merits of alternative grammars, to criteria of simplicity, elegance, andinternal consistency.4. The three major strands of psycholinguistic research:(1) Comprehension: How do people use their knowledge of language, andhow do they understand what they hear or read?(2) Production: How do they produce messages that others can understand inturn?(3) Acquisition: How language is represented in the mind and how languageis acquired?6.2 Language comprehension6.2.1 Word recognition1. An initial step in understanding any message is the recognition of words.2. One of the most important factors that effects word recognition is howfrequently the word is used in a given context.3. Frequency effect: describes the additional ease with which a word is accesseddue to its more frequent usage in the language.4. Recency effect: describe the additional ease with which a word is accesseddue to its repeated occurrence in the discourse or context.5. Another factor that is involved in word recognition is Context.6. Semantic association network represents the relationships between varioussemantically related words. Word recognition is thought to be faster whenother members of the association network are provided in the discourse.6.2.2 Lexical ambiguity1. lexical ambiguity: ambiguity explained by reference to lexical meanings: e.g.that of I saw a bat, where a bat might refer to an animal or, among others,stable tennis bat.2. There are two main theories:(1) All the meanings associated with the word are accessed, and(2) only one meaning is accessed initially. e.g.a. After taking the right turn at the intersection….“right” is ambiguous: correct vs. rightwardb. After taking the left turn at the intersection…“left” is unambiguous6.2.3 Syntactic processing1. Once a word has been dentified , it is used to construct a syntactic structure.2. As always, there are cinokucatuibs due to the ambiguity of individual wordsand to the different possible ways that words can be fit into phrases.Sometimes there is no way to determine which structure and meaning asentence has.e.g. The cop saw the spy with the binoculars. “with the binoculars” isambiguity(1) the cop employed binoculars in order to see the spy.(2) it specifies “the spy has binoculars.”3. Some ambiguities are due to the ambiguous category of some of the words inthe sentence.e.g. the desert trains, trains (培训；列车)the desert trains man to be hardly. 沙漠使人坚韧。

让爱心自然流露作文英语

Allowing Love to Flow Naturally is a theme that resonates deeply with many people. It is about the spontaneous expression of affection,care,and compassion without any expectations or conditions.Heres a detailed essay on this topic in English:Title:Allowing Love to Flow NaturallyLove is an intrinsic part of the human experience,a force that shapes our lives and defines our connections with others.It is a profound emotion that can be expressed in various forms,from the warmth of familial bonds to the depth of romantic relationships, and even the simple acts of kindness towards strangers.The beauty of love lies in its ability to flow naturally,without constraints or artificial barriers.The Essence of Natural LoveNatural love is not forced or contrived it is an authentic expression of ones feelings.It is the spontaneous hug given to a friend in need,the genuine smile offered to a passerby,or the heartfelt words of encouragement to a colleague.This type of love is uncalculated and stems from a place of pure emotion,without the need for reciprocation or recognition.Cultivating an Environment for LoveTo allow love to flow naturally,one must cultivate an environment that fosters openness and vulnerability.This involves creating spaces where individuals feel safe to express their emotions without fear of judgment or rejection.It requires a society that values empathy,understanding,and mutual respect.Overcoming Barriers to Natural LoveOften,societal norms and personal insecurities can act as barriers to the natural expression of love.People may suppress their feelings due to fear of appearing weak or being misunderstood.To overcome these barriers,it is essential to practice selfacceptance and embrace the full spectrum of human emotions.Encouraging open communication and emotional literacy can also help in breaking down these walls.The Impact of Natural LoveWhen love is allowed to flow naturally,it has a transformative effect on individuals and communities.It fosters a sense of belonging and interconnectedness,promoting a moreharmonious and supportive society.Natural love can heal emotional wounds,strengthen relationships,and inspire acts of altruism and selflessness.Practical Ways to Encourage Natural Love1.Practice Active Listening:Give others your full attention when they speak,showing genuine interest in their thoughts and feelings.2.Express Gratitude:Regularly acknowledge the kindness and support of others, reinforcing positive interactions.3.Show Empathy:Put yourself in others shoes to understand their emotions and perspectives.4.Be Present:Engage fully in the moment,allowing for genuine connections to form.5.Offer Support:Be there for others in times of need,offering a helping hand or a shoulder to lean on.ConclusionAllowing love to flow naturally is a practice that enriches our lives and the lives of those around us.It is a testament to the power of human connection and the depth of our emotional capacity.By creating an environment that encourages openness and vulnerability,and by overcoming the barriers that inhibit our natural expressions of love, we can foster a world where love is free to flourish and transform.This essay explores the concept of natural love and its significance in our lives,offering insights into how we can nurture and express this powerful emotion more freely.。

中国人英语自习方法读后感

In recent times, I have had the opportunity to delve into an array of strategies and methods that Chinese learners employ in their quest for mastering the English language through self-study. This reflective piece aims to provide a comprehensive analysis from various perspectives, highlighting the efficacy, challenges, and potential improvements within these methods.The journey of English language learning for many Chinese students is often characterized by rigorous discipline and a strong commitment to self-study. A common method they adopt is extensive reading, which plays a pivotal role in enhancing vocabulary, grammar comprehension, and overall language fluency. The vast availability of English texts online, coupled with digital dictionaries and translation tools, has significantly facilitated this process. However, it's crucial to emphasize the importance of context-based reading for better understanding and retention. Reading materials across different genres not only expose learners to diverse writing styles but also to colloquial expressions and cultural nuances, thereby enriching their communicative skills.Another notable approach is the use of language apps and online courses. Platforms like Duolingo, FluentU, or Coursera offer interactive lessons and personalized practice sessions, aligning well with China's tech-savvy learning culture. These resources provide immediate feedback and allow learners to study at their own pace, thus promoting autonomous learning. Yet, while these tools are undeniably effective, they can't replace human interaction, which is essential for improving pronunciation, conversational skills, and understanding non-verbal communication.Chinese learners also heavily rely on rote learning for memorizing vocabulary and grammar rules. While this method aids in building a solid foundation, it may hinder creativity and spontaneous usage of the language. To balance this, learners could integrate spaced repetition techniques alongside active learning strategies such as creating flashcards or using mnemonics, fostering long-term retention and application.One area that deserves special attention is listening and speaking skills.With limited exposure to authentic English-speaking environments, Chinese learners often face challenges in developing these skills. Watching English films and TV shows, participating in language exchange programs, or even engaging in virtual discussions can bridge this gap. However, more emphasis should be placed on structured practice and assessment in these areas, perhaps through simulation exercises or AI-powered speech recognition technology.Furthermore, the 'Learning by Teaching' principle can be harnessed effectively. Many Chinese learners form study groups where they teach each other what they've learned. This not only reinforces personal knowledge but also encourages critical thinking and problem-solving abilities. It's a testament to the collaborative nature of learning that is deeply ingrained in Chinese educational culture.Despite the impressive strides made by Chinese learners in English self-study, there remains room for improvement. For instance, integrating more task-based learning activities could help learners apply English in real-life situations. Also, fostering a growth mindset where mistakes are viewed as stepping stones rather than failures would encourage risk-taking and accelerate language acquisition.In conclusion, the Chinese approach to English self-study reflects a blend of traditional and innovative methodologies. While leveraging digital resources and embracing autonomous learning, it is equally important to consider a balanced approach that includes ample opportunities for authentic communication and experiential learning. By doing so, Chinese learners can continue to excel in their English language journey, equipping themselves with a powerful tool for global communication and collaboration.Word Count: 659This reflection is just a brief overview and does not meet the minimum word count requirement of 1370 words. A fully fleshed-out version would further explore specific case studies, research findings, and detailed recommendations for educators and policymakers to enhance English self-study methods for Chineselearners.。

人机论文

目录摘要 (1)正文 (1)1、语音识别技术概述 (1)2、发展历史 (1)3、语音识别原理 (2)4、语音识别系统简介 (3)5、语音识别的系统类型 (4)5.1、限制用户的说话方式 (4)5.2、限制用户的用词范围 (5)5.3、限制系统的用户对象 (5)6、语音识别的几种主要研究方法 (5)6.1、动态时间规整(DTW) (5)6.2、矢量量化(VQ) (5)6.3、隐马尔可夫模型(HMM) (6)6.5、支持向量机（SVM) (6)7、语音识别的发展趋势 (6)7.1、提高可靠性。

(7)7.2、增加词汇量。

(7)7.3、应用拓展。

(8)7.4、降低成本减小体积。

(8)8、语音识别所面临的问题 (9)9、值得研究方向 (9)10、语音识别技术的前景展望 (10)参考文献 (11)浅谈语音识别技术摘要:语音识别是一门交叉学科。

近二十年来，语音识别技术取得显著进步，开始从实验室走向市场。

人们预计，未来10年内，语音识别技术将进入工业、家电、通信、汽车电子、医疗、家庭服务、消费电子产品等各个领域。

语音识别听写机在一些领域的应用被美国新闻界评为1997年计算机发展十件大事之一。

很多专家都认为语音识别技术是2000年至2010年间信息技术领域十大重要的科技发展技术之一。

语音识别技术所涉及的领域包括：信号处理、模式识别、概率论和信息论、发声机理和听觉机理、人工智能等等。

关键词：语音识别，矢量化，人工神经元网络，动态时间规整正文1、语音识别技术概述语音识别是解决机器“听懂”人类语言的一项技术。

作为智能计算机研究的主导方向和人机语音通信的关键技术，语音识别技术一直受到各国科学界的广泛关注。

如今，随着语音识别技术研究的突破，其对计算机发展和社会生活的重要性日益凸现出来。

以语音识别技术开发出的产品应用领域非常广泛，如声控电话交换、信息网络查询、家庭服务、宾馆服务、医疗服务、银行服务、工业控制、语音通信系统等，几乎深入到社会的每个行业和每个方面。

人教版高中英语选择性必修第二册 UNIT 1 单元提升强化练

ⅠⅡⅢⅣ
2.What did Galileo do during his college time? A.He studied chemistry. B.He developed a balance. C.He carried out researches on motion. D.He prepared himself for teaching math. 答案 D 解析细节理解题。根据第二段中“Galileo then began to prepare himself to teach mathematics and several of his lectures have survived.”可知,伽利略在大学期间为教数学做好了准备。
Galileo died on January 8,1642. 【语篇导读】本文是一篇记叙文。文章介绍了著名科学家伽利略的生平事迹。
ⅠⅡⅢⅣ
1.Where did Galileo receive education in his middle teens?
A.In Pisa.
B.In Vallombrosa.
ⅠⅡⅢⅣ
5.What does Wu’s colleague think of her? A.Hard-working. B.Brave. C.Kind-hearted. D.Considerate. 答案 A 解析推理判断题。根据第二段中“It’s no exaggeration... over more than 60 years”可知,同事认为吴工作勤奋。
ⅠⅡⅢⅣ
4.What is the best title of the text? A.Galileo’s Life Story B.Galileo’s Road to Success C.Galileo’s Major Contributions D.Galileo’s Studying Experience 答案 A 解析标题归纳题。通读全文可知,本文主要介绍了科学家伽利略的生平事迹,故A项可作为文章的标题。

Speech-to-text and speech-to-speech summarization of spontaneous speech

Speech-to-Text and Speech-to-Speech Summarizationof Spontaneous SpeechSadaoki Furui,Fellow,IEEE,Tomonori Kikuchi,Yousuke Shinnaka,and Chiori Hori,Member,IEEEAbstract—This paper presents techniques for speech-to-text and speech-to-speech automatic summarization based on speech unit extraction and concatenation.For the former case,a two-stage summarization method consisting of important sentence extraction and word-based sentence compaction is investigated. Sentence and word units which maximize the weighted sum of linguistic likelihood,amount of information,confidence measure, and grammatical likelihood of concatenated units are extracted from the speech recognition results and concatenated for pro-ducing summaries.For the latter case,sentences,words,and between-filler units are investigated as units to be extracted from original speech.These methods are applied to the summarization of unrestricted-domain spontaneous presentations and evaluated by objective and subjective measures.It was confirmed that pro-posed methods are effective in spontaneous speech summarization. Index Terms—Presentation,speech recognition,speech summa-rization,speech-to-speech,speech-to-text,spontaneous speech.I.I NTRODUCTIONO NE OF THE KEY applications of automatic speech recognition is to transcribe speech documents such as talks,presentations,lectures,and broadcast news[1].Although speech is the most natural and effective method of communi-cation between human beings,it is not easy to quickly review, retrieve,and reuse speech documents if they are simply recorded as audio signal.Therefore,transcribing speech is expected to become a crucial capability for the coming IT era.Although high recognition accuracy can be easily obtained for speech read from a text,such as anchor speakers’broadcast news utterances,technological ability for recognizing spontaneous speech is still limited[2].Spontaneous speech is ill-formed and very different from written text.Spontaneous speech usually includes redundant information such as disfluencies, fillers,repetitions,repairs,and word fragments.In addition, irrelevant information included in a transcription caused by recognition errors is usually inevitable.Therefore,an approach in which all words are simply transcribed is not an effective one for spontaneous speech.Instead,speech summarization which extracts important information and removes redundantManuscript received May6,2003;revised December11,2003.The associate editor coordinating the review of this manuscript and approving it for publica-tion was Dr.Julia Hirschberg.S.Furui,T.Kikuchi,and Y.Shinnaka are with the Department of Com-puter Science,Tokyo Institute of Technology,Tokyo,152-8552,Japan (e-mail:furui@furui.cs.titech.ac.jp;kikuchi@furui.cs.titech.ac.jp;shinnaka@ furui.cs.titech.ac.jp).C.Hori is with the Intelligent Communication Laboratory,NTT Communication Science Laboratories,Kyoto619-0237,Japan(e-mail: chiori@cslab.kecl.ntt.co.jp).Digital Object Identifier10.1109/TSA.2004.828699and incorrect information is ideal for recognizing spontaneous speech.Speech summarization is expected to save time for reviewing speech documents and improve the efficiency of document retrieval.Summarization results can be presented by either text or speech.The former method has advantages in that:1)the documents can be easily looked through;2)the part of the doc-uments that are interesting for users can be easily extracted;and 3)information extraction and retrieval techniques can be easily applied to the documents.However,it has disadvantages in that wrong information due to speech recognition errors cannot be avoided and prosodic information such as the emotion of speakers conveyed only in speech cannot be presented.On the other hand,the latter method does not have such disadvantages and it can preserve all the acoustic information included in the original speech.Methods for presenting summaries by speech can be clas-sified into two categories:1)presenting simply concatenated speech segments that are extracted from original speech or 2)synthesizing summarization text by using a speech synthe-sizer.Since state-of-the-art speech synthesizers still cannot produce completely natural speech,the former method can easily produce better quality summarizations,and it does not have the problem of synthesizing wrong messages due to speech recognition errors.The major problem in using extracted speech segments is how to avoid unnatural noisy sound caused by the concatenation.There has been much research in the area of summarizing written language(see[3]for a comprehensive overview).So far,however,very little attention has been given to the question of how to create and evaluate spoken language summarization based on automatically generated transcription from a speech recognizer.One fundamental problem with the summaries pro-duced is that they contain recognition errors and disfluencies. Summarization of dialogues within limited domains has been attempted within the context of the VERBMOBIL project[4]. Zechner and Waibel have investigated how the accuracy of the summaries changes when methods for word error rate reduction are applied in summarizing conversations in television shows [5].Recent work on spoken language summarization in unre-stricted domains has focused almost exclusively on Broadcast News[6],[7].Koumpis and Renals have investigated the tran-scription and summarization of voice mail speech[8].Most of the previous research on spoken language summarization have used relatively long units,such as sentences or speaker turns,as minimal units for summarization.This paper investigates automatic speech summarization techniques with the two presentation methods in unrestricted1063-6676/04$20.00©2004IEEEdomains.In both cases,the most appropriate sentences,phrases or word units/segments are automatically extracted from orig-inal speech and concatenated to produce a summary under the constraint that extracted units cannot be reordered or replaced. Only when the summary is presented by text,transcription is modified into a written editorial article style by certain rules.When the summary is presented by speech,a waveform concatenation-based method is used.Although prosodic features such as accent and intonation could be used for selection of important parts,reliable methods for automatic and correct extraction of prosodic features from spontaneous speech and for modeling them have not yet been established.Therefore,in this paper,input speech is automat-ically recognized and important segments are extracted based only on the textual information.Evaluation experiments are performed using spontaneous presentation utterances in the Corpus of Spontaneous Japanese (CSJ)made by the Spontaneous Speech Corpus and Processing Project[9].The project began in1999and is being conducted over a five-year period with the following three major targets.1)Building a large-scale spontaneous speech corpus(CSJ)consisting of roughly7M words with a total speech length of700h.This mainly records monologues such as lectures,presentations and news commentaries.The recordings with low spontaneity,such as those from read text,are excluded from the corpus.The utterances are manually transcribed orthographically and phonetically.One-tenth of them,called Core,are tagged manually and used for training a morphological analysis and part-of-speech(POS)tagging program for automati-cally analyzing all of the700-h utterances.The Core is also tagged with para-linguistic information including intonation.2)Acoustic and language modeling for spontaneous speechunderstanding using linguistic,as well as para-linguistic, information in speech.3)Investigating spontaneous speech summarization tech-nology.II.S UMMARIZATION W ITH T EXT P RESENTATIONA.Two-Stage Summarization MethodFig.1shows the two-stage summarization method consisting of important sentence extraction and sentence compaction[10]. Using speech recognition results,the score for important sen-tence extraction is calculated for each sentence.After removing all the fillers,a set of relatively important sentences is extracted, and sentence compaction using our proposed method[11],[12] is applied to the set of extracted sentences.The ratio of sentence extraction and compaction is controlled according to a summa-rization ratio initially determined by the user.Speech summarization has a number of significant chal-lenges that distinguish it from general text summarization. Applying text-based technologies to speech is not always workable and often they are not equipped to capture speech specific phenomena.Speech contains a number of spontaneous effects,which are not present in written language,such as hesitations,false starts,and fillers.Speech is,to someextent,Fig. 1.A two-stage automatic speech summarization system with text presentation.always distorted by ungrammatical and various redundant expressions.Speech is also a continuous phenomenon that comes without unambiguous sentence boundaries.In addition, errors in transcriptions of automatic speech recognition engines can be quite substantial.Sentence extraction methods on which most of the text summarization methods[13]are based cannot cope with the problems of distorted information and redundant expressions in speech.Although several sentence compression methods have also been investigated in text summarization[14],[15], they rely on discourse and grammatical structures of the input text.Therefore,it is difficult to apply them to spontaneous speech with ill-formed structures.The method proposed in this paper is suitable for applying to ill-formed speech recognition results,since it simultaneously uses various statistical features, including a confidence measure of speech recognition results. The principle of the speech-to-text summarization method is also used in the speech-to-speech summarization which will be described in the next section.Speech-to-speech summarization is a comparatively much younger discipline,and has not yet been investigated in the same framework as the speech-to-text summarization.1)Important Sentence Extraction:Important sentence ex-traction is performed according to the following score for eachsentence,obtained as a result of speechrecognition(1)where is the number of words in thesentenceand, ,and are the linguistic score,the significance score,and the confidence score ofword,respectively. Although sentence boundaries can be estimated using linguistic and prosodic information[16],they are manually given in the experiments in this paper.The three scores are a subset of the scores originally used in our sentence compaction method and considered to be useful also as measures indicating theFURUI et al.:SPEECH-TO-TEXT AND SPEECH-TO-SPEECH SUMMARIZATION 403appropriateness of including the sentence in thesummary.and are weighting factors for balancing the scores.Details of the scores are as follows.Linguistic score :The linguisticscore indicates the linguistic likelihood of word strings in the sentence and is measured by n-gramprobability(2)In our experiment,trigram probability calculated using transcriptions of presentation utterances in the CSJ con-sisting of 1.5M morphemes (words)is used.This score de-weights linguistically unnatural word strings caused by recognition errors.Significance score :The significancescoreindicates the significance of eachword in the sentence and is measured by the amount of information.The amount of in-formation contained in each word is calculated for content words including nouns,verbs,adjectives and out-of-vocab-ulary (OOV)words,based on word occurrence in a corpus as shown in (3).The POS information for each word is ob-tained from the recognition result,since every word in the dictionary is accompanied with a unique POS tag.A flat score is given to other words,and(3)where is the number of occurrencesof in the recog-nizedutterances,is the number of occurrencesof ina large-scale corpus,andis the number of all content words in that corpus,thatis.For measuring the significance score,the number of occurrences of 120000kinds of words is calculated in a corpus consisting of transcribed presentations (1.5M words),proceedings of 60presentations,presentation records obtained from the World-Wide Web (WWW)(2.1M words),NHK (Japanese broadcast company)broadcast news text (22M words),Mainichi newspaper text (87M words)and text from a speech textbook “Speech Information Processing ”(51000words).Im-portant keywords are weighted and the words unrelated to the original content,such as recognition errors,are de-weighted by this score.Confidence score :The confidencescoreis incor-porated to weight acoustically as well as linguistically re-liable hypotheses.Specifically,a logarithmic value of the posterior probability for each transcribed word,which is the ratio of a word hypothesis probability to that of all other hypotheses,is calculated using a word graph obtained by a decoder and used as a confidence score.2)Sentence Compaction:After removing relatively less important sentences,the remaining transcription is auto-matically modified into a written editorial article style to calculate the score for sentence compaction.All the sentences are concatenated while preserving sentence boundaries,and a linguisticscore,,a significancescore ,and aconfidencescoreare given to each transcribed word.A word concatenationscorefor every combination of words within each transcribed sentence is also given to weighta word concatenation between words.This score is a measure of the dependency between two words and is obtained by a phrase structure grammar,stochastic dependency context-free grammar (SDCFG).A set of words that maximizes a weighted sum of these scores is selected according to a given compres-sion ratio and connected to create a summary using a two-stage dynamic programming (DP)technique.Specifically,each sentence is summarized according to all possible compression ratios,and then the best combination of summarized sentences is determined according to a target total compression ratio.Ideally,the linguistic score should be calculated using a word concatenation model based on a large-scale summary corpus.Since such a summary corpus is not yet available,the tran-scribed presentations used to calculate the word trigrams for the important sentence extraction are automatically modified into a written editorial article style and used together with the pro-ceedings of 60presentations to calculate the trigrams.The significance score is calculated using the same corpus as that used for calculating the score for important sentence extraction.The word-dependency probability is estimated by the Inside-Outside algorithm,using a manually parsed Mainichi newspaper corpus having 4M sentences with 68M words.For the details of the SDCFG and dependency scores,readers should refer to [12].B.Evaluation Experiments1)Evaluation Set:Three presentations,M74,M35,and M31,in the CSJ by male speakers were summarized at summarization ratios of 70%and 50%.The summarization ratio was defined as the ratio of the number of characters in the summaries to that in the recognition results.Table I shows features of the presentations,that is,length,mean word recognition accuracy,number of sentences,number of words,number of fillers,filler ratio,and number of disfluencies including repairs of each presentation.They were manually segmented into sentences before recognition.The table shows that the presentation M35has a significantly large number of disfluencies and a low recognition accuracy,and M31has a significantly high filler ratio.2)Summarization Accuracy:To objectively evaluate the summaries,correctly transcribed presentation speech was manually summarized by nine human subjects to create targets.Devising meaningful evaluation criteria and metrics for speech summarization is a problematic issue.Speech does not have explicit sentence boundaries in contrast with text input.There-fore,speech summarization results cannot be evaluated using the F-measure based on sentence units.In addition,since words (morphemes)within sentences are extracted and concatenated in the summarization process,variations of target summaries made by human subjects are much larger than those using the sentence level method.In almost all cases,an “ideal ”summary does not exist.For these reasons,variations of the manual summarization results were merged into a word network as shown in Fig.2,which is considered to approximately express all possible correct summaries covering subjective variations.Word accuracy of the summary is then measured in comparison with the closest word string extracted from the word network as the summarization accuracy [5].404IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING,VOL.12,NO.4,JULY 2004TABLE I E V ALUATION SETFig.2.Word network made by merging manual summarization results.3)Evaluation Conditions:Summarization was performed under the following nine conditions:single-stage summariza-tion without applying the important sentence extraction (NOS);two-stage summarization using seven kinds of the possible combination of scores for important sentence extraction(,,,,,,);and summarization by randomword selection.The weightingfactorsand were set at optimum values for each experimental condition.C.Evaluation Results1)Summarization Accuracy:Results of the evaluation ex-periments are shown in Figs.3and 4.In all the automatic summarization conditions,both the one-stage method without sentence extraction and the two-stage method including sen-tence extraction achieve better results than random word se-lection.In both the 70%and 50%summarization conditions,the two-stage method achieves higher summarization accuracy than the one-stage method.The two-stage method is more ef-fective in the condition of the smaller summarization ratio (50%),that is,where there is a higher compression ratio,than in the condition of the larger summarization ratio (70%).In the 50%summarization condition,the two-stage method is effective for all three presentations.The two-stage method is especially effective for avoiding one of the problems of the one-stage method,that is,the production of short unreadable and/or incomprehensible sentences.Comparing the three scores for sentence extraction,the sig-nificancescoreis more effective than the linguisticscore and the confidencescore .The summarization score can beincreased by using the combination of two scores(,,),and even more by combining all threescores.Fig. 3.Results of the summarization with text presentation at 50%summarizationratio.Fig. 4.Results of the summarization with text presentation at 70%summarization ratio.FURUI et al.:SPEECH-TO-TEXT AND SPEECH-TO-SPEECH SUMMARIZATION405The differences are,however,statistically insignificant in these experiments,due to the limited size of the data.2)Effects of the Ratio of Compression by Sentence Extrac-tion:Figs.5and6show the summarization accuracy as a function of the ratio of compression by sentence extraction for the total summarization ratios of50%or70%.The left and right ends of the figures correspond to summarizations by only sentence compaction and sentence extraction,respectively. These results indicate that although the best summarization accuracy of each presentation can be obtained at a different ratio of compression by sentence extraction,there is a general tendency where the smaller the summarization ratio becomes, the larger the optimum ratio of compression by sentence extraction becomes.That is,sentence extraction becomes more effective when the summarization ratio gets smaller. Comparing results at the left and right ends of the figures, summarization by word extraction(i.e.,sentence compaction) is more effective than sentence extraction for the M35presenta-tion.This presentation includes a relatively large amount of re-dundant information,such as disfluencies and repairs,and has a significantly low recognition accuracy.These results indicate that the optimum division of the compression ratio into the two summarization stages needs to be estimated according to the specific summarization ratio and features of the presentation in question,such as frequency of disfluencies.III.S UMMARIZATION W ITH S PEECH P RESENTATIONA.Unit Selection and Concatenation1)Units for Extraction:The following issues need to be ad-dressed in extracting and concatenating speech segments for making summaries.1)Units for extraction:sentences,phrases,or words.2)Criteria for measuring the importance of units forextraction.3)Concatenation methods for making summary speech. The following three units are investigated in this paper:sen-tences,words,and between-filler units.All the fillers automat-ically detected as the result of recognition are removed before extracting important segments.Sentence units:The method described in Section II-A.1 is applied to the recognition results to extract important sentences.Since sentences are basic linguistic as well as acoustic units,it is easy to maintain acoustical smoothness by using sentences as units,and therefore the concatenated speech sounds natural.However,since the units are rela-tively long,they tend to include unnecessary words.Since fillers are automatically removed even if they are included within sentences as described above,the sentences are cut and shortened at the position of fillers.Word units:Word sets are extracted and concatenated by applying the method described in Section II-A.2to the recognition results.Although this method has an advan-tage in that important parts can be precisely extracted in small units,it tends to cause acoustical discontinuity since many small units of speech need to be concatenated.There-fore,summarization speech made by this method some-times soundsunnatural.Fig.5.Summarization accuracy as a function of the ratio of compression by sentence extraction for the total summarization ratio of50%.Fig.6.Summarization accuracy as a function of the ratio of compression by sentence extraction for the total summarization ratio of70%.Between-filler units:Speech segments between fillers as well as sentence boundaries are extracted using speech recognition results.The same method as that used for ex-tracting sentence units is applied to evaluate these units.These units are introduced as intermediate units between sentences and words,in anticipation of both reasonably precise extraction of important parts and naturalness of speech with acoustic continuity.2)Unit Concatenation:Units for building summarization speech are extracted from original speech by using segmentation boundaries obtained from speech recognition results.When the units are concatenated at the inside of sentences,it may produce noise due to a difference of amplitudes of the speech waveforms. In order to avoid this problem,amplitudes of approximately 20-ms length at the unit boundaries are gradually attenuated before the concatenation.Since this causes an impression of406IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING,VOL.12,NO.4,JULY 2004TABLE IIS UMMARIZATION A CCURACY AND N UMBER OF U NITS FOR THE T HREE K INDS OF S UMMARIZATION UNITSincreasing the speaking rate and thus creates an unnatural sound,a short pause is inserted.The length of the pause is controlled between 50and 100ms empirically according to the concatenation conditions.Each summarization speech which has been made by this method is hereafter referred to as “summarization speech sentence ”and the text corresponding to its speech period is referred to as “summarization text sentence.”The summarization speech sentences are further concate-nated to create a summarized speech for the whole presentation.Speech waveforms at sentence boundaries are gradually at-tenuated and pauses are inserted between the sentences in the same way as the unit concatenation within sentences.Short and long pauses with 200-and 700-ms lengths are used as pauses between sentences.Long pauses are inserted after sentence ending expressions,otherwise short pauses are used.In the case of summarization by word-unit concatenation,long pauses are always used,since many sentences terminate with nouns and need relatively long pauses to make them sound natural.B.Evaluation Experiments1)Experimental Conditions:The three presentations,M74,M35,and M31,were automatically summarized with a summarization ratio of 50%.Summarization accuracies for the three presentations using sentence units,between-filler units,and word units,are given in Table II.Manual summaries made by nine human subjects were used for the evaluation.The table also shows the number of automatically detected units in each condition.For the case of using the between-filler units,the number of detected fillers is also shown.Using the summarization text sentences,speech segments were extracted and concatenated to build summarization speech,and subjective evaluation by 11subjects was performed in terms of ease of understanding and appropriateness as a sum-marization with five levels:1—very bad;2—bad;3—normal;4—good;and 5—very good.The subjects were instructed to read the transcriptions of the presentations and understand the contents before hearing the summarizationspeech.Fig.7.Evaluation results for the summarization with speech presentation in terms of the ease ofunderstanding.Fig.8.Evaluation results for the summarization with speech presentation in terms of the appropriateness as a summary.2)Evaluation Results and Discussion:Figs.7and 8show the evaluation results.Averaging over the three presentations,the sentence units show the best results whereas the word unitsFURUI et al.:SPEECH-TO-TEXT AND SPEECH-TO-SPEECH SUMMARIZATION407show the worst.For the two presentations,M74and M35,the between-filler units achieve almost the same results as the sen-tence units.The reason why the word units which show slightly better summarization accuracy in Table II also show the worst subjective evaluation results here is because of unnatural sound due to the concatenation of short speech units.The relatively large number of fillers included in the presentation M31pro-duced many short units when the between-filler unit method was applied.This is the reason why between-filler units show worse subjective results than the sentence units for M31.If the summarization ratio is set lower than50%,between-filler units are expected to achieve better results than sentence units,since sentence units cannot remove redundant expressions within sentences.IV.C ONCLUSIONIn this paper,we have presented techniques for com-paction-based automatic speech summarization and evaluation results for summarizing spontaneous presentations.The sum-marization results are presented by either text or speech.In the former case,the speech-to-test summarization,we proposed a two-stage automatic speech summarization method consisting of important sentence extraction and word-based sentence compaction.In this method,inadequate sentences including recognition errors and less important information are automat-ically removed before sentence compaction.It was confirmed that in spontaneous presentation speech summarization at70% and50%summarization ratios,combining sentence extraction with sentence compaction is effective;this method achieves better summarization performance than our previous one-stage method.It was also confirmed that three scores,the linguistic score,the word significance score and the word confidence score,are effective for extracting important sentences.The best division for the summarization ratio into the ratios of sentence extraction and sentence compaction depends on the summarization ratio and features of presentation utterances. For the case of presenting summaries by speech,the speech-to-speech summarization,three kinds of units—sen-tences,words,and between-filler units—were investigated as units to be extracted from original speech and concatenated to produce the summaries.A set of units is automatically extracted using the same measures used in the speech-to-text summarization,and the speech segments corresponding to the extracted units are concatenated to produce the summaries. Amplitudes of speech waveforms at the boundaries are grad-ually attenuated and pauses are inserted before concatenation to avoid acoustic discontinuity.Subjective evaluation results for the50%summarization ratio indicated that sentence units achieve the best subjective evaluation score.Between-filler units are expected to achieve good performance when the summarization ratio becomes smaller.As stated in the introduction,speech summarization tech-nology can be applied to any kind of speech document and is expected to play an important role in building various speech archives including broadcast news,lectures,presentations,and interviews.Summarization and question answering(QA)per-form a similar task,in that they both map an abundance of information to a(much)smaller piece to be presented to the user[17].Therefore,speech summarization research will help the advancement of QA systems using speech documents.By condensing important points of long presentations and lectures, speech-to-speech summarization can provide the listener with a valuable means for absorbing much information in a much shorter time.Future research includes evaluation by a large number of presentations at various summarization ratios including smaller ratios,investigation of other information/features for impor-tant unit extraction,methods for automatically segmenting a presentation into sentence units[16],those methods’effects on summarization accuracy,and automatic optimization of the division of compression ratio into the two summarization stages according to the summarization ratio and features of the presentation.A CKNOWLEDGMENTThe authors would like to thank NHK(Japan Broadcasting Corporation)for providing the broadcast news database.R EFERENCES[1]S.Furui,K.Iwano,C.Hori,T.Shinozaki,Y.Saito,and S.Tamura,“Ubiquitous speech processing,”in Proc.ICASSP2001,vol.1,Salt Lake City,UT,2001,pp.13–16.[2]S.Furui,“Recent advances in spontaneous speech recognition and un-derstanding,”in Proc.ISCA-IEEE Workshop on Spontaneous Speech Processing and Recognition,Tokyo,Japan,2003.[3]I.Mani and M.T.Maybury,Eds.,Advances in Automatic Text Summa-rization.Cambridge,MA:MIT Press,1999.[4]J.Alexandersson and P.Poller,“Toward multilingual protocol genera-tion for spontaneous dialogues,”in Proc.INLG-98,Niagara-on-the-lake, Canada,1998.[5]K.Zechner and A.Waibel,“Minimizing word error rate in textual sum-maries of spoken language,”in Proc.NAACL,Seattle,W A,2000.[6]J.S.Garofolo,E.M.V oorhees,C.G.P.Auzanne,and V.M.Stanford,“Spoken document retrieval:1998evaluation and investigation of new metrics,”in Proc.ESCA Workshop:Accessing Information in Spoken Audio,Cambridge,MA,1999,pp.1–7.[7]R.Valenza,T.Robinson,M.Hickey,and R.Tucker,“Summarization ofspoken audio through information extraction,”in Proc.ISCA Workshop on Accessing Information in Spoken Audio,Cambridge,MA,1999,pp.111–116.[8]K.Koumpis and S.Renals,“Transcription and summarization of voice-mail speech,”in Proc.ICSLP2000,2000,pp.688–691.[9]K.Maekawa,H.Koiso,S.Furui,and H.Isahara,“Spontaneous speechcorpus of Japanese,”in Proc.LREC2000,Athens,Greece,2000,pp.947–952.[10]T.Kikuchi,S.Furui,and C.Hori,“Two-stage automatic speech summa-rization by sentence extraction and compaction,”in Proc.ISCA-IEEE Workshop on Spontaneous Speech Processing and Recognition,Tokyo, Japan,2003.[11] C.Hori and S.Furui,“Advances in automatic speech summarization,”in Proc.Eurospeech2001,2001,pp.1771–1774.[12] C.Hori,S.Furui,R.Malkin,H.Yu,and A.Waibel,“A statistical ap-proach to automatic speech summarization,”EURASIP J.Appl.Signal Processing,pp.128–139,2003.[13]K.Knight and D.Marcu,“Summarization beyond sentence extraction:A probabilistic approach to sentence compression,”Artific.Intell.,vol.139,pp.91–107,2002.[14]H.Daume III and D.Marcu,“A noisy-channel model for document com-pression,”in Proc.ACL-2002,Philadelphia,PA,2002,pp.449–456.[15] C.-Y.Lin and E.Hovy,“From single to multi-document summarization:A prototype system and its evaluation,”in Proc.ACL-2002,Philadel-phia,PA,2002,pp.457–464.[16]M.Hirohata,Y.Shinnaka,and S.Furui,“A study on important sentenceextraction methods using SVD for automatic speech summarization,”in Proc.Acoustical Society of Japan Autumn Meeting,Nagoya,Japan, 2003.[17]K.Zechner,“Spoken language condensation in the21st Century,”inProc.Eurospeech,Geneva,Switzerland,2003,pp.1989–1992.。

2021年全国教师资格证考试-中学笔试科目三《学科知识与教学能力》模拟卷1-高中英语

2021年下教师资格笔试高中英语模拟卷一一、单项选择题（本大题共30小题，每小题2分，共60分）1.Inside are over500paintings,prints,watercolors,and a_____of other art objects.A.sortB.kindC.amountD.variety2.The paper is due next month,and I am working seven days______week,often long into______night.A.a;theB.the;不填C.a;aD.不填;the3.—Can I help you?—I'd like to buy a present for my father's birthday,_____at a proper price but of good use.A.oneB.itC.thatD.which4.The Foundation is holding a dinner at the Museum of American Art_____the opening of their new show.A.in honor ofB.in memory ofC.in response toD.in reply to5.In the lecture hall_____.A.seats a professorB.a professor seatsC.sits a professorD.a professor sits6.—We need to turn to Professor Smith for help.—______?Our classmate Simon is an expert at solving such problems.A.Why notB.How comeC.Why botherD.What for7.A(n)_____is the smallest unit of sound in a language,which can distinguish two words.A.soundB.morphemeC.phonemeD.allophone8./m,n/are_____.A.fricativesB.dentalsC.glidesD.nasals9.“Big”and“small”are a pair of_____antonyms.plementaryB.gradablepleteD.converse10.A:Will you go mountain-climbing with me now?B:But I have a headache now.This conversation violates the maxim of_____.A.qualityB.quantityC.relationD.manner11.Writing plays a very important role in developing students’English learning. Editing belongs to_____.A.Post-writingB.While-writingC.Post-listeningD.Post-reading12.In speaking class,which of the following is a suitable activity for production stage? _____A.Role-playB.Writing a similar textC.Learning vocabularies about the topicD.putting pictures in order13.The PPP teaching model is considered appropriate in teaching_____?A.readingB.writingC.listeningD.vocabulary14.In teacher Wang’s class,before giving the evaluation to the performance of students, she always encourages students to evaluate themselves or evaluate by peers.What does this situation reflects?_____A.It reflects the professionalism of the teacher.B.The teacher pays attention to the diversification of the evaluation subject.C.The teacher wants to promote communication among students.D.It is just the teacher’s personal habit.15.Ask the students to recall the basic content they have learned to check whether they have remembered the knowledge they have learned.What kind of classification of asking questions dose this belong to?A.Questions about memoryB.Questions about understandingC.Questions about analysisD.Questions about comprehensive skills16._____are generally used to provide written evaluation symbols or comments for students’homework or tests after class.A.Verbal feedbacksB.Non-verbal feedbacksC.Feedbacks in writing formrmation feedbacks17.The elements in foreign language reading include_____.A.automatic recognition skills and formal discourse structure knowledgeB.world and cultural background knowledge;synthesis and evaluation skills/strategiesC.metal cognitive knowledge and skills monitoring reading;vocabulary and structure knowledgeD.Above all18.Teacher can use____as production practice during teaching pronunciation.A.Same or differentB.Display ordering meaningful contextsD.Odd one out19.A popular way of getting students to concentrate on phonetic aspects of pronunciation is to_____.A.recognize stress pattern in phraseB.match different intonation with different meaningC.learn the correspondence of sound and spellingD.contrast two sounds which are very similar and often confusing20.Which of the following activities can help students prepare for spontaneous speech?A.Reading aloud.B.Giving a prepared talk.C.Doing a drill.D.Interviewing someone,or being interviewed.请阅读Passage1，完成21~25小题。

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

TOWARDS SPONTANEOUS SPEECH RECOGNITION FOR ON-BOARD CAR NAVIGATION AND INFORMATION SYSTEMSMartin Westphal and Alex WaibelInteractive Systems LaboratoriesUniversity of Karlsruhe (Germany), Carnegie Mellon University (USA){westphal,waibel}@a.dea.deABSTRACTSpeech recognition is seen to be of great benefit in on-board car navigation systems and assistance. The command word approach will be used for applications in the near future since the small active vocabulary and the hierarchical structure is much easier to cope with, from the developers’ side. An alternative approach, using spontaneous speech input, is far more complex but provides the user with an interface that is very intuitive and has fewer restrictions. The user can rely upon his or her experience in inter-human communication and utter spontaneous queries. In this paper, we describe the requirements and the collection of a continuous car speech data base and show first recognition results obtained under different environmental conditions in the car.Keywords: spontaneous speech recognition, speech based car navigation interface1.INTRODUCTIONSpeech recognition technology in the car has a number of convincing advantages and first products have already appeared on the market. However, it is clear that speech recognition in the car is a difficult task due to the noisy environment. In the European project VODIS [1] two different approaches, namely the command word approach and the spontaneous speech approach, were investigated. VODIS aims to control not only the car phone and audio components like radio, but also the navigation system. Depending on the task one or the other approach is appropriate [2]. A navigation demonstrator system that allows spontaneously uttered queries was developed by the Interactive Systems Laboratories in Karlsruhe, Germany and Pittsburgh, USA. In this section we review the two approaches mentioned above and show why it was necessary to collect a continuous speech database in the car to provide our demonstrator system with a recognizer capable of processing speech recorded in the real car environment.1.1.Limitations of the Command WordApproachUsing spoken digits to dial a phone number or selecting from a personal phone directory by just uttering the name, is an appreciable help and very much increases the safety in the car. The first speech based approach to include other control functions like selecting the radio station, controlling the CD/cassette player, or entering destinations in a car navigation system, is made by means of command words. Compared to continuous speech, the recognition process is simpler and requires fewer technical resources. A set of commands can be defined that matches the desired functionality. With an increasing number, one would typically arrange the commands with a similar context, that is for example controlling the same device, within a hierarchical structure. This does not only reduce the size of the active vocabulary but can also give the user guidance in form of an active command word list in a small display. These hints are very important since one can not expect the user to memorize all the commands nor to know which hierarchical level is currently active.Let us consider the following example: The driver is hungry and is looking for a place to eat. Each time he utters a command the small display will provide a new list from which he can choose. He steps through the menus of his navigation device by using the following command words:“NAVIGATION”“ENTER DESTINATION”“OTHER DESTINATIONS”“RESTAURANT”“NATIONALITY”“ITALIAN”“RESTRICTION”“CLOSEST”“SELECT DESTINATION”Unless he is not very familiar with the system and this sort of query he might look at the possible choices on the display each time before uttering one of the commands. Note that he would never be able to enter that kind of information while driving without speech and that he does not need his hands for a tactile interface. His focus however is not on the road for quite a while. For queries with such a degree of complexity, the command word approach reaches its limit. Besides being awkward, it is questionable whether this use of speech technology guarantees safety in the car. Nevertheless, for the near future this speech driven approach can provide a basic functionality and opens up new possibilities for on-board navigation and information systems.1.2.The Spontaneous Speech ApproachIn general, a short human conversation in the car is not considered a dangerous distraction from traffic. Allowing a variety of familiar expressions for human machine interaction, the user can easily access a wide range of functionality without being distracted too much. Compared to the command word approach, spontaneous speech allows a far more user friendly and faster input:“Take me to the nearest Italian restaurant!”However, for the machine, it is much harder to recognize continuous speech. In the car, we even have to deal with spontaneous speech since the user still concentrates on the traffic which might result in false starts, hesitations, or ungrammatical sentences. Also, for the interpretation of the query a natural language understanding component, that can process such input, is necessary. The following example illustrates the usefulness of such an interface:User: “Where .. uh ..How far is the nearest post office?”System: The nearest post office is about 2 miles from here!User: “Okay, take me there”To understand the last utterance, context information is also needed. Due to the high complexity and the problems with speech recognition under noisy conditions, it will take several years until such systems are available on the market.1.3.Towards an On-board NavigationDemonstrator for Spontaneous Speech In [3] we described a first laboratory demonstrator for the spontaneous speech approach allowing to enter spontaneous navigation queries that are recognized, parsed and then replied using a map display. This demonstrator is using a recognizer for clean spontaneous speech based on the Janus Recognition Toolkit. To run such a system in the car, we need a continuous speech recognizer that can cope with the adverse conditions.Since most studies are based on the command word approach, huge effort was made to collect command words, proper names, connected digits and letter sequences in the car environment. In the MoTiV data collection [4] also a limited number of spontaneous queries were recorded but the main goal was to provide a database for command word recognizers (see for example [5]) and small vocabulary continuous speech recognition, like digits and letters.2.DATA COLLECTIONOur aim is to develop and evaluate a continuous car speech recognizer and to study the effects arising in this environment. Due to the very limited amount of available continuous speech data recorded in a real car environment, we performed our own data collection. In this section, we describe the requirements and the collection of a car speech data base. 2.1.RequirementsAlthough we can expect that a specific user sitting in the car will speak a number of utterances so that the recognizer can adapt, we have to provide a speaker independent system in the first place. For the training of such a system, we need many different speakers covering both genders, all ages, dialects and so forth. It is also necessary to cover different car types since they have a large influence on the recorded speech. A complete coverage would have surely extended the scope of the project so we collected 43 speakers between ages 18 and 64 in three different cars.The content of the utterances was oriented on the requirements of possible applications. One part consists of spontaneous navigation queries. The amount is relatively small since such queries have to be transcribed manually. Furthermore, the vocabulary and the resulting phonetic context (polyphones) very much depend on the navigation scenario. Therefore, the largest portion consists of read newspaper articles which are easily available and do not need manual transcriptions. The vocabulary is significantly larger and, as a consequence, also the polyphone coverage. In order to allow the recognition of proper names, we also collected spoken as well as spelled city and street names.For the study of environmental effects in the car, it was very important to log the recording conditions as accurate as possible. A laptop allowed to verify the quality of the speech recording and to record the environmental conditions. After each recording the conditions were determined according to table 1 and in special situations (e.g. “indicator”), a predefined comment was selected.For the audio recordings we used the same microphones as for part of the MoTiV collection [4]. The room microphone AKG C400 was installed at the car ceiling just above the windscreen. Simultaneously, we recorded the speech with a close-talking microphone Sennheiser HMD 410. The latter one is less affected by noise and was also used for earlier laboratory speech data collections.Table 1: recording conditions2.2. Database StatisticsAccording to the requirements, a first set of 10393 utterances from 43 speakers was recorded. It was collected in 3 different cars with two microphones at a sampling rate of 16 kHz. With an average duration of 4.3 seconds, the entire database amounts to a total of 12½ hours per channel. Most of the utterances contain continuous speech and a smaller portion covers isolated or spelled names. Table 2 gives the exact number of utterances for different partitions and table 3 gives statistics on utterance, word, and vocabulary counts for different categories.utterancespercentageTraining 900886,7 %male 800977,1 %Ford Escort256424,7 %Table 2: Utterance Statisticcategory utteranceswords vocabularyDictation 65626551112891Navigation 5824102582spelled names1327935033Table 3: Word StatisticFor each utterance we also logged the recording conditions such as road type, road condition, speed, fan, window and weather. No restrictions were given concerning these items.As an example figure 1 and 2 give an idea of the road typesand the speed distribution of the collection. Contrary to most other conditions, the speed can be measured (is measured anyway in the car) and thus could be used as an additional input feature for the speech recognition process.road 35%engine offFigure 1: road types0-3030-6060-9090-120>120u t t e r a n c e sFigure 2: speed distributionAdditionally to the first set, we collected a second set with the 10 test speakers uttering each of 30 navigation queries under 12 different controlled conditions giving a total of 3600utterances. The 12 conditions cover different speed ranges, fan settings and special cases such as indicator, open window, and acceleration (see table 4). The queries cover not only questions about any of 1715 streets from the German city of Karlsruhe but also street numbers, neighborhoods, points of interests, and less specific questions like “Where can I drink a coffee here ?”.3. RECOGNITION EXPERIMENTSThe collected database was used to train and evaluate a speech model for car speech (CarTrain). No data from one of the 10test speakers was used for training. Our clean speech recognizer (LabTrain) trained on 30 hours of speech recorded in a quiet office environment was also tested for the different conditions to determine which aspects cause a performance degradation. Figure 3 shows word error rates over all conditions for the two recognizers. Note that for our application some errors are tolerable as long as they do not change the semantics of the input and lead to the desired response. The two most frequent substitutions are “den ” vs.“dem ” (both meaning “the ” in English) and “zur ” vs. “zum ”(both meaning “to the ” in English). These cases do not degrade the overall performance of the navigation system.Training and testing was done based on the JANUS-3 speech recognition toolkit. After sampling the audio signal at 16 kHz 13 mel-frequency cepstral coefficients and their first and second order derivatives are computed. A speech based cepstral mean subtraction helps to enhance channel robustness. Finally this input vector is reduced by linear discriminant analysis (LDA) into a 32 dimensional feature vector. The acoustic model uses fully continuous mixture Gaussian densities based on 2500 decision-tree clustered context-dependent sub-phones. Both systems use the same language model and a 3k dictionary including all streets of Karlsruhe.Using the clean speech recognizer together with the close talking microphone results in a word error rate of 13.2% for condition 01. This condition is very similar to the office environment since the engine and fan are turned off. Note that the same microphone type was also used to record the training data. This setup turned out to be robust for most conditions.Table 4: Recording conditions of set 2.Only for speeds over 75 km/h the error rates increased to values over 20%. For these conditions, we also observed a greater loudness of the uttered speech which indicates that the Lombard effect also plays a role here.For an on-board navigation system it is highly desirable to have a built in microphone that is mounted in the car cabin. The database we collected and described above provides uswith simultaneously recorded utterances over the head-mounted close talking microphone and a car-mounted room microphone. This way, we can directly compare recognition results on the two channels. From figure 3 one can see that using the room microphone (car mic) with the clean speech recognizer leads to severe degradations of the recognition performance. For the clean speech condition 01 we find an error rate of 17.4% due to the microphone mismatch. For all other conditions the performance losses are higher, especially for a high fan setting and high speeds.The car speech recognizer was trained with data recorded with the room microphone under real driving conditions. This training helped to improve the recognition results for most conditions. For some of the conditions they even became comparable with the close talking results. For others, like for a high fan setting (condition 04), the error rate is still at a level of almost 30%. The indicator (condition 02) seems to affect the recording of the room microphone and was also not well compensated by the car speech training. The clean condition 01 gives the worst results with 30.1%. This is a very rare case in our car speech data base and thus a mismatch between training and test environment.In the future, we aim at improving the car speech and the clean speech recognizer by noise reduction and adaptation methods. We already observed fundamental differences between the effectiveness of such methods for continuous versus single word recognition.4.SUMMARYBy providing a continuous speech database for car speech, we could build a spontaneous speech recognizer for a navigation task in the car. The results are comparable to a clean speech recognizer trained on a very large database of spontaneous speech and tested with the same type of close talking microphone that was used to record the training data.REFERENCES[1]VODIS: “Advanced Speech Technologies for VoiceOperated Driver Information Systems”, EC LanguageEngineering Project LE 1-2277.VODIS-URL: a.de/VODIS[2]D. Van Compernolle: “SPEECH RECOGNITION INTHE CAR – From Phone Dialing to Car Navigation”,Eurospeech ’97, pp 2431-2334, Rhodes, Greece, 1997[3]P. Geutner, M. Denecke, U. Meier, M. Westphal andA. Waibel: “Conversational Speech Systems For On-Board Car Navigation And Assistance“, ICSLP ’98,Adelaide, Australia, 1998[4]D. Langmann, H. Pfitzinger, T. Schneider, R.Grudszus, A. Fischer, M. Westphal, T. Crull, U.Jekosch: “CSDC – The MoTiV Car Speech DataCollection“, ICLRE ’98[5]A. Fischer and V. Stahl: “Subword Unit Based SpeechRecognition In Car Environments”, ICASSP ’98, pp257-260, Seattle, USA, 1998。