Content-based Language Models for Spoken Document Retrieval

合集下载

language model方法

一、概述语言模型是自然语言处理领域中的一项重要技术，它可以用来预测和生成文本，帮助计算机理解和生成人类语言。

在过去的几年中，随着深度学习的发展，各种语言模型方法也在不断涌现。

本文将介绍语言模型的基本概念和发展历程，重点介绍目前流行的language model 方法及其在自然语言处理领域的应用。

二、语言模型概述1. 语言模型的定义语言模型是用来评估一个句子的出现概率的数学模型，它可以根据历史文本数据来预测下一个词或一段话的出现概率。

一个好的语言模型能够很好地理解语言的上下文，并预测合适的语言结构。

2. 语言模型的应用语言模型在自然语言处理领域有着广泛的应用，包括机器翻译、语音识别、文本生成等方面。

在机器翻译中，语言模型可以帮助系统理解上下文并生成更加流畅和准确的翻译结果。

在语音识别中，语言模型可以帮助系统更好地理解语音输入，并提高识别准确率。

在文本生成领域，语言模型可以帮助计算机自动生成文章、新闻或故事情节。

三、语言模型方法1. n-gram模型n-gram模型是一种基本的语言模型方法，它通过统计文本中相邻n个词的出现频率来建模语言。

n-gram模型简单、易于实现，但由于其对上下文的局部性建模，导致难以处理长依赖关系和词汇稀疏的问题。

2. RNN模型循环神经网络（RNN）是一种适合处理序列数据的神经网络模型，它可以通过记忆历史信息来建模长依赖关系。

RNN模型在语言建模任务中取得了一定的成绩，但由于其难以训练和处理长期依赖问题，限制了其在实际应用中的表现。

3. LSTM模型长短期记忆网络（LSTM）是一种特殊的RNN变种，它通过门控机制来更好地捕捉长期依赖关系。

LSTM模型在语言建模任务中取得了显著的进展，成为了语言模型领域的一种重要方法。

4. Transformer模型Transformer模型是由Google提出的一种基于自注意力机制的神经网络模型，它通过并行化计算和全局建模来更好地捕捉文本中的长距离依赖关系。

content based instruction内容型教学法

Content-based instructionFrom Wikipedia, the free encyclopediaContent-based instruction (CBI) is a significant approach in languageeducation (Brinton, Snow, & Wesche, 1989). CBI is designed to provide second-language learners instruction in content and language.Historically, the word content has changed its meaning in second language teaching.Content used to refer to the methods of grammar–translation, audio-lingualmethodology and vocabulary or sound patterns in dialog form. Recently, content isinterpreted as the use of subject matter as a vehicle for second or foreign languageteaching/learning.Contents[hide]∙1Benefits∙2Comparison to other approaches∙3Motivating students∙4Active student involvement∙5Conclusion∙6See also∙7References∙8External linksBenefits[edit]1. Learners are exposed to a considerable amount of language through stimulatingcontent. Learners explore interesting content and are engaged in appropriatelanguage-dependent activities. Languages are not learned through directinstruction, but rather acquired "naturally" or automatically.2. CBI supports contextualized learning; learners are taught useful language that isembedded within relevant discourse contexts rather than as isolated languagefragments. Hence students make greater connections with the language andwhat they already know.3. Complex information is delivered through real life context for the students to graspwell and leads to intrinsic motivation.4. In CBI information is reiterated by strategically delivering information at right timeand situation compelling the students to learn out of passion.5. Greater flexibility and adaptability in the curriculum can be deployed as per thestudent's interest.Comparison to other approaches[edit]The CBI approach is comparable to English for Specific Purposes (ESP), which usually is for vocational or occupational needs or English for Academic Purposes (EAP). The goal of CBI is to prepare students to acquire the languages while using the context of any subject matter so that students learn the language by using it within the specific context. Rather than learning a language out of context, it is learned within the context of a specific academic subject.As educators realized that in order to successfully complete an academic task, second language (L2) learners have to master both English as a language form (grammar, vocabulary etc.) and how English is used in core content classes, they started to implement various approaches such as Sheltered instruction and learning to learn in CBI classes. Sheltered instruction is more of a teacher-driven approach that puts the responsibility on the teachers' shoulders. This is the case by stressingseveral pedagogical needs to help learners achieve their goals, such as teachers having knowledge of the subject matter, knowledge of instructional strategies to comprehensible and accessible content, knowledge of L2 learning processes and the ability to assess cognitive, linguistic and social strategies that students use to assure content comprehension while promoting English academic development. Learning to learn is more of a student-centered approach that stresses the importance of having the learners share this responsibility with their teachers. Learning to learn emphasizes the significant role that learning strategies play in the process of learning.Motivating students[edit]Keeping students motivated and interested are two important factors underlyingcontent-based instruction. Motivation and interest are crucial in supporting student success with challenging, informative activities that support success and which help the student learn complex skills (Grabe & Stoller, 1997). When students are motivatedand interested in the material they are learning, they make greater connections between topics, elaborations with learning material and can recall information better (Alexander, Kulikowich, & Jetton, 1994: Krapp, Hidi, & Renninger, 1992). In short, when a studentis intrinsically motivated the student achieves more. This in turn leads to a perception of success, of gaining positive attributes which will continue a circular learning pattern of success and interest. Krapp, Hidi and Renninger (1992) state that, "situational interest, triggered by environmental factors, may evoke or contribute to the development oflong-lasting individual interests" (p. 18). Because CBI is student centered, one of its goals is to keep students interested and motivation high by generating stimulating content instruction and materials.Active student involvement[edit]Because it falls under the more general rubric of communicative language teaching (CLT), the CBI classroom is learner rather than teacher centered (Littlewood, 1981). In such classrooms, students learn through doing and are actively engaged inthe learning process. They do not depend on the teacher to direct all learning or to be the source of all information. Central to CBI is the belief that learning occurs not only through exposure to the teacher's input, but also through peer input and interactions. Accordingly, students assume active, social roles in the classroom that involve interactivelearning, negotiation, information gathering and the co-construction of meaning (Lee and VanPatten, 1995). William Glasser's "control theory" exemplifies his attempts to empower students and give them voice by focusing on their basic, human needs: Unless students are given power, they may exert what little power they have to thwart learning and achievement through inappropriate behavior and mediocrity. Thus, it is important for teachers to give students voice, especially in the current educational climate, which is dominated by standardization and testing (Simmons and Page, 2010).[1]Conclusion[edit]The integration of language & content teaching is perceived by the European Commission as "an excellent way of making progress in a foreign language". CBI effectively increases learners' English language proficiency & teaches them the skills necessary for the success in various professions. With CBI, learners gradually acquire greater control of the English language, enabling them to participate more fully in an increasingly complex academic & social environment.。

New+version+of+the+fourth+grade+English+lesson+pla

Characteristics
The textbook emphasizes practical language skills and focuses on real life scenarios It also includes cultural notes and background information to help students better understand the language and culture
Students will improve their writing skills by writing essays, stories, and other types of compositions based on the theme of the lesson
Emotional attributes and values goals
Students will improve their reading skills by reading different types of texts, understanding the main idea, and extracting information from the text
Writing
Reading comprehension
Students will improve their reading comprehension skills by reading and understanding the passage in the lesson
Ability goals
Listening and speaking
Content organization
Organize the teaching content into different units or chapters, and clarify the relationship between each unit or chapter

小学英语EEC教材

make learning fun
02
Language level
The language level of the textbook gradually increases as
students progress through the grades, introducing new
vocabulary, grammar rules, and language functions
Target audience
The textbook is designed for primary school students, typically from Grades 1 to 6, to learn English as a foreign or second language
Goals
The main goals of the textbook are to develop students' language skills, including listening, speaking, reading, and writing, and to introduce them to the culture and literature of English speaking countries
Primary School English EEC Textbook
目录
• Textbook Overview • Teaching content and
requirements • Teaching methods and means • Teaching evaluation and
feedback • Teacher Role and Quality

Prompt-basedLanguageModels：模版增强语言模型小结

Prompt-basedLanguageModels：模版增强语⾔模型⼩结©PaperWeekly 原创 · 作者 | 李泺秋学校 | 浙江⼤学硕⼠⽣研究⽅向 | ⾃然语⾔处理、知识图谱最近注意到 NLP 社区中兴起了⼀阵基于 Prompt（模版）增强模型预测的潮流：从苏剑林⼤佬近期的⼏篇⽂章《必须要 GPT3 吗？不，BERT 的 MLM 模型也能⼩样本学习》，《P-tuning：⾃动构建模版，释放语⾔模型潜能》，到智源社区在 3 ⽉ 20 ⽇举办的《智源悟道1.0 AI 研究成果发布会暨⼤规模预训练模型交流论坛》[1] 中杨植麟⼤佬关于“预训练与微调新范式”的报告，都指出了 Prompt 在少样本学习等场景下对模型效果的巨⼤提升作⽤。

本⽂根据上述资料以及相关论⽂，尝试梳理⼀下 Prompt 这⼀系列⽅法的前世今⽣。

不知道哪⾥来的图……本⽂⽬录：1. 追本溯源：从GPT、MLM到Pattern-Exploiting Training1. Pattern-Exploiting Training2. 解放双⼿：⾃动构建Prompt1. LM Prompt And Query Archive2. AUTOPROMPT3. Better Few-shot Fine-tuning of Language Models3. 异想天开：构建连续Prompt1. P-tuning4. ⼩结追本溯源：从GPT、MLM到Pattern-Exploiting Training要说明 Prompt 是什么，⼀切还要从 OpenAI 推出的 GPT 模型说起。

GPT 是⼀系列⽣成模型，在 2020 年 5 ⽉推出了第三代即 GPT-3。

具有 1750 亿参数的它，可以不经微调（当然，⼏乎没有⼈可以轻易训练它）⽽⽣成各式各样的⽂本，从常规任务（如对话、摘要）到⼀些稀奇古怪的场景（⽣成 UI、SQL 代码？）等等。

语言教学Content

研究问题：对內容本位(CBI)语言教学的概述及其启示
选题缘由和意义
一：我国英语教学长期以来存在着“费时较多，成效较低”的普遍问题。二：国外各国內容本位(CBI)语言教学迅速发展，对我国的英语教学有着重要的借鉴意义。三：内容本位语言教学与浸入式语言教学有着千丝万缕的联系，为今后的英语浸入式研究打下坚实的基础。
（二）理论基础
1）克拉申（Krashen）的监察模式； i+1
2）斯温纳（Swain ）的输出假设
3) 柯敏斯（Cummins）的语言掌握模式; BICS&CALP
4）语言社会化视角。
（三）意义
语言与内容融合教学促进语言的学习，内容的学习，增强学习动机和学习兴趣水平，将来在要求语言能力的职业中，更易获得就业机会。（Grabe&Stoller,1997）
研究内容设计
1.CBI语言教学的发展概况,定义,理论基础,及意义。
2.CBI语言教学的特点和分类：(见下页) 3. CBI语言教学的教学原则：
The Content and Language Integration Principle Multi-objectives Principle The Communicative Competence Principle The Automaticity Principle The Intrinsic Motivation Principle 4. CBI语言教学的利弊分析。 5. CBI语言教学对我国英语教学的启示。
CONTENT-BASED LANGUAGE TEACHING: A CONTINUUM OF CONTENT AND LANGUAGE
INTEGRATION
Content-Driven

pep英语书ppt课件ppt课件

PEP English Teaching Objectives
Communicative Skills
To enable students to communicate effectively in English through speaking, listening, reading, and writing
Analysis of Vocabulary and Grammar Points
Vocabulary analysis
Each unit has a dedicated vocabulary learning session that involves a moderate vocabulary size and is in line with students' cognitive level. At the same time, help students understand and remember through example sentences and exercises.
Analysis of Text Content
cultural background
The text introduces some Western cultural background knowledge, such as festivals, customs, etc., which helps students understand the differences between Chinese and Western cultures.
Content and Structure
The textbooks cover a range of topics and themes, with each unit structured to include listening, speaking, reading, and writing activities

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

Content-based Language Models for Spoken Document RetrievalHsin-min Wang and Berlin ChenInstitute of Information Science, Academia Sinica, Taipei, Taiwan, R.O.C.Email: whm@.tw, berlin@.twAbstractSpoken document retrieval (SDR) has been extensively studied in recent years because of its potential use in navigating large multimedia collections in the near future. This paper presents a novel concept of applying content-based language models to spoken document retrieval. In an example task for retrieval of Mandarin Chinese broadcast news data, the content-based language models either trained on automatic transcriptions of spoken documents or adapted from baseline language models using automatic transcriptions of spoken documents were used to create more accurate recognition results and indexing terms from both spoken documents and speech queries. We report on some interesting findings obtained in this research.Keywords: spoken document retrieval (SDR); content-based language models; speech recognition.1. IntroductionMassive quantities of spoken audio are becoming available on the web. Therefore, intelligent and efficient information retrieval techniques allowing easy access to spoken documents, such as the radio and television shows, are becoming highly desired and have been extensively studied in recent years [1-6]. There have been several different approaches developed for spoken document retrieval (SDR). Word-based retrieval approaches [1-3] have been very popular and successful, though with the potential problems of either having to know the query words in advance, or requiring a large enough lexicon to cover the growing dynamic contents, such as the diverse broadcast news. Some other researchers [4-6] proposed the concept of subword-based approaches because the subword units could provide a complete phonological coverage for spoken documents and circumvent the OOV problem in audio indexing and retrieval.In either approach, applying language models to speech recognition can definitely improve recognition accuracy as well as retrieval performance. Lin et al [7] have proved, in their evaluation of retrieving a very large collection of Mandarin Chinese text documents with Mandarin speech queries, that applying the language models trained on the target text database to the recognition of speech queries could significantly improve both recognition accuracy and retrieval performance. However, for spoken document retrieval, the real issue is how to collect text corpora adequate to training the language models. For example, in automatic transcription of broadcast news, many experimental results have shown that using the language models trained from the broadcast news text can achieve higher recognition accuracy compared to using the language models trained from the newswire text [8]. Therefore, there is reason to believe that the language models adapted from baseline language models using the automatic transcriptions of the large collection of spokendocuments may be helpful for both speech recognition and information retrieval. However, for some tasks, such as retrieval of recordings of meetings or voice mails, the adequate corpora may be very difficult to collect or even not available at all. In these cases, only the language models trained on the automatic transcriptions of the large collection of spoken documents are available and they may be helpful as well. Accordingly, this paper presents a concept of applying the content-based language models to spoken document retrieval. The content-based language models can be either trained on automatic transcriptions of spoken documents or adapted from baseline language models using automatic transcriptions of spoken documents. The recognition of spoken documents can be iteratively executed and, thus, more accurate recognition results can be obtained and used to create the indexing terms. Such content-based language models can be used to improve the recognition accuracy of speech queries as well. Hopefully, retrieval performance can be improved correspondingly.Considering the monosyllabic structure of the Chinese language, a whole class of indexing features for retrieval of Mandarin Chinese broadcast news using syllable-level statistical characteristics has been previously investigated [6]. The syllable-based approach is a special case of subword-based approaches. The feasibility of the proposed approach has been tested on the same task. The rest of this paper is organized as follows. The methodology of the proposed approach to spoken document retrieval is introduced in Section 2. The speech recognition process and the retrieval method are presented in Sections 3 and 4, respectively. Then, the target database to be retrieved is introduced in Section 5, and all the experimental results are discussed in Section 6. Finally, the concluding remarks are made in Section 7.2. MethodologyThe block diagram of the proposed approach to spoken document retrieval is shown in Figure 1. During the phase of database preparation, the speech recognition module first transcribes every spoken document in the database to a word (or subword) string based on the acoustic models. Then, the automatic transcriptions of all the spoken documents are used to train the so-called content-based language models, and the speech recognition module repeats recognition based on the acoustic models and the content-based language models. For a bi-gram model, the bi-gram probabilities are estimated using the following equation)(),()|(111−−−=i i i i i w c w w c w w p . (1) Where ),(1i i w w c −and )(1−i w c respectively denote the occurrence counts of the word pair ),(1i i w w − and the word 1−i w in the spoken document collection. The details of speech recognition based on the statistical acoustic models and N-gram language models can be found in many books and papers[9]. The language model training and speech recognition processes can iterate several times and higher recognition accuracy can be achieved. If the baseline language models are available, the automatic transcriptions of the spoken documents are used to adapt the baseline language models to the content-based language models. For a bi-gram model, the adapted bi-gram probabilities can be obtained using the following equation)()(),(),()|(1011011−−−−−+×+×=i i i i i i i i w c w c w w c w w c w w p αα. (2) Where ),(10i i w w c −and )(10−i w c respectively denote the occurrence counts of the word pair ),(1i i w w −and the word 1−i w in the text corpora on which the baseline language models weretrained, and αis a weighting factor. In this paper, αis simply set to 1. Again, the language model adaptation and speech recognition processes can iterate several times and higher recognition accuracy can be achieved. Finally, the feature vector extraction module constructs the feature vectors from these automatic transcriptions.When a user enters a speech query into the retrieval system, the speech recognition module first transcribes the speech query to a word (or subword) string based on the acoustic models for speech queries and the content-based language models obtained in the database preparation phase. Then, the feature vector extraction module constructs the feature vector from the word (or subword) string. Finally, the retrieval module evaluates the similarity measures between the feature vector of the speech query and all the feature vectors of the spoken documents and selects a set of documents with the highest similarity measures as the retrieval output.In this paper, a Mandarin spoken document or a speech query is transcribed to a syllable string and the feature vector consists of overlapping syllable N-grams (uni-gram, bi-gram, and tri-gram) and syllable pairs separated by n syllables (n=1,2,3). The details for speech recognition and information retrieval will be given in Sections 3 and 4, respectively.3. Speech Recognition3.1 A brief introduction to Mandarin ChineseIn the Chinese language, there are more than 10,000 characters. A Chinese word is composed of one to several characters. The combination of one to several of such characters gives an almost unlimited number of words, which can be found in different versions of dictionaries and texts on different subjects. Two nice features of the language are that all characters are monosyllabic, and that the total number of phonologically allowed syllables is only 1,345. Furthermore, MandarinChinese is a tonal language, in which each syllable is assigned a tone and there are a total of 4 lexical tones plus 1 neutral tone. If the differences among the syllables caused by tones are disregarded, only 416 “base syllables” (i.e., the tone-independent syllable structures) instead of 1,345 different “tonal syllables” (i.e., including the tone features) are required to cover all the pronunciations for Mandarin Chinese [10]. Base syllable recognition is, thus, believed to be the first key problem in the speech recognition as well as in the spoken document retrieval for Mandarin Chinese.3.2 Signal processingFor speech recognition, typically, a speech signal is divided into a number of overlapping time frames, and a speech parametric vector is computed to represent each time frame. In our speech recognizer, spectral analysis is applied to a 20 msec frame of speech waveform every 10 msec. For each speech frame, 12 mel-frequency cepstral coefficients (MFCC) and the logarithmic energy are extracted, and these 13 coefficients along with their first and second time derivatives are combined together to form a 39-dimensional feature vector. In addition, to compensate for channel noise, utterance-based cepstral mean subtraction (CMS) is applied to all the training sentences, spoken documents and speech queries.3.3 Acoustic modelingAlthough the base syllable is a very natural recognition unit for Mandarin Chinese due to the monosyllabic structure of the Chinese language, using it leads to inefficient utilization of the training data in the training phase and high computation requirements in the recognition phase. Thus, the acoustic units chosen here are 112 context-dependent INITIALs and 38 context-independent FINALs based on the monosyllabic nature of Mandarin Chinese and the INITIAL-FINAL structureof Mandarin base syllables. Here, INITIAL means the initial consonant of the base syllable, and FINAL means the vowel (or diphthong) part but including optional medial or nasal ending.Each INITIAL is represented by a HMM with 3 states while each FINAL is represented by one with 4 states. The Gaussian mixture number per state ranges from 2 to 16, depending on the amount of training data. Therefore, every syllable unit is represented by a 7-state HMM. The silence model is a 1-state HMM with 32 Gaussian mixtures trained using the non-speech segments.3.4 Language modelingThe baseline syllable language models are trained on a newswire text corpus consisting of 50M Chinese characters collected from Central News Agency (CNA) in 1999 from January to July. The training materials are word identified and phonetic spelling indicated using a pronunciation lexicon consisting of around 62,000 frequently used Chinese words.3.5 Syllable recognitionIf the language models are not available, the syllable recognizer performs only free syllable decoding without any grammar constraints. On the other hand, in a more complicated syllable recognizer, a two-pass search strategy is used. In the first pass, the Viterbi search is performed based on the acoustic models and the syllable bi-gram language models, and the score at every time index is stored. In the second pass, a backward time-asynchronous A* tree search [11] generates the best syllable string based on the heuristic scores obtained from the first pass search and the syllable bi-gram language models.4. Information Retrieval4.1 Vector space modelsVector space models widely used in many text information retrieval systems are used here, in which each document or query is represented by a feature vector:))idf(t )f(t ),...,w idf(t )f(t (w V K K k ××××=111.(3) Where i w ,)(i t f and )(i t idf are respectively the weight, frequency, and inverse document frequency (IDF) of the indexing term i t , while K is the total number of distinct indexing terms. Here, the term frequency is defined as follows:))(ln(11∑==+=it i n j j t i j cm )f(t .(4) Where )(j cm i t , ranging from 0 to 1, is the normalized acoustic confidence measure of the j -thoccurrence of the indexing term t i in the document or query, and i t n is the total occurrences of theindexing term t i in the document or query.The weight i w is different for different classes of indexing terms as explained below. The Cosine measure widely used in text information retrieval is used to estimate the similarity between a document and a query.4.2 Syllable-level indexing termsIn this paper, the syllable-level indexing terms compose of overlapping syllable segments with length N (S(N), N=1~3) and syllable pairs separated by n syllables (P(n), n=1~3). Here, the syllable segment with length N corresponds to the overlapping syllable N-gram. Considering a syllablestring of 10 syllables S1 S2 S3.... S10, examples of the former are listed on the upper half of Table 1, while examples of the latter on the lower half of Table 1. The combination of these indexing terms has been shown very effective for Mandarin Chinese SDR [6]. For example, the overlapping syllable segments with length N can capture the information of polysyllabic words or phrases while the syllable pairs separated by n syllables can tackle the problems arising from the flexible wording structure, abbreviation, and speech recognition errors. In the following experiments, the weight in Equation (3) is 0.1 for S(1) and 1.0 for all the rest classes of indexing terms.5. Data collectionIn this paper, the radio news was recorded using a wizard FM radio connected to a PC, and digitized at a sampling rate of 16kHz with 16bit resolution. The data were collected from several radio stations, all located at Taipei, from December 1998 to July 1999. All the recordings were manually segmented into stories and transcribed to texts.The target database to be retrieved in the following experiments consists of 757 recordings (about 10.5 hours of speech materials). They were collected from Broadcasting Corporation of China (BCC). Each recording is a short news abstract (about 50 seconds of speech materials) produced by a news announcer. Hence, each recording may contain several news items. Some recordings in the database contain background music. 40 simple queries were manually created to support the retrieval experiments. Each query has on average 23.3 relevant documents among the 757 documents in the database, with the exact number ranging from 1 to 75. Two (one male and one female) speakers were asked to record the 40 queries, respectively. At the first glance, this SDR task seems easy since the retrieval target contains only anchors' speech. However, this SDR task is, in fact, very difficult because almost all the query terms appear only once in each of their relevantdocuments and the queries are often very short. On average, each query contains roughly only 4 characters (or syllables).A different broadcast news speech database consisting of 453 stories (about 4.0 hours of speech materials) collected from Police Radio Station (PRS), UFO Station (UFO), Voice of Han (VOH), and Voice of Taipei (VOT) is used for training the speaker-independent HMMs for the recognition of broadcast news speech. Another read speech database including 5.3 hours of speech materials for phonetically balanced sentences and isolated words produced by roughly 80 male and 40 female speakers is used for training the speaker-independent HMMs for the recognition of speech queries.6. ExperimentsWe have evaluated the proposed approach in two different ways. In the first way, we iteratively perform syllable recognition on all the spoken documents. On the other hand, in the second way, we perform syllable recognition on all the spoken documents just once and every spoken document is represented as a syllable lattice first. Then, we iteratively perform the re-scoring process on every syllable lattice based on the acoustic recognition scores and the content-based language models to generate a new best syllable string. The experimental results show that the proposed approach of applying content-based language models to speech recognition in either way not only improves the recognition accuracies of both the spoken documents and the speech queries but also improves the retrieval performance.6.1 Experimental results based on the standard approachAs described in Section 2, in the database preparation phase, we can first compute the content-based language models using the automatic transcriptions of all the spoken documents using Equations (1)or (2) and, then, re-perform the syllable recognition process on all the spoken documents. The above process can be performed iteratively and, hopefully, both recognition accuracy and retrieval performance can be improved.6.1.1 Syllable recognition resultsTable 2 summarizes the syllable recognition accuracies obtained in this manner. In the first experiment, we assume that the baseline language models are not available. In the second row of Table 2, the test using free syllable decoding gives the recognition accuracy of spoken documents only 56.11%. With the content-based syllable bi-gram language models applied, the recognition accuracy can be improved to 60.77%, as shown in the third row. That is, with the content-based language models applied, the recognition accuracy is improved by about 8%. These results indicate that the concept of applying content-based language models to the recognition of spoken documents is useful in improving recognition accuracy. However, the system has to perform speech recognition twice, which is a very time consuming task. We have also tested on the speech queries. The test using free syllable decoding gives the recognition accuracy only 55.56%. With the content-based language models applied to the recognition of speech queries, the recognition accuracy can be improved to 70.68%. These results show that the content-based language models are even more useful to the recognition of speech queries. The recognition accuracy is improved by about 27%.If we apply the baseline language models to the initial recognition of spoken documents and speech queries, the recognition accuracies are 64.68% and 72.84%, respectively, as shown in the fourth row of Table 2. With the content-based language models adapted from the baseline language models using automatic transcriptions of spoken documents applied to speech recognition, the recognition accuracies are 64.52% and 72.84%, respectively. The accuracies almost remainunchanged, probably because the weighting factor α used in Equation (2) was not carefully chosen. αis set to 1.0 for simplicity at the moment. However, the text corpus for training the baseline language models contains about 50M characters (syllables) while the spoken document collection contains less than 200K characters (syllables). The occurrence counts of syllables and syllable pairs contributed by the relatively small spoken document collection can hardly influence the total occurrence counts in Equation (2). Hence, designing a sophisticated formula or procedure to manipulate the value of α, according to the size of the text corpus used for training the baseline language models, the size of the spoken document collection, and speech recognition accuracy or the acoustic recognition scores of words (We used syllables in this paper.), is worthy of further study. A sophisticated equation for language model adaptation may help as well. Furthermore, in typical speech recognition systems, speaker adaptation techniques are widely used to customise the HMMs to the characteristics of a particular speaker using a small amount of enrollment or adaptation data. Hence, the other possible ways include applying the speaker adaptation techniques to language model adaptation and integrating the language model adaptation techniques with the speaker adaptation techniques.6.1.2 Spoken document retrievalRetrieval performance in terms of non-interpolated average precision [12] is shown in Table 3, where SD/SQ and SD/TQ represent the results of spoken document retrieval obtained using speech queries and text queries, respectively. That is, the erroneous syllable strings of documents and queries obtained from speech recognition are denoted as SD (Spoken Documents) and SQ (Speech Queries), respectively, while the query transcripts, which are exactly correct, are denoted as TQ (Text Queries). Again, we first assume the baseline language models are not available. The non-interpolated average precisions for SD/SQ and SD/TQ based on the transcriptions obtained by free syllable decoding are 0.2524 and 0.4366, respectively, as shown in the second row of Table 3. With the content-based syllable bi-gram language models trained on automatic transcriptions of spoken documents applied to speech recognition, they are improved to 0.3617 and 0.4732, respectively, as shown in the third row.If we apply the baseline language models to the recognition of spoken documents and speech queries, the non-interpolated average precisions for SD/SQ and SD/TQ are 0.4264 and 0.5541, respectively, as shown in the fourth row of Table 3. With the content-based language models adapted from the baseline language models using automatic transcriptions of spoken documents applied to speech recognition, the average precisions are 0.4264 and 0.5562, respectively. Again, like recognition accuracies, the retrieval performance almost remained unchanged when the content-based language models were applied to speech recognition. The language model adaptation case is worthy of further study as mentioned above.6.2 Experimental results based on the fast approachBecause iteratively applying the syllable recognition process to all the spoken documents in the large collection is a very time consuming task, we have designed an alternative way to save the recognition time needed. First of all, all the spoken documents are transcribed to syllable lattices. Then, in each iteration, the content-based syllable bi-gram language models are used to select a new best syllable string from the syllable lattice of a spoken document. The content-based language models can be either adapted from the baseline syllable bi-gram language models using the best syllable strings of all the spoken documents obtained in the previous iteration or directly trained on the best syllable strings obtained in the previous iteration. The above iterative process for finding anew best syllable string from a syllable lattice based on the acoustic recognition scores and the content-based syllable bi-gram language models is very fast because only a relatively simple re-scoring process instead of a complicated speech recognition process is executed. Such a new syllable string obtained after each iteration is not an optimal result based on the acoustic models and the content-based language models because it is obtained under the constraint of the segmentation obtained from free syllable decoding and the syllable candidates contained in the syllable lattice.The syllable lattice generation process is described as follows. First of all, the syllable recognizer, which performs free syllable decoding without any grammar constraints or performs syllable recognition based on both acoustic models and baseline language models, is used to transcribe all the spoken documents. Then, based on the state likelihood scores calculated in the first search and the syllable boundaries of the best syllable string, the syllable recognizer further performs the Viterbi search on each utterance segment which may include a syllable and outputs several most likely syllable candidates with their corresponding acoustic recognition scores. Finally,a syllable lattice is constructed. In this study, each syllable segment contains 25 syllable candidates.6.2.1 Syllable recognition resultsTable 4 summarizes the syllable recognition accuracies obtained in this manner. We have first evaluated on training the content-based language models from automatic transcriptions of spoken documents. That is, we assume that the baseline language models are not available. As shown in the second row, the test using free syllable decoding gives the recognition accuracies of spoken documents and speech queries only 56.11% and 55.56%, respectively. With the syllable bi-gram language models trained on the automatic transcriptions of the spoken documents obtained using free syllable decoding applied to re-scoring, the recognition accuracies are improved to 57.53% and56.48%, respectively, as shown in the third row. With the syllable bi-gram language models trained on the new transcriptions applied to re-scoring, the recognition accuracies are improved to 58.81% and 61.11%, respectively, as shown in the fourth row. After an additional iteration, the recognition accuracies are further improved to 58.91% and 62.04%, respectively, as shown in the fifth row.We have also evaluated on adapting the baseline language models using automatic transcriptions of spoken documents. With the baseline language models applied to the initial syllable recognition, the recognition accuracies of spoken documents and speech queries are 64.68% and 72.84%, respectively, as shown in the sixth row of Table 4. With the content-based syllable bi-gram language models adapted from the baseline language models using automatic transcriptions of spoken documents applied to re-scoring, the recognition accuracies are 64.51% and 72.84%, respectively, as shown in the seventh row. After one more iteration, the recognition accuracies are 64.49% and 73.15%, respectively, as shown in the last row. The recognition accuracies almost remain unchanged probably due to the same reasons that we have discussed in Section 6.1.1.The detailed recognition results after the first 10 iterations are depicted in Figure 2. In the case that the baseline language models are not used, the recognition accuracies are improved at first, but, subsequently, the improvements are not obvious or sometimes the recognition accuracy even slightly degrades. In the case with the baseline language models applied, the curves are almost flat.6.2.2 Spoken document retrievalRetrieval performance in terms of non-interpolated average precision with respect to the number of iterations is depicted in Figure 3. In general, Figure 3 reveals very similar trends to Figure 2, which displayed the speech recognition accuracies, because the retrieval performance depends strongly on the recognition accuracies of both spoken documents and speech queries. In the case without thebaseline language models applied, the performance improvements for the SD/TQ case are less significant than that for the SD/SQ case. This is because the retrieval performance for the SD/SQ case depends not only on the recognition accuracy of spoken documents but also on the recognition accuracy of speech queries. According to Table 4 and Figure 2, it’s obvious that the improvements in the query recognition accuracy are more significant than the improvements in the document recognition accuracy in this task. In the case with the baseline language models applied, the curves for both SD/TQ and SD/SQ are almost flat, just like the curves for the recognition accuracies depicted in Figure 2(b). The best non-interpolated average precisions for SD/SQ and SD/TQ based on the transcriptions obtained by applying the content-based syllable bi-gram language models trained on automatic transcriptions of spoken documents to speech recognition are 0.3340 and 0.4474, respectively, while they are 0.2524 and 0.4366 based on automatic transcriptions obtained by free syllable decoding. The non-interpolated average precisions for SD/SQ and SD/TQ are 0.4264 and 0.5541 based on transcriptions obtained by applying the baseline language models to speech recognition. The best non-interpolated average precisions for SD/SQ and SD/TQ based on transcriptions obtained by applying the content-based syllable bi-gram language models adapted from the baseline language models using transcriptions of spoken documents to speech recognition are 0.4323 and 0.5518, respectively. The retrieval performances after the first several iterations are listed in Table 5.7. ConclusionsIn this paper, we have proposed a novel approach, which applied content-based language models to the recognition of spoken documents and speech queries, to spoken document retrieval. The content-based language models can be either trained on automatic transcriptions of spoken。