DEALING WITH PHRASE LEVEL CO-ARTICULATION (PLC) IN SPEECH RECOGNITION A FIRST APPROACH

合集下载

2024年高考英语最后一卷(全国卷)(解析版)

2024年高考英语最后一卷【全国卷】英语（考试时间：120分钟试卷满分：150分）注意事项：1．答卷前，考生务必将自己的姓名、考生号等填写在答题卡和试卷指定位置上。

2．回答选择题时，选出每小题答案后，用铅笔把答题卡对应题目的答案标号涂黑。

如需改动，用橡皮擦干净后，再选涂其他答案标号。

回答非选择题时，将答案写在答题卡上。

写在本试卷上无效。

3．考试结束后，将本试卷和答题卡一并交回。

第一部分听力（共两节，满分30 分）第一节（共5小题；每小题1.5分，满分7.5分）听下面5段对话。

每段对话后有一个小题，从题中所给的A、B、C三个选项中选出最佳选项。

听完每段对话后，你都有10秒钟的时间来回答有关小题和阅读下一小题。

每段对话仅读一遍。

1．Where are the old cases?A．In the boxes.B．In the bookcase.C．In the drawers.【答案】C【原文】M: Do you like the way I organized the files in the bookcase, Ms. Stanford?W: Yes, you’ve done a good job of organizing them, but what did you do with the old cases in the yellow boxes? M: I moved those into the drawers, since we don’t use them very often.2．What animal does the woman own?A．A mouse.B．A dog C．A cat.【答案】B【原文】W: Watch what happens when I place some cheese on the edge of the wall just here…M: Oh, my goodness! Is that a mouse that just grabbed it?W: Yes! And the dog kept watching from his bed!M: Maybe it’s time to get a cat!3．Where does this conversation take place?A．In a house.B．In a park.C．In a forest【答案】C【原文】M: How much further is this walk?W: Not long. We just have to walk past that big house and then through a park.M: We’ve been walking through the forest for ages now.4．What can we learn about the woman?A．She found a great job.B．She is popular in college.C．She won the student election.【答案】B【原文】W: I’m putting my name forward for the upcoming student election. I’m hoping to be the first student union president from Asia at the university.M: That’s fantastic news , and you’d do a great job. I think you have a great chance of winning as everybody likes you!5．What are the speakers talking about?A．Making a birthday cake.B．Going to a birthday party.C．Repairing the broken clock.【答案】B【原文】M: I thought I’d set my alarm clock, but I didn’t hear it ring!W: Oh, no. And Ashley’s birthday party is going to start in a few minutes.M: I better get there in a hurry before everyone eats all the birthday cake.W: You and cake?! Let’s not forget whose birthday it is.第二节（共15小题；每小题1.5分，满分22.5分）听下面5段对话或独白。

基于atan—LMS与PLPC相结合的稳健语音识别

ｔｅｅａｉｎｅｗｅｎｔｐｉａｄｒｏ．Ａｎｔｅｈｒｌｔｂｔｅｓｅｓｚｏｅｎｅｒｒｄｈｎ，ＰＰｉｏｕｅ．ＡｃｏｄｎｏｔｅｘｅｍｅｔｌｅｕｔＬＣｓｍｐｔｄｃｃｒｉｇｔｈｅｐｒｎａｒｓｌ，ｉ
ａｔＬｐｒｏｍｓｅｔｒｏｓｅｃｒｃｇｉｉｎｉｈｉｍｉｄｕｃｉｎ（ＶＳＭＳ）ａｄＬＣ．Ｍｏｅｖｒｎ — ＭＳｅｆｒｂｔｆｒｐｅｈｅｏｎｔｗｔＳｇｎｆｎｔｅｏｏＳＬｎＰＰｒｏｅ，
ｒｃｇｉｉｎａｅｎｃｅｓｓａｔｔｔｅｉｎｌｔ —ｎｉｅｅｒａｉｇｅｏｎｔｒｔｉｒａｅｆｓｅｒｗｉｈｈｓｇａ — ｏｏｓｄｃｅｓｎ．ｏ
【ｙｗｒｓＬＣ；ａａｔｅｆｔｉｇＭＳａｏｔｍＫｅｏｄ】ＰＰｄｐｉｌｒ；Ｌｌｒｈｖｉｅｎｇｉ
好的系统跟踪能力，但在误差接近０步长变化剧烈，时
鉴于传统自适应滤波和ＰＰ的不足，Ｌ笔者提出一
预测技术模拟了人耳的听觉特性，大量试验证明采用该技术的语音识别系统识别率会有一定程度的提高，
但是比较有限。
基于Ｌ的自适应滤波器使输出信号与期望响ＭＳ应之间的均方误差值最小，其性能受自适应算法的影
响很大。文献［］出的变步长算法（ＶＬＳ具有较４提ＳＳＭ）
应用于对噪声不敏感的语音特征识别中，感知线性如

MFCC介绍

在语音识别（SpeechRecognition）和话者识别（SpeakerRecognition）方面，最常用到的语音特征就是梅尔倒谱系数（Mel-scaleFrequency Cepstral Coefficients，简称MFCC）。

根据人耳听觉机理的研究发现，人耳对不同频率的声波有不同的听觉敏感度。

从200Hz到5000Hz的语音信号对语音的清晰度影响对大。

两个响度不等的声音作用于人耳时，则响度较高的频率成分的存在会影响到对响度较低的频率成分的感受，使其变得不易察觉，这种现象称为掩蔽效应。

由于频率较低的声音在内耳蜗基底膜上行波传递的距离大于频率较高的声音，故一般来说，低音容易掩蔽高音，而高音掩蔽低音较困难。

在低频处的声音掩蔽的临界带宽较高频要小。

所以，人们从低频到高频这一段频带内按临界带宽的大小由密到疏安排一组带通滤波器，对输入信号进行滤波。

将每个带通滤波器输出的信号能量作为信号的基本特征，对此特征经过进一步处理后就可以作为语音的输入特征。

由于这种特征不依赖于信号的性质，对输入信号不做任何的假设和限制，又利用了听觉模型的研究成果。

因此，这种参数比基于声道模型的LPCC相比具有更好的鲁邦性，更符合人耳的听觉特性，而且当信噪比降低时仍然具有较好的识别性能。

梅尔倒谱系数（Mel-scale Frequency Cepstral Coefficients，简称MFCC）是在Mel标度频率域提取出来的倒谱参数，Mel标度描述了人耳频率的非线性特性，它与频率的关系可用下式近似表示：。

2024届湖北省武汉市高中毕业生二月调研考试英语试题(含答案)

武汉市2024 届高中毕业生二月调研考试英语试卷养成良好的答题习惯，是决定高考英语成败的决定性因素之一。

做题前，要认真阅读题目要求、题干和选项，并对答案内容作出合理预测;答题时，切忌跟着感觉走，最好按照题目序号来做，不会的或存在疑问的，要做好标记，要善于发现，找到题目的题眼所在，规范答题，书写工整;答题完毕时，要认真检查，查漏补缺，纠正错误。

命题&审题：武汉市教育科学研究院第一部分听力 ( 共两节，满分 3 0 分 )做题时，先将答案标在试卷上。

录音内容结束后，你将有两分钟的时间将试卷上的答案转涂到答题卡上。

第一节(共5小题；每小题1.5分，满分7.5分)听下面5段对话。

每段对话后有一个小题，从题中所给的A 、B 、C 三个选项中选出最佳选项，并标在试卷的相应位置。

听完每段对话后，你都有10秒钟的时间来回答有关小题和阅读下一小题。

每段对话仅读一遍。

1.What are the speakers probably doing?A.Discussing at work.B.Talking on phoneC.Driving on the way2.What will the man do next?A.Have a dessert.B.Pay the check.C.Ask for a beer.3.What do we know about the hamburger?A.It might go bad.B.It's good-lookingC.It looked funny4.What are the speakers mainly talking about?A.The sceneryB.The transport.C.The weather.5.How does the woman sound in the end?A.Glad.B.Surprised.C.Impatient.第二节(共15小题；每小题1.5分，满分22.5分)听下面5段对话或独白。

TCN-Transformer-CTC的端到端语音识别

收稿日期：２０２１０８１４；修回日期：２０２１１００８基金项目：国家自然科学基金面上项目（６１６７２２６３）作者简介：谢旭康（１９９８），男，湖南邵阳人，硕士研究生，主要研究方向为语音识别、机器学习等；陈戈（１９９６），女，河南信阳人，硕士研究生，主要研究方向为语音识别、语音增强等；孙俊（１９７１），男（通信作者），江苏无锡人，教授，博导，博士，主要研究方向为人工智能、计算智能、机器学习、大数据分析、生物信息学等（ｊｕｎｓｕｎ＠ｊｉａｎｇｎａｎ．ｅｄｕ．ｃｎ）；陈祺东（１９９２），男，浙江湖州人，博士，主要研究方向为演化计算、机器学习等．

ＴＣＮＴｒａｎｓｆｏｒｍｅｒＣＴＣ的端到端语音识别谢旭康，陈　戈，孙　俊，陈祺东（江南大学人工智能与计算机学院，江苏无锡２１４１２２）

摘　要：基于Ｔｒａｎｓｆｏｒｍｅｒ的端到端语音识别系统获得广泛的普及，但Ｔｒａｎｓｆｏｒｍｅｒ中的多头自注意力机制对输入序列的位置信息不敏感，同时它灵活的对齐方式在面对带噪语音时泛化性能较差。针对以上问题，首先提出使用时序卷积神经网络（ＴＣＮ）来加强神经网络模型对位置信息的捕捉，其次在上述基础上融合连接时序分类（ＣＴＣ），提出ＴＣＮＴｒａｎｓｆｏｒｍｅｒＣＴＣ模型。在不使用任何语言模型的情况下，在中文普通话开源语音数据库ＡＩＳＨＥＬＬ１上的实验结果表明，ＴＣＮＴｒａｎｓｆｏｒｍｅｒＣＴＣ相较于Ｔｒａｎｓｆｏｒｍｅｒ字错误率相对降低了１０．９１％，模型最终字错误率降低至５．３１％，验证了提出的模型具有一定的先进性。关键词：端到端语音识别；Ｔｒａｎｓｆｏｒｍｅｒ；时序卷积神经网络；连接时序分类中图分类号：ＴＮ９１２３４文献标志码：Ａ文章编号：１００１３６９５（２０２２）０３００９０６９９０５ｄｏｉ：１０．１９７３４／ｊ．ｉｓｓｎ．１００１３６９５．２０２１．０８．０３２３

ＴＣＮＴｒａｎｓｆｏｒｍｅｒＣＴＣｆｏｒｅｎｄｔｏｅｎｄｓｐｅｅｃｈｒｅｃｏｇｎｉｔｉｏｎＸｉｅＸｕｋａｎｇ，ＣｈｅｎＧｅ，ＳｕｎＪｕｎ，ＣｈｅｎＱｉｄｏｎｇ（ＳｃｈｏｏｌｏｆＡｒｔｉｆｉｃｉａｌＩｎｔｅｌｌｉｇｅｎｃｅ＆ＣｏｍｐｕｔｅｒＳｃｉｅｎｃｅ，ＪｉａｎｇｎａｎＵｎｉｖｅｒｓｉｔｙ，ＷｕｘｉＪｉａｎｇｓｕ２１４１２２，Ｃｈｉｎａ）

基于ResNet-LSTM的多类型伪装语音检测

（7）
ht = ot ⊗ tanh(ct )
（8）
其中 U1,W1,b1 为线性关系的系数和偏置，σ 为 Sigmoid
激活函数，⊗为 Hadamard 积（对应位置相乘）。
2.3 ResNet-LSTM 网络
本文提出的 ResNet-LSTM 结构如表 1 所示。网络由
xt
xt
Input Gate it
（3）关于语音变形 (Voice Transformation, VT) 的
方法。
研究，大多利用如频谱图、修改群延迟（MGD）和梅尔普
1 伪装语音检测研究现状
倒谱系数（MFCC）作为特征再利用支持向量机（SVM）、
近年来，自动说话人验证（Automatic Speech Verification, ASV）系统这种低成本的生物识别技术已被广泛地应用
Output Gate Ot
Cell
xt
CtLeabharlann htft Forget Gate
xt
图 4 LSTM 记忆单元结构图 Fig.4 LSTM memory unit structure diagram
13 Copyright©博看网. All Rights Reserved.
第 41 卷
数字技术与应用
现有的关于伪装语音检测的研究主要集中在三种不判断结果。实验结果表明，该方法在多种类型的伪装语
同的伪装类型：
音检测上都有超过 90% 的识别精度，能应对各种不同时
（1）语音转换（Voice Conversion, VC）和语音合长及不同类型的伪装语音攻击。
成（Speech Synthesis, SS）方面：F. Hassan 等人提出 2 多类型伪装语音检测系统

基于深度学习的说话人识别研究

基于深度学习的说话人识别研究说话人识别技术被广泛应用于语音识别、语音生成、人机交互等领域。

在实际应用场景中，如电话、语音社交以及语音助手等一类的场景中，都需要对说话人的身份进行识别。

传统的说话人识别技术主要基于语音信号的频域、时域、功率谱等特征进行分析识别。

然而声学特征本身有很多变化因素，这些因素影响着分析准确度，准确度不高的说话人识别无法满足实际应用的需求。

近年来，深度学习在语音信号处理领域强大的处理能力被广泛关注。

本文从基于深度学习的说话人识别的角度出发，探讨深度学习技术在该领域的应用和优势。

一、传统说话人识别模型传统说话人识别模型主要基于MFCC、PLP、MFCC_Delta等特征对声音信号的特征提取进行分析处理。

这些特征通常分为三个部分：语音的基本特征，如语音的基音频率、共振峰频率等；时域特征，如短时能量、过零率等；频域特征，如Mel频率倒谱系数、频率倒谱平均值等。

通过对这些特征进行提取，就可以得到一个声音信号的语音特征向量，利用该特征向量，可以使用一些传统模型如GMM、SVM等模型进行分类识别。

但传统说话人识别模型本身存在一些问题，首先是特征提取的问题。

传统特征方法往往需要人为定义特征函数，而这种人为定义的特征函数容易出现过拟合、欠拟合等问题。

其次是对噪声、语速等变化因素的适应性问题，这些因素对声音信号产生直接的影响。

因此，传统方法无法掌握这些细节信息来实现准确的说话人识别。

二、基于深度学习的说话人识别方法近年来，深度学习在语音信号处理领域的识别能力达到了令人惊赞的程度，如语音识别、说话人识别等，基于深度学习的说话人识别方法也受到了极大的关注。

深度学习方法在提取特征和建模方面具有很强的优势，能够解决传统方法的问题。

1.深度学习方法提取说话人特征深度学习方法对语音信号进行特征提取时，无需对手动设定的特征函数进行特征提取，因为深度学习模型可以自动完成这个过程。

特别的，采用深度卷积神经网络(CNN)、循环神经网络(RNN)等模型在进行音频信号特征提取时，音频信号的原始频域信号、时域信号等可直接作为模型输入，模型自动学习提取特征。

(完整版)视听说原文及翻译

Unit 1As the owner of a small business selling software I find it hard to recruit good people in today‘s tight labor market and having got people on board, there is an equally, if not more difficult task of keeping them happy. Staff turnover is a real problem. Two years ago our staff turnover at Epmus plc was out of control. We were consistently losing staff across the spectrum from clerical workers to senior managers, but our real worry was the skilled tehdnical people who were leaving us.They comprised a bulk of our work force so we brought in a group of consultants to help us figure out why they were leaving. It wasn‘t too difficult to see what had gone wrong. Getting new recruits to deal with clients without any specialist training wasn‘t a good idea. We were putting our staff in an unfair position, especially when they had to reach sales targets. Nor was the system of evaluating employee performance only once a year a good idea. It meant we won't pick up potential problems early enough. So having conducted our assessment we established a formal plan to retain the people who had worked so hard to recruit and hire. We laid out specific steps for communicating with our staff.作为一家销售软件的小企业的所有者，我发现在当今紧张的劳动力市场中很难招募到优秀的人才，而且已经有人加入了公司，这是一项同样的任务，即使不是更困难，也要让他们快乐。

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

DEALING WITH PHRASE LEVEL CO-ARTICULATION (PLC)IN SPEECH RECOGNITION: A FIRST APPROACH

Roeland J. F. Ordelman#, Arjan J. van Hessen#, David A. van Leeuwen*# University of Twente, Enschede, The Netherlands* TNO - Human Factors Research Institute, Soesterberg, The Netherlands

ABSTRACTWhereas nowadays within-word co-articulationeffects are usually sufficiently dealt with inautomatic speech recognition, this is not always thecase with phrase level co-articulation effects (PLC).This paper describes a first approach in dealingwith phrase level co-articulation by applying theserules on the reference transcripts used for trainingour recogniser and by adding a set of temporaryPLC phones that later on will be mapped on theoriginal phones. In fact we temporarily break downacoustic context into a general and a PLC context.With this method, more robust models could betrained because phones that are confused due toPLC effects like for example /v/-/f/ and /z/-/s/,receive their own models. A first attempt to applythis method is described.

1. INTRODUCTIONThe DRUID1 project (Document Retrieval UsingIntelligent Disclosure), a collaboration ofCTIT2/University of Twente, TNO3 and CWI4, aimsat the development of tools for the indexing ofmultimedia content. For the Spoken DocumentRetrieval (SDR) part of this project, we useABBOT, the hybrid connectionist-hidden Markovmodel large vocabulary speech recognition system[1,2] developed for English by CambridgeUniversity, Sheffield University and SoftSound.TNO already participates in the annual EnglishTREC SDR tracks with this system [3], but sincethe DRUID project focuses on Dutch SDR, we arecurrently developing a Dutch version of ABBOT.ABBOT uses a recurrent neural net (RNN) foracoustic modelling and a Markov process forlanguage modelling. Since the RNN is able tocapture temporal acoustic context, very goodrecognition results can be achieved using context-independent phone models. Although languagemodelling often makes it possible to transform sets 1http://www.seti.cs.utwente.nl/Parlevink/Projects/druid.html2 Centre for Telematics and Information Technology3 Institute of Applied Physics, departements MultimediaTechnology (Soesterberg) and Human Factors (Delft)4 Centre for Mathematics and Computer Science,Amsterdamof erroneously recognised phones into well-recognised words, better phone recognitionundoubtedly leads to better word recognition.Our first target was training the phone modelsin a baseline training, which eventually performed a33.3% Phone Error Rate (PER) on the test data.Next steps should involve improving acousticmodelling and starting language model training inorder to be able to do word recognition. Followingon a more detailed description of our methods toimprove acoustic modelling in the next sections,this paper reflects our first attempt of improvingacoustic modelling by applying phrase level co-articulation rules on the reference transcripts usedfor training the phone models.2. ACOUSTIC MODEL2.1. Acoustic Training DataThe baseline training material consisted of about 7hours of speech material of 52 (26 male - 26female) speakers reading 66 sentences from anewspaper text database, recorded in a noise freeroom (TNO-NRC-0 database). PLP feature vectors(12th order cepstral coefficients derived usingperceptual linear prediction and log energy) werepresented at the input of the RNN that contained256 state units. Our phoneset consisted of 44context-independent phones plus silence.Obviously, we need far more and also differenttypes of training data to build robust phone modelsfor speaker independent continuous speechrecognition in typical SDR tasks, but it is quite aneffort to collect large annotated speech corpora forDutch. Currently we are collecting and annotatingspeech material from Dutch radio shows andrecordings of sessions of parliament.2.2. AnnotationsFrom some of the speech material we arecollecting, text auto cues (text to read fornewsreader) or annotations (recording andannotation is in special cases a statutoryrequirement) are available that could reduce at leastsome of the hard labour. More important, it canprovide additional context specific training data forlanguage modelling. Also, CEEFAX documents ofthe recorded news broadcastings are collected in