语音信号处理中英文翻译

合集下载

语音信号处理文献翻译

语音信号处理文献翻译

利用扬声器元音的特征进行情感语音合成卡努仆•太郎浅田•川端康成•吉富正义田卧勇太摘要:近来,情感语音合成方法已经在语音合成领域的研究中受到相当的重视。

我们先前提出了一种基于案例的方法,通过利用最大振幅和元音的发声时间,和情感语音的基频特性产生情绪合成语音。

在本研究中,我们提出了一种方法,其中,我们报告的方法是通过控制情绪合成语音的基频进一步提高。

作为一个初步调查,我们采用一个语义是中性的日本名字的话语。

使用该方法,从一个男性受试者带有情绪的讲话做出的情感合成语音,其平均可辨别度达到了83.9%,18名受试者听取了情感合成话语“生气”、“快乐”、“中性”、“悲伤”或者“惊讶”时的发声是日本人“Taro ”,或“Hiroko ”。

在提出的方法中对基频的进一步调整使情感合成语音项目更清楚。

关键词:情感语音 特征参数 合成语音 情感合成语音 元音中图分类号:Ó ISAROB 20131.介绍近来,情感语音合成方法已经在语音合成领域的研究中受到相当的重视。

为了产生情感合成语音,有必要控制该话语的韵律特征。

自然语言主要由元音和辅音组成。

日语有五个元音字母。

元音比辅音留给听者的印象更深,主要是因为元音的发音时间比辅音更长,幅度比辅音更大。

我们之前提出了一种基于实例的方法来产生情感合成语音,就是利用了元音的最大幅度和发音时间,这两个元素可以通过语音识别系统和情感语音的基频得到。

在本研究中,我们提出了一种方法,其中,我们报告的方法是通过控制情绪合成语音的基频进一步提高。

我们的研究在报告研究中的优势是在情感语音中利用了元音的特征来产生情感合成语音。

2.提出的方法在第一阶段中,我们得到的情感语音的音频数据为WA V 文件,受试者讲话时用了特意的情绪“愤怒”、“快乐”、“中性”、“难过”和“感到吃惊”。

那么,对于每一种情绪讲话,我们测量每个元音发声的时间和波形的最大幅值,和情感语音的基频。

在第二阶段中,我们把受试者的话语音素按序列进行综合。

语音信号处理

语音信号处理

1950
第一台语音识 别机器的诞生
动态规划在语音 识别中的应用
1960 语音产生的声 学理论 1970 LPC在语音识 在语音识 别中的应用
DTW算法的 算法的 出现
1980
非特定人大词 汇量连续语音 识别的成熟
1990
HMM在语音 在语音 识别中的应用
语音识别发展历史中的重要事件
未来的语音识别技术必须具备的特点: 未来的语音识别技术必须具备的特点:
现在假设平均速度是每秒十个音素, 现在假设平均速度是每秒十个音素,并忽略 相邻音素之间的相关性, 相邻音素之间的相关性,这样就可以估计得语音 60比特 的平均信息速度为60比特/ 的平均信息速度为60比特/s. 换句话说,在正常的讲话速度下, 换句话说,在正常的讲话速度下,与话音等 效的书面文字含有60bit/s的信息 当然, 60bit/s的信息。 效的书面文字含有60bit/s的信息。当然,语音 实际”信息的低限远高于这一速度, 的“实际”信息的低限远高于这一速度,这是因 为 在上面的估计中我们对很多音素末加考虑。例如 在上面的估计中我们对很多音素末加考虑。 说话人的个性和情绪, 说话人的个性和情绪,说话的速度和语音的强弱 等。
Speech Signal processing ---Principles and Practice
语音信号处理---原理与应用 原理与应用
基础理论 声学原理 语音编码 语音增强 语音识别
第一章 绪论
内容:介绍语音信号处理的意义、 内容:介绍语音信号处理的意义、基础理 论和算法、处理硬件和实用系统、 论和算法、处理硬件和实用系统、发展历 史及其应用的概况。 史及其应用的概况。 要求:了解语音信号处理技术的总体概况。 要求:了解语音信号处理技术的总体概况。

15_Speech Signal Processing(语音信号处理)

15_Speech Signal Processing(语音信号处理)

General Approaches
Time Domain Coders and Linear Prediction Linear Predictive Coding (LPC) is a modeling technique that has seen widespread application among timedomain speech coders, largely because it is computationally simple and applicable to the mechanisms involved in speech production. In LPC, general spectral characteristics are described by a parametric model based on estimates of autocorrelations or autocovariances. The model of choice for speech is the all-pole or autoregressive (AR) model. This model is particularly suited for voiced speech because the vocal tract can be well modeled by an all-pole transfer function. In this case, the estimated LPC model parameters correspond to an AR process which can produce waveforms very similar to the original speech segment. Differential Pulse Code Modulation (DPCM) coders (i.e., ITU-T G.721 ADPCM [CCITT, 1984]) and LPC vocoders (i.e., U.S. Federal Standard 1015 [National Communications System, 1984]) are examples of this class of time-domain predictive architecture. Code Excited Coders (i.e., ITU-T G728 [Chen, 1990] and U.S. Federal Standard 1016 [National Communications System, 1991]) also utilize LPC spectral modeling techniques.1 Based on the general spectral model, a predictive coder formulates an estimate of a future sample of speech based on a weighted combination of the immediately preceding samples. The error in this estimate (the prediction residual) typically comprises a significant portion of the data stream of the encoded speech. The residual contains information that is important in speech perception and cannot be modeled in a straightforward fashion. The most familiar form of predictive coder is the classical Differential Pulse Code Modulation (DPCM) system shown in Fig. 15.1. In DPCM, the predicted value at time instant k, ˆ s(k Έ k – 1), is subtracted from the input signal at time k, s(k), to produce the prediction error signal e(k). The prediction error is then approximated (quantized) and the quantized prediction error, eq(k), is coded (represented as a binary number) s(k Έ k – 1) to yield a for transmission to the receiver. Simultaneously with the coding, eq(k) is summed with ˆ s(k). Assuming no channel errors, an identical reconstruction, reconstructed version of the input sample, ˆ distorted only by the effects of quantization, is accomplished at the receiver. At both the transmitter and receiver, the predicted value at time instant k +1 is derived using reconstructed values up through time k, and the procedure is repeated. N ˆ (z) = 0 and Â(z) = The first DPCM systems had B a z -i , where {ai ,i = 1…N} are the LPC coefficients i =1 i –1 and z represents unit delay, so that the predicted value was a weighted linear combination of previous reconstructed valuesJ. Watson Research Center

混音常用英汉互译

混音常用英汉互译

混音常用英汉互译混音是音频后期处理的重要环节之一,它涉及将多个音频信号混合在一起以产生更丰富、更复杂的音频效果。

混音常用的英汉互译包括以下内容:英译-汉译:1. Mixing - 混音2. Audio signal - 音频信号3. Blend - 混合4. Equalization (EQ) - 均衡5. Panning - 平移6. Reverb - 混响7. Compression - 压缩8. Delay - 延迟9. Phaser - 相位器10. Flanger - 波纹效果器11. Chorus - 合唱12. Fader - 混音台滑块13. Bus - 总线14. Master channel - 主通道15. Monitor - 监听16. Stereo - 立体声17. Surround sound - 环绕声参考内容:1. Mixing is the process of combining audio signals together to create a harmonious blend. - 混音是将音频信号混合在一起以创造和谐的混合声音的过程。

2. Equalization, or EQ, is a tool used in mixing to control the frequencies of audio signals. - 均衡是混音中用于控制音频信号频率的工具。

3. Panning refers to the technique of placing sounds in different positions within the stereo field. - 平移是指在立体声领域中将声音放置在不同位置的技术。

4. Reverb is an effect that simulates the natural reverberation of sound in different environments. - 混响是一种模拟不同环境下声音自然混响的效果。

201116910524苗云龙外文资料翻译

201116910524苗云龙外文资料翻译

毕业设计(论文)外文资料翻译题目:语音通信和语音信号处理院系名称:信息学院专业班级:电信1105班学生姓名:苗云龙学号: 201116910524指导教师:乔丽红教师职称:副教授起止日期:地点:附件: 1.外文资料翻译译文;2.外文原文。

附件1:外文资料翻译译文语音通信和语音信号处理序言像语音所携带的信息一样,与一个机器在常规模式下进行交流不仅是一个科技性的挑战,而且还有我们对人们是如何如此不费吹灰之力进行沟通交流能力上的理解力的限制关键点在于去理解语音处理(看作是人们的沟通方式)和语音信号处理(看作是一种机制)之间的不同之处。

当人们听到语音的时候,他们会应用他们积累的语言知识与一种语言的关系来捕获信息。

在这个过程中,注意到用经过很长一段时间学得的知识资源进行有选择的处理那些输入语音信号是非常有趣的,例如良好的声音单员, 声学语音学、韵律、词汇、语法、语义和语用这些知识资源,这种处理过程因人不同而不同,并且,对于任何一个个人去准确有利的表达出他或她在处理输入语音信号这个过程中是用什么原理是非常困难的。

这也就使得通过写一段程序去通过机器来执行提取语音信号重的信息的任务变得比较困难。

应当被注意到的是,对于一种机器来说,在一个抽样序列的模式里,仅仅只有语音信号能够被提取到,而其他的一些包括在输入信号上的知识资源的鉴定以及对他们的调用都是一种科学上的挑战。

这样语音信号的处理过程就是很多非常有趣的挑战之一,以至于引起了报错很多不同科学小组的好奇,包括语言学家,语言学者,心理学或声学专家,电子工程师,计算机科学家,和应用工程师。

SADHANA的编辑文员会已经恰当的把这个主题认同为应当被定位为一个特殊的问题。

他们已经让我采取首创的自发精神来搜集引导科学小组的观点,和这个特殊问题的论文的文章形式。

我也的确非常幸运的已经能够劝说很多已经有很高成就的科学家,说服他们在他们的领域内致力于这个特殊的问题,对这个特殊的额问题多做文章。

语音识别中英文对照外文翻译文献

语音识别中英文对照外文翻译文献

中英文资料对照外文翻译(文档含英文原文和中文翻译)Speech Recognition1 Defining the ProblemSpeech recognition is the process of converting an acoustic signal, captured by a microphone or a telephone, to a set of words. The recognized words can be the final results, as for applications such as commands & control, data entry, and document preparation. They can also serve as the input to further linguistic processing in order to achieve speech understanding, a subject covered in section.Speech recognition systems can be characterized by many parameters, some of the more important of which are shown in Figure. An isolated-word speech recognition system requires that the speaker pause briefly between words, whereas a continuous speech recognition system does not. Spontaneous, or extemporaneously generated, speech contains disfluencies, and is much more difficult to recognize than speech read from script. Some systems require speaker enrollment---a user must provide samples of his or her speech before using them, whereas other systems are said to be speaker-independent, in that no enrollment is necessary. Some of the other parameters depend on the specific task. Recognition is generally more difficult when vocabularies are large or have many similar-sounding words. When speech is produced in a sequence of words, language models or artificial grammars are used to restrict the combination of words.The simplest language model can be specified as a finite-state network, where the1permissible words following each word are given explicitly. More general language models approximating natural language are specified in terms of a context-sensitive grammar.One popular measure of the difficulty of the task, combining the vocabulary size and the language model, is perplexity, loosely defined as the geometric mean of the number of words that can follow a word after the language model has been applied (see section for a discussion of language modeling in general and perplexity in particular). Finally, there are some external parameters that can affect speech recognition system performance, including the characteristics of the environmental noise and the type and the placement of the microphone.Table: Typical parameters used to characterize the capability of speech recognition systems Speech recognition is a difficult problem, largely because of the many sources of variability associated with the signal. First, the acoustic realizations of phonemes, the smallest sound units of which words are composed, are highly dependent on the context in which they appear. These phonetic variabilities are exemplified by the acoustic differences of the phoneme,At word boundaries, contextual variations can be quite dramatic---making gas shortage sound like gash shortage in American English, and devo andare sound like devandare in Italian.Second, acoustic variabilities can result from changes in the environment as well as in the position and characteristics of the transducer. Third, within-speaker variabilities can result from changes in the speaker's physical and emotional state, speaking rate, or voice quality. Finally, differences in sociolinguistic background, dialect, and vocal tract size and shape can contribute to across-speaker variabilities.Figure shows the major components of a typical speech recognition system. The digitized speech signal is first transformed into a set of useful measurements or features at a fixed rate, typically once every 10--20 msec (see sectionsand 11.3 for signal representation and digital signal processing, respectively). These measurements are then used to search for the most likely word candidate, making use of constraints imposed by the acoustic, lexical, and language models. Throughout this process, training data are used to determine the values of the model parameters.Figure: Components of a typical speech recognition system.Speech recognition systems attempt to model the sources of variability described above in several ways. At the level of signal representation, researchers have developed representations that emphasize perceptually important speaker-independent features of the signal, and de-emphasize speaker-dependent characteristics. At the acoustic phonetic level, speaker variability is typically modeled using statistical techniques applied to large amounts of data. Speaker adaptation algorithms have also been developed that adapt speaker-independent acoustic models to those of the current speaker during system use, (see section). Effects of linguistic context at the acoustic phonetic level are typically handled by training separate models for phonemes in different contexts; this is called context dependent acoustic modeling.Word level variability can be handled by allowing alternate pronunciations of words in representations known as pronunciation networks. Common alternate pronunciations of words, as well as effects of dialect and accent are handled by allowing search algorithms to find alternate paths of phonemes through these networks. Statistical language models, based on estimates of the frequency of occurrence of word sequences, are often used to guide the searchthrough the most probable sequence of words.The dominant recognition paradigm in the past fifteen years is known as hidden Markov models (HMM). An HMM is a doubly stochastic model, in which the generation of the underlying phoneme string and the frame-by-frame, surface acoustic realizations are both represented probabilistically as Markov processes, as discussed in sections,and 11.2. Neural networks have also been used to estimate the frame based scores; these scores are then integrated into HMM-based system architectures, in what has come to be known as hybrid systems, as described in section 11.5.An interesting feature of frame-based HMM systems is that speech segments are identified during the search process, rather than explicitly. An alternate approach is to first identify speech segments, then classify the segments and use the segment scores to recognize words. This approach has produced competitive recognition performance in several tasks.2 State of the ArtComments about the state-of-the-art need to be made in the context of specific applications which reflect the constraints on the task. Moreover, different technologies are sometimes appropriate for different tasks. For example, when the vocabulary is small, the entire word can be modeled as a single unit. Such an approach is not practical for large vocabularies, where word models must be built up from subword units.Performance of speech recognition systems is typically described in terms of word error rate E, defined as:where N is the total number of words in the test set, and S, I, and D are the total number of substitutions, insertions, and deletions, respectively.The past decade has witnessed significant progress in speech recognition technology. Word error rates continue to drop by a factor of 2 every two years. Substantial progress has been made in the basic technology, leading to the lowering of barriers to speaker independence, continuous speech, and large vocabularies. There are several factors that have contributed to this rapid progress. First, there is the coming of age of the HMM. HMM is powerful in that, with the availability of training data, the parameters of the model can be trained automatically to giveoptimal performance.Second, much effort has gone into the development of large speech corpora for system development, training, and testing. Some of these corpora are designed for acoustic phonetic research, while others are highly task specific. Nowadays, it is not uncommon to have tens of thousands of sentences available for system training and testing. These corpora permit researchers to quantify the acoustic cues important for phonetic contrasts and to determine parameters of the recognizers in a statistically meaningful way. While many of these corpora (e.g., TIMIT, RM, ATIS, and WSJ; see section 12.3) were originally collected under the sponsorship of the U.S. Defense Advanced Research Projects Agency (ARPA) to spur human language technology development among its contractors, they have nevertheless gained world-wide acceptance (e.g., in Canada, France, Germany, Japan, and the U.K.) as standards on which to evaluate speech recognition.Third, progress has been brought about by the establishment of standards for performance evaluation. Only a decade ago, researchers trained and tested their systems using locally collected data, and had not been very careful in delineating training and testing sets. As a result, it was very difficult to compare performance across systems, and a system's performance typically degraded when it was presented with previously unseen data. The recent availability of a large body of data in the public domain, coupled with the specification of evaluation standards, has resulted in uniform documentation of test results, thus contributing to greater reliability in monitoring progress (corpus development activities and evaluation methodologies are summarized in chapters 12 and 13 respectively).Finally, advances in computer technology have also indirectly influenced our progress. The availability of fast computers with inexpensive mass storage capabilities has enabled researchers to run many large scale experiments in a short amount of time. This means that the elapsed time between an idea and its implementation and evaluation is greatly reduced. In fact, speech recognition systems with reasonable performance can now run in real time using high-end workstations without additional hardware---a feat unimaginable only a few years ago.One of the most popular, and potentially most useful tasks with low perplexity (PP=11) is the recognition of digits. For American English, speaker-independent recognition of digit strings spoken continuously and restricted to telephone bandwidth can achieve an error rate of 0.3% when the string length is known.One of the best known moderate-perplexity tasks is the 1,000-word so-called Resource Management (RM) task, in which inquiries can be made concerning various naval vessels in the Pacific ocean. The best speaker-independent performance on the RM task is less than 4%, using a word-pair language model that constrains the possible words following a given word (PP=60). More recently, researchers have begun to address the issue of recognizing spontaneously generated speech. For example, in the Air Travel Information Service (ATIS) domain, word error rates of less than 3% has been reported for a vocabulary of nearly 2,000 words and a bigram language model with a perplexity of around 15.High perplexity tasks with a vocabulary of thousands of words are intended primarily for the dictation application. After working on isolated-word, speaker-dependent systems for many years, the community has since 1992 moved towards very-large-vocabulary (20,000 words and more), high-perplexity (PP≈200), speaker-independent, continuous speech recognition. The best system in 1994 achieved an error rate of 7.2% on read sentences drawn from North America business news.With the steady improvements in speech recognition performance, systems are now being deployed within telephone and cellular networks in many countries. Within the next few years, speech recognition will be pervasive in telephone networks around the world. There are tremendous forces driving the development of the technology; in many countries, touch tone penetration is low, and voice is the only option for controlling automated services. In voice dialing, for example, users can dial 10--20 telephone numbers by voice (e.g., call home) after having enrolled their voices by saying the words associated with telephone numbers. AT&T, on the other hand, has installed a call routing system using speaker-independent word-spotting technology that can detect a few key phrases (e.g., person to person, calling card) in sentences such as: I want to charge it to my calling card.At present, several very large vocabulary dictation systems are available for document generation. These systems generally require speakers to pause between words. Their performance can be further enhanced if one can apply constraints of the specific domain such as dictating medical reports.Even though much progress is being made, machines are a long way from recognizing conversational speech. Word recognition rates on telephone conversations in the Switchboard corpus are around 50%. It will be many years before unlimited vocabulary, speaker-independentcontinuous dictation capability is realized.3 Future DirectionsIn 1992, the U.S. National Science Foundation sponsored a workshop to identify the key research challenges in the area of human language technology, and the infrastructure needed to support the work. The key research challenges are summarized in. Research in the following areas for speech recognition were identified:Robustness:In a robust system, performance degrades gracefully (rather than catastrophically) as conditions become more different from those under which it was trained. Differences in channel characteristics and acoustic environment should receive particular attention.Portability:Portability refers to the goal of rapidly designing, developing and deploying systems for new applications. At present, systems tend to suffer significant degradation when moved to a new task. In order to return to peak performance, they must be trained on examples specific to the new task, which is time consuming and expensive.Adaptation:How can systems continuously adapt to changing conditions (new speakers, microphone, task, etc) and improve through use? Such adaptation can occur at many levels in systems, subword models, word pronunciations, language models, etc.Language Modeling:Current systems use statistical language models to help reduce the search space and resolve acoustic ambiguity. As vocabulary size grows and other constraints are relaxed to create more habitable systems, it will be increasingly important to get as much constraint as possible from language models; perhaps incorporating syntactic and semantic constraints that cannot be captured by purely statistical models.Confidence Measures:Most speech recognition systems assign scores to hypotheses for the purpose of rank ordering them. These scores do not provide a good indication of whether a hypothesis is correct or not, just that it is better than the other hypotheses. As we move to tasks that require actions,we need better methods to evaluate the absolute correctness of hypotheses.Out-of-Vocabulary Words:Systems are designed for use with a particular set of words, but system users may not know exactly which words are in the system vocabulary. This leads to a certain percentage of out-of-vocabulary words in natural conditions. Systems must have some method of detecting such out-of-vocabulary words, or they will end up mapping a word from the vocabulary onto the unknown word, causing an error.Spontaneous Speech:Systems that are deployed for real use must deal with a variety of spontaneous speech phenomena, such as filled pauses, false starts, hesitations, ungrammatical constructions and other common behaviors not found in read speech. Development on the ATIS task has resulted in progress in this area, but much work remains to be done.Prosody:Prosody refers to acoustic structure that extends over several segments or words. Stress, intonation, and rhythm convey important information for word recognition and the user's intentions (e.g., sarcasm, anger). Current systems do not capture prosodic structure. How to integrate prosodic information into the recognition architecture is a critical question that has not yet been answered.Modeling Dynamics:Systems assume a sequence of input frames which are treated as if they were independent. But it is known that perceptual cues for words and phonemes require the integration of features that reflect the movements of the articulators, which are dynamic in nature. How to model dynamics and incorporate this information into recognition systems is an unsolved problem.语音识别一定义问题语音识别是指音频信号的转换过程,被电话或麦克风的所捕获的一系列的消息。

信号处理中英文对照外文翻译文献

信号处理中英文对照外文翻译文献

信号处理中英文对照外文翻译文献(文档含英文原文和中文翻译)译文:一小波研究的意义与背景在实际应用中,针对不同性质的信号和干扰,寻找最佳的处理方法降低噪声,一直是信号处理领域广泛讨论的重要问题。

目前有很多方法可用于信号降噪,如中值滤波,低通滤波,傅立叶变换等,但它们都滤掉了信号细节中的有用部分。

传统的信号去噪方法以信号的平稳性为前提,仅从时域或频域分别给出统计平均结果。

根据有效信号的时域或频域特性去除噪声,而不能同时兼顾信号在时域和频域的局部和全貌。

更多的实践证明,经典的方法基于傅里叶变换的滤波,并不能对非平稳信号进行有效的分析和处理,去噪效果已不能很好地满足工程应用发展的要求。

常用的硬阈值法则和软阈值法则采用设置高频小波系数为零的方法从信号中滤除噪声。

实践证明,这些小波阈值去噪方法具有近似优化特性,在非平稳信号领域中具有良好表现。

小波理论是在傅立叶变换和短时傅立叶变换的基础上发展起来的,它具有多分辨分析的特点,在时域和频域上都具有表征信号局部特征的能力,是信号时频分析的优良工具。

小波变换具有多分辨性、时频局部化特性及计算的快速性等属性,这使得小波变换在地球物理领域有着广泛的应用。

随着技术的发展,小波包分析 (Wavelet Packet Analysis) 方法产生并发展起来,小波包分析是小波分析的拓展,具有十分广泛的应用价值。

它能够为信号提供一种更加精细的分析方法,它将频带进行多层次划分,对离散小波变换没有细分的高频部分进一步分析,并能够根据被分析信号的特征,自适应选择相应的频带,使之与信号匹配,从而提高了时频分辨率。

小波包分析 (wavelet packet analysis) 能够为信号提供一种更加精细的分析方法,它将频带进行多层次划分,对小波分析没有细分的高频部分进一步分解,并能够根据被分析信号的特征,自适应地选择相应频带 , 使之与信号频谱相匹配,因而小波包具有更广泛的应用价值。

利用小波包分析进行信号降噪,一种直观而有效的小波包去噪方法就是直接对小波包分解系数取阈值,选择相关的滤波因子,利用保留下来的系数进行信号的重构,最终达到降噪的目的。

win7语音识别中英文对照语音指令

win7语音识别中英文对照语音指令
delect ... 删除...
select when through Phrases 选择从 when到Phrases
new line/new pragraph 新起一行
go to the end of the document 到行末尾
go to the startof the document 到行首部
accessories 附件
double click... 双击
right click document Maxine mize 右击文本
show numbers 显示编码
scroll down 向下滑动
scroll up 向上滑动
scroll down 10 向下滑动10
maximize 最大化
restore down 还原
mousegrid 鼠标格子
press capital b 按下打下b
press c as in close 按下c作为结束
backspace 退格
press control home 按下 ctrl+home
press y 3 times 按下3次y
start 开始
all programs 所有程序
6. show speech recgnition 显示语音识别
7. what can I say 我能说什么
period 句号
exclamation mark 感叹号
question mark 问号
correct ... 改正...(改正某个单词)
undo/undo that/delete that 取消
scroll up 20 向上滑动20
  1. 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
  2. 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
  3. 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。

附录:中英文翻译15SpeechSignalProcessing15.3AnalysisandSynthesisJ esseW. FussellA fte r an acousti c spee ch s i gnal i s conve rte d to an ele ctri cal si gnal by a mi crophone, i t m ay be desi rable toanalyzetheelectricalsignaltoestimatesometime-varyingparameterswhichprovideinformationaboutamodel of the speech producti on me chanism. S peech a na ly sis i s the process of e stim ati ng such paramete rs. Simil arl y , g ive n some parametri c model of spee ch production and a se que nce of param eters for that m odel,speechsynthesis istheprocessofcreatinganelectricalsignalwhichapproximatesspeech.Whileanalysisandsynthesistechniques maybedoneeitheronthecontinuoussignaloronasampledversionofthesignal,mostmode rn anal y sis and sy nthesis methods are base d on di gital si gnal processing.Atypicalspeechproductionmodelisshownin Fig.15.6.Inthismodeltheoutputoftheexcitationfunctionisscaledbythegainparam eterandthenfilteredtoproducespeech.Allofthesefunctionsaretime-varying.F IGUR E 15 .6 A ge ne ra l spee ch productionmodel.F IGUR E 1 5 .7 W ave form of a spoken phone me /i/ as i nbeet.Formanymodels,theparametersarevariedataperiodicrate,typically50to100timespersecond.Mostspee ch inform ati on is containe d i n the porti on of the si gnal bel ow about 4 kHz.Theexcitationisusually modeledaseitheramixtureorachoiceofrandomnoiseandperiodicwaveform.For hum an spee ch, v oi ced e x citati on occurs w hen the vocal fol ds in the lary nx vibrate; unvoi ce d e x citati onoccurs at constri cti ons i n the vocal tract w hi ch cre ate turbulent a i r fl ow [Fl anagan, 1965] . The rel ati ve mi x ofthesetw o type s ofexcitationisterme d ‚v oicing.‛In addition,theperiodi c e xcitation i s characterizedby afundamentalfrequency,termed pitch orF0.Theexcitationisscaledbyafactordesignedtoproducetheproperampli tude or level of the spee ch si gnal . The scaled ex citati on function i s then fi ltere d to produce the properspe ctral characte risti cs. W hile the filter m ay be nonli near, i t i s usuall y m odele d as a li nearfunction.AnalysisofExcitationInasimplifiedform,theexcitationfunctionmaybeconsideredtobepurelyperiodic,forvoicedspeech,orpurel y random, for unvoi ce d. T hese tw o states correspond to voi ce d phoneti c cl asse s such as vow elsand nasalsandunvoicedsoundssuchasunvoicedfricatives.Thisbinaryvoicingmodelisanoversimplificationforsounds such as v oi ced fri cati ves, whi ch consist of a mi xture of peri odi c and random compone nts. Fi gure 15.7is an ex ample of a time w ave form of a spoke n /i/ phoneme , w hi ch is w ell m odeled by onl y pe riodi c e x citation.B oth ti me dom ai n and frequency dom ai n anal y s is te chni ques have bee n used to esti m ate the de greeofvoi ci ng for a short se gme nt or frame of spee ch. One ti me dom ain fe ature, te rme d the ze ro crossing rate,i sthenumberoftimesthesignalchangessigninashortinterval.AsshowninFig.15.7,thezerocrossingrateforvoicedsoundsisrelativ elylow.Sinceunvoicedspeechtypicallyhasalargerproportionofhigh-frequencyenergy than voi ce d spee ch, the ratio of high-fre que ncy to low -frequency e nergy is a fre que ncy dom aintechni que that provi des i nform ation on voi cing.A nothe r measure use d to estim ate the de gree of voi ci ng is the autocorrel ation functi on, w hi ch is de fine d fora sam pled speech se gment, S ,aswheres(n)isthevalueofthenthsamplewithinthesegmentoflengthN.Sincetheautocorrelationfunctionofa periodi c functi on is i tsel f pe ri odi c, voi ci ng can be e sti mated from the de gree of pe ri odi city oftheautocorrel ati on function. Fi gure 15. 8 i s a graph of the nonne gati ve te rms of the autocorrel ation functi on for a64 -ms frame of the w aveform of Fi g . 15. 7. Ex cept for the de cre ase i n amplitude w ith i ncre asi ng lag, whi chresultsfromtherectangularwindowfunctionwhichdelimitsthesegment,theautocorrelationfunctionisseento be quite pe riodi c for thi s voi ce dutterance.F IGUR E 1 5 .8 A utocorrel ati on functi on of one frame of /i/. Ifananalysisofthevoicingofthespeechsignalindicatesavoicedorperiodiccomponentispresent,another ste p i n the anal y si s process m ay be to estim ate the freque ncy ( or pe ri od) of the voi ce d component.Thereareanumberofwaysinwhichthismaybedone.Oneistomeasurethetimelapsebetweenpeaksinthetime dom ai n si gnal. For ex am ple i n Fi g . 15.7 the m aj or peaks are separate d by about 0. 00 71 s, for afundamentalfrequencyofabout141Hz.Note,itwouldbequitepossibletoerrintheestimateoffundamentalfre quency by mistaki ng the sm aller pe aks that occur betwee n the m a jor pe aks for the m aj or pe aks. Thesesmallerpeaksareproducedbyresonanceinthevocaltractwhich,inthisexample,happentobeatabouttwicethe ex citation fre quency . T his ty pe of e rror w ould re sult in an e sti m ate of pitch approxi m atel y tw i ce the corre ct fre quency.The di stance betw ee n m ajor pe ak s of the autocorrel ation functi on is a closel y rel ate d fe ature thatisfre quentl y use d to esti m ate the pitch pe ri od. In Fi g . 15. 8, the di stance between the m aj or peaks in the autocorrelationfunctionisabout0.0071s.Estimatesofpitchfromtheautocorrelationfunctionarealsosusce pti ble to mistaking the fi rst vocal track resonance for the g l ottal e x citati on frequency.The absol ute m agnitude di ffere nce functi on ( AM DF), de fi nedas,is another functi on w hi ch is often use d i n estim ating the pitch of voi ce d spee ch. A n ex ample of the AM DF isshownin Fig.15.9forthesame64-msframeofthe/i/phoneme.However,theminimaoftheAMDFisusedasanindicatorofthepitchperiod.TheAMDFhasbeenshownt obeagoodpitchperiodindicator[Rossetal.,19 74 ] and does not requi re multi pli cations.FourierAnalysisOne of the m ore comm on processe s for e stim ating the spe ctrum of a se gme nt of spee ch is the Fourie rtransform [ Oppenheim and S chafer, 1 97 5 ]. T he Fourie r transform of a seque nce is m athem ati call y de fine daswheres(n)representsthetermsofthesequence.Theshort-timeFouriertransformofasequenceisatimedependentfunction,definedasF IGUR E 1 5 .9 A bsolute m agnitude diffe rence functi on of one frame of /i/.wherethewindowfunctionw(n)isusuallyzeroexceptforsomefiniterange,andthevariablemisusedtoselectthesectionofthesequ enceforanalysis.ThediscreteFouriertransform(DFT)isobtainedbyuniformlysam pling the short-ti me Fourie r transform i n the fre quency dime nsi on. Thus an N-point DFT is computedusingEq.(15.14),wherethe setofNsamples,s(n),may have firstbeenmultiplied by a window function.Anexampleofthemagnitudeofa512-pointDFTofthewaveformofthe/i/from Fig.15.10isshowninFig.15.10.Noteforthisfi gure, the 512 poi nts in the se que nce have been m ulti plied by a Ham ming w i ndow de fi nedbyF IGUR E 1 5 .1 0 M agnitude of 51 2-point FFT of Ham mi ng window e d/i/.S ince the spe ctral characteristi cs of spee ch m ay change dram a ti call y in a fe w milli se conds, the le ngth, type,and l ocation of the wi ndow function are im portant consi derati ons. If the w indow is too long, changi ng spe ctralcharacteristicsmaycauseablurredresult;ifthewindowistooshort,spectralinaccuraciesresult.AHammingwi ndow of 16 to 32 m s durati on is com m onl y use d for spee ch analysis.S everal characte risti cs of a speech utte rance m ay be dete rmine d by ex amination of the DFT m agnitude. InFig.15.10,theDFTofavoicedutterancecontainsaseriesofsharppeaksinthefrequencydomain.Thesepeaks, caused by the peri odi c sampl ing acti on of the g lottal ex ci tation, are separated by the fundame ntalfrequencywhichisabout141Hz,inthisexample.Inaddition,broaderpeakscanbeseen,forexampleatabout300 Hz and at about 2300 Hz. T hese broad peaks, calle d formants, result from resonances in the vocaltract. LinearPredictiveAnalysisGivenasampled(discrete-time)signals(n),apowerfulandgeneralparametric modelfortimeseriesanalysisiswheres(n)istheoutputandu(n)istheinput(perhapsunknown).Themodelparametersare a(k)fork=1,p,b( l ) for l = 1, q, and G. b( 0) is assume d to be unity. Thi s m odel , describe d as an autore g ressi ve m ov ing average(ARM A)orpole-zeromodel,formsthefoundationfortheanalysismethodtermedlinearprediction.Anautoregressive(AR) orall-polemodel,forwhichallofthe‚b‛coe fficientsexceptb(0)arezero,isfrequentlyused for spee ch anal y si s [M arkel and Gray, 1976].In the standard A R formul ati on of li ne ar predi ction, the model paramete rs are sele cte d to mi ni mizethemean-squarederrorbetweenthemodelandthespeechdata.Inoneofthevariantsoflinearprediction,theautocorrelationmethod,themini mizationiscarriedoutforawindowedsegmentofdata.Intheautocorrelationmethod,minimizingthemean-squareerror of the time domain samples is equivalentto minimizing theintegratedratioofthesignalspectrumtothespectrumoftheall-polemodel.Thus,linearpredictiveanalysisisagoodmethod forspectralanalysiswheneverthesignalisproducedby an all-pole system.M ost speechsounds fi t thi s model w ell.One ke y consi deration for li near pre dicti ve anal y si s is the order of the model, p. For spee ch, if the orde ristoosmall,theformantstructureisnot well represented. If the orderis too large, pitch pulses as well asformantsbegintoberepresented.Tenth- or twelfth-order analysis is typical forspeech.Figures15.11 and15.12 provideexamplesof the spectrum produced by eighth-order and sixteenth-order linear predictiveanalysisofthe/i/waveformofFig.15.7.Figure15.11showstheretobethreeformantsatfrequenciesofabout30 0, 23 00, and 3200 Hz , whi ch are ty pi cal for an/i/.Homomorphic(Cepstral)AnalysisFor the speech m odel of Fi g. 15. 6, the e x citati on and filter i mpulse response are convol ved to produce thespeech.Oneoftheproblemsofspeechanalysisistoseparateordeconvolvethespeechintothesetw ocom ponents. Onesuch te chni que is called hom omorphi c filte ri ng [ Oppe nheim and S chafer, 1968 ]. Thecharacte risti c sy ste mfor a sy ste m for hom om orphi c deconvol ution conve rts a convolution operation to anadditi on ope ration. The output of such a characteristi c sy stem is calle d the com ple x cep str u m . The complexcepstrumisdefinedastheinverseFouriertransformofthecomplexlogarithmoftheFouriertransformoftheinput.Iftheinputseque nceisminimumphase(i.e.,thez-transformoftheinputsequencehasnopolesorzerosoutside the unit ci rcle), the se quence can be represe nted by the real portion of the transforms. Thus, the re alcepstrum can be com pute d by cal cul ati ng the inve rse Fourie r transform of the log- spe ctrum of theinput.FIGURE15.11Eighth-orderlinearpredictiveanalysisofan‚i‛.FIGURE15.12Sixteenth-orderlinearpredictiveanalysisofan‚i‛.Fi gure 1 5.1 3 show s an e x ample of the cepstrum for the voi ced /i/ utterance from Fi g. 15.7 . The cepstrum ofsuch a voi ce d utterance i s characte rized by rel ati vel y la rge v alues in the fi rst one or tw o milli se conds as w ellas。

相关文档
最新文档