Speech enhancement based on temporal processing

合集下载

ENHANCEMENT OF NOISY SPEECH BASED ON STATISTICAL S

专利名称：ENHANCEMENT OF NOISY SPEECH BASEDON STATISTICAL SPEECH AND NOISEMODELS发明人：Jensen, Jesper申请号：EP16175779.4申请日：20160622公开号：EP3118851A1公开日：20170118专利内容由知识产权出版社提供专利附图：摘要：The disclosure relates to a method and a system for enhancement of noisy speech. The system according to the disclosure comprises an analysis filterbank (24)configured to subdivide the spectrum of the input signal into a plurality of frequency sub-bands and to provide time-frequency coefficients () for a sequence of observable noisy signal samples for each of said frequency sub-bands. The system further comprises enhancement processing unit configured to receive () and to provide enhanced time-frequency coefficients (,) and a synthesis filterbank configured to receive the enhanced time-frequency coefficients () and to provide an enhanced time-domain signal. The system further comprises a covariance estimator configured for estimating and providing the covariance matrix (,) of the time-frequency coefficients () for successive samples in each respective subband and an optimal linear combination optimiser configured to receive the covariance matrix (,), the covariance matrix of the noise-free signal () derived from a statistical speech model, and the covariance matrix of the noise signal derived from a statistical noise model and to provide a linear combination of the covariance matrix of the noise-free signal and the covariance matrix of the noise signal that explains the noisy observation as closely as required. The system further comprises a storage for the statistical model(s) of speech and for the statistical model(s) of noise. The optimal linear combination optimiser provides the linear combination of the covariance matrix of the noise-free signal and the covariance matrix of the noise signal to the enhancement processing unit, thereby enabling the enhancement processing unit to determine the enhanced time-frequency coefficients based on the time-frequency coefficients for each of said frequency sub-bands.申请人：Oticon A/S地址：Kongebakken 9 2765 Smørum DK国籍：DK代理机构：William Demant更多信息请下载全文后查看。

Model Based Speech Enhancement and Coding

Thesis for the degree of Doctor of PhilosophyModel Based Speech Enhancement andCodingDavid Yuheng Zhao赵羽珩Sound and Image Processing LaboratorySchool of Electrical EngineeringKTH(Royal Institute of Technology)Stockholm2007Zhao,David YuhengModel Based Speech Enhancement and Coding Copyright c 2007David Y.Zhao except where otherwise stated.All rights reserved.ISBN978-91-628-7157-4TRITA-EE2007:018ISSN1653-5146Sound and Image Processing LaboratorySchool of Electrical EngineeringKTH(Royal Institute of Technology)SE-10044Stockholm,SwedenTelephone+46(0)8-7907790AbstractIn mobile speech communication,adverse conditions,such as noisy acoustic environments and unreliable network connections,may severely degrade the intelligibility and natural-ness of the received speech quality,and increase the listening eﬀort.This thesis focuses on countermeasures based on statistical signal processing techniques.The main body of the thesis consists of three research articles,targeting two speciﬁc problems:speech enhancement for noise reduction andﬂexible source coder design for unreliable networks.Papers A and B consider speech enhancement for noise reduction.New schemes based on an extension to the auto-regressive(AR)hidden Markov model(HMM)for speech and noise are proposed.Stochastic models for speech and noise gains(excitation variance from an AR model)are integrated into the HMM framework in order to improve the modeling of energy variation.The extended model is referred to as a stochastic-gain hidden Markov model(SG-HMM).The speech gain describes the energy variations of the speech phones,typically due to diﬀerences in pronunciation and/or diﬀerent vocalizations of individual speakers.The noise gain improves the tracking of the time-varying energy of non-stationary noise,e.g.,due to movement of the noise source.In Paper A,it is assumed that prior knowledge on the noise environment is available,so that a pre-trained noise model is used.In Paper B,the noise model is adaptive and the model parameters are estimated on-line from the noisy observations using a recursive estimation algorithm. Based on the speech and noise models,a novel Bayesian estimator of the clean speech is developed in Paper A,and an estimator of the noise power spectral density(PSD)in Paper B.It is demonstrated that the proposed schemes achieve more accurate models of speech and noise than traditional techniques,and as part of a speech enhancement system provide improved speech quality,particularly for non-stationary noise sources.In Paper C,aﬂexible entropy-constrained vector quantization scheme based on Gaus-sian mixture model(GMM),lattice quantization,and arithmetic coding is proposed.The method allows for changing the average rate in real-time,and facilitates adaptation to the currently available bandwidth of the network.A practical solution to the classical issue of indexing and entropy-coding the quantized code vectors is given.The proposed scheme has a computational complexity that is independent of rate,and quadratic with respect to vector dimension.Hence,the scheme can be applied to the quantization of source vectors in a high dimensional space.The theoretical performance of the scheme is analyzed under a high-rate assumption.It is shown that,at high rate,the scheme approaches the theoretically optimal performance,if the mixture components are located far apart.The practical performance of the scheme is conﬁrmed through simulations on both synthetic and speech-derived source vectors.Keywords:statistical model,Gaussian mixture model(GMM),hidden Markov model(HMM),noise reduction,speech enhancement,vector quantization.iList of PapersThe thesis is based on the following papers:[A]D.Y.Zhao and W.B.Kleijn,“HMM-based gain-modeling forenhancement of speech in noise,”In IEEE Trans.Audio,Speech and Language Processing,vol.15,no.3,pp.882–892,Mar.2007.[B]D.Y.Zhao,W.B.Kleijn,A.Ypma,and B.de Vries,“On-linenoise estimation using stochastic-gain HMM for speech enhance-ment,”IEEE Trans.Audio,Speech and Language Processing, submitted,Apr.2007.[C]D.Y.Zhao,J.Samuelsson,and M.Nilsson,“Entropy-constrained vector quantization using Gaussian mixture mod-els,”IEEE munications,submitted,Apr.2007.iiiIn addition to papers A-C,the following papers have also been produced in part by the author of the thesis:[1]D.Y.Zhao,J.Samuelsson,and M.Nilsson,“GMM-basedentropy-constrained vector quantization,”in Proc.IEEE Int.Conf.Acoustics,Speech and Signal Processing,Apr.2007,pp.1097–1100.[2]V.Grancharov,D.Y.Zhao,J.Lindblom,and W.B.Kleijn,“Low complexity,non-intrusive speech quality assessment,”inIEEE Trans.Audio,Speech and Language Processing,vol.14,pp.1948–1956,Nov.2006.[3]——,“Non-intrusive speech quality assessment with low compu-tational complexity,”in Interspeech-ICSLP,September2006. [4]D.Y.Zhao and W.B.Kleijn,“HMM-based speech enhance-ment using explicit gain modeling,”in Proc.IEEE Int.Conf.Acoustics,Speech and Signal Processing,vol.1,May2006,pp.161–164.[5]D.Y.Zhao,W.B.Kleijn,A.Ypma,and B.de Vries,“Methodand apparatus for improved estimation of non-stationary noisefor speech enhancement,”2005,provisional patent applicationﬁled by GN ReSound.[6]D.Y.Zhao and W.B.Kleijn,“On noise gain estimation forHMM-based speech enhancement,”in Proc.Interspeech,Sep.2005,pp.2113–2116.[7]——,“Multiple-description vector quantization using translatedlattices with local optimization,”in Proceedings IEEE GlobalTelecommunications Conference,vol.1,2004,pp.41–45.[8]J.Plasberg,D.Y.Zhao,and W.B.Kleijn,“The sensitivitymatrix for a spectro-temporal auditory model,”in Proc.12thEuropean Signal Processing Conf.(EUSIPCO),2004,pp.1673–1676.ivAcknowledgementsApproaching the end of my Ph.D.study,I would like to thank my supervisor, Prof.Bastiaan Kleijn,for introducing me to the world of academic research. Your creativity,dedication,professionalism and hardworking nature have greatly inspired and inﬂuenced me.I also appreciate your patience with me,and that you spend many hours of your valuable time for correcting my English mistakes.I am grateful to all my current and past colleges at the Sound and Im-age Processing lab.Unfortunately,the list of names is too long to be given here.Thank you for creating a pleasant and stimulating work environment.I particularly enjoyed the cultural diversity and the feeling of an interna-tional“family”atmosphere.I am thankful to my co-authors Dr.Jonas Samuelsson,Dr.Mattias Nilsson,Dr.Volodya Grancharov,Jan Plasberg, Dr.Jonas Lindblom,Dr.Alexander Ypma,and Dr.Bert de Vries.Thank you for our fruitful collaborations.Further more,I would like to express my gratitude to Prof.Bastiaan Kleijn,Dr.Mattias Nilsson,Dr.Volodya Gran-charov,Tiago Falk and Anders Ekman for proofreading the introduction of the thesis.During my Ph.D.study,I was involved in projectsﬁnancially supported by GN ReSound and the Foundation for Strategic Research.I would like to thank Dr.Alexander Ypma,Dr.Bert de Vries,Prof.Mikael Skoglund, Prof.Gunnar Karlsson,for your encouragement and friendly discussions during the projects.Thanks to my Chinese fellow students and friends at KTH,particularly Xi,Jing,Lei,and Jinfeng.I feel very lucky to get to know you.The fun we have had together is unforgettable.I would also like to express my deep gratitude to my family,my wife Lili,for your encouragement,understanding and patience;my beloved kids: Andrew and whom yet-to-be-named,for the joy and hope you bring.I thank my parents,for your sacriﬁce and your endless support;my brother, for all the venture and joy we have together;my mother-in-law,for taking care of st but not least,I am grateful to our host family,Harley and Eva,for your kind hearts and unconditional friendship.David Zhao,Stockholm,May2007vContentsAbstract i List of Papers iii Acknowledgements v Contents vii Acronyms xi I Introduction11Statistical models (3)1.1Short-term models (4)1.2Long-term models (6)1.3Model-parameter estimation (9)2Speech enhancement for noise reduction (12)2.1Overview (14)2.2Short-term model based approach (17)2.3Long-term model based approach (21)3Flexible source coding (25)3.1Source coding (25)3.2High-rate theory (30)3.3High-rate quantizer design (32)4Summary of contributions (35)References (37)II Included papers51 A HMM-based Gain-Modeling for Enhancement of Speech inNoise A1 1Introduction...........................A1vii2Signal model..........................A42.1Speech model......................A42.2Noise model.......................A52.3Noisy signal model...................A63Speech estimation.......................A7 4Oﬀ-line parameter estimation.................A9 5On-line parameter estimation.................A11 6Experiments and results....................A136.1System implementation................A136.2Experimental setup...................A146.3Evaluation of the modeling accuracy.........A156.4Objective evaluation of MMSE waveform estimators A166.5Objective evaluation of sample spectrum estimators A196.6Perceptual quality evaluation.............A197Conclusions...........................A22 Appendix I.EM based solution to(12)...............A22 Appendix II.Approximation of f¯s(¯g n|x n).............A23 References...............................A24B On-line Noise Estimation Using Stochastic-Gain HMM forSpeech Enhancement B1 1Introduction...........................B1 2Noise model parameter estimation..............B42.1Signal model......................B42.2On-line parameter estimation.............B62.3Safety-net state strategy................B92.4Discussion........................B103Noise power spectrum estimation...............B11 4Experiments and results....................B124.1System Implementation................B124.2Experimental setup...................B134.3Evaluation of noise model accuracy..........B144.4Evaluation of noise PSD estimate...........B174.5Evaluation of speech enhancement performance...B184.6Evaluation of safety-net state strategy........B215Conclusions...........................B22 Appendix I.Derivation of(13-14)..................B25 References...............................B27 C Entropy-Constrained Vector Quantization Using GaussianMixture Models C1 1Introduction...........................C1 2Preliminaries..........................C42.1Problem formulation..................C4viii2.2High-rate ECVQ....................C42.3Gaussian mixture modeling..............C5 3Quantizer design........................C5 4Theoretical analysis......................C64.1Distortion analysis...................C64.2Rate analysis......................C74.3Performance bounds..................C74.4Rate adjustment....................C10 5Implementation.........................C105.1Arithmetic coding...................C105.2Encoding for the Z-lattice...............C115.3Decoding for the Z-lattice...............C125.4Encoding for arbitrary lattices............C125.5Decoding for arbitrary lattices............C135.6Lattice truncation...................C13 6The algorithm..........................C14 7Complexity...........................C15 8Experiments and results....................C158.1Artiﬁcial source.....................C168.2Speech derived source.................C19 9Summary............................C21 References...............................C22ixAcronyms3GPP The Third Generation Partnership Project AMR Adaptive Multi-RateAR Auto-RegressiveAR-HMM Auto-Regressive Hidden Markov ModelCCR Comparison Category RatingCDF Cumulative Distribution FunctionDFT Discrete Fourier TransformECVQ Entropy-Constrained Vector QuantizerEM Expectation MaximizationEVRC Enhanced Variable Rate CodecFFT Fast Fourier TransformGMM Gaussian Mixture ModelHMM Hidden Markov ModelIEEE Institute of Electrical and Electronics Engineers I.I.D.Independent and Identically-DistributedIP Internet ProtocolKLT Karhunen-Lo`e ve TransformLL Log-LikelihoodLP Linear PredictionLPC Linear Prediction CoeﬃcientLSD Log-Spectral DistortionxiLSF Line Spectral FrequencyMAP Maximum A-PosterioriML Maximum LikelihoodMMSE Minimum Mean Square ErrorMOS Mean Opinion ScoreMSE Mean Square ErrorMVU Minimum Variance UnbiasedPCM Pulse-Code ModulationPDF Probability Density FunctionPESQ Perceptual Evaluation of Speech Quality PMF Probability Mass FunctionPSD Power Spectral DensityRCVQ Resolution-Constrained Vector Quantizer REM Recursive Expectation MaximizationSG-HMM Stochastic-Gain Hidden Markov Model SMV Selectable Mode VocoderSNR Signal-to-Noise RatioSSNR Segmental Signal-to-Noise RatioSS Spectral SubtractionSTSA Short-Time Spectral AmplitudeVAD Voice Activity DetectorVoIP Voice over IPVQ Vector QuantizerWF Wiener FilteringWGN White Gaussian NoisexiiPart I IntroductionIntroductionThe advance of modern information technology is revolutionizing the way we communicate.Global deployment of mobile communication systems has enabled speech communication from and to nearly every corner of the world. Speech communication systems based on the Internet protocol(IP),the so-called voice over IP(VoIP),are rapidly emerging.The new technologies allow for communication over the wide range of transport media and proto-cols that is associated with the increasingly heterogeneous communication network environment.In such a complex communication infrastructure, adverse conditions,such as noisy acoustic environments and unreliable net-work connections,may severely degrade the intelligibility and naturalness of the perceived speech quality,and increase the required listening eﬀort for speech communication.The demand for better quality in such situations has given rise to new problems and challenges.In this thesis,two speciﬁc problems are considered:enhancement of speech in noisy environments and source coder design for unreliable network conditions.Network Figure1:A one-direction example of speech communication in a noisy environment.Figure1illustrates an adverse speech communication scenario that is2Introduction typical of mobile communications.The speech signal is recorded in an environment with a high level of ambient noise such as on a street,in a restaurant,or on a train.In such a scenario,the recorded speech signal is contaminated by additive noise and has degraded quality and intelligibility compared to the clean speech.Speech quality is a subjective measure on how pleasant the signal sounds to the listeners.It can be related to the amount of eﬀort required by listeners to understand the speech.Speech intelligibil-ity can be measured objectively based on the percentage of sounds that are correctly understood by listeners.Naturally,both speech quality and intel-ligibility are important factors in the subjective rating of a speech signal by a listener.The level of background noise may be suppressed by a speech enhancement system.However,suppression of the background noise may lead to reduced speech intelligibility,due to possible speech distortion as-sociated with excessive noise suppression.Therefore,most practical speech enhancement systems for noise reduction are designed to improve the speech quality,while preserving the speech intelligibility[6,60,61].Besides speech communication,speech enhancement for noise reduction may be applied to a wide range of other speech processing applications in adverse situations, including speech recognition,hearing aids and speaker identiﬁcation.Another limiting factor for the quality of speech communication is the network condition.It is a particularly relevant issue for wireless communica-tion,as the network capacity may vary depending on interferences from the physical environment.The same issue occurs in heterogeneous networks with a large variety of diﬀerent communication links.Traditional source coder design is often optimized for a particular network scenario,and may not perform well in another network.To allow communication with the best possible quality in the new network,it is necessary to adapt the coding strategy,e.g.,coding rate,depending on the network condition.This thesis concerns statistical signal processing techniques that facili-tate better solutions to the aforementioned problems.Formulated in a sta-tistical framework,speech enhancement is an estimation problem.Source coding can be seen as a constrained optimization problem that optimizes the coder under certain rate constraints.Statistical modeling of speech(and noise)is a key element for solving both problems.The techniques proposed in this thesis are based on a hidden Markov model(HMM)or a Gaussian mixture model(GMM).Both are generic mod-els capable of modeling complex data sources such as speech.In Papers A and Paper B,we propose an extension to an HMM that allows for more accurate modeling of speech and noise.We show that the improved models also lead to an improved speech enhancement performance.In Paper C,we propose aﬂexible source coder design for entropy-constrained vector quan-tization.The proposed coder design is based on Gaussian mixture modeling of the source.The introduction is organized as follows.In section1,an overview of1Statistical models3Figure2:Histogram of time-domain speech waveform samples.Gener-ated using100utterances from the TIMIT database,down-sampled to8kHz,normalized between-1and1,and2048bins. statistical modeling for speech is given.We discuss short-term models for modeling the local statistics of a speech frame,and long-term models(such as an HMM or a GMM)that describe the long-term statistics spanning over multiple frames.Parameter estimation using the expectation-maximization (EM)algorithm is discussed.In section2,we introduce speech enhancement techniques for noise reduction,with our focus on statistical model-based methods.Diﬀerent approaches using short-term and long-term models are reviewed and discussed.Finally,in section3,an overview of theory and practice forﬂexible source coder design is given.We focus on the high-rate theory and coders derived from the theory.1Statistical modelsStatistical modeling of speech is a relevant topic for nearly all speech pro-cessing applications,including coding,enhancement,and recognition.A statistical model can be seen as a parameterized set of probability density functions(PDFs).Once the parameters are determined,the model pro-vides a PDF of the speech signal that describes the stochastic behavior of the signal.A PDF is useful for designing optimal quantizers[64,91,128], estimators[35,37–39,90],and recognizers[11,105,108]in a statistical frame-work.One can make statistical models of speech in diﬀerent representations. The simplest one applies directly to the digital speech samples in the wave-form domain(analog speech is out of the scope of this thesis).For instance, a histogram of speech waveform samples,as shown in Figure2,provides4Introduction the long-term averaged distribution of the amplitudes of speech samples.A similarﬁgure is shown in[135].The distribution is useful for designing an entropy code for a pulse-code modulation(PCM)system.However,in such a model,statistical dependencies among the speech samples are not explicitly modeled.It is more common to model speech signal as a stochastic process that is locally stationary within20-30ms frames.The speech signal is then pro-cessed on a frame-by-frame basis.Statistical models in speech processing can be roughly divided into two classes,short-term and long-term signal models,depending on their scope of operation.A short-term model de-scribes the statistics of the vector for a particular frame.As the statistics are changing over the frames,the parameters of a short-term model are to be determined for each frame.On the other hand,a model that describes the statistics over multiple signal frames is referred to as a long-term model.1.1Short-term modelsIt is essential to model the sample dependencies within a speech frame of20-30ms.This short-term dependency determines the formants of the speech frame,which are essential for recognizing the associated phoneme.Modeling of this dependency is therefore of interest to many speech applications. Auto-regressive modelAn eﬃcient tool for modeling the short-term dependencies in the time do-main is the auto-regressive(AR)model.Let x[i]denote the discrete-time speech signal,the AR model of x[i]is deﬁned asx[i]=−pj=1α[j]x[i−j]+e[i],(1)whereαare the AR model parameters,also called the linear prediction coeﬃcients,p is the model order and e[i]is the excitation signal,modeled as white Gaussian noise(WGN).An AR process is then modeled as the output ofﬁltering WGN through an all-pole AR modelﬁlter.For a signal frame of K samples,the vector x=[x[1],...,x[K]]can be seen as a linearly transformation of a WGN vector e=[e[1],...,e[K]],by considering theﬁltering process as a convolution formulated as a matrix transform.Consequently,the PDF of x can be modeled as a multi-variate Gaussian PDF,with zero-mean and a covariance matrix parameterized using the AR model parameters.The AR Gaussian density function is deﬁned asf(x)=1(2πσ2)K2exp −12x T D−1x ,(2)1Statistical models5 Pitch periodGain Gain Speech signalFigure3:Schematic diagram of a source-ﬁlter speech production model.with the covariance matrix D=σ2(A A)−1,where A is a K×K lower triangular Toeplitz matrix with theﬁrst p+1elements of theﬁrst column consisting of the AR coeﬃcients[α[0],α[1],α[2],...,α[p]]T,α[0]=1,and the remaining elements equal to zero, denotes the Hermitian transpose, andσ2denotes the excitation variance.Due to the structure of A,the covariance matrix D has the determinantσ2K.Applying the AR model for speech processing is often motivated and supported by the physical properties of speech production[43].A commonly used model of speech production is the source-ﬁlter model as shown in Figure3.The excitation signal,modeled using a pulse train for a voiced sound or Gaussian noise for an unvoiced sound,isﬁltered through a time-varying all-poleﬁlter that models the vocal tract.Due to the absence of a pitch model,the AR Gaussian model is more accurate for unvoiced sounds.Frequency-domain modelsThe spectral properties of a speech signal can be analyzed in the frequency domain through the short-time discrete Fourier transform(DFT).On a frame-by-frame basis,each speech segment is transformed into the frequency domain by applying a sliding window to the samples and a DFT on the windowed signal.A commonly used model is the complex Gaussian model,motivated by the asymptotic properties of the DFT coeﬃcients.Assuming that the DFT length approaches inﬁnity and that the span of correlation of the signal frame is short compared to the DFT length,the DFT coeﬃcients may be considered as a weighted sum of weakly dependent random samples[51,94, 99,100].Using the central limit theorem for weakly dependent variables[12, 130],the DFT coeﬃcients across frequency can be considered as independent6Introductionzero-mean Gaussian random variables.For k∈{1,K2+1},the PDF of theDFT coeﬃcients of the k’th frequency bin X[k]is given byf(X[k])=1πλ2[k]exp −|X[k]|2λ2[k] ,(3)whereλ2[k]=E[|X[k]|2]is the power spectrum.For k∈{1,K2+1},X[k]isreal and has a Gaussian distribution with zero mean and variance E[|X[k]|2]. Due to the conjugate symmetry of DFT for real signals,only half of the DFT coeﬃcients need to be considered.One consequence of the model is that each frequency bin may be ana-lyzed and processed independently.Although,for typical speech process-ing DFT lengths,the independence assumption does not always hold,it is widely used,since it signiﬁcantly simpliﬁes the algorithm design and reduces the associated computational complexity.For instance,the complex Gaus-sian model has been successfully applied in speech enhancement[38,39,92].The frequency domain model of an AR process can be derived under the asymptotic assumption.Assuming that the frame length approaches inﬁnity,A is well approximated by a circulant matrix(neglecting the frame boundary),and is approximately diagonalized by the discrete Fourier trans-form.Therefore,the frequency domain AR model is of the form of(3),and the power spectrumλ2[k]is given byλ2[k]AR=σ2| p n=0α[n]e−jnk|2.(4)It has been argued[89,103,118]that supergaussian density functions, such as the Laplace density and the two-sided Gamma density,ﬁt better to speech DFT coeﬃcients.However,the experimental studies were based on the long-term averaged behavior of speech,and may not necessarily hold for short-time DFT coeﬃcients.Speech enhancement methods based on supergaussian density function have been proposed in[19,85,86,89,90]. The comprehensive evaluations in[90]suggest that supergaussian methods achieve consistent but small improvement in terms of higher segmental SNR results relative to the Wienerﬁltering approach(which is based on the Gaus-sian model).However,in terms of perceptual quality,the Ephraim-Malah short-time spectral amplitude estimators[38,39]based on the Gaussian model were found to provide more pleasant residual noise in the processed speech[90].1.2Long-term modelsWith a long-term model,we mean a model that describes the long-term statistics that span over multiple signal frames of20-30ms.The short-term models in section1.1apply to one signal frame only,and the model1Statistical models7 parameters obtained for a frame may not generalize to another frame.A long-term model,on the other hand,is generic andﬂexible enough to allow statistical variations in diﬀerent signal frames.To model diﬀerent short-term statistics in diﬀerent frames,a long-term model comprises multiple states,each containing a sub-model describing the short-term statistics of a frame.Applied to speech modeling,each sub-model represents a particular class of speech frames that are considered statistically similar.Some of the most well-known long-term models for speech processing are the hidden Markov model(HMM),and the Gaussian mixture model(GMM).Hidden Markov modelThe hidden Markov model(HMM)was originally proposed by Baum and Petrie[13]as probabilistic functions ofﬁnite state Markov chains.HMM applied for speech processing can be found in automatic speech recognition [11,65,105]and speech enhancement[35,36,111,147].An HMM models the statistics of an observation sequence,x1,...,x N. The model consists of aﬁnite number of states that are not observable, and the transition from one state to another is controlled by aﬁrst-order Markov chain.The observation variables are statistically dependent of the underlying states,and may be considered as the states observed through a noisy channel.Due to the Markov assumption of the hidden states,the observations are considered to be statistically interdependent to each other. For a realization of state sequence,let s=[s1,...,s N],s n denote the stateof frame n,a sn−1s n denote the transition probability from state s n−1to s nwith a s0s1being the initial state probability,and let f sn(x n)denote theoutput probability of x n for a given state s n,the PDF of the observation sequence x1,...,x N of an HMM is given byf(x1,...,x N)= s∈S N n=1a s n−1s n f s n(x n),(5)where the summation is over the set of all possible state sequences S. The joint probability of a state sequence s and the observation sequence x1,...,x N consists of the product of the transition probabilities and the output probabilities.Finally,the PDF of the observation sequence is the sum of the joint probabilities over all possible state sequences.Applied to speech modeling,each sub-model of a state represents the short-term statistics of a ing the frequency domain Gaussian model(3),each state consists of a power spectral density representing a particular class of speech sounds(with similar spectral properties).The Markov chain of states models the temporal evolution of short-term speech power spectra.。

一种基于CASA的单通道语音增强方法

一种基于CASA的单通道语音增强方法余世经;李冬梅;刘润生【摘要】基于对计算听觉场景分析(Computational Auditory Scene Analysis,CASA)算法思想的研究,提出了一种单通道语音增强方法.通过分析白噪声、风噪声、周期性噪声三类典型噪声和一般语音信号的频谱特点,构造适合的信号提取特征作为线索,判别出信号时频单元中的主要信号成分,然后对各时频单元乘以相应的衰减系数以掩蔽噪声成分.对仿真实验结果的客观测试和非正式听音测试表明,相对于常用的多子带谱减法和维纳滤波法,所提出的算法能够更有效地抑制白噪声、风噪声、周期性噪声等背景噪声.【期刊名称】《电声技术》【年(卷),期】2014(038)002【总页数】5页(P50-54)【关键词】语音增强;计算听觉场景分析;线索;掩蔽【作者】余世经;李冬梅;刘润生【作者单位】清华大学电子工程系电路与系统研究所,北京 100084;清华大学电子工程系电路与系统研究所,北京 100084;清华大学电子工程系电路与系统研究所,北京 100084【正文语种】中文【中图分类】TN912.351 引言单通道语音增强可用于实时通信下的背景噪声抑制，也可以作为语音压缩、语音识别等系统的前端。

单通道语音增强目前最常用的算法有谱减法、多子带谱减法、维纳滤波等。

谱减法［1］首先通过某种方法从带噪语音功率谱中估计出噪声功率谱，然后将其从带噪语音功率谱中减去;多子带谱减法［2－3］对传统谱减法做出了改进，它将频域信号按频率分成多个子带来进行谱减，可以获得更大的信噪比的提高。

谱减法及多子带谱减法存在的一个主要问题就是会残留“音乐噪声”，特别是在低信噪比(小于5 dB)情况下，残留噪声会严重影响人的听觉效果。

维纳滤波法［4－5］可以一定程度上抑制“音乐噪声”，但是在低信噪比情况下仍然有较大的噪声残留，在信噪比的提高上难以令人满意。

谱减法和维纳滤波法都需要对噪声功率谱进行估计，其增强效果的好坏很大程度上取决于噪声谱幅度估计是否准确，当干扰噪声为风噪声等非平稳噪声时，要准确估计出噪声功率谱是很困难的，因此谱减法和维纳滤波对非平稳噪声的抑制效果比较差。

Recent Advancements in Speech Enhancement

Y. Ephraim is with the Department of Electrical and Computer Engineering, George Mason University, Fairfax, VA 22030. Email: yephraim@ I. Cohen is with the Department of Electrical Engineering, Technion, Israel Institute of Technology, Haifa 32000, ISRAEL. Email: icohen@ee.technion.ac.il
1
1
enhancement. Thus, no reference noise source is assumed available. This problem is of great interest, and has attracted signiﬁcant research eﬀort for over ﬁfty years. A successful algorithm may be useful as a preprocessor for speech coding and speech recognition of noisy signals. The perception of a speech signal is usually measured in terms of its quality and intelligibility. The quality is a subjective measure which reﬂects on individual preferences of listeners. Intelligibility is an objective measure which predicts the percentage of words that can be correctly identiﬁed by listeners. The two measures are not correlated. In fact, it is well known that intelligibility can be improved if one is willing to sacriﬁce quality. This can be achieved, for example, by emphasizing high frequencies of the noisy signal [35]. It is also well known that improving the quality of the noisy signal does not necessarily elevate its intelligibility. On the contrary, quality improvement is usually associated with loss of intelligibility relative to that of the noisy signal. This is due to the distortion that the clean signal undergoes in the process of suppressing the input noise. From a pure information theoretic point of view, such loss in “information” is predicted by the data processing theorem [10]. Loosely speaking, this theorem states that one can never learn from the enhanced signal, more than he can learn from the noisy signal, about the clean signal. A speech enhancement system must perform well for all speech signals. Thus, from the speech enhancement system point of view, its input is a random process whose sample functions are randomly selected by the user. The noise is naturally a random process. Hence, the speech enhancement problem is a statistical estimation problem of one random process from the sum of that process and the noise. Estimation theory requires statistical models for the signal and noise, and a distortion measure which quantiﬁes the similarity of the clean signal and its estimated version. These two essential ingredients of estimation theory are not explicitly available for speech signals. The diﬃculties are with the lack of a precise model for the speech signal and a perceptually meaningful distortion measure. In addition, speech signals are not strictly stationary. Hence, adaptive estimation techniques, which do not require explicit statistical model for the signal, often fail to track the changes in the underlying statistics of the signal. In this paper we survey some of the main ideas in the area of speech enhancement from a single microphone. We begin in Section 2 by describing some of the most promising statistical models and distortion measures which have been used in designing speech enhancement systems. In Section 3 we present a detailed design example for a speech enhancement system which is based on minimum mean square error estimation of the speech spectral magnitude. This approach integrates several key ideas from Section 2, and has

基于双通道神经网络时频掩蔽的语音增强算法

第49卷第 6 期 2021年 6 月
DOI：10.13245/j.hust.210609
华中科技大学学报（自然科学版） J. Huazhong Univ. of Sci. & Tech. (Natural Science Edition)
Vol.49 No.6 Jun. 2021
估计方向矢量，识别每路麦克风信号上用于定位的语音主导的时频 (t ime-f r e q u e n c y ，T -F )单元，使其在强噪声和混响环境下仍得到准确方向矢量估计. 最后，输入到基于加权最小化无失真响应(w e i g h t ed p o w e r minimization distortionless r e s p o n s e ， W P D )优化准则的卷积波束形成器中进行语音增强，使去噪抑制混响效果同时达到最优.与几种不同的语音增强方法相比，本文算法既消除了与语音同方向的背景噪声，又消除了不同方向的噪声干扰，得到的增强语音可懂度和清晰度都较高.并且本研究根据神经网络训练的模型，不需要任何关于麦克风阵列的先验知识，在噪声环境下有较强的鲁棒性.
模，对双麦克风信号分别进行单通道神经网络初步语音增强，达到全面利用语音非线性特征改善感知度的目的：
其次，提出一种基于自适应掩模方向矢量定位法，精确计算语音、噪声的空间协方差矩阵和方向矢量，在带噪和
混响的环境下精确定位目标声源；最后，输入信号到卷积波束形成器中，进一步去噪和抑制混响 .实验结果表
收稿日期 2020-09-01. 作者简介贾海蓉(1977-)，女，教授，E-mail: helenjia722@. 基金项目国家自然科学基金资助项目（12004275):山西省留学人员科技活动择优资助项目(20200017) ; 山西省回国留学

一种适用于双微阵列的语音增强算法

一种适用于双微阵列的语音增强算法毛维;曾庆宁;龙超【摘要】Considering the limitations of traditional single channel speech enhancement algorithm for noise sup -pression,a dual-mini microphone array consisting of two microphones array is employed.The noisy speech is pro-cessed by the temporal and spatial characteristics of the spatial structure of the array,and a speech enhancement al-gorithm based on dual-mini microphone array is proposed.The speech enhancement algorithm uses the logarithmic minimum mean square error(LogMMSE)to enhance the signal-to-noise(SNR)of the noisy speech signals collect-ed by each channel.Then,the signal in the target source direction is obtained and retained by using the wideband minimum variance distortionless response(MVDR)in the frequency domain,which also suppresses the signal inter-ference in other directions.Finally,an improved intelligibility and improved minimum controlled recursive averagingcombination(IMCRA)Wiener filter for noise estimation to remove residual noise can be used to improve the speech quality.Simulation results show that compared with the traditional single channel speech enhancement algo -rithm,the proposed algorithm has good performance in noise suppression.%考虑到传统单通道语音增强算法对噪声抑制的局限性,采用由两个微型麦克风阵列组成的双微阵列;利用该阵列空间结构的时空域特性对含噪语音进行处理,提出了一种适用于双微阵列的语音增强算法;该增强算法是将各通道采集到的带噪语音信号先使用对数最小均方误差(logarithmic minimum mean squareerror,LogMMSE)提升其信噪比;然后利用频域宽带最小方差无畸变响应(MVDR)通过对目标声源信号的获取,保留目标声源方向的信号并抑制其他方向的信号干扰;最后通过一个改进可懂度结合改进最小控制递归平均(improved minimum controlled recursive average algorithm,IMCRA)噪声估计的维纳滤波器来去除噪声残留提升语音质量.仿真实验结果表明,相比传统的单通道语音增强算法,该算法具有良好的噪声抑制性能.【期刊名称】《科学技术与工程》【年(卷),期】2018(018)010【总页数】5页(P245-249)【关键词】双微阵列;语音增强;对数最小均方误差;最小方差无畸变响应;改进最小控制递归平均;维纳滤波【作者】毛维;曾庆宁;龙超【作者单位】桂林电子科技大学信息与通信学院,桂林541004;桂林电子科技大学信息与通信学院,桂林541004;桂林电子科技大学信息与通信学院,桂林541004【正文语种】中文【中图分类】TP391.42近年来，人工智能技术发展迅速；特别是在如计算机视觉、大数据分析、无人驾驶、自动翻译、垃圾邮件过滤等诸多领域已取得了革命性的突破进展；而语音作为人工智能交互的接口，对其性能的改善和优化起了重要的作用。

融合多头自注意力机制的语音增强方法

2020年2月第47卷第1期Feb.2020Vol.47No.1西安电子科技大学学报JOURNAL OF XIDIAN UNIVERSITYdoi：10.19665/j.issn1001-2400.2020.01.015融合多头自注意力机制的语音增强方法常新旭，张杨，杨林，寇金桥，王昕，徐冬冬(北京计算机技术及应用研究所，北京100854)摘要：由于人类在听觉感知过程中存在掩蔽效应，因此能量高的信号会屏蔽掉其他能量6的信号。

受到这一现发，结合自注意力方法和多头注意力方法，提出融合多头自注意力机制的语音增强方法。

通过对输入的含噪语音特征施加多头自注意力计算，可以使得输入语音特征的干净语音部分和噪声部分有较为明显的区分，从而使得后续的处理能够更有效地抑制噪声。

实验结果表明，这种语音增强方法相较于使用门控循环神经网络的语音增强方法，语音增强性能更好，增强语音的语音质量与可懂度更高#关键词：语音增强；深度神经网络；自注意力；多头注意力；门控循环单元中图分类号:TN912.35文献标识码:A文章编号=1001-2400(2020)01-0104-07Speech enhancement method based on the multi-head self-attention mechanismCHANG Xtnxu,ZHANG Yang,YANGI-n,KOUJtnqtao,WANG Xtn,XU Dongdong (Beijing Institute of Computer Technology and Application,Beijing100854,China)Abstract:The human ear can only accept one sound signal at one time,and the signal with the highestenergy will shield other signals with low energy.According to the above principle,this paper combines theself-a?en?ionand?hemuli-heada?enion?oproposeaspeechenhancemen?me?hodbasedon?hemuli-headself-attention mechanism.By applying multi-head self-attention calculation to the input noisy speechfea?ures,?hecleanspeechpar?and?henoisepar?of?heinpu?speechfea?urecanbeclearlydisinguished,thereby enabling subsequent processing to suppress noise more effectively.Experimental results show thatthe proposed method significantly outperforms the method based on the recurrent neural network in terms ofbo?hspeechquali?yandin?e l igibiliy.;60?0*#>speech enhancement；deep neural network；self attention；multi-head attention；gated recurrent unit语音增强是信号处理领域中一个重要的研究方向，其主要目的是提高被噪声污染语音的质量和可懂度1,并且在语音识别、移动通信和人工听觉等诸多领域有着广泛的应用前景2*近年来，随着深度学习技术的兴起，基于深度神经网络(Deep Neural Network,DNN)的有监督语音增强方法取得了巨大的成功*特别是在低信噪比和非平稳噪声环境中，这种方法相较于传统方法凸显出明显的优势。

语音增强波束赋形应用的空间滤波器组之设计

Blekinge Institute of Technology School of Engineering 37225 Ronneby, Sweden
ABSTRACT In this paper, a new spatial filter bank design method for speech enhancement bcamforming applications is presented. The aim of this design is to construct a set of different frIter banks that would include thc constraint of signal passage at one position (and dosing i n other positions corresponding to known disturbing sources). By performing the directional opening towards the desired location in the fixcd filter bank structure, the beamformer is Ieft with the task o f tracking and supprcssing the continuously emerging noise sources. This algorithm has been implemented in MATLAB and tested on real spcech recordings conducted i n a car hands-free communication situation. Results show that a reduction of the total complexity can he achieved while maintaining the noise suppression performancc and reducing the speech distorsion.

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

SPEECH ENHANCEMENT BASED ON TEMPORAL PROCESSINGHynek Hermansky, Eric A. Wan, and Carlos AvendanoOregon Graduate Institute of Science & Technology Department of Electrical Engineering and Applied Physics P.O. Box 91000, Portland, OR 97291Finite Impulse Response (FIR) Wiener-like lters are applied to time trajectories of cubic-root compressed short-term power spectrum of noisy speech recorded over cellular telephone communications. Informal listenings indicate that the technique brings a noticeable improvement to the quality of processed noisy speech while not causing any signi cant degradation to clean speech. Alternative lter structures are being investigated as well as other potential applications in cellular channel compensation and narrowband to wideband speech mapping.ABSTRACTBoll (1979)), i.e., enhanced speech often contains musical noise and the technique typically degrades clean speech.2. CURRENT TECHNIQUE1. INTRODUCTIONThe need for enhancement of noisy speech in telecommunications increases with the spread of cellular telephony. Calls may originate from rather noisy environments such as moving cars or crowded public places. The corrupting noise is often relatively stationary, or at least changing rather slowly. This leads us to investigate the use of RelAtive SpecTrAl (RASTA) processing of speech (Hermansky and Morgan, et. al., 1991, 1994, Hirsch et. al., 1991), which was originally designed to alleviate e ects of convolutional and additive noise in automatic speech recognition (ASR). RASTA does this by band-pass ltering time trajectories of parametric representations of speech in a domain in which the disturbing noisy components are additive. Recently, RASTA was also applied to direct enhancement of noisy speech (Hermansky, Morgan, and Hirsch, 1993). In that case, RASTA ltering was applied to a magnitude (or cubic-root compressed power) of the short-term spectrum of speech while keeping the phase of the original noisy speech. However, applying rather aggressive xed ARMA RASTA lters (designed for suppression of convolutional distortions in ASR) yields results similar to spectral subtraction (see e.g.,This work was sponsored in part by USWEST Advanced Technologies under contract number 9069-111. Appeared in ICASSP 95, Detroit MI,1995RASTA involves nonlinear ltering of the trajectory of the short-term power spectrum (estimated using a 256 point FFT applied with a 64 point overlap Hamming window at a 8kHz sampling rate). Currently, we use linear lters applied to the cubic-root of the estimated power spectrum (Hermansky et. al. (1994)). In this study, we have substituted the ad hoc designed and xed RASTA lters by a bank of non-casual FIR Wiener-like lters. Each lter is designed to optimally map a time window of the noisy speech spectrum of a speci c frequency to a single estimate of the short-term magnitude spectrum of clean speech. For a 256 point FFT we require 129 unique lters.-10 Noisy Short-term Speech Analysis 1 0 FIR filter 10( ).a( )b. Magnitude . & Phase Extraction 129. . .. . .Phasea=2/3 b=3/2Figure 1: Block Diagram of System Let n( ) be the cubic-root estimate of the shorti term power spectrum of noisy speech in frequency bin ( = 1 to 129 and corresponds to an 8ms step). The output of each lter is the following:p k i i k p^i( ) =kj =?MM Xwi( ) n( ? ) ij p kSynthesisEstimate of Clean Speechj ;(1)jwhere ^i( ) is the estimate of the clean speech cubicroot spectrum. FIR lter coe cients i( ) are found such that ^i is the least squares estimate of the cleanp k w psignal i for each frequency bin . The current design has = 10 corresponding to 21 tap noncausal lters. As in the spectral subtraction technique, any negative spectral values after RASTA ltering are substituted by zeros and the noisy speech phase is used for the synthesis. The technique is illustrated in Fig. 1. Initial lters were designed on approximately 2 minutes of speech of one male talker recorded at 8 kHz sampling over a public analog cellular line from a relatively quiet laboratory. The speech was arti cially corrupted by additive noise recorded over a second cellular channel from a) car driving on a freeway with a window closed, b) car driving on a freeway with a window open, and c) busy shopping mall.p i M3. RESULTING OPTIMAL RASTA FILTERSFilter frequency responses are shown in Fig.2 (darker shades represent larger values). The lters are approximately symmetric (i.e., near zero phase). Filters for di erent frequency channels di er. The whole frequency band between 0 and 4 kHz appears to be subdivided into several regions, each characterized by its own RASTA processing.C A -60 dB 0 dBModulation frequency (Hz)Bsponse of typical lters in this region (slice A in Fig. 2) are shown in Fig. 3(A). Typically, lters in this frequency band have a band-pass character, emphasizing modulation frequencies around 6-8 Hz. Comparing to our original ad hoc designed RASTA lter (Hermansky, Morgan, and Hirsch, 1993), the low frequency bandstop is much milder, being only at most 10 dB down from the maximum. Filters for very low frequencies (0-100 Hz) are highgain lters with a rather at frequency response (slice C in Fig. 2 and Fig. 3(C)). Filters in the 150-250 Hz and the 2700-4000 Hz regions are low-gain low-pass lters (slice B in Fig.2) with at least 10 dB attenuation for modulation frequencies above 5 Hz. The low frequency pass-band of these lters are typically below the passband of the high-gain band-pass lters of Fig. 3(A). The frequency response of a 129 tap Wiener lter designed on noisy data is shown by the solid line in Fig. 4. For comparison, the center taps of the RASTA lters designed on magnitude spectrum (i.e. a=b=1 in Fig.1) are shown in the gure by the dashed line. Informal comparisons between the quality of speech processed by the time-domain Wiener lter and the RASTA lters indicates some advantage of the additional lter taps and compressed spectrum used with the new RASTA lters.10| | |62Magnitude (dB)RASTA center taps Wiener filter0 -10 -20 -30 -40| |00 0.5 1 1.5 2 2.5 3 3.5 4Center frequency of analysis channel (kHz)Figure 2: Frequency Response of RASTA Filters|01234Frequency (kHz)00(C)−5−5Figure 4: Wiener lter response and RASTA centertaps.Gain (dB)−10 −10 −15 −15(A) (B)−20 −20 −25 −25 −30 −30 −35 −35 0 0 20 40 20 60 80 40 100 1204. EXPERIMENTS60Modulation Frequency (Hz)Figure 3: Freq. Response of Di erent Filters. The highest gain RASTA lters are applied in the frequency bands of about 300-2300 Hz. Frequency re-The RASTA processing using lters described above has been tested on several recordings of actual noisy cellular communications (out of training set). We have informally observed that the processing brings a noticeable improvement in subjective quality of the recordings. Noise seems to be less disturbing while the quality of speech is not impaired. Further, we have also observed that the processing does not seem to impair the quality of clean speech recordings. A visible reduc-tion of the noise after processing is apparent from the spectrograms shown in Fig. 5.4have been taken towards the investigation of this problem.5.3.1. Cellular Channel Equalization0 0 4sec6.40 0sec6.4Figure 5: Spectrogram showing noisy speech followed bythe same noisy segment after processing.The original 16kHz sampled speech is downsampled to 8kHz and used as the desired response in our training scheme. The input data corresponds to the same speech segments recorded over the cellular channel. Additional white Gaussian noise is added to the input in order to avoid excessive gains in the band edges where cellular speech is not present. Several tests done on clean cellular communications have resulted in a noticeable improvement; however, formal comparisons with other techniques have yet to be done.8kHzkHzA database that includes the e ects of actual ambient noise on cellular speech communications is being collected. A training set of 330 sentences taken from the TIMIT database (complete dr8 set) is played through a portable PC over a public cellular line from di erent locations. A test set generated with 123 randomly chosen TIMIT sentences (dr1-dr7 only) is recorded under the same conditions for later evaluation of our algorithms. Adjacent channel information has been used to train each channel resulting in a set of (Multi-Input-SingleOutput) MISO lters. Initial results with 6 adjacent channels (3 at each side of the frequency of interest) show an improvement in the reduction of perceived noise as well as reduction of residual error at each frequency bin. However, the additional improvement obtained with this structure is noticeable only within the training set (original 2 minute source), while the processing generalizes poorly over di erent speakers and noise types. The e ect of training on large data sets and the e ect of di erent architectures need to be investigated before any conclusions can be drawn. Availability of the original 16kHz sampled speech data has raised the question as to the possibility of recovering a good quality wideband signal from a degraded, 8khz sampled, cellular speech recording. Two steps5. WORK IN PROGRESS 5.1. Data CollectionkHz4 0 0 8 1.8kHz4 0 0 8 1.8kHz4 0 0 1.85.2. Adjacent Channelssecondstrogram of downsampled signal and c) recovered spectrogram. 5.3.2. Wideband Speech ReconstructionFigure 6: a) Spectrogram of the original speech, b) spec-5.3. Cellular to Wideband Speech RecoveryAn approximate wideband from narrowband speech map has been produced by extending the adjacent channel concept. Now the 129 trajectories derived from the 8kHz sampled speech analysis are fed to MISO lters whose outputs correspond to frequencies from 4kHz to 8kHz (i.e., bins 130 to 257) and the signal reconstruction is done at twice the original rate. Training was performed with 9 TIMIT sentences (same speaker) with a time window of 40 ms ( = 2 in (1)). In Fig.6, we show an out of training set original wideband speech spectrogram, the downsampled version of it, and the recovered spectrogram. Within a given talker the high frequency recovery appears to be viable. However, for time signal reconstruction an estimate of the phase ofMthe recovered high frequencies needs to be performed. This is discussed in the following section. Modi cation of the short-term magnitude spectrum of a signal and synthesis with original phase do not, in general, yield a time signal with the desired magnitude spectrum (see e.g., Allen, J.(1977)). To avoid distortion in the synthesized signal, especially when the power spectrum has been severely modi ed (see subsection 5.6) , we are investigating an iterative algorithm (see e.g., Gri n and Lim (1984)) in which the mean squared error between the desired spectrum and the spectrum produced by the synthesized signal is minimized. In noise reduction applications, we have used the noisy signal's phase to perform the initial step in the reconstruction. In the case of wideband speech mapping, a linear map from the available low frequency phase components to the higher frequencies is taken as a rst approximation.5.4. Phase Reconstructionshort-term spectrum of speech. While no formal subjective perceptual tests were carried out, our initial experience indicates a possible advantage of the proposed technique over conventional speech enhancement techniques. Future work will investigate generalization of the technique for additional noise sources and larger training sets. In addition, more formal perceptual evaluations will be undertaken. Related work outlined in section 5 will also be pursued.7. ACKNOWLEDGEMENTSWe would like to thank Mahesh K. Kumashikar for the neural networks training, and CONACYT (Mexico) for partial support of the third author.8. REFERENCES1] J.B. Allen: Short term spectral analysis, synthesis, and modi cation by discrete Fourier transform, IEEE ASSP-25, pp. 235-238, Jun. 1977. 2] S.F. Boll: Suppression of acoustic noise in speech using spectral subtraction, IEEE ASSP-27, pp. 113-120, Apr. 1979. 3] V. Ramamoorthy, N.S. Jayant, R.V. Cox and M.M. Sondhi: Enhancement of ADPCM speech coding with backward-adaptive algorithms for post ltering and noise feedback, IEEE J. Selec. Areas in Comm. Vol. 6, No.2, pp. 364-382, Feb. 1988. 4] H. Hermansky: Perceptual predictive (PLP) analysis of speech, J. Acoust. Soc. Am. 87 pp. 17381752 Apr. 1990. 5] H. G. Hirsch, P. Meyer, and H. Ruehl: Improved speech recognition using high-pass ltering of subband envelopes, Proc. EUROSPEECH '91, pp. 413-416, Genova, 1991 6] H. Hermansky, N. Morgan, A. Bayya, and P. Kohn: Compensation for the e ect of the communication channel in auditory-like analysis of speech (RASTA-PLP) Proc. EUROSPEECH '91, pp. 1367-1370, Genova, 1991. 7] H. Hermansky, N. Morgan, and H.G. Hirsch: Recognition of speech in additive and convolutional noise based on RASTA spectral processing, Proc. ICASSP-93, pp. II-83, II-86, 1993. 8] H. Hermansky, E. Wan, and C. Avendano: Noise suppression in cellular communications, Proc. IVTTA 94, pp. 85-88, Kyoto 1994.5.5. Postprocessing with Envelope EnhancementFirst proposed for enhancement of ADPCM speech coding (see Ramamoorthy, et. al., (1988)), a pre ltering technique for spectral envelope enhancement has been applied in our system. Spectral domain ltering is done by designing a lter at each 8 ms step with an untilted and smoothed 8th order all-pole model spectrum of the particular frame, giving the e ect of enhancing frequencies around formants and suppression elsewhere. Best results have been achieved by pre ltering the compressed power spectrum prior to applying RASTA lters to its trajectories.Non-linear RASTA lters have also been implemented with 3 layer arti cial neural networks. Preliminary results on small training sets show a more aggressive ltering and greater noise reduction. However, musicallike residual noise as well as greater degradation of clean speech is also observed. The iterative synthesis method described above has been used with some success to alleviate the phase distortion introduced by the large amount of spectral modi cation. Current e orts are focused on reducing the adverse e ect introduced by the networks.5.6. Non-linear Filters6. CONCLUSIONSWe have presented a new technique for enhancement of noisy speech in cellular communications, which utilizes Wiener-like RASTA lters on cubic-root compressed。