Combining_Acoustic_Features_for_Improved_Emotion_Recognition_in_Mandarin_Speech
Combining Acoustic Features for Improved Emotion
Recognition in Mandarin Speech
Tsang-Long Pao, Yu-Te Chen, Jun-Heng Yeh, and Wen-Yuan Liao
Department of Computer Science and Engineering, Tatung University
tlpao@https://www.360docs.net/doc/768994501.html,.tw
d8906005@https://www.360docs.net/doc/768994501.html,.tw
Abstract. Combining different feature streams to obtain a more accurate ex-
perimental result is a well-known technique. The basic argument is that if the
recognition errors of systems using the individual streams occur at different
points, there is at least a chance that a combined system will be able to correct
some of these errors by reference to the other streams. In the emotional speech
recognition system, there are many ways in which this general principle can be
applied. In this paper, we proposed using feature selection and feature combina-
tion to improve the speaker-dependent emotion recognition in Mandarin speech.
Five basic emotions are investigated including anger, boredom, happiness, neu-
tral and sadness. Combining multiple feature streams is clearly highly beneficial
in our system. The best accuracy recognizing five different emotions can be
achieved 99.44% by using MFCC, LPCC, RastaPLP, LFPC feature streams and
the nearest class mean classifier.
1 Introduction
Speech signals are air vibration produced by air exhaled from the lungs modulated and shaped by the vibrations of the glottal cords and the vocal tract as it is pushed out through the lips and nose and are the most natural form of communication for humans and many animal species. Speech sounds have a rich temporal-spectral variation. In contrast, animals can only produce a relatively small repertoire of basic sound units.
Just as written language is a sequence of elementary alphabet, speech is a sequence of elementary acoustic symbols. Speech signals convey more than spoken words. The additional information conveyed in speech includes gender information, age, accent, speaker’s identity, health, prosody and emotion [1].
Recently acoustic investigation of emotions expressed in speech has gained in-creased attention partly due to the potential value of emotion recognition for spoken dialogue management [2-4]. For instance, complain or anger due to unsatisfied ser-vices about user’s requests could be dealt with smoothly by transferring the user to human operator. However, in order to reach such a level of performance we need to extract a reliable acoustic feature set that is largely immune to inter- and intra-speaker variability in emotion expression. The aim of this paper is to use feature combination (FC) to concatenate different features to improve emotion recognition in Mandarin speech.
280 T.-L. Pao et al.
This paper is organized as follows. In Section 2, the extracted acoustic features used in the system are described. In Section 3, we will describe feature selection and combination in more detail. Experimental results are reported in Section 4. Finally, conclusions are given in Section 5.
2 Extracted Acoustic Features
The speech signal contains different kind of information. From the automatic emotion recognition task point of view, it is useful to think about speech signal as a sequence of features that characterize both the emotions as well as the speech. How to extract sufficient information for good discrimination in a form and size which is amenable for effective modeling is a crucial problem.
All studies in the field point to the pitch as the main vocal cue for emotion recogni-tion. The other acoustic variables contributing to vocal emotion signaling are: vocal energy, frequency spectral feature, formants and temporal features [5]. Another ap-proach to feature extraction is to enrich the set of features by considering some de-rivative features such as MFCCs (Mel-Frequency Cepstral Coefficients) parameters of signal or features of the smoothed pitch contour and its derivatives. It is well known that the MFCCs are among the best acoustic features used in automatic speech recog-nition. The MFCCs are robust, contain much information about the vocal tract con-figuration regardless the source of excitation, and can be used to represent all classes of speech sounds. In [6], the authors proposed using MFCC coefficients and Vector Quantification to perform speaker-dependent emotion recognition. From the experi-mental results, the correct rate of energy or pitch is about 36% in comparison with performance of MFCC 76% and the influence of these features when combined with MFCC is unclear.
Instead of extracting pitch and energy features, in this paper, we estimated the fol-lowing: formants (F1, F2 and F3), Linear Predictive Coefficients (LPC), Linear Pre-diction Cepstral Coefficients (LPCC), Mel-Frequency Cepstral Coefficients (MFCC), first derivative of MFCC (dMFCC), second derivative of MFCC (ddMFCC), Log Frequency Power Coefficients (LFPC), Perceptual Linear Prediction (PLP) and RelA-tive SpecTrAl PLP (Rasta-PLP). Formants are used in a wide range of applications such as parameter extraction for a TTS system, as acoustic features for speech recog-nition, as analysis component for a formant based vocoder and for speaker identifica-tion, adaptation and normalization. There are at least three methods defined to esti-mate formant positions: LPC root, LPC peak-picking of the LPC envelope and LPC peak-picking of the cepstrally smoothed spectrum. In this paper, we adopted the sec-ond method.
For the past years, LPC has been considered one of the most powerful techniques for speech analysis. In fact, this technique is the basis of other more recent and so-phisticated algorithms that are used for estimating speech parameters, e.g., pitch, formants, spectra, vocal tract and low bit representations of speech. The LPC coeffi-cients can be calculated by either autocorrelation method or covariance method [1] and the order of linear prediction used is 16. After 16 LPC coefficients are obtained, the LPCC parameters are derived using the following equation:
Combining Acoustic Features for Improved Emotion Recognition
281 p k a ic k a c k i i k i k k ≤≤+=∑?=?1)/1(1
1 (1)
where a k are the LPC coefficients and p is the LPC analysis order.
Mel Frequency Cepstral Coefficients (MFCCs) are currently one of the most fre-
quently used feature representations in automatic speech recognition system as they
convey information of short time energy migration in frequency domain and yield
satisfactory recognition results for a number of tasks and applications. The MFCCs
are extracted from the speech signal that is framed and then windowed, usually with a
Hamming window. A set of 20 Mel-spaced filter banks is then utilized to get the mel-
spectrum. The natural logarithm is then taken to transform into the cepstral domain
and the Discrete Cosine Transform is finally computed to get the MFCCs. Spectral
transitions play an important role in human speech perception. Including the first and
second order derivatives of MFCC features are adopted in this paper.
LFPC can be regarded as a model that follows the varying auditory resolving
power of the human ear for various frequencies. In [7], it is found that the short time
LFPC gives better performance for recognition of speech emotion compared with
LPCC and MFCC. A possible reason is that LFPCs are more suitable for the preserva-
tion of fundamental frequency information in the lower order filters. The computation
of short time LFPC can be found in [7]. The bandwidth and the center frequency of
the first filter for a set of 20 bandpass filters are set at 54 Hz and 127 Hz respectively.
A combination of DFT and LP techniques is perceptual linear prediction (PLP).
The analysis steps for PLP are critical band warping and averaging, equal loudness
pre-emphasis, transformation according to the intensity loudness power law and all-
pole modeling. All-pole model parameters are converted to cepstral coefficients
which are liftered to approximately whiten the features. For a model of order p we use
p =20 cepstral coefficients.
The word RASTA stands for RelAtive SpecTrAl technique [8]. This technique is
an improvement of the traditional PLP method and consists in a special filtering of the
different frequency channels of a PLP analyzer. The previous filtering is done to
make speech analysis less sensitive to the slowly changing or steady-state factors in
speech. The RASTA method replaces the conventional critical-band short-term spec-
trum in PLP and introduces a less sensitive spectral estimation.
3 Feature Selection and Combination
Despite the successful deployment of speech recognition applications, there are cir-
cumstances that present severe challenges to current recognizers – for instance, back-
ground noise, reverberation, fast or slow speech, unusual accents etc. In the huge
body of published research there are many reports of success in mitigating individual
problems, but fewer techniques that are of help in multiple different conditions. What
is needed is a way to combine the strengths of several different approaches into a
single system.
282 T.-L. Pao et al.
Fig. 1. Features ranking
Classified
Result
Fig. 2. The block diagram of the proposed system
Feature combination is a well-known technique [9]. During the feature extraction of speech recognition system, we typically find that each feature type has particular circumstances in which it excels, and this has motivated our investigations for com-bining separate feature streams into a single emotional speech recognition system. For feature combination, all possible feature streams would be concatenated. Due to the highly redundant information in the concatenated feature vector, a forward feature selection (FFS) or backward feature selection (BFS) should be carried out to extract only the most representative features, thereby orthogonalizing the feature vector and reducing its dimensionality. Figure 1 is a two-dimensional plot of 9 features ranked with nearest class mean classifier by forward selection and backward elimination. Features near origin are considered to be more important. The resulting feature vector
Combining Acoustic Features for Improved Emotion Recognition 283
is then processed as for a regular full-band recognizer. The block diagram of Manda-
rin emotional speech recognition system with feature combination is shown in Fig. 2.
4 Experimental Results
In order to carry out the experimentations, we have constructed an acted speech corpus
recorded from two Chinese, one male and one female, in 8-bit PCM with a sampling
frequency of 8k Hz. To provide reference data for automatic classification experiments,
we performed human listening tests with two other listeners. Only those data that had
complete agreement between them were chosen for the experiments reported in this
paper. From the evaluation result, angry is the most difficult category to discriminate in
our database, which had the lowest accuracy rate. Finally, we obtained 637 utterances
with 95 angry, 115 bored, 113 happy, 150 neutral, and 164 sad utterances.
The Mandarin emotion recognition system was implemented using “MATLAB”
software run under a desktop PC platform. The correct recognition rate was evaluated
using leave-one-out (LOO) cross-validation which is a method to estimate the predic-
tive accuracy of the classifier.
The task of the classifier component proper of a full system is to use the feature
vector provided by the feature extractor to assign the object to a category. To recog-
nize emotions in speech we tried the following approaches: minimum distance classi-
fier with Euclidean distance, minimum distance classifier with Manhattan distance
and nearest class mean classifier.
Table 1. Experimental results (%) with each feature stream Minimum distance (Euclidean) Minimum distance (Manhattan) Nearest class
mean
F1~F3 32.81 32.52 36.44 LPC 63.67 60.90 59.18 LPCC 78.71 78.00 82.05 MFCC 71.44 73.50 92.62 dMFCC 65.70 65.03 46.13 ddMFCC 52.37 49.92 29.56 LFPC 62.75 64.96 80.64 PLP 67.00 68.58 77.24 RASTAPLP 56.65 60.13 78.43
Table 2. The best accuracy and the combined feature streams in different classifiers
Classifier Minimum distance (Euclidean) Minimum distance
(Manhattan)
Nearest class mean Best accuracy 79.98 81.03
99.44 Combined features
LPCC, MFCC, LPC, LFPC, RASTA-PLP LPCC, MFCC,
LPC, LFPC, RASTA-PLP MFCC, LPCC, RASTA-PLP, LFPC
284 T.-L. Pao et al.
The nearest class mean classification is a simple classification method that assigns an unknown sample to a class according to the distance between the sample and each class’s mean. The class mean, or centroid, is calculated as follows:
∑==i n j j i i x n m 1,1 (2)
where x i,j is the j th sample from class i . An unknown sample with feature vector x is classified as class i if it is closer to the mean vector of class i than to any other class’s mean vector. Rather than calculate the distance between the unknown sample and the mean of every classes, the minimum-distance classification estimates the Euclidean and Manhattan distance between the unknown sample and each training sample.
The experimental results of individual feature stream are described in Table 1 and the best accuracy of feature combination based on different classification method is given in Table 2.
5 Conclusions
We express our emotions in three main ways: the words that we use, facial expression and intonation of the voice. Whereas research about automated recognition of emo-tions in facial expressions is now very rich, research dealing with the speech modal-ity, both for automated production and recognition by machines, has only been active for very few years and is almost for English.
Some of the state-of-the-art ASR systems employ multi-stream processing in which several data streams are processed in parallel before their information is re-combined at a later point. Combination of the different streams can be carried out either before or after acoustic modeling, i.e. on the feature level or on the decision level.
In this paper, the feature selection and the feature combination were used to per-form an exhaustive search for the optimal selection of the best feature vector combination. For each feature combination the classifier performance was tested by means of the leave-one-out method. Five basic emotions of anger, boredom, happi-ness, neutral and sadness are investigated. The experimental results show that the MFCC and LPCC play the role of primary features and the best feature combination in our proposed system are MFCC, LPCC, RastaPLP and LFPC. The highest accuracy of 99.44% is achieved with nearest class mean classifier. Contrary to [7], LPCC and MFCC achieve better performance for recognition of speech emotion compared with short time LFPC.
In the future, it is necessary to collect more and more acted or spontaneous speech sentences to test the robustness of the best feature combination. Furthermore, it might be useful to measure the confidence of the decision after performing classification. Based on confidence threshold, classification result might be classified as reliable or not. Unreliable tests can be for example further processed by human. Besides, trying to combine the information of different features and diverse classifiers in emotion recognition system is still a challenge for our future work.
Combining Acoustic Features for Improved Emotion Recognition 285 Acknowledgement
The authors would like to thank the National Science Council (NSC) for financial supporting this research under NSC project NO: NSC 93-2213-E-036-023. References
1.Rabiner L.R. and Juang B.H. Fundamentals of Speech Recognition. Prentice-Hall, Engle-
wood Cliffs, NJ, 1993.
2.Lee, C. M., and Narayanan, S., “Towards detecting emotion in spoken dialogs,” IEEE
Trans. on Speech & Audio Processing, in press.
3.Cowie, R., Douglas-Cowie, E., Tsapatsoulis, N., Votsis, G., Kollias, S., Fellenz, ans Taylor,
J., “Emotion Recognition in Human-Computer Interactions”, IEEE Sig. Proc. Mag., vol. 18, pp.32-80, Jan 2001.
4.Litman, D., ad Forbes, K. “Recognizing Emotions from Student Speech in Tutoring Dia-
logues,” In proceedings of the ASRU’03, 2003.
5.Banse, R. and Scherer, K.R. “Acoustic profiles in vocal emotion expression,” Journal of
Personality and Social Psychology, pp.614-636, 1996.
6.Xuan Hung Le; Quenot, G.; Castelli, E.; “Recognizing emotions for the audio-visual docu-
ment indexing”, Proceedings of Computers and Communications, ISCC, 2004, pp. 580-584.
7.Tin Lay Nwe, Foo Say Wei, Liyanage C De Silva, “Speech Emotion Recognition using
Hidden Markov models,” Speech Communication, 2003.
8.H. Hermansky and N. Morgan, ``RASTA Processing of Speech'', IEEE Transactions on
Speech and Audio Processing, Vol. 2 No. 4, October 1994.
9.Ellis, DPW (2000a). “Stream combination before and/or after the acoustic model.” Proc. of
the Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP 2000).