Combining_Acoustic_Features_for_Improved_Emotion_Recognition_in_Mandarin_Speech

Combining_Acoustic_Features_for_Improved_Emotion_Recognition_in_Mandarin_Speech
Combining_Acoustic_Features_for_Improved_Emotion_Recognition_in_Mandarin_Speech

Combining Acoustic Features for Improved Emotion

Recognition in Mandarin Speech

Tsang-Long Pao, Yu-Te Chen, Jun-Heng Yeh, and Wen-Yuan Liao

Department of Computer Science and Engineering, Tatung University

tlpao@https://www.360docs.net/doc/768994501.html,.tw

d8906005@https://www.360docs.net/doc/768994501.html,.tw

Abstract. Combining different feature streams to obtain a more accurate ex-

perimental result is a well-known technique. The basic argument is that if the

recognition errors of systems using the individual streams occur at different

points, there is at least a chance that a combined system will be able to correct

some of these errors by reference to the other streams. In the emotional speech

recognition system, there are many ways in which this general principle can be

applied. In this paper, we proposed using feature selection and feature combina-

tion to improve the speaker-dependent emotion recognition in Mandarin speech.

Five basic emotions are investigated including anger, boredom, happiness, neu-

tral and sadness. Combining multiple feature streams is clearly highly beneficial

in our system. The best accuracy recognizing five different emotions can be

achieved 99.44% by using MFCC, LPCC, RastaPLP, LFPC feature streams and

the nearest class mean classifier.

1 Introduction

Speech signals are air vibration produced by air exhaled from the lungs modulated and shaped by the vibrations of the glottal cords and the vocal tract as it is pushed out through the lips and nose and are the most natural form of communication for humans and many animal species. Speech sounds have a rich temporal-spectral variation. In contrast, animals can only produce a relatively small repertoire of basic sound units.

Just as written language is a sequence of elementary alphabet, speech is a sequence of elementary acoustic symbols. Speech signals convey more than spoken words. The additional information conveyed in speech includes gender information, age, accent, speaker’s identity, health, prosody and emotion [1].

Recently acoustic investigation of emotions expressed in speech has gained in-creased attention partly due to the potential value of emotion recognition for spoken dialogue management [2-4]. For instance, complain or anger due to unsatisfied ser-vices about user’s requests could be dealt with smoothly by transferring the user to human operator. However, in order to reach such a level of performance we need to extract a reliable acoustic feature set that is largely immune to inter- and intra-speaker variability in emotion expression. The aim of this paper is to use feature combination (FC) to concatenate different features to improve emotion recognition in Mandarin speech.

280 T.-L. Pao et al.

This paper is organized as follows. In Section 2, the extracted acoustic features used in the system are described. In Section 3, we will describe feature selection and combination in more detail. Experimental results are reported in Section 4. Finally, conclusions are given in Section 5.

2 Extracted Acoustic Features

The speech signal contains different kind of information. From the automatic emotion recognition task point of view, it is useful to think about speech signal as a sequence of features that characterize both the emotions as well as the speech. How to extract sufficient information for good discrimination in a form and size which is amenable for effective modeling is a crucial problem.

All studies in the field point to the pitch as the main vocal cue for emotion recogni-tion. The other acoustic variables contributing to vocal emotion signaling are: vocal energy, frequency spectral feature, formants and temporal features [5]. Another ap-proach to feature extraction is to enrich the set of features by considering some de-rivative features such as MFCCs (Mel-Frequency Cepstral Coefficients) parameters of signal or features of the smoothed pitch contour and its derivatives. It is well known that the MFCCs are among the best acoustic features used in automatic speech recog-nition. The MFCCs are robust, contain much information about the vocal tract con-figuration regardless the source of excitation, and can be used to represent all classes of speech sounds. In [6], the authors proposed using MFCC coefficients and Vector Quantification to perform speaker-dependent emotion recognition. From the experi-mental results, the correct rate of energy or pitch is about 36% in comparison with performance of MFCC 76% and the influence of these features when combined with MFCC is unclear.

Instead of extracting pitch and energy features, in this paper, we estimated the fol-lowing: formants (F1, F2 and F3), Linear Predictive Coefficients (LPC), Linear Pre-diction Cepstral Coefficients (LPCC), Mel-Frequency Cepstral Coefficients (MFCC), first derivative of MFCC (dMFCC), second derivative of MFCC (ddMFCC), Log Frequency Power Coefficients (LFPC), Perceptual Linear Prediction (PLP) and RelA-tive SpecTrAl PLP (Rasta-PLP). Formants are used in a wide range of applications such as parameter extraction for a TTS system, as acoustic features for speech recog-nition, as analysis component for a formant based vocoder and for speaker identifica-tion, adaptation and normalization. There are at least three methods defined to esti-mate formant positions: LPC root, LPC peak-picking of the LPC envelope and LPC peak-picking of the cepstrally smoothed spectrum. In this paper, we adopted the sec-ond method.

For the past years, LPC has been considered one of the most powerful techniques for speech analysis. In fact, this technique is the basis of other more recent and so-phisticated algorithms that are used for estimating speech parameters, e.g., pitch, formants, spectra, vocal tract and low bit representations of speech. The LPC coeffi-cients can be calculated by either autocorrelation method or covariance method [1] and the order of linear prediction used is 16. After 16 LPC coefficients are obtained, the LPCC parameters are derived using the following equation:

Combining Acoustic Features for Improved Emotion Recognition

281 p k a ic k a c k i i k i k k ≤≤+=∑?=?1)/1(1

1 (1)

where a k are the LPC coefficients and p is the LPC analysis order.

Mel Frequency Cepstral Coefficients (MFCCs) are currently one of the most fre-

quently used feature representations in automatic speech recognition system as they

convey information of short time energy migration in frequency domain and yield

satisfactory recognition results for a number of tasks and applications. The MFCCs

are extracted from the speech signal that is framed and then windowed, usually with a

Hamming window. A set of 20 Mel-spaced filter banks is then utilized to get the mel-

spectrum. The natural logarithm is then taken to transform into the cepstral domain

and the Discrete Cosine Transform is finally computed to get the MFCCs. Spectral

transitions play an important role in human speech perception. Including the first and

second order derivatives of MFCC features are adopted in this paper.

LFPC can be regarded as a model that follows the varying auditory resolving

power of the human ear for various frequencies. In [7], it is found that the short time

LFPC gives better performance for recognition of speech emotion compared with

LPCC and MFCC. A possible reason is that LFPCs are more suitable for the preserva-

tion of fundamental frequency information in the lower order filters. The computation

of short time LFPC can be found in [7]. The bandwidth and the center frequency of

the first filter for a set of 20 bandpass filters are set at 54 Hz and 127 Hz respectively.

A combination of DFT and LP techniques is perceptual linear prediction (PLP).

The analysis steps for PLP are critical band warping and averaging, equal loudness

pre-emphasis, transformation according to the intensity loudness power law and all-

pole modeling. All-pole model parameters are converted to cepstral coefficients

which are liftered to approximately whiten the features. For a model of order p we use

p =20 cepstral coefficients.

The word RASTA stands for RelAtive SpecTrAl technique [8]. This technique is

an improvement of the traditional PLP method and consists in a special filtering of the

different frequency channels of a PLP analyzer. The previous filtering is done to

make speech analysis less sensitive to the slowly changing or steady-state factors in

speech. The RASTA method replaces the conventional critical-band short-term spec-

trum in PLP and introduces a less sensitive spectral estimation.

3 Feature Selection and Combination

Despite the successful deployment of speech recognition applications, there are cir-

cumstances that present severe challenges to current recognizers – for instance, back-

ground noise, reverberation, fast or slow speech, unusual accents etc. In the huge

body of published research there are many reports of success in mitigating individual

problems, but fewer techniques that are of help in multiple different conditions. What

is needed is a way to combine the strengths of several different approaches into a

single system.

282 T.-L. Pao et al.

Fig. 1. Features ranking

Classified

Result

Fig. 2. The block diagram of the proposed system

Feature combination is a well-known technique [9]. During the feature extraction of speech recognition system, we typically find that each feature type has particular circumstances in which it excels, and this has motivated our investigations for com-bining separate feature streams into a single emotional speech recognition system. For feature combination, all possible feature streams would be concatenated. Due to the highly redundant information in the concatenated feature vector, a forward feature selection (FFS) or backward feature selection (BFS) should be carried out to extract only the most representative features, thereby orthogonalizing the feature vector and reducing its dimensionality. Figure 1 is a two-dimensional plot of 9 features ranked with nearest class mean classifier by forward selection and backward elimination. Features near origin are considered to be more important. The resulting feature vector

Combining Acoustic Features for Improved Emotion Recognition 283

is then processed as for a regular full-band recognizer. The block diagram of Manda-

rin emotional speech recognition system with feature combination is shown in Fig. 2.

4 Experimental Results

In order to carry out the experimentations, we have constructed an acted speech corpus

recorded from two Chinese, one male and one female, in 8-bit PCM with a sampling

frequency of 8k Hz. To provide reference data for automatic classification experiments,

we performed human listening tests with two other listeners. Only those data that had

complete agreement between them were chosen for the experiments reported in this

paper. From the evaluation result, angry is the most difficult category to discriminate in

our database, which had the lowest accuracy rate. Finally, we obtained 637 utterances

with 95 angry, 115 bored, 113 happy, 150 neutral, and 164 sad utterances.

The Mandarin emotion recognition system was implemented using “MATLAB”

software run under a desktop PC platform. The correct recognition rate was evaluated

using leave-one-out (LOO) cross-validation which is a method to estimate the predic-

tive accuracy of the classifier.

The task of the classifier component proper of a full system is to use the feature

vector provided by the feature extractor to assign the object to a category. To recog-

nize emotions in speech we tried the following approaches: minimum distance classi-

fier with Euclidean distance, minimum distance classifier with Manhattan distance

and nearest class mean classifier.

Table 1. Experimental results (%) with each feature stream Minimum distance (Euclidean) Minimum distance (Manhattan) Nearest class

mean

F1~F3 32.81 32.52 36.44 LPC 63.67 60.90 59.18 LPCC 78.71 78.00 82.05 MFCC 71.44 73.50 92.62 dMFCC 65.70 65.03 46.13 ddMFCC 52.37 49.92 29.56 LFPC 62.75 64.96 80.64 PLP 67.00 68.58 77.24 RASTAPLP 56.65 60.13 78.43

Table 2. The best accuracy and the combined feature streams in different classifiers

Classifier Minimum distance (Euclidean) Minimum distance

(Manhattan)

Nearest class mean Best accuracy 79.98 81.03

99.44 Combined features

LPCC, MFCC, LPC, LFPC, RASTA-PLP LPCC, MFCC,

LPC, LFPC, RASTA-PLP MFCC, LPCC, RASTA-PLP, LFPC

284 T.-L. Pao et al.

The nearest class mean classification is a simple classification method that assigns an unknown sample to a class according to the distance between the sample and each class’s mean. The class mean, or centroid, is calculated as follows:

∑==i n j j i i x n m 1,1 (2)

where x i,j is the j th sample from class i . An unknown sample with feature vector x is classified as class i if it is closer to the mean vector of class i than to any other class’s mean vector. Rather than calculate the distance between the unknown sample and the mean of every classes, the minimum-distance classification estimates the Euclidean and Manhattan distance between the unknown sample and each training sample.

The experimental results of individual feature stream are described in Table 1 and the best accuracy of feature combination based on different classification method is given in Table 2.

5 Conclusions

We express our emotions in three main ways: the words that we use, facial expression and intonation of the voice. Whereas research about automated recognition of emo-tions in facial expressions is now very rich, research dealing with the speech modal-ity, both for automated production and recognition by machines, has only been active for very few years and is almost for English.

Some of the state-of-the-art ASR systems employ multi-stream processing in which several data streams are processed in parallel before their information is re-combined at a later point. Combination of the different streams can be carried out either before or after acoustic modeling, i.e. on the feature level or on the decision level.

In this paper, the feature selection and the feature combination were used to per-form an exhaustive search for the optimal selection of the best feature vector combination. For each feature combination the classifier performance was tested by means of the leave-one-out method. Five basic emotions of anger, boredom, happi-ness, neutral and sadness are investigated. The experimental results show that the MFCC and LPCC play the role of primary features and the best feature combination in our proposed system are MFCC, LPCC, RastaPLP and LFPC. The highest accuracy of 99.44% is achieved with nearest class mean classifier. Contrary to [7], LPCC and MFCC achieve better performance for recognition of speech emotion compared with short time LFPC.

In the future, it is necessary to collect more and more acted or spontaneous speech sentences to test the robustness of the best feature combination. Furthermore, it might be useful to measure the confidence of the decision after performing classification. Based on confidence threshold, classification result might be classified as reliable or not. Unreliable tests can be for example further processed by human. Besides, trying to combine the information of different features and diverse classifiers in emotion recognition system is still a challenge for our future work.

Combining Acoustic Features for Improved Emotion Recognition 285 Acknowledgement

The authors would like to thank the National Science Council (NSC) for financial supporting this research under NSC project NO: NSC 93-2213-E-036-023. References

1.Rabiner L.R. and Juang B.H. Fundamentals of Speech Recognition. Prentice-Hall, Engle-

wood Cliffs, NJ, 1993.

2.Lee, C. M., and Narayanan, S., “Towards detecting emotion in spoken dialogs,” IEEE

Trans. on Speech & Audio Processing, in press.

3.Cowie, R., Douglas-Cowie, E., Tsapatsoulis, N., Votsis, G., Kollias, S., Fellenz, ans Taylor,

J., “Emotion Recognition in Human-Computer Interactions”, IEEE Sig. Proc. Mag., vol. 18, pp.32-80, Jan 2001.

4.Litman, D., ad Forbes, K. “Recognizing Emotions from Student Speech in Tutoring Dia-

logues,” In proceedings of the ASRU’03, 2003.

5.Banse, R. and Scherer, K.R. “Acoustic profiles in vocal emotion expression,” Journal of

Personality and Social Psychology, pp.614-636, 1996.

6.Xuan Hung Le; Quenot, G.; Castelli, E.; “Recognizing emotions for the audio-visual docu-

ment indexing”, Proceedings of Computers and Communications, ISCC, 2004, pp. 580-584.

7.Tin Lay Nwe, Foo Say Wei, Liyanage C De Silva, “Speech Emotion Recognition using

Hidden Markov models,” Speech Communication, 2003.

8.H. Hermansky and N. Morgan, ``RASTA Processing of Speech'', IEEE Transactions on

Speech and Audio Processing, Vol. 2 No. 4, October 1994.

9.Ellis, DPW (2000a). “Stream combination before and/or after the acoustic model.” Proc. of

the Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP 2000).

相关主题
相关文档
最新文档