Feature Vector Classification based Speech Emotion ... - IEEE Xplore

1 downloads 0 Views 455KB Size Report
IEEE Transactions on Consumer Electronics, Vol. 55, No. 3, AUGUST 2009. Contributed Paper. Manuscript received July 8, 2009. 0098 3063/09/$20.00 © 2009 ...
IEEE Transactions on Consumer Electronics, Vol. 55, No. 3, AUGUST 2009

1590

Feature Vector Classification based Speech Emotion Recognition for Service Robots Jeong-Sik Park, Ji-Hwan Kim, Member, IEEE, and Yung-Hwan Oh, Member, IEEE Abstract — This paper proposes an efficient feature vector classification for Speech Emotion Recognition (SER) in service robots. Since service robots interact with diverse users who are in various emotional states, two important issues should be addressed: acoustically similar characteristics between emotions and variable speaker characteristics due to different user speaking styles. Each of these issues may cause a substantial amount of overlap between emotion models in feature vector space, thus decreasing SER accuracy. In order to reduce the effects caused by such overlaps, this paper proposes an efficient feature vector classification for SER. The conventional feature vector classification applied to speaker identification categorizes feature vectors as overlapped and non-overlapped. Because this method discards all of the overlapped vectors in model reconstruction, it has limitations in constructing robust models when the number of overlapped vectors is significantly increased such as in emotion recognition. The method proposed herein classifies overlapped vectors in a more sophisticated manner, selecting discriminative vectors among overlapped vectors, and adds those vectors in model reconstruction. On SER experiments using an emotional speech corpus, the proposed classification approach exhibited superior performance to conventional methods, and displayed an almost human-level performance. In particular, we achieved commercially applicable performance for two-class (negative vs. non-negative) emotion recognition1. Index Terms — Speech emotion recognition, Feature vector classification, Service robot.

I. INTRODUCTION Service robots are robots which operate autonomously to perform services useful to the well being of humans. Unlike industrial robots, which are typically found in manufacturing environments, service robots interact with a great number of users in a variety of places from hospital to home, from restaurants to offices. As design and implementation breakthroughs in the field of service robotics are following one another rapidly, consumers are beginning to take an interest in these robots. A great variety of service robots are being developed to perform human tasks such as assisting in 1

the care of people. In order to coexist in a human’s daily life and offer services in accordance with a person’s intention, service robots should be able to understand human emotional states. The ability to recognize human emotions promptly and correctly would allow service robots to provide users with different services or responses according to a user’s emotional states. For example, if a nursing robot could continuously monitor the emotional state of its patients, it could better provide them with proper health-care services and even cope with unexpected emergencies more rapidly. Human emotional information can be obtained from speech, facial expressions, gestures, and so forth. Among these indicators, human speech provides a natural and intuitive interface for interaction with robots [1][2]. For this reason, Speech Emotion Recognition (SER) technology is essentially required for human-robot interaction, along with speech recognition technology which has been already employed for human-robot interaction as in [3]. While speech recognition tasks obtain the explicit messages (‘what the speaker said’) from speech, the SER tasks focus on the implicit messages (‘how the speaker spoke’). Many researchers have tried to create an exact definition of emotion, but unfortunately, the only conclusion that has been drawn is that emotion is not easy to define or to understand [4][5]. Because of the ambiguity in defining emotion, SER is not an easy task even for human listeners. Several speech emotion studies have performed human listening tests to see how accurately humans perceive the emotion in speech [6]-[8]. Most results have shown that participants frequently fail to perceive the emotional state of the speaker. According to [7], the human classification accuracy for six kinds of emotional states was no higher than 70%, even when the speech data contained emotional context information. As will be presented in Section IV, the human listening test conducted for this research also showed similar results. These poor results of human listening tests indicate that each emotion has acoustically similar characteristics to those of other emotions. The speaker and language characteristics can also affect SER [9]. While experiencing the same emotion, speakers may speak differently, according to their own speaking styles. In addition, the expression of emotion in speech can be affected by language characteristics, since national customs or culture can create distinctive speaking styles, such as in tonal languages. Since public service robots, such as nursing robots or guidance robots, interact and communicate with people of different accents, ages, genders, education, and so forth, they should be able to reduce the effects caused by the above characteristics.

This work was supported in part by the Research Grant from the Defense Acquisition Program Administration And the Agency for Defense Development under contract as well as the Sogang University Research Grant of 200701117.01 Jeong-Sik Park and Yung-Hwan Oh are with the Computer Science Division, Korea Advanced Institute of Science and Technology (KAIST), Daejeon, Korea (e-mail: {dionpark, yhoh}@speech.kaist. ac.kr) Ji-Hwan Kim is with the Dept. of Computer Science and Engineering, Sogang University, Seoul, Korea (e-mail: [email protected]) Contributed Paper Manuscript received July 8, 2009 0098 3063/09/$20.00 © 2009 IEEE

J.-S. Park et al.: Feature Vector Classification based Speech Emotion Recognition for Service Robots

This paper proposes a robust SER technique that minimizes two important effects reducing the system performance: the acoustical similarity between emotions due to the ambiguity in the definition of emotions and the performance dependency according to speaker characteristics. We apply a feature vector classification scheme to GMM (Gaussian Mixture Model)-based SER system. A feature vector classification scheme normally categorizes feature vectors in two groups: acoustically discriminative vectors between emotion GMM models and non-discriminative vectors. Then, GMM models are reconstructed by using only the discriminative vectors, based on the assumption that the non-discriminative vectors are generated by non-speech segments such as silence and environmental noise, or speech segments whose corresponding models have acoustically similar characteristics to other models. The conventional feature vector classification was applied to speaker identification, where feature vectors are categorized as overlapped vectors and non-overlapped vectors [10]. Because this method discards all of the overlapped vectors in model reconstruction, it has limitations in constructing robust models when the number of overlapped vectors is significantly increased, such as in emotion recognition. Due to this sparseness of training vectors, the reconstructed models may be trained insufficiently. Therefore, it is necessary to devise a more sophisticated feature vector classification method in order to construct reliable speech emotion models. This paper is organized as follows. In Section II, several previous works related to this study are presented. Next, Section III describes the conventional feature vector classification and the proposed classification method. In Section IV, experimental setups and results are presented and discussed. The paper concludes in Section V. II. SPEECH EMOTION RECOGNITION Most research on SER has concentrated on feature-based and classification-based approaches [11]-[13]. Feature-based approaches aim at analyzing speech signals and effectively estimating feature parameters representing human emotional states. Many kinds of acoustic features have been introduced and verified. According to [11] and [12], short-term acoustic features such as pitch, energy, duration and MFCC (Mel-Frequency Cepstral Coefficients) are frequently used in SER systems. The classification-based approaches focus on designing a classifier to determine distinctive boundaries between emotions. The boundary is used as a criterion to decide the emotional state for a given utterance. Various techniques already in use for conventional pattern classification problems are likewise used for such emotion classifiers. Representative classifiers are the Hidden Markov Model (HMM), the Gaussian Mixture Model (GMM), the Support Vector Machine (SVM), and the Artificial Neural Network (ANN). Other researchers have investigated the relationship between languages and emotional characteristics. They emphasize the necessity for multi-lingual emotional data collection, since each language has a different way of expressing emotional states. Most current research on SER is performed based on distinctive language databases [11][14].

1591

Each of the issues explained above can influence the performance of a SER system. A number of studies have compared the efficiency of various acoustic features along with classification techniques. In [12] and [15], the GMMbased classifiers have been characterized as better suited for the task of classifying emotions using short-term acoustic features than other static discriminative classifiers such as the SVM. Based on those results, this study concentrates on a GMM-based SER approach using short-term acoustic features. A. GMM-based Speech Emotion Recognition for Service Robots GMM-based speech emotion recognition is composed of emotion model training and emotion recognition phases. As shown in Fig. 1, a service robot receives robot-directed speech as input for its inner SER system as it communicates with its user. The speech signal is passed through a pre-processing system that normalizes amplitude, reduces the amount of noise, and extracts feature vectors. Speech communication

Pre-processing - Volume Normalization - Noise Processing - Feature Vector Extraction

Provide services according to emotional state of user

Speech Emotion Recognition - Log-likelihood Estimation based on GMMs - Emotion Decision

Fig. 1. Speech emotion recognition procedures for a service robot

The next modules perform emotion recognition based on GMMs. The GMMs try to identify the emotion type of input utterances during the recognition phase according to the following procedures. For a sequence of feature vectors X = JG JJG { x1 ,..., xT }, which are extracted from an input utterance, the log-likelihood on each GMM model λi (i=1,…,E if there are E emotions) is computed as (1). T JG (1) log P ( X | λi ) = ∑ log P ( xt | λi ) t =1

Then, a model that has the maximum log-likelihood with a given input utterance is determined to be a recognition result. Based on the type of the recognized emotion, the robot provides users with different services or responses. The purpose of the training phase is to construct a reliable GMM for each emotion. Once feature vectors are extracted from the training data (i.e., speech data collected for each emotional state), model parameters (the mean and the variance of Gaussian distribution) are estimated using the iterative ExpectationMaximization (EM) algorithm [16]. Fig. 2 demonstrates the training procedures for the standard GMM approach.

IEEE Transactions on Consumer Electronics, Vol. 55, No. 3, AUGUST 2009

Speech Data for each Emotion

Feature Vector Extraction

Statistical Modeling using EM Algorithm

Emotion GMM

Fig. 2. Training procedures for the standard GMM-based SER

III. FEATURE VECTOR CLASSIFICATION FOR SPEECH EMOTION RECOGNITION In the standard GMM approach, the interrelation between models critically affects the final decision. Especially, several factors such as acoustically similar characteristics between emotions or environmental noises may generate overlaps of emotion models and thus induce decision errors. Usually, the more emotions are included in recognition system, the more amounts of overlaps exist. Hence, it is considerably important to mitigate the overlap effects. A. Conventional Feature Vector Classification for Speaker Identification A feature vector classification method was proposed to reduce the overlap effects in speaker identification [10]. The method classifies training vectors (i.e., feature vectors used for model training) into two categories: overlapped and non-overlapped vectors. Once the standard GMMs are built, each training vector is verified whether it indicates maximum log-likelihood on the corresponding model. If a vector indicates its maximum loglikelihood on one of other models, it is regarded as an overlapped vector. Otherwise, it is regarded as a non-overlapped vector. This approach assumes that overlapped vectors are generated by nonspeech segments such as silence and environmental noise, or speech segments whose corresponding speaker models have acoustically similar characteristics with other speaker models. After classification, speaker models are reconstructed using only non-overlapped vectors.

vectors, in order to verify whether the conventional feature vector classification is applicable to emotion recognition. Using emotional speech utterances, we constructed ten emotion GMMs, each of which has one state and one mixture based on the type of corresponding emotion data. And next, we investigated the number of overlapped vectors for five different emotions in two cases: ‘five models’ and ‘ten models’. In the former, five models represent five different emotions (boredom, anger, happy, neutral, and sadness) and in the latter, another five emotion models (anxiety, despair, panic, shame, and pride) are included. Fig. 4 represents the composition ratio of overlapped vectors when the conventional classification method is applied to SER. This result explains that an increment in the number of emotion models increases the number of overlapped vectors considerably. For example, more than 50% of feature vectors from ‘boredom’ and ‘neutral’ emotion were regarded as overlapped vectors when the number of emotion models was increased by ten. In other words, only less than 50% of feature vectors are used in model training. This sparseness of training vectors deteriorates the recognition accuracy. The higher the number of emotions included, the more vectors are discarded in GMM reconstruction. As a result, reconstructed emotion models are trained insufficiently. Therefore, it is necessary to devise a more sophisticated feature vector classification method in order to construct robust emotion models. 60

five models ten models

50

Ratio (%)

1592

40 30 20 10 0

Speaker Data

Reconstructed GMM

Feature Vector Extraction

GMM Reconstruction using Non-overlapped Vectors

Speaker GMM Construction

Feature Vector Classification

Fig. 3. Speaker model training based on the conventional feature vector classification

Fig. 3 shows the procedure in speaker model training based on the conventional feature vector classification. The conventional method showed good experimental results in speaker identification. B. Problems of Conventional Classification in Speech Emotion Recognition In SER, emotion decisions for speech data may be ambiguous in many cases. This domain-oriented ambiguity increases overlapped vectors substantially. We carried out an experiment to measure the composition ratio of overlapped

boredom

anger

happy

neutral

sadness

Emotion Type

Fig. 4. Composition ratio (%) of overlapped vectors when the conventional feature vector classification is applied to emotion recognition

C. Advanced Feature Vector Classification for Model Construction in Speech Emotion Recognition Several previous studies on emotion recognition show that an emotion tends to be recognized as an another specific emotion, when it is misrecognized [12][17]. In general, some attributes of emotional speech seem to be associated with common characteristics of emotions, rather than with individual categories. For example, anger, fear, joy, and surprise are associated with more highly activated characteristics in acoustic domain than other emotions such as disgust, sadness, and boredom. This similarity of acoustical characteristics for certain emotions implies that they can easily be mistaken for one another. Such a tendency can be

J.-S. Park et al.: Feature Vector Classification based Speech Emotion Recognition for Service Robots

considered in relation to overlapped vectors. It is already mentioned that a large amount of vectors are regarded as overlapped vectors in the field of speech emotion, due to the domain-oriented ambiguity. We believe that a substantial amount of overlapped vectors are caused by acoustically similar characteristics between two models, which have such tendencies in the corresponding pairs of misrecognition. Thus, we assume that such overlapped vectors retain their discriminative emotional information. In the conventional method, only non-overlapped vectors are treated as discriminative vectors and all of the overlapped vectors are discarded. In our method, addressed as ‘advanced feature vector classification’, we carefully expand the region of discriminative vectors to include the overlapped vectors which preserve discriminative information according to our assumption.

Emotion Model (standard GMM)

< log P( xe,t | λR2 ( xe ,t ) ) − ε

N-best Log-likelihood Estimation

1st classification

Nonoverlapped feature vector?

Yes

No

Yes

Overlap Model Construction

(2)

1 E ∑ log P( xe,t | λRk ( xe ,t ) ) . If xe,t satisfies (2), this E − 2 k =3 vector is regarded as a discriminative feature vector. Based on our mechanism, we classify feature vectors extracted from training data into three categories as following summarization:

where ε =

2nd classification

Nondiscriminative feature vector?

After constructing the standard GMMs, we obtain N-best log-likelihood results of each xe,t from E emotion models. Let us denote Rr(xe,t) as the emotion model index (ranging from 1 to E) at r-th rank in N-best log-likelihood list obtained from all of E emotion models with given xe,t. It is obvious that xe,t is a non-overlapped vector (as well as a discriminative vector) if R1(xe,t) is e. Otherwise, xe,t represents an overlapped feature vector. In case that R2(xe,t) is e, we still need to verify whether this overlapped vector is discriminative or not. We believe that xe,t still preserve corresponding emotion characteristics if the difference between the log-likelihood from R1(xe,t) and that from R2(xe,t) is less than the difference between the loglikelihood from R2(xe,t) and that from the average loglikelihood from other models, as stated in (2).

log P( xe,t | λR1 ( xe ,t ) ) − log P( xe,t | λR2 ( xe ,t ) )

Feature Vectors of Training Data

1593

No

Emotion Model Reconstruction

Fig. 5. Procedures for model construction based on advanced feature vector classification

Fig. 5 illustrates the procedures for our advanced feature vector classification. Firstly, we build each emotion model based on the standard GMM approach. Next, every feature vector is verified whether it is correctly recognized as a corresponding emotion, based on the log-likelihood estimated for each emotion model. If the vector is closer to its corresponding model than other models, we designate this vector as a discriminative vector as well as a non-overlapped vector. Steps proceed to classify the overlapped vectors. As mentioned above, some of the overlapped vectors still preserve discriminative information, if their corresponding models have acoustically similar characteristics to a specific emotion model. We designate such vectors as discriminative vectors of another type, whereas other overlapped vectors are regarded as non-discriminative vectors. Finally, vectors classified as discriminative vectors are applied to emotion model reconstruction; otherwise, non-discriminative vectors are used to construct overlap models. Above mechanism classifying feature vectors into three categories can be described using numerical expression as follows. Note that there are E emotion standard GMMs, λe (where e=1,…,E), and xe,t (where t=1,…,Te) is one of Te feature vectors used to construct a standard GMM λe .

y xe,t : t-th feature vector used in the construction of standard GMM for e-th emotion, t = 1,…,Te y Rr(xe,t) : the emotion model index at r-th rank in N-best log-likelihood list obtained from all of E emotion models with given xe,t , r = 1,…,E i) If R1(xe,t) is e, xe,t → discriminative feature vector ii) Otherwise, if R2(xe,t) is e and (2) is satisfied, xe,t → discriminative feature vector iii) Otherwise, xe,t → non-discriminative feature vector N-best and log-likelihood based verification such as in (2) is also widely applied to confidence measures for utterance verification [18]. In case there are two kinds of emotions, the right side of (2) can be empirically determined according to the pair of emotions. D. Emotion Recognition based on Advanced Feature Vector Classification After feature vectors are classified, we construct two models for each emotion. One is a reconstructed emotion model built from discriminative vectors, and the other is an overlap model built from non-discriminative vectors. We use overlap models as garbage models to exclude non-discriminative vectors included in a test utterance. Fig. 6 demonstrates the procedures for emotion recognition based on reconstructed emotion models and overlap models. In the recognition phase, every feature vector extracted from a given test utterance is classified into two categories based on reconstructed emotion models and overlap models. If a vector

IEEE Transactions on Consumer Electronics, Vol. 55, No. 3, AUGUST 2009

1594

indicates maximum log-likelihood on one of the overlap models, the vector is regarded as a non-discriminative vector and discarded in recognition process. Otherwise, the vector is a discriminative vector which preserves corresponding emotion characteristics. Next, we estimate Maximum LogLikelihood (MLL), using a set of discriminative feature vectors, and decide an emotion  e , as follows. (3)

where log P ( D | λle ) = 1 ∑ F log P ( d f | λle ) and λle denotes the f =1 F reconstructed GMM for e-th emotion. D is a set of feature vectors regarded as the discriminative vector for a given test utterance and F is the number of vectors in D. Feature Vectors of a Test Utterance Log-likelihood Estimation

Maximum Loglikelihood on one of Reconstructed Models?

Overlap Models

No

Yes

Discriminative Vector (used in MLL estimation)

TABLE I EMOTION TYPES ACCORDING TO THE NUMBER OF CLASSIFICATION CATEGORIES

No. of emotions

 e = arg max log P ( D | λle ), e = 1,..., E e

Reconstructed Emotion Models

‘sadness’, as these are the categories generally used for fiveclass emotion recognition tasks [12][15][20]. Finally, we append ‘anxiety’, ‘despair’, ‘panic’, ‘shame’, and ‘pride’ to comprise a ten-class emotion.

Non-discriminative Vector (discarded)

Emotion Decision

2

anger, neutral

3

anger, neutral, happy

5

anger, neutral, happy, boredom, sadness

10

anger, neutral, happy, boredom, sadness, anxiety, despair, panic, shame, pride

A. Experimental Results and Discussion First, we performed the two-class (negative vs. nonnegative) recognition experiments. In this experiment, we attempted to categorize speech data, using two emotions, anger and neutral. Table II shows the results. The proposed method, the Advanced Feature Vector Classification (AFVC), achieved a 96.9% recognition rate, which represents relative improvements of 51.6% and 40.4% (3.3% and 2.1% in absolute improvement) over the standard GMM approach and the Conventional Feature Vector Classification (CFVC), respectively. Based on these results, we conclude that our method can be applied to service robots as a commercialized negative vs. non-negative SER system. TABLE II PERFORMANCE (%) ON TWO-CLASS SER EXPERIMENTS

Fig. 6. Procedures for emotion recognition based on advanced feature vector classification

IV. SPEECH EMOTION RECOGNITION EXPERIMENTS We performed emotion recognition experiments on emotional speech data obtained from the Emotional Prosody Speech and Transcripts of LDC [19]. This corpus consists of speech data recorded in a clean environment by seven professional actors who express emotions while reading short phrases of dates and numbers. Speech data from four speakers were used in the model training and speech data from three speakers were used in test. We used pitch, log energy, zero crossing rate and 12 dimensional MFCCs, as feature vectors. All vectors were computed within frames of 40ms with a Hamming window shifted by 10ms. By experiments, we confirmed 40ms is a minimum duration to estimate reliable emotion characteristics, particularly pitch information. This duration was also used in [20]. Table I illustrates the sets of emotions. We chose ‘anger’ and ‘neutral’ as the most representative negative and nonnegative emotions, respectively. Then, ‘happy’ was added for three-class emotion recognition tests, because ‘happy’ is the most opposite emotion to ‘anger’. Next, we composed a five-class emotion with the addition of ‘boredom’ and

Emotion types

GMM GMM+CFVC GMM+AFVC

anger (negative) vs. neutral (non-negative) 93.6 94.8 96.9

In order to verify the efficiency of our classification method, we investigated speech emotion recognition accuracy according to the number of emotions. Fig. 7 shows the results. As the number of emotions increased, so did the error rates. The AFVC method demonstrated superior performance to that of the CFVC method over all experimental sets. On the three- class experiment, the AFVC presented 20.1% and 10.7% relative improvements over the standard GMM approach and CFVC, respectively. On the fiveclass experiment, the AFVC method showed 11.9% and 9.4% relative improvements, respectively. It is interesting to observe that on the ten-class recognition experiment, the reconstructed GMM with CFVC yielded even lower accuracy than the standard GMM. Such a result was caused by an undesirable feature vector classification which discarded so many training vectors that the reconstructed emotion models were trained inadequately. On the other hand, our method provides a more sophisticated classification in the selection of discriminative vectors and their use in model reconstruction.

J.-S. Park et al.: Feature Vector Classification based Speech Emotion Recognition for Service Robots 100

Standard GMM

90

Accuracy (%)

80

Reconstructed GMM with CFVC

70

Reconstructed GMM with AFVC

60 50 40 30

1595

TABLE III CONFUSION MATRIX OF FIVE-CLASS EMOTION RECOGNITION BASED ON CFVC boredom anger happy neutral sadness

boredom 0.52 0.03 0.03 0.24 0.16

anger 0.02 0.62 0.33 0.07 0.04

happy 0.07 0.28 0.54 0.09 0.05

neutral 0.24 0.06 0.06 0.48 0.15

sadness 0.15 0.01 0.04 0.12 0.60

20

TABLE IV CONFUSION MATRIX OF FIVE-CLASS EMOTION RECOGNITION BASED ON AFVC

10 0 2

3

5

10

Number of Emotions

Fig. 7. Recognition accuracy (%) according to the number of emotions

We analyzed the composition ratio of non-discriminative feature vectors classified by CFVC and AFVC. We repeated the experiment for ten emotion models, using the procedures described in Section III.B. As presented in Fig. 8, on each emotion, about 40% of vectors regarded as overlapped vectors by CFVC were determined to be discriminative vectors by AFVC. Therefore, the proposed approach maintains a relatively sufficient number of vectors for emotion model reconstruction. 70

CFVC

60

AFVC

Ratio (%)

50 40 30 20 10 0

boredom

anger

happy

neutral

sadness

Emotion Type

Fig. 8. Composition ratio (%) of non-discriminative feature vectors classified based on CFVC and AFVC in ten-class emotion recognition

Table III and IV show the confusion matrix of five-class emotion recognition for CFVC and AFVC, respectively. In this recognition experiment, the average recognition accuracy was reported as 55.2% and 59.4% on the reconstructed GMM with CFVC and AFVC, respectively. The performance was improved for all types of emotions, and especially for ‘happy’ and ‘anger’ (in terms of relative improvement). Our experimental results demonstrate that the proposed method solves two problems that should be addressed when the conventional classification method is applied to SER: sparseness of training vectors and insufficient model training. By selecting discriminative feature vectors among overlapped vectors and adding the vectors in model reconstruction, it is possible to construct more robust speech emotion models, thus reducing recognition errors due to acoustically similar characteristics between models.

boredom anger happy neutral sadness

boredom 0.55 0.03 0.02 0.22 0.18

anger 0.03 0.69 0.29 0.06 0.04

happy 0.06 0.22 0.60 0.09 0.04

neutral 0.20 0.05 0.05 0.52 0.13

sadness 0.16 0.01 0.04 0.11 0.61

For service robots, the capability to recognize negative and non-negative emotions is important because they will need to provide users with different services or responses according to whether the user is experiencing negative emotion states or non-negative emotion states. Based on our experimental results, it can be concluded that the proposed method effectively improves SER accuracy, and that our SER system can be commercialized for two-class (negative vs. nonnegative) emotion recognition. However, it needs to be considered whether this system is applicable to service robots for more than three-class emotion recognition tasks, which showed lower accuracy than 80% in our experiments. We used only speech data as an information source for our experiments. Moreover, each speech data is brief and consists of only numbers and dates, which do not include words conveying emotional content. Therefore, it is worthwhile to organize a human listening test to check the upper limit of the proposed automatic SER system. In the human listening test, 20 participants (native speakers of English) listened to samples of emotional speech chosen randomly from the test data and characterized each sample as one emotion within the set of emotions provided. Table V summarizes the performance comparison between the results of the tested automatic SER systems and those of the human listening test. According to this table, listeners only recognize about 97%, 80% and 65% of emotions correctly, for two-class, three-class and five-class emotion recognition problems, respectively. In all cases, there is only a small difference between the results of human listening test and those of our system based on the AFVC method. TABLE V PERFORMANCE (%) COMPARISON OF AUTOMATIC SER EXPERIMENTS AND HUMAN LISTENING TEST

GMM GMM+CFVC GMM+AFVC Human Listening Test

Two-class SER

Three-class SER

Five-class SER

93.6 94.8 96.9 97.3

70.6 73.7 76.5 80.0

53.9 55.2 59.4 65.3

IEEE Transactions on Consumer Electronics, Vol. 55, No. 3, AUGUST 2009

1596

Although this experiment was performed with a limited emotional speech data and a small number of listeners, the result shows that our SER system performs nearly as well as the tested human participants even for three or five-class emotion recognition. This experiment and result gives a general idea about the task-oriented difficulty of SER, but also a strong possibility of applying our automatic SER system to human-machine interaction system such as service robots. V. CONCLUSIONS This paper proposed an efficient feature vector classification scheme that classifies discriminative feature vectors more sophisticatedly than the conventional method. We performed SER experiments, using an LDC speech emotion corpus. Experimental results indicate that the proposed approach maintains a relatively sufficient number of vectors for emotion model reconstruction. Our classification approach exhibits superior performance to that of the conventional methods, achieving an almost human level discrimination performance. In particular, we achieved commercially applicable performance for two-class (negative vs. non-negative) emotion recognition from speech. Experiments in this research used only short speech samples pronouncing words without any intrinsic emotional content. In future work, we will apply the proposed method to longer human-robot conversations with dialogue understanding to achieve applicable performance for more than three-class SER. In addition, using our SER system, we will investigate hybrid architectures for multimodal emotion recognition. Finally, it is expected that the proposed feature vector classification will be applicable to other pattern recognition problems such as language identification or speaker identification. REFERENCES [1] [2] [3] [4] [5] [6]

[7] [8]

B. Adams, C. Breazeal, R. Brooks, and B. Scassellati, “Humanoid robots: A new kind of tool,” IEEE Intelligent Systems and Their Applications, vol. 15, no. 4, pp. 25-31, Jul. 2000. E. Kim, K. Hyun, S. Kim, and Y. Kwak, “Emotion interactive robot focus on speaker independently emotion recognition,” Proc. Int. Conf. on Advanced Intelligent Mechatronics, pp. 1-6, Sep. 2007. Y. Oh, J. Yoon, J. Park, M. Kim, and H. Kim, “A name recognition based call-and-come service for human robots,” IEEE Transactions on Consumer Electronics, vol. 54, no. 2, pp. 247-253, May 2008. M. Richins, “Measuring emotions in the consumption experience,” Journal of Consumer Research, vol. 24, pp. 127-146, 1997. R. Cowie, E. Cowie, N. Tsapatsoulis, G. Votsis, S. Kollias, W. Fellenz, and J. Taylor, “Emotion recognition in human-computer interaction,” IEEE Signal Processing Magazine, vol. 18, pp. 32-80, 2001. M. Shami and W. Verhelst, “An evaluation of the robustness of existing supervised machine learning approaches to the classification of emotions in speech,” Speech Communication, vol. 49, no. 3, pp. 201-212, Mar. 2007. T. Nwe, S. Foo, and L. Silva, “Speech emotion recognition using hidden Markov models,” Speech Communication, vol. 41, no. 4, pp. 603-623, Nov. 2003. P. Oudeyer, “The production and recognition of emotions in speech: Features and algorithms,” Int. Journal of Human-Computer Studies, vol. 59, pp. 157-183, 2003.

[9] [10] [11] [12] [13] [14] [15] [16]

[17] [18] [19] [20]

L. Bosche, “Emotions, speech and the ASR framework,” Speech Communication, vol. 40, no. 1, pp. 213-225, Apr. 2003. S. Kwon and S. Narayanan, “Robust speaker identification based on selective use of feature vectors,” Pattern Recognition Letters, vol. 28, pp. 85-89, 2007. D. Ververidis and C. Kotropoulos, “Emotional speech recognition: Resources, features, and methods,” Speech Communication, vol. 48, no. 9, pp. 1162-1181, Sep. 2006. O. Kwon, K. Chan, J. Hao, and T. Lee, “Emotion recognition by speech signals,” Proc. Eurospeech, pp. 125-128, 2003. R. Tato, R. Santos, R. Kompe, and J. Pardo, “Emotional space improves emotion recognition,” Proc. ICSLP, pp. 2029-2032, 2002. E. Cowie, N. Campbell, R. Cowie, and P. Roach, “Emotional speech: Towards a new generation of databases,” Speech Communication, vol. 40, no. 1, pp. 33-60, Apr. 2003. R. Huang and C. Ma, “Toward a speaker-independent real time affect detection system,” Proc. Int. Conf. on Pattern Recognition (ICPR), pp. 1204-1207, 2006. J. Bilmes, “A gentle tutorial of the EM algorithm and its application to parameter estimation for Gaussian Mixture and Hidden Markov Models,” Technical Report, University of Berkeley, ICSI-TR-97-021, 1997. H. Hu, M. Xu, and W. Wu, “GMM supervector based SVM with spectral features for speech emotion recognition,” Proc. ICASSP, pp. 413-416, 2007. G. Guo, C. Huang, and H. Jiang, “A comparative study on various confidence measures in large vocabulary speech recognition,” Proc. ISCSLP, pp. 9-12, 2004. M. Liberman, K. Davis, M. Grossman, N. Martey, and J. Bell, Emotional prosody speech and transcripts, Linguistic Data Consortium (LDC), University of Pennsylvania, PA, USA, Jul. 2002. V. Sethu, E. Ambikairajah, and J. Epps, “Speaker normalization for speech-based emotion detection,” Proc. Int. Conf. on Digital Signal Processing (DSP), pp. 611-614, 2007. Jeong-Sik Park received his B.E. degree in Computer Science from Ajou University, South Korea in 2001 and his M.E. degree in Computer Science from KAIST (Korea Advanced Institute of Science and Technology) in 2003. He is currently pursuing the Ph.D. degree in Computer Science at KAIST. His research interests include speech emotion recognition, speech recognition, speech enhancement, utterance verification, and music information retrieval.

Ji-Hwan Kim (M’09) received the B.E. and M.E. degrees in Computer Science from KAIST (Korea Advanced Institute of Science and Technology) in 1996 and 1998 respectively and Ph.D. degree in Engineering from the University of Cambridge in 2001. From 2001 to 2007, he was a chief research engineer and a senior research engineer in LG Electronics Institute of Technology, where he was engaged in development of speech recognizers for mobile devices. In 2005, he was a visiting scientist in MIT Media Lab. Since 2007, he has been an assistant professor in the Department of Computer Science and Engineering, Sogang University. His research interests include spoken multimedia content search, speech recognition for embedded systems and dialogue understanding. Yung-Hwan Oh (M’84) received his B.S. and M.S. degrees from Seoul National University, South Korea and his Ph.D. degree from Tokyo Institute of Technology, Japan, 1972, 1974, and 1980, respectively. From 1981 to 1985, he was an assistant professor with the Computer Engineering Department of Chungbuk National University. He was a visiting research staff at the University of California, Davis, from 1983 to 1984, and a visiting research professor at Carnegie-Mellon University from 1995 to 1996. He is now a professor with the Computer Science Division of KAIST, Daejeon, South Korea. His research interests include speech recognition, speech synthesis, speech coding, and speech enhancement.