Enhancing Emotion Recognition from Speech ... - Semantic Scholar

4 downloads 0 Views 216KB Size Report
mean, standard deviation, kurtosis, skewness, minimum and maximum value, relative position, range as well as two linear regression coefficients with their.
Enhancing Emotion Recognition from Speech through Feature Selection Theodoros Kostoulas, Todor Ganchev, Alexandros Lazaridis, and Nikos Fakotakis Wire Communications Laboratory, Department of Electrical and Computer Engineering, University of Patras, 26500 Rion-Patras, Greece [email protected],[email protected] {alaza,fakotaki}@upatras.gr

Abstract. In the present work we aim at performance optimization of a speaker-independent emotion recognition system through speech feature selection process. Specifically, relying on the speech feature set defined in the Interspeech 2009 Emotion Challenge, we studied the relative importance of the individual speech parameters, and based on their ranking, a subset of speech parameters that offered advantageous performance was selected. The affect-emotion recognizer utilized here relies on a GMMUBM-based classifier. In all experiments, we followed the experimental setup defined by the Interspeech 2009 Emotion Challenge, utilizing the FAU Aibo Emotion Corpus of spontaneous, emotionally coloured speech. The experimental results indicate that the correct choice of the speech parameters can lead to better performance than the baseline one. Key words: affect recognition, emotion recognition, feature selection, real-world data

1

Introduction

The progress of technology and the increasing use of spoken dialogue systems raise the need for more effective and user-friendly human-machine interaction [1]. Awareness of the emotional state of the user can contribute towards more successful interaction experiences [2]. One of the greatest challenges in the task of emotion recognition from speech is dealing with real-life data and addressing speaker-independent emotion recognition. Real-world speech data differ much from acted speech, once characterized by spontaneous speech and genuine formulations [3]. Thus, results reported for acted speech corpora (accuracy of up to 100%) can not be transferred in realistic conditions with reported performance < 80% (two-class classification problem) and < 60% (four-class classification problem) [4]. To this end, various approaches for emotion recognition have been reported. In [5], Callejas et al. studied the impact of contextual information for the annotation of emotions. They carried out experiments on a corpus extracted from

2

T. Kostoulas et al.

the interaction of humans with a spoken dialogue system. Their results show that both humans and machines are affected by the contextual information. In [6], Iliou & Anagnostopoulos statistically selected a feature set towards studying speaker-dependent and speaker-independent emotion recognition on acted speech corpus. In [7], Seppi et al. reported classification results with the use of acoustic and linguistic features, utilizing the FAU Aibo Emotion Corpus [8], [9]. In [10], Ververidis & Kotropoulos optimized the execution time and accuracy of the sequential floating forward selection (SFFS) method in speech emotion recognition. In [11], Brendel et al., describe research efforts towards emotion detection for monitoring an Artificial Agent by voice. Despite the great effort in the area of emotion recognition the majority of the work conducted offers no chance of compatibility, once no universally-accepted experimental setup had been widely used so far. Though, recent research efforts tend to the establishment [12], [13] and utilization of a universally accepted setup, [14], [15]. The present work reports on-going research activity on affectemotion recognition within the experimental setup defined by the Interspeech 2009 Emotion Challenge [12]. Specifically we examine the performance of an emotion recognition system in relation with the speech parameters selected for representing the emotional information over five emotion classes. Results indicate that the performance of the system exceeds the baseline performance [12] and is close to the highest performance achieved by [14]. The remaining of this work is organized as follows: Section 2 details the architecture of the emotion recognition system. Section 3 describes the emotional speech data utilized. Experimentations performed, and the results to which these lead are included in Section 4.

2

System architecture

The block diagram of the GMM-UBM based emotion recognition system is shown in Fig. 1. The upper part of the figure summarizes the training of the speakerindependent emotion models, and the bottom part outlines the operational mode of the system. During both the training and the operational phases, speech data are subject to speech parameterization, which results in a 384-dimensional feature vector [12]. The following 16 low-level speech descriptors are computed: zerocrossing-rate (ZCR) from the time signal, root mean square (RMS) frame energy, pitch frequency (normalised to 500 Hz), harmonics-to-noise ratio (HNR) by autocorrelation function and twelve Mel-frequency cepstral coefficients (MFCC) (excluding the 0-th) computed as in the standard HTK setup. For the resulting feature set the delta coefficients are computed. Next, the twelve functionals: mean, standard deviation, kurtosis, skewness, minimum and maximum value, relative position, range as well as two linear regression coefficients with their mean square error (MSE) are applied on sentence level. As the figure presents, during the training phase two types of data (labelled and unlabeled) are utilized for the creation of the emotion models of interest. Specifically, unlabelled speech recordings, different from the speakers involved in the testing of the emotional models, are utilized for the creation of a large Gaus-

Enhancing Emotion Recognition from Speech through Feature Selection

3

Fig. 1. Block diagram of the emotion recognition component.

sian mixture model, referred to as Universal Background Model (UBM). This model is sufficiently general not to interfere with any of the emotion categories of interest, and not to represent accurately the individual characteristics of the speakers whose speech was used in its creation. Thus, the UBM is considered to represent emotion-independent distribution of the feature vectors [16], [17]. Next, a category-specific set of labelled speech recordings are used for deriving the models for each emotion category of interest. This is done by the Bayesian adaptation technique also known as maximum a posteriori (MAP) adaption of the UBM [18]. During MAP adaptation only the means were adapted. The emotion models built during training are utilized in the operational phase for the classification of unlabeled speech recordings to one of the predefined emotional categories. In brief, the feature vectors resulting from the speech parameterization stage are fed to the GMM classification stage, where the loglikelihoods of the input data belonging to each of the category-specific models are computed. Next, these log-likelihoods are subject to the Bayes optimal decision rule, which selects the emotion category with the highest probability.

3

Emotional Speech Data

The present study utilizes the FAU Aibo Emotion Corpus [8], Chap. 5 [9]. The speech corpus results from the interaction of fifty-one 10-12 years old children with Sonys pet robot Aibo, thus consists of spontaneous, emotionally coloured, German speech. The data were collected at two different schools, Ohm (26 children) and Mont (25 children). The robot was controlled by a human operator, causing Aibo to perform a fixed, predetermined sequence of actions; sometimes provoking emotional reactions. The recordings were segmented automatically into turns and annotated on word level by five labellers [8], Chap. 5. Manually defined chunks based on syntactic-prosodic criteria [8], Chap. 5.3.5 were defined within the Interspeech 2009 Emotion Challenge. The whole corpus consisting of 18,216 chunks. The present work focuses on the five-class clas-

4

T. Kostoulas et al.

sification problem, which considers the following classes: Anger (angry-touchyreprimanding), Emphatic, Neutral, Positive (motherese, joyful ), Rest. The training dataset consists of the data collected in the Ohm school and the test dataset of the data recorded in the Mont school. In the training set the chunks are given in sequential order with the chunk name indicating the speaker identity. In the test set, the chunks are presented in random order without explicit information about the speaker.

4

Experiments and Results

We split the training dataset specified in Section 3, in two parts: development set and validation set. In each of these parts we preserved the speakers age and gender distributions similar to those of the entire training dataset. The development set was utilized for performing feature ranking and feature selection experiments. The validation set was utilized for identifying of the most favourable subset of speech parameters and identifying the optimal complexity of the GMM emotion models. The test dataset, as it is defined in Section 3, was used for measuring the accuracy of the emotion recognition system, in the setup defined by the Interspeech 2009 Emotion Challenge. The selection of the features is performed considering the predictive value of each feature individually, along with the degree of redundancy among them, and for that purpose we relied on the BestFirst search method [19]. 10-fold cross validation protocol was followed. Within this work we selected any speech feature that was selected five or more times out of the 10 evaluations (the 10 splits of data corresponding to the 10 folds). Table 1 shows the 56 speech parameters that were selected, sorted in descending order according to their rank. As can be seen in the table, speech parameters, which are derivatives of the RMS frame energy and MFCCs dominate in the subset top-56 parameters, selected here. These results are in agreement with previous research conducted in the field of emotion recognition [20]. In order to examine the improvement of emotion recognition accuracy contributed by adding more parameters to the speech vector, the following procedure was used: Emotion models were subsequently trained with increasing size of the feature vector, starting from using only the first speech feature and then increasing by one, each time. In order to identify the most appropriate settings of the GMM model, for each size of the feature vector we experimented with different number of mixture components, i.e. {1, 2, 4, 8, 16}. The optimal accuracy for every feature set, in terms of the un-weighted average (UA) recall computed on the validation set, was observed for a GMM with one mixture component. The UA recall obtained for each of the 56 feature subsets that were evaluated are summarized in Fig. 2. As the figure presents, the maximum UA recall was achieved for the feature vector composed of the first 49 speech parameters. With the selection of the top-49 speech parameters for the feature vector, and with setting the size of the GMM model to one mixture, the optimization of the emotion recognition system was completed.

Enhancing Emotion Recognition from Speech through Feature Selection

5

Table 1. Selected speech parameters. No. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28

Feature rmsEnergy-max rmsEnergy-min rmsEnergy-range rmsEnergy-mean rmsEnergy-stddev rmsEnergy-kurtosis rmsEnergy-linregc1 rmsEnergy-De-max rmsEnergy-De-min rmsEnergy-De-range rmsEnergy-De-stddev rmsEnergy-De-skewness rmsEnergy-De-linregc1 rmsEnergy-De-linregc2 rmsEnergy-De-linregerrQ F0freq-De-max HNR-De-kurtosis mfcc1-max mfcc1-min mfcc1-mean mfcc1-linregc1 mfcc1-linregerrQ mfcc3-kurtosis mfcc4-min mfcc4-range mfcc4-stddev mfcc7-kurtosis mfcc1-De-max

No. 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56

Feature mfcc1-De-stddev mfcc2-De-max mfcc2-De-min mfcc2-De-skewness mfcc2-De-kurtosis mfcc4-De-kurtosis mfcc5-De-skewness zcr-mean zcr-De-stddev HNR-max mfcc3-mean mfcc11-mean mfcc12-mean mfcc6-De-maxPos mfcc8-De-stddev mfcc1-kurtosis mfcc12-skewness mfcc4-De-max F0freq-De-skewness mfcc8-kurtosis mfcc7-De-maxPos rmsEnergy-linregerrQ mfcc3-maxPos mfcc5-max mfcc9-mean mfcc4-De-linregc1 F0freq-skewness F0freq-linregerrQ

On the next step we evaluated the accuracy of the emotion recognition system on the original split of training and test datasets, as formulated in the Interspeech 2009 Emotion Challenge. In detail, utilizing the training set we evaluate the systems performance over the test set using the best performing feature set (top-49) and one mixture for the GMM emotion models. This resulted to 41.99% weighted average recall (accuracy) and 39.45% UA recall. The confusion matrix for this experiment is shown in Table 2. As Table 2 presents, the broad category Rest (8.6 %) was the most difficult to recognize. This can be explained by the large diversity of the data belonging to this category, i.e. this class contains all the data that could not be assigned to one of the other four classes. Category Positive is mostly confused with Neutral, since both subcategories of Positive, i.e. motherese and joyful, are closer to Neutral than any other category. This can also be observed by the accuracy for Neutral, which is mostly confused with category Positive. Moreover, relatively

6

T. Kostoulas et al.

Fig. 2. The UA recall computed on the validation set for feature vectors composed of 1,2,. . . , 56 speech parameters. Table 2. Accuracy in percentages of the optimized emotion recognition system in the setup defined in the Interspeech 2009 Emotion Challenge.

Anger Emphatic Neutral Positive Rest

Anger 39.1 14.8 12.9 3.3 13.4

Emphatic 22.8 44.3 17.0 2.3 9.9

Neutral 19.6 31.0 44.3 27.0 36.6

Positive 13.1 5.6 22.1 60.9 31.5

Rest 5.4 4.3 3.7 6.5 8.6

high accuracy of Neutral affective state can be explained by the large number of instances which allows successful adaptation of the universal model towards building the neutral model. In general, the emotion recognition component shows significant capability on recognizing emotions in all classes but the Rest one, related to the number of available instances in the training set Anger (881), Emphatic (2,093), Neutral (5,590), Positive (674), Rest (721). The emotion recognition system achieves 10% and 3.3% relative improvement to dynamic and static modelling baseline ones provided in [15] respectively and performs close to the highest one reported along the challenge (41.3% UA recall) [14].

5

Conclusion

The present work reported research efforts towards speaker-independent speech emotion recognition, utilizing the setup defined by the Interspeech 2009 Emotion Challenge. Specifically, we addressed the five-class emotion problem, outperform-

Enhancing Emotion Recognition from Speech through Feature Selection

7

ing the baseline accuracy and being close to the highest one reported along the challenge. The UBM-GMM based emotion recognition system shows significant capability to overcome the class imbalance problem, reporting high value of un-weighted average recall. Correct choice of speech parameters leads to improvement of the systems performance demonstrating the importance of feature selection on the demanding task of recognizing spontaneous, emotionally coloured speech and genuine formulations. Acknowledgments. This work was supported by the PlayMancer project (FP7ICT-215839-2007), which is funded by the Seventh Framework Programme of the European Commission.

References 1. Pantic, M., Rothkrantz, L.: Toward an affect-sensitive multi-modal human-computer interaction. In: Proc. of the IEEE, vol. 91, pp. 1370–1390 (2003) 2. Cowie, R., Douglas-Cowie, E., Tsapatsoulis, N., Votsis, G., Kollias, S., Fellenz, W., Taylor, J.G.: Emotion recognition in human-computer interaction. In: IEEE Signal Processing Magazine, vol. 18, no. 1, pp. 32–80 (2001) 3. Batliner, A., Fisher, K., Huber, R., Spilker, J., N¨ oth, E.: How to find trouble in communication. In: Speech Communication, vol. 40, pp. 117–143 (2003) 4. Batliner, A., Burkhardt, F., van Ballegooy, M., N¨ oth, E.: A taxonomy of applications that utilize emotional awareness. In: Erjavec, T. and Gros, J. (Ed.), Language Technologies, IS-LTC 2006, pp. 246–250 (2006) 5. Callejas, Z., Lopez-Cozar, R.: Influence of contextual information in emotion annotation for spoken dialogue systems. In: Speech Communication, pp. 416–433 (2008) 6. Iliou, T., Anagnostopoulos, C.N.: Comparison of Different Classifiers for Emotion Recognition. In: 13th Panhellenic Conference on Informatics, pp.102–106 (2009) 7. Seppi, D., Batliner, A., Schuller B., Steidl, S., Vogt, T., Wagner, J., Devillers, L., Vidrascu, L., Amir, N., Aharonson, V.: Patterns, prototypes, performance: classifying emotional user states. In: Interspeech 2008, pp. 601–604 (2008) 8. Steidl, S.: Automatic Classification of Emotion-Related User States in Spontaneous Childrens Speech. In: Logos Verlag, Berlin (2009) 9. Batliner, A., Steidl, S., Hacker, C., N¨ oth. E.: Private Emotions vs. Social Interaction – a Data-driven Approach towards Analysing Emotion in Speech. In: User Modeling and User-Adpated Interaction (umuai), vol. 18, No. 1-2, pp. 175–206 (2008) 10. Ververidis D., Kotropoulos, C.: Fast and accurate feature subset selection applied into speech emotion recognition. In: Elsevier Signal Processing, vol. 88, issue 12, pp. 2956–2970 (2008) 11. Brendel, M., Zaccarelli R., Devillers, L.: Building a System for Emotions Detection from Speech to Control an Affective Avatar. In: Proceedings of LREC 2010, pp. 2205-2210 (2010) 12. Schuller, B., Steidl, S., Batliner, A.: The Interspeech 2009 Emotion Challenge. In: Interspeech 2009, ISCA, Brighton, UK, pp. 312–315 (2009) 13. Schuller, B., Steidl, S., Batliner, A., Burkhardt, F., Devillers, L., Mueller, C., Narayanan, S.: The Interspeech 2010 Paralinguistic Challenge. In: Interspeech 2010, ISCA, Makuhari, Japan (2010)

8

T. Kostoulas et al.

14. Kockmann, M., Burget, L., Cernocky J.: Brno University of Technology System for Interspeech 2009 Emotion Challenge. In: Interspeech 2009, ISCA, Brighton, UK, pp. 348–351 (2009) 15. Steidl S., Schuller, B., Seppi, D., Batliner, A.: The Hinterland of Emotions: Facing the Open-Microphone Challenge. In: Proc. 4th International HUMAINE Association Conference on Affective Computing and Intelligent Interaction 2009 (ACII 2009), volume I, pp. 690–697 (2009) 16. Reynolds, D.A., Rose, R.C.: Robust text-independent speaker identification using gaussian mixture speaker models. In: IEEE Transactions on Speech and Audio Processing, vol. 3, pp. 72–83 (1995). 17. Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. In: J. Roy. Stat. Soc. 39, pp. 1–38 (1977) 18. Reynolds, D.A., Quatieri, T.F., Dunn, R.B.: Speaker verification using adapted Gaussian mixture models. In: Digital Signal Processing 10, 19–41 (2000) 19. Witten, I. H., Frank, E.: Data Mining: Practical machine learning tools and techniques. 2nd Edition, Morgan Kaufmann, San Francisco (2005) 20. Schuller, B., Batliner, A., Seppi, D., Steidl, S., Vogt, T., Wagner, J., Devillers, L., Vidrascu, L., Amir, N., Kessous, L., Aharonson. V.: The relevance of feature type for the automatic classification of emotional user states: Low level descriptors and functionals. In: Interspeech 2007, ISCA, pp. 2253–2256, Antwerp, Belgium, August (2007)