Bimodal Emotion Recognition by Man and Machine

0 downloads 0 Views 275KB Size Report
traction algorithms to the same data used by De Silva et al. 5] in their human ... In Section 4, we describe the audio and video feature extraction. ... speaker's \feelings" from both the video and the speech. .... Figure 1 shows four frames of tracking result with the meshes .... and best video feature), the machine recognition gave.
Bimodal Emotion Recognition by Man and Machine Thomas S. Huang, Lawrence S. Chen & Hai Tao Beckman Institute & CSL, University of Illinois 405 N. Mathews, Urbana IL 61801, U.S.A. fhuang,lchen,[email protected] Tsutomu Miyasato & Ryohei Nakatsu ATR Media Integration and Communication Laboratories 2-2, Hikari-dai, Seika-cho, Soraku-gun, Kyoto, 619-02 JAPAN fmiyasato,[email protected]

Abstract

human emotions [10, 19]. In this paper, we investigate the integration of both the visual and acoustic information for machine to recognize human emotions. One comment frequently encountered is \How do you know the expression re ects the true emotion?" Human beings are capable of hiding their true emotions. They may choose to not to reveal their emotion, or to show another expression that is not the true \felt emotion." With video cameras and microphones, we can only hope to recognize the \apparent emotions," which may or may not be the person's true emotion inside. Often the expression do re ect the inner emotional state. Emotion is also communicated via the vocal channel. We assume that the user is willingly showing his/her emotion through the facial expression and in the speech as a means to communicate. De Silva et al.[5] studied human's ability to recognize six basic emotions by means of subjective evaluation. They found that humans recognize some emotions better from video, while other emotions better from audio. We analyzed the same audio/visual data that De Silva et al. showed to the human subjects, to see how well machines can recognize emotions by the audio and video. It is useful to have the human performance results as a benchmark. We also study whether the bimodal approach o ers better recognition. In the study by De Silva et al. [5], the subjects consistently confused between some emotion classes. We discovered feature clusterings similar to the human confusion. This suggests that the features we use may be similar to what humans use in the recognition process. Furthermore, we found that some of these confusions are mutually exclusive in the two modalities which means we can use one to resolve confusion in the other. We will show that this complementary re-

In this paper, we report preliminary results of recognizing emotions using both speech and video by machine. We applied automatic feature analysis and extraction algorithms to the same data used by De Silva et al. [5] in their human subjective studies. We compare machine recognition results to human performances in three tests:(1) video only, (2) audio only, (3) combined audio and video. In machine-assisted analyses, we found these two modalities to be complementary. Emotion categories that have similar features in one modality have very di erent features in the other modality. By using both, we show it is possible to achieve higher recognition rates than either modality alone.

1 Introduction Recognizing human facial expression and emotion by computer is an interesting yet challenging problem. Especially in achieving natural human-computer interaction (HCI), it is important for the computer to recognize emotion and facial expression of the user in order to respond properly. In her recent book A ective Computing, Picard [15] listed several applications where the computers would recognize emotions of human users or even have emotions of their own. There has been a number of studies on machine recognition of human emotions from facial expressions or emotional speech. However, most of these studies use only a single modality, either purely speech [1, 3, 4, 11, 13, 17, 18] or solely video [6, 8, 14, 12, 16, 19, 21], as the data. Relatively little work has been done in combining the two modalities for the machine to recognize 1

3 Human Performance Results of De Silva et al.

lationship makes it possible to obtain higher emotion recognition accuracy than either modality alone. The next section brie y reviews related works on recognition of human emotions from speech and from video. In Section 3, we review some results of De Silva et al. [5]. In Section 4, we describe the audio and video feature extraction. Experimental results of machine recognition are presented in Section 5. Comparisons of human and machine performance is discussed in Section 6. Finally, conclusions and future directions are given in Section 7.

De Silva et al. [5] reported results of human subjects' ability to recognize emotions. They presented video clips of facial expressions and corresponding, synchronized emotional speech clips to human subjects who were not familiar with the languages (Spanish and Sinhala) used in the data. They compared human performance using (1) video only, (2) audio only, (3) corresponding audio and video of the same emotion. The subjects indicated their impression of the shown audio/video as being one or more out of the six basic emotional categories(disgust was replaced by dislike). Based on the subjective evaluation, they found that \Sadness and Fear emotions are audio dominant, Happiness, Surprise and Anger are video dominant, while Dislike gave mixed results."

2 Review of Related Works 2.1 Recognition of emotion from speech Most recent studies in emotional contents in speech [1, 3, 4, 11, 13, 17, 18] have used \prosodic" informations which include the \pitch, duration and intensity of the utterance" [9]. Murray and Arnott [13] reviewed ndings on human vocal emotions. We also use similar features in speech analysis.

The overall performance is summarized in Table 1. We see that the subjects recognized the emotion of the Spanish speaker more accurately from his facial expressions, but for the Sinhala speaker, the subjects did better from the audio. The subjects had some confusions between some classes in both modalities. For example, they consistently labeled some Dislike video as Anger, and Anger video as Dislike. This explains why some of the recognition rates are low. In one case(Sinhala), the bimodal data enabled subjects to get better recognition rates, while in the other case (Spanish) the bimodal result is a fraction of a percent worse than Video-Only.

2.2 Recognition of emotion from video Ekman [6, 7] showed concrete evidences that there exist \universal facial expressions." They are happiness, sadness, anger, fear, surprise, and disgust. Recent works on facial expression recognition [8, 14, 12, 16, 19, 21] have used these \basic expressions" or a subset of them. These approaches di er in the features extracted from the video images or the processing of video images to classify emotions. The video processing fall under two broad categories. The rst is \feature-based" where one tries to detect and track speci c features such as the corners of the mouth, eyebrows, etc, while the other approach is \region-based" in which a window is de ned around the eye, eyebrows, and mouth, to track average movements in these regions. Our method is \feature-based."

Table 1. Overall Human Recognition Rates. Spanish

Video-Only Audio-Only Audio/Video Sinhala Video-Only Audio-Only Audio/Video

2.3 Bimodal emotion recognition Sakaguchi et al. [19] built a system that recognized Japanese vowels to compensate for mouth movements from speech. However, the emotion recognition was video-based in that no acoustic information such as pitch or intensity was used to recognize vocal emotion. Iwano et al. [10] attempted to recognize the speaker's \feelings" from both the video and the speech. They used 22 video features but only 2 audio features on \ ller" words that do not have de nite meanings.

53.81% 41.71% 53.44% 26.77% 32.29% 39.95%

Table 2 shows the confusion matrix of human performance on Spanish Audio-only test. Note confusion between Happy and Anger, Sadness and Dislike, and Anger and Surprise. Table 3 is for the Spanish Videoonly test. Bimodal results for Spanish is shown in Table 4. 2

Table 4. Human Performance Confusion Matrix: Spanish Video and Audio.

Table 2. Human Performance Confusion Matrix: Spanish Audio-Only. Desired Hap Sad D Hap 42.69 3.77 e Sad 7.50 36.73 t Ang 19.72 13.21 e Dis 10.46 36.54 c Sur 13.43 4.51 t Fea 6.20 5.25

Ang 10.78 3.30 43.86 7.86 27.38 6.82

Dis 7.39 23.43 15.02 39.91 7.69 6.57

Sur 12.31 6.57 23.61 8.80 44.54 4.17

Desired Hap Sad D Hap 74.85 1.82 e Sad 4.29 22.01 t Ang 3.36 16.64 e Dis 3.73 53.86 c Sur 11.33 3.30 t Fea 2.44 2.38

Fea 0.42 21.08 0.90 0.04 34.63 42.93

Table 3. Human Performance Confusion Matrix: Spanish Video-Only. Desired Hap Sad D Hap 84.57 3.49 e Sad 2.16 16.08 t Ang 3.46 14.04 e Dis 1.60 55.71 c Sur 6.05 7.19 t Fea 2.16 3.49

Ang 0.60 2.38 66.45 18.23 9.78 2.56

Dis 0.88 1.17 31.47 63.21 1.54 1.73

Sur 8.58 9.88 9.88 10.25 56.17 5.25

Ang 0.26 0.93 62.77 22.33 9.44 4.26

Dis 0.81 4.07 26.89 65.74 0.81 1.67

Sur 9.51 7.84 7.84 7.65 60.80 6.36

Fea 1.40 13.36 4.10 0.48 46.18 34.48

standard deviation of pitch, (4) average derivative of the pitch, (5) average rms energy.

4.3 Video processing

Fea 0.83 5.46 15.28 2.69 39.35 36.39

Conventional \expression-only" recognizers are not suitable for the data we have because mouth movements from speech become a sort of noise for pure expression recognition. Therefore we applied a generalpurpose tracking algorithm developed by Tao et al. [20] that tracked the entire face as it deforms throughout the sequence. The details of the tracking algorithm is described in [20]. The basic idea is to place several meshes over the face in the video images, and these meshes are free to deform. The meshes deform with the face by local image template matching around the nodes of the meshes. These meshes are connected by additional constraints so that the mesh plates do not drift too far apart. It has been shown to track the facial features very well. We tracked the following features: (1) horizontal and vertical positions of the eyebrow, (2) cheek lifting, and (3) the horizontal and vertical size of the mouth opening. From the tracking results we can measure the amount of feature position or movements in terms of pixels, relative to some landmark features. Figure 1 shows four frames of tracking result with the meshes overlaid on the face. Figure 2 shows the minimum and maximum vertical eyebrow positions in terms of their distance from the eye (in pixels). These are plotted in their respective classes. We see that Anger and Dislike have low eyebrows, while the other classes have high eyebrows. The di erence between the maximum and minimum also shows the amount of eyebrow movement. Happy, Anger and Dislike have small eyebrow movements, while the other three classes have large movements.

4 Feature Extraction 4.1 Bimodal data We performed computer-assisted analyses to the same data sets used by De Silva et al. This is one of the very few bimodal data available for studying emotion recognition, as most studies on human emotions have either speech-only or video-only data. The Spanish and Sinhala data each contains 36 clips of synchronized video/audio, six clips for each basic emotion. The speakers in the data were asked to speak with vocal emotion and to portray facial expressions.

4.2 Audio processing Following the traditional analysis of vocal a ect, we extract some \prosodic features" of the speech. We rst compute the pitch contour and rms energy envelope of each sentence. Then from the voiced parts of the speech, we extract statistics of the pitch contour, energy envelope, and their derivatives. Of the sixteen features we extracted, we selected only 5 useful ones. They are (1) average pitch, (2) maximum pitch, (3) 3

Spanish, Vertical Eyebrow Position: maximum (circle), minimum (triangle) 65

60

Distance to eye (pixels)

55

50

45

40

35

30

25

0

5 HAPPY

10 SAD

15 ANGER

20 DISLIKE

25

30

SURPRISE

35

40

FEAR

Figure 2. Vertical eyebrow position relative to the eye: maximum (circle), minimum (triangle).

Table 5. Machine Performance Confusion Matrix for Spanish Audio Only.

Figure 1. Video tracking results.

Desired Hap Sad D Hap 66.7 0 e Sad 0 83.3 t Ang 33.3 0 e Dis 0 16.7 c Sur 0 0 t Fea 0 0

5 Experimental Results In each of the following, we report recognition results, and brie y compare to human performance. More discussions are presented in the next Section. We employ the supervised classi cation with leaveone-out cross validation. First we set aside one sample from each category as test data, and use the rest of samples as training data. For the testing, we simply use the nearest-neighbor criterion to each of the class means. This is done systematically for all samples. For classi cation, all training samples are scaled so that the minimum sample value is zero, and the maximum sample value is one, for each of the features. In testing, the test samples are scaled to the range of the training data.

Ang 16.7 0 66.7 0 16.7 0

Dis 0 33.3 0 66.7 0 0

Sur 0 0 0 0 83.3 16.7

Fea 0 0 0 0 16.7 83.3

5.1 Audio-only recognition results

of pitch, and average rms energy). From De Silva et al. [5], the human subjects were only able to achieve overall performance of 41.71% for Spanish, and 32.29% for Sinhala based only on the audio information. For Sinhala audio, the subjects confused between Surprise and Anger, and we also observed similar confusion between these two classes in machine recognition.

Audio-only machine recognition yielded overall performance of 75%, using only two features (pitch average and standard deviation of pitch). The confusion matrix is shown in Table 5. Note the confusion patterns (Happy and Anger, Sad and Dislike, Anger and Surprise, Surprise and Fear) are very similar to that of the human performance (Table2). For Sinhala data, the machine also achieved 75% using three features (pitch average, standard deviation

The machine performance for Spanish video-only is shown in Table 6. Strong confusions exist between Anger and Dislike which is also seen in human performance in Table 3. The other confusions are di erent. For Sinhala speaker, the presence of eye-glasses makes tracking dicult. Also, his movements are much

5.2 Video-only recognition results

4

Table 6. Machine Performance Confusion Matrix for Spanish Video Only. Desired Hap Sad D Hap 83.3 16.7 e Sad 0 33.3 t Ang 0 0 e Dis 16.7 0 c Sur 0 16.7 t Fea 0 33.3

Ang 0 0 66.7 33.3 0 0

Dis 0 0 33.3 66.7 0 0

Sur 0 0 0 0 83.3 16.7

Fea 0 16.7 0 0 0 83.3

more subtle than the Spanish speakers. The humans were able to distinguish his emotions better by his audio than video. The human subjects achieved a 53.81% recognition rate from Spanish video. The overall machine recognition rate for Spanish video is 69.4% based on three features: (1) maximum eyebrow position, (2) maximum cheek-lifting, and (3) minimum mouth width.

(a)

(b)

(c)

(d)

Figure 3. Spanish speaker: (a) Sadness, (b) Dislike, (c) Anger, (d) Surprise.

Table 7. Bimodal Machine Recognition Results for Spanish.

5.3 Bimodal Approach

Desired Hap Sad D Hap 100 0 e Sad 0 100 t Ang 0 0 e Dis 0 0 c Sur 0 0 t Fea 0 0

Results of the video and audio analyses showed some similar confusions to that of the the human performance. In Spanish audio, there are two-way confusions between Sadness and Dislike. In the video there are two-way confusions between Sadness and Surprise(see Figure 3(a),(d)), and between Anger and Dislike(Figure 3(b),(c)). However, in the video, Sadness and Dislike are quite distinct(Figure 3(a),(b)). Using the video features, we can resolve the audio confusion between Sadness and Dislike. Using audio features, we can also resolve some video confusions. This gives an intuitive idea that by combining the two modalities, some confusions would be resolved and the performance would increase. For a rst test, we simply concatenate the two audio and three video features to form a bimodal feature vector. The same nearest-neighbor leave-one-out cross validation for unimodal experiments is also used here. The results show tremendous improvement using the bimodal approach(see Table 7, overall performance of 97.2%). Even with only two features(the best audio feature and best video feature), the machine recognition gave 91.7% correct.

Ang 0 0 100 0 0 0

Dis 0 0 0 100 0 0

Sur 0 0 0 0 83.3 16.7

Fea 0 0 0 0 0 100

6 Discussions 6.1 Comparison of human and machine performances Before we compare numbers, we note that the human subjects were not trained on the data at all. Instead, they are assumed to be able to distinguish between di erent emotions which is an ability either innate or acquired through experiences in life. We do not expect the computer to perform much better than human beings, but since the classi er is trained on training samples, it is capable of exploring useful features and take advantage of them, provided that the input data is consistent. Indeed, the machine performances turned out to be much better 5

Acknowledgment

(audio-only: 75%, video-only: 69.4%, bimodal: 97.2%) than human performance, but the comparison is not exactly fair. Also, the performance seems to be speakerdependent. Another comparison to be made is the confusion patterns. The similarities in the confusion patterns between human and machine performance would reveal what features human beings use to recognize emotion. Does bimodal approach perform better simply because we use more number of features? Not necessarily. We have tried larger number of features in each of the modalities, but the performance actually decreased. We believe the improvement in performance is attributed to the complementary property of audio and video channels.

We would like to thank Dr. L.C. De Silva of National University of Singapore for providing the bimodal data as well as detailed results of human performance.

References [1] C. C. Chiu, Y. L. Chang, and Y. J. Lai, \The Analysis and Recognition of Human Vocal Emotions," in Proc. International Computer Symposium, NCTU, Hsihchu, Taiwan, R.O.C., December 12-15, 1994. [2] L. S. Chen, T. S. Huang, T. Miyasato, and R. Nakatsu, \Multimodal Human Emotion/Expression Recognition," in Proc. IEEE Int. Conf. on Automatic Face and Gesture Recognition FG98, Nara, Japan, April 14-16, 1998.

6.2 Other issues There are several interesting questions we can ask. (1) It has been shown that there are universal expressions, are there universal characteristics for vocal emotion as well? (2) How much data do we need before making a decision? (3) Are there other modalities to explore such as gesture or physiological data? (4) How much better is bimodal approach? How can it be used in conjunction with single-modal methods? (5) What are other ways of combining these modalities? Some of these questions can only be answered after further investigation with more bimodal data.

[3] R. Cowie and E. Douglas-Cowie, \Automatic Statistical Analysis Of The Signal And Prosodic Signs Of Emotion In Speech," in Proc. International Conf. on Spoken Language Processing, Philadelphia, PA, USA, pp. 1989{1992, October 3-6, 1996. [4] F. Dellaert, T. Polzin and A. Waibel, \Recognizing Emotion in Speech," in Proc. International Conf. on Spoken Language Processing, Philadelphia, PA, USA, pp. 1970{1973, October 3-6, 1996. [5] L. C. De Silva, T. Miyasato, and R. Nakatsu, \Facial Emotion Recognition Using Multimodal Information." in Proc. IEEE Int. Conf. on Information, Communications and Signal Processing (ICICS'97), Singapore, pp. 397-401, Sept. 1997.

7 Conclusions and Future Directions We reported preliminary results on machine recognition of emotions and compared to human performance on the same data. There are two signi cant discoveries: (1) bimodal approach does o er higher machine recognition rate because of the complementary property between audio and video (audio-only: 75%, video-only: 69.4%, bimodal: 97.2%). (2) machine analysis reveals some confusion patterns that are similar to confusions made by humans. In the future, we would like to do the following: (1) Improve video tracking algorithm: currently the tracking algorithm has problems with eye-glasses and large global head movements. We would like to improve the tracking algorithm to deal with these problems. (2) We are in the process of collecting more bimodal data for a large-scale study. In this paper there were only small number of input samples. This means statistical methods do not work too well, and it is also dicult to make general conclusions.

[6] P. Ekman, (ed), Emotion In the Human Face, Cambridge: Cambridge University Press, 1982. [7] P. Ekman, \Strong Evidence for Universals in Facial Expressions: A Reply to Russell's Mistaken Critique," Psychological Bulletin, vol. 115, no. 2, pp. 268{287, 1994. [8] I. A. Essa and A. P. Pentland, \Coding, Analysis, Interpretation, and Recognition of Facial Expressions," IEEE Trans. PAMI, vol. 19, no. 7, pp. 757{ 763, July 1997. [9] H. Fujisaki, \Prosody, Models, and Spontaneous Speech," in Computing Prosody, by Y. Sagisaka, N. Campbell, and N. Higuchi, Eds, Springer-Verlag, New York, 1997. 6

[10] Y. Iwano et al., \Extraction of Speaker's Feeling using Facial Image and Speech," in Proc. IEEE International Workshop on Robot and Human Comm. RO-MAN '95, Tokyo, Japan, pp. 101{106, July 5-7, 1995.

[21] N. Ueki, S. Morishima, H. Yamada, and H. Harashima, \Expression Analysis/Synthesis System Based on Emotion Space Constructed by Multilayered Neural Network," Systems and Computers in Japan, vol. 25, no. 13, pp. 95-103, Nov. 1994.

[11] T. Johnstone, \Emotional Speech Elicited Using Computer Games," in Proc. International Conf. on Spoken Language Processing, Philadelphia, PA, USA, pp. 1985{1988, October 3-6, 1996. [12] K. Mase, \Recognition of Facial Expression from Optical Flow," IEICE Trans., vol. E74, no. 10, pp. 3474{3483, October 1991. [13] I. R. Murray and J. L. Arnott, \Toward the simulation of Emotion in Synthetic Speech: A Review of The Literature of Human Vocal Emotion," Journal of the Acoustic Society of America, vol. 93, no. 2, pp. 1097{1108, February 1993. [14] T. Otsuka and J. Ohya, \Recognizing Multiple Persons' Facial Expressions Using HMM Based on Automatic Extraction of Signi cant Frames from Image Sequences," in Proc. Int. Conf. on Image Processing (ICIP-97), Santa Barbara, CA, USA, pp. 546{549, Oct 26-29, 1997. [15] R. W. Picard, A ective Computing, Cambridge, MA: MIT Press, 1997. [16] M. Rosenblum, Y. Yacoob and L.S. Davis, \Human Expression Recognition from Motion Using a Radial Basis Function Network Architecture," IEEE Trans. Neural Network, vol. 7, no. 5, pp. 1121{1138, September 1996. [17] J. Sato and S. Morishima, \Emotion Modeling in Speech Production Using Emotion Space," in Proc. IEEE Int. Workshop on Robot and Human Communication, Tsukuba, Japan, pp. 472{477, Nov 1996. [18] K. R. Scherer, \Adding The A ective Dimension: A New Look In Speech Analysis And Synthesis," in Proc. International Conf. on Spoken Language Processing, Philadelphia, PA, USA, page no. not available, October 3-6, 1996. [19] T. Sakaguchi, \Facial Feature Extraction Based on the Wavelet Transform for Dynamic Expression Recognition," Submitted to IEEE Trans. PAMI. [20] H. Tao and T. S. Huang, \Connected vibrations: a modal analysis approach to non-rigid motion tracking," to appear in Proc. IEEE Comput. Vision and Patt. Recogn.'98 (CVPR'98), 1998. 7