speaker indipendent emotion recognition from speech ...

Automatic Stress Detection from Speech by Using Support Vector Machines and Discrete Wavelet Transforms Firoz Shah.A, Raji Sukumar.A, Babu Anto.P School of Information Science and Technology, Kannur University, Kerala, India [email protected], [email protected], [email protected]

Abstract: Automatic Speech Recognition (ASR) and Automatic Emotion Recognition (AER) from speech are the pivotal areas in affective computing. Automatic detection of stress from speech simply means to make machines able to recognize the expressed stress from speech. We have used Discrete Wavelet Transform (DWT) technique for feature extraction and Support Vector Machines (SVM) for the training and testing of the machine. We have created and used a speaker and gender independent dataset consisting of 450 utterances for this experiment. We have used Malayalam (One of the south Indian language) for the experiment. We have used a dataset proportion 80:20 for training and testing of SVM respectively. We have obtained an overall stress detection accuracy of 89.95 % from this experiment.

classification and recognition of pattern classes also increases in the case of machines. We have successfully used SVMs to make a machine able to recognize the stress from speech. Automatic stress detection from speech finds applications in robotics, security applications, call centre applications and in automotive applications.

Keywords: Automatic stress detection. Discrete Wavelet Transform (DWT), Support Vector Machines (SVM)

We have created a dataset for the Malayalam (one of the south Indian language).Our dataset consists of a total number of 450 speech utterances. The speech samples used to create the dataset and their IPA format are given in table1.The recognition accuracy in speech based studies depends on the dataset created. Emotional speech datasets can be broadly classified into speaker dependent, speaker independent and context dependent. Speaker dependent datasets can also be classified as gender dependent and gender independent. Context dependent databases can be classified into natural datasets where all the samples in the dataset are collected from real life situations. Second context is the acted dataset where the samples are recorded by using professionals and the third context is the elicited dataset where all the emotions are induced.

I.

INTRODUCTION

Human–Computer interaction is an interesting domain in modern scientific research. The exact recognition of different emotions from speech is a major area in affective computing. Human perceptional system is compatible with understanding the emotions through speech in an efficient and effortless manner with maximum effectiveness. Human auditory system and neural processing mechanism are able to process any complex patterns of data because of their collective behavior and fault tolerance. In the case of machines learning, this is not that much easy as in the case of humans [1]. For machines recognizing the speech or emotions from speech is really a hard problem. Automatic Emotion Recognition (AER) from speech means to make a machine able to recognize the exact emotional contents from speech by using some machine learning algorithms. The Automatic Emotion Recognition (AER) can be considered as a complex pattern classification problem. Each emotional category can be considered as a different class of patterns. As the emotional categories increases the difficulty in

II.

DATASET

Table 1 : Emotional speech database and their IPA format Words in Malayalam

Words in English

IPA format

amme

//æ/m/ m/ æ//

acha

//æ///tʃʰ/ ɑː//

II.

obtain the low frequency components of the input signals. The DWT is computed by successive low pass filtering and high pass filtering of the discrete time domain signal. When DWT is performed on a continuous signal we get a series of valuable coefficients. DWT uses filter banks to construct a multi resolution time-frequency plane [2]. In DWT a discrete signal x[k] is filtered by using a high pass filter and a low pass filter, which will separate the signals to high frequency and low frequency components. To reduce the number of samples in the resultant output we apply a down sampling factor of ↓2.

mole

// m/ ɒ/ l/ ɛ//

mone

// m/ ɒ/ n/ ɛ//

eda

// ɛ/ d/ɑː//

lethe

// l/ ɛ// θ/ ɛ//

devi

// d/ ɛ/ v/ ɪ//

njano

// n/ dʒ/ɑː/ n/ ɒ//

The Discrete Wavelet Transform is defined by the following equation.

kutty

//k/ʊ/t/t/i//

W (j, K) =ΣjΣkX (k) 2-j/2Ψ (2-jn-k)

maye

// m/ ɑː/ j/ ɛ//

ayyo

//æ/aɪ/ɒ//

chetta

//tʃʰ/t (ʰ)/ɑː//

venda

// v/ iː/ n/ d/ɑː//

kandu

// k (ʰ)/ɔː/ n/ d/ juː//

poyi

// p/ ɒ/ i//

poda

//p/o/d/a://

pode

//p/o/d/I//

ede

// ɛ/ d/i://

vave

//v/a:/v/ æ//

neeyo

//n/ ɛ/ ɔɪ/ ɒ//

FEATURE ETRACTION

Feature extraction is the process of converting the input speech signals to some ordered parametric representations to train and test the classifier by using some machine learning algorithms. In this experiment we have used Discrete Wavelet Transform (DWT) technique for feature extraction. Discrete Wavelet Transform (DWT) DWT is the most promising mathematical transformation which provides both the time –frequency information of the signal. DWT is performed by using digital filter banks to

(1)

Where Ψ (t) is the basic analyzing function called the mother wavelet

At each level the decomposition of the input signal having two kinds of outputs forms the low frequency components.ie, the approximations and high frequency components, the details. With this approach, the time resolution becomes arbitrarily good at high frequencies, while the frequency resolution becomes arbitrarily good at low frequencies. The filtering and decimation process is continued until the desired level is reached the DWT of the original signal is then obtained by concatenating all the coefficients, a[n] and d[n], starting from the last level of decomposition [3]. The successive filtering of the high pass and low pass filtering of the signal can be depicted by the following equations

Y high [k] = Σnx[n]g [2k-n]

(3)

Y low[k] = Σnx[n]h [2k-n]

(4)

Where Y high and Y low are the outputs of the high pass and low pass filters obtained by sub sampling by 2 [4].

IV.FEATURE VECTOR CLASSIFICATION AND RECOGNITION In this experiment we have used the Support Vector Machines (SVMs) for training and testing of the complex patterns.

A. Support Vector Machines

support vector Machine for a two class pattern classification problem is given in fig.1

Support Vector Machines (SVM) is the most successful machine learning algorithm based on statistical learning theory for two-class problems. SVM is a supervised learning algorithm having better generalization with limited number of training patterns [5]. SVM was introduced for the classification of linearly separable classes of objects. SVM uses a simple linear method to the patterns but in a high dimensional feature space [6]. An SVM performs the classification problems by separating the data into two categories by using an n dimensional hyperplane. SVM determines hyperplanes that maximize the margin between classes. For any particular set of two class object an SVM finds the unique hyperplane having the maximum margin. SVM represents the classified outputs as support vectors that determine the maximum margin hyperplane. SVM can also be used to classify classes that cannot be classifying with a linear classifier. The feature space is a high dimensional –space in which the two classes can be separated with a linear classifier [7]. A special property of SVMs is that they simultaneously minimize the empirical classification error and maximize the geometric margin; hence they are also known as maximum margin classifiers. A support vector machine for pattern classification is built by mapping the input pattern x into a high-dimensional feature vector v using a non linear transformation g(x), and then constructing an optimal hyperplane in the feature space. Non linear transformation g(x) should be such that the pattern classes are linearly separable in the feature space [8]. Architecture of a

V. EXPERIMENT AND RESULTS We conducted the experiment to make a machine able to recognize the stress from speech by using the Malayalam language. For this experiment we have created an elicited dataset consisting of 450 speech utterances. The dataset created is speaker and gender independent. We have used speakers under the age group of 30 for recording the speech corpora. We have used 8 male and 6 female speakers to collect the speech corpus. We have used a high quality studio recording microphone for the recording purpose. The speech samples are recorded at a frequency range of 8 KHz (4 KHz band limited). The speech samples are recorded at different times by taking many days. The speakers are trained well before collecting the speech samples. The recorded speech samples are processed labeled and stored in the dataset. For the feature extraction purpose we have used Daubechies-4 type wavelet. By using Daubechies-4 wavelet we performed the successive decomposition of the speech signals to obtain a good feature vector of reduced size at the 13 th level of decomposition. We have divided the database in 80:20 proportions to train and test the classifier respectively. After successful training of the SVM with 80% database volume, the classifier was tested by using the reaming 20% volume. While testing we could have achieved a recognition accuracy of 89.95% from this experiment.

Figure 1 : SVM for a two class pattern classification problem

VI. CONCLUSION We have obtained a recognition accuracy of 89.95% in detecting stress from speech. We can conclude that Discrete Wavelet Transform (DWT) technique is ideal for feature extraction in stress detection from speech and Support Vector Machines (SVMs) is better to train the machine to identify the stressed speech patterns. The efficiency of the algorithm can be evaluated by using different feature extraction methods and soft computing techniques. We can successfully reduce the computational complexity in machine learning by using the feature vector with a small size. REFERENCES [1]. K.Oatley and J.M. UK:Blackwell,1996

Jenkins,

Understanding

Emotions

.Oxford,

[2]. I.Daubechis “Orthonormal Bases of Compactly supported wavelets” Communication on pure and Applied Math.Vol.41 1988, 909-996.

[3]. R.Kronland-Martinet, J.Morlet and A.GrossmanAnalysis of sound patterns through wavelet transform”, International Journal of Pattern Recognition and Artificial Intelligence, Vol. 1(2), 1987, 237-301. [4]. Audio Analysis using the Discrete Wavelet Transform George Tzanetakis, Georg Essl, Perry Cook Computer Science Department also Music Department Princeton35 Olden Street, Princeton NJ 08544. [5]. Support Vector Machines in R Alexandros Karatzoglou, David Meyer,Kurt HornikJournal of Statistical Software April 2006, Volume 15, Issue 9 [6]. Reviews in Computational Chemistry, Volume 23 edited by Kenny B. Lipkowitz and Thomas R. Cundari Copyright 2007 Wiley-VCH, John Wiley & Sons, Inc. [7]. A new intrusion detection system using support vector machines and hierarchical clustering Latifur Khan Mamoun Awad Bhavani Thuraisingham The VLDB Journal (2007) 16:507-521DOI 10.1007/s00778-006-0002-5 [8]. Acoustic Modeling of Sub word Units Using Support Vector MachinesC. Chandra Sekhar, W.F.Lee, K. Takeda and F. Itakura Workshop on Spoken Language Processing, TFIR, Mumbai, India