Mixture of Support Vector Machines for HMM based ... - CiteSeerX

12 downloads 54 Views 80KB Size Report
tures of Support Vector Machines (SVMs) for classification by integrating this .... compute emission probabilities, creating a hybrid SVM- mixture/HMM system. 3. .... it has more potential for applying on more state-of-the-art systems, because not ...
Mixture of Support Vector Machines for HMM based Speech Recognition Sven E. Kr¨uger, Martin Schaff¨oner, Marcel Katz, Edin Andelic, Andreas Wendemuth Otto-von-Guericke University of Magdeburg, IESK, Cognitive Systems, PF 4120, 39016 Magdeburg, Germany [email protected]

Abstract Speech recognition is usually based on Hidden Markov Models (HMMs), which represent the temporal dynamics of speech very efficiently, and Gaussian mixture models, which do non-optimally the classification of speech into single speech units (phonemes). In this paper we use parallel mixtures of Support Vector Machines (SVMs) for classification by integrating this method in a HMM-based speech recognition system. SVMs are very appealing due to their association with statistical learning theory and have already shown good results in pattern recognition and in continuous speech recognition. They suffer however from the effort for training which scales at least quadratic with respect to the number of training vectors. The SVM mixtures need only nearly linear training time making it easier to deal with the large amount of speech data. In our hybrid system we use the SVM mixtures as acoustic models in a HMM-based decoder. We train and test the hybrid system on the DARPA Resource Management (RM1) corpus, showing better performance than HMM-based decoder using Gaussian mixtures.

1. Introduction

SVMs. In these hybrid SVM/HMM systems the problem of quadratic computer time for SVM training becomes especially severe because of the large amount of training data in speech recognition. In order to overcome this drawback, we propose to integrate parallel mixtures of SVMs (as introduced in [4]) into speech recognition. We do this by using the SVM mixtures as acoustic models in a HMM framework, following the successful hybrid approach with single SVMs [7] and also earlier approaches with artificial neural networks [1]. Mixtures of SVMs have also been used [8] in the related field of speaker recognition. Since the mixtures of SVMs need only nearly linear training time [4] we could use (contrary to [7]) a one-vs-rest approach for the multi-class classification. This approach was recently defended [11] to be as accurate as any other approach (e.g., one-vs-one as used in [7]).

2. Speech Recognition using Hidden Markov Models We give a short overview of Automatic Speech Recognition (ASR) (see Ref. [10] for more details), showing where SVM mixtures can be used.

2.1. Hidden Markov Models Support Vector Machines (SVMs) [14, 2, 12] have become quite popular for many applications of pattern classification, mainly due to their great ability to generalize, often resulting in better performance than traditional techniques, such as artificial neural networks. They suffer however from the effort of computer time for training which scales at least quadratic [4] with respect to the number of training vectors. SVMs have also been successfully integrated [5, 13, 7] into continuous speech recognition which is based on Hidden Markov Models (HMMs). In HMM speech recognizers the classification of single speech units (phonemes) is usually done with Gaussian mixture models (GMMs), which do not discriminate well. In a hybrid SVM/HMM system such as in [7] these Gaussian mixtures are replaced with

Predominantly, Hidden Markov Models (HMMs) are used in ASR. A HMM is a stochastic finite state automaton built from a finite set of possible states Q = {q1 , . . . , qK } with instantaneous transitions with certain probabilities between these states. Each of these states is associated with a specific emission probability distribution p(xn |qk ). Thus, HMMs can be used to model a sequence X of featurevectors as a piecewise stationary process where each stationary segment is associated with a specific HMM state. This approach defines two concurrent stochastic processes: the sequence of HMM-states modeling the temporal dynamics of speech, and a set of state output processes modeling the locally stationary property of the speech signal.

In speech recognition we have to find the HMM M ∗ which maximizes the posterior probability p(M |X) of the hypothesized HMM M given a sequence X of feature-vectors. Since this probability cannot be computed directly, it is usually split using Bayes’ rule into the acoustic model (likelihood) p(X|M ) and a prior p(M ) representing the language model: p(M |X) ∝ p(X|M )p(M ).

2.2. Emission probabilities The acoustic model p(X|M ) is usually realized using mixtures of Gaussian probability density functions as emission probabilities of the HMMs. However, these Gaussian mixture are trained by likelihood maximization, which assumes correctness of the models and thus suffer from poor discrimination [1]. Artificial neural networks (ANNs) and SVMs, which discriminate better than Gaussian mixtures, have been proposed as models of emission probabilities in a hybrid system (see [1, 7] and references therein). We follow this approach of replacing Gaussian mixtures with better classifiers and use mixtures of SVMs to compute emission probabilities, creating a hybrid SVMmixture/HMM system.

3. Emission probability estimation with mixture of SVMs 3.1. Support Vector Machines Support Vector Machines (SVMs) were first introduced by Vapnik and developed from the theory of structural risk minimization (SRM) [14]. We give a short overview and refer to [12, 2] for more details and further references. SVMs are linear classifiers (i.e. the classes are separated by hyperplanes) but they can be used for non-linear classification by the so-called kernel trick. Instead of applying the SVM directly to the input space Rn they are applied to a higher dimensional feature space F, which is nonlinearly related to the input space: Φ : Rn → F. The kernel trick can be used since the algorithms of the SVM use the training vectors only in the form of Euclidean dot-products (x·y). It is then only necessary to calculate the dot-product in feature space (Φ(x) · Φ(y)), which is equal to the so-called kernel function k(x, y) if k(x, y) fulfills the Mercer’s condition. An important kernel functions which fulfills this conditions is the Gaussian radial basis function (RBF) kernel k(x, y) = exp(−γ||x − y||2 ).

(1)

Given a training set {xi , yi } of N training vectors xi ∈ Rn and corresponding labels yi ∈ {−1, +1}, a kernel function k(x, y) and a parameter C (see below) the SVM finds a

optimal1 separating hyperplane in F. This is done by solving the following quadratic programming problem: Finding the vector α which maximizes W (α) =

N 

N

αi −

i=1

N

1  αi αj yi yj k(xi , xj ) 2 i=1 j=1

(2)

N under the constraints i=1 αi yi = 0 and 0 ≤ αi ≤ C ∀i. The parameter C > 0 allows us to specify how strictly we want the classifier to fit to the training data (a larger C meaning more strictly). The output of the SVM is s(x) =

N 

αi yi k(xi , x) + b,

(3)

i=1

where s(x) > 0 means that x is classified to class +1. The training vectors xi for which the αi are greater than zero are called support vectors. We use the software library Torch [3] for the training of the SVMs.

3.2. Mixture of SVMs The idea of mixture models, such as the mixture of experts [6] is to use a set of experts (e.g, ANNs) and a gating network to weight the outputs of the single experts for the final output. Here we use parallel mixtures of SVMs as introduced in [4]. We give a short overview over this method and refer to [4] for more details and references. The output of the mixture for an input vector x is f (x) =

M 

wm (x)sm (x)

(4)

m=1

where M is the number of experts (here single SVMs), and sm (x) is the output of the mth SVM (i.e., expert) and wm (x) is the weight for the mth SVM. The wm (x) form the gating network and are defined by a Multi Layer Perceptron (MLP), and trained using a backpropagation algorithm to minimize the cost function N 

[f (xi ) − yi ]2 .

(5)

i=1

The mixture model is trained [4] by first dividing the training set into M random subsets of size near N/M . Then the single SVMs are trained separately over one of these subsets. In the following the SVMs are kept fixed and the weights wm (x) are trained to minimize (5). In a next step the examples (xi ,yi ) are reassigned to the experts: For 1 according to the SRM principle, i.e. by optimizing the upper bound on the actual risk of misclassification by controlling the complexity of the classifier and minimizing the empirical error on training data

each example the experts are sorted in descending order according to the values of wm (xi ), and the example is then assigned to the first expert in the list which has less than (N/M+1) examples already assigned. The experts (SVMs) are then again trained over the new subsets. This can be done several times. If the number M of SVMs grows with the total number of training vectors N so that N/M remains constant, then the total training time for the mixture of SVMs scales nearly linearly with N [4].

other classes. Having K classes means we need to train K mixtures of SVMs. Finally, the posteriors p(q|x) have to be transformed into emission probabilities p(x|q) (needed for the HMMs) by using Bayes’ rule

3.3. Estimation of posterior probabilities from the SVM mixture output

4. Experiments

There is no clear relationship between the mixture output f (x) (see Eq. (4)) and the posterior class probability p(y = +1|x) that the pattern x belongs to the class y = +1. Although the output of a MLP trained to a global error minimum (such as the mean square error of Eq. (5)) can (under certain conditions) be a posterior probability p(y = +1|x) [1], we cannot use this fact here, because the MLP output wm (x) serves only as weight for the SVM outputs (which as distance measures are no probabilities). We estimate the posterior class probability with the method of Platt [9] by modeling the distributions p(f |y = +1) and p(f |y = −1) of the SVM mixture output f (x) using Gaussians of equal variance and then compute the probability of the class given the output by using Bayes’ rule. This yields to the ansatz 1 . 1 + exp(Af (x) + B) (6) The parameters of A and B of Eq. (6) are found [9] using maximum likelihood estimation from a training set2 (pi , ti ) with pi = g(f (xi ), A, B) and with the target probabilities ti = (yi + 1)/2. This is done by minimizing the crossentropy error function p(y = +1|x) = g(f (x), A, B) ≡





[ti log(pi ) + (1 − ti ) log(1 − pi )]

(7)

i

using the algorithm of Platt [9] which is based on the Levenberg-Marquardt algorithm.

3.4. Multi-class classification Since the SVM mixtures are binary classifiers we have to use a certain method for more than two classes. We use the one-versus-rest approach (see [11] for an overview), where a mixture of SVMs learns to discriminate one class from all 2 This training set can but has not to be equal to the training set of the mixture of SVMs.

p(x|q) ∝

p(q|x) p(q)

(8)

where the a-priori probability p(q) of class q is estimated by the relative frequency of the class in the training data.

4.1. Experimental setup Our implementation of a hybrid HMM system using acoustic models other than Gaussian mixture models is based on the Hidden Markov Toolkit (HTK) [16]. We use the DARPA Resource Management (RM1) corpus [15]. Preprocessing is performed by short-time FFT with a frame width of 25ms and a frame shift of 10ms followed by a cepstral analysis. The resulting feature vectors contain 12 cepstral coefficients, the frame energy and the first and second order time differences (altogether 39 components). We use the standard 72-speaker training set, and our systems were evaluated using the speaker independent Feb’89 test set [15]. We use the standard RM word-pair grammar. First we compute a time alignment using a standard Gaussian mixture HMM decoder to get the state (label) for each feature vector. Using monophone models with three HMM states each and an optional pause model sp (to model the pauses between words) we have 3 × 48 + 1 = 145 different states altogether.

4.2. Results The mixtures of SVMs are trained with the full training set (about N = 988, 000 feature vectors), or with a diluted sets (as an approximation using every tenth feature vector). The posteriors are estimated using the same set of feature vectors as used for the training. The baseline for our experiments is a three-state HMM trained on the full training set with a GMM with 8 mixtures, resulting in a word accuracy of 91.96% comparable with the usual results of monophone models [15]. In our experiments we use the one-vs-rest approach. Having 145 classes (HMM models) we have to train 145 mixtures of SVMs. For each mixture we use 30 SVMs (these are the experts, each to be trained with N/30 training vectors on average). The gating network (wm (x) in Eq. (4)) is a MLP with 90 hidden neurons. The ratio between the number of hidden neurons and the number of SVMs for each mixture is similar to the optimal one used in [4].

classifier (in a one-state-HMM) Baseline (GMM with 8 mixtures) Mixture of SVMs, full set Mixture of SVMs diluted (1/10)

Word accuracy 91.96 % 92.23 % 85.94 %

Table 1. Results for the mixture of SVMs onevs-rest approach trained with the full set and a diluted set compared with the baseline.

Our results as given in Tab. 1 are slightly better than the baseline results with Gaussian mixtures.

5. Conclusion We use mixtures of SVMs in a hybrid SVMmixture/HMM system for continuous speech recognition which uses HMMs to model the temporal dynamics of the speech and mixtures of SVMs for the classification of speech frames into single speech units (phonemes). Using a one-versus-rest multi-class classification approach the output of the SVM mixtures serves to estimate the emission probabilities in a HMM-based decoder. Our system was tested on the DARPA RM1 corpus (speaker independent) and we got an improvement compare to the baseline of Gaussian mixtures. Although the improvement is only small, the results are nevertheless interesting because we would expect a better generalizing classifier (such as our mixture of SVMs) to be particular good under noisy conditions (which is indicated e.g., in [13]), meaning that any improvement for a clean speech corpus such as RM1 is promising. Our approach yields however slightly worse results than the one-vs-one approach (using single SVMs) of [7], but it has more potential for applying on more state-of-the-art systems, because not only it has the training time scaling only nearly linearly with the number of feature vectors, but also because our one-vs-rest approach makes the use of triphone-systems easier (contrary to one-vs-one: having e.g., 2000 triphone classes the one-vs-one approach would need to train 2 million pairs of classes). Further work will concentrate on incorporating triphone models in our system, applying it on larger problems and also testing it with noisy acoustic conditions.

References [1] H. Bourlard and N. Morgan. Hybrid HMM/ANN systems for speech recognition: Overview and new research directions. In C. L. Giles and M. Gori, editors, Adaptive Processing of Sequences and Data Structures, Lecture Notes in Artificial Intelligence (1387), pages 389–417. Springer Verlag, 1998.

[2] C. J. Burges. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2:121–167, 1998. [3] R. Collobert and S. Bengio. SVMTorch: Support vector machines for large-scale regression problems. The Journal of Machine Learning Research, 1:143–160, 2001. [4] R. Collobert, S. Bengio, and Y. Bengio. Parallel mixture of svms for very large scale problems. Neural Computation, 14:1105–1114, 2002. [5] S. E. Golowich and D. X. Sun. A support vector/hidden markov model approach to phoneme. In ASA Proceedings of the Statistical Computing Section, pages 125–130, 1998. [6] R. A. Jacobs, M. I. Jordon, S. J. Nowlan, and G. E. Hinton. Adaptive mixtures of local experts. Neural Computation, 3:79–87, 1991. [7] S. E. Kr¨uger, M. Schaff¨oner, M. Katz, E. Andelic, and A. Wendemuth. Speech recognition with support vector machines in a hybrid system. In Proc. INTERSPEECH-2005, pages 993–996, 2005. [8] Z. Lei, Y. Yang, and Z. Wu. Mixture of support vector machines for text-independent speaker recognition. In INTERSPEECH-2005, pages 2041–2044, 2005. [9] J. Platt. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In B. S. C. J. C. Burges and A. J. Smola, editors, Advances in Large Margin Classifiers. MIT Press, 1999. [10] L. Rabiner and B.-H. Juang. Fundamentals of Speech Recognition. Prentice Hall, Englewood Cliffs, 1993. [11] R. Rifkin and A. Klautau. In defense of one-vs-all classification. The Journal of Machine Learning Research, 5:101– 141, dec 2004. [12] B. Sch¨olkopf and A. J. Smola. Learning with Kernels. MIT Press, 2002. [13] J. Stadermann and G. Rigoll. A hybrid SVM/HMM acoustic modeling approach to automatic speech recognition. In INTERSPEECH-2004 ICSLP, pages 661–664, 2004. [14] V. Vapnik. The Nature of Statistical Learning Theory. Springer Verlag, 1995. [15] P. Woodland and S. Young. The HTK tied-state continuous speech recogniser. In Proceedings Eurospeech, pages 2207– 2210, 1993. [16] S. Young, G. Evermann, D. Kershaw, G. Moore, J. Odell, D. Ollason, D. Povey, V. Valtchev, and P. Woodland. The HTK Book. Cambridge University Engineering Department, 2002.