Multiple Regression Hidden Markov Models

MULTIPLE-REGRESSION HIDDEN MARKOV MODEL Katsuhisa Fujinaga, Mitsuru Nakai, Hiroshi Shimodaira and Shigeki Sagayama

Japan Advanced Institute of Science and Technology Tatsu-no-Kuchi, Ishikawa, 923-1292 Japan kfujina,mit,sim,sagayama @jaist.ac.jp

ABSTRACT This paper proposes a new class of hidden Markov model (HMM) called multiple-regression HMM (MRHMM) that utilizes auxiliary features such as fundamental frequency ( ) and speaking styles that affect spectral parameters to better model the acoustic features of phonemes. Though such auxiliary features are considered to be the factors that degrade the performance of speech recognizers, the proposed MR-HMM adapts its model parameters, i.e. mean vectors of output probability distributions, depending on these auxiliary information to improve the recognition accuracy. Formulation for parameter reestimation of MRHMM based on the EM algorithm is given in the paper. Experiments of speaker-dependent isolated word recognition demonstrated that MR-HMMs using based auxiliary features reduced the error rates by more than compared with the conventional HMMs. 1. INTRODUCTION Spectral parameters of phonemes are influenced by number of factors, not only gender, speakers, contexts, but also speaking styles, fundamental frequency ( ) and so on. The challenge of improving the recognition accuracy of HMM is regarded as a problem of how to neutralize the influence by those factors that degrade the recognition performance. So far, a number of efforts have been made towards speaker adaptation (MAP[1], VFS[2], MLLR[3]) and context-dependent modeling (HMNet[4]), while only a few towards speaking style and adaptation or normalization. In spoken dialogue systems, even a single human could speak in many different styles. For example, when the system misrecognizes the speech, the user tends to speak more clearly, slowly to emphasize the misrecognized words. These sorts of speaking styles that are different from the normal utterance style cause lower accuracy of the recognizer[5]. The first step for this problem is to use separate acoustic models for the specific speaking styles [6]. Next step which is discussed in this paper will be to explore some adaptation or normalization techniques, hopefully online or frame-synchronous adaptation of HMM against the utterance variations. Based on a knowledge that spectral features have some correlation with , Singer and Sagayama [7] showed that

spectrum normalization by a phoneme-wise linear regression model between and cepstral features could improve the phoneme recognition accuracy. This approach assumed that the regression coefficient did not change within a phoneme. But it would be more natural that the coefficient can vary according to the change of spectral features even within a certain phoneme. To realize such dynamic processing, spectrum normalization by and phoneme recognition should be done at the same time. This can be achieved by embedding the adaptation or normalization operation into the HMM formulation, in other words, developing a new class of HMM that adapts its model parameters depending on or other auxiliary features. Among such class of HMM, multi-regression HMM (MR-HMM), the one that employs multiple regression to modify the model parameters, i.e. mean vectors of normal distributions, is discussed here. It should be noted that the proposed MR-HMM is completely different from the existing autoregressive HMM (AR-HMM) [8] which assumes that observation vectors are drawn from an auto-regression process. This paper is organized as follows: the next section describes the basic formulation of MR-HMM and EMbased parameter reestimation algorithm. The third section presents experimental results of speaker-dependent isolated word recognition. Finally, the last section is devoted to conclusions. 2. MULTIPLE-REGRESSION HMM 2.1. Outline of MR-HMM Fig. 1 shows the correlation between and the 7th melcepstral coefficient (MCEP) of a phoneme sample s/e/i ( /e/ preceded by /s/ and followed by /i/ ). It can be seen from the figure that the 7th MCEP has a negative correlation with . This sort of influence of has been observed on formant frequencies of vowels, and it has been explained from the biomechanical and phonological point of view [9]. This evidence implies that can be of help to recover the original spectral features from the observed spectral parameters. Since the correlation between the spectral features and varies depending on contexts, phonemes or sub-

0

to the point that MLLR uses the same feature parameters with the ones used for recognition while MR-HMM utilizes auxiliary features that are not used directly for recognition but used for adapting the model parameters on-line.

-0.5 -1.0

7th MCEP

0.5

s /e/ i

2.2. Probability evaluation in MR-HMM 1.8

2.0

2.2

Since MR-HMM differs from standard HMM only on the point that the former uses auxiliary features to calculate probability distributions but the latter not, most parts of its formulation is same with HMM. So, the notations = >@? (state transition probability) and A > (initial state probability) used here have the same meanings with those in HMM. Let B and C be the mean vector and the covariance matrix of a Gaussian distribution, respectively. The output probability density function D >FEHG0I J(G of state for M N aNRQ given NTK S -dimensional observation vector, E GL !PO ) M O )7557Q 5) OVUXW , and - -dimensional auxiliary vector, J(G 2 !PO )+2 O )757557)2 9YOZU[W , is defined as

log F0

F0

and MCEP. Mean

Mean

Mean

Fig. 1. Correlation between

F0

F0

MR-HMM

D*> FE G I J G*\

i-th MCEP

HMM

where B

F0

Times

as an

phonemes, normalization of spectrum parameters by using should be done simultaneously with speech recognition. MR-HMM gives such framework by changing its parameters according to and other possible features. Fig. 2 shows a basic idea of MR-HMM. The bottom box in the figure illustrates the th MCEP coefficient as an input feature to HMM, as an auxiliary feature, as well as the mean values of output probability distributions of each state of both MR-HMM and standard HMM. In the framework of conventional HMMs, the mean value of each state does not change, while, in the MR-HMM, the mean value of a certain time instance changes based on the regression line (the upper 3 boxes in the figure) given as a function of , i.e. the fundamental frequency at time . Let be an element of a mean vector, then at time is modeled as

"!$#&%(' *)

(1)

where ()+,! are the regression coefficients. In a general case where auxiliary features (predictor variables in terms of multiple regression) are given, the above formulation is now rewritten as

./01"!32"!465575,8"9:209