Transformation-based bayesian predictive classification using online ...

4 downloads 82290 Views 244KB Size Report
tion, online prior evolution, speech recognition, transformation- based Bayesian ... The authors are with the Department of Computer Science and Information. Engineering, National Cheng ...... received the B.S. degree in electrical engineering.
IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 9, NO. 4, MAY 2001

399

Transformation-Based Bayesian Predictive Classification Using Online Prior Evolution Jen-Tzung Chien, Member, IEEE, and Guo-Hong Liao

Abstract—The mismatch between training and testing environments makes the necessity of speech recognizers to be adaptive both in acoustic modeling and decision making. Accordingly, the speech hidden Markov models (HMMs) should be able to incrementally capture the evolving statistics of environments using online available data. Also, it is necessary for speech recognizers to exploit the robust decision strategy, which takes the uncertainty of parameters into account. This paper presents a transformation-based Bayesian predictive classification (TBPC) where the uncertainty of transformation parameters of HMM mean vector and precision matrix is adequately represented by a joint multivariate prior density of normal-Wishart belonging to the conjugate family. The formulation of TBPC decision is correspondingly constructed. Due to the benefit of conjugate density, we generate the reproducible prior/posterior pair such that the hyperparameters of prior density could evolve successively to new environments using online test/adaptation data. The evolved hyperparameters could suitably describe the parameter uncertainty for TBPC decision. Therefore, a novel framework of TBPC geared with online prior evolution (OPE) capability is developed for robust speech recognition. This framework is examined to be effective as well as efficient on the recognition task of connected Chinese digits in hands-free car environments. Index Terms—Hidden Markov model, multivariate distribution, online prior evolution, speech recognition, transformationbased Bayesian predictive classification.

I. INTRODUCTION

T

HERE is no doubt that the robustness issue is crucial in the area of speech recognition because the mismatch between training and testing data always exists and degrades the recognition performance considerably. The mismatch sources may arise from the variabilities of inter- and intra-speakers, transducers/ channels and surrounding noises. For example, when the speech recognizer is designed for hands-free controlling the equipment in car including cellular phone, air conditioner, global positioning system, etc, the surrounding noises of engine, music, and babble under different driving speeds will deteriorate the performance of recognizer. The change of speaker voice caused by abrupt alteration of car noise level (known as Lombard effect) will also damage the recognizer [23]. As a result, several problems need to be solved to compensate the performance loss

Manuscript received January 14, 2000; revised September 7, 2000. This work was supported in part by the National Science Council, Taiwan, R.O.C., under contract NSC89-2213-E-006-078. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Dirk van Compernolle. The authors are with the Department of Computer Science and Information Engineering, National Cheng Kung University, Tainan 70101, Taiwan, R.O.C. (e-mail: [email protected]). Publisher Item Identifier S 1063-6676(01)02740-7.

in statistical speech recognition. Generally, the acoustic modeling and the decision strategy are two directions to be dealt with to achieve the recognition robustness. In this study, we develop an adaptive speech recognition system where the varying statistics of environments is continuously traced to parameterize the acoustic mismatch and simultaneously the inevitable uncertainty of parameters is compensated to build the robust decision strategy [13], [24], and [25]. Traditionally, the acoustic modeling using hidden Markov model (HMM) technique is done in training phase and the trained HMMs are unchanged for speech recognition in spite of the varying environments. Practically, the speech HMMs should be adapted to new environments using environment-specific data. In the literature, the maximum a posteriori (MAP) adaptation of HMMs [10] and maximum likelihood linear regression (MLLR) [26] provided two approaches to adaptive acoustic modeling. The MAP adaptation directly adjusted the HMM parameters using MAP estimation. It achieved good asymptotic property in case of sufficient adaptation data. The MLLR indirectly adapted the HMM parameters through cluster-dependent transformation functions in which the parameters were obtained via maximum likelihood (ML) estimation. This transformation-based adaptation could perform well for any numbers of adaptation data if the sharing of transformation parameters was dynamic according to a tree structure of HMMs [30]. Basically, these two approaches are feasible to batch adaptation. However, the realistic environments are nonstationary because the environmental statistics evolve naturally as the time goes on. It is usually difficult to capture the newest statistics using batch data. Instead of waiting long batch data, we are preferable to incrementally perform the adaptation using online available data. Hence, the online/incremental adaptation has been intensively attracting many researchers focused on this issue [3], [9], [12], [14], [17]. Huo and Lee [17] applied quasi-Bayes (QB) principle to design a recursive algorithm for incrementally updating the HMM parameters and the hyperparameters of approximate posterior distribution. This algorithm could be viewed as an incremental variant of MAP adaptation [10]. Recently, we presented an online transformation-based adaptation (known as online transformation) to achieve desirable performance on incremental adaptation for sparse and sufficient adaptation data [3]. On the other hand, the robust decision rule plays an important role for robust speech recognition. It is because that the conventional plug-in MAP rule searches the optimal word sequence of test utterance assuming that the pre-trained parameters are true values, which are constant during MAP decoding. In fact, the actual parameters are random and disturbed due to the mismatches

1063–6676/01$10.00 ©2001 IEEE

400

IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 9, NO. 4, MAY 2001

and errors in parameter modeling and estimation. The robust decision strategy is intended to incorporate the uncertainty of parameters for a speech recognizer. In previous works, the minimax classification was presented to minimize the worst-case probability of error [27]. This approach was exploited to protect against the possibility of the worst mismatch. In addition, the Bayesian predictive classification (BPC) provided a fascinating approach to characterize the uncertainty of parameters using a general prior probability density function (pdf) [19], [20]. Using BPC, the decision of pattern classifier was based on the predictive distribution [2], [11], [28]. This predictive pdf was averaged over the uncertainty of parameters instead of riskily using the point estimate of parameters. In HMM framework, the predictive distribution was approximated via a Viterbi decoding algorithm where a less informative prior pdf of HMM mean component using the constrained uniform distribution was adopted [21]. Further, a conjugate prior using Gaussian density was applied to model the uncertainty of mean component [22]. More recently, Surendran and Lee [33] employed the BPC technique to the scheme of transformation-based adaptation where the uncertainty of transformation parameter of HMM mean component was described by a Gaussian pdf. This method was also extended to tackle the disturbance of transformation parameters of individual HMM mean and precision (or inverse variance) components through a joint normal-gamma distribution [32]. In their implementation, the BPC algorithm was acted as a variant of model adaptation, called Bayesian predictive adaptation. In this paper, we would like to combine the powers of online transformation-based adaptation and BPC-based decision strategy to accomplish intensive robustness in speech recognition. Our goal is to develop a robust statistical decision theory where the prior statistics of the parameters evolve incrementally to fit the nonstationary environments using online test/adaptation data [16], [22]. After refreshing the prior statistics, the recognizer throws away the current data and waits for the new data for next refreshment. The original contribution of this work is to focus on the development of transformation-based BPC (TBPC) equipped with the online prior evolution (OPE) capability, referred as the TBPC-OPE strategy. Instead of modeling the parameter uncertainty of individual HMM mean and precision components, we universally represent the joint multivariate pdf of transformation parameters of HMM mean vector and precision matrix to be a normal-Wishart density attributed to the natural conjugate prior family. The choice of conjugate prior pdf also results in the pooled posterior pdf corresponding to the same normal-Wishart density. Using the repeated prior evolution and posterior pooling, we exploit a recursive algorithm to incrementally refresh hyperparameters for TBPC decision. The formulation of TBPC-OPE strategy is theoretically established within a joint quasi-Bayes (QB) estimation of word transcriptions and transformation parameters. On recognition of connected Chinese digits in car environments, we find that the proposed algorithm does obtain better performance than previous algorithms for various driving conditions. This paper is organized as follows. In next section, we present the general approach to the problem of TBPC-OPE strategy by means of a joint QB estimation algorithm. In Section III and Section IV, we respectively address the formulations of TBPC

decision and OPE processing. A Bayesian predictive likelihood measure averaging the parameter uncertainty is therein derived to generate the TBPC decision. In Section V, the databases of car speech recognition are introduced. A number of experiments are conducted to evaluate the asymptotic property and the effects of hyperparameter sensitivity, transformation functions and multiple transformation clusters for TBPC-OPE strategy. The comparisons with other methods are also reported. Finally, the conclusions drawn from this study are given in Section VI. II. GENERAL FORMULATION For the sake of reinforcing speech recognition performance, it is necessary to cope with the robustness of decision strategy as well as the learning capability of HMMs. Huo and Lee [16] jointly applied the online adaptation scheme for learning HMM parameters and the BPC rule for robust statistical decision. The link of online adaptation and BPC rule was not detailed. In [22], the BPC algorithm was improved via sequential Bayesian learning of continuous density HMMs (CDHMMs). The prior pdf of HMM mean component was assigned by a Gaussian function so that the performance of BPC can be sequentially raised using the improved prior understanding of HMM mean. In this study, the transformation-based BPC is explored to tackle the uncertainty of transformation parameters of full HMM mean vector and precision matrix. Due to the usage of conjugate prior pdf in parameter modeling, we solidly build a hybrid TBPC-OPE strategy where the same prior pdf is applied to TBPC decision to model parameter uncertainty and OPE processing to trace environmental statistics. In a practical human-machine interaction system, the speech data of test speaker are usually sufficient and observed in an be i.i.d. incremental fashion. Let and incrementally collected data, which are used to adjust the existing speaker-independent (SI) HMMs to fit the nonstationary environments. Using the online transformation [3], the are employed to estimate the parameters incremental data of transformation function so as to compensate the acoustic variabilities. After observing the current sentence with unknown transcription , the approximate MAP and transforma(or QB) estimates of word transcription are obtained by [3], [17] tion parameters

(1) is apwhere the posterior pdf of previous data proximated by the closest tractable prior pdf with the hyperparameters refreshed according to the his. Because of the undetermined word sequence tory data , the unsupervised learning of parameters is completed by jointly estimating the optimal word sequence and transformation parameters within a Bayesian framework [4]. In essence,

CHIEN AND LIAO: TRANSFORMATION-BASED BAYESIAN PREDICTIVE CLASSIFICATION

401

Fig. 1. Block diagram of TBPC-OPE strategy for speech recognition. A switch is used to select whether the estimated or the actural data transcription is employed to prior evolution. Online adaptive HMMs can be optionally generated for traditional plug-in MAP decision.

the parameters and are independent such that the estimation in (1) can be separated into two stages

where the likelihood distribution

is determined by the predictive (5)

(2) (3) is the prior knowledge of corresponding to the where denotes the prior pdf of . language model and The first stage of QB estimation is engaged to search the most of current data . In this stage, likely word transcription the transformation parameters are supposed to be exact and , the HMM parameters plugged in MAP decoder. Given are transformed to by using the parameters estimated by (3). Based on this joint QB estimate with initial hyperpa, we can estimate the transformation parameters rameters by substituting and its associated optimal transcription into (3). The hyperparameters are then refreshed and . Acstored for the estimation of next parameters cordingly, a recursive QB estimation of parameter sequences and can be established to incrementally trace the environments. In [3], [17], the incremental adaptation skills were presented for supervised speaker adaptation where the word transcriptions were provided beforehand. In this study, we generalize the learning strategy to be not only incremental but also unsupervised. The decision of optimal word sequence in (2) is robustly fulfilled using the TBPC rule. are estiConventionally, the transformation parameters mated using ML [26], [29], MAP [5], [6], [31], or QB [3] principles and plugged in MAP decoder of (2) to determine the optimal word sequence. The point estimate is embedded . Without loss in computation of the likelihood . of generality, we may express the likelihood by According to TBPC decision, the uncertainty of parameters is merged into the decision of word sequence. The MAP decoder in (2) is then modified to produce the following TBPC rule (4)

Notably, the merit of this algorithm is revealed on the prior pdf of parameters , which is commonly and simultaneously applied to TBPC rule in (4), (5) for speech recognition and online transformation scheme in (3) to attain OPE capability. Therefore, based on the joint QB estimate in (4) and (3), we construct a universal decision strategy depicted in Fig. 1. The input incremental data could be either adaptation or recognition data provided with or without word transcription. A switch is used to control whether the supervision is available for prior evolution. Generally, there are several realizations released from this universal strategy. In [16], [22], the adaptation data collected in off-line and supervised manners were applied to OPE processing. The evolved hyperparameters were kept unchanged for BPC-based speech recognition. Such realization is viewed as a special case in our universal strategy. In fact, the most fascinating realization is designed to incrementally apply TBPC decision to recognize online test data and simultaneously utilize the recognized transcription to perform unsupervised prior evolution. As shown in Fig. 1, the online adaptive HMMs can be optionally produced and plugged in MAP decoder to obtain online transformation scheme. III. TRANSFORMATION-BASED BAYESIAN PREDICTIVE CLASSIFICATION A. Background of Transformation-Based Adaptation Consider the CDHMMs with states and mixture com, , ponents, , the state observation pdf of sample of utterance is defined by a mixture of multivariate Gaussian pdfs (6) where

is the mixture gain subject to the constraint and is the Gaussian pdf with

402

IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 9, NO. 4, MAY 2001

-dimensional mean vector and precision matrix . When the transformation-based adaptation is applied to resolve environmental mismatch, we use environmental-specific and extract the cluster-dependent transformautterance with parameters to tion functions adapt/transform clusters of CDHMMs to new environments. In [6], the HMM mean vectors were transformed by a regression function and the prior pdf of transformation parameters was chosen from a family of elliptically symmetric variate distributions. In this study, the transformation function is constrained to HMM mean vector and by adding the bias vector to HMM precision matrix. multiplying the scaling matrix is assumed to be unadapted. The HMM mixture gain transformation function is defined by [3] (7) where the HMM unit with indices and is hypothetically at. The likelihood of tributed to the th cluster, i.e., given HMM unit and transformation parameters becomes

(8) Furthermore, we carefully choose the prior pdf of transformation parameters to be a multivariate normal-Wishart density corresponding to the conjugate distribution family [3], [7]

to approximate the predictive pdf. They also computed the predictive pdf by considering the nature of missing data problem in HMMs. Accordingly, the unobserved state sequence and mixture component sequence corresponding to the observais incorporated to yield tion data

(10) is approximated where the summation over all possible and mixture by substituting the most likely state sequence . In [21], an approximate Viterbi component sequence Bayesian algorithm was presented to search the optimal unoband find the approximated predictive served sequences pdf. However, the Viterbi Bayesian search algorithm was still unable to exactly accomplish the Viterbi BPC rule. Alternatively, a straightforward approach to approximating the predictive pdf is attempted to directly calculate the predictive pdf of each observation frame instead of using conventional Gaussian pdf. The predictive pdf then serves as new observation pdf and applied to optimal MAP decision in (2) so as to approximate the BPC rule. In [21], this approach was termed as Bayesian predictive density based model compensation (BP-MC). In their experiments, the BP-MC and Viterbi BPC rules achieved comparable performance. Therefore, the BP-MC approach is herein adopted to realize the TBPC decision in (4), (5). Using this approach, the state observation pdf in (6) is altered by (11)

where the new observation pdf of single frame defined by

is

(12)

(9)

are the hywhere , perparameters of the th cluster such that is a -dimensional vector, and are symmetric positive definite matrices. The notation represents the trace of a matrix. B. Approach to TBPC Rule In TBPC rule, the decision of an unknown test pattern is according to the predictive distribution, which is calculated by averaging over the uncertainty of parameters. However, the calculation of predictive pdf in (5) is very difficult in TBPC decision. Huo et al. [19] used the Laplace method for integrals

Our TBPC decision is therefore established by replacing the predictive state observation pdf of (11) into the plug-in MAP decoder. It turns out that the kernel of TBPC decision is highlighted on the derivation of Bayesian predictive likelihood measure in (12). C. Derivation of Bayesian Predictive Likelihood Measure In Bayesian predictive likelihood measure (denoted by BPLM), the integral is operated over the admissible region . By using the identity of dual parameters , the BPLM becomes the following double integrals

(13)

CHIEN AND LIAO: TRANSFORMATION-BASED BAYESIAN PREDICTIVE CLASSIFICATION

To solve the double integrals, the first step is to determine the marginal pdf of derived by

403

By substituting (14) and (18) into (13), the BPLM can be arranged to be

(19) with being an identity matrix and

being defined by

(20)

(14)

In (19), the integral is operated on a Wishart density of with degrees of freedom and precision matrix [7]. This integral equals to unity. Thus, the BPLM has a form of

deThe marginal pdf has a form of Wishart density with . This is obtained grees of freedom and precision matrix using the property that Gaussian pdf integrates to unity over the entire parameter space of . When the joint prior pdf of in (9) is divided by the marginal pdf of in (14), we yield the conditional pdf

(21) Applying the identity nonsingular matrix finally show that

and

vector

with [7], we can

(15) and prewhich is a Gaussian function with mean vector . Having (15) and (8), the inner integral cision matrix in (13) is expressed to be proportional to

(16) Because the exponent in (16) can be verified by [3], [7]

(17) where and the integral involving the Gaussian functional form of the first term of (17) equals to unity, (16) becomes

(22) If two determinants of right hand side of (22) are skipped, the is proportional to a d-dimenBPLM of observation frame degrees sional multivariate t distribution with and precision maof freedom, location vector . The trix BPLM serves as a novel observation pdf embedding with prior knowledge of the uncertainty of transformation parameters. Because the derived BPLM has a multivariate distribution form instead of the popular Gaussian distribution, it becomes interesting to know the difference between univariate Gaussian and distributions. In Fig. 2, we plot the standardized Gaussian and distribution distribution (23)

in

(18)

is gamma function and is degree of freedom. where Herein, the cases of being 1, 2, 10, and 30 are investigated. We can see that distribution is symmetric about zero and similar to Gaussian distribution in shape. Indeed, the distribution [1]. For approaches to Gaussian distribution when smaller , distribution is flatter with thicker tails and equals . to Cauchy distribution when

404

IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 9, NO. 4, MAY 2001

pdf

with hyperparameters learnt from history data . Because the most likely state and mixture component seof are obtained while the optimal word sequences is extracted by TBPC decision, the missing data quence problem in parameter estimation of (3) can be correspondingly conquered using the following Viterbi approximation:

(24)

Fig. 2. Comparison of Gaussian distribution and t distribution with different degrees of freedom v .

IV. ONLINE PRIOR EVOLUTION In addition to the robustness of TBPC decision, another crucial feature of proposed strategy is the capability of OPE. This capability provides TBPC decision with the hyperparameters estimated from the incremental data such that the newest environmental statistics can be continuously traced. In [15], the sensitivity of prior pdf was studied for BPC-based speech recognition. They evaluated how significant the change of hyperparameters affects the performance of BPC. In [16], [22], the online adaptive HMMs were learnt in adaptation session and applied their corresponding hyperparameters to BPC decision in test session. Huo and Ma [18] further proposed a technique of multiple-stream prior evolution and posterior pooling where the streams of prior pdfs obtained from QB [17] and incremental ML [9] were pooled together. This pooled posterior pdf gave better performance for online adaptive learning compared to the individual prior pdf. In this study, we only use single stream of prior pdf to establish the OPE scheme of transformation parameters. The OPE processing is accomplished via the online transformation of HMMs in (3) [3]. Different from previous works, the OPE scheme is rigidly coupled with the new proposed TBPC decision and the coupled strategy can be efficiently completed using the test data. at In QB estimate of (3), the transformation parameters the th incremental interval are estimated by maximizing the and prior product of current observation pdf

This approximation is a variant of expectation–maximization (EM) algorithm [8] where the expectation step over entire is approximated by replacing the space of missing data and mixture component optimal state sequence as explained earlier in (10). After the sequence prior pdf of (9) is substituted into (24), the auxiliary function , multiplied by a normalization term can be of cluster , with new also expressed by a normal-Wishart density [3] hyperparameters (25)

(26) (27)

(28) where [see (29)–(31), shown at the bottom of the page]. The is the Kronecker delta function. By maximizing notation with respect to , we can the normal-Wishart density easily derive the optimal parameters (32)

(29)

(30) (31)

CHIEN AND LIAO: TRANSFORMATION-BASED BAYESIAN PREDICTIVE CLASSIFICATION

Optionally, the derived parameters are feasible to obtain the . But, the highlight online adaptive HMMs by of proposed strategy is focused on the merit of reproducible prior/posterior pair with normal-Wishart density such that the new hyperparameters can straightforwardly serve as the refreshed hyperparameters

(33) They are applied to TBPC-based recognition of upcoming data . As a result, a mechanism of OPE is constructed to incrementally provide the newest knowledge of parameter uncertainty for TBPC decision. Notably, all of our formulas are derived in universal and multivariate fashions. In the Appendix, we also formulate the TBPC-OPE strategy where only the transformation of HMM mean vector is considered. V. EXPERIMENTS A. Speech Databases and Baseline Recognition System We conduct a series of speech recognition experiments to examine the performance of proposed TBPC-OPE strategy. Our experiments are aimed at the recognition task of connected Chinese digits. Two severely mismatched speech databases were used. One is the training database consisted of 1000 utterances by 100 speakers (50 males and 50 females). This database was sampled from a TCC300 corpus of 300 speakers jointly collected by the Speech Recognition Groups at National Cheng Kung University, National Taiwan University, and National Chiao Tung University, Taiwan. It was recorded in ordinary office environments via four close-talking microphones. We applied this database to generate SI HMMs. The second database was sampled from a CARNAV98 corpus jointly collected by National Cheng Kung University and Industrial Technology Research Institute, Taiwan. This database contained the utterances of ten speakers (five males and five females) recorded in two median-class cars: Toyota Corolla 1.8 and Yulon Sentra 1.6. These utterances were collected using a high-quality MD Walkman of type MZ-R55 via a hands-free far talking SONY ECM-717 microphone different from those in training database. Speech was digitized by a Sound Blaster A/D card at 16-bit accuracy and 8 kHz sampling rate. Three materials of standby condition, downtown condition and freeway condition with averaged car speeds respectively being 0 km/h, 50 km/h and 90 km/h were recorded. During recording, we kept the engine on, the air-conditioner on, the music off and the windows rolled up. The numbers of testing utterances were 50, 150, 250 and corresponding digits were 324, 964, and 1593 for driving conditions of standby, downtown and freeway, respectively. The word error rate (WER) was averaged over ten test speakers. Each speaker optionally provided five adaptation utterances with supervision to carry out different realizations of TBPC-OPE. All training/testing utterances contained three to eleven random digits. In our experiments, we modeled each Chinese digit using a left-to-right seven-state CDHMM without state skipping. There were totally 73 HMM

405

states (70 for ten digits, one for presilence, one for postsilence, and one for silence within connected digits). Each HMM state was composed of four mixture components. For a speech frame, we applied the Hamming windowing and computed a feature vector of 12-order LPC-derived cepstral coefficients, 12-order delta cepstral coefficients, one delta log energy and one delta delta log energy. Our baseline system (i.e., plug-in MAP decoding using SI HMMs) reports the word error rates of 25.6%, 55%, and 62.3% for driving conditions of standby, downtown and freeway, respectively. B. Implementation Issues We have presented a general TBPC-OPE decision strategy. This strategy is applicable to large vocabulary continuous speech recognition (LVCSR) system where a language model has been defined. In the experiments, we ignored the contribution of language model during TBPC decision of (4). The case of TBPC-OPE using one transformation cluster shared by all HMM pdfs is adopted in most of experiments. We also compare the recognition results by extending the number of transformation clusters. To start the OPE capability, beforehand. we need to have the initial hyperparameter In this study, the initial hyperparameters were estimated from SI training data and using the technique in [3]. To prevent the fluctuation of performance induced by the refreshment of and , we heuristically force hyperparameters and fix for all incremental intervals. The HMM precision is simplified to be diagonal. From (25) and (32), we matrix of Gaussian density in know that the precision matrix (15) can be guaranteed to be symmetric and positive definite. In the following, we realize several variants of TBPC-OPE strategy and perform systematic evaluation of convergence property, hyperparameter sensitivity, transformation functions and multiple transformation clusters. The performance of TBPC and plug-in MAP decision is also compared. Moreover, we examine the proposed TBPC decision against other BPC decision [22], [32], [33]. C. Evaluation of Hyperparameter Sensitivity in TBPC-OPE are crucial to The hyperparameters characterize the uncertainty of transformation parameters for TBPC-OPE strategy. We would like to know the sensitivity of with relation to speech recognition fixed hyperparameter performance. In Fig. 3, the word error rates of baseline plug-in are compared under MAP and TBPC-OPE with different the case of freeway driving. In this case, each speaker uses five supervised adaptation sentences to incrementally learn his/her own hyperparameters on a sentence-by-sentence basis. Those derived hyperparameters are fixed for TBPC-based speech recognition. This realization is referred as TBPC with OPE processing on adaptation phase. We can see that TBPC-OPE is superior to baseline plug-in MAP. Better performance is obtained when hyperparameter is ranged between 50 and 70. to be 55 in the subsequent Accordingly, we roughly specify , the BPLM of (22) has a experiments. In case of multivariate distribution form with 30 degrees of freedom and the WER of 47.7% is obtained.

406

IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 9, NO. 4, MAY 2001

Fig. 3. Comparison of word error rates of baseline system and TBPC-OPE strategy using various values of hyperparameter under freeway driving condition.

Fig. 5. Comparison of word error rates of baseline system, online transformation and TBPC with OPE processing both on adaptation and test phases under various driving conditions.

where the hyperparameters are not only learnt using five supervised adaptation data but also refreshed during recognition on a sentence-by-sentence basis. From Fig. 4, we learn that TBPC with OPE processing outperforms that without OPE processing under various driving conditions. If OPE processing is done both on adaptation and test phases, the best recognition results are obtained. In case of downtown driving, the word error rates have been reduced to 41.3% for OPE processing on adaptation phase and 38.1% for OPE processing both on adaptation and test phases. These results indicate that the hyperparameters realistically learnt from test data do improve the TBPC decision. Notably, the hyperparameter learning is partially supervised in this realization. D. Comparison of Online Transformation and TBPC-OPE

Fig. 4. Comparison of word error rates of baseline system, TBPC without OPE processing, TBPC with OPE processing on adaptation phase and TBPC with OPE processing both on adaptation and test phases under various driving conditions.

Also, we compare the performance of baseline plug-in MAP and TBPC with fixed initial hyperparameters estimated from SI training data [3]. Obviously, these hyperparameters are shared for all testing speakers. This realization is also referred as TBPC without OPE processing. As shown in Fig. 4, the TBPC without OPE processing is significantly better than baseline plug-in MAP under various driving conditions. For the case of downtown driving, the WER is greatly reduced from 55% of baseline plug-in MAP to 46% of TBPC without OPE processing. Such results are promising because there are no extra adaptation data and computational overhead needed in this realization. To further explore the hyperparameter sensitivity, we implement two other realizations for comparison. The first realization is the TBPC with OPE processing on adaptation phase mentioned earlier. The second realization is referred as TBPC with OPE processing both on adaptation and test phases

In general, the TBPC-OPE strategy is developed to combine the works of online transformation [3] and BPC decision [19]. The online transformation technique serves as a special realization of TBPC-OPE when OPE processing is performed to refresh hyperparameters using adaptation data and calculate the optimal transformation parameters using (32). Given these parameters, the HMMs are transformed and plugged in MAP decoder. Online transformation can be viewed as plug-in MAP decision using online adaptive HMMs. In Fig. 5, we compare the recognition rates of online transformation and TBPC with OPE processing both on adaptation and test phases. It is found that the proposed TBPC-OPE is always better than online transformation for different driving conditions. For downtown driving, the WER of 38.2% using TBPC-OPE is smaller than that of 41.1% using online transformation. This illustrates the superiority of TBPC decision over traditional plug-in MAP decision. E. Evaluation of TBPC-OPE versus Different Transformation Functions In this study, we exploit a general TBPC-OPE where the uncertainty of transformation parameters of both HMM mean vector and precision matrix is considered in decision criterion.

CHIEN AND LIAO: TRANSFORMATION-BASED BAYESIAN PREDICTIVE CLASSIFICATION

Fig. 6. Word error rates versus numbers of adaptation data using TBPC with OPE processing on adaptation phase under various driving conditions. Dotted and solid lines respectively denote the results of transformation functions considering only mean and both mean and precision.

The derived BPLM has a form of multivariate distribution. However, in the Appendix, we formulate the TBPC-OPE considering only transformation of mean vector and the resulting BPLM has a Gaussian distribution form. Two kinds of BPLM own their associated hyperparameters and OPE schemes. It is of interest to evaluate the recognition performance of TBPC-OPE using two different transformation functions. Fig. 6 plots the recognition comparison of TBPC-OPE with transformation functions considering only mean and both mean and precision under various driving conditions. In this set of experiments, the case of TBPC with OPE processing on adaptation phase is performed. Number of adaptation data is changed for evaluation. From this figure, we know that the word error rates are consistently reduced as more adaptation data are involved under various driving speeds and transformation functions. This shows the convergence property of TBPC-OPE strategy with respect to adaptation data length. Furthermore, we observe that the recognition results of TBPC-OPE with transformation considering both mean and precision are not consistently better than those of considering only mean. It is because that the reliable modeling of transformation uncertainty of precision matrix is difficult to achieve. We need to explore some tricks to improve the performance. F. Evaluation of Different BPC Rules On the other hand, the proposed TBPC-OPE is further investigated by comparing with other BPC rules. In [22], Jiang et al. presented a BPC rule in which the uncertainty of HMM mean component itself was modeled by Gaussian pdf. Some incremental adaptation data were applied to estimate hyperparameters. These hyperparameters were fixed for BPC-based speech recognition. To objectively compare TBPC-OPE and Jiang’s BPC, we perform the realization of TBPC with OPE processing on adaptation phase and considering only transformation of HMM mean. As demonstrated in Fig. 7, Jiang’s BPC does outperform the baseline plug-in MAP. But, Jiang’s BPC is obviously inferior to our TBPC-OPE for different driving conditions. In case of freeway

407

Fig. 7. Comparison of word error rates of baseline system, Jiang’s BPC and TBPC-OPE under various driving conditions. Case of TBPC with OPE processing on adaptation phase and transformation function considering only mean is performed.

driving, the WER of Jiang’s BPC is 56.9% which is significantly worse than 45.6% using TBPC-OPE. In addition, Surendran and Lee [33] introduced a TBPC technique where transformation parameter of HMM mean component was assumed to be Gaussian distributed. In their implementation, the most likely state and mixture component sequences should be first decoded to estimate the hyperparameters for each sentence. Using the estimated hyperparameters, HMMs were adapted and applied to the second pass TBPC decision. However, the proposed method adopts the online evolved hyperparameters for TBPC decision. After recognition of current sentence, the hyperparameters are refreshed for recognizing next sentence. Accordingly, we only perform one pass of recognition, which is different from two passes using Surendran’s method. The processing time of TBPC-OPE should be considerably reduced. To compare these two methods, we carry out another realization referred as the TBPC with OPE processing on test phase where the TBPC decision utilizes the hyperparameters refreshed on test phase in unsupervised and incremental manners. Since no adaptation data are needed, such realization is the most flexible among all realizations. As listed in Table I, we show the word error rates and recognition speeds of baseline system, Surendran’s method and TBPC-OPE for different driving conditions. Herein, the results of TBPC-OPE considering only mean transformation are given. Averaged recognition speeds are measured in seconds per utterance through simulating the algorithms on a Pentium III 450 PC. We can see that Surendran’s method does raise the recognition performance a great deal. However, the TBPC-OPE can obtain extra recognition improvement. In case of standby condition, the WER 25.6% of baseline system is greatly reduced to 12.7% of Surendran’s method and further dropped to 11.3% of TBPC-OPE. Moreover, the computational cost of Surendran’s method is almost double compared to proposed TBPC-OPE. The TBPC-OPE spends only small computational overhead compared to baseline system. Also, Surendran and Lee have

408

IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 9, NO. 4, MAY 2001

TABLE I WORD ERROR RATES (%) AND RECOGNITION SPEEDS (SECOND/UTTERANCE) OF BASELINE SYSTEM, SURENDRAN’S METHOD AND TBPC-OPE UNDER VARIOUS DRIVING CONDITIONS. CASE OF TBPC WITH OPE PROCESSING ON TEST PHASE AND TRANSFORMATION FUNCTION CONSIDERING ONLY MEAN IS PERFORMED

extended their work to handle the transformation uncertainty of both mean and precision components [32]. It is interesting to compare Surendran’s method and TBPC-OPE under the same condition of considering transformation function of both mean and precision. As shown in Fig. 8, our TBPC-OPE still outperforms Surendran’s method for different driving conditions. It is because that TBPC-OPE can accumulate the environmental statistics for TBPC decision. Surendran’s method only relies on the statistics obtained from current test utterance. G. Evaluation of TBPC-OPE with Multiple Transformation Clusters To further demonstrate the effectiveness of TBPC-OPE, we implement another set of experiments where the number of transformation clusters is increased from one to two. The case of two transformation clusters is simply defined by grouping the HMM states into speech part (70 states) and silence part (three states) [29]. In this case, two sets of hyperparameters for BPLM calculation of speech frames and silence frames are separately estimated and evolved. In Fig. 8, the recognition comparison of TBPC-OPE with one and two clusters is illustrated. Herein, we keep the experimental conditions of TBPC with OPE processing on test phase and considering the transformation of both mean and precision. This figure reveals that the recognition results are significantly improved by increasing number of transformation clusters. In case of freeway condition, the WER 44.2% of using two transformation clusters is apparently lower than 48.3% of using single transformation cluster. All of these results prove that the proposed TBPC-OPE strategy is not only effective but also efficient for robust speech recognition in car environments. VI. CONCLUSION This paper has presented a novel TBPC decision strategy equipped with OPE capability for robust speech recognition. The TBPC decision was established by taking the parameter uncertainty into consideration so that the resulting decision rule of transformation-based adaptation could achieve better robustness than conventional plug-in MAP rule. In this study, we adequately modeled the transformation parameters of HMM mean vector and precision matrix using a prior normal-Wishart density associated with a set of hyperparameters. It turned out that the TBPC decision depending on the derived BPLM, which was formulated by a multivariate distribution form. Due to the statistical attractiveness of the selected prior density family, the OPE capability was developed to incrementally trace the newest

Fig. 8. Comparison of word error rates of baseline system, Surendran’s method and TBPC-OPE under various driving conditions. Case of TBPC with OPE processing on test phase and transformation function considering both mean and precision is performed. TBPC-OPE results with one and two transformation clusters are also compared.

environmental hyperparameters for TBPC decision. Such joint TBPC-OPE strategy was tightly constructed based on the QB estimation of word transcriptions and transformation parameters. Also, we contributed several realizations released from TBPC-OPE strategy. A flexible realization of TBPC-OPE operated only on test data was exploited to use online evolved hyperparameters for TBPC-based speech recognition. The estimated transcription of test sentence was applied to obtain new hyperparameters for TBPC decision of next sentence. This realization was entirely online and unsupervised. After a series of experiments on connected Chinese digit recognition in car environments, some points are concluded and listed as follows. 1) TBPC decision is superior to traditional plug-in MAP decision. Asymptotic property of TBPC-OPE versus number of adaptation data is observed. 2) Variation of hyperparameters representing the environmental statistics affects the TBPC decision substantially. Using fixed and general hyperparameters properly estimated from training data could improve the speech recognition performance. 3) Providing some adaptation data used for hyperparameter refreshment, the resulting environment-specific hyperparameters outperforms general hyperparameters. If hyperparameters are further updated during recognition, the performance of TBPC decision could be raised accordingly. 4) TBPC-OPE strategy achieves better results than online transformation algorithm built by the plug-in MAP decision using online adaptive HMMs. 5) TBPC-OPE with the transformation function concerning both mean and precision obtains similar results compared to that concerning only mean transformation. 6) Proposed TBPC-OPE attains better performance than Jiang’s BPC [22] and Surendran’s method [32], [33]. Efficiency of TBPC-OPE is verified as well.

CHIEN AND LIAO: TRANSFORMATION-BASED BAYESIAN PREDICTIVE CLASSIFICATION

7) TBPC-OPE is improved by increasing the number of transformation clusters. Because the proposed approach is established on a basis of multivariate analysis of transformation-based adaptation, this approach should be easily extended to the simplified case where the HMM precision matrix and its transformation matrix are diagonal. Correspondingly, the univariate variant of TBPC-OPE considering the uncertainty of individual transformation components could be produced. Furthermore, this approach could be referred to develop a new decision strategy geared with OPE capability, which the uncertainty of HMM mean vector and precision matrix themselves is characterized and incorporated in decision rule.

409

This BPLM is then plugged in MAP decoder to generate TBPC decision. We can see that the BPLM serves as a Gaussian density of with mean vector and precision matrix if and are both diagonal matrices. Furthermore, the online evolution of hyperparameters is done through the QB estimate in (24). In this case, we can easily derive the resulting auxiliary proportional to a Gaussian function of with the function refreshed mean vector and precision matrix given by [7]

APPENDIX TBPC-OPE CONSIDERING ONLY TRANSFORMATION OF MEAN VECTOR

(37) (38)

When the transformation-based adaptation only transforms using bias vector and keeps HMM HMM mean vector unchanged, the TBPC decision is built by precision matrix deriving a new BPLM of (12), where the transformation paramis replaced by a bias vector . Herein, we choose the eter to be a Gaussian function (also corresponding prior pdf of and precision to the conjugate family) with mean vector . The new BPLM involves the folmatrix lowing single integral

If online adaptive HMMs are aimed to be produced, we may correspondingly estimate the optimal transformation parameter and apply it to obtain new HMMs to be (39) However, in our TBPC rule, we concern about the OPE processing finished by the following procedure of hyperparameter refreshment (40) ACKNOWLEDGMENT

(34) Similar to the derivation in (16)–(18), the right hand side of (34) can be further arranged by

The authors acknowledge the helpful discussions with Prof. Q. Huo at University of Hong Kong and Dr. A. C. Surendran at Bell Laboratories. The valuable comments of anonymous reviewers, which considerably improved the quality of this paper, are appreciated as well. REFERENCES

(35) . where in (35) is integrated to Because the Gaussian function of unity, the new BPLM is proportional to

(36)

[1] L. J. Bain and M. Engelhardt, Introduction to Probability and Mathematical Statistics, 2 ed. Boston, MA: PWS, 1992. [2] J. O. Berger, Statistical Decision Theory and Bayesian Analysis, 2 ed. Berlin, Germany: Springer-Verlag., 1985. [3] J.-T. Chien, “Online hierarchical transformation of hidden Markov models for speech recognition,” IEEE Trans. Speech Audio Processing, vol. 7, pp. 656–667, Nov. 1999. [4] J.-T. Chien and J.-C. Junqua, “Unsupervised hierarchical adaptation using reliable selection of cluster-dependent parameters,” Speech Commun., vol. 30, pp. 235–253, Apr. 2000. [5] J.-T. Chien and H.-C. Wang, “Telephone speech recognition based on Bayesian adaptation of hidden Markov models,” Speech Commun., vol. 22, no. 4, pp. 369–384, 1997. [6] W. Chou, “Maximum a posteriori linear regression with elliptically symmetric matrix variate priors,” in Proc. Eur. Conf. Speech Communication Technology (EUROSPEECH), vol. 1, 1999, pp. 1–4. [7] M. H. DeGroot, Optimal Statistical Decisions. New York: McGrawHill, 1970. [8] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood from incomplete data via the EM algorithm,” J. R. Statist. Soc. B, vol. 39, pp. 1–38, 1977. [9] V. V. Digalakis, “Online adaptation of hidden Markov models using incremental estimation algorithms,” IEEE Trans. Speech Audio Processing, vol. 7, pp. 253–261, May 1999.

410

IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 9, NO. 4, MAY 2001

[10] J. L. Gauvain and C.-H. Lee, “Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains,” IEEE Trans. Speech Audio Processing, vol. 2, pp. 291–298, 1994. [11] S. Geisser, Predictive Inference: An Introduction. London, U.K.: Chapman & Hall., 1993. [12] Y. Gotoh, M. M. Hochberg, and H. F. Silverman, “Efficient training algorithms for HMM’s using incremental estimation,” IEEE Trans. Speech Audio Processing, vol. 6, pp. 539–548, Nov. 1998. [13] Q. Huo, “Adaptive learning and compensation of hidden Markov model for robust speech recognition,” in Proc. Int. Symp. Chinese Spoken Language Processing (ISCSLP), Singapore, 1998, pp. 31–43. [14] Q. Huo and C.-H. Lee, “On-line adaptive learning of the correlated continuous density hidden Markov models for speech recognition,” IEEE Trans. Speech Audio Processing, vol. 6, pp. 386–397, July 1998. , “A study of prior sensitivity for Bayesian predictive classifica[15] tion based robust speech recognition,” Proc. IEEE Int. Conf. Acoustics, Speech, Signal Processing, vol. 2, pp. 741–744, 1998. , “Combined on-line model adaptation and Bayesian predictive [16] classification for robust speech recognition,” Proc. Eur. Conf. Speech Communication Technology (EUROSPEECH), vol. 4, pp. 1847–1850, 1997. , “On-line adaptive learning of the continuous density hidden [17] Markov model based on approximate recursive Bayes estimate,” IEEE Trans. Speech Audio Processing, vol. 5, pp. 161–172, Mar. 1997. [18] Q. Huo and B. Ma, “On-line adaptive learning of CDHMM parameters based on multiple-stream prior evolution and posterior pooling,” Proc. Eur. Conf. Speech Communication Technology (EUROSPEECH), vol. 6, pp. 2721–2724, 1999. [19] Q. Huo, H. Jiang, and C.-H. Lee, “A Bayesian predictive classification approach to robust speech recognition,” in Proc. IEEE Int. Conf. Acoustics, Speech, Signal Processing, 1997, pp. 1547–1550. [20] H. Jiang, K. Hirose, and Q. Huo, “Robust speech recognition based on Viterbi Bayesian predictive classification,” in Proc. IEEE Int. Conf. Acoustics, Speech, Signal Processing, 1997, pp. 1551–1554. , “Robust speech recognition based on a Bayesian prediction ap[21] proach,” IEEE Trans. Speech Audio Processing, vol. 7, pp. 426–440, July 1999. [22] , “Improving Viterbi Bayesian predictive classification via sequential Bayesian learning in robust speech recognition,” Speech Commun., vol. 28, no. 4, pp. 313–326, 1999. [23] J.-C. Junqua, “The influence of acoustics on speech production: a noise-induced stress phenomenon known as the Lombard reflex,” Speech Commun., vol. 20, pp. 13–22, 1996. [24] C.-H. Lee, “On stochastic feature and model compensation approaches to robust speech recognition,” Speech Commun., vol. 25, no. 1–3, pp. 29–47, 1998. [25] C.-H. Lee and Q. Huo, “Adaptive classification and decision strategies for robust speech recognition,” in Proc. Workshop Robust Methods Speech Recognition Adverse Conditions, Tampere, Finland, 1999, pp. 45–52. [26] C. J. Leggetter and P. C. Woodland, “Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models,” Comput. Speech Lang., vol. 9, pp. 171–185, 1995. [27] N. Merhav and C.-H. Lee, “A minimax classification approach with application to robust speech recognition,” IEEE Trans. Speech Audio Processing, vol. 1, no. 1, pp. 90–100, 1993. [28] B. D. Ripley, Pattern Recognition and Neural Networks, U.K.: Cambridge Univ. Press, 1996.

[29] A. Sankar and C.-H. Lee, “A maximum-likelihood approach to stochastic matching for robust speech recognition,” IEEE Trans. Speech Audio Processing, vol. 4, pp. 190–202, 1996. [30] K. Shinoda and T. Watanabe, “Speaker adaptation with autonomous control using tree structure,” in Proc. Eur. Conf. Speech Communication Technology (EUROSPEECH), 1995, pp. 1143–1146. [31] O. Siohan, C. Chesta, and C.-H. Lee, “Hidden Markov model adaptation using maximum a posteriori linear regression,” in Proc. Workshop Robust Methods Speech Recognition Adverse Conditions, Tampere, Finland, 1999, pp. 147–150. [32] A. C. Surendran and C.-H. Lee, “Bayesian predictive approach to adaptation of HMM’s,” in Proc. Workshop Robust Methods Speech Recognition Adverse Conditions, Tampere, Finland, 1999, pp. 155–158. , “Predictive adaptation and compensation for robust speech [33] recognition,” Proc. Int. Conf. Spoken Language Processing, vol. 2, pp. 463–466, 1998.

Jen-Tzung Chien (S’97–A’98–M’99) was born in Taipei, Taiwan, R.O.C., on August 31, 1967. He received the B.S. degree in electrical engineering from Feng-Chia University, Taichung, Taiwan, in 1989, the M.S. degree in computer science and information engineering from Tamkang University, Taipei, in 1991, and the Ph.D. degree in electrical engineering from the National Tsing Hua University (NTHU), Hsinchu, Taiwan, in 1997. He was an Instructor with the Department of Electrical Engineering, NTSU, in 1995. Since August 1997, he has been with the Department of Computer Science and Information Engineering, National Cheng Kung University, Tainan, Taiwan, where he is currently an Assistant Professor. During the Summer of 1998, he was a Visiting Scholar at the Speech Technology Laboratory, Panasonic Technologies, Inc., Santa Barbara, CA, working on unsupervised speaker adaptation. He is also a Senior Consultant with Century Semiconductor, Inc., Taiwan. His current research interests include statistical and array signal processing, speech recognition, speaker adaptation, natural language processing, face recognition, neural networks, and multimedia human-machine communication. Dr. Chien is member of the IEEE Computer and Signal Processing Societies, the International Speech Communication Association, and the Association for Computational Linguistics and Chinese Language Processing.

Guo-Hong Liao received the B.S. degree from Feng-Chia University, Taichung, Taiwan, R.O.C., in 1998, and the M.S. degree in computer science and information engineering from National Cheng Kung University, Tainan, Taiwan, in 2000. His M.S. thesis involved research on Bayesian predictive classification for car speech recognition. He is now fulfilling military duty in Taiwan.