microphone-array speech recognition via ... - Semantic Scholar

15 downloads 0 Views 196KB Size Report
MICROPHONE-ARRAY SPEECH RECOGNITION. VIA INCREMENTAL MAP TRAINING. John E. Adcock. Yoshihiko Gotoh. Daniel J. Mashao. Harvey F.
MICROPHONE-ARRAY SPEECH RECOGNITION VIA INCREMENTAL MAP TRAINING John E. Adcock

Yoshihiko Gotoh Daniel J. Mashao LEMS, Division of Engineering Brown University Providence, RI 02912, USA e-mail: fjea,yg,djm,[email protected]

ABSTRACT

For a hidden Markov model (HMM) based speech recognition system it is desirable to combine enhancement of the acoustical signal and statistical representation of model parameters, ensuring both a high quality speech signal and an appropriately trained HMM. In this paper the incremental variant of maximum a posteriori (MAP) estimation is used to adjust the parameters of a talker-independent HMM-based speech recognition system to accurately recognize speech data acquired with a microphone-array. The approach is novel for a microphone-array speech recognition task in that a robust talker-independent model is derived from a baseline system using a relatively small amount of data for training. The results show that (1) MAP training signi cantly improves recognition performance compared to the baseline, and (2) beamforming signal enhancement outperforms single-channel enhancement before and after the adaptive MAP training.

1. INTRODUCTION

Microphone-array systems have been an area of active research at LEMS for several years [1, 2, 3, 4, 5]. The potential for high-quality hands-free speech acquisition in noisy environments makes microphone-arrays an attractive alternative to conventional head-mounted or desktop microphones. The signal-enhancement and source-location capabilities of microphone-arrays make them applicable to a variety of tasks including video teleconferencing and speech recognition. Several experiments been reported for microphone-array speech recognition [2, 6, 7, 8]. These experiments focus on the acoustical processing side rather than on representation of the underlying HMM model parameters. For an HMM-based system, it is desirable to combine acoustical enhancement with accurate statistical modeling so that not only will quality features be presented to the HMM but the model will represent the array data appropriately for the recognition task. In the approach presented in this paper, the HMM parameters are adjusted to the beamformed data using incremental variant of maximum a posteriori (MAP) estimation. The objective is to realize a talker-independent speech recognition system using an array of microphones to pick up a remote talker in a natural environment. Scenarios for microphone-array applications vary signi cantly; it is generally not practical to prepare enough data to train a talker-independent task from scratch. A compromise solution is to evolve a new model from a baseline system using a relatively small amount of data acquired in the new environment. The series of experiments presented here use  This

work was funded by NSF grants MIP-9120843, MIP-

9314625, and MIP-9509505.

Harvey F. Silverman 

data set female male # utterances training 5 5 320 testing 5 5 320

Table 1. Breakdown of the experimental database.

speech recorded from talkers with a 16-element microphonearray system. The results show that the combination of beamforming and MAP adaptation signi cantly improve the recognition performance, bringing it close to the performance achieved with speech recorded with a headset microphone.

2. EXPERIMENTAL DATABASE

A microphone-array speech database was collected from 20 talkers of American English. The vocabulary comprises the American English alphabet (A  Z ), the digits (0  9), \space" and \period". The typical utterance includes about 12 vocabulary items and is about 4 seconds long. Each talker contributed the same number of utterances. Table 1 shows the data sets broken down by gender. Four distinct data sets are used in the experiments:  close-talk headset microphone data,  single remote microphone data from one of the 16 array microphones,  single remote microphone with spectral-subtraction post processing, and  beamformed data processed from the array data. Each of these data sets were obtained from simultaneous recordings by the headset microphone and by the microphone-array system (see section 2.1) and contain a completely identical set of utterances. None of the talkers in this database are in the training set for the baseline system (see section 4.1).

2.1. Recording

The microphone-array environment used in this experiment is depicted in Figure 1. It comprises 16 pressure gradient microphones, 8 on each of two orthogonal walls of a 3.5  4.8 m enclosure, horizontally placed at a height of 1.6 m. Within each 8-microphone sub-array, the microphones are uniformly spaced at 16.5 cm intervals. The microphone-array is in a partially walled-o area of an acoustically-untreated workstation lab. Approximately 70% of the surface area of the enclosure walls is covered with 7.5 cm acoustic foam, the 3 m ceiling is untreated plaster, and the oor is carpeted. The reverberation time within the enclosure is approximately 200 ms. The utterances were recorded with the talker standing approximately 2 m away from each of the microphone subarrays. The microphone-array recording was performed with a custom-built 12-bit 20 kHz multichannel acquisition system [5]. Multirate signal processing was used to achieve the 16 kHz sampling rate used by the HMM system. During recording, the talker also wore a close-talking

480 118

talker

output power

acoustic foam

16.5

350

145

200 16.5

150 100 50 0 3

( unit: cm )

3

2

2

1

Figure 1. Layout of the LEMS microphone-array system using 16 pressure gradient microphones.

headset microphone. This is the same microphone used to collect the high-quality speech data for training the baseline HMM system (see section 4.1). Using the analog-todigital conversion unit of a Sparc10 workstation, the signal from the headset microphone was digitized in 16 bits at 16 kHz, simultaneously with the 16 remote microphones in the array system. Both the headset and the array recordings were segmented by hand to remove leading and trailing silence. The measured peak SNR at each array microphone is approximately 15 dB and at the headset microphone is approximately 40 dB.

2.2. Beamforming

The multichannel recordings were processed with a \delayand-sum beamformer " to generate the \beamformed" data set [9]. For each utterance, the location of the talker is rst determined by choosing a high-SNR segment of data and nding the point in the room that minimizes the total spectral phase t error for a point source at that position [10]. This is equivalent to nding the focal point that maximizes the output power of the beamformer [9]. The room is searched on a 10 cm grid then a gradient descent method is used to re ne this initial location estimate. This initial location is used to delay each channel by an integer number of samples forming a coarse beam. In order to determine each, presumably small, residual inter-microphone delay, a delay estimator capable of inter-sample precision [4] is then applied to estimate the delay between each pair of adjacent microphones. Each received signal is shifted by this estimated delay to perform the ne time alignment, and nally the channels are summed to produce the beamformer output. The delay estimation is performed on a 25 ms frame with 12 ms advance. The high update rate and accurate delay estimation allows the beamformer to compensate for head and body movements of the talker thereby maintaining a very accurate focus on the talker over the course of the utterance. This steering accuracy is essential since the beam width a orded by an orthogonal array of this type is very small as shown in Figure 2. This high update rate also lends itself well to a moving talker [4]. The beamforming results in a peak SNR of approximately 30 dB, 10-15 dB greater than that of a single microphone.

2.3. Spectral Subtraction

An implementation of the well known spectral subtraction speech enhancement technique [11] was used on the singlemicrophone data set to generate a new data set. While spectral subtraction introduces distracting tonal artifacts,

1 0 0

meters

meters

Figure 2. Simulated output power of the microphone array of Figure 1 as a function of source location. The array is steered to (2,2) and the source is an 800 Hz sinusoid.

the objective measurement used here is recognition performance rather than listening scores. Spectral subtraction provides a convenient single-channel speech enhancement technique to compare with the far more costly multi-channel enhancement of the beamformer. The peak SNR of the enhanced speech is approximately 40 dB, roughly equal to the SNR of the headset data. Those familiar with spectral subtraction will realize that this is not a very meaningful measurement since spectral subtraction incorporates speech-detection and essentially squelches nonspeech portions of the signal [11].

3. INCREMENTAL MAP TRAINING

It was reported in [12] that substantial speed improvements could be obtained using incremental MAP estimation. The signi cance of this approach is that it does not lose any recognition performance while speeding the convergence. The learning technique presented is a variation on the recursive Bayes approach for performing sequential estimation of model parameters given incremental data. Let ~x1 ;    ; x~ T be i.i.d. observations and  be a random variable such that f (~xtj) is a likelihood on  given by ~xt . The posterior distribution of  is f (j~x1;    ; x~ t)  f (~xtj)f (j~x1 ;    ; x~ t?1 ) (1) where f (j~x1 )  f (~x1j)f () and f () is the prior distribution on the parameters. The recursive Bayes approach results in a sequence of MAP estimations of , ^ t = argmax f (j~x1 ;    ; x~ t ) : (2) 

There is a corresponding sequence of posterior parameters which act as the memory for previously observed data. If the likelihood f (~xtj) is from the exponential family (i.e., a sucient statistic of xed dimension exists) and f () is the conjugate prior, then the posterior f (j~x1 ;    ; x~ t ) is a member of the same distribution as the prior regardless of sample size t. This implies that the representation of the posterior remains xed as additional data is observed. In the case of missing-data problems (e.g., HMMs), the expectation-maximization (EM) algorithm can be used to provide an iterative solution for estimation of the MAP parameters [13]. The iterative EM MAP estimation process can be combined with the recursive Bayes approach. The approach that incorporates (1) and (2) with the incremental EM method [14] (i.e., randomly selecting a subset of data

from the training set and immediately applying the updated model) is fully described in [12]. Also, Gauvain and Lee have presented the expressions for computing the posterior distributions and MAP estimates of continuous observation density HMM (CD-HMM) parameters [15]. Because the posterior is from the same family as the prior, (1) and (2) are equivalent to the update expressions in [15] and are not repeated here.

4. EXPERIMENTS

The experiments presented here were carried out on a talker-independent, connected-alphadigit speech recognition system [16]. No language model was used. Standard signal processing was used for the front-end, and three sets of feature vectors were generated from DFTbased cepstral coecients and energy. A CD-HMM was used for the experiments and the training was performed in two stages. A baseline talker-independent model was created from headset-microphone data then incremental MAP training was performed to adjust the model parameters for the new database (beamformed, single microphone, or spectral subtraction data).

4.1. Baseline Model and Prior Generation

A baseline talker-independent CD-HMM was obtained by a conventional maximum likelihood (ML) training scheme. The training set was high-quality data from a headset microphone and contained 3484 utterances from 80 talkers. The initial parameters of the CD-HMM were derived from a discrete observation hidden semi-Markov model (HSMM) which used a Poisson distribution to model state duration. This model was then converted to a tied-mixture HSMM by simply replacing each discrete symbol with a multivariate normal distribution. Normal means and full covariances were estimated from the training data. The initial prior distributions were also derived from the training data set used to train the baseline HMM. The employed prior distributions were the normal-Wishart distribution for the parameters of the normal distribution and the Dirichlet distribution for the rest of model parameters. The parameters describing the priors were set such that the mode of the distribution corresponded to the initial CDHMM. The strength of the prior (i.e., the amount of observed data required for the posterior to signi cantly di er from the prior) was determined empirically. A subjective measure of prior strength was used where a very weak prior was (almost) equivalent to a non-informative prior and a very strong prior (almost) corresponded to impulses at the initial parameter values.

4.2. Model Parameter Adjustment

The second stage continued from the baseline model and the initial prior distribution which was set to enforce moderately strong belief. Incremental MAP training was then performed to adjust the model parameters for the new database (Table 1). 10 utterances were randomly chosen at each iteration. The experiments were repeated for the four training sets: headset microphone, single remote microphone, single microphone with spectral subtraction, and beamformed data. It should be noted that the training data size for the second stage (320 utterances from 10 talkers for each set) was an order of magnitude smaller than that used for creating the baseline model. Table 2 shows the recognition performance when di erent types of data sets were used for model parameter adjustment. Several interesting observations may be obtained from the table. First, the performance of the headset microphone data was included here to indicate the system's baseline level. The training improved recognition performance slightly (0.9%). There was not much new information to learn from the new data set because two training

data set headset microphone single-microphone spectral subtraction beamformed

before after % change training training in error 91.2 92.1 10.2 44.6 75.9 56.5 67.8 76.5 27.0 75.4 83.8 34.1

Table 2. Recognition performance for di erent data sets measured in % words correct. The third column shows the % by which the error was reduced. 10 utterances were randomly chosen at each iteration for training. Performance for each data set was measured before and after the incremental MAP training (i.e., for the baseline and the adjusted models). The performance of each of the adjusted models was measured after 100 iterations (1000 utterances processed in total).

sets (one for the baseline model generation and the other for the model parameter adjustment) were both collected by the same hardware (the same close-talk headset microphone) and processed by the identical method. The only di erence between them is that for the baseline system the talker was seated at a workstation and for the microphonearray database the talker was standing in the middle of the room. Unlike the headset data case, the MAP training substantially improved the performance for the single-microphone data set (44.6% to 75.9%), the spectral subtraction data set (67.8% to 76.5%), for the beamformed data set (75.4% to 83.8%). This accounts for a 27 to 56% reduction of the error rate. The results demonstrate the e ectiveness of incremental MAP training for determining better HMM parameters. The best performance achieved by the MAP-adjusted model with the beamformed data was 84.2%. This was accomplished when the training was continued for 400 iterations. This is close, though admittedly not as close as is desirable, to the result achieved with the high-quality headset data. Before and after MAP training, the beamformed data set outperformed both the remote microphone and the spectral subtraction data sets by a fair margin. This suggests that the signal improvement performed by the beamforming, while resulting in only a modest gain in SNR, enhances the information content of the speech signal in a way not achieved by the spectral subtraction.

4.3. E ect of the Amount of Training Data

Figure 3 compares the improvement of the log-likelihood and the recognition performance for di erent numbers of training talkers. The beamformed data set was used for both training and testing, and the training set contained 2 to 10 talkers. When the number of talkers was smaller, the log-likelihood improved faster and reached a higher level. However, it did not take many iterations until the model over t to the small training set. Higher recognition performance was obtained when the number of talkers was larger. Although the result here is not very surprising, it sheds some light on the approach. As noted earlier, the objective is to achieve talker-independent speech recognition using the microphone-array system. It would be ideal if a large amount of data could be collected for each task. However, it is not always practical because so many scenarios exist for microphone-array applications. This experiment suggests that the compromise solution (i.e., deriving a dedicated model from a certain baseline system by incremental MAP estimation) is a practical alternative requiring a relatively small amount of data. The approach works even for a very small training set (e.g., 2 talkers), although the model was susceptible to the over- tting problem. In fact, a very small training set case may be considered as a \talker adaptation " problem [15] rather than a talker-independent speech recognition task.

log lokelihood

−208 (3)

(2)

−210

(1)

[3]

−212

[4]

−214

0

500

1000

[5] performance (%)

85 (1) (2)

[6]

80 (3)

[7]

75 0

500 1000 total number of utterances processed

Figure 3. Improvement of the log-likelihood and the recognition performance by incremental MAP training. The beamformed data set was used for both training and testing. Use of (1) 10, (2) 6, and (3) 2 training-talker cases are compared. The same 10 talkers were used for testing. The beamformed data set contained an equal number of female/male talkers and each talker contributed 32 utterances.

5. SUMMARY

Earlier work [12] has shown that the incremental estimation approach improves the training eciency signi cantly. The series of experiments presented here serves as a validation of the MAP training method for the microphone-array speech recognition application. The incremental estimation can provide a solution that derives a dedicated model from a baseline system using a relatively small amount of data. This paper has presented experiments on a beamforming/incremental MAP estimation approach for a talkerindependent microphone-array speech recognition. The experiments show that:  beamformed data from the microphone-array system out-performs single-microphone data before, but especially after MAP adaptation,  beamforming outperforms spectral subtraction as a signal-enhancement techinque for this task,  when the baseline model is adjusted to the beamformed data through MAP adaptation, performance improves to a level close to that of the high-quality headset microphone data,  the MAP adaptation approach works even for a very small \adjustment" training set but care must be taken to avoid over- tting. The important feature of the approach is that a model for any type of (microphone-array) environment can be derived from relatively small amount of data.

REFERENCES

[1] Harvey F. Silverman. Some analysis of microphone arrays for speech data acquisition. IEEE Transactions on Acoustics, Speech, and Signal Processing, 35(12):1699{ 1712, October 1987. [2] Harvey F. Silverman, Stuart E. Kirtman, John E. Adcock, and Paul C. Meuse. Experimental results for baseline speech recognition performance using input acquired from a linear microphone array. In DARPA

[8]

[9] [10]

[11] [12] [13] [14] [15] [16]

[17]

Workshop on Speech and Natural Language, pages 285{ 290, Harriman, NY, February 1992. Michael S. Brandstein, John E. Adcock, and Harvey F. Silverman. A closed-form method for nding source locations from microphone-array time-delay estimates. In ICASSP-95 [17], pages 3019{3022. Michael S. Brandstein, John E. Adcock, and Harvey F. Silverman. A practical time-delay estimator for localizing speech sources with a microphone array. Computer Speech and Language, 9:153{169, 1995. Stuart E. Kirtman and Harvey F. Silverman. A userfriendly system for microphone-array research. In ICASSP-95 [17], pages 3015{3018. Dirk Van Compernolle, Weiye Ma, Fei Xie, and Marc Van Diest. Speech recognition in noisy environments with the aid of microphone arrays. Speech Communication, 9:433{442, 1990. Richard M. Stern, Fu-Hua Liu, Yoshiaki Ohshima, Thomas M. Sullivan, and Alejandro Acero. Multiple approaches to robust speech recognition. In DARPA Workshop on Speech and Natural Language, pages 274{ 279, Harriman, NY, February 1992. B. de Vries, C. Che, R. Crane, J. L. Flanagan, Q. Lin, and J. Pearson. Neural network speech enhancement for noise robust speech recognition. In Proceedings of International Workshop on Applications of Neural Networks to Telecommunications, Stockholm Sweden, May 1995. J. L. Flanagan, J. D. Johnston, R. Zahn, and G. W. Elco. Computer-steered microphone arrays for sound transduction in large rooms. Journal of the Acoustical Society of America, 78(5):1508{1518, November 1985. John E. Adcock, Joseph H. DiBiase, Michael S. Brandstein, and Harvey F. Silverman. Practical issues in the use of a frequency-domain delay estimator for microphone-array applications. In Proceedings of Acoustical Society of America Meeting, Austin, Texas, November 1994. S. F. Boll. Suppression of acoustic noise in speech using spectral subtraction. IEEE Transactions on Acoustics, Speech and Signal Processing, 27(2):113{120, April 1979. Yoshihiko Gotoh, Michael M. Hochberg, Daniel J. Mashao, and Harvey F. Silverman. Incremental MAP estimation of HMMs for ecient training and improved performance. In ICASSP-95 [17], pages 457{460. A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, series B, 39(1):1{38, 1977. Radford M. Neal and Geo rey E. Hinton. A new view of the EM algorithm that justi es incremental and other variants. submitted to Biometrika, 1993. Jean-Luc Gauvain and Chin-Hui Lee. Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains. IEEE Transactions on Speech and Audio Processing, 2(2):291{298, April 1994. Harvey F. Silverman and Yoshihiko Gotoh. On the implementation and computation of training an HMM recognizer having explicit state durations and multiple-feature-set tied-mixture observation probabilities. Technical Report LEMS Monograph Series: 1-1, Division of Engineering, Brown University, 1994. International Conference on Acoustics, Speech, and Signal Processing, Detroit, May 1995.