Automatic Speech Recognition System Using Acoustic ... - CiteSeerX

3 downloads 0 Views 162KB Size Report
3] Alexander Pentland & Kenji Mase. (1989) \Lip ... 5] David E. Rumelhart, James L. McClelland et al. ... formation Processing Systems-6 J. D. Cowan, G.
Automatic Speech Recognition System Using Acoustic and Visual Signals Marcus E. Hennecke Electrical Engineering Dept. Stanford University Stanford, CA 94305

[email protected]

Abstract

Automatic speechreading systems use both acoustic and visual signals to perform speech recognition. In previous work, we have shown how visual speech can improve recognition accuracy of automatic speech recognition [10] and have described an algorithm based on deformable templates that accurately infers lip dynamics [1]. In this paper we present a complete speechreading system, which is able to record an utterance using a standard color video camera, preprocess both the audio and video signal, and perform speech recognition. This system is based on new algorithms for nding the talker's face and mouth and an improved template algorithm for tracking the lips. We will also compare the results from our new system with our previous work and discuss various strategies for integration of the two modalities.

1 Introduction

Information complementary to that in the traditional acoustic signal is available in the visual image of the talker, and several groups have successfully implemented speech recognition systems which make use of such visual information [4, 11, 3, 7, 10]. It is now clear that whether by Hidden Markov Models, neural networks, or statistical pattern recognition approaches, visual information can improve speech recognition accuracy, especially in noisy environments. For any practical speechreading system, it is essential that relevant visual features be extracted automatically. Previous research in most labs has employed special markers on the face, or chromatic lipstick to facilitate the extraction of such features. Of course, such approaches are not applicable to real world systems. In this paper, we will describe a complete and robust system that nds the face and the mouth, tracks

K. Venaktesh Prasad

David G. Stork

Ricoh California Research Center 2882 Sand Hill, Suite 115 Menlo Park, CA 94025 prasad|[email protected]

the lips without the need for arti cial alteration of the talker. We describe here various integration strategies for the acoustic and visual information as well, and compare recognition accuracies.

2 Video processing

For video processing, we made use of deformable templates, in order to infer the dynamics of lip contours throughout an image sequence (see [1]). In broad overview, our system rst nds the face and mouth in video frame t = 1. Then a mouth template is placed close to the mouth and is allowed to adapt to it. For each subsequent frame, the template is initialized with the previous optimal parameter set and then allowed to adapt, thereby tracking the lips through the entire sequence. The parameters describing the mouth model are then fed to the recognizer. Our system is currently able to process about 5 frames per second on a SparcStation 2.

2.1 Finding the Face

In order to nd the face, our system uses a simple algorithm which nds blobs of skin-colored pixels (Fig. 1) based on the RGB values of each individual pixel. Our rst experiments used a simple 3-6-1 sigmoidal neural network [5], which used the RGB values as inputs. However, subsequent experiments showed that similar results can be obtained by transforming the RGB values to a suitable color space and then using a simple Gaussian classi er. We have obtained good results with the YIQ color space, which is the one generally used in JPEG compression. The face is then estimated to be the largest blob of skin in the image. Surprisingly, the chromaticity of the skin is very constant across races; It is only the intensity that varies signi cantly. However, the color of the illumination does have an e ect on the color registered by the

-

Skin detector

-

Figure 1: A simple skin detector based on color, such as a neural network, marks pixels as either skin or noskin. The grayscale value at the right corresponds to the relative probability that the pixel is skin-colored, and hence comprises the face. The detector works well for Caucasians, Asians, Indians, Africans, and is largely independent of lighting. video camera. While our system works well for standard oce illumination, sun light adds a slight bluish tint which can cause the system to fail. Although we have not yet implemented them, standard algorithms for discounting the color of overall illumination should solve this [9].

2.2 Finding the Mouth

Within the face, the mouth is usually characterized by low intrinsic values (intensities) and strong edges (lips, teeth). Figure 2 shows how these features are used to nd the location of the mouth. After combining the inverted image with the output of the edge detector, the mouth and the eyes and sometimes the nostrils give rise to large, white blobs. As additional heuristic, the mouth is assumed to be the lowest blob larger than a certain size, a measure that works well as long as the face is not tilted beyond about 45o. This approach works fairly well under usual lighting conditions if the additional heuristic is used that the mouth has to be within the face.

2.3 Tracking the Lips

Deformable templates are essentially parameterized descriptions of an object to be tracked [12]. In our work, the object to be tracked is the talker's mouth and the template consists of parabolas and quartics which follow the outer and inner edges of the upper and lower lips. In order to keep the deformable template computations to a minimum, we used a very simple model of the mouth and lips consisting of two symmetric parabolas and three quartics (Fig. 3). The inner parabolas share the same (possibly varying) width as do the outer quartics; each parabola and quartic has its own height. Given continuity constraints there are thus 9 parameters required to describe the shape of the parabolas and quartics. These parameters can

? ? ? Gaussian blur ? Threshold ?

? ? Gaussian blur ? Threshold ? Edge

Invert

- h ?

Figure 2: The mouth detection algorithm combines inverse grayscale image with the output of the edge detector. The mouth region is characterized by a strong response in both images. directly (or after normalization) be entered into the speech recognition engine. In addition, the x-y position of the template and the head rotation angle are determined, but they are not used for speech recognition. Once an optimal set of parameters has been found, the average graylevel intensity inside the mouth is measured (the dark area in Fig. 3) and also passed on to the recognizer. This gives a rough estimate of the visibility of the tongue. In all, there are now a total of 10 visual parameters, 9 of which describe the shape of the lips and one gives an estimate of the visibility of the tongue. The tting process makes use of two potential elds, an edge eld and a valley eld. The edge eld is the output of an edge detector, blurred with a circularly symmetric exponential mask; the valley eld is found by inverting and blurring the image. Adapting the template to the image involves adjusting the template

aoff aoff

h1

θ

(xc , yc )

h2 h3

h4

wi

wi

wo wo

Figure 3: The mouth template consists of two parabolas for the inside edges and three quartics for the outside edges. Including the center coordinates and the tip angle there are a total of 12 parameters describing the shape, location and orientation of the template, nine of which are used for recognition. parameters according to a gradient descent algorithm, which minimizes a desired cost function E . The cost function consisted of the integral over the edge eld along the template curves, an integral over the area inside the mouth (dark area in Fig. 3), as well as certain penalty terms:

X c Z  dl ? c Z  dA A C X + k penalty terms | Heuristics {z }

E = ?

e

i

i

v

v

i

i

i

where ci are arbitrary constants associated with the four curves Ci , e is the edge potential eld, cv is a constant associated with the area A that marks the inside of the mouth, and v is the valley potential eld. The penalty terms make it possible to incorporate heuristics to avoid impossible parameter combinations, such as the upper lip lying beneath the lower lip. During the gradient descent, we can roughly distinguish between two phases, corresponding to coarse and ne alignment. In the beginning, the template is located close to the mouth. At this time there is only little overlap between the template curves and the edges of the mouth and so the edge terms contribute only little to the gradient descent. However, the valley term pulls the template in and center it on the mouth. Once the template is aligned with the mouth, the valley term is at a minimum and contributes only little to the gradient descent. Instead, the edge eld starts

to interact with the template curves and deform the template to match the shape of the mouth. During the entire adaptation process the penalty terms are ensuring that the parameters stay within a reasonable range. One improvement over our previous work lies in the incorporation of the valley eld. Experiments have shown that the edge eld is not enough to allow robust tracking of the lips over long image sequences or when the talker moves the face. The valley eld ensures accurate tracking even for extreme lighting conditions.

3 Recognition

In speechreading the recognition generally follows the same principles as in traditional acoustic only recognition. Most systems use similar recognition algorithms such as Hidden Markov Models (HMM) and Neural Networks (NN). One problem that is particular to speechreading is the integration of visual and acoustic information. Most speechreading systems integrate either early or late. In early integration, the feature vectors are simply concatenated to form one combined vector. A late integration system is one which uses two recognizers to compute likelihoods and then combines them in some suitable way. In some sense, late integration is just a special case of early integration with a particular recognition algorithm. The advantage of early integration is that the recognizer can make full use of all the available information, including time dependencies. For example, the voice onset time (VOT) can easily be measured. It should be expected that a recognizer will perform better using an early integration scheme. However, due to the larger size of the input, recognizers will usually require more training data and in order to nd relevant timing information, the audio and video must be accurately synchronized. But even if sucient training data are available and synchrony is ensured, a recognizer may not be capable of detecting such information. The rate of visual speech is usually slower than acoustic speech and HMMs can not model that very well. In late integration, two recognizers (possibly using di erent recognition algorithms) operate on separate data. The acoustic and visual channels need not be synchronized very well and the contribution from each can be weighted according to their reliability. However, late integration quickly becomes impractical for large vocabulary systems. In such cases, integration must occur at some intermediate stage. Also, all timing information between the acoustic and the visual channel is lost and can not be used by the recognizer.

audio only video only audio and video improvement

Accuracy Error 43% 57% 38% 62% 59% 41% 16% 28%

Table 1: Preliminary recognition results show a signi cant improvement of the combined system over the acoustic-only system. The improvement of the error rate is shown as a relative quantity.

4 Results

For our experiments we used a commercially available speech recognizer which uses Hidden Markov Models. The system allowed us to arbitrarily con gure the HMMs and gave us access to all log-likelihoods computed by the Viterbi decoder. Thus, we were able to run a range of experiments and implement both early and late integration strategies.

4.1 Experiment 1

In a preliminary experiment, we recorded the four consonant-vowel syllables /da/, /fa/, /la/, and /ma/ at 10dB signal-to-noise ratio (SNR) from 10 native and non-native, male and female English talkers, each repeated ve times, resulting in a total of 200 utterances. The acoustic parameters were obtained using fourteen log Mel power spectrum coecients and the visual parameters consisted of nine deformable template parameters. For the combined system, the feature vectors were simply concatenated (early integration). Of the 200 utterances 40 (four distinct utterances from each of the ten talkers) were set aside as the test set. For each of the four syllables, we used a standard continuous HMM. Each model had 4 states, including one input and one output state. The states were connected so that the model could only traverse the states from left to right, without skipping a state. The output distribution was in the form of a single mixture multivariate Gaussian probability density function with a full covariance matrix. Table 1 compares the performance of the audio only, video only, and the combined system. Clearly, the visual information improved the performance of the recognizer signi cantly. It should be noted however that this task is particularly dicult for an acoustic recognizer. The signal to noise ratio is very low and the utterances o er only little context information.

audio only video only audio and video (early) audio and video (late) improvement (early) improvement (late)

Accuracy Error 93% 7% 17% 83% 79% 21% 94% 6% -14% -300% 1% 14%

Table 2: If the data are more favorable for acoustic speechrecognition systems, the visual information hurts the performance in the case of early integration. For late integration we still see some improvement of the combined system over the acoustic-only system.

4.2 Experiment 2

We were particularly interested how a system would perform in situations that are more favorable for acoustic speech recognizers. In a second experiment, we recorded 11 Shinkansen train station names from 10 speakers at 24dB SNR. Each name was uttered twice, the rst utterance was used for training, the second for testing. We used 12 log Mel power spectrum coecients as acoustic parameters and three visual parameters were derived by computing the height and width of the mouth opening and the thickness of the upper lip. In this experiment we also compared early and late integration. In order to nd the optimal con guration for each of the 11 word models, we used a brute search strategy. Each model was trained with 4, 5, 6, up to 22 states and with one or two Gaussian mixtures. We then used those models that computed the highest average loglikelihoods for the correct words. The covariance matrices of the mixtures were forced to be diagonal. Our late integration strategy consisted of multiplying the log-likelihoods of the acoustic recognizer with a constant  and the visual recognizer with (1 ? ). This strategy is explained in more detail in [6]. The purpose of the parameter  is to introduce some weighting which takes into account the relative reliability of the two channels. After some experiments, we found  = 0:8 to be optimal. Table 2 shows the results we obtained for this second experiment. The video data are clearly less accurate and reliable than the audio data. However, the accuracy of the video-alone system is better than chance (9%). As expected, the improvement of the combined systems over the acoustic-only system is now much less. In fact, for early integration, the system performs worse. The HMMs are not able to model the

two rates of speech very well and they weight both channels equally, regardless of their reliability. The late integrating system does not have these disadvantages and is able to improve the recognition accuracy of the acoustic-only system.

5 Conclusions

We have described a system that is able to record a talker's utterances with a standard video camera and preprocess the image sequence in a way suitable for speech recognition. We have shown that simple and computationally ecient algorithms are capable of locating the face and mouth and infer lip dynamics throughout an entire image sequence. The estimated parameters are suitable for speech recognition and can improve recognition accuracy even in low noise situations.

Acknowledgements

We would like to thank Greg Wol for useful discussions and support.

References

[1] Marcus E. Hennecke, K. Venkatesh Prasad & David G. Stork. (1994) \Using Deformable Templates to Infer Visual Speech Dynamics," Proc. 28th Asilomar Conference on Signals, Systems, and Computers, pp. 578{582. [2] Michael Kass, Andrew Witkin & Demetri Terzopoulus. (1988) \Snakes: Active contour models," International Journal of Computer Vision pp. 321331. [3] Alexander Pentland & Kenji Mase. (1989) \Lip reading: Automatic visual recognition of spoken words," Proc. Image Understanding and Machine Vision, Optical Socieity of America, June 12-14. [4] Eric D. Petajan, B. Bischo & D. Bodo . (1988) \An improved automatic lipreading system to enhance speech recognition," ACM SIGCHI-88 pp. 19-25. [5] David E. Rumelhart, James L. McClelland et al. (1986) Parallel Distributed Processing. MIT Press, Cambridge, Massachusetts. [6] Peter L. Silsbee. (1993) Computer Lipreading for Improved Accuracy in Automatic Speech Recognition. PhD thesis, University of Texas at Austin. [7] David G. Stork, Greg Wol & Earl Levine. (1992) \Neural network lipreading system for improved speech recognition," Proc. IJCNN-92 Vo. II, pp. 154-159.

[8] David G. Stork (ed). (1996) Speechreading by Humans and Machines. Springer. (In press) [9] Brian A. Wandell. (1995) Foundations of Vision. Sinauer Associates, Sunderland, MA. [10] Greg Wol , K. Venkatesh Prasad, David G. Stork & Marcus Hennecke. (1994) \Lipreading by neural networks: Visual preprocessing, learning and sensory integration," Proceedings of the Neural Information Processing Systems-6 J. D. Cowan, G. Tesauro and J. Alspector (eds.) Morgan Kaufmann, pp. 1027-1034. [11] Benjamin P. Yuhas, M. H. Goldstein, Jr., Terrance J. Sejnowski & R. E. Jenkins. (1988) \Neural network models of sensory integration for improved vowel recognition," Proc. IEEE 78(10), pp. 1658-1668. [12] Alan L. Yuille, David S. Cohen & Peter W. Hallinan. (1989) \Feature extraction from faces using deformable templates," CVPR IEEE Computer Society Conference on Computer Vision and Pattern Recognition Washington DC, IEEE Computer Society Press, pp. 104-109.