A COMPARISON OF HMM AND NEURAL ... - Semantic Scholar

0 downloads 0 Views 83KB Size Report
Center for Spoken Language Understanding, Oregon Graduate Institute of Science and Technology. 20000 N.W. Walker Road, P.O. Box 91000, Portland, ORĀ ...
A COMPARISON OF HMM AND NEURAL NETWORK APPROACHES TO REAL WORLD TELEPHONE SPEECH APPLICATIONS Pieter Vermeulen, Etienne Barnard, Yonghong Yan, Mark Fanty and Ronald Coley Center for Spoken Language Understanding, Oregon Graduate Institute of Science and Technology 20000 N.W. Walker Road, P.O. Box 91000, Portland, OR 97291-1000, USA Tel: +1 503-6901484, E-mail: [email protected]

ABSTRACT We compare a standard HMM based and a neural network based approach to speech recognition. The application is the speaker independent recognition of a small vocabulary over the telephone. While the recognition results are comparable, it is argued that the neural network system is a better choice for implementation.

1. Introduction Hidden Markov model (HMM) based speech recognition systems are widely accepted as the method of choice for speaker-independent speech recognition[1]. For reasons described ins Section 3, we believe that neural network technology will eventually be preferable for large scale (thousands of ports), small vocabulary (on the order of 100 phrases), speaker independent, telephone applications. It is, however, crucial to evaluate this less mature technology against the HMMs to see how they compare from both the perspectives of classi cation accuracy and cost of implementation. In this paper we discuss a speci c application: voice access to networked telephone services. In the USA the local telephone service providers offer, in addition to basic telephone services, optional services such as call waiting, call forwarding, speed dialing, three way calling etc. The subscriber then interacts with these services using DTMF access codes and interactive DTMF menus. Voice recognition o ers a simple interface to these services using a limited number of phrases. For example: To switch o the call waiting feature a subscriber typically has to dial \*" followed by a two digit code. This can be replaced with a system where the user dials \*", followed by the utterance of the phrase \cancel call waiting". To switch it back on the subscriber would dial \*" and say \switch on call waiting".

The system described here had a vocabulary of 58 such phrases. We implemented both HMM and hybrid neural network versions of the algorithm and compared them. A standard continuous HMM architecture was implemented using a mixture of triphone and monophone phone models. The input features were 12 LPC cepstral coecients, energy and their delta vector and the models were trained with a Baum-Welch algorithm. The reader is referred to [1] for a complete description of this system. In the next section we describe the lesser known hybrid neural network algorithm. We then compare the two systems (Section 3) on the telephone voice access application. We conclude with a discussion on the relative merits of these algorithms.

2. Overview of the Neural Network speech recognition architecture Data Capture Barge-In/ End of Utterance RASTA PLP Computation Energy Prenormalization Feature Collection Frame-Based Phonetic Classification Viterbi Search

Figure 1. General architecture of a speech recognition system based on neural networks

Conceptually, our speech-recognition system based on neural networks consists of three stages[2] as shown in Figure 1:



The incoming speech (in our case, the mulaw encoded digital samples, with sampling rate 8kHz, which are transmitted over a telephone line) is rst converted to a representation which is more suitable for recognition. We use Perceptual Linear Predictive (PLP) analysis[3]; this is a modi cation of linear predictive coding which takes into account some of the properties of human hearing. We compute seventh-order PLP coecients in a 10 msec window, and advance the window 6 msec at a time. We thus obtain seven PLP coecients (and the energy) every 6 msec { we refer to this as a \frame" of speech. Voicing classifier

3. Comparison

To compare the two approaches we collected a corpus of 1200 digital calls where the caller repeated each of the 58 phrases. Each utterance was veri ed by a human labeler. The corpus was divided into a training set, a development test set, and a nal test set as shown in Table 1. The systems were trained on the training data and tuned using the development data. A single run on the nal-test data produced the results reported here. Train Dev Final Total 731 233 236 1200

PLP coeff. + energy

-72 to -84 ms

-36 to -48 ms

-6 to -18 ms

6 to 18 ms

36 to 48 ms

72 to 84 ms

Current frame

Figure 2. Frame selection for classi cation

The computed PLP coecients are used as input to a neural network which performs phonetic classi cation. This network has 57 inputs: eight feature values (seven PLP coef cients + energy) at each of seven sampling o sets surrounding the frame to be classi ed and a bias neuron. The surrounding frames selected are shown in Figure 2. We represent all English words in terms of 39 phonemes, and each phoneme is modeled in three parts: left, middle and right. The left and right part of each phoneme is further modeled by a number of output nodes depending on the phoneme which is adjacent. The middle part of each phoneme is modeled by a single context-independent output node. The classi er estimates the probability that each of these phoneme classes is present.  Finally, a Viterbi search is used to combine this matrix of classi cation probabilities so as to decide which word (or sequence of words) was spoken. Thus, each 

word is expressed in terms of the sequence of phonemes that is expected when that word is uttered. To compute the likelihood that each word was spoken, one typically assumes that the acoustic vectors in di erent time frames are independent, so that the likelihood of a phoneme occurring in a sequence of time frames is the product of the likelihoods in each of the individual time frames.

Table 1. Number of calls and division into training, development and nal test sets.

The HMM system was initialized with a general English front end and trained on these data using a Baum-Welch training algorithm. In total 274 phone models (a mixture of triphone and monophone models) were used. Each phone model was a three-state left-to-right HMM and each state was modeled by a three element Gaussian mixture probability density function with a diagonal covariance matrix. The total number of trained parameters in this architecture was 130 698. To train our neural network probability estimator we need phonetically labeled data. We used a previously trained general English recognizer to produce forced alignments on these utterances (that is, an automatic phonetic time alignment of the known word string to the utterance). The resulting time-aligned segments were used as training data for a new task-dependent network. The neural network consisted of 57 inputs, 45 hidden nodes and 356 output nodes; i.e. 18 585 parameters to train. The recognition results are shown in Table 2. Although the HMM system performed somewhat better on this test, 3% vs. 4.5% error rate, we be-

System Recognition HMM 97.0% NN 95.5% Table 2. Relative performance of the HMM and neural network (NN) speech-recognition systems on 58-word task

lieve that these results are comparable for a real world system and that the neural-network based system will eventually be preferable for implementation. This belief is based on three facts:  Theoretically, the density function implemented by the neural network is more general than the Gaussian mixture used by the HMM; thus, the performance with the neural network should eventually exceed that of the HMM. Although HMM technology is much more mature than that of neural networks, the performance di erential has already decreased signi cantly, and should be reversed in the near future.  The neural network system is both more compact and more regular than the HMM system, making for more ecient implementation. The density estimator in our best HMM system has approximately 130,000 parameters, whereas the neural network has only 20,000; evaluation of the HMM probability density requires around 175,000 mathematical operations (adds, multiplies, multiply accumulates, etc), whereas the neural network requires around 40,000.  By casting our whole system in a neuralnetwork framework, we are able to unify several functions such as speaker veri cation, speech recognition and con dence evaluation. These same functions can obviously be obtained with an HMM system, but less naturally and thus less eciently. The neural-network system is very suitable for hardware implementation: the algorithms are compact and ecient, and are mostly highly regular, thus executing very eciently on a pipelined DSP architecture. We have demonstrated a near real time implementation of this algorithm on a Linkon FC3000 board. This PCbased telephony board is based on a 88MHz DSP32C. The board can be con gured for up to 12 telephone ports with a DSP per port. The algorithm easily ts on the DSP32C (Table 3) and runs in 106% real time as is detailed in Table 4.

Module code data End-of-utterance 2708 36 DC-O set Removal 256 44 PLP 2704 3052 Pre-normalization 564 576 Feature Collection 564 1068 Neural Network 5840 74340 Viterbi Search 4328 67400 Shared Parameters { 87431 Total 25272 233947 Table 3. Memory requirements (bytes) for DSP implementation

Module % real time End-of-utterance 3% DC-O set Removal 0.2 % PLP 10 % Pre-normalization 0.04 % Feature Collection 0.1 % Neural Network 30 % Viterbi Search 63 % Total 106 %

Table 4. Timing measurement for DSP implementation

This implementation is described in more detail by Schalkwyk et. al.[4].

4. Conclusion Although the HMM system produced the most accurate recognition performance, the results were comparable and we conclude that the neural network system is the preferred solution for two reasons: First, the system is clearly superior in terms of the computational requirements. Second, it allows for a uni ed system with additional functions such as speaker veri cation and con dence evaluations.

References [1] L. Rabiner and B.-H. Juang, Fundamentals of speech recognition. Englewood Cli s, NJ: Prentice Hall, 1993. [2] E.Barnard, R.Cole, M.Fanty, and P.Vermeulen, \Realworld speech recognition with neural networks," in Applications of Neural Networks to Telecommunications (R. J.Alspector and T.X.Brown, eds.), vol. 2, (Hillsdale, New Jersey), pp. 186{193, IWANNT95, Lawrence Erlbaum Assoc., 1995.

[3] H. Hermansky, \Perceptual linear predictive PLP analysis for speech," Journal of the Acoustical Society of America, vol. 87, no. 4, pp. 1738{1752, 1990. [4] J.Schalkwyk, P.Vermeulen, M.Fanty, and R.Cole, \Embedded implementation of a hybrid neural-network telephone speech recognition system," (Nanjing, P.R. China), Int. Conf. on Neural Networks and Signal Processing, Dec. 1995.