integration of parsing and incremental speech ...

INTEGRATION OF PARSING AND INCREMENTAL SPEECH RECOGNITION Sven Wachsmuth, Gernot A. Fink, Gerhard Sagerer

University of Bielefeld, Technical Faculty, P.O. Box 100131, 33501 Bielefeld, Germany Tel.: +49 521 106 2935, Fax: +49 521 106 2992 e-Mail: [email protected]

ABSTRACT In this paper we propose a new approach to integrating a parser into a statistical speech recognizer. The method is able to incrementally apply grammatical restrictions and robustly combine them with the statistical acoustic and language models. On spontaneous speech data a 11.6% reduction in word error rate could be achieved compared to the baseline system applying statistical models only.

1 INTRODUCTION From a statistical point of view n-gram language models such as bi- or tri-grams are the method of choice to integrate knowledge about plausible word sequences of a given domain into a statistical speech recognition system. However, n-grams provide useful restrictions only when trained on a sucient amount of data that might not be readily available for the application in question. Additionally, the developers of a speech recognizer usually have a priori knowledge about the structure of utterances that will be observed by the system especially when it is planned to interact with a speech understanding module that uses declarative domain knowledge. A convenient way to express such declarative expectations is to design an e.g. context free grammar for the language of the domain considered. As the development of a grammar usually is done by an expert and does not require large amounts of data to be collected beforehand it might be the only way to formulate word-level restrictions when switching from one domain to another completely new one. However, most current recognizers can not make use of grammatical restrictions during the recognition process. In order to achieve this goal, non-trivial problems have to be solved. First of all, an eective and ecient interaction scheme between parser and recognizer has to be developed. When feedback to the user is to be generated as soon as possible the parsing component must not be limited to processing complete hypothesis chains but 1 This work has been partly supported by the German Research Foundation (DFG) in the project SFB 360.

must be able to generate results incrementally as the statistical recognizer processes speech input. Secondly, the scoring mechanism used must be able to combine statistical and declarative constraints in an advantageous way. This implies that violations of grammatical restrictions have to be handled as robustly as mismatches between the statistical models and the actual data. In the following sections we will brie y review some important approaches dealing with the problem of integrating parsing technology into a statistical recognizer. Then we will propose a new approach that applies grammatical restrictions in a strictly time synchronous way thus making it possible to generate recognition results incrementally. Finally we will give some experimental results and a short conclusion.

2 RELATED WORK Most approaches to coupling or integrating parsing and acoustic recognition proposed in the literature fall into one of the following categories: Two-stage: In these systems the parser acts as a lter or postprocessor of results generated by a classical recognizer based on HMMs and n-grams [13]. Mostly nbest hypotheses are generated by the recognition system and passed along to the parsing component that only rescores the results. Obviously, an incremental generation of results is not possible with this approach. Compiled: Another way to exploit at least part of the constraints imposed by a grammar is to compile it into a weaker representation | usually nite-state machines | that can be used directly by the recognizer [12, 2, 11]. These approaches suer from a complete loss of grammatical structure during the compilation phase and therefore can not generate structured hypotheses. Word-prediction: Many parsing algorithms | most easily LR-parsers | can be used to predict allowed successor words for a given word sequence. These can be used to dynamically extend the word level search space of the recognizer [6, 5, 8, 7]. The complex computations involved frequently in the prediction phase and

nach Bus ich

im

morgen-s heute

man

nach

muß

corresponding link

hat

morgen

pointer to next symbol on LR-stack

Mannheim Mannheim $

(a)

S

F

(b)

$

S

S

S

A

S

(c) $

S

$

S

R

node in the tree organized parse stack A

word hypothesis

R S

pointer to last action of parser

R S

R

S,R,A,$,F parser actions (shift,reduce,accept,abort, skip fault)

Figure 1: A part of the search space of word hypothesis and the corresponding tree of parse stacks for the German sentence "Ich mu heute morgen nach Mannheim." (I have to go to Mannheim this morning.). (a) parser skips the ungrammatical word 'Bus' (bus). (b) parser aborts the structure currently built up and is set to initial state. (c) parser has accepted the structure (morgens:TIME) and is initialized. the high bandwidth of the interaction between parser and recognizer limit the eciency of this approach. Word-veri cation: Using the parser in a \wordsynchronous" mode for potentially many search paths extended in parallel leads to a strategy where every extension of a hypothesis chain is scored by the parsing component [7, 9]. Though this approach does not suer from the de ciencies mentioned earlier. An important prerequisite is the successful combination of grammatical restrictions and statistical scores.

3 INCREMENTAL GRAMMARBASED SPEECH RECOGNITION

We propose a new system architecture for integrating parsing and speech recognition that solves several problems simultaneously. A merely declarative grammar | which is required to be LR(1) | can be used to enhance recognition performance. This is especially useful for domains for which not enough data is yet available in order to train statistical models successfully. Furthermore, structures useful for higher level processing stages can be modelled in the grammar and are output as hierarchically structured hypotheses and not merely as at word chains. The grammatical restrictions are applied during a veri cation step \word-synchronously" in the same way as the language model scores are computed. The language model scores are applied time-synchronously for all word hypotheses ending at the current frame, as our recognizer uses a tree-organized lexicon. The parsing component now merely acts as a special kind of language model and calculates a combined n-gram and grammar score for every word hypothesis considered. In order to acomplish this

S ! ich mu TIME TARGET ; S ! TIME j TARGET ; TIME ! heute morgen j morgens ; TARGET ! nach CITY ; CITY ! Mannheim ; Figure 2: Grammar used in gure 1 a tree of parse stacks is built up in parallel with the tree of words hypothesized ( g. 1). Most important for the success of this type of interaction is the choice of the scoring scheme.

3.1 Domain Independent Scoring Scheme

Obviously, a mere binary grammar score either allowing or disallowing the combination of word hypotheses will not be suitable. Violations of grammatical restrictions could never be compensated by the language model or acoustic scores. Also the grammar could never boost valid hypotheses scored low by the statistical models. Therefore, we need a more sophisticated scoring scheme. Mainly two dierent approaches are described in the literature. One is based on a statistical foundation. A probability is associated with each rule of the grammar which has been estimated on a training corpus. When partial parsers are used they are often combined with a category extended bi-gram [12, 14]. In [4] a homogeneous model is presented which integrates these two probabilities by calculating n-grams over the stack symbols of a shift-reduce parser. The main drawback of these probabilistic scoring schemes is the need of a large amount of training data. Our aim is the use of declara-

tive and statistical models to complement one another. Therefore, the declarative model should not depend on training data. The second scoring scheme is based on penalty scores punishing ungrammatical word sequences and is independent of the grammar. A rst attempt to combine a LR(1)-parser with a speech recognizer was realized in the SPHINX-LR and SPHINX-LR II systems using penalty scores [9]. Our approach is based on the second scheme but it is dierent to SPHINX-LR in several aspects. Firstly we use a partial LR(1)-parser which can build up partial structures even when there is no complete parse of the spoken sentence. Therefore, grammatical restrictions can be applied to the whole utterance. Secondly we distinguish three fundamental cases a LR(1)-parser can fail: aborting errors: the parser started to build up a structure but the following word cannot extend the structure ( g. 1b). initializations: the status of the parser has been initialized because of an abortion or acceptance ( g. 1c). ungrammatical words: the parser is in initial state but there is no structure which is started by the following word ( g. 1a). In all cases dierent penalty scores are applied. Thirdly we use a dynamic weighting of the statistical language model and grammar score by reducing the weight when there has been an extension of the actual parsing structure. Combining these simply de ned grammar scores we get a more complex scoring scheme which models structural dependencies in the language. The values of the dierent penalty scores can be adapted by a few test runs.

3.2 Generating Results Incrementally

Due to the time-synchronous nature of the whole integrated parsing and acoustic matching process we are able to partially trace back hypothesis paths and generate results incrementally. We assume that most search paths active at the current frame t will merge into a single path with a certain delay t at frame t t. For eciency reasons we consider only the best scoring path at frame t. From the corresponding hypothesis we start a trace-back until a hypothesis is found ending at most at frame t t. If no grammar is used incremental output starts at this hypothesis [3]. With a grammar, however, we have to go back one step further and check whether a constituent boundary was crossed. In this case the incremental output is activated for the predecessor hypothesis. Otherwise the constituent actually parsed has not been nished at t t and no output is activated for this time step. While in the rst case the output is generated word by word, the grammar-based recognizer generates results constituent by constituent. Thereby,

the time delay t is adapted to the occurance of grammatical structures in a very exible way. An interpretation of the recognized sentence is normally based on basic grammatical structures like noun phrases, e.g. "the last train". Therefore, there is no need to x the chain of recognized words until some basic grammatical structures have been found. Furthermore, the time delay t can be reduced if the parser has accepted a constituent and extended if the parser generated an aborting error or skipped an ungrammatical word. As we are currently only interested in rst-best solutions an additional pruning step is necessary that eliminates all paths from the search space that might lead to results competing with the hypotheses created already.

4 RESULTS

We evaluated our approach in the framework of a dialogue system for train time-table inquiries. The statistical recognizer [3] uses mel-cepstral features, semicontinuous HMMs, context dependent acoustic units, and a bi-gram language model. The acoustic models were trained on over 12 hours of read aloud speech from 85 dierent talkers. The grammar used was designed to generate initial requests to such a dialogue system [10]. The test set consisted of 115 spontaneous utterances of naive users of the dialogue system. This test set yields a perplexity of 24.9 and is covered to 63.7% by the grammar. in % HMM HMM+Bi. HMM+Grm. HMM+Bi.+Grm.

WER 41.0 22.4 25.7 19.8

SR 10.43 36.52 29.57 40.00

CER 44.2 21.7 21.4 17.8

Table 1: Results using dierent con gurations showing word error rate, sentence recognition and constituent error rate. The base-line system yields a word error rate of 41.0% on this test without a language model and 22.4% if a bigram model is used (tab. 1). With only grammatical restrictions a similar performance gain can be achieved with 25.7% word error rate. This result shows that if no statistical language model were available an acceptable improvement of recognition performance could be achieved using a grammar alone. Combining grammar and bi-gram model further improves the performance down to 19.8% word error rate. This is a reduction in word error rate of 11.6%. When evaluating the results on the level of grammar constituents an even better improvement from 21.7% constituent errors to 17.8% with grammatical constraints added can be achieved | i.e. an 18% reduction of constituent errors which are de ned analogically to the word error rate [1]. In a second experiment we tested various settings of the grammar based speech recognizer and the base line

50

45

constituents, using bigram and grammar score constituents, using bigram score words, using bigram score

40

35

30

25

20 300

400

500

600

700

800

900 1000 1100 1200

Figure 3: Incremental results varying the average time delay t [ms] using dierent settings showing word error rate. system in incremental processing mode ( g. 3). In tests with the base line system we measured the word accuracy while varying the time delay t after which the word hypothesis are xed. Using the grammar based recognizer we did the whole test series twice. In the rst run only the dynamic variation of t was activated without using grammar scores. The time delay t was increased by 400ms if there was an abortion of a grammatical structure or a following ungrammatical word. In the second run the grammar scores were used too. Results of the base-line system show that the word error rate is rising steeply when time delay goes below 700ms | at 400ms above 35%. When looking at results of the grammar based recognizer word error rate rises very slowly. Using a time delay of 400ms we still obtain a word error rate below 25%. The shorter the time delay the lower is the drop in word error rate when using grammar scores. This observation corresponds to the central idea of the grammar that grammatical structures are used to model dependencies longer than those covered in a bi-gram.

5 CONCLUSION We presented a new approach to integrating grammatical constraints into a statistical speech recognizer. Our method diers from related approaches reported in the literature in several ways: First of all a purely declarative grammar is used and grammatical structures are preserved in the hypotheses recognized. Secondly, grammatical and statistical constraints can be combined robustly using a exible scoring mechanism. Finally, as constraints are applied time-synchronously results can be generated incrementally and the time delay can be adapted to the occurance of basic grammatical structures. We demonstrated the eectiveness of our approach on

spontaneous speech data. Further research is concerned with the integration of the grammar based recognizer in an speech understanding system which directly uses the generated grammatical structures of the recognizer.

References

[1] M. Boros, W. Eckert, F. Gallwitz, G. Gorz, G. Hanrieder, and H. Niemann. Towards understanding spontaneous speech: Word accuracy vs. Concept accuracy. In International Conference on Spoken Language Processing, volume 2, pages 1005{1008, 1996. [2] W. Eckert, F. Gallwitz, and H. Niemann. Combining stochastic and linguistic language models for recognition of spontaneous speech. In Proc. Int. Conf. on Acoustics, Speech and Signal Processing, volume 1, pages 423{426, 1996. [3] G. A. Fink, C. Schillo, F. Kummert, and G. Sagerer. Incremental speech recognition for multimodal interfaces. In IECON, Aachen, 1998. to appear. [4] D. Goddeau. Using probabilistic shift-reduce parsing in speech recognition systems. In International Conference on Spoken Language Processing, pages 321{324, 1992. [5] D. Goddeau and V. Zue. Integrating probabilistic LR-parsing into speech understanding systems. In Proc. Int. Conf. on Acoustics, Speech and Signal Processing, volume 1, pages 181{ 184, 1992. [6] D. Goodine, S. Sene, L. Hirschman, and M. Phillips. Full integration of speech and language understanding in the mit spoken language system. In Proc. European Conf. on Speech Communication and Technology, pages 845{848, 1991. [7] A. Hauenstein and H. Weber. An investigation of tightly coupled time synchronous speech language interface using a uni cation grammar. In Proceedings of the Workshop on Integration of Natural Language and Speech Processing at AAAI 94, pages 42{49, 1994. [8] A. Kai and S. Nakagawa. A frame-synchronous continuous speech recognition algorithm using a top-down parsing of context-free grammar. In International Conference on Spoken Language Processing, pages 257{260, 1992. [9] K. Kita and W. Ward. Incorporating LR-Parsing into SPHINX. In Proc. Int. Conf. on Acoustics, Speech and Signal Processing, pages 269{272, 1991. [10] T. Kuhn. Die Erkennungsphase in einem Dialogsystem, volume 80 of Dissertationen zur Kunstlichen Intelligenz. in x, Sankt Augustin, 1995. [11] Q. Lin, D. Lubensky, M. Picheny, and P. S. Rao. CMU's robust spoken language understanding system. In Proc. European Conf. on Speech Communication and Technology, pages 255{ 258, Rhodes, 1997. [12] M. Meteer and J. R. Rohlicek. Statistical language modeling combining n-gram and context-free grammars. In IEEE International Conference on Acoustics, Speech and Signal Processing, pages 37{40, 1993. [13] M. Woszczyna, N. Aoki-Waibel, F. Buo, N. Coccaro, K. Horiguchi, T. Kemp, A. Lavic, A. McNair, T. Polzin, I. Rogina, C. Rose, T. Schultz, B. Suhm, M. Tomita, and A. Waibel. JANUS 93: Towards Spontaneous Speech Translation. In Proc. Int. Conf. on Acoustics, Speech and Signal Processing, pages 345{348, 1994. [14] J. Wright, G. Jones, and H. Lloyd-Thomas. A robust language model incorporating a substring parser and extended n-grams. In Proc. Int. Conf. on Acoustics, Speech and Signal Processing, pages 361{368, 1994.