Measuring the Quality of Pronunciation Dictionaries

0 downloads 0 Views 265KB Size Report
Abstract. In this paper we investigate measures for the evaluation of pronunciation dictionaries that can be used independently of the type of lexicon, the ...
Measuring the Quality of Pronunciation Dictionaries Matthias Wolff, Matthias Eichner and Rüdiger Hoffmann Dresden University of Technology, Laboratory of Acoustics and Speech Communication D-01062 Dresden, Germany Matthias.Wolff|Matthias.Eichner|[email protected]

Abstract In this paper we investigate measures for the evaluation of pronunciation dictionaries that can be used independently of the type of lexicon, the language, a specific recognizer and how the dictionary was generated. We will describe statistical measures, measures based on information theory and performance measures and give examples how these measures can be practically applied in supervision of data-driven dictionary training, selection of pronunciation variants and evaluation of the consistency of different dictionaries. Although the introduced measures are independent of the type of dictionary, we only report results obtained with a datadriven dictionary generation and do not address measures specific to rule-based approaches.

1. Introduction Over the last decade plenty of effort has been made to incorporate knowledge about pronunciation variation into speech recognizers. Researchers developed a great variety of methods to obtain this knowledge (see [1] for a detailed overview). Unfortunately, in most cases pronunciation modeling yields pretty little success in terms of reduction of word error rate. Actually, many researches report that their recognition system tends to get confused when they use a significant numbers of pronunciation variants. One common conclusion drawn from this fact is that modeling pronunciation variants except for a couple of multiwords and liaisons, is rather counterproductive in practical applications. However, every single work in this field seems to show one point very clearly: pronunciation variation is actually a nonnegligible phenomenon in natural speech! So how can we “improve” speech recognizers by ignoring this phenomenon? Perhaps it is too simple just to blame the variant lexicon, particularly when we consider the fact that many errors of recognizers can hardly be located [11]. This paper aims to support the thesis that relevant and consistent pronunciation dictionaries can be build and that this can be proved by suitable evaluation measures independently of any recognizer problems. We will describe a number of such measures which • do not depend on the language, the vocabulary, the organization and the generation method of the dictionary, • allow a direct comparison of different variant dictionaries, • help to supervise dictionary generation and to select pronunciation variants from over-generated dictionaries, • provide information about the consistency of the phonetic transcription. We will demonstrate the capabilities and limitations of these

measures at variant dictionaries obtained by a data-driven training procedure [4].

2. Evaluation Measures for Variant Dictionaries For the calculation of the evaluation measures described in the following sections we define the following general representation of a pronunciation dictionary. The dictionary L is defined as a set of word models W:

L== {W}

(1)

A word model consists of an orthographic string O and a directed acyclic graph G which represents a set of pronunciation variants A(W). The nodes of G carry phoneme symbols a, the edges carry transition probabilities:

W = {O, G} = {O, U ,ΨUU }

{

where U = u1 K u N

(2)

} denotes the node set of G and

ΨUU : U ×U → {∅,1} denotes the incidence relation. We do

not allow parallel edges, hence the edge set is implicitly defined by the incidence relation. A prior unigram probability p(W ) which can be estimated from a text corpus is assigned to each word model. Further, we define a pronunciation variant A as a consecutive path of the length N through G:

A = a1 o K o a N where ai ∈ U

(3)

The probability of A is the product of the transition probabilities along the path: N −1

(

p( A) = ∏ p ψ ai ai +1 i =1

)

with

∑ p( A) = 1

A∈A (W )

Figure 1: Example of the word model “heute”(today)

(4)

This general form of a word pronunciation model is suitable for both, phoneme network and list forms. The calculation of most of the evaluation measures requires that either transition or variant likelihoods are known. If the variants are generated by a rule set, rule application probabilities can be employed to estimate these likelihoods.

2.2.1.

Phonemes as Symbols

First we consider the word model as an information source emitting phoneme symbols. The phoneme entropy of such a source is defined by:

H a (W ) = −

2.1. Statistical Measures 2.1.1.

A simple but commonly used property of a dictionary is the average number of pronunciation variants per word:

∑ A (W ) L

(5)

In most LVASR systems N A is chosen not greater than 1.5 (see e.g. [1][7] and section 1), only few rather research oriented approaches use significantly more variants (e.g. [8]). A related simple value is the average number of nodes per model

nA =

∑W

W∈L

L

(6)

which we use to control the growth of the dictionary during the automatic training process [4]. 2.1.2.

Relevance of Dictionary Parts

Most scenarios of automatically obtaining pronunciation variants have to deal with the problem of over-generation. This is particularly, but not exclusively, the case for rule based approaches. Technical restrictions often require to use only a part of the pronunciation variants in the application. There are several strategies of limiting the number of pronunciation variants in a dictionary [1]. Actually, in many practical applications most of the pronunciations are removed. This raises the question how relevant the remaining part of the pronunciation dictionary is. We can estimate the relevance of a dictionary part L ′ by the global probability mass that it represents in L:

r (L ′) =

  ∑  p(W ) ⋅ ∑ p( A) W∈L  A∈A′(W ) 

)

(8)

where a1 o K o aQ denotes a typical emitted phoneme

Simple Measures

N A = W∈L

(

1 ld p a1 o K o aQ Q

(7)

where A ′(W ) ⊆ A (W ) denote the pronunciations which are part of L ′ . In section 3.2 we will show how to use the relevance for the selection of variants from an over-generated dictionary. 2.2. Information Theoretical Measures This class of measures bases on considering the word models as stochastic automata which describe ergodic processes. There are two interpretations of a pronunciation model as information source, one emitting phonemes and one emitting pronunciations as symbols.

sequence of the length Q. From the phoneme entropy we derive the phoneme perplexity of the word model:

Pa (W ) = 2 H a (W )

(9)

which describes the average uncertainty of selecting a transition originating from a particular node. Thus the phoneme perplexity can be interpreted as a measure of the consistency of a word model. The practical application of a phoneme level description for controlling a data-driven dictionary training approach [4] showed some disadvantages (see section 4, also [10]). 2.2.2.

Pronunciation Variants as Symbols

The second possible interpretation of the word model as an information source emits entire pronunciation variants as symbols. The variant entropy of the model is then given by:

H A (W ) = −

∑ p( A) ld p( A)

A∈A (W )

(10)

For comparison of different dictionaries it is necessary to calculate a single value for the entire dictionary rather than separate ones for each model. This is done by normalization of the entropy on the (maximal) entropy of an equally distributed model of the same size. We call this normalized entropy the variant consistency of the word model: C AN

(W )

C 1A (W )

∑ p( A) ld p( A)

= 1+

A∈A (W )

ld

A(W )

(11)

=1

Then we calculate the average variant consistency of all word models:

C A (L ) =

∑ C A (W )

W∈L

L

(12)

The consistency value varies from 0 (“inconsistent”) to 1 (“consistent”) and is a crucial measure for the supervision of a data-driven dictionary training [4]. In a successful training session the consistency passes through a minimum and stabilizes with increasing number of learning examples to a certain value which is significantly greater than the minimum (ref. Figure 7). Since the consistency is derived from the entropy, the entropy passes through a maximum respectively (ref. Figure 6). Because the number of variants represented by a word model increases throughout the training process, a raising consistency can only be explained by an increasing polarization of probabilities of the variants. A few variants cover the major part of the probability mass while the majority

of the variants has negligible probabilities. We call this effect consolidation of word models. This consolidation effect is the basis for the selection of pronunciation variants.

bet the set of pronunciation variants contained in the dictionary L. Then we define the completeness of L related to T as follows:

2.2.3.

N c (L , T ) =

Grapheme to Phoneme Consistency

The information theoretical measures described above do not provide information on how consistent the phonetic transcriptions are across word models. In order to measure this kind of consistency we perform a grapheme to phoneme alignment for all pronunciation variants in the dictionary. An approach which performs this alignment is described in [2]. Except for a few technical details it resembles the joint multigram cosegmentation approach described in [12]. The cosegmentation yields an alphabet combined from ~ and minimal grapheme strings o~ . minimal phoneme strings a We call the symbols of this alphabet graphones. The size of this alphabet, its entropy

H o~,a~ = −

∑ p(o~, a~ ) ld p(o~, a~ )

~ ~ o~∈O ,a~∈A

(13)

and its mutual information

p(o~, a~ ) I o~, a~ = ∑ ∑ p(o~, a~ ) ld ~ ~ ~ ~ p(o ) p(a ) o~∈O a~∈ A

C o~ ,a~ =

I o~ ,a~ H o~ ,a~

(15)

Section 3.3 shows how the G2P consistency can be used to compare dictionaries of different type and language. 2.3. Performance Measures Of course, the goal of lexica optimization is to decrease the word error rate (WER) of the recognizer. So, the WER seems to be an appropriate evaluation criterion for pronunciation dictionaries. However, in our experiments we found that adding variants to the lexicon is problematic because the recognition system cannot easily cope with an arbitrary number of pronunciation variants. Therefore we were looking for alternate measures to evaluate lexica. This section describes criteria for measuring the performance of a pronunciation dictionary that can be measured at the dictionary itself and which do not require a recognizer pass. There are two concurring aspects of the performance: the completeness and the confusability. 2.3.1.

Completeness

By the term completeness we denote the ability of a dictionary to predict pronunciations in a defined test set, that was not used for training. More precisely we define the completeness as the relative number of word pronunciations contained in a test set which are correctly predicted by the dictionary. Let AT be the set of pronunciations contained in a test set and let AL

(16)1

When supervising a dictionary generation process, the completeness can be utilized to measure the quality of the dictionary by evaluating the prediction of pronunciations contained in a test set. We will give an example for practical application of the completeness in section 3.1. 2.3.2.

Confusability

A better completeness will in general result in a more confusable dictionary [4]. A certain risk of confusion between the word models is introduced whenever a pronunciation variant occurs in more than one model. To quantify this effect we can employ the overall risk criterion. It is defined as the sum of the likelihoods of deciding for the wrong word model given an ambiguous pronunciation variant (provided no further source of knowledge, e.g. language model, is used) over all pronunciation variants:

pconf (L ) =

(14)

are measures for the “overall” consistency of the grapheme to phoneme mappings in the dictionary. From (13) and (14) we derive a grapheme-to-phoneme (G2P) consistency measure (similar to CA, see equation 11):

AT ∩ AL AT





∑  ∑ p(A Wi )p(Wi ) − p (A Wa )p(Wa )

A∈A (L )  Wi ∈L

where Wa = arg max p (A Wi ) p(Wi ) Wi∈L



(18)

A is a certain variant from the global set of variants A(L) and Wi is a certain word model in the dictionary L. pconf is the sum of confusion probabilities of all variants in the dictionary and thus the overall confusion risk. In section 3.1 the confusability measure will be applied to selection of pronunciation variants. 2.4. Other Measures A commonly used criterion for measuring the quality of a dictionary is the word error rate or the word accuracy [13]. There exist other approaches for dictionary evaluation not mentioned in this paper. Williams and Renals [9] investigated confidence measures for automatically trained pronunciation dictionaries on the acoustic level in order to measure the quality of the trained variants. Lüngen [6] uses a rule based approach for building lexica and defines the consistency of a dictionary in terms of coverage for a given test corpus and with respect to multilingual word lists. Another possibility is to compute rule application probabilities and use them for selection of variants [14].

3. Some Dictionary Anatomy In this section we will give examples how the introduced measures can be applied to various tasks. In the experiments described below we are using the following databases:

The equation was simplified for better legibility. AT may contain multiple identical instances of a particular pronunciation. The ∩ operator preserves the individuality of those instances.

1

Database Description Verbmobil Spontaneous speech with orthographic transcription (4639 words; 168930 training samples; avg. 18.3 variants per word) Phondat II Read speech with orthographic transcription (194 words; 7130 training samples; avg. 19.3 variants per word) Celex Pronunciation dictionaries without variants for languages including - Dutch (NE; 117939 words; canonical) - English (EN; 45680 words; canonical) - German (DE; 51729 word; canonical)

consistency can be multiplied. This logical "and" combination isolates models having a high consistency and good performance. As we expect, long words which are frequent perform best under this condition. However, most long words are less frequent in natural speech and we run into the typical sparse data problem of data-driven dictionary training. Short words are problematic even if they have a high number of samples due the high number of equally probable variants. In [2] we suggested a way out of this dilemma by using an adapted word list. The goal is to collect short words into multiwords and to split long ones into parts of words. Doing so increases the number of word models which can be successfully trained.

Table 1: Databases used in experiments 

3.1. Dictionary evaluation – completeness and variant consistency

  

First we take a closer look at dictionaries generated using our data-driven approach (see [4] for details). They where trained using the Verbmobil training data base (see Table 1). To get an impression which word models perform well and which do not, we measured the completeness depending on the length of the word in graphones (ref. 2.2.3) and the number of samples in the training set (Figure 2). Every sample in the training set adds variants to the word model. As one would expect, the more samples the better is the completeness. This effect can be observed independent from the length of the word. The completeness is high as long as a certain number of training samples are available!













       

 



 /HQJWKLQ

L

*UDSKRQHV







 







  

CA

&RQVLVWHQF\

1XPEHURI6DPSOHV





Figure 3: Consistency depending on length and number of training samples

    





 

   



*UDSKRQHV L

 











 

 V V H Q H W OH S P R &

Now we focus on the selection of pronunciation variants from over-generated dictionaries. In [4] we showed that only a significant number of pronunciation variants results in an improvement of the completeness of a dictionary. 

(NC )

 /HQJWKLQ



Nc

 

3.2. Selection of pronunciation variants – completeness and relevance versus confusability

1XPEHURIWUDLQLQJ VDPSOHV

Figure 2: Completeness depending on length and number of training samples

  V V H Q  H W H O S  P R & 



&RPSOHWHQHVV &RQIXVDELOLW\

    



Beside the completeness, the consistency is an important measure for the quality of a dictionary (Figure 3). A high consistency is achieved for long words having many training samples and short words having only a few samples. In the first case the word model consolidate during the training (ref. 2.2.2), that is, only a few variants cover most of the probability mass. In the later case the few samples generate only few variants that lead to high variant consistency, but the performance for those models is poor. Since both measures are defined between zero and one, completeness and variant









   O O\ D H LF LN Q O R W Q V D R F P



 

 

 







9DULDQWVSHU:RUG(N A )

   G H ] L P WL S R   

Figure 4:Completeness and confusability for different degrees of variant selection

(pconf )



\ W X OL E D V X I Q R &

But, as we could expect, more pronunciation variants lead to a higher confusability of the word models (see 2.3.2). This effect is often used as explanation why speech recognizers get confused by numerous pronunciation variants in the lexicon. But are completeness and confusability actually inseparably related? In an experiment with a variant dictionary trained with the Verbmobil data base (see Table 1) we removed all confusable pronunciations from the dictionary (rightmost bars in figure 4). Even though the confusability is zero, the completeness is still equivalent to that of a dictionary with 2.5 pronunciations per word in average. However, a fairly high number of variants per model are then required to achieve a good completeness. So the experiment shows also that confusable variants are a non-negligible phenomenon of natural speech. Once again: ignoring statistically relevant information should not be a mean to improve a recognizer. If desired they can be removed from a variant dictionary without tremendous loss of completeness but the lexicon is not the right place to cope with pronunciation confusability. (r )

HF  QD YH OH  5

      







      

 













7RWDOQXPEHURIVHOHFWHGYDULDQWV

Figure 5:Relevance of a dictionary depending on the number of selected variants. Figure 5 shows the influence of the selection of pronunciation variants by on the relevance (ref. 2.1.2) of the remaining dictionary in the same experiment. The values in the boxes are the average number of variants per word. 3.3. Consistency across dictionary types and languages – G2P consistency Now we have a look at the overall G2P consistency of different dictionaries. Table 1 shows the consistency C o~ ,a~ calculated from entropy H o~ ,a~ and mutual information

I o~ ,a~ (ref. 2.2.3) for dictionaries of different languages. The German dictionaries Phondat II and Verbmobil are listed in two forms. The first column (can.) is the canonical dictionary. The second column (var.) is the variant dictionary generated using the training procedure mentioned above. For canonical dictionaries (without any pronunciation variants), the G2P consistency can be interpreted as a language specific property which quantifies the regularity of the “standard” word pronunciations. Of course this value may vary a bit depending on the vocabulary. For German (DE) and Dutch (NL), the canonical G2P consistency is about 0.75, for English (EN) significantly less (about 0.65). When a dictionary contains pronunciation variants, the G2P consistency indicates the uniformity of pronunciation

variation. For instance we observed a significant difference between a dictionary trained with read speech (col.3 – Phondat II / var.) and one trained with spontaneous speech (col.5 – Verbmobil / var.). 3KRQGDW,, 9HUEPRELO &HOH[ &HOH[ &HOH[ '( '( '( (1 1/ can. var. can. var. can. can. can. 5,34 6,11 5,56 H o~ ,a~ 5,55 6,69 5,54 7,90 4,30 4,04 4,16 I o~ ,a~ 4,28 4,04 4,05 2,93 0,77

C o~ ,a~

0,60 0,73

0,37

0,80

0,66

0,75

Table 2: G2P regularity for different dictionaries (ASCII grapheme set, intl. SAMPA phoneme alphabet)

4. Supervising data-driven dictionary training - entropy, perplexity and consistency As mentioned above, our original intention for developing quality measures for dictionaries was to control a data-driven dictionary training process. In the following we will give an example how these measures can be applied for supervision of the training process. The training algorithm bases on incremental graph training techniques and is described in [4] and [2]. The working principle of the training procedure is enlarging an initial canonical dictionary by pronunciation variants found in the speech corpus. As mentioned above, out training corpora consist of speech signal of utterances and their orthographic transcription. The algorithm does not only list seen pronunciation variants but it is also capable of extrapolating unseen variants and estimating probabilities for them. During the training, we analyze the dictionary using the introduced measures. LWE OR   E  \P 6





 \ S R U W Q (







 

 3KRQHPH(QWURS\





9DULDQW(QWURS\



3KRQHPH3HUSOH[LW\



 





















7UDLQLQJ3URJUHVV

Figure 6:Entropy during dictionary training using read speech In figure 6 the development of phoneme entropy, phoneme perplexity and variant entropy (ref. 2.2.2) are shown during a training experiment using read speech (Phondat II). The consolidation of word models can be clearly seen in the variant entropy, whereas the phoneme based measures do not indicate this effect. From this we conclude that a consistent set of pronunciations is not necessarily connected with a pronunciation network of low perplexity. For the supervision of our dictionary training algorithm we use variant based measures only. The consolidation of variants can not be observed in

\ LW [ OH S U H 3  H P H Q R K 3

€







ó

   



(C A )

average over all word models when we are using spontaneous speech (Verbmobil data base). To find the reason we took a closer look at the trained dictionaries. Although the method works fine for frequent words, it does not generate satisfying pronunciation models for infrequent words. So we decided to exclude all word models from training that have less than a minimal number of samples in the training set and keep the canonical pronunciation for those models.

 \ F Q H W LV V Q R &

    



 

7UDLQLQJ 3URJUHVV

 











   

 

0LQ1XPEHURI6DPSOHV

Figure 7: Progression of consistency during training depending on minimal number of required samples per word model In Figure 7 the progression of consistency during training using the Verbmobil data base is illustrated. Line € shows the consistency with no restriction, line ó the consistency with limitation to 40 or more training samples. For a minimal number of 10 samples the consolidation effect can be observed.

5. Conclusion Experiments with a number of evaluation criteria for variant dictionaries show that even though obtaining relevant and consistent pronunciations can be problematic we are able to generate such dictionaries. Evaluation measures can successfully be employed for the supervision of the dictionary generation and the post-processing (selection of variants). Our experiments show: The information in a variant dictionary is not per se inconsistent or confusing! So why exactly does a significant amount of pronunciation variants compromise the performance of speech recognizers? We cannot give a satisfying answer to this question but we collected plenty of evidence that a variant dictionary is most likely not the only reason. In our current research we focus on systematically tracking recognizer errors in the framework of an integrated, SMGbased speech recognizer and synthesizer [5]. This approach provides strict modularization and special capabilities (such speech synthesis as “reverse” pass of the recognizer) which will hopefully help us to understand failures in the recognition process better.

6. Acknowledgements This work was partially funded by the Deutsche Forschungsgemeinschaft (DFG) under grant no.

HO1674/3 and HO1674/7. The responsibility for the text lies with the authors.

7. References [1] Strik, H.: “Pronunciation adaptation at the lexical level”. Proc. of the ISCA Workshop on Adaptation 2002, Sophia Antipolis (France). [2] Wolff, M.; Eichner, M.; Hoffmann, R.: “Improved Data– Driven Generation of Pronunciation Dictionaries Using an Adapted Word List”, Proc. Eurospeech 2001, Aalborg (Denmark), pp.1433-1436. [3] Waibel, A.; et al: “Multilingual Speech Recognition”, In: Wahlster, W. (Ed.): Verbmobil. Foundations of Speechto-Speech Translation, Berlin etc.: Springer, 2000, pp.3345. [4] Eichner, M.; Wolff, M.: “Data driven generation of pronunciation dictionaries in the German Verbmobil project – discussion of experimental results”, Proc. ICASSP 2000, Istanbul, pp.1687-1690. [5] Eichner, M.; Wolff, M.; Hoffmann, R.: “A unified approach for speech synthesis and speech recognition using Stochastic Markov Graphs”, Proc. ICSLP 2000, Beijing, vol. 1, pp.701-704 [6] Gibbon, D.; Lüngen, H.: “Speech Lexica and Consistent Multilingual Vocabularies”, In: Wahlster, W. (Ed.): Verbmobil. Foundations of Speech--to-Speech Translation, Berlin etc.: Springer, 2000, pp.296-310. [7] K.Beulen; et.al: “Pronunciation Modelling in the RWTH Large Vocabulary Speech Recognizer”, Proc. of the Workshop on Prununciation Modelling for ASR, Rolduc (The Netherlands), 1998, pp.13-16. [8] Yang, Q; Martens, J.-P.: “Data-driven Lexical Modeling of Pronunciation Variations for ASR”, Proc. of the ICSLP 2000, Beijing, Paper 00750. [9] Williams, G.; Renals, S.: “Confidence Measures for Evaluating Pronunciation Models”, Proc. of the Workshop Modeling Pronunciation Variation for ASR, Rolduc (The Netherlands), 1998, pp.151-155. [10] Westendorf, C.-M.; Wolff, M.: “Automatische Generierung von Aussprachewörterbüchern aus Signaldaten”. Proc. of Konvens98, Verlag P. Lang, 1998, pp.213-225. [11] Chase, L.: “Blame assignment for errors made by large vocabulary speech recognizers”, Proc. Eurospeech 1997, Rhodes (Greece), pp.1563-1566. [12] Deligne, S.; Yvon, F.; Bimbot, F.: „Variable-Length Sequence Matching for Phonetic Transcription Using Joint Multigrams“, Proc. Eurospeech 1995. [13] Sloboda, T.: „Dictionary learning: Performance through consistency“, Proc. ICASSP 1995. [14] Lehtinen, G.; Safra, S.: “Generation and selection of pronunciation variants for a flexible word recognizer”, Proc. of the ESCA Workshop: Modeling Pronunciation Variation for ASR, pages 67-71, Rolduc (The Netherlands), May 1998.