pashto speech recognition with limited pronunciation lexicon

PASHTO SPEECH RECOGNITION WITH LIMITED PRONUNCIATION LEXICON Rohit Prasad, Stavros Tsakalidis, Ivan Bulyko, Chia-lin Kao, Prem Natarajan BBN Technologies, 10 Moulton St., Cambridge, MA 02138 {rprasad,stavros,ibulyko,ckao,prem}@bbn.com ABSTRACT Automatic speech recognition (ASR) for low resource languages continues to be a difficult problem. In particular, colloquial dialects of Arabic, Farsi, and Pashto pose significant challenges in pronunciation dictionary creation. Therefore, most state-of-the-art ASR engines rely on the grapheme-as-phoneme approach for creating pronunciation dictionaries in these languages. While the grapheme approach simplifies ASR training, it performs significantly worse than a system trained with a high-quality phonetic dictionary. In this paper, we explore two techniques for bridging the performance gap between the grapheme and the phonetic approaches, without requiring manual pronunciations for all the words in the training data. The first approach is based on learning letter-to-sound rules from a small set of manual pronunciations in Pashto, and the second approach uses a hybrid phoneme/grapheme representation for recognition. Through experimental results on colloquial Pashto, we demonstrate that both techniques perform as well as a full phonetic system while requiring manual pronunciations for only a small fraction of the words in the acoustic training data. Index Terms— speech recognition, Pashto, graphemeas-phoneme, HMM, decision tree 1. INTRODUCTION Many dialog applications that use automatic speech recognition (ASR) require dealing with colloquial dialects instead of formal form(s) of spoken language. One such application is speech-to-speech translation [1,2,3,4], which for being broadly useful must work for local population in different countries. However, most of the local population converses in colloquial dialects that are usually very different from the formal form of the spoken language. Such colloquial dialects offer several challenges for developing an ASR engine. First, oftentimes there is no standard written form for such colloquial dialects. This leads to ambiguities in orthography of words during the audio transcription process. The second challenge is the creation of the pronunciation dictionaries for these dialects. Most ASR engines use phones as units for acoustic modeling, and

978-1-4244-4296-6/10/$25.00 ©2010 IEEE

5086

each word in the recognition lexicon is manually spelt by these phones. However, skilled acoustic-phoneticians for low resource dialects are few. Therefore, manual creation of phonetic spellings for large vocabulary ASR is usually impractical. Phonetic dictionary creation is even more difficult for languages that use the Arabic script for the writing system. In most of the colloquial dialects of such languages which includes Arabic, Farsi, Dari, and Pashto, short vowels are usually not written in the orthography of the words, which results in ambiguity for pronunciation of the words. Given the above challenges, most state-of-the-art ASR systems use the “grapheme-as-phoneme” [5] approach for dictionary creation for such languages. In this approach, the pronunciation for a word is derived directly from the orthography by using the constituent character/grapheme as phones. The grapheme approach has several advantages including: (1) it automates the dictionary creation process, thereby simplifying the ASR training, (2) it does not suffer with inter-annotator differences in manual pronunciation creation for words, and (3) addition of new vocabulary at runtime can be fully automated with this approach. In particular, for Arabic dialects the grapheme approach has emerged as a promising approach for mitigating the impact of inherent ambiguity introduced by absence of short vowels. In addition, researchers have also explored automatic diacritization based on morphological analysis [6]. However, most of these morphological rules are created for the formal form of the language and breakdown on colloquial dialects. Therefore, the grapheme approach is usually still better than using automatic diacritization for colloquial dialects. Although there are several advantages to the grapheme approach, the recognition performance with grapheme-asphonemes is significantly worse than with a high-quality, manually created phonetic dictionary. In this paper, we explore techniques to reduce the difference between grapheme and full phonetic systems by using manual pronunciations for only a small fraction of words. Specifically, we investigate two different techniques for developing a recognizer for colloquial Pashto. The first technique uses a modified version of the text-to-phoneme (T2P) tool [7]. T2P is a decision tree approach that learns letter-to-sound rules from a small set of manual

ICASSP 2010

pronunciations. The standard version of T2P has serious limitations for languages for which the number of letters/graphemes is larger than the number of phones. Here, we describe a novel approach for extending T2P to deal with such languages, which includes Pashto. The second technique uses a hybrid phoneme/grapheme recognition approach, similar to the one described in [8].

of 34 phones were derived from the graphemes after romanization. Because several letters map to the same sound sounds, the total number of graphemes is less than the total number of letters in Pashto alphabet. In Table 1, we compare the phone set used for the phonetic and grapheme representations. Representation

2. PASHTO DATA AND PHONETIC REPRESENTATION 2.1. Pashto Corpus Description The corpus used in our experiments is a collection of twoway monolingual Pashto dialogs and interpreter mediated dialogs between a Pashto (respondent) and an English speaker (interviewer). The audio data spans a wide range of scenarios including, checkpoint patrols, civil affairs, medical interviews, facility inspections, etc. The audio was segmented and transcribes by Appen, Pty, Ltd. The audio data and transcriptions from Appen were preprocessed by us to eliminate segments with unsatisfactory transcriptions not suitable for either acoustic or language model training, e.g. unintelligible speech, long pauses, overlapping, or foreign speech. Next, we divided the speakers and data into two sets: A 34-hour (400K total, 10K unique words) training set and a 2-hour, 26K total word, test set. We use the Pashto speech from the respondents and interpreters for acoustic training, but only the respondent’s speech in the test set. 2.2. Phonetic Representation for Pashto ASR Pashto is an Indo-European language spoken primarily in Afghanistan and northwestern Pakistan. Pashto is one of the two official national languages of Afghanistan. Pashto alphabet is a modified form of the Arabic alphabet with extra letters added for Pashto-specific sounds, thus Pashto alphabet has several letters which do not appear in any other Arabic script. In total the Pashto alphabet consists of 46 letters. Also, just like Arabic dialects, many diacritics are omitted in most standard Pashto writing. In addition to the acoustic training data, Appen also provided us with manual pronunciations for approximately 10K words. These pronunciations were created using the SAMPA representation [9]. A total of 42 phonemes consisting of 9 vowels were used for creating the manual pronunciation dictionary. 3. APPROACHES FOR DICTIONARY CREATION 3.1. Grapheme-as-Phoneme We developed a grapheme-as-phoneme [5] mapping based on the orthography of the words in the Pashto data described earlier. We used a modified Buckwalter transliteration system to create Romanized forms of Pashto letters. A total

5087

Phonetic Grapheme

Pashto sounds 42 34

Nonspeech 3 3

Total phonemes 45 37

Table 1. Pashto phoneme and grapheme representation. 3.2. Learning Text-to-Phoneme Mappings Our approach for text-to-phoneme (T2P) conversion is based on the set of public-domain tools from CMU [7]. The training of T2P models with the CMU tools is performed in three steps: 1. Align letters to phonemes in the training dictionary 2. Extract contextual features from the alignments 3. Train a decision tree using the contextual features. We discovered serious limitations in the alignment step of the standard T2P tool. Specifically, the standard alignment process can only handle word and pronunciation pairs where the number of letters is greater or equal to the number of phones, allowing no more than one phone to be aligned to a given letter. While this may be acceptable for most of English words, it does not work for many other languages including Pashto. Therefore, we implemented a new alignment algorithm that overcomes the limitations of the standard T2P tool. The algorithm uses iterative expectation maximization (EM) style optimization to find alignments that best describe the training dictionary. Our updated alignment algorithm has the following key steps: 1. Initialization a. Set P(phone=p | letter=l) = num dictionary pairs with both l and p / num words with l b. Set P(deletion | l) = 0.1 2. Iterate until convergence a. Find best path (according to the current model) through the [letters × phones] grid using dynamic programming and allowing any number of phones per letter b. Update probability P(p | l) = number of aligned pairs (p,l) / total number of l c. Update P(deletion | l) = number of unaligned l / total number of l 3.3. Hybrid Phoneme/Grapheme In the hybrid phoneme/grapheme approach, during recognition we model each word with two different phone sequences. The first set is manually created using the

phonetic representations described in Section 2.2. The other uses grapheme-as-phone representation. In training, we assume independence between the phoneme, ‘P’ and the grapheme representation, ‘G’, and train two different sets of context-dependent HMMs. Words which do not have any manually created pronunciations are spelt with just the grapheme-derived phones. During recognition, we can also use pronunciation probabilities to weight the grapheme and phoneme pronunciations differently. The pronunciation probability can be estimated using the approaches described in [10]. In this paper, we used unigram pronunciation probabilities. 4. EXPERIMENTAL RESULTS In this section, we perform recognition experiments on the Pashto corpus to compare the different methods for pronunciation dictionary creation. All recognition experiments used a three pass recognition strategy in the BBN Byblos recognizer [10]. The first pass, referred to as the forward pass, uses context-dependent triphones with state-tied mixture (STM) parameter tying and a bigram language model (LM). The second pass, referred to as the backward pass, operates on the lattice from the forward pass using context-dependent quinphones with state-clustered tied mixture (SCTM) configuration for acoustic models and a trigram LM. The output from the backward pass is a lattice or an n-best list. The third and final recognition pass, referred to as the rescoring pass, uses SCTM models trained with crossword quinphones to re-rank the n-best list produced by the backward pass. All acoustic models in the results described in this section were trained using maximum likelihood estimation (MLE). LM training used a total of 700K words from Pashto transcriptions and translations available in the corpus provided by Appen. 4.1. Assessment of Text-to-Phoneme Conversion Our first experiments were designed to compare the quality of pronunciations produced by standard T2P and the modified version using the improved alignment algorithm. We used a set of 10K manually created word pronunciations to perform the comparison. We compared the two approaches under two operating conditions. In the first condition, we used 1K manually created word pronunciations for training and 9K for testing. In the second, we divided the 10K words equally into two sets of 5K each. Table 2 shows the percentage of words where the predicted pronunciations were identical to the corresponding reference, i.e. the manual pronunciation. From Table 2, we conclude that our updates to the T2P tool outperform the standard tool by a factor of 2 to 3 in prediction accuracy. On analysis of the pronunciation errors from the modified T2P tool, we found that most of the errors are single phone variations in the phonetic string. Therefore, we adopted the

improved approach for subsequent experiments that rely on creating automatic phonetic pronunciations.

#Wds 1K 5K

Train T2P Mod. T2P 36% 98% 29% 96%

#Wds 9K 5K

Test T2P Mod. T2P 12% 22% 14% 42%

Table 2. Percentage of words where the predicted pronunciation is identical to the reference pronunciation. 4.2. Recognition Results In the following section we present experimental results for comparing the approaches we described in Section 3. To perform a comprehensive investigation, we explored three different scenarios by varying the amount of the manually created spellings: (1) a low-resource case that simulates having pronunciation for only the top-1K most frequent words in the training data, (2) a medium-resource case with the top-5K words having manual pronunciations, and (3) full-resource where every word has a manual pronunciation. Using the aforementioned setup we trained the following systems: P: Three phoneme-based systems that are estimated from the corresponding fraction of the audio transcripts for which every word has a manually created pronunciation. G: Single grapheme-based system trained over the entire training set. P+G: As described in Section 3.3, for the hybrid phoneme/grapheme approach the recognition dictionary uses two pronunciations (phonetic and graphemic) for words that have a manual pronunciation and just the grapheme-based pronunciation for words that do not have a manual pronunciation. During training, we train two sets of contextdependent HMMs. The first set uses phonetic representation and is trained from the corresponding fraction of the audio transcripts for which every word has a manually created pronunciation. The second set uses grapheme representation and is trained over the entire available training set. Thus, the grapheme-based HMMs for all three training scenarios are estimated from the same amount of data. This ensures that the grapheme HMMs exploit all available training data. P+T2P: In this approach, as described in Section 3.2, there is a common set of HMMs that use only phonetic representation. For the words that do not have manual pronunciations, the trained letter-to-sound rules from the modified T2P tool are used to create pronunciations automatically. Therefore, the HMMs are trained over the entire training set. The only difference between the three P+T2P systems is the fraction of words with manual and automatic pronunciations. Table 3 compares the performance of the systems trained from various dictionary configurations as evaluated on the test set in terms of the word error rate (WER). All results are

5088

reported with unsupervised constrained maximum likelihood linear regression (CMLLR) speaker adaptation [11]. Decoding was performed with the same 10K vocabulary, except for the system P, where the vocabulary size is restricted to the number of words with manual pronunciations. The out-of-vocabulary (OOV) rate for the test set with the 10K vocabulary is 4%, whereas for system P the OOV rate is 5% for the 5K dictionary and 12% for the 1K dictionary. As one would expect, the grapheme system (System G in parenthesis in P+G row of Table 3) results in the worst performance (WER of 47.3%) compared to the systems with the same vocabulary. On the other hand, the phoneme system (System P), which uses manual pronunciations for every word results in a WER of 45.2% - 2.1% absolute lower in WER than the grapheme system. System P P+T2P P+G

# words with manual pronunciations 0K 1K 5K 10K (all) 53.2 46.2 45.2 45.7 45.3 45.2 46.8 45.5 45.1 47.3 (G)

Table 3. %WER of systems trained from various dictionary configurations as evaluated on the test set. Decoding was based on the same 10K vocabulary, except System P, where the vocabulary is restricted to the number of words with manual pronunciations. For the low (1K) and medium (5K) resource scenarios the P+T2P and P+G systems yields better performance than the phoneme system. In particular, the P+T2P system significantly outperforms the P+G system for the lowresource scenario (WER 45.7% vs. 46.8%). For the mediumresource scenario, both P+T2P and P+G systems result in comparable performance. Note that the P+G system uses pronunciation probabilities to assign a different weight to the grapheme and phoneme pronunciations. Figure 1 shows a plot of the WER for all the systems.

Figure 1: Plot of WER for different dictionary configurations.

5089

5. CONCLUSIONS AND FUTURE WORK We presented two innovative techniques for developing an ASR engine for Pashto using limited pronunciation resources. Both approaches show a significant reduction in amount of words for which manual pronunciations are required, without impacting the recognition performance. In particular, the modified algorithm for automatic text-tophoneme rule creation needs only a pronunciation dictionary of 1K words to be competitive to the full phonetic system, in spite of relatively high pronunciation prediction error rate. Error analysis of the pronunciation prediction errors shows that most of the errors are single phone variations in the complete phonetic string. Therefore, future work will focus on improving the quality of text-to-phoneme prediction and extending the approach to produce multiple pronunciations. 6. REFERENCES [1] D. Stallard, et al., “The BBN 2007 Displayless English/Iraqi Speech-to-Speech Translation System,” Proc. of Interspeech, Antwerp, Belgium, 2007. [2] B. Zhou, S. Chen, and Y. Gao, “Constrained PhraseBased Translation Using Weighted Finite-State Transducers,” Proc. of IEEE ICASSP, Philadelphia, PA, 2005. [3] T. Schultz and A. Black, “Challenges with Rapid Adaptation of Speech Translation Systems to New Language Pairs,” Proc. of IEEE ICASSP, Toulouse, France, 2006. [4] A. Kathol, et al., “Speech Translation for Low-resource Languages: the case of Pashto,” Proc. of ISCA Interspeech, Lisbon, Portugal, 2005. [5] J. Billa et al., “Audio indexing of Arabic broadcast news,” Proc. of IEEE ICASSP, Orlando, FL, 2002. [6] B. Xiang et al., “Morphological decomposition for Arabic Broadcast News Transcription,” Proc. of IEEE ICASSP, Toulouse, France, 2006. [7] A. Black, K. Lenzo, and V. Pagel, “Issues in Building General Letter to Sound Rules,” Proc. of ESCA, 1998. [8] M. Magimai.-Doss, S. Bengio, and H. Bourlard, “Joint Decoding for Phoneme-Grapheme Continuous Speech Recognition,” Proc. of IEEE ICASSP, Montreal, Canada, 2004. [9] http://en.wikipedia.org/wiki/SAMPA [10] S. Tsakalidis, R. Prasad, P. Natarajan, “Contextdependent Pronunciation Modeling for Iraqi ASR,” Proc. of IEEE ICASSP, Taipei, Taiwan, 2009. [11] M. J. F. Gales, “Maximum Likelihood Linear Transformations for HMM-Based Speech Recognition,” Journal of Computer Speech and Language, vol. 2, pp. 75-98, 1997.