The use of artificial neural networks in the Speech Understanding ...

The use of articial neural networks in the Speech Understanding Model - SUM Daniel Nehme Müller, Mozart Lemos de Siqueira, and Philippe O. A. Navaux Federal University of Rio Grande do Sul Porto Alegre, Rio Grande do Sul, Brazil {danielnm,mozart,navaux}@inf.ufrgs.br

Abstract. Recent neurocognitive researches demonstrate how the natural processing of auditory sentences occurs. Nowadays, there is not an appropriate human-computer speech interaction, and this constitutes a computational challenge to be overtaked. In this direction, we propose a speech comprehension software architecture to represent the ow of this neurocognitive model. In this architecture, the rst step is the speech signal processing to written words and prosody coding. Afterwards, this coding is used as input in syntactic and prosodic-semantic analyses. Both analyses are done concomitantly and their outputs are matched to verify the best result. The computational implementation applies wavelets transforms to speech signal codication and data prosodic extraction and connectionist models to syntactic parsing and prosodic-semantic mapping.

1 Introduction The research of Spoken Language Understanding (SLU) software is derived from two joint technologies: Automatic Speech Recognition (ASR) and Natural Language Processing (NLP). These two technologies are complementary: the natural language can help in the speech recognition through information in syntax and semantics, and the speech recognition can improve the understanding of the language with contextual information, such as the intonation of the words (prosody)[1]. This work argues that it is possible to unify several computational systems to represent the speech understanding process. Thus, we propose a software architecture the SUM, a Speech Understanding Model, based on a neurocognitive model of auditory sentence (section 4). Through SUM, we searched a computational representation for speech signal codication, prosody, syntactic and semantic analysis. The SUM is illustrated in the gure 1. Wavelet transform is used for the signal processing and prosodic codication. The wavelets codication is fully described in the section 4. The connectionist subsystems used in syntactic parsing and denition of semantic contexts are described in the sections 4. Finally, in section 5 we describe how the integration of the subsystems occurs, while in section 6, the results concerning the consequences of this work are presented.

Fig. 1. The Speech Understanding Model - SUM.

2 Related Work The SLU is the rst part of spoken dialog systems [2]. In our work, we apply two parts of SLU systems: speech recognizer and language parser. Moreover, to improve this process, we take as base recent neurocognitive researches that demonstrate the relevance of prosody to the natural processing of auditory. Thus, we present to follow some related works in speech recognizer, language parser and prosody. Wavelets transforms have been proposed as an alternative to traditional methods to perform speech analysis as MFCC (Mel Frequency Cepstral Coecients), which applies Fourier transform. Some limitations of MFCC have been observed, such as easily noise corruption and a signal frame representation that can hold more than one phoneme [3]. Wavelets extract speech signal characteristics by sub-band division [4]. Software to determine semantic context has been the focus of researches, in spite of some work about handmade sentence labeling [5]. This handmade method is expensive and laborious, however, it permits a full and deep semantic analysis [6]. On the other hand, the shallow semantic analysis is fully statistical and more robust than deep analysis [6]. Some methods have been developed using shallow semantic analysis approach. All of them do a classication by distance clustering based on context determination by sentence structure. Neural networks are applied to thematic indexing of spoken documents [7]. Kompe [8] pointed out the importance of accent analysis to word recognition and focus in speech context determination. In this direction, Zhang et al. [9] used Kompe's concepts to create an automatic focus kernel extraction and to dierentiate a pair of words with similar linguistic structure but with dierent meanings.

3 Neurocognitive Model Angela Friederici proposes a neurocognitive model of auditory sentence processing that identies which parts of the brain were activated at the time, given the dierent applied tests. She divided the processing of the auditory sentences in four large phases [10]. Indeed, the most recent research indicates that the prosody processing description must be added to the neurocognitive model [11]. This model is illustrated in gure 2.

Fig. 2. Improved sequence of neurocognitive model of auditory sentence.

In the rst phase (phase 0 ), it is done an acoustic characteristic extraction and signal codication. In this process, the pitch is isolated in the right hemisphere and the aective signal (emotions) is distinguished from the linguistic signal. Thus, the pitch variation (aective signal) denes the prosodic characteristics, which determine the processing segmentation. The linguistic characteristics will be analyzed at the syntactic level by the left hemisphere of the brain during the second phase [10]. The second phase (phase 1 ) performs the syntactic analysis and it occurs only in the left hemisphere of the brain. The syntactic analysis is not aected by the prosody and the semantics, and it is processed according to an independent manner [12]. The syntactic evaluation process occurs where the structure sentence errors cause the need of corrections without semantic analysis [13]. This means that the syntactic structure errors must be corrected before semantic analysis. The semantic analysis is performed in the third phase (phase 2 ), where there is a query in the words category memory, which can be observed in tests where the sentences had been organized to produce conicts related to category and gender [12]. The semantic analysis apparently awaits the syntactic analysis output in order to solve interpretation problems brought about mainly by the words categories contextualization. If the sentences are well structured, they will be evaluated by gender, category and semantic context of the involved words [10][13]. The fourth and last phase (phase 3 ) the integration among syntax, semantics and prosody, necessary to review problems not resolved in the previous phases takes place. The syntax structure correction is necessary

when there are lexical terms organization problems [10]. The syntax structure correction is necessary when lexical terms organization problems occur [10].

4 The implementation of Speech Understanding Model SUM We have propose a software architecture based on the natural auditory sentence processing to represent Friederici's neurocognitive model [14]. From the four described phases in the neurocognitive model, we propose the architecture of SUM, as illustrated in gure1. In SUM, the rst phase extracts from speech signal coecients, which were obtained through mathematic transforms applied. These coecients provide information about the speech signal and they are used in the subsequent phases. The second computational phase is the application of speech coecients to carry out the syntactic parsing. In the third phase the coecients are used to dene semantic contexts. The linguistic and prosodic information embedded in the coecients is used to verify the similarity with predened semantic patterns. The fourth phase receives the analyses from second and third phases' outputs. In this phase, the most likely context is indicated as answer to each analyzed sentence.

Speech signal processing In this work, wavelet multiresolution analysis is

used to extract the characteristics of the speech signal. We divided these signal characteristics in prosodics and semantics. Wavelet permits wave decompositions in scale (dilation) and temporal displacement (translation). The scale enables the signal dierentiation between frequency levels, whereas the translation denes the band wideness in analysis [15]. If we vary only the scale, the wavelets can work as lterbanks with multiresolution analysis [16]. This means that it is possible to obtain more details of signal in dierent frequency levels. The scale is dened by √ X φ(x) = 2 hk φ(2x − k) k

where hk is a low-pass lter and k is the scale index. The mother-wavelet will be inserted in this scale by √ X ψ(x) = 2 gk ψ(2x − k) k

where gk is a high-pass lter. These parameters permit high frequencies to be analyzed in narrow windows and low frequencies in wide windows. In this SUM implementation we use the multiresolution analysis to speech signal codication. We dene this process in two ways: phonetic and prosodic approaches (g 3). The phonetic coecients are obtained from a single decomposition of wavelet coecients. The prosodic coecients are extracted by wavelets from F0 variation (pitch).

Fig. 3. Coecients extraction from speech signal.

We can extract the spectral density in each sub-band to phoneme identication [17] and calculate from the leaves of the wavelet multiresolution tree are the phonetic coecients. This coecients are used as word patterns in connectionist systems to process language. The identication of prosodic characteristics is done through the F0 analysis. According to Kadambe, it is necessary to detect the wavelet maximum points to acquire information on the variations of the F0 speech, which correspond to the glottal closure instants (GCI) [18]. The prosodic coecients are the variance of F0 detected by wavelet multiresolution analysis and will be applied to semantic and prosodic analysis .

Syntactic analysis The syntactic analysis is processed by SARDSRN-RAAM

system, developed by Mayberry and Miikkulainen [19]. The phonetic coecients are trained by RAAM (Recursive Auto-Associative Memory) net, whose activation enables the sequencing of the words in the phrase by the SARDSRN-RAAM. In this training, the phonetic coecients are associated to words. Afterward, the temporal sequence of the component words is initiated, and the sentence pattern presented in the input layer is distributed to the hidden layer and to SARDNET (Sequential Activation Retention and Decay Network). This net, also feeds the hidden layer that, in turn, transfer its codes to a context layer, characterizing the SRN (Simple Recurrent Network) in the SARDSRN-RAAM. In the end, the output layer generates a pattern sentence that is decoded by the RAAM net (see gure 4).

Prosodic-semantic analysis The prosodic-semantic analysis system receives the phonetic and the prosodic coecients from wavelet transform. The semantic processing is composed by four chained Growing Hierarchical Self-Organizing Maps (GHSOM) [20](see gure 5). In the rst GHSOM net, the input is provided by prosodic coecients from wavelet transform applied on the speech signal. In

Fig. 4. SARDSRN schema.

the second GHSOM net, the phonetic map organizes groups of words according to their linguistic structure and pronunciation. The net that forms the prosodicsemantic map uses the output information from the activated neuron in the phonetic map and the activation in the prosodic map. The training of this composition enables the semantic word clustering in the map. Finally, the last map is responsible for grouping sentences that are informed by the user. The codication of each component word of the sentence supplied by the system user is made by the activation of the preceding maps. The composition of the output of semantic map for each word is the input of the sentences map. The recognition of speech patterns is performed by the sentences map, which indicates the most likely sentence.

Fig. 5. Maps organization to phonetic, prosodic, semantic and sentences clustering.

Evaluation The sentences resulting from systems' recognitions are evaluated

after syntactic and prosodic-semantic processing. The SARDSRN-RAAM system indicates an error rate in each sentence output and the semantic maps system points to the winner neuron in the sentences map. We elaborated an algorithm to perform this evaluation: if a syntactic error rate is lower than 0.5, then we must reject the syntactic structure; if the distance from trained patterns in the sentences map is higher than 2, we must reject the semantic context; if there is a failure just in the syntactic analysis, the system points out the best sentence of the prosocid-semantic analysis; if the failure occurs just in the semantics, the system returns the sentence found in the syntactic analysis; if both analyses fail, the sentence is considered incomprehensible; if both have a successful recognition, they are indicated to the user as being good results. The result of this algorithm is the output of our SUM implementation. For each spoken sentence, the system generates the most reasonable written sentence as result.

5 Simulation of the SUM Training We apply wavelet transforms to obtain patterns to train the neural

networks. We record 8 sentences to extract 13 words segmented by hand spoken by three Brazilian Portuguese speakers. The records are made in Brazilian Portuguese, but here we present only the translation of the sentences. The spoken sentences trained were (with acronyms marked with small letters): acronym sentence acronym sentence tbstc the boy saw the cat tdctc the dog chased the cat tcltg the cat liked the girl tcstd the cat saw the dog tdbtb the dog bit the boy tbbtg the boy bit the girl tgltd the girl liked the dog tgctb the girl chased the boy We used the Matlab to apply the wavelet transform on the speech signal. The wavelet transform used was Daubechies' wavelet transforms with eight lter coecients (db4 lter in Matlab). The phonetic coecients were obtained with three decomposition levels and the prosodic coecients with two decomposition levels. The prosodic and phonetic coecients are obtained directly from wavelet transform (gure 6). In the SARDSRN-RAAM system, the order of training input is guided by a grammar denition with written sentences. The training is stopped when the error rate in each sentence is lower 0.01. The simulation of system maps was proceeded by GHSOM Matlab toolbox [20]. All GHSOM maps were chained as described in section 4.

Recognition In recognition, the same wavelet codication process was realized.

We select three spoken sentences with each verb of the lexicon. The sentences were spoken by a single speaker. The selected sentences to recognize were (with acronyms marked with capital letters):

Fig. 6. a) original wave and phonetic coecients and b) wavelet coecients with zerocrossing marks and prosodic coecients.

acronym sentence acronym sentence TBSTG the boy saw the girl TCBTB the cat bit the boy TGSTC the girl saw the cat TDBTC the dog bit the cat TDSTB the dog saw the boy TGBTC the girl bit the cat TDLTG the dog liked the girl TBCTC the boy chased the cat TGLTB the girl liked the boy TDCTG the dog chased the girl TBLTC the boy liked the cat TCCTD the cat chased the dog In the SARDSRN-RAAM recognition, for each sentence the system presents the best likely sentence. For example, the sentence TBSTG was recognized as sentence trained tbbtg, TGSTC as tgctb, TDSTB as tdbtb, TBLTC as tbstc, and so on. It is important to stand out that we recognized (by estimation) not known sentences. As illustration of the map estimation, the not trained sentence TDLTG match with trained sentence tcltg, TBCTC with tdctc, and so on. Each sentence corresponds to recognized neuron, as it is shown in gure 7, that it illustrates the resultant grouping of sentences in the sentences map. We chosed the sentences TDBTB (the dog bit the boy) and TCCTD (the cat chased the dog), that they are not in the set of training. In the rst sentence, we got a prosodic-semantic match in tcltg (the cat liked the girl). The syntactic subsystem returned the same written sentence (the dog bit the boy), although not trained. In the second sentence, the syntactic subsystem presented the wrong sentence the cat boy the dog. The result of sentences map was a distance 0 from pattern tdctc (the dog chased the cat) (see gure 7). These two examples mean that, in the case of the rst sentence, the syntactic subsystem got the correct sentence, but the

maps system have a bad choice. In the case of the second sentence, the syntactic system fails, but the maps system returned the best likely sentence.

Fig. 7. The nal sentences map.

6 Conclusion The SUM model is a software architecture to guide the computational implementation of the auditory sentence process. The proposal of implementation to SUM consists in the codication of the speech signal through coecient wavelets. The results obtained from wavelet transform allowed an appropriate speech signal and prosody codication for the use in the connexionist systems to syntactic and semantic representation. The resultant codication demonstrates that there is an interface between existent linguistic parsing connectionists systems to analyze text and the speech. This opens a new method to implementation of systems for written language. The use of articial neural nets in the syntactic and prosodic-semantic processing was presented as a facilitator in the language modeling process. The training through examples provided by connectionists systems simplies the work necessary to dene grammars and contexts of the language. The use of hierarchies of maps for denition of semantic contexts might use the prosodics information as guide for linguistic parsing. As analyzed, this notion has a biological inspiration in the presented neurocognitive model. Finally, the computational prototype that demonstrates the processing of the SUM resulted in a complementing analysis system. Therefore, when the syntactic analysis does not oer reliability, it is possible to evaluate prosodic-semantic analysis, such as in human speech understanding.

References 1. P. Price, Spoken language understanding, in Survey of the State of the Art in Human Language Technology, R. A. Cole et al, Ed. Cambridge University Press, Stanford, 1996. 2. Ryuichiro Higashinaka et al, Incorporating discourse features into condence scoring of intention recognition results in spoken dialogue systems, Speech Communication, vol. 48, pp. 417436, 2006. 3. Z. Tufekci and J.N. Gowdy, Feature extraction using discrete wavelet transform for speech recognition, Proc. IEEE Southeastcon 2000, pp. 116123, 2000. 4. Kevin M. Indrebo et al, Sub-banded reconstructed phase spaces for speech recognition, Speech Communication, vol. 48, pp. 760774, 2006. 5. Ye-Yi Wang and Alex Acero, Rapid development of spoken language understanding grammars, Speech Communication, vol. 48, pp. 390416, 2006. 6. H. Erdogan et al, Using semantic analysis to improve speech recognition performance, in Computer Speech and Language, vol. 19, pp. 321343. Elsevier, 2005. 7. Mikko Kurimo, Thematic indexing of spoken documents by using self-organizing maps, Speech Communication, vol. 38, pp. 2945, 2002. 8. R. Kompe, Prosody in Speech Understanding Systems, Springer-Verlag, Berlin, 1997. 9. Tong Zhang; Mark Hasegawa-Johnson; Stephen Levinson, A hybrid model for spontaneous speech understanding, in Proceedings of the AAAI Workshop on Spoken Language Understanding, pp. 6067. Pittsburgh, 2005. 10. Angela D. Friederici and Kai Alter, Lateralization of auditory language functions: A dynamic dual pathway model, Brain and Language, vol. 89, pp. 267276, 2004. 11. Korinna Eckstein and Angela D. Friederici, Late interaction of syntactic and prosodic processes in sentence comprehension as revealed by erps, Cognitive Brain Research, vol. 25, pp. 130143, 2005. 12. S. Heim et al, Distributed cortical networks for syntax processing: Broca's area as the common denominator, Brain and Language, vol. 85, pp. 402408, 2003. 13. Sonja Rossi et al, When word category information encounters morphosyntax: An erp study, Neuroscience Letters, vol. 384, pp. 228233, 2005. 14. Daniel N. Müller; Mozart Lemos de Siqueira; Philippe O. A. Navaux, A connectionist approach to speech understanding, in Proceedings of 2006 International Joint Conference on Neural Networks - IJCNN'2006, pp. 71817188. 2006. 15. Ingrid Daubechies, Ten lectures on wavelets, Siam, 1992. 16. S. G. Mallat, A theory for multiresolution signal decomposition: The wavelet representation, IEEE Trans. Pat. Anal. Mach. Intell., vol. 11, pp. 674693, July 1989. 17. L.P. Ricotti, Multitapering and a wavelet variant of mfcc in speech recognition, IEEE Proc. Vision, Image and Signal Processing, vol. 152, pp. 2935, Feb. 2005. 18. Shubha. Kadambe and G. Faye Boudreaux-Bartels, Application of the wavelet transform for pitch detection of speech signals, IEEE Trans. Information Theory, vol. 38, pp. 917924, 1992. 19. M. R. Mayberry III and Risto Miikkulainen, SARDSRN: a neural network shiftreduce parser, in Proceedings of IJCAI-99, pp. 820825. Kaufmann, 1999. 20. Elias Chan, Alvin; Pampalk, Growing hierarchical self organising map (ghsom) toolbox: visualisations and enhancements, in Proceedings of the 9th International Conference on Neural Information Processing (ICONIP'02), 2002, vol. 5, pp. 2537 2541.