MULTILEVEL ANNOTATION OF SPEECH ... - Semantic Scholar

MULTILEVEL ANNOTATION OF SPEECH SIGNALS USING WEIGHTED FINITE STATE TRANSDUCERS Sérgio Paulo, Lu´ıs Oliveira

Spoken Language Systems Lab. INESC-ID/IST Rua Alves Redol 9, 1000-029 Lisbon, Portugal

spaulo,lco @l2f.inesc-id.pt

ABSTRACT The purpose of this work was the development of a set of tools to automate the process of multilevel annotation of speech signals, preserving the alignments of the utterance’s different levels of the linguistic representation. Our goal is to build speech databases, using speech from non professional speakers with multilevel relational annotations, that can be used for the development of concatenative-based textto-speech synthesizers or for training and testing statistical models. The method is based on the linguistic analysis of the transcription of the spoken material performed by a TTS system. The predicted phone sequence is then compared with the sequence produced by the speaker. The problem of aligning these two sequences is solved in a languageindependent way using Weighted Finite State Transducers. After the alignment, a re-synchronization procedure is applied to the remaining levels to put them in agreement with the spoken utterance. 1. INTRODUCTION The increased availability of storage capacity in current machines allows concatenative-based text-to-speech to use much larger inventories of natural speech than before. The extended spoken material allows a better selection of the units necessary for producing a given utterance, thus increasing the quality of the synthetic signal. Also, much of the linguistic knowledge contained in current systems is derived from databases of speech signals with multi-level annotations using tools like the EMU System [1]. The creation of speech databases for these purposes involves the specification of the materials to record (prompt texts), the selection of the speaker, the recording and, finally, the annotation of the speech signal at the different levels (segmental, syllable, prosodic, etc.). If the database is large, the annotation process can be costly if completely performed by human annotators. Several techniques have been used to automate the annotation, namely by using a text-to-speech to predict the several levels of annotation, based on the input text. A speech recognizer, in forced alignment mode, can then be used to align the predicted sequence of phonetic segments with the actual spoken utterance. Using this

alignment, the remaining levels of annotation can be synchronized with the timing of the segment level. One of the problems of this fully automated approach is that the spoken utterance may differ from what was predicted from the prompt text. One of the solutions to this problem is a careful monitoring of the recording procedure to verify if the speaker is performing the expected acoustic realization of the prompt text. Even with professional speakers, it usually requires the re-recording of much of the spoken material. With less trained and cooperative speakers, like children, this process can rapidly become a nightmare, and the constant repetition can ruin the naturalness of the spoken utterances. In other cases, the recordings are already available without the possibility of further corrections. Another solution for the problem is to accept variations on the recorded material and to manually correct the observed deviations during the annotation process. This corrections must be performed at all annotation levels being careful to avoid creating inconsistencies between them. In this paper we propose an approach to cope with this problem in four steps, depicted in figure 1. First, a speech synthesizer is used to predict all levels of the expected utterance. This synthesizer also establish every meaningful inter-level connections. Next, a speech recognizer is used in forced alignment mode modified to allow multiple pronunciations taking into account co-articulation and reduction hypothesis. The resulting segmental annotation can then be manually corrected. The third step is the alignment between the observed and predicted sequence of phonetic segments for which we use a WFST based approach. Finally we will use the resulting alignment to automatically synchronize the higher levels of annotation. 2. PROPOSED METHOD In this work, we follow the hierarchical architecture used in the Festival Speech Synthesis System[2], using the default relations and relation names defined in this system. 2.1. Building the utterance tree As shown in figure 1, the first step is to build a predicted utterance tree, using the text of the recording prompts and

Text

SPEECH

TEXT Token

PITCH TRACKING TOOLS

Word

SYNTHESIZER

Syllable

PREDICTED PHONES

FORCED ALIGNMENT

Token

PREDICTED UTTERANCE

Phone

Word

Syllable

Phone

Syllable

Phone

UTTERANCE TREE

PREDICTED PHONES

OBSERVED PHONES & TIMES

F0 VALUES

Target

WFST

PHONE ALIGNMENT ALIGNED PHONES & TIMES

SYNCHRONIZATION FINAL UTTERANCE

Fig. 1. Schematic form of the method presented here. a TTS system for the language. Figure 2 shows a very simplified diagram of the utterance tree. At top level, the input text is split into a list of tokens. Each token can be linked to a sequence of one or more words, and each word syllable is linked to a sequence of phones. The syllables can also be marked by intonation events, predicted by the prosody models. Also, each phone can have assigned targets for fundamental frequency contour. All these connections have to be updated in the synchronization phase, the final step of the method. The phone sequence predicted by the TTS system is then used as input for the forced alignment phase. 2.2. Forced alignment One of the most important aspects of a speech database is the correct alignment of phones in each utterance. Several approaches have been proposed to perform this task. One approach, described in [3], requires a synthesized speech signal, based on the expected phonetic sequence of the natural speech signal in analysis. The alignment is obtained using Dynamic Time Warping (DTW) techniques, based on the frame spectral distance of the two speech signals. This segmentation method places the phone boundaries with good accuracy, but it requires a speech synthesizer for the language, and a resembling voice. Another solution for this problem is the use of a HMMbased speech recognizer, in forced alignment mode. This

Syllable

Phone

IntEvent

IntEvent

HAND CORRECTION

Phone

Target

Target

Word

Phone

Syllable

Phone

Phone

IntEvent

Target

Target

Fig. 2. The most relevant utterance levels, and their connections. method is based on the phoneme models derived from large speech databases (as any HMM-based speech recognizer). The predicted phonetic sequence is used together the Viterbi algorithm to choose a path that maximizes a likelihood function. One of the problems of this approach is that, although the phoneme models allow us to have reasonable amount of knowledge about the phones, the information about the phone transitions is very poor. Thus, the phone boundaries may be not very accurate. To counterbalance this method weakness, these models are reasonably speaker-independent. The observed acoustic realization of a sentence differs from speaker to speaker, due to several effects like speaking style or even the speaker’s cultural background. The most of the variations in the acoustic realizations of a sentence is due to phonological phenomenons like vowel-reduction, schwa-deletion, schwa-insertion, and co-articulation between phones in word boundaries. With this knowledge, rules can be added to the forced alignment procedure to allow multiple pronunciations of the sentence [5]. A similar approach for finding the correct phonetic sequence in utterance, described in [4], has shown to have a performance comparable with human listeners. The forced alignment with alternative pronunciations is more efficiently implemented with a HMM-recognizer than with the DTW alignment: the Viterbi algorithm has now more possibilities to find a path that maximizes the likelihood function. The DTW aligner, if available, can later be used to refine the location of the phone boundaries. Even with this rule-based multiple pronunciations, there are cases that may be missed in the alignment. For example, when a speaker says ”tfone” instead of ”telefone”(telephone), or ”árve” instead of ”árvore”(tree). This can be solved either by adding a lexicon with some common mistakes in the pronunciation or by a human correction of the forced alignment results. 2.3. Phone alignment WFST’s(Weighted Finite State Transducers[7]) can be used to find the best alignment between predicted and observed phones. The alignment problem can be viewed as the composition of the sequence of input symbols (predicted phone sequence) with all possible symbol transformations (inser-

a:a

1)

a:~a/C3 S:S

@:@

g1:g1

o:o

6:6

p1:p1

u:u

r:r

t1:t1

u:u

u:u

t:t

g:g

a:a

u:(eps)

r:r

(eps):u

t1:t

g1:g1

a:a

l~:l~

g1:g

a:a

2) S:S

(eps):a/C1

@:@

S:S

Fig. 3. A simplified version of the transducer used as a filter to implement the costs established before.

@:@

p:p

r:r

l~:l~

Where represents the predicted sequence FST, is the transformation transducer and is the observed sequence FST. Since we are not interested in all possible solutions, but only in the one that minimizes the number of symbol modifications, weights must be assigned to transducer . Figure 3 shows a very simplified representation of such a transducer, where there is no cost of replacing an a by an a but, inserting an a has a cost C1, removing the a has a cost C2 and the replacement of an a by ã has a cost C3. The problem of finding the best alignment can now be expressed as:

g1:g

o:o

6:(eps)

p1:p

u:u

l~:l~

Fig. 4. Results of the alignment of predicted and observed phones of a part of a Portuguese phrase ”... chegou a Portugal...” .

tion, deletion or substitution) that can produce the output symbol sequence (observed phone sequence). The input sequence can be represented as a Finite State Automaton (FSA) that is transformed by a Finite State Transducer to produce all possible output sequences. From these, we are only interested in the ones equal to the observed phone sequence. Since the input FSA can also be represented as a FST with the same input and output labels, the all process can be represented as the composition:

o:o

3)

a:(eps)/C2

g:g

Phrase "... chegou a portugal, ..." Predicted phones

S @ g o 6 p u r t u g a l~

Predicted phones plus insertions S @ g o 6 p u r (eps) t u g a l~

Observed phones plus deletions S @ g o (eps) p (eps) r u t (eps) g a l~

Word

Word

Word

'chegou'

'a'

'portugal'

Syl 'S . @'

*S **

S

Syl '6'

Syl 'g . o'

Syl

Syl 'g.a.l~'

Syl 't.u'

'p.u.r'

@

g

o

6

p

u

r

(eps)

t

u

g

a

l~

@

g

o

(eps)

p

(eps)

r

u

t

(eps)

g

a

l~

* Predicted phones

** Observed phones

Fig. 5. In left side, alignment of the predicted and observed phones of a part of a Portuguese phrase ”... chegou a Portugal, ...”, in right side, utterance tree for this example, after the alignment of predicted and observed phones..

(8. 3:9

( )43 ( +5376

2.3.1. Alignment costs

a simple solution can be used making and . A more sophisticated solution can take into consideration the phone features to compute the replacement costs. For example, the cost of replacing two vowels is less than replacing a vowel by a consonant. The currently used features are: vowel/consonant, voiced/unvoiced, front/back, long/short, stop/non-stop, etc. A special case was added specifically for the portuguese language, where the stop consonants in the onset of the syllable are usually not deleted nor replaced. This required the addition of six more phone symbols to represent the stop consonants in that position. The cost of replacement of these symbols is much higher that for stop consonants in other syllable positions. Similar modification can be required for other languages.

In general, the alignment solution is not unique, depending on the assignment of the different alignment costs. For this process, the following costs were established:

2.3.2. A simple example of using WFST’s to find the best alignment

!" &% $

#

The input and output labels of edges along the best path form an alignment where is a predicted phone, and is an observed phone. Whenever a deletion occurs is assigned the value , for insertions it is assigned the same value but to .

!

'

$

(*)

The cost of inserting a new phone between two existing ones, ;

(,+ - ; ' The cost of aligning two different phones, ( . -! /0 .; is the where each cost (21 is a function of the phone: observed phone and is the predicted phone. The reason '

The cost of deleting a predicted phone,

for this dependency is that there are, for example, phones more likely be deleted than others. In general, the values for each cost function are language-dependent and must be assigned by human experts. If that knowledge is not available,

To exemplify the use of WFST’s for the alignment problem, we will take a segment of an utterance in portuguese ”... chegou a Portugal...”. Figure 4 shows three of the generated transducers using the AT&T FSM toolkit [6], when aligning the spoken and predicted phones of that piece of utterance. The first transducer is the transducer that corresponds to the predicted phones, the second one corresponds to the observed phones and last one is the resulting transducer of this process, where we can see the matching phones. In the first and last transducers, the stop consonants on the onset of the syllable are marked with special labels (g1, p1 and t1).

Syl 'g. o'

Syl 'S. @'

PHONES

S

@

SYLLABLES ( S . @ )

g

Syl '6'

p

o

(g.o)

Syl t.u

Syl 'p.u.r' r

u

t (t)

( p . r . u)

3. CONCLUSIONS

Syl 'g.a.l~'

g

a

l~

( g . a . l~ )

Fig. 6. Scheme of some levels of tree utterance after linking the segments to the Syllable level nodes. Word 'Portugal'

Word 'chegou'

PHONES SYLLABLES

Syl 'S. @'

Syl 'g. o'

S

g

@

(S.@)

o

(g.o)

Syl

Syl 'p.u.r'

p

r

u

( p . r . u)

't'

Syl 'g.a.l~'

t

g

a

l~

( t ) ( g . a . l~ )

Fig. 7. Final view of some levels in the utterance tree, after the cleaning step.

4. ACKNOWLEDGEMENTS The research reported here has been conducted within the DIXI+ project, supported by the Portuguese Foundation for Science and Technology (FCT), POSI program. We would like to thank Diamantino Caseiro for the speech recognizer with multiple pronunciations using WFSTs.

2.4. Synchronization The next step is the alignment of the higher level features. This process has as input the result of the phone alignment block, which is a list of predicted and observed phones with some , representing insertions and deletions, respectively, between their elements. The first of these two lists is compared with the list of predicted phones. When there is a mismatch, a new segment is inserted in the utterance tree, just before the predicted phone in analysis, as shown in figure 5. The next step is removal of the deleted segments ( in ). Then, we will start by assuming that the observed phones are not very different from the predicted ones so that we can use the same syllable structure predicted from the text, and all the new segments will be attached to existing syllables. A re-syllabification can then be performed, to eventually create new syllables. Sometimes, it can appear a syllable, with only the onset phone, and no phone inside the nuclei. This can be seen in figure 6, inside the initial syllable ”t.u”, after having deleted the non observed phone ”u”. It was decided to keep those syllable without the nuclei. After all these steps, the utterance tree can be cleaned, deleting all intermediate nodes that have no child nodes. In the example shown in figure 7, the word “a´´ was removed, because the phone of its only syllable was deleted. At this stage, the word and syllable levels are synchronized with the timing of the segment level. Now, it is necessary to copy to the utterance structure the values of fundamental frequency observed in the spoken utterance. A standard pitch tracking tool is used to produce F0 values at a fixed rate that be loaded and synchronized with the phone sequence. Optionally, other levels of annotation (manual or automatic), can also be synchronized at this time.

$

In this paper we explored the use of WFST’s for the alignment of phone sequences predicted by a TTS system and phone sequences observed in spoken utterances. The observed sequence of phones can be determined using a HMMbased speech recognizer allowing multiple pronunciations and, possibly, with manual correction. The use of this WFST formalism makes the alignment process easily configurable to take into account language specific phenomenon like shwa deletion, vowel reduction, shwa insertion, co-articulations between phones in word boundaries, etc. This tool is particular useful for building speech databases with multi-level annotations, where the phone alignment procedure can be a starting point for the synchronization of the full utterance tree. This process was presented using a short segment of an utterance in portuguese.

$

5. REFERENCES [1] S. Cassidy and J. Harrington (1996), Emu: an Enhanced Hierarchical Speech Data Management System. In Proceedings of the Australian Speech Science and Technology Conference, Adelaide. [2] A. Black, P. Taylor and R. Caley, The Festival Speech Synthesis System. System documentation Edition 1.4, for Festival Version 1.4.0, 17th June 1999. [3] F. Malfrère and T. Dutoit, High-Quality Speech Synthesis for Phonetic Speech Segmentation In Proceedings of Eurospeech97, Rhodes, Greece, 1997. [4] J. Kessens, M. Wester, C. Cucchiarini and H. Strik, The Selection of Pronunciation Variants: Comparing the Performance of Man and Machine. In Proceedings of the 5th International Conference on Spoken Language Processing, Sidney, 1998. [5] D. Caseiro, H. Meinedo, A. Serralheiro, I. Trancoso and J. Neto, Spoken Book alignment using WFST HLT 2002 Human Language Technology Conference. [6] M. Mohri, F. Pereira and M. Riley, A Rational Design for a Weighted Finite-State Transducer Library. Lecture Notes in Computer Science, 1436, 1998. [7] M. Mohri, Finite-State Transducers in Language and Speech Processing Computational Linguistics, 23:2, 1997.