Formal Prosodic Structures and their Application in

Formal Prosodic Structures and their Application in NLP⋆ Jan Romportl, Jindˇrich Matouˇsek University of West Bohemia, Department of Cybernetics, Univerzitn´ı 8, 306 14 Plzeˇ n, Czech Republic [email protected], [email protected]

Abstract. A formal prosody description framework is introduced together with its relation to language semantics and NLP. The framework incorporates deep prosodic structures based on a generative grammar of abstract prosodic functionally involved units. This grammar creates for each sentence a structure of immediate prosodic constituents in the form of a tree. A speech corpus manually annotated by such prosodic structures is presented and its quantitative characteristics are discussed.

1

Introduction

Each sentence (i.e. utterance as an abstract concept) offers various levels of abstraction of its description. These levels mostly hierarchically cover the broad scale ranging from an acoustic and phonetic/orthographic description to a semantic and logical analysis. Suprasegmental features of speech (i.e. prosody), however, do not fit into this linear ordering – rather they parallelly interconnect the acoustic level with semantics and pragmatics in mutual interaction, which is particularly nontrivial. Since even the transformation of a sentence from one level of a language description to a neighbouring one is noncanonical (once we have a phonetic representation of a sentence, we cannot precisely recall its original acoustic form, nor we often can exactly disambiguate its meaning unless we concern its context or even human “knowledge-about-world”). Generally, the closer (in terms of the aforementioned scale) two description levels are, the easier the transition from one to another is (yet still noncanonical). Obviously this is the reason for adopting such a stratificational language description – a transition from the acoustic form of a sentence directly to its semantic or logical representation (omitting the intermediate levels) is almost impossible (more formal analysis of this phenomenon from the cognitive point of view using the Alternative Set Theory is in [1]). We propose a formal theory for the description of prosodic phenomena interaction with other levels of the language system – its aim is to “bridge” the gap on the theoretical field between suprasegmental acoustic speech features and ⋆

ˇ This research was supported by Grant Agency of Czech Republic, project no. GACR 102/05/0278

semantics, while still being suitable for NLP purposes, such as TTS (text-tospeech) or ASR (automatic speech recognition) tasks. This framework involves description of sentences by abstract prosodic structures (in the form of trees generated by a special prosodic phrase grammar) and linkage of these structures both to the surface realization of prosodic phenomena (specially F0, intensity, duration) and the deep semantic structure (tectogrammatics). The current stage of the research involves partial implementation of a TTS prosody model based on this concept (see [2]), development of specially annotated corpora and appropriate prosodic grammar parser, which is crucially needed for the full implementation of the TTS prosody model.

2

Prosodic structures

Let us suppose each sentence can be fully semantically – in the scope of linguistics (e.g. not logic, paralinguistics or whatsoever) – described by a tectogrammatical representation (tree structure, similar to surface syntactic structure, see [5]) which takes into account various contextual “inter-sentence” links and dependencies (e.g. topic-focus articulation). A sentence (concerning only its surface representation) can often have more different tectogrammatical representations, especially in the case it is in a written form and without further context. This semantic ambiguity is caused by homonymy principally bound to the syntax and can be eliminated either by the sentence context analysis or by the suprasegmental acoustic means (prosody). A frequent and well known example is: “I saw an airplane flying to Chicago.” One can utter this sentence with many distinct intonations, yet each will eventually be understood in one of two ways corresponding to two different readings (e.g. interpretations, tectogrammatical structures) of the sentence. This supports the assumption there are two levels of prosody – the surface level (actual realization of intonation and timing using F0, intensity and duration modulations) and the underlying deep level, which is one of the constitutive components of the sentence meaning. We propose the framework where the structures of this underlying prosody level consist of the following abstract entities (non-terminal symbols for a generative grammar are parenthesised): Prosodic sentence (PS) Prosodic sentence is a prosodic manifestation of a sentence as a syntactically consistent unit, yet it can also be unfinished or grammatically incorrect. Prosodic clause (PC) Prosodic clause is such a linear unit of a prosodic sentence which is delimited by pauses. Prosodic phrase (PP) Prosodic phrase is such a segment of speech where a certain intonation scheme is realized continuously. A single prosodic clause often contains more prosodic phrases.

Prosodeme (P0), (Px) Prosodeme is an abstract unit established in a certain communication function within the language system. We have postulated that any single prosodic phrase consists of two prosodemes: so called “null prosodeme” and “functionally involved prosodeme” (where (Px) stands for a type of the prosodeme chosen from the list shown below), depending on the communication function the speaker intends the sentence to have. In the present research we distinguish the following prosodemes (for the Czech language; other languages may need some modifications): P0 – null prosodeme; P1 – prosodeme terminating satisfactorily (P1-1 no indication; P1-2 indicating emphasis; P1-3 indicating imperative; P1-4 indicating interjection; P1-5 indicating wish; P1-6 specific); P2 – prosodeme terminating unsatisfactorily (P2-1 no indication; P2-2 indicating emphasis; P2-3 indicating “wh-” question; P2-4 indicating emphasised “wh-” question; P2-5 specific); P3 – prosodeme nonterminating (P3-1 no indication; P3-2 indicating emphasis; P33 specific) Prosodic word (PW) Prosodic word (sometimes also called phonemic word) is a group of words subordinated to one word accent (stress). Languages with a non-fixed stress position would need a stress position indicator too. Semantic accent (SA) By this term we call such a prosodic word attribute, which indicates the word is emphasised (using acoustic means) by a speaker. 2.1

Prosodic grammar

The well formed prosodic structures (trees) are determined by the generative prosodic phrase grammar. This grammar uses two more terminal symbols (“$” and “#”) which stand for pauses differing in their length. The symbol (wi ) stands for a concrete word from a lexicon and ∅ means an empty terminal symbol. The rules should be understood this way: (P C) −→ (P P ){1+} #{1} means that the symbol (P C) (prosodic clause) generates one or more (P P ) symbols (prosodic phrases) followed by one # symbol (pause). (S) −→ (P C){1+} ${1}

(1)

(P C) −→ (P P ){1+} #{1}

(2)

(P P ) −→ (P 0){1} (P x){1}

(3)

(P 0) −→ ∅

(4)

(P 0) −→ (P W ){1+}

(5)

(P 0) −→ (SA){1} (P W ){1+}

(6)

(P x) −→ (P W ){1}

(7)

(P x) −→ (SA){1} (P W ){2+}

(8)

(P W ) −→ (wi ){1+}

(9)

The grammar can be transformed into the Chomsky’s normal form, yet the “intuitive” form shown above is more explanatory. The rule (1) rewrites a sentence into one or more prosodic clauses followed by an inter-sentence pause. The rule (2) analogically rewrites a prosodic clause into prosodic phrases followed by an inter-clause (intra-sentence respectively) pause. The important rule (3) represents the structure of a prosodic phrase: a prosodic phrase consists of two prosodemes – a “null” prosodeme and a “functionally involved” prosodeme. The “null” prosodeme can either be empty or consist of any number of prosodic words – according to the rules (4) and (5) – or it comprises a semantic accent at the beginning, followed by at least one prosodic word, as it is in the rule (6) used for a specific “wh-” question intonation. The rules (7) and (8) represent the structure of a “functionally involved” prosodeme: it either consists of a single prosodic word (common case with automated “unmarked” position of an intonation centre at the end of a sentence), or it starts with a semantic accent followed by at least two prosodic words (this is a “marked” scheme reflecting the topic-focus articulation). The relation between semantic distinctions and prosodic features can be seen on the following examples. The question: “Why are you so sad?” can easily have the answer: “They stole our car.” (in Czech: ”Ukradli n´ am auto.”) Such a situation is represented by this structure: ((S) ((P C) ((P P ) ((P 0) ukradli n´ am) ((P 1−1) auto) ) ) ) On the contrary, if one asks: “Did they steal your motorbike again?”, the reply can be: “CAR it is, what they stole.” (in Czech, more flexibly: “AUTO n´ am ukradli.”). ((S) ((P C) ((P P ) ((P 0) ∅) ((P 1−1) (SA) auto n´ am ukradli) ) ) ) Another example demonstrates different prosodic structures of the “same” (on the surface level, not in the tectogrammatics) sentence even without changing the word order: ((S) ((P C) ((P P ) ((P 0) I saw) ((P 3−1) an airplane) ) ((P P ) ((P 0) flying) ((P 1−1) to Chicago) ) ) ) ((S) ((P C) ((P P ) ((P 0) I saw) ((P 1−1) (SA) an airplane flying to Chicago) ))) 2.2

Prosody-Semantics Relation

Now let us suppose we have a sentence S with its one and only one corresponding tectogrammatical structure TS (possible homonymy is thus disambiguated). Let PS be a set of all conceivable prosodic structures of the sentence S without considering TS . Then TS determines a subset PT S ⊆ PS of the prosodic structures “allowed” for this particular TS , i.e. the given semantic interpretation.

The different structures from PT S may perhaps slightly vary in the deployment of prosodic phrases or other units, but only to the extent not contradictory to the given TS . The relation between prosodic structures from PT S and the appropriate surface prosodic forms of the uttered sentence S (i.e. F0, intensity and segmental duration) is crucial for prosody modelling in TTS systems and is analysed in [2]. The opposite process (e.g. suitable for ASR and natural language understanding) assumes we have an uttered sentence Su with a particular (measurable) surface prosodic features (intonation, etc.) iSu . The sentence Su can be assigned with a set TS of all its possible tectogrammatical representations. Each element TSj ∈ TS (j-th possible tectogrammatic structure, i.e. j-th meaning) determines its own PjT S as it is described above. The surface prosody iSu is assigned (theoretically using an appropriate classifier) with a set Pi of all its possible deep prosodic interpretations (i.e. homonymous tree prosodic structures – having the same allowed surface form, again see [2]). As a result we obtain a set of correct semantic interpretations of the uttered sentence Su : it is a subset TC ⊆ TS of TSj such that (10) ∀TSj ∈ TS : TSj ∈ TC ⇔ ∃p : p ∈ PjT S ∧ p ∈ Pi Informally: TSj is a correct semantic interpretation of Su provided that at least one prosodic structure determined by TSj is also allowed by (or is “underlying” of) the surface prosody iSu . In case the set TC has more than one element the meaning of the uttered sentence cannot be fully disambiguated through prosody itself and thus perhaps further context of the sentence must be taken into account to prune the set TS . However, the aforementioned relations well capture the mutual functioning of prosody, semantics and deliberate acting of a speaker.

3

Prosodic Data

Since this paper does not deal with the relation between the prosodic structures and the surface prosodic form (concrete F0, intensity and duration), we leave the question of gathering such data open. The issue of generating the surface prosody from the prosodic structures for the sake of a TTS system is discussed in the article [2]. If there is to be a prosodic parser suitable for TTS systems (i.e. the parser producing correct prosodic structures for input sentences, perhaps with partially or fully unknown tectogrammatic structures), one must employ suitable training and testing data of adequate amount. 3.1

Prosodic Corpus Annotations

We have chosen 3 most frequent radio news announcers from the Czech TV & Radio Broadcast News Corpus recorded at the Department of Cybernetics, University of West Bohemia in Pilsen (see [3], [4]) and arranged a project of manual

prosodic annotation of their speech (with additional constraints and criteria approx. 1,500 sentences altogether have been selected from the total of 16,483). Since this was the first attempt to employ the presented prosodic theory for a real corpus, the amount of data was relatively small and such was also the number of annotators involved. Another drawback is the fact the corpus consists in crushing majority only of declarative sentences. The first stage of the corpus processing involved three annotators – each annotating one speaker. The second stage is still in progress and involves independent re-annotating of all three speakers (each by an annotator other than the one in the first stage) which allows assignment of the inter-annotator consensus. We do not expect this consensus to be much high because the process of the prosodic annotating often requires subjective judgement, rather we expect each annotator profiles himself/herself with a specific style of “prosody understanding” and we presuppose this style to be more or less constant even among various speaker annotations. During the annotation process each annotator has available a waveform of all uttered sentences together with their transcriptions and F0, filtered F0 and intensity curves. Moreover, there is a manual and a set of pre-annotated examples at disposal to set up a general standard for the annotation and precedents for arguable or ambiguous cases. The F0 curve was extracted from the waveforms using the RAPT algorithm [6] implemented in the Snack toolkit [7]. Despite being very robust, this algorithm is also susceptible to doubling and halving errors. It has been shown [8] the estimated pitch having been exposed to halving and doubling obeys the lognormal tied mixture (LTM) probabilistic distribution with 3 parameters log(Fˆ0 ) ∼ LT M (µ, σ, λ1 , λ2 , λ3 ) = = λ1 · N µ − log(2), σ 2 + λ2 · N µ, σ 2 + +λ3 · N µ + log(2), σ 2

(11)

P3 and the constraint i=1 λi = 1. The parameters λi can be estimated by the Expectation-Maximisation (EM) algorithm. As the parameters are set up it is basically possible to exclude such pitch values that are more probable to be halved or doubled than correct. Since the LTM fitted F0 curve still incorporates microprosody features irrelevant to the suprasegmental scope of the prosodic structures, it is purposeful to exclude them by modelling (filtering respectively) each voiced region by a piecewise linear function with nodes {xk , g(xk )}K k=0 : g(x) =

K X

(ak x + bk ) I[xk−1