Resolving Ambiguities in Sentence Boundary

0 downloads 0 Views 127KB Size Report
For the retelling, the stories were read and subsequently retold from ..... 'It may be twilight [shadows are falling] (12). It's pleasant ... Chistikov, P., Khomitsevich, O. Online Automatic Sentence Boundary Detection in a Russian. ASR System.
Resolving Ambiguities in Sentence Boundary Detection in Russian Spontaneous Speech Anton Stepikhov Department of Russian, St. Petersburg State University, 11 Universitetskaya emb., 199034 St. Petersburg, Russia [email protected]

Abstract. The paper analyses inter-labeller agreement within manual annotations of transcribed spontaneous speech and suggests a way to resolve ambiguities in expert labelling. It argues that the number of controversial sentence boundaries may be reduced if some of them are regarded as “zones”. We describe a technique of detecting these zones and analyse which syntactic structures are the most likely to appear in them. Though the approach is based on Russian language material, it may be applied to oral texts in other languages. Keywords: sentence boundary detection, manual annotation, segmentation, spontaneous speech, oral text, monologue, Russian language resources.

1

Introduction

The problem of sentence boundary detection in spontaneous speech has been known for decades. As [1] points out, though any text can be segmented into sentences, not every text can be segmented unambiguously. At the same time, obtaining information about sentence boundaries is a key issue for natural language processing. This data improves linguistic analysis for text mining systems, information retrieval, and text processing techniques such as parts of speech tagging, parsing and summarisation [2]. The presence of sentence boundaries in speech recognition output enhances its human readability [3]. Information about sentence boundaries in unscripted speech is acquired through sentence boundary labelling which can be performed manually or automatically. There are various models of automatic sentence boundary detection ([2,3,4,5,7]), and most of them are focused on reproducing expert manual annotation. It is expert annotation that is the subject of this paper. It is known that the extent of inter-labeller agreement varies (e.g. [3]). This paper is an in-depth exploration of manual annotation ambiguity. Understanding of its nature may suggest ways of resolving ambiguities and, consequently, enhance human standard of automatic sentence boundary detection models.

2

Anton Stepikhov

2

Data and Method Description

2.1

Corpus

The study is based on a corpus of spontaneous monologues. The corpus is balanced with respect to speakers’ age, gender and profession (linguists and non-linguists). It consists of 160 texts obtained from 32 speakers. They were well acquainted with the person making the recording, which made their speech natural to a maximum extent. Each speaker engaged in 5 different tasks: two story retellings, two types of picture description and a topic-based story. For the retelling, the stories were read and subsequently retold from memory. For picture description, the speakers examined and described pictures simultaneously. The stories and pictures were the same for each speaker. For the topic-based story, they commented on one of two themes: “My leisure time” or “My way of life”. The recordings were done either in a soundproof room at the Department of Phonetics of St. Petersburg University or in a quiet space in field conditions. Overall duration of the recorded texts is about 9 hours. The corpus was annotated according to the technique described below. The corpus and annotation were released as part of the Russian National Corpus [9] (spoken corpora). 2.2

Method

Recordings of unscripted speech were transcribed orthographically by the author. The transcription did not contain any punctuation. To make text reading and perception easier, graphic symbols of hesitation (like eh, uhm) and other comments (e.g. [sigh], [laughter]) were also excluded. The transcripts were then manually segmented into sentences by a group of experts consisting of 20 professors and students of the Philological Faculty of St. Petersburg University (Russia) and of Faculty of Philosophy of the University of Tartu (Estonia). All were Russian native speakers. The experts were asked to mark sentence boundaries using conventional full stops or any other symbol of their choice (e.g. a slash).1 The experts were presumed to have a native intuition of what a sentence is and, thus, it was left undefined. They were not time-constrained while performing the annotation. Sentence boundary identification can be based on textual and prosodic information. The interaction between the two is not yet fully understood. For example, [6] showed that the influence of the semantic factor on segmentation outweighs that of the tone factor. In addition, the analysis of sentence boundary detection in a Russian ASR system reveals that in Russian spontaneous speech it is difficult to detect boundaries based on prosodic clues alone [7]. In our experiment, the experts had no access to the actual recordings. This approach allows us to focus on semantic and syntactic factors only and separate them from prosodic factors. At the same time, text reading suggests text reproducing and, hence, segmentation in inner speech. Thus, the lack of information about a speaker’s intonation is to some extent compensated by the reader’s prosodic 1

A similar approach to the analysis of Russian spontaneous speech has been used by some researchers earlier (e.g. [6]). The suggested method, however, allows predicting and resolving ambiguities in sentence boundary detection as will be shown further.

Resolving Ambiguities in Sentence Boundary Detection. . .

3

competence, allowing him or her to feel the rhythm and melodic texture of sentences without their physical conversion into sound [8]. As a result of this experiment, 20 versions of syntactic segmentation of the proposed texts into sentences were obtained (3,200 marked texts in total). They were then subjected to further statistical analysis.

3 3.1

Data Analysis Analysis of Inter-Labeller Agreement

The analysis of the inter-labeller agreement shows that experts disagree in their marking of sentence boundaries. For each position in the text the number of experts who had marked the boundary at this position was computed. This number can be interpreted as a “boundary confidence score” (BCS) which ranges from 0 (no boundary marked by any of the experts) to 20 (boundary marked by all experts = 100% confidence). The distribution of BCS in the corpus is illustrated in Fig. 1.

The number of positions with a certain BCS

2500 2203

2000

1500 1133

1000 761 601 497

500

416

342 358 346 319 306 281 292 265 297 281 286 254 302

428

0 1

2

3

4

5

6

7

8

9

10 11 12 13 14 15 16 17 18 19 20

B oundary confidence score

Fig. 1. The number of positions with a certain BCS in the corpus

A reasonable approach would be to accept as a sentence end a position with a BCS no less than 12 (60% of the experts). With this approach, we found that almost 70% of positions marked as sentence boundaries should not be taken into consideration for sentence boundary detection because the BCS failed to reach the threshold. It is worth mentioning that the largest number of positions with BCS > 0 have BCS = 1 (22.1%

4

Anton Stepikhov

of all mentioned positions). This fact reveals the high extent of variability in detecting possible boundaries. 3.2

Resolving Ambiguities

Using the threshold approach is, however, just the first step in detecting sentence boundaries. The data examination reveals that there are labelled positions which require more careful consideration. These places are underlined in the example below, figures in bold mark the BCS: Ïîøåë â ìîäíûé ñàëîí

10

âîò

6

ïðèìåðèë øëÿïó

13

‘He went to a boutique 10 so 6 tried on a hat 13’. Based on the threshold defined above (BCS = 12), the boundary marks in the underlined fragment should be ignored. On the other hand, some experts could have detected the boundary before the underlined word and others after it. In this case 16 (10+6) experts might have associated the sentence boundary with the word âîò (‘so’). If experts’ estimation is ignored, two sentences are combined into one, which results in the distortion of the text’s linguistic analysis. Thus, the simple threshold approach in similar situations routinely underestimates the number of sentences in the data and misses possible boundaries. Another situation is, however, possible. The same experts might have marked the boundary twice – on both sides of the word. It means that the underlined words might be treated as a sentence boundary by only 10 experts rather than 16; that is, the final confidence score for these boundaries is less than the threshold amount. To resolve this ambiguity, the following method was developed. We identified single words and short segments that had the following conditions: 1) they were marked both before and after, 2) each position had a confidence limit below the established threshold, but: 3) both could reach it in sum. For example: 7 âîò 9 (‘7 so 9’). The next step was to ascertain information about the number of experts who had marked the sentence boundary in a certain zone. In this regard, only one sentence boundary marker from an expert was considered relevant in this zone. If a given segment had been labelled twice by the same expert – before and after it – one of the markers was ignored. As a result, the cases of the “double boundary” in a certain zone were excluded and the information about the exact number of experts who had associated the boundary between sentences with a certain zone was obtained. We propose that if the final confidence score passes the threshold, the word or a segment should be treated as a special “boundary zone”. Such zones indicate a boundary between two sentences without specifying its exact location. After the described processing the above given example looks like this2 : Ïîøåë â ìîäíûé ñàëîí [âîò] (12). Ïðèìåðèë øëÿïó (13).

‘He went to a boutique [so] (12). Tried on a hat (13).’ 2

Sentence boundary zones are enclosed in square brackets. In round brackets is shown the final boundary confidence score. In further examples the boundary zones are given without underlining.

Resolving Ambiguities in Sentence Boundary Detection. . .

5

The final confidence score 12 rather than 16 means that 4 annotators had marked the boundary twice and in these cases one of the labels was ignored. The other 8 experts had labelled the boundary before or after the word âîò (‘so’). On the base of the analysis of manual annotation, the transcribed texts were segmented into sentences. With this approach, we identified 284 such zones in the corpus. This increased the total number of sentences from 2,690 to 2,974, with zone boundaries accounting almost 10% of all boundaries. 3.3

Syntactic Content of Sentence Boundary Zones

To understand the variability in sentence segmentation, the syntactic content of these zones was examined. Table 1. Syntactic content of sentence boundary zones Syntactic Content of the Boundary Zone Count Percentage 1 Discourse markers / filler words, parenthesis, 156 54.9% function words (particles, conjunctions), interjections, onomatopoeia 1a Discourse marker / filler word 111 39.1% 1b Parenthesis 14 4.9% 1c Combination of two or three discourse markers and / or 13 4.6% function words, parenthetic words, interjection 1d Particle, conjunction 12 4.2% 1e Interjection, onomatopoeia 6 2.1% 2 Subordinate parts of the sentence 39 13.7% 3 Disfluencies 38 13.4% 4 Syntactically ambiguous content 19 6.7% 5 Principal parts of the sentence 16 5.6% 6 Other cases 16 5.6% TOTAL AMBIGUOUS CASES 284 100%

Table 1 reveals that most of the boundary zones are filled with semantically and syntactically weakened elements: discourse markers, parenthesis, function words, interjections, onomatopoeia and disfluencies such as breaks, repetitions and revisions. Thus, types 1 and 3 constitute almost 70% of all cases. The most considerable number of sentence boundary zones (39.1%) contains discourse markers / filler words (see 3.2 and Appendix for examples). Rather often boundary zones contain subordinate parts of the sentence (type 2, example see in Appendix) – mostly adverbial modifiers of different types (10.9%). The appearance of these zones may be explained by the fact that these boundary units can be associated with both previous and subsequent segments. The other reason may be language-specific features such as rich inflectional morphology of Russian and, hence, relatively free word order. Type 4 is functionally close to type 2. The distinction between them is that the content of sentence boundary zones of type 4 can perform various syntactic functions

6

Anton Stepikhov

depending on their association with the previous or the following sentence. In example 4 (see Appendix) the content of the boundary zone may be either a predicate (within the left context) or parenthesis (within the right one). Besides the common explanation, types 5 and 6 are also determined by the possibility of asyndetic connection between homogeneous parts of the sentence and clauses.

4

Discussion and Conclusions

We have presented a corpus of Russian spontaneous monologues and a possible approach to sentence boundary detection in oral speech. The approach is based on the analysis of the expert manual annotation of transcriptions. The proposed method involves a linguistic experiment and can be used without drawing on a recorded speech. The enlistment of sufficient number of experts allows acquiring data without explicit information about text prosody. It should also be mentioned that manual annotation of sound recordings with a considerable number of experts is in comparison a rather demanding if not impossible task. Our approach provides a confidence score for each position which was interpreted as a sentence boundary by the experts (boundary confidence score, BCS). The BCS allows for obtaining information about the unequal status of annotated boundaries and is therefore a valuable resource for development and / or adjustment models of automatic sentence boundary detection in unscripted speech. The analysis of inter-labeller agreement reveals the problem of spontaneous text segmentation into sentences. It is only about 30% of all labelled positions in the corpus that reached the threshold and, thus, proved relevant for sentence boundary detection. This figure indicates the high extent of variability in the detection of possible boundaries, which may be explained by the widespread asyndetic and paratactic connection between clauses in spontaneous speech. The possibility for subordinating conjunctions to begin a sentence may as well be a contributing factor. We demonstrate that the ambiguity of expert annotation may be resolved by considering the sentence boundary in spontaneous speech not only as a certain point between two sentences but also some zone of relatively small length (1–3 words as a rule). The phenomenon of sentence boundary zones may be explained by language-specific features such as the rich inflectional morphology of the Russian language and its relatively free word order. Or they may be features general to all languages that merely await identification. Detecting these zones allows acquiring more precise data for the syntactic analysis of unscripted speech. In our corpus almost 10% of sentence boundaries were detected as zones, with about 70% of them filled with semantically weakened elements. It is worth mentioning that some discourse markers were interpreted by the experts as a separate sentences, with boundary marks on both sides of the words. This segmentation reflects the real prosodic characteristics of such utterances, as in oral speech, their tone contour corresponds with that of a declarative sentence. This fact reveals the reality of human prosodic competence and shows that information about sentence boundaries may be obtained without explicit information about text intonation. In some cases the exact place of a sentence boundary can be disambiguated by the prosodic characteristics of recorded speech. Therefore in future we plan to collect

Resolving Ambiguities in Sentence Boundary Detection. . .

7

expert annotations of recorded speech and to compare the results of the two types of labelling (based on prosodic and textual information). This study is also considered to be groundwork for developing a model to predict sentence boundaries in oral speech.

5

Appendix

The examples below represent some of the syntactic content of sentence boundary zones. Examples are limited to the most illustrative in terms of translation. Numbering corresponds with the type of syntactic content given in Tab. 1. 1b.  îáùåì-òî èçîáðàæåíî íà íåå íà íåé èçîáðàæåí êàêîé-òî êðåñòüÿíñêèé õóòîð íó òàêîé äîâîëüíî îñíîâàòåëüíûé [êàê ÿ ïîíèìàþ] (18). Ýòî êàìåííûå çäàíèÿ êàìåííûé çàáîð êàêèå-òî ëþäè õîäÿò âî äâîðå (12). ‘In fact it depicts a pretty well-built peasant farm [as I understand] (18). These are stone buildings stone wall some people are walking around the yard (12).’ 1e. Èç-çà ýòîãî ìåíÿ èíîãäà ïóòàëè ñ ïóãàëîì äóìàëè ÷òî ýòî ÿ â îãîðîäå ñòîþ îêëèêàëè [ýé] (16). Íåò ÿ íå îòêëèêàëñÿ âåðíåå ÿ õý ÿ âîîáùå íå ñëûøàë [ïóã. . . ] (13). ‘Because of this, I was sometimes mistaken for a scarecrow they thought that it was me who was standing in the garden and called [hey] (16). No, I didn’t reply that is I ha I didn’t hear at all [sca. . . ] (13).’ 2. Íó âîîáùå ëþáëþ íàïðèìåð Çîîëîãè÷åñêèé ìóçåé êðîìå òîãî íó è ÷åãî-òî ïî÷èòàòü íà ýòó òåìó òîæå íå ïðî÷ü ñúåçäèòü íà ïðèðîäó [çèìîé] (13). Êàòàþñü íà ãîðíûõ ëûæàõ ‘Well in general I like for example The Museum of Zoology too and to read something on the topic as well I would not mind a trip to the country [in winter] (13). I go downhill skiing .’ 3. Çíà÷èò ðå÷ü èäåò ýòî ñàìîå äåëî ïðîèñõîäèò â ñåëå [â ñåëå] (19). Ýòî îñåíü è ýòî âðåìÿ êîãäà ïîñïåëè ÿáëîêè (17). ‘Well it is about well it happens in the village [in the village] (19). It is autumn and it’s the time when the apples have ripened (17).’ Ðàññêàç÷èê âñïîìèíàåò ðàííþþ îñåíü êîãäà ñîáèðàþò óðîæàé ÿáëîê [è åìó] (15). Óðîæàé î÷åíü áîëüøîé (16).

‘The narrator recalls early autumn when people are harvesting apples [and he] (15). The harvest is very large (16).’ Òàêîé íåíàñòíûé äåíü âèäèìî êàêîé-òî [õîòÿ âðîäå áû íåò] (13). Íó íåò äîæäÿ íåò êîíå÷íî íî âåòåð òàì íå çíàþ äûì äûì èç òðóáû èäåò

‘Nasty day, it looks like [though maybe not] (13). Well there’s no rain of course but wind there I don’t know smoke smoke is coming out of the chimney ’ 4. Íàøëè ÷åëîâåêà è ïîíÿëè ÷åëîâåê íå ìîã ïåøêîì äàëåêî-òî óéòè çíà÷èò ãäå-òî íå òàê äàëåêî è æèëüå [ìîæåò áûòü] (16). Ýòî è ñïàñëî íîâîñåëêîâñêèõ ìóæèêîâ

8

Anton Stepikhov

‘They found a man and realised the man couldn’t walk a long distance on foot it means that somewhere not far away a house [maybe] (16). It saved the men from Novosyolki ’ 5. Ñäåëàë ñåáå ïîêóøàòü [ïîåë] (12). Âîò ïîøåë åñòåñòâåííî ê êîìïüþòåðó ïðîâåðèë ïî÷òó ïîñìîòðåë êòî òåáå ïîçâîíèë êòî íå ïîçâîíèë (14). ‘You cook some food [eat it] (12). You go of course to the computer check the mail look who has given you a call who hasn’t (14).’ 6. Âîçìîæíî ýòî ñóìåðêè [òåíè ëîæàòñÿ] (12). È âîò õîðîøî òåïëî ‘It may be twilight [shadows are falling] (12). It’s pleasant warm ’ Acknowledgments. The paper has benefited greatly from the valuable comments and questions of Dr. Anastassia Loukina and Dr. Walker Trimble. The author also thanks all the speakers and experts who took part in the experiment.

References 1. Skrebnev, Yu. M. Vvedenie v kollokvialistiku. Saratov: Izdatel’stvo Saratovskogo universiteta (1985). (in Russian) 2. Nasukawa, T., Punjani, D., Roy, S., Subramaniam, L. V., Takeuchi, H. Adding Sentence Boundaries to Conversational Speech Transcriptions using Noisily Labelled Examples. In: AND 2007, pp. 71–78 (2007). 3. Liu, Y., Chawla, V. N., Harper, M. P., Shriberg, E., Stolcke, A. A study in machine learning from imbalanced data for sentence boundary detection in speech. In: Computer Speech and Language. Volume 20, issue 4, pp. 468–494 (2006). 4. Gotoh, Y., Renals, S. Sentence Boundary Detection in Broadcast Speech Transcripts. In: Automatic Speech Recognition: Challenges for the new Millenium, ISCA Tutorial and Research Workshop (ITRW), Paris, France, September 18–20, 2000, pp. 228–235 (2000). 5. Koláˇr, J., Liu, Y. Automatic sentence boundary detection in conversational speech: A crosslingual evaluation on English and Czech. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2010, 14–19 March 2010, Sheraton Dallas Hotel, Dallas, Texas, USA, pp. 5258–5261 (2010). 6. Vannikov, Yu., Abdalyan, I. Eksperimental’noe issledovanie chleneniya razgovornoj rechi na diskretnye intonacionno-smyslovye edinicy (frazy). In: Sirotinina, O. B., Barannikova, L. I., Serdobintsev, L. Ja. (eds.) Russkaya razgovornaya rech, pp. 40–46. Saratov (1973). (in Russian) 7. Chistikov, P., Khomitsevich, O. Online Automatic Sentence Boundary Detection in a Russian ASR System. In: Potapova, R. K. (ed.) SPECOM 2011. The 14th International Conference “Speech and Computer”. 27-30 September 2011. Kazan, Russia, pp. 112–117 (2011). 8. Gasparov B. M. Yazyk, pamyat’, obraz. Lingvistika yazykovogo sushchestvovaniya. Moscow: Novoe literaturnoe obozrenie (1996). (in Russian) 9. Russian National Corpus, http://www.ruscorpora.ru/en/index.html