Parsing transcriptions... - Semantic Scholar

4 downloads 2489 Views 476KB Size Report
Angle brackets on both sides of a string in italic are used to indicate a ..... has been developed for PhonotacticSearch, written in HTML and calling the search.
Parsing transcriptions and searching for phonotactic patterns

Sofia Strömbergsson [email protected]

Master's Thesis Spring Term 2000 Language Engineering Programme Department of Linguistics Uppsala University

Academic supervisor: Lars Borin Industrial supervisor: Kolbjørn Slethei

ii

Abstract This thesis describes the development of a phonotactic toolkit consisting of, on the one hand, a phonotactic parser and, on the other hand, a phonotactic search engine. The phonotactic parser is used at Nordisk Språkteknologi as a proof-reading tool for hand-written transcriptions, and it automatically finds violations against transcription conventions. One possible use of the phonotactic search engine is as an assistance in the work of formulating new transcription conventions; by automatically extracting all occurrences of some specified phonotactic pattern it is easier to overview the consequences of a new convention. Both the parser and the search engine can handle the three Scandinavian languages Swedish, Norwegian and Danish. The structure of the tools is modular, in the sense that the parsing or searching routines are general and used for all three languages, while the language specific lexicons are interchangeable with each other.

Acknowledgements This work has been funded by Nordisk Språkteknologi (NST). There are many people who have supported me during the work. My supervisor from Uppsala University, Lars Borin, has given me valuable feedback and new angles of approach to problems. Many thanks to Kolbjørn Slethei, my industrial supervisor, who has given me guidance as well as encouragement through this work. Lars Warmenius and Catharina Kylander at NST have been extremely patient with my questions concerning everything from transcription conventions to requirements on the developed tools.

iii

iv

Contents Abstract ..............................................................................................................................................iii Acknowledgements .............................................................................................................................iii Contents............................................................................................................................................... v

1 Introduction ............................................................................................................... 1 1.1 Purpose ........................................................................................................................................ 1 1.2 Outline of this thesis .................................................................................................................... 2 1.2.1 Notation conventions ......................................................................................................... 2

2 Background ................................................................................................................ 4 2.1 Transcription syntax .................................................................................................................... 5 2.2 Transcription errors .................................................................................................................... 6 2.3 Earlier work................................................................................................................................. 7

3 The phonotactic toolkit ............................................................................................. 8 3.1 Description of ParseTranscriptions and PhonotacticSearch ...................................................... 8 3.1.1 The output ........................................................................................................................ 13 3.2 Functionality of ParseTranscriptions and PhonotacticSearch.................................................. 16 3.2.1 Lexical analysis................................................................................................................ 18 3.2.2 The syntactic analysis in ParseTranscriptions ................................................................. 20 3.2.2.1 Parsing at the symbol level......................................................................................... 21 3.2.2.2 Stress parsing ............................................................................................................. 22 3.2.2.3 Syllable parsing .......................................................................................................... 22 3.2.3 The matching procedures in PhonotacticSearch .............................................................. 23 3.2.3.1 matchingSyllSpecs ...................................................................................................... 27 3.3 Implementation .......................................................................................................................... 23 3.3.1 Representation of transcription conventions.................................................................... 23 3.3.1.1 Forbidden sequences .................................................................................................. 24 3.3.1.2 Unusual sequences ..................................................................................................... 25 3.3.1.3 Transcription correspondence to some decomposition subsequence ......................... 25 3.3.1.4 Transcription sequences allowed only if origin code is equal to X ............................ 25 3.3.1.5 Stress patterns allowed only if conditions X, Y and/or Z are satisfied ....................... 26 3.3.1.6 Geminate consonants over syllable boundaries allowed or not ................................. 27 3.4 Results........................................................................................................................................ 28

4 Discussion ................................................................................................................. 33 4.1 Future prospects ........................................................................................................................ 34

References .................................................................................................................... 36 Appendix 1: Terminology........................................................................................... 37 Appendix 2: Overview of XSAMPA.......................................................................... 39 Appendix 3: Examples of output from ParseTranscriptions 1.0 ............................ 40 Appendix 4: Parsing results for Norwegian and Danish ......................................... 43 Norwegian parsing results ................................................................................................................ 43 Danish parsing results....................................................................................................................... 44

v

1 Introduction An important part of the development of speech recognition and speech synthesising products carried out at Nordisk Språkteknologi (henceforth: NST) is the work of transcribing1. In many computational linguistic products transcriptions are an important link between text and speech, and this is the case at NST as well. In order for an application like this to work as expected, it is important that the transcriptions are correct and consistent. As an example: say that the transcription of a word contains some symbol that is not allowed in the transcription language. Then, in a program that converts the user’s speech signal to text, the computer would not be able to make the connection between the user’s pronunciation of the word and the transcription. This association has to be made in order for the program to choose the correct orthographic output string from the transcription. If the error occurs in a program that produces synthesised speech from text, the program would either have to guess how the word should be pronounced, or signal that a pronunciation of the word could not be produced. These types of errors are of course not acceptable. However, there are several factors why correctness and consistency is hard to achieve. First, the transcriptions are written and corrected by hand, and therefore there is a great risk that they contain errors and inconsistencies. Second, the transcriptions are written in a phonetic alphabet that, from the beginning, is unfamiliar to the transcribers and this is of course an important error source. Third, as the transcribing work is done by several different persons, consistency is hard to achieve. Accordingly, one necessarily has to assume that the transcriptions contain errors and inconsistencies and one has to come up with a way to find these. Looking for the errors manually would be a time-consuming and tiresome work, and there would still be no guarantee that all errors and inconsistencies were found. An automatic transcription checker – a parser – is a better solution. By performing the proofreading automatically both time and labour would be saved, and therefore this kind of tool has a considerable economic potential. The development of a transcription parser is described in this thesis. In order to avoid inconsistencies in the transcriptions, transcription conventions have been formulated and are still being formulated. In this work, it is often difficult to overview the consequences of a suggested convention. A phonotactic search engine, which automatically extracts all transcriptions matching a phonotactic pattern specified by the user, would clearly be a useful tool in this connection. When formulating explicit rules and conventions about natural language, as is necessary when parsing transcriptions, one discovers that the translation from the intuitive knowledge that humans possess about language to a machine-readable form is not straightforward. These difficulties put restrictions on what errors and inconsistencies are at all possible to find automatically. The problem of transferring intuitive knowledge to a machine-readable form is obviously not restricted to parsing transcriptions, but is encountered in all NLP and AI applications.

1.1 Purpose I was given the assignment to develop a parser for transcriptions, such that illegal – or “nongrammatical” – transcriptions would be identified and diagnosed given the “grammar” governing the transcription syntax. The goal was to find as many errors in the input transcriptions as possible, at as many levels as possible, for example at the syllable level, the symbol level2 and the stress structure level. The parser should be modular, in the sense that language specific conditions and rules should be separate from the general module. Such a structure facilitates the replacement of e.g. the Swedish conditions with Norwegian conditions. As a next step, the parser should be transformed into a phonotactic search engine that could identify simple characters and combinations of characters in 1

A transcription is a written representation of the pronunciation of some string. One might look at the transcriptions as an intermediate step in the transformation process from text to speech or vice versa. Most often an automatically generated transcription (generated by a grapheme-to-phoneme converter) is presented to the transcribers at NST, together with the orthographic representation of the word and its morphological decomposition. From that information the transcribers correct the transcription, and in the cases where no transcription has been proposed, the transcribers produce a transcription themselves. 2 The transcription alphabet consists of different symbols and obviously there are constraints regulating what is a legal symbol and what is a legal sequence of symbols.

1

transcriptions. The search engine should allow searches for patterns in transcription strings, but also for correspondences between different aspects of a lexical entry, for instance, between the transcription string and the orthographic form of the word. Since the future user’s expectations on the search engine were only vaguely formulated, the search engine should be as flexible and powerful as possible. The parser and the search engine constitute a phonotactic toolkit that have been constructed to handle Swedish, Norwegian and Danish transcriptions. The work has been carried out within the LDB-project at NST.

1.2 Outline of this thesis This report describes the development of a phonotactic toolkit, consisting of a transcription parser and a phonotactic search engine. In Section 2, some background is presented, and the need for correct and consistent transcriptions is discussed. The transcription syntax, i.e. the rules that the transcriptions must conform to, is discussed in 2.1 and common violations against the syntax are presented in 2.2. Earlier work on phonotactic parsers and search engines is discussed in Section 2.3. In Section 3, the phonotactic toolkit is presented; a description of the input and output of the programs and of what the tools demand from the user is presented in Section 3.1. A functional description of the tools is found in Section 3.2. In Section 3.3 an overview of the implementation is presented and in 3.4 the results of running the programs are discussed. Possible improvements of the tools are discussed in Section 3.5. In Section 4 the overall results are discussed as well as the use of the developed tools; some planned adaptations of the tools are also mentioned. 1.2.1

Notation conventions

The transcriptions are written in L&H+, which is a company internal transcribing language and, as such, confidential within Lernout & Hauspie and associates. As a consequence, L&H+ could not be used or referred to in this document, and therefore transcriptions are written in XSAMPA3, which is a public phonetic alphabet also with the essential feature of being machine-readable. See Appendix 2 for an overview of XSAMPA. In XSAMPA there is no character for indicating syllable boundary as there is in L&H+. As the character /]/ does not bear any other meaning in XSAMPA it has been used in this document to indicate syllable boundaries. XSAMPA does not have any means of expressing the distinction between toneme 1 and toneme 2, as is necessary when dealing with Norwegian and Swedish. In this document, the sequence /”1/ has been used to denote main stress in a toneme 1 word and the sequence /”2/ denotes main stress in a toneme 2 word. The notation conventions used throughout this document are listed in Table 1.

3

http://www.phon.ucl.ac.uk/home/sampa/x-sampa

2

// [] *

Angle brackets on both sides of a string in italic are used to indicate a variable string of the type type. Angle brackets on both sides of a string not in italic are used to indicate the orthographic form of a lexical entry (as opposed to the phonemic or phonetic form). Slashes are used as delimiters of a phonemic transcription string. Square brackets are used as delimiters of a string of phones, or an allophonic transcription string. The asterisk in front of a string (of any kind) indicates that the string is not syntactically4 correct.

Table 1: Notation conventions used throughout this document.

Important terms that the reader might be unfamiliar with are listed in Appendix 1. The first time such a term occurs in the document, it is underlined.

4

See Section 2.1.

3

2 Background When the personal computer entered the everyday life of people in the West, there were probably not many who expected to meet an alternative to the keyboard-screen interaction between the user and the computer in a near future. Today, some ten or fifteen years later, when most people are as familiar with the mouse as with the keyboard5, the development of alternative user interfaces has progressed far. Systems with alternative interfaces are e.g. such where the user communicates with the computer by physically pointing at the screen. In other systems, particularly designed for seriously disabled people, the user might control the computer by eye-tracking (Bergman & Johnson, 1995). The farthest developed alternative interface today is perhaps the voice interface, where the user dictates text or controls the computer orally, and where the computer responds with synthesised speech. Some useful voice-based interfaces are already on the market, such as L&H Voice Xpress and L&H RealSpeak6 developed by Lernout & Hauspie. However, these products are not yet commercially available for the Scandinavian languages Swedish, Norwegian and Danish, but the work of developing such products is in progress at NST. What is the role of transcriptions in text-to-speech (TTS) and Automatic Speech Recognition (ASR) systems? In TTS systems the transcription is one out of many intermediate representations of the text being transformed into speech. Figure 1 illustrates the major steps in the conversion from text to speech; the transcription is one type of abstract underlying linguistic representation. In ASR systems, the conversion is reversed, except that the output format of the incoming speech signal is not necessarily text in the usual sense, but text in any format that the computer – the listener – understands. Nevertheless, the speech signal is somewhere along the process likely to have the shape of a phonemic transcription, in order to enable e.g. a dictionary lookup to find a semantic interpretation of the assumed transcription string. Text Phonetic transcription?

Yes

No

Found in lexicon?

Yes Lexicon

No

Need for normalisation? No

Rules for phonetic transcription

Yes Normalisation rules7

Sequencing and concatenation -> transcription strings

Predefined units

Coarticulation rules

Synthesiser

Figure 1: Schematic overview of the steps in text to speech conversion. (Slethei, 1988)

5

In fact, many people are perhaps too familiar with the mouse, and repetitive stress injuries (RSIs) due to frequent and static mouse control movements are becoming a common health problem. 6 L&H Voice Xpress is a program that enables you to dictate into and to control almost any Windows-based program, i.e. to use your voice instead of your keyboard in e.g. word processing. L&H RealSpeak reads aloud any text with a synthesised voice that is often hard to distinguish from a human voice. 7 Normalisation rules converts strings like , and to a form that is closer to the pronunciation: , and , respectively.

4

What different types of constraints must the transcriptions conform to? There are constraints on different levels; syntactic constraints dictated by the transcription language govern which symbols are allowed and in what order without deeper linguistic consideration; other constraints govern which sequences are allowed according to phonotactic or phonetic conventions for the given language. Nevertheless, these types of constraints are treated similarly and therefore discussed in the same section: Section 2.1. What different types of deviations from the syntax are found in the transcriptions and why are some errors more frequent than other? The transcription files reveal that there are different types of errors and inconsistencies; some errors are due to simple mistyping, others are due to inconsistent application of transcription conventions, or even the lack of transcription conventions. These different types of deviations and inconsistencies are discussed in Section 2.2. How much work has been done in the field of parsing transcriptions and searching for phonotactic patterns? It will be clear that these are rather unexplored fields, but some existing relevant works are considered in Section 2.3.

2.1 Transcription syntax There are hundreds of rules that a text in a natural language, such as English, should comply with. Most often, the user would want his/her text to be understandable to others. In order to achieve this the writer should, at least, use the English alphabet and obey grammatical rules for English, assuming that the language is English.8 However, not only natural languages have an alphabet and follow a given grammar, but also artificial ones, such as programming languages or logical languages. The set of symbols allowed in a language is often referred to as the lexicon, and the ways in which the symbols are allowed to combine, are referred to as the syntax. One might look at transcriptions as phrases of a “transcription language”, where the phonetic or phonemic alphabet corresponds to a lexicon, and the ways in which the symbols of the alphabet can be combined correspond to the syntax. In real life, the distinction between utterances that follow the syntax and those that do not might be quite diffuse, but when formulating the syntax for the language used in L&H+-transcriptions one has to state explicitly the intuitive knowledge that humans possess about which phonotactic sequences are legal and which are illegal in the given language. The difficulty of formulating intuitive knowledge in a declarative manner is obviously not restricted to automatic phonotactic analysis, but is the main problem within all AI science. The L&H+-transcriptions that the parser and the search engine were developed for are more phonemic than phonetic. This means that the transcription symbols corresponds to the units in the pronunciation of a word which have linguistic function, i.e. the phonemes. In a phonetic transcription, units are transcribed regardless of their linguistic function (if they at all have a linguistic function), and according to their articulatory or auditory features. However, in some ways the transcriptions are indeed phonetic; for instance, assimilation is sometimes reflected in the transcription, and this must be considered a phonetic phenomena, since it does not serve a linguistic function. There are different levels of syntactic conditions9 that the L&H+-transcriptions should satisfy in order to be grammatical. At the symbol level, each transcription should be divisible into a number of legal symbols, and the order of these symbols should not violate any phonetic or phonological restrictions. An example of a phonetic restriction on the ordering of symbols is the Swedish convention stating that the sequence /nk/ is not legal, since assimilation causes it to transform into the sequence /Nk/ (Kylander, 2000b). At a subsymbolic level, each L&H+ symbol should have the following structure (where parentheses indicate an optional element): = ()()

8

Grammaticality alone is not a sufficient condition for a sentence to be understandable, as is illustrated by Chomsky’s (1957) famous example of a grammatical but semantically obscure sentence: “Colorless green ideas sleep furiously”. Grammaticality is not even a necessary condition for a sentence to be understandable, as there are sentences that are not grammatical that are nevertheless understandable, such as “*John have a knife”. However, in everyday language use, as opposed to constructed examples of the kind mentioned above, grammaticality tends to facilitate comprehension. 9 Including both rules and lexicon.

5

A symbol might contain one or many diacritic characters and these must conform to a specified pattern. (I will not go into details here, since L&H+, as mentioned earlier, is a company confidential notation system.) At the syllable level, each sequence in the transcription that is marked as a syllable should actually be a legal syllable, i.e. conform to the following structure (where parentheses indicate an optional element): = ()() (For definitions of the terms onset, nucleus and coda, see Appendix 1.) A necessary condition for the syllable to be legal is that all its constituents, i.e. the onset, nucleus and coda, are legal according to the grammar. At a level higher than the syllable level, as the condition ranges over more than one syllable, every syllable delimiter should be inserted at the right place, as dictated by the so-called Maximum Onset Principle (MOP). Many syntactical constraints in the L&H+ transcription language are dependent on features of the word being transcribed, for example the morphological decomposition of the word, and thus violations are not predictable from the transcription string alone. As an example: according to the transcription conventions for Swedish, Norwegian and Danish, MOP is overruled by compound boundaries, so that a word such as the Swedish word (tape recording) should be transcribed /”2band]In]%spe:l]nIN/ and not */”2ban]dIn]%spe:l]nIN/, because of the compound boundary between and . Some syntactic constraints are captured by transcription conventions and these are discussed further in Section 3.3.1.

2.2 Transcription errors Producing for example an English text with absolutely no errors is hard; no matter how familiar the writer is with the language, the text may still contain errors of different types: spelling errors, grammatical errors or style errors. Some errors might be typing errors, while others might be due to the writer’s misinterpretation of, or deficient knowledge of, English. In other words: errors might be caused by deficiencies in – using Chomsky’s (1965) terminology – either performance or competence. These two factors play an important part in the transcribing work as well. The work of transcribing is a demanding task and a real challenge for the ones performing it; even though they are all used to transcribing, they have to learn to use new transcription symbols – symbols that are familiar but are used in a new context. The transcribers are de facto learning a new language, and this might explain some of the errors found in the transcription files. Certain errors are probably due to simple mistyping, as the risk for mistyping is several times greater in transcribing as in writing texts in natural language; this is a consequence of the fact that the transcription language is new to the transcribers. As always when several different people are involved in this kind of work, there is also an obvious risk that the transcriptions might contain inconsistencies due to misuse of, or even absence of, transcription conventions. Some examples of these different types of errors and inconsistencies are found in Table 2. A transcription that does not violate the transcription syntax or the transcription conventions might still be incorrect. For example, if the word (to whisper) is transcribed as /”2vI.skar/, this error will not violate the transcription syntax or any transcription convention for Swedish. In order to find these errors one has to declare correspondences between orthographic form and transcription form and such correspondences are not always obvious. For example does not always correspond to /v/; an exception is the in the word (ocean air), which corresponds to /f/.

6

Error type Mistyping Mistyping Mistyping

Use of a convention that has been revised. Difficulties in applying rules that contradict one’s linguistic intuition. Absence of or misuse of transcription conventions in difficult cases, i.e. where more than one pronunciation is possible.

Example /”1ø:]bE]rj/

Explanation The sequence /rj/ is not a legal syllable; it contains no nucleus. The sequence /li:”1se:/ should /sta]bI]li:”1se:]rar/ be divided into two syllables. /ba]”2sEng]ka]pa]sI]%te:t/ Careless transfer of orthographic bigram to transcription, where it should correspond to the simple phoneme /N/. /”1sn]bo:d`/ // used to be a legal Swedish diphthong, but the symbol was removed from the alphabet. MOP-violation in a word that is /”2mIs]%trU]de/ not marked as a compound. (Should be /”2mI]%strU]de/.) /a]gI]”2tA:]tOr/ vs. Use of different phonemes in the /a]ren]”2dA:]tUr/ same context (the different correspondences to orthographic ).

Table 2: Some examples of different types of errors and inconsistencies found in the Swedish transcription file ALL_se_col1_ci_sorted.

2.3 Earlier work The task for which the parser described in this document was developed consists in finding as many errors in the input transcriptions as possible, at as many levels as possible (see Section 3.2.2). There does not seem to exist any public transcription parsers or checkers with the same ambition, although according to Slethei (personal communication 09-05-2000), there are two publicly unavailable transcription parsers only in Norway. However, there is a great number of parsers analysing any string, and there are also a few applications that have been designed to perform a phonotactic parse at some level, such as the syllable parser for English and French developed by Hammond (1995). Hammond’s parser takes as input a plain orthographic string and inserts syllable boundaries, while the parser described in this thesis takes a phonemic string as input, with syllable boundaries already inserted. A rule-formalism that can be used to perform both production and parsing of word forms is Two-level Phonology (Antworth, 1996), for instance implemented in PC-KIMMO (cf. Antworth, 1990). However, Two-level Phonology operates on linear orthographic input, which, as mentioned earlier, is an obvious difference from the parser described in this document. Currently there does not seem to exist any phonotactic search engine.

7

3 The phonotactic toolkit The phonotactic toolkit described in this document consist of a parser, ParseTranscriptions 1.0, and a search engine, PhonotacticSearch 1.0. There are two main types of contemplated users of the toolkit (or most of all of the transcription parser): one group is the transcribers and the other group is the developers. The transcribers might input a transcription, or rather a whole lexical entry (see Section 3.1), and the parser outputs a – possibly empty – set of warnings and error messages. The second group of users, henceforward the advanced users, add and modify the representations of transcribing conventions in the parser (see Section 3.3.1), in order to improve the behaviour of the parser. Both the parser and the search engine are written in Perl and since they also share a lot of other features, they are both described in this section.

3.1 Description of ParseTranscriptions and PhonotacticSearch ParseTranscriptions 1.0 is a parser for Swedish, Norwegian or Danish transcriptions. The parser might be run in two different modes: either a single transcription is parsed or an entire file of transcriptions. The input is not only one or many transcriptions, but rather one or many lexical entries, where each lexical entry contains much useful information apart from the transcription, e.g. the morphological decomposition10, the part-of-speech and the language origin11 of the lexical unit. This is because the decision whether a transcription is correct or not is predictable from other aspects of the lexical unit than the transcription itself; information that to some extent is found in the lexical entry. The output from the parser is, for the single-transcription-mode, a specification of all real and potential errors found in the lexical entry, and for the multiple-transcription-mode, specifications of all certain and potential errors found in each lexical entry in the input file. An overview of the structure of ParseTranscriptions 1.0 is presented in Figure 2. As the picture suggests, the parser is able to parse its own output, in the sense that corrections might be made directly in the output file, which is then used as input file and parsed again, and so on, until no further corrections are needed. Any modern parser consists of two main parts: a set of rules and a lexicon; this is true for ParseTranscriptions 1.0 as well. The rules are few and general and when adjusting the grammar to a new language or to some new conditions in the same language, the advanced user make changes in the lexicon and the parsing routines are supposed to be left untouched. This approach makes the parsing routines reusable and saves a lot of programming effort. The third box in Figure 2 contains the parsing parameters, and by modifying these, the advanced user can change the behaviour of the parser, for example which file to use as input file, the name of the output file and what errors the parser should look for. By keeping the parsing parameters and the parser separate, the user never has to make changes in the parsing routines, and this limits the risk of making unwanted changes in these general routines. Parsing parameters

Lexical entry/entries

Lexicon

Parsing routines

Parsed entry/entries

Corrections 10 Each lexical entry has been automatically decomposed into compound constituents. Generally, words have been marked as compound when each constituent has the same or similar meaning alone as in the compound word. Compound boundaries have not been inserted after or in front of constituents that cannot occur alone (i.e. in derived words). (From Kylander, 2000b.) 11 For example, the word is an English loanword in Swedish, and this origin is marked in the lexical entry for .

8

Figure 2: Overview of the structure of ParseTranscriptions 1.0.

The three main parts of ParseTranscriptions, the parsing routines, the lexicon and the parsing parameters, are separated into three different files. This structure is chosen not only for means of precaution, but it also makes the program easier to read. In order to set the values of the parsing parameters the advanced user is assumed to possess some knowledge of Perl. The parameters are: -

The name of the input file The name of the output file (default is “inputfile.parsed”) The output format; either a clean, easy-to-read format, or a complete format (see 3.1.1) Selection of lexicons, i.e. the names of the lexicons from which the entries should be taken The value of “garbage flag” for transcriptions to be parsed (0, 1 or unspecified)12 Selection of which deviations the parser should look for

In order to make changes in the lexicon, not only programming knowledge is assumed, but also knowledge of phonetic and phonological constraints in the given language and of the transcription conventions for the language. The program is executed with the following call: perl ParseTranscriptions.pl If the user wants to parse a single lexical entry, this lexical entry is given as an argument to the call. The output, i.e. possible error and warning messages, is in this case written to the terminal window from which the program was called. A lexical entry, regardless of parsing mode, is supposed to have a certain format, illustrated by the following example: abdikera;abdikera;abdikera;verb;verb;;swe.SWE;0;0;0;;ab]dI]”1ke:]ra;1 ;STD;swe.SWE;;;;;;/;;;;; ;;;;;;;;;;;;;;;20254 Each lexical entry contains more than 35 information fields, separated by ‘;’-characters. The Danish lexicon file format differs from the Swedish and the Norwegian; it contains less information and thus, fewer checks can be made. PhonotacticSearch 1.0 is a search engine that searches a Swedish, Norwegian or Danish lexicon for certain phonotactic patterns. In accordance with the specification of this thesis, the transcription parser, ParseTranscriptions, was used as a starting point in the development of PhonotacticSearch. Certain parts of ParseTranscriptions have been reused in the phonotactic search engine, while other parts were not reusable; most of the procedures in PhonotacticSearch had to be written specifically for the search engine. The structure of PhonotacticSearch is illustrated in Figure 3. As the picture shows, the input to the search engine is the search specification, in the form of search parameters. The output from the search engine, the search results, is a list of relevant matches to the search specification. The structural overview in Figure 3 shows that a phoneme lexicon for the specific language is used both in the search specification and by the search engine itself; this phoneme lexicon is almost identical to the lexicon used by ParseTranscriptions 1.0. The separation of the three parts of PhonotacticSearch, the lexicon, the search parameters and the search procedures, provides for a more manageable structure, a structure that is easier to overview.

12 If the word should be removed from the lexicon, the value of garbage flag is 1; otherwise 0. The reason why these entries haven’t already been “physically” removed from the lexicon is that this would cause an unwanted gap in the lexicon, since all entries are indexed.

9

Phoneme lexicon for the specific language

Search engine

Search parameters

get next entry

Search results

new lexical entry Lexicon file

Figure 3: Overview of the structure of PhonotacticSearch 1.0.

A large part of the phoneme lexicon in PhonotacticSearch is the same as the lexicon used in ParseTranscriptions 1.0, but some modifications have been made. Some information has been added to the lexicon of the search engine, such as the further specification of consonants and vowels into natural classes. These classes have been formed on the basis of, for the consonants, manner of articulation, place of articulation and voicing13 and, for the vowels, rounding, advancement and height. By grouping phonemes together by features the user is able to refer to natural classes of phonemes when specifying a search pattern. However, in the development of the search engine, there are difficulties in deciding how the parameters should be formulated. Either one could choose to have a redundant and large lexicon, with groups like e.g. labial fricatives and dental fricatives and concise parameter formulations, or one could choose to have a concise and irredundant lexicon, where labials, dentals and fricatives are separate groups, and longer and more complex parameter formulations. The latter option has been chosen here and, as a consequence, the demands on the user setting the values of the search parameters are high; not only knowledge of phonetics and phonology is assumed, but also some knowledge of Perl. These qualifications are assumed also for the user making changes in the lexicon. The degree of specificity of a search is dependent on the information accessible in the input file. If the input file contains only transcriptions, the user can search only for substrings and patterns in the transcription strings. If the input file in addition to the transcriptions contains, say, information about morphological structure, the user might specify the search further by e.g. searching only among non-compound words. However, one will find that there is a lot of information to find in a simple transcription string. As we have already seen, since certain sequences of characters form certain phonemes, each transcription can easily be divided into phonemes. Thus it is possible to search for certain sequences of phonemes. The transcriptions are divided into syllables by syllable delimiters and this makes it possible to refer to the syllable and to define a search pattern for a syllable at an explicit position or at any position in the transcription string. Since you can define the structure for the Nth syllable in the transcription, it is of course also possible to define the structure for the (N+1)th syllable in the transcription, and so on, and this makes is possible to refer to sequences ranging over syllable boundaries. Tonal patterns are also extractable from the transcription string and it is possible to refer to e.g. “the mainly stressed syllable” of some transcription.14 If one combines these parameters the search can be very specific, although the only source of information is the transcription string. As an example, you should be able to extract all transcriptions with primary stress on the Nth syllable, where 13

Direction of airflow has not been considered in the lexicon, since it is assumed that the search engine will only be used for languages where this dimension is irrelevant, where all consonants are pulmonic. 14 It is not always possible to refer to “the syllable bearing secondary stress”, since secondary stress, according to the transcription conventions used for Swedish, should only marked in compound and derived words

10

the phonemes of the (N+1)th syllable matches the pattern (). An example of a transcription matching this pattern is the transcription of the word (step son), which is /”2sty:v]%so:n/. Although a lot of information can be extracted from only a transcription string there are of course limitations of how much, and these limitations are mainly due to the level of transcription and the transcription conventions used15. The values of the search parameters are set in a separate file. The search parameters in PhonotacticSearch 1.0 are presented in Table 3. Parameter Input file name Output file name Selection of transcriptions (name of sublexicon where the transcription is a member) Language of transcription (=origin code) Dialect Tonal pattern Number of syllables Position of mainly stressed syllable Position of secondary stressed syllable Structure pattern X for syllable N

Possible values .results / / / // //16/ // //

X = N = // + Part-of-speech / Compound or non-compound // Correspondence C in transcription to C = orthographic O O = Substrings O1, O2 … On in ortho- Ox = graphic string Table 3: Search parameters in PhonotacticSearch 1.0

The more search parameters the user is able to control, the more the results can be specified, and, as a consequence, the more flexible the search engine will be. But the greater the number of adjustable parameters, the greater demands on the user’s accuracy, as the risk for making mistakes increases. This is an example of how the parameter values might be set, when searching in a Swedish lexicon file: # names of input file and output file: $inFile = "ALL_se_col1_ci_sorted"; $outFile = ">$inFile.results"; $trcSelection = 'enter_se|spd_se'; $dialectSpec = "STD"; $originSpec = 'swe\.SWE';

# if unspecified, replace with "" # if unspecified, replace with "" # if unspecified, replace with ""

$toneme = ""; $nofSylls = ""; $mainStressPos = "X"; $secStressPos = ""; 15

One limitation put by the transcription conventions for Swedish is that both (using IPA notation) 5 and ‡ have been transcribed with the same L&H+-symbol, /s`/. There is thus no possibility of searching for words containing the sound 5, without including words containing the sound ‡. 16 is interpreted as “less than or equal to digit” and is interpreted as “greater than or equal to ”.

11

$STRUCTURE_X_FOR_SYLLABLE_Y{'X+1'} = "($SWE_CONSONANT)($SWE_VOWEL)"; $posWord = ""; $compWord = ""; $TRC_X_CORRESPONDS_TO_ORTO_Y{'(tion|sion)'} = "-x\u:($SWE_NOT_SOUND)*n"; @SUBSTRING_IN_ORT = (“”);

These parameter values will make PhonotacticSearch 1.0 look in the file ALL_se_col1_ci_sorted for entries from sublexicons containing the substrings enter_se or spd_se. Only transcriptions where the dialect specification code for the transcriptions matches the string STD and the origin specification code is swe.SWE will be relevant to the search. The values for toneme type, number of syllables and position of secondary stress of the transcription are left unspecified. The syllable bearing main stress is the syllable at position X, and the structure of syllable X+1 is specified as CV. The part-of-speech of the word is not considered and the word could be or not be a compound. The engine will look for entries where the morphological decomposition string contains one of the substrings or and the transcription does not match the pattern /x\u:($SWE_NOT_SOUND)*n/, i.e. the most common way of transcribing the sequences and . No other substring in the morphological decomposition string (than and ) is expected. A graphical user interface (GUI) is not only nicer to look at than a DOS-dialog or a Perl script, it can also be used to limit the number of possible typing errors. For example, with a list of possible part-of-speech tags, organised e.g. in a drop-down list, there is no possibility for the user to input a non-existent part-of-speech tag and thus perform a search in an empty set of transcriptions. A graphical interface has been developed for PhonotacticSearch, written in HTML and calling the search engine as a cgi-script. Figure 4 shows this GUI. However, the intended users of the search engine are language engineers with thorough knowledge of both Perl and phonetics, and they would probably rather set the parameters directly in the parameter file. Thus, the GUI should be seen as an alternative way of calling PhonotacticSearch, not the only way. One drawback of the HTML format is the fact that search patterns that are long (which they often tend to be) might be too long to fit the text fields. The search patterns are easier to overview in a text editor window.

Figure 4: The graphical user interface to PhonotacticSearch 1.0.

12

When executed from a terminal window the call to the search engine is: perl PhonotacticSearch.pl The lines in the input file, i.e. the entries in the lexicon file to search in, are supposed to be of the same format as in the transcription parser17: 3.1.1

The output

ParseTranscriptions 1.0 can generate 21 different kinds of error and warning messages. These messages are presented together with their interpretation in Tables 4-6. (The Swedish messages are given as an example; in the Norwegian and Danish output files, “Swedish” is obviously replaced with “Norwegian” and “Danish”, respectively.) The advanced user can choose from two output formats (specified in the parsing parameter file): one where the entire lexical entry is presented together with the error and warning messages: kurfursten;kurfursten;kurfursten;noun;noun;u;swe.SWE;0;0;0;; ”2k}:r]%f8s`]t`en;1;STD;swe.SWE;;;;;;/;;;;;;;;;;;;;;;;;;;;58 583 e = MOP-violation, syllables %f8s` and t`en w = possible non-compound with % or one where only the most important information from the lexical entry is presented, together with the error and warning messages: ”2k}:r]%f8s`]t`en;58583;1;noun;kurfursten;swe.SWE;; e = MOP-violation, syllables %f8s` and t`en w = possible non-compound with %

Message contains illegal stress pattern

Interpretation This error message will be printed if the transcription string does not follow the stress pattern(s) specified as legal for the given language.

possible non-compound with %

This warning will be printed if a given transcription is not a compound, and could not be considered a derived word, but still contains a secondary stress marking. (See Section 3.3.1)

Stress warnings

Stress errors

The advanced user would probably prefer the former alternative, since corrections can be made directly in the lexical entry and be run through the parser again, while this is not possible when choosing the latter output format. However, the reduced output format is more attractive and preferable when one does not want to parse the output file again. An excerpt of the output when parsing the file ALL_se_col1_ci_sorted, is presented in Appendix 3.

Table 4: Stress warnings.

17

Here, as in the case of the parser, the Danish lexicon format differs from the Swedish and Norwegian, and therefore a special file reading procedure has been designed for this Danish file format.

13

Message contains illegal diacritic

Syntax errors

illegal Swedish symbol () the number of :s in transcription is not correct unwanted sequence of and illegal position for syllable delimiter probably unwanted duplication of

Syntax warnings

probably unwanted sequence of and correspondence // to not found geminate consonant over syllable boundary // allowed only in words with language code

Interpretation If a given symbol contains diacritics that do not comply with the diacritic syntax, this error message will be printed. Each transcription string containing one or more symbols not found in the list of allowed symbols for Swedish receives this error message. If the number of word delimiting characters found in the transcription string is not equal to the number of word delimiting characters found in the morphological decomposition string, this error message will be printed. The language specific list of unwanted sequences is found in the lexicon and if a violation against these patterns is found, this warning message will be printed This error message will be printed if a syllable delimiter is found at the end of a transcription string. If any type of character occurs successively, the parser warns that this might be a typing error. The language specific list of probably unwanted sequences is found in the lexicon and if a violation against these patterns is found, this warning message will be printed. If an expected correspondence to some sequence in the morphological decomposition string is not found in the transcription string, this warning message is printed. If the convention that no consonant should appear on both sides of a syllable boundary is violated this warning will be printed. This warning will, at the current stage, be printed only if a “nynorsk” transcription pattern is found in a transcription whose language code is not the code for “nynorsk”.

Table 5: Syntax error and warning messages.

14

Message syllable contains no nucleus syllable contains more than one nucleus

Syllable errors

not a Swedish onset

not a Swedish coda

Onset is not allowed wordinitially MOP-violation, syllables and is a rare onset

Syllable warnings

is a rare coda possible MOP-violation, syllables and

Interpretation If a given syllable does not contain any of the possible nuclei specified for the given language this error message will be printed. If the number of nuclei in a sequence marked as a syllable is greater than 1 this error message will be printed. If a given onset, at any position in the transcription string, is not a member of the list of rare onsets, not a member of the list of onsets not allowed word-initially, and not a member of the list of legal onsets for the specific language, this error message is printed. If a given coda is not a member of the list of legal codas and not a member of the list of rare codas specified for the given language, this error message will be printed. If a given word-initial onset is a member of the list of onsets not allowed word-initially in the current language, this error message will be printed. This list is found in the lexicon. If the given transcription is not marked as a compound word, and if the word is not an abbreviation (see 3.3.1.5), but the word still violates MOP, this error message is printed. If a given onset is a member of the list of rare onsets for the current language, this warning is printed. If a given coda is a member of the list of rare codas for the current language, this warning is printed. This warning is printed if the transcription violates MOP and the word is marked as a compound, or contains one of the derivations possible for the given language. As the algorithm has no means for finding where in the transcription the compound boundary is, there is no way to separate legal MOPviolations such as /”2slOts]%bY]gnad/ from illegal MOP-violations such as */”2slOts]%bYg]nad /.

Table 6: Syllable error and warning messages

Figure 5 illustrates how the output from PhonotacticSearch might look (the results of a search with the example parameter values mentioned above). The first lines in the output file simply contain a listing of what the engine has searched for, i.e. what search parameters values the user has specified. (This might or might not be what the user intended to search for. If the results do not correspond to the user’s expectations, this way of presenting the results makes it easy for the user to understand what the engine searched for and how he/she should modify the search parameter values in order to perform the intended search.) The number of transcriptions in the input file is printed as well as the number of transcriptions relevant in the specific search.

15

The lexical entries in the input file that were considered relevant to the search are sorted by the morphological decomposition string and printed in a list. Each line in the list consists of an index number for the lexical entry, a transcription index, indicating if the transcription is default or an alternative transcription, and the transcription itself followed by the morphological decomposition string. Search performed in file ALL_se_col1_ci_sorted Transcriptions from lexicons whose name /enter_se|spd_se/

contains

the

substring

Transcriptions with dialect code Transcriptions with origin code Words with unspecified stress pattern Unspecified number of syllables Main stress on the Xth syllable Unspecified position of secondary stress The structure for syllable X+1: ($SWE_CONSONANT)($SWE_VOWEL) Part-of-speech: unspecified The word could be a compound or a non-compound Transcriptions NOT containing correspondence /-x\u:(|)*n/ to orthography Total number of transcriptions in file: Number of relevant hits: 52069: 67322: 23462: 23461: 29851: 29852: 32712: 35461: 35505: 35506:

150454 95

1: In]tø]”1n`E]s`ø]nels 1: nat]x\U]”2nA:l]%gA:]d`et 1: a_Uk]x\U]”1ne:]ra]des 1: a_Uk]x\U]”1ne:]ra]des 1: “2br8]tU]nat]x\O]nA:l]prO]%d8kt brutto+!national+produkt 1: “2br8]tU]nat]x\O]nA:l]prO]%d8k]ten brutto+!national+produkten 1: de]sI]l8]x\U]”1ne:]ra]de 1: e]mO]s`O]”1ne]la 1: ”2e:n]dI]men]x\O:]%nel 1: ”2e:n]dI]men]x\O]%ne]la

100% 0.06% Internationals National+gardet auktionerades auktionerades

desillusionerade emotionella en+dimensionell en+dimensionella

Figure 5: An example of output from PhonotacticSearch 1.0.

3.2 Functionality of ParseTranscriptions and PhonotacticSearch The syntactic analysis must be preceded by a lexical analysis, i.e. the division of the input string into manageable and classifiable units, possibly together with other secondary tasks. In ParseTranscriptions, this corresponds, at least on one level, to the process of extracting useful information from the input file.18 An overview of the interaction between the lexical analyser and the parser in ParseTranscriptions 1.0 is found in Figure 6. The lexical analysis phase is described in Section 3.2.1 and the parsing procedures in ParseTranscriptions are described in Section 3.2.2.

18

For example, the process of dividing each transcription into syllables is also a kind of lexical analysis. This process is discussed in section 3.2.2.3.

16

useful information from current entry lexical entry/entries

lexical analyser

parser error messages (get next lexical entry)

statistics to STDOUT error messages to output file Figure 6: Overview of the interaction between the two parts of the parsing routine in ParseTranscriptions 1.0, the lexical analyser and the parser. (The figure is analogous to Fig. 3.1 in Aho et al., 1986.)

When formulating the algorithm of the phonotactic search engine one must keep in mind the two questions -

What information in the input file is the user interested in finding? What information can at all be extracted from the input file?

The answer to the first question is, obviously, dependent on the purpose of using the search engine. As mentioned earlier, one possible use of the search engine is as a tool for checking the consequences of a potential transcription convention. As a transcription convention can put restrictions on almost anything in the lexical entry, the search engine should be able to refer to almost everything in the lexical entry in order to fit the user’s sometimes unpredictable needs. Therefore, focus was put on the second question, i.e. as much information as possible should be extracted, and the aim set to develop a flexible and versatile search engine. The lexical entries in the input file are alphabetically ordered by orthographic form and this structure does not simplify the search for phonotactic patterns. This means that all the lexical entries must be compared with the search specification. For each lexical entry the engine searches for patterns not matching the patterns specified by regular expressions in the search parameters. When such a mismatch occurs, the search within the lexical entry is interrupted, as the entry is irrelevant to the search specification, and the engine proceeds for the next entry in the lexicon file. The matching procedure makes use of the built-in matching functions in Perl. The syntactic analysis in a parser had to be preceded by a lexical analysis, and this is an important part of PhonotacticSearch 1.0 as well. The lexical analysis in the search engine is almost identical with the lexical analysis in ParseTranscriptions, at least in the way its main task, extraction of useful information from the input, is performed. The lexical analyser interacts with the matching procedures of PhonotacticSearch; an overview of this interaction is shown in Figure 7. The lexical analyser is presented in Section 3.2.2 and the matching procedures are described in Section 3.2.3.

17

lexical entries Does entry X match pattern Y? lexical analyser

search pattern

matching procedures yes/no

search specification and list of relevant entries to output file Figure 7: An overview over the interaction between the lexical analyser and the search procedures in PhonotacticSearch 1.0. (The figure is analogous to Fig. 3.1 in Aho et al., 1986.)

The illustration in Figure 7 suggests that the lexical analyser send only one question, henceforth referred to as a query, to the matching module. This might be true, but the number of queries from the lexical analyser might also be either greater or smaller. To clarify: the lexical entry might be sorted out as irrelevant already in the lexical analysis phase, e.g. if the language code of the transcription does not match the language code specified in the search parameters. In that case, the lexical analyser will not have to send any query to the matching module, but will proceed to the next lexical entry. The more successful matches, the more attendant queries will be passed to the matching module, and this explains why the number of queries might be greater than one. (Of course there is a maximum number of queries that might be passed to the matching module; if the last match is successful, the lexical entry will be considered relevant and the lexical analyser will look for the next lexical entry in the lexicon file.) Which the queries are and how they are ordered will be discussed in Section 3.2.1 and 3.2.3. 3.2.1

Lexical analysis

As Aho et al. (1986: 84) state, the main task of the lexical analyser is to “read the input characters and produce as output a sequence of tokens that the parser uses for syntactic analysis”. However, they also point out that the lexical analyser can “perform certain secondary tasks at the user interface”. This is true for the lexical analysers in both ParseTranscriptions and PhonotacticSearch; as the input file format is the same, the lexical analyser in ParseTranscriptions could mostly be reused in the search engine. One secondary task the lexical analyser in ParseTranscriptions performs is to count the number of lexical entries in the file and the number of errors found, and to print these numbers to standard output. Another secondary, but still essential, task is to receive error and warning messages from the parser and to print these to the output file. One secondary task performed by the lexical analyser in PhonotacticSearch is counting the total number of entries in the input file and the number of relevant entries in the file. This lexical analyser also performs the task of listing the relevant entries in the output file together with the search specification found in the parameter file. The information extracted from each lexical entry by the lexical analyser in ParseTranscriptions is the following: 1. Transcription(s)19 2. Morphological decomposition 3. “Garbage flag” 19

In some cases a word has an alternative transcription besides the required default transcription.

18

4. 5. 6. 7. 8.

Name of sublexicon of which the entry is a member Part-of-speech information Language code Expansion (of acronyms) Index of the lexical entry

The most important information considered in the parsing process is the transcription string(s) and the morphological decomposition string. The expansion is examined and reveals if the lexical entry is an acronym20, and this information is also used in the parsing procedure. Information of kind 3 and 4 might be used to be able to limit the number of transcriptions to parse. As an example: it is not necessary to parse a transcription of an entry that should be removed from the lexicon later, i.e. where the value of “garbage flag” is 1. The rest of the fields extracted from each entry are not considered in the parsing process, but simply passed on to the resulting output, as information for the ones interpreting the error and warning messages and correcting the transcriptions. As the parsing proceeds, the lexical analyser counts the lexical entries being parsed, the number of transcriptions in the file21 and the number of errors found. This statistical information is printed to the terminal window from which the program was called. During the parsing process, every 250th lexical entry is printed to standard output, so that the advanced user can follow the parsing procedure and is assured that it has not been interrupted or is stuck in an infinite loop. For every lexical entry passed from the lexical analyser to the parsing procedures, these procedures return the collected error and warning messages for the given entry. (See illustration in Figure 6.) If no deviating patterns are found in the entry, the collection of error and warning messages is empty, and the entry will be collected in an array of correct entries, @correctOut. If the lexical entry does contain some deviating pattern, the error and warning descriptions for the entry will be sorted, so that error messages precede warnings, reflecting the order of correction priority. As the next step, the sorted error and warning collection of the current entry is put into a hash table, %errOut, with the whole information cluster, i.e. the lexical entry together with the message collection, as a key. By using a hash table, it is easy to use built-in sort functions in Perl in order to produce a wellstructured output; all lexical entries that received error messages are listed in the beginning of the file, the lexical entries that received only warning messages are listed below. The lexical entries in which no deviations were found, i.e. the entries collected in @correctOut, are listed at the end of the file. The information extracted from each lexical entry by the lexical analyser in PhonotacticSearch is the following: 1. 2. 3. 4. 5. 6. 7. 8.

Transcription(s) Morphological decomposition Dialect code (i.e. standard pronunciation for the given language or with some dialect)22 Origin code (i.e. if the word is transcribed with a native/nativised or foreign pronunciation) Name of sublexicon of which the entry is a member Part-of-speech information Index of lexical entry Index of transcription

Information of types 1 and 2 is used when matching patterns of syllable and stress structure. These matching procedures might be quite complex and the number of potential relevant transcriptions is reduced at an early stage, by using information of types 2 to 6 to remove transcriptions not matching the specifications for those parameters. Information of type 7 and 8 is not considered in the search process, but simply passed on to the resulting output, as information for the user, for fast retrieval in the lexicon file. In PhonotacticSearch, some matching attempts are performed within the lexical analyser so as to sort out irrelevant lexical entries at an early state, before more complex patterns are searched for. It 20

If a lexical entry has an expansion one can conclude that the word is an acronym, but if the lexical entry does not have an expansion one cannot conclude that the word is not an acronym. 21 The number of transcriptions in the file is not necessarily equal to the number of lexical entries, since a lexical entry might contain an alternative to the default transcription. 22 At the moment there is no other dialect code than ‘STD’ (standard Swedish).

19

is examined whether the name of the sublexicon of which the entry is a member matches the users specification for it in the parameter file, if the dialect code and the origin code match the user’s specifications and if the word’s being or not being a compound matches the user’s specification for it. The queries that are passed to the matching procedures are the following: -

Does the part-of-speech of the lexical entry match the specification for part-of-speech?23 Does the stress pattern of the transcription24 match the stress pattern specified? Is the number of syllables of the transcription equal to the number specified? Are the patterns specified as substrings in the morphological decomposition string really found in the morphological decomposition string? Are the correspondence patterns in the transcription string to some substring in the morphological decomposition string really found in the transcription string and the morphological decomposition string? Do the patterns for syllable structure match the syllable structure in the transcription string?

The queries are ordered so that the more simple ones are executed before the more complex ones; in the program this is stated as a conditional saying: “if the simple checks have succeeded, perform more complex checks”. This ordering implies that if a simple match attempt fails, the more complex matching procedures need not be considered; thus, it assures that the complex matching procedures are not invoked unnecessarily, which would be a waste of execution time. For every lexical entry passed from the lexical analyser to the parser, the parser returns an indication of whether the entry matches the search pattern or not. If the lexical entry matches the search pattern it will be put in an array of relevant entries, or rather, the transcription, the morphological decomposition string, the index of the lexical entry and the index of the transcription will be put in an array of similar clusters. The array will be sorted and printed to the output file when the whole input file has been searched through. 3.2.2

The syntactic analysis in ParseTranscriptions

The goal in the development of the parser was to cover as many analysis levels as possible, in order to find as many potential errors as possible. Figure 8 illustrates the concept of “level” referred to in this context. (Suprasyllabic level:) Syllable level: (Subsyllabic level:) Stress pattern level: Symbol level:

nskr tran tr

]

a n

skrIp skr

]

px\

I

]

x\u:n

p

x\ u: n ]

“1

t r a n ] s k r I p ]

“1

x\ u: n

Figure 8: Possible levels of description of the transcription of the Swedish word (transcription).

At the symbol level the parser not only checks for illegal symbols, but violations against transcription conventions that operate on this level are also found; this procedure is described in Section 3.2.2.1. In Section 3.2.2.2 the procedure of finding incorrect stress patterns is described. In the syllable parse,

23 This might seem as a simple check, that might be handled in the lexical analysis as well as for example the question of compounds. However, it might be a little more complicated, since the part-of-speech specification might be negatively specified, such as ”search for lexical entries where the part-of-speech is not X”. But, for sure, this is the most simple question passed to the matching module. 24 A lexical entry might contain two transcriptions; in those cases both transcriptions are checked together with the rest of the information in the lexical entry.

20

syllable structure errors are found as well as syllable boundaries inserted at the wrong place; this is described in Section 3.2.2.3. 3.2.2.1

Parsing at the symbol level

First, the transcription string is scanned for the following usual but, regardless of language specific conventions, illegal sequences of characters: -

Duplications: Duplications of any character, possibly on a subsymbolic level, are rarely allowed. This is an assumed error25 of non-linguistic nature, often due to careless typing.26

-

Syllable delimiter string finally: If the last character of the transcription is a syllable delimiter, an error message will be printed to the output file.

-

Illegal number of word delimiters: If the number of word delimiters found in the transcription string is not equal to the number of word delimiters found in the morphological decomposition string, an error message will be printed to the output file.

The second step is to divide the transcription string into symbols. (Thus, this is a sort of lexical analysis, even if it is performed within the parsing routines.) All legal symbols in a language are collected in a list in the language specific lexicons and all subsequences in some transcription that do not match any of the legal symbols listed are considered illegal; if such a sequence is found, an error message will be printed. Currently, there is no need to be able to refer to the internal structure of a symbol. However, some time in the future one might want to have that opportunity and therefore the procedure to extract symbol constituents is implemented. Access to the constituents of a symbol is accomplished in the following steps: 1. ‘;’-characters are inserted before every allowed symbol found in the string 2. Since an erroneous transcription might begin with a non-head character it is not certain that a ‘;’-character has been inserted string-initially. Therefore this is done explicitly: - if the string does not already start with a ‘;’-character, insert a ‘;’-character in the beginning of the string. When ‘;’-characters have been inserted before all symbols in the transcription string, the string is split by ‘;’-characters and one (potential) symbol at a time is parsed in the following steps: 1. It is checked whether the symbol is a legal symbol for the current language. This is done simply by searching for the current symbol in the list of legal symbols found in the lexicon. 2. If the symbol contains diacritics (indicated by the diacritic initial character) the diacritic initial character and everything following it is removed from the symbol string and parsed separately. The diacritic should not violate the L&H+ diacritics syntax. 3. If the symbol now contains tail-characters (i.e. the string does not consist of only headcharacters and/or a diphthong separator), the first non-head-character (and non-diphthong separator) and everything following it is removed from the string and thus the tail is extracted. The next step is to check for violations against language specific transcription conventions; this procedure is described separately in Section 3.3.1.

25

Since duplication of a character does not have to violate the L&H+-syntax and since there might be cases where duplication is justified, the output message is merely a warning, not an error message. 26 However, in Norwegian, the sequences ‘nn’ and ‘ll’ are allowed, for example in the transcription /”1fi:]kE]nn=/ of the word (the figs). To handle this, an exception list has been added to each lexicon. (In Swedish and Danish these exception lists are empty.)

21

3.2.2.2

Stress parsing

The first step in the stress parsing procedure is to check if the string, in accordance with the transcription conventions for Swedish, is marked with secondary stress without being marked as a compound and without being a derived word. The definition of derived word in the parser is straightforward and perhaps somewhat crude: A word is derived if the morphological decomposition string contains one of the derivational affixes listed in the lexicon. When collecting derivational affixes into a list one has to decide if one wants to give priority to precision or to recall.27 For example: if one adds the prefix to the list of derivational affixes, one increases the recall value by (legitimately) considering a word such as (report) a derived word, but at the same time the precision value decreases, since e.g. the (non-derived) proper name would then be considered a derived word. If one decides to exclude from the list, as is the case at the moment, the situation is the opposite, leading to a lowered precision value in the form of an overgeneration of warning messages. In each of the three lexicons, there are specifications of legal stress patterns for transcriptions of words in the given language, formulated as regular expressions. For example, the legal stress patterns for Swedish are: Toneme 1: one and only one main stress marking, and no other stress marking. Toneme 2: one and only one main stress marking optionally followed by one and only one secondary stress marking. If the transcription does not match the pattern, and thus contains an illegal stress pattern, an error message will be printed. If the transcription contains one or more word delimiters, and thus two or more words, each word has to match one of the stress patterns described above. In Norwegian, there are rare stress patterns, which are allowed only under certain circumstances; these will be discussed further in Section 3.3.1.5. 3.2.2.3

Syllable parsing

The initial step of the syllable parsing process, i.e. the lexical analysis, is to split the transcription string at syllable and word delimiters28, so that each sequence marked as a syllable in the string is extracted and can be parsed separately. First of all, the following two conditions must be satisfied: 1. The potential syllable has to contain a nucleus, i.e. a vowel or a syllabic consonant29. 2. The potential syllable must not contain more than one nucleus. If the sequence marked as a syllable contains one and only one nucleus, the rest of the structure of the syllable is checked. First, the syllable is split by the nucleus and in this way an onset-sequence and a coda-sequence are extracted. Both the onset-sequence and the coda-sequence are compared to the lists in the lexicon of allowed and rare onsets and codas, and if the current onset-sequence and/or codasequence is not found in such a list it is considered illegal and an error message will be printed. In the next move of the syllable parsing process, the current syllable – if it is not string initial – and the previous one are checked for MOP-violations according to the following procedure:

27

A precision value of 1, in e.g. searching for some pattern in a text, means that all hits are relevant, while a recall value of 1 means that all relevant instances in the text were found. In the ideal case both precision and recall is 1. 28 If a word delimiter is inserted it implicates a syllable boundary, and a syllable delimiter should not be inserted as well. 29 In the Swedish transcriptions there are no, and should not be any, syllabic consonants and therefore the list of syllabic consonants is empty, while in e.g. Norwegian at least the symbol ‘n=’ (syllabic /n/) should be included.

22

1. The coda of the first syllable is concatenated with the onset of the second syllable to a string, cons_cluster. 2. The onset of the second syllable, syll2_onset, is extracted. 3. If cons_cluster forms a legal onset… a. and the word is not marked as a compound and is not an acronym; an error message will be printed. b. If not a), a warning message will be printed. If no MOP-violation has yet been found and while cons_cluster is not equal to syll2_onset and while cons_cluster is not empty, the initial consonant in cons_cluster is removed and the procedure is repeated from 3. 3.2.3

The matching procedures in PhonotacticSearch

As mentioned in the previous section, some lexical entries are sorted out as irrelevant to the search specification already in the lexical analysis phase. Some of the matching procedures called from the lexical analyser are, although more complex than the matching attempts performed within the lexical analysis, rather trivial and will not be described here. However, the most complex matching procedure is worth a description and this is found in Section 3.3.2.

3.3 Implementation As described earlier, both the parser and the search engine involve three files: a phoneme lexicon, a set of parse or search parameters and the actual parsing or search routines. In the phoneme lexicon of the parser the parsing conventions are represented as values in predefined templates. It is here that the advanced user of the transcription parser is supposed to make changes in order to improve the behaviour of the parser. The templates are described in Section 3.3.1. Most of the matching procedures in PhonotacticSearch are simple and need not be described within the frames of this document. However, the most complex matching procedure deserves to be commented and a description of this procedure is found in Section 3.3.2. 3.3.1

Representation of transcription conventions

There are several language specific conditions that should be satisfied for the transcriptions to be considered correct. In order to make the parser understand these conditions, one has to formulate the conditions in a declarative and non-ambiguous manner. It should be noted that the conventions used for transcriptions in Swedish, Norwegian and Danish are in the first place formulated as guidelines for the transcribers, so as to gain consistency; they are not necessarily linguistically uncontroversial. An example of a convention that is not linguistically obvious, is the distinction between derived words that should be marked with secondary stress and derived words that should not: In transcriptions of derived words where one of the constituents is a potentially stress bearing derivational affix in Swedish [and the other(s) is/are real Swedish words in isolation], secondary stress should be marked; while in transcriptions of other derived words it should not. (Kylander, personal communication 02-02-2000) For example, the derived word (insolvent), where is a real Swedish word in isolation and is a derivational affix potentially bearing stress, should be transcribed /kON]”2k8]%s`mE]sIg/30 and not, without a secondary stress marking as in */kON]”2k8]s`mE]sIg/. As a contrast, the derived word (lecture) should be transcribed /”2fø:]re]lE:]snIN/, although there is a morphological boundary between /s/ and 30

As seen, the transcription does not violate MOP, although probably many phoneticians and other linguistics would argue that MOP should be overruled by the morphological boundary between and , so that the syllable boundary should be between /s`/ and /m/, instead of between /8/ and /s`/. However, the Swedish transcription conventions say that MOP should only be violated in words that are marked as compounds, such as (fish dish) which is transcribed as /”2fIsk]%rEt/.

23

/n/. This is motivated by the fact that is not a derivational affix of the same quality as for example . However, this last example might be considered linguistically controversial, since the difference between an affix such as and an affix such as is quite vague. The tendency to bear stress does not seem to suffice to make the distinction between the two types of affixes, since might also bear stress, e.g. in (raising). This convention is reformulated to a declarative form in ParseTranscriptions 1.0 as: For each transcription where a secondary stress marker is found: - IF the word is not a compound (i.e. if the morphological decomposition string contains a ‘+’-character) - AND IF the morphological decomposition string does not contain an affix of the type (affixes of this quality, a restricted set, are collected in a list). => THEN there should not be a secondary stress marker in the given transcription. There are general phonotactic constraints in Swedish, Norwegian and Danish that have been overruled by the transcription conventions to benefit some other constraint. As an example: according to Sigurd (1965: 142), long vowels only occur in stressed syllables in Swedish. This is indeed a declarative statement and it could easily be implemented in the parser. However, there are two reasons why the Swedish transcription conventions make this infeasible. First, as discussed earlier, the conventions say that secondary stress should only be marked in compound words and in some derived words. This implicates that a long vowel can occur in what might look like an unstressed syllable, but in fact is a stressed syllable, as is the case with e.g. the phoneme /o:/ in /”2al]kU]ho:l/ (alcohol). The decision to mark secondary stress only in compound words and some derived words has as consequence that many phonotactic constraints referring to the concept of “unstressed syllable” cannot be implemented in the parser. Secondly, the Swedish conventions explicitly state that long vowels are allowed in unstressed syllables, if this reflects the quality of the vowel better than its short counterpart, as the phoneme /e:/ in /”2ar]be:]tar]fa]%mIlj/ (working family). Because of these two conventions the implementation of Sigurd’s statement would lead to massive overgeneration of error messages. In the language specific lexicon of ParseTranscriptions 1.0 there are five templates into which conventions can be put; each one of these templates is described below. 3.3.1.1

Forbidden sequences

If there are sequences that are not legal in the transcription strings in some language, these can be filled into this template. Violations of these conditions yield error messages and therefore one has to be sure that there are no contexts where the sequence actually is legal, since this would make precision less than 1 and thus a warning message would be more appropriate. An example of a convention that is assumed to have no exceptions is the Danish convention putting restrictions on the contexts where “stød”31 is allowed: Stød is applied to either a long vowel or a short vowel followed by a voiced consonant. In case of a long vowel followed by a voiced consonant, the stød is assigned after the long vowel – not after the consonant. In case of stød to a diphthongised sequence of two vowels, the stød is placed after the second vowel, and neither of the vowels are marked as being long: bur /”1buO?/. Stød can never be assigned to a stress-less syllable. (Bjerg-Nielsen, 2000: 18) This implies that the following sequences are forbidden in Danish transcriptions, and can thus be fit into the template for forbidden sequences:

31

“Stød” is the name for the glottal plosive /!/ that occurs in the specified context in Danish.

24

1) A syllable delimiter (or the beginning of the transcription string) + some number of Danish consonants and vowels + the stød-symbol. (Since the syllable is not marked as stressed.32) 2) A long vowel + anything that is not the stød-symbol + the stød-symbol. (Since the stød-symbol must follow the vowel directly.) 3) A short vowel + zero or more occurrences of anything that is not a voiced consonant + stød. (Since stød can only be in a syllable with a long vowel as nucleus if the vowel is followed by a voiced consonant.)33 3.3.1.2

Unusual sequences

There is a template for sequences that are unusual in the given language, such that transcriptions where an unusual sequence is found would yield a warning message. An example of how this template is used is the formulation of the context of devoicing in Swedish. The devoicing convention says: Voiced plosives [and /v/] shall be devoiced when preceded by a short vowel and followed by an unvoiced consonant [that is a plosive or /f/].34 (Kylander, 2000b: 1) To fit this condition into the unusual-sequence-template one has to reformulate the rule to: A sequence with a short vowel, one of the consonants /b/, /d/, /g/ or /v/ and an optional syllable boundary should not be followed by one of the consonants /p/, /t/, /k/ or /f/. As there are exceptions to this convention, such as (merchandiser) which is transcribed as /a]”2fE:s`]%Id]ka]re/ (/d/ preceding /k/), it cannot be fit into the template for forbidden sequences. 3.3.1.3

Transcription correspondence to some decomposition subsequence

There is much regularity in the relation between the pronunciation of a word and the spelling of it. The parser has a template into which one may fit conventions saying that if some substring is found in the morphological decomposition string, some substring is expected in the transcription string. For example, there is a Swedish convention saying that if the string is found in the decomposition string, then the phoneme /N/ is expected in the transcription string. (Note that there is no morphological boundary between and as there is in for example !) 3.3.1.4

Transcription sequences allowed only if origin code is equal to X

According to Norwegian transcription conventions (Nygaard-Øverland, 2000) the transcriptions should “reflect the spoken language variant of Norwegian, Ålesund bymål”. However, there are ‘nynorsk’ words in the Norwegian lexicon that are not found in Ålesund bymål (‘bokmål’), for example (parish) and (helped). These words are, or at least should be, marked with the origin code nno.NOR while the origin code of ‘bymål’ words is nor.NOR. As there are sequences that are allowed in ‘nynorsk’ but not in ‘bymål’, the template for stating The sequence X is allowed in the transcription string if and only if the origin code of the lexical entry is equal to has been added to each language specific lexicon of the parser. As an example: the only possible transcription of the ‘nynorsk’ word is /”1sOkn/, but since /kn/ is not an allowed coda in 32 According to the Danish conventions primary and secondary stress is always marked, regardless if the word is a compound, a derived word or neither. Therefore one might refer to the concept of “unstressed syllable” in Danish, but not in Swedish. 33 Part from a voiced consonant, the character following the short vowel should not be the length mark (since that would make the vowel long) or a syllable boundary (since this would implicate the end of the syllable and the sequence should only be searched for within one syllable). 34 The modifications to the convention, i.e. the parts marked with square brackets, were made 06-04-2000 after a discussion with Kylander.

25

‘bokmål’, this transcription would receive an error message. However, one could choose to use the template described for saying: The sequence /kn]/ is allowed in the transcription string if and only if the origin code of the lexical entry is equal to nno.NOR and this measure would block the generation of an error message. (/kn/ is an allowed coda in Norwegian.) This would probably be the desired behaviour of the parser, at least in cases like these, where there is no possibility of transcribing the word in a way that is allowed in ‘bokmål’. In Danish and Swedish, where the check is irrelevant, the template is left empty. 3.3.1.5

Stress patterns allowed only if conditions X, Y and/or Z are satisfied

In some cases one might want to put further restrictions on the contexts where a specific stress pattern is allowed; there might be several conditions that have to be satisfied, or one out of several conditions that has to be satisfied. As an example, there is a transcription convention used for Norwegian, specifying the contexts where secondary stress may precede main stress: Acronyms […] may receive secondary stress also in front of the main stress when pronounced letter by letter. This also applies to acronym derivations. (NygaardØverland, 2000: 12) This pattern is also allowed in proper names (Atterås, personal communication 07-04-2000). One might reformulate this as: IF a transcription is found where secondary stress occurs in front of the main stress, THEN the word has to be either an acronym, an acronym derivation or a proper noun. Thus, one has to know the definition of an acronym. Given the file format assumed for the input file to ParseTranscriptions 1.0, there are two independent sufficient implications that a lexical entry is an acronym, namely: 1. If the lexical entry has an expansion. 2. If the morphological decomposition string contains (at least) two juxtaposed capital characters. This suggests that there might be lexical entries for which none of these conditions are satisfied, that still are acronyms. There are actually such cases, but they are most often due to errors in the decomposition string35 or to the fact that no appropriate expansion has been found; therefore they have not been given any further consideration. Given these conditions, the following rule can be stipulated: Secondary stress preceding main stress is allowed in Norwegian transcriptions if and only if at least one of the following conditions is satisfied: 1) The morphological decomposition string of the lexical entry contains at least two juxtaposed capital characters. 2) The expansion field of the lexical entry is not empty. 3) The part-of-speech of the lexical entry is that of a proper name. The implementation of the convention described above has a disjunctive form in that that at least one of the conditions must be satisfied for the transcription pattern to be allowed. However, there might be other cases where several conditions have to be satisfied in order to make a specific transcription pattern allowed; such rules would have a conjunctive form. At the moment there are no such cases for any of the three languages. As a conclusion: the following parameters must be set in the template of combined conditions (apart from the transcription string for which the combined conditions must be satisfied): 1) Restrictions on the morphological decomposition string 35

For example, morphological decomposition string for the Swedish word (the UN head quarter) is . This, and the fact that the entry has no expansion, has as consequence that the parser will consider the (correct) transcription /”2ef]en]hø:g]kva]%t`e:]ret/ erroneous.

26

2) Restrictions on the expansion 3) Restrictions on the part-of-speech 4) A connective joining the restrictions 1, 2 and 3: ‘AND’ for a conjunction, ‘OR’ for a disjunction. 3.3.1.6

Geminate consonants over syllable boundaries allowed or not

In all three languages there is a convention stating that the same consonant should not occur on both sides of a syllable boundary in non-compound and non-derived words (see definition above). However, since the conventions are frequently modified, the advanced user can choose to allow or to disallow this kind of duplication. 3.3.2

matchingSyllSpecs

In this section, the implementation of the most complex matching procedure in PhonotacticSearch is described. As a first step, a secondary stress marking is inserted in disyllabic toneme 2 words, which do not already contain such a marking. This is done e.g. in a word like the Swedish (read), which should be transcribed as /”2lE:]sa/, since there is no morphological boundary36 between the syllables /”2lE:/ and /sa/. In a search for lexical entries where the second syllable should bear secondary stress, /”2lE:]%sa/, will be considered relevant while /”2lE:]sa/ would not. As a next step, the transcription is divided by syllable boundaries into syllables. If the main stress of the word is not found at the syllable where it should be found, according to the search specification, the procedure is interrupted, as the transcription is irrelevant. If the position of the syllable expected to bear main stress is specified by a variable, all occurrences of this particular variable are replaced with the position of the syllable actually bearing main stress. For example, if the following parameter values are specified: $mainStressPos = "X"; $STRUCTURE_X_FOR_SYLLABLE_Y{'X'} = "($SWE_VOWEL)($SWE_CONSONANT)"; $STRUCTURE_X_FOR_SYLLABLE_Y{'X+1'} = "($SWE_CONSONANT)($SWE_VOWEL)";

and the syllable with position 3 was find to bear main stress, the result of the procedure would be: $mainStressPos = "3"; $STRUCTURE_X_FOR_SYLLABLE_Y{'3'} = "($SWE_VOWEL)($SWE_CONSONANT)"; $STRUCTURE_X_FOR_SYLLABLE_Y{'3+1'} = "($SWE_CONSONANT)($SWE_VOWEL)";

The same steps are taken if the specification of the position of the syllable bearing secondary stress is specified by a variable. Next, all syllable structure specifications where the position of the syllable is explicit are searched for in the transcription and if a mismatch is found, the procedure is interrupted. If there are syllable structure specifications left, where the position of the syllable is not explicit, but represented by a variable, the following steps will be taken:

36

See discussion in section 3.3.1.

27

1. The positions of all syllables matching the specified pattern are collected in an array, @matchingPositions. 2. If @matchingPositions is empty, i.e. if no syllable was found that matched the pattern, the procedure is interrupted. 3. Else, for every position N in the @matchingPositions–array: - If there is no specification for a syllable at position N+, the procedure is terminated successfully - If there is a specification for a syllable at position N+, the syllable at that position in the transcription is examined. If a mismatch is found, or if N+ is greater than the total number of syllables in the transcription, N is removed from @matchingPositions. If the @matchingPositions–array is empty at this stage, the lexical entry will be considered irrelevant to the search.

3.4 Results At this stage it should be clear that the output messages are divided into error messages and warning messages. For the errors, precision is maximised, i.e. the number of irrelevant messages is minimised, while for the warnings, recall is maximised, i.e. there should be no errors left unfound in the file (although some warnings might be irrelevant). The number of errors and warnings generated and the distribution of the separate error and warning types are presented in Table 7 and Table 8. The parser was tested 19-04-2000 at NST, Voss, on the file ALL_se_col1_ci_sorted, which contains 149 049 entries and 150 454 transcriptions (an entry might contain two alternative transcriptions). 35 831, or 23.81% of the total number of transcriptions received at least one warning or error message. (All transcriptions in the file, except for those 1774 transcriptions, whose value for “garbage flag” was equal to 1, were parsed.) When running the program on a computer with a Pentium III 500 MHz processor, with 128 MB of 100 MHz synchronous dynamic random-access memory (SDRAM), the parsing speed is about 250 lexical entries per second. Examples from the parsed file are presented in Appendix 3. # Error message 1 2 3 4 5 6 7 8 9 10 11 12

MOP-violation, syllables and Syllable contains more than one nucleus () Not a Swedish onset () Not a Swedish coda () Onset is not allowed word-initially () The number of is incorrect Syllable contains no nucleus () Illegal position for syllable delimiter Illegal Swedish symbol () Contains illegal stress pattern Contains ill-formed diacritic () Unwanted sequence of and TOTAL NUMBER OF ERROR MESSAGES

Number of occurrences 2377 378 364 266 178 111 89 38 18 11 0 0 3830

% 62.06% 9.87% 9.50% 6.95% 4.65% 2.90% 2.32% 0.99% 0.47% 0.29% 0% 0% 100.00%

Table 7: Distribution of error messages in output file

When looking through the distribution of error messages in Table 7, one might notice that there are two error messages, number 11 and 12, which were never generated. Error message number 11 does not occur since there are, at the moment, no diacritics in the Swedish transcriptions. The reason why

28

error message number 12 is never generated is that there is no specification in the lexicon of sequences that must never occur successively in Swedish.

29

# Warning message 1 2 3 4 5 6 7 8 9

possible MOP-violation, syllables and possible non-compound with '2 probably unwanted sequence of and correspondence // to not found is a rare onset geminate consonant X over syllable boundary is a rare coda probably unwanted duplication of X // allowed only in words with language code TOTAL NUMBER OF WARNING MESSAGES

Number of occurrences 18670 13265 2429 1424 305 223 208 14 0 32001

% 58.34% 41.45% 7.59% 4.45% 0.95% 0.70% 0.65% 0.04% 0% 100.00%

Table 8: Distribution of warning messages in output file

The results in Table 8 show that the number of warning messages is several times greater than the number of error messages. This fact is consistent with or due to the fact that the precision value is smaller for warnings than for errors. The large percentage of warning 1 is a consequence of maximising the recall value. If this check is activated, the parser will find illegal MOP-violations of the type illustrated in the incorrect transcription /”2fIsk]%rEt]er/ of the word (fish dishes). (The second syllable boundary should precede the /t/.) These types of illegal MOP-violations will not generate an error message; the only MOP-violations that generate an error message are those that occur in non-compound words and non-acronyms. For all other MOP-violations it is not possible to state which are allowed and which are not, since the parser cannot know the position of the morphological boundary in the transcription. For example, the correct transcription /”2fIsk]%rE]ter/ would receive a warning message, since there is no way for the parser to know that the morphological boundary in requires a MOP-violation between the first and the second syllable. The large percentage of warning message 2 is also a consequence of maximising the recall value; a discussion about this is found in Section 3.2.2.2. The user might find the overgeneration of messages 1 and 2 annoying, and the advanced user might want to set the parser not to generate these warning messages. The corresponding results for Norwegian and Danish are found in Appendix 4. An earlier version of the parser37 was used in a production setting with satisfactory results at NST in 02-03-2000 with the following values for parsing parameters (the processes generating messages marked with ‘*’ had not yet been implemented at the time):

37

The lexicon of the earlier version of the parser has been modified to version 1.0. The affixes , , and have been added to the list of derivational affixes, together with their inflected forms. Some conventions in the earlier version of the parser, those checking velarisation, have been modified, since the parser did not distinguish between /n/ and /n`/ (the non-retroflex /n/ and the retroflex /n/, respectively) when looking for velarisation in the transcription string where the orthographic sequence (or ) was found. The convention is that no retroflex /n/ can be velarised. The earlier version of the parser assumed that all orthographic sequences , preceded by a character that is not and possibly separated by a morphological boundary, should correspond to /Nk/ in the transcription string. A word for which this is not true is (the road leading to Enköping) where the sequence corresponds to /nC/. This formulation has been modified as well. The convention saying that /e/ is not allowed preceding a retroflex in a stressed syllable has also been implemented in version 1.0.

30

Infile: Selection of transcriptions: Value for “garbage flag”: Error messages to generate:

Error messages not generated: Warning messages to generate:

seLex06.txt (a Swedish lexicon file) Lexical entries from lexicons with names containing the substrings enter_se or spd_se 0 - MOP-violation, syllables and - Syllable contains more than one nucleus () - Not a Swedish onset () - Not a Swedish coda () - Onset is not allowed word-initially () - The number of is incorrect - Syllable contains no nucleus () - Contains illegal stress pattern - Illegal Swedish symbol () - Contains ill-formed diacritic () - Unwanted sequence of and - Illegal position for syllable delimiter*

Warning messages not generated: -

possible non-compound with % probably unwanted sequence of X and Y correspondence /X/ to not found X is a rare onset X is a rare coda probably unwanted duplication of X possible MOP-violation, syllables X and Y geminate consonant X over syllable boundary*

The results of running PhonotacticSearch 1.0 are satisfactory, as the search engine does what is expected. The search engine was tested 19-04-2000 at NST, Voss, on the Swedish lexicon file ALL_se_col1_ci_sorted, which contains 149 049 entries and 150 454 transcriptions (an entry might contain two alternative transcriptions). One rather simple query and one more complex query were given as input to the search engine, to examine the difference in processing time. As shown in Table 9, the complex search takes almost twice the time as the simple. The program was run on a computer with a Pentium III 500 MHz processor, with 128 MB of 100 MHz synchronous dynamic random-access memory (SDRAM).

31

Search specification 1 Transcriptions from lexicons whose name contains the substring // Transcriptions with dialect code Transcriptions with origin code Toneme 2 words Number of syllables: 4 Main stress on first syllable Secondary stress on the 4th syllable The structure for syllable 4: []*x\u:n Part-of-speech: unspecified The word could be a compound or a non-compound The following substring should/should not be found in orthographic form: 2 Transcriptions from lexicons whose name contains the substring // Transcriptions with dialect code Transcriptions with origin code Toneme 2 words Number of syllables: Main stress on the Xth syllable Secondary stress on the Zth syllable The structure for syllable Y: [%]*($SWE_FRICATIVE)($SWE_LONG_VOWEL)($SWE_NASAL) The structure for syllable Z: [%]*($SWE_FRICATIVE)($SWE_LONG_VOWEL)($SWE_NASAL) The structure for syllable X+1: ($SWE_FRICATIVE)($SWE_LONG_VOWEL)($SWE_PLOSIVE)?($S WE_FRICATIVE)? Part-of-speech: unspecified The word could be a compound or a non-compound The following substring should/should not be found in orthographic form:

Processing time 60 seconds

117 seconds

Table 9: The search specifications and the processing time for searches in the Swedish lexicon file ALL_se_col1_ci_sorted. The second search (2) is more complex than the first (1).

32

4 Discussion The parser described in this document diverges from conventional parsers. Obviously, this statement needs to be substantiated, and as a first step, the notion of “conventional parser” has to be straightened out. In this context, “conventional parser” refers to a devise that analyses the structure of a text either in an artificial language such as C++, or in a natural language such as English. The following discussion will show why the parser described in this thesis is different from these types of “conventional” parsers. The syntax of a programming language such as C++ might be divided into two parts: the identifier (or symbol) syntax and the syntax of identifier combination, or how the symbols are organised. The identifier syntax regulates the lexical analysis in that it decides what sequences of characters should be considered one symbol. For example, the sequence var1+var2 will be divided into three symbols: var1, + and var2, since each of these (sub)sequences is identified as a legal symbol, or identifier. The other syntax governs the combination of identifiers and analyses the hierarchical structure of sequences of identifiers. For example, the hierarchical structure of the phrase var3=var1+var2 is illustrated in Figure 9. = var3 var1

+ var2

Figure 9: The hierarchical structure of the C++ phrase var3=var1+var2.

C++ is a constructed language and both its syntaxes are imposed by formal rule and the distinction between syntactically correct and incorrect phrases is clear. The syntax for a written natural language such as Swedish can also be divided into several parts. One part is the traditional orthography, governing which sequences of characters constitute a word and, without defining it further, the spelling of a word. This syntax regulates the lexical analysis. (Of course, the morphological structure of a word can be analysed as well, according to the morphological syntax of the given language. However, for our purposes it is not necessary to consider this syntax further.) Another syntax of written natural language is the syntax governing how the words combine into phrases and the hierarchical structure of a phrase. Although a natural language grammar is a formal system, it is regulated by something outside the formal system; in other words, the syntaxes are not imposed by formal rule, as is the case with programming languages. The input to a transcription parser is transcriptions, i.e. orthographic representations of the pronunciation of words or phrases in a natural language. The transcription syntax can also be divided into parts. The syntax of the transcription language, for example XSAMPA and L&H+38, governs the lexical analysis, i.e. what sequences of characters constitutes a symbol and also, in some way, how these symbols are organised. A transcription language is an artificial language and thus imposed by formal rule. However, the transcription language has been constructed to describe natural language, and the ways in which the symbols combine are governed by natural language syntax, in addition to the transcription language syntax. As a consequence, the transcriptions must satisfy both syntactical 38

XSAMPA and L&H+ are most often referred to as phonetic alphabets and as an alphabet is a part of the description of a language, it might seem inappropriate to talk about them as transcription languages. This is true on one level, but not on another. To clarify: each symbol in XSAMPA or L&H+ represents one phoneme in the given natural language and the ways that the phonemes combine are governed by the natural language syntax. In this respect, phonetic alphabet is a more appropriate term than transcription language. However, the symbols in XSAMPA and L&H+ are composite and the ways single characters combine into a symbol, and how symbols combine, are governed by the syntax of XSAMPA and L&H+, respectively. In this respect, transcription language is an appropriate description. It might be confusing to talk both of phonetic alphabets and transcription languages, when referring to the same thing (XSAMPA or L&H+), and therefore only one term, transcription language, will be used henceforward.

33

constraints set by the constructed transcription language and syntactical constraints set by the natural language being described. The main difference between a transcription language and written natural language is that the orthography of transcriptions is constructed, while the orthography of written natural language is dictated by natural language constraints. However, regardless of this seemingly small difference, the way a transcription parser has to merge artificial and natural constraints makes it unique. According to the specification of the task, found in Section 1.1, the transcription parser should be able to identify as many errors and possible errors as possible, at as many levels as possible. The developed parser achieves that goal; the power of ParseTranscriptions 1.0 is limited mostly by the number of conventions implemented in the language specific lexicon part of the parser and the degree of specificity of these conventions. If the advanced user does not specify any transcription conventions in the lexicon, the parser will still find errors. One type of error is found in the syllable structure, e.g. if a sequence marked as a syllable does not contain one and only one nucleus or if the position of a syllable boundary causes a MOP-violation. The parser will also find errors in the symbol structure, for example illegal symbols or unwanted duplications of characters. These types of errors violate conventions that are not as linguistically oriented as the deviations described earlier, but rather formulated by the L&H+-syntax. This merging of constraints of the constructed language L&H+ and constraints set by the natural language makes the parser unique. If the advanced user chooses to use the transcription convention templates in the lexicon for the given language, the power of the parser will increase. There is no limit as to how many conventions can be put in the lexicon, and there is no limit as to how specific the conventions can be. Examples of transcription conventions that might be put in the lexicons are phonological rules of devoicing and velarisation. As the parser is expandable, the work of improving the parser’s performance might proceed as long as there is use for the tool. The goal in the development of the phonotactic search engine, to construct a powerful and flexible tool for finding phonotactic patterns, has also been reached. Of course, PhonotacticSearch 1.0 might be used simply for extracting phonotactic patterns in transcriptions, but, as described in Section 6, the user might choose to search for more complex patterns by referring for example to the orthographic form of the word. The search engine will be a useful tool when improving the performance of the parser as discussed above. The phonotactic toolkit, including both the transcription parser and the phonotactic search engine, can be used (and have been used) industrially to speed up the work of transcribing. This acceleration of the transcribing process saves a lot of work and this is obviously advantageous from an economic point of view.

4.1 Future prospects Version 1.0 of the parser as a whole could be considered final, but one might want to add or modify a few details. And as the structure is rather clear and straightforward, with a clear difference between lexicon and rules, this should not be a problem; the templates, i.e. the lexical entries, are already there and the rules are already formulated. The listing of allowed and rare Swedish codas should be taken with caution and needs to be checked. It is originally based on the listing in Sigurd (1965), but has been modified to some extent. For example, Sigurd’s list contains the codas ‘rt’ and ‘rd’, which have been “translated” into [t`] and [d`]. Sigurd’s idea of so-called “morphological filters” has been adopted here as well, but to the parser these filters are of course nothing more than strings of characters. All codas that are not listed by Sigurd but still occur in the file ALL_se_col1_ci_sorted have been listed and some of them – the ones that are common and uncontroversial – have been added to the parser’s list of legal codas. The set of legal codas in Swedish is not so much a result of linguistic considerations as it is a result of trying to maximise both precision and recall, even though, in this case, both approaches would lead to the same result. As a conclusion: the establishment of a boundary between codas that are allowed and codas that are rare is not simple and the list as it is now is merely a rough sketch. The list of possible derivational affixes in the lexicon should be treated with the same caution as the list of legal codas. The selection of derivational affixes is based on the same principle as the selection of codas: to maximise precision and recall. An additional problem arising when the set of

34

derivational affixes is limited has already been discussed in 3.2.2.2: the issue of including or excluding affixes like e.g. , resulting either in a low precision value or a low recall value. As mentioned above, the search engine works as expected, but some improvements might nevertheless be made for a later version. The processing time of complex queries might perhaps be annoying, but this is hard to change. If the complex search in Table 9 had been limited to e.g. words with four syllables, the processing time would be just little more than 60 seconds (67 seconds); but since the engine must search through all toneme 2 words, the search necessarily takes more time. One might want the possibility of choosing how the results should be presented; either as a list of relevant hits, as at the moment, or as statistics or perhaps as both. For example, one might want to formulate a search such as “What phoneme sequences X can follow the sequence Y in a syllable?”. For that question a way of presenting the result would be to list all possible instances of X together with absolute and relative number of occurrences. There are plans at NST to use ParseTranscriptions 1.0 as the engine of a graphical transcription tool (Warmenius, personal communication 12-04-2000). With this tool, each transcription would be parsed directly when written, so that no erroneous transcriptions would ever be inserted in a lexicon file, and, as a consequence, that no lexicon files would ever have to be parsed and corrected. This method would speed up the transcribing process considerably.

35

References Aho, A. V., Sehti, R. and Ullman, J. D. (1986): Compilers. Principles, techniques, and tools. Reading, Massachusetts: Addison-Wesley. Antworth, E. L. (1990): PC-KIMMO: a two-level processor for morphological analysis. Dallas, Texas: Summer Institute of Linguistics. Bjerg Nielsen, K. (2000): Transcription conventions for Danish v0.4. Company internal document within Nordisk Språkteknologi. Chomsky, N. (1957): Syntactic structures. The Hague: Mouton. Chomsky, N. (1965): Aspects of the theory of syntax. Cambridge: The M.I.T Press. Gussenhoven, C. and Jacobs, H. (1998): Understanding Phonology. Great Britain: Arnold (copublished by Oxford University Press, Inc.). Hammond, M. (draft: 1995): Syllable parsing in English and French. University of Arizona. Kylander, C. (2000a): Regler för delning av sammansatta ord (dekomponering), version 2.00. Company internal document within Nordisk Språkteknologi. Kylander, C. (2000b): Checklist for corrections of the Enterlist, Company internal document within Nordisk Språkteknologi. Nygaard-Øverland, H. (2000): Transcription conventions for Norwegian v1.5. Company internal document within Nordisk Språkteknologi. Prince, A. and Smolensky, P. (1993): Optimality theory, ms., Rutgers and University of Colorado. Slethei, K. (1988): Kort innføring i taleteknologi. Department of phonetics and linguistics, University of Bergen. Sigurd, B. (1965): Phonotactic structures in Swedish. Lund: Berlingska boktryckeriet.

Online sources Antworth, E. L. (1996?): Introduction to Two-level Phonology. Summer Institute of Linguistics. [Referred to 17-04-2000]. Accessible online at http://www.sil.org/pckimmo/two-level_phon.html Bergman, E. and Johnson, E. (1994-2000): Towards Accessible Human-Computer Interaction. Sun Microsystems. (A reprint of an excerpt from Nielsen, J. (ed.) (1995): Advances in HumanComputer Interaction Volume 5. Ablex Publishing Corporation.) [Referred to 24-042000]Accessible online at http://www.sun.com/tech/access/updt.HCI.advance.html Wells, J.C. (04-02-1999 [date of last revision]): Computer-coding the IPA: a proposed extension of SAMPA. London: University College London. [Referred to 11-04-2000]. Accessible online at http://www.phon.ucl.ac.uk/home/sampa/ipasam-x.pdf

36

Appendix 1: Terminology A list of comprehensive definitions of some of the most important terms used in this document is presented below. Definitions marked with an asterisk (*) are collected from Nygaard-Øverland (2000) and my own modifications or supplements within these definitions are within square brackets. *Allophonic transcription: See Transcription. Assimilation: When we speak, there is a balance between underarticulation and overarticulation, i.e. we do not want to spend more effort on articulation than is necessary in order for the listener to understand what we are saying. When we underarticulate, the sounds affect each other in our striving in making smooth transitions between the sounds. One manifestation of underarticulation is assimilation, when one sound colours another sound or other sounds. An example is the word ‘assimilation’ itself; the word has its origin in the Latin morphemes ‘ad’ (to) and ‘similis’ (alike) and when put together, to ‘ad+similis’, the ‘d’ assimilates into ‘s’. Coda: The optional closing segment of a syllable (consisting of only consonants). Maximum Onset Principle (MOP): When inserting a syllable boundary between two syllables according to the MOP, the onset of the second syllable should first be maximised, then a legal coda of the first syllable is formed. (Gussenhoven & Jacobs, 1998) Nuclei: The plural form of nucleus. *Nucleus: The obligatory centre of a syllable, [most often] a vowel. Onset: The optional opening segment of a syllable, only consisting of consonants. *Phonemic transcription: See Transcription. *Rhyme: The nucleus and coda together make up the rhyme of the syllable. E.g. in the monosyllabic word , phonemically: /dog/, the onset is /d/ whereas the /og/ makes up the rhyme. (/o/ acts as nucleus and /g/ as coda.) *Syllable: A unit of pronunciation typically larger that a single sound and smaller than a word. The notion of syllable is very real to native speakers. A precise definition of the syllable does not exist. There are several theories in both phonetics and phonology. A phonological approach is taken in this document, since we focus on the way sounds combine in an individual language to produce typical syllables. Two classes of sounds emerge: those sounds which occur on their own, or at the centre of a syllable, and those sounds which cannot occur on their own, but rather at the edges of a syllable. The former are referred to as vowels or centres [or nuclei] and the latter as consonants or edges [or onsets and codas]. All syllables must have a vowel/centre, while consonants/edges need not always be present.

37

*Syllabic consonant: In Norwegian, a nasal or lateral consonant which in exceptional cases function as the vowel/centre [or nucleus] of a syllable. In Norwegian this occurs as a result of a phonological process of surface reduction/elimination of an underlying vowel. *Transcription: The method of writing down speech sounds in a systematic and consistent way *Phonemic transcription: The units to be transcribed are those, which have linguistic function, i.e. the phonemes (as opposed to a phonetic transcription, which transcribes units according to their articulatory/auditory identity, regardless of their function). *Allophonic transcription: Adds functional phonetic details. E.g.: syllabic consonants.

38

Appendix 2: Overview of XSAMPA Below follows an overview of the symbols in extended version of Speech Assessment Methods Phonetic Alphabet (SAMPA), called XSAMPA, that are used in this document. The IPA symbols belonging to the ordinary Roman lower-case alphabet (e.g. u, x) remain the same and are not listed below. More information can be found online at http://www.phon.ucl.ac.uk/home/sampa/x-sampa (Wells, 1999). XSAMPA t` (` = ASCII 096) d` s` C n` N l`

IPA Ü Ç ‡ % Ô 0 Î

retroflex plosive, voiceless retroflex plosive, voiced retroflex fricative, voiceless palatal fricative, voiceless retroflex nasal velar nasal retroflex lateral approximant

Table I: Consonants in XSAMPA that deviates from IPA. } 8 E

Π2 '

close central rounded close-mid central rounded open-mid front unrounded

Table II: Vowels in XSAMPA that deviates from IPA. x\ _

Ë d

simultaneous S and x tie bar

Table III: Other symbols " % Table IV: Suprasegmentals

39

¥ ¤

primary stress secondary stress

Appendix 3: Examples of output from ParseTranscriptions 1.0 An extract from the file ALL_se_col1_ci_sorted.parsed is presented below. In the file 35 883 transcriptions received at least one warning or error message and since all these could not be presented here, only a representative selection from the file is shown. (Omissions are indicated by the sequence ‘…’.) “1]pens;70703;1;verb;opens;swe.SWE;; e = illegal Swedish symbol () e = syllable contains more than one nucleus (”1) ... “2se:d]%ven]jUr;14577;1;noun;sedvänjor;SWE;; e = MOP-violation, syllables %vEn and jUr e = MOP-violation, syllables “2se:d and %vEn w = possible non-compound with % ... “2tu:r]%val]a;94485;1;ngo;Torvalla;swe.SWE;; e = MOP-violation, syllables %val and a w = possible non-compound with % ... “2jE:n`]%Ol]der;147360;1;noun;;swe.SWE;; e = MOP-violation, syllables “2jE:n` and %Ol w = possible non-compound with % ... de]fen]”1si:v]spe]sI]a]%lI]ste]n`a;110643;1;noun;defensiv+specialisterna;sw e.SWE;; e = contains illegal stress pattern ... pe]s`U]”1nA:]lI]a];127461;1;noun;personalia;swe.SWE;; e = illegal position for syllable delimiter ... “2fø]d`e:%]lak]tI]ga;41420;1;adj;fördelaktiga;swe.SWE;; e = not a Swedish coda (%) ... ”1rIC]ter;77650;1;npl;Richter;swe.SWE;; e = not a Swedish coda (C) ... “2ny:]plan]%te:]rInk]ar;125997;1;noun;ny+planteringar;swe.SWE;; e = not a Swedish coda (nk) w = correspondence /N/ to not found w = possible MOP-violation, syllables rInk and ar ... “2drOt]nIN]In]ter]%vj}:]er;111188;1;noun;drottning+intervjuer;swe.SWE;; e = not a Swedish onset (vj) ... ”1t`aj]brejk;93059;2;noun;tiebreak;eng.GBR;; e = onset is not allowed word-initially (t`) ...

40

“2lInd]%Cø:pIN;10136;1;ngi;lind+köping;SWE;; e = syllable contains more than one nucleus (%Cø:pIN) ... “2ny:]bør]ja]%]s`ku:;125926;1;noun;ny+!börjar+sko;swe.SWE;; e = syllable contains no nucleus (%) ... dA:ls]”1ro:]stOk;144671;1;ngo;Dals_Rostock;swe.SWE;; e = the number of '_':s in transcription is not correct ... “2va]ten]%ka]na;100821;1;noun;vatten+kanna;swe.SWE;; w = correspondence /(Nk|nC)/ to not found ... aN]“2tre:]%drIn]ken;111885;1;noun;entré+drinken;swe.SWE;; w = correspondence /(Nk|nC)/ to not found ... pU]“2li:s]}:t]%re:]da]n`a;127954;1;noun;polis+ut+redningarna;swe.SWE;; w = correspondence /N/ to not found w = possible MOP-violation, syllables “2li:s and u0:t w = possible MOP-violation, syllables u0:t and %re: ... “2A:v]ve]klINs]%kOst]na]der;107599;1;noun;avveckling+s+kostnader;swe.SWE;; w = geminate consonant v over syllable boundary w = possible MOP-violation, syllables klINs and %kOst ... spe]sI]“2A:l]%CA:t`]rat;86653;1;adj;special+chartrat;swe.SWE;; w = possible MOP-violation, syllables %CA:t` and rat “2lands]an]tIk]%vA:r]I]e;59763;1;noun;land+s+antikvarie;swe.SWE;; w = possible MOP-violation, syllables %vA:r and I w = possible MOP-violation, syllables “2lands and an w = possible MOP-violation, syllables tIk and %vA:r w = probably unwanted sequence of d and s ... “2by:]%ro:]e]n`a;30509;1;noun;byråerna;swe.SWE;; w = possible non-compound with % “2u:]be]%fu:]gad;69238;1;adj;obefogad;swe.SWE;; w = possible non-compound with % “2tem]pel]%bEr]jet;148539;1;noun;;swe.SWE;; w = possible non-compound with % ... “2eld]%ka]sta]re;111587;1;noun;eld+kastare;swe.SWE;; w = probably unwanted sequence of d and k “28nd]%kO]ma;97395;1;verb;!und+komma;swe.SWE;; w = probably unwanted sequence of d and k ... ”1drajv;33818;1;noun;drive;eng.GBR;; w = rare Swedish coda (jv)

41

”1ajl;52601;1;noun;isle;eng.GBR;; w = rare Swedish coda (jl) ... pU]“2li:]sfør]%hø:r;13222;1;noun;polis+förhör;SWE;; w = rare Swedish onset (sf) ”1sfE:]re]n`as;148241;1;noun;;swe.SWE;; w = rare Swedish onset (sf)

42

Appendix 4: Parsing results for Norwegian and Danish The parser was tested 19-04-2000 at NST, Voss, on a Swedish, a Norwegian and a Danish lexicon file. The results for the Swedish parsing was presented in Section 4.4 and the corresponding results for the Norwegian and the Danish files are presented below.

Norwegian parsing results The Norwegian lexicon file all_lists_rev82 contains 143 942 entries and 145 809 transcriptions. 49 700, or 34.09 % of the total number of transcriptions received at least one warning or error message. (All transcriptions in the file, except for those 1130 transcriptions, whose value for “garbage flag” was equal to 1, were parsed.) # Error message 1 2 3 4 5 6 7 8 9 10 11 12

MOP-violation, syllables and Syllable contains more than one nucleus () Not a Norwegian onset () Not a Norwegian coda () Onset is not allowed word-initially () The number of is incorrect Syllable contains no nucleus () Illegal position for syllable delimiter Illegal Norwegian symbol () Contains illegal stress pattern Contains ill-formed diacritic () Unwanted sequence of and TOTAL NUMBER OF ERROR MESSAGES

Number of occurrences 450 438 194 182 149 126 109 27 9 5 0 0 1689

% 26.64% 25.93% 11.49% 10.78% 8.82% 7.46% 6.45% 1.60% 0.53% 0.30% 0% 0% 100.00%

Table V: Distribution of error messages in output file in Norwegian parsing

# Warning message 1 2 3 4 5 6 7 8

possible MOP-violation, syllables and possible non-compound with '2 probably unwanted sequence of and correspondence // to not found is a rare onset geminate consonant X over syllable boundary is a rare coda // allowed only in words with language code 9 probably unwanted duplication of X TOTAL NUMBER OF WARNING MESSAGES

Number of occurrences 23007 16844 1110 2816 232 50 25 19

Table VI: Distribution of warning messages in output file in Norwegian parsing.

43

0 54103

% 42.52% 31.13% 20.53% 5.20% 0.43% 0.09% 0.05% 0.04% 0% 100.00%

Danish parsing results The Danish lexicon file da_100K_05_eft_korr.txt contains 11 259 entries and 11 259 transcriptions. 6245, or 55.47% of the total number of transcriptions received at least one warning or error message. The format of the Danish lexicon file is different from the format of the Swedish and Norwegian lexicon files; the entries do not contain as much information and therefore some checks are impossible to perform. # Error message 1 2 3 4 5 6 7 8 9 10 11 12

MOP-violation, syllables and Syllable contains more than one nucleus () Not a Danish onset () Not a Danish coda () Onset is not allowed word-initially () The number of is incorrect Syllable contains no nucleus () Illegal position for syllable delimiter Illegal Danish symbol () Contains illegal stress pattern Contains ill-formed diacritic () Unwanted sequence of and TOTAL NUMBER OF ERROR MESSAGES

Number of occurrences 5398 32 19 16 7 6 2 0 0 0 0 0 5480

% 98.50% 0.58% 0.35% 0.29% 0.13% 0.11% 0.04% 0% 0% 0% 0% 0% 100.00%

Table VII: Distribution of error messages in output file in Danish parsing

# Warning message 1 2 3 4 5 6 7 8

possible MOP-violation, syllables and possible non-compound with '2 probably unwanted sequence of and correspondence // to not found is a rare onset geminate consonant X over syllable boundary is a rare coda // allowed only in words with language code 9 probably unwanted duplication of X TOTAL NUMBER OF WARNING MESSAGES

Number of occurrences 1409 835 6 0 0 0 0 0 0 2250

% 62.62% 37.11% 0.27% 0% 0% 0% 0% 0% 0% 100.00%

Table VIII: Distribution of warning messages in output file in Danish parsing.

44