PARSING AND TAGGING SENTENCES ... - Semantic Scholar

9 downloads 111145 Views 1MB Size Report
1.6 Summary of the Document . ..... While there is a long way to go before an android such as Data or a computer such as the HAL-2000 can be built, great ...
PARSING AND TAGGING SENTENCES CONTAINING LEXICALLY AMBIGUOUS AND UNKNOWN TOKENS

A Thesis Submitted to the Faculty of Purdue University by Scott M. Thede

In Partial Fulllment of the Requirements for the Degree of Doctor of Philosophy

December 1999

- ii -

To Denise. Thanks for sticking with me all these years.

- iii -

ACKNOWLEDGMENTS Writing a Ph.D. thesis is a tremendous amount of work, and it is not accomplished in a vacuum. I would like to thank all the people who were most important to me during the journey. First, I would like to thank my committee for their help. Special thanks go to Mary Harper, my advisor, for her many hours of proofreading, editing and re-editing, and her many helpful suggestions. Carla Brodley, Avi Kak, and Victor Raskin have also all helped make this possible. I would also like to thank the many people who have shared the Speech Lab with me while I have been at Purdue: Chris, Dan, Dave, Mike, and Steve. You guys helped me keep my sanity during the long years(!) of work, and coincidentally helped improve my backgammon and chess games as well. Thanks, guys! Finally, many thanks to my parents, who have always supported me in everything I have done. And last, but denitely not least, I thank Denise, my wife, who has waited longer than anyone should be expected to wait for me to start my career. Thank you for your support and patience.

- iv -

-v-

TABLE OF CONTENTS LIST OF FIGURES : : : : : : : : : : : : : : : : : : : : : : : LIST OF TABLES : : : : : : : : : : : : : : : : : : : : : : : : ABSTRACT : : : : : : : : : : : : : : : : : : : : : : : : : : : 1 INTRODUCTION : : : : : : : : : : : : : : : : : : : : : : 1.1 Natural Language Processing : : : : : : : : : : : : : 1.2 Sentence Parsing : : : : : : : : : : : : : : : : : : : : 1.3 Part-of-Speech Tagging : : : : : : : : : : : : : : : : : 1.4 Ambiguous and Unknown Words : : : : : : : : : : : 1.4.1 Parsing with ambiguous and unknown words : 1.4.2 Tagging with ambiguous and unknown words 1.5 Goals of this Research : : : : : : : : : : : : : : : : : 1.6 Summary of the Document : : : : : : : : : : : : : : : 2 PARSING : : : : : : : : : : : : : : : : : : : : : : : : : : : 2.1 An Introduction to Parsing : : : : : : : : : : : : : : : 2.1.1 Grammars : : : : : : : : : : : : : : : : : : : : 2.1.2 How parsers work : : : : : : : : : : : : : : : : 2.2 Parsing Unknown Words : : : : : : : : : : : : : : : : 2.2.1 Other work with unknown words : : : : : : : 2.2.2 Information sources for unknown words : : : : 2.3 Parsing with a Morphological Recognizer : : : : : : : 2.3.1 A hand-constructed morphological recognizer 2.3.2 The post-mortem algorithm : : : : : : : : : : 2.3.3 Experimental design : : : : : : : : : : : : : : 2.3.4 Results : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : :

Page : ix : xi : xiii : 1 : 1 : 4 : 4 : 6 : 7 : 8 : 9 : 10 : 13 : 13 : 14 : 17 : 18 : 19 : 23 : 30 : 32 : 33 : 35 : 37

- vi 2.3.5 Comments : : : : : : : : : : : : : : : : : : : : : 2.4 Developing a Computer-Generated Recognizer : : : : : 2.4.1 Constructing the recognizer : : : : : : : : : : : 2.4.2 Discussion of experimental results : : : : : : : : 2.5 Attempts to Rene the Recognizer : : : : : : : : : : : 2.5.1 The ratio construction method : : : : : : : : : : 2.5.2 The frequency list post-mortem parsing method 2.5.3 The new post-mortem algorithm : : : : : : : : : 2.5.4 Experiments : : : : : : : : : : : : : : : : : : : : 2.5.5 Results : : : : : : : : : : : : : : : : : : : : : : : 2.6 Final Thoughts on Parsing : : : : : : : : : : : : : : : : 3 PART-OF-SPEECH TAGGING : : : : : : : : : : : : : : : : 3.1 Introduction : : : : : : : : : : : : : : : : : : : : : : : : 3.2 Preliminary Investigation on Information Sources : : : 3.2.1 Creating the predictor : : : : : : : : : : : : : : 3.2.2 Rening the predictions : : : : : : : : : : : : : 3.2.3 The experiment : : : : : : : : : : : : : : : : : : 3.2.4 Conclusions : : : : : : : : : : : : : : : : : : : : 3.3 Part-of-Speech Tagging : : : : : : : : : : : : : : : : : : 3.3.1 Past work in tagging : : : : : : : : : : : : : : : 3.3.2 Tagging unknown words : : : : : : : : : : : : : 3.4 Implementation Issues : : : : : : : : : : : : : : : : : : 3.4.1 The hidden Markov model : : : : : : : : : : : : 3.4.2 Using an HMM for part-of-speech tagging : : : 3.5 The Second-Order Model for Part-of-Speech Tagging : 3.5.1 Dening new probability distributions : : : : : 3.5.2 The new Viterbi algorithm : : : : : : : : : : : : 3.5.3 Calculating probabilities for unknown words : : 3.5.4 Smoothing the probabilities : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : : : : : : :

40 41 42 45 45 46 48 50 51 53 59 61 61 62 62 63 67 70 70 71 73 75 75 75 78 78 81 83 86

- vii 3.5.5 Experiment and conclusions : : : : : : : : 3.6 Smoothing Experiments : : : : : : : : : : : : : : 3.6.1 Changing the coecients : : : : : : : : : : 3.6.2 The experiment : : : : : : : : : : : : : : : 3.7 Integrating the Tagger and Parser : : : : : : : : : 3.7.1 Investigating N -best tagging : : : : : : : : 3.7.2 The VDT and PAT systems : : : : : : : : 3.7.3 Results of the VDT and PAT systems : : : 3.8 Final Thoughts on Tagging : : : : : : : : : : : : : 4 CONCLUSIONS AND COMMENTARY : : : : : : : : 4.1 Contributions : : : : : : : : : : : : : : : : : : : : 4.1.1 Parsing : : : : : : : : : : : : : : : : : : : 4.1.2 Part-of-speech tagging : : : : : : : : : : : 4.2 Future Work : : : : : : : : : : : : : : : : : : : : : 4.2.1 Investigations into errors in training data : 4.2.2 Statistical parsing : : : : : : : : : : : : : : 4.2.3 Integrating the tagger and parser : : : : : LIST OF REFERENCES : : : : : : : : : : : : : : : : : : A THE FIRST-ORDER HIDDEN MARKOV MODEL : : A.1 Stochastic Processes and Markov Processes : : : : A.1.1 The state transition matrix : : : : : : : : A.1.2 The initial state distribution : : : : : : : : A.1.3 Working with a Markov process : : : : : : A.2 The Hidden Markov Model : : : : : : : : : : : : : A.2.1 Finding the most likely state sequence : : A.3 An Example: The Ball-and-Urn Problem : : : : : A.3.1 Using the maximum likelihood algorithm : A.3.2 Using the Viterbi algorithm : : : : : : : : A.4 Conclusions : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : : : : : : :

93 97 99 100 101 103 104 107 108 109 109 109 109 110 110 111 111 113 119 119 119 120 121 122 124 126 128 129 131

- viii B DATA TABLES FOR PARSING EXPERIMENTS : : : : : : : : : : : : : 133 VITA : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 139

- ix -

LIST OF FIGURES Figure 1.1 1.2 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3.1 3.2 3.3 3.4 A.1 A.2 A.3

An Abstract NLP System : : : : : : : : : : Parse Tree for The boy watched the bird. : : A Simple Parsing System : : : : : : : : : : : The Post-Mortem Algorithm : : : : : : : : : Average Number of Deletions : : : : : : : : Average Number of Insertions : : : : : : : : Average Number of Matches : : : : : : : : : The Frequency List Post-Mortem Algorithm Results using the Maximum Method : : : : Results using the Average Method : : : : : : Results using the Sux-Only Method : : : : Results of Various Methods : : : : : : : : : The First-Order Viterbi Algorithm : : : : : The Second-Order Viterbi Algorithm : : : : Three kinds of Smoothing Coecients : : : A Markov Process : : : : : : : : : : : : : : : The Maximum Likelihood Algorithm : : : : The First-Order Viterbi Algorithm : : : : :

: : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : :

Page : 1 : 5 : 19 : 35 : 38 : 39 : 40 : 50 : 54 : 56 : 58 : 68 : 79 : 82 : 100 : 121 : 125 : 127

-x-

- xi -

LIST OF TABLES Table 2.1 2.2 2.3 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 A.1 B.1 B.2 B.3 B.4 B.5 B.6 B.7

Page Deletion and Insertion Data : : : : : : : : : : : : : : : : : : : : : : : 41 Top Ten Occurring Axes : : : : : : : : : : : : : : : : : : : : : : : : 44 Ratio Calculation Example : : : : : : : : : : : : : : : : : : : : : : : : 47 Results using Various Methods : : : : : : : : : : : : : : : : : : : : : 67 Comparison between Taggers on Brown Corpus : : : : : : : : : : : : 93 Comparison between Taggers on WSJ Corpus : : : : : : : : : : : : : 94 Comparison between Full Second-Order HMM and Other Taggers : : 95 Comparison of Unknown Word Accuracies : : : : : : : : : : : : : : : 96 Results using Various Smoothing Coecients : : : : : : : : : : : : : 102 Results using Various Methods : : : : : : : : : : : : : : : : : : : : : 105 Results on Timit Corpus : : : : : : : : : : : : : : : : : : : : : : : : 107 Probabilities of Initial Three States for Markov Process Example : : : 123 Results using the Maximum Method with Dictionary Distribution Frequencies for No-Ax Words : : : : : : : : : : : : : : : : : : : : : : : 134 Results using the Maximum Method with Compromise Frequencies for No-Ax Words : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 134 Results using the Maximum Method with Original Frequencies for NoAx Words : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 135 Results using the Average Method with Dictionary Distribution Frequencies for No-Ax Words : : : : : : : : : : : : : : : : : : : : : : : 135 Results using the Average Method with Compromise Frequencies for No-Ax Words : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 136 Results using the Average Method with Original Frequencies for NoAx Words : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 136 Results using the Sux-First Method with Dictionary Distribution Frequencies for No-Ax Words : : : : : : : : : : : : : : : : : : : : : 137

- xii B.8 Results using the Sux-First Method with Compromise Frequencies for No-Ax Words : : : : : : : : : : : : : : : : : : : : : : : : : : : : 137 B.9 Results using the Sux-First Method with Original Frequencies for No-Ax Words : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 138

- xiii -

ABSTRACT Thede, Scott, Ph.D., Purdue University, December, 1999. Parsing and Tagging Sentences Containing Lexically Ambiguous and Unknown Tokens. Major Professor: Mary P. Harper. We present a parsing system designed to parse sentences containing unknown words as accurately as possible. Our post-mortem parsing algorithm combines syntactic parsing rules, morphological recognition, and closed-class lexicon with a method that attempts to parse a sentence rst with a limited prediction for unknown words, and later reparse the sentence with a more broad prediction if rst attempts fail. This allows great exibility while parsing, and can oer improved accuracy and efciency for parsing sentences that contain unknown words. Experiments involving hand-created and computer-generated morphological recognizers are performed. We also develop a part-of-speech tagging system designed to accurately tag sentences, including sentences containing unknown words. The system is based on a basic hidden Markov model, but uses second-order approximations for the probability distributions (instead of rst-order). The second order approximations give increased tagging accuracy, without increasing asymptotic running time over traditional trigram taggers. A dynamic smoothing technique is used to address sparse data by attaching more weight to events that occur more frequently. Unknown words are predicted using statistical estimation from the training corpus based on word endings only. Information from dierent length suxes is included in a weighted voting scheme, smoothed in a fashion similar to that used for the second-order HMM. This tagging model achieves state-of-the-art accuracies. Finally, the use of syntactic parsing rules to increase tagging accuracy is considered. By allowing a parser to veto possible tag sequences due to violation of syntactic rules, it is shown that tagging errors were reduced by 28% on the Timit corpus. This enhancement is useful for corpora that have rules sets dened.

- xiv -

-1-

1. INTRODUCTION 1.1 Natural Language Processing Articial intelligence (AI) is currently a popular eld of study, evoking images from laymen (and some researchers) of intelligent robots and computers, such as Data from Star Trek or HAL from 2001: A Space Odyssey. While there is a long way to go before an android such as Data or a computer such as the HAL-2000 can be built, great strides have been made since the inception of the eld. The research area of articial intelligence has grown and developed since its beginnings and numerous elds within AI have been developed. One of these elds is natural language processing (NLP), which is concerned, broadly speaking, with developing a computer that understands and generates human language. Like most areas of articial intelligence, natural language processing is very challenging. As most researchers in the AI eld learn, it is extremely dicult to get a computer to perform tasks that humans can do quite easily. In the case of natural language processing, that task is understanding and producing language { in our case English. One reason for the diculty computers have with NLP tasks is the fact that there are multiple levels of knowledge required to process English. Consider the diagram of an abstract NLP system shown in Figure 1.1. After taking a sentence as input, such a system often determines part-of-speech (POS) information about each word and generates parse trees for the sentence on its way to determining the meaning Sentence Input

Dictionary POS or POS Tagger Information

Parser

Parse Trees

Fig. 1.1. An Abstract NLP System

Semantic Analyzer

Sentence Meanings

-2of the sentence. Thus, there are (at least) three levels of knowledge needed to reach the sentence's meaning: 

Lexical knowledge: This knowledge concerns information about individual words, including part of speech and other information such as tense, number, person, and usage. This information is typically stored in a lexicon that is available to the system. One diculty is due to the fact that some words can be used in dierent ways. For example, the word can may be used as a noun, a tensed verb, or a modal verb. Determining which sense is being used in a sentence often requires more knowledge than simply the lexical information for each word in the sentence.



Syntactic knowledge: This knowledge concerns how words can be put together to form units such as phrases or sentences. The problem is that English is a fairly complicated language, and rules for its structure can be complex and varying. Ambiguity can appear at the syntactic level, just as at the lexical level. For instance, one type of syntactic ambiguity is prepositional phrase attachment. As an example, consider the sentence: The boy saw the bird with his binoculars.

One ambiguity here is whether the prepositional phrase with his binoculars attaches to the noun bird or to the verb saw. To disambiguate this attachment problem requires deeper knowledge than is available at the syntactic level each choice is equally possible without additional information (e.g., world knowledge). 

Semantic knowledge: This knowledge concerns the assignment of meaning to words and phrase structures in a sentence. This sort of knowledge is dicult to encode in a computer, because it requires the development of an internal representation as well as access to contextual information and world knowledge. Consider again the sentence:

-3The boy saw the bird with his binoculars.

The prepositional phrase would normally be attached to the verb saw since human experience tells us that binoculars are typically used to aid in seeing, and birds typically do not own items. Encoding this sort of human experience for use by the computer is challenging. In comparison, the sentence: The bird saw the boy with his binoculars.

would tend to be interpreted with the opposite prepositional phrase attachment, even though it is syntactically identical. These two dierent sentences are mapped to very dierent meanings (and syntactic structures). Unfortunately, there can even be ambiguity at the semantic level. For example, consider the sentence He helped her carry his books.

To whom do the words \He", \her", and \his" refer? There is no way to know this without access to extra-sentential information. Additionally, do the words \He" and \his" refer to the same person or not? This is ambiguous, although most people would say that they do refer to the same person (without access to prior sentences or other context). Hence, even higher levels of knowledge can be required. Understanding English is clearly a dicult task for a computer that does not have world knowledge or some basic understanding of the language. In addition, the interlocking nature of the knowledge required to understand language makes the task non-modular in some ways. The ultimate goal of research in the NLP area is to develop computer systems able to interact with a human using human (or natural) language. This research attempts to improve computer language processing at the lexical and syntactic levels. The two tasks we focus on here are parsing and part-of-speech

-4tagging. Parsing provides syntactic level information, such as breaking the sentence into its phrase structure. Part-of-speech tagging identies the usage of a word in a sentence based on the words that appear in the sentence and a dictionary. Each of these tasks are described briey here and in more detail in later chapters.

1.2 Sentence Parsing Sentence parsing is the act of breaking a sentence into its proper syntactic constituents. This is done using lexical information for each word, whether obtained from a lexicon or from the output of a part-of-speech tagger, together with grammatical rules in order to determine how the lexical items can be combined into structures to form a sentence. For example, consider the following sentence: The boy watched the bird.

A reasonable parse for this sentence is: (S (NP ( The) ( boy)) (VP ( watched) (NP ( the) ( bird))) ( .))

This is a parse tree in list notation. In the parse tree, the highest level symbol is S (standing for sentence), made up of an NP (noun phrase) followed by a VP (verb phrase) and a (punctuation mark). The NP is made up of a (determiner) and a (noun), while the VP is made up of a (verb) followed by another NP. Figure 1.2 shows the same sentence parse in a tree notation. One goal of parsing research is to create a computer system that can automatically determine the valid parses for a sentence.

1.3 Part-of-Speech Tagging Part-of-speech tagging is the act of assigning each word in a sentence its proper lexical information. This lexical information is encoded in a symbol called a tag. A

-5S

NP



VP







The

boy

watched

.

NP





the

bird

Fig. 1.2. Parse Tree for The boy watched the bird. tag usually indicates at least the word's part of speech, but it can also contain other feature information, such as number (singular or plural) or verb tense. The Penn Treebank documentation 1] describes a commonly used set of tags. Considering the previous sentence (The boy watched the bird.), output from a tagger should be: The/DT boy/NN watched/VBD the/DT bird/NN ./.

where DT signies determiner, NN signies singular common noun, VBD signies past tense verb, and \." signies ending punctuation. Creating an automatic part-of-speech tagger is simpler than creating a parser since syntactic structure is not necessary to build a robust tagger. Typically, lexical information for each word is looked up in a dictionary, and the tagger's task is to determine the correct choice between lexical possibilities for each word given its usage in the sentence. This disambiguation is accomplished by using information provided by the surrounding words and their tags.

-6Part-of-speech taggers are frequently used as pre-processors for parsing systems. By using a tagger, the parser does not need to consider all possible lexical alternatives provided by a dictionary lookup. Since parsing has an O(n3) running time, where n is the number of lexical alternatives per word times the number of words in the sentence, reduction of lexical ambiguity has the potential to reduce running time. Because of its benets as a preprocessor, the accuracy of the tagger is very important to prevent cascading errors in later systems.

1.4 Ambiguous and Unknown Words So, why are parsing and part-of-speech tagging challenging research problems? The answer lies in the fact that words in English are frequently ambiguous or even unknown. An ambiguous word has alternative lexical choices that need to be resolved, while an unknown word has no known lexical information available at all. If every word in English could only be used in one way, parsing and part-of-speech tagging (and in fact nearly every NLP task) would be trivial. However, there are many words in English that can be used in multiple ways. Consider the lexically ambiguous word runs, which can be used as a verb or a noun, or the word like, which can be used as a verb, adjective, adverb, or preposition! Syntactic ambiguity is also possible, as illustrated in the prepositional phrase attachment problem mentioned previously. In addition, semantic ambiguity (ambiguity in meaning not syntactic structure) is also possible. For example, consider the word booked in the two sentences \He booked the hotel room" and \He booked the criminal" which meaning of booked should be used in each sentence? All these forms of ambiguity cause NLP tasks to become dicult. Unknown words are an even bigger problem for NLP systems. As NLP systems are applied to more tasks, the amount of text available to be processed has grown. Lexicons cannot be reasonably expected to cover every possible word that can occur, so unknown word handling has become an important issue. The problem of unknown words is one that will only become more important to solve as natural language processing systems are used for more on-line computer applications (such as information retrieval, database querying, and speech recognition). Determining the best way to

-7accurately parse or tag sentences containing unknown words is not trivial. There are many questions about what information is useful and necessary for identifying the function of unknown words in a sentence.

1.4.1 Parsing with ambiguous and unknown words Parsers typically return a set of all possible parses for a given sentence as a parse forest (this is a structure which compactly stores even an exponential number of parses examples of its use include Tomita 2] and Earley 3]). Errors while parsing involve the exclusion of the correct parse from the candidate parse forest, or the inclusion of spurious parses these errors may result from ambiguous and/or unknown words in the sentence. Selection of the correct parse from the forest requires much information, some of which cannot be encoded in a grammar. Although grammar rules can help to select the correct lexical class for an ambiguous word, sometimes additional information (e.g., semantics and pragmatics) is required to decide on the correct parse. Success can be measured by returning the single \correct" parse from the parse forest or by comparing the parse forest returned by the parser to a target parse forest. The possibility of encountering unknown words is a problem every parsing system should be equipped to deal with. Even if a comprehensive lexicon is available, unexpected inputs could occur. New words are constantly being coined and added to the language, either coming from a subculture or from advances in technology or science. Current words can also be used in new ways, for example \dog" or \xerox"1. There are a number of ways a parsing system can respond when it encounters an unknown word, such as:

1



Catastrophic Failure: The system simply fails to parse a sentence containing an unknown word.



Human Intervention: The system prompts a human user for information if an unknown word is encountered.

As Calvin from Calvin and Hobbes says: \Verbing weirds languages." 4]

-8

Syntactic Rules Only: The system allows unknown words to assume any possible part of speech, then disambiguates based on the parser's syntactic rules.



Semantically/Contextually Enhanced Systems: The system disambiguates unknown words based on knowledge outside of the syntactic level.

Of these, catastrophic failure and human intervention are clearly poor alternatives for an automatic parsing system. Using syntactic rules alone oers adequate performance, but it has drawbacks. It is believed that a fusion of methods and knowledge from several sources will oer the best performance.

1.4.2 Tagging with ambiguous and unknown words A part-of-speech tagger can be used to generate tagged corpora or it can serve as a preprocessor for other NLP processes. In both cases, accurate performance is important. Performance is typically limited by the fact that most tagging systems use only lexical information locally available in the sentence, as opposed to parsing systems which make use of both lexical and structural information. Much research has been done to improve tagging accuracy, and several dierent models and methods have been used, which can be broadly classied into the following areas: 

Statistical Methods: This is currently one of the most popular methods for partof-speech tagging. Statistical taggers are generally trained on a set of correctly tagged sentences rst. The lexical information provided by the correctly tagged sentences allows the tagger to extract implicit rules to disambiguate words based on the surrounding words in the sentence. Specic models using this method include hidden Markov models, maximum-entropy models, and memory-based models.



Rule-based Methods: Rule-based systems perform part-of-speech tagging by applying a sequence of rules to determine the \best" set of tags for the sentence. \Best" is in quotes here because no maximization of probabilities is done while

-9tagging2 . Specic models of rule-based tagging include the transformational approach, decision tree models, and path-voting constraint models. 

Other Methods: Neural networks have also been used, whether with a linear separator model or a full neural network. Other models use multiple singlemodel taggers to tag a sentence, then combine results to assign a single tag sequence.

Like parsing, tagging systems must also cope with the presence of unknown words in a sentence. Frequently, a statistical method of predicting possible parts of speech for the unknown word is used. Most tagging systems make certain assumptions that aect their ability to robustly tag unknown words, such as assuming that the root word is known or that the sentence is found in a specic semantic domain. It is believed that a statistical model should be able to achieve good results in predicting parts of speech for unknown words without making these assumptions.

1.5 Goals of this Research Our research has two main goals. First, we want to develop a parsing system that can accurately parse sentences containing unknown words. This will allow the parsing system to be used in many dierent applications without loss of performance. The second goal is to create a part-of-speech tagging system with superior overall performance and superior performance on unknown words. In addition to these goals, we would like both systems to rely on as little additional human eort as possible. The systems should operate using only the training data available to it linguistically sophisticated information supplied by human users requires signicant development time. Also, eliminating hand-crafted information allows the systems to remain robust and applicable to dierent tagsets, domains, and possibly languages. The parsing system described in Chapter 2 uses three basic concepts to accurately parse sentences containing unknown words: the closed-class/open-class distinction, Maximization may be used while constructing the rule set, however, as is done for the decision tree model. 2

- 10 syntactic information, and morphological information. It combines these information sources with a post-mortem algorithm, allowing the parser to back o if it fails a parse and try again with more liberal guesses. This allows precise guesses to be made for unknown words without fear of failing to parse the sentence. The tagging system presented in Chapter 3 uses a new statistical model called the second-order hidden Markov model that uses second-order approximations for the lexical probabilities and the contextual probabilities in a traditional HMM framework3. This new model oers better accuracy without increasing the asymptotic running time of the algorithm4. Unknown word prediction is done using statistical word ending information together with a method of smoothing the probabilistic information from all the matching suxes of a word, weighted based on length. This method assumes nothing about the root word and nothing about the domain of the sentence.

1.6 Summary of the Document This section outlines the contents of the remaining chapters of this dissertation. The second chapter details the parsing component of the research. After opening with an introduction to parsing, the problem of parsing unknown words is discussed. This is followed by an explanation of our parsing system, the post-mortem parser with morphological recognition, and its performance in various experiments is discussed. The third chapter discusses the tagging component of this research. After introducing part-of-speech tagging and an investigation of statistical morphological recognition, we discuss our new tagging system, the second-order hidden Markov model, and illustrate its superiority over other tagging systems. Methods of smoothing sparse data are also examined, and a new tagging system is considered that is augmented with a parser to eliminate invalid tag sequences. The last chapter recaps the work and its contributions, as well as detailing future See Appendix A for a review of the hidden Markov model. The running time of a standard trigram tagger using the Viterbi algorithm is O(NT 3) where N is the number of words in the sentence and T is the number of possible tags. A standard rst-order HMM has a running time of O(NT 2 ), but is no longer commonly used in tagging systems due to inferior accuracy rates. 3 4

- 11 directions of our work in this area. Finally, the rst appendix oers a brief tutorial and introduction to hidden Markov models, and the second appendix provides tables of parsing data that are displayed graphically in Section 2.5.5.

- 12 -

- 13 -

2. PARSING 2.1 An Introduction to Parsing Parsing is the act of breaking down a sentence into its constituent parts. There are many ways to parse a sentence using a computer, but most of them involve the use of rewrite rules. A rewrite rule denes the possible constructions of each possible element of a sentence. The rewrite rules consist of symbols drawn from a set V  C , where V is the set of variables and C is the set of constants. Constants are the nal symbols of the language { for parsing, the constants are either the words of the language or the possible lexical categories (parts of speech) of the language. Variables (or non-terminals) are symbols that must be replaced by other symbols (variables or constants) in the rewrite rules. For example, consider this sentence: The boy watched the bird. As shown in section 1.2, a reasonable parse for this sentence is: (S (NP ( The) ( boy)) (VP ( watched) (NP ( the) ( bird))) ( .))

To parse this sentence, the set of variables V = fS, NP, VPg and the set of constants C = f, , , g1 must be dened. The rewrite rules are dened as: These symbols could be de ned as variables, and the set of constants could be set to the words of the sentence this would needlessly complicate the system, and the two ideas are functionally equivalent.

1

- 14 S NP VP

!

NP VP

! ! NP

These rewrite rules can be used to generate a legal sentence in terms of the parts of speech or to recognize a sentence that is legally constructed according to the set of rules. To recognize a sentence, we must begin with the S symbol. Then the S symbol can be replaced by the right hand side of any rewrite rule containing an S on the left hand side. We continue replacing variables until only constants remain. Then, the lexicon must be consulted to determine whether the words in the sentence are consistent with the parts of speech generated by the rules. If a word's part of speech matches the lexical category provided by the rewrite rules, the sentence is in the grammar. For example, consider the following application of the rewrite rules: S

!

NP VP

!

VP

!

NP

!



!

The boy watched the bird .

Therefore, the sentence The boy watched the bird. is a legal sentence according to this set of rules. In fact, the string of parts of speech depicted above is the only string that can be generated by these rules. To cover the English language, much more complicated sets of rules would need to be constructed.

2.1.1 Grammars To parse sentences in a language, a set of rewrite rules is designed to determine what constitutes a valid sentence. A set of rewrite rules is called a grammar. There are several classications for grammars. Each classication places restrictions on what type of rewrite rules can be used in return for guaranteeing an upper bound on running time for sentence recognition. In the following descriptions, upper-case letters (such as V ) are variables, and lower-case letters (such as c) are constants.

- 15 -

Regular grammars: These are the most restrictive forms of grammars. There are

actually two types of regular grammars: right linear or left linear. In right linear grammars, rewrite rules must be of the form:

V ! cW V ! c For left linear grammars, the rewrite rules must be in the form: V V

! !

Wc c

One of the strengths of a regular grammar is its running time. A regular grammar only takes O(n) time to recognize a string as belonging to the grammar, where n is the length of the string in symbols. Unfortunately, a regular grammar is not very expressive. For example, a regular grammar cannot model a string of the form anbn , which is required to model bracket matching for a compiler. Therefore, regular grammars are incomplete models of language structure (they can be useful as a loose set of constraints, but other grammar models are more restrictive).

Context-Free Grammars: A context-free grammar (CFG) is required to have

rewrite rules only of the form V ! , where  is any string (possibly empty) of constants or variables. Thus, the only restriction on the rules of the grammar are that the left-hand side must consist of a single variable. Another slightly more restrictive form is called a positive context-free grammar, which requires that the string  be non-empty.

Context-free grammars can be structured into a normal form, that is a grammar in which the rules are structured in a xed number of ways (e.g., Griebach Normal Form or Chomsky Normal Form). For example, the Chomsky Normal Form is simply a context-free grammar that has rewrite rules consisting only of the forms V ! W X or V ! c. Any context-free grammar can be rewritten in Chomsky Normal Form (a proof of this fact can be found in Davis, et al. 5]).

- 16 There are algorithms available that can recognize a string from a context-free grammar in O(n3) time. The expressive power of a CFG is more than that of the regular grammar, but it is still somewhat limited (e.g., anbn can be recognized by a context-free grammar, but anbncn cannot).

Context-Sensitive Grammars: A context-sensitive grammar restricts its rewrite rules to be of the form  !  , where  and  are sequences of symbols (variables or constants) such that jj  j j. In other words, the right hand side of the rewrite rule must be at least as long as the left hand side.

Context-sensitive grammars are in general very expressive but extremely computationally dicult. Recognizing a string with a context-sensitive grammar is a PSPACE-complete problem. Any problem that is PSPACE-complete is as hard or harder than any problem that can be solved in polynomial space on a Turing machine. This is typically considered to be a \more" intractable set of problems than NP-complete. See Garey and Johnson 6] for a complete denition of PSPACE and PSPACE-complete.

Type 0 Grammars: The type 0, or recursively enumerable, grammar2 has abso-

lutely no restrictions on its rewrite rules. They are, of course, the most expressive grammars but are computationally intractable.

In natural language processing, a context-free grammar (CFG) is normally used for parsing. This is due to its polynomial running time and reasonable expressive power (all but the most complicated English sentences can be expressed with a CFG, and this limitation is accepted to make the parsing computationally feasible). By using a context-free grammar, instead of a context-sensitive grammar, the main problem is likely to be the acceptance of an ungrammatical sentence or possibly producing ambiguous parses where no ambiguity exists. The name Type 0 comes from the nomenclature of the Chomsky hierarchy 7], which de nes recursively enumerable grammars as type 0, context-sensitive grammars as type 1, context-free grammars as type 2, and regular grammars as type 3.

2

- 17 -

2.1.2 How parsers work We have described the way sets of rewrite rules can be combined to form grammars, but have not described how those grammars are to be used to perform parsing. For dierent grammars, there are dierent kinds of parsers. The parser used in our experiments utilizes a context-free grammar, so these will be considered rst. There are several processing strategies for a context-free grammar parser to use. Some of them are listed here: 

Top-Down parsers begin with the single start symbol S . From this point, the top-down parser expands the symbol S into a string of symbols using any applicable rules from the grammar. The parser continues to expand each variable in the string until the string consists of only constants. Then, this string is compared to the possible lexical categories of the words of the sentence{if they match, the sentence is a legal construct. If not, the parser backtracks and tries new expansions of variables until the sentence parses or there are no more possible expansions. An example of this kind of procedure was shown in section 2.1.



Bottom-Up parsers begin with a string that consists of the words of the sentence. Each word is replaced with its possible parts of speech. Then, these parts of speech are combined into constituents using the production rules if the righthand side of a production rule matches some sequence of symbols in the string, then those symbols can be replaced with the left-hand side of the appropriate rule. This continues until the S symbol is obtained or until every possible combination of symbols is exhausted.



Chart parsers attempt to combine the advantages of bottom-up and top-down parsers. They essentially use a bottom-up approach, but they do so in a single pass through the sentence by using memoization or dynamic programming. The parser keeps track of all possible constituents that can be constructed at a single point in the sentence, using all previous words. This list of possible constituents is the chart, and it makes it easier to determine what constituents can be built

- 18 that include the next word of the sentence. By using a chart, the context-free grammar parser has an O(n3) running time for parsing of sentences, where n is the total number of lexical categories for all the words of the sentence. These three methods are explained in more detail in Allen's book 8]. There are also other methods for parsing a sentence that do not use production rules directly. For example, transition networks model the grammar as a set of statebased networks 9, 10]. Constraint dependency parsers model the production rules as a set of constraints to be satised. Constraint dependency grammars were introduced by Maruyama 11, 12], and further work has been done by Harper and associates 13, 14, 15]. Parsers can also be implemented as statistical models of context-free grammars, where each production rule is assigned a probability of occurrence. This allows each possible parse to be assigned a probability of being correct. In this way, a single parse can be selected as the most likely correct parse for the sentence. A simple stochastic parser is described in Allen's text 8], and research in this area has been done by Collins 16] and Magerman 17]. Blaheta and Charniak 18] have rened their statistical parser to give current state-of-the-art performance. The parser used in the experiments detailed in the rest of this chapter is a chart parser, which is an extension of the parser originally developed by Tomita3 2]. It stores the possible parse trees for a sentence in a parse forest, which allows a large (possibly exponential) number of possible parse trees to be stored in a polynomial amount of space. It was chosen as the base parsing system for our parser because it is a good basic system with which to interface an unknown word recognizer.

2.2 Parsing Unknown Words A simplied parsing system is diagrammed in Figure 2.1 focusing on the interaction with the lexicon. What happens when a word is not available in the lexicon? The response of a parsing system when it encounters an unknown word varies from system 3

This parser was used in EE 669: Natural Language Processing at Purdue.

- 19 -

Sentence to be parsed

Parsing System Word Lookup

Parsed Sentence Word Information Returned

Lexicon

Word Not Found

?

Fig. 2.1. A Simple Parsing System to system. Some parsers cannot parse sentences containing unknown words at all. Any practical parsing system must be capable of coping with unknown words. Even if a comprehensive lexicon is available, unexpected input could easily be received if the parsing system is expected to interact often at some level with humans. New words are constantly being added to the language and word usage evolves with the language. For example, \To dog someone" uses dog as a verb, a usage that is not typical. Keeping a lexicon up to date for all possible situations is virtually impossible.

2.2.1 Other work with unknown words We wish to design a parser that can parse sentences containing unknown words. Before we describe our approach, it is useful to consider how other systems address the problem of unknown words.

Catastrophic failure Some parsers cannot analyze sentences containing unknown words at all. When an unknown word is encountered, the system fails to return any parse at all for the sentence. This response is unacceptable for any system that is to be considered reasonably robust. However, many parsing systems have no way to parse sentences

- 20 containing unknown words. For example, Weischedel, et al. 19] mention that the system that performed best at the Second Message Understanding Conference (MUC-2) would simply stop parsing after encountering an unknown word. This response may be acceptable for systems operating under the restrictions of MUC-2, but it is inappropriate for a robust parser that expects to encounter a wide variety of text. There should be a way to design a parser to cope with unknown words with only moderate increases in operating costs.

Human intervention Another way of parsing sentences that contain unknown words is to rely on human interaction with the system. When an unknown word is encountered by such a system, the human user is prompted by the system to dene the word. This technique is not discussed in the literature, but limited human intervention can be of use in some applications. For example, in a database interface, it may be helpful for users to identify the lexical information about a particular object. Human intervention can be especially important when several unknown words occur in a single sentence, or there is a critical misidentication by the system while working with one of the unknown words. There are problems with the human intervention approach. It is limited to applications that support human supervision, eliminating those applications for which human intervention would be prohibitively expensive or impossible (e.g., parsing a very large corpus or searching through a large number of documents). Another problem is that the user may not know enough information about the word. If the user is prompted for information that he or she cannot supply to the system or supplies incorrectly, the system could fail catastrophically. This sort of problem is most likely when users are not linguistically sophisticated (e.g., many users of computer systems).

- 21 -

Syntactic rules Many parsers are written using syntactic rules as the only guidelines for determining the structure of the sentence. The syntactic rules are encoded as rewrite rules, usually forming a context-free grammar. These syntactic rules can be used to restrict the possible lexical alternatives for an unknown word by eliminating syntactically impossible alternatives. Tomita's parser 2] handles unknown words by assigning them every possible lexical category, then using the syntactic rules of the grammar to lter out the incompatible assignments. There are signicant problems with this approach. Since the parser assigns every possible lexical category to all unknown words, there can be a massive over-generation of parses by the system if the word is in a position that allows several possible lexical categories. This is especially problematic when there is more than one unknown word in the sentence. Use of all possible lexical alternatives for unknown words should be limited to occasions where the lexicon is guaranteed to be nearly complete.

Contextually enhanced systems The FOUL-UP system 20] is a parsing system that copes with unknown words by focusing on contextual and semantic information. It uses a strong top-down context guide called a script to provide the expected attributes of unknown words. For example, if the following input is given to the FOUL-UP system: Friday, a car swerved o Route 69. The car struck a kleeb.

and all words except kleeb are known and stored in the lexicon, then the script \One vehicle accident" is activated by the rst sentence and is used to identify a kleeb as a physical object that can act as an obstruction to a vehicle. This works well within the limits of the one-vehicle accident script. The FOUL-UP system requires the use of scripts and assumes that most words will be known. Domain specic information limits the applicability of the parser to a specic situation and could have ramications

- 22 for the extensibility of the system. The LINK system 21] uses the context of the sentence to determine semantic information about an unknown word. Grammatical constraints are used to determine the possible syntactic categories of the word, then inferences are made about the semantic information based on those categories and the semantic domain of the sentence. The system uses a hierarchical organization of semantic categories to determine the semantic attributes of the unknown words. The LINK system is limited by its semantic domain dependence. All information that is generated about an unknown word is based on specic rules designed for one semantic domain (e.g., actions to be performed on an assembly line). Without this domain-specic semantic information, inferences would be dicult, if not impossible to draw. In addition, the authors assume that the vast majority of words are already known, limiting the wider applicability of the system.

Proper noun recognition There are many people working on the problem of identifying unknown proper nouns while parsing. McDonald 22] establishes the idea of using both internal and external evidence to identify and categorize unknown proper nouns (specically names). Internal evidence corresponds to morphological and capitalization information, while external information incorporates the semantics and syntax of the sentence. Mani and MacMillian 23] further address proper nouns, specically name identication in newswire text. Paik, et al. 24] address the identication of proper nouns using 29 semantic categories, such as company, person, or date. All these systems make use of information that is specic to proper nouns| capitalization and information about how compound nouns are constructed (for example \Three Mile Island is a reactor." versus \Three April showers came this week."). These information sources will not help in the general problem of predicting unknown words. However, these systems could be used in conjunction with another system to provide systematic guidance on proper nouns within a larger parsing system.

- 23 -

Combined methods The SCISOR system 25] was developed to acquire lexical information from text during parsing. It uses a combination of context, syntax, morphology, and semantic information sources to determine information about an unknown word. The purpose of this system is to allow the parser to augment the lexicon as it parses, dealing with unknown words as they occur. The SCISOR system assumes that a large dictionary of general root words is available, and that the morphological information is obtained only by using this set of roots. However, if an unknown word is encountered, the root of that word may also be unknown. In addition, SCISOR assumes that unknown words have specialized meanings related to the domain of the base lexicon. So again, the parsing system is limited by its reliance on a specic semantic domain.

2.2.2 Information sources for unknown words We believe that a method that uses as many information sources as possible, while avoiding making assumptions about the domain or structure of the sentence, should perform well in a wide variety of applications. A combined methods approach, by using a variety of knowledge sources to deduce information about unknown words, should be more accurate than a system using only one source. If the restrictive assumptions discussed in the previous section (e.g., reliance on a complete lexicon of root words, the usage of specialized meanings, or the assumption of a specic domain) can be avoided, a parsing system using combined methods should be both accurate and robust. There are three major sources of information for predicting this lexical information about unknown words that are discussed in this section. The use of closed-class words and their value in the lexicon is considered, the ability of syntactic rules to limit possible lexical categories of unknown words, and the utility of morphological recognition. These information sources should enable the accurate prediction of lexical information about unknown words.

- 24 -

Closed-class versus open-class words A useful source of information for English parsing systems is the distinction between closed-class and open-class words. Closed-class parts of speech are those parts of speech that are not normally assigned to new words. Closed-class words are words with a closed-class part of speech. For example, pronouns and determiners are members of the closed-class set of parts of speech it is very rare that a new determiner or pronoun is added to English. An advantage of closed-class parts of speech is that those words can be enumerated completely. In addition, closed-class words do not generally take on other parts of speech | for example, it is rare to nd a determiner that can also act as a noun. This statement ignores possible meta-linguistic use of a word, such as the sentence: \The" is a determiner. Open-class parts of speech are those that can be assigned to new words with little diculty. Nouns, verbs, adjectives, and adverbs are examples of open-class parts of speech. Noun modier is a part of speech that is used in the parsing experiments to indicate those words that can be used to modify nouns this eliminates extraneous parses that occur when a word dened as both a noun and adjective is used to modify a head noun. This part of speech is also used by Cardie 26] in her experiments. The existence of a set of closed-class words allows the construction of a dictionary for parsing that will facilitate the detection and analysis of unknown words. Closedclass words, since they typically can act as only one part of speech, oer \anchor" points in a sentence that gives the system more contextual information. The increased contextual information, along with the limitation that unknown words must be openclass parts of speech, allows the system to more accurately predict the possible parts of speech for the unknown words. This distinction can be used to develop a small core dictionary of closed-class words that can greatly ease the task of identifying unknown words in a sentence. There are a number of closed-class parts of speech, including determiners, prepositions, conjunctions, predeterminers, and quantiers. Pronouns and auxiliary verbs (be, have, and do ) are also closed-class parts of speech. Irregular verbs are designated

- 25 as closed-class words for this research since they are a static part of the language. The irregular verbs are enumerated by Quirk, et al 27]. New verbs in a language are not typically coined as irregular verbs. By assuming that the irregular verbs are known, it becomes easier to identify unknown verb forms. For example, all past tense regular verbs end in -ed, third person singular regular verbs end in -s, and so on. For similar reasons, irregular noun plurals are also included in the set of closed-class words. If unknown words are encountered that are known to be nouns or verbs, their forms are easily determined. For example, in the sentence The zzzax built a car.

assume that the word zzzax is unknown. The words The and a are determiners, a closed-class part of speech, and should be known by the system. Similarly, the word built is an irregular verb form, which is also considered to be closed-class. Three of the ve words in this sentence are known, and the basic structure of the sentence is laid out. For the sentence to conform to proper English rules of grammar, zzzax (and incidentally car ) must be a noun (or at least something acting as a noun in this sentence).

Morphology Morphology has much to oer for the prediction of information about unknown words. Morphology is the study of the construction of words from more basic units (morphemes ) that roughly correspond to units of \meaning". These meaning units are normally root words, prexes, and suxes. The study of axes (ax refers to a prex or a sux) can oer important information for identifying the lexical class of an unknown word. The part-of-speech tagger used by Weischedel, et al. 19] reduced its error rate on unknown words by a factor of three by using only simple morphology, and by a factor of ve when combining morphology with other word structure information, illustrating the usefulness of morphological information. Re-

- 26 search on morphology can be categorized into three areas: morphological generation, morphological reconstruction, or morphological recognition. Morphological generation research focuses on the development of morphological axation rules and their use in generating new words from a lexicon of base roots. Theoretical linguists and psychologists use these rules in order to develop linguistic theories or to understand how people learn language. For example, Muysken 28] classied axes according to their order of application in a theoretical discussion on morphology. Badecker and Caramazza 29] discussed the distinction between inectional and derivational morphology as it applies to acquired language decit disorder and in general to the theory of language learning. Baayen and Lieber 30] studied the productivity of certain English axes in the CELEX lexical database in an eort to study the dierences between frequency of appearance and productivity. Morphological generation is of great use in lexical generation and encoding (for example, in Golding and Thompson's Lexicrunch 31] system that represents a lexicon as compactly as possible), but since it involves the construction of new word forms by applying rules of axation to base forms, it is only indirectly helpful in the identication of new words. Morphological reconstruction researchers analyze an unknown word by using knowledge of the root stem and axes of that word. Studies about morphological reconstruction can oer information about the identication of unknown lexical items. For example, Milne 32] makes use of morphological reconstruction to resolve lexical ambiguity while parsing. Light 33] uses morphological cues to determine semantic information about unknown words by using various semantic restrictions and knowledge. Jacobs and Zernik 25] make use of morphology in the SCISOR system, their case study of lexical acquisition, in which they attempt to augment the lexicon using a variety of knowledge sources. However, all these systems assume a large dictionary of general roots is available, and that the unknown words tend to have specialized meanings. Since an unknown word could be completely unknown, with no root word information, morphological reconstruction alone is of somewhat limited value for cop-

- 27 ing with unknown words. A method to cope with unknown words cannot be solely based on knowledge of the root if the root is also unknown. The third type of morphology research, morphological recognition, uses knowledge about axes to determine the possible parts of speech and other features of a word, without utilizing any explicit information about the word's stem. There is much information available in the beginning and ending of a word that can be used without requiring knowledge of the word's root. Also, a learning system that does not rely on hand-crafted rules specic to English can be used on any morphologically rich language. There is one caveat concerning the morphological recognition approach. Since the root of an unknown word is assumed to be unknown, the recognizer can only consider whether an ax matches the word. This can lead to an interesting type of error. For example, while checking the word buttery, the ax -ly matches as a \sux". Typically, the -ly sux is attached to adjectives to form adverbs (e.g., happy ! happily ) or to nouns to form adjectives (e.g., beast ! beastly ) however, the word buttery was not formed by this process, but rather by compounding the words butter and y. Since no assumptions are made about the quality of the word *butterf, and without some notion of legal word structure, there is no way to determine that buttery was not formed by applying -ly. So buttery would be mistakenly assumed to have the -ly sux and be identied as an adverb or an adjective. If additional information is available, a morphological recognizer can circumvent this problem. For example, in the part-of-speech predictor detailed in section 3.2, the suxes -y and -ry are used rst to identify the part of speech of buttery. These suxes would presumably not be as likely to be associated with adjectives and adverbs. There has been little research done in morphological recognition in English. Some work of note includes work by Mikheev 34, 35], who includes root word information along with word ending information (i.e., combines morphological reconstruction and recognition), and work by Thede 36] which investigates statistical prediction without using any root word information.

- 28 Morphology has also been investigated in other languages including Dutch 37], Finnish 38], French 39], German 40], Spanish 41], and Turkish 42]. Many of these languages are more morphologically rich than English, and so morphological research is central to the building of an NLP system. For many of these languages, the morphological processes operate in somewhat dierent ways than in English (for example, much more compounding and prexation), so these techniques may not apply well to English. Also, research on English has shown that the use of statistical information alone oers good results, without the weaknesses required by reliance on outside information or assumptions. The same may not be true for these other languages.

Syntactic knowledge Syntactic knowledge is used implicitly by a parser when an unknown word is encountered. The location of the unknown word in a sentence can be of great importance when determining its part of speech. For example, a word directly following a determiner is typically a noun or noun modier. Consider the unknown word smolked used in this sentence: The cat smolked the dog. Assuming that all the other words in the sentence are found in the lexicon, then based on purely syntactic knowledge the unknown word must be a nite tense verb, either past or present tense. The use of morphological recognition can rene this information. The -ed ending on smolked usually indicates either a past tense verb or a past participle. By combining the syntactic and morphological information, the word smolked is identied as a past tense verb. Furthermore, it is a verb which takes a noun phrase object. Thus, syntactic information can augment morphological information, and vice versa. Obviously, the more words in a sentence that are dened in the lexicon, the more the syntactic knowledge can limit possible parts of speech of unknown words.

- 29 -

Our approach It is unlikely that syntactic and contextual rules alone will provide a tight enough set of constraints on the possible parts of speech for an unknown word, especially for a sentence containing a large number of unknown words. By combining closed-class word information, morphological information, and contextual information, we will construct a system that improves performance over a system using any of the three sources individually. By combining these information sources with the post-mortem parsing algorithm described in the following sections, the parser becomes a powerful tool for parsing texts that contain unknown words. We have developed a post-mortem parsing algorithm that allows the parser to reattempt a sentence parse if initial attempts fail. The post-mortem approach is similar to the iterative-deepening search mechanism: a limited attempt is made to nd a parse, gradually broadening the search space for unknown words until a parse is found. It is this post-mortem approach utilizing morphological and syntactical knowledge sources that allows the ecient and accurate parsing of sentences that contain unknown words. By using this approach, all sentences containing unknown words should eventually parse, unless the rule coverage or closed-class lexicon is inadequate. A morphological recognizer is added to the parsing system and forms a module that interacts with the parser and lexicon when the word to be parsed is not found in the lexicon. The use of this module, when combined with the post-mortem approach of the parser, avoids the catastrophic failure inherent in some systems by oering lexical possibilities for every word it encounters, regardless of whether or not this word is in the lexicon, while avoiding most of the over-generation inherent in assigning all possible lexical categories to the word. This system also parses sentences with no human intervention. If necessary, the ability to interact with a user could be added. This ability should be activated only after a certain threshold for performance is exceeded, possibly based on the number of unknown words in the sentence.

- 30 -

2.3 Parsing with a Morphological Recognizer This section empirically investigates the performance of a parsing system using a post-mortem algorithm, along with a dictionary of closed-class words, syntactic parsing rules and a morphological recognizer for unknown words. This is a preliminary investigation, using a hand-created morphological recognizer, done to test the eectiveness of our post-mortem algorithm and the unknown word recognizer. In this section, we describe the post-mortem algorithm and how the morphological recognizer is integrated with it. The parser is built from a Tomita 2] style parser implemented in Common Lisp. The parser's lexicon denes a word with its spelling, part(s) of speech, and a small set of features: 

Number (singular / plural)



Person (rst / second / third)



Verb form (innitive / present / past / past participle / -ing participle)



Noun type (common / proper)



Noun case (subjective / objective / possessive / reexive)



Verb sub-categorization (12 dierent possibilities)

This is one set of features other sets of features could be used. Our parser has been altered to use a morphological recognizer to predict parts of speech for unknown words instead of using Tomita's method of guessing all possibilities. We believe that the post-mortem approach combined with the morphological recognizer will parse sentences containing unknown words accurately while avoiding the overgeneration of parses inherent in Tomita's method. The test corpus is a set of 356 sentences from the Timit corpus 43], a corpus of sentences that has no underlying semantic theme. It was originally designed to test phoneme recognition in speech processing. This corpus was specically chosen for our experiments because it has no semantic theme and oers a wide range of

- 31 sentence types. This should allow the parser to be used for processing sentences in other corpora and should guarantee that the experimental results are not skewed by any inherent domain of the test corpora. The grammar rule set for this parsing system was designed rst to properly parse the test corpus, and second to be as general as possible. It is hoped that these rules could deal with a wider variety of English sentences than those that are found in the test corpus, though they have not been tested on other corpora. The rule set is not an overly critical element of the experiment | any rule set should oer similar results as long as it fully covers the corpus. The size and coverage of the grammar is not currently an issue in this experiment. Regardless of the size of the grammar or the semantic domain of the corpus, identifying possible parts of speech for unknown words will benet a parsing system. Furthermore, there are many applications that use grammars that do not cover an extensive range of English sentences. The experiments in this chapter will demonstrate the benet of our mechanisms for dealing with unknown words. There are four separate les that constitute the lexicon for this parser and corpus: 

nouns - This lexicon contains all irregular noun plurals.



verbs - This lexicon contains all forms of all irregular verbs.



closed - This lexicon contains all other closed-class words.



dict - This lexicon is actually a set of les, named dict1 to dict10. Dict10 contains all the words from the corpus that are not contained in the other three les|these are all open-class words. Dict10 along with the rst three les form a complete lexicon for the test corpus. Dict9 contains 90% of the words from dict10, dict8 contains 80%, and so on down to dict1, which contains 10% of these words. All these les were created randomly and independently4 using the words from dict10. The percentages for each le are based on a word count,

In other words, the set of words contained in dict7, for example, is not necessarily a subset of the words in dict8, and so on.

4

- 32 not on a denition count|many words have more than one denition. For example, acts is one word, but it has two denitions|as a noun and a verb. The rst three les mentioned (nouns, verbs, and closed) comprise the closed-class lexicon. The les nouns and verbs are included in the closed-class lexicon as discussed in section 2.2.2. The set of open-class dict les are used in the experiment to simulate missing words from the lexicon.

2.3.1 A hand-constructed morphological recognizer

The morphological recognizer is a critical component to the parsing system. The recognizer is used to predict parts of speech of unknown words using information about axes that match the word. We say an ax matches a word when the beginning or ending letters are the same as a prex or sux respectively. For example, the word unnatural matches the prex un- and the sux -al. The axes used in this morphological recognizer are morphologically signicant axes. Using linguistic information, such as from Quirk, et al. 27], a set of parts of speech that are typically associated with each ax is created. From this set of parts of speech, two lists are created: the rst-choice and second-choice lists. These two lists are used to allow multiple parse attempts with more information. The rst-choice list indicates those parts of speech that are strongly associated with the given ax. For example, the sux -ly is strongly associated with adverbs. The second-choice list indicates parts of speech that are associated with the given ax but not as strongly as parts of speech on the rst-choice list. Also included on the second-choice list are all parts of speech from the rst-choice list. For example, for the sux -ly, the rst-choice list consists of adverb, and the second-choice list consists of adverb and adjective as well as noun modier. Words that have neither a prex nor a sux are specially treated. Since no conclusions can be drawn about these words morphologically, the rst-choice list for such words contains noun, verb, adjective, and modier, and the second-choice list contains noun, verb, adjective, modier, and adverb. These parts of speech are chosen because they constitute the open-class parts of speech. In the absence of

- 33 morphological information, the system uses all possible open-class parts of speech for unknown words. Adverb is omitted from the rst-choice list for words with no ax since relatively few words in English are adverbs, and many of those that are contain morphological information.

2.3.2 The post-mortem algorithm

A major focus of this research is the new post-mortem approach to parsing. This approach allows the parser to perform a back-o procedure, trying various predictions until a parse is found. The algorithm for the post-mortem approach for this experiment is as follows: 1. Combined Lexicon and First-Choice Morphological Recognizer The lexicon is consulted for each word in the sentence. If the word is dened in the lexicon, its denition|consisting of the word, its part of speech, and various features aecting its use|is used to parse the sentence. If it is not in the lexicon, the system assumes that it can only be an open-class part of speech, and it could possibly be any of them. In this pass, the morphological recognizer is used to reduce the number of possible parts of speech for the word. Consulting a list of axes, the recognizer determines which ax, if any, match the word. Then the recognizer assigns all the parts of speech from the rst-choice list for that ax to that word. If both a prex and a sux match the word, the list for the sux are used. This is done because an inspection of the ax lists suggests that suxes are more indicative of part of speech than prexes. For example, an unknown word ending in -ly is assumed to be an adverb. A parse forest for the sentence is generated. If the parse fails, the parser moves to pass two. 2. Combined Lexicon and Second-Choice Morphological Recognizer If the sentence fails to parse in pass one, it is reparsed using the procedure from pass one, except that parts of speech from the second-choice list for the ax are assigned to the word, instead of the rst-choice list. This will assign a more liberal set of parts of speech to the word based on its ax. For example, an unknown word ending in -ly is now assumed to be an adverb, adjective, and a modier. Again, the parse forest

- 34 for the sentence is returned. If the sentence fails to parse, the parser goes on to pass three. 3. All Open-Class Variants for Unknown Words If the sentence fails to parse in pass two, it is reparsed, and all open-class lexical categories are assigned to every unknown word. For example, an unknown word ending in -ly is now assumed to be a noun, verb, adverb, adjective, and modier. The parse forest for the sentence is returned. If the sentence fails to parse in this pass, the parser moves on to pass four. At this point, the performance of the parser is equivalent to that of Tomita's parser, since we are assigning every unknown word all possible open-class parts of speech. 4. All Morphological Variants for all Open-Class Words If the sentence fails to parse in pass three, it is reparsed using the morphological recognizer on all open-class words. It assigns a set of parts of speech to each word, based on the second-choice list for that word's ax. Note that only the closed-class lexicon is consulted during this attempt to parse. The morphological recognizer is used to determine denitions for all open-class words. This approach allows the parsing system to nd new denitions for words that are already in the dictionary. For example, any open-class word ending in -ly is now assumed to be an adverb, an adjective, and a modier. The parse forest for the sentence is returned. If the sentence fails to parse in this pass, the parser moves on to pass ve. 5. All Open-Class Variants If the sentence fails to parse in pass four, all possible open-class parts of speech are assigned to all open-class words in the sentence. Again, only the closed-class lexicon is consulted. For example, any word ending in -ly is now assumed to be a noun, verb, modier, adjective, and adverb. The parse forest is returned, or the sentence fails to parse completely.

A owchart version of this algorithm can be seen in Figure 2.2. Notice the post-mortem approach|if (and only if) the sentence fails to parse, the parser tries to parse it again, using more possible parts of speech in its word look-up

- 35 -

Fig. 2.2. The Post-Mortem Algorithm algorithm. This limits the possibilities for unknown words in the beginning to those that are most likely, and broadens the search space later if the rst attempts fail. The use of this approach should oer improved performance over other parsers, since the number of possible parses is limited at rst by making restrictive lexical hypotheses about unknown words, and yet a parse is guaranteed if the rule set allows it.

2.3.3 Experimental design For this preliminary experiment, two separate data runs are performed. The rst run uses a variation of Tomita's method of assigning unknown words all possible parts of speech. The system assigns unknown words all possible open-class parts of speech (noun, verb, adjective, adverb, and modier). This data run is called the baseline run. The second data run assigns parts of speech based on the output of the morphological recognizer, in accordance with the post-mortem algorithm. This data run is called the experimental run. For each test run, all the sentences in the corpus are parsed by the system eleven times. For each pass through the corpus, all the closed-class words are loaded into the lexicon. In each separate pass, a dierent open-class dictionary is used. For the rst

- 36 pass, the full dictionary found in dict10 is used. This pass is called the control pass, since all words in the corpus are dened in the lexicon for this pass. It will be used to determine the performance of each subsequent pass in the run. For each successive pass, the next dictionary is used, from dict9 down to dict1. Finally, the eleventh pass is performed without loading any extra dictionary les|only the three closed-class les are used. Thus, data are generated for 0% of the open-class dictionary missing, 10% missing, and so on, up to 100% of the open-class dictionary missing. After all of the parse trees have been generated, each individual pass is compared to the control pass, and three numbers are calculated for each sentence in each pass| the number of matches, deletions, and insertions. 

A match occurs when the parser has generated a parse for the sentence that occurs within the control parse forest for that sentence. This could occur if the unknown word predictor predicts a part of speech that is included in the word's denition.



A deletion occurs when the parser has failed to produce a parse for the sentence that occurs in the control parse forest for that sentence. This could occur if the unknown word predictor fails to predict a part of speech that is included in the word's denition.



An insertion occurs when the parser produces a parse for the sentence that does not occur in the control parse forest for that sentence. This could occur if the unknown word predictor predicts a part of speech that is not included in the word's denition.

For example, assume sentence #16 produces parses A, B, C, and D in the control pass. In a later pass, sentence #16 produces parses A, C, E, F, and G. Then, for sentence #16 in that pass, there are two matches (A and C), two deletions (B and D), and three insertions (E, F, and G). By using these measurements, the precision and recall of the parsing system can be determined when parsing sentences with unknown words.

- 37 The issue of disambiguation of the control parse forest is not explicitly dealt with in this experiment. The morphological recognizer is being tested to determine how well it can replicate the performance of a parser with a full dictionary. This is why we use the match, insertion, and deletion counts. The number of matches is important, since the recognizer should return all possible parses that occur when the full dictionary is used. The issue of which of these parses is the correct one would require that the recognizer utilize semantic, pragmatic, and contextual (possibly statistical) information to select the correct parse, a topic beyond the scope of the experiment. A typical parser will return all syntactically correct parses for a given sentence without determining which is \correct". We assume that all parses returned by the baseline run, done with a full lexicon, are correct.

2.3.4 Results This experiment was run several times, changing the morphological recognizer each time to test the eect of the changes. The results listed here are the results using the recognizer that performed the best. The optimization of the recognizer to this corpus oers a chance to determine an upper bound on the capability of the system and also gives a benchmark performance that can be used in further experiments. These later experiments will use automatically constructed recognizers. Two sets of data were collected in this experiment, one set for the baseline run and one set for the experimental run. Figure 2.3 compares the deletion rates of the experimental and baseline runs, Figure 2.4 shows the insertion rates, and Figure 2.5 shows the match rates. Table 2.1 shows the total number of deletions and insertions for the two runs, as well as the percentage of the total number of expected parses for all the sentences in the corpus in the control run (1137 parses for 356 sentences). In Figure 2.3, the deletion rate for the baseline data is zero, as expected. Since every possible part of speech is assigned to each unknown word, all the original parses should be generated. The deletion rate for the experimental run shows that some parses are being deleted when there is one or more unknown words. With 10% of the open-class dictionary missing, there are 22 deletions out of 1137 total possible parse

- 38 Deletion Rate 1

Average Number of Deletions per Sentence

0.9

Experimental Baseline

0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0

10

20 30 40 50 60 70 80 Percentage of Open−Class Dictionary Removed

90

100

Fig. 2.3. Average Number of Deletions

matches. This is an average of only 0.062 deletions per sentence, or 1.9% of the total parses, as shown in Figure 2.3 and Table 2.1. With 100% of the open-class dictionary missing|in other words, using only closed-class words|there are 225 deletions, an average of 0.63 per sentence, or 19.8% of the total parses. In other words, over 80% of the original parses are produced, even with only closed-class words in the lexicon. The deletion rate is in part due to the fact that many words in the complete dictionary are lexically ambiguous whereas, several times the morphological recognizer and postmortem parser assign a smaller set of parts of speech for unknown words, which can result in a correct parse being generated, but not the entire control pass parse forest. The true value of the experimental approach can be seen when comparing the insertion rates. Figure 2.4 and Table 2.1 show that the insertion rate in the baseline data is enormous. This is due to the fact that the baseline system assigns every open-class part of speech to every unknown word. Data is only available up to the point where 30% of the open-class dictionary is missing, because after this point the program runs

- 39 Insertion Rate 100 Experimental Baseline

Average Number of Insertions per Sentence

90 80 70 60 50 40 30 20 10 0 0

10

20 30 40 50 60 70 80 Percentage of Open−Class Dictionary Removed

90

100

Fig. 2.4. Average Number of Insertions

out of memory for storing all of the spurious parses. Since the baseline parser assigns every open-class part of speech to an unknown word, a combinatorial explosion of parses occurs. In the baseline run with 30% of the open-class words missing, there is a total of 110 942 insertions, or an average of 311.6 per sentence. At the same point in the experimental run (30% missing), there are only 2.5 average insertions per sentence, less than one one-hundredth of the baseline value. With 100% of the open-class dictionary missing, there are 69.3 insertions per sentence when using the morphological recognizer in the experimental run. Performance of the experimental system in terms of insertions with 100% of the open-class dictionary missing is comparable to the baseline performance with 20% of the open-class dictionary removed. Obviously, the experimental data shows that the insertion rate has been cut drastically by using the morphological recognizer together with our post-mortem approach. Figure 2.5 shows the match rate for the two parsing systems of the experiment. This is the number of sentence parses that were produced that match the correct

- 40 Average Match Rate

Average Number of Parse Matches per Sentence

4

3.5

3

2.5

2

1.5

1

0.5

0 0

10

20

30 40 50 60 70 80 Percentage of Open Class Dictionary Removed

90

100

Fig. 2.5. Average Number of Matches parses. Considering the denitions of deletion and match (a deletion occurs when a parse from the original forest is not produced, and a match occurs when a parse from the original forest is produced), the number of deletions plus the number of matches are constant (the total number of parses produced in the original forests) across trials. Figure 2.5 demonstrates this fact.

2.3.5 Comments This experiment has demonstrated that Tomita's method of guessing all parts of speech for unknown words is impractical in practice. It does reproduce all parses (there are no deletions), but at the cost of massive overgeneration of parses. Therefore, our hypothesis that this method would be inadequate is conrmed. Our post-mortem algorithm has addressed the problem of overgeneration nicely. The results of the experiment show that morphological recognition combined with post-mortem application shows promise as a method to parse sentences containing unknown words. This method will be further investigated in the subsequent experi-

- 41 Table 2.1 Deletion and Insertion Data % of Open Class Dict Removed 0 10 20 30 40 50 60 70 80 90 100

Deletion and Insertion Rates Experimental Data Baseline Data Deletions Insertions Deletions Insertions Total Percent Total Percent Total Percent Total Percent 0 0% 0 0% 0 0% 0 0% 22 1.9% 315 27.7% 0 0% 7248 636.8% 30 2.6% 533 46.9% 0 0% 24 095 2118.6% 57 5.0% 892 78.5% 0 0% 110 942 9757.4% 44 3.9% 1718 151.1% | | | | 93 8.2% 2598 228.5% | | | | 115 10.1% 4065 357.5% | | | | 137 12.0% 3866 340.0% | | | | 197 17.3% 9340 821.5% | | | | 215 18.9% 11 545 1015.4% | | | | 225 19.8% 24 663 2169.1% | | | |

ments.

2.4 Developing a Computer-Generated Recognizer While performance of the post-mortem parser in section 2.3 is quite promising, the hand-created nature of the morphological recognizer violates our goal of automatic performance. We wish to construct a system that is as automated as possible, using as little information as possible that is not available in the training data. The handcrafted nature of the recognizer limits the parser not only to English but also to corpora similar to the Timit corpus. In an attempt to automate the process of creating a good morphological recognizer, we develop a method to create the morphological recognizer automatically from an online dictionary. While we still use a given list of axes (obtained from the dictionary), we now create the post-mortem recognition lists automatically. This section investigates the performance of this computer-generated recognizer. The recognizer is designed automatically using part-of-speech data given in the on-line dictionary.

- 42 -

2.4.1 Constructing the recognizer Several dierent sources were used to construct morphological recognizers. One source, an electronic dictionary from the MRC Psycholinguistic Database 44], contains over 150 000 word entries. Of the 150 000 word entries there are over 120 000 distinct word strings listed the remaining 30 000 entries are due to the fact that some words have more than one part of speech listed (each entry gives the word and only one of its possible parts of speech). Also included in the MRC Psycholinguistic Database is a listing of 1604 axes, consisting of both prexes and suxes. The Timit corpus was also used to construct some of the recognizers in these experiments. To construct a recognizer, the program needs to determine the most likely parts of speech associated with words that match a given ax. A word and an ax are said to match if the word contains that ax as a substring, either at the beginning (for prexes) or end (for suxes) of the word. For example, if -ity is a sux compared against the word ability, then ability and -ity have a sux match, and ability is considered to have -ity as a sux. The distribution of parts of speech for words that match -ity is then used to predict possible parts of speech for an unknown word that matches -ity. Note that this denition of matching an ax allows possible misidentication, as with buttery in section 2.2.2. However, it is expected that for the majority of words the matching is accurate. The MRC dictionary is used to construct the ax information for this morphological recognition module. Each word in the MRC dictionary is compared against the list of axes created from the MRC dictionary. For each word, all suxes and prexes that match the word are used by the program. If no ax is present, the word falls into the none category, indicating that there are no axes that match it. Only up to one level of prexation or suxation are identied for each word|multiple prexation or suxation is beyond the scope of the current system. For example, the word multinationalization would match the prex multi- and the sux -tion |the additional levels of suxation are ignored. After the words are processed by the program, the total number of words plus

- 43 the total number of nouns, verbs, adjectives, adverbs, and other parts of speech, are calculated for each ax using the part-of-speech data found in the MRC dictionary. Any part of speech other than noun, verb, adjective, or adverb is classied as \other", to simplify the reporting of results. The open-class parts of speech of noun, verb, adjective, and adverb are traditionally morphologically rich|there are many axes available to form words with these parts of speech. The \other" parts of speech tend to be morphologically poor|they are not generally formed using axation. Also, the other parts of speech are mostly closed-class words and should not require morphological identication. The totals for each ax give a picture of which parts of speech tend to be associated with which axes. The top ten most frequently occurring axes in words from the MRC dictionary are listed in Table 2.2, along with the number of words in which each occurred, and the percentages of those occurrences that were nouns, verbs, adjectives, adverbs, and other parts of speech. Looking at this data, there are few surprises as to which axes are expected to occur most often. The important fact to notice is the distinctiveness of most of these axes. Only one of the top ten axes (-ate ) does not have a single part of speech that occurs in more than 50% of the words with that ax. Some of them are especially eective at signifying a certain part of speech, especially -ion (94.5% nouns) and -ous (94.6% adjectives). Suxes like these help to make morphological recognition a very promising source of information for the identication of unknown words, since the predictions given will be accurate much of the time. Using the part-of-speech totals for all the axes, two lists of parts of speech were constructed for each ax, as in section 2.3.1. The rst-choice list indicates those parts of speech that most commonly occur in words with that ax. This list contains the parts of speech that occurred in over 30% of the words matching that ax. The second-choice list indicates those parts of speech that occur less frequently in words with that ax. This list is composed of all parts of speech that occurred in over 10% of the words, which includes those contained in the rst-choice list. The \other" part of speech is ignored in constructing these lists, since this class actually consists of several

- 44 Table 2.2 Top Ten Occurring Axes Ax Words Nouns Verbs Adjectives Adverbs Others -s 21 106 73.9% 24.0% 0.6% 0.3% 1.2% -ed 7510 2.6% 63.7% 20.7% 0.2% 12.8% -ing 5807 9.9% 69.7% 6.3% 0.4% 13.7% -er 4311 76.8% 11.2% 8.8% 1.6% 1.6% -ion 2714 94.5% 2.5% 2.2% 0.1% 0.7% -ly 2415 6.1% 1.6% 11.7% 78.1% 2.6% -ate 2214 19.8% 47.3% 24.2% 0.3% 8.4% -ic 2068 23.6% 0.7% 73.9% 0.0% 1.8% -y 1982 51.3% 10.5% 33.5% 2.0% 2.7% -ous 1809 3.0% 0.3% 94.6% 0.8% 1.2% parts of speech. The \other" parts of speech are closed-class, so they are assumed to not occur in unknown words. The rst- and second-choice lists support the multiple phase post-mortem parsing approach described in Section 2.3, and these lists should oer more precise guesses about the parts of speech associated with unknown words (instead of guessing them all). Words that have neither a prex nor a sux are specially treated. Since no conclusions can be drawn morphologically about these words, the rst-choice list for such words contains noun, verb, and adjective, and the second-choice list contains noun, verb, adjective, and adverb. In the absence of morphological information, the system uses all possible open-class parts of speech for unknown words. Adverb is omitted from the rst-choice list for words with no ax since relatively few words in English are adverbs, and many of those that are contain morphological information. As an example of this construction process, consider the sux -ic as listed in Table 2.2. This ax matches 2068 words in this dictionary. Of those words, 73.9% are adjectives, 23.6% nouns, 1.8% others, 0.7% verbs, and 0.0% adverbs. For -ic, the rst-choice list contains only the adjective part of speech, since it is the only part of speech that occurs over 30% of the time. The second-choice list contains adjective

- 45 and noun, since both occur over 10% of the time. Any list that contains adjective or noun will also contain noun modier.

2.4.2 Discussion of experimental results

Several experiments have been run using the computer-generated recognizer (CGR). These experiments are fully detailed in our paper 45], but some overall perceptions are discussed here. While the computer-generated recognizer seemed to perform well when tested on individual words, when it was used in the post-mortem parser, results were not as good as those reported in section 2.3. The recognizer that was trained on the Timit corpus, which performed the best of all the computer-generated recognizers that were tested, produced 240 deletions and 32 979 insertions with 100% of the open-class lexicon removed. As seen in Table 2.1, the hand-generated recognizer produced 225 deletions and 24 663 insertions at the same point. The CGR never performed better than the hand-generated recognizer. The main cause for the poor performance is that computer-generated recognizers were often too specic: if the CGR obtained a sentence parse using a too-specic prediction, it would likely miss parses from the control pass, thus increasing deletions. On the other hand, if it failed to parse the sentence initially, it would attempt and succeed with a larger set of alternatives, increasing insertions. We believe that there are two possible reasons for this failure of the CGR to perform adequately: 1. The automatic method used to construct the recognizer is not producing an accurate recognizer, or 2. The post-mortem approach using rst- and second-choice lists is not ne-grained enough to oer adequate results. We chose to address these points in the following section.

2.5 Attempts to Re ne the Recognizer To improve the computer-generated recognizer (CGR), the computer's method of assigning parts of speech to the rst- and second-choice lists was altered. As described in section 2.4.1, the computer program used a at 30%/10% cuto method. If over

- 46 30% of the words with a certain ax had a specic part of speech, that part of speech was assigned to the rst-choice and second-choice lists for that ax. If over 10% of the words with an ax had a specic part of speech, that part of speech was assigned to the second-choice list for that ax. The part-of-speech lists for each ax were constructed in this fashion automatically by the computer. One problem with this method of list construction is that it fails to take into account the over-all distribution of the words in the dictionary. Since certain parts of speech occur less frequently than others in the corpus, axes that contain these parts of speech should take that distribution into account when creating the rstand second-choice lists. For example, the part-of-speech data for the MRC dictionary shows that only 2.9% of the words in the MRC dictionary are adverbs. But, according to the at cuto method, over 30% of the words for a given ax must be adverbs to assign adverb to the rst-choice list. This method seems to ignore an important fact about the distribution of parts of speech over the entire dictionary.

2.5.1 The ratio construction method In an attempt to improve the performance of the CGR, a new method of creating the lists was devised. This method attempts to incorporate the statistics of the overall corpus into the process of list creation. To create the lists, a ratio r is formed, where r is dened as: tot r = napos=a =n  with:

pos

tot



apos = the number of words with the given ax that have the given part of speech,



atot = the total number of words with the given ax,



npos = the number of words from the corpus that have the given part of speech, and



ntot = the total number of words in the corpus.

- 47 Table 2.3 Ratio Calculation Example Part of Speech noun verb adjective adverb

Ratio

10=100 500=1000 = 0:20 15=100 300=1000 = 0:50 50=100 150=1000 = 3:33 25=100 50=1000 = 5:00

This ratio r is a unitless number that represents how the distribution of parts of speech over words containing the ax diers from the distribution over all words in the corpus. If the ratio r is equal to 1, that part of speech occurs in the axed words at the same rate that it occurs in all words of the corpus. If r is less than 1, the part of speech occurs less frequently in the axed words, and if it is greater than 1, it occurs more frequently in the axed words. An example should make this concept more clear. Assume that there are exactly 1000 words in the training corpus, and that there are exactly 100 words with the -ly ax. Assume that the distribution of parts of speech for the corpus is 500 nouns, 300 verbs, 150 adjectives, and 50 adverbs. Assume that the distribution of parts of speech for words matching the -ly ax is 10 nouns, 15 verbs, 50 adjectives, and 25 adverbs. Note that these numbers are used for example purposes only. The ratios for the various parts of speech are calculated and displayed in Table 2.3. These ratios show that words with the ax -ly are adverbs and adjectives more frequently than other words, while they are nouns and verbs less frequently. Notice that if the at cuto rate construction method had been used for this -ly ax, adverb would not have been added to the rst-choice list, as it only occurs in 25% of the words containing -ly. Clearly, this new statistical method uses more information in the construction process. Once the ratios are calculated for each ax, the rst-choice and second-choice lists for that ax can be constructed. Instead of setting a cut-o in terms of percentages,

- 48 the cut-o is now set in terms of the calculated ratios. After extensive testing of dierent ratio combinations, the best results occurred when using a cut-o of 1.2 for the rst-choice list and 0.4 for the second-choice list. In other words, any part of speech with a ratio of 1.2 or more for a given ax was placed on the rst-choice and second-choice lists for that ax, and any part of speech with a ratio of 0.4 or more was placed on the second-choice list only. Unfortunately, while this construction method did improve the performance of the CGR while parsing in most cases, it did not improve it to a level near the handgenerated recognizer. The new method produced 237 deletions and 27 465 insertions with 100% of the open-class lexicon removed. Whereas this is an improvement over the earlier construction method, it is still not close to the performance of the handgenerated recognizer. The CGR was apparently still too specic in the rst attempt and not specic enough for the second attempt. Unfortunately, no alteration of the ratios used during list construction oered better performance. Another point is that the ratio levels used during list construction seem somewhat unintuitive. Using 0.4 as a lower limit would seem to include parts of speech in the ax's lists that should not be included. Changing the method of construction of the recognizer, while improving performance, did not oer enough of an improvement to the parsing system. Therefore, the second possibility for failure mentioned in section 2.4.2 is considered: the lack of granularity of the post-mortem algorithm.

2.5.2 The frequency list post-mortem parsing method The lack of granularity in the post-mortem algorithm stems from the fact that there are only two lists available to the recognizer|if the parser fails the rst parse attempt, it moves to the second list. In addition, as soon as the parse fails once, all unknown words use their second-choice list in attempting the parse again. There is not enough granularity to allow multiple levels of information to be used. There should be a way to allow a more gradual broadening of the coverage of the recognizer and allow dierent words to change levels at dierent times. A new method for integrating the morphological recognition module into the post-mortem algorithm is

- 49 designed here. This new recognition method discards the idea of two separate lists of parts of speech for each ax. Instead, each ax has one list associated with it called a frequency list. This frequency list contains four numbers, one for each open-class part of speech|noun, verb, adjective, and adverb. Each element of the list for an ax is a number that reects the frequency of that part of speech in words containing that ax. The more frequent the part of speech, the higher the number for that part of speech in the frequency list. For the present experiment, a scale from 0 to 10 (integers only) is used for elements of the frequency list. The frequency list for each ax is constructed in the following manner: 1. The frequency for each part of speech is calculated as a fraction of the total words with that ax. For example, returning to the example using the ax -ly from section 2.5.1, it occurs in 100 words total|10 nouns, 15 verbs, 50 adjectives, and 25 adverbs. So, the fractional frequencies would be 0.10 for nouns, 0.15 for verbs, 0.50 for adjectives, and 0.25 for adverbs. 2. The fractional frequencies are multiplied by 10. In our example, we get 1 for nouns, 1.5 for verbs, 5 for adjectives, and 2.5 for adverbs. 3. The entire list is scaled by multiplication so that it ranges from 0 to 10. In other words, every member of the list is divided by the largest member of the list, then multiplied by 10. Also, the numbers are truncated so that they are all integers. In our example, the scaling is done by multiplying each member of the list by two, producing 2 for nouns, 3 for verbs, 10 for adjectives, and 5 for adverbs. This method guarantees that at least one element of the frequency list will be equal to 10. 4. The frequency list is constructed for the ax. In our example, the frequency list associated with the ax -ly is (Noun: 2, Verb: 3, Adjective: 10, Adverb: 5). The parsing system has been altered to use these lists. Instead of using the vestep system outlined in section 2.3, the new post-mortem parsing method uses a

- 50 -

Fig. 2.6. The Frequency List Post-Mortem Algorithm twelve-step process. The algorithm is described in the next section, and shown in Figure 2.6.

2.5.3 The new post-mortem algorithm 1. Level 10 Parse Attempt The lexicon is consulted for each word in the sentence. If the word is dened in the lexicon, its denition|consisting of the word, its part of speech, and various features aecting its use|is used to parse the sentence. If it is not in the lexicon, the system assumes that it can only be an open-class part of speech, and it could possibly be any of them. In this pass, the morphological recognizer is used to reduce the number of possible parts of speech for the word. Consulting a list of axes, the recognizer determines which ax, if any, match the word. Then the recognizer assigns all the parts of speech from the frequency list for that ax that have a rating of 10 to that word. A parse forest for the sentence is generated. If the parse fails, the parser moves to the next level. 2. Level 9 Parse Attempt The lexicon is consulted for each word, as above. If the word is not in the lexicon,

- 51 the recognizer assigns all parts of speech to the word that have a rating of 9 or above in the frequency list for that word. If the parse fails, the parser moves to the next level. .. . 3 : : : 10. Levels 8 : : : 1 Parse Attempts The lexicon is consulted for each word, as above. If the word is not in the lexicon, the recognizer assigns all parts of speech to the word that have a rating appropriate to the level or above in the frequency list for that word. If the parse fails, the parser moves to the next level. .. . 11. Level 0 Parse Attempt The lexicon is consulted for each word, as in previous attempts. If the word is not in the lexicon, the recognizer assigns all parts of speech to the word that have a rating of 0 or above in the frequency list for that word. This, in fact, assigns all possible open-class parts of speech to all unknown words. If the parse fails, the parser moves to the next level. 12. All Open-Class Variants All open-class parts of speech are assigned to all open-class words in the sentence. Again, only the closed-class lexicon is consulted. For example, any word ending in -ly is now assumed to be a noun, verb, modier, adjective, and adverb. The parse forest is returned, or the sentence fails to parse completely.

2.5.4 Experiments In this experiment, the results of the post-mortem parser when using a CGR with frequency lists are compared with the results from the parser with the hand-generated recognition module from section 2.3. The new post-mortem algorithm is used for all parsing tests, and the experimental data runs are executed in the same way as the experiment in section 2.3. Since the hand-generated module was created using the

- 52 original two-list method, it was altered to use the frequency lists described in section 2.5.2. This should create a more fair comparison with the CGR. There are two questions about how to parse with a CGR using frequency lists that should be considered. The rst is how to parse words that match no ax in the recognizer. Unknown words that match no ax must be assigned some frequency list to allow the parser to parse the sentence. The frequency list that is assigned to no-ax words is an important one, and could drastically aect the performance of the parser. For that reason, three dierent possibilities were considered and tested separately:

Dictionary Distribution Frequencies: In this method, the frequency list (Noun:

10, Verb: 6, Adjective: 3, Adverb: 1) is assigned to words that match no ax. This frequency list uses the frequencies that match the distribution of all openclass words in the MRC dictionary.

Original Frequencies: In this method, the frequency list (Noun: 10, Verb: 10, Ad-

jective: 10, Adverb: 1) is assigned to words that match no ax. This method is analogous to the method that the original post-mortem algorithm used when assigning parts of speech to no-ax words.

Compromise Frequencies: In this method, the frequency list (Noun: 10, Verb: 10,

Adjective: 6, Adverb: 1) is assigned to words that match no ax. This method is considered as a compromise between the other two methods.

The second question about a frequency list CGR is how to deal with words that match both a prex and sux. There are three methods considered for reconciling the frequency lists of a prex and a sux in a single word:

Maximum Method: In this method, if a word has both a prex and a sux, a

new frequency list is formed by taking the maximum of each element of the two lists. For example, say a word has a prex with a frequency list of (Noun: 10, Verb: 3, Adjective: 0, Adverb: 1), and a sux with a list of (Noun: 0, Verb: 5,

- 53 Adjective: 10, Adverb: 1). This word will have a frequency list of (Noun: 10, Verb: 5, Adjective: 10, Adverb: 1) while using the maximum method. Results for this method are found in Figure 2.7.

Average Method: In this method, the frequency list for a word is formed by av-

eraging the two frequency lists of the prex and sux, then rescaling the list to a maximum of 10. Using the same example above, the word's frequency list will be averaged to (Noun: 5, Verb: 4, Adjective: 5, Adverb: 1), which scales to (Noun: 10, Verb: 8, Adjective: 10, Adverb: 2). Results for this method are found in Figure 2.8.

Su x-First Method: In this method, the frequency list for the sux is used if the word has both a sux and a prex. Using the example above, the word's frequency list would be (Noun: 0, Verb: 5, Adjective: 10, Adverb: 1). Results for this method are in Figure 2.9.

The experiment tests three dierent morphological recognition modules: the handgenerated recognizer altered to use frequency lists5, a CGR created by training on the Timit corpus dictionary, and a CGR created by training on the MRC dictionary. Each experiment was performed using the same procedure as in section 2.3. The Timit test corpus was parsed eleven times, and the number of deletions, insertions, and matches was calculated for each pass. An experiment was run for each possible combination of methods and frequency lists for no-ax words discussed above, for a total of nine separate experimental runs for each recognizer a total of twenty seven sets of data in all. The results from each of these runs are found in the following section.

2.5.5 Results

The results from parsing with the three dierent morphological recognition modules are found in Figure 2.7, Figure 2.8, and Figure 2.9. This data is also available in a It uses a frequency of 10 for parts of speech on the rst-choice list, a frequency of 5 for parts of speech on the second-choice list that aren't on the rst-choice lists, and a frequency of 0 for all other parts of speech. 5

- 54 -

Fig. 2.7. Results using the Maximum Method

- 55 tabular format in Appendix B. From these results, several interesting conclusions can be drawn. First, the deletion performance level of the hand-generated recognizer has become worse after it was translated to the new post-mortem algorithm. However, its insertion rate has drastically improved. With 100% of the open-class dictionary missing, there are 225 deletions and 24 663 insertions with the old post-mortem method (Table 2.1). Using the most similar new method (Sux-rst Method with Original Frequencies for No-Ax words), there are 319 deletions and 8164 insertions. This behavior in the hand-created recognizer can be attributed to the dierent methods for combining prex and sux information, as well as the dierent frequency lists that were used for no-ax words. It is felt that the remarkable decrease in insertions more than osets the increase in deletions, and that the comparisons from this experiment are still valid. The Timit-created CGR out-performs the hand-generated recognizer in almost all cases. The MRC-created CGR does not perform as well as the Timit-created CGR, but still manages to out-perform the hand-generated version in both deletions and insertions most of the time. However, the MRC-created CGR should not be expected to perform at a level equal to the Timit-created CGR|after all, the Timit CGR is in a sense testing on the same data it has been trained on. If all the lexical information from the corpus were available|which was required for the construction of the Timit recognizer|there would be a full lexicon for that corpus available, and there would be no need for a recognizer in the rst place. Thus, using a CGR created from a machine-readable dictionary is more indicative of what could be achieved for real-world applications. In comparing the dierent methods for processing words with two axes, the maximum method seems to perform the best. The sux-rst method is roughly equivalent to the maximum method in performance at early levels with few unknown words, but as more of the open-class lexicon is removed, the insertion rate grows very large. The average method does not perform as consistently well as the maximum method either. This could be due to the compression of scale that occurs when the

- 56 -

Fig. 2.8. Results using the Average Method

- 57 elements are averaged. Considering the various frequency lists tested for no-ax words, the results here demonstrate the inherent trade-o between deletion rate and insertion rate when using a morphological recognizer. Using the compromise frequencies for no-ax words seems to oer the best overall performance. For the data from the CGR trained on the MRC2 dictionary, using the Maximum Method with 100% of the open-class dictionary removed, the number of deletions when using the compromise frequency lists is 18.5% lower, and the insertion rate is only 5.2% higher, than the rates when using the dictionary distribution frequency lists. However, the deletion rate for the compromise frequency lists is only 3.4% higher than the deletion rate for the original frequency lists the insertion rate for the compromise frequency lists is 16.4% lower than that of the original frequency lists. Therefore, using compromise frequency lists for no-ax words seems to oer the best overall performance while balancing deletions and insertions. Similar results were found using the Timit-created CGR and the hand-generated recognizer. Therefore, the Maximum Method using Compromise Frequency lists for no-ax words (in Figure 2.7) appears to be the best choice, and it is used in all the following calculations. With 100% of the open-class lexicon removed, the Timit-created recognizer produces 271 deletions and 1896 insertions: the deletion rate is 18.9% lower than that of the hand-generated recognizer (334 deletions), and the insertion rate is 18.1% lower (2316 insertions) than that of the hand-generated recognizer. The MRC-created recognizer at the same point gives a 2.7% lower deletion rate (325 deletions for the MRC CGR vs. 334 deletions for the hand-generated recognizer), but a 9.4% higher insertion rate (2533 vs. 2316) than the hand-generated recognizer. In general, the Timit-created CGR out-performs the hand-generated model. This demonstrates the utility of the new algorithm for the past-mortem approach, as well as the worth of a corpus-based morphological recognizer. The MRC-created model performs very near the hand-generated model as well, indicating that a dictionary-based computergenerated morphological recognition can be useful in real-world parsers.

- 58 -

Fig. 2.9. Results using the Sux-Only Method

- 59 Looking at the results from section 2.3, the Timit-created CGR has 271 deletions and 1896 insertions the hand-created recognizer previously resulted in 225 deletions and 24 663 insertions. The CGR has a 20% increase in the number of deletions and a 92% decrease in the number of insertions. This seems like a good trade-o for sentence parsing. Keep in mind that the hand-created recognizer was optimized over many experiments for the Timit corpus.

2.6 Final Thoughts on Parsing We have shown that morphological recognition, the distinction between closedclass and open-class words, and syntactic knowledge are powerful tools in handling unknown words when combined with a post-mortem method of parsing. These knowledge sources allow us to determine parts of speech for unknown words while using a corpus that does not have a specic domain. The insertion rate can be drastically reduced with only a moderate increase in the deletion rate. Obviously, there is a trade-o between the deletion rate and the insertion rate. This tradeo can be manipulated by altering the morphological rules to place more importance on a low deletion rate or a low insertion rate (done by altering the parse attempt level at which new part-of-speech predictions are added), by modifying our post-mortem approach to obtain ner control over the process of handling unknown words, or by considering additional knowledge sources. This issue should be of interest for any researcher developing a parsing system that will need to deal with unknown words. The performance of our post-mortem parser using an automatically created recognizer in section 2.5 has been improved to a level near or above that of the parsing system using a hand-created recognizer from section 2.3. By modifying the post-mortem algorithm to have higher granularity and by constructing ne-grained frequency lists for use in the morphological recognizer, we were able to obtain results that were close to or superior to the hand-constructed system. Experiments demonstrate that an automatically constructed morphological recognizer can be used to obtain good performance while parsing unknown words, thus limiting the need for human design in recognizer construction.

- 60 The use of a frequency list has moved the morphological recognizer closer to a statistical model. Rather than consider statistical parsing, we have chosen to continue our study of unknown words by using part-of-speech tagging. This allows us to investigate statistical models for the recognizer without developing new grammars. The following chapter of this document concerns itself with the use of a morphological recognizer together with a high-quality statistical part-of-speech tagger.

- 61 -

3. PART-OF-SPEECH TAGGING 3.1 Introduction The experiments detailed in Chapter 2 use an unknown word recognizer constructed using a specic list of signicant axes. Those experiments used a variety of techniques: hand-created or computer-generated rst-choice and second-choice lists, ratio lists, and frequency lists. One thing that all of the methods had in common was the fact that they each utilized a specic list of axes specially created for the task. This forces the unknown word recognizer to use the given set of axes | something that limits the recognizer to English and possibly ignores other signicant axes. We will now investigate the performance of a statistical unknown word recognizer that is created automatically without using a hand-created list of axes. This recognizer creates a probability distribution over all possible parts of speech for an unknown word instead of simple lists. This is done by creating a probabilistic lexicon from a large tagged corpus and using that data to estimate distributions for words with a given \prex" or \sux". Prex and sux here indicate substrings that come at the beginning and end of a word respectively and are not necessarily morphologically meaningful. This predictor will oer a probability distribution of possible tags for an unknown word based solely on the statistical data available in the training corpus. This statistical recognizer is created for eventual use in a part-of-speech tagger. This shift to part-of-speech tagging is done for many reasons: 1. Part-of-speech tagging oers a more exible research platform for the study of statistical unknown word recognition. Accuracy rates for statistical parsers are generally signicantly lower than rates achieved by taggers. By focusing on

- 62 part-of-speech tagging, with its relatively high accuracy rate, we should be able to more eectively evaluate the impact of the recognizer. 2. We want to predict the single most likely tag for the unknown word in context. The predictors used in the parsing experiments in Chapter 2 returned lists of parts of speech. By using a part-of-speech tagger, we will get the single most likely tag, which can be compared to the correct tag, making accuracy calculations straightforward measures of performance. 3. By using a part-of-speech tagger, there is no need to develop a grammar. This allows us to meet our goal of human-independence by avoiding the need for a hand-created grammar, and it also avoids the use of corpus-specic knowledge. 4. By using part-of-speech tagging, it is easier to compare work with other researchers in the eld. Word accuracy rates are well-dened in the tagging research community, as opposed to the parsing community. This allows data to be more easily measured and compared to other researchers.

3.2 Preliminary Investigation on Information Sources Before working with a part-of-speech tagger, we would like to determine how accurate a statistical morphological recognizer without a preset list of axes can be. Assuming that the recognizer works well, it will be integrated into a complete part-of-speech tagger.

3.2.1 Creating the predictor This predictor is constructed based on the assumption that new words in a language are created using a well-dened morphological process. We wish to use suxes and prexes to predict possible tags for unknown words. For example, a word ending in -ed is likely to be a past tense verb or a past participle verb. This rough stemming is a simple technique, but it avoids the need for hand-crafted morphological information. To automatically build the unknown word predictor, a lexicon was created from

- 63 the training corpus (the Brown corpus, consisting of about 1.2 million words). The training corpus is made up of 90% of the full corpus. The entry for a word consists of a list of all tags assigned to that word along with the number of times that each tag was assigned to that word over the entire training corpus. For example, the lexicon entry for the word advanced would be (advanced ((VBN 31) (JJ 12) (VBD 8))) indicating that the word advanced appeared a total of 51 times in the corpus: 31 as a past participle (VBN), 12 as an adjective (JJ), and 8 as a past tense verb (VBD)1. This lexicon is used as a preliminary source to construct the unknown word predictor. After creating the lexicon, each word in it is processed to determine which axes match the word. Every ax up to four characters long, or up to two characters less than the length of the word, whichever is smaller, is considered. Only open-class tags are considered when constructing the distributions. For each ax, all the lexical information for every word that matches that ax is totaled and used to create a probability distribution for all axes that appear in the corpus. For example, assume that the ax -ed occurs in 20 000 words such that 10 000 of those words are past tense verbs, 8000 are past participles, 1500 are adjectives, and 500 are nouns. Then the probability distribution is created such that there is a 50% chance a word ending in -ed is a past tense verb, a 40% chance it is a past participle, a 7.5% chance it is an adjective, and 2.5% chance it is a noun.

3.2.2 Re ning the predictions There are several techniques that can be used to rene the distributions of possible tags for unknown words. After creating the predictor, as illustrated in the previous section, we have a set of various axes along with probabilities associating each ax with certain parts of speech. However, in many cases when an unknown word is This example illustrates an interesting problem with the Brown corpus, as well as other available tagged corpora. Past participles can frequently be used as adjectives in English sentences. So, how do we know when advanced is acting as an adjective, or as a past participle being used as an adjective? Words in the corpus typically (but not always) are tagged according to how they are acting in the speci c sentence. This can cause problems with tagging and unknown word prediction. Similar problems occur between NN (common noun) and JJ (adjective) in the position of noun modi er. 1

- 64 encountered, there is a question of which ax to use for prediction if a word matches several alternative prexes and suxes. How to choose the best tag in the face of several available probability distributions is an important issue. For this preliminary experiment, we chose to use the longest matching prex/sux found in the recognizer. For example, consider the word ability. If the sux -lity appears in the recognizer data, it will be used as the sux contribution, ignoring any information available from -ity, -ty, and -y. This is done because it is hypothesized that longer suxes (or prexes) will be more predictive and specic than shorter ones. They will certainly occur less frequently than the shorter versions of that ax. The issue of combining various length suxes is considered later in section 3.5.4. Another important decision concerns how to combine information from a prex and a sux in a single word. If the word does not match any sux or prex at all, the overall distribution of words is used. Otherwise, there are several methods of combining the two pieces of information, some of which are considered here.

The Basic Method: When an unknown word is encountered, the longest sux and

prex that match the word are determined. If no prex matches, the sux distribution is used if no sux is found, the prex distribution is used. Otherwise, a simple heuristic method of selecting the distribution that allows the fewest possible tags is used. For instance, if the prex has a distribution over three possible tags, and the sux has a distribution over ve possible tags, the distribution from the prex is used.

The Entropy Method: The next method considered uses the entropy measure of

the prex and sux distributions to determine which should be more useful. Entropy, used in some part-of-speech tagging systems 46], is a measure of how much information is necessary to separate data. The concept of entropy is developed in Jaynes 47] and Good 48]. The entropy of a tag distribution is determined by the following equation: X Entropy of i -th ax = ; nNij log2( nNij ) i i j

- 65 where

nij = number of times word with the i-th sux is tagged with the j -th tag Ni = total occurrences of words containing the i-th ax The ax/tag distribution with the smallest entropy is used, as this is the distribution that oers the most predictive information.

The Su x-Only Method: The initial parsing experiments in section 2.3 chose to

use sux information before prex information. Sux information was more predictive than prex information during the early experiments in parsing. To determine if results from a statistical approach will follow this pattern, a predictor that only uses sux information is also tested.

These three methods are in some ways analogous to the three possible methods of combining prex and sux distribution information for frequency lists in section 2.5. The sux-only methods for parsing and tagging are exactly the same. The other two methods dier, since the baseline and entropy methods are choosing one distribution to use, while the averaging and maximum methods from section 2.5 are combining information from both distributions. The averaging method used in section 2.5 is appropriate for frequency list parsing, but averaging two probability distributions will remove much information. Besides, the averaging method did not perform that well in the parsing experiments. The maximum method worked well in the parsing experiments, but performing a maximization on two probability distributions will require rescaling, which will distort the distribution too much. After obtaining a distribution of tags for an unknown word, that distribution is smoothed using the overall distribution of tags from the training data. In other words, dene:

- 66 -

p(x) = the distribution of tags returned by the predictor for a given unknown word q(x) = the distribution of tags for all words in the corpus p = the actual distribution used to predict the part of speech of the unknown word = p(x) + (1 ; )q(x) 0

The value used for  is 0.9. This value was chosen to limit the eect of the overall distribution. In addition to choosing how to use information from the prex and sux data, we can also combine information from other sources to improve the accuracy rate. This is done not only to attack the problem with as many weapons as possible, but also to emulate the performance of a part-of-speech tagger in some way which allows accuracies to be more comparable. One of the outside information sources that can be used is the concept of open-class words. As described above, the distributions produced by the predictor are smoothed with the overall distribution of tags. Instead of using the overall distribution for q(x) in the smoothing equation, we hypothesize that smoothing using only the open-class tag distribution (instead of all word's tags) as q(x) will oer better results. We call this approach open-class smoothing. Another source of information available to tag predictors is contextual information in other words, the words and tags directly before the word can oer helpful information about the correct tag for an unknown word. Each probability P (tijti 1), that is the probability of a tag ti following a tag ti 1, is determined from the training data and smoothed with the unknown word's distribution. Part-of-speech tagging makes extensive use of the contextual information, so that information should also be helpful in our system. ;

;

- 67 Table 3.1 Results using Various Methods

Extra Information No open-class and no context Open-class and no context No open-class and context Open-class and context

Method Basic Entropy Endings Basic Entropy Endings Basic Entropy Endings Basic Entropy Endings

1-best 57.6% 62.2% 67.1% 57.6% 62.2% 67.1% 61.5% 65.7% 70.9% 61.3% 65.4% 70.9%

2-best 73.2% 77.6% 83.5% 73.6% 78.1% 83.6% 75.0% 78.9% 86.5% 78.2% 81.8% 87.6%

3-best 79.5% 83.4% 91.4% 83.2% 86.9% 92.2% 81.7% 85.1% 92.6% 87.0% 89.6% 93.8%

3.2.3 The experiment This experiment was performed using the Brown corpus 49]. A 10-fold crossvalidation technique was used. The sentences from the corpus were split into ten les each of the ten les is used as a test set, with the other nine les used as training data to determine the probability distributions for each ax. All unknown words in the test set (in other words, those that did not occur in the training set) were assigned a tag distribution by the unknown word predictor. Then, that distribution is ordered from most probable tag to least probable, and the correct tag is checked to see if it falls in the n-best tags for various values of n. The results for all ten test les were combined to determine the overall performance for the experiment. The results from the experiment are shown in Table 3.1 and Figure 3.1. Some trends can be seen in this data. For example, choosing between the prex distribution or sux distribution using entropy calculations clearly improves the performance over using the basic method (about 4-5% overall for 1-best values, which is an over 10% decrease in the number of errors), and using only sux distributions improves it another 4-5% over the entropy method (which is about a 15% decrease in the number

- 68 -

Fig. 3.1. Results of Various Methods

- 69 of errors). These improvements are fairly consistent regardless of whether openclass or contextual smoothing are used. The sux-only method is clearly the most accurate. The use of contextual information improves the likelihood that the correct tag is in the n-best predicted for small values of n (increases nearly 4% for 1-best), but it oers less improvement for larger values of n. Thus, context is helpful in selecting among a small number of tags ranked highly by the predictor, but less helpful at moving lower ranked tags into the top n. On the other hand, smoothing the distributions with openclass tag distributions oers no improvement for the 1-best results, but improves the n-best performance for larger values of n. This seems to indicate that open-class smoothing will move tags up at lower levels of probability but will not disambiguate between top tags. This makes sense given that open-class tags are more likely to occur in the top levels of the predictor. Thus, the closed-class tags are ltered down in the distribution. This behavior may also be due to the choice of  = 0:9 made for the smoothing equation. Smaller values of  may let the open-class smoothing have more of an eect. However, smaller values would also limit the predictive power of the sux, so the value of  was kept relatively high. Overall, the system oering the best accuracy uses both context and open-class smoothing, and relies only on the sux information. The system predicts the correct tag as its most likely (1-best) tag just over 70% of the time, and the correct tag is in the top three returned by the predictor almost 94% of the time. While it is dicult to compare these results to those of the parser in Chapter 2, the statistical recognizer seems to oer very good results. Compared to the computer-generated recognizer from our work 45], the fully automatic recognizer has a 1-best accuracy of 70.9%, in contrast to the computer-generated recognizer, which has an exact accuracy of 49.3%. This is not a very fair comparison, however, since the computer-generated recognizer is creating a list of parts of speech that must match exactly to a pregenerated list however, the results from this automatic recognizer are certainly promising.

- 70 -

3.2.4 Conclusions This section has considered the construction of a automatically constructed statistical unknown word recognizer, trained from an annotated training corpus. Dierent methods of combining prex and sux information were considered, as well as combining the morphological information with other knowledge sources. Using a suxonly method, combined with contextual and open-class information, oers the best accuracy of the system tested here. This recognizer seems to oer an improvement over those considered in Chapter 2. The performance of the sux-rst method was quite similar to the maximum method in our parsing experiments, at least until 40% of the dictionary is missing. In part-of-speech tagging, there are generally few unknown words, since the tagger is trained on a large corpus. This, along with the fact that tagging accuracy is not judged based on precision and recall of a set of parses suggests that the sux-rst method should be the most appropriate method for tagging of the three methods tested in our parsing experiments. Because of the promising performance of the statistical unknown word recognizer here, we next investigate its performance in a part-of-speech tagging system. The rest of this chapter will detail our experiments in part-of-speech tagging.

3.3 Part-of-Speech Tagging Part-of-speech tagging is the act of assigning each word in a sentence its proper lexical information. This lexical information is encoded in a symbol called a tag. A tag usually indicates at least the word's part of speech, but it can also contain other feature information, such as number (singular or plural) or verb tense. The Penn Treebank documentation 1] describes a commonly used set of tags. The tagger can be used to create tagged corpora for NLP applications, as well as a preprocessor for more sophisticated NLP systems.

- 71 -

3.3.1 Past work in tagging Part-of-speech tagging is an important research topic in natural language processing (NLP). Taggers are often preprocessors in NLP systems, making accurate performance especially important. Much research has been done to improve tagging accuracy. Most of these taggers implement one of the following approaches to perform the tagging:

Hidden Markov Models (HMMs): HMMs are used frequently in speech recog-

nition, but are also used for a variety of tasks in natural language processing. HMMs are statistical models, and as such are useful for part-of-speech tagging when large training corpora are available. Rabiner 50] oers an introduction to hidden Markov models, including their use in speech recognition systems. See Appendix A for a brief tutorial on HMMs.

HMMs have been used successfully for part-of-speech tagging. Kupiec 51] describes a part-of-speech tagger implemented with an HMM, while Charniak, et al. 52] discuss the equations for HMM taggers, and how they can be modied and extended (for example, to handle unknown words). Merialdo 53] uses a hidden Markov model based system while testing the eectiveness of annotated training data versus unannotated data2 . DeRose 54] uses a statistical model similar to the HMM in an early (pre-HMM) attempt at statistical tagging.

Rule-based Systems: Using rule-based systems for part-of-speech tagging is an-

other common method for part-of-speech tagging. These systems use a series of transformational rules that are applied to word/tag pairs, changing the word's tag based on the word's context and its characters. Rule-based systems include systems developed by Brill 55, 56], Klein and Simmons 57], Brodda 58], and Paulussen and Martin 59].

He shows that annotated training data is much more useful, and in fact the Baum-Welch (or EM) algorithm does not improve results of HMM-based taggers. 2

- 72 -

Memory-based Systems: The MBT tagger 60] uses a form of case-based learning

to tag words based on matching that word and context to similar cases held in memory. Cardie 26] demonstrates a more general approach for using case-based learning in NLP, learning domain-specic information about open-classed words to help classify unknown words. The MBMA system 37] uses a memory-based approach for morphological recognition of Dutch.

Memory-based (and case-based) systems are basically a form of k-nearest neighbor systems where the training data are kept in memory, and each test sample uses a distance metric to determine which training samples are closest. Then, the test sample is classied as the same class as the training samples (usually using a weighted voting scheme).

Decision-Tree Systems: Decision trees have also been used to implement part-of-

speech taggers, along with other NLP tasks. Schmid 61] denes a probabilistic decision tree used for tagging, while Cardie 62] uses decision trees to improve her case-based learning of the denitions of unknown words. Magerman 17] uses a statistical decision tree to perform parsing. A decision tree is a tree such that each internal node is a feature test and the leaves are classes to be assigned to the tested individual. The trees are constructed using statistical information (see Quinlan 63] for an example of decision tree creation).

Maximum-Entropy Systems: The maximum-entropy tagger 46] learns a set of

features from the training data, then chooses the tag that maximizes the entropy function for the data. The maximum entropy tagger has been a consistently good performer in part-of-speech tagging, and the approach has been used to solve several other NLP problems such as prepositional phrase attachment 64], sentence parsing 65], and adaptive language modeling 66].

Path Voting Constraint Systems: This tagging system, described by T%ur and

Oazer 67], uses constraint rules to develop several possible tag sequences,

- 73 which are then voted on by the set of rules. This system uses a combination of hand-crafted and automatically-generated rules.

Neural Network Systems: The SNOW tagger 68] uses a network of linear sep-

arator units to perform part-of-speech tagging. An interesting feature of this system is that it allows online updating of the system, which can change the classier as it tags a set of sentences. Additional neural network systems for part-of-speech tagging were developed by Benello, Mackie, and Anderson 69] and Nakamura and Shikano 70].

Multiple Tagger Majority Voting System: This system, developed by van Hal-

teren, Zavrel, and Daelemans 71], is made up of several types of taggers, with each tagger independently tagging a sentence and then voting on individual tags. The tag with the largest vote is assigned to that word. This system provides a very good accuracy, but requires the use of several dierent accurate taggers to work eectively.

Very recently, researchers have begun applying current machine learning techniques to the part-of-speech tagging problem with promising results 72, 73]. Most of the techniques applied to tagging have oered similar accuracies for the task of part-of-speech tagging, although standard HMMs are somewhat less accurate than the others. Because of this less accurate performance of HMMs, taggers implemented with HMMs are now typically implemented as a trigram model, where the context (transition probabilities) depends on the two previous tags, not just the immediately previous one. This concept is extended even further by our second-order HMM (described in section 3.5), resulting in improved accuracy.

3.3.2 Tagging unknown words

An important problem for tagging systems is coping with the presence of unknown words in a sentence. Some of the eects of unknown words on part-of-speech tagging have been studied in the literature. Charniak, et al. 52] state in their work that morphology can be of use in determining parts of speech for unknown words encountered

- 74 by the tagger. They indicate that morphology is only a minor part of their tagging mechanism, and that their tagger works best on words that have roots that have already been encountered in the test corpus. The use of morphological information was not been fully developed in their system. Relaxing the requirement that roots must be present in the lexicon is a focus of our work. Weischedel, et al 19] study the problem of encountering unknown words using probabilistic methods. Their results show that by using a part-of-speech tagger with morphological and other information, the error rate of their tagger on unknown words decreases by a factor of ve. However, for their experiments they assume that there is only one unknown word present per sentence. Kupiec 51] and Brill 55] make use of morphology to handle unknown words during part-of-speech tagging. By training on a tagged corpus, Brill's tagger learns transformational rules for words based on their axes (an ax is either a prex or a sux). When an unknown word is encountered, it is assumed to be a proper noun if capitalized, common noun if not. The tagger then applies the transformational rules to the unknown word to tag it with the appropriate part-of-speech information. Kupiec's hidden Markov model uses a set of suxes to assign probabilities and state transformations to unknown words. Both these methods work well with part-of-speech tagging, but they ignore some of the syntactic content of the sentence. Extending the contextual information available for part-of-speech tagging is an issue addressed in this work. Mikheev 34, 35] has also done work on tagging unknown words. His system is statistically-based and uses prexes, suxes, and word endings to identify appropriate tags for unknown words. He dierentiates between suxes and word endings by the fact that suxes, when removed from a word, leave a word that is in the lexicon. Word endings are simply letters at the end of the word. His research reports good results in tagging unknown words in the Brown corpus. We will demonstrate better results using only statistical information on word endings to identify unknown words.

- 75 -

3.4 Implementation Issues In section 3.5 we will introduce a new statistical model called a second-order hidden Markov model. However, before this new model is presented, it is useful to consider the basic model, a (rst-order) hidden Markov model (HMM).

3.4.1 The hidden Markov model A hidden Markov model is a state machine. The model consists of a set of distinct states the model can be in any one of these states at a given time. We assume that time is discrete, and the model changes between states (or transitions ) at each unit of time. The transitions are dictated by a set of transition probabilities stored in the state transition matrix A. The probability that the model moves from state i to state j at any time is denoted as aij . For a standard Markov process, the output of the process would be a sequence of states that had been visited by the process. For a hidden Markov model, the exact states visited are hidden. Instead, as the model moves between states, it emits an output symbol after each transition. Exactly which output symbol is emitted depends on the output symbol distribution B , which denes the probability of emitting a particular symbol while in a particular state. The probability that the k-th output symbol is emitted from state j is denoted as bj (k). The output from the hidden Markov model (HMM) is a sequence of output symbols. A common use of the HMM is to determine the most likely sequence of states visited given the sequence of output symbols. The hidden Markov model is described in more detail in Appendix A. In the next section, we describe how the HMM can be used in part-of-speech tagging.

3.4.2 Using an HMM for part-of-speech tagging Part-of-speech tagging is the act of assigning each word in a sentence a tag that describes how that word is used in the sentence. Typically, these tags indicate syntactic categories, such as noun or verb, and usually include additional feature information, such as number (singular or plural) and verb tense. This section recaps the important components of an HMM, and how those components are used for part-of-speech

- 76 tagging. We also discuss how these elements of the HMM can be trained from an annotated corpus. An HMM can be used to perform part-of-speech tagging by modeling the tags as the states of the model and the words of the sentence as the output symbols of the model. Since we are given the sentence, we can use the Viterbi algorithm (see Figure 3.2) to determine the most likely sequence of tags that would have \generated" that sentence. There are a number of denitions that are fundamental to the description of the HMM and how it is used for tagging. 1.

is the set of possible tags that can be assigned to words. Each ti 2 T is a tag that can be assigned to a word while part-of-speech tagging. Each ti is a state for the model thus, T = jT j is the number of states in the model. The set of states is easily determined by the set of all distinct tags that appear in the training data.

2.

is the set of words that can be emitted by the model. Each wk 2 W is a word that is found in the training data. These words are the output symbols of the model, and W = jWj is the number of possible output symbols in the model. Again, the set of words W can easily be determined from the training corpus.

T

W

3. A = faij g is the state transition probability distribution. The probability aij is the probability that the process will move from state ti to state tj in one transition. For part-of-speech tagging, the states represent the tags, so aij is the probability that tag tj follows ti. This probability can be estimated using data from the training corpus, using the equation: tag ti)  aij = c(tag tj follows c(ti) where c(x) is a count of the number of times the event x occurs. 4. B = fbj (k)g is the observation symbol probability distribution. The probability bj (k) is the probability that the k-th output symbol will be emitted when the

- 77 model is in state j . For part-of-speech tagging, this is the probability that the word wk will be emitted as an output symbol when the system enters tag tj . This probability can be estimated using data from a training corpus with the equation: with tag tj ) : bj (k) = c(word wk is tagged c(tj ) 5. & = fig is the initial state distribution. i is the probability that the model will start in state i. For part-of-speech tagging, this is the probability that the sentence will begin with a word tagged with tag ti. This can be simply estimated by: beginning with ti : i = number of sentences total sentences 6. N is the number of words in the sentence to be tagged. 7. O = o1o2 : : : oN is the output sequence of words that occurs in the sentence to be tagged. 8. U = 12 : : : N is the sequence of tags that are assigned to the sentence during tagging. 9. n(j ) is the maximum probability sequence of the rst n states, assuming that the n-th state is tj , given the rst n output symbols. In mathematical terms,

n(j ) = 1::: maxn;1 P (1 : : :  n 1 n = tj jo1 : : : on ) ;

n(j ) can also be recursively dened as max (i)aij bj (on ). n(j ) is needed in i n 1 the Viterbi algorithm. ;

10. n(j ) is the tag n 1 that maximizes n(j ). n(j ) is also used in the Viterbi algorithm to allow the maximum probabilistic sequence of tags to be retrieved. ;

11. The Viterbi algorithm maximizes the probability of the entire sequence of states, instead of maximizing each individual state in sequence. More formally, given a

- 78 sequence O of output symbols, the Viterbi algorithm nds a sequence of states U that maximizes P (OjU ). To calculate the appropriate tag sequence U that maximizes the probability P (OjU ) to a value P with the tag sequence 1 : : :N using the Viterbi algorithm, we perform the steps outlined in Figure 3.2. 





3.5 The Second-Order Model for Part-of-Speech Tagging The hidden Markov model described in section 3.4 is an example of a rst-order hidden Markov model. It is called a rst-order model because the approximations are made such that each probability value is dependent only on one other value (i.e., aij = P (tj jti) or bj (k) = P (wk jtj )). In part-of-speech tagging, the rst-order HMM is generally called a bigram tagger, since it relies on bigram probabilities for its contextual and lexical information. This model works reasonably well for part-ofspeech tagging, but captures a more limited amount of the contextual information than is available. Most statistical taggers currently use a trigram model, which replaces the contextual bigram transition probability aij = P (n = tj jn 1 = ti) with a trigram probability aijk = P (n = tk jn 1 = tj  n 2 = ti). In other words, the probability of entering a new state depends on the two previous states. This section describes a new type of tagger that uses trigrams not only for the contextual probabilities but also for the lexical probabilities. We refer to this new model as a full second-order hidden Markov model. This section describes the implementation of this model, some issues regarding its application to part-of-speech tagging, and results of experiments using this tagger. ;

;

;

3.5.1 De ning new probability distributions The full second-order HMM uses a notation similar to a standard rst-order HMM for its probability distributions. The A matrix contains the state transition probabilities, the B matrix contains the output symbol distributions, and & indicates the initial state probability distribution. For the second-order HMM, the & vector is identical to its counterpart in the rst-order model (i.e., i indicates the chance that

- 79 1. Initialize variables The values for each of the variables must be initialized. The value for each 1 (i) is just the probability of ti occurring at the beginning of the process (i ) times the probability of ti emitting the rst output symbol (bi (o1)). The initial value for 1(i) is 0, since there is no maximization done. 1 (i) 1 (i) 

= =

1iT 1iT

i i (o1)

 b

0

2. Calculate values recursively The values for  and  are calculated recursively for each successive output symbol in the sequence. n (j )

= 1max  (i)aij ]bj (on ) i T n 1

2  n  N 1  i j  T

n (j )

=

2  n  N 1  i j  T



;

 



arg 1max  (i)aij ] i T n 1 ;

 

3. Calculate maximum probability Once the values for the last output symbol in the sequence have been calculated, the maximum probability P and the nal state value of the state sequence U = 1 : : : N that gives it can be determined: 







= 1max  (i) i T N N = arg 1max  (i) i T N

P



 



 

4. Determine best tag sequence The  values are \unrolled" to determine the proper state sequence: n





=



n+1 (n+1 ) 

n

= N ; 1 N ; 2 : : :  2 1

Fig. 3.2. The First-Order Viterbi Algorithm

- 80 the process will begin in state ti). However, the denitions of the state transition matrix A and output probability distribution B must be modied in light of the new second-order approximations. These more precise approximations enable the full second-order HMM to use more contextual information to model part-of-speech tagging.

Contextual Probabilities: The A matrix denes the contextual transition prob-

abilities for the part-of-speech tagger. Instead of limiting the context to a rst-order approximation, the A matrix for the second-order HMM is dened as: A = faijk g where aijk = P (n = tk jn 1 = tj  n 2 = ti): ;

;

The transition matrix is now three dimensional, so the probability of transitioning to a new state depends not only on the current state, but also on the previous state. This allows a more realistic context-dependence for word tags.

Lexical Probabilities: The output probability distribution B denes the lexical probabilities for the part-of-speech tagger. These are the probabilities that a given state will emit a given output symbol. In a fashion similar to the secondorder extension to the A matrix, the approximation for the lexical probabilities can also be modied to include second-order information as follows:

B = fbij (k)g where bij (k) = P (on = wk jn = tj  n 1 = ti) ;

In these equations, the probability of the model emitting a given word depends not only on the current state but also on the previous state. To our knowledge, this approach has not been used before in tagging. In cases where n = 1 and n = 2 for each of the probability calculations, the special tag symbols SOS (start-of-sentence) and NONE (pre-start-of-sentence) are used to handle boundary conditions. For example, to tag the rst word of the sentence, the two preceding tags are assumed to be NONE and SOS, in that order.

- 81 -

3.5.2 The new Viterbi algorithm Modication of the lexical and contextual probabilities is only the rst step in dening a full second-order HMM. These probabilities must also be combined to select the most likely sequence of tags that generated the sentence. This requires modication of the Viterbi algorithm. First, the variables and must be redened from their previous denition in Section 3.4: n(i j ) = 1max 1nN :::n;2 P (1  : : : n 2  n 1 = ti  n = tj jo1  : : :on )

n(i j ) = arg 1max :::n;2 P (1  : : : n 2  n 1 = ti  n = tj jo1  : : :on ) 1  n  N ;

;

;

;

These variables take into account the added dependencies of the distributions of A and B and make use of the special tags SOS and NONE as needed. The most likely tag sequence can now be calculated using a modication of the Viterbi algorithm, as shown in Figure 3.3. This second-order version of the Viterbi algorithm has a running time of O(NT 3). This can be seen by the following breakdown of each step: 1. Step 1 is visited once by the algorithm and executes O(T 2) operations. 2. Step 2 is visited N ; 1 times by the algorithm, and there are T 2 values of to calculate. Each calculation require a maximization over T values, so the running time for each visit is O(T 3), making the total number of operations for this step O((N ; 1)T 3) = O(NT 3). 3. Step 3 is visited once, and can be completed in O(T 2) operations, since P can be calculated in T 2 operations, and N and N 1 can be determined during the calculation of P . 





;



4. Step 4 is visited once, and requires O(N ; 2) = O(N ) operations. Thus, the total time to run the algorithm is O(T 2) + O(NT 3) + O(T 2) + O(N ) = O(NT 3). By increasing the use of context over the standard HMM, the running time has increased by a (constant) factor of T . However, the algorithm is still linear in N , the number of words in the sentence, which is the standard measure of input size. Also,

- 82 -

1. Initialize variables The values are initialized for each of the variables. ( ) 0  ti 6= SOS 1 (i j ) = i bij (o1 ) ti = SOS 1 (i j ) = 0

1  i j  T 1iT

2. Calculate values recursively The values for  and  are calculated recursively for each successive word in the sentence. Note that the maximum is taken only over i. n (j k)

= 1max  (i j )aijk]bjk (on ) i T n 1

2  n  N 1  i j k  T

n (j k)

=

2  n  N 1  i j k  T





 

;

arg 1max  (i j )aijk] i T n 1 ;

 

3. Calculate maximum probability Once the values are calculated for the last word in the sentence, the maximum probability and the two nal tag values of the tag sequence U = 1 : : : N that gives it can be determined. Notice here that the max for P is over both i and j . N is the value of j that maximizes P , and N 1 is the value of i. 









;

= 1 max  (i j ) ij T N N = argj 1 max  (i j ) ij T N N 1 = argi 1 max  (i j ) ij T N P



















;

4. Determine best tag sequence The  values are \unrolled" to determine the proper tag sequence: n





=



n+1 (n+1  n+2) 



n

= N ; 2 N ; 3 : : :  2 1

Fig. 3.3. The Second-Order Viterbi Algorithm



- 83 it should be pointed out that traditional trigram taggers also have O(NT 3) running time, assuming that they maximize the probability of the tag sequence over the entire sentence, as with the Viterbi algorithm. Adding the use of second-order statistics for lexical entries does not increase the asymptotic performance over standard trigram taggers.

3.5.3 Calculating probabilities for unknown words One of the main purposes of part-of-speech tagging is the disambiguation between multiple tags for a word. However, some words could be encountered that are not found in the lexicon | these unknown words will require special information. This section will discuss the methods used for tagging unknown words with a second-order HMM. This topic is not often discussed on its own (exceptions include Mikheev's work 34, 35] as well as Weischedel, et al. 19]), but it is often included in general work on part-of-speech tagging, such as Daelemans, et al. 60], Ratnaparkhi 46], and Brill 56]. To create a second-order HMM that can accurately tag unknown words, it is necessary to determine an estimate of the probability bij (k) = P (on = wk jn = tj  n 1 = ti) for use in the tagger. If a word does not occur in the training data, the emit probability for that word is 0.0 in the B matrix (i.e., bij (k) = 0:0 for all i and j if wk is unknown)3. This behavior requires adding another probability distribution to the HMM to estimate the probability that the current tag will emit the given unknown word. Most methods devised to estimate these probabilities rely on axation information (e.g., Weischedel, et al. 19] and a component of Mikheev's system 35]). The system in this paper relies solely on \sux" information (a more appropriate term is word ending information, since the \suxes" do not necessarily have any morphological signicance, but this is unwieldy, so the word sux will be used). The decision to use suxes only comes from the experiments on unknown word recognition performed and ;

Strictly speaking, bij (k) doesn't even exist for an unknown word, since wk is de ned to be a member of W , and bij (k) is written in terms of wk .

3

- 84 described in section 3.2, as well as in our papers 36, 45]. This type of system enables the tagger to be automated, learning most of what it needs from the annotated training corpus. This allows the tagger to be tagset independent as well as more language independent4, since hand-crafted rules are not used.

The methodology To allow the second-order HMM to tag unknown words, a new probability distribution relating to unknown words must be created. The new distribution will be referred to as the sux probability distribution C , and will be similar to the B distribution for lexical probabilities. However, instead of basing the probabilities on tags and words, the C probabilities are based on tags and suxes. Therefore, C is dened as follows:

C = fcij (k)g where cij (k) = P (on contains sk jn = tj  n 1 = ti) ;

Here, S is a set containing all suxes encountered during the training phase, and sk 2 S is some sux found in the word on . Every word in the training corpus is considered while training the predictor. The key points in training the unknown word predictor are listed here: 

Suxes of length one to four characters are considered while training. For any given word, a sux from that word can be no longer than the length of the word minus two characters to be used to train the predictor. It is felt that this number of characters oers signicant predictive information without making the probability matrix unwieldy. Experiments performed with length ve suxes oered inconsequential increases in accuracy while requiring more system resources, most notably training time and memory.



Words that are four characters or less in length and words that are not tagged with open-class tags are ignored while training the unknown word predictor.

The eectiveness of the system would depend on the morphological-richness as well as the morphological rules of the language.

4

- 85 Short words are not likely to contain much signicant morphological information, and unknown words can be assumed to be open-class, so closed-class words can be safely ignored while training. 

Words that are capitalized, hyphenated, or contain a numeric digit are assumed to have dramatically dierent distributions than other unknown words. This distinction is used by Mikheev 34], although comparisons are not documented. These types of words are used to train separate probability distributions, along with a distribution for general unknown words. Each combination of features is also allowed, so for example capitalized, hyphenated, numeric words each have their own distribution. Thus, in practice, there are eight C matrices: those for words containing only one of the three features (three matrices), those for each combination of two of three features (three matrices), those for words with all three features, and nally those for all other words with none of the features. Capitalized unknown words that begin a sentence are assumed to be capitalized due to their position, so if they are not in the lexicon in their capitalized form, they are looked up in non-capitalized form, then treated as non-capitalized unknown words if they are not in the lexicon. This approach was chosen to limit the number of unknown words that begin sentences misidentied as proper nouns due to the capitalization, which could aect the entire tag sequence of the sentence.

To initialize the predictor, a count for each tag/tag/sux triple is set to zero. Using the procedure above, every word that is contained in the training corpus is processed and its sux/tag/tag triples are incremented in the system. For example, the word government tagged as a common noun(NN) and preceded by a determiner(DT) would increase the counts for the tag/tag/sux triples (DT NN -t ), (DT NN -nt ), (DT NN -ent ), and (DT NN -ment ). The word star-crossed tagged as an adjective(JJ) and preceded by a verb(VB) would increase the tag/tag/sux triples (VB JJ -d ), (VB JJ -ed ), (VB JJ -sed ), and (VB JJ -ssed ) for hyphenated words. Once

- 86 counts are calculated for all sux/tag triples by processing the entire training corpus, probabilities are created for each triple by dividing the number of times that triple occurred by the number of times the corresponding tag pair appeared in the corpus. The issue of data sparseness with respect to these estimates (and the estimates for A and B ) is addressed in Section 3.5.4. Now that a C matrix is available, it can be used in the second-order HMM tagger. Whenever a word that has not been encountered in training is seen (i.e., the word is not available in the lexicon), the value bij (k) is replaced with cij (m), where sm is a sux that is found in that word. The method of determining exactly which sux is used for the word is somewhat of an open question. Initial experiments simply used the longest sux that was available in the training data. Current results indicate that using a voting method over all suxes oers better results. This will also be discussed further in section 3.5.4. Prexes are not used in the predictor at all. As shown in section 3.2, using a combination of prex and sux information can result in poorer accuracies than simply using suxes. It is possible that prexes could be incorporated into the predictor, if a reasonable method of using them was found, but for this work, only sux information is used. Integrating prex information into the predictor is an area of future research for this system.

3.5.4 Smoothing the probabilities While the second-order HMM is a more precise approximation of the underlying probabilities for the language, a problem can arise from sparseness of data, especially with lexical estimations. For example, the size of the B matrix is T 2W , which for the Wall Street Journal corpus is approximately 125 000 000 possible tag/tag/word combinations. To help avoid sparse data estimation problems, the probability estimates for each distribution must be smoothed.

- 87 -

Previous smoothing procedures There are several methods of smoothing discussed in the literature. Some of these methods are discussed below.

Additive smoothing: This is the simplest method, and one which generally per-

forms poorly (as argued by Gale and Church 76, 77]). Additive smoothing essentially assumes that every possible item occurs times more often than it actually did. Generally, = 1 for this sort of calculation (thus assuming that each possible item occurs one more time than it actually appears in the training data), but it can take on any value.

Good-Turing smoothing: The Good-Turing smoothing method 48] states that an

event that occurs r times should be treated as if it had occurred r = (r +1) nnr+1r times, where nr is the number of events that occur exactly r times in the training data. This has the eect of increasing the number of perceived times an item occurs. This re-estimated value is used to calculate the smoothed probability P^ . This method (along with additive smoothing) are of limited use for part-ofspeech tagging, since it does not provide interpolation of unigram and bigram statistics into the trigram approximation. 

Jelinek-Mercer smoothing: The Jelinek-Mercer smoothing method 78] performs linear interpolation of the item with lower order models. The smoothed probability is a weighted sum of the non-smoothed estimated probability and the smoothed probability using the next lower-order probability. For example, a smoothed estimate is recursively dened below. P^ (xn jxn 1 : : : xn k ) =  c(cx(xn k : :: ::: x xn) ) + (1 ; )P^ (xnjxn 1  : : : xn (k 1)) n k n 1 where the base case is: P^ (xnjxn 1) = P (xnjxn 1 ): ;

;

;

;

;

;

;

;

;

;

The weights attached to each term (i.e., the  values) are trained from training data, which requires additional hold-out data.

- 88 -

Katz smoothing: Katz 79] describes a smoothing technique used mainly in speech

recognition and other NLP tasks, which smoothes items based on how many times they appear. If an item occurs more than k times (Katz sets k = 5), the probability is unchanged if the item occurs less than k but at least once, the smoothed probability is calculated by multiplying the non-smoothed probability by the discount ratio (making the smoothed probability more likely) if the item never occurs, the smoothed probability is based on the smoothed probability of a lower-order term occurring. For example, if the trigram being smoothed had never occurred in the training data, the smoothed probability would be based on the probability of the underlying bigram.

These methods are all useful smoothing algorithms, however, none of them is completely appropriate for our needs. Since we are smoothing trigram probabilities, the additive and Good-Turing methods are of limited usefulness, since neither takes into account bigram or unigram probabilities. Katz smoothing seems a little too granular to be eective|the broad spectrum of possibilities is reduced to three options. We believe that smoothing in our application should be based on a function of the number of occurrences. Jelinek-Mercer allows this concept by using dierent  values, but this requires holding out training data for the 's and still forms buckets for various levels of occurrences. There are three distributions that need to be smoothed in our second-order hidden Markov model. Each distribution has a dierent level of sparseness, as well as implementation specic reasons for dierent types of smoothing. We have implemented a model that smoothes with lower order information using coecients calculated from the number of occurrences in our training data. The methods for smoothing the three distributions are explained here.

State Transition Probabilities: To estimate the state transition trigram prob-

abilities, the counts for the number of times each trigram occurs could be used. However, what if that trigram never occurs? We would like to estimate the probabil-

- 89 ity using bigram or unigram information, if higher-order information is not available. Thus, we can estimate P (n = tk jn 1 = tj  n 2 = ti) with: ;

;

^ ( n = tk jn 1 = tj  n 2 = ti ) = f (N3) N3 +(1;f (N3))f (N2) N2 +(1;f (N3))(1;f (N2)) N1 C2 C1 C0

P 

;

;

which depends on the following numbers: 1 N2 N3 C0 C1 C2 N

and the function:

= = = = = =

the number of times tk occurs the number of times the sequence tj tk occurs the number of times the sequence ti tj tk occurs the total number of tags that appear the number of times tj occurs the number of times the sequence ti tj occurs

log(x + 1) + 1 : f (x) = log( x + 1) + 2

The function f is chosen so that the weighting for each element in the equation for P^ changes based on how often that element occurs in the training data. As a given element occurs more often, the weighting for that element increases. Interesting facts about f include f (0) = 0:5 f (1) = 0:5654 and xlim f (x) = 1:0. Notice that the coecients of the probabilities in the equation for P^ sum to one. This guarantees that the value returned for P^ is a valid probability. After this value is calculated for all tag triples, the values are normalized so that P P^ = 1. This creates a valid tk T probability distribution for the transition matrix. The value of this smoothing technique becomes clear when the triple in question occurs very infrequently, if at all. Consider calculating P^ for the tag triple CD RB VB. Assume that the information for this triple is: !1

2

N1 = 33277 N2 = 4335 N3 = 0 C0 = 1056892 C1 = 46994 C2 = 160

(number of times VB appears in training corpus) (number of times RB VB appears in training corpus) (number of times CD RB VB appears in training corpus) (total number of tags in the training corpus) (number of times RB appears in training corpus) (number of times CD RB appears in the training corpus)

- 90 Using these values, we calculate the probability P^ :

N2 + (1 ; f (N ))(1 ; f (N )) N1 3 + (1 ; f (N3 ))f (N2 ) P^ = f (N3 ) N 3 2 C2 C1 C0 0 + (1:0 ; 0:500)  0:823  4 335 = 0:500  160 46994 33277 +(1:0 ; 0:500)  (1:0 ; 0:823)  1056892 = 0:500  0:000 + 0:412  0:092 + 0:088  0:031 = 0:041 If this smoothing were not applied, the probability P (n = VB jn 1 = RB  n 2 = CD) would have been 0.000, which would create problems for tagger generalization. Smoothing allows tag triples that were not encountered in the training data to be assigned a probability of occurrence, while not drastically changing the probabilities for triples encountered a reasonable number of times. ;

;

Lexical Probabilities: For the lexical probabilities, a somewhat dierent approach

is used than the one used for context probabilities. Initial experiments that used a formula similar to that used for contextual estimates performed poorly. This poor performance was traced to the fact that smoothing allowed a chance for any word to be tagged with any tag even if it never occurs with that tag in the training data, which caused overgeneralization in the model. This was due to the inclusion of the unigram probabilities in the smoothing equation. As an alternative, the smoothed probability P^ (on = wk jn = tj  n 1 = ti) was calculated for words as follows: ;

N2 3 P^ = f (N3 ) N + (1 ; f (N3 )) C C 2

1

where f is the same function dened in the previous section and 2 3 C1 C2 N N

= = = =

the number of times the word wk occurs with tag tj the number of times the word wk occurs with tag tj preceded by tag ti the number of times tj occurs the number of times the sequence ti tj occurs

Notice that this method assigns a probability of 0.0 to a tag/tag/word triple when the tag/word pair was not seen in the training data. This prevents the tagger from

- 91 allowing every possible combination of word and tag, something which both increases running time (by forcing the tagger to consider more options) and decreases the accuracy (by allowing words too much probability to occur with improper tags). A better smoothing approach for lexical information could possibly be created by using some sort of word class idea, such as the genotype idea used by Tzoukermann and Radev 80], to improve our ability to estimate the lexical probabilities.

Su x Probabilities: For the sux probabilities, there are two possibilities to

apply smoothing. The rst involves the set of probabilities cij (m), for which there is a situation analogous to the lexical probabilities discussed above. In fact, these cij (m) are smoothed in exactly the same fashion as the lexical probabilities are smoothed, smoothing the trigram sux probabilities with the underlying bigram information, giving a set of probabilities c^ij (sm). However, there is also the issue of which sux should be used to predict the part of speech for the unknown word if multiple suxes match the word. There are many possible answers, some of which are considered in section 3.2: use the longest matching sux, use an entropy measure to determine the \best" ax to use, or use an average. We decided to use a smoothing technique that is similar to that used for contextual smoothing, but based on using the information provided by the dierent length suxes. This approach gives better results on unknown words than simply choosing the longest sux. Let x4 be the length four sux of the given word. Dene x3 x2, and x1 to be the length three, two, and one suxes respectively. If the length of the word is n, only choose suxes of length n ; 2 or below. Determine the longest sux of these that matches a sux in the training data, and calculate the new smoothed probability for that sux: P^ij (xk ) = f (Nk )^cij (m) + (1 ; f (Nk ))P^ij (xk 1) ;

where 

x+1)+1 f (x) = log( log(x+1)+2

- 92 

Nk = the number of times the sux xk occurs in the training data.

 c^ij (m)

= the estimate of cij (m) from the previous lexical smoothing, where sm = xk .

For the recursion base case, P^ij (x1) is set to c^ij (m), where sm = x1. After calculating P^ for the given word, it is normalized by calculating ^ P^ = PPij^(sm) ij Pij (sm ) Thus, longer suxes are given the most weight, and each sux receives more weight the more times that sux appears. All suxes of length one to four are used in estimating the probabilities, however. For example, consider the following sentence from the Wall Street Journal corpus: After the race, Fortune 500 executives drooled like schoolboys over the cars and drivers.

In this sentence, the word schoolboys is unknown. For this word, the suxes are s1 = -s, s2 = -ys, s3 = -oys, and s4 = -boys. The following information is available from our training data: Sux N f (N ) -s 88 419 0.8560 -ys 3684 0.8204 -oys 33 0.7168 -boys 0 0.5000

Tag Tag Sux IN NNS -s IN NNS -ys IN NNS -oys IN NNS -boys

c^ij 0.929 0.009 0.001 0.000

So, the smoothed probabilities P^ij are calculated:

P^IN,NNS(-s) P^IN,NNS(-ys) P^IN,NNS(-oys) P^IN,NNS(-boys)

= = = =

0:929 0:8204  0:009 + 0:1796  0:929 = 0:174 0:7168  0:001 + 0:2832  0:174 = 0:050 0:5000  0:000 + 0:5000  0:050 = 0:025

- 93 Table 3.2 Comparison between Taggers on Brown Corpus Tagger Type Known Unknown Overall Error Reduction Standard Bigram 95.94% 80.61% 95.60% | Second-Order Lexical only 96.23% 81.42% 95.90% 6.8% Second-Order Contextual only 96.41% 82.69% 96.11% 11.6% Full Second-Order HMM 96.62% 83.38% 96.32% 16.4% These probabilities will now be normalized with all other suxes associated with the IN and NNS pairs. This smoothing allows information from shorter suxes to be used more in the prediction if the longer suxes occur only infrequently (compare the P^ values to the c^ij values in the table).

3.5.5 Experiment and conclusions The new second-order HMM tagging model is tested in several dierent ways. The basic experimental technique is a 10-fold cross validation. The corpus in question is randomly split into ten sections with nine of the sections combined to train the tagger and the tenth for testing. The results of the ten possible training/testing combinations are merged to give an overall accuracy measure. The tagger was tested on two corpora|the Brown corpus (from the Treebank II CD-ROM 1]) and the Wall Street Journal corpus (from the same source). Comparing results for taggers can be dicult, especially across dierent researchers. Care has been taken in these experiments that, when comparing two systems, the comparisons are from experiments that were as similar as possible and that dierences are highlighted in the comparison. First, we compare the results on each corpus of four dierent versions of an HMM tagger prepared by us: a standard (bigram) HMM tagger, an HMM using second-order lexical probabilities, an HMM using second-order contextual probabilities (a standard trigram tagger), and a full second-order HMM tagger. This was done to compare the results between the various approximations. The Brown corpus results for each tagger are shown in Table 3.2, while results on the Wall Street Journal corpus are given in Table 3.3. The column for error reduction refers to reduction in error rate from the

- 94 Table 3.3 Comparison between Taggers on WSJ Corpus Tagger Type Known Unknown Overall Error Reduction Standard Bigram 96.52% 82.40% 96.25% | Second-Order Lexical only 96.80% 83.63% 96.54% 7.7% Second-Order Contextual only 96.90% 84.10% 96.65% 10.7% Full Second-Order HMM 97.11% 85.22% 96.87% 16.0% standard bigram model. As might be expected, the full second-order HMM had the highest accuracy levels. The model using only second-order contextual information (a standard trigram model) was second best, the model using only second-order lexical information was third, and the standard bigram HMM had the lowest accuracies. The full second-order HMM reduced the number of errors on known words by around 16% over bigram taggers (raising the accuracy about 0.6%), and by around 6% over conventional trigram taggers (accuracy increase of about 0.2%). Similar results were seen in the overall accuracies. Unknown word error rates were reduced by around 14% over bigrams. The full second-order HMM tagger is also compared to other researcher's taggers in Table 3.4. It is important to note that both SNOW, a linear separator model 68], and the voting constraint tagger 67] used training data that contained full lexical information (i.e., no unknown words), as well as training and testing data that did not cover the entire WSJ corpus. This use of a full lexicon may have increased their accuracy beyond what it would have been if the model were tested with unknown words. The standard trigram tagger data is from Weischedel, et al 19]. MBT 60] did not include numbers in the lexicon, which accounts for the inated accuracy on unknown words. Table 3.4 compares the accuracies of the taggers on known words, unknown words, and overall accuracy (when the data is available). The table also contains two additional pieces of information. The rst indicates if the corresponding tagger was tested using a closed lexicon (one in which all words appearing in the testing data are known to the tagger) or an open lexicon (not all words are known

- 95 Table 3.4 Comparison between Full Second-Order HMM and Other Taggers Known

Unknown

Overall

Open/Closed Lexicon?

Testing Method

Standard Trigram 19]

96.7%

85.0%

96.3%

open

MBT 60]

96.7%

90.6% b

96.4%

open

Rule-based 55]

||

82.2%

96.6%

open

Maximum-Entropy 46]

||

85.6%

96.6%

open

full WSJ a xed WSJ cross-validation xed full WSJ c xed full WSJ

97.1%

85.2%

96.9%

open

SNOW 68]

97.2%

||

||

closed

Voting Constraints 67]

97.5%

||

||

closed

98.05%

||

||

closed

Tagger Type

Full Second-Order HMM

Full Second-Order HMM

full WSJ cross-validation

xed subset of WSJ d subset of WSJ cross-validation e

full WSJ cross-validation

The full WSJ is used, but only one unknown word per sentence was allowed. It is unknown whether a cross-validation was performed, although that would seem impossible based on the limitation on unknown words. b MBT did not place numbers in the lexicon, so all numbers were treated as unknown words. c Both the rule-based and maximum-entropy models use the full WSJ for training/testing with only a single test set. d SNOW used a xed subset of WSJ for training and testing with no cross-validation. e The voting constraints tagger used a subset of WSJ for training and testing with cross-validation. a

to the system). The second indicates whether a hold-out method (such as crossvalidation) was used, and whether the tagger was tested on the entire WSJ corpus or a reduced corpus. Two cross-validation tests with the full second-order HMM were run: the rst with an open lexicon (created from the training data), and the second where the entire WSJ lexicon was used for each test set. These two tests allow more direct comparisons between our system and the others. As shown in the table, the full second-order HMM has improved overall accuracies on the WSJ corpus to state-of-the-art levels|96.9% is the greatest accuracy reported on the full WSJ for an experiment using an open lexicon. Similarly, the known word accuracy of 97.1% is the highest for an open lexicon experiment covering the entire WSJ corpus. Finally, using a closed lexicon,

- 96 Table 3.5 Comparison of Unknown Word Accuracies System Accuracy HMM (Xerox guesser) 82.15% Brill (Brill guesser) 85.31% HMM (Mikheev guesser) 87.62% Brill (Mikheev guesser) 88.67% Second-Order HMM 89.48% the full second-order HMM achieved an accuracy of 98.05%, the highest reported for the WSJ corpus for this type of experiment. The accuracy of our system on unknown words is 85.2%. This accuracy was achieved by creating separate classiers for capitalized, hyphenated, and numeric digit words: tests on the Wall Street Journal corpus with the full second-order HMM show that the accuracy rate on unknown words without separating these types of words is only 80.2%5. Two of the systems have reported higher unknown word accuracies than the second-order HMM. One of these, the MBT tagger, used a lexicon in which numbers had been removed. Our tagger was rerun on the WSJ corpus, with numbers removed from the lexicon. The results of the experiment showed an unknown word accuracy of 92.8%, well above MBT's 90.6%. The second system is the MaximumEntropy tagger (MxPost), which is freely available on the web6. In the interests of a fair comparison, MxPost was run on precisely the same training and test data crossvalidation sets as the second-order HMM. This resulted in an unknown word accuracy of 84.8%, below our 85.2% result. To compare accuracies of unknown word predictors, we set up an experiment as similar as possible to that of Mikheev 34]. He tested his unknown word predictor and achieved very good results. To compare our systems, we ran an experiment using Mikheev 34] also separates sux probabilities into dierent estimates, but does not provide any data illustrating the accuracy increase. 6 Adwait Ratnaparkhi's home page is found at http://www.cis.upenn.edu/adwait, and the tagger itself is at ftp://ftp.cis.upenn.edu/pub/adwait/jmx/jmx.tar.gz.

5

- 97 similar conditions to his: he tested on the Brown corpus, using a cross-validation method. Since his tagger uses a full general purpose lexicon, using word frequencies from the Brown corpus, we used a lexicon induced from the entire Brown corpus. From this full lexicon, we removed all hapax words (words that occur only once), and all capitalized words that occur less than twenty times. These words are treated as the unknown words, as Mikheev did in his experiments. Any errors between NN and NNP (common and proper nouns) are ignored, as is done by Mikheev. The results of this experiment are shown in Table 3.5. As can be seen, the second-order HMM achieves the highest accuracy. All the other data in the table are taken from Mikheev's paper 34].

3.6 Smoothing Experiments The experiments in section 3.5 have shown that the second-order HMM is a very accurate tagging system. An important element of the second-order HMM's accuracy is the fact the the second-order approximations allow more use of context in the probability distributions. One important issue when using second order (trigram) approximations is the possibility of sparse training data. Some of these approximations have millions of possible variations, many of which occur infrequently, if at all, in the training data. Because of this sparseness, some sort of probabilistic smoothing of the trigrams is necessary. The methods of smoothing used in section 3.5.4 can all be written using the following recursive denition (with minor variations): 8 > < f (n )p (x) + (1 ; f (ni ))^pi 1 (x)  i > 1 p^i(x) = > i i : p1 (x)  i=1 ;

where pi(x) is the i-gram estimate of occurrence x from the training data, p^i (x) is the smoothed i-gram probability, ni is the number of times the i-gram x occurs in the training data, and n + 1) + 1 f (n) = log( log(n + 1) + 2

is the smoothing coecient function. Variations of this method are used for all types of

- 98 smoothing in the second-order HMM. For example, in lexical smoothing, the recursion base case is p2(x), since unigram information is omitted from the smoothing function. This method of smoothing was chosen for many reasons. 

It allows the inclusion of lower levels of statistical information (e.g., uses bigrams and unigrams when smoothing trigrams). This allows all statistical information to be used, unlike additive 76] or Good-Turing 48] smoothing.



It reinforces the idea that the more frequently a given n-gram occurs, the more weight that information should carry in a smoothing calculation. Since f (n) was chosen to be monotonically increasing in n, approaching 1 as n ! 1, the desired eect is achieved. Our approach provides smoother transition than Katz 79] or Jelinek-Mercer 78] smoothing, both of which use a \bucketing" scheme to break down coecients.



It avoids the need to train coecients, which saves time and training data. Since the coecient f (n) is pre-dened, calculations are simple and require no holdout of training data, unlike Jelinek-Mercer smoothing.

The smoothing function was chosen for all the above reasons. The exact form of the function was chosen to satisfy these three criteria: 

As n ! 1, f (n) ! 1.



0 < f (0) < 1.



f (n) is monotonically increasing.

The coecient should approach 1 as n grows large. We also want f (0) to be nonzero, since the fact that the item never occurs in the training data is oering negative information. f (n) should be monotonically increasing so that the more times an item appears, the more weight it is given. Furthermore, the choice of f (n) was chosen to satisfy our intuition about how the function should grow and behave.

- 99 -

3.6.1 Changing the coe cients Our choice of f (n) was made intuitively, with a desire that specic performance requirements be met by the function. Specically, the logarithm was chosen as a central element of the function so that f would grow slowly as n tended to innity. However, other logarithm-based functions may oer superior performance. With that in mind, several tests were performed to determine the eect of the smoothing coecients on tagging accuracy, implementing several possible combinations of smoothing coecients. Three dierent smoothing coecients were considered: 





n+1)+1 Base-10: f (n) = log( log(n+1)+2 . This is the original smoothing coecient described in section 3.5. ln(n+1)+1 . This function uses the natural log (ln, or log ) instead Base-e: f (n) = ln( e n+1)+2 of log10. This function will attach a larger weight to smaller values of n, as well as approach 1 faster as n ! 1.

Type-3: This smoothing method is nearly identical to Base-e, except that the n+1)+2 coecient for the bigram only is calculated using f (n) = ln( ln(n+2)+3 . This has the eect of reducing the weight given the unigram in favor of the bigram. We anticipate that this smoothing method will be useful in situations where data sparseness is not as large an issue.

Figure 3.4 graphically depicts the dierences between the three smoothing coecients (the Type-3 entry is for the coecient used for the bigram component). The Base-10 smoothing is very slow growing, require large values of n to achieve high coecients (at n = 108 ; 1, f (n) = 0:9). The other two methods grow faster, allowing items that occur less frequently to have higher weight. Type-3 shifts the curve for the bigram probabilities up a bit, attaching less weight to the unigram probabilities. If data is not too sparse, the unigram probabilities should not contribute much in the data.

- 100 -

Fig. 3.4. Three kinds of Smoothing Coecients We expect that the Base-e and Type-3 methods should work well for the contextual smoothing, since context is the portion of the second-order HMM that suers least from data sparseness. There are less than 125 000 possible combinations of trigrams for context, and over one million trigrams available in the training data. The lexical smoothing is the most sparse, so we anticipate less improvement from the Base-e methods, if any improvement at all.

3.6.2 The experiment We tested the new smoothing coecients in a manner quite similar to the experiment in section 3.5. The only dierence (other than the use of the dierent smoothing coecients) is that words that only occur once in the training data are removed from the dictionary for that test set. This has the eect of reducing the dictionary size from over fty thousand words to around thirty thousand, which speeds up the running of the algorithm and reduces the system load, allowing us to run experiments more quickly. There should be little dierence in terms of measuring the dierential eects of changing the smoothing coecients. The results of the test runs are shown in Table 3.6. This table shows some in-

- 101 teresting results. First of all, the type-3 smoothing does in fact oer the best results when used for contextual smoothing. The base-e method works best for the sux smoothing, and the original base-10 method works best for lexical smoothing. Type-3 was not used for lexical or sux smoothing for diering reasons. For lexical smoothing, the unigram numbers are not used anyway (see section 3.5.4), so there would be no dierence between base-e and type-3 methods. And for sux smoothing, the items being smoothed are not true unigrams, bigrams, and so on, but sux lengths, so it was felt that changing the coecients for dierent length suxes would not be appropriate. The best overall accuracy, 96.84%, is achieved using type-3 for context, base-10 for lexicon, and base-e for sux smoothing. The worst (using base-10, base-e, and base-10 for context, lexical, and sux smoothing respectively) is 96.79%, a dierence of 0.05%. This corresponds to an 1.5% decrease in the number of errors. A larger absolute dierence is found in unknown word accuracy, with a spread from 84.81% to 85.25%. This is a 2.9% reduction in errors. So, the smoothing coecients can make a dierence in accuracy rates. We also tested the new smoothing coecients using the standard test procedure from section 3.5, without removing the singleton words from the dictionary. The results given in section 3.5 (97.11% accuracy on known words, 85.22% accuracy on unknown words, and 96.87% accuracy overall) are results from using type-3 contextual smoothing, base-10 lexical smoothing, and base-e sux smoothing. Using the standard base-10 smoothing for each, we get accuracies of 96.86% overall, 97.10% for known words and 84.84% for unknown words. The new smoothing coecients made modest improvements in the accuracy, most notably in the unknown word category.

3.7 Integrating the Tagger and Parser The nal experiment of this chapter involves the integration of the two systems developed in this document: the parser and the tagger. This integration was motivated by examining some of the errors that occurred while tagging. These errors could be avoided if an even cursory consultation of the syntactic rules of English were

- 102 Table 3.6 Results using Various Smoothing Coecients

Smoothing Types Contextual Lexical Sux base-10 base-10 base-10 base-10 base-10 base-e base-10 base-e base-10 base-10 base-e base-e base-e base-10 base-10 base-e base-10 base-e base-e base-e base-10 base-e base-e base-e type-3 base-10 base-10 type-3

type-3 type-3

base-10

base-e base-e

base-e

base-10 base-e

Accuracy Known Unknown 97.20% 84.96% 97.20% 85.18% 97.18% 84.89% 97.18% 85.14% 97.21% 84.92% 97.21% 85.24% 97.20% 84.81% 97.20% 85.20% 97.22% 84.90%

97.22%

97.20% 97.20%

85.25%

84.83% 85.21%

Overall 96.81% 96.81% 96.79% 96.80% 96.82% 96.83% 96.80% 96.82% 96.82% 96.84%

96.81% 96.83%

made. For example, several sentences are tagged with a sequence of tags that does not contain a tensed verb. While some of the WSJ \sentences" do not contain tensed verbs (mainly headlines and bylines), around 95% of them do. We consider two types of augmented tagging systems. The rst is a tagging system that rejects tag sequences that do not contain tensed (or innitives used as commands) verbs. This is accomplished by modifying the tagger to examine the N best (the N most likely) tag sequences and reject those that do not contain tensed verbs, instead of simply returning the most likely sequence. We call this tagging system the Verb-Discrimination Tagger (VDT). The second tagging system integrates a parser into the tagging system. It works similarly to the VDT, but rejects tag sequences if they fail to parse. If the parser returns a parse tree, the tagged sentence is assumed to be correct and returned otherwise, the next most likely tag sequence for that sentence is considered. We call this system a Parser-Augmented Tagger (PAT). Since the parser being used is the one developed in Chapter 2, we test the tagging systems on the Timit corpus. Before we can evaluate a tagging system on the

- 103 Timit corpus, we have to determine the correct tagging of all the sentences in the corpus. The sentences were tagged with the second-order part-of-speech tagger (from Section 3.5, trained on the WSJ corpus), then two linguistically sophisticated individuals manually corrected any errors in tagging. This gives us a correctly tagged set of sentences to be used to calculate the error rate of the tagger. 3.7.1 Investigating N -best tagging Before testing our tagging systems, it is necessary to modify the tagger to calculate the N -best sequences of tags. This is harder than it might seem, since the Viterbi algorithm works by taking the maximum probability at each point in the sentence and ignoring the rest of the information. We require the tagger to keep the N -best tag sequences, not just the single best. To modify the Viterbi algorithm to keep the N -best tag sequences, we must modify the variable. Instead of using n(i j ), which is the maximum probability up to word n given that tag n = tj and n 1 = ti, we dene a new variable nm(i j ) which is the m-th largest probability of the sequence up to word n given n = tj and n 1 = ti. This allows us to keep track of the m-best values at any given point. The validity of this approach is shown by Young, et al. 81]. To calculate the value of nm(i j ) given n 1m (i j ), we have to consider all the possible values of m in n 1m(i j ). In formal terms, ;

;

;

;

n1(j k) = 1 i max T1 m  



(i j )aijk bjk (on ) M n 1m



;

the value of n2(j k) is the second largest value, and so on. Each of the m-best values at a given word must be considered when calculating the values for the succeeding word. Notice that this multiplies the running time by a constant factor M , which is the number of tag sequences we consider while tagging. In an attempt to determine how much eect consulting the N -best tag sequences could have, we ran a couple of preliminary experiments. For each, the N -best tag sequences were generated for each sentence in the Timit corpus (for these experiments, N = 20). Then, for each sentence, two accuracies are calculated:

- 104 

We determine the rst n at which the n-best tag sequence is the correct tag sequence for the entire sentence. Thus, if the best tag sequence is the correct sequence for the sentence, then the 1-best sequence is correct. For each value of n, the percentage of sentences for which a correct tag sequence occurs in the m-best position, for any m  n, is calculated.



We also determine if each individual word of a sentence has been correctly tagged in any tag sequence up to the n-th sequence. Thus, if a word is tagged correctly in the 3rd sequence, but not in the rst two, it is considered correct for the 3rd best and larger sequences.

From these two gures, we can determine two accuracy rates: the sentence accuracy rate, and the word accuracy rate. The data for each of these is shown in Table 3.7. Considering the data in this table, N -best post-processing appears to have the potential to improve tagging accuracy. 61.24% of the most likely tag sequences are entirely correct for their sentences. However, 76.40% of the sentences have entirely correct tag sequences that are either the most likely or second most likely. Word accuracy similarly jumps from 94.55% to 96.84%. A couple of points should be made here. The word accuracy rates are going to be larger than similar accuracies from taggers, since accuracies here are only considering if the word itself is correctly tagged in any of the N -best sequences. Taggers must work with accuracies based on a single tag sequence. Additionally, the largest improvement is seen at low levels of n, which is promising. Finally, we must consider how to decide when to reject the best sequence for a lower probability sequence that is correct.

3.7.2 The VDT and PAT systems Our rst tagging system, the Verb-Discrimination Tagger (VDT), uses the N -best tagger and rejects tag sequences that do not contain a tensed or imperative verb. This is a particularly good constraint for the Timit corpus, since each correct tag sequence for every Timit sentence should contain one of these types of verbs. We hope to test

- 105 Table 3.7 Results using Various Methods

Sentence Word Tag Sequence Accuracy Accuracy 1-best 61.24% 94.55% 2-best 76.40% 96.84% 3-best 79.78% 97.67% 4-best 81.74% 97.93% 5-best 83.71% 98.17% 6-best 85.11% 98.36% 7-best 86.23% 98.51% 8-best 86.80% 98.61% 9-best 87.92% 98.79% 10-best 87.92% 98.79%

Sentence Word Tag Sequence Accuracy Accuracy 11-best 87.92% 98.79% 12-best 88.20% 98.85% 13-best 88.76% 98.91% 14-best 89.33% 98.95% 15-best 90.17% 98.95% 16-best 90.17% 98.95% 17-best 90.73% 98.95% 18-best 90.73% 98.95% 19-best 90.73% 98.95% 20-best 90.73% 98.95%

this system on other corpora (for example, the WSJ corpus) eventually. Our second system, the Parser-Augmented Tagger (PAT), uses the N -best tagger to reject tag sequences that do not parse given our parsing system. Before we can implement the PAT, we need to create some sort of system to translate the tags from the part-of-speech tagger into dictionary entries for the parser to use. This is handled by running the tagged sentence through a Perl script that automatically creates a Lisp-readable dictionary le for the parser. This dictionary le is the only dictionary les that is used by the parser for this sentence. For the most part, the translation from tags to parts of speech for the parser is straight-forward. The feature information is somewhat more dicult to translate. A partial list of the tags and how they are translated is given here: 

DT is translated to , and is assumed to have all forms of agreement (1s 2s 3s 1p 2p 3p).



JJ, JJR, JJS are all translated to (adjective) and . Similarly, RB, RBR, and RBS are all translated to .



NN, NNP, NNS, NNPS are all translated to , with the agreement set to

- 106 3s (3rd person singular) for NN and NNP, or 3p (3rd person plural) for NNS and NNPS, and the noun type set to either common (NN, NNS) or proper (NNP, NNPS). 

VB, VBD, VBG, VBN, VBP, VBZ are all translated to . The verb form feature is easily translated (VB = bare, VBD = past tense, VBG = -ing form, VBN = past participle, VBP = present tense with non-3s agreement, VBZ = present tense with 3s agreement). The verbs are assumed to have all possible subcats, since subcat information is impossible to determine from just the word and tag without additional information.

There are a few exceptions to the direct translation performed for the dictionary. Since the Penn Treebank tagging convention tags auxiliary verbs such as be, do, or have as regular verbs (using VB, VBP, and so on), even when they are being used in an auxiliary fashion, while the parser treats them as a special part of speech () when they are used as auxiliary verbs, the auxiliary verbs are added to the dictionary for the parser every time. This allows correct tag sequences for sentences to parse that otherwise would not. Using the N -best tagging system and the automatic translation from tags to lexical entries, we have successfully integrated the parser into the PAT system. It, and the VDT system, are tested on the Timit corpus. Each system follows the same basic approach: 1. Calculate the N -best tag sequences for the given sentence. For this experiment, we chose N = 10, which gives a good range of possible tag sequences without bogging down the system too much. Initially, the rst-best tag sequence is considered the current best sequence for step 2. 2. Accept or Reject the Current-Best Sequence. The current best sequence is considered by the tagger. A sequence is accepted if it contains a tensed or imperative verb (for VDT) or if the sequence generates

- 107 Table 3.8 Results on Timit Corpus Tagging System Known Unknown Overall Second-Order HMM Tagger 96.22% 82.66% 94.55% Verb-Discrimination Tagger (VDT) 96.43% 84.42% 94.95% Parser-Augmented Tagger (PAT) 97.28% 87.44% 96.07% a parse tree (for PAT). Notice that the tag sequence must be translated into a dictionary le for PAT before it can be accepted or rejected. If a sequence is accepted, the tags are applied to the words of the sentence, and the tagged sentence is output. If it is rejected, then that tag sequence is discarded and the next best sequence is considered the current best and this step is repeated. If there are no more tag sequences to try, the initial best tag sequence is returned (since none of the sequences met the constraints, we might as well return the most likely one).

3.7.3 Results of the VDT and PAT systems The tagger used for these experiments is trained on the entire Wall Street Journal corpus. Since it is used on the Timit corpus, a degradation of performance from results elsewhere in this chapter would be expected. The tagger was rst tested on the Timit corpus using the original second-order tagger (in other words, the most likely tag sequence was returned every time). Then, the VDT and PAT systems were tested. The results from these three tests are shown in Table 3.8. Combining the parser and with the tagger resulted in an increased accuracy, especially for unknown words. The accuracy on known words increased over 1%, a 28% reduction in errors over the normal tagger. The overall accuracy increased over 1.5%, also a 28% reduction in errors. The unknown accuracy increased nearly 5%, another 28% reduction in errors. These results show that the PAT system oers substantial increases in accuracy over a standard tagger. One problem with this approach is that the parser slows down the tagger drastically. It also requires that a grammar for the

- 108 given corpus be available. We did not test this tagging scheme on the Wall Street Journal for this reason. More dramatic improvements could have been possible with more precise feature information from a lexicon of the known words. The VDT system increased the accuracy of our tagger on the Timit corpus as well, although not as dramatically as the parser-augmented system. Overall accuracy was increased 0.4%, a decrease of 7% in the number of errors. We would like to consider other easily implemented syntactic constraints such as this, in future work on this system. In summary, systems that consider N -best tag sequences and how to impose additional constraints in our second-order HMM tagger are presented, and show great promise as tools for highly accurate tagging (when speed is not an issue).

3.8 Final Thoughts on Tagging In conclusion, a new statistical model, the full second-order HMM, has been shown to improve part-of-speech tagging accuracies over current models. This model makes use of second-order approximations for a hidden Markov model and improves the state of the art for taggers with no asymptotic increase in running time over traditional trigram taggers based on the hidden Markov model. A new smoothing model is also explained, which allows the use of second-order statistics while avoiding sparse data problems. Various smoothing coecient functions are tested. Finally, tagging systems that use the N -best tag sequences along with constraints upon a correct tag sequence are tested and show improvement over traditional tagging systems.

- 109 -

4. CONCLUSIONS AND COMMENTARY 4.1 Contributions 4.1.1 Parsing We have presented a new parsing system designed to parse ambiguous and unknown words accurately and robustly. This has been accomplished using two main ideas: the post-mortem algorithm and morphological recognition. Post-mortem parsing allows the parser to attempt to parse a sentence more than once. This allows the system to guess a limited number of parts of speech for unknown words, then reparse the sentence if initial attempts fail. This eliminates the \all or nothing" aspect associated with deterministic parsing systems. Morphological recognition allows us to use morphology to improve the predictive power of our system on unknown words. This allows us to guess parts of speech of unknown words more accurately, especially when additional information such as context and closed-class dictionaries are used. When tested on the Timit corpus, using the frequency list post-mortem method (section 2.5.2), we achieved 62 deletions and 99 insertions when 10% of the dictionary was missing. There are 1137 total parses from the full dictionary control parse attempts, so less than 5.5% of the parses were missed. This illustrates the eectiveness of morphology in unknown words recognition, as well as the post-mortem approach to parsing.

4.1.2 Part-of-speech tagging We have presented a new part-of-speech tagging system that has achieved stateof-the-art accuracy levels. This system was created using a new statistical model called the second-order hidden Markov model. This model allows the system to make

- 110 use of more contextual information while not increasing the asymptotic running time over traditional trigram taggers. In addition to the new model, a new unknown word recognizer system is developed. The construction of the recognizer is completely automated, using only the information from a tagged corpus. No language or ax information is included. The recognizer uses statistical information gathered from suxes of a training set and a unique smoothing approach that smoothes using all suxes found in the word. Smoothing methods for the second-order HMM are also investigated. These smoothing methods allow the increased contextual information to be used while avoiding the problems with sparse data. The second-order HMM tagger gives an accuracy of 96.9% on the Wall Street Journal corpus while using an open dictionary. Using a closed dictionary, it achieves an accuracy of 98.05%. Both of these are state-of-the-art results. In addition, an accuracy of 85.2% is obtained on unknown words. We also investigate improving tagging accuracy by integrating a parser with the tagger. The parser is used to eliminate potential tag sequences that are syntactically impossible sentences. In tests on the Timit corpus, we improve tagging accuracy from 94.55% to 96.07% over the standard second-order HMM tagger. This is an impressive improvement over stand-alone part-of-speech tagging.

4.2 Future Work 4.2.1 Investigations into errors in training data While performing our experiments in part-of-speech tagging, we noticed that the tagged training corpora contained inconsistent tags in similar situations and several tags that were simply incorrect. Many of the inconsistencies and errors in the training data could compromise the ability to accurately construct a tagger. We would like to repair as many of these errors as possible. However, there are over 1.2 million words in the Wall Street Journal corpus to check all of these words manually would take an inordinate amount of time. Therefore, we will develop a method to automatically detect errors and inconsistent tags in any tagged corpus, in

- 111 order to create a more consistent corpus for training and testing purposes.

4.2.2 Statistical parsing

This document details experiments involving a deterministic parser and a statistical part-of-speech tagger. A next logical step in this research would be to combine a statistical parser with a morphological recognizer to improve parsing accuracy of sentences containing unknown words. In addition, techniques used for the statistical model of part-of-speech tagging might be useful in statistical parsing as well. Notably, the smoothing methods developed in section 3.5.4 may benet a statistical parser.

4.2.3 Integrating the tagger and parser

Our experiments in section 3.7 have shown that using a parser to check the output of a tagger can help the tagging accuracy of a system. This experiment is fairly preliminary there are several directions in which this research could go. We would like to test this type of system on the Wall Street Journal corpus. This would involve deriving a set of rules from the corpus for our parser. We would also like to experiment with other types of syntactic constraints, besides parsing, that may improve accuracy while avoiding the long running time of parsing. The tensed verb constraint seems to work well on the Timit corpus it will be tested on the WSJ corpus as well. In addition, we would like to nd other constraints that are syntactically motivated that could improve accuracy.

- 112 -

- 113 -

1] 2] 3] 4] 5] 6] 7] 8] 9] 10] 11] 12] 13] 14]

LIST OF REFERENCES Mitchell Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics, 19(2):313{330, 1993. Masaru Tomita. Ecient Parsing for Natural Language. PhD thesis, CarnegieMellon University, 1986. Jay Earley. An Ecient Context-Free Parsing Algorithm. PhD thesis, CarnegieMellon University, 1968. Bill Watterson. Homicidal Psycho Jungle Cat: a Calvin and Hobbes Collection, page 53. Andrews and McMeel: A Universal Press Syndicate Company, 1994. Martin D. Davis, Ron Sigal, and Elaine J. Weyuker. Computability, Complexity, and Languages, pages 285{287. Academic Press, 1994. Michael R. Garey and David S. Johnson. Computers and Intractability: A Guide to the Theory of NP-Completeness, pages 170{178. W. H. Freeman and Company, 1979. Noam Chomsky. Three models for the description of language. International Radio Engineers (IRE) Transactions of the Professional Group on Information Theory (PGIT), pages 113{124, 1956. James Allen. Natural Language Understanding, Second Edition, pages 47{60, 189{222. The Benjamin/Cummings Publishing Company, Inc., 1995. W.A. Woods. Transition network grammars for natural language analysis. Communications of the Association of Computing Machinery, 13:591{606, 1970. W.A. Woods. An Experimental Parsing System for Transition Network Grammars. New York: Algorithmics Press, 1973. Hiroshi Maruyama. Constraint dependency grammar and its weak generative capacity. Computer Software, 1990. Hiroshi Maruyama. Structural disambiguation with constraint propogation. Proceedings of the 28th Annual Meeting of the Association for Computational Linguistics (ACL-90), pages 31{38, 1990. Carla H. Zoltowski, Mary P. Harper, Leah H. Jamieson, and Randall A. Helzerman. PARSEC: A constraint-based framework for spoken language understanding. Proceedings of the 1992 International Conference on Spoken Language Processing, pages 249{252, 1992. Mary P. Harper and Randall A. Helzerman. Extensions to constraint dependency parsing for spoken language processing. Computer Speech and Language, 9(3):187{234, 1995.

- 114 15] Mary P. Harper, Leah H. Jamieson, Carla H. Zoltowski, and Randall A. Helzerman. Semantics and constraint parsing of word graphs. Institute of Electrical and Electronics Engineers (IEEE) Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 63{66, 1993. 16] Michael John Collins. A new statistical parser based on bigram lexical dependencies. Proceedings of the 34th Annual Meeting of the Association for Computational Linguistics (ACL-96), pages 184{191, 1996. 17] David M. Magerman. Statistical decision tree models for parsing. Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics (ACL-95), pages 276{283, 1995. 18] Don Blaheta and Eugene Charniak. Automatic compensation for parser gureof-merit aws. Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics (ACL-99), pages 513{518, 1999. 19] Ralph Weischedel, Marie Meeter, Richard Schwartz, Lance Ramshaw, and Je Palmucci. Coping with ambiguity and unknown words through probabilitic models. Computational Linguistics, 19:359{382, 1993. 20] Henry Granger, Jr. FOUL-UP : A program that gures out meanings of words from context. Proceedings of the Fifth International Joint Conference on Articial Intelligence, pages 172{178, 1977. 21] Peter Hastings, Steven Lytinen, and Robert Lindsay. Learning words from context. Proceedings of the Eighth International Conference on Machine Learning, pages 55{59, 1991. 22] David McDonald. Internal and external evidence in the identication and semantic categorization of proper names. In Corpus Processing for Lexical Acquisition, pages 21{40. A Bradford Book, 1996. 23] Inderjeet Mani and T. Richard MacMillan. Identifying unknown proper names in newswire text. In Corpus Processing for Lexical Acquisition, pages 41{60. A Bradford Book, 1996. 24] Woojin Paik, Elizabeth Liddy, Edmund Yu, and Mary McKenna. Categorizing and standardizing proper nouns for ecient information retrieval. In Corpus Processing for Lexical Acquisition, pages 61{76. A Bradford Book, 1996. 25] Paul Jacobs and Uri Zernik. Acquiring lexical knowledge from text : A case study. Proceedings of the Seventh National Conference on Articial Intelligence, pages 739{744, 1988. 26] Claire Cardie. A case-based approach to knowledge acquisition for domainspecic sentence analysis. Proceedings on the Eleventh National Conference on Articial Intelligence, pages 798{803, 1993. 27] Randolph Quirk, Sidney Greenbaum, Geory Leech, and Jan Svartnik. A Grammar of Contemporary English, Twelfth Impression, pages 109{122, 976{1008. Longman Group, Ltd., 1986. 28] Pieter Muysken. Approaches to ax order. Linguistics, 24:629{643, 1986.

- 115 29] William Badecker and Alfonso Caramazza. A lexical distinction between inection and derivation. Linguistic Inquiry, 20(1):108{116, 1989. 30] Harald Baayen and Rochelle Lieber. Productivity and English derivation: A corpus-based study. Linguistics, 29:801{843, 1991. 31] Andrew Golding and Henry Thompson. A morphology component for language programs. Linguistics, 23:263{284, 1985. 32] Robert Milne. Resolving lexical ambiguity in a deterministic parser. Computational Linguistics, 12(1):1{12, 1986. 33] Marc Light. Morphological cues for lexical semantics. Proceedings of the 34th Annual Meeting of the Association for Computational Linguistics (ACL-96), pages 25{31, 1996. 34] Andrei Mikheev. Automatic rule induction for unknown-word guessing. Computational Linguistics, 23(3):405{423, 1997. 35] Andrei Mikheev. Unsupervised learning of word-category guessing rules. Proceedings of the 34th Annual Meeting of the Association for Compuatational Linguistics (ACL-96), pages 327{334, 1996. 36] Scott M. Thede. Predicting part-of-speech information about unknown words using statistical methods. Proceedings of the 1998 Joint International Conference on Computational Linguistics and Meeting of the Association for Computational Linguistics (COLING-ACL '98), pages 1505{1507, 1998. 37] Antal van den Bosch and Walter Daelemans. Memory-based morphological analysis. Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics (ACL-99), pages 285{292, 1999. 38] Harri Jappinen and Matti Ylilammi. Associative model of morphological analysis: An empirical inquiry. Computational Linguistics, 12(4):257{272, 1986. 39] Jean-Pierre Chanod and Pasi Tapanainen. Creating a tagset, lexicon, and guesser for a french tagger. From Text to Tags: Issues in Multilingual Language Analysis Association for Computational Linguistics SIGDAT Workshop, pages 58{64, 1995. 40] Wolfgang Lezius, Reinhard Rapp, and Manfred Wettler. A freely available morphological analyzer, disambiguator and context sensitive lemmatizer for German. Proceedings of the 1998 Joint International Conference on Computational Linguistics and Meeting of the Association for Computational Linguistics (COLINGACL '98), pages 743{748, 1998. 41] Antonio Moreno and Jos(e M. Goni. GRAMPAL: A morphological processor for Spanish implemented in Prolog. Proceeding of GULP-PRODE95: Joint Conference on Declarative Programming, 1995. 42] Kemal Oazer and G%okhan T%ur. Morphological disambiguation by voting constraints. Joint Proceedings of 35th Meeting of the Association of Computational Lingustics and the 8th Conference of the European Association of Computational Lingustics (ACL/EACL-97), pages 222{229, 1997.

- 116 43] Victor Zue, Stephanie Sene, and James Glass. Speech database development at MIT : Timit and beyond. Speech Communication, 9(4):351{356, 1990. 44] M. Coltheart. The MRC psycholinguistic database. Quarterly Journal of Experimental Psychology, 33A:497{505, 1981. 45] Scott Thede and Mary Harper. Identifying unknown lexical items using morphological and syntactic information. Unpublished, 1997. 46] Adwait Ratnaparkhi. A maximum entropy model for part-of-speech tagging. Proceedings of the Empirical Methods in Natural Language Processing Conference, pages 133{142, 1996. 47] E.T. Jaynes. Information theory and statistical mechanics. Physical Review, 106:620{630, 1957. 48] I. J. Good. The population frequencies of species and the estimation of population parameters. Biometrika, 40:237{264, 1953. 49] W. Nelson Francis and Henry Kucera. Frequency Analysis of English Usage: Lexicon and Grammar. Houghton Mi)in, 1981. 50] Lawrence R. Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition. Proceeding of the IEEE, pages 257{286, 1989. 51] Julian Kupiec. Robust part-of-speech tagging using a hidden Markov model. Computer Speech and Language, 6(3):225{242, 1992. 52] Eugene Charniak, Curtis Hendrickson, Neil Jacobson, and Mike Perkowitz. Equations for part-of-speech tagging. Proceedings of the Eleventh National Conference on Articial Intelligence, pages 784{789, 1993. 53] Bernard Marialdo. Tagging English text with a probabilistic model. Computational Linguistics, 20(2):155{172, 1994. 54] Steven J. DeRose. Grammatical category disambiguation by statistical optimization. Computational Linguistics, 14(1):31{39, 1988. 55] Eric Brill. A report of recent progress in transformation-based error-driven learning. Proceedings of the Twelfth National Conference on Artical Intelligence, pages 722{727, 1994. 56] Eric Brill. Transformation-based error-driven learning and natural language processing: A case study in part of speech tagging. Computational Linguistics, 21(4):543{565, 1995. 57] S. Klein and R. F. Simmons. A grammatical approach to grammatical coding of English words. Journal of the Association for Computing Machinery (JACM), 10:334{347, 1963. 58] Benny Brodda. Problems with tagging and a solution. Nordic Journal of Linguistics, pages 93{116, 1982. 59] H. Paulussen and W. Martin. Dilemma-2: A lemmatizer-tagger for medical abstracts. Proceedings of the Third Conference on Applied Language Processing, pages 141{146, 1992.

- 117 60] Walter Daelemans, Jakub Zavrel, Peter Berck, and Steven Gillis. MBT: A memory-based part of speech tagger-generator. Proceedings of the Fourth Workshop on Very Large Corpora (WVLC-4), pages 14{27, 1996. 61] Helmut Schmid. Probabilistic part-of-speech tagging using decision trees. International Conference on New Methods in Language Processing, 1994. 62] Claire Cardie. Using decision trees to improve case-based learning. Proceedings of the Tenth International Conference on Machine Learning, pages 25{32, 1993. 63] J.R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1992. 64] Adwait Ratnaparkhi. A maximum entropy model for prepositional phrase attachment. Proceedings of the ARPA Workshop on Human Language Technology, 1994. 65] Adwait Ratnaparkhi. A linear observed time statistical parser based on maximum entropy models. Proceedings of the Second Conference on Empirical Methods in Natural Language Processing, 1997. 66] R. Lau, R. Rosenfeld, and S. Roukos. Adaptive language modeling using the maximum entropy principle. Proceedings of the Human Language Technology Workshop, pages 108{113, 1993. 67] G%okhan T%ur and Kemal Oazer. Tagging English by path voting constraints. Proceedings of the 1998 Joint International Conference on Computational Linguistics and Meeting of the Association for Computational Linguistics (COLINGACL '98), pages 1277{1281, 1998. 68] Dan Roth and Dmitry Zelenko. Part of speech tagging using a network of linear separators. Proceedings of the 1998 Joint International Conference on Computational Linguistics and Meeting of the Association for Computational Linguistics (COLING-ACL '98), pages 1136{1142, 1998. 69] J. Benello, A.W. Mackie, and J.A. Anderson. Syntactic category disambiguation with neural networks. Computer Speech and Language, 3:203{217, 1989. 70] M. Nakamura and K. Shikano. A study of English word category prediction based on neural networks. Institute of Electrical and Electronics Engineers (IEEE) Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 731{734, 1989. 71] Hans van Halteren, Jakub Zavrel, and Walter Daelemans. Improving data driven wordclass tagging by system combination. Proceedings of the 1998 Joint International Conference on Computational Linguistics and Meeting of the Association for Computational Linguistics (COLING-ACL '98), pages 491{497, 1998. 72] Steven Abney, Robert E. Schapire, and Yoram Singer. Boosting applied to tagging and PP attachment. Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, pages 38{45, 1999. 73] Llu(is M*arquez, Horacio Rodr(iguez, Josep Carmona, and Josep Montolio. Improving POS tagging using machine learning techniques. Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, pages 53{61, 1999.

- 118 74] Scott M. Thede and Mary P. Harper. Analysis of unknown lexical items using morphological and syntactic information with the TIMIT corpus. Proceedings of the Fifth Workshop on Very Large Corpora, pages 261{272, 1997. 75] Scott M. Thede and Mary P. Harper. A second-order hidden markov model for part-of-speech tagging. Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics (ACL-99), pages 175{182, 1999. 76] William A. Gale and Kenneth W. Church. Estimation procedures for language context: Poor estimates are worse than none. COMPSTAT, Proceedings in Computational Statistics, 9th Symposium, pages 69{74, 1990. 77] William A. Gale and Kenneth W. Church. What's wrong with adding one? In Corpus-Based Research into Language. Rodolpi, Amsterdam, 1994. 78] Frederick Jelinek and Robert L. Mercer. Interpolated estimation of markov source parameters from sparse data. Proceedings of the Workshop on Pattern Recognition in Practice, 1980. 79] Salva M. Katz. Estimation of probabilities from sparse data for the language model component of a speech recognizer. Institute of Electrical and Electronics Engineers (IEEE) Transactions on Acoustics, Speech and Signal Processing, 35(3):400{401, 1987. 80] Evelyne Tzoukermann and Dragomir R. Radev. Using word class for part-ofspeech disambiguation. Proceedings of the Fourth Workshop on Very Large Corpora (WVLC-4), pages 1{13, 1996. 81] S.J. Young, N.H. Russell, and J.H.S. Thornton. Token passing: A conceptual model for connected speech recognition systems. Technical Report CUED Tachnical Report F INFENG/TR38, Cambridge University, 1989.

- 119 -

A. THE FIRST-ORDER HIDDEN MARKOV MODEL A.1 Stochastic Processes and Markov Processes A Markov process is a specialized form of a stochastic process. A stochastic process is dened by a set T of states t1 : : :  tT (the choice of t to signify states should be clear given Chapter 3, in which the model is applied to part-of-speech tagging). Thus, there are T possible states for the process. The process moves randomly from state to state in a discrete manner. In other words, time proceeds in integer steps n = 1 2 : : :, and we can dene the state at time n to be qn = ti for some 1  i  T . At time n +1, the process transitions to state qn+1 = tj for 1  j  T . Notice that it is possible for j = i, in which case the process transitions to the same state.

A.1.1 The state transition matrix

It has been stated that the process transitions between states. How does the process determine to which state it will transition? This is an important question, and is determined in part by the problem being modeled. In general, the probability of transitioning to a given state qn+1 = tj must depend on the entire history of states visited by the process. In other words, the probability that qn+1 = tj depends on the values of all qk for k = 1 2 : : :  n. Thus, to determine which state tj the process will transition to at time n +1, the probability P (qn+1 = tj jqn = tin  qn 1 = tin;1  : : : q1 = ti1 ) must be determined. As n gets large, this probability requires an extremely large number of dependencies. To use a stochastic process to model a real life problem, it is necessary to estimate these probabilities. Estimating a probability requiring this many dependencies is extremely dicult, since T n+1 values are possible. To make the estimation of parameters more manageable, two critical assumptions are made. ;



The probability of a transition is assumed to depend not on the entire history of states of the process, but to depend only on the previous state. This is the

- 120 rst-order assumption, since the probability is assumed to depend on only the prior state. In other words,

P (qn+1 = tj jqn = tin  qn 1 = tin;1  : : :  q1 = ti1 ) P (qn+1 = tj jqn = ti) ;



The transition probabilities are also assumed to be time-invariant | the probability of a transition does not depend on n. 8n

P (qn+1 = tj jqn = ti) = aij

A stochastic process using these two assumptions is called a rst-order Markov process. The eect of these two assumptions is to make the estimation of transition probabilities manageable they depend only on the two states transitioned between (and the order of the transition). Since there are T states, the state transition matrix A = faij g, a T T matrix, can now be dened, where aij is the probability of transitioning from state ti to tj .

A.1.2 The initial state distribution

The method of specifying the likelihood of transitions between states has been provided, but the state in which the process is most likely to begin must also be determined. This question can be handled in a couple of dierent ways. One way is to dene a new state, t0, as the start state, and dene q0 = t0. Then transition probabilities from t0 to all other states are needed. This is a reasonable solution, but it complicates the model. Instead, an initial state distribution is often dened. This is a vector & such that i is the probability that the process will begin with state ti. These two methods are functionally identical (notice that a0i = i for the two methods), and the initial state distribution frees us from adding another state to the model. Now that the transition probabilities and the initial state distribution have been dened, the Markov process has been completely dened. The output from the process will be a sequence of states that the process enters.

- 121 0.2

0.0

0.2

A

B

0.4

0.2

0.2 0.4 0.2

0.4 0.2

0.2 0.2

0.2 0.6

0.1

0.2

0.4

C

D 0.1

0.5

0.1

Fig. A.1. A Markov Process Given the set of transition probabilities, the Markov process can be viewed as a directed graph. The nodes of the graph correspond to the states of the model, and the edges are related to the transition probabilities such that the weight of the edge directed from ti to tj is equal to aij . An example of a Markov process is shown in Figure A.1. Notice that edges exist between all pairs of states, although they are often omitted if aij = 0. In the diagram, the weighted edges coming from outside the graph denote the initial state distribution. The transition probabilities were fabricated for this example.

A.1.3 Working with a Markov process Consider the Markov process shown in Figure A.1. Answering some questions about this process will illustrate how the process can be used. 1. What is the probability of generating the sequence A B A C D A?

- 122 This question can be answered in a straightforward way by using the initial state distribution and the transition probabilities. The probability that this sequence would be generated by this process is just the probability of the process starting in state A, times the probability of transitioning from A to B, times the probability of transitioning from B to A, and so on. In mathematical terms, we solve this in this way:

P (ABACDA) = A  aAB  aBA  aAC  aCD  aDA = (0:2)(0:2)(0:4)(0:4)(0:1)(0:2) = 0:000128 Notice that as the sequence gets longer, the probability of any specic sequence becomes very small. We often use log-probabilities to support calculations with very small numbers. 2. Are there any state sequences that are impossible? Yes, any sequence that has a transition from B to B is impossible, since aBB = 0. 3. What is the most likely initial sequence of three states? This question is a bit more challenging. One way to solve this is to enumerate the possibilities, as shown in Table A.1. The highest probability sequences are D B A and D B D, with a probability of 0.096 apiece. Although exhaustive enumeration allows us to correctly identify the most likely sequence of three states, it is intractable for problems larger than this toy example. Asymptotically, exhaustive enumeration requires O(T N ) time, where T is the number of states and N is the length of the output sequence. There are algorithms that can calculate the most probable state sequence more quickly they are discussed in the following section.

A.2 The Hidden Markov Model A Markov process has been dened that outputs a series of states visited by the process. But what if the process doesn't report the states that it visits? Many types of problems can be modeled in such a way that the current state of the system is

- 123 -

Table A.1 Probabilities of Initial Three States for Markov Process Example Sequence AAA AAB AAC AAD ABA ABB ABC ABD ACA ACB ACC ACD ADA ADB ADC ADD

Prob 0.008 0.008 0.016 0.008 0.016 0.000 0.008 0.016 0.016 0.016 0.040 0.008 0.008 0.024 0.004 0.004

Sequence BAA BAB BAC BAD BBA BBB BBC BBD BCA BCB BCC BCD BDA BDB BDC BDD

Prob 0.016 0.016 0.032 0.016 0.000 0.000 0.000 0.000 0.008 0.008 0.020 0.004 0.016 0.048 0.008 0.008

Sequence CAA CAB CAC CAD CBA CBB CBC CBD CCA CCB CCC CCD CDA CDB CDC CDD

Prob 0.008 0.008 0.016 0.008 0.016 0.000 0.008 0.016 0.020 0.020 0.050 0.010 0.004 0.012 0.002 0.002

Sequence DAA DAB DAC DAD DBA DBB DBC DBD DCA DCB DCC DCD DDA DDB DDC DDD

Prob 0.016 0.016 0.032 0.016 0.096 0.000 0.048 0.096 0.008 0.008 0.020 0.004 0.008 0.024 0.004 0.004

- 124 in doubt. To handle this, we can use a hidden Markov model. A hidden Markov model (HMM) is essentially a Markov process with an additional layer between the states and the output. Instead of outputting a sequence of states, the model emits a sequence of output symbols. These symbols come from some alphabet W of output symbols fw1 : : : wW g. Again, the use of w for the output symbols should be clear when the model is applied to part-of-speech tagging (see Chapter 3). A symbol is emitted after each transition, and the probability of a given symbol being emitted is dependent upon which state the system is in. To be more precise, the HMM has a probability function, dened as bj (k), which equals the probability that the symbol wk is emitted when the system is in state tj . When an HMM is run, it generates a sequence of output symbols O = o1 : : : oN drawn from the alphabet W of symbols. Therefore, there is no direct information available to the outside world as to what state the HMM is in at any given time, only possibilities given the output symbols.

A.2.1 Finding the most likely state sequence The most common question generally asked when using a hidden Markov model is \What is the state sequence that is most likely to have generated the given output sequence?" In other words, assume that the HMM outputs a sequence O = o1 : : : oN of outputs. We wish to determine the sequence of states U = 1 : : :N that has the largest probability of occurring and emitting the output sequence O. There are two accepted methods to answer this question:

The Maximum Likelihood Algorithm: This algorithm maximizes the probabil-

ity of each state in the sequence individually. The algorithm begins with the rst output symbol o1, and chooses the value ti for 1 such that ti is the state that maximizes ibi(o1), the probability that the model begins in state ti and emits o1. A similar calculation is performed for each subsequent output symbol. The full algorithm can be seen in Figure A.2, where i is the optimal value for state i chosen by the algorithm. 

- 125 1. Calculate the Initial Value 

1



= arg max  b (o ) t i i 1 i

2. Calculate each Successive Value 

n



= arg max a b (o )  = ti t ij j n n 1  ;

j

3. Return the Sequence U

= 1 : : : N 



Fig. A.2. The Maximum Likelihood Algorithm

The Viterbi Algorithm: This algorithm maximizes the probability of the entire

sequence of states, instead of maximizing each individual state in sequence. More formally, given a sequence O of output symbols, the Viterbi algorithm nds a sequence of states U that maximizes P (OjU ). The Viterbi algorithm makes use of the variables and , where:

n(i) = 1::: maxn;1 P (1 : : : n 1  n = tijo1 : : :on )

n (i) = arg max n(i) ;

In simpler terms, n(i) is the maximum probability sequence of the rst n states, assuming that the n-th state is ti, given the rst n output symbols. n(i) is the state ti that maximizes n(i). To calculate the appropriate state sequence U that maximizes the probability P (OjU ) to a value P with the state sequence 1 : : :N using the Viterbi algorithm, we perform the steps outlined in Figure A.3. 





These are the two main algorithms used to calculate the \best" state sequence from a given output sequence. The word \best" is in quotes because these two algorithms can, and frequently do, give dierent results. Thus, \best" is a relative term. The maximum likelihood algorithm is in practice much faster, running in O(TN ) time,

- 126 while the Viterbi algorithm requires O(T 2N ) time. However, the results returned by the Viterbi algorithm will always be as accurate as those returned by the maximum likelihood algorithm, and are usually more accurate, in terms of the probability of the entire sequence. However, there may be applications where the ML algorithm oers good enough results (especially when the probability sequence does not need to be maximized). Also, in some cases, the time factor may be more important than improved accuracy.

A.3 An Example: The Ball-and-Urn Problem To illustrate the operation of a hidden Markov model, we will use an example problem called the ball-and-urn problem. In this problem, we assume that we are in a room with a curtain. Behind the curtain in this room is a set of T urns. Each urn contains a number of colored balls (chosen from W dierent colors). There is an assistant behind the curtain that randomly chooses an urn, pulls a ball at random from the urn, and reports its color. Then he moves to another urn, according to some probability, and repeats the process. The goal is, given the information about how many urns there are, how many balls of each color are in each urn, and the probabilities of the assistant moving to each urn, to determine the most likely sequence of urns visited given the sequence of colors reported to us. This problem can be easily modeled with an HMM. Each urn represents a state of the model, while the dierent colors are the possible output symbols. The transition probabilities of the HMM are determined by the probabilities of the assistant moving to a new urn, and the output probabilities are determined by the distribution of balls in each urn. Let's assume that we have four urns: A, B, C, and D. The probability of the assistant moving between urns is represented by the transition and initial probabilities are as shown in Figure A.1. Each urn contains the following balls: 

Urn A contains 3 blue balls, 4 green balls, and 3 red balls.



Urn B contains 8 blue balls and 2 green balls.

- 127 1. Initialize variables The values for each of the variables must be initialized. The value for each 1 (i) is just the probability of ti occurring at the beginning of the process (i ) times the probability of ti emitting the rst output symbol (bi (o1)). The initial value for 1(i) is 0, since there is no prior state to maximize on. 1 (i) 1 (i) 

= =

1iT 1iT

i i (o1)

 b

0

2. Calculate values recursively The values for  and  are calculated recursively for each successive output symbol in the sequence. n (j )

= 1max  (i)aij ]bj (on ) i T n 1

2  n  N 1  i j  T

n (j )

=

2  n  N 1  i j  T



;

 



arg 1max  (i)aij ] i T n 1 ;

 

3. Calculate maximum probability Once the values for the last output symbol in the sequence have been calculated, the maximum probability P and the nal state value of the state sequence U = 1 : : : N that gives it can be determined: 







= 1max  (i) i T N N = arg 1max  (i) i T N

P



 



 

4. Determine best tag sequence The  values are \unrolled" to determine the proper state sequence: n





=



n+1 (n+1 ) 

n

= N ; 1 N ; 2 : : :  2 1

Fig. A.3. The First-Order Viterbi Algorithm

- 128 

Urn C contains 5 green balls and 5 red balls.



Urn D contains 10 blue balls.

Now assume that the assistant reports an initial sequence of blue, green and red. We will determine the most likely sequence of urns to produce the sequence of balls, using both algorithms.

A.3.1 Using the maximum likelihood algorithm

Using the maximum likelihood algorithm, we can calculate the probability that the rst urn is A, B, C, or D given that the rst ball is blue. 

AbA (blue) = (0:2)(0:3) = 0:06



B bB (blue) = (0:2)(0:8) = 0:16



C bC (blue) = (0:2)(0:0) = 0:00



D bD (blue) = (0:4)(1:0) = 0:40

So, the rst urn is urn D. 

aDAbA (green) = (0:2)(0:4) = 0:08



aDB bB (green) = (0:6)(0:2) = 0:12



aDC bC (green) = (0:1)(0:5) = 0:05



aDD bD (green) = (0:1)(0:0) = 0:00

The second urn is urn B. 

aCA bA(red) = (0:2)(0:3) = 0:06



aCB bB (red) = (0:2)(0:0) = 0:00



aCC bC (red) = (0:5)(0:5) = 0:25



aCD bD (red) = (0:1)(0:0) = 0:00

The last urn is C. Therefore, the best sequence according to the maximum likelihood algorithm is D B C, with a total probability of

D bD(blue)aDBbB (green)aBC bC (red) = 0:012:

- 129 -

A.3.2 Using the Viterbi algorithm

Using the Viterbi algorithm, the initial values for and are calculated:

 1 (A) = A bA (blue) = (0:2)(0:3)

= 0:06

 1 (B ) = B bB (blue) = (0:2)(0:8) = 0:16  1 (C ) = C bC (blue) = (0:2)(0:0)

= 0:00

 1 (D) = D bD (blue) = (0:4)(1:0) = 0:40 

1(A) = 1(B ) = 1(C ) = 1(D) = 0

Now, we calculate the possible values for 2(i) and 2(i):

2(A) = = = =

2(A) =

max ( 1(i)aiA)bA(green) i max( 1(A)aAA 1(B )aBA 1(C )aCA 1(D)aDA )bA(green) max(0:012 0:064 0:000 0:080)(0:4) 0:032 D

2(B ) = = =

2(B ) =

max ( 1(i)aiB )bB (green) i max(0:012 0:000 0:000 0:240)(0:8) 0:192 D

2(C ) = = =

2(C ) =

max ( 1(i)aiC )bC (green) i max(0:024 0:032 0:000 0:040)(0:5) 0:020 D

2(D) = max ( 1(i)aiD)bD (green) i

- 130 = max(0:012 0:064 0:000 0:040)(0:0) = 0:000

2(D) = B

Finally, we calculate the values for 3(i) and 3(i):

3(A) = = =

3(A) =

max ( 2(i)aiA)bA(red) i max(0:0064 0:0768 0:0040 0:0000)(0:3) 0:023 B

3(B ) = = =

3(B ) =

max ( 2(i)aiB)bB (red) i max(0:0064 0:0000 0:0040 0:0000)(0:0) 0:000 A

3(C ) = = =

3(C ) =

max ( 2(i)aiC )bC (red) i max(0:0128 0:0384 0:0100 0:0000)(0:5) 0:019 B

3(D) = = =

3(D) =

max ( 2(i)aiD)bD (red) i max(0:0064 0:0768 0:0020 0:0000)(0:0) 0:000 B

- 131 Now, the maximum probability is simply max (i) = 3(A) = 0:023 and our sequence i 3 of states can be determined:

3 = = 2 = = 1 = = 





arg max (i) i 3 A

3(A) B

2(B ) D

Therefore, the best sequence as calculated by the Viterbi algorithm is D B A, with a probability of 0.023. Notice that to calculate this value using brute force methods, we would have had to calculate the probability of emitting the given sequence for every possible combination of urns. There are 43 = 64 possible combinations of urns, so 64 calculations would have been required. Using the Viterbi algorithm, we only had to calculate 12 values to get the highest probability sequence. Larger problems result in similar saving { Viterbi is O(NT 2) while enumeration is O(T N ). Notice that the Viterbi and Maximum Likelihood algorithms produced dierent results. Also notice that the Viterbi algorithm's result has a higher probability than the other. The Viterbi produces results that are more probable at the cost of additional running time.

A.4 Conclusions The hidden Markov model, a commonly used model in part-of-speech tagging and speech recognition, has been introduced and discussed. This will facilitate discussions of the new second-order hidden Markov model that is introduced in Section 3.5.

- 132 -

- 133 -

B. DATA TABLES FOR PARSING EXPERIMENTS This appendix contains the data from the experiments in section 2.5. It was presented in graphical form in that section.

- 134 -

Table B.1 Results using the Maximum Method with Dictionary Distribution Frequencies for No-Ax Words % Open-Class Hand Timit MRC2 Lexicon Removed Delete Insert Delete Insert Delete Insert 10 65 89 70 75 70 82 20 96 198 80 169 90 195 30 124 322 96 294 118 282 40 163 368 112 287 131 313 50 211 647 160 492 178 609 60 249 500 136 512 197 506 70 292 1422 207 1271 239 1306 80 373 1308 284 1082 332 1214 90 383 1723 311 1427 334 1914 100 448 1933 349 1644 385 2401

Table B.2 Results using the Maximum Method with Compromise Frequencies for No-Ax Words % Open-Class Hand Timit MRC2 Lexicon Removed Delete Insert Delete Insert Delete Insert 10 57 133 62 99 62 106 20 79 241 70 196 80 221 30 89 387 67 313 91 285 40 97 457 93 326 112 346 50 151 768 138 537 156 643 60 185 651 108 574 169 563 70 233 1651 155 1450 187 1417 80 284 1524 229 1253 281 1312 90 293 2057 245 1666 286 2036 100 334 2316 271 1896 325 2533

- 135 -

Table B.3 Results using the Maximum Method with Original Frequencies for No-Ax Words % Open-Class Hand Timit MRC2 Lexicon Removed Delete Insert Delete Insert Delete Insert 10 51 214 58 135 58 141 20 59 449 62 321 72 346 30 77 538 65 357 89 329 40 80 753 93 442 112 463 50 110 1172 127 670 145 774 60 148 1209 101 838 162 824 70 182 2384 146 1764 180 1699 80 245 2678 220 1618 276 1647 90 252 3359 234 2054 281 2386 100 282 3987 254 2345 314 2949

Table B.4 Results using the Average Method with Dictionary Distribution Frequencies for No-Ax Words % Open-Class Hand Timit MRC2 Lexicon Removed Delete Insert Delete Insert Delete Insert 10 71 70 71 84 71 90 20 97 145 84 174 103 189 30 136 251 112 296 125 317 40 171 285 119 260 141 372 50 220 478 182 564 202 667 60 265 437 181 500 242 640 70 303 1378 253 1481 270 1632 80 387 1188 328 1027 379 1377 90 394 1608 338 1664 377 2184 100 465 1766 387 1796 432 2814

- 136 -

Table B.5 Results using the Average Method with Compromise Frequencies for No-Ax Words % Open-Class Hand Timit MRC2 Lexicon Removed Delete Insert Delete Insert Delete Insert 10 57 123 62 108 62 113 20 81 191 73 194 92 213 30 90 311 86 315 100 301 40 100 388 97 290 121 401 50 158 619 161 599 176 679 60 201 616 166 540 217 691 70 233 1639 190 1635 212 1709 80 285 1425 281 1170 325 1382 90 294 1896 289 1832 324 2359 100 337 2115 330 1987 368 2892

Table B.6 Results using the Average Method with Original Frequencies for No-Ax Words % Open-Class Hand Timit MRC2 Lexicon Removed Delete Insert Delete Insert Delete Insert 10 51 195 58 143 58 149 20 59 369 65 321 84 339 30 78 395 84 364 98 351 40 87 577 96 411 121 541 50 139 816 150 721 165 811 60 177 951 158 824 189 935 70 202 1741 180 1956 197 1996 80 283 1915 273 1564 299 1744 90 293 2284 278 2342 298 2750 100 325 2594 313 2549 336 3334

- 137 -

Table B.7 Results using the Sux-First Method with Dictionary Distribution Frequencies for No-Ax Words % Open-Class Hand Timit MRC2 Lexicon Removed Delete Insert Delete Insert Delete Insert 10 65 76 70 83 70 83 20 96 149 81 169 90 179 30 125 265 109 290 119 272 40 164 292 135 301 140 358 50 221 460 189 587 191 597 60 271 407 173 720 220 697 70 307 1305 225 1563 247 2164 80 385 6335 300 1276 336 5364 90 395 6749 313 1758 336 22 396 100 463 6853 358 2083 389 23 187

Table B.8 Results using the Sux-First Method with Compromise Frequencies for No-Ax Words % Open-Class Hand Timit MRC2 Lexicon Removed Delete Insert Delete Insert Delete Insert 10 57 120 62 107 62 107 20 79 190 71 196 80 205 30 90 311 80 309 92 275 40 98 371 116 338 121 390 50 177 556 167 628 169 628 60 207 534 147 776 192 752 70 240 1519 175 1751 195 2250 80 304 6525 247 1444 285 5438 90 313 7042 252 1982 291 22 484 100 357 7173 285 2326 332 23 280

- 138 -

Table B.9 Results using the Sux-First Method with Original Frequencies for No-Ax Words % Open-Class Hand Timit MRC2 Lexicon Removed Delete Insert Delete Insert Delete Insert 10 51 192 58 143 58 142 20 59 370 63 321 72 330 30 78 403 78 357 90 323 40 85 561 116 451 121 500 50 142 798 156 743 158 745 60 174 957 140 1027 185 1001 70 199 2058 166 2038 188 2490 80 279 7271 242 1786 280 5731 90 286 7882 245 2345 286 22 799 100 319 8164 272 2738 321 23 642

- 139 -

VITA

Scott Matthew Thede Addresses Work

Home

DePauw University 2005 Lakeview Drive Department of Computer Science Greencastle, IN 46135 Julian Center #249 (765) 653-0514 Greencastle, IN 46135 (765) 658-4736

Teaching Interests

I am interested in teaching courses in the following areas: Natural Language Processing, Machine Learning, Articial Intelligence, Control Systems and Robotics, Systems and Signals Analysis, Statistical and Computational Models, Algorithms, Programming Languages (C/C++/Lisp/Prolog), as well as core Math, Computer Science, and Engineering Classes.

Research Interests

I plan on doing research in natural language processing, namely investigating how to deal with out-of-lexicon words in parsing, part-of-speech tagging, and natural language understanding. I also intend to study statistical models and their use in natural language processing, machine learning, and articial intelligence, especially in statistical parsing and part-of-speech tagging.

Education

Currently pursuing Ph.D. in Electrical and Computer Engineering (ECE), Purdue University Expected Graduation Date: August, 1999 Current GPA: 3.78 Advisor: Dr. Mary P. Harper Thesis Title: Methods for Properly Parsing and Tagging Sentences Containing Unknown and Ambiguous Lexical Tokens M.S. in Electrical Engineering, Purdue University Graduation Date: December, 1994 GPA: 3.83

- 140 B.S. in Electrical Engineering, University of Toledo Graduation Date: June,1993 GPA: 3.85

Teaching Experience Instructor

School of Electrical and Computer Engineering, Purdue University (Spring 1998) Instructor for EE473, "Introduction to Articial Intelligence". Duties included:   

Preparing and presenting all lectures Writing and grading assignments, programming projects, and exams Individual instruction for students who needed outside help

Teaching Assistant

School of Electrical and Computer Engineering, Purdue University (Spring 1999) Assisted in teaching EE 473, "Introduction to Articial Intelligence", including writing of homework assignments, grading exams, and individual tutoring sessions with students. School of Electrical and Computer Engineering, Purdue University (Spring 1999) Teaching Assistant for EE 608, "Computational Model and Methods", the core course for the EE Graduate School Computer Engineering Area. Duties included helping with homework assignments, grading exams, and oering individual tutoring for students. School of Electrical and Computer Engineering, Purdue University (Fall 1998) Assisted in teaching EE 373, "Programming Languages for Articial Intelligence", including writing of homework assignments, grading exams, individual tutoring sessions with students, and giving one lecture for the class. School of Electrical and Computer Engineering, Purdue University (Fall 1997) Teaching Assistant for EE 608, "Computational Model and Methods". Duties included writing homework assignments, grading exams, and oering individual tutoring for students.

Grader

School of Electrical and Computer Engineering, Purdue University (Fall 1996) Graded homework assignments for EE 608. School of Electrical and Computer Engineering, Purdue University (Summer 1995) Duties included solving and grading all homework assignments, assisting in exam grading, and delivering lectures for two class periods.

- 141 -

Publications and Documents

\Familiarity and Pronunciability of Nouns and Names", by Aimee M. Suprenant, Susan L. Hura, Mary P. Harper, Leah H. Jamieson, Glenis Long, Scott M. Thede, Ayasakanta Rout, Tsung-Hsiang Hsueh, Stephen A. Hockema, Michael T. Johnson, Pramila N. Srinivasan and Christopher M. White. To appear in Behavior Research Methods, Instruments, and Computers. \A Second-Order Hidden Markov Model for Part-of-Speech Tagging", by Scott M. Thede and Mary P. Harper. Published in the Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics (ACL '99), June, 1999, pages 175182. \Predicting Part-of-Speech Information about Unknown Words using Statistical Methods", by Scott M. Thede. Published in the Proceedings of the 36th Annual Meeting of the Association of Computational Linguistics and 17th International Conference on Computational Linguistics (COLING-ACL '98), August 1998, pages 1505-1507. \Familiarity and Pronouncibility of Nouns and Names: the Purdue Proper Name Database", by A. M. Supranant, S. L. Hura, M. P. Harper, L. H. Jamieson, G. Long, S. M. Thede, A. Rout, T.-H. Hsueh, S. A. Hockema, M. T. Johnson, J. B. Laen, P. Srinivasan, and C. M. White. Published in 16th International Congress on Acoustics and 135th Meeting Acoustical Society of America, Seattle, WA, June, 1998. \Post-Mortem Parsing with Unknown Lexical Items using Morphological Recognition, Syntactic Information, and a Closed Class Lexicon", by Scott Thede. Doctoral Research Proposal, Purdue University, September 1997. \Analysis of Unknown Lexical Items using Morphological and Syntactic Information with the TIMIT Corpus", by Scott M. Thede and Mary P. Harper. Published in Proceedings of the Fifth Workshop on Very Large Corpora, August 1997, pages 261272.

Presentations

\A Second-Order Hidden Markov Model for Part-of-Speech Tagging", by Scott M. Thede and Mary P. Harper. Presented at ACL '99, College Park, MD, USA, June, 1999. \Predicting Part-of-Speech Information about Unknown Words using Statistical Methods," by Scott M. Thede. Presented at COLING-ACL '98, Montreal, Canada, August 1998. \Analysis of Unknown Lexical Items using Morphological and Syntactic Information with the TIMIT Corpus," by Scott M. Thede and Mary P. Harper. Presented at the Fifth Workshop on Very Large Corpora, Beijing, China, August 1997.

- 142 -

Coursework Computer Engineering Area        

Articial Intelligence Computational Models and Methods Computer Vision Formal Languages, Computability, and Complexity Introduction to Neural Networks Natural Language Processing Pattern Recognition and Decision Making Processes Programming Languages for Articial Intelligence

Automatic Controls Area    

Introduction to Robotic Systems Lumped System Theory Optimization Methods for Systems and Controls State Estimation and Parameter Identication for Stochastic Processes

Mathematics and other Areas      

Advanced Math for Engineers and Physicists I Advanced Math for Engineers and Physicists II Elements of Stochastic Processes Linear Algebra with Applications Mathematical Logic I Random Variables and Signals

Research Experience Research Assistant

School of ECE, Purdue University (May - August 1998) Continued thesis research into the improvement of a hidden Markov model for a part-of-speech tagger, increasing accuracy on both known and unknown words. School of ECE, Purdue University (August 1996 - December 1997) Conducted thesis research into the improvement of a hidden Markov model for a part-of-speech tagger, increasing accuracy on both known and unknown words.

National Science Foundation Fellow

School of ECE, Purdue University (August 1993 - August 1996) Conducted thesis research and attended classes.

- 143 -

Undergraduate Research Assistant, National Science Foundation

University of Maine, Orono, Maine (June - September 1992) Performed research in areas of computer vision, robotic control, and articial intelligence. Assisted in the design of a computer / robot system to play chess.

Professional Development

Paper Review, for ACL '99 (Spring 1999) Reviewed papers for the student session of the Association of Computational Linguistics conference in Baltimore, 1999. Teaching Workshop, Purdue University (Fall 1997) Attended a ten-week session of a teaching workshop oered for current and prospective instructors to improve their teaching skills. Topics included lecture techniques, design and evaluation of instruction, evaluation of student achievement, and studentteacher relationships, among others.

Related Work Experience

Communications Technician, Ohio Northern University (May - August 1993) Worked on setting up and wiring the campus for computer network connections. Troubleshot telephone and computer systems. Computer Lab Technician, University of Toledo (January 1991 - May 1993) Worked with students using the mainframe and personal computers. Peer Advisor in the College of Engineering, University of Toledo (September - December 1990) Worked with incoming freshmen engineering students, helping them become acquainted with college life and their engineering studies. Lab Technician, Ohio Northern University (May - August 1990) Worked in the EE labs on campus performing maintenance. Technician, TPT Electronics (May 1984 - September 1989) Assembled circuit boards and cases for bio-medical equipment.

- 144 -

Honors and Service

Awarded the Magoon Outstanding Teaching Assistant award at Purdue University, April 1999. Awarded the University of Toledo Presidential Scholarship for undergraduate study. Member of Triangle Fraternity, University of Toledo chapter.  Served as Recording Secretary.  Served as Scholarship Chairman.  Served as Fund-Raising Chairman. Member of Mortar Board National Honorary Society, University of Toledo chapter.  Served as Recording Secretary. Member of Eta Kappa Nu (Electrical Engineering Honorary Society). Member of Tau Beta Pi (Engineering Honorary Society). Member of Phi Eta Sigma (Freshman Honorary Society).

Professional Societies Member of Institute for Electrical and Electronics Engineers (IEEE). Member of IEEE Computer Society (IEEE-CS) Member of Association for Computational Linguistics (ACL). Member of Association for Computing Machinery (ACM). Member of ACM Special Interest Groups on Articial Intelligence (SIGART) and Computer Science Education (SIGCSE). References available on request.