automatic word sense disambiguation (wsd) system - Semantic Scholar

8 downloads 0 Views 323KB Size Report
marek.trawicki@marquette.edu. ABSTRACT. This paper presents an automatic word sense disambiguation. (WSD) system that uses Part-of-Speech (POS) tags ...
AUTOMATIC WORD SENSE DISAMBIGUATION (WSD) SYSTEM Kevin Indrebo

Jidong Tao

Marek Trawicki

Department of Electrical and Computer Engineering, Marquette University

Department of Electrical and Computer Engineering, Marquette University

Department of Electrical and Computer Engineering, Marquette University

Knowledge and Information Discovery Laboratory, Olin Engineering 523 Milwaukee, WI 53201-1881

Speech and Signal Processing Laboratory, Olin Engineering 518 Milwaukee, WI 53201-1881

Speech and Signal Processing Laboratory, Olin Engineering 518 Milwaukee, WI 53201-1881

(414) 288-6046

(414) 288-7451

(414) 288-7451

[email protected]

[email protected]

[email protected]

ABSTRACT This paper presents an automatic word sense disambiguation (WSD) system that uses Part-of-Speech (POS) tags along with word classes as the discrete features. Word Classes are derived from the Word Class Assigner using the Word Exchange Algorithm from statistical language processing. Naïve-Bayes classifier is employed from Weka in both the training and testing phases to perform the supervised learning on the standard Senseval-3 data set. Experiments were performing using 10-fold cross-validation on the training set and the training and testing data for training the model and evaluating it. In both experiments, the features will either used separately or combined together to produce the accuracies. Results indicate that word class features did not provide any discrimination for word sense disambiguation. POS tag features produced a small improvement over the baseline. The combination of both word class and POS tag features did not increase the accuracy results. Overall, further study is likely needed to possibly improve the implementation of the word class features in the system.

an “intermediate task” [4]. In other words, WSD serves as a means for other applications, not an end in itself. Besides WSD being essential for language understanding and man-machine communication, there are several other applications [1]. Machine Translation: WSD is essential for proper translation of words. Information Retrieval (IR): WSD is vital for eliminating occurrences of specific target (i.e., keyword) word that are used in the inappropriate sense (e.g., Miami Dolphins versus Bottlenose Dolphins). Content and Thematic Analysis: WSD is recognized as a way to include only those instances of words in their proper sense. Grammatical Analysis: WSD is useful for part-of-speech (POS) tagging.

Keywords Natural Language Processing (NLP), Text Mining, Word Sense Disambiguation (WSD), Supervised Learning, Naïve Bayes Classifier and Model, Part-of-Speech (POS) Tagging, Collocation Features, Word Exchange Algorithm, Senseval Data Set.

1. INTRODUCTION Word Sense Disambiguation (WSD) has been of great interest and concern to the natural language and text processing community for the past fifty-years [1]. Fundamentally, WSD deals with choosing the correct sense (i.e., meaning) of a word in a given text from a list of possible senses [2]. The WSD task can be best illustrated in terms of an example. For instance, consider the target word bank, which has three distinct meanings shown in Table 1 [3]. Although it is relatively easy task for humans to distinguish between the various senses for the target word bank in the given sentences, it is actually a difficult task for computers. The challenge arises from the fact that computers need the context of the target word along with possibly external knowledge sources such as lexical, encyclopedic, and hand-devised information to select the correct sense [1]. Since WSD is needed to accomplish most other natural language processing tasks, it is of often called

Speech Processing: WSD is required for correct phonetization of words in speech synthesis [5], [6] and word segmentation and homophone discrimination in speech recognition [7], [8]. Text Processing: WSD is necessary for tasks such as spelling correction, case changes, and lexical access to Semitic Languages (i.e., subgroup of Afro-Asiatic languages including Arabic, Hebrew, Amharic, Aramaic, etc.). Currently, WSD systems have not been able to resolve the problem of disambiguating words in a given text. Supervised machine-learning approaches such as Naïve-Bayes classification has only produced 75% accuracy on English data sets, which is far lower than the 95% accuracy given for POS tagging algorithms [9]. Thus, the goal of this work is to compare, develop, and evaluate methods for performing WSD in order to improve accuracy results using the golden standard Senseval-3 data set [10].

The paper is structured in the following way: Section 2 (Background), Section 3 (Related Work), Section 4 (Data), Section 5 (WSD System), Section 6 (Evaluation), Section 7 (Conclusions), and Section 8 (Future Work).

(left) and player and stand (right). Although the window could be expanded to include electric (left) and off (right), the system would likely not want to proceed beyond those words or else risk having reduced accuracies.

2.1.2 Parts-of-Speech (POS) Tags Parts-of-Speech (POS) tagging is the process of labeling words in a data set with their possible tags [2]. The significance of POS tagging for language processing is that it provides important word category information about the target words and its neighbors. It is needed to obviously distinguish between major word categories such as nouns vs. verbs and even many finer distinctions. For example, knowing whether a word is a personal pronoun (e.g., I, you, he, me) can determine which words are likely to occur in its vicinity (e.g., personal pronouns are more likely followed by a verb possessive) [2]. Figure 1 illustrates the process of tagging words in a data set using Python 2.4 [11] and NLTK [12].

Table 1: Example of Word Sense Disambiguation for Bank

2. BACKGROUND The WSD system implemented in this work consists of POS tags and collocation features as input to the Naïve-Bayes classifier. Details are given for both the features and model to understand the remainder of the paper.

2.1 Features Collocation and POS tags were used as the features for the NaïveBayes classifier. Previous work with WSD systems has shown that these features are effective at encoding both the local lexical and grammatical information that can often accurately isolate a given sense [2].

2.1.1 Collocation Collocation refers to the position of words surrounding a given target word. These collocation features serve to encode information about the lexical units of specific positions located immediately to the left and right of the given target word whose sense is to be disambiguated by the WSD system. The target word and the words to both sides of it are grouped together as a window. Research has indicated that 1-3 words around the target word produce the best accuracy results [1]. Consider the following example that deals with the target word bass [2].

Figure 1: POS Tagging

An electric guitar and bass player stand off to one side, not really part of the scene, just as a sort of nod to gringo expectations perhaps.

The text is tokenized into individual word and punctuation units that are separated by white spaces using regular expressions. Next, the Brown corpus, which is a collection of about one million word samples from 500 written texts in different genres (e.g., newspaper, novels, non-fiction, academic, etc.) and assembled at Brown University [13], was chosen as the tagset. Then, the Brill Tagger [14], which is a Transformation-Based Learning (TBL) approach to machine-learning, combines the stochastic-based [15] and rule-based [16], [17], [18] taggers to first determine the most likely tag for a given word and then verify the corresponding tag using TBL rules. As output, the POS tagger produces the tagged text.

For this situation, the collocation feature vector consists of the target word bass along with its neighbors, namely guitar and and

In order to illustrate the process of POS tagging, consider the following example [2]. After tokenization of the sentence The

horse is expected to race tomorrow into the specific words and punctuation, the Brown corpus would most likely label the word race in the sentence as a noun based on having a higher probability value of the tag given the word. P(NN|race) = 0.98 P(VB|race) = 0.20 Since the NN (i.e., noun) is the most likely POS tag for race, it would be tagged with that POS tag. In the example sentence, it is actually a mistake. The sentence should actually be tagged as The/DT horse/NN is/VBZ expected/VBN to/TO race/NN tomorrow/NN,

∏ P (v n

P(V | s ) ≈

j

| s ),

(3)

j =1

which estimates the probabilities of an entire feature vector V given a particular sense by the product of probabilities of its individual features for that sense. After substituting (3) and removing the constant P(V) term, which is the same value across all senses, (2) can be rewritten as

∏ P (v n

) s = arg max P ( s ) s∈ S

j

| s ),

(4)

j =1

where the tag follows the slash. With the selection of the mostlikely tag as NN (i.e., noun) using the stochastic-tagger, Brill’s tagger applies its TBL rules to the mistagging of race.

where P(s) is removed from the product in (4) since it does not depend on the index j. With (4), the issues involving zero counts and smoothing typically will apply to the Naïve-Bayes classifier [2].

Change NN to VB when the previous tag is TO.

3. RELATED WORK

Because the previous tagged word is to/TO, the rule would change race/NN to race/VB. Thus, the Brill tagger would properly correct the mistagged word in the sentence.

2.2 Naïve-Bayes Bayesian classifiers [19] are supervised learning approaches that have been employed to a wide-variety of problems. Specifically, the Naïve-Bayes classifier has been a very popular approach for WSD tasks with good performance [20]. The basic idea is to select the most likely sense, s, from all possible senses, S, of a word given the input vector V, which includes the context (i.e., collocation) and POS tag features [2]. In essence, the best sense is defined as

) s = arg max P( s | V ). s∈ S

Over the years, there have been several robust, stand-alone WSD systems designed to operate with minimal assumptions about the type of information available from other processes [2]. Each of the systems has employed several common WSD approaches such as AI-based [21], [22], [23], [24], knowledge-based [25], [26], and corpus-based [27], [28], [29] to perform the word sense disambiguation task. Table 2 lists the details of each of the approaches and methods.

(1)

Bayes’ formula can be applied to (1) to rewrite the equation as

P (V | s ) P ( s ) ) s = arg max , P (V ) s∈ S

(2)

where P(V|S), P(s), and P(V) represent the likelihood, prior, and evidence in (2). While the data set has spare information concerning the feature vectors V with the senses, it has an abundance of tagged training set information about individual feature-value pairs in the context of specific senses [2]. As a result, the independence assumption can be employed to naively assume that the features V are independent of each other and rewrite P(V|s) as

Table 2: WSD Approaches Since the project will concern itself only with knowledge-based and corpus-based approaches, they will each be explained in further detail.

3.1 Knowledge-Based Approaches With knowledge-based approaches, the machine readable dictionaries (MRD) provide both the means for constructing a sense tagger along with the necessary target senses that will be employed in the system [2]. In 1986, Lesk [30] first implemented an approach in which all of the sense definitions of the word to be disambiguated were retrieved from the dictionary. Each of the senses was compared to the dictionary definitions of all the remaining words in the context. The sense with the highest overlap (i.e., common senses) with these context words was chosen as the correct sense. In Lesk’s WSD system, the primary problem was that the dictionary entries for the target words were relatively short in nature and might not have provided enough material to create adequate classifiers. Essentially, the words utilized in the context and their associated definitions must have direct overlap with the words contained in the appropriate sense definition to have any sort of usefulness. Through his work, Lesk reported accuracies of 50-70% on only short samples of text that is employed as the basis for most knowledge-based WSD approaches.

3.2 Corpus-Based Approaches

adjectives) with tagged senses from WordNet [35]. There are approximately 140 training examples and 70 testing examples of each of the words, where an individual example consists of a paragraph of text that contains the target word. On average, the target words have about 8 senses with a range of 3 to 23 different senses.

5. WSD SYSTEM The WSD system consists of two major phases: training and testing. While the POS tags and collocation features are often implemented for the Naïve-Bayes classifier, the novel portion of the system consists of the Word Class Assigner and corresponding class definitions. Details are given for both phases of the WSD system along with an example of each block from the Senseval-3 data set.

5.1 Training The training phase of the WSD system deals with building the Naïve-Bayes classifier model that was used for classification purposes in the testing phase. Figure 2 shows the basic overview of the training phase.

In corpus-based approaches, the systems are actually trained to perform the task of WSD [2]. After the models have been trained on numerous examples, they are tested on unseen examples to determine the effectiveness of the trained classifier. As illustrated in Table 2, there are essentially three major corpus-based approaches, namely supervised and unsupervised learning along with bootstrapping. For the supervised learning method, the WSD system is constructed from a set of unbiased labeled instances drawn from the same distribution as the test set. The most common supervised learning approaches include Naïve-Bayes classifiers, decision lists, decision trees, artificial neural networks (ANNs), logic learning systems, and nearest neighbor. In Mooney’s work [31], he was able to deploy a Naïve-Bayes classifier and an ANN system for word disambiguation with accuracies of 73% in assigning one of six senses to a corpus of examples of the word line. With the Bootstrapping method, it is a combination of supervised and unsupervised methods that deals with far few resources. In essence, the initial classifier is constructed with a small amount of labeled instances using any of the supervised methods and then is employed to extract a larger training set from the unlabeled instances. Yarowsky reported an average performance of 96.5% on sense assignments involving 12 words [32]. In terms of the unsupervised learning approach, the WSD system is developed from a clustering-based idea that attempts to discover representations of the word senses from unlabeled texts. For a similar data set as in Yarowsky’s work, Schütze [33], [34] implemented unsupervised techniques to achieve accuracies approaching 90% on a small sample of words. With all of these corpus-learning approaches, the major issues usually deal with having large sense-tagged training sets and involve a great deal of effort to create a model for each ambiguous entry in the lexicon.

4. DATA The Senseval-3 data set [10], which is the one of the three golden standards along with the early versions Senseval-1 and Senseval3, was employed for both training and testing the WSD system. It consists of 57 different English words (32 verbs, 20 nouns, and 5

Figure 2: WSD Training Phase

5.1.1 Word Extraction The 140 training examples from the Senseval-3 data set were all partitioned into windows using a size of 9 around the target word. Figure 3 shows the word extraction process. Figure 4: POS Tagger on Window

Figure 3: Word Extraction

Brill tagger was built using transformation-based learning (TBL) rules and trained by the Brown corpus tagset. Table 5 illustrates the corresponding tagged words from the paragraph containing the word activate.

In terms of a specific example from the Senseval-3 data set, Table 3 shows the windowing process using 3 words to the left and right of with the target word activate and its corresponding labeled sense identification number of 38201 from WordNet in a given sentence.

Table 5: Activate with POS Tags

5.1.3 Word Class Assigner Table 3: Activate from Senseval-3 with Window Extraction The target word activate has five senses that consist of a different identification number from WordNet. In the sentence from Table 3, activate has the meaning to initiate action in and make active. Table 4 shows the correct sense along with the other four senses for the target word.

The Word Class Assigner used the Word Exchange Algorithm [36] to determine which words were assigned to individual classes. Figure 5 displays the process for the neighboring words that are assigned to 1 of 10 pre-defined classes.

Figure 5: Word Class Assigner The neighboring words in the sentence containing the target word activate were assigned to a particular class, where multiple words can be in the same class. Table 6 shows the output from the Word Class Assigner.

Table 4: 5 Senses of Word Activate from WordNet

5.1.2 POS Tagger The POS tagger created a tagged feature for the words neighboring the target word and punctuation in the Senseval-3 data set. Figure 4 shows the POS tagging of the windowed words of size 9.

Table 6: Sample Classes for Windowed Words with 10 Classes

5.1.3.1 Word Exchange Algorithm The Word Exchange Algorithm [36] implemented in the WSD system is an adaptation taken from Kneser and Ney’s work that builds class maps in statistical language modeling. It is an iterative algorithm that explores possible class maps by moving each word to all classes and evaluating an objective function. Pseudo-code for the algorithm is given as function wordExchange(W,S,N,maxIts) createNClassMaps(N) → C randomize(C)

Figure 6: Feature Vector for Naïve-Bayes Classifier

for i = 1 to maxIts for each w in W for each c in C c → w objectiveFunction(S,C) →f w → max(f)

In this situation, the surrounding words were given one of 87 POS tags from the Brown corpus and a specific class number, where the number of classes was varied from 10, 20, and 30 for the training of the WSD system. This feature vector contained the necessary information to construct the model that is employed for classification in the testing phase.

return C

5.1.5 Model Trainer The Naïve-Bayes model was trained using the Senseval-3 data set by collecting counts of the individual discrete feature value statistics given the sense of the target word and stored for classification in the testing phase of the WSD system.

function objectiveFunction(S,C) return H(C) – H(C|S) Here, W is the set of words, S is the training data with labeled senses, N is the number of classes desired, and maxIts is the number of iterations to perform. It returns the class map C, which maps each word to a class. The objective function used is Information Gain, I, which is defined as

I ( X ; Y ) = H ( X ) − H ( X | Y ),

(5)

where X and Y are random variables, H(X) is the entropy of X, and H(X|Y) is the conditional entropy of X and Y. The entropies are given by

H ( X ) = −∑ p ( x) log p ( x )

(6)

and

H ( X | Y ) = −∑ p( x, y ) log p( x | y ). (7)

5.1.4 Feature Extractor The POS tagged, windowed sentences along with Word Classes for the neighboring words (i.e., excluding the target word) were combined together as discrete values in the feature vector. Figure 6 shows the feature vector that is used as input for the NaïveBayes classifier.

5.2 Testing The testing phase operated in a very similar manner as with the training phase of the WSD system. Each of the 70 testing example paragraphs from the Senseval-3 data set was partitioned into a window size of 9 containing the target word. The words in the window (i.e., excluding the target word) were tagged with POS tag information along with assigned a word class based on the Word Class definitions from the training phase. These discrete features were used as input for the Naïve-Bayes classifier. As output, the Naïve-Bayes classifier selected the best sense for a given word given the feature vector through (4). Figure 7 illustrates the basic overview for the testing phase.

10-fold Cross-Validation Accuracy (Training Sets) Baseline

Word Classes

Tags

Combined

53.8%

61.7%

55.5%

59.6%

Table 7: Training Set Accuracy

Training/Testing Sets Baseline

Word Classes

Tags

Combined

41.5%

41.3%

44.6%

44.6%

Table 8: Training / Testing Sets The results for the two different experiments disagree on the usefulness of the word class features. The accuracy for these features on the cross-validation experiments was substantially higher than the baseline, but this is not the case when the test set was used for evaluation. This is likely because the word classes were trained over the training set, which caused the crossvalidation experimental accuracy using the word classes to be more like a training accuracy. The words that appear in common with specific senses in the training set likely do not appear in similar contexts in the test set.

Figure 7: WSD Testing Phase

6. EVALUATION Experiments were performed using the Weka machine learning software package [37]. Two sets of experiments were run using various feature configurations. In the first experiment, 10-fold cross validation experiments were executed over the training set. In the second experiment, the models were trained on the training data while the test data was used for evaluation. The Naïve-Bayes classifier was used for all experiments except the baseline, which was done using the ZeroR classifier. The ZeroR classifier assigns the same class to every example: the most commonly occurring class in the training set. Three different feature sets were used for the evaluation. The first set was based on the word class features described previously. The second used POS tags. These two sets were also combined to form a third feature set. The results for these feature types for both experiments are shown in Table 7 and Table 8.

The accuracies shown for the test set were significantly lower than those reported in literature. This is because of the way examples with multiple senses were handled. Instead of counting a classification as correct if one of the senses was chosen by the classifier, the examples with multiple senses were split up so that all examples had only one sense. Therefore, only one of these examples was classified correctly since the same data had a different label for the split examples. This was a different technique for handling multiple sense labels than is commonly used in other WSD systems and done because it simplified the implementation of the experiments in Weka.

7. CONCLUSIONS The results obtained by these experiments did not indicate that the derived word classes possessed discriminating power for word sense disambiguation. This may indicate that the words that tended to correlate with specific word senses did not seem to be consistent across different data sets. Further study is needed to determine if this is true or if the implementation presented here was unable to sufficiently capture trends between words and senses. The tag features used only improved a small amount over the baseline. This could be because the tag set used contained too many tags. A smaller number of tags may have allowed for better parameter estimation.

The combination of the two feature sets did not improve accuracy over the individual accuracies. This was not very surprising, especially considering the word class features did not appear to carry much information by themselves.

[10] [11] [12] [13]

8. FUTURE WORK The word classes obtained from the Word Exchange Algorithm did not provide discriminating power. One problem was that one class tended to contain a disproportionately larger number of words than another class. Other algorithms for assigning class maps could be explored, along with alternative objective functions. Possible functions may include more strict constraints on the number of words per class. Also, larger numbers of classes could be explored.

[14]

[15]

[16] Given the number of classes chosen for experimentation, the number of words per class was quite large (on the order of 1,000 per class). Therefore, it was difficult to view any meaningful relationships among words within a class. Running the algorithm with many more classes (several hundred, perhaps) might allow an examination of meaning in some classes. The main obstacle to implementing this was the complexity of the algorithm for the number of words and classes. A more efficient class assignment algorithm is needed for this.

9. ACKNOWLEDGMENTS The authors would like to thank Dr. Craig Struble, assistance professor at Marquette University, for his reviews and feedback on the project.

[17]

[18]

[19]

[20]

[21]

10. REFERENCES [1]

[2]

[3] [4]

[5]

[6]

[7]

[8]

[9]

N. Ide and J. Veronis, "Word Sense Disambiguation: The State of the Art," Computational Linguistics, vol. 24, pp. 1-41, 1998. D. Jurasky and J. H. Martin, Speech and Language Understanding. Upper Saddle River: Prentice-Hall, Inc., 2000. "Dictionary.com." Y. Wilks and M. Stevenson, "The grammar of sense: Is word sense tagging much more than part-of-speech tagging?," University of Sheffield, Sheffield, United Kingdom 1996. R. Sproat, J. Hirschberg, and D. Yarowsky, "A corpusbased synthesizer," presented at Proceedings of the International Conference on Spoken Language Processing, Banff, Alberta, Cananda, 1992. D. Yarowsky, "Homograph disambiguation in text-tospeech synthesis," in Progress in Speech Synthesis. New York, NY: Springer-Verlag, 1997, pp. 157-172. C. Connine, "Effects of sentence context and lexical knowledge in speech processing," in Cognitive models in speech processing. Cambridge, Massachussets: The MIT Press, 1990. S. Seneff, "TINA, A natural language system for spoken language applications," Computational Linguistics, vol. 18, pp. 61-86, 1992. t. f. e. Wikipedia, "Word sense disambiguation," vol. 2005, 2005.

[22]

[23] [24]

[25]

[26] [27]

[28]

[29]

R. Michalcea, "Senseval web page," vol. 2005: University of North Texas, 2004. G. v. Rossum, "Python Programming Language," vol. 2005. S. Bird and E. D. Loper, "Natural Language Toolkit." H. Kucera and W. N. Francis, "Computational analysis of present-day American English." Providence, RI: Brown University Press, 1967. E. Brill, "Transformation-based error-driven learning and natural language processing: A case study in partof-speech tagging," Computational Linguistics, vol. 21, pp. 543-566, 1995. W. S. Stolz and P. H. Tannenbaum, "A stochastic approach to the grammatical coding of English," Communications of the ACM, vol. 8, pp. 399-405, 1965. Z. S. Harris, "String Analysis of Sentence Structure." Mouton, The Hague, 1962. S. Klein and R. F. Simmons, "A computational approach to grammatical coding of English words," Journal of the Association for Computing Machinery, vol. 10, pp. 334-347, 1963. B. B. Greene and G. M. Rubin, "Automatic grammatical tagging of English." Providence, RI: Department of Linguistics, Brown University, 1971. R. O. Duda and P. E. Hart, Pattern Classification and Scene Analysis. New York, NY: John Wiley and Sons, 1973. T. Pedersen, "Learning Probabilistic Models of Word Sense Disambiguation," in School of Engineering and Applied Science: Southern Methodist University, 1998. M. Masterman, "Semantic message detection for machine translation using interlingua," presented at International Conference on Machine Translation of Languages and Applied Language Analysis, Her Majesty's Stationery Office, Londo, 1961. A. M. Collins and E. F. Loftus, "A spreading activation theory of semantic processing," Psychological Review, vol. 82, pp. 407-428, 1975. J. R. Anderson, "Language, Memory, and Thought." Hillsdale, NJ, 1976. J. R. Anderson, "A Spreading Activation Theory of Memory," Journal of Verbal Learning and Verbal Behavior, vol. 22, pp. 261-295, 1983. R. A. Amsler, "The structure of the Merriam-Webster Pocket Dictionary." Austin, TX: University of Texas at Austin, 1980. A. Michiels, "Exploiting a large dictionary data base." Liege, Belgique: Universite de Liege, 1982. W. A. Gale, K. W. Church, and D. Yarowsky, "A method for disambiguating word senses in a large corpus," vol. 26: Computers and the Humanities, 1993, pp. 415-439. D. Yarowsky, "Unsupervised word sense disambiguation rivaling supervised methods," presented at Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics, Cambridge, MA, 1995. M. A. Hearst, "Noun homograph disambiguation using local context in large corpora," presented at Proceedings of the 7th Annual Conference of the University of Waterloo Centre for the New OED and Text Research, Oxford, United Kingdom, 1991.

[30]

[31]

[32]

[33] [34] [35] [36]

[37]

M. E. Lesk, "Automatic sense disambiguatuion using machine readable dictionaries: How to tell a pine cone from an ice cream cone," presented at Proceedings of the Fifth International Conference on Systems Documentation, Toronto, CA, 1986. R. J. Mooney, "Comparative experiments on disambiguating word senses: An illustration of the role of bias in machine learning," presented at Proceedings of the Conference on Empirical Methods in Natural Language Processing, Philadelphia, PA, 1996. D. Yarowsky, "Unsupervised word sense disambiguation rivaling supervised methods." Cambridge, MA: ACL-95, 1995, pp. 189-196. H. Schutze, "Dimensions of meaning," presented at Proceedings of Supercomputing, 1992. H. Schutze, "Automatic word sense discrimination," Computational Linguistics, vol. 21, pp. 97-124, 1998. "WordNet - Princeton University Cognitive Science Laboratory," vol. 2005: Princeton University. R. Kneser and H. Ney, "Improved Clustering Techniques for Class-Based Statistical Language Modelling," presented at Proceedings of the European Conference on Speech Communication and Technology, 1993. "Weka 3 - Data Mining with Open Source Machine Learning Software in Java," vol. 2005: The University of Waikato.

About the authors: Kevin Indrebo is a doctoral student and GAANN Fellow in the Department of Electrical and Computer Engineering at Marquette Univeristy, where he received both his Bachelor’s (2002) and Master’s (2004) degrees. He works in the Knowledge and Information Discovery Lab, and his research focus is on robust speech recognition. Jidong Tao is currently a doctoral student in Department of Electrical and Computer Engineering at Marquette University. His research interests are speech and language processing. Marek Trawicki received his B.S. and M.S. degrees in electrical engineering from Marquette University in 2001 and 2002 and M.S. degree in mathematics from the University of WisconsinMilwaukee in 2005. He is currently a GAANN Fellow and doctoral student at Marquette University in the Department of Electrical and Computer Engineering and working on the NSFfunded Dr. Dolittle project. His primary research interests are mathematical models, bioacoustics, speech recognition, and language processing.