WORD SENSE DISAMBIGUATION USING ID TAGS_NDobric

97 This is a contribution from Vitas D., Krstev C. (Eds.) Proceedings of the 29th International Conference on Lexis and Grammar/LGC, pp. 97-105, 2010 © 2010 Faculty of Mathematics at the University of Belgrade This electronic file may not be altered or reproduced in any way. ________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________

WORD SENSE DISAMBIGUATION USING ID TAGS – IDENTIFYING MEANING IN POLYSEMOUS WORDS IN ENGLISH Nikola Dobrić Alpen-Adria Universität Klagenfurt Abstract: Polysemy is still an equally complicated problem in contemporary linguistic research as it has been during the while of consolidated linguistic research. It is a theoretical issue, as there are still no definite direction in dealing with the sense distinctiveness and vagueness vs. polysemy problems, as the insufficiencies of various polysemy tests indicate. It is however, also a practical issue in the contemporary NLP applications as WSD suffers significant from drawbacks when dealing with highly polysemous words. The paper presents a procedure called behavioral profiling (BP) designed to address both the theoretical and practical issues. Key words: behavioral profiling, corpus, word sense disambiguation, ID tag

1. INTRODUCTION One of the persisting issues in modern lexicography1 is the identification of prototypical senses of (especially highly polysemous) words, the degree of sense distinctness and the structure of a given lexical network. The issue arises because polysemous words, or their respective senses, are (as is admittedly the case with all lexical items) strongly influenced by the surrounding syntactic, semantic and discourse context. That causes great difficulties in identifying and listing their exact meanings, as used in a particular communicative situation. Due to many problems and dictionary insufficiencies regarding the treatment of polysemy, ranging from a variety of applied approaches to sense identification (Geeraerts, 1993; Cruse, 1986; Kilgarriff, 1997) and different editorial practices (as influenced by the factors described above), the broader goal of this paper (and the more encompassing ongoing research project from which it stems) is to explore the nature of meaning and the possible theories about its coding and decoding. A more practical goal of the paper (and of the project) is also to explore a method of solving such problems by meeting lexicography and computer linguistics, as proposed by some authors, most notably by Gries (2006) that would find applications in natural language processing (NLP) and word sense disambiguation (WSD). The emphasis of the task is on the cognitive approach to sense identification (Geeraerts, 2001; Zima-Smith, 1993; Taylor, 1995; Palmer, 1996;; Dirven and Verspoor, 1998) accompanied by a strong background of corpus linguistics (Biber, 1993; Dobrić, 2009a) and, in a later stage, computational linguistics (Vitas, 2007). The major linguistic issue is that of prototypicality of meaning and a construction of lexical network structures, with proper connections between the senses within them. A further problem, still not addressed fully in literature, is the alignment of such lexical networks with the corresponding factors which influence their given interpretations. With the help of a comprehensive corpus-based analysis, the paper presents one of the steps in such an analysis, namely a cognitive and a sociosemantic (Teubert, 2010) breakdown of the many senses of polysemous words using 1

As well as in the development of machine translation tools, information retrieval, information extraction, intelligent humancomputer interface, question answering, bioinformatics and applied linguistics.

98 the lexeme’s immediate context in the identification procedure. The paper exemplifies a system which would investigate the problem of polysemy from a theoretical side, investigating the possible solutions for prototypicality and sense distinctiveness, but also from a very practical side, offering tentative steps towards the ultimate solution of WSD for the purposes of NLP. 2. BEHAVIORAL PROFILING The major reason for implying an approach presented in the paper is the author’s belief that the main path to fully disambiguating words, both in our mental processes and in NLP, must be in taking account of all the relevant factors that constitute meaning (including encyclopedic, cognitive and contextual factors), both practically and theoretically. Also, the impetus for exploring alternative means of disambiguating polysemous words and clearly distinguishing polysemy from vagueness comes from the unreliability of various polysemy tests (such as the logical test or the definitional test2) which produce different results as to the ambiguity testing of the same lexical item (Geeraerts, 2010: 196–199). The procedure proposed in this paper which could solve such issues is called behavioral profiling (BP) and it is achieved by the implementation of quantitative corpus methodology (Berez and Gries, 2008), both on the theoretical and practical planes. It combines semantic and syntactic information and supports the structuring of meaning with sound quantitative data; and it considers the distinction of senses as a matter of (quantitatively) measurable degree (at least in linguistic units of measurement). In essence, behavioral profiling presents a combination of corpus methodology as a practical orientation and cognitive and sociosemantic theoretical orientation. The theoretical background of the procedure can be found in the combination of what is basically a (manual) cognitive approach to sense identification (in the initial steps of the procedure) and a sociosemantic, in its more diluted form3, conception of meaning (regarding the observation of the context as generating sense recognition). The reference to cognitive semantics comes from the apparent link between cognitive and corpus means of sense analysis, stemming from the idea that different senses are motivated not only experientially and frame-semantically but also in respect to the lexical networks they appear in, thus justifying the inclusion of lexico-grammatical structures in linking senses of polysemous words (Gibbs and Matlock, 2001). Such a choice of including the lexico-grammatical constructions as a basis for sense disambiguation is further supported by the stated reliance on the sociosemantic approach to meaning which insists on the context being the (one) most important thing in generating any given individual meaning4. The practical, methodological orientation is most of all the need for an empirically sound base for polysemy testing, but also the need for quantitatively significant representative data, both for the procedures of ID tags and for their relevant computational representations. By combining these methodological and theoretical considerations we can (hopefully) reach the desired effects of full and reliable (automatic) polysemy testing and disambiguation. The procedure of behavioral profiling of lexemes is done in several steps (Gries, 2006): first, we extract a full KIWIC for a given lexical item from a representative corpus (the context consisting of minimum the whole sentence in which the item appears) and, when researching polysemy, identify all of the possible

2

For references on polysemy tests see Fillmore and Atkins, 2000. This discourse-centred approach has been applied to behavioral profiling in its diluted form, seeing that the analysis does not look at the entirety of human discourse, as the contemporary sociosemantic and discourse analysis approaches propose, but rather on the more immediate a computationally manageable context of the sentence or an utterance in which a given lexical item appears. 4 It is important to note here that even though the author will be mostly referring to contemporary papers and research, the ideas of both cognitive conception of meaning and the importance of the context in sense identification, stem from the very beginnings of consolidated lexical semantic research (c.f.i. Bréal, 1897; Erdmann, 1910; Esnault, 1925; Paul, 1920; Meillet, 1921 [1906]). 3

99 senses of the given word 5 . Then comes the tedious process of manual (so far) analysis of morphosyntactic and semantic features of the given lexical item and its use 6 , each of the analyzed features representing an ID tag. The next step is to create a table that would give a clear insight into the frequency of each ID tag and the frequency of its co-occurrence with each sense (when investigating polysemy), the percentages of which would then constitute a sense’s (and all together the word’s) behavioral profile. The given table needs to be tested by suitable statistical procedures7 before the final judgment of the whole procedure can be made. The last step, which will not be presented in this paper, is actually the ultimate practical goal of the procedure, and it would consists of tagging a given corpus for attested ID tags, which should (in a sufficiently high percentage of cases) lead to the possibility of sense identification based on the co-occurring ID tags previously processed. Further research and more comprehensive tagging procedures should ensure such sense identification for the lexical item in question in any corpus tagged for the given ID tags. The available published works on BP have been mostly applied to synonymy and antonymy (Divjak, 2006; Gries and Divjak, 2009), not so many can be found dealing with polysemy; and more over, the ones published on polysemy deal predominantly with verbs and/or prepositions (Gries, Hampe, and Schönefeld, 2005; Divjak and Gries, 2006). This paper, in difference, presents just a very tentative example of behavioral profiling applied to the (often appearing in lexical semantics) noun bachelor. Even though the given word is not very polysemous (having only five senses in its noun form as listed in NODE and only two attested in the small corpus sample), it was chosen for its symbolic recurrent appearance in lexical semantic research (Katz and Fodor, 1963; Bolinger, 1965; Fillmore, 1982; Lakoff, 1987; Geeraerts, 2010) and should exclusively be seen as an example of the BP procedure being applied rather than a conclusive research on full BP of the lexeme. 3. DATA

5

The sense identification procedure in the initial stage is done manually – in order to ensure objectivity native speaker intuition is combined with existing dictionary and WordNet data. The completed BP can, however, provide us with a different layout of senses. 6 The features that can be looked into in this step are as follows: (1) morphological features of the given word form respectively: (a) for verbs: tense, aspect, and voice; (b) for nouns: singular, plural; countable, uncountable, collective; possessive form; abstract vs. concrete; human, animate, concrete countable objects, concrete mass nouns, machines, abstract entities, organizations/institutions, locations, quantities, events, processes etc. (c) for adjectives: compositionality vs. noncompositionality; derivational (denominal, deverbal, derived) vs. nonderivational; (d) for adverbs: time, manner, frequency, degree, comment; (e) for prepositions: place and position, direction, time, manner, agent, accompaniment, purpose, association, measure, similarity; (2) the syntactic properties of (a) the given word: for verbs: ; for nouns: definite vs. indefinite reference; subject vs. direct object vs. indirect object; for adjectives: basic vs. event vs. object; for adverbs: complementing verbs, complementing preposition, modifying verbs, modifying adjectives, modifying adverbs, modifying nouns; for prepositions: transitive (NP or PP complement), intransitive, ditransitive; (b) the clause the given word form occurs in: intransitive vs. transitive vs. complex transitive use of verbs, declarative vs. interrogative vs. imperative sentence form, main clause vs. subordinate clause (e.g. regular subordinate clause with or without subordinator, relative clause with or without relative pronoun); (3) semantic characteristics of (a) the given word: idiomatic use or not; adverbs modifying speech acts, modifying sentences, modifying subjects, modifying VPs, modifying nouns; for verbs state verbs, process verbs, action verbs, action-process verbs, experiental vs. benefactive vs. locative; (b) the referents of the elements co-occurring with the given word: its subjects/heads, objects and complements (which were coded, e.g., as human, animate, concrete countable objects, concrete mass nouns, machines, abstract entities, organizations/institutions, locations, quantities, events, processes etc.); (4) the instance’s collocates in the same clause; (5) a paraphrase of the given word’s meaning in the citation and the semantic roles of the surrounding words. 7 In this case the statistical procedure used was the hierarchical cluster analysis testing, but the procedures may vary pending on the data and the corpus at hand.

100 The small study presented in this paper is based on the randomly selected8 167 instances (out of 3721 total found in the Corpus of Contemporary American English) of bachelor in all of its inflectional forms9. Every instance is observed within its sentence context. The first procedure to be implemented, as described above, is to identify and categorize the senses. The basis for sense identification was the definition of bachelor as provided by the New Oxford Dictionary of English (NODE 10 ). The senses identified in NODE are presented in Figure 1 bellow, while the ones attested in the sample are accompanied by an example11.

Figure 1. Attested senses of bachelor12.

The clearly prototypical sense of the word is ‘a person who holds a first degree from a university or other academic institutions’ as it is the earliest attested, the most frequent one and (apart from the archaic and almost non-existent ‘a young knight serving under another’s banner’) etymologically the oldest (OED). It is also the one that can be zero-derived into an adjective, and apart from being the most frequent (113 to 54), it also exhibits the highest variety of ID attributes (1935 ID instances to 922). Having seen how BP can help in resolving prototypicality, the other theoretical problem of distinctiveness of senses (lumping vs. splitting, or more generally, vagueness vs. polysemy) must be considered. The BP procedure that follows might demonstrate how the issue might be clearly and empirically resolved (even though in the case of ‘bachelor it may not pose a significant problem). Hence, the next, both the most important and most difficult, stage is identifying all the possible ID tags. The ones used in the analysis were as follows: − morphological features: plural/singular; possessive/of genitive/ ellipsis; simple/compound; countable/uncountable; − semantic characteristics of the lexeme: human/animal/concrete/machine/location/quantity; − syntactic properties of the clause it appears in: subject/object; transitive/intransitive verb; tense/voice of the verb; main/subordinate;

8

The sample was randomly selected using the tables in Kolmogorov, 1963. The reason why only a small number of randomly selected instances was chosen is because the paper does not aim at presenting an exhaustive account of the analysis of bachelor (due to space and time limitations) but rather to illustrate the procedure and its applicability. 10 This dictionary was chosen as referential due to its close connection to the theoretical background of the procedure, being based on cognitive approach to meaning. 11 Compare the sense analysis to the discussion on bachelor by Katz and Fodor (1963), Bolinger (1965), Fillmore (1982), Lakoff (1987) and Geeraerts (2010). 12 The given figure should be primarily observed only as a representational semantic network rather than an attempt to accurately show the stemming of one sense from another. 9

101 − morpho-semantic characteristics of co-occurring elements: subject/object of the clause as being plural/singular; possessive/of genitive/ ellipsis; simple/compound; countable/uncountable; human/animal/concrete/countable/uncountable /machine/location/quantity; and − R1 and L1 collocates. Each data point would then be available for tagging for 33 morphological, semantic and syntactic ID tags and 63 R1 and L1 collocates. All of the ID tags mentioned are presented in Table 1. Morphological features 1. Lexeme singular 157 plural 10 possessive 87 of genitive 13 ellipsis 3 simple 75 compound 92 countable 167 uncountable 0 2. Co-occurring S/O singular 95 plural 26 possessive 0 of genitive 0 ellipsis 0 simple 90 compound 12 complex 18 countable 120 uncountable 9 expletive 1

Semantic features

Syntactic features

1. Lexeme human 52 animate 52 inanimate 115 concrete 52 abstract 115 location 0 2. Co-occurring S/O human 105 animate 405 inanimate 20 pronoun 41 location 2 proper 25 concrete 114 abstract 5

1. Lexeme subject object complement of PP complement of clause 2. Co-occurring S/O subject object 3. Clause verb transitive verb intransitive tense Present Simple tense Past Simple tense Present Perfect tense Past Perfect tense Future Simple tense Future Prog. tense Pres. Progressive imperative tense would+infinitive present participle copula infinitive tense modal present tense modal past voice active voice passive clause main clause subordinate

Collocates

16 101 33 7 113 7 101 12 35 63 13 3 2 1 2 2 1 13 29 6 2 1 131 6 90 55

1. R1 collocates degree(s) PP level thesis apartment 2. L1 collocates balding unemployed be unbecoming this perennial receive eternal earn award take have year offer obtain attain eccentric certified kind wealthy happy committed

81 17 1 1 1

larder auction remember fantasy meal

1 1 1 1 1

1 1 2 1 1 1 11 1 23 1 1 9 2 1 1 1 2 2 1 1 1 1

energetic my his her their PP get complete achieve hold pursue education eligible young old remain attain life-long shy devout divorced hot

1 6 8 4 1 18 3 2 1 5 1 10 3 1 6 1 1 3 1 1 1 2

Table 1. Attested ID tags.

4. FINDINGS The results obtained were then analyzed quantitatively. The presented co-occurrence table was entered into a hierarchical cluster analysis13. And even though the results are of an extremely exemplary nature and are very noisy (even more so than the normal corpus sample would be due to its small size and poor representativeness), the dendrogram does point out to some interesting findings. The first major branching (though indicating a loose interaction structure) shows a point of division alongside the simple vs. compound ID tag. Next, on the right-hand side the second major branching shows 13

The hierarchical cluster analysis was performed using SPSS V.16.

102 grouping along the main vs. subordinate tag; while the next fork on the same side is indicative of the possessive vs. non-possessive ID tag. On the left hand side, grouping mostly the ‘a man who is not and has never been married’ senses, we can see a significant division on the point of singular vs. plural ID tags, followed by smaller (and more strongly interacting) tags. Other observations made on Figure 1 can also be of use. The features which are most indicative and most relevant to WSD (i.e. compound vs. simple) show the strongest and widest branching, while the less frequent ones form smaller and more specific branching nodes. This kind of clustering of attested ID tags is very insightful as it points out to category structure and to the respective predictive strength(s) of the give ID tags. The dendrogram also sheds light on the theoretical issue also mentioned as very relevant – that of the sense distinctness. By analyzing the interaction and grouping of ID tags around a particular sense node we can clearly distinguish between a unique sense and vagueness of meaning (see Gries, 2006:77–79). It is important to reiterate again that, having in mind the size of the corpus sample, the results can only be seen as exemplary and more detailed and representative studies (as the one planned to be performed in the ongoing project) will surely show somewhat different and more accurate results (especially when sense distinctness is concerned).

103

Figure 2. Cluster analysis dendrogram14.

5. FURTHER STEPS The practical application of the procedures presented in the paper is found in the reversal of the perspective of the approach. The idea behind WSD here is to no longer look at the conditional causality of ID tags as dependent on word senses but on word senses as dependent on conditional probability of ID tags (Gries, 2006). In practice, a given sense recognition software would be provided with the frequency of a given sense and equipped with a computational ability to recognize ID tags in an appropriately tagged corpus. By doing so it would identify the distinctive sense(s) based on the input correlation data. The problems with this approach would be the less frequent and, by default, the less distinctive ID tags 14

Only the ID tags with frequency over 10 were included due to their perceived statistical significance (Dobrić, 2009b).

104 which would not help a great deal with sense identification. Another issue concerns the ID tags which are shared by more than one sense whose individual indicating strength is reduced regardless of their frequency. But with sufficiently frequent and distinctive ID tags the case might appear quite different. If we consider an example of the ‘a man who is not and has never been married’ sense of bachelor, we can see how the model would work: − 32% (54) of all the attested instances of ‘bachelor’ are of this sense, which automatically gives the would-be software the 32% probability to hit the right sense out of any number of citations; − 31% (52) of all the instances of the word of this sense are animate, concrete and human, which raises the predictability level to 63%; − 30% (50) of all of examples of ‘bachelor’ are simple, non possessive instances, bringing the software to a 93% of sense recognition accuracy. It is true that the procedure becomes more complicated with more polysemous words (exhibiting dozens of senses and sharing more ID tags, making them less distinctive), but having a system based on the combination of the most frequent ID tags with the less frequent ones constituting a context from which the software could draw its computational conclusions would present a significant improvement to the AWSD procedures as employed in NLP today as well as bring definite answers to the theoretical issues still surrounding polysemy. References: Atkins, Beryl T. Sue 1987. Semantic ID tags: corpus evidence for dictionary senses. Proceedings of the Third Annual Conference of the UW Centre for the New Oxford English Dictionary, pp. 17–36. Berez, Andrea and Gries, Th. Stephan 2008. In defense of corpus-based methods: A behavioral profile analysis of polysemous get in English. Proceedings of the 24th NWLC, pp. 157–167. Biber, Douglas 1993. Co-occurrence Patterns Among Collocations: A Tool for Corpus-based Lexical Knowledge Acquisition. Computational Linguistics 19, pp.531–538. Bolinger, Dwight 1965. The atomization of meaning. Language 41, pp. 555–73. Bréal, Michel 1897. Essai de sémantique: science des significations. Paris: Hachette. Cruse, Alan 1986. Lexical semantics. Cambridge: Cambridge University Press. Dobrić, Nikola 2009a. Korpusni pristup kao nova paradigma istraživanja jezika. Philologia, 6, pp. 31–41. Dobrić, N. 2009b. Savremene tendencije u lingvostatistici. Statistička revija, Vol. 58, 1-2, pp. 45–50. Dirven, René and Verspoor, Marjolijn 1998. Cognitive exploration of language and linguistics. Amsterdam/Philadelphia: John Benjamins. Divjak, Dagmar 2006. Ways of intending: Delineating and structuring near synonyms. In Stefan Th. Gries and Anatol Stefanowitsch (eds.) Corpora in cognitive linguistics: corpus-based approaches to syntax and lexis, pp.19–56, Berlin/New York: Mouton de Gruyter. Erdmann, Karl 1910. Die Bedeutung des Wortes: Aufsätze aus dem Grenzgebiet der Sprachpsychologie und Logik. Leipzig: Avenarius. Esnault, Gaston 1925. Métaphores occidentales: essai sur les valeurs imaginative concrètes du français parlé en BasseBretagne comparé avec les patois, parlers techniques et argots français. Paris: Presses Universitaires de France. Fillmore, Charles 1982. Towards a descriptive framework for spatial deixis. In Robert J. Jarvella and Wolfgang Klein (eds.), Speech, Place, and Action: Studies of Deixis and Related Topics, pp. 31–59, Chichester:Wiley. Fillmore, Charles and Beryl Sue Atkins 2000. Describing polysemy: the case of crawl. In Ravin, Y. and Leacock, C. (eds.) Polysemy: Theoretical and Computational Approaches, pp. 91–110, Oxford: Oxford University Press. Geeraerts, Dirk 1993. Vagueness’s puzzles, polysemy’s vagaries. Cognitive Linguistics 4, pp.223–272. Geeraerts, Dirk 2001. The definitional practice of dictionaries and the Cognitive Semantic conception of polysemy. Lexicographica 17, pp. 6–21. Geeraerts, Dirk 2010. Theories of lexical semantics. New York: Oxford University Press. Gibs, Raymond and Matlock, Teenie 2001. Psycholinguistic perspectives on polysemy. In Cuyckens, H. and Zawad, B. (eds.) Polysemy in Cognitive Linguistics, pp. 213–239, Amsterdam/Philadelphia: John Benjamins. Gries, Th. Stephan, Hampe, Beate and Schönefeld, Doris 2005. Converging evidence: Bringing together experimental and corpus data on the association of verbs and constructions. Cognitive Linguistics 16, pp. 635–676.

105 Gries, Th. Stephan 2006. Corpus-based methods and cognitive semantics: The many senses of to run. In Gries, S. and Stefanowitsch, A. (eds.) Corpora in cognitive linguistics: corpus-based approaches to syntax and lexis, pp.57-99, Berlin/New York: Mouton de Gruyter. Gries, Th. Stephan and Divjak, Dagmar 2009. Behavioral Profiles: a corpus-based approach to cognitive semantic analysis. In Evans, V. and Poucel, S. (eds). New Directions in Cognitive Linguistics, pp. 57–75, Amsterdam: John Benjamins. Katz, Jerrold and Fodor, Jerry 1963. The structure of a semantic theory. Language 39, pp. 170–210. Kilgarriff, Adam 1997. I don’t believe in word senses. Computers and the Humanities 31, pp.91–113. Lakoff, George 1987. Women, Fire and Dangerous Things: What Categories Reveal about the Mind. Chicago: University of Chicago Press. Kolmogorov, Andrey 1963. On tables of random numbers. Sankhyā: The Indian Journal of Statistics 25, pp. 369-376. Meillet, Antoine 1921. [1906]. Comment les mots changent de sens. In Linguistique historique et linguistique générale. Paris: Champion. Palmer, Gary 1996. Toward a Theory of Cultural Linguistics. Austin: University of Texas Press. Paul, Hermann 1920. Prinzipien der Sprachgeschichte. Halle: Niemeyer. Simpson, John and Weiner, Edmund 1989. Oxford English Dictionary, 2nd Edition. Oxford University Press: Oxford. Soanes, Catherine and Stevenson, Angus 2001. New Oxford Dictionary of English. Oxford University Press: Oxford. Teubert, Wolfgang 2010. Meaning, Discourse and Society. Cambridge: Cambridge University Press. Taylor, John 1995. Linguistic categorization. Prototypes in linguistic theory. Oxford: Oxford University Press. Vitas, Dusko et al. 2007. Towards a Complex Model for Morpho-Syntactic Annotation. In Paskaleva, E. and Slavcheva, M. (eds.) Proceedings of the Workshop Workshop on a Common Natural Language Processing Paradigm for Balkan Languages, pp. 65-71, Bulgaria: Borovets. Zima-Smith, Veronica 1993. Retrieving words from memory. In Delanoy, W., Köberl, J. and Tschachler, H. (eds.) Experiencing a Foreign Culture. pp. 239-256. Tübingen: Narr Francke Attempto Verlag GmbH + Co. KG. Sources: Corpus of Contemporary American English [accessed 10th June 2010] http://www.americancorpus.org/