Word Sense Disambiguation as a Traveling Salesman Problem - limsi

5 downloads 1646 Views 461KB Size Report
The proposed approach could be customized for other lexical ..... A dictionary D defines all the assignable concepts of an individual word wi: D : W → C.
Noname manuscript No. (will be inserted by the editor)

Word Sense Disambiguation as a Traveling Salesman Problem Kiem-Hieu Nguyen · Cheol-Young Ock

Received: date / Accepted: date

Abstract Word Sense Disambiguation is a difficult problem in Computational Linguistics, mostly because of the decision upon a fixed sense inventory and the deep level of granularity. This paper formulated Word Sense Disambiguation as a variant of Traveling Salesman Problem to maximize the overall semantic relatedness of the context needed to be disambiguated. Ant Colony Optimization, a robust nature inspired algorithm, was used in a reinforcement learning manner to solve the formulated Traveling Salesman Problem. We proposed a novel measure based on Lesk algorithm and Vector Space Model to calculate semantic relatedness. Our approach to Word Sense Disambiguation is comparable to stateof-the-art knowledge-based and unsupervised methods on benchmark datasets. In addition, we showed that the combination of knowledge-based methods beats the most frequent senses heuristic, and significantly shortens the distance to the best performed supervised methods. The proposed approach could be customized for other lexical disambiguation tasks, such as Lexical Substitution or Word Domain Disambiguation. Keywords Traveling Salesman Problem · Ant Colony Optimization · Word Sense Disambiguation · semantic relatedness · Lesk algorithm

1 Introduction Word Sense Disambiguation (WSD) in Computational Linguistics is the task of automatically assigning senses for words in a context using a predefined sense inventory. The sources of ambiguity come from homonymy, i.e. words having the same spelling and pronunciation but different senses, and polysemy, i.e. words having multiple senses usually with subtle differences. Homonymy is trivial to be disambiguated because the domains of different senses are distinct. E.g. the noun “bank ” could be referred to “the sloping raised land, especially along the sides of Cheol-Young Ock School of Computer Engineering and Information Technology, University of Ulsan, Korea E-mail: [email protected]

2

Kiem-Hieu Nguyen, Cheol-Young Ock

a river’ ’ in the sentence“By the time we reached the opposite bank, the boat was sinking fast”, or “an organization where people and businesses can invert or borrow money, change it to foreign money, etc. or building where these services are offered ” in “I had to take out a bank loan to start my business” (Cambridge Advanced Learner’s Dictionary). Polysemy is far more difficult because of the subtle difference and the common origin of the senses. E.g. the noun “cold ” could be referred to “a mild viral infection involving the nose and respiratory passages” in the question“Will they never find a cure for the common cold? ”, or “the absence of heat” in “Cold is a vasoconstrictor ”, or “the sensation produced by low temperatures” in “The cold helped clear his head ” (WordNet 3.1). One could consult Ide and V´eronis (1998) and Navigli (2009) for surveys of the WSD literature. Supervised methods, boosted by classification and feature optimization techniques from machine learning, achieve state-of-the-art performance (Snyder and Palmer (2004), Navigli et al (2007)). The training of model, meanwhile, necessitates the labor-intensive annotation of data. On the other hand, as supervised methods are based on word-experts, i.e. each word-expert disambiguates the instances of an individual word, their scalability for domain adaptation and practical applications is limited. Knowledge-based methods overcome the two drawbacks by using no or little training data and processing the ambiguous text in an all-words fashion. However, because of the omission of training data, knowledge-based methods, in general, perform worst than supervised methods and have not fully beaten the most frequent senses heuristic yet (Snyder and Palmer (2004), Navigli et al (2007)). By exploiting relations from WordNet and Wikipedia, Ponzetto and Navigli (2010) showed that when disambiguating ambiguous nouns separately, a knowledge-based method beats the most frequent senses heuristic and rivals supervised methods in a specific domain. Most supervised methods and knowledge-based methods cast WSD as a ranking problem. The rank of each candidate sense is calculated against the context surrounding the ambiguous word. The sense with the highest rank is finally assigned to the ambiguous word. One of the other approaches, taking less attention of the contemporary works, is to cast WSD as a combinatorial optimization problem. Each combination of senses of all the words in the context is scored against a cost function. The combination with the best cost value is selected as the final result. In general, the cost function is chosen based on an assumption that the overall semantic relatedness of word senses in the context is maximized. Different strategies are applied to optimize this cost function, from brute force search (Pedersen et al (2005)) to nature inspired methods like Simulated Annealing (Cowie et al (1992)), Genetic Algorithm (Zhang et al (2008)), or Ant Colony (Schwab and Guillaume (2011)). The first contribution of this article is the formulation of WSD as a variant of Traveling Salesman Problem (TSP). Since its proposition in 1930 as a mathematical problem, TSP has been intensively studied with a profound theoretical background and various approaches. Laporte (1992) is one of the complete surveys of the problem. We applied Ant Colony Optimization (ACO) algorithm proposed in Dorigo and Gambardella (1997) with some modifications to solve the variant of TSP. Dorigo and Gambardella (1997) showed that, among heuristics and nature inspired approximation methods, ACO is robust and suitable for TSP. ACO, in

Word Sense Disambiguation as a Traveling Salesman Problem

3

addition, naturally fits the parallel paradigm that boosts the execution time in large-scale practices. The second contribution is a novel Lesk-based semantic relatedness measure using Vector Space Model (Buckley (1985)). Whereas previously published works in knowledge-based WSD suffer from insufficient semantic relatedness measures, our work was motivated by positive results achieved in Ponzetto and Navigli (2010) and Schwab and Guillaume (2011) and the availability of rich knowledge sources. Last but not least, we empirically showed that the combination of knowledgebased methods beat the most frequent senses heuristic, that has been a difficult baseline for knowledge-based methods, and approached the best performed supervised methods. The paper is organized into four sections. The first section introduces related works in the literature and our motivation. The next section loosely presents TSP and the use of ACO to solve the problem. In the third section, we describe our formulation of WSD as a TSP. Experimental results are showed and discussed in the forth section. The paper ends with some conclusions and future works.

2 Traveling Salesman Problem using Ant Colony Optimization In this section, we first introduce the basic idea of TSP. We next describe how natural ants search for food sources mainly based on the laying down and evaporation of pheromone. We then present the formulation of ACO proposed in Dorigo and Gambardella (1997) to solve this problem. We made some modifications and extensions to the TSP and ACO to fit our WSD model.

2.1 Traveling Salesman Problem TSP was first formally proposed as a mathematical problem in 1930. In brief, the problem is stated that: “Given a finite set of cities and their appropriate positions in a map, find the shortest tour that visits each city once”. When the number of cities is small, a brute force search algorithm, i.e. evaluating all possible combinations of cities, works well in an acceptable time. However, the number of solutions increases in an exponential fashion as the number of cities increases. This is the ground for heuristics and approximation methods, including ones inspired by a natural phenomenon. An important characteristic of these methods is to attempt to find a “good enough” solution instead of the global optimal solution in favor of processing time and computational complexity.

2.2 Ant Colony Phenomenon In Fig. 1, a) Assuming that ants follow two paths from the nest to the food source. Path A is shorter than path B. b) Initially, ants follow the two paths with the same probability. Ants come back to the nest after reaching the food source. Ants lay down pheromone trails in their path. c) At the decision points (i.e. the nest and the food source), an individual ant decides to follow the path having higher pheromone density. As path A is shorter than path B, in a specific period of

4

Kiem-Hieu Nguyen, Cheol-Young Ock

Fig. 1 Ants find the shortest path from a nest to a food source using indirect communication based on the laying down and evaporation of pheromone trails.

time t, path A will be traveled more times than path B, hence higher pheromone density in path A. As t increases, ants tend to follow path A more and more, while pheromone trails in path B are evaporated. d) Finally, ants only follow path A.

2.3 Ant Colony Optimization for Traveling Salesman Probelm In comparison with Dorigo and Gambardella (1997), our formulation is different in two main aspects. One aspect is about ACO and the other is about TSP itself. Firstly, pheromone is laid down at the vertices instead of the edges of the graph. This modification was aimed at using TSP to calculate vertex centrality, beside the main purpose of finding the shortest tour. Secondly, a tour does not exactly visit each city once. This modification resulted from an assumption of WSD that a word can have only one sense in a context. Therefore, if an ant visits a sense of a word w , it also visits the other senses of w. The effects of these modifications on the results of WSD are further discussed in Section 4.3, and 4.5. Definition 1 (Labeled graph) Given a graph G = (V, E) and a weighting function w : E → R∗ that assigns a positive real weight value to each edge in G, the pair (G, w) (1) is called a labeled (or weighted) graph. Definition 2 (Artificial ant) An artificial ant has following properties: 1. Global memory storing information about the current path’. Global memory is reset every time an artificial ant completes a route 1 from the nest to the food source or vice versa. 2. Local memory storing information about the adjacent vertices. and behaviors: 1 We use route instead of tour because in our formulation of WSD, an ant does not exactly visit all the vertices of graph to complete a solution.

Word Sense Disambiguation as a Traveling Salesman Problem

5

1. Move from a vertex to an adjacent vertex based on artificial pheromone density and edge weight. Two or more artificial ants can visit a vertex at the same time. 2. Lay down artificial pheromone every time it visits a vertex2 . Definition 3 (Artificial pheromone) 1. Artificial pheromone is laid down by an artificial ant every time it visits a vertex. 2. Artificial pheromone is represented by pheromone density. 3. Artificial pheromone is evaporated with evaporation rate λ. Definition 4 (Artificial environment) 1. In an artificial environment, ants move in a discrete and synchronous way: At each time step t, all ants move one step from a vertex to an adjacent vertex. 2. In an artificial environment, ants move in a collision-free way, i.e. two or more ants can visit a vertex at the same time. 3. In an artificial environment, the laying down and evaporation of artificial pheromone are synchronous at all the vertices. In afterward sections, ant, pheromone, and environment will be called for artificial ant, artificial pheromone, and artificial environment, respectively. Given a labeled graph G = (V, E), M (M ≤ |V |) ants are initially placed at M different vertices. At a time step t, an ant k is currently located at vertex ut , it selects the next vertex to visit ut+1 as follows:

ut+1 =

 argmax [w (eut ,v )] [pt (v)]β if q ≤ q0  v∈A (u ) k

 

t

S

(2)

if q > q0

where 1. 2. 3. 4.

Ak (ut ) is the set of adjacent vertices of ut that has not been visited by k. w (eut ,v ) is the weight value assigned to eut ,v . pt (v) is pheromone density at v at time step t. β is the smoothing factor between edge weight and pheromone density.

If we consider ACO as a population-based approximation method, the trade off between exploration and exploitation is maintained by the value of q0 ∈ [0, 1] . q is drawn from an uniform distribution. Ants select a short path-biased vertex with the probability q0 (exploitation) and move “randomly” with the probability 1 − q0 (exploration). If q ≈ 1, ants quickly converge to an optimal solution with the risk of premature convergence and being trapped into a local optimal solution. If q ≈ 0, ants might never converge and the movement of ants becomes a random distribution. To maintain the convergence with every possible value of q0 , Dorigo and Gambardella (1997) proposed that the random movement of ants should follow a probability distribution that emphasizes short edges and vertices with high pheromone density, i.e. the probability that an ant k, being located at ut , moves 2

In this work, we assume that artificial pheromone is laid down at a vertex, not in an edge.

6

Kiem-Hieu Nguyen, Cheol-Young Ock

to an adjacent vertex v agrees with the weight of eut ,v and pheromone density at v3 : S = argmax [sv ] [w (eut ,v )] [pt (v)]β v∈Ak (ut )

Pheromone density is updated locally after each time step and globally after an ant completed a route. At time step t, pheromone density at each vertex v is locally updated: pt+1 (v) = (1 − λ) pt + λ |Mt (v)| τ (3) where 1. Mt (v) is the set of ants at v at time step t. 2. λ is the evaporation rate of pheromone. 3. τ is defined by empirical experiences or heuristics. In Dorigo and Gambardella (1997), τ is inversely proportional to the length of the shortest path found by the nearest neighbor heuristic. With the mechanism of global update, ACO could be classified as a reinforcement learning algorithm: If an ant k completes a route, the solution rk created by its path will be compared to the current best solution rbest of the population. If rk is better than rbest , the best solution will be updated and all the vertices in the route rk will be rewarded with a pheromone density inversely proportional to the length of the path ∆p = krk k−1 : pt (v) = (1 − λ) pt + λ∆p ∀v ∈ rk

(4)

The movement of ants and the update of pheromone continue iteratively until all the ants follow an unique route or a maximum number of iterations is reached. The latest best solution will be the ouput of the algorithm. This decision is enough as far as the finding of the shortest path is the only concern. In the succeeding part, we propose using ACO to calculate vertex centrality in a labeled graph.

2.4 Traveling Salesman Problem and Vertex Centrality Definition 5 (Vertex centrality) Given a labeled graph G = (V, E), vertex centrality is a function f : V → R∗ (5) that assigns a positive real value to every vertex in G. Vertex centrality depicts the rank (or importance) of vertices and reflects the structure of graph. Given a labeled graph, vertex centrality could be defined as vertex degree, vertex betweenness, vertex closeness, or the principal eigenvector of the adjacency matrix. The three formers are calculated directly from the graph whereas the latter could be calculated using a random walk algorithm like PageRank (Brin and Page (1998)) or HITS (Kleinberg (1999)). 3 The selection of S should not be confused with the selection of u t+1 when q ≤ q0 . The former follows a probability distribution whereas the latter is based on the absolute confidence (The two equations differ in the occurrence of a random variable sv drawn from an uniform distribution).

Word Sense Disambiguation as a Traveling Salesman Problem

7

To use ACO to calculate vertex centrality, we intuitively assumed that a vertex will be more important if it is visited more times by ants in a specific period. At time step N , we defined vertex centrality as pheromone density at that vertex: f ≡ pN : V → R∗

3 Application to Word Sense Disambiguation In the preceding section, we described the formulation of ACO with some modifications that enable it to calculate vertex centrality. In this section, we start out by introduce WSD as a sequence labeling problem. We initially consider WSD as a combinatorial optimization problem. By introducing pheromone density hence vertex centrality (Section 2.4), we recast it as a ranking problem and show that the two approaches are equivalent regarding the final output. We then present the pseudo code of the algorithms to create the disambiguation graph and to run ACO on this graph. Given a domain of words W, a domain of concepts C, and a domain of senses S, WSD is a sequence labeling problem where the input is a sequence of words w = w0 w1 · · · wn and the output is a sequence of the appropriate senses s = s0 s1 · · · sn : g·f :W→S w = w0 w1 · · · wn 7→ s = s0 s1 · · · sn A dictionary D defines all the assignable concepts of an individual word wi : D:W→C wi 7→ {ci0 , ci1 , · · · cik } As senses are defined in a fixed sense inventory (dictionary, thesaurus, or ontology), WSD actually consists of two subsequent phases: In the first phase, the input word sequence w = w0 w1 · · · wn is mapped to a concept sequence c = c0 c1 · · · cm : f :W→C

(6)

w = w0 w1 · · · wn 7→ c = c0 c1 · · · cm and in the second phase, the concept sequence c = c0 c1 · · · cm is mapped to a sense sequence s = s0 s1 · · · sn : g:C→S c = c0 c1 · · · wm 7→ s = s0 s1 · · · sn The former is the most difficult in WSD whereas the mapping from a concept to a sense given the appropriate word is trivial and could be done directly using the sense inventory. It should be notice that the length of the concept sequence is not necessarily equal to the length of the word sequence and the sense sequence (m ≤ n) because one concept may be mapped to two or more senses, i.e. two or more words in a context may share the same concept4 . 4 This phenomenon rarely happens in a small context like a sentence, which was the disambiguation context in this work (usually, m = n).

8

Kiem-Hieu Nguyen, Cheol-Young Ock

One of the most common approaches to WSD is to consider (6) as a ranking function r: Given an input word sequence w = w0 w1 · · · wn , the output is the ranks of all the (assignable) concepts: r : W → R∗|C|

(7)

w = w0 w1 · · · wn 7→ rc0 rc1 · · · rc|C|−1 In supervised learning, the ranking function is learned from training data (e.g. Lee et al (2004)). In unsupervised learning, it is estimated based on large unlabeled text corpora (e.g. by using a n-gram language model as in Yuret and Yatbaz (2010) or clustering techniques5 as in Pedersen and Bruce (1997), and Popescu and Hristea (2011)). In both supervised and unsupervised methods, the rank of a concept c in a context w is p (c|w), the probability of assigning c given w. In knowledge-based methods, the rank of concept is calculated using Lexical Knowledge Bases (LKB) like dictionaries, thesaurus, taxonomies, and ontologies. WordNet (Miller (1995)) is a de facto sense inventory and the most commonly used LKB. WordNet LKB consists of concepts (also synsets) and lexical and semantic relations between them. As the natural representation of WordNet is graphical, the rank of a concept is possibly defined as its vertex centrality as in (5). Answering the question on which graph vertex centrality is calculated leads to two approaches. The first is to use the whole LKB graph containing all the concepts in the sense inventory possibly with a personalization adaptive to the context (Agirre and Soroa (2009)). The second, used in our work, is to create an instance of disambiguation graph for each context containing the concepts assignable to words in the context (Veronis and Ide (1990), Tsatsaronis et al (2007), Sinha and Mihalcea (2007), Ponzetto and Navigli (2010)). In this work, the context was treated as a bag-ofword without concerning the word order or the syntactic structure of the context. One could consult Segond et al (1997), Molina et al (2002), Hatori et al (2008), and Tran et al (2010) for the use of word order and syntactic structure along with Hidden Markov Models or Conditional Random Fields.

3.1 Disambiguation Graph Disambiguation graphs are context sensitive. An instance is created for each context using WordNet as a sense inventory . The context is first labeled with part-ofspeech tags. Only content words including nouns, verbs, adjectives, and adverbs, i.e. words that exist in the sense inventory, are disambiguated. Words in various forms are then lemmatized. All the assignable senses of lemmas are extracted from the sense inventory. These senses are mapped to their concepts in WordNet. Finally, the disambiguation graph contains vertices as all the assignable concepts of the context and edges as semantic relatedness between all pairs of concepts. E.g., given a sentence occurring in a document of Wall Street Journal (document d001 used in Semeval 2007 coarse-grained English all-words task). “Your research stopped when a convenient assertion could be made.” This sentence is labeled with a part-of-speech tags to filter out content words. 5 Clustering techniques are actually applied to word sense discrimination (or induction). A post-processing step is needed to map sense clusters to their appropriate senses in a sense inventory.

Word Sense Disambiguation as a Traveling Salesman Problem

9

Algorithm 1 Disambiguation graph creation Disambiguation graph creation Input: Word sequence w = w0 w1 · · · wn . Input: Dictionary D: mapping each word wi to the set of its assignable concepts. Output: Undirected labeled graph G = (V, E) where vertices are concepts and edge weights are semantic relatedness of all pairs of concepts. 1: V = φ // Initialize graph vertices. 2: for all wi ∈ w do 3: for all ci ∈ D (wi ) do 4: if ci ∈ / V then 5: V = V ∩ {ci } // Add a concept to the graph. 6: end if 7: end for 8: end for 9: E = φ // Initialize the edges of the graph. 10: for all u ∈ V do  11: for all v ∈ V \ {u} ∪ D−1 (u) do 12: eu,v = semantic relatedness (u, v) // Calculate the semantic relatedness. 13: E = E ∩ {eu,v } // Add an edge to the graph. 14: end for 15: end for 16: return G = (V, E)

“Your/PRP$ research/NN stopped/VBD when/WRB a/DT convenient/JJ assertion/NN could/MD be/VB made/VBN ./.” Fine-grained part-of-speech tags are mapped to coarse-grained part-of-speech tags understandable by the sense inventory as follows: NN, NNP, NNPS, NNS → n; VB, VBD, VBG, VBZ → v ; JJ, JJR, JJS → a; RB, RBR, RBS → r. [n, v, a, r] stand for [noun, verb, adjective, adverb] in that order. Various word forms are converted to lemmas recognizable by the sense inventory. “research#n”, “stop#v”, “convenient#a”, “assertion#n”, “make#v” The set of all the assignable senses contains 2 senses of “research#n”, 11 senses of “stop#v”, 2 senses of “convenient#a”, 2 senses of “assertion#n”, and 49 senses of “make#v”. Finally, the word senses are mapped to appropriate concepts in WordNet. Edges of the disambiguation graph are semantic relatedness between all pairs of concepts.

3.2 Disambiguation Algorithm Definition 6 (Visited) A word wi is visited if one of its assignable concepts cj in the disambiguation graph is visited: cj ∈ D (wi ). Definition 7 (Path) A path p = c0 c1 · · · cm is an ordered chain of concepts in the disambiguation graph where two consecutive concepts are adjacent, and no words are visited more than once: 1. eci ,ci+1 6= 0 ∀i = 0, m− 1 2. D−1 (ci ) ∩ D−1 (cj ) ∩ w = φ ∀ci , cj ∈ p Definition 8 (Route) A path is a route if and only if it visits all the words in the word sequence: ∀wi ∈ w ∃cj ∈ p : cj ∈ D (wi ).

10

Kiem-Hieu Nguyen, Cheol-Young Ock

Algorithm 2 WSD using TSP-ACO Input: Disambiguation graph G = (V, E), V = {v0 , v1 , · · · vm }. Input: A set A of artificial ants, A = {a0 , a1 , · · · am }. Input: Initial pheromone density : θ ≥ 0 Input: Evaporation rate: o ≤ λ ≤ 1 Input: Fitness factor: τ ≥ 0 Output: Vertex centrality f : V → R∗ . 1: for i = 0 → m do 2: ai → current = vi // Put m ants at m vertices. 3: f (vi ) = θ// Initialize pheromone density. 4: end for 5: repeat 6: for all a ∈ A do 7: a → move ()// Move to an adjacent vertex. 8: if a → f inished a route() then 9: a → reset ()// Reset a new route. 10: if a → route ≥ the best solution then 11: the best solution = a → route 12: global udpate(a → route) 13: end if 14: end if 15: end for 16: for all v ∈ V do 17: f (v) = (1 − λ) f (v)// Pheromone evaporates. 18: end for 19: for all a ∈ A do 20: f (a → current) = f (a → current) + λτ // Pheromone is laid down. 21: end for 22: until termination condition 23: return f

As WSD is a mapping from a word sequence to a concept sequence (6) , TSP in the disambiguation graph is stated that: “Given a labeled graph that consists of all the assignable concepts of words in the context and semantic relatedness between all pairs of concepts, the task is to find a possible shortest route that visits all the words once”. If an assignable concept of a word wi is visited, all the other assignable concepts of wi are also visited. This is to guarantee that a word is only visited once in a path (and route). This restriction is based on the assumption that an individual word can have only one sense in a context. In case of homonymy, this assumption is straightforward because of the clear distinction of different senses in different domains, whereas in case of polysemy, it is naive and it seems to simplify the overlap of subtly different senses of a word. E.g, in the sentence “The cold helped clear his head.”, the noun “cold ” is consequently understood as “the sensation produced by low temperatures” without mentioning “a mild rival infection involving the nose and respiratory passages” or “the absence of heat”. In Section 4, we will show that this naive assumption works well and will discuss some suggestions to deal with its shortcoming. The algorithm starts by locating an ant in each vertex of the disambiguation graph. Then, the movement of an ant must satisfy the requirements of a path and follow the decision rule in (2). At each time step, an ant checks whether it completes a route. If a route is completed, it resets the global memory and restarts for a new route. The route is compared with the best route of the population and pheromone is globally updated (4) if necessary. Pheromone at all the vertices is locally updated

Word Sense Disambiguation as a Traveling Salesman Problem

11

(3). The algorithm terminates after a maximum number of iterations. Finally, the output is returned either as the best solution of the population or the vertices with the highest pheromone density. If the algorithm converges, i.e. all the ants follow an unique route, these two solutions will be identical.

3.3 Lesk-Based Semantic Relatedness Semantic relatedness between concepts can be calculated based on dictionaries, ontologies, sense tagged corpora or even large untagged text corpora. Lesk (1986) simply counts the overlap of dictionary definitions. Other works use the information content estimated from sense tagged or large untagged text corpora (Resnik (1995)), or the topology of ontologies (Leacock and Chodorow (1998), Wu and Palmer (1994)), or both (Lin (1998), Jiang and Conrath (1997)). Ponzetto and Strube (2011) compares semantic relatedness induced from WordNet taxonomy and Wikipedia taxonomy. Comparative studies (e.g. Budanitsky and Hirst (2006), Ponzetto and Strube (2011)) showed that these measures are competitive on benchmark datasets. But when applying to all-words WSD, they have two common shortcomings. Firstly, semantic relatedness can only be calculated between words of the same part-of-speech because only words of the same part-of-speech are reasonably structured as a taxonomy. Secondly, the textual representation of concepts, which is an important clue, is not exploited. Lesk-based measures overcome these shortcomings as they are part-of-speech independent and are based on the definitions of concepts. Banerjee and Pedersen (2002), Vasilescu et al (2004), Pedersen et al (2005), and Gelbukh et al (2005) enrich the textual representation of concepts by leveraging semantic and lexical relations from WordNet and topical relations induced from sense disambiguated glosses. Ponzetto and Navigli (2010) extends WordNet glosses using associative relations from Wikipedia and applies it to disambiguate nouns. Ruiz-Casado et al (2005) calculates the overlap of definitions in an term-frequency based fashion and applies it to map Wikipedia entries to WordNet concepts. We propose a novel Lesk-based semantic relatedness measure using Vector Space Model. WordNet glosses were first enriched with various types of relation. Then, by indexing glosses as vectors in Vector Space Model, we integrated the distribution of vocabulary into semantic relatedness measuring. Finally, semantic relatedness of a pair of concepts was the dot product of their vectors. The representation of concepts in details consisted of five steps as follows: 1. Glosses were extracted from WordNet. 2. The gloss of each concept was enriched with the glosses of its related concepts in WordNet through semantic and lexical relations, e.g. hyponymy, hypernymy, antonymy, meronymy, and other types of relation available in WordNet. 3. The enriched gloss in step 2 was further enriched with the glosses of topically related concepts available in eXtended WordNet (Mihalcea and Moldovan (2001)). 4. From the mappings of Wikipedia entries to WordNet concepts (Ponzetto and Navigli (2010)), the enriched gloss in step 3 was finally enriched with the glosses of its associative concepts in Wikipedia.

12

Kiem-Hieu Nguyen, Cheol-Young Ock

5. Steps 2, 3, and 4 could be accumulatively done in any order. The final enriched glosses were preprocessed through stop words removal and Porter stemming. At last, they were indexed as terms and documents using Vector Space Model. Each concept (also synset) in WordNet is defined by a short gloss containing a brief explanation of the concept and/or running examples of its usage. Lesk-based semantic relatedness will return zero, which is wrong, when two concepts are in the same topic but the topical information is not explicitly displayed in their glosses and the two glosses have no common word. E.g. given the glosses of three senses of “bank ”, “stock ”, and “money” in FINANCIAL topic: – bank#n#2: a financial institution that accepts deposits and channels the money into lending activities) “he cashed a check at the bank”, “that bank holds the mortgage on my home”. – money#n#1: (the most common medium of exchange; functions as legal tender) “we tried to collect the money he owed us”. – stock#n#1: (the capital raised by a corporation through the issue of shares entitling holders to an ownership interest (equity)) ”he owns a controlling share of the company’s stock”. In this example, the glosses of “bank#n#2 ”6 and “money#n#1 ” only share a common word “money” and there is no overlap between the glosses of “bank#n#2 ” and “stock#n#1 ”, or “stock#n#1 ” and “money#n#1 ” (“he”, “us”, “a” are treated as stop words, only content words including nouns, verbs, adjectives, adverbs are counted). The most straightforward extension is to use semantic and lexical relations available in WordNet. The limitation is that, most relations in WordNet are between concepts of the same part-of-speech, particularly nouns and verbs, hence the lack of cross part-of-speech relations. To overcome this, the main idea realized in eXtened WordNet is to disambiguate WordNet glosses and then extract relations from that resource. Morphological derivations between word forms are available in conventional dictionaries (e.g “derive”, “derivative”, and “derivation”), but those between word senses are not. In eXtended WordNet, morphologically derivative relations are extracted from sense disambiguated glosses. This type of relation is part-of-speech cross and it can augment the topical information for concepts. The third type of relation under consideration is associative relations between Wikipedia concepts. The associative term refers to hierarchical relations between topical concepts, or interlinks from a concept A to other concepts mentioned in the article of A in Wikipedia. The problem lies in the quality of mapping between Wikipedia concepts and WordNet concepts. One of successful dealings with this problem is the mapping algorithm in Ponzetto and Navigli (2010), of which results were used in this study. Each gloss was a vector in Vector Space Model with term frequency-inverse document frequency (tfidf) weighting scheme. Tfidf highlights terms that frequently occur in some specific documents but rarely occur in other documents: tf idft,d = tft,d log

N dft

where 6

“bank#n#2 ” stands for the second sense in WordNet of the word “bank ” as a noun.

Word Sense Disambiguation as a Traveling Salesman Problem

13

1. tf idft,d is term frequency-inverse document frequency of term t in document d. 2. tft,d is the frequency that term t occurs in document d. 3. dft is the frequency that term t occurs in a document in the corpus. 4. N is the total number of documents in the corpus. Semantic relatedness between two concepts c1 and c2 was the dot product of their vectors v1 and v2 : sr(c1 , c2 ) = v1 · v2 =

M X

tf idfi,1 tf idfi,2

(8)

i=1

where M is the total number of terms in the vocabulary. We preferred the dot product to other distance measures like cosine similarity or Euclidean distance because of two reasons. Firstly, the dot product basically relies on the overlap of text, that is closely related to the originality of Lesk algorithm. Other measures, like cosine similarity or Euclidean, relying on the closeness of angles or spatial positions, have been studied in large scale problems of information retrieval or text classification but have not been well studied in lexical disambiguation. Secondly, other measures realize the normalization and equalize long and short documents. However, the distribution of concepts in WordNet is skewed. Assuming that concepts and their relations form a network, frequently used concepts, called strong concepts, are related to many other concepts through various types of relation. Some concepts, called weak concepts, in particular adjectives and adverbs, weakly connect to the network. Other concepts, called neutral concepts, are neither strong nor weak. The glosses of strong concepts are consequently rich whereas the the glosses of weak concepts are still sparse. As a consequence, if the lengths of glosses are normalized, semantic relatedness between a strong concept and a weak concept will not be significantly different from semantic relatedness between a neutral concept and a weak concept. Again, a common practice of Vector Space Model used in large scale problems should be carefully concerned here.

4 Experimental Results and Discussions In this section, we describe and interpret the experimental results of the proposed method, namely TSP-ACO (stands for Traveling Salesman Problem-Ant Colony Optimization). Firstly, various presentations of glosses are evaluated. Next, the proposed method is compared with other graph centrality methods based on the same disambiguation graph. We then present two combination approaches to improve the overall performance. Finally, we show the results of comparison between our best systems and contemporary related works. Experiments were conducted on Semeval-2007 coarse-grained (Navigli et al (2007)), Senseval-2 (Edmonds and Cotton (2002)), and Senseval-3 (Snyder and Palmer (2004)) English all-words datasets. Semeval-2010 (Agirre et al (2010)) all-words task evaluates WSD in a specific domain, which is not one of the focuses in this work. Semantic and lexical relations and glosses were extracted from WordNet7 1.7. Topical relations from WordNet glosses and associative relations 7

http://wordnet.princeton.edu/

14

Kiem-Hieu Nguyen, Cheol-Young Ock

Table 1 Default parameter values of TSP-ACO (∗ Values from Dorigo and Gambardella (1997)). Parameter q0∗ λ∗ β∗ τ∗ t∗ θ m

Description Exploitation vs. Exploration Evaporation rate Smoothing factor for evaporation rate and edge weight Inversed to the length of the nearest neighbor heuristic Maximum number of iterations Initial pheromone density Number of ants: equal to the number of concepts

Value 0.9 0.1 2 1250 0.01 -

from Wikipedia8 are publicly provided by the authors of Mihalcea and Moldovan (2001) and Ponzetto and Navigli (2010). UKB9 0.1.6rl was used to yield the output of state-of-the-art Personalized PageRank algorithm (Agirre and Soroa (2009)). Lucene10 3.1.0 was used to index the glosses and to implement the tfidf weighting scheme. The parameter values of TSP-ACO were from Dorigo and Gambardella (1997) except the initial pheromone density and the number of ants (Table 1). To avoid random moving of ants in the first time step, pheromone was initialized at all the vertices with the same small value. The number of ants was set equal to the number of concepts in the disambiguation graph. This indicated that all concepts have equal probability to start a new route. TSP-ACO was implemented in Java with multithread supporting. Execution on Senseval-3 dataset, containing 301 sentences and 2041 instances of content words, took 316s (Gloss + WN + xWN + Wiki in Table 2) to complete with a Intel Core i5 2.80GHz processor in which the online calculation of Lesk-based semantic relatedness was the most time consuming. Experimental results were compared using attempted, precision, recall and F-measure. Attempted is the proportion of instances being predicted. Precision is the proportion of predicted results that are correct. Recall is the proportion of correct results that are predicted. Precision focuses on the correctness of the results whereas recall complementarily focuses on the completeness of the results. F-measure (also F1 score and F-score) is the harmonic mean of precision and recall, which equalizes the polarity of precision and recall. When comparing with related works, as not all the mentioned measures were available, we chose recall and focused on completeness. The most frequent senses heuristic (MFS), i.e. the most frequent senses occurring in a sense-tagged corpus are simply selected, is based on the statistic information drawn from SemCor (Miller et al (1993)). This information is implicitly embedded in WordNet in form of the order of word senses. A = attempted = P = precision = R = recall = 8 9 10

predicted instances total instances

correct predicted instances total predicted instances

correct predicted instances total instances

http://www.cl.uni-heidelberg.de/∼ponzetto/wikitax2wn/ http://ixa2.si.ehu.es/ukb/ http://lucene.apache.org

Word Sense Disambiguation as a Traveling Salesman Problem

15

Table 2 The effect of WordNet glosses enrichment. Senseval-2 English all-words Enrichment P R F Gloss 48.7 48.5 48.6 Gloss + WN 55.5 55.4 55.5 Gloss + WN + XWN 62.8 62.7 62.7 Gloss + WN + XWN + Wiki 63.0 62.8 62.9 Senseval-3 English all-words Enrichment P R F Gloss 38.5 38.1 38.3 Gloss + WN 51.9 51.4 51.6 Gloss + WN + XWN 57.5 56.9 57.2 Gloss + WN + XWN + Wiki 57.8 57.2 57.5 Semeval-2007 coarse-grained English all-words Enrichment P R F Gloss 65.8 65.5 65.6 Gloss + WN 73.9 73.6 73.7 Gloss + WN + XWN 78.5 78.2 78.3 Gloss + WN + XWN + Wiki 78.5 78.1 78.3

F = F –measure = 2

P.R P +R

4.1 WordNet Glosses Representation In the first experiment, we evaluated the effect of different types of relation to the enrichment of glosses. Because the evaluation of semantic relatedness was not the main scope in this work, we did not conduct experiments on datasets of sample pairs of words (Miller and Charles (1991)) as convention. We instead used WSD performance as an indirect way of evaluation. The baseline system was built from original glosses (Gloss). Other systems were accumulatively built from original glosses enriched with WordNet relations (WN), topical relations (XWN), and Wikipedia relations (Wiki) in that order. The average improvement on each dataset is 15.8%. WN relations improves 9.4%. XWN relations additionally improves 5.8%. Wiki relations finally improves 0.16%. This does not imply that Wiki relations is not as good as WN relations or XWN relations. In fact, the slight improvement of Wiki relations suggests that three types of relation may be partly redundant or that the improvement may have reached an upper bound (e.g. the most frequent senses heuristic). Further studies on the topic of Lesk-based semantic relatedness could evaluate on datasets of pairs of words and possibly use human judgments as an upper bound to investigate these two assumptions. (Table 2) In the second experiment, we compared the use of tfidf and BOW for semantic relatedness. In BOW, semantic related is simply the number of common terms counted in the glosses of two concepts. Tfidf improves 1.4% on average. The improvement on Senseval-2, Senseval-3, and Semeval-2007 is 1.1%, 1.3%, and 1.7%, respectively. Semeval-2007 is a coarse-grained evaluation. This explains the highest improvement, 1.7% on this dataset. (Table 3)

16

Kiem-Hieu Nguyen, Cheol-Young Ock

Table 3 Comparison of tfidf and BOW. Dataset Senseval-2 Senseval-3 Semeval-2007

P 61.9 56.5 76.8

BOW R 61.8 55.9 76.5

F 61.9 56.2 76.6

P 63.0 57.8 78.5

Tfidf R 62.8 57.2 78.1

F 62.9 57.5 78.3

4.2 Graph Centrality In this experiment, we compared our proposed method with two graph centrality methods, namely Degree and PageRank. Sinha and Mihalcea (2007) showed that these methods are more suitable for the disambiguation graph than other methods. – Degree: The centrality of a vertex v is the sum of weights of all the edges between v and its adjacent vertices A(v). Degree is parameter-free and can be calculated directly from a graph. X Degree(v) = wu,v u∈A(v)

– PageRank: Page et al (1999) proposed PageRank to rank web pages on Word Wide Web with the assumption that a web page is important if it is linked from other important web pages. PageRank is a random walk algorithm to define the principal eigenvector of a graph. The algorithm starts by initializing a rank value for each vertex on the graph. Ranks are updated iteratively until convergence. X r(v) = (1 − d) + d wu,v r(u) u∈A(v)

The damping vector d controls the convergence speed and the balance between the random page selection behavior of users, i.e. random walking, (the first component in the right) and the importance of network typology (the second component in the right). d = 0.25 emphasizes the random walking. d = 0.5 equalizes the two factors. d = 0.85, used in Sinha and Mihalcea (2007), Agirre and Soroa (2009), and this work, emphasizes the network typology. Degree, a simple and parameter-free algorithm interestingly outperforms TSPACO and PageRank. However, simplicity is not always the best. In the perspective of machine learning, the performance of Degree could not be improved because no parameter is available for optimization. TSP-ACO is comparable to Degree on Senseval-3 and Semeval-2007 and outperforms PageRank on the three datasets. These results encourage further parameter optimization regarding that the values of parameters were from Dorigo and Gambardella (1997) and our formulation of TSP was modified for the problem of sense disambiguation. (Table 4)

4.3 Combination of Knowledge-Based Methods This experiment was aimed at improving the performance of WSD by combining purely knowledge-based methods. We were motivated by the fact that methods based on the disambiguation graph, including our proposed method, do not reach

Word Sense Disambiguation as a Traveling Salesman Problem

17

Table 4 Comparison with methods based on graph centrality. Senseval-2 English all-words Description P R Shortest path finding 63.0 62.8 Static vertex centrality 64.1 64.0 Random walk 59.7 59.6 Senseval-3 English all-words System Description P R TSP-ACO Shortest path finding 57.8 57.2 Degree Static vertex centrality 58.0 57.4 PageRank Random walk 55.5 54.9 Semeval-2007 coarse-grained English all-words System Description P R TSP-ACO Shortest path finding 78.5 78.1 Degree Static vertex centrality 78.8 78.5 PageRank Random walk 76.8 76.5 System TSP-ACO Degree PageRank

F 62.9 64.0 59.6 F 57.5 57.7 55.2 F 78.3 78.6 76.6

100% attempted, because the disambiguation graph cannot be created if there is only one content word in a context (e.g. in a short sentence). Three factors that affect the efficiency of combination were considered. Firstly, the weak classifiers used for combination should have various characteristics. Precisely, their assumptions and disambiguation models should not be duplicated. The combination of similar methods based on the disambiguation graph and semantic relatedness (e.g. TSP-ACO, Degree, and PageRank) will not significantly improve the performance. In this work, we combined a method based on semantic relatedness (TSP-ACO or Degree), an enhancement of Lesk algorithm (Lesk), and a method based on the structure of WordNet ontology (PPR). Secondly, the gaps between performance of weak classifiers should be reasonably small. The ultimate goal of combination is to exploit all the weak classifiers and finally yield the performance better than the best weak classifier. If the gaps are too large, the performance of combination will stay averagely somewhere between the worst and the best weak classifiers. Thirdly, the strategy of combination is important. In this work, we evaluated two strategies: binary voting and weighted voting. – Personalized PageRank (PPR): PPR (Agirre and Soroa (2009)) differs from PageRank (Section 4.2) in the underlying graph. The disambiguation graph in PageRank contains the assignable concepts of words in the context and semantic relatedness between these concepts. An instance of disambiguation graph is created for each context. In PPR, the whole LKB graph contains all the concepts in WordNet and actually existing relations between these concepts. For each context, the LKB graph is personalized by adding words in the context and directed edges from these words to their assignable concepts. In this work, WordNet relations, eXtened WordNet relations, and Wikipedia relations were all included in the LKB graph. PPR always reach 100% attempted because the LKB graph contains all the concepts in WordNet. – Lesk: In Lesk algorithm, a word in the context is disambiguated by comparing the context with the glosses of its assignable concepts. The concept with the highest similarity will be assigned. In this work, the glosses were enriched and represented as document vectors in Vector Space Model with tfidf weighting scheme (Section 3.3). Each context was treated as a query vector. Similarity between the query and a document vectors was calculated by their dot product

18

Kiem-Hieu Nguyen, Cheol-Young Ock

Table 5 Comparison of combination approaches. Senseval-2 English all-words Description A P Shortest path finding 99.7 63.0 Static vertex centrality 99.7 64.1 Personalized PageRank 100.0 61.1 Glosses overlap 95.8 65.0 Binary voting 64.1 Binary voting 64.8 Weighted voting 65.5 The most frequent senses 66.5 Senseval-3 English all-words System Description A P TSP-ACO Shortest path finding 98.9 57.8 Degree Static vertex centrality 98.9 58.0 PPR Personalized PageRank 100.0 56.9 Lesk Glosses overlap 96.3 61.1 TSP-ACO+PPR+Lesk-b Binary voting 60.5 Degree+PPR+Lesk-b Binary voting 60.3 Degree+PPR+Lesk-w Weighted voting 61.4 MFS The most frequent senses 64.3 Semeval-2007 coarse-grained English all-words System Description A P TSP-ACO Shortest path finding 99.8 78.5 Degree Static vertex centrality 99.8 78.8 PPR Personalized PageRank 100.0 75.6 Lesk Glosses overlap 96.6 80.6 TSP-ACO+PPR+Lesk-b Binary voting 79.1 Degree+PPR+Lesk-b Binary voting 79.3 Degree+PPR+Lesk-w Weighted voting 81.1 MFS The most frequent senses 78.5 System TSP-ACO Degree PPR Lesk TSP-ACO+PPR+Lesk-b Degree+PPR+Lesk-b Degree+PPR+Lesk-w MFS

R 62.8 64.0 61.1 62.3 64.1 64.8 65.5 66.5

F 62.9 64.0 61.1 63.6 64.1 64.8 65.5 66.5

R 57.2 57.4 56.9 58.9 60.5 60.3 61.4 64.3

F 57.5 57.7 56.9 60.0 60.5 60.3 61.4 64.3

R 78.1 78.5 75.6 77.7 79.1 79.3 81.1 78.5

F 78.3 78.6 75.6 79.1 79.1 79.3 81.1 78.5

(8). Lesk algorithm usually does not reach 100% attempted because it is not applicable when the context is too short. – Binary voting: The final prediction is the one that is predicted by the majority of weak classifiers. In case of tie, the final prediction is a random selection. – Weighted voting: The final prediction is the one that is predicted by the majority of weak classifiers along with the weights of prediction. In case of tie, the final prediction is a random selection. In Degree, the prediction weight is the normalized degree of vertex. In PPR, the prediction weight is the normalized rank of vertex. In Lesk, the prediction weight is the normalized similarity between the context and a gloss. Because the attempted of the combination is always 100%, both the binary voting and weighted voting improve recall and F-measure. The binary weighting fails to improve precision. The weighted voting, therefore, outperforms the binary voting in the three datasets. TSP-ACO is comparable to Degree on Senseval-3 and Semeval-2007. Their appropriate binary voting systems (i.e. TSP-ACO+PPR+Leskb and Degree+PPR+Lesk-b) are similarly equal on these two datasets. This suggests that if the prediction weight is available in TSP-ACO, its weighted voting (assuming TSP-ACO+PPR+Lesk-w) will be on par with Degree+PPR+Lesk-w. (Table 5)

Word Sense Disambiguation as a Traveling Salesman Problem

19

Why is not the prediction weight available in TSP-ACO? TSP-ACO converges when all the ants follow the shortest route. The default maximum number of iterations was 1250. The convergence is, meanwhile, usually reached after 100 iterations. After 1250 iterations, the pheromone at the vertices in the shortest route is, therefore, very dense while the pheromone at other vertices is almost evaporated. As a result, if we want to use pheromone density to calculate the rank of vertex, the output will be likely: The rank of the vertices on the shortest route is “one” and the rank of the rest vertices is “zero”. This is actually binary, not weighted prediction. If TSP-ACO is combined with other methods in a weighted voting scheme, the result will be biased to TSP-ACO and lead to wrong predictions. Future works could deal with this shortcoming by two possible suggestions. Firstly, the definition of visited (6) could be relaxed so that a route actually visits all the vertices of graph. This will dramatically change the problem of WSD from a combinatorial optimization problem to a pure TSP. If so, certain definitions and arguments in our formulation of TSP and ACO will have to be carefully reconsidered. Secondly, the algorithm could terminate sooner so that the pheromone density reflects the rank of vertex better. The condition of termination could be a smaller number of iterations or the time when all the ants follow an unique route. It should be noticed that a premature convergence might occur and lead to a local optimal solution. 4.4 Related Works In this experiments, TSP-ACO and Degree+PPR+Lesk-w, i.e. our proposed method and our best combination, were compared with contemporary related works. The results of the related systems were from their published papers. MFS was calculated based on statistic information from SemCor (Miller et al (1993)). – SAN-Tsatsaronis et al (2007): A spreading activation network method. SAN revisits the idea of connectionist in the early days of WSD (Hirst (1984), Cottrell (1989)). Assuming that concepts are organized in the brain as an activation network, this idea, originated from psychology, is in brief stated that the sense of a word is activated by other words in its context. – Sinha07-Sinha and Mihalcea (2007): A combination of graph centrality methods and semantic relatedness measures. – NCM-Yuret and Yatbaz (2010): A noisy channel model drawn from a large untagged text corpus for unsupervised WSD. The model is first formulated in a coarse-grained framework and then disambiguates fine-grained senses using the most frequent senses heuristic. – SVM-Lee et al (2004): A supervised method using state-of-the-art Support Vector Machines. The training and prediction are based on various kinds of features extracted from the context of the word needed to be disambiguated. – GAMBL-Decadt et al (2004): A memory-based supervised method. – SSI-Navigli and Velardi (2005): A state-of-the-art graph-based method. Parameters are trained using a sense tagged corpus. – ACA-Schwab and Guillaume (2011): A combinatorial optimization method using Ant Colony Algorithm and Lesk-based semantic relatedness. – ExtLesk-Ponzetto and Navigli (2010): An extended version of Lesk algorithm augmented by rich knowledge from Wikipedia.

20

Kiem-Hieu Nguyen, Cheol-Young Ock

Table 6 Comparison with related works using recall. Senseval-2 English all-words Description N V Adj. Knowledge-based 69.2 41.3 64.7 Voting 72.4 43.9 66.5 Knowledge-based Voting 65.6 32.3 61.4 Unsupervised+MFS 77.7 Heuristic 72.0 45.3 68.6 Supervised Senseval-3 English all-words System Description N V Adj. TSP-ACO Knowledge-based 61.3 46.6 66.9 Degree+PPR+Lesk-w Voting 66.6 51.4 67.7 Sinha07 Voting 60.5 40.6 54.1 NCM Unsupervised+MFS 72.9 MFS Heuristic 70.7 53.6 68.5 GAMBL Supervised 70.8 59.3 65.3 SSI Knowledge-based Semeval-2007 coarse-grained English all-words System Description N V Adj. TSP-ACO Knowledge-based 81.0 68.9 80.4 Degree+PPR+Lesk-w Voting 83.5 75.1 81.9 ACA Knowledge-based ExtLesk Knowledge-based 79.4 TreeMatch Unsupervised MFS Heuristic 77.2 75.3 82.2 NUS-PT Supervised SSI Knowledge-based 84.1 System TSP-ACO Degree+PPR+Lesk-w SAN Sinha07 NCM MFS SVM

Adv. 76.1 79.4 60.2 83.1 -

All 62.8 65.5 49.3 56.4 66.5 65.6

Adv. 92.3 100.0 100.0 100.0 100.0 -

All 57.2 61.4 52.4 64.3 65.2 60.4

Adv. 84.2 83.7 87.5 -

All 78.1 81.1 74.4 73.6 78.5 82.5 83.2

– TreeMatch-Tran et al (2010): A hybrid system of unsupervised and knowledgebased methods. TreeMatch uses dependency knowledge extracted from a large untagged text corpus to calculate similarity between the context and examples in the glosses. – NUS-PT-Chan et al (2007): Similar to SVM but augmented with knowledge extracted from parallel texts.

Our proposed method is novel, in comparison with related methods, in particular ACA and ExtLesk, in important ways. Firstly, Word Sense Disambiguation was cast as a variant of Traveling Salesman Problem. Beside ACO, other techniques applicable to solve TSP could be alternatively investigated. Secondly, a novel semantic relatedness measure was proposed by combining Lesk algorithm and Vector Space Model. TSP-ACO outperforms both ACA and ExtLesk. On the coarsegrained evaluation, TSP-ACO beats MFS in the disambiguation of nouns. Our best combination system, Degree+PPR+Lesk-w, beats MFS on the fine-grained evaluation in the disambiguation of nouns (Senseval-2) and fully beats MFS on the coarse-grained evaluation. Degree+PPR+Lesk-w approaches the best supervised methods on both the fine-grained and coarse-grained evaluations. This result is very promising for knowledge-based and unsupervised methods as no training was involved. (Table 6)

Word Sense Disambiguation as a Traveling Salesman Problem

21

4.5 Future Works TSP-ACO is comparable to state-of-the-art knowledge-based and unsupervised methods. Improving the performance and flexibility of TSP-ACO necessitates further works. Firstly, TSP-ACO could be trained in a weakly supervised way by optimizing parameters with a sense tagged corpus. Secondly, to combine with other methods in a weighted voting-like scheme, TSP-ACO must be able to yield rank-based results, not binary decisions. Thirdly, the performance on verbs is remarkably lower than on other part-of-speeches. Either Lesk-based semantic relatedness might not be the most suitable for verbs, or knowledge sources, that are rich of relations between verbs, should be employed. Last but not least, multiple knowledge sources were integrated into the proposed model, including ontology relations, dictionary glosses, and the distribution of vocabulary. Future works could consider the structure of context and employ nearly free but valuable untagged text corpora as in TreeMatch (Tran et al (2010)). In this work, WSD was cast as a variant of TSP. On one hand, we see that other lexical disambiguation tasks like Lexical Substitution or Word Domain Disambiguation could be similarly cast as a TSP. We intend to conduct experiments on these tasks. On the other hand, we applied the algorithm in Dorigo and Gambardella (1997) to solve TSP. It is interesting to observe the results of contemporary studies in TSP applying to lexical disambiguation tasks in which relatedness between lexical entities is calculable. As shown in this work, knowledge-based methods approach the best supervised methods on general datasets. In particular, the performance upper than 80% in the coarse-grained evaluation facilitates the application of WSD to high level tasks, like Information Retrieval, Machine Translation or Text Summarization. However, this is not the case in specific domains, such as environment (Agirre et al (2010)) or biomedicine (Yepes and Aronson (2010)). To improve the performance in a specific domain, as general-purpose breakthrough algorithms or highly customized sophisticated systems are not easy to be accomplished, the integration of multiple knowledge sources is a simple yet efficient approach.

5 Conclusions Word Sense Disambiguation was formulated as a variant of Traveling Salesman Problem. Ant Colony Optimization was successfully used to solve the proposed formulation. Various aspects of knowledge-based Word Sense Disambiguation, including context sensitivity, semantic relatedness, knowledge integration and representation, and the combination of methods were discussed. Our method achieves state-of-the-art performance. Further improvement of the proposed method in particular, and future works on knowledge-based Word Sense Disambiguation in general, were suggested. This work definitely paves the way for applying not only Ant Colony Optimization but also related studies on Traveling Salesman Problem to Word Sense Disambiguation as well as other lexical disambiguation tasks. Acknowledgements This work is partly supported by the scholarship of the Korean Government IT Scholarship Program (KGSP) from March, 2008 to March, 2012.

22

Kiem-Hieu Nguyen, Cheol-Young Ock

References Agirre E, Soroa A (2009) Personalizing PageRank for Word Sense Disambiguation. In: Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009), Association for Computational Linguistics, Athens, Greece, pp 33–41 Agirre E, L´ opez de Lacalle O, Fellbaum C, Hsieh SK, Tesconi M, Monachini M, Vossen P, Segers R (2010) SemEval-2010 Task 17: All-Words Word Sense Disambiguation on a Specific Domain. In: Proceedings of the 5th International Workshop on Semantic Evaluation, Association for Computational Linguistics, Uppsala, Sweden, pp 75–80 Banerjee S, Pedersen T (2002) An Adapted Lesk Algorithm for Word Sense Disambiguation Using WordNet. In: Proceedings of the Third International Conference on Computational Linguistics and Intelligent Text Processing, Springer-Verlag, London, UK, UK, CICLing ’02, pp 136–145 Brin S, Page L (1998) The anatomy of a large-scale hypertextual Web search engine. Comput Netw ISDN Syst 30:107–117 Buckley C (1985) Implementation of the SMART Information Retrieval System. Tech. rep., Ithaca, NY, USA Budanitsky A, Hirst G (2006) Evaluating WordNet-based Measures of Lexical Semantic Relatedness. Comput Linguist 32:13–47 Chan YS, Ng HT, Zhong Z (2007) NUS-PT: Exploiting Parallel Texts for Word Sense Disambiguation in the English All-Words Tasks. In: Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007), Association for Computational Linguistics, Prague, Czech Republic, pp 253–256 Cottrell GW (1989) A connectionist approach to word sense disambiguation. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA Cowie J, Guthrie J, Guthrie L (1992) Lexical disambiguation using simulated annealing. In: Proceedings of the workshop on Speech and Natural Language, Association for Computational Linguistics, Stroudsburg, PA, USA, HLT ’91, pp 238–242 Decadt B, Hoste V, Daelemans W, Van den Bosch A (2004) GAMBL, genetic algorithm optimization of memory-based WSD. In: Mihalcea R, Edmonds P (eds) Senseval-3: Third International Workshop on the Evaluation of Systems for the Semantic Analysis of Text, Association for Computational Linguistics, Barcelona, Spain, pp 108–112 Dorigo M, Gambardella LM (1997) Ant Colonies for the Traveling Salesman Problem Edmonds P, Cotton S (2002) SENSEVAL-2: Overview. In: Proceedings of SENSEVAL-2: Second International Workshop on Evaluating Word Sense Disambiguation Systems, 5-6 July 2001, Toulouse, France, pp 1–7 Gelbukh AF, Sidorov G, Han SY (2005) On Some Optimization Heuristics for Lesk-Like WSD Algorithms. In: NLDB, pp 402–405 Hatori J, Miyao Y, Tsujii J (2008) Word Sense Disambiguation for All Words using TreeStructured Conditional Random Fields. In: Coling 2008: Companion volume: Posters, Coling 2008 Organizing Committee, Manchester, UK, pp 43–46 Hirst GJ (1984) Semantic interpretation against ambiguity. PhD thesis, Providence, RI, USA, aAI8422435 Ide N, V´ eronis J (1998) Introduction to the special issue on word sense disambiguation: the state of the art. Comput Linguist 24:2–40 Jiang JJ, Conrath DW (1997) Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy. In: International Conference Research on Computational Linguistics (ROCLING X), pp 9008+ Kleinberg JM (1999) Hubs, authorities, and communities. ACM Comput Surv 31 Laporte G (1992) The traveling salesman problem: An overview of exact and approximate algorithms. European Journal of Operational Research 59(2):231 – 247 Leacock C, Chodorow M (1998) Combining local context and WordNet similarity for word sense identification, In C. Fellbaum (Ed.), MIT Press, pp 305–332 Lee YK, Ng HT, Chia TK (2004) Supervised Word Sense Disambiguation with Support Vector Machines and multiple knowledge sources. In: Mihalcea R, Edmonds P (eds) Senseval-3: Third International Workshop on the Evaluation of Systems for the Semantic Analysis of Text, Association for Computational Linguistics, Barcelona, Spain, pp 137–140 Lesk M (1986) Automatic sense disambiguation using machine readable dictionaries: how to tell a pine cone from an ice cream cone. In: Proceedings of the 5th annual international conference on Systems documentation, ACM, New York, NY, USA, SIGDOC ’86, pp 24–26

Word Sense Disambiguation as a Traveling Salesman Problem

23

Lin D (1998) An Information-Theoretic Definition of Similarity. In: Proceedings of the Fifteenth International Conference on Machine Learning, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, ICML ’98, pp 296–304 Mihalcea R, Moldovan DI (2001) eXtended WordNet: progress report. In: in Proceedings of NAACL Workshop on WordNet and Other Lexical Resources, pp 95–100 Miller GA (1995) WordNet: a lexical database for English. Commun ACM 38:39–41 Miller GA, Charles WG (1991) Contextual correlates of semantic similarity. Language and Cognitive Processes 6:1–28 Miller GA, Leacock C, Tengi R, Bunker RT (1993) A semantic concordance. In: Proceedings of the workshop on Human Language Technology, Association for Computational Linguistics, Stroudsburg, PA, USA, HLT ’93, pp 303–308 Molina A, Pla F, Segarra E (2002) A Hidden Markov Model Approach to Word Sense Disambiguation. In: Proceedings of the 8th Ibero-American Conference on AI: Advances in Artificial Intelligence, Springer-Verlag, London, UK, UK, IBERAMIA 2002, pp 655–663 Navigli R (2009) Word sense disambiguation: A survey. ACM Comput Surv 41:10:1–10:69 Navigli R, Velardi P (2005) Structural Semantic Interconnections: A Knowledge-Based Approach to Word Sense Disambiguation. IEEE Trans Pattern Anal Mach Intell 27:1075–1086 Navigli R, Litkowski KC, Hargraves O (2007) SemEval-2007 Task 07: Coarse-Grained English All-Words Task. In: Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007), Association for Computational Linguistics, Prague, Czech Republic, pp 30–35 Page L, Brin S, Motwani R, Winograd T (1999) The PageRank Citation Ranking: Bringing Order to the Web. Technical Report 1999-66, Stanford InfoLab, previous number = SIDLWP-1999-0120 Pedersen, Banerjee, Patwardhan (2005) Maximizing Semantic Relatedness to Perform Word Sense Disambiguation Pedersen T, Bruce R (1997) Distinguishing Word Senses in Untagged Text. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP-97), Providence, RI Ponzetto SP, Navigli R (2010) Knowledge-Rich Word Sense Disambiguation Rivaling Supervised Systems. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Uppsala, Sweden, pp 1522–1531 Ponzetto SP, Strube M (2011) Taxonomy induction based on a collaboratively built knowledge repository. Artif Intell 175:1737–1756 Popescu M, Hristea F (2011) State of the art versus classical clustering for unsupervised word sense disambiguation. Artif Intell Rev 35:241–264 Resnik P (1995) Using information content to evaluate semantic similarity in a taxonomy. In: Proceedings of the 14th international joint conference on Artificial intelligence - Volume 1, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, pp 448–453 Ruiz-Casado M, Alfonseca E, Castells P (2005) Automatic Assignment of Wikipedia Encyclopedic Entries to WordNet Synsets. pp 380–386, DOI 10.1007/11495772 59 Schwab D, Guillaume N (2011) A Global Ant Colony Algorithm for Word Sense Disambiguation Based on Semantic Relatedness. In: Prez J, Corchado J, Moreno M, Julin V, Mathieu P, Canada-Bago J, Ortega A, Caballero A (eds) Highlights in Practical Applications of Agents and Multiagent Systems, Advances in Intelligent and Soft Computing, vol 89, Springer Berlin / Heidelberg, pp 257–264 Segond F, Schiller A, Grefenstette G, pierre Chanod J (1997) An Experiment in Semantic Tagging using Hidden Markov Model Tagging. In: ACL/EACL Workshop on Automatic Information Extraction and Building of Lexical Semantic Resources for NLP Applications, pp 78–81 Sinha R, Mihalcea R (2007) Unsupervised Graph-basedWord Sense Disambiguation Using Measures of Word Semantic Similarity. In: Proceedings of the International Conference on Semantic Computing, IEEE Computer Society, Washington, DC, USA, pp 363–369 Snyder B, Palmer M (2004) The English all-words task. In: Mihalcea R, Edmonds P (eds) Senseval-3: Third International Workshop on the Evaluation of Systems for the Semantic Analysis of Text, Association for Computational Linguistics, Barcelona, Spain, pp 41–43 Tran A, Bowes C, Brown D, Chen P, Choly M, Ding W (2010) TreeMatch: A fully unsupervised WSD system using dependency knowledge on a specific domain. In: Proceedings of the 5th International Workshop on Semantic Evaluation, Association for Computational

24

Kiem-Hieu Nguyen, Cheol-Young Ock

Linguistics, Stroudsburg, PA, USA, SemEval ’10, pp 396–401 Tsatsaronis G, Vazirgiannis M, Androutsopoulos I (2007) Word sense disambiguation with spreading activation networks generated from thesauri. In: Proceedings of the 20th international joint conference on Artifical intelligence, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, pp 1725–1730 Vasilescu F, Langlais P, Lapalme G (2004) Evaluating Variants of the Lesk Approach for Disambiguating Words. In: Proceedings of LREC 2004, Lisbonne, pp 633–636 Veronis J, Ide NM (1990) Word sense disambiguation with very large neural networks extracted from machine readable dictionaries. In: Proceedings of the 13th conference on Computational linguistics - Volume 2, Association for Computational Linguistics, Morristown, NJ, USA, COLING ’90, pp 389–394 Wu Z, Palmer M (1994) Verbs semantics and lexical selection. In: Proceedings of the 32nd annual meeting on Association for Computational Linguistics, Association for Computational Linguistics, Stroudsburg, PA, USA, ACL ’94, pp 133–138 Yepes AJ, Aronson A (2010) Knowledge-based biomedical word sense disambiguation: comparison of approaches. BMC Bioinformatics 11(1):569+, DOI 10.1186/1471-2105-11-569 Yuret D, Yatbaz MA (2010) The noisy channel model for unsupervised word sense disambiguation. Comput Linguist 36:111–127 Zhang C, Zhou Y, Martin T (2008) Genetic Word Sense Disambiguation Algorithm. In: Proceedings of the 2008 Second International Symposium on Intelligent Information Technology Application - Volume 01, IEEE Computer Society, Washington, DC, USA, IITA ’08, pp 123–127