Tehnical Report CS-2010-06 - CS Technion

3 downloads 0 Views 5MB Size Report
Money 2. Money, Money, Money 3. Money creation 4. Money for Noth- ing/Beverly ... eyness 8. Cash Money Records 9. ...... Money for Nothing/Beverly Hillbillies.
Technion - Computer Science Department - Tehnical Report CS-2010-06 - 2010

Journal of Artificial Intelligence Research X (YYYY) 1-1

Submitted MM/YY; published MM/YY

Wikipedia-based Compact Hierarchical Semantics with Application to Semantic Relatedness Sonya Liberman Shaul Markovitch

[email protected] [email protected]

Department of Computer Science Technion - Israel Institute of Technology, 32000 Haifa, Israel

Abstract A proper semantic representation of words and texts underlies many text processing tasks. In this paper, we present a novel representation of semantics which is based on an hierarchical ontology of natural concepts derived from Wikipedia articles and category system. Our method, called Compact Hierarchical Explicit Semantic Analysis (CHESA) generates compact hierarchical representations of unrestricted natural language texts. With comparison to previous methods for semantic representations, CHESA generates very intuitive and comprehensible representations allowing deep semantic reasoning and understanding. CHESA representations are flexible with regards to their level of abstraction and compactness. We present a methodology to compute semantic relatedness using CHESA representations and evaluate CHESA on the task of semantic relatedness assessment of words and texts. Empirical results show that for compact representations, CHESA is superior to the previous state of the art.

1. Introduction Digital publicly available textual data is vast. The World Wide Web consists of tens of billions of web-pages, most of which contain textual content in the form of natural language. Intelligent access, manipulation and processing of such textual data is of great importance to a variety of tasks such as text categorization, word sense disambiguation and machine translation. One of the core issues of this challenge is representing language semantics in a way that can be manipulated by computers. Statistical approaches for semantic representation of natural language perform a empirical analysis of textual data and construct statistical and probabilistic models of language. Most of these approaches do not utilize external textual data. They view a given text as a Bag of Words (BOW), namely, a text is represented as an unordered set of the words it contains. The BOW approach has been widely used in various information retrieval tasks such as search and text categorization (Baeza-Yates & Ribeiro-Neto, 1999; Sebastiani, 2002). Nevertheless, it performs sub-optimally on short texts, does little to address polysemy and synonymy and is not able to perform high level semantic reasoning. It has long been recognized that in order to process textual data at a deeper level, computers require access to common-sense and domain-specific world knowledge (Buchanan & Feigenbaum, 1982; Lenat, Guha, Pittman, Pratt, & Shepherd, 1990). Indeed, considerable research efforts have been invested in developing methods for incorporating external knowledge into statistical NLP. c °YYYY AI Access Foundation. All rights reserved.

Technion - Computer Science Department - Tehnical Report CS-2010-06 - 2010

Liberman and Markovitch

Corpus-based methods (Landauer & Dumais, 1997; Rehder, Schreiner, Wolfe, Laham, Landauer, & Kintsch, 1998; Landauer, Foltz, & Laham, 1998; Wolfe, Schreiner, Rehder, Laham, Foltz, Kintsch, & Landauer, 1998; Foltz, 1996; Burgess, Livesay, & Lund, 1998; Lund & Burgess, 1996; Turney, 2001) leverage large external textual corpora and use worddocument co-occurrence information from these corpora. For example, the Latent Semantic Analysis (LSA) approach (Deerwester, Dumais, Furnas, Landauer, & Harshman, 1990) applies singular value decomposition to this co-occurrence data, retrieving ”latent concepts” and representing words and texts in the space defined by these concepts. Various approaches (Budanitsky, Hirst, & Hirst, 2006; Jarmasz & Szpakowicz, 2003; Lee, Kim, & Lee, 1993; Jiang & Conrath, 1997; Leacock & Chodorow, 1998; Lin, 1998; Pedersen, Patwardhan, & Michelizzi, 2004; Wan & Peng, 2005) use lexicographic knowledge to represent semantics and assess semantic similarity. This knowledge is derived from lexical databases such as WordNet (Fellbaum, 1998) or Roget’s Thesaurus (Roget, 1852) encode important relations between words such as synonymy, hypernymy, and meronymy. Other approaches utilize more extensive knowledge repositories to represent semantics and compute semantic relatedness. Some of these approaches (Strube & Ponzetto, 2006; Schonhofen, 2006; Milne, 2007; Kaiser, Schwarz, & Jakob, 2009) use Wikipedia, while others (Danushka, Yutaka, & Mitsuru, 2007; Matsuo, Sakaki, Uchiyama, & Ishizuka, 2006; Chen, Lin, & Wei, 2006; Sahami & Heilman, 2006; Maguitman, Menczer, Erdinc, Roinestad, & Vespignani, 2006; Lakkaraju, Gauch, & Speretta, 2008; Gabrilovich & Markovitch, 2007a, 2005) utilize Web content and the Open Directory Project (ODP) repository. The method of Explicit Semantic Analysis (ESA) introduced by Gabrilovitch and Markovitch (Gabrilovich & Markovitch, 2006, 2007b, 2009) uses Wikipedia as a flat ontology, where each article represents an atomic concept. The semantics of a word is represented as a vector in the multi-dimensional space defined by Wikipedia articles. The weight of each dimension is inferred from the article text by computing the TF.IDF of the word within that text. A document is represented as the centroid of the representation vectors of its words. Experimental evaluation of ESA on semantic relatedness as well as text categorization tasks showed considerable improvements over previous approaches. Nevertheless, ESA has three notable drawbacks. 1. ESA representation is excessive. ESA represents word semantics as a weighted combination of all Wikipedia concepts that can amount to tens of thousands of concepts. Many text processing related tasks, e.g commercial search engines and realtime categorization systems need to process and index enormous amounts of textual content and then operate fast on these large corpora. The compactness of the semantic representations directly affects their performance, turning infeasible tasks to practicable and efficient. While it is possible to make ESA compact, by selecting the top highly associated concepts (as done when using ESA for text categorization (Gabrilovich & Markovitch, 2006, 2007a)), this trimmed representation is far from being satisfactory. For example, consider the top twenty concepts generated by ESA for the word money: 1. Money 2. Money, Money, Money 3. Money creation 4. Money for Nothing/Beverly Hillbillies 5. Make Money Fast 6. Money supply 7. Moneyness 8. Cash Money Records 9. Eddie Money 10. Money laundering 2

Technion - Computer Science Department - Tehnical Report CS-2010-06 - 2010

Wikipedia-based Compact Hierarchical Semantics

11. Money Mark 12. The Color of Money 13. Making Money 14. Take the Money and Run 15. Money market 16. I Get Money 17. The Money Programme 18. Money (That’s What I Want) 19. Electronic money 20. $1 Money Wars. Observe that eleven out of the top twenty concepts simply do not express the semantic meaning of the word money, as they refer to novels (The Color of Money), songs (I Get Money), television programs (The Money Programme) etc. As the association score for every concept is determined by the textual content of the corresponding article, the most highly associated concepts are roughly those that contain the word ”money” many times. 2. ESA is noisy. Even in its compact form, ESA contains many redundant and overspecific concepts. ESA is not able to differentiate between concepts representing the primal gists of the word (such as Money for the word money in the example above), and over-specific, noisy concepts (such as Money, Money, Money that refers to a ABBA song), as the association scores for both types of concepts are high. As ESA generates each concept independently of the other concepts in the representation redundancies often occur. For example, the top-ten concepts generated by ESA for the word car, contain seven different car types (Concept car, Sports car, Armored car, Executive car, City car, Compact car and Full-size car) and two types of car number plates (Polish car number plates and Greek car number plates). 3. ESA if flat. ESA views Wikipedia as a flat conceptual ontology and constructs flat semantic representations, thus disregarding the inner structure and inter-dependencies of concepts. First, this is a notable weakness when addressing semantic relatedness of words and texts. When humans perform this task, they use their innate ability to generalize. For example, a human would easily determine that the words money and wealth are related as they both trigger high-level concepts related to economics and sociology. Observe the top twenty concepts generated by ESA for the word wealth: 1. Wealth in the United States 2. Wealth 3. Distribution of wealth 4. Wealth management 5. Wealth (economics) 6. Sovereign wealth fund 7. The Wealth of Nations 8. Land of Wealth 9. Common Wealth Party 10. Share Our Wealth 11. Income redistribution 12. Millionaire 13. Wealth condensation 14. Wealth tax 15. Lakshmi 16. Adam Smith 17. Pareto distribution 18. Physiocrats 19. Per capita income 20. Nouveau riche. Note that there are no intersecting concepts between ESA representations of money and wealth, as the information which allows generalizations is absent. Secondly, the ability to reason about semantics at varying abstraction levels is highly beneficial for many tasks. For example, given a text categorization task of assigning scientific papers to either Computer Science or History categories, the semantic representation of the texts can be captured on high-level. For instance, it is sufficient to determine whether the paper is in the field of applied sciences or humanities. Given a different task of categorizing papers to either supervised learning or unsupervised learning topics, the semantic representation of the texts should be much more specific 3

Technion - Computer Science Department - Tehnical Report CS-2010-06 - 2010

Liberman and Markovitch

Main topic classifications

Social sciences

Economics

Macroeconomics and monetary economics

Society

Sociology

Financial economics

Political economy

Social institutions

Entertainment

Drama

Gambling

Culture

Scandals

Gaming

Leisure

Accountancy

Business

Finance

Businessrelated television channels

Figure 1: The compact hierarchical semantic representation generated by CHESA for the word money

and fine-grained. Such multi-level, flexible reasoning cannot be achieved with a flat modeling of knowledge. In this work we present a novel representation of semantics that overcomes these problems. Our method is called Compact Hierarchical Explicit Semantic Representation (CHESA). It represents semantics as a compact hierarchical structure of pre-defined natural concepts, capturing semantics at different abstraction levels and constructing representations at any given size, depending on the task at hand. Figure 1 shows CHESA semantic representation for the word money. The representation expresses the primal gist of the word, showing both specific (e.g., Macroeconomincs and monetary economics) and high-level (e.g., Society) concepts triggered by it. Numerous studies in cognitive sciences address the question of how knowledge is represented in memory. Many theoretical models assume that cognitive processes and representations have a hierarchical structure (Cohen, 2000). Behavioral (Rosch, Mervis, Gray, Johnson, & Boyes-Braem, 1976; Bower, Clark, Lesgold, & Winzenz, 1969; Potter & Faulconer, 1975) as well as neurological (Dennis, 1976; Hart, Berndt, & Caramazza, 1985) evidence support the notion of conceptual hierarchical structures in long-term memory. Studies on retention patterns of knowledge (Bahrick, 1984; Naveh-Benjamin, 1988; Conway, Cohen, , & Stanhope, 1991; Cohen, Stanhope, & Conway, 1992; Stanhope, Cohen, & Conway, 1993) support the idea of inner hierarchical organization, as they reveal differences between the retention of higher level information and fine-grained specific details. Moreover, humans’ tendency to organize knowledge within hierarchies can be deduced from their handling of Web content. Yahoo! Directory1 , the Open Directory Project2 (ODP), Wikipedia category system3 and numerous domain specific repositories are hierarchical ontologies, assembled 1. http://dir.yahoo.com 2. http://www.dmoz.org 3. http://www.wikipedia.org

4

Technion - Computer Science Department - Tehnical Report CS-2010-06 - 2010

Wikipedia-based Compact Hierarchical Semantics

by humans in their attempt to organize information in a natural manner, easily perceived by them. Our approach leverages structured encyclopedic knowledge encoded within Wikipedia articles and categories and uses the conceptual hierarchy inferred from this knowledge resource to represent text semantics. We start by defining a criterion for measuring the marginal contribution of a concept to the representation of a given word or text. We continue by presenting two greedy algorithms that construct compact hierarchical semantic representations of words based on that notion, and two extended algorithms that handle semantic representation of text. Finally, we define a similarity metric between hierarchical semantic representations, evaluate CHESA on the task of sematic relatedness assessment and show that it is superior to previous state of the art for compact representations. The rest of this paper is organized as follows: Section 2 describes the methodology of CHESA and presents the algorithms for representing semantics of words and texts. Section 3 discusses the application of CHESA to automatic assessment of semantic relatedness. We evaluate CHESA and report on experimental results in Section 4. Related work is discusses in Section 5. We conclude in Section 6.

2. Compact Hierarchical Explicit Semantic Representation Assume a pre-defined global hierarchical ontology, where nodes represent natural concepts and edges represent hyponymy or meronymy relations between these concepts. Assume that each node in that ontology is associated with some textual content. We represent the semantics of an input text t as a weighed sub-hierarchy within this global ontology. Namely, we draw a virtual separating curve on top of the hierarchy, that determines which concepts are included into the semantic representation (those above the separating curve) and which are excluded (those bellow the separating curve). Moreover, every concept within the representation is assigned a weight which represents the strength of association between its textual content and the input text. In the rest of this section we address the two aspects that arise from our formulations. Namely, what knowledge repository can be used as a source for such global hierarchical ontology, and how to automatically construct a compact, comprehensive hierarchical semantic representation based on that ontology. 2.1 Wikipedia as a Conceptual Hierarchy Wikipedia is the largest encyclopedia in the world. Its English version contains almost three million articles4 and a vast system of categories. It exceeds the size of the next largest English-language encyclopedia, Encyclopedia Britannica, by more than 25 times5 . Wikipedia provides a comprehensive source of world knowledge, organized within a taxonomylike structure determined by its articles and categories. Every article in Wikipedia contains a textual description of a single topic. We thus view each Wikipedia article as defining a concept corresponding to that topic. For example, the article titled Dog that describes the dog species, corresponds to the concept Dog. Articles in Wikipedia are usually classified 4. http://en.wikipedia.org/wiki/Wikipedia:Size of Wikipedia, visited on August 2009 5. http://en.wikipedia.org/wiki/Wikipedia:Size comparisons, visited on August 2009

5

Technion - Computer Science Department - Tehnical Report CS-2010-06 - 2010

Liberman and Markovitch

Computer science

Cognitive science

Interdisciplinary fields

Vision

Artificial intelligence

Information science

Computer vision

Robotics

Knowledge representation

Ontology

Computational science

Linguistics

Computational linguistics

Speech recognition

Machine translation

WordNet

Figure 2: A partial view on Wikipedia’s conceptual graph

under one or more categories. Categories are also often classified under other categories. For example, the article Cat is classified under the category Domesticated animals, which is in turn, classified under the category Animals. We view Wikipedia categories as defining high-level concepts, which are more general than the concepts defined by the articles they categorize. Wikipedia categories are typically established based on hyponymy and myronymy relations. Figure 2 shows a small subgraph of the Wikipedia category system. However, as a taxonomic category structure is not strictly enforced, cycles or disconnected components are possible (though rare). To simplify the original conceptual graph derived from Wikipedia, and eliminate cycles, we remove a small part of the edges in the graph. Specifically, if a node can be reached by more than one path from the root, we retain only the shortest path, using a breadth-first-search traversal through the graph. If several paths have the same minimal length, we retain all these paths. The specific algorithm can be found in Appendix ??. In the end of this procedure we obtain a rooted directed acyclic graph where inner nodes correspond to Wikipedia categories and leaf nodes correspond to Wikipedia articles. We define the textual content of each concept v as follows: ½ T C(v) =

{Av } S v 0 ∈Children(v)

T C(v 0 )

if v is a leaf otherwise

Where Ac is the article corresponding to the leaf concept c and Children(c) is the set of the direct child concepts of c. 6

Wikipedia-based Compact Hierarchical Semantics

Technion - Computer Science Department - Tehnical Report CS-2010-06 - 2010

2.2 The Conditional Over-representation Criterion Once the hierarchical ontology has been defined, the main question that needs to be addressed is which concepts from the ontology should be part of the semantic representation of a given text. We begin by addressing the more simple case of constructing a semantic representation of a single word wi . Potentially, every concept containing wi in one of the articles that constitute its textual content, can be part of the representation, as it is associated with the word to some extent. But once a certain concept is part of the representation, does it affect the marginal contribution of other concepts, specifically its direct descendants, to the representation? For instance, if the concept Vehicles is part of the representation of the word car, how contributory is its child concept Automobiles to that representation? And what is the contribution of its child concepts Spacecraft or Amphibious vehicles? Intuitively, the concept Automobiles provides a substantial value to the semantic interpretation of the word car even in the presence of the concept Vehicles. The concept Spacecraft is unsuitable for describing the semantics of car. The concept Amphibious vehicles is weakly associated with the word car, but it does not contribute to the grasping of the meaning of car beyond the interpretation provided by the concepts Vehicles. Based on those intuitions, we define the conditional over-representation criterion Φwi (v1 , v2 ) on an edge in the conceptual hierarchy. Its value quantifies the marginal contribution of the child concept v2 given that its parent concept v1 is included into the semantic representation. We wish to formalize this marginal contribution through a statistical test, and deduce upon its significance. Recall that the textual content T C(v1 ), of some concept v1 , is a superset of the articles that constitute the textual content T C(v2 ), of it’s child concept v2 . We distinguish between two types of articles; those that contain the word wi in their text and those that do not. We then model T C(v2 ) as a sample from the population defined by T C(v1 ), with marked items being the articles containing the word wi . Our null hypothesis postulates that T C(v2 ) is a random sample from the set of articles defined by T C(v1 ). We then use a hypergeometric test to compute the probability of falsely rejecting the null hypothesis. As this probability decreases, it is an indication that wi is over-represented in v2 with respect to v1 . A Hypergeometric test assumes a population of size N in which n items are marked, and a random sample of size M drawn from that population. The number of marked items in the sample is described by a random variable X ∼ HG(N, n, M ) that follows a hypergeometric distribution with a probability mass function defined as, ¡n¢¡ N −n ¢ P r(X = k) =

k

M −k ¡N ¢ M

The cumulative distribution function for the upper tail is P r(X ≥ k) =

M X

P r(X = i)

i=k

Note that as k increases, the probability of observing k marked items or more in a random sample decreases. 7

Technion - Computer Science Department - Tehnical Report CS-2010-06 - 2010

Liberman and Markovitch

Denote by T Cw (v) ⊆ T C(v) the subsets of articles in the textual content of v that contain the word w. Given an input word wi and a pair of concepts (v1 , v2 ) which constitutes an edge in the conceptual hierarchy we define the conditional over-representation criterion as Φwi (v1 , v2 ) = 1 − P r(Xwi ,v1 ,v2 ≥ |T Cwi (v2 )|) Given that Xwi ,v1 ,v2 ∼ HG(|T C(v1 )|, |T Cwi (v1 )|, |T C(v2 )|) Φwi (v1 , v2 ) is high when there is a low probability that a |T Cwi (v2 )| or more marked articles are found in a random sample. We can set a significance threshold on Φwi (v1 , v2 ) values, e.g., 0.95 to decide whether the concept v2 should be included in the representation, given that v1 is in the representation. We can also use Φwi to rank concepts according to their marginal contribution to the representation. For the aforementioned example of representing the semantics of the word car, the concept Vehicles has 10,529 articles overall, out of which 2,232 contain the word car. The concept Automobiles has 1,936 articles, out of which 1,577 contain the word car. Even without performing a statistical test, it is evident that receiving this sample by chance is highly improbable, and indeed Φcar (V ehicles, Automobiles) is very close to 1.0 for these concepts. On the other hand, the concept Spacecraft contains the word car in only one out of its 705 articles. Consequently, the value of Φcar (V ehicles, Spacecraf t) is very close to zero. The concept Amphibious vehicles contains the word car in 8 out of 51 articles. In this case, the value of Φcar is slightly higher (Φcar (V ehicles, Amphibious vehicles) = 0.217). Note that the textual content of a leaf concept v is based on exactly one article, Av , namely, |T C(v)| = 1. Thus, we define the conditional over-representation criterion for a leaf concept and its parent concept as a function of the raw number of the input word occurrences within the textual content of the concepts6 . Let count(w, v) indicate the number of times a word w occurs in the textual content of a concept v. Then, given an input word wi and a pair of concepts (v1 , v2 ) where v2 is a leaf, we define the conditional over-representation criterion as Φwi (v1 , v2 ) = 1 − P r(Xwi ,v1 ,v2 ≥ count(wi , v2 )) Given that X X Xwi ,v1 ,v2 ∼ HG( count(wj , v1 ), count(wi , v1 ), count(wj , v2 )) j

j

2.3 Automatic Construction of Compact Hierarchical Semantic Representation In this section we present two algorithms for constructing compact hierarchical representations of words. Both algorithms are based on the conditional over-representation criterion Φ. Their modus operandi and consequently the semantic representations they generate, however, differ. Both algorithms allow us to control the level of the compactness of the representation. To achieve compactness, semantics is described on a more abstract level, using general 6. We also tried this approach for the internal nodes of the graph, but the results did not change significantly. Therefore, for the sake of efficiency, we decided to keep the original method.

8

Technion - Computer Science Department - Tehnical Report CS-2010-06 - 2010

Wikipedia-based Compact Hierarchical Semantics

concepts that appear high within the hierarchical ontology. As this constraint loosen, concepts located deeper within the hierarchy and expressing more specific topics, become part of the representation. 2.3.1 The Top-Down CHESA Algorithm The Top-Down CHESA algorithm is an iterative greedy algorithm that traverses Wikipedia’s7 conceptual hierarchy in a top-down manner and expands the representation by adding more specific concepts located deeper in the hierarchy. The process begins by representing the semantics of the input word wi by the root concept. Each iteration the algorithm investigates all the edges (v1 , v2 ) in the original hierarchy such that v1 is in the representation, and v2 has not yet been added. It then selects the edge (v1max , v2max ) with the maximal value of Φwi . This edge, and the concept v2max are then added to the representation. The algorithm terminates when the representation reaches a pre-defined size of k concepts. It can also receive a significance threshold for Φwi as an input parameter. In that case, the algorithm terminates when all the edges considered in a given iteration have a Φwi value bellow this threshold. The formal description of the procedure appears in Algorithm 1. Algorithm 1 Top-Down CHESA Input: An input word wi , a rooted directed acyclic graph G(V, E), a root node r ∈ V and a representation size k Output: A rooted directed acyclic subgraph G0 (V 0 , E 0 ) of k nodes 1: V 0 ← {r} 2: E 0 ← ∅ 3: while |V 0 | < k do 4: (v1max , v2max arg max(v1 ,v2 )∈E,(v1 ,v2 )∈E / 0 ,v1 ∈V 0 Φwi (v1 , v2 ) S ) ←max 0 0 max 5: E ← E S {(v1 , v2 )} V 0 ← V 0 {v2max } 6: 7: end while 8: return G0 (V 0 , E 0 ) Note that the conceptual hierarchy may contain multiple inheritance. For this algorithm, it is sufficient for a concept to have a high Φwi value with respect to at least one of its incoming edges to be included in the representation. Figure 2.3.1 shows the first five iterations of the top-down algorithm, for the word money, as well as its compact representation of size ten. 2.3.2 The Bottom-Up CHESA Algorithm The way the top-down algorithm expands the representation, makes it susceptible to local minima. Specifically, when a concept has a low Φwi value for all its out-going edges, its descendent concepts will not be added to the representation, even if the Φwi values of their 7. For convenience, in this section, we will refer to Wikipeida as the global conceptual hierarchy. However, this can be any hierarchy as described in the beginning of this section.

9

Technion - Computer Science Department - Tehnical Report CS-2010-06 - 2010

Liberman and Markovitch

Main topic classifications

Main topic classifications

Main topic classifications

Social sciences

(a) Iteration 1

Social sciences

(b) Iteration 2

Culture

(c) Iteration 3 Main topic classifications

Main topic classifications

Social sciences

Culture

Social sciences

Society

Culture

Society

Economics

(d) Iteration 4

(e) Iteration 5 Main topic classifications

Social sciences

Economics

Sociology

Culture

Society

Entertainment

Scandals

Leisure

Social institutions

(f) Iteration 10

Figure 3: Top-Down construction of the semantic representation of the word money out-going edges are high. One possible solution to this problem is to preform a lookahead. However, it is not clear how deep this lookahead should be. Another solution, is reversing the representation construction process and operate in a bottom-up manner, by removing concepts rather than adding them. The Bottom-Up CHESA algorithm is an iterative greedy algorithm that handles this procedure. As opposed the Top-Down CHESA algorithm, it begins by representing the input word wi with all Wikipedia concepts, namely the full conceptual hierarchy. At each iteration, the algorithm investigates all the incoming edges of the leaf nodes in the current representation. It then selects the edge which is the least element under the ordering relation ºΦ defined bellow, and removes it. If, as a consequence, the leaf node connected through that edge has zero in-coming edges in the representation, it is removed as well. The formal description of the procedure appears in Algorithm 2. Definition 1 Let e = (v1 , v2 ) and e0 = (v10 , v20 ) be edges in the representation and r be the root node. We define the ordering ºΦ as: e ºΦ e0 if Φwi (r, v2 ) ≥ 0.95 and Φwi (r, v20 ) < 0.95 or Φwi (v1 , v2 ) ≥ Φwi (v10 , v20 ). We do not simply select the edge with the minimal Φwi since its value only indicates of the marginal contribution of the leaf node over its parent. Namely, wi cat be weakly 10

Technion - Computer Science Department - Tehnical Report CS-2010-06 - 2010

Wikipedia-based Compact Hierarchical Semantics

associated with the leaf node, but even more weakly associated with its parent, resulting in a high Φwi . We therefore, also consider the “absolute” contribution of the leaf by computing its conditional over-representation with respect to the root. Algorithm 2 Bottom-Up CHESA Input: An input word wi , a rooted directed acyclic graph G(V, E), a root node r ∈ V and a representation size k Output: A rooted directed acyclic subgraph G0 (V 0 , E 0 ) of k nodes 1: V 0 ← V 2: E 0 ← E 3: while |V 0 | > k do ¡ ¢ Let v1min , v2min be the least element in {(v1 , v2 ) ∈ E 0 , OutDegreeG0 (v2 ) = 0} under 4: the ordering º¢ª Φ ©¡relationmin 5: E 0 ← E 0 \ v1min , v 2¢ ¡ 6: if InDegreeG©0 v2min = 0 then ª 0 0 min 7: V ← V \ v2 8: end if 9: end while 10: return G0 (V 0 , E 0 ) It is evident that in many cases the top-down algorithm and the bottom-up algorithm produce different representations for the same input word and the same pre-defined representation size. In general, the top-down algorithm focuses on the primary meanings of the word and efficiently selects its most prominent gists within the conceptual hierarchy. However, the effect of local minima may lead to a situation where important concepts are absent from the representation. This may also be problematic when the topic coverage in Wikipedia is non-uniform. In that case, the top-down strategy sometimes focuses on only one, out of the several interpretations of the input word. The bottom-up algorithm aims to deal with these problems by trimming the representation at places where a generalization is suitable, but keeping concepts (and automatically their ancestors) which have a high Φwi value with respect to their parent and the root concept. Nevertheless, the bottom-up algorithm has some drawbacks with respect to the top-down technique. First, the complexity of generating a bottom-up representation is proportional to the size of the full conceptual hierarchy, while the complexity of the top-down strategy amounts to the pre-defined representation size. Moreover, all ancestors of a node that is not pruned from the representation, are present in the representation, regardless of their strength of association with the input word. Thus, the representations may contain many concepts which are weakly associated with the input word. The latter drawback of the bottom-up representation motivates an assignment of weights to the concepts, that quantify the strength of their association with the input word. We present our weighting scheme in the following section. Figures 22 and 23, in Appendix B, show the top-down and the bottom-up representation of size 50 for the word tiger, respectively. Observe that the top-down representation is dominated by sports-related concepts, while the biological context of tiger is absent. This is due to the vast coverage of sports topics in Wikipedia, including numerous sport teams 11

Technion - Computer Science Department - Tehnical Report CS-2010-06 - 2010

Liberman and Markovitch

and clubs, many of which have the word tiger in their title. This non-uniform coverage, results in a low Φtiger (Main topic classifcations,Natural sciences) score and thus, concepts from this part of the hierarchy are not added to the representation. The bottom-up representation, on the other hand, contains the concepts Natural sciences, Biology, Zoology, Mammalogy, Mammals and Carnivores, revealing the additional meaning of the word. Sports-related concepts are also present, but they are mostly high-level, and are not as abundant as in the top-down representation. 2.4 Assignment of Association Scores Once a semantic representation is constructed, every concept in the representation is assigned a weight, indicating the strength of its association with the word being represented. This association score is a function of the frequency of the word in the textual content of the concept, and its global frequency in the conceptual hierarchy. We define the frequency of the word wi in the concept v as count(wi , v) tf (wi , v) = P j count(wj , v) We define the association score of an input word wi and a concept v as follows: µ ¶ tf (wi , v) s(wi , v) = max 0, log tf (wi , r) When r is the root concept which contains the complete textual content of the hierarchy. Note that the association score is zero when the frequency of wi in the concept c is equal to or lower than its frequency in the conceptual hierarchy, namely in the root. 2.5 Compact Hierarchical Semantic Representation of Texts In the previous sections we described two algorithms for generating semantic representation of words. In the following, we present two extended algorithms for generating compact hierarchical semantic representations of texts. 2.5.1 The Union Algorithm The Union CHESA algorithm represents an input text as the union of the compact representations of its words. Formally, given a text t = hw1 , ..., wn i, let Gi (Vi , Ei ) be a rooted directed acyclic graph, and s(wi , v), v ∈ Vi a weighting function, that constitute the CHESA representation of the word wi . We define the representation of the text t as a rooted weighted directed acyclic graph G(V, E) such that: V =

n [

Vi

i=1

E=

n [ i=1

12

Ei

Wikipedia-based Compact Hierarchical Semantics

Technion - Computer Science Department - Tehnical Report CS-2010-06 - 2010

We define the association score of the text t and a concept v as follows: s(t, v) =

n X

s(wi , v)

i=1

Obviously, the union method can use both the top-down and the bottom-up algorithms for the intermediate step of representing single words. 2.5.2 The Direct Representation Algorithm The Direct method represents the semantics of a text in a direct manner, by defining a new conditional over-representation criterion Φ. Specifically, we modify the notion of marked items within a textual content of a concept, which previously referred to articles containing the target word to be represented. Denote by T Ct,i (v) ⊆ T C(v) the set of articles in the textual content of a concept v that contain at least i words from the input text t, when i is a pre-defined parameter. We define the conditional over-representation criterion for a text t and a pair of concepts (v1 , v2 ) as follows: Φt (v1 , v2 ) = 1 − P r (Xt,v1 ,v2 ≥ |T Ct,i (v2 )|) Given that Xt,v1 ,v2 ∼ HG (|T C(v1 )| , |T Ct,i (v1 )| , |T C(v2 )|) Both the top-down and the bottom-up algorithms can be used to directly construct the semantic representation using the new definition of Φ. Figure 4 shows the compact hierarchical representation of twenty concepts for the text fragment “A recently discovered type of star follows some wild orbits around the Milky Way” 8 . We define the association score of an input text t = hw1 , ..., wn i and a concept v as follows: Pn ¶ µ tf (wi , v) i=1 s(t, v) = max 0, log Pn i=1 tf (wi , r) when r is the root concept which contains the complete textual content of the hierarchy.

3. Using Compact Hierarchical Semantic Representation for Computing Semantic Relatedness In this section we discuss the application of our novel hierarchical semantic representation methodology to automatic assessment of semantic relatedness. Computing semantic relatedness between texts underlies many fundamental tasks in the field of text processing such as, text categorization, information retrieval and word sense disambiguation. In this section we present a two-phase, semantic approach for computing semantic relatedness. Given two words or texts, we generate their compact hierarchical semantic representations using the CHESA algorithms presented in Section 2. These representations are then compared, to deduce upon the semantic relatedness of the original texts. 8. The representation was constructed using the top-down algorithm and i = 3.

13

Technion - Computer Science Department - Tehnical Report CS-2010-06 - 2010

Liberman and Markovitch

Main topic classifications

Natural sciences

Belief

Astronomy

Astronomical objects

Supermassive black holes

Constellations

Astronomical catalogues

Reticulum constellation

Sagitta constellation

Nature

Universe

Space

Extrasolar planet

Science

Astronomical naming conventions

Nature

Large-scale structure of the cosmos

Dark matter

Asterism (astronomy)

Figure 4: A semantic representation of text generated by the Direct Top-Down CHESA algorithm

We define the semantic relatedness between two input texts, t1 and t2 , as the cosine similarity between their linearized CHESA representations. Formally, P v∈V s(t1 , v)s(t2 , v) pP (1) rel(t1 , t2 ) = pP 2 2 v∈V s (t1 , v) v∈V s (t2 , v) When V is the set of all the concepts in the global hierarchy. It seems that our definition of relatedness does not exploit the hierarchical structure of CHESA representations. Indeed, there exist other measures that use hierarchical information explicitly. Such measures, e.g. Generalized-cosine similarity (Ganesan, Garcia-Molina, & Widom, 2003) and Earth-mover’s distance (Rubner, Tomasi, & Guibas, 2000; Wan & Peng, 2005), compute similarity between collections, whose elements correspond to leaf nodes in some pre-defined global hierarchy. They then consider the relative proximity of the elements in that hierarchy to compute similarity. For example, the generalized-cosine similarity, extends the cosine-similarity measure, by dropping the assumption that different elements correspond to orthogonal dimensions. It defines a positive inner product between different dimensions, which is a function of the proximity of the corresponding elements in the hierarchy. We illustrate the contribution of explicitly considering hierarchical information in the following example. Assume a three level hierarchy as shown in Figure 5. Now, consider the following two vector pairs where the coordinates correspond to leaves (a, b, c, d) in that hierarchy. • v1 = (1, 0, 1, 0) v2 = (0, 1, 0, 1) (Figure 6(a)) • u1 = (1, 1, 0, 0) u2 = (0, 0, 1, 1) (Figure 6(b)) The cosine similarity for both vector pairs is zero. However, the general-cosine similarity, which considers the hierarchy, computes a positive similarity score for both pairs. Moreover, 14

Technion - Computer Science Department - Tehnical Report CS-2010-06 - 2010

Wikipedia-based Compact Hierarchical Semantics

A

B

a

C

b

c

d

Figure 5: A three level hierarchy

it assigns a higher score to the first pair, as a and b (in v1 and v2 respectively) are located at greater proximity than a and c (in u1 and u2 respectively). The same occurs for c and d (in v1 and v2 respectively) compared to b and d (in u1 and u2 respectively).

a

c

b

d

(a) Pair 1 of collections

a

b

c

d

(b) Pair 2 of collections

Figure 6: Example of measuring similarity between collections which elements correspond to leaves of a hierarchy Although it seems natural to use the aforementioned measures to compute similarity between CHESA representations, these measures are unsuitable for two reasons: 1. These measures assume a model where the collections they compare correspond to leaf nodes of some underlying hierarchical structure. CHESA representations do not coincide with such a model. Internal nodes of the Wikipedia conceptual hierarchy 15

Technion - Computer Science Department - Tehnical Report CS-2010-06 - 2010

Liberman and Markovitch

are explicitly present within the representations. In fact, CHESA representations (in particular the compact ones) are mostly dominated by internal concepts and not the leaf nodes of the conceptual hierarchy. 2. CHESA algorithms ensure that a concept cannot be part of the representation unless its ancestors are part of it. By considering both internal and leaf nodes, cosine similarity is capable of implicitly capturing the hierarchical structure. As an example, observe the following two vector pairs where the coordinates correspond to all the nodes (A, B, C, a, b, c, d) in the hierarchy shown in Figure 5. • v10 = (1, 1, 1, 1, 0, 1, 0) v20 = (1, 1, 1, 0, 1, 0, 1) (Figure 7(a)) • u01 = (1, 1, 0, 1, 1, 0, 0) u02 = (1, 0, 1, 0, 0, 1, 1) (Figure 7(b)) Note that the elements defined by v10 , v20 , u01 and u02 are a combination of the elements defined by v1 , v2 , u1 and u2 respectively, and their appropriate ancestors. The cosine similarity for both pairs is positive, and it is higher for the first pair, which coincides with the results achieved by the generalized cosine similarity measure in the previous example.

A

A

B

C

a

B

c

C

b

d

(a) Pair 1 of collections A

A

B

a

C

b

c

d

(b) Pair 2 of collections

Figure 7: Example of measuring similarity between collections which elements correspond to leaves of a hierarchy and their ancestor inner nodes Using a cosine similarity for assessing semantic relatedness is also advantageous in terms of computational time complexity, as it is linear in the size of the representations, while using measures that exploit proximities within a hierarchy amounts to at least a quadratic complexity. In terms of space complexity, there is no need to store the complete Wikipedia 16

Technion - Computer Science Department - Tehnical Report CS-2010-06 - 2010

Wikipedia-based Compact Hierarchical Semantics

conceptual hierarchy, but only the CHESA representations of the texts, meaning that compact representations amounts to fast computation of semantic relatedness and low storage costs.

4. Empirical Evaluation In this section we evaluate CHESA by performing both quantitative and qualitative analysis of the method on the task of semantic relatedness assessment, and compare it to ESA, the previous state of the art, as well as other previous work. Note that semantic relatedness and semantic similarity are different notions. It is possible that two terms are related but not similar, e.g., Obama and Washington are not similar but are considered to be related. We address the issue of semantic similarity in Section 4.6. 4.1 Experimentation Procedure Assessing semantic relatedness of words and texts is a difficult task for computers but a natural, everyday tasks for humans. Moreover, studies have shown that the inter-agreement between different individuals on such tasks is very high (Budanitsky et al., 2006; Jarmasz & Szpakowicz, 2003; Finkelstein, Gabrilovich, Matias, Rivlin, Solan, Wolfman, & Ruppin, 2002). This is not surprising as it is this agreement that allows us to understand each other and communicate. Therefore, we consider human judgements as a golden standard and evaluate our method in terms of the correlation of the computed relatedness scores with the scores assigned by humans. We use two benchmarks for assessing semantic relatedness of words and texts. The WordSimilarity353 collection (Finkelstein, Gabrilovich, Matias, Rivlin, Solan, Wolfman, & Ruppin, 2001; Finkelstein et al., 2002) is assembled of 353 noun pairs. Each pair is judged by 13-16 judges, which are university graduates with fluent knowledge of the English language. The relatedness scores are given in a scale of 0 (completely unrelated words) to 10 (very much related or identical words), and they are averaged for each word pair to produce a single score. The Lee collection (Lee, Pincombe, & Welsh, 2005; Pincombe & Surveillance, 2004) consists of 50 documents from the Australian Broadcasting Corporation’s news mail service . The documents are paired in all possible ways and judged for semantic relatedness by 83 university students. The judgements are on a five-point scale with 1 indicating highly unrelated texts and 5 indicating highly related texts. To compute correlation with human judgements in the WordSim353 benchmark, we use the Spearman’s rank-order correlation coefficient. Being non-parametric, it is considered to be more robust than Pearson’s linear correlation coefficient. Note, however, that the calculations in Spearman’s measure become inaccurate as the number of tied ranks increases. Thus, Spearman’s correlation is not appropriate for the Lee benchmark, where there are only 67 distinct values among the 1,225 relatedness scores. Therefore, we use Pearson’s linear correlation to evaluate CHESA on this benchmark. The aforementioned collections are, to the best of our knowledge, the largest publicly available collections for assessment of semantic relatedness, and are widely used to evaluate methods for assessing semantic relatedness (Jarmasz & Szpakowicz, 2003; Finkelstein et al., 2002; Strube & Ponzetto, 2006; Hughes & Ramage, 2007; Gabrilovich & Markovitch, 2007b; 17

Technion - Computer Science Department - Tehnical Report CS-2010-06 - 2010

Liberman and Markovitch

Number Number Number Number

of of of of

nodes inner nodes leaf nodes distinct terms

Out-degree In-degree Depth Number of distinct terms per article Number of terms per article

Average 17.16 2.46 6.59 265.85 668.12

557,114 79,886 477,228 730,489 Minimum 1 1 1 100 115

Maximum 5,041 51 15 4,216 462,076

Table 1: Properties of Wikipedia’s conceptual hierarchy based on October 18, 2007 snapshot

Lee et al., 2005). Additional benchmarks exist, which assess semantic similarity rather that semantic relatedness, e.g., the Li sentence collection (Li, McLean, Bandar, O’Shea, & Crockett, 2006) . These benchmarks will be discussed in Section 4.6. We implemented our CHESA approach using a Wikipedia snapshot as of October 18, 2007. Wikipedia is publicly available for download in an XML format9 . We parsed its XML dump using the Wikipedia Processor (WikiPrep) tool (Gabrilovich & Markovitch, 2006, 2007b) and obtained 6.8 Gigabyte of text. Our data includes over 2 million Wikipedia articles and almost 300,000 categories. We follow the footsteps of Gabrilovich and Markovitch (Gabrilovich & Markovitch, 2006) and discard overly-small articles having fewer than 100 non stop words, and articles having less than 5 incoming or outgoing links. We also exclude other types of articles and categories that are unlikely to be useful for describing semantics. For example articles and categories referring to a specific date or an event occurring in a specific year (e.g. 1961 plays, 1820s in fashion) or categories which clearly do not express hypernymy relations (e.g. Writers by nationality and British musicians by instrument). We also disregard lists and stubs. At the end of this process our knowledge base contained 497,153 articles and 125,542 categories. We then used the procedure described in Section 2.1 to construct a conceptual hierarchy rooted at the Main topic classifications concept. Table 1 summarizes the properties of the resulting conceptual hierarchy. To compute the statistics on the out-degree we considered only the internal nodes. To compute the statistics on node depth, we considered only the leaf nodes. 4.2 Evaluating CHESA on Semantic Relatedness of Words In this section we report the results of an experimental evaluation of CHESA on semantic relatedness of words. We evaluate the performance of CHESA on the WordSimilarity353 collection and perform two experiments. In the first experiment, we use the top-down CHESA algorithm (Section 2.3.1) to generate compact representations, given different, predefined values of representation size k. In the second experiment we use the bottom-up 9. http://download.wikimedia.org/enwiki/

18

0.8

0.7 Correlation with Human Judgements

Technion - Computer Science Department - Tehnical Report CS-2010-06 - 2010

Wikipedia-based Compact Hierarchical Semantics

0.6

0.5

0.4

0.3 Top−Down CHESA Bottom−Up CHESA ESA

0.2

0.1

0

50

100

150 200 250 Representation Size

300

350

400

Figure 8: Using CHESA for semantic relatedness of words with varying representation size

CHESA algorithm (Section 2.3.2) to generate the representations. For both experiments we use Equation 1 to compute semantic relatedness. For comparison, we compute semantic relatedness using ESA compact representations by taking the top-k most highly-associated articles from the ESA interpretation vector10 . Figure 8 shows the performance of the algorithms with varying values of representation size. We see that both the top-down and the bottom-up CHESA algorithms significantly outperform ESA for compact representations11 . We believe that the advantage of CHESA is the hierarchical structure of its representations, which captures semantics on different abstraction levels , thus better revealing the various relations between the input words. In Section 4.5 we perform a qualitative analysis that demonstrates this phenomena. The results also show that with tighter constraints on representation size, the top-down CHESA algorithm is superior to the bottom-up algorithm. As representation size increases, the bottom-up algorithm outperforms the top-down algorithm although the difference is not statistically significant. In addition to experimenting with varying values of k, we evaluate CHESA algorithms in a setting of unlimited representation size, namely we represent semantics with the full hierarchical representation. Note that the top-down and the bottom-up CHESA algorithms 10. Both CHESA and ESA representations are constructed based on the same Wikipedia snapshot from October 18, 2007. 11. The difference is statistically significant with p-value< 0.05 for k ≤ 100 for top-down CHESA and k ≤ 200 for bottom-up CHESA

19

Liberman and Markovitch

Correlation

Technion - Computer Science Department - Tehnical Report CS-2010-06 - 2010

Algorithm WordNet (Jarmasz & Szpakowicz, 2003) Rogets Thesaurus (Jarmasz & Szpakowicz, 2003) LSA (Finkelstein et al., 2002) WikiRelate! (Strube & Ponzetto, 2006) MarkovLink (Hughes & Ramage, 2007) CHESA ESA (Gabrilovich & Markovitch, 2007b)

0.35 0.55 0.56 0.50 0.55 0.72 0.74

Table 2: Computing semantic relatedness of words when representation size is unlimited produce identical semantic representations of words when representation size is unlimited. For comparison, we compute semantic relatedness using ESA without trimming its interpretations vectors. Table 2 shows the results of both CHESA algorithms, ESA and five additional previous methods12 . We see that when representation size is unlimited, ESA slightly outperforms both CHESA algorithms, however, the difference is not statistically significant. ESA and CHESA yield significant improvements13 over the other methods. Note that CHESA compact representations of size above 100 concepts outperform these previous methods. 4.3 Evaluating CHESA on Semantic Relatedness of Texts To evaluate CHESA on the assessment of semantic relatedness of texts, we use the Lee document collection and conduct four experiments. In the first two, we generate compact representations of the texts using the Union algorithm (Section 2.5.1) based on the top-down and the bottom-up representations of words. In the other two experiments we generate CHESA representations using the Direct algorithm (Section 2.5.2), again, using both the top-down and the bottom-down techniques. For all experiments we use Equation 1 to compute semantic relatedness. Recall that the Union algorithm generates a representation for an input text by first representing each word of the text with a compact representation of size k, and then unifying these representations. The size of the final representation of the text is thus between k and nk, when n is the number of words in the text. The Direct algorithm, on the other hand, constructs the semantic representation directly, without an intermediate step of representing the single words of the text. It uses a modified Φ function, and applies either the top-down or the bottom-up algorithms to construct a representation of a pre-defined size k. For comparison, we also compute semantic relatedness using ESA representations of the texts. Note, that ESA represents texts as a centroid of the interpretation vectors of its words. It first represents the semantics of each word within the text, and then averages the vectors generated for each word. To generate compact representations using ESA, we can either prune the interpretation vectors to a pre-defined size k at the level of single words, or, at the level of the final text interpretation vector (using the full vectors when representing 12. The results of the five additional previous methods are taken from (Gabrilovich & Markovitch, 2007b) 13. The difference is statistically significant with p-value< 0.05

20

0.8

0.7 Correlation with Human Judgements

Technion - Computer Science Department - Tehnical Report CS-2010-06 - 2010

Wikipedia-based Compact Hierarchical Semantics

0.6

0.5

0.4

0.3 Top−Down CHESA Bottom−Up CHESA ESA

0.2

0.1

0

0.5

1

1.5 2 Representation Size

2.5

3 4

x 10

Figure 9: Using Union-CHESA for semantic relatedness of texts with varying representation size

each word). The Union algorithm is more similar to the former version of ESA, and thus we use this version when comparing ESA to Union CHESA. The Direct algorithm is more similar to the latter version, and therefore compared to it. Figure 9 shows the result for evaluating Union CHESA and ESA algorithms. Note that when comparing the Union algorithms to ESA, we use an inner parameter k that specifies the size of the representations of each word in the text. The final, text representations for both Union CHESA and ESA are of size between k and nk (where n is the number of words in the text). The results show the performance of the algorithms for k = 10, 50, 100, 200, 300, 500, 1000, where for each k, we compute the average representation size for all the documents in the Lee collection, as well as the correlation score with human judgements. We see that in general, both Union CHESA algorithms outperform ESA and that the top-down algorithm is superior to the bottom-up algorithm. We note that for very small representation size ESA outperforms CHESA. We hypothesize that this phenomenon occurs since CHESA achieves compactness by using more abstract concepts. While this very compact description is often sufficient to express the semantics of single words (CHESA outperforms ESA for very compact representations in the task of word relatedness), it is not able to efficiently capture the semantics of a complex text. Figure 10 shows the results for evaluating the Direct CHESA and the ESA algorithms14 . We see that the bottom-up algorithm outperforms ESA for compact representations of 50 14. For the Direct CHESA algorithm we set i = 5

21

0.8

0.7 Correlation with Human Judgements

Technion - Computer Science Department - Tehnical Report CS-2010-06 - 2010

Liberman and Markovitch

0.6

0.5

0.4

0.3 Top−Down CHESA Bottom−Up CHESA ESA

0.2

0.1

0

50

100

150 200 250 Representation Size

300

350

400

Figure 10: Using Direct-CHESA for semantic relatedness of texts with varying representation size Correlation

Algorithm Bag of words (Lee et al., 2005) LSA (Lee et al., 2005) Union CHESA Direct CHESA ESA (Gabrilovich & Markovitch, 2007b)

0.35 0.55 0.60 0.70 0.72

Table 3: Computing semantic relatedness of texts when representation size is unlimited concepts15 , and the gap between them diminishes as representation size increases. The performance of the top-down algorithm is comparable to ESA for representations above the size of 50 concepts. For very compact representations of size 10, ESA outperforms both Direct CHESA algorithms. In addition, we evaluate the Union and the Direct CHESA algorithms in a setting of unlimited representation size. The results are shown in Table 3. We see that ESA slightly outperforms the Direct CHESA algorithm, however, the difference is not statistically significant. Both algorithms outperform the Union CHESA algorithm and previous methods (p-value< 0.05). 15. The difference is significant at p-value< 0.05 for k = 50

22

Wikipedia-based Compact Hierarchical Semantics

Technion - Computer Science Department - Tehnical Report CS-2010-06 - 2010

4.4 The Effect of Different Distance Measures on CHESA Performance In Section 3 we discussed the various possibilities to compute the distance between two CHESA representations and claimed that the cosine similarity measure captures the hierarchical inner structure of the representations. To test our hypothesis we compared cosine similarity to generalized cosine similarity (Ganesan et al., 2003). The generalized cosine similarity measure takes into account only the leaf concepts of the representations, and computes a similarity score based on these concepts and their proximities in the global hierarchy. We also tested the importance of using the association scores of concepts, for computation, by comparing cosine similarity to Jaccard measure (Jaccard, 1901) that disregards weighting. To assess semantic relatedness using the cosine similarity and Jaccard measures we consider both the internal and the leaf concepts within the CHESA representations. When using general cosine similarity, we consider only the leaf concepts (these are the concepts for which none of the children is in the representation). Figures 11(a) and 11(b) show the results for assessing relatedness of words with the three similarity measures, using the top-down and bottom-up algorithms respectively. Figures 11(c) and 11(d) show the results for assessing relatedness of texts with the three similarity measures, using the top-down and bottom-up algorithms respectively. It is seen from the graphs that the cosine similarity and Jaccard measures are superior to the generalized cosine similarity measure for both words and texts. Cosine similarity significantly outperforms Jaccard on assessing semantic relatedness of words with the bottom-up algorithm, and is slightly superior to it in the other three experiments. These results support our initial choice of cosine similarity as a measure of comparing CHESA representations. They stress that applying cosine similarity to hierarchical representations is advantageous over explicitly using the underlying hierarchical structure, but considering the leaf nodes only. Moreover, it illustrates the importance of assigning concept association scores. 4.5 Qualitative Analysis of CHESA To better understand the specific properties of CHESA and illustrate its strength in generating compact comprehensive representations and assessing semantic relatedness, we present a qualitative analysis of our method using real examples. 1. Money vs. Wealth Consider the words money and wealth from the WordSimilarity353 collection. This pair was rated by human judges as highly related (among the top 40 related pairs). Figure 1 shows the CHESA top-down representations of size 10 for these words. We see that in a limited space of ten concepts, CHESA representation captures the core sematic gists of the words on several abstraction levels. The two representations share several concepts (marked in bold) such as Economics and Sociology indicating of the semantic relatedness between the two words. For comparison, Table 4 shows the top-10 concepts generated by ESA for each of the words. We can see that the ESA interpretation vectors for these two words do not 23

0.8

0.7

0.7 Correlation with Human Judgements

Correlation with Human Judgements

0.8

0.6

0.5

0.4

0.3 Cosine Similarity Generalized Cosine Similarity Jaccard

0.2

0.1

0

50

100

150 200 250 Representation Size

300

0.6

0.5

0.4

0.3

350

0.1

400

0.7

0.7 Correlation with Human Judgements

0.8

0.6

0.5

0.4

0.3

0.1

Cosine Similarity Generalized Cosine Similarity Jaccard 0

50

100

150 200 250 Representation Size

300

350

0

50

100

150 200 250 Representation Size

300

350

400

(b) Bottom-up representation of words

0.8

0.2

Cosine Similarity Generalized Cosine Similarity Jaccard

0.2

(a) Top-down representation of words

Correlation with Human Judgements

Technion - Computer Science Department - Tehnical Report CS-2010-06 - 2010

Liberman and Markovitch

0.6

0.5

0.4

0.3 Cosine Similarity Generalized Cosine Similarity Jaccard

0.2

0.1

400

(c) Top-down representation of texts

0

50

100

150 200 250 Representation Size

300

350

400

(d) Bottom-up representation of texts

Figure 11: The affect of different similarity measures on CHESA performance for semantic relatedness

share any concepts and thus the relatedness score computed using cosine similarity is zero. 2. Vodka vs. Brandy Another pair that is judged by humans as very related in the WordSimilarity353 collection is vodka and brandy. Figure 2 shows the top-down representations of size 10 of the two words. The two representation share many concepts, such as Alcoholic beverages, Mixed drinks and their ancestors. The concepts Vodka and Brandy are unique to each representation and indicate of the differences in the semantics of the words. Table 5 shows the top-10 concepts generated by ESA for the two words. We see that the generated concepts are over-specific and that the compact representations have no intersecting concepts. Thus, ESA in unable to capture the relations between the words. Moreover, the representation of the word brandy has only 1 concept associated with its main context of an alcoholic beverage, while the other concepts refer to people.

24

Technion - Computer Science Department - Tehnical Report CS-2010-06 - 2010

Wikipedia-based Compact Hierarchical Semantics

Money

Wealth

1

Money

Wealth in the United States

2

Money, Money, Money

Wealth

3

Money creation

Distribution of wealth

4

Money for Nothing/Beverly Hillbillies

Wealth management

5

Make Money Fast

Wealth (economics)

6

Money supply

Sovereign wealth fund

7

Moneyness

The Wealth of Nations

8

Cash Money Records

Land of Wealth

9

Eddie Money

Common Wealth Party

10

Money laundering

Share Our Wealth

Table 4: ESA top-10 generated concepts for the words money and wealth

Vodka

Brandy

1

Vodka

Brandy

2

Absolut Vodka

Albrecht Brandi

3

Smirnoff

George Brandis

4

SKYY vodka

Kristina Brandi

5

Bloody Mary (cocktail)

Brandy (entertainer)

6

Gin (Case Closed)

Brandi Chastain

7

Mixed drink shooters and drink shots

Jonathan Brandis

8

Vodka Belt

Brandy & Mr. Whiskers

9

Grey Goose (vodka)

Brandi Carlile

10

Beer cocktail

Tom Brandi

Table 5: ESA top-10 generated concepts for the words vodka and brandy

25

Technion - Computer Science Department - Tehnical Report CS-2010-06 - 2010

Liberman and Markovitch

Main topic classifications

Social sciences

Economics

Culture

Sociology

Society

Entertainment

Scandals

Leisure

Social institutions

(a) Compact representation of money Main topic classifications

Religion

Social sciences

Economics

Macroeconomics and monetary economics

Philosophy

Sociology

Economics

Social mobility

Political economy

(b) Compact representation of wealth

Figure 12: Comparison of the compact top-down CHESA representations of the words money and wealth

3. Synonyms The former examples showed cases of words which are semantically related. To see how CHESA handles synonyms we present the representations of the words car and automobile in Figure 3. We see that both representation share the concept Vehicles and its ancestors, which are indeed the connecting link between the two words. However, while car and automobile are considered synonyms16 , their CHESA representations differ. For the word car, the concepts related to sports are prominent, while the word automobile is represented exclusively with concepts related to technology. These differences reflect the differences in the language usage of the two words, as the word automobile is used less in the context of sports and car racing. Table 6 shows the top-10 concepts generated by ESA for both words. We see that ESA generates overly-specific concepts for the two words. For example, among the top-10 concepts generated for car, 7 refer to different car types and 2 refer to two kinds of car number plates. It is thus unable to capture the relations between them. 16. http://thesaurus.reference.com/browse/car

26

Technion - Computer Science Department - Tehnical Report CS-2010-06 - 2010

Wikipedia-based Compact Hierarchical Semantics

Main topic classifications

Main topic classifications

Culture

Culture

Food and drink

Food and drink

Beverages

Beverages

Alcoholic beverages

Mixed drinks

Distilled beverages

Duo and trio cocktails

Alcoholic beverages

Distilled beverages

Vodkas

Brandies

Vodka

Brandy

(a) Compact vodka

representation

Mixed drinks

of (b) Compact brandy

Romanian spirits

representation

of

Figure 13: Comparison of the compact top-down CHESA representations of the words vodka and brandy

4. Smile vs. Chord The aforementioned examples stress the importance of generalization when representations are compact. As ESA is unable to generalize, salient semantic relations are often missed. Nevertheless, in some cases, generalization may lead to unexpected results. Figure 4 shows the top-down representations of size 10 for the words smile and chord. This pair of words was taken from the WordSimilarity353 collection, where it was judged by humans as being very unrelated. CHESA, on the other hand, finds the two words to be very related, and consequently fails when compared to human judgements. We see that although the representation of smile contains the concepts Society and Communication, which are related to the main gists of the word, it also contains many music-related concepts. The reason for that is the vast amount of music-related Wikipedia articles, such as songs and bands, that contain the word. 5. Idioms The word pair (death,row ) appears in the WordSimilarity353 collection and is judeged by humans as very related. The reason is most probably the fact that this pair is 27

Technion - Computer Science Department - Tehnical Report CS-2010-06 - 2010

Liberman and Markovitch

Main topic classifications

Technology

Industry

Events

Technology systems

Competitions

Vehicles

Sports competitions

Sports events

Drifting (motorsport)

(a) Compact representation of car Main topic classifications

Technology

Industry

Technology systems

Manufacturing companies

Vehicles

Manufacturing companies

Korean automobile industry

Vehicle manufacturing companies

Automobiles

(b) Compact representation of automobile

Figure 14: Comparison of the compact top-down CHESA representations of the words car and automobile

recognized as the idion “death row”. Figure 5 shows the top-down representations of size 10 for each of the words. We see that CHESA does not identify these two words as related. The semantic interpretation of the word row contains concepts such as Linear algebra and Principal component analysis, while the representation of the word death contains concepts such as Life, Death and Men. Indeed, an idiom is defined as “an expression whose meaning is not predictable from the usual meanings of its constituent elements” 17 . CHESA’s assessment of the semantic relatedness is correct in this example, although it does not coincide with human judgements, as 17. http://dictionary.reference.com/browse/idiom

28

Technion - Computer Science Department - Tehnical Report CS-2010-06 - 2010

Wikipedia-based Compact Hierarchical Semantics

Car

Automobile

1

Polish car number plates

Automotive industry

2

Concept car

Automobile

3

Sports car

Federation Internationale de l’Automobile

4

Armored car

Automotive design

5

Greek car number plates)

American Automobile Association

6

Executive car

California Automobile Association Building

7

City car

Monte Carlo Rally

8

Compact car

Automobile Club de l’Ouest

9

Full-size car

Car accident

10

Car bomb

Automobile Magazine

Table 6: ESA top-10 generated concepts for the words car and automobile the relation between the two words is not semantic but rather associative. ESA, on the other hand, often identifies associative relations, since words which are part of an idiom co-occur very often. For example, the ESA top-10 representations of the words death and row both contain the concept Death row. 6. Semantic relatedness of texts Figure 17 shows the CHESA compact representation of size 20 for the texts “A recently discovered type of star follows some wild orbits around the Milky Way” and “Astronomers measured the distance to a faraway galaxy, to investigate universe expansion” 18 . First we see that both representations capture the astronomy-related primal gists of the text fragments, at different levels of abstraction. We see that some concepts, such as, Astronomy, Universe and Space are shared between the representations. These concepts express well the relatedness between the two text fragments. Other concepts are unique to each representation, and stress the semantic differences between the two texts. For example, the concept Extrasolar planet appears only in the representation of the first text, while the concept Relativity appears only in the representations of the second text. Table 7 shows the top-20 concepts generated by ESA for each text fragment. Although these representations share several concepts such as Milky Way, Supernova and Andromeda Galaxy, the intersecting concepts are overly-specific and weakly express the relation between the two text fragments. 7. Word Sense Disambiguation Figure 18 shows the compact top-down representation of size 20 of the ambiguous word bank. We see that the representation contains concepts expressing both meanings of the word (e.g., the concepts Financial economics for the financial context and Landforms for the geographical context). Nevertheless, the financial context is more 18. Representation is generated using the top-down Direct CHESA with i = 3

29

Technion - Computer Science Department - Tehnical Report CS-2010-06 - 2010

Liberman and Markovitch

“A recently discovered type of star follows some wild orbits around the Milky Way”

“Astronomers measured the distance to a faraway galaxy, to investigate universe expansion”

1

Milky Way

Metric expansion of space

2

Andromeda Galaxy

Hubble’s law

3

Globular cluster

Cosmic distance ladder

4

Star cluster

Parsec

5

Sagittarius Dwarf Elliptical Galaxy

Physical cosmology

6

Binary star

Redshift

7

Extrasolar planet

Astronomy

8

Solar System (1945 play)

Universe

9

Canis Major Dwarf Galaxy

Andromeda Galaxy

10

Astrometry

Light-year

11

Spiral galaxy

Parallax

12

Red dwarf

Big Bang

13

Supernova

Star Wars galaxy

14

Comet

Edwin Hubble

15

Variable star

Triangulum Galaxy

16

Meteoroid

Supernova

17

Parsec

Comoving distance

18

Black hole (1945 play)

Planetary nebula

19

Space science

Milky Way

20

Dial-Home Device

Open cluster

Table 7: ESA top-20 generated concepts for two related texts

30

Technion - Computer Science Department - Tehnical Report CS-2010-06 - 2010

Wikipedia-based Compact Hierarchical Semantics

Main topic classifications

Culture

Entertainment

Cultural history

Music

Society

Cultural studies

Cultures

Communication

Popular culture

(a) Compact representation of smile Main topic classifications

Culture

Society

Entertainment

Cultural movements

Cultural studies

Music

Punk

Popular culture

Cultures

(b) Compact representation of chord

Figure 15: Comparison of the compact top-down CHESA representations of the words smile and chord

dominant than the geographical context which results from the commonness of each of the contexts in language. In figures 7 and 7 we show that when the word bank is put in context, CHESA resolves the ambiguity. Figure 7 shows the top-down representation of size 20 for the phrase ”bank deposit”. We see that all the concepts in this representation are related to the financial meaning of the word. Figure 7 shows the top-down representation of size 20 for the phrase ”river bank”. We can see that the concepts in the representation are related to the geographical context of the word. 4.6 Semantic Relatedness vs. Semantic Similarity In Sections 4.2 and 4.3 we evaluated CHESA on the semantic relatedness task. A related, but different task, is computing semantic similarity between words and texts. Semantic relatedness is a more general notion than semantic similarity (Budanitsky et al., 2006) as words can be related in many different ways. While similar words are also related, the opposite is not always true. For example, while (midday,noon), (cock,rooster) or 31

Technion - Computer Science Department - Tehnical Report CS-2010-06 - 2010

Liberman and Markovitch

Main topic classifications

Belief

Society

Religion

Nature

Spirituality

Men

Family

Life

Death

(a) Compact representation of death Main topic classifications

Mathematics

Mathematical components

Algebra

Principal components analysis

Linear algebra

Technology

Events

Technology systems

Euclidean subspace

(b) Compact representation of row

Figure 16: Comparison of the compact top-down CHESA representations of the words death and row

(gem,jewel) are pairs of words which are both similar and related, (headache,neurology) or (Obama,Washington) are considered related by not similar. CHESA algorithms assess semantic relatedness of words and texts. For example, Figure 4.6 shows the CHESA representations of the words headache and neurology that share several concepts. Therefore, we hypothesize that CHESA performance will degrade on semantic similarity tasks, as it will disagree with human judgements in case of related, but not similar text pairs. To test our hypothesis we evaluate CHESA on two additional benchmarks for assessment of semantic similarity rather that semantic relatedness. The first benchmark, introduced by Li et at. (Li et al., 2006) consists of a collection of 65 sentence pairs that correspond to the Rubenstein and Goodenough (R&G) (Rubenstein & Goodenough, 1965) list of 65 word pairs. Each word from the R&G data set was replaced with its definition from the Collins Cobuild dictionary (Sinclair, 2001). The sentence pairs were ranked by 32 volunteers, all 32

Technion - Computer Science Department - Tehnical Report CS-2010-06 - 2010

Wikipedia-based Compact Hierarchical Semantics

Main topic classifications

Natural sciences

Belief

Astronomy

Astronomical objects

Supermassive black holes

Constellations

Astronomical catalogues

Reticulum constellation

Sagitta constellation

Nature

Universe

Space

Extrasolar planet

Science

Nature

Large-scale structure of the cosmos

Astronomical naming conventions

Dark matter

Asterism (astronomy)

(a) Compact representation of “A recently discovered type of star follows some wild orbits around the Milky Way” Main topic classifications

Natural sciences

Mathematics

Astronomy

Olbers' paradox

History of astronomy

Physics

Astronomy

Omega point (Tipler)

Physical cosmology

Celestial mechanics

Mathematical physics

Nature

Astronomical objects

Universe

Space

Large-scale structure of the cosmos

Mataphysics

Science

Relativity

(b) Compact representation of “Astronomers measured the distance to a faraway galaxy, to investigate universe expansion”

Figure 17: Comparison of the compact top-down CHESA representations of two related texts

native English speakers and graduates, and the average scores of their judgements was assigned to the pairs. As the distribution of the similarity scores assessed by the judges was heavily skewed towards low similarity scores, a subset of 30 pairs was selected to preserve more uniform distribution of scores (Li et al., 2006). The second benchmark is the Microsoft paraphrase corpus (Dolan, Quirk, & Brockett, 2004). This benchmark consists of pairs of texts, collected for a period of 18 months, from thousands of news resources on the Web. The pairs were assigned binary labels by two human judgements, indicating whether the sentences are semantically equivalent paraphrases or not. The collection consists of a training set of 4,076 texts pairs, and a test set of 1,725 text pairs. On the Li benchmark, CHESA obtains a Pearson’s correlation coefficient of 0.87 with human judgements19 . These results slightly outperform previous work (Islam & Inkpen, 2008; Li et al., 2006) although the difference is not statistically significant. 19. Using the union bottom-up algorithm for k = 200. We conducted experiments with k = 100, 300, 400 and achieved similar results.

33

Technion - Computer Science Department - Tehnical Report CS-2010-06 - 2010

Liberman and Markovitch

Main topic classifications

Society

Social sciences

Nature

Business

Scandals

Economics

Human geography

Landforms

Bank of Credit and Commerce International

International economics

Economic policy

Macroeconomics and monetary economics

Financial economics

Bank

Foreign exchange reserves

Global financial system

Clearstream

Banks of Australia

Political geography

Figure 18: Compact top-down CHESA representation of the word bank

Main topic classifications

Society

Economics

Bank secrecy

Bank

Deposit insurance

Social institutions

Scandals

Business

Bank of Credit and Commerce International

Finance

Wirecard

Banking

Offshore bank

Full-reserve banking

Retail banking

Bank run

Fractionalreserve banking

Cooperative banking

Figure 19: Compact top-down CHESA representation of the text ”bank deposit”

34

Bank regulation

Technion - Computer Science Department - Tehnical Report CS-2010-06 - 2010

Wikipedia-based Compact Hierarchical Semantics

Main topic classifications

Social sciences

Human geography

Geography

Geography of religion

Branches of geography

Nature

Surveying

Landforms

Religious places

Surveyors

Basins

Sacred rivers

American surveyors

Depressions

John C. Sullivan

Rift valleys

Indus River

Ganges

New Madrid Seismic Zone

Figure 20: Compact top-down CHESA representation of the text ”river bank”

On the Microsoft paraphrase benchmark the top-down CHESA algorithm obtains an F-Measure of 79.6 and the bottom-up CHESA algorithm obtains an F-measure of 79.720 . The similarity threshold (0.3) was selected using the training set. Previous methods (Islam & Inkpen, 2008; Corley & Mihalcea, 2005; Mihalcea, Corley, & Strapparava, 2006) achieved slightly higher F-measures of 81.2-81.3. According to the results on the Li benchmark, our hypothesis ought to be rejected. However, a further examination of the Li collection revealed that it mostly contains either similar text fragments or, neither similar nor related text fragments. This serves an explanation as to why CHESA performs well on this dataset. The Microsoft paraphrase corpus, on the other hand, contains many pairs of texts which are highly related but are not paraphrases. For example, the text “Ballmer has been vocal in the past warning that Linux is a threat to Microsoft” is very related to the text “In the memo, Ballmer reiterated the open-source threat to Microsoft”, although they are not considered as semantically equivalent paraphrases. Indeed, CHESA performance on this dataset is inferior comparing to methods that assess semantic similarity.

5. Related Work Many statistical approaches have been introduced in the past for representing word and text semantics and assessing semantic relatedness. The most common statistical approach is representing a text as a bag-of-words (BOW), namely as the collection of its words. Texts are compared in the vector space model (Salton & McGill, 1986; Baeza-Yates & Ribeiro-Neto, 1999), where they are represented as vec20. For k = 200. Similar results were obtained with k = 100, 300, 400

35

Technion - Computer Science Department - Tehnical Report CS-2010-06 - 2010

Liberman and Markovitch

Main topic classifications

Applied sciences

Natural sciences

Thought

Health sciences

Biology

Cognition

Medicine

Neuroscience

Diseases

Biology prefixes and suffixes

Neurology

Neurology

Behavioral neurology

Cognitive neuroscience

-phyte

Neuropsychiatry

American Board of Psychiatry and Neurology

Neurological examination

(a) Compact representation of neurology Main topic classifications

Applied sciences

Natural sciences

Health sciences

Medicine

Pharmaceutical sciences

Medical treatments

Pharmacology

Biology

Diseases

Society

Chemistry

Microbiology

Disability

Organic chemistry

Health

Environment

Environmental science

Pollution

(b) Compact representation of headache

Figure 21: Comparison of the compact top-down CHESA representations of the words neurology and headache

tors in the hyperspace defined by single words. Semantic relatedness is then assessed by measuring the distance between these vectors. Although widely applied, this technique is inherently unable to compare single words, and performs sub-optimally on texts using different vocabularies to describe related topics. Due to the limitations of the BOW approach, many methods began exploiting external knowledge to represent semantics and to assess semantic relatedness of words and texts. 36

Technion - Computer Science Department - Tehnical Report CS-2010-06 - 2010

Wikipedia-based Compact Hierarchical Semantics

Lexical resources such as WorldNet (Fellbaum, 1998) and Roget’s Thesaurus (Roget, 1852) have been utilized by many previous methods for semantic representation and semantic relatedness assessment. Most of these methods represent an input word with a node in the semantic graph defined by these lexical resources. Semantic relatedness is then computed using various metrics. Some methods use path length between the corresponding nodes (Lee et al., 1993; Rada, Mili, Bicknell, & Blettner, 1989; Leacock & Chodorow, 1998), and the information content (Resnik, 1995, 1999; Lin, 1998; Jiang & Conrath, 1997; Seco, Veale, & Hayes, 2004; Budanitsky et al., 2006) or the depth of the lowest common ancestor (Wu & Palmer, 1994). Others (Hughes & Ramage, 2007; Wan & Peng, 2005), perform random walks on the graph or use the Earth Movers distance (Rubner et al., 2000) to compute semantic distance. Finally, some techniques use a combination of different lexical resources (Li, Bandar, & McLean, 2003; Li et al., 2006; Mihalcea & Corley, 2006; Rodriguez & Egenhofer, 2003). The methods that use these lexical resources have several limitations. First, the resources encode lexical information but lack comprehensive world knowledge. The semantic representation of an input word is essentially its mapping to a specific node, which does not allows deep interpretation of its semantics. Moreover, these resources are limited in terms of vocabulary as they only contain nouns, verbs, adjectives and adverbs. In the absence of information regarding proper names, neologisms, domain-specific terms etc., the vocabulary of words that can be semantically represented is very restricted. Finally, the aforementioned methods are limited to representing singe words, while representing and comparing texts requires an extra level of sophistication (Mihalcea et al., 2006; Corley & Mihalcea, 2005; Islam & Inkpen, 2008). Word-text co-occurrence data gathered from an unlabeled textual corpora is another resource of external knowledge for some semantic representations methods (Lund & Burgess, 1996; Landauer & Dumais, 1997; Foltz, 1996). One approach that uses such data is Latent Semantic Analysis (LSA) (Deerwester et al., 1990; Landauer & Dumais, 1997; Rehder et al., 1998; Wolfe et al., 1998; Landauer et al., 1998) which is based on singular value decomposition (SVD). LSA represents meaning of texts as a vector in a hyperspace of ”latent concepts” retrieved by performing SVD on the co-occurrence matrix. Semantic relatedness is assessed by comparing the representations in the latent space. Retrieving latent concepts or topics to describe text semantics was also addressed by probabilistic approaches. The Probabilistic Latent Semantic Analysis (PLSA) method (Hofmann, 1999) proposes a generative probabilistic model.According to this model, each document postulates a probability distribution on a latent multinomial random variable that represent a set of latent topics. The words of the document are a sample from a mixture model defined by this multinomial random variable. The Expectation Maximization algorithm is used to compute the conditional posterior probability distribution of the latent variable given a document. Latent Dirichlet Allocation (LDA) (Blei, Ng, & Jordan, 2003) is another probabilistic approach. The difference between this approach and PLSA is that the former assigns Dirichlet priors to the topics. Hierarchical Latent Dirichlet Allocation (hLDA) method (Blei, Griffiths, Jordan, & Tenenbaum, 2004) proposes an extension of LDA in which the latent topics lie in a hierarchy. Both LSA and the aforementioned probabilistic methods represent words and text using ”latent concepts”. These concepts are extracted using SVD or Bayesian inference from 37

Technion - Computer Science Department - Tehnical Report CS-2010-06 - 2010

Liberman and Markovitch

a corpus of training documents and are hard to interpret, as they cannot be mapped to knowledge concepts that are clear to humans. CHESA on the other hand represent meaning explicitly through natural concepts that are directly defined by humans and represent world knowledge. In addition, CHESA leverages the explicit hierarchical inner structure of those concepts encoded in Wikipedia categories. This is as apposed to the hierarchical LDA that learns the hierarchy as part of the inference process. Web search engines have been exploited by several methods to compute semantic similarity. Some methods leverage search results returned for words and phrase queries to assess their semantic relationships (Danushka et al., 2007; Matsuo et al., 2006; Chen et al., 2006; Sahami & Heilman, 2006), while others (Maguitman et al., 2006), compare web pages using their organization under topical otologies such as the The Open Directory Project and using graph-based metrics. CHESA differs from these techniques as it is able to process arbitrary texts and is not limited to existing Web pages or short texts that can constitute search queries. Some methods (Lakkaraju et al., 2008; Gabrilovich & Markovitch, 2007a, 2005) map documents into hierarchical taxonomies, such as the Open Directory Project. These techniques are similar to CHESA in the sense that they use a pre-defined hierarchical ontology, and represent documents as hierarchical structures within that ontology. Nevertheless there are several substantial differences between CHESA and these methods. CHESA generates balanced representations of words and texts, capturing their semantic gists on a high level of abstraction when the representations is compact, and generating more specific concepts as size limitations loosen. The aforementioned methods, on the other hand, primarily detect the concepts which are highly associated with the input word or text (e.g., measure the TFIDF of the input word or text in the text of the concept). This generation process triggers often overly-specific concepts usually located deep within the hierarchy. The algorithms then proceed by adding all the ancestors of the generated concepts to the representation, which results in representations having a small number of long paths from the root. Figure ?? shows an illustration of the typical structure of compact representations generated by these methods vs. representations generated by CHESA. Moreover, as opposed to CHESA, these methods are unable to control neither the final size of the representations nor their level of abstraction, as they select concepts regardless of their location in the hierarchy. Finally, these algorithms cannot manage redundancies. For instance if a parent and a child categories are very similar, and are highly associated with the input word or text, they both will be part of the semantic representation. CHESA refrains from such redundancies by using the conditional over-representation criterion. CHESA is not the first methodology that uses Wikipedia to represent semantics and compare words and texts. WikiRelate! (Strube & Ponzetto, 2006) represents a word as a short list of Wikipedia articles containing the word in their title, and computes semantic relatedness by either comparing the texts of the corresponding articles or computing the path lengths or using information content for these articles in the Wikipedia category hierarchy. As apposed to CHESA, this method is limited to representing words only, and its vocabulary is bounded to terms appearing in titles of Wikipdia articles. Note that the method’s usage of Wikipedia category graph is different than ours. While CHESA uses Wikipedia conceptual hierarchy for generating the semantics of text (regardless of the spe38

Technion - Computer Science Department - Tehnical Report CS-2010-06 - 2010

Wikipedia-based Compact Hierarchical Semantics

cific application), WikiRelate! uses Wikipedia categories only for computing the semantic distance between words and texts. Wikipedia categories have also been used for document topic identification (Schonhofen, 2006), where documents are classified to Wikipedia categories according to the titles of the articles in those categories. Other methods (Milne, 2007; Kaiser et al., 2009) utilize the hyperlink structure of Wikipedia to compute semantic relatedness of words and texts. Explicit Semantic Analysis (ESA) (Gabrilovich & Markovitch, 2006, 2007b, 2009) represents semantics of words and texts as a flat weighted vector of Wikipedia articles, where the weights are assigned using a TFIDF metric between the text represented and the textual content of the Wikipedia article. We have extensively described and illustrated the differences between CHESA and ESA throughout this paper. There is a certain connection between hierarchical semantic representation of texts based on a global conceptual hierarchy and hierarchical classification of texts. These methods (e.g. (Koller & Sahami, 1997; Dumais & Chen, 2000; Ruiz & Srinivasan, 2002)), classify documents to one node in a topic hierarchy by a multiple-phase process where a classifier is built for every set of siblings in the hierarchy. Namely, a chain of classifiers is applied to the input text, classifying it to a more specific topic each iteration. The local classification can be performed with a variety of classifiers such as Bayesian classifiers and Support Vector Machines. Although making local decisions within a topical hierarchy and able to classify a document to topics with alternating abstraction level, hierarchical classification is essentially a mapping of a document to a single node within the topical hierarchy rather than an hierarchical representation which constitutes a sub-hierarchy of the global hierarchical ontology.

6. Discussion Researchers has recognized, long time ago, that many natural language processing tasks require comprehensive semantic interpretation of text. The efforts to find an appropriate semantic representation with associated interpretation method, have followed two main directions. One direction, best exemplified by the CYC project (Lenat & Guha, 1989), use extensive manual efforts to construct rich and deep semantics. The other direction, exemplified by Latent Semantic Analysis (Landauer & Dumais, 1997), automatically generate semantics based on statistical analysis of large text collection. The generated semantics, however, are usually not appropriate for deep natural language understanding. The CHESA framework presented in this paper, tries to bridge between these two approaches. On one hand the semantics is generated automatically. On the other hand, the generated representation is structured, and coincides with human conceptual hierarchy, thus having the potential of being used for deep NLP. CHESA representation is flexible with respect to size and level of abstraction. Even when compact, the representation is powerful and expressive. Since CHESA uses Wikipedia as a resource of world knowledge, the rapid growth of the online encyclopedia directly improves the richness of the semantic representations. We developed a methodology to compute semantic relatedness based on our representation and showed that CHESA is superior to previous methods when representations are 39

Technion - Computer Science Department - Tehnical Report CS-2010-06 - 2010

Liberman and Markovitch

compact. For unlimited representation size, empirical analysis showed that CHESA is comparable to ESA and is superior to other methods. We intend to expand our study of CHESA and design methods for using CHESA for additional text processing tasks, such as word sense disambiguation, text categorization and information retrieval. We believe that the novel method introduced in this paper is a step forward in making statistical semantic representation methods useful for deep natural language understanding.

References Baeza-Yates, R. A., & Ribeiro-Neto, B. (1999). Modern Information Retrieval. AddisonWesley Longman Publishing Co., Inc., Boston, MA, USA. Bahrick, H. (1984). Semantic memory content in permastore: Fifty years of memory for spanish learned in school.. Journal of Experimental Psychology: General,, 113, 1–35. Blei, D. M., Griffiths, T. L., Jordan, M. I., & Tenenbaum, J. B. (2004). Hierarchical topic models and the nested chinese restaurant process. In Advances in Neural Information Processing Systems. MIT Press. Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. J. Mach. Learn. Res., 3, 993–1022. Bower, G. H., Clark, M. C., Lesgold, A. M., & Winzenz, D. (1969). Hierarchical retrieval schemes in recall of categorized word lists.. Journal of Verbal Learning and Verbal Behavior, 8 (3), 323–343. Buchanan, B. G., & Feigenbaum, E. (1982). Forward. McGraw-Hill, McGraw-Hill. Budanitsky, A., Hirst, G., & Hirst, E. B. G. (2006). Evaluating wordnet-based measures of lexical semantic relatedness. Computational Linguistics, 32, 13–47. Burgess, C., Livesay, K., & Lund, K. (1998). Explorations in context space: Words, sentences, discourse. Discourse Processes. Chen, H.-H., Lin, M.-S., & Wei, Y.-C. (2006). Novel association measures using web search with double checking. In ACL-44: Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics, pp. 1009–1016, Morristown, NJ, USA. Association for Computational Linguistics. Cohen, G., Stanhope, N., & Conway, M. (1992). Age differences in the retention of knowledge in young and elderly students. British Journal of Developmental Psychology, 10, 153– 164. Cohen, G. (2000). Hierarchical models in cognition: Do they have psychological reality?. European Journal of Cognitive Psychology, 12 (1), 1–36. Conway, M., Cohen, G., , & Stanhope, N. (1991). On the very long term retention of knowledge: Twelve years of cognitive psychology.. Journal of Experimental Psychology: General, 120, 395–409. Corley, C., & Mihalcea, R. (2005). Measures of text semantic similarity. In ACL workshop on Empirical Modeling of Semantic Equivalence. 40

Technion - Computer Science Department - Tehnical Report CS-2010-06 - 2010

Wikipedia-based Compact Hierarchical Semantics

Danushka, B., Yutaka, M., & Mitsuru, I. (2007). Measuring semantic similarity between words using web search engines. In WWW ’07: Proceedings of the 16th international conference on World Wide Web, pp. 757–766, New York, NY, USA. ACM. Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41, 391–407. Dennis, D. (1976). Dissociated naming and locating of body parts after left anterior temporal lobe resection: An experimental case study.. Brain and Language, 3, 147–163. Dolan, B., Quirk, C., & Brockett, C. (2004). Unsupervised construction of large paraphrase corpora: exploiting massively parallel news sources. In COLING ’04: Proceedings of the 20th international conference on Computational Linguistics, Morristown, NJ, USA. Association for Computational Linguistics. Dumais, S., & Chen, H. (2000). Hierarchical classification of web content. In SIGIR ’00: Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, pp. 256–263, New York, NY, USA. ACM. Fellbaum, C. (1998). WordNet: An Electronic Lexical Database. MIT Press, Cambridge, MA. Finkelstein, L., Gabrilovich, E., Matias, Y., Rivlin, E., Solan, Z., Wolfman, G., & Ruppin, E. (2001). Placing search in context: the concept revisited. In WWW ’01: Proceedings of the 10th international conference on World Wide Web, pp. 406–414, New York, NY, USA. ACM. Finkelstein, L., Gabrilovich, E., Matias, Y., Rivlin, E., Solan, Z., Wolfman, G., & Ruppin, E. (2002). Placing search in context: The concept revisited. ACM Transactions on Information Systems, 20 (1), 116–131. Foltz, P. W. (1996). Latent semantic analysis for text-based research. Behavior Research Methods, Instruments and Computers, 28, 197–202. Gabrilovich, E., & Markovitch, S. (2005). Feature generation for text categorization using world knowledge. In Proceedings of the 19th International Joint Conference on Artificial Intelligence, pp. 1048–1053. Gabrilovich, E., & Markovitch, S. (2006). Overcoming the brittleness bottleneck using Wikipedia: Enhancing text categorization with encyclopedic knowledge. In Proceedings of the 21st National Conference on Artificial Intelligence, pp. 1301–1306. Gabrilovich, E., & Markovitch, S. (2007a). Harnessing the expertise of 70,000 human editors: Knowledge-based feature generation for text categorization. Journal of Machine Learning Research, 8, 2297–2345. Gabrilovich, E., & Markovitch, S. (2007b). Harnessing the expertise of 70,000 human editors: Knowledge-based feature generation for text categorization. Journal of Machine Learning Research, 8, 2297–2345. Gabrilovich, E., & Markovitch, S. (2009). Wikipedia-based semantic interpretation for natural language processing. Journal of Artificial Intelligence Research, 443–498. 41

Technion - Computer Science Department - Tehnical Report CS-2010-06 - 2010

Liberman and Markovitch

Ganesan, P., Garcia-Molina, H., & Widom, J. (2003). Exploiting hierarchical domain structure to compute similarity. ACM Trans. Inf. Syst., 21 (1), 64–93. Hart, J., Berndt, R. S., & Caramazza, A. (1985). Category-specific naming deficit following cerebral infarction. Nature, 316, 439–440. Hofmann, T. (1999). Probabilistic latent semantic analysis. In In Proc. of Uncertainty in Artificial Intelligence, UAI99, pp. 289–296. Hughes, T., & Ramage, D. (2007). Lexical semantic relatedness with random graph walks. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pp. 581–589. Islam, A., & Inkpen, D. (2008). Semantic text similarity using corpus-based word similarity and string similarity. ACM Transactions on Knowledge Discoverty from Data, 2 (2), 1–25. ´ Jaccard, P. (1901). Etude comparative de la distribution florale dans une portion des alpes et des jura. Bulletin del la Soci´et´e Vaudoise des Sciences Naturelles, 37, 547–579. Jarmasz, M., & Szpakowicz, S. (2003). Roget’s thesaurus and semantic similarity. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP-03), pp. 212–219, Borovets, Bulgaria. Jiang, J. J., & Conrath, D. W. (1997). Semantic similarity based on corpus statistics and lexical taxonomy. In International Conference Research on Computational Linguistics (ROCLING X). Kaiser, F., Schwarz, H., & Jakob, M. (2009). Using wikipedia-based conceptual contexts to calculate document similarity. In ICDS ’09: Proceedings of the 2009 Third International Conference on Digital Society, pp. 322–327, Washington, DC, USA. IEEE Computer Society. Koller, D., & Sahami, M. (1997). Hierarchically classifying documents using very few words. In Fisher, D. H. (Ed.), Proceedings of ICML-97, 14th International Conference on Machine Learning, pp. 170–178, Nashville, US. Morgan Kaufmann Publishers, San Francisco, US. Lakkaraju, P., Gauch, S., & Speretta, M. (2008). Document similarity based on concept tree distance. In HT ’08: Proceedings of the nineteenth ACM conference on Hypertext and hypermedia, pp. 127–132, New York, NY, USA. ACM. Landauer, T. K., & Dumais, S. T. (1997). Solution to plato’s problem: The latent semantic analysis theory of acquisition, induction and representation of knowledge. Psychological Review. Landauer, T. K., Foltz, P. W., & Laham, D. (1998). An introduction to latent semantic analysis. Discourse Processes, 259–284. Leacock, C., & Chodorow, M. (1998). Combining Local Context and WordNet Similarity for Word Sense Identification, chap. 11, pp. 265–283. The MIT Press. Lee, J. H., Kim, M. H., & Lee, Y. J. (1993). Information retrieval based on conceptual distance in is-a hierarchies. Journal of Documentation, 49 (2), 188–207. 42

Technion - Computer Science Department - Tehnical Report CS-2010-06 - 2010

Wikipedia-based Compact Hierarchical Semantics

Lee, M. D., Pincombe, B., & Welsh, M. (2005). An Empirical Evaluation of Models of Text Document Similarity, pp. 1254–1259. Erlbaum, Mahwah, NJ. Lenat, D. B., & Guha, R. V. (1989). Building Large Knowledge-Based Systems; Representation and Inference in the Cyc Project. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA. Lenat, D. B., Guha, R. V., Pittman, K., Pratt, D., & Shepherd, M. (1990). Cyc: toward programs with common sense. Commun. ACM, 33 (8), 30–49. Li, Y., Bandar, Z. A., & McLean, D. (2003). An approach for measuring semantic similarity between words using multiple information sources. IEEE Trans. on Knowl. and Data Eng., 15 (4), 871–882. Li, Y., McLean, D., Bandar, Z. A., O’Shea, J. D., & Crockett, K. (2006). Sentence similarity based on semantic nets and corpus statistics. IEEE Transactions on Knowledge and Data Engineering, 18 (8), 1138–1150. Lin, D. (1998). An information-theoretic definition of similarity. In ICML ’98: Proceedings of the Fifteenth International Conference on Machine Learning, pp. 296–304, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc. Lund, K., & Burgess, C. (1996). Producing high-dimensional semantic spaces from lexical co-occurrence. Behavior Research Methods, Instrumentation, and Computers, 203–20. Maguitman, A. G., Menczer, F., Erdinc, F., Roinestad, H., & Vespignani, A. (2006). Algorithmic computation and approximation of semantic similarity. World Wide Web, 9 (4), 431–456. Matsuo, Y., Sakaki, T., Uchiyama, K., & Ishizuka, M. (2006). Graph-based word clustering using web search engine. In Proc. of EMNLP 2006. Mihalcea, R., & Corley, C. (2006). Corpus-based and knowledge-based measures of text semantic similarity. In In AAAI06, pp. 775–780. Mihalcea, R., Corley, C., & Strapparava, C. (2006). Corpus-based and knowledge-based measures of text semantic similarity. In AAAI. Milne, D. (2007). Computing semantic relatedness using wikipedia link structure. In Proceedings of the New Zealand Computer Science Research Student Conference (NZCSRSC), Hamilton, New Zealand. Naveh-Benjamin, M. (1988). Retention of cognitive structures learned in university courses.. Practical aspects of memory: Current research and issues, 2, 383–388. Pedersen, T., Patwardhan, S., & Michelizzi, J. (2004). Wordnet: : Similarity - measuring the relatedness of concepts.. In AAAI, pp. 1024–1025. Pincombe, B., & Surveillance, I. (2004). Comparison of human and latent semantic analysis (lsa) judgments of pairwise document similarities for a news corpus. defence science and technology organisation research report dstorr0278.. Potter, M. C., & Faulconer, B. A. (1975). Time to understand pictures and words.. Nature, 253, 437–438. 43

Technion - Computer Science Department - Tehnical Report CS-2010-06 - 2010

Liberman and Markovitch

Rada, R., Mili, H., Bicknell, E., & Blettner, M. (1989). Development and application of a metric on semantic nets. IEEE Transactions on Systems Management and Cybernetics, 19 (1), 17–30. Rehder, B., Schreiner, M. E., Wolfe, B. W., Laham, D., Landauer, T. K., & Kintsch, W. (1998). Using latent semantic analysis to assess knowledge: Some technical considerations. Discourse Processes, 25, 337–354. Resnik, P. (1995). Using information content to evaluate semantic similarity in a taxonomy. In In Proceedings of the 14th International Joint Conference on Artificial Intelligence, pp. 448–453. Resnik, P. (1999). Semantic similarity in a taxonomy: An information-based measure and its application to problems of ambiguity in natural language. Journal of Artificial Intelligence Research, 11, 95–130. Rodriguez, M. A., & Egenhofer, M. J. (2003). Determining semantic similarity among entity classes from different ontologies. Knowledge and Data Engineering, IEEE Transactions on, 15 (2), 442–456. Roget, P. (1852). Rogets thesaurus of english words and phrases. Longman Group Ltd. Rosch, E., Mervis, C. B., Gray, W. D., Johnson, D. M., & Boyes-Braem, P. (1976). Basic objects in natural categories.. Cognitive Psychology, 8 (3), 382–439. Rubenstein, H., & Goodenough, J. B. (1965). Contextual correlates of synonymy. Communications of the ACM, 8 (10), 627–633. Rubner, Y., Tomasi, C., & Guibas, L. J. (2000). The earth mover’s distance as a metric for image retrieval. International Journal of Computer Vision, 40 (2), 99–121. Ruiz, M. E., & Srinivasan, P. (2002). Hierarchical text categorization using neural networks. Information Retrieval, 5 (1), 87–118. Sahami, M., & Heilman, T. D. (2006). A web-based kernel function for measuring the similarity of short text snippets. In WWW ’06: Proceedings of the 15th international conference on World Wide Web, pp. 377–386, New York, NY, USA. ACM. Salton, G., & McGill, M. J. (1986). Introduction to Modern Information Retrieval. McGrawHill, Inc., New York, NY, USA. Schonhofen, P. (2006). Identifying document topics using the wikipedia category network. In WI ’06: Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence, pp. 456–462, Washington, DC, USA. IEEE Computer Society. Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Comput. Surv., 34 (1), 1–47. A comprehensive survey of Text Categorization techniques and approaches. Includes dimentionality reduction by term selection and term extraction, inductive construction of text classifiers and their evaluation. Seco, N., Veale, T., & Hayes, J. (2004). An intrinsic information content metric for semantic similarity in wordnet. In ECAI’2004, the 16th European Conference on Artificial Intelligence. Sinclair, J. (Ed.). (2001). Collins Cobuild English Dictionary for Advanced Learners. Harper Collins Pub. 44

Technion - Computer Science Department - Tehnical Report CS-2010-06 - 2010

Wikipedia-based Compact Hierarchical Semantics

Stanhope, N., Cohen, G., & Conway, M. (1993). Very long term retention of a novel.. Applied Cognitive Psychology, 7, 239–256. Strube, M., & Ponzetto, S. P. (2006). Wikirelate! computing semantic relatedness using wikipedia.. In AAAI 2006. AAAI Press. Turney, P. D. (2001). Mining the web for synonyms: Pmi-ir versus lsa on toefl. In EMCL ’01: Proceedings of the 12th European Conference on Machine Learning, pp. 491–502, London, UK. Springer-Verlag. Wan, X., & Peng, Y. (2005). The earth mover’s distance as a semantic measure for document similarity. In CIKM ’05: Proceedings of the 14th ACM international conference on Information and knowledge management, pp. 301–302, New York, NY, USA. ACM. Wolfe, M. B., Schreiner, M. E., Rehder, B., Laham, D., Foltz, P. W., Kintsch, W., & Landauer, T. K. (1998). Learning from text: Matching readers and texts by latent semantic analysis. Discourse Processes, 25, 309–336. Wu, Z., & Palmer, M. (1994). Verbs semantics and lexical selection. In Proceedings of the 32nd annual meeting on Association for Computational Linguistics, pp. 133–138, Morristown, NJ, USA. Association for Computational Linguistics.

45

Technion - Computer Science Department - Tehnical Report CS-2010-06 - 2010

Liberman and Markovitch

Appendix A. Building a Conceptual Hierarchy from Wikipedia Concept Graph Algorithm 3 describes the procedure of constructing a rooted directed acyclic graph of concepts from the Wikipedia graph of categories and articles. The graph is processed in a bread-first-search order and paths leading from the root to concepts that can be reached by shorter paths are removed. Moreover, concepts that cannot be reached from the root are not part of the final hierarchy. Algorithm 3 Build Conceptual Hierarchy Input: A directed graph G(V, E), a root node vr ∈ V Output: A multiple-inheritance hierarchy G0 (V 0 , E 0 ) (a rooted directed acyclic graph) 1: V 0 ← ∅ 2: E 0 ← ∅ 3: Q ← an empty First In First Out queue 4: for all v ∈ V do 5: v.depth= 0 6: end for 7: Q.add(vr ) 8: while Q is not empty do 9: v ← Q.getF irst() 10: Q.removeF S irst() 0 0 11: V ← V {v} for all v 0 such that (v, v 0 ) ∈ E do 12: 13: if ((v 0 .depth = 0) or (v 0 .depth = v.depth + 1) or (v 0 is a leaf)) then S E 0 ← E 0 {(v, v 0 )} 14: 15: v 0 .depth ← v.depth + 1 16: Q.add(v 0 ) 17: end if 18: end for 19: end while 20: return V 0 , E 0

Appendix B. Top-Down vs. Bottom-Up Representation Figures22 and 23 show the top-down and bottom-up CHESA representations of 50 concepts for the word tiger.

46

47

London Tigers

Tamils of Sri Lanka

Chinese martial arts

Ball games

Baseball

Olympic sports

Entertainment

Baseball records

Sports records and statistics

Sports

Sports teams

FEU Tamaraws

Society

UE Red Warriors

Current sports events

Sports events

Events

Australian rules football competitions

Rugby league competitions

Eastern Colored League

Negro baseball leagues

Baseball leagues

Baseball competitions

Sports competitions

Competitions

Figure 22: The top-down representation if size 50 generated by CHESA for the word tiger

Rugby

Professional wrestling

Sri Lankan Tamil politics

Liberation Tigers of Tamil Eelam

Culture

Sports entertainment

Chinese culture

China

Civilizations

Cultural history

Tamil politics

Tamil

Defunct baseball teams

Defunct sports teams

Ancient languages

History of sports

Communication of falsehoods

Belief

Main topic classifications

Baseball teams

Professional sports leagues

Sports leagues

Sports organisations

Organizations

United States Australian Football League

Technion - Computer Science Department - Tehnical Report CS-2010-06 - 2010

Wikipedia-based Compact Hierarchical Semantics

Agricultural universities and colleges

Agriculture

Applied mathematics

Applied sciences

48

Sports

Competitions

Sports teams

Entertainment

Mammals

Mammalogy

Water streams

Limnology

Earth sciences

Natural sciences

Biology

Zoology

Probability and statistics

Mathematics

International disputes

Conflicts

Events

Systems

Subdivisions of Japan

Geography of Japan

Archipelagoes

Coastal and oceanic landforms

Landforms

Nature

Historical eras

United States Environmental Protection Agency

Environmental protection agencies

Environment

Social sciences

Regional science

American studies

Area studies

Society

Figure 23: The bottom-up representation if size 50 generated by CHESA for the word tiger

China

Civilizations

Cultural history

Culture

Main topic classifications

Sports organisations

Organizations

Canadians of Caribbean descent

People of Caribbean descent

Caribbean diaspora

Diasporas

Ethnic groups

Technion - Computer Science Department - Tehnical Report CS-2010-06 - 2010

Human migration

Liberman and Markovitch