Hybrid Method for Computing Word-Pair ... - ACM Digital Library

2 downloads 0 Views 768KB Size Report
Jun 15, 2012 - 2016, Carthage Présidence, Tunisia. [email protected]. ABSTRACT. With the development of the Web, there is a great amount of.
Hybrid Method for Computing Word-Pair Similarity based on Web Content Imen Akermi

Rim Faiz

University of Tunis - ISG LARODEC 2000, Bardo, Tunisia

University of Carthage - IHEC LARODEC 2016, Carthage Présidence, Tunisia

[email protected]

[email protected]

ABSTRACT

Brown corpus [10], [14] and Web-based metrics [15], [22].

With the development of the Web, there is a great amount of information available on Web pages, which opens new perspectives for sharing information, allowing the collaborative construction of content. We took advantage of this shared information in our work in order to measure the similarity between words. Our measure uses on one hand, an online English dictionary provided by the Semantic Atlas project of the French National Centre for Scientific Research (CNRS) and on the other hand, a page counts based metric returned by a social website.

In this same context, we propose a method that uses Web content to measure semantic similarity between a pair of words. The rest of the document is organized as follows: Section 2 introduces the related work on semantic similarity methods. In section 3, we present our approach for measuring semantic similarity between words. In section 4 we evaluate our method in order to demonstrate its ability. In section 5 we conclude with few notes and some perspectives.

Categories and Subject Descriptors H.3.4 [Semantic Web]: Web Content; H.2.4 [Approach]: Similarity Measures; H.3.3 [Information Systems]: Information Search and Retrieval

General Terms Measurement, Performance, Experimentation.

Keywords Web search, Semantic similarity, User-generated content, Information Retrieval.

1. INTRODUCTION The ability to accurately judge the semantic similarity between words is critical to the performance of several applications such as Information Retrieval and Natural Language Processing. Indeed, one of the main goals of these applications is to facilitate user access to relevant information. According to Baeza-Yates et al. [2], an information retrieval system “should provide the user with easy access to the information in which he is interested”. Measures of the semantic similarity of words have been used for a long time in applications in Natural Language Processing and related areas, such as the automatic creation of thesauri [14], text classification [7] and information extraction and retrieval [4], [24]. Therefore, various methods have been proposed; Corpusbased methods using online dictionaries such as WordNet and

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. WIMS'12, June 13-15, 2012 Craiova, Romania Copyright © 2012 ACM 978-1-4503-0915-8/12/06... $10.00.

2. RELATED WORK Many previous works on semantic similarity have used manually compiled taxonomies such as WordNet1, Brown corpus2 and large text corpora [10], [14]. Li et al. [13] combined structural semantic information from a lexical taxonomy and information content from a corpus in a nonlinear model. They proposed a similarity measure that uses shortest path length, depth and local density in taxonomy. Lin [14] defined the similarity between two concepts as the information that is in common to both concepts and the information contained in each individual concept. However, one major issue behind taxonomies and corpora oriented approaches is that they might not necessarily capture similarity between proper names such as named entities (e.g., personal names, location names, product names) and the new uses of existing words. Other Web-based approaches used Web content for measuring semantic similarity between words. Turney [22] defined a measure called point-wise mutual information (PMI-IR), using the number of hits returned by a web search engine, to recognize synonyms. A similar approach for measuring the similarity between words was proposed by Matsuo et al. [15]. They proposed the use of web hits for extraction of communities on the Web. They measured the association between two personal names using the Simpson coefficient, which is calculated based on the number of web hits for each individual name and their conjunction (i.e., AND query of the two names). However, using page counts alone as a measure of similarity is not enough since even though two words appear in a page, they might not be related [3]. A perfect exemplary case for this matter; page counts for apple contains page counts for apple as a fruit and apple as a company. Moreover, given the scale and noise in the Web, some words might occur arbitrarily, i.e. by random chance, on some pages. For those reasons, page counts alone are unreliable when measuring semantic similarity.

1

http://wordnet.princeton.edu/.

2

http://icame.uib.no/brown/.

Sahami et al. [20] measured semantic similarity between two queries using snippets returned for those queries by a search engine. For each query, they collect snippets from a search engine and represent each snippet as a TF-IDF-weighted term vector. Each vector is L2 normalized and the centroid of the set of vectors is computed. Semantic similarity between two queries is then defined as the inner product between the corresponding centroid vectors. Chen et al. [6] developed a double-checking model using text snippets returned by a web search engine to compute semantic similarity between words. For two words P and Q, they collect snippets for each word from a web search engine. Then they count the occurrences of word P in the snippets for word Q and the occurrences of word Q in the snippets for word P. These values are combined nonlinearly to compute the similarity between P and Q. However, this method depends heavily on the search engine's ranking algorithm. Although two words P and Q might be very similar, there is no reason to believe that one can find Q in the snippets for P, or vice versa. Bollegala et al. [3] developed a hybrid semantic similarity measure. They combined a WordNet based metric with retrieval of information from text snippets returned by a Web search engine. They automatically discover lexico-syntactic templates for semantically related and unrelated words using WordNet, and they train a support vector machine (SVM) classifier. The learned templates are used for extracting information from the text fragments returned by the search engine. In this paper, we propose a method that uses both page counts based similarity returned by a social Website and a dictionary based metric.

3. THE PROPOSED METHOD Our method for measuring semantic similarity between words uses, on one hand, an online English dictionary provided by the Semantic Atlas project (SA) [17] and on the other hand, page counts returned by a social website whose content is generated by users (see Figure. 1).

Synonyms for word w1

Page counts based semantic similarity

Synonyms for word w2 Similarity coefficient WebJaccard

Similarity coefficient Jaccard

3.

The calculation of the similarity (SP) between two words based on the page counts returned by the social Website Digg.com3. Integration of the two similarity measures SD and SP.

3.1 Phase 1: The Semantic Atlas based similarity measure In this phase, we extract synonyms for each word from the online English dictionary provided by the Semantic Atlas project [17] of the French National Centre for Scientific Research (CNRS). What lured our attention is the fact that the SA is composed of several dictionaries and thesauri (including the Roget’s thesaurus), offering thus a wide range of senses for a given word. In fact, the SA is used for the automatic treatment of polysemy and semantic disambiguation [23]. The SA is currently available for French and English versions. It can be consulted online via a flexible interface allowing for interactive navigation on http://dico.isc.cnrs.fr4. The current online version allows users to set the search type for English words as (1) standard (narrow synonymy) or (2) enriched (broad synonymy). In this paper, we use the enriched search type for a better precision. In our work, we use the SA to extract sets of synonyms for each word. Once the two sets of synonyms for each word are collected, we calculate the degree of similarity, which we call between them, using the Jaccard coefficient:

Where : The number of common words between the two synonym sets. : The number of words contained in the w1 synonym set. : The number of words contained in the w2 synonym set.

3.2 Phase 2: The page counts similarity measure

Word pair (w1,w2)

Dictionary based semantic similarity

2.

The semantic similarity between w1 et w2 is given by Sim (w1,w2)

In this phase, we calculate the degree of similarity between the two words w1 and w2 using the coefficient WebJaccard [3] which has as parameters the number of pages returned by the social Website Digg.com for queries w1, w2 and (w1, w2 ). Barlow [1] identifies Digg.com as one of the most popular aggregators of articles published on the Web, and whose content is generated by users. We use this social site to calculate the number of pages for a given query. Therefore, once the page counts for queries w1, w2 and (w1, w2) are obtained, we calculate the WebJaccard coefficient for the given pair of words:

Figure 1. The proposed method for measuring semantic similarity between a given pair of words . We proceed in three phases: 1.

The calculation of the similarity (SD) between two words . based on the online dictionary provided by the Semantic Atlas project.

3

http://www.digg.com.

4

This site is the most consulted address of the French National Center for Scientific Research’s domain (CNRS), one of the major research bodies in France.

Where : The page count for the query (w1∩w2). : The page count for the query w1. : The page count for the query w2.

3.3 Phase 3: The overall similarity measure In this last phase, we incorporate both measures previously calculated by the following formula: (3) . First experiments on Miller-Charles [16] and on RubensteinGoodenough's [19] datasets have shown that our measure performs better with 0,6. We are performing further experiments on The WordSimilrity-353 Test Collection [8] in order to stabilize .

4. EXPERIMENTS 4.1 The Benchmark Dataset We evaluated the proposed method against Miller-Charles dataset [16], a dataset of 30 word pairs. We also evaluated our method on the Rubenstein-Goodenough's [19] original data set of 65 word pairs. These two datasets are considered as a reliable benchmark for evaluating semantic similarity measures. The word pairs are rated on a scale from 0 (no similarity) to 4 (perfect synonymy).

4.1.1 The Miller-Charles’s dataset Miller and Charles [16] chose 30 pairs from the original Rubenstein-Goodenough's 65 word pairs [19], taking 10 from the “high level (between 3 and 4), 10 from the intermediate level (between 1 and 3), and 10 from the low level (0 to 1) of semantic similarity”, and then obtained similarity judgments from 38 subjects, given the same instructions as above, on those 30 pairs [5]. In Table 1, We compare the results of our measure against Miller-Charles dataset [16] with Chen’s Co-occurrence Double Checking (CODC) measure [6], with Sahami et al. metric [20], and with four popular co-occurrence measures modified by [3]; WebJaccard, WebSimpson, WebDice, and WebPMI (point-wise mutual information). These measures use page counts returned by the search engine Google5. Table 1. The SimFA similarity measure compared to baselines on Miller-Charles' dataset. Method

5

Correlation

Webjaccard

0,614

WebSimpson

0,244

WebDice

0,220

WebPMI

0,455

Sahami et al.

0,579

Chen-CODC

0,693

Bollegala et al.

0,834

Proposed SimFA

0,836

http://www.google.com.

The first four measures, in Table 1, are based on the use of association ratios between words computed using the frequency of co-occurrence of words in documents. The main hypothesis of this approach is that two words are semantically related if their association ratio is high. All figures in Table 1, except those of Miller-Charles, are normalized in the interval [0,1] to facilitate comparison. According to Table 1, our proposed measure SimFA earns the highest correlation of 0,836.

4.1.2 The Rubenstein-Goodenough's dataset We evaluate our method against Rubenstein and Goodenough’s dataset with the following measures: Hirst and St-Onge’s [9], Jiang and Conrath’s [10], Leacock and Chodorow’s[12], Lin’s[14], Resnik’s [18], Li et al. [13] and Bollegala et al. [3]. As summarized in Table 2, our proposed measure evaluated against Rubenstein-Goodenough's [19] dataset gives a correlation coefficient of 0,866 which we can consider as promising. Our measure outperforms simple WordNet-based approaches such as Edge-counting and Information Content measures and it is comparable with the other methods. Our proposed method does not require hierarchical taxonomy of concepts or sensetagged definitions of words, unlike the WordNet based methods. Therefore, in principle the proposed method could be used to calculate semantic similarity between named entities, which are not fully covered by WordNet or other manually compiled thesauri. However, considering the high correlation between human subjects (0.901), there is still room for improvement. Table 2. The SimFA similarity measure compared to baselines on Rubenstein-Goodenough's dataset. Method

Correlation

Human replication

0.901

Resnik

0.779

Lin

0.819

Li et al. (2003)

0.891

Leacock &chodorow

0.838

Hirst & St-Onge

0.786

Jiang & Conrath

0.781

Bollegala et al. (2007) Proposed SimFA

0.812 0,866

4.2 Discussion In order to calculate the co-occurrence measures: WebJaccard, WebSimpson, WebDice, and WebPMI (point-wise mutual information), Bollegala et al. [3] used the web search engine Google. However, the Google API only allows 100 automatic queries per day, and if we want to exceed the 100 requests, a charge must be paid. We did not encounter this issue with the API provided by digg.com. We went further in our experiments by evaluating our method against the two datasets mentioned earlier, using the page counts returned by the web search engine Google instead of those returned by the social website Digg.com, so that we could be sure that the results returned by Digg.com were as accurate as those returned by Google. According to Table 3, the SimFA measure performs slightly better when we use, as basis for page counts, the social website

Digg.com. This emphasizes the fact that we can use the social web site Digg.com instead of the search engine Google, in order to get automatically and with no charges, page counts for a given dataset exceeding the Google limit.

[9]

Table 3. The SimFA measure evaluated with Google.com and Digg.com [10] DataSet

Miller - Charles

RubensteinGoodenough

Google.com

0,828

0,865

Digg.com

0,836

0,866

Page counts returned by

[11]

5. CONCLUSION and FUTURE WORK Semantic similarity between words is fundamental to various fields such as Cognitive Science, Natural Language Processing and information retrieval. Therefore, relying on a robust semantic similarity measure is crucial. In this paper, we introduced a new similarity measure between words combining on one hand, the use of an online English dictionary provided by the Semantic Atlas project of the French National Centre for Scientific Research (CNRS) and on the other hand, page counts returned by the social website Digg.com. Experimental results on Miller-Charles and RubensteinGoodenough's datasets show that the proposed measure is promising. There are several lines of future work that our proposed measure lays the foundation for. We will incorporate this measure into other similarity-based applications to determine its ability to provide improvement in tasks such as classification and clustering of text. In fact, we are working on integrating our metric into an opinion classification application [7]. Besides, we intend to go further in this domain by developing new semantic similarity measures related to sentences and documents.

6. REFERENCES [1] A. Barlow: Blogging America: The New Public Sphere. Praeger, 2008. [2] R. Baeza-Yates and B. Ribeiro-Neto: Modern Information Retrieval. Addison Wesley, 1999. [3] D. Bollegala, Y. Matsuo, and M. Ishizuka: Measuring semantic similarity between words using web search engines. In Proceedings of International Conference on the World Wide Web (WWW 2007), pages 757–766, 2007. [4] C. Buckley, G. Salton, J. Allan, and A. Singhal. Automatic query expansion using smart: Trec 3. In Proceedings of 3rd Text Retreival Conference, pages 69-80, 1994. [5] A. Budanitsky and G. Hirst. Evaluating wordnet-based measures of lexical semantic relatedness. Computational Linguistics, 32(1): 13–47, 2006. [6] H. Chen, M.Lin and Y.Wei,: Novel association measures using web search with double checking. In Proceedings of the COLING/ACL, pages 1009–1016, 2006. [7] A. Elkhlifi, R. Bouchlaghem and R. Faiz: Opinion Extraction and Classification Based on Semantic Similarities. Proceedings of the Twenty-Fourth International Florida Artificial Intelligence Research Society Conference (FLAIRS’11), AAAI Press 2011. [8] L. Finkelstein, E. Gabrilovich, Y. Matias, E. Rivlin, Z. Solan, G. Wolfman, and E. Ruppin, "Placing search in

[12]

[13]

[14] [15] [16] [17]

[18]

[19] [20]

[21]

[22] [23] [24]

context: The concept revisited", ACM Transactions on Information Systems, 20(1):116-131, 2002. G. Hirst, and D. St-Onge: Lexical chains as representations of context for the detection and correction of malapropisms. In Christiane Fellbaum editor, WordNet: An Electronic Lexical Database. The MIT Press, Cambridge, MA, pages 305–332, 1998. J.J. Jiang and D.W. Conrath: Semantic similarity based on corpus statistics and lexical taxonomy. In Proceedings of the International Conference on Research in Computational Linguistics ROCLING X, 1998. H. Kozima and T. Furugori: Similarity between words computed by spreading activation on an English dictionary. In Proceedings of the sixth conference on European chapter of the Association for Computational Linguistics (EACL '93). Association for Computational Linguistics, Stroudsburg, PA, USA, pages 232-239. 1993. C. Leacock, and M. Chodorow: Combining local context and WordNet similarity for word sense identification. In Christiane Fellbaum, editor, WordNet: An Electronic Lexical Database. The MIT Press, Cambridge, MA, pages 265–283. 1998. Y. Li, Z.Bandar and D. McLean: An approach for measuring semantic similarity between words using multiple information sources. IEEE Transactions on Knowledge and Data Engineering 15(4):871-882, 2003. D. Lin.: An information-theoretic definition of similarity. In Proceedings of the 15th ICML, pages 296–304, 1998. Y. Matsuo, T. Sakaki, K. Uchiyama and M. Ishizuka: Graph-based word clustering using web search engine. In Proceedings of EMNLP, 2006. G.A Miller and W.G. Charles: Contextual correlates of semantic similarity. Language and Cognitive Processes 6(1), pages 1–28, 1991. S. Ploux, A. Boussidan and H. Ji: The Semantic Atlas: and interactive Model of Lexical Representation. In Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC'10), Valletta, Malta, 2010. P. Resnik: Using information content to evaluate semantic similarity in taxonomy. In Proceedings of 14th International Joint Conference on Artificial Intelligence (IJCAI), 1995. H. Rubenstein and J. Goodenough: Contextual correlates of synonymy. Communications of the ACM 8, pages 627-633, 1965. M. Sahami and T. Heilman: A web-based kernel function for measuring the similarity of short text snippets. In Proceedings of 15th International World Wide Web Conference (WWW), 2006. P. Siddharth, S. Banerjee and T. Pedersen: Using measures of semantic relatedness for word sense disambiguation. In Proceedings of the Fourth International Conference on Intelligent Text Processing and Computational Linguistics, Mexico City, Mexico, pages 241-257, 2003. P. D. Turney: Mining the web for synonyms: Pmi-ir versus lsa on toefl. In Proceedings of ECML, pages 491–502, 2001. F. Venant: Utiliser des classes de sélection distributionnelle pour désambiguïser les adjectifs, In Proceedings of TALN, Toulouse, France, 2007. J. Xu and B. Croft: Improving the effectiveness of information retrieval. ACM Transactions on Information Systems, 18(1) :79-112, 2000.