Term Similarity and Weighting Framework for Text ... - Semantic Scholar

Term Similarity and Weighting Framework for Text Representation Sadiq Sani, Nirmalie Wiratunga, Stewart Massie, and Robert Lothian School of Computing, The Robert Gordon University, Aberdeen AB25 1HG, Scotland, UK

Abstract. Expressiveness of natural language is a challenge for text representation since the same idea can be expressed in many different ways. Therefore, terms in a document should not be treated independently of one another since together they help to disambiguate and establish meaning. Term-similarity measures are often used to improve representation by capturing semantic relationships between terms. Another consideration for representation involves the importance of terms. Feature selection techniques address this by using statistical measures to quantify feature usefulness for retrieval related tasks. In this paper we present a framework that combines term-similarity and weighting for text representation. This allows us to comparatively study the impact of term similarity, term weighting and any synergistic effect that may exist between them. Study of term similarity is based on approaches that exploit term co-occurrences within the document and sentence contexts whilst term weighting uses the popular Chi-squared test. Our results on text classification tasks show that the combined effect of similarity and weighting is far superior to each independently and that this synergistic effect is obtained regardless of the co-occurrence context granularity. We also introduce a novel term-similarity mining approach using lexical cooccurrence profiles which consistently out-performs both the standard co-occurrence approaches to similarity mining and SVM.

1

Introduction

While unstructured, natural language text is convenient for human consumption, computers still find it difficult to process such information with satisfactory precision. This is because the lexical content of natural language text can be quite different from its intended meaning due to inherent ambiguities in natural language such as synonymy (different terms having similar meaning) and polysemy (the same term having multiple different meanings). Representation of text documents is of interest to many research fields such as Information Retrieval, Natural Language Processing and Textual CBR. The standard Bag of Words (BOW) representation is a naive approach in that it operates at the lexical level, treating terms as independent features [15]. Such a strategy is well suited in domains where vocabulary usage remains consistent. However in the

presence of ambiguities such approaches using lexical features alone for representation will remain ignorant of latent semantics that are needed to disambiguate the text. To address such limitations the semantic relatedness between terms must be taken into account and one approach to achieve this is through the acquisition of term-term similarity knowledge. Notion of similarity can be ascertained on the basis of term co-occurrence in a corpus. The context within which co-occurrence is mined is very important. Most approaches for extracting term co-occurrence statistics do so in the context of the whole document [6, 18, 4]. It can be argued that at the document level, every term can possibly co-occur with every other term and thus, document contexts do not accurately capture semantic relatedness. A sentence however, is a more linguistically justified context as it expresses one complete idea [14]. Thus mining co-occurrence at the sentence level is likely to be better at capturing the semantic relatedness between terms. An alternative approach to term-similarity extraction involves maintaining a profile of the terms (lexical co-occurrents) that co-occur with a given term within a predefined window of text. Accordingly, similarity between a term pair can be determined based on the similarity of their corresponding lexical co-occurrence profiles [16]. In this paper we study co-occurrence based term similarity extracted from document and sentence contexts for text representation. We compare these with a new lexical co-occurrence approach with a sentence-based window, where two terms are similar if they have similar term co-occurrence profiles. We present a framework to combine term similarity knowledge and weighting and study the synergistic effect of these on document representation. The rest of the paper is organised as follows: in Section 2 we provide an overview of different sources of term-similarity knowledge. In particular we look at extrospective (using background knowledge) and introspective approaches to similarity knowledge acquisition. We introduce our term similarity and weighting framework in Section 3 explaining issues related with normalisation and weighting. Section 4 presents three introspective term similarity acquisition approaches based on co-occurrence mining. We present results from five text classification tasks in Section 5, followed by conclusions in Section 6.

2

Term Similarity Extraction

A solution commonly used for overcoming the variation in natural vocabulary is to find measures of semantic relatedness between terms in a corpus. This provides a mapping between different lexical expressions of a similar idea into conceptual groups that capture the inherent meanings of documents. The result of this is the extraction of some high level features that represent the underlying semantic concepts in a document. Achieving this however requires a source of similarity knowledge. Techniques for extracting similarity knowledge range from using extrospective (knowledge rich) sources which include lexical databases and the World Wide Web (WWW) to introspective (knowledge light) techniques that use statistics of term co-occurrences in a corpus.

2.1

Extrospective Sources of Similarity Knowledge

WordNet, a lexical database for the English language [12], has been used extensively for extracting term-similarity knowledge. Words within WordNet are grouped into sets of cognitive synonyms called synsets each expressing a distinct concept. Synsets are further grouped based on their grammatical function into noun, verb, adjective and adverb dictionaries. Synsets within the same dictionary are inter-connected through links representing the semantic and lexical relationships between them. This structure can be viewed as a graph where synsets are nodes and semantic links are edges. Such a graph allows to measure relatedness between terms by means of combining shortest path between term pairs and information about the depth of nodes in the graph [19] and information content [13, 9, 11]. Despite its popularity, WordNet has recently been criticised for having limited coverage and scope of applications [8]. It also suffers from sparsity in its synset connections [2] with the different dictionaries within WordNet being mostly independent with very limited inter-connections between them. It has also been observed that WordNet hierarchies fail to accurately capture semantic distance between concepts i.e. when two concepts share a common subsumer, one of the concepts may be more related to the parent concept than the other. Unlike WordNet, Wikipedia, a free online encyclopaedia, boasts a vast coverage in orders of magnitude greater than that of lexical databases and thesauri. Gabrilovich and Markovich [7] use Wikipedia to explicitly represent the meaning of natural language by representing text documents in a high-dimensional space of Wikipedia concepts. Wikipedia is particularly attractive as a source of semantic knowledge because each Wikipedia page provides a comprehensive description of a single topic or concept and can thus be seen as a representation of that concept. Other researchers go beyond this and simply exploit the entire WWW as a means extract semantic knowledge. For instance Cilibrasi and Vityani [5] developed the Normalised Google Distance metric which uses page counts to determine the semantic relatedness of two terms. Page count of documents returned in response to search engine query provides useful evidence of relatedness between terms in the query. This can then be quantified as a similarity metric i.e. the higher the proportion of documents that contain both terms, the more related the two terms are and therefore more likely to be similar. However page counts can often be misleading as it does not consider the intended sense of terms and the semantics within which they are used in the result pages. Sophisticated approaches using text snippets 1 can be used to resolve these ambiguities by exploiting lexical syntactic patterns in these snippets [1]. The major downside of extrospective techniques is the demand on maintenance and processing power. It is certainly a non-trivial task to develop and maintain a lexical database like WordNet for a new domain or language. While online resources like Wikipedia and the WWW partly address this problem, they have the added problems of network availability (and latency), and demand for storage and memory when content is alternatively downloaded and processed 1

small pieces of text extracted by the search engine around the query term

locally. Such issues make corpus statistics a popular option for term-similarity extraction. 2.2

Introspective Sources of Similarity

The general idea is that co-occurrence patterns of terms in a corpus can be used to infer semantic relatedness between terms. Co-occurrence is also helpful for estimating domain specific relationships between terms [4]. Latent Semantic Indexing (LSI) [6] is a popular technique in this category that uses singular-value decomposition (SVD) to exploit co-occurrence patterns of terms and documents to create a semantic concept space where documents that are related are brought closer together. In this way, LSI brings out the underlying latent semantic structure in texts. Kontosthatis and Pottenger [10] have shown that LSI implicitly exploits higher-order co-occurrences between terms in a collection. When two terms t1 and t2 co-occur within the same document they are said to have a first order association. However when t1 and t2 do not co-occur but do co-occur with a common term t3 then t1 and t2 are said to share a second order co-occurrence through t3 and so on for higher orders. Unlike LSI which implicitly exploits higher order associations between terms using SVD, an explicit approach to combining co-occurence paths up-to the third order between terms has been successfully used to extract similarity arcs for a Case Retrieval Net (CRN) applied to text classification [4] The standard approach to term similarity mining is to extract term cooccurrence counts within a specific context where the context can range from whole documents to paragraphs, sentences and even word sequences [17]. An alternative approach is to obtain a profile of the co-occurrents of a term within the entire corpus. The co-occurrents of a terms are the other terms that co-occur with it within a predetermined range (or window). A window size can span from about a hundred terms to just a few terms on either side of the focus term. By passing such a window over the entire corpus, we can obtain for each term, a list of its co-occurrents and represent that in a vector space, where the dimensions of this space are the set of unique terms in the corpus. The similarity between any two terms is then obtained by comparing their term vectors. Such an approach has bee employed to construct a thesauri, where synonyms are identified based on vector similarity [16].

3

Text Representation Framework

The first step in our framework is to obtain all pairwise term similarity values to populate a term-term similarity matrix T where the row and column dimensions of T represent the corpus terms. Each entry in T determines similarity of corresponding row and column terms and all entries in T are normalised with the value 1 in any cell corresponding to identical term pairs and 0 to dissimilar. As we are not interested in calculating similarity of a term with itself, the leading diagonal of T contains 0 entries. However, any term in T can be at most similar

to itself, so all entries on the leading diagonal need to be set to 1. Formally this can be stated by adding the identity matrix I to T , giving a new matrix T 0 . T 0 = (T + I)

(1)

Next T 0 is used to obtain the new document representation capturing term similarity knowledge. This is done by multiplying a document-term matrix D with T 0 to obtain a new term-document matrix D0 . D0 = T 0 × D

(2)

Documents can be represented as column vectors in a term-document matrix D whilst the row dimensions correspond to terms in the corpus. Here any standard text representation scheme such as a binary vector, a term frequency vector or a tf-idf vector can be used for D’s column vectors. Intuitively the impact of equation 2 will be to boost the presence of related terms that were not contained in the original document, which in turn has the beneficial effect of bringing similar documents closer together. This same effect has also been shown to be equivalent to document retrieval using a Case Retrieval Net where entries in matrix T 0 can represent the weights of the similarity arcs between terms, and entries in matrix D represents the relevance arcs between terms and documents [3]. The purpose of T 0 can also be viewed as providing evidence in favour of certain (usually related) terms in the vector representation of documents. Essentially it will expand the document representation or in other words lead to a more generalised representation. Therefore, it is desirable to be able to control the strength of contribution from this new evidence. We can achieve this through a weighting parameter, α, with which the influence of terms introduced into document representations due to term similarity knowledge can be controlled. Accordingly equation 2 can be rewritten as: D0 = αT 0 × D 3.1

(3)

Normalisation

In our approach we apply an L2 normalisation function to the rows of T 0 and columns of D before the matrix multiplication in equation 2. This is done so as to prevent frequent terms and longer documents from dominating representations. The L2 normalisation function is given in equation 4. L2 (v) =

vi |v|

(4)

where v is any vector and |v| =

q

n |v|2 Σi=i

(5)

A further benefit of enforcing normalisation in this manner is that equation 2 now amounts to taking the cosine similarity between the term vectors in T 0 and the document vectors in D.

3.2

Term Weighting

Typically all terms in a corpus do not have equal importance with usually some select terms having a higher discerning power than others. Thus, a measure of term importance needs to be introduced into document representations such that more important terms have a higher influence on document similarity. A natural ordering for combining feature importance and feature similarity extraction is to first select a subset of high quality features and then infer term similarity knowledge [18]. This should however be done with caution because a severely reduced feature space is not suited to obtaining T and the more orthogonal sub feature space alone may not be sufficient to infer useful relationships that otherwise will be discovered with the aid of eliminated terms (should the space be severely pruned). Accordingly we propose to use feature weighting after obtaining T instead of selecting a subset of features before generating T . This allows us to obtain useful term relationships and still end up with highly discriminatory features. The feature weights also allow us to capture the relative importance of the terms compared with feature selection which only seeks to eliminate non-informative terms. Once term weights have been obtained, they can be used to populate a diagonal matrix W which has the same row and column dimensions as T . W can be used to assign a weight to each term in D corresponding to the significance of that term in the domain. This step is presented in equation 6. D0 = W × (αT 0 × D)

(6)

Feature weights can easily be obtained using a feature selection technique [20]. We expect any effective feature selection technique to work well with our framework and the choice of any one technique should be based on how suitable it is for the task at hand.

4

Modeling Term Similarity Knowledge

Central to the framework presented in section 3 is the presence of term-term similarity knowledge to populate matrix T 0 . Any of the approaches presented in Section 2 can be utilised for this purpose. Here we present two straight forward introspective approaches for extracting term-similarity knowledge: Context Cooccurrence Approach (CCA) and Lexical Co-occurrence Approach (LCA). CCA measures similarity between terms based on the strength of co-occurrence. Two terms are said to co-occur if they happen to appear within a specified context. Here we consider two possible contexts: at the document or sentence level. LCA on the other hand measures similarity between terms with respect to their association pattern with other terms. Basically a term-pair that has similar cooccurrence patterns with other terms within a predetermined context window are deemed to be similar. Each term thus has a lexical co-occurrence profile in the corpus and the similarity between any two terms is determined by the similarity between their co-occurrence profiles.

4.1

Context Co-occurrence Approach (CCA)

Documents are considered to be similar in the vector space model (VSM) if they contain a similar set of terms. The similarity of two documents can be determined by finding the distance between their vector representations in the term-document space defined by D. In the same way, terms can also be considered similar if they appear in a similar set of documents. If document vectors are defined by the columns in matrix D, then the rows of the matrix define the term vectors. Thus, the similarity between two terms can be determined by finding the similarity of their corresponding term vectors. This model can be extended to a general term-context matrix where the co-occurrence context can range from whole documents to sentences or even word sequences of a predefined length. In this work we consider both the document and sentence contexts for extracting co-occurrence based similarity. We demonstrate the process of creating a firstorder term-term similarity matrix using an approach similar to that presented in [4], but without restricting ourselves to document-only context. Starting with a term-context matrix C where the context can be either at the document or sentence level, we obtain a term-term similarity matrix T by multiplying C with its transpose (C T ). T = C × CT

(7)

We observe that the dot product of term vectors is not a robust measure of similarity as it only gives the number of co-occurrence paths between terms. This results in highly frequent terms having higher values in the resulting T . Highly general terms are usually not very discriminatory and thus less useful for establishing semantics, compared with other terms that have relatively fewer and more specialised types of relationships. Consider for instance a domain about cars which contains the popular term ’car’. If the task is to classify documents into different car brands (like Toyota, Ford and Honda), then query documents containing ’car’ will most likely be incorrectly classified because most documents will be strongly related to the concept ’car’. The main reason for this is that ’car’ will dominate semantic relationships unless it is taken into account in the representation. One way to achieve this is through vector normalisation, which uses relative frequency instead of raw frequency counts. We demonstrate this further using a document level context representation with three example matrices in Figure 1. Here t1 has exactly the same document distribution as t3 yet the similarity of t1 to t3 (2.0) in T is the same as it’s similarity to t4 and t5 simply because t4 and t5 are highly frequent terms. Also, the similarity of t4 to t5 is higher (3.0) than its similarity to t3 (2.0) even though t3 and t4 fail to co-occur within just a single document (compared with a difference of 2 documents between t4 and t5 ). Moreover, frequent terms like t5 are likely to be general terms that are not very discriminatory. We apply the L2 normalisation (as we did in Section 3.1) on the rows of C and columns of C T before the matrix multiplication in equation 7. This invariably amounts to calculating the cosine similarity between term vectors (normalised to unit length). Normalisation in this way contrasts with the frequency count

Fig. 1: Example of context co-occurrence term similarity d1 d2 d3 d4 d5 0 1 1 0 0 1 0 0 1 0 0 1 1 0 0 1 1 1 0 0 1 1 1 1 1 t1 t2 t3 t4 t5 d1 0 1 0 1 1 d2 1 0 1 1 1 CT = d3 1 0 1 1 1 d4 1 1 0 0 1 d5 0 0 0 0 1 t1 t2 t3 t4 t5 t1 2.0 0.0 2.0 2.0 2.0 t2 0.0 2.0 0.0 1.0 2.0 T = t3 2.0 0.0 2.0 2.0 2.0 t4 2.0 1.0 2.0 3.0 3.0 t5 2.0 2.0 2.0 3.0 5.0 t1 t2 C= t3 t4 t5

based approach presented in [4]. Importantly normalisation helps to mitigate any undue influences from highly-frequent terms on semantic relatedness. 4.2

Lexical Co-occurrence Approach (LCA)

In LCA, we begin with a term-term co-cooccurrence matrix P which we call the lexical co-occurrence matrix. The dimensions of the rows and columns in P are the unique terms in our corpus. P is thus a square matrix and an entry pi,j in P is a binary value indicating the co-occurrence or otherwise, of term pi and pj within a context window. Our window size is a whole sentence instead of a document. Thus terms that co-occur within a sentence will have the value 1 in their corresponding cell in P while terms that do not co-occur within the same sentence will be 0. The term-term similarity matrix P can be obtained from P by calculating the cosine similarity of all pair-wise combinations of the terms vectors of P . This is achieved by multiplying P with its transpose P T after the rows and columns of P have been L2 normalised (as before). T = P × PT

(8)

Equation 8 appears similar to the approach presented in [4] for creating a second order co-occurrence matrix with the exception that the sentence context is used here instead of whole documents. However, there are important difference between the two approaches. For instance, to obtain second order co-occurrences, the entries of P need to be converted into binary values such that the matrix multiplication (dot product of term vectors) of P and P T produces integer values that are counts of second order paths between term pairs. On the other hand,

the cosine similarity in the lexical approach produces normalised, real values that are a measure of similarity between term pairs based on their first order lexical co-occurrences (see resulting T matrix presented in Figure 2) where term vectors in P need not be binary. Consequently the values produced by the cosine similarity are not counts of second order co-occurrence paths neither can they be interpreted as such. Equally by capturing similarity in this way we avoid the need to have separate matrices for the first and second orders. Fig. 2: Example of lexical co-occurrence term similarity t1 t2 t3 t4 t5 0 1 2 0 0 1 0 2 0 0 2 2 0 0 1 0 0 0 0 1 0 0 1 1 0 t1 t2 t3 t4 t5 t1 0 1 2 0 0 t2 1 0 2 0 0 PT = t3 2 2 0 0 1 t4 0 0 0 0 1 t5 0 0 1 1 0 t1 t2 t3 t4 t5 t1 1.0 0.8 0.3 0.0 0.6 t2 0.8 1.0 0.3 0.0 0.6 T = t3 0.3 0.3 1.0 0.3 0.0 t4 0.0 0.0 0.3 1.0 0.0 t5 0.6 0.6 0.0 0.0 1.0 t1 t2 P = t3 t4 t5

Figure 2 provides an example to illustrate the creation of a term-similarity matrix from a lexical co-occurrence matrix. Notice that the terms t1 and t2 both co-occur twice with the term t3 and thus have a resulting similarity of 0.8 in T . Accordingly t1 and t4 fail to co-occur with a common term and thus have 0 similarity.

5

Evaluation

The aim of our experiments is to examine the effect of combining term similarity knowledge with term weighting for document representation. Two different co-occurrence based approaches for extracting term-term similarity knowledge introspectively were presented in Section 4. Using these we further explore the impact of document versus sentence context on the quality of term similarity knowledge. Accordingly we use the following algorithms to mine term-similarity knowledge for matrix T : – CCAdoc term co-occurrences at the document context level (see section 4.1);

– CCAsent term co-occurrences at the sentence context level (see section 4.1); – LCA using lexical co-occurrences within a sentence window (see section 4.2). Once T is obtained we use the framework in Section 3 to generate the document representations. We also study the influence of feature weighting on these representations using the Chi squared (χ2 ) measure2 to obtain feature weights for matrix W . Our baseline technique, Base, is a representation with no similarity knowledge. Equally the individual influence of feature weighting alone is further investigated by adding this component into Base (W+). We applied standard text preprocessing operations such as stop-words removal, stemming and also eliminated rare terms (terms with document frequency less than 3). We used binary vectors to represent documents in D. The different representation schemes were compared on text classification tasks using a weighted kNN as our primary algorithm (with k=3) with the cosine similarity metric to identify the neighbourood. kNN’s classification accuracy is also compared to that of a Svm classifier. Accordingly we restrict our study to binary classification datasets. Results are obtained by averaging over 10 repeated stratified ten fold cross validations. 5.1

Datasets

The first dataset used for the experiments was formed from the 20 newsgroups corpus; a collection of approximately 20,000 documents partitioned into 20 newsgroups. The particular version of the 20 newsgroups dataset we used contains 18846 documents from the original collection that are arranged by date and with duplicates removed. From these groups we created the Hardware dataset which contains documents from the comp.sys.ibm.pc.hardware and comp.sys.mac.hardware newsgroups. This dataset contains 1000 documents selected randomly from the respective newsgroups. Documents in the comp.sys.ibm.pc.hardware category contain discussions on PC hardware while those in the comp.sys.mac.hardware contain discussions on Apple hardware. The second group of data was formed from the OHSUMED corpus. OHSUMED is a subset of MEDLINE, an online database of medical literature, and comprises a collection of 348,566 medical references from medical journals covering a period from 1987 to 1991. We obtained a subset of the OHSUMED collection which consists of all references from the year 1991. From this we created two new, 1000 documents dataset: NervImmuno with 500 documents from each of the Nervous and Immunological diseases categories and BacterialParasitic with 500 documents from each of the Bacterial and Parasitic diseases categories. The OHSUMED corpus used here contains medical abstracts on diseases and thus we expect significant overlap in vocabulary in all categories. However, we expect the Bacterial and Parasitic diseases categories to be more similar, making the BacterialParasitic dataset harder for classification than the NervImmuno dataset. 2

The final (χ2 ) score of term ti is obtained as the maximum χ2 value of ti with respect 2 to all classes as χ2max = maxn i=1 {χ (t, ci )}

The last group of data was formed from the Reuters21578 corpus which is a collection of newswire stories. The corpus contains 5 category sets: EXCHANGES, ORGS, PEOPLE, PLACES and TOPICS. Each category set contains a number of different categories. The TOPICS category set contains economic subject categories e.g. Cocoa, Trade, Grain and Money-fx. From the TOPICS categories we created two, 1000 document datasets: InteresetFX which contains 500 documents from each of the Interest and Money-fx categories and the TradeGrain dataset which contains all 486 documents from each of the Trade category and 514 documents from the Grain category. The Interest and Money-fx categories contains news wires on interest rates and foreign exchange rates which are expected to be very similar. The Trade and Grain categories on the other had contain reports on miscellaneous international trade and international trade in grain respectively. These two categories are also expected to be similar.

Table 1: Characteristics of datasets. Vocabulary Size No of Documents Ave Document Length Hardware 3824 1000 64 NervImmuno 3213 1000 57 BacterialParasitic 3047 1000 58 InterestFX 3172 1000 66 TradeGrain 3177 1000 78

Summary of the 5 datasets is listed in Table 1. They all have a comparable average vocabulary size with Hardware having the largest at 3,824. Average document length was calculated after documents were processed for stopwords and rare terms. Here TradeGrain has the largest average document length of 78 unique terms whilst NervImmuno is smallest with 57 unique terms. 5.2

Results

Overall results show that incorporating term similarity and weighting has, as expected, out performed Base on all datasets and Svm on 4 out of the 5 datasets. When used independently of each other, we can see that weighting on its own achieves higher accuracies than representations with term similarity knowledge. For instance in Table 2, Base (W+), which is the results column for weighting only, is consistently better compared to all columns with no weighting (W-). However the combination of the two techniques produces the best results confirming the synergistic effect of the two. Co-occurrence mining within a sentence context gives best results in the absence of feature weighting (e.g. compare the W- columns for CCAsent, CCAdoc and LCA). However when weighting is injected to the final representations, a sentence context does not provide any obvious gain over a document

context. In terms of the different term-similarity combinations with weight, we found that LCA and CCAdoc to consistently out perform Svm on 4 of the 5 datasets (i.e. excepting the TradeGrain dataset). The TradeGrain dataset is the easiest dataset with a Base accuracy of 93.7%. This means that the two classes are largely orthogonal and mining term similarity is likely to introduce more spurious term relationships which may adversely impact document similarity leading to poor classifier accuracy. Comparison of CCAdoc with its counterpart at the sentence level (CCAsent) shows that the larger document context seems to be better on 3 of the datasets with a ∼3-4% improvement. One dataset where CCAsent performs well, is on the InteresetFX, which is the most difficult of the 5 datasets with a Base accuracy of just 67.1%; apparently due to the similarity between the documents in the ’interest’ and ’money-fx’ classes. In fact, close examination of the dataset revealed a number of very similar documents that were categorised into different classes. Accordingly representations are more prone to learning many arbitrary relationships that are likely to result in lower classification accuracy. As the CCAsent technique mines co-occurrence relationships within a restricted context size, this might have given it an edge over CCAdoc on this particular dataset.

Table 2: Accuracies on the different datasets with (W+) and without (W-) term weights Svm Hardware NervImmuno BacterialParasitic InterestFX TradeGrain

6

90.7 88.0 78.56 74.59 97.5

Base W- W+ 86.9 88.8 81.5 88.0 75.0 79.5 67.1 71.6 93.7 95.5

CCAsent W- W+ 84.3 91.5 84.1 86.8 77.9 79.0 68.5 72.8 94.4 96.5

CCAdoc W- W+ 83.4 94.6 82.8 89.6 76.7 84.1 67.2 70.3 94.4 95.6

LCA W- W+ 86.3 93.9 84.0 90.0 73.9 81.9 67.0 75.3 93.9 95.89

Conclusion

The main contribution of this paper is a framework that combines term similarity knowledge with term weighting to represent textual content. We have also discussed how three different approaches for extracting term-similarity knowledge from corpus co-occurrence can be utilised within this framework.. We have demonstrated how the common approach of using matrix multiplications for cooccurrence based term similarity is biased towards highly frequent terms and how that can be addressed through normalisation. Results from a comparative study on text classification tasks clearly demonstrate the synergistic effect of term-similarity and weighting compared with using either independently of the other. Our results also show kNN with our augmented

representations to outperform Svm on a majority of the datasets. Although we have tested our framework on text classification tasks it is worth noting that none of the proposed term-similarity mining approaches were specifically adapted for supervised tasks. Therefore it is quite likely that further benefits are likely to be realised if co-occurrence mining had been biased by class knowledge. In future work we plan to conduct a more comprehensive study involving both extrospective and introspective sources of knowledge for term-term similarity computation. Our initial findings on document versus sentence context for cooccurrence mining remains inconclusive. Therefore it will be very useful to study this further on many more datasets and also on increasing sentence window sizes.

References 1. Danushka Bollegala, Yutaka Matsuo, and Mitsuru Ishizuka. Measuring semantic similarity between words using web search engines. In WWW ’07: Proceedings of the 16th international conference on World Wide Web, pages 757–766. ACM, 2007. 2. Jordan Boyd-graber, Christiane Fellbaum, Daniel Osherson, and Robert Schapire. Adding dense, weighted connections to wordnet. In Proceedings of the Third International WordNet Conference, 2006. 3. Sutanu Chakraborti, Robert Lothian, Nirmalie Wiratunga, Amandine Orecchioni, and Stuart Watt. Fast case retrieval nets for textual data. In Advances in CaseBased Reasoning, volume 4106, pages 400–414. Springer Berlin / Heidelberg, 2006. 4. Sutanu Chakraborti, Nirmalie Wiratunga, Robert Lothian, and Stuart Watt. Acquiring word similarities with higher order association mining. In Proceedings of the 7th international conference on Case-Based Reasoning: Case-Based Reasoning Research and Development, pages 61–76, Berlin, Heidelberg, 2007. Springer-Verlag. 5. Rudi L. Cilibrasi and Paul M. B. Vitanyi. The google similarity distance. IEEE Trans. on Knowl. and Data Eng., 19:370–383, March 2007. 6. Scott C. Deerwester, Susan T. Dumais, Thomas K. Landauer, George W. Furnas, and Richard A. Harshman. Indexing by latent semantic analysis. Journal of the American Society of Information Science, 41(6):391–407, 1990. 7. Evgeniy Gabrilovich and Shaul Markovitch. Wikipedia-based semantic interpretation for natural language processing. J. Artif. Int. Res., 34:443–498, March 2009. 8. Jorge Gracia and Eduardo Mena. Web-based measure of semantic relatedness. In Web Information Systems Engineering - WISE 2008, volume 5175, pages 136–150. Springer Berlin / Heidelberg, 2008. 9. J.J. Jiang and D.W. Conrath. Semantic similarity based on corpus statistics and lexical taxonomy. In Proc. of the Int’l. Conf. on Research in Computational Linguistics, pages 19–33, 1997. 10. April Kontostathis and William M. Pottenger. A framework for understanding latent semantic indexing (lsi) performance. Information Processing and Management, 42(1):56–73, 2006. 11. Dekang Lin. An information-theoretic definition of similarity. In Proc. of the 15th Int’l. Conf. on Machine Learning, pages 296–304, 1998. 12. George A. Miller. Wordnet: A lexical database for english. Communications of the ACM, 38:39–41, 1995. 13. Philip Resnik. Using information content to evaluate semantic similarity in a taxonomy. In Proceedings of the 14th international joint conference on Artificial intelligence - Volume 1, pages 448–453, 1995.

14. Magnus Sahlgren. An introduction to random indexing. In In Methods and Applications of Semantic Indexing Workshop at the 7th International Conference on Terminology and Knowledge Engineering, TKE 2005, 2005. 15. G. Salton, A. Wong, and C. S. Yang. A vector space model for automatic indexing. Commun. ACM, 18:613–620, November 1975. 16. Hinrich Sch¨ utze and Jan O. Pedersen. A cooccurrence-based thesaurus and two applications to information retrieval. Inf. Process. Manage., 33:307–318, May 1997. 17. Peter D. Turney and Patrick Pantel. From frequency to meaning: vector space models of semantics. J. Artif. Int. Res., 37:141–188, January 2010. 18. Nirmalie Wiratunga, Ivan Koychev, and Stewart Massie. Feature selection and generalisation for retrieval of textual cases. In Proceedings Of 7TH European Conference On Case-Based Reasoning (ECCBR04), VOLUME 3155, pages 806–820. Springer-Verlag, 2004. 19. Zhibiao Wu and Martha Palmer. Verb semantics and lexical selection. In Proc. of the 32nd annual meeting on Association for Computational Linguistics, pages 133–138, 1994. 20. Yiming Yang and Jan O. Pedersen. A comparative study on feature selection in text categorization. In Proceedings of the Fourteenth International Conference on Machine Learning, ICML ’97, pages 412–420, 1997.