Calculating Statistical Similarity between Sentences

0 downloads 0 Views 998KB Size Report
Feb 2, 2011 - Junsheng Zhang, Yunchuan Sun, Huilin Wang, Yanqing He ...... [1] Wu, Z. and Palmer, M. “Verbs semantics and lexical selection”, Association ...
Calculating Statistical Similarity between Sentences Junsheng Zhang, Yunchuan Sun, Huilin Wang, Yanqing He Journal of Convergence Information Technology, Volume 6, Number 2. February 2011

Calculating Statistical Similarity between Sentences 1

Junsheng Zhang, 2Yunchuan Sun, 3Huilin Wang, 4Yanqing He IT Support Center, Institute of Scientific and Technical Information of China, Beijing, China, 100038, [email protected] *2,Corresponding Author School of Economics and Business Administration, Beijing Normal University, Beijing, China, 100875, [email protected] 3,4 IT Support Center, Institute of Scientific and Technical Information of China, Beijing, China, 100038, [email protected], [email protected]

1, First Author

doi:10.4156/jcit.vol6. issue2.3

Abstract Sentence similarity plays an important role in text-related research and applications. It is closely related to word similarity and document similarity. The statistical similarity measures between sentences, based on symbolic characteristics and structural information, could measure the similarity between sentences without any prior knowledge but only on the statistical information of sentences. This paper presents several approaches to calculating statistical similarity between sentences on a test corpus of 40 sentences. These measures can be used in short text related applications such as corpus construction and title/abstract based document recommendation. The evaluation results show the differences of these measures.

Keywords: Sentence Similarity, Statistic, Calculating 1. Introduction Sentence similarity plays an important role in many text-related research and applications such as information retrieval, information recommendation, natural language processing, machine translation and translation memory, and etc. Calculating similarity between sentences is the basis of measuring the similarity between texts which is the key of document classification and clustering. Sentence similarity is one of the key issues for sentence alignment, sentence clustering, question answering, and etc. Natural languages have different meaning granularities such as word, phrase, sentence and document. Word can be regarded as the minimum meaning unit in natural language, while sentence can be regarded as the minimum unit to communicate some complete meaning in natural language. Separators like period and comma are symbols to split word sequences into meaningful chunks. A sequence of words and separators formulate a sentence. Accordingly, there are several levels of similarities in natural languages. Words can be classified into synonyms and antonyms based on the similarity between words and phrases. Similarities between documents are the basis of text classification and clustering. Sentence similarity is between the word similarity and document similarity. Similarities calculating methods vary in different levels.  Word similarity. Similarity between words can be calculated from the spelling of words or the meaning of words. Edit distance can be used to measure similarity between words from the spelling of words. If two words are similar in spelling, they are possible similar in meaning in our intuition. While lexicon dictionary such as WordNet can be used to calculate the meaning similarity between words just like WordNet::similarity does [2].  Sentence similarity. The similarities between words in different sentences have great influence on the similarity between two sentences. Words and their orders in the sentences are two important factors to calculate sentence similarity.  Document similarity. The similarity between sentences has great influence on the similarity between documents. Commonly used approaches are often based on similarity between the keyword sets (e.g., Dice similarity) or similarity between the vectors of keywords (e.g., cosine similarity). These methods seldom consider the semantic meaning of words. Sentences are the mediate level of meaning units between words and documents. Sentences connect words and documents in the meaning space of natural language. Most existing approaches ignore

- 22 -

Calculating Statistical Similarity between Sentences Junsheng Zhang, Yunchuan Sun, Huilin Wang, Yanqing He Journal of Convergence Information Technology, Volume 6, Number 2. February 2011

sentence similarity for the reason of the time cost. However, the similarity of sentence level is interesting and important. Besides, the study on sentence similarity can enhance the word similarity research, especially on the statistical approaches based on large corpus. In this paper, we focus on the measures of the sentence similarity from symbols. We propose two measures of sentence similarity based on the words orders and the words distances. Considering sentence similarities based on word sets, word vectors, and edit distances, we use six approaches to calculating statistical similarity between sentences. Then, we evaluate and compare the six approaches on a corpus with 40 sentences selected from NIST05 BLEU corpus. Evaluation results show that different measures of sentence similarity could indicate different statistical characteristics of sentences. The calculating approaches for sentence similarity based on statistical characteristics are helpful for the applications of natural language processing.

2. Related Work Similarity analysis between long texts has been widely used in documents retrieval. It considers mainly on the statistical information of keywords in long texts. Keywords are generally selected using weight assigning schemes like TF-IDF [12]. Sentence similarity has been used in the fields of text-related knowledge representation and discovery, such as machine translation [13], translation memory, text summarization [4], text categorization, question answering and even image search on the Web [3]. Similarity analysis between sentences is different from that between long texts in statistical information. The weight assigning approach such as TF-IDF is not suitable for the similarity analysis between sentences. A corpus is needed in calculating sentence similarity statistically. A method is presented for measuring the semantic similarity of texts using a corpus-based measure of semantic word similarity and a normalized and modified version of the Longest Common Subsequence (LCS) string matching algorithm in [6,7]. Sentence similarity partially depends on the word similarity. Ref. [11] investigates several methods for computing the words similarity in two aspects: the dictionary-based methods using WordNet, Roget’s thesaurus, or other resources; and the corpus-based methods using frequencies of cooccurrence in corpora (cosine method, latent semantic indexing, mutual information, and etc). The methods can be used in several fields, such as solving TOEFL-style synonym questions, detecting words that do not fit into their context in order to detect speech recognition errors, and synonym choice in context for writing aid tools. Ref. [14] presents an approach for measuring the semantic relatedness between words based on the implicit semantic links. The authors introduced Omiotis, a measure of semantic relatedness between texts, which capitalizes on the word-to-word semantic relatedness measure and extends it to measure the relatedness between texts. There are three categories of similarity calculation between texts: word co-occurrence methods, corpus-based methods, and methods based on descriptive features [10]. Word co-occurrence method calculates similarity with vectors. Corpus-based text similarity can find the semantic associations between words. However, the high dimension and sparse sentence vectors leads to the bad performance of similarity calculation. A method is discussed for measuring text semantic similarity based on corpus and knowledge [11], and the experiments show that semantic similarity outperforms methods on simple lexical matching. Word to word similarity based on prior knowledge was discussed in [1, 9]. Pointwise mutual information [15] and latent semantic analysis [8] are used to measure the word to word similarity. Ref. [18] presents vector based models for semantic composition. A Chinese similarity computing system model was proposed for Chinese sentence similarity computation [19]. A modified method was proposed for concepts similarity calculation [20]. Ref. [21] designed dissimilarity measures to increase their utility for instance-based learning methods. In this paper, sentence similarity analysis depends on the word similarity and structural information of sentences. Here, word similarity only considers the spellings and ignores the semantic meanings of words. Structure information considers the orders of words and the distances between words, and it ignores the syntactic information of sentences.

- 23 -

Calculating Statistical Similarity between Sentences Junsheng Zhang, Yunchuan Sun, Huilin Wang, Yanqing He Journal of Convergence Information Technology, Volume 6, Number 2. February 2011

3. Sentence Similarity Measures In this paper, a sentence is defined as a sequence of words and separators, denoted by s. In preprocessing phase, a sentence should be segmented into words and phrases, especially for the Chinese sentences that have no blanks between two neighbored words. Separators are also important parts to represent the semantics of a sentence. The length of sentence s, denoted by |s|, is defined as the number of words and separators in sentence s. There are three types of the sentence similarity measures:  Symbolic similarity. If two sentences are more similar, then words in the two sentences are more similar, and vice versa. Here, words in two sentences are similar means that the words are similar in symbolic or in semantics.  Semantic similarity. Two sentences with different symbolic and structure information could convey the same or similar meaning. Semantic similarity of sentences is based on the meanings of the words and the syntactic of sentence.  Structure similarity. If two sentences are similar, structural relations between words are similar, and vice versa. Structural relations include relations between words and the distances between words. If the structures of two sentences are similar, they are more possible to convey similar meanings. Words formulate sentences, while sentences formulate documents. There are many researches focusing on word similarity and document similarity. Word similarity has two types: symbolic similarity and semantic similarity. The symbolic similarity of words can be calculated by edit distance, and the semantic similarity of words can be calculated by WordNet::similarity. Document similarity is often calculated by the Vector Space Model (VSM). Documents are represented by the bag of words, and the meanings of documents are presented by vectors. The document similarity can be calculated by using cosine of the vectors. If the weights of words are ignored, the document similarity can be calculated based on the sets of keywords by using Dice similarity or Jaccard Coefficient similarity. Sentence similarity is close to word similarity and document similarity. Words are the components of sentences, while sentences are the components of documents. If words in two sentences are similar, the two sentences are more possibly similar. If sentences in two documents are similar, the two documents are more possibly similar. The sentence similarity is partially based on the word similarity, and the relations between words should be also considered. However, word similarity cannot replace sentence similarity. Because the word similarity reflects the closeness of two discrete words or concepts, while the sentence similarity reflects the closeness of two sequences of words and separators, and the meaning of sentences are closely related to the orders of words. The similarity of two documents depends on the similarity of their sentences. Document similarity is calculated by vectors of documents, and the VSM model is efficient in calculating document similarities in large scaled document sets. To deeply understand the meanings of documents, it is necessary to concentrate on the similarities of sentences. Semantic sentence similarity is difficult to be calculated on large scale corpus, because taxonomy is hard to cover all domains. Statistical similarity between sentences considers only words in the two sentences without any prior knowledge such as lexicon dictionary or syntactical parsing. The cost of calculating statistic similarity is lower than that of calculating semantic similarity. Statistical similarity combines the symbolic characteristics (such as word sets and vectors) and structural characteristics (such as the orders and distances).

4. Calculating Statistic Similarity between Sentences In this section, we present six measures of statistical similarity between sentences.  Sentence similarity based on word set is calculated with the sets of words in two sentences respectively.

- 24 -

Calculating Statistical Similarity between Sentences Junsheng Zhang, Yunchuan Sun, Huilin Wang, Yanqing He Journal of Convergence Information Technology, Volume 6, Number 2. February 2011

 Sentence similarity based on vector is calculated with vectors representing two sentences respectively. The weights of words are assigned in two ways: one way is to assign the weights of words averagely; the other is to assign the weights of words by TF-IDF approach.  Sentence similarity based on edit distance is calculated based on the edit distances between two sentences.  Sentence similarity based on word order is calculated with the orders between word pairs in the sentences.  Sentence similarity based on word distance is calculated based on the distances between word pairs in the same sentences. The former four sentence similarity measures are symbolic similarity, while the latter two sentence similarity measures are structural similarity. Symbolic similarity between sentences only considers the spelling of words ignoring the meanings of words. The symbolic similarity of words has two types: the set of words and the bag of words. The set of words representing the meaning of a sentence with a word set, while the bag of words can be transformed into a vector to represent the meaning of the sentence. The structure information of a sentence includes word orders, word distances and syntactic structure of the sentence. Here, we only consider word orders and word distances. Before calculating statistical similarity between sentences, suppose sa is a sentence with length m (m ≥2), and sb is a sentence with length n (n ≥2). sa = wa1wa2 wa3 … wam (m≥ 2) sb = wb1wb2 wb3 … wbn (n≥2) where w ai(i∈[1, m]) and wbj( j∈[1, n]) are the words or separators in sa and sb. Let w(sa) be the set of words containing all the words wai(i∈[1, m]), and w(s b) be the set of words containing all the words wbj(j∈[1, n]). To distinguish the different sentence similarities, we use two simple exemplary sentences as follows. sa = That is the old file . sb = This is the new file .

4.1. Sentence Similarity based on Word Set To calculate sentence similarity based on word set, the word sets of sentences should be constructed first. Since the sentences may have different tenses and voices, there are two ways to calculate word based sentence similarity. One is to calculate sentence similarity with the words in sentences; the other is to calculate sentence similarity with stemmed words in sentences. Here, we choose the original words in the sentences, because the stemmed words would miss the tense and voice information of the sentence. After the word sets of two sentences are formulated, Jaccard similarity between sentences can be calculated by ∩ | | , )= . | ∪ | Another similarity based on the word set can be calculated by Dice similarity. ,

2| |

The word sets of two example sentences are

- 25 -

∩ | |

| . |

Calculating Statistical Similarity between Sentences Junsheng Zhang, Yunchuan Sun, Huilin Wang, Yanqing He Journal of Convergence Information Technology, Volume 6, Number 2. February 2011

"That", "is","the","old","file","." "This","is","the","new","file","."

The word based sentence similarities are 5/7 5/6

, ,

4.2. Sentence Similarity based on Word Vector Words in a sentence have different semantic roles, and the weights of words should be different. The weights of words in calculating document similarity can be calculated by TF-IDF [12]. However, it is difficult to use TF-IDF to represent the weights of words in a sentence because the frequency is 1for almost all words. It is necessary to find a new approach to assigning different weights to different words. The semantic roles of words can be analyzed by NLP techniques, and different roles of words have different weights [5]. We adopt two strategies for vector based sentence similarity: one is to assign weights of words averagely based on the numbers of all the words and separators, the other is to assign weights of words by TF-IDF based on a corpus. To calculate sentence similarity based on word vectors, the word vectors of sentences should be constructed first. If the words in w(sa) and w(sb) are assigned with weights, sa and sb can be represented by the bags of words: ,

,

,

,…,

,

,

,

,

,…,

,

Then cosine similarity between sentences can be calculated by ∑

, ∑



Coming to the two exemplary sentences, the vectors of the two sentences are as follows: ,1 , ,1 ,

,1 , ,1 ,

,1 , ,1 ,

,1 , ,1 ,

, 1 , . ,1 , 1 , . ,1

We choose all the weights of words in the sentences 1. If a word occurs two or more times in one sentence, the weight of the word is accumulated. The sentence similarity based on the vectors is 2/3. , Sentence similarity based on word vector considers words and their weights in two sentences. Vector based similarity is popular in information retrieval. However, it does not consider the orders and distances between words. It is also symbolic similarity, and different word weights distinguish the importance of words. Stop Words are words which do not contain important significance in the sentence. Usually these words are filtered out because they bring vast amount of unnecessary information for calculating sentence similarity of long texts. Some words occur in different sentences, called stop words, which are less important to distinguish the meaning of sentences. In similarity measures for long texts based on word sets and vectors should filter the stop words to formulate the word sets and vectors, because too many stop words will become the noise for calculating sentence similarity. However, the stop words in sentences are important, for they imply the structure information of sentences. If all the stop words are eliminated from sentence, the structure information might partially lost in the calculating of the sentence similarity.

- 26 -

C Calculating Statisstical Similarity between Sentencess Junssheng Zhang, Yunnchuan Sun, Huiliin Wang, Yanqingg He JJournal of Converrgence Informatioon Technology, V Volume 6, Numberr 2. February 2011

4.3.. Sentence Siimilarity based on Edit Distance A According to the spellinggs of words in two senteences, the eddit distance bbetween the two senttences can bbe calculatedd. There are several diffferrent kinds of edit disttance: Hamm ming dist ance, Levensshtein distancce, Damerau--Levenshtein distance, Jarro-Winkler diistance, WagnnerFisccher edit dis tance, Ukkonnen and Hir shberg. We choose Leveenshtein distaance as the edit dist ance betweenn sentences. D Definition 4.11 (Levenshteein Edit Disttance). The eedit-distance of two stringgs is the miniimal costt of a sequen ce of symbol insertions, ddeletions, or substitutions transformingg one string into the other. C Coming to thhe two exempplary sentenc es s a and s b, the edit disttance betweeen sa and sb iis 5. Theere are 2 subs titutions from m “That” to ““This”, and 3 substitutionss from “old” to t “new”. H Here, we calcculate the senttence similarrity based on the edit distaance by 1 _

1

Then the sent ence similari ty values are mapped ontoo the intervall (0, 1]. T E Edit distance based similarity is also symbolic sim milarity, andd it is populaar in calcula ting simiilarity of sequuences such aas strings, lannguages and bbiological se quences.

4.4.. Word Ordeer based Sen ntence Simillarity A According to the positionns of words iin a sentencee, the orders between wo rd pairs suchh as befoore and after could be estaablished. Figuure 1 shows a sequential nnetwork of a sentence.

Figgure 1. The s equential nettwork of a senntence T The sequenti al relations between woords formula te a sequenttial network of words. The sequuential netwoork could bee used to di scover frequuent patterns.. The comm mon used n-ggram langguage model uses the adj acent n wordds to describ e the sequenntial characte ristics of woords. Whiile the sequeential networrk shows thee sequential rrelations mo re than n, annd the distannces betw ween words vary v from 1 too |s|-1. ,

,

,

,…,

,

,

,

,

,…,

,

Wheere (wx , wy)∈ ∈L(s a ) ∪L(sb ) means thatt wx is beforee wy . The sim milarity betweeen s a and s b can be ccalculated bassed on the or ders of wordss by Set

,

| |

∩ ∪

| |

C Coming to thhe two exem mplary sentennces, sequenttial relationss between words in the two senttences are listted in Table 1.

- 27 -

C Calculating Statisstical Similarity between Sentencess Junssheng Zhang, Yunnchuan Sun, Huiliin Wang, Yanqingg He JJournal of Converrgence Informatioon Technology, V Volume 6, Numberr 2. February 2011

Taable 1. Sequeential relationns between words w in two example e senttences L(ss a ) L(ss b )

(That,is),, (That, the), ( That, old), (Thhat, file), (Thaat, .), (is, the), (is, old), (is, file), (is, .), (t he, old), (thee, file), (the, .) , (old, file), (oold, .), (file, .) (This,is),, (This, the), (T This, new), (T This, file), (Thiis, .), (is, the), (is, new), (is, file),(is, .), (t he, new), (thhe, file), (the, . ), (new, file), ((new, .), (file, .) _

A the similarrity between s a and s b based on word orderrs is And

,

1/4.

Sequential reelations betw S ween words are instancees of conne ctions betweeen words. The sequuential relatioons are also cconnection chharacteristics of words.

4.5.. Word Distaance based S Sentence Sim milarity T consider thhe distance bettween word paairs, it is necessary to calcuulate the distannces between each To e pairr of words in the same senntence. The diistance betweeen two words takes the forrm of (w1, w2, d), wheere w1 and w2 are two wordss, and d is the distance betw ween w1 and w2. Figure 2 shhows the distannces betw ween words inn sentence sa.

F Figure 2. The distance netw work of a senteence T list of disttances betweenn word pairs inn the same senntence is as foollows. The ,

,

,

,

,

,

,

,

,

,

,

,

, ,

,…,

,

,

,

,…,

,

,

,

L W(a, b) = {(wax, way) |1 ≤ x ≤ y ≤ i} ∩ Let ∩{(wbp, wbq) | 1 ≤ p ≤ q ≤ j}, then similariity between sa and sb caan be calculateed as follows: ∑ ,



, ∈

,

,



,

, ,

wheere wai = wbi annd waj = wbj , aand 1 ≤ I, j ≤ |W W(a, b)|. F the two exxemplary senttences, the sim For milarity betweeen sa and sb bbased on the ddistances betw ween worrds is _ , 40/141. T meaning oof a sentence iis not only releevant to the orrders of wordss but also the distances The d betw ween worrds. The distannces between w words are alsoo structural chaaracteristic of sentences.

5. E Evaluation and Discusssion 5.1.. Evaluation n D During the trannslation betweeen different natural n languagges, a source sentence may be translated into diffeerent sentencees in the targett language. Thhese target senntences are sem mantically sim milar in meaniings. We use different sentence s simillarity methodss to find the toop-k statisticallly similar senttences. W choose 40 sentences as a testing corppus to comparee the differentt statistical sim We milarity measuures. The 40 sentences belongs 10 grroups, and eacch group contaains 4 sentencces with similaar meaning. Thhese

- 28 -

Calculating Statistical Similarity between Sentences Junsheng Zhang, Yunchuan Sun, Huilin Wang, Yanqing He Journal of Convergence Information Technology, Volume 6, Number 2. February 2011

sentences are selected from NIST05 corpus for BLEU test in machine translation. The sentence groups are listed in the Appendix. Based on the sentence similarity measures, we can find the precision rate and recall rate. F-measure is used to evaluate the sentence similarity by considering both the precision rate and recall rate. Sentence similarity is used to find the top-4 similar sentences of a sentence from other 39 sentences. The fetched sentences are compared with the sentences in the same group, and then the precision rate, recall rate and F-measure could be calculated. The formulas to calculate precision rate p, recall rate r and F-measure F are as follows. |

|

|

|

|∩| |∩| 2



|

|

|

|

We choose the top-4 most similar sentences for each sentence according to the different sentence similarity measures. The top-4 sentences contain the sentence itself. We compare the fetched 4 sentence set with the original 4 sentence set, and the precision rate, recall rate and F-measure are calculated in each sentence group, and the number of similar sentence groups is 40. Then we calculate the average precision rate, recall rate and F-measure as the precision rate, recall rate and F-measure of the sentence similarity on the test corpus. Table 2 is the experiment results of different measures of sentence similarity. Figure 3 shows the precision rate, recall rate and F-measure of different measures of sentence similarity. The evaluation results show that sentence similarity based on word set and sentence similarity based on word order have better performance than other sentence similarity. Sentence similarity based on edit distance has the lowest precision rate, recall rate and F-measure. Sentence similarity based on TF-IDF has lower precision rate, recall rate and F-measure. Although the evaluation result is closely relevant to the test corpus, it also shows the differences between different sentence similarity measures. The evaluation result could be explained as follows:  Sentence similarity based on word set and sentence similarity based on word order capture more local information of sentence pairs. The statistical similarity between two sentences depends more heavily on local information of the two sentences than statistical information acquired from the whole corpus.  Sentence similarity based on Edit distance only considers the insertion, deletion and substitutions of characters and separators. It is hard to capture the meaning of words, so it has the worst performance in the evaluation.  Word orders between word pairs are important than the distances between word pairs in calculating the sentence similarity.  Sentence similarity based on TF-IDF weighted vector performs worse than the average sentence similarity based on weighted vectors. It means that the local information of two sentences is more important than the global information in the corpus. Table 2. Precision rate, recall rate and F-measure of six kinds of statistical similarity measures between sentences on the test corpus Similarity based on precision rate recall rate F-measure word set average weighted vector edit distance word order word distance TF-IDF weighted vector

1 0.9797 0.8786 0.9932 0.9730 0.8952

1 0.9797 0.8786 0.9932 0.9730 0.8952

- 29 -

1 0.9797 0.8786 0.9932 0.9730 0.9851

C Calculating Statisstical Similarity between Sentencess Junssheng Zhang, Yunnchuan Sun, Huiliin Wang, Yanqingg He JJournal of Converrgence Informatioon Technology, V Volume 6, Numberr 2. February 2011

Figgure 3. Precision rate, recall rate and F-meeasure of diffeerent sentence similarity meeasures on the test Corpus T pairwise ssentence similarities for the 40 sentences are also calcuulated. Figure 4 shows the ggray The coloor map of paairwise similaarity betweenn sentences ccalculated byy different seentence similaarity calcculating methoods. Each subb-figure is sym mmetric, and there t are seveen parallel briight lines in thhese sub--figures whichh show the infformation of siimilar sentencce groups. Thee longest brighht line containns 40 gridds representingg the similar pairs of 40 seentences them mselves. Otherr bright lines show the sim milar senttences which aare in the sam me sentence grroups. Accordding to the graay color mapss, it is shown that worrd set based seentence similarrity, word ordder based senteence similarityy, word distannce based senteence simiilarity and TF-IDF T weightted vector bassed similarity hhave clear briight lines, how wever, bright llines averrage weightedd vector based similarity andd edit distance based sentencce similarity aare not clear.

5.2.. Discussion S Similarity is a cognitive conncept represennting the simillarity betweenn two objects in i attributes (ssuch as siize, color and texture) or seemantics in thee concept spacce. Similarity between two objects is closse to the ccognition of hhuman being. The similarityy degree between two objeccts may be diffferent to diffeerent perssons. Similariities in naturaal languages play an impportant role inn informationn access suchh as orgaanization, retriieval and recommendation.

- 30 -

C Calculating Statisstical Similarity between Sentencess Junssheng Zhang, Yunnchuan Sun, Huiliin Wang, Yanqingg He JJournal of Converrgence Informatioon Technology, V Volume 6, Numberr 2. February 2011

(a) based b on worrd set

(b) based on average weigghted vector

(c) baased on edit distance

(d) bassed on word oorder

(e) bassed on word d distance (f) based on T TF-IDF weighted vector ure 4. Gray coolor map for ppairwise similaarity between sentences calcculated by diffferent similaritties. Figu In tthe color graphh, the color off the small gridd shows the similarity betweeen the two seentences of x-aaxis and y-axis. The brighter is thee color the morre similar two sentences aree. (a) sentence similarity bassed on word set (b) ssentence similarity based onn average weigghted vector (cc) sentence sim milarity basedd on edit distance (d) ssentence similaarity based onn word order (ee) sentence sim milarity basedd on word distaance (f) senteence similarityy based on TF--IDF weightedd vector. A Approaches too calculating similarity beetween long ttexts cannot bbe used to calculate c senteence simiilarity, and thee reasons are aas follows:  Similarity beetween long teexts is calculaated based on the t vectors, w which comes frrom bag of woords. However, inn two sentencees, the keywords are hard to be chosen by using TF-IDF F, and most words in the two seentence occurss with few tim mes, and even some words oonly occur oncce. So the weiights of words aree hard to be asssigned. Wordds in the vectorrs representingg the two senttences formulaate a word set conntaining all thee words in thee two sentencees respectivelyy. If words in the two sentennces are very diff fferent, the twoo vectors reprresenting the two t sentencess are sparse, which w leads too the similarity caalculation resuults to be very small.  During the similarity callculation, the function worrds such as “tthe”, “a”, “annd”, and “of”” are ignored as thhe stop wordss. However, thhese function words play im mportant roles in sentencess for describing ssyntactic inforrmation of senntences. So thhe syntactic innformation off sentences shoould combine thee function woords. This is different from m the vector based similaarity by assignning weights withh TF-IDF.

- 31 -

Calculating Statistical Similarity between Sentences Junsheng Zhang, Yunchuan Sun, Huilin Wang, Yanqing He Journal of Convergence Information Technology, Volume 6, Number 2. February 2011

 Sentence semantic similarity can be calculated by considering the semantics of words. Sentences with similar meaning may not share common words, and it is difficult to detect the semantic similarity between sentences just using lexical information and statistic information of words.  Sentence similarity can be used to cluster sentences for syntactic patterns [16] and analyze the patterns in the languages [17].

6. Conclusion Sentence similarity is important during information organization and retrieval. It is closely related to both word similarity and document similarity. According to the symbolic characteristics and structural information of sentences, the statistical similarity between sentences could be calculated. The statistical similarity measures between sentences could measure the similarity between sentences without any prior knowledge but only on the statistical information of sentences. This paper presents six approaches to calculating statistical similarity between sentences. The different statistical similarity measures are evaluated on a test corpus of 40 sentences. The evaluation results show that differences between these sentence similarity measures. The proposed similarity measures can be used in short text related applications such as corpus construction and title/abstract based document recommendation.

7. Acknowledgment This research has been partially supported by ISTIC research foundation projects YY-2010018, ZD2010-3-3, XK2010-6 and Beijing Excellent Talent Fostering Foundation 2010D009012000001.

8. References [1] Wu, Z. and Palmer, M. “Verbs semantics and lexical selection”, Association for Computational Linguistics, 1994. [2] Pedersen, T. and Patwardhan, S. and Michelizzi, J. “WordNet::Similarity: measuring the relatedness of concepts”, Association for Computational Linguistics, 2004. [3] T. Coelho, P. Calado, L. Souza, B. Riberio-Neto”, and R. Muntz. Image retrieval using multiple evidence ranking. IEEE Transactions on Knowledge and Data Engineering, 16(4):408–417, 2004. [4] G. Erkan and D. Radev. “LexRank: Graph-based lexical centrality as salience in text summarization”. Journal of Artificial Intelligence Research, 22(1):457–479, 2004. [5] D. Gildea and D. Jurafsky. “Automatic labeling of semantic roles. Computational Linguistics”, 28(3):245–288, 2002. [6] A. Islam and D. Inkpen. “Semantic text similarity using corpus-based word similarity and string similarity”. ACM Transactions on Knowledge Discovery from Data (TKDD), 2(2):1–25, 2008. [7] A. Islam and D. Inkpen. “Semantic similarity of short texts”. Recent Advances in Natural Language Processing V: Selected Papers from Ranlp 2007, page 227, 2009. [8] T. Landauer, P. Foltz, and D. Laham. “An introduction to latent semantic analysis”. Discourse processes, 25(2):259–284, 1998. [9] C. Leacock and M. Chodorow. “Combining local context and WordNet similarity for word sense identification”. WordNet: An electronic lexical database, 49(2):265–283, 1998. [10] Y. Li, D. McLean, Z. Bandar, J. O’Shea, and K. Crockett. “Sentence similarity based on semantic nets and corpus statistics”. IEEE Transactions on Knowledge and Data Engineering, pages 1138– 1150, 2006. [11] Mihalcea, Rada and Corley, Courtney and Strapparava, Carlo, “Corpus-based and knowledgebased measures of text semantic similarity”, in Proceedings of the 21st national conference on Artificial intelligence - Volume 1, pages 775—780, AAAI Press, 2006. [12] G. Salton and M. McGill. Introduction to modern information retrieval. McGraw-Hill New York, 1983. [13] H. Somers. “Review article: Example-based machine translation”. Machine Translation, 14(2):113–157, 1999.

- 32 -

Calculating Statistical Similarity between Sentences Junsheng Zhang, Yunchuan Sun, Huilin Wang, Yanqing He Journal of Convergence Information Technology, Volume 6, Number 2. February 2011

[14] G. Tsatsaronis, I. Varlamis, and M. Vazirgiannis. “Text relatedness based on a word thesaurus”. Journal of Artificial Intelligence Research, 37:1–39, 2010. [15] P. Turney. “Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL”. Machine Learning: ECML 2001, pages 491–502, 2001. [16] Fu, K. & Lu, S. “A clustering procedure for syntactic patterns”, IEEE Transactions on Systems, Man and Cybernetics, IEEE, 7:734-742, 2007. [17] Lu, S. and Fu, K. “A sentence-to-sentence clustering procedure for pattern analysis”, IEEE Transactions on Systems, Man and Cybernetics, IEEE, 8: 381-389, 2007. [18] Mitchell, J. and Lapata, M. “Vector-based models of semantic composition”, in Proceedings of ACL-08: HLT, pages 236-244, 2008. [19] Cheng Xian-yi, Sun Ping, Zhu Qian, Cai Yue-hong, "The Research of Chinese Semantic Similarity Calculation Introduced Punctuations", JCIT, Vol. 5, No. 7, pp. 17-23, 2010. [20] Zhang Ye, Cai GuoQiang, Jia DongYue, "A Modified Method for Concepts Similarity Calculation", JCIT, Vol. 6, No. 1, pp. 34-40, 2011. [21] Lluis A. Belanche, Jorge Orozco, "On the design of metric relations", JCIT, Vol. 3, No. 3, pp. 7081, 2008.

Appendix: Ten groups of sentences Ten groups of sentences selected from NIST05 BLEU corpus are as follows. 1. (a) meriam abandoned her husband a year ago . (b) meriam abandoned her husband a year ago . (c) meriam abandoned her husband a year ago . (d) meriam abandoned her husband a year ago . 2. (a) johnson says his work was to park the cars of hotel and casino guests while meriam was increasingly tempted by the nightlife of the gambling city , frequenting clubs late at night and ignoring her husband . (b) johnson says his work was to park the cars of hotel and casino guests while meriam was increasingly tempted by the nightlife of the gambling city , frequenting clubs late at night and ignoring her husband . (c) johnson says his work was to park the cars of hotel and casino guests while meriam was increasingly tempted by the nightlife of the gambling city , frequenting clubs late at night and ignoring her husband . (d) johnson says his work was to park the cars of hotel and casino guests while meriam was increasingly tempted by the nightlife of the gambling city , frequenting clubs late at night and ignoring her husband . 3. (a) this week ’s eurozone economic indicators to show continuous economic weakening in europe (b) this week ’s eurozone economic indicators to show continuous economic weakening in europe (c) this week ’s eurozone economic indicators to show continuous economic weakening in europe (d) this week ’s eurozone economic indicators to show continuous economic weakening in europe 4. (a) “ she ’s gone off the deep end , “ johnson says . (b) “ she ’s gone off the deep end , “ johnson says . (c) “ she ’s gone off the deep end , “ johnson says . (d) “ she ’s gone off the deep end , “ johnson says . 5. (a) economists said that eurozone economic indicators to be released this week will provide further evidence of weak economic confidence and slowing economic output , while british data is expected to show a pick - up in industrial activities . (b) economists said that eurozone economic indicators to be released this week will provide further evidence of weak economic confidence and slowing economic output , while british data is expected to show a pick - up in industrial activities . (c) economists said that eurozone economic indicators to be released this week will provide further evidence of weak economic confidence and slowing economic output , while british data is expected to show a pick - up in industrial activities .

- 33 -

Calculating Statistical Similarity between Sentences Junsheng Zhang, Yunchuan Sun, Huilin Wang, Yanqing He Journal of Convergence Information Technology, Volume 6, Number 2. February 2011

(d) economists said that eurozone economic indicators to be released this week will provide further evidence of weak economic confidence and slowing economic output , while british data is expected to show a pick - up in industrial activities . 6. (a) the pair met in 1999 when career military man johnson was stationed in bahrain . (b) the pair met in 1999 when career military man johnson was stationed in bahrain . (c) the pair met in 1999 when career military man johnson was stationed in bahrain . (d) the pair met in 1999 when career military man johnson was stationed in bahrain . 7. (a) their union was forbidden by the royal family , so johnson disguised meriam in a flannel shirt and a baseball cap , forged her military identification papers and brought her out of her home country to america . (b) their union was forbidden by the royal family , so johnson disguised meriam in a flannel shirt and a baseball cap , forged her military identification papers and brought her out of her home country to america . (c) their union was forbidden by the royal family , so johnson disguised meriam in a flannel shirt and a baseball cap , forged her military identification papers and brought her out of her home country to america . (d) their union was forbidden by the royal family , so johnson disguised meriam in a flannel shirt and a baseball cap , forged her military identification papers and brought her out of her home country to america . 8. (a) but his life changed dramatically when he met the beautiful teenage princess and the pair fell in love . (b) but his life changed dramatically when he met the beautiful teenage princess and the pair fell in love . (c) but his life changed dramatically when he met the beautiful teenage princess and the pair fell in love . (d) but his life changed dramatically when he met the beautiful teenage princess and the pair fell in love . 9. (a) however , johnson says that their star - crossed union comes to an end because of the temptations of the “ sin city “ of las vegas , the constant tensions with meriam ’s rich and powerful family and rumors of an assassination plot against him . (b) however , johnson says that their star - crossed union comes to an end because of the temptations of the “ sin city “ of las vegas , the constant tensions with meriam ’s rich and powerful family and rumors of an assassination plot against him . (c) however , johnson says that their star - crossed union comes to an end because of the temptations of the “ sin city “ of las vegas , the constant tensions with meriam ’s rich and powerful family and rumors of an assassination plot against him . (d) however , johnson says that their star - crossed union comes to an end because of the temptations of the “ sin city “ of las vegas , the constant tensions with meriam ’s rich and powerful family and rumors of an assassination plot against him . 10. (a) after a bitter immigration battle with us authorities , the couple finally married at the candlelight wedding chapel on the famed and glitzy las vegas strip when johnson was 23 and the bride only 19 . (b) after a bitter immigration battle with us authorities , the couple finally married at the candlelight wedding chapel on the famed and glitzy las vegas strip when johnson was 23 and the bride only 19 . (c) after a bitter immigration battle with us authorities , the couple finally married at the candlelight wedding chapel on the famed and glitzy las vegas strip when johnson was 23 and the bride only 19 . (d) after a bitter immigration battle with us authorities , the couple finally married at the candlelight wedding chapel on the famed and glitzy las vegas strip when johnson was 23 and the bride only 19 .

- 34 -