Semantic Textual Similarity between sentences

Semantic Textual Similarity between sentences Sanketh Kokkodu Balakrishna, Daniel Sam Pete Thiyagu College of Information and Computer Sciences University of Massachusetts Amherst {skokkodubala, dthiyagu}@cs.umass.edu Abstract— In this project, we present a comparative study to solve the problem of rating the similarity of two sentences. Our methods range from some well known corpus based methods to little known strategies for computing semantic similarity of the two given sentences. This includes computation of semantic similarity using semantic knowledge bases like WordNet, supervised learning methods like Linear Regression, Support vector machines and Multilayer Perceptron, as well as methods utilizing Word2Vec. We also use various semantic similarity metrics like Jaccard Coefficient and Containment Coefficient to aid the computation of semantic similarity. Some of our methods performed better than the standard baseline algorithm.

sentences, ”The man is exercising” and ”A man is doing pull-ups” in which the gold annotated score is 3.2. The sentences are very similar because they share a lot of information. The sentence score in this case also raises concerns about the subjective nature of this problem, and that makes it a hard problem to solve. We explored various methods such as using semantic and word order vectors as explained in [12], supervised learning methods using various similarity metrics and features as described in [4], and utilizing Word2Vec to find the semantic composition of a sentence.

I. INTRODUCTION

Motivation for this project arises from the fact that Semantic Textual Similarity is a core part of any web search engine and plagiarism detector. It lies at the core of many other language processing tasks including paraphrase detection, question answering and query ranking. Semantic Textual Similarity task has applications in information retrieval i.e web queries retrieval and also summarization modules. In the case of Web page retrieval, Semantic Textual Similarity between sentences has also been used for improving retrieval effectiveness [15].For the case of retrieving images from the Web given textual description, relevant work has been done in [1]. In addition, the incorporation of short-text similarity is beneficial to applications such as text summarization [5], text mining [2], as well as text categorization [10].

Semantic Textual Similarity (STS) between two pieces of text is a core discipline in the field of Natural Language Processing(NLP) and Information Extraction. There are many methods that are based on corpus statistics, and also methods using well-known knowledge bases like WordNet [12] and Wikipedia [6]. There are other strategies revolving around word alignment strategies [17], Machine Learning [4] and Deep Learning Methods like Tree LSTM [18], as well as ensemble methods [16]. The amount of research in this domain is itself an indicator of how important the nature of the problem is. Through this project, One question we try to answer is whether a Machine can be taught to understand the notion of semantic similarity. Similarity measures can be applied to word phrases, sentences, paragraphs or documents. The aim of this project is to compute the similarity between very short texts, primarily of sentence length, by utilizing the datasets present in SemEval, which consists of pairs of sentences along with their gold tags. The problem is decomposed into the statement given two sentences, predict a score of semantic similarity in a range of 0-5, 5 meaning the sentences are very similar, and 0 meaning the sentences are very dissimilar. For example, take the English pair of

II. M OTIVATION AND A PPLICATIONS

III. RELATED WORK For this section we will outline other published work in the domain of Semantic Textual Similarity. A. Semantic Textual Similarity using Lexical, Syntactic, and Semantic Information This paper [4] relates the computation of semantic similarity by exploiting the lexical, syntactic and semantic information from sentences, by establishing the

lexical and syntactic features in the form of POS ngram, character n-gram and lemma n-gram overlaps. They exploit the semantic information from a sentence utilizing Word embedding methods, Paragraph vectors, Tree LSTM and Word alignment strategies. With these features, they compute similarity scores using supervised learning methods like Linear Regression, Support Vector Machines, Gaussian Process Regression, Decision Tree Regression and Perceptron Regression. They also compare it against unsupervised methods based on weighted Word Alignment strategies as described in Section III-C. A tree model [18] helps in representing the sentence structure and it is described in brief in Section IIIB. They also utilize Paragraph2Vec [11] which is an unsupervised method to learn representations of text. They were able to achieve excellent scores using SVM and the features described. We implemented some of the methodologies present in this paper. B. Improved Semantic Representations From TreeStructured Long Short-Term Memory Networks They utilize order-sensitive models like Tree model, which helps in finding out semantic relatedness. For example ”cats climb trees” and ”trees climb cats” would exhibit two different tree structures, which helps in capturing syntactic structure. The standard LSTM computes its hidden state from the input at the current time step, and the hidden state from the previous time step, the Tree-LSTM computes its state from an input vector and the hidden states of arbitrarily many child units. They propose Child-Sum Tree-LSTM and the Nary Tree-LSTM, where the branching factor is at most N and where children are ordered. In the former, the hidden state of a node is the sum of all the hidden states of the nodes’ children at the previous time step. For computation of semantic relatedness, they turn it into a classification task where the output is yˆ ∈ {1, 2, ..., K}. They compute the sentence representations hL and hR for the sentences L and R using a Tree-LSTM model over the parse tree of the sentences, and they predict the output using these sentence representations hL and hR , and a neural network. They achieve excellent results using a Dependencey Tree-LSTM which is a ChildSum Tree-LSTM applied to a dependency tree. The definitions and the models are given in [18] C. Sentence Similarity from Word Alignment This paper [17] comprises of an algorithm that takes a hypothesis that two sentences are more or less similar,

depending on the number of semantic units in common, as well as the fact whether such semantic units occur in similiar contexts in the sentences. The algorithm comprises of aligning words sequences, named entities, aligning content words using dependencies and aligning content words using surronding words. For finding similar words, they use paraphraseDb. Alignment of content words using surrounding words is done using a span window of 3 words to the left and right. Equivalence of Dependency Structures uses dependency parses using Stanford’s parser and the method is described with an example as follows.Consider two sentences S (1) and S (2) in Figure 1. We see (1) (2) (1) (2) that w2 and w6 is ‘wrote’ and w4 and w4 is ‘book’. Consider the following typed dependencies: (1) (1) (2) (2) dobj(w2 , w4 ) in S (1) and rcmod(w4 , w6 ) in S (2) . They represent the relation “thing that was written” between the verb ‘wrote’ and its corresponding argument ‘book’.So they establish such types of dependency equivalence for a possible alignment between the pair (1) (2) (1) (2) (w2 , w6 ) (‘wrote’) which lies in the pair (w4 , w4 ) (‘book’). Therefore they establish such equivalence of the dependency types like dobj and rcmod as shown here. They achieve the best performance in Semeval

Fig. 1: Dependency Equivalance based on [17] 2014 based on the correlation metric. D. Computing Semantic Textual Similarity by Combining Multiple Content Similarity Measures This paper [3] uses a log-linear regression model, and also uses multiple text similarity measures like Jaccard Coefficient and Containment Coefficient, Lin, Resnik, Jiang and Conrath. The features used for finding semantic similarity range from character and word n-grams and common subsequences, Explicit Semantic Analysis vector comparisons, Distributional Thesaurus

and pairwise word similarity. For features that relate to structure and syntax, they utilized stopword n-grams, pos n-grams, word pair order(which indicates whether two words occur in the same order in both texts). They also used measures which capture statistical properties like typetoken ratio (TTR) and sequential TTR. E. Computing Semantic Relatedness using Wikipediabased Explicit Semantic Analysis This paper [6] uses machine learning techniques to constitute a semantic interpreter which translates text into a weighted sequence of Wikipedia concepts ordered by their relevance, which is called an interpretation vector. The meaning of a text is now interpreted in terms of its affinity with a host of Wikipedia concepts. The computation of semantic relatedness is done using cosine similarity metric. ESA have shown good results in terms of correlation scores with human judgments, from r = 0.56 to 0.75 for individual words and from r = 0.60 to 0.72 for texts, where r stands for the correlation coefficient. F. Text Comparison Using Soft Cardinality This paper [9] defines a soft cardinality method which allows them to treat the affinity between elements of a set separately from the weight imposed on an element of the set. They tested this using bi-gram intersection as well as cosine similarity and have proved that their text comparison outperformed the standard baselines when it utilized a soft TF-IDF weight. G. Combining Soft Cardinality Features for Semantic Textual Similarity, Relatedness and Entailment This paper [8] defines a soft cardinality approach for extracting features for the machine learning approaches. This soft cardinality approach is expandable and supports plugging in various similarity measures for finding semantic and syntactic relatedness to gain a richer information about the similarity measure. The additional features utilized in this paper includes, but not limited to Antonymy, Hypernymy, Negation of words, features from dependencies and features for each parts of speech, ESA features [6], and string matching features. The regression models used were reduced-error pruning tree (REPtree) and Linear regression. This method ranked among the top systems in the SemEval task.

H. Attention-Based Multi-Perspective Convolutional Neural Networks for Textual Similarity Measurement This paper [7] uses attention-based convolutional neural network for predicting the semantic similarity utilizing the Paragram-Phrase word embeddings.Their attention-based interaction layer converts two independent input sentences into an inter-related sentence pair and is fed into their MultiPerspective CNN(MPCNN) model, which is then applied over raw word embeddings of input sentences to generate re-weighted word embeddings. The attention-based re-weightings guides the MPCNN model onto important input words in a sentence that are more similar to the other sentence. Their model outperforms the winning entry in STS SemEval 2015. I. A Deep Ensemble System for Semantic Textual Similarity They [16] utilize features like alignment ratio[17], cosine similarity for Word2Vec centroids, cosine similarity for bag of words, machine translation metrics like BLEU, METEOR, NIST, BADGER, TER and TERp. They used the feature based methods indicated above, as well as a standard LSTM and a Tree LSTM[18]. They also created an ensemble method that predicts based on the scores given by the above methods. They were also among the top systems in the STS task. J. Sentence similarity based on semantic nets and corpus statistics This paper [12] introduces the use of semantic vectors and word order vectors and makes use of the lexical database WordNet for computation of the similarity metric for designing the semantic vector. They also use a word order vector, which is used for finding out the difference in meaning because of the difference in word orders. They derive the information content from the Browns corpus, which is also used for computing the semantic vector. Then the overall similarity is computed based on a weightage to the semantic similarity and the word order similarity. IV. DATASET AND S OFTWARES The dataset which is used in this project is obtained from STS-Semeval. We have the datasets from 2012 to 2016, in which there are multiple files. Each file consists of a set of two sentences along with the corresponding goldtags. The goldtags are scores from 0 to 5, based on the STS evaluation methods. We have a total of 14,778 pairs of sentences along with goldtags

in the data. The problem which we are trying to solve is hard in the sense that scores are subjective, and each person may rate a pair of sentences differently. In this project, we assume the scores given in the dataset to be correct, and build our models to match these scores. These are the softwares and tools used for this project: • Anaconda Sypder • Lexical database • Brown Corpus • Word Net, Stanford dependency parser, Stanford Named Entity Recognizer, ParaphraseDB • STS Evaluation data V. T ECHNICAL A PPROACH AND M ODELS A. Metrics and Similarity Measures We use various metrics and semantic similarity measures for features that are used in the supervised learning methods as stated from section V-D for the calculation of similarity scores. Jaccard Co-efficient: The Jaccard Coefficient of two sets s1 and s2 which comprises of the character n-gram tokens or POS tagged n-gram tokens or lemmatized n-gram tokens is given by:

We define the Containment co-efficient as the average of the above two values. Containment of s1 in s2 2 Containment of s2 in s1 + 2

Containment Coefficient =

IDF: The train-data test-data split is 80-20%. The IDF score for words is given by IDF[word] =

N 1 + appearance(word)

where N represents the number of documents and the function appearance(word) returns the number of documents in which the word appears in the total set of documents. The ’one’ is added as a pseudo count to the denominator so as to not get a divide by zero error. These IDF scores help in adding weightage to the words for finding similarity measure in metrics like cosine similarity, Jaccard Coefficient and Containment Coefficient. B. Features

JC =

count(s1∩s2) count(s1∪s2)

A weighted count for Jaccard Coefficient can also be added to give less weightage to more frequently occurring words like ’a’, ’the’ etc. The formula then changes to JC =

weighted count(s1∩s2) weighted count(s1∪s2)

weighted count(s) = Σkj=1 Σni=1 IDF [wordji ]

where n represents the n-gram size, s represents a set of n-grams, k represents the cardinality of the set and wordji represents the word in the ith position in the n-gram for the j th element of the set. Containment Co-efficient: The Containment Coefficient of two sets s1 and s2 which comprises of the character n-gram tokens or POS tagged n-gram tokens or lemmatized n-gram tokens is given below. If we don’t take the respective weights into account, then in the following formulas, we need to replace weighted count with count. For an n-gram weighting, the following are the corresponding equations Containment of s2 in s1 =

weighted count(s1∩s2) weighted count(s1)

Containment of s1 in s2 =

weighted count(s1∩s2) weighted count(s2)

As described in [4], we require various features to help us capture the similarity metrics between sentences. They are defined as follows. 1) Lexical and Syntactic Similarity: Lexical and syntactic similarity helps in getting a better interpretation and understanding of similarity measures. We use the following features to help us exploit lexical and syntactic information in the text. Lemma n-gram overlap: We used the nltk library to get lemma words from tokens and calculated the similarity scores for the lemma word n-grams in both sentences using Jaccard Similarity Coefficient (JC) for different orders i.e. n = 1, 2, 3, 4. We also compute Containment Coefficient for orders of n which gives a value of how much of one sentence is contained in the other. We also weigh the words based on IDF scores collected during our training phase, and in this case, the TF scores do not give an accurate picture as the sentences are relatively short. The similarity score would take into account a higher weightage for more important words and a lower weightage for stopwords. POS n-gram overlap: The nltk library is used to get POS tags from the tokens and the similarity scores for the Parts of Speech tags word n-grams in

both sentences is calculated using Jaccard Similarity Coefficient (JC) and Containment Coefficient for orders n = 1, 2, 3, 4. We compute Jaccard Similarity Coefficient (JC) and Containment Coefficient similar to lemma n-gram overlap and weighting, with IDF scores computed during training phase. These features exploit syntactic similarity of the sentences. Character n-gram overlaps: We compute the character n-gram for the sentences and compute the similarity scores using the same Jaccard Similarity Coefficient and Containment Coefficient. Here, the IDF weights are computed on character n-gram level, and an n-gram weighing for computing the similarity metrics is used for orders of n = 2, 3, 4.

Fig. 2: Sentence similarity computation.

2) Semantic Similarity: We explore certain techniques for finding out the semantic similarity of the sentences in the following section. Semantic Composition: As referenced in [4], we explore techniques based on Distributional Hypothesis which states that with some sense, we can induce the meaning of words from the distribution of text. Frege’s principle of compositionality states that the meaning of a complex expression is determined as a composition of its parts. So, we define the semantic meaning of a sentence to be the linear combination of the semantic meaning of parts(words) of the sentence. A Word2Vec model is used to get the Word2Vec vectors for the words of the sentences. The vectors are then summed to form the semantic composition of the sentence. C. Methods Different methods are used in our approach to get similarity scores. Some of the methods are described below. 1) Semantic and Word Order Similarity: This method which is proposed uses two similarity measures -semantic and word similarities- for the computation of similarity score. The proposed method derives text similarity from semantic and syntactic information contained in the compared texts. A text is considered to be a sequence of words, each of which carries useful information. The words, along with their combination structure, make a text convey a specific meaning. Texts considered here are assumed to be of sentence length.

Figure 2 shows the procedure for computing the sentence similarity between two candidate sentences.

Fig. 3: Hierarchical semantic knowledgebase.

Unlike existing methods that use a fixed set of vocabulary, the method used in [12] dynamically forms a joint word set only using all the distinct words in the pair of sentences. For each sentence, a raw semantic vector is derived with the assistance of a lexical database. Figure 3 shows the lexical database used, WordNet corpus with predefined synsets. A word order vector is formed for each sentence, again using information from the lexical database. Since each word in a sentence contributes differently to the meaning of the whole sentence, the significance of a word is weighted by using information content derived from a corpus. By combining the raw semantic vector with information content from the corpus, a semantic vector is obtained for each of the two sentences. Semantic similarity is computed based on the two semantic vectors. Word order similarity is calculated using the two word order vectors. Finally, the sentence similarity is derived by combining semantic similarity and order similarity.

Sentence similarity S(T 1, T 2) = δSs + (1 − δ)Sr

s1 .s2 Ss = ||s1 ||||s2 ||

where si = s˜ ∗ I(wi ) ∗ I(w˜i ) I(w) =

log(n + 1) −log(p(w)) =1− log(N + 1) log(N + 1)

and s˜ = max(set of synset path length(wi , wjjws )) Sr = 1 −

||r1 − r2|| ||r1 + r2||

Also I(w) ∈ [0, 1] where s1,s2 are the semantic vectors, s˜ is a lexical semantic vector computed for a sentence using the maximum similarity(path similarity in WordNet) for each word as compared in the joint word set of the two sentences and wjjws represents for every word in the joint word set. Ss stands for semantic similarity and Sr stands for word order similarity, n denotes the number of occurrences of the word present in Browns corpus and N denotes the total count of the words present in the corpus, p(w) is the probability of seeing the word, I(w) is the information content of the word w, r1 and r2 are word order vectors of the sentences and defined in the succeeding paragraph. For gathering Information Content we use the Brown Corpus which comprises of 1, 014, 000 words and was compiled for standard texts in 1961. Let us consider a pair of sentences, T1 and T2, that contain exactly the same words in the same order with the exception of two words from T1 which occur in the reverse order in T2. For example: •T1: A quick brown dog jumps over the lazy fox. •T2: A quick brown fox jumps over the lazy dog. Since these two sentences contain the same words, any methods based on ”bag of words” will give a decision that T1 and T2 are exactly the same. However, it is clear for a human interpreter that T1 and T2 are only similar to some extent. The dissimilarity between T1 and T2 is the result of the different word order. Therefore, a computational method for sentence similarity should take into account the impact of word order. For the example pair of sentences T1 and T2, the joint word set is: •T = A quick brown dog jumps over the lazy fox. We assign a unique index number for each word in T1 and T2. The index number is simply the order number in which the word appears in the sentence. For example,

the index number is 4 for dog and 6 for over in T1. In computing the word order similarity, a word order vector, r, is formed for T1 and T2, respectively, based on the joint word set T. Taking T1 as an example, for each word wi in T, we try to find the same or the most similar word w˜i in T1 as follows: 1. If the same word is present in T1, we fill the entry for this word in r1 with the corresponding index number from T1. Otherwise, we try to find the most similar word w˜i in T1. 2. If the similarity between wi and w˜i is greater than a preset threshold, the entry of wi in r1 is filled with the index number of w˜i in T1. 3. If the above two searches fail, the entry of wi in r1 is 0. The word order vectors for T1 and T2 are r1 and r2, respectively and for the example sentence pair, we have: r1 = {1 2 3 4 5 6 7 8 9} r2 = {1 2 3 9 5 6 7 8 4}

Thus, a word order vector helps in finding out the basic structural information a sentence carries. For dealing with word order we then need to measure how similar the word order in two sentences is. D. Linear Regression Linear regression is an approach for predicting a quantitative response using single or multiple features (or ”predictors” or ”input variables”). It takes the following form: y = β0 + β1 x1 + β2 x2 + ... + βi xi + ... + βn xn

where y is the response/output , xi is a feature, β0 is the intercept, βi is the coefficient for the feature xi . Linear models also rely on an assumption that the features are independent, and if those assumptions are violated, it becomes less reliable. We use the features as described in Section V-B ie. the similarity metrics of Jaccard Coefficient and the Containment Coefficient for each of the n-gram overlaps. We also feed in features like the pos ngrams, character n-grams and lemma n-grams as input to the linear regression model. The concatenation of the Word2Vec vectors for the two sentences are also passed as input for each of the training and test samples. We then normalize each non zero sample using L2 norm.

E. Support Vector Machine

n-gram Size

We use Support Vector Machines (SVM) for regression, with the radial basis function (RBF) as a kernel. Similar to linear regression, we use features as described in the linear regression model above and input that to the SVM. We also concatenated the Word2Vec features for both the sentences into a given training sample. It needs to find dependencies between them, so that it can result in a better decision. A support vector machine with a radial basis function is used because it is non-linear and it can help in finding dependencies between features by operating in a higher-dimensional feature space, which this kernel helps in achieving. Similar to linear regression, we do a normalization using an L2 norm. Testing our hyper parameters, and with a 5-fold crossvalidation, we found that the best parameters for C(a parameter which trades off misclassification of training examples against simplicity of decision boundary) and γ (a parameter that defines how far the influence of a single training example reaches) to be 10000 and 0.001 respectively.

1 2 3 4 1,2,3,4

F. Multi Layer Perceptron Multilayer Perceptron(MLP) regression is a feedforward neural network which uses back-propagation to help compute a real valued output. The features utilized by the multilayer perceptron are the ones used similar to the ones used by the linear regression model and the SVM model. We utilized a multi layer perceptron with 2 hidden layers each hidden layer containing 100 neurons and also with an ’relu’ activation function utilizing adam, the stochastic gradient-based optimizer. We were unable to achieve very good results, as seen in Figure 4. Similar to Linear Regression and SVM, we do a normalization in this method and use L2 norm to normalize each non zero sample. G. Word2Vec approach This is an approach where we get the word vectors composed by Google as referenced in [13], which includes word vectors for a vocabulary of 3 million words and phrases that Google trained on, and roughly 100 billion words from a Google news dataset. The vector length is 300 features. So given a sentence, we apply a linear summation of the word vectors for each word in the given sentence. We then get a semantic representation for that sentence. Then, the cosine similarity of the two semantic vector representations is computed for the given two sentences.

LR Error PC 2.3 -0.01 1.7 0.05 1.5 0.15 1.5 0.14 1.7 0.07

SVM Error PC 1.5 0.08 1.5 0.08 1.5 0.08 1.5 0.08 1.4 0.19

MLP Error PC 1.6 0.06 1.9 -0.009 1.62 -0.001 1.7 -0.07 1.5 0.05

TABLE I: Table of custom error rate score(Error) and pearsons coefficient(PC) for Logistic Regression(LR), Support Vector Machine(SVM) and Multi Layer Perceptron(MLP)

VI. E XPERIMENT A. Baseline Algorithm The baseline algorithm that we used was cosine similarity metric. Given a sentence, we decompose it to give a bag of words representation in a vector. After decomposing the two input sentences into vector representations, we compute the dot product to get a similarity measure. n P

Ai B i A·B i=1 s s similarity = cos(θ) = = kAkkBk n n P P A2i Bi2 i=1

i=1

where Ai and Bi are components of vector A and B B. Preprocessing We did not remove stopwords like ’a’, ’the’ because it also contains information which can be used. For example, if we want to compare ”A man is sitting on the bench” to ”One man is sitting on the bench”, words like ”A” become important in determining the semantic similarity score. C. Feature Selection Feature selection is an important part in any Machine Learning algorithm. We experimented with different n-grams and formulated the results in Table I. Our results indicate that there was no specific n-gram size which resulted in substantial improvement in accuracy. We included n-grams of the POS tags, lemma tags and also the character n-grams as features. Increase in dimensionality resulted in no real improvement in the performance of the system.

D. Evaluation Metric Custom Error Rate Score: The error for each pair of sentences is the difference between the goldtag score and the predicted score using the above metrics. The average error for the data is given as: Error =

Σi=N i=1 |(predictedi − goldi )| N

where N denotes the number of documents Pearson’s Correlation Score: P P P (N xy) − ( x y) r=p P P P 2 P (N x2 − ( x)2 )(N y − ( y)2 )

Fig. 4: Custom Error Rates for Test Data Set

P xy where N denotes the number of pairs of scores, denotes sum of the products of the paired scores, P P x denotes y denotes sum P 2sum of x scores, P of2 y scores, x denotes sum of squared x scores, y denotes sum of squared y scores. Here x represents the gold standard scores and y represents our predicted scores. E. Quantitative Analysis We also explored whether an IDF weighing helps us in our similarity scores for the above features(pos n-gram, lemma n-gram and character n-gram), as described in the analysis given in Table II and Table III. We noted that the Jaccard Coefficient and Containment Coefficient using Character tri-gram alone was able to get a lower custom error rate(Table II, Table III and Figure 4) than the other models we approached in this project. We also got the metrics for our methods i.e, the custom error rate and pearsons correlation coefficient on the test data set are described in the figures 4 and 5 respectively. We were able to note that the Word2Vec approach and the semantic and word order similarity model was able to get a better score based on pearsons coefficient(see Figure 5) than the baseline algorithms and that they performed quite well.

F. Qualitative Analysis The sentence pair and predicted scores for a few sentences are evaluated and our comments and scores are present in the Table IV,V,VI:

Fig. 5: Pearsons Correlation Coefficient for the Test Data Set

1) Shortcomings: The similarity of the sentences using solely semantics can lead to subtle errors as described below: Phrase error: sentence pair: ”He is a Bachelor” , ”He is an unmarried man” has a similarity score of 2.3/5 based on the semantic and word order model, 2.6 for the Word2Vec approach, 3.4 for the Linear Regression(LR) and 2.9 for Support Vector Machine(SVM) and 2 for the Multilayer Perceptron(MLP). These pair of sentences are very similar in meaning, but get a relatively low score. For the semantic and word order model, this is because the phrase ”unmarried man” and ”Bachelor” are not matched in the synsets, since only one word is matched at a time. The other supervised methods also suffer from such cases. Word sense disambiguation: Consider the sentence pair: ”It’s an Orange” , ”Its Orange” has a similarity score of 3.91 for the semantic and word order similarity, 4.5 for the Word2Vec approach, 3 for the Linear Regression(LR) and Support Vector Machine(SVM) and 1 for the Multilayer Perceptron(MLP). In the first sentence, ”Orange” is a fruit and in the

IDF Weighing Type

N=1 train

N=1 test

lemma

1.4

parts of speech character

Non IDF Weighing

N=2 train

N=2 test

N=3 train

N=3 test

N=4 train

N=4 test

N=1 train

N=1 test

N=2 train

N=2 test

N=3 train

N=3 test

N=4 train

N=4 test

1.1

2.1

1.6

2.4

1.9

2.6

2.0

1.4

1.1

2

1.6

2.4

1.9

2.6

2.0

1.5

1.2

2.1

1.7

2.4

1.9

2.6

2.1

1.5

1.2

2.1

1.6

2.4

1.9

2.6

2.1

1.2

1.3

1.1

1.7

1.2

0.9

1.3

0.9

1.1

1.6

1.0

0.9

1.2

0.9

1.3

1.0

TABLE II: Evaluation Results for custom error rates in the range 0-5 for Jaccard Coefficient using weighing and without weighted counts IDF Weighing Type

N=1 train

N=1 test

lemma

1.0

parts of speech character

Non IDF Weighing

N=2 train

N=2 test

N=3 train

N=3 test

N=4 train

N=4 test

N=1 train

N=1 test

N=2 train

N=2 test

N=3 train

N=3 test

N=4 train

N=4 test

1.0

1.7

1.3

2.1

1.7

2.4

1.9

1.1

1.1

1.7

1.3

2.1

1.7

2.3

1.9

1.1

1.0

1.7

1.4

2.1

1.7

2.4

2.0

1.1

1.1

1.7

1.4

2.1

1.7

2.4

1.9

1.1

1.5

0.9

0.9

0.9

0.9

1.0

0.9

1.3

1.9

0.9

1.1

0.9

0.9

0.9

0.9

TABLE III: Evaluation Results for custom error rates in the range 0-5 for Containment Coefficient using weighing and without weighted counts Method Semantic and word order similarity Logistic regression(LR) Support Vector Machine(SVM) Multi layer Perceptron(MLP) Word2Vec

Score 5

Comments It was able to understand the semantics for this sentence pair

1.8 2.9

It was unable to find a dependence between features It was able to achieve a better output than LR and MLP, possibly because of the rbf kernel which enable them to operate in a higher dimensional feature space The MLP learner was unable to generalize given the data that it was trained on.

1.9 5

As expected the linear combination of the word vectors in both the sentences would be the same and hence, gives the correct output.

TABLE IV: Comparison for the sentence pair ”The group is good” , ”The group is good”. Method Semantic and word order similarity

Score 2.72

Logistic regression(LR)

5

Support Machine(SVM) Multi layer tron(MLP) Word2Vec

Vector

2.9

Percep-

2.7 4.3

Comments The sentence pair may seem similar to a human reader, but the model struggles to find similarities between the word ”very” and any word in the first sentence. Also, the path between simple and easy in the WordNet synsets is not very short. Hence, these factors contribute to the predicted score. This model was able to deduce a function in which this particular example fits, given the amount of training data it was exposed to, but this was very surprising since Linear regression is only a linear weightage of the particular features and can never capture dependence between two features. This model found it difficult to predict the appropriate score for the sentence, which could be possible because it was unable to generalize based on the training data it had seen. The complexity of features that it had learned based on the training data could have been high, and hence was unable to give an accurate measure. The linear combination of the distributed representation of the words for both the sentences are nearly similar, which may have helped this model to get a good score, where the other models apart from LR failed.

TABLE V: Comparison for the sentence pair ”The problem is simple” , ”The problem is very easy”.

second sentence, it means a color. The word sense is not taken into account in this semantic and word order and Word2Vec model, and hence gives it gives a high score for the above pair of sentences. But the MLP was able to distinguish it. G. Additional Methods Investigated We also used Paraphrase DB, stanford NER and dependency parser for alignment strategies as cited in

[17]. First, we tried getting the alignment of words by utilizing a grid search method of finding the contiguous sequence of similar words, which are identified using paraphrase DB(XXXL version). We also align named entities and location variables with the values that Stanford NER parser outputs. The paper [17] also mentions align equivalent dependency, as there is a specific structure that helps align, which is beyond the scope of this project. This is, however mentioned in Section III-

Method Semantic and word order similarity

Score 0.69

Logistic regression(LR) Support Vector Machine(SVM) Multi layer Perceptron(MLP) Word2Vec

2.2 2.9 2.2 2.3

Comments The sentence pair seems totally unrelated, and the score compliments it. The similarity score is just due to similarity between the common word ”is” and some extent of similarity between other words. This model struggled to figure out the dissimilarity This model was unable to figure out the dissimilarity between the two sentences It was unable to generalize based on the limited training data it had seen. The limited training data leads to cases where the MLP cannot find the score appropriately It was also unable to find the dissimilarity while the semantic and word order similarity was able to find the correct score.

TABLE VI: Comparison for the sentence pair ”Today is a Friday” , ”That person is not related to me.”

C. This is something we hope to extend in the future, apart from finding better alignment strategies. VII. C ONCLUSION

[2]

1

In our project , we explored various approaches to find semantic similarity between two sentences. Some of our approaches were based on corpus statistics and lexical knowledge databases like WordNet, and others utilized Word embeddings. We also investigated simple supervised learning regression models like linear regression, support vector machines and MultiLayer Perceptron(MLP). On the whole, we found that Character Trigram works best for similarity measures like Jaccard Coefficient and Containment Coefficient. It is relatively less complex method compared some of the other models we implemented.We also find that semantic and word order similarity, and Word2Vec approach was able to beat the baseline method in terms of the pearsons correlation coefficient. Amongst the supervised methods, we find that Support Vector Machines work reasonably well, and linear regression model was incompetent to handle specific inputs. Multilayer Perceptron required more training data, and also data that can help it generalize better. We felt that the amount of data required for the supervised learning approach was inadequate. We hope to build on our project with newer additions of word alignment strategies, and other neural models like tree LSTM. We also continuously strive to find other means of figuring out similar phrases and words, and also plan to look into sentence embedding models like Para-gramPhrases.

[3]

[4]

[5]

[6]

R EFERENCES [1] Tatiana Almeida Souza Coelho, Pável Pereira Calado, Lamarque Vieira Souza, Berthier Ribeiro1

The source code for this Project is available https://github.com/danielsamfdo/semantic similarity

at

[7]

Neto, and Richard Muntz. Image retrieval using multiple evidence ranking. IEEE Trans. on Knowl. and Data Eng., 16(4):408–417, April 2004. J. Atkinson-Abutridy, C. Mellish, and S. Aitken. Combining information extraction with genetic algorithms for text mining. IEEE Intelligent Systems, 19(3):22–30, May 2004. Daniel Bär, Chris Biemann, Iryna Gurevych, and Torsten Zesch. Ukp: Computing semantic textual similarity by combining multiple content similarity measures. In Proceedings of the First Joint Conference on Lexical and Computational Semantics - Volume 1: Proceedings of the Main Conference and the Shared Task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation, SemEval ’12, pages 435– 440, Stroudsburg, PA, USA, 2012. Association for Computational Linguistics. Tomás Brychc´ın and Lukás Svoboda. UWB at semeval-2016 task 1: Semantic textual similarity using lexical, syntactic, and semantic information. In Proceedings of the 10th International Workshop on Semantic Evaluation, SemEval@NAACLHLT 2016, San Diego, CA, USA, June 16-17, 2016, pages 588–594, 2016. Günes Erkan and Dragomir R. Radev. Lexrank: Graph-based lexical centrality as salience in text summarization. J. Artif. Int. Res., 22(1):457–479, December 2004. Evgeniy Gabrilovich and Shaul Markovitch. Computing semantic relatedness using wikipediabased explicit semantic analysis. In Proceedings of the 20th International Joint Conference on Artifical Intelligence, IJCAI’07, pages 1606– 1611, San Francisco, CA, USA, 2007. Morgan Kaufmann Publishers Inc. Hua He, John Wieting, Kevin Gimpel, Jinfeng Rao, and Jimmy J. Lin. UMD-TTIC-UW

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

at semeval-2016 task 1: Attention-based multiperspective convolutional neural networks for textual similarity measurement. In Proceedings of the 10th International Workshop on Semantic Evaluation, SemEval@NAACL-HLT 2016, San Diego, CA, USA, June 16-17, 2016, pages 1103–1108, 2016. Sergio Jiménez, George Dueñas, Julia Baquero, and Alexander F. Gelbukh. Unal-nlp: Combining soft cardinality features for semantic textual similarity, relatedness and entailment. In SemEval@COLING, 2014. Sergio Jimenez, Fabio Gonzalez, and Alexander Gelbukh. Text comparison using soft cardinality. In Proceedings of the 17th International Conference on String Processing and Information Retrieval, SPIRE’10, pages 297–302, Berlin, Heidelberg, 2010. Springer-Verlag. Youngjoong Ko, Jinwoo Park, and Jungyun Seo. Improving text categorization using the importance of sentences. Inf. Process. Manage., 40(1):65–79, January 2004. Quoc V. Le and Tomas Mikolov. Distributed representations of sentences and documents. CoRR, abs/1405.4053, 2014. Yuhua Li, David McLean, Zuhair Bandar, James O’Shea, and Keeley A. Crockett. Sentence similarity based on semantic nets and corpus statistics. IEEE Trans. Knowl. Data Eng., 18(8):1138–1150, 2006. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. CoRR, abs/1301.3781, 2013. Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting held December 5-8, 2013, Lake Tahoe, Nevada, United States., pages 3111–3119, 2013. Eui-Kyu Park, Dong-Yul Ra, and Myung-Gil Jang. Techniques for improving web retrieval effectiveness. Inf. Process. Manage., 41(5):1207– 1223, September 2005. Peter Potash, William Boag, Alexey Romanov, Vasili Ramanishka, and Anna Rumshisky. Simihawk at semeval-2016 task 1: A deep ensem-

[17]

[18]

[19]

[20]

ble system for semantic textual similarity. In Proceedings of the 10th International Workshop on Semantic Evaluation, SemEval@NAACL-HLT 2016, San Diego, CA, USA, June 16-17, 2016, pages 741–748, 2016. Md. Arafat Sultan, Steven Bethard, and Tamara Sumner. Back to basics for monolingual alignment: Exploiting word similarity and contextual evidence. TACL, 2:219–230, 2014. Kai Sheng Tai, Richard Socher, and Christopher D. Manning. Improved semantic representations from tree-structured long short-term memory networks. CoRR, abs/1503.00075, 2015. Geoffrey Zweig Tomas Mikolov, Scott Wentau Yih. Linguistic regularities in continuous space word representations. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACLHLT-2013). Association for Computational Linguistics, May 2013. Wei Xu, Chris Callison-Burch, and Bill Dolan. Semeval-2015 task 1: Paraphrase and semantic similarity in twitter (pit). In Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015), pages 1–11, Denver, Colorado, June 2015. Association for Computational Linguistics.