An Unsupervised Sentiment Classifier on ... - Semantic Scholar

1 downloads 0 Views 211KB Size Report
the content of a document D when selecting sentences for the summary of D. Given a review R on a .... S, “Shrek was great and its animation was very well-done.
An Unsupervised Sentiment Classifier on Summarized or Full Reviews Maria Soledad Pera, Rani Qumsiyeh, and Yiu-Kai Ng Computer Science Department, Brigham Young University, Provo, Utah, U.S.A. Abstract. These days web users searching for opinions expressed by others on a particular product or service P S can turn to review repositories, such as Epinions.com or Imdb.com. While these repositories often provide a high quantity of reviews on P S, browsing through archived reviews to locate different opinions expressed on P S is a time-consuming and tedious task, and in most cases, a very labor-intensive process. To simplify the task of identifying reviews expressing positive, negative, and neutral opinions on P S, we introduce a simple, yet effective sentiment classifier, denoted SentiClass, which categorizes reviews on P S using the semantic, syntactic, and sentiment content of the reviews. To speed up the classification process, SentiClass summarizes each review to be classified using eSummar, a single-document, extractive, sentiment summarizer proposed in this paper, based on various sentence scores and anaphora resolution. SentiClass (eSummar, respectively) is domain and structure independent and does not require any training for performing the classification (summarization, respectively) task. Empirical studies conducted on two widely-used datasets, Movie Reviews and Game Reviews, in addition to a collection of Epinions.com reviews, show that SentiClass (i) is highly accurate in classifying summarized or full reviews and (ii) outperforms well-known classifiers in categorizing reviews.

1

Introduction

The rapid growth of social search websites, such as Imdb.com and Epinions.com, which allow users to express their opinions on products and services, yields large review repositories. As a side effect of the growth, finding diverse sentiment information on a particular product or service P S from these review repositories is a real challenge for web users, as well as web search engine designers. To facilitate the task of identifying reviews on P S that share the same polarity, i.e., positive, negative, or neutral, we introduce a simple, yet effective sentiment classifier, denoted SentiClass. Given a review R on P S, which can be extracted from existing review repositories using a simple keyword-based query, SentiClass relies on the SentiW ordN et (sentiwordnet.isti.cnr.it) scores of each non-stopword1 in R, which are numerical values that quantify the positive, negative, and neutral connotation of a word, and considers the presence of intensifiers (e.g., “very”, 1

Stopwords are commonly-occurred words, such as articles, prepositions, and conjunctions, which carry little meaning. From now on, unless stated otherwise, whenever we mention (key)word(s), we mean non-stopword(s).

2

Maria Soledad Pera, Rani Qumsiyeh, and Yiu-Kai Ng

“extremely”, “least”), connectors (e.g., “although”,“but”, “however”), reported speech (which report what someone has said), and negation terms (e.g., “not”, “except”, “without”) in R to precisely capture the sentiment on P S expressed in R and categorize R according to its polarity. To reduce the overall classification time of SentiClass and shorten the length of the classified reviews a user is expected to examine, we summarize the reviews to be classified using eSummar, which is a single-document, extractive, sentiment summarizer introduced in this paper. eSummar pre-processes a review R using anaphora resolution, which identifies successive references of the same discourse entity in R to eliminate ambiguity when interpreting the content of R. Hereafter, eSummar computes for each sentence S in R the (i) similarity of the words in S and in the remaining sentences of R to determine how representative S is in capturing the content of R, (ii) word significance factor, which quantifies the significance of each word in S in representing the content of R, and (iii) sentiment score that reflects the degree of sentiment on P S expressed in S and is calculated using the linguistic type (such as adjective, adverb, or noun) and the SentiW ordN et score of each word in S. A number of sentences in R with high combined scores yield the summary of R. Our proposed sentiment classifier (summarizer, respectively) is unique, since SentiClass (eSummar, respectively) does not require any training for performing the categorization (summarization, respectively) task, which simplifies and shortens the classification (summarization, respectively) process. In addition, SentiClass (eSummar, respectively) is domain independent and thus can classify (summarize, respectively) reviews with diverse structures and contents. We proceed to present our work as follows. In Section 2, we discuss existing (sentiment) classification and summarization approaches. In Section 3, we introduce SentiClass and eSummar. In Section 4, we present the performance evaluation of SentiClass based on widely-used benchmark datasets and metrics, in addition to verifying the correctness of using eSummar, as compared with other (sentiment) summarizers, for summarizing reviews to be classified by SentiClass. In Section 5, we give a concluding remark.

2

Related Work

Pang et al. [18] evaluate three different machine learning approaches for (sentiment) classification: Naive Bayes classification, Support Vector Machines (SVM), and Maximum Entropy classification. Kennedy and Inkpen [6] compare two sentiment classification approaches, one of which identifies positive and negative terms in a document and labels the document positive (negative, respectively) if it contains more positive (negative, respectively) than negative (positive, respectively) terms, and the other approach trains an SVM (using single words as features) to determine the sentiment expressed in a document. As opposed to [6], the sentiment classifier in [4] trains an SVM using only a subset of the words in a document, which indicate a positive or negative intent and are determined using a maximum entropy model. Zhao et al. [22] categorize movie reviews us-

SentiClass

3

ing a probabilistic approach based on Conditional Random Field that captures the context of a sentence to infer its sentiment. Unlike SentiClass, all of these classification methods rely on training data which (i) disallow classification on the fly, and (ii) do not consider the semantics, such as negation terms, reported speech, and words in different contexts, of a sentence, which affects the polarity of words and thus their sentiment in a review. Given an opinion-rich document D, Ku et al. [9] first employ a manuallycreated set of seed words with pre-determined orientation to generate a list L of positive and negative words using synsets from W ordN et, a widely-used English lexical database. Thereafter, the authors label each sentence S in D as positive (negative, respectively) if the majority of words in S are positive (negative, respectively) according to the sentiment information of the words in L. D is classified as positive (negative, respectively) if the majority of sentences in D are labeled as positive (negative, respectively). While SentiClass uses SentiW ordN et to determine the polarity of a word, the approach in [9] relies on the synsets extracted from W ordN et, which are purely based on the original choice of seed words and thus is restricted in word usage. The polarity of a review is determined in [21] by identifying fixed sequences of words (stems) which, when they appear together, tend to have a polarity (i.e., negative or positive orientation). Unlike SentiClass, this method excludes word types, such as connectors, i.e., “but”, “although”, “however”, etc., which could eliminate or reverse the polarity of a sentiment in a review. For single-document summarization, Radev et al. [20] claim that the most promising approach is extraction, which identifies and retains sentences that capture the content of a text as its summary. Techniques commonly used for generating extractive, single-document summaries include the Hidden Markov Model (HMM), Latent Semantic Analysis (LSA), Word Significance, and Support Vector Machines (SVM). Unlike eSummar, SVM is a semi-supervised method that relies on training data for generating a summary, whereas HMM, LSA, and Word Significance are unsupervised methods that (i) fail to capture the sentiments expressed in a document because they are not sentiment summarizers and (ii) do not consider the relative degree of significance of a sentence in capturing the content of a document D when selecting sentences for the summary of D. Given a review R on a product P , Hu and Wu [3] adopt a score algorithm that considers the positive or negative orientation of each word in a sentence S (in R) to determine the sentiment of S. Thereafter, the summarizer extracts key phrases in each sentence that represent positive and negative opinions on P and groups the phrases according to their polarity to generate the summary of R. While the summarizer in [3] simply lists positive and negative phrases describing P , eSummar actually creates a complete summary of R that reflect the sentiment on P expressed by the author of R. Zhuang et al. [23] rely on W ordN et, statistical analysis, and movie knowledge to generate extractive, feature-based summaries of movie reviews. Unlike eSummar, the summarizer in [23] is domain dependent, which employs a pre-defined set of movie features, such as screenplay, and cannot be generalized. (See in-depth discussions on existing sentiment summarizers and classifiers in [17].)

4

Maria Soledad Pera, Rani Qumsiyeh, and Yiu-Kai Ng

Fig. 1. Processing steps of the proposed classifier, SentiClass

Fig. 2. The SentiW ordN et scores of words “Excellent”, “Movie”, and “Poor”, where P, O, and N denote Positive, Objective (i.e., neutral), and Negative, respectively

3

Sentiment-based Classification

In this section, we present our proposed classifier for categorizing reviews based on their polarity, i.e., positive, negative, or neutral. The overall process of the classifier is illustrated in Figure 1. 3.1

Sentiment Classification

In classifying a review R, SentiClass, the proposed sentiment classifier, first determines the polarity of each word w in R such that w is positive (negative, respectively) if its positive (negative, respectively) SentiW ordN et score is higher than its negative (positive, respectively) counterpart. (The SentiW ordN et scores of three sample words are shown in Figure 2.) Thereafter, SentiClass calculates the overall sentiment score of R, denoted SentiScore(R), by subtracting the sum of its negative words’ scores from the sum of its positive words’ scores, which reflects the sentiment orientation, i.e., positive, negative, or neutral, of R. Since the longer R is, the more sentiment words are in R and thus the higher its sentiment score is, we normalize the score by dividing it with the number of sentiment words in R, which yields SentiScore(R). If the normalized SentiScore(R) is higher (lower, respectively) than a pre-defined range (as given in Section 3.2), then R is labeled as positive (negative, respectively). Otherwise, R is neutral. We support the claim made in [19] which asserts that the sentiment expressed in a review R cannot be properly captured by analyzing solely individual words in R. Thus, prior to computing the SentiScore of R, we consider intensifiers,

SentiClass

5

connectors, negation terms, and reported speech in R, if there are any, which affect the polarity of the keywords in R. Intensifiers, which can be extracted from wjh.harvard.edu/˜inquirer/home cat.htms, are words, such as“extremely” and “barely”, that either weaken or strengthen the polarity of their adjacent words in R. SentiClass doubles (divides by half, respectively) the SentiW ordN et score of a word w in R if an intensifier strengthens (weakens, respectively) the polarity of w. In addition, we consider connectors, such as “however”, “but”, and “although”, which can mitigate the polarity of words in a sentence S and can be compiled using W ordN et. SentiClass assigns zero as the SentiW ordN et score of any word following a connector in S. In another initial step, SentiClass identifies words in R that are preceded by negation terms, such as “not” and “never”, which are listed in [15], and labels them as “Negated”. During the classification process, SentiClass first inverts the polarity of a word w (in R) affected by a preceding negation term in the same sentence so that if w has a positive (negative, respectively) polarity (based on the SentiW ordN et score of w) and is labeled as “Negated” in the initial step, SentiClass treats w as a negative (positive, respectively) word in R and assigns to w the corresponding negative (positive, respectively) SentiW ordN et score. Polanyi et al. [19] further suggest that reported speech sentences can have a detrimental effect in adequately determining the polarity of a text, and thus it is a common computational linguistic treatment to ignore them. Reported speech, also known as indirect speech, refers to sentences reporting what someone else has said. Consider the sentence S, “My friend said that he did not enjoy the movie”, in a review R. Since S does not reflect the opinion expressed by the author of R, it should not be considered when determining the polarity of R. SentiClass removes from R reported speech sentences (using the algorithm in [8]) prior to detecting intensifiers, connectors, and negation terms in R. 3.2

Classification Ranges

In establishing the pre-defined, sentiment range for determining whether a review R should be treated as positive, negative, or neutral, we conducted an empirical study using a set of 200 neutral reviews extracted from Gamespot.com, along with a set of 800 (randomly selected) reviews on Books, Cars, Computers, and Hotels, which were extracted from Epinions.com such that each review comes with a three-star (out of five) rating and is treated as neutral. (None of these reviews was used in Section 4 for assessing the performance of SentiClass.) Using the 1,000 reviews, we first computed the SentiScore (see Section 3.1) of each review. Hereafter, we calculated the mean and standard deviation of the SentiScores of the reviews, which are 0.02695 and 0.01445, respectively, that yield the range [0.0125, 0.0414]. If the SentiScore of R falls into the range, then SentiClass classifies R as neutral. Since, as previously stated, SentiScore is a normalized value that is not affected by the length of a review, the range established for classifying R can also be used by SentiClass for classifying the summarized version of R.

6

Maria Soledad Pera, Rani Qumsiyeh, and Yiu-Kai Ng

3.3

SentiClass Using Sentiment Summarization

To further enhance the efficiency of SentiClass in terms of minimizing its overall processing time, we introduce our single-document, sentiment summarizer, called eSummar. Instead of classifying an entire review R, SentiClass can apply eSummar on R to reduce the length of R for categorization while preserving the main content and polarity of R, which as a side-effect reduces the overall classification time of SentiClass. In designing eSummar, we rely on sentence similarity, significance factor, and sentiment scores, in addition to applying anaphora resolution on R, to (i) first determine the expressiveness of each sentence in R in capturing the content of R and (ii) then choose sentences in R to be included in the summary of R. Sentence Similarity The sentence-similarity score of sentence Si in a review R, denoted SimR (Si ), indicates the relative degree of significance of Si in reflecting the overall content of R. SimR (Si ) is computed using the word-correlation factors2 [7] of words in Si and in each remaining sentence Sj in R to determine the degree of resemblance of Si and Sj . The higher SimR (Si ) is with respect to the remaining sentences in R, the more promising Si is in capturing the content p of R to a certain degree. To compute SimR (Si ), we adopt the Odds ratio = 1−p [5], where p denotes the strength of an association between a sentence Si and the remaining sentences in R, and 1-p reflects its complement. n m |S| j=1,i=j k=1 l=1 wcf (wk , wl ) SimR (Si ) = (1) |S| n m 1 − j=1,i=j k=1 l=1 wcf (wk , wl ) where |S| is the number of sentences in R, n (m, respectively) is the number of words in Si (Sj , respectively), wk (wl , respectively) is a word in Si (Sj , respectively), and wcf (wk , wl ) is the word-correlation factor of wk and wl . Sentence Significance Factor As a summary of a review R reflects the content of R, it should contain sentences that include significant words in R, i.e., words that capture the main content in R. We compute the significance factor [12] of each sentence S in R, denoted SFR (S), based on the number of significant words in S. |signif icant words|2 (2) SFR (S) = |S| where |S| is the number of words in S and |signif icant words| is the number of significant words in S such that their frequency of occurrence in R is between pre-defined high- and low-frequency cutoff values. A word w in R is significant in R if ⎧ ⎨ 7 − 0.1 × (25 − Z) if Z < 25 if 25 ≤ Z ≤ 40 (3) fR,w ≥ 7 ⎩ 7 + 0.1 × (Z − 40) otherwise 2

Word-correlation factors indicate the degree of similarity of any two words calculated using the frequency of word co-occurrence and word distances in a Wikipedia dump of 880,000 documents on various subjects written by more than 89,000 authors.

SentiClass

7

where fR,w is the frequency of w in R, Z is the number of sentences in R, and 25 (40, respectively) is the low- (high-, respectively) frequency cutoff value. Sentence Sentiment Since the eSummar-generated summary of R captures the sentiment expressed in R, we measure to what extent a sentence S in R reflects its sentiment, i.e., polarity, during the summarization process. In defining the sentiment score of S, denoted SentiR (S), we determine the polarity of each word w in S by multiplying its (sentiment) weight (i.e., the SentiW ordN et score of w) with its linguistic score, denoted LiScore, as defined in [3], which reflects the strength of the sentiment expressed by w based on its type, i.e., adverb, adjective, verb, or conjunction, the commonly-used types to express sentiment. SentiR (S) =

n 

StScore(wi ) × LiScore(wi )

(4)

i=1

where n is the number of words in S, wi is a word in S, StScore(wi ) is the highest SentiW ordN et score of wi (among the three possible SentiW ordN et scores for wi ), and LiScore(wi ) is the linguistic score of wi , which is the multiplicative factor assigned to the linguistic type of wi . As defined in [3], the linguistic scores of adjectives and adverbs, verbs, and conjunctions are 8, 4, and 2, respectively. Anaphora Resolution Prior to computing the sentence scores previously introduced, we adopt anaphora resolution to eliminate the ambiguity that arises in interpreting the content of R by assigning a discourse entity to proper nouns, acronyms, or pronouns in R [10]. Consider the sentence S, “Shrek was great and its animation was very well-done.” By applying anaphora resolution on S, “it” is transformed into the referenced entity, i.e., “Shrek”. We identify anaphoric chains, which are entities and their various discourse referents in R, using the anaphora resolution system Guitar (cswww.essex.ac.uk/Research/nle/GuiTAR/) and replace all discourse referents by their corresponding entities in R. Ranking Sentences and Summary Size eSummar computes a ranking score for each sentence S in R, denoted RankR (S), which reflects the relative degree of content and sentiment in R captured by S using SimR (S), SFR (S), and SentiR (S). Hereafter, the sentences in R with the highest Rank scores are included in the summary of R. Rank applies the Stanford Certainty Factor [11]3 on SimR (S), SFR (S), and SentiR (S), which is a measure that integrates different assessments, i.e., sentence scores in our case, to determine the strength of a hypothesis, i.e., the content in R captured by S in our case. RankR (S) = 3

SimR (S) + SFR (S) + SentiR (S) 1 − M in{SimR (S), SFR (S), SentiR (S)}

(5)

Since SimR (S), SFR (S), and SentiR (S) are in different numerical scales, prior to computing Rank, we normalize the range of the scores using a logarithmic scale.

8

Maria Soledad Pera, Rani Qumsiyeh, and Yiu-Kai Ng

In choosing the size of eSummar-generated summaries, we adopt the length of 100 words determined by DUC (www-nlpir.nist.gov/projects/duc/index.html) and TAC (www.nist.gov/tac/data/forms/index.html), which provide benchmark datasets for assessing the performance of a summarizer. Since eSummar is an extractive summarization approach, it creates the summary of R by including the maximum number of sentences in R in the order of their Rank values, from the highest to lowest, excluding the ranked sentence (and successive ones) in R that causes the total word count to exceed 100. Weights of Sentence Scores Since the SimR (S), SFR (S), and SentiR (S) values of sentence S in R provide different measures in determining the Rank value of S, their weights in computing Rank should be different. To adequately determine their weights in Rank, we apply the multi-class perceptron algorithm [14] which establishes the weights through an iterative process4 . We constructed a training set T S with 1,023 sentences in 100 blog posts (on an average of 10 sentences per post) extracted from the TAC-20085 dataset (nist.gov/tac/data/ forms/index.html). Each sentence S (in T S) is represented as an input vector with four different values associated with S, i.e., SimR (S), SFR (S), SentiR (S), and Rouge-1. While SimR (S), SFR (S), and SentiR (S) are as defined earlier, Rouge-1 quantifies the degree of a sentence S in R in capturing the content and sentiment of the corresponding expert-created summary, Expert-Sum, of R. Rouge-1 is determined by the overlapping of unigrams between S and ExpertSum. The higher the Rouge-1 score is, the more representative S is in the content of Expert-Sum (and thus R). Based on the conducted experiment, the weights of SentiR (S), SimR (S), and SFR (S) are set to be 0.55, 0.25, and 0.20, respectively, and Equation 5 is modified to include the weights as follows: Enhanced RankR (S) 0.25 × SimR (S) + 0.20 × SFR (S) + 0.55 × SentiR (S) = 1 − M in{0.25 × SimR (S), 0.20 × SFR (S), 0.55 × SentiR (S)}

4

(6)

Experimental Results

In this section, we first introduce the datasets and metrics used for assessing the performance of SentiClass (in Sections 4.1 and 4.2, respectively). Thereafter, we present the accuracy of SentiClass in sentiment classification (in Section 4.3) and the effectiveness of using eSummar (compared with other summarizers) to generate summaries of reviews which enhance the classification process of SentiClass (in Section 4.4). 4.1

The Datasets

To evaluate the effectiveness of SentiClass, we have chosen two widely-used datasets, Movie Reviews (cs.cornell.edu/people/pabo/movie-review-data/) and 4 5

The training of the multi-class perceptron occurs only once and as a pre-processing step prior to performing the summarization task by eSummar. TAC-2008 includes posts and their respective expert-created reference summaries.

SentiClass

9

Game Reviews (cswiki.cs.byu.edu/cs679/index.php/Game Spot). The former includes 2,000 reviews extracted from Imdb.com in which 1,000 reviews are positive and the others are negative, whereas the latter consists of reviews that were downloaded from Gamespot.com between April 2005 and January 2007. Each game review includes (i) a score (between 1.0 and 10.0 inclusively, rounded to one decimal point), which denotes the rating of a reviewed video game given by an author, (ii) the author’s name, and (iii) the text of the review. Reviews with a score up to 6, between 6 and 8 exclusively, and 8 or higher are labeled as negative, neutral, and positive, respectively, which yields a set of 2,044 game reviews: 548 negative, 1067 neutral, and 429 positive. Besides movie and game reviews, we created a new dataset, denoted Epinions-DS, with 1,811 reviews extracted from Epinions.com6 , out of which 940 reviews are positive and the remaining ones are negative. The reviews in Epinions-DS are uniformly distributed into four different subject areas, Books, Cars, Computers, and Hotels, which are diverse in contents and structures. 4.2

Evaluation Metrics

To evaluate the performance of SentiClass in sentiment classification on either entire reviews or eSummar-generated summaries, we use the classification accuracy measure defined below. Accuracy = 4.3

Number of Correctly Classified (Summarized) Reviews Total Number of (Summarized) Reviews in a Collection

(7)

Performance Evaluation of SentiClass

In this section, we analyze the performance of SentiClass on classifying (eSummar-generated summaries of) reviews in Movie Reviews, Game Reviews, and Epinions-DS. Thereafter, we compare the effectiveness of SentiClass with other existing sentiment classifiers in categorizing reviews. Classification Accuracy SentiClass is highly accurate in classifying (eSummar-generated summarized) reviews. As shown in Figure 3, the average accuracy achieved by SentiClass using entire reviews in the three test datasets is 93%, as opposed to 88% using eSummar-generated summaries. Classification accuracy on entire and summarized reviews in Game Reviews yields the largest difference due to the inclusion of the third class, neutral. Identifying neutral reviews is more difficult than categorizing positive or negative ones, since a mixed number of positive and negative terms often co-occur in neutral reviews. A significant impact of using eSummar-generated summaries, instead of the entire reviews, is that the overall average classification time of SentiClass is dramatically reduced by close to a 50% in classifying eSummar-generated summaries, instead of full reviews, from an average of 112 minutes to 59 minutes, as shown in Figure 3. 6

Epinions.com is a well-known public and free product review source from where test datasets can be created for sentiment classification studies [17].

10

Maria Soledad Pera, Rani Qumsiyeh, and Yiu-Kai Ng

Fig. 3. (Average) Accuracy and overall processing time of SentiClass in classifying eSummar-generated summaries and their entire reviews in the test datasets

SentiClass and Other Classifiers To further assess the effectiveness of SentiClass, we compare its classification accuracy with other well-known (sentiment) classifiers, as presented in [18], which include Multinomial Naive Bayes (MNB), Maximum Entropy (ME), Support Vector Machines (SVM), and Linear/Log Pooling [13], in addition to two other sentiment classifiers by design, i.e., Hybrid and Extract-SVM, on Movie Reviews and Epinions-DS. We do not consider Game Reviews, since most of the sentiment classifiers to be compared are designed for two-class, i.e., positive and negative, categorization. MNB is a simple and efficient probabilistic classifier that relies on a conditional word independence assumption to compute the probability of word occurrence in a pre-defined class, which dictates to which class a review R should be assigned according to the occurrences of its words. ME, which is a classification technique that has been shown effective in several natural language processing applications, estimates the conditional distribution of the pre-defined class labels. The classifier represents a review R as a set of word-frequency counts and relies on labeled training data to estimate the expected values of the word counts in R on a class-by-class basis to assign R to its class. The premise of SVM [18], a highly effective text classifier, is to conduct a training procedure to find a hyperplane that accurately separates reviews (represented as word vectors) in one class from those in another. Extract-SVM [16], another SVM-based classifier, first categorizes sentences in a review as subjective or objective and then applies the SVM classifier on only the subjective sentences in performing sentiment classification, since removing objective sentences prevents the classifier from considering irrelevant or even potentially misleading text [16]. Hybrid [6], on the other hand, is a weighted, voted classifier that combines the classification score on a review R computed by a support vector machine with the score on R generated by a term-counting approach, which considers context valence shifters, such as negations, intensifiers, and diminishers, to classify reviews based on their sentiments.

SentiClass

11

Fig. 4. Sentiment classification accuracy achieved by various sentiment classifiers on Movie Reviews and Epinions-DS

The authors in [13] propose a Linear/Log Pooling method, denoted LPool, which combines lexical knowledge and supervised learning for text classification. LP ool might lead to overfitting (i.e., over-training) its model, which increases training time, lowers its accuracy, and as a result the classification performance on unseen data becomes worse. As shown in Figure 4, SentiClass outperforms MNB, ME, SVM, Hybrid, Extract-SVM, and LP ool on categorizing reviews in Movie Reviews from 9% to 15% and from 4% to 15% on Epinions-DS. 4.4

Assessment of SentiClass Using Summarization

Having demonstrated the effectiveness of SentiClass in Section 4.3, we proceed to assess and compare the classification accuracies achieved by SentiClass using summarized reviews generated by eSummar and other summarizers on the test datasets introduced in Section 4.1. The T op-N summarizer, a naive summarization approach, assumes that introductory sentences in a review R contain the overall gist of R and extracts the first N (≥ 1) sentences in R as its summary. We treat T op-N as a baseline measure for summarization. Gong [2], on the contrary, applies Latent Semantic Analysis (LSA) for creating single-document summaries. LSA analyzes relationships among documents and their terms to compile a set of topics related to the documents, which are described by word-combination patterns recurring in the documents. LSA selects sentences with high frequency of recurring wordcombination patterns in a document D as the summary of D. LRSum [1] creates summaries of reviews by first training a logistic regression model using sentences in reviews represented by a set of features, which include

12

Maria Soledad Pera, Rani Qumsiyeh, and Yiu-Kai Ng

Fig. 5. (Average) Accuracy achieved by SentiClass using summaries created by T opN , LSA, LRSum, and eSummar on test datasets, respectively

the position of a sentence S within a paragraph, its location in a review, and the frequency of occurrence of words in S. Using the trained model, LRSum determines sentiment sentences based on their features and selects the one in a review R with the maximal conditional likelihood as the summary of R. As shown in Figure 5, performing the classification task in SentiClass using eSummar-generated summaries is more accurate, by an average of at least 20% higher, than using summaries generated by T op-N , LSA, or LRSum. Note that we set the length of the summaries generated by T op-N , LSA, LRSum, and eSummar to be 100 words, as previously discussed in Section 3.3. As opposed to eSummar which computes sentence sentiment scores, neither T op-N nor LSA considers the sentiment expressed in a review in creating its summary, even though LRSum does. Unlike eSummar, LRSum requires (i) training, which increases the complexity and processing time of the summarizer, and (ii) labeled data, which may not always be available nor is easy to compile. Figure 6 shows the (average) processing time of SentiClass in creating and classifying the summarized versions of the reviews in the test datasets (introduced in Section 4.1) using T op-N , LSA, LRSum, and eSummar, respectively. Not only the processing time of SentiClass in classifying eSummar-generated summaries is (slightly) faster than the one achieved by using LRSum-generated summaries, its classification accuracy is higher than the one using LRSumgenerated summaries by 7%, 30%, and 23%, on Movie Reviews, Game Reviews, and Epinions-DS, respectively, as shown in Figure 5. Although the average processing time using T op-N - and LSA-generated summaries is on an average 40% faster than the one achieved by using eSummar-generated summaries, the average classification accuracy using summaries created by eSummar is significantly higher than the ones using summaries created by using T op-N or LSA by an average of 27% and 24%, respectively (see Figure 5).

SentiClass

13

Fig. 6. (Average) Processing time of SentiClass in creating summaries using T op-N , LSA, LRSum, or eSummar on test datasets and classifying the summaries

Since SentiClass (i) achieves high accuracy and (ii) does not significantly prolong its classification processing time by using summaries created by eSummar as compared with others created using alternative summarizers, eSummar is an ideal choice for summarizing reviews to be classified by SentiClass.

5

Conclusions

With the increasing number of reviews posted under social search websites, such as Epinions.com, a heavier burden is imposed on web users to browse through archived reviews to locate different opinions expressed on a product or service P S. To assist web users in identifying reviews on P S that share the same polarity, i.e., positive, negative, or neutral, we have developed a simple, yet effective sentiment classifier, denoted SentiClass. SentiClass considers the SentiW ordN et score of words, intensifiers, connectors, reported speech, and negation terms in reviews to accurately categorize the reviews according to the sentiment reflected on P S. To (i) reduce the processing time required by SentiClass for categorizing reviews and (ii) shorten the length of the (classified) reviews a user is expected to browse through, we have introduced eSummar, a single-document, extractive, sentiment summarizer, which considers word-correlation factors, sentiment words, and significance factors to capture the opinions on P S expressed in a review R and generate the summary of R. Empirical studies conducted using well-known datasets, Movie Reviews and Game Reviews, along with a set of reviews extracted from Epinions.com, denoted Epinions-DS, show that SentiClass is highly effective in categorizing (eSummar-generated summaries of) reviews according to their polarity. Using Movie Reviews and Epinions-DS, we have verified that SentiClass outperforms other classifiers in accomplishing the sentiment classification task. We have also demonstrated that eSummar is an ideal summarizer for SentiClass by comparing eSummar with other existing (sentiment) summarizers for creating summaries to be classified.

14

Maria Soledad Pera, Rani Qumsiyeh, and Yiu-Kai Ng

References 1. P. Beineke, T. Hastie, C. Manning, and S. Vaithyanathan. An Exploration of Sentiment Summarization. In Proc. of AAAI, pages 12–15, 2003. 2. Y. Gong. Generic Text Summarization Using Relevance Measure and Latent Semantic Analysis. In Proc. of ACM SIGIR, pages 19–25, 2001. 3. X. Hu and B. Wu. Classification and Summarization of Pros and Cons for Customer Reviews. In Proc. of IEEE/WIC/ACM WI-IAT, pages 73–76, 2009. 4. S. Jie, F. Xin, S. Wen, and D. Quan-Xun. BBS Sentiment Classification Based on Word Polarity. In Proc. of ICCET, pages 352–356, 2009. Vol. 1. 5. P. Judea. Probabilistic Reasoning in the Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann, 1988. 6. A. Kennedy and D. Inkpen. Sentiment Classification of Movie Reviews Using Contextual Valence Shifters. Computational Intelligence, 22(2):110–125, 2006. 7. J. Koberstein and Y.-K. Ng. Using Word Clusters to Detect Similar Web Documents. In Proc. of KSEM, pages 215–228, 2006. LNAI 4092. 8. R. Krestel, S. Bergler, and R. Witte. Minding the Source: Automatic Tagging of Reported Speech in Newspaper Articles. In Proc. of LREC, pages 2823–2828, 2008. 9. L. Ku, Y. Liang, and H. Chen. Opinion Extraction, Summarization and Tracking in News and Blog Corpora. In Proc. of AAAI-2006 Spring Symposium on Computational Approaches to Analyzing Weblogs, pages 100–107, 2006. 10. S. Lappin and H. Leass. An Algorithm for Pronominal Anaphora Resolution. Computational Linguistics, 20(4):535–561, 1994. 11. G. Luger. Artificial Intelligence: Structures and Strategies for Complex Problem Solving, 6th Ed. Addison Wesley, 2009. 12. H. Luhn. The Automatic Creation of Literature Abstracts. IBM Journal of Research and Development, 2(2):159–165, 1958. 13. P. Melville, W. Gryc, and R. Lawrence. Sentiment Analysis of Blogs by Combining Lexical Knowledge with Text Classification. InProc. of KDD, pages 1275–1284, 2009. 14. M. Minsky and S. Papert. Perceptrons: An Introduction to Computational Geometry. The MIT Press, 1972. 15. J. Na, C. Khoo, and P. Wu. Use of Negation Phrases in Automatic Sentiment Classification of Product Reviews. Library Collections, Acquisitions, and Technical Services, 29(2):180–191, 2005. 16. B. Pang and L. Lee. A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts. In Proc. of ACL, pages 271–278, 2004. 17. B. Pang and L. Lee. Opinion Mining and Sentiment Analysis. Foundations and Trends in Information Retrieval, 2(1-2):1–135, 2008. 18. B. Pang, L. Lee, and S. Vaithyanathan. Thumbs up? Sentiment Classification using Machine Learning Techniques. In Proc. of EMNLP, pages 79–86, 2002. 19. L. Polanyi and A. Zaenen. Computing Attitude and Affect in Text: Theory and Applications, chapter Contextual Valence Shifters, pages 1–10. Springer, 2006. 20. D. Radev, E. Hovy, and K. McKeown. Introduction to the Special Issue on Summarization. Computational Linguistics, 28(4):399–408, 2002. 21. J. Wiebe, T. Wilson, R. Bruce, M. Bell, and M. Martin. Learning Subjective Language. Computational Linguistics, 30:277–308, 2004. 22. J. Zhao, K. Liu, and G. Wang. Adding Redundant Features for CRFs-based Sentence Sentiment Classification. In Proc. of EMNLP, pages 117–126, 2008. 23. L. Zhuang, F. Jing, and X. Zhu. Movie Review Mining and Summarization. In Proc. of ACM CIKM, pages 43–50, 2006.