Sentiment Intensity: Is It a Good Summary Indicator? - Springer Link

5 downloads 0 Views 277KB Size Report
analysis are “good” summary sentences from the perspective of text sum- marisation. We operationalise the concepts of very positive and very neg-.
Sentiment Intensity: Is It a Good Summary Indicator? Mijail Kabadjov1 , Alexandra Balahur2 , and Ester Boldrini2 1

2

Joint Research Centre, European Commission, Via E. Fermi 2749, Ispra (VA), Italy [email protected] Departamento de Lenguajes y Sistemas Inform´ aticos Universidad de Alicante Apartado de Correos 99, E-03080, Alicante, Spain {abalahur,eboldrini}@dlsi.ua.es

Abstract. In this paper we address the question of whether “very positive” or “very negative” sentences from the perspective of sentiment analysis are “good” summary sentences from the perspective of text summarisation. We operationalise the concepts of very positive and very negative sentences by using the output of a sentiment analyser and evaluate how good a sentence is for summarisation by making use of standard text summarisation metrics and a corpus annotated for both salience and sentiment. In addition, we design and execute a statistical test to evaluate the aforementioned hypothesis. We conclude that the hypothesis does not hold, at least not based on our corpus data, and argue that summarising sentiment and summarising text are two different tasks which should be treated separately.

1

Introduction

Recent years have marked the birth and the expansion of the Social Web, the web of interaction and communication. Within this context, people express opinion on a high variety of topics – from economics and politics to science and technology to cooking and gardening – that are of interest to a large community of users. People propose and debate these topics over forums, review sites and blogs, a series of newly emerging textual genres that are growing in volume and number of topics addressed daily. Moreover, such texts can offer a genuine image of the opinion expressed on different interesting topics by all the social categories, age groups, in all regions of the world. This can be useful to conduct market research, reputation tracing, product assessment, company prospects, research on public image, social media analysis or can simply satisfy the curiosity of finding out what people think of a certain issue. However, the large quantities of such data that is available cannot be read entirely. Therefore, automatic systems that are able to detect opinion, classify it and summarise it have to be built. Such a system would, for example, given a blog thread (the sequence of texts containing the post on a subject and the subsequent comments on it made by different “bloggers”), Z. Vetulani (Ed.): LTC 2009, LNAI 6562, pp. 203–212, 2011. c Springer-Verlag Berlin Heidelberg 2011 

204

M. Kabadjov, A. Balahur, and E. Boldrini

analyse its content as far as opinion is concerned and subsequently summarise the classes of opinions expressed (i.e., arguments pro and against the topic). While opinion mining and text summarisation have been intensely researched on separately, up until now there has been little research done at the intersection of both, namely, opinion summarisation. Therefore, the aim of our work presented herein is to study the manner in which opinion can be summarised. In this paper we address the question of whether very positive or very negative sentences from the perspective of sentiment analysis are good summary sentences from the perspective of text summarisation. We test our different hypotheses on a corpus of blog threads from different domains and discuss our findings. The reminder of the paper is organised as follows: in section §2 we summarise the state of the art on sentiment analysis and summarisation; in section §3 we describe our approach of gauging the usefulness of sentiment intensity for summarisation; next, in section §4, we discuss our experimental results; and finally, we draw the conclusions from this work and give pointers to future work.

2

Related Work

Whilst there is abundant literature on text summarisation [16,14,11] and sentiment analysis [3,20,10], there is still limited work at the intersection of these two areas [23], in particular, at studying the relationship between sentiment intensity and summarisation prominence. For the first time in 2008 there was a Summarisation Opinion Pilot track at the Text Analysis Conference organised by the US National Institute of Standards and Technology (NIST). Most approaches to the problem were underpinned by already existing summarisation systems; some added new features (sentiment, pos/neg sentiment, pos/neg opinion) to account for positive or negative opinions – CLASSY [7], CCNU [13], LIPN [4] and IIITSum08 [26] – some proposed efficient polarity-based methods focusing on the retrieval and filtering stage – DLSIUAES [1] – or on separating information rich clauses – italica [8]. Overall, previous work on opinion mining can be broadly classified in two categories: sentiment classification at the document level and at the sentence level. Research done in document-level sentiment classification used supervised [6] and unsupervised methods [25], using rating scales [20] and scoring of features [9]. Work on sentence-level sentiment classification used bootstrapping techniques [21], finding strength of opinions [27], summing up orientations of opinion words in a sentence [17] and identifying opinion holders [23]. Finally, finer-grained, feature-based opinion summarisation was put forward in [15].

3

Sentiment Intensity and Summarisation

Our approach follows a simple intuition: when people express very negative or very positive sentiment, for example, in blogs, they might be also conveying important and valuable information that is somewhat more salient than other comments. The sub-area of Natural Language Processing concerned with identifying salient information in text documents is Text Summarisation, hence, we

Sentiment Intensity: Is It a Good Summary Indicator?

205

decided to formalise our intuition in the context of text summarisation and make use of standard methodology from that area. In addition, we cast the above intuition as a statistical hypothesis test where the null hypothesis we seek to reject is the opposite of our intuition, that is, the sentiment intensity of salient blog comments is no different from the sentiment intensity of non-salient comments. In order to carry out experiments to study in a quantitative manner whether sentiment intensity is a useful summary indicator, three things are needed: a sentiment analysis system capable of producing a sentiment intensity score for a given blog comment, a summarisation algorithm exploiting this sentiment intensity score and a reference corpus annotated for both sentiment and salience (i.e., gold standard data). Next, we describe each of those components and the design of the hypothesis test. 3.1

Sentiment Annotated Corpus

Annotation Process. The corpus we employed in this study is a collection of 51 blogs extracted from the Web, a limited dataset which allows for a preliminary study in the field. The blog posts are written in English and have the same structure, which is the following: there is an initial post created by the author, containing a piece of news and their opinion on it; subsequently, bloggers reply expressing their opinions about the topic. In most of the cases, commenting posts are the most subjective texts even if the authors, in their initil intervention may also express their point of view on the topic of the post. Blogs can also contain multi-modal information. However, the aim of our study is to summarize opinions expressed in texts and therefore we only annotated the textual content of the blogs. The annotation we performed on this blog corpus contains several elements: first, we indicated the URL from which the thread was extracted, we then included the initial annotated piece of news and the labeled user comments with an opinion annotation scheme that is proper for blogs - EmotiBlog. In order to delimitate our work, we selected only five major topics; we gave priority to the most relevant threads, that contained a large amount of posts in order to have a considerable amount of data. We chose some of the topics that we considered relevant: economy, science and technology, cooking, society and sport. Due to space constraints, specific details on the corpus are omitted here (see [2] for more details). 3.2

Sentiment Analysis

The first step we took in our approach (see figure 1) was to determine the opinionated sentences, assign each of them a polarity (positive or negative) and a numerical value corresponding to the polarity strength (the higher the negative score, the more negative the sentence and vice versa). Given that we are faced with the task of classifying opinion in a general context, we employed a simple, yet efficient approach, presented in [3]. At the present moment, there are different lexicons for affect detection and opinion mining. In order to have a more

206

M. Kabadjov, A. Balahur, and E. Boldrini

Fig. 1. The Sentiment Analysis Process

extensive database of affect-related terms, in the following experiments we used WordNet Affect [24], SentiWordNet [12], MicroWNOp [5]. Each of the employed resources were mapped to four categories, which were given different scores: positive (1), negative (-1), high positive (4) and high negative (-4). As shown in [3], these values performed better than the usual assignment of only positive (1) and negative (-1) values. First, the score of each of the blog posts was computed as sum of the values of the words identified; a positive score leads to the classification of the post as positive, whereas a final negative score leads to the system classifying the post as negative. Subsequently, we performed sentence splitting using Lingpipe and classified the obtained sentences according to their polarity, by adding the individual scores of the affective words identified. 3.3

Summarisation Algorithm Based on Sentiment Intensity

A straightforward summarisation algorithm that exploits sentiment intensity can be defined as follows: 1. Rank all comments according to their intensity for a given polarity 2. Select highest n comments It is important to point out here that positive and negative polarity comments are treated separately, that is, we produce one summary for all positive comments and one for all negative comments for a given blog thread (see figure 2).

Fig. 2. The Summarisation Process

Sentiment Intensity: Is It a Good Summary Indicator?

207

We ran this algorithm at two commonly used compression rates: 15% and 30%. That is, we produce two summaries for each polarity for each thread, one choosing the top 15% and one the top 30% of all comments. The results are presented and discussed in the next section, §4. 3.4

Hypothesis Test for Sentiment Intensity Usefulness

In addition to a standard summarisation evaluation we evaluate the hypothesis that very positive or very negative comments are good choices to be included in a summary by casting the problem as a statistical hypothesis test. Student’s t-test. We define the following setting in order to execute an independent two-sample one-tailed t -test of unequal sample sizes and equal variance: ¯1 − X ¯2 = 0 1. Null hypothesis, H0 : X ¯1 > X ¯2 Alternative hypothesis, H1 : X 2. Level of significance: α = 0.05 3. t statistic: ¯1 − X ¯2 X  t= (1) SX1 X2 · n11 + n12 

where

2 + (n − 1)S 2 (n1 − 1)SX 2 X2 1 (2) n1 + n2 − 2 4. Criterion: Reject the null hypothesis in favour of the alternative hypothesis if t > tν,α

SX1 X2 =

where ν = n1 + n2 − 2 (degrees of freedom) and t∞,0.05 = 1.645 In equations 1 and 2, n is the number of sample points, 1 is group one and 2 is group two. More specifically, in our case group one is composed of all the comments annotated as salient in our corpus (i.e., gold summary comments) and group two is composed of all the comments that were not annotated (i.e., gold non-summary comments). Furthermore, we further slice the data upon polarity (as produced by the sentiment analysis tool), so we have two samples (i.e., group one and group two) for the case of positive comments and two samples for the case of negative comments. For example, out of all the comments that were assigned a positive score by the sentiment analysis tool there are those that were also annotated as positive by the annotators – these constitute group one for the positive polarity case – and those that were not annotated at all – these constitute group two for the positive polarity case.3 The same thinking applies for the negative polarity case. 3

Certainly, in order to use gold polarity alongside the score produced by the sentiment analysis tool as we do, we had to firstly automatically align all the automatically identified sentences with the annotated comments. The criterion for alignment we used was that at least 70% of the words in an automatically identified sentence are contained in an annotated comment for it to inherit the gold polarity of that comment (and by virtue of that to be considered a gold summary sentence).

208

4

M. Kabadjov, A. Balahur, and E. Boldrini

Experimental Results

We first discuss the performance of the sentiment recognition system followed by the sentiment-intensity-based summarisation performance. 4.1

Sentiment Analysis

Performance results of the sentiment analysis are shown in table 1. Table 1. Sentiment analysis performance System Precision Recall F1 Sentneg 0.98 0.54 0.69 Sentpos 0.07 0.69 0.12

The first thing to note in table 1 is that the sentiment analysis tool is doing a much better job at identifying negative comments (F 1 = 0.69) than positive ones (F 1 = 0.12), the main problem with the latter being a very low precision (P = 0.07). One possible reason for this is an insufficient number of annotated positive examples (there were much more negative examples than positive ones in the corpus). We discuss in the next section whether this substantial difference in performance between the negative and positive cases has an impact on the subsequent analysis. 4.2

Summarisation

Performance results of the summariser are shown in table 3. We used the standard ROUGE evaluation [19] also used for the Text Analysis Conferences. We include the usual ROUGE metrics: R1 is the maximum number of co-occurring unigrams, R2 is the maximum number of co-occurring bigrams, RSU4 , is the skip bigram measure with the addition of unigrams as counting unit, and finally, RL is the longest common subsequence measure [19]. In all cases we present the average F1 score for the given metric. There are five rows in table 3: the first (SISummneg at 15%) is the performance of the sentiment-intensity-based summariser (SISumm) on the negative posts at 15% compression rate; the second (SISummpos at 15%) presents the performance of SISumm on the positive posts at 15% compression rate; the third (SISummneg at 30%) is the performance of the SISumm on the negative posts at 30% compression rate; the fourth (SISummpos at 30%) presents the performance of SISumm on the positive posts at 30% compression rate; and finally, the Table 2. Ingredients for a two-sample t-test; unequal sample sizes, equal variance 2 2 ¯1 ¯2 Polarity X n1 n2 SX SX t statistic X 1 2 Negative −3.95 −4.04 1092 1381 10.13 10.5 0.021 Positive 4.37 4.26 48 1268 9.3 28.03 0.036

Sentiment Intensity: Is It a Good Summary Indicator?

209

Table 3. Summarisation performance in terms of ROUGE scores System SISummneg at 15% SISummpos at 15% SISummneg at 30% SISummpos at 30% T opSummT AC08 BottomSummT AC08

R1 0.07 0.22 0.17 0.19 – –

R2 0.03 0.03 0.06 0.03 0.111 0.069

RSU 4 0.03 0.03 0.06 0.03 0.142 0.081

RL 0.07 0.19 0.16 0.17 – –

fifth and the sixth rows correspond to the official scores of the top and bottom performing summarisers at the 2008 Text Analysis Conference Summarisation track (TAC08), respectively. The last scores are included to provide some context for the other results.4 From table 3 it is evident that the ROUGE scores obtained are low (at least in the context of TAC08). This suggests that sentiment intensity alone is not a sufficiently representative feature of the importance of comments for summarisation purposes. Thus, using it in combination with other features that have proven useful for summarisation, such as entities mentioned in a given comment [16], certain cue phrases and surface features [18], or features capturing the relevance of blog posts to the main topic5 , is likely to yield better results. In particular, incorporating topic detection features would be crucial, since at the moment off-topic, but very negative or very positive, comments are clearly bad choices for a summary, and currently we employ no means for filtering these out. There is also an alternative interpretation of the attained results. These results were obtained by using a methodology used in text summarisation research, so it is feasible that the method is not particularly well-suited for the task at hand, that of producing sentiment-rich summaries. Hence, the reason for the low results may be that we addressed the problem in the context of a slightly different task, suggesting that the task of producing content-based summaries and that of producing sentiment-based summaries are two distinct tasks which require different treatment. In addition to the above results, we perform the statistical hypothesis test explained in section §3.4. The necessary ingredients and the resulting t -statistic values are shown in table 2. In both cases, negative and positive polarity, the t values obtained are not large enough for us to reject the null hypothesis in favour of the alternative hypothesis. That is, we do not have empirical evidence to reject the null hypothesis that the sentiment intensity of salient blog comments is no different from the sentiment intensity of non-salient comments in favour of our alternative hypothesis that, indeed, sentiment intensity in summary blog comments is different from that of non-summary blog comments. 4 5

We note, however, that the results on our corpus are not directly comparable with those of TAC08, since the data sets are different. Blog posts in our corpus were annotated as important with respect to the main topic of the respective blog threads.

210

M. Kabadjov, A. Balahur, and E. Boldrini

We conclude that, based on our annotated corpus, the hypothesis that very positive or very negative sentences are also good summary sentences does not hold. But, once again, we point out that these results are meaningful in the context of text summarisation, that is, the task of producing content-based summaries. Hence, the observation we made above that producing content-based summaries is different from producing sentiment-based summaries and as such these tasks should be treated differently applies also in this case.

5

Conclusions

In this paper we addressed the question of whether very positive or very negative blog comments from the perspective of sentiment analysis are good summary sentences from the perspective of text summarisation. We used a sentiment analysis tool capable of producing a score in the range [−10, +10] for every sentence, to process a corpus of blog threads which has been annotated for salience and sentiment. We proposed a simple summarisation algorithm that sorts all sentences of the same polarity according to their sentiment score in decreasing order and chooses the top n (e.g., the top 15% or 30%) as a resulting summary. All such summaries were evaluated against the model summaries produced according to the annotation using four ROUGE metrics commonly employed in text summarisation evaluations: R1 , R2 , RSU4 and RL (see section §4.2 for details). In addition, we designed and carried out a hypothesis test to evaluate statistically the hypothesis that very positive or very negative blog comments are good summary sentences. In the light of the low summarisation results attained and the results from the statistical test, we concluded that the aforementioned hypothesis does not hold, at least not based on our corpus data, and argued that summarising sentiment and summarising text are two different tasks which should be treated separately. In future work we intend to explore in more depth the problem of summarising opinions expressed in blogs adopting an approach to summarisation combining statistical information with information about persons and organisations such as the one proposed in [16], since in blogs it is very common that a thread is elaborated around an important person or organisation.

References 1. Balahur, A., Lloret, E., Ferr´ andez, O., Montoyo, A., Palomar, M., Mu˜ noz, R.: The DLSIUAES team’s participation in the TAC 2008 tracks. In: National Institute of Standards and Technology [22] 2. Balahur, A., Lloret, E., Boldrini, E., Montoyo, A., Palomar, M., Mart´ınez-Barco, P.: Summarizing threads in blogs using opinion polarity. In: Proceeding of the Workshop on Events in Emerging Text Types at RANLP, Borovetz, Bulgaria (September 2009) 3. Balahur, A., Steinberger, R., van der Goot, E., Pouliquen, B.: Opinion mining from newspaper quotations. In: Proceedings of the Workshop on Intelligent Analysis and Processing of Web News Content at the IEEE / WIC / ACM International Conferences on Web Intelligence and Intelligent Agent Technology (WI-IAT) (2009)

Sentiment Intensity: Is It a Good Summary Indicator?

211

4. Bossard, A., G´en´ereux, M., Poibeau, T.: Description of the LIPN systems at TAC 2008: Summarizing information and opinions. In: National Institute of Standards and Technology [22] 5. Cerini, S., Compagnoni, V., Demontis, A., Formentelli, M., Gandini, G.: MicroWNOp: A gold standard for the evaluation of automatically compiled lexical resources for opinion mining. In: Sans` o, A. (ed.) Language Resources and Linguistic Theory: Typology, Second Language Acquisition, English Linguistics, Franco Angeli, Milano, IT (2007) 6. Chaovalit, P., Zhou, L.: Movie review mining: a comparison between supervised and unsupervised classification approaches. In: Proceeding of HICSS 2005, the 38th Hawaii International Conference on System Sciences (2005) 7. Conroy, J., Schlesinger, S.: Classy at TAC 2008 metrics. In: National Institute of Standards and Technology [22] 8. Cruz, F., Troyani, J., Ortega, J., Enr´ıquez, F.: The Italica system at TAC 2008 opinion summarization task. In: National Institute of Standards and Technology [22] 9. Dave, K., Lawrence, S., Pennock, D.: Mining the peanut gallery: Opinion extraction and semantic classification of product reviews. In: Proceeding of the World Wide Web Conference (2003) 10. Riloff, E., Wiebe, J., Phillips, W.: Exploiting subjectivity classification to improve information extraction. In: Proceedings of the 20th National Conference on Artificial Intelligence (AAAI) (2005) 11. Erkan, G., Radev, D.R.: LexRank: Graph-based centrality as salience in text summarization. Journal of Artificial Intelligence Research, JAIR (2004) 12. Esuli, A., Sebastiani, F.: SentiWordNet: A publicly available resource for opinion mining. In: Proceeding of the 6th International Conference on Language Resources and Evaluation, Italy (May 2006) 13. He, T., Chen, J., Gui, Z., Li, F.: CCNU at TAC 2008: Proceeding on using semantic method for automated summarization yield. In: National Institute of Standards and Technology [22] 14. Hovy, E.H.: Automated text summarization. In: Mitkov, R. (ed.) The Oxford Handbook of Computational Linguistics, pp. 583–598. Oxford University Press, Oxford (2005) 15. Hu, M., Liu, B.: Mining opinion features in customer reviews. In: Proceeding of the National Conference on Artificial Intelligence (AAAI) (2004) 16. Kabadjov, M.A., Steinberger, J., Pouliquen, B., Steinberger, R., Poesio, M.: Multilingual statistical news summarisation: Preliminary experiments with english. In: Proceedings of the Workshop on Intelligent Analysis and Processing of Web News Content at the IEEE / WIC / ACM International Conferences on Web Intelligence and Intelligent Agent Technology (WI-IAT) (2009) 17. Kim, S., Hovy, E.: Determining the sentiment of opinions. In: Proceedings of the International Conference on Computational Linguistics (COLING) (2004) 18. Kupiec, J., Pedersen, J., Chen, F.: A trainable document summarizer. In: Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Seattle, Washington, pp. 68–73 (1995) 19. Lin, C.Y.: ROUGE: a package for automatic evaluation of summaries. In: Proceedings of the Workshop on Text Summarization Branches Out, Barcelona, Spain (2004) 20. Pang, B., Lee, L.: Opinion mining and sentiment analysis. Foundations and Trends in Information Retrieval 2(1-2), 1–135 (2008)

212

M. Kabadjov, A. Balahur, and E. Boldrini

21. Riloff, E., Wiebe, J.: Learning extraction patterns for subjective expressions. In: Proceeding of the Conference on Empirical Methods in Natural Language Processing (2003) 22. National Institute of Standards and Technology (eds.): Proceeding of the Text Analysis Conference. Gaithersburg, MD (November 2008) 23. Stoyanov, V., Cardie, C.: Toward opinion summarization: Linking the sources. In: Proceedings of the COLING-ACL Workshop on Sentiment and Subjectivity in Text.Association for Computational Linguistics, Sydney (July 2006) 24. Strapparava, C., Valitutti, A.: WordNet-Affect: an affective extension of wordnet. In: Proceeding of the 4th International Conference on Language Resources and Evaluation, Lisbon, Portugal, pp. 1083–1086 (May 2004) 25. Turney, P.: Thumbs up or thumbs down? semantic orientation applied to unsupervised classification of reviews. In: Proceeding of the Annual Meeting of the Association for Computational Linguistics (ACL) (2002) 26. Varma, V., Pingali, P., Katragadda, R., Krisha, S., Ganesh, S., Sarvabhotla, K., Garapati, H., Gopisetty, H., Reddy, V., Bysani, P., Bharadwaj, R.: IIT Hyderabad at TAC 2008. In: National Institute of Standards and Technology [22] 27. Wilson, T., Wiebe, J., Hwa, R.: Just how mad are you? finding strong and weak opinion clauses. In: Proceeding of the National Conference on Artificial Intelligence (AAAI) (2004)