A Context-Based Word Sense Disambiguation ...

14 downloads 0 Views 221KB Size Report
Jul 7, 2012 - shows, elaborate more specific marketing campaigns in accordance with their ... The methods based on machine learning can be either supervised or ..... tweets per second or 3,000 tweets per minute. V. CONCLUSION ... 10:1--10:69, 2009. [13] X. Z. J. Han, "Named Entity Disambiguation by Leveraging.
Advances in Information Technology and Applied Computing”(ISSN 2251-3418), Volume (1)

A Context-Based Word Sense Disambiguation Method for TV Shows Ana C. E. S. Lima

Leandro N. de Castro

Natural Computing Laboratory Mackenzie Presbyterian University São Paulo, Brazil [email protected]

Natural Computing Laboratory Mackenzie Presbyterian University São Paulo, Brazil [email protected] feedback about their shows. According to recent researches, 50.6% of Internet users surf the web while watching TV [5]. Thus, by monitoring such messages, they can detect certain aspects on the preferences of their viewers and improve their shows, elaborate more specific marketing campaigns in accordance with their audience, and make their TV shows more interactive [6]. However, one of the problems with the monitoring of Twitter message is the ambiguity presented by some results. That is, although a certain message presents the term monitored, this term does not necessarily belong to the desired context. As the messages, known as tweets, are short and may be written in informal language, with slangs and even a mix of languages, the potential for ambiguity is huge [8]. Ambiguity can complicate the automatic message analysis and significantly modify quantitative and qualitative results [9]. Therefore, every monitoring system has to include a technique for indentifying which messages are in the correct context of application. The determination of the correct sense is called disambiguation. Disambiguation is a classical natural language processing problem and is essential for a system that works with automatic translation, information retrieval, and search systems, among others [10]. The disambiguation task can be divided in two forms [11]: Named Entity Disambiguation (NED), and Word Sense Disambiguation (WSD). The named entity disambiguation indentifies if a word is related with a certain object, or entity. The word sense disambiguation defines a proper sense for the text that contains ambiguous words [11]. Although there are many algorithms used for ambiguity removal in static data sets, the novel communication technologies, such as the social networks, present a new challenge in real-time disambiguation [12]. Therefore, this work proposes and evaluates a system for the automatic and real-time context-based disambiguation for Twitter messages that mention Brazilian TV shows. The present document is organized as follows. Section 2 introduces the concept of Word Sense Disambiguation and in Section 3 the proposed model for disambiguation is introduced. Then, Section 4 presents the obtained results and discussions. Finally, the conclusion is presented in Section 5.

Abstract— Twitter is a microblogging service that has gained popularity as an instant communication platform. Many companies see this service as an important place for the monitoring of their brands. In this direction, TV stations have adopted Twitter for shortening the distance between them and their viewers, and use such information as a feedback mechanism for their shows. However, monitoring Twitter messages has a high level of difficulty due to the short texts, informality and, mainly, the ambiguity problem encountered in the recovery of messages. Thus, there is a strong demand for tools, named disambiguation systems, which provide real-time responses about the context of a message. This paper proposes a real-time context-based disambiguation system for Twitter messages that mention TV shows of Brazilian stations. To assess the performance of the proposed system tweets related to a Brazilian TV show were captured in a 24h interval and fed into the system. The proposed technique achieved an average accuracy of 93%. Keywords - Twitter; Word Classification; Context Dictionary

I.

Sense

Disambiguation;

INTRODUCTION

Twitter® is a microblogging service designed for simplified (short messages) communication. The posts are shown in reverse chronological order, thus, the reader is informed about the news with the same speed that they are posted, meaning that the service has, as essence, real-time information transmission [1]. Within Twitter, the focus is the information, not the user interaction. Therefore, the reciprocity between two followers is not required; the first user can follow the second without being followed by the second [2]. The simplified communication and connection accelerates the process of message update. There are about 250 billion messages posted daily [3], characterizing this service as an important repository for data analysis. As a result, many companies utilize this service for the monitoring of their brands. The monitoring of messages helps companies understand a specific environment and its constant changes, with one basic principle: if something is said in the social media, then it can be qualified and quantified [4]. Monitoring tools, like Twitter Grader®, Twitalyzer® and Twazzup®, are increasingly used by communication professionals for market analysis. As well as companies and brands, TV stations have realized that Twitter has a scenario that can provide real time

1

A Context-Based Word Sense Disambiguation Method for TV Shows 159

Advances in Information Technology and Applied Computing”(ISSN 2251-3418), Volume (1)

II.

automatically from the company website to create a knowledge base for the purpose of classifying tweets in related or unrelated to the company. In [19] a system was developed for real-time disambiguation of tweets about the Brazilian football championship. The average accuracy obtained was 70%.

WORD SENSE DISAMBIGUATION

In general, languages are naturally ambiguous. Many words can be interpreted differently depending on their context. Such words are called polysemic [13]. The existence of polysemic words increase the difficulty level of an automatic analysis, influencing several areas such as information retrieval, link analysis, knowledge discovery, machine translation, automatic summarization, among others. Consequently, there is a major need of effective disambiguation methods [14]. The task of providing a proper sense to a polysemic word is called Word Sense Disambiguation (WSD) [14], whose main components are: Selection of word senses (Class): aims to organize the possible meanings of the words for creating a class set [13]; Representation of context: handles the processing of texts in order to structure them, because texts are unstructured sources of information full of uncertainty and ambiguity, which make interpretation even more difficult [15]; Selection of a classification method: the word sense disambiguation can be seen as a classification task, in that the senses are the classes and a classification method is used to assign sense for all or some words of text [13]. Two types of techniques are usually adopted: machine learning and knowledge-based systems [12]. The methods based on machine learning can be either supervised or unsupervised. The supervised approaches take into account a pre-classified dataset which is used either to build a classification model, or to simply infer the class of a new object based on the data whose classes are known [16]. The main supervised algorithms used for WSD are Decision Lists, Decision Trees, Naive Bayes, Neural Networks and SVMs [16]. The unsupervised classifiers do not use pre-classified samples and, instead, try to find statistical regularities within the data so as to determine the class of objects in the dataset [16]. The main unsupervised algorithms used in WSD are Context Clustering, Word Clustering and Co-occurrence Graphs [16]. The knowledge-based classification methods use an information set about the senses selected to make the classification. This knowledge can be represented as a dictionary, thesaurus, text collection, ontology, among others [13]. The present paper proposes a knowledge-based method to create a context-based classifier that can operate in the automatic disambiguation of Twitter messages. Twitter generates about 250 million messages daily [3], thus creating a huge demand for real-time disambiguation systems. Despite this need, there is relatively little research in the area of tweet disambiguation. In [17] the authors proposed a tweet disambiguation about companies using custom rules. The average accuracy achieved by the system was 75%, with a 75% precision of positive class and 74% precision for the negative class. In [18] the authors used keywords extracted

III.

CBWSD: A CONTEXT-BASED WSD METHOD

The disambiguation system proposed in this paper, called Context-Based Word Sense Disambiguation (CBWSD), uses a context dictionary to determine if the message is within the scope of application or not. We developed a context-based system due to the increased coverage of this model because of the associated knowledge [12]. A context dictionary corresponds to a set of words that make reference to the situation and environment in which they are inserted. For instance, the context of a movie would include the actors and their roles. The CBWSD corresponds to a text classifier in which given a set of N tweets, ! "#$ % #& % ' % #( )% a set W of m polysemic words * ! "+$ % +& % ' % +, )% and a set C of k senses - ! ".$ % .& % ' % ./ ) , the disambiguation maps each word w ! W into one sense in set C: W C. In the application used in this paper for assessing the proposed system, a polysemic word corresponds to a TV show of Brazilian TV stations that has an ambiguous name. For instance, the program named “Estrelas” (which means “Stars” in English) from Rede Globo, can have several meanings, such as a set of stars in the sense of something or someone special, the stars that shine in the skies, and the polygon star. In the CBWBD system proposed here, a tweet can be classified into one of two classes (senses): related to TV (RTV), or not related to TV (NRTV). Therefore, - ! "0 1% 20 1). The proposed system operates in two phases (Fig. 1): 1) Polysemic Set Membership (PSM) Analysis: the first step of the model verifies whether a tweet has a polysemic word or not, that is, if the tweet mentions a program that belongs to the list of ambiguous programs. If the message contains any program within such list, it goes to the classification stage; otherwise, it does not belong to the polysemic set and does not need to be classified as RTV or NRTV. 2) Word Sense Disambiguation (WSD): the second step corresponds to the disambiguation itself, that is, classification of a tweet into RTV or NRTV based on its context. Only those tweets selected in the polysemic set membership analysis go through the WSD. At this stage, the tweet is examined based on the context dictionary. In order to accelerate the classifier process, a generic dictionary, named ontoTV, that contains the context of watching TV was created. If it is not possible to determine the tweet class from the ontoTV dictionary, the system locates the specific set of words, within a dictionary named ontoProgram, that define the TV program.

2

160

Advances in Information Technology and Applied Computing”(ISSN 2251-3418), Volume (1)

A. A Briefing on the Twitter Microblogging are web services designed for simplified communication. This implies in messages shorter than those contained in blogs, and also characterized by their means of showing the posts in reverse chronological order, such as the blogs. In these services the process of information update is accelerated due to the simplified communication and short texts [20]. The tweets have a maximum length of 140 characters, and can be used more informally, with slangs and special characters. Therefore, the automatic analysis of Twitter messages has a different difficulty level than that of more formal texts with longer character limits [2]. The monitoring of posted messages is possible through the Application Programming Interface (API) of Twitter that provides several methods for data retrieval and access to user information. Internally, the Twitter API is divided into Search API and API Stream. The Search API provides access to limited set of recent tweets. The Stream API allows access to the flow of messages in real time. There are libraries [21] available in several programming languages to facilitate access to the Twitter API. Among these, we can highlight the Twitter4J, which is a library for Java applications developed by Yusuke Yamamoto. This library was used in the development of this paper.

Figure 1. Proposed classifier architecture.

B. An Illustration of CBWSD Operation In order to illustrate the operation of CBWSD, a message about the program titled “Agora é Tarde” was captured. This title means “Now it is late” in Portuguese and, as in English, may have different senses. In such case, CBWSD works as follows, depending on the received message: 1) “Ontem eu nem assisti Agora é tarde...=/, mas hoje eu vou assistir =)” . This example means in Portuguese "Yesterday I didn't watch Now it is Late, but today I'll watch". PSM Analysis: in this step it is verified the existence of the ambiguous term “Agora é tarde”. WSD Classification: the first context dictionary used is ontoTV and, at this point, the word “assisti” is within context, so the message is classified as RTV. 2) “Agora é tarde (10/07): Danilo Gentili conversa com Mari Paraiba http://t.co/VDFW2iNJ”. This example means in Portuguese "Now it is Late (07/11): Danilo Gentili speaks with Mari Paraiba". PSM Analysis: in this tweet the polysemic set membership analysis also indicates that the tweet has a polysemic word. WSD Classification: in the WSD classification step the tweet goes through the ontoTV ontology that returns and empty set and, then, follows to the ontoProgram ontology that finds the word “Danilo Gentili” as a context related to TV (RTV). 3) “agora é tarde demais, ta na hora de assumir meus erros e seguir em frente :)” . This example means in Portuguese "Now it is too much late, it's time to take my mistakes and move on". PSM Analysis: as in all previous examples, the tweet contains a polysemic expression. WSD Classification: after succeeding in the PSM analysis, both ontologies, ontoTV and ontoProgram, fail in identifying terms related to TV, implying in a negative classification of the tweet (NRTV). IV.

B. Materials and Methods The Twitter4J [22] library was used to capture tweets and a search script of tweets was written in Java to make queries in real time from the 6th to the 7th July 2012, totaling 24 hours of tweets captured. The queried term was about the program “Agora é Tarde” and 6,030 tweets were collected. The test was realized in the Dell computer, with Intel Core 2 Duo 2.8 GHz processor, RAM of 4GB and 250GB of HD. For this problem, the results can be presented as a confusion matrix, with a row and column for each class. The confusion matrix contains information about the correct and predicted classifications done by a classifier [23]. The performance of such systems is evaluated using the data in the confusion matrix. Table I shows an example of matrix for a two-class classifier. The correct class is placed in the rows and the predicted class in the columns. TP is the number of correct predictions to the Positive class, TN is the number of correct predictions to the Negative class, FP is the number of Negative class objects predicted as Positive, and FN is the number of Positive class objects predicted as Negative.

PERFORMANCE EVALUATION

TABLE I.

This section starts by making a brief overview of the Twitter microblogging system and then presents the materials and methods used for assessing the performance of the proposed word sense disambiguation system. Results are then presented and discussed.

Correct

3

161

Positive Negative

CONFUSION MATRIX.

Predicted Positive Negative TP FN FP TN

Advances in Information Technology and Applied Computing”(ISSN 2251-3418), Volume (1)

The performance measures used were Precision, Eq. (1), Recall, Eq. (2), and the F-measure, Eq. (3). These measures are used to evaluate how satisfactory the answers are retrieved by an information retrieval system. Precision indicates the relative number of relevant documents retrieved during the search process, and Recall is the number of relevant documents retrieved in relation to all the relevant documents available in the database. F-measure is the harmonic mean between precision and recall [24]. ! "# () " # + "#

In addition to the accuracy, the false negative rate – FPR, Eq. (5), corresponds to the rate of incorrect classifications made by the algorithm [16]: + ( "#

$% $%&'%

%$!$#

$%&'* ,

TABLE III.

PERFORMANCE OF CLASS NRTV.

Measure

Results

FPR

0.35 (±0.06)

ACC

0.93 (±0.008)

!%#

-/%. &#-/01

The FPR value indicates the rate of NRTV tweets understood by the system as RTV. This measure should be low because NRTV tweets should be ignored during the monitoring, otherwise they would compromise any quantitative analysis of the base. The classifier obtained an average FPR of 0.35 with standard deviation of ±0.06. The hit rate of correct average of the classifier was 0.93 with a deviation of ±0.008. Another important aspect considered was the processing time of each tweet. As the response of the disambiguation should follow the pace of message production, that is, provide a real time response, an important concern in the preparation of this proposal was the time required for a tweet to go through all the analysis and classification steps in CBWSD. The average processing time was 20 milliseconds for each tweet, which corresponds to a throughput rate of 50 tweets per second or 3,000 tweets per minute.

C. Results and Discussion Table II summarizes the results of Pr, Re and F calculated for each class. CBWSD PERFORMANCE.

Measure

RTV

NRTV

Pr

0.95 (± 0.007)

0.75 (± 0.07)

Re

0.97 (± 0.008)

0.65 (± 0.06)

F

0.96 (± 0.004)

0.69 (± 0.05)

V.

$%&$* $%&'*&'%&$*

CONCLUSION AND FUTURE TRENDS

Polysemic words have separate groups of settings according to their different senses, thus opening the possibility of using information about its context for disambiguation [26]. This paper proposed a context-based disambiguation method for tweets, called CBWSD, and showed as an example application the disambiguation of a Brazilian TV program. The CBWSD consists of a text classifier that uses context as the knowledge base in order to improve the classification process for a specific topic. In addition, the classifier does not require previous training of a sample base, thus eliminating a usually costly step of classification algorithms. The disambiguation monitoring of Twitter messages, in addition to effective, must provide a response in near real time. The proposed method showed an average accuracy of 93% and a response time of around 20 milliseconds for each tweet.

The positive class precision measure is important because it indicates the rate of tweets correctly classified as related to TV. The goal of the disambiguation method proposed is to recover the largest possible number of correct tweets. The precision obtained for the positive class was 0.95 with standard deviation of ± 0.007, thus indicating that the proposed classifier is able to identify 95% of the tweets related to the monitored program. To measure the overall classifier performance we used Accuracy, Eq. (4), that represents the success rate of the classification algorithm and corresponds to the number of correct classifications divided by the number of documents [16]. 233 " #

!'#

where FP (false positives) is the number of NRTV tweets classified as RTV, and TN (true negative) is the number of NRTV messages classified as NRTV. Table III shows FPR and the accuracy (ACC) obtained !"# by the proposed classifier.

The validation method used in this paper was the k-fold cross-validation, which divides the database into k partitions, uses one partition for testing and the remaining k 1 for training and repeats this process until all partitions have been used for training and testing [25]. In the experiments performed here k = 10 [16].

TABLE II.

'% '%&$*

!&#

4

162

Advances in Information Technology and Applied Computing”(ISSN 2251-3418), Volume (1)

Among the many investigations to be performed it is possible to mention the assessment of the system with a much larger set of tweets, including the disambiguation involving different simultaneous terms, the comparison with other approaches from the literature, and the automatic design of the ontoTV and ontoProgram libraries.

[13] X. Z. J. Han, "Named Entity Disambiguation by Leveraging Wikipedia Semantic Knowledge," Proceedings of the 18th ACM conference on Information and knowledge management, pp. 215-224, 2009. [14] I. H. Witten, Text mining. In Practical handbook of internet computing, Florida: Chapman & Hall/CRC Press, 2005, pp. 14-1 14-22. [15] J. HAN and M. KAMBER, Data Mining: Concepts and Techniques, Morgan Kaufman, 2001. [16] M. M. S. O. S. S. I. N. H. Yoshida, "ITC-UT:Tweet Categorization by Query Categorization of On-line Reputation Management," In CLEF (Notebook Papers/LABs/Workshops), 2010. [17] D. Valenti, "ilter keywords and majority class strategies for company name disambiguation in Twitter," 2011. [18] A. S. W. V. A. M. J. W. L. A. D. S. A. Davis, "RT-NED: Real-time named entity disambiguation on Twitter streams," XXVI Simpósio Brasileiro de Banco de Dados - Sessão de Demos, pp. 43-48, 2011. [19] A. N. F. Serafimo, C. C. D. D. Cunha and M. P. D. B. Silva, "REDES SOCIAIS E MICROBLOGS EM UNIDADES DE INFORMAÇÃO: explorando o potencial do twitter, do ning e do foursquare como ferramentas para promoção de serviços de informação.," Anais 33º ENEBD, 2010. [20] Twitter, "Twitter Libraries," 6 July 2006. [Online]. Available: https://dev.twitter.com/docs/twitter-libraries. [Accessed 10 July 2012]. [21] Twitter4J, "Twitter4J - A Java library for the Twitter API," [Online]. Available: http://twitter4j.org/en/index.html. [Accessed 20 July 2012]. [22] J. Wainer, "Confusion Matrix," Unicamp, 2 Maio 2008. [Online]. Available: http://www.ic.unicamp.br/~wainer/cursos/1s2012/mc906/Confusion.p df. [Accessed 2012 Agosto 13]. [23] I. H. Witten, "Text mining.," in Practical handbook of internet computing, Florida, Chapman & Hall/CRC Press, 2005, pp. 14-1 14-22. [24] R. KOHAVI, "A study of cross-validation and bootstrap for accuracy estimation and model selection," In: International joint Conference on artificial intelligence, vol. 14, p. 1137–1145, 1995. [25] T. e. N. M. Pardo, "Segmentação Textual Automática: Uma Revisão Bibliográfica," 2002. [26] P. Chagas, "UM OLHO NA TV E OUTRO NO COMPUTADOR: repercussão de produtos televisivos no twitter," Revista Científica do Departamento de Comunicação Social da Universidade Federal do Maranhão - UFMA, pp. 149-160, 2010. [27] J. T. V. S. P. S. A. Marasanapalle, "Business Intelligence From Twitter For The Television Media: A Case Study.," Business Applications of Social Network Analysis BASNA 2010, 2010. [28] R. Kohavi and F. Provost, "Glossary of terms," Machine Learning, vol. 30, pp. 271-274, 1998.

ACKNOWLEDGMENTS The authors thank Mackenzie University, Mackpesquisa, CNPq, Capes and FAPESP for the financial support. REFERENCES [1]

D. G. S. L. G. Boyd, "Tweet, Tweet, Retweet: Conversational Aspects of Retweeting on Twitter," Proceedings of the 2010 43rd Hawaii International Conference on System Sciences, pp. 1-10, 2010. [2] H. KWAK, "What is Twitter, a social network or a news media?," Proceedings of the 19th international conference on World wide web, pp. 591-600, 2010. [3] Datasift, "Browse Data Sources - Twitter," 2012. [Online]. Available: http://datasift.com/source/6/twitter. [Accessed 29 Abril 2012]. [4] S. Salustiano, "O Profissional Analista," in Para entender o Monitoramento de Mídias Sociais, 2012, pp. 34-40. [5] Elife, "Estudo Hábitos 2012," Elife, 2012. [Online]. Available: http://elife.com.br/cadastropapers/?c=estudo-habitos-2012. [Accessed August 2012]. [6] J. Marasanapalle, T. Vignesh, P. Srinivasan and A. Saha, "Business Intelligence From Twitter For The Television Media," MARASANAPALLE, J. et al. Business Intelligence From Twitter For The Television Media: A Case Study. Business Applications of Social Network Analysis BASNA 2010, 2010. [7] T. Lake, "Twitter Sentiment Analysis," Kalamazoo, 2011. [8] L. Sarmento, "Agrupamento de contextos de palavras polissêmicas," [Online]. Available: http://paginas.fe.up.pt/~las/conteudo/pub/pln/prodei/ec_luis_sarmento .pdf. [Accessed 27 April 2012]. [9] C. Zavaglia, "Base de Conhecimento Léxico-Ontológico para o Português do Brasil: uma proposta de modelo," 2003. [Online]. Available: http://www.lbd.dcc.ufmg.br/colecoes/til/2003/004.pdf. [Accessed 9 August 2012]. [10] H. ANAYA-SÁNCHEZ, A. PONS-PORRATA and R. BERLANGALLAVORI, "Using Sense Clustering for the Disambiguation of Words," Polibits, pp. 23-28, 2009. [11] E. Palta, "Word Sense Disambiguation," 2007. [12] R. Navigli, "Word Sense Disambiguation: A Survey.," ACM Comput. Surv, pp. 10:1--10:69, 2009.

5

163