Contributions to Topic Extraction and Tracking

Contributions to Topic Extraction and Tracking 1

Julien Velcin1, Jean-Hugues Chauchat2

ERIC Lab, University of Lyon, [email protected] ERIC Lab, University of Lyon, [email protected]

2

Abstract In this paper, we present our recent contributions in the field of text mining, especially when dealing with topic extraction and tracking. After a brief overview of the state of the art, we present a whole system for extracting topics and finding understandable key phrases to label these topics; we present a platform for fetching information forums (either RSS feeds or Web sites) and for analyzing online discussions. We give also current work and preliminary results to tracking topics through various information sources and to deal with the evolution of topics over time. The crucial point of validating topic models is evoked. An important part of the paper is used to give future works in which we are interested in. Keywords: text mining, topic extraction, machine learning

1. Introduction Blogs, RSS feeds, news wires, all these various media transmit the information produced every day either by news agencies, official entities, or simple web users. To deal with such a huge amount of (social) information, we must make the process automatic. Data mining is precisely a research field that provides technologies to address this issue. In particular, web and text mining provide different but effective tools to solve great issues of the “semantic Web 2.0” (Stavrianou et al., 2007): structuring and summarizing the information flow in a high level (means readable by human beings), doing fast and reliable inferences, making accurate and motivated decisions. Researchers of the ERIC Lab are dealing with those exciting issues using their experience in both Computer Science and Statistics. In this paper, we propose to detail our recent work related to text and web mining. There is a lot of effort worldwide in developing topic models to extract the underlying meaning of textual collections. It is difficult to know which model will fit a specific problem and how to tune it. Our first contribution deals with two hot topics: evaluating the outputs of such models, and automatically labeling those topics. It has been used in an application to analyze web discussions. Our second contribution addresses the problem of tracking the evolution of topics through time and sources. We try to take into account both the temporal dimension of the textual corpora and the sources that generate these

texts. This work is still in progress but preliminary results and reflection give us interesting possible tracks to investigate. Apart this introduction, this paper is divided into 3 sections. Section 2 gives an overview of modern models to extract topics from textual datasets, both in a static and dynamic context. Section 3 presents our previous contributions, especially a whole system for topic extraction. Section 4 gives current work and preliminary results. We highlight current perspectives that can lead to build novel research projects in collaboration with academic and industrial partners.

2. Background 2.1 Extracting Topics from Textual Datasets In the recent literature, many approaches of topic extraction are based upon probabilistic models. Some well-spread methods are pLSI (probabilistic Latent Semantic Indexing) (Hofmann, 1999) and LDA (Latent Dirichlet Allocation) (Blei et al., 2003). The core idea of probabilistic models lies in the assumption that the observed texts are derived from a generative model. In such a model, there are unseen latent variables (the topics) from which words and documents are generated. The latent variables are represented as random variables over the set of words. Thereby, given a dataset, these models attempt to estimate the probability distributions of the latent variables by using maximum likelihood or Bayesian inference. On top of that basic approach, many other generative models have been developed in order to address topics extraction in more complex cases such as for dynamic data (Wang et al., 2008), correlated topics (Blei et Lafferty, 2007), n-gram based approaches (Wang et al., 2007), social networks (Chang et al., 2009), opinion mining (Mei et al., 2007) and so on. Many applications can take great benefit from such topic models. In particular, topic models can be highly useful for automatic ontology building and evolution (Rizoiu and Velcin, 2011). Apart from those probabilistic approaches, other kinds of models have also been developed for extracting topics from texts. For instance, matrix factorization models have also generated many research contributions. That family of techniques is rather based on the spectral analysis of observed occurrence matrices (documents/terms). The basic mathematical tools in that context are spectral analysis, such as singular value decomposition or eigen decomposition techniques. One example of that type of models is Latent Semantic Indexing (Berry et al., 1994). Another approach that deals with topics extraction that is worth mentioning here is AGAPE (A General Approach for concePt Extraction) (Velcin and Ganascia, 2007). 2.2 Evolution of Topics over Time Regarding prior art related to event detection in streams of text data, many research works in text mining have investigated that task in different contexts (Brants and Chen, 2003; Margineantu et al., 2010). In that regard, we can quote the following communities:

Topic Detection and Tracking (TDT), Emerging Trend Detection (ETD) and Temporal Text Mining (TTM). The TDT community (Allan, 2002) focused their research efforts on broadcast news understanding. More particularly the core aim was to develop systems that were able to monitor streams of broadcast news. To this end the paradigm followed by the participants was to provide an event-based organization of the news articles. In that perspective they defined an event as something happening in the world at a given place and a given time. The systems are designed in order to automatically detect new events in the stream of texts and track all other upcoming texts that were related to those new events. Next, in the ETD approach (Kontostathis, 2003) one important aspect is the concept of trend: ''a topic area that is growing in interest and utility over time''. A system called Technology Opportunities Analysis (TOA) uses a semi-automatic approach that relies on queries designed by the experts of the studied domain. Then, the retrieved data are analyzed by using a bibliometric analysis approach that is based on words occurrences and co-occurrences, citations, associated publications and so on. Other systems that are worth mentioning here are the following ones: Cimel (Constructive, Collaborative Inquiry based Multimedia E-Learning), PatentMiner, UnexpectedMiner and so on. Finally, in the TTM (Mei and Zhai, 2005) approach it was argued that in the aforementioned techniques, the temporal information was not emphasize sufficiently. Indeed, the authors were interested in discovering, extracting and summarizing theme patterns and their respective evolutions over time also called Evolutionary Theme Patterns (ETP). In TTM, themes are seen as distributions over the words that have a semantic coherence such as for pLSI and LDA. However, the novelty in their proposal lies in the definition of a so-called theme-span that allows one to associate each theme to a time window. In this approach, they use a divergence measure in order to characterize evolutionary transitions between two consecutive time windows.

3. A Whole System for Topic Extraction Through different projects in which we are involved, we have set up a platform linked to a relational database. The objective is to fetch the information flow that pass in a specified list of blogs, forums, RSS feeds, etc. This platform gives us access to the datasets on which we can perform our machine learning algorithms. In particular, we are able to extract and label topics to sum up a whole bunch of texts. We detail these two aspects in the following. 3.1 Setting up a Platform for Fetching Information Forums In the ERIC Lab, we have designed a relational database and a full process for loading and exploiting this database (see Fig. 1). The application is based on a Tomcat server and the Java technology. The mediation format is XML. The first step (1) is to fetch the data from the Internet into the database. It could be either RSS feeds or Web sites. RSS feeds

are regularly crawled (say, each night). Each forum needs a specific dedicated parser, which could become very expensive depending on the task. The second step (2) consists in extracting the dataset we are interested on. It is easily solved using relational queries. In the third step (3), we perform our machine learning algorithms to extract useful information. 3.2 Our proposed System for Extracting Topics

Fig.1. The whole process from the Web to useful views on the data.

Suppose we have a set of texts that deal with economic consequences of a political decision. We would like to have a tool capable of extracting the key topics associated with reactions to the decision, automatically and with minimal prior knowledge of the texts. This tool should combine text and then propose one or more name(s) to each of these categories. The topics could then be: "Government policy", the "opposition proposals, " the "union responses, " the "economic efficiency", "social justice" etc. The natural language texts being naturally associated with several topics, we chose to develop a system capable if necessary to classify the same text into several categories. In (Rizoiu et al., 2010) we have proposed a comprehensive modular solution for achieving the whole processing chain, from raw textual data, to obtain overlapping classes of texts (the “topics”) and a characterization of each class. In order to obtain overlapping classes, we use Overlapping K-Means (OKM) which is an extension, proposed by (Cleuziou, 2007), of the classical algorithm of K-Means. It shares the same principles as this one and it aims at minimizing a criterion function based on the principle of distortion. The main difference of the algorithm OKM compared to the K-Means is a document may be assigned to multiple clusters. The difference with approaches such as LDA is that every document is exactly associated to a subset of topics; it is not a probabilistic mixture over topics. We have performed several experiments that show the effectiveness of our approach. In particular, we perform additional experiments in making the measure used to describe the textual data vary. Although the grouping of documents is already a good way to organize a collection of texts (Pons-Porrata et al., 2007), it is necessary to synthesize the information when it becomes too large. One solution is to provide users with an understandable description of the categories. A good description for a category is a pattern that contains consecutive words with a meaning that goes beyond single words (e.g. “data mining”). These meaningful word sequences are usually called keyphrases. Note that this description may include prepositions and articles that are meaningful to the human reader (e.g. “system of information”). A keyphrase is then “a sequence of one or more words deemed relevant when taken together”, while a keyword is “a single word very relevant” (Hammouda et al., 2005). To provide the user with a readable description of the categories, we extract a set of candidate-names from the corpus of documents untreated. Our approach is quite similar to (Osinski, 2003), except for the considered clustering paradigm. The output

topic descriptions, when presented to human judgment, are much more understandable than using word sets alone. The comparison of the proposed models dealing with topic extraction remains rather difficult, especially when considering models built on different theoretical basis. A lot of effort is currently put into evaluating topic models. Recent work has proved that using only numerical measures cannot alone solve this issue; human judgment can be of great benefit in the task of topic evaluation (Chang et al., 2009). Very recent work (Newman et al., 2010) uses external resources, especially the Web, as an alternative evaluation measure and deems working with semantic resources unfeasible. No previous work has successfully used semantic sources, especially ontologies, to evaluate the meaning of topic models. We are currently working with a Lab of the Universitatea Poltitehnica Bucuresti (Romania) to propose an automatic evaluation for topic models using a prior ontological knowledge (here, WordNet). We have just designed an evaluation model based on a new notion of conceptual relevance. The first run of experiments has involved almost 40 people and the results prove the correlation between our model and human judgment. This very recent work has led to submitting a paper to a high-level international conference (Musat et al., 2011).

4. Tracking Topics through Various Information Sources 4.1 Preliminary Work for Tracking Topics In the context of TDT and TTM, we have done a preliminary work for studying the temporal evolution of topics in the textual corpora. We have proposed a formal framework for modeling the topic evolution over time (Forestier et al., 2009). This descriptive framework is based on the set theory. Given published news articles preclassified into chronological periods, we have defined five operators for topic evolution: simple evolution, fusion, split, appearance and disappearance. For the purpose of information watch, these operators are the basis of specific alerts, such as “lost of interest” and “fusion by absorption”. We have experimented this framework with a corpus from AFP (the French press agency). The exhibited alerts have been qualitatively validated. These first and highly interesting results are just a beginning; they raise many perspectives. Our current interest is mainly in taking into account the various sources where the information comes from. This will be discussed later in this paper. 4.2 Analysis of online discussions With the development of interactive Internet, a multitude of debates and forums are being established at the initiative of private citizens, traditional media, either companies or public institutions. But the organizers of these forums have few tools to exploit them, and users who enter a forum have a hard time grasping the topics already discussed and the structure of the debate. The current analysis tools are poor and incomplete. The abundance and popularity of online discussion systems that come in the form of forums,

blogs or newsgroups, has pointed out the need for analyzing and mining such systems. Monitoring how the users behave and interact with each other, their ideas and opinions on certain subjects, their preferences and general beliefs is significant. Our aim is to find the dynamics of a debate by using both multiple types of information and modeling: content of texts, topics, chronology, embedded social network, etc. Most existing works view an online discussion as a network in which users meet and contact each other, form communities and acquire certain roles. Forums are usually modeled by a graph whose vertices represent users that are connected with each other according to who speaks to whom. Such graphs are analyzed by social network techniques (Carrington et al., 2005). A. Stavrianou PhD thesis (Lyon, 2010) presents a theoretical work that has been carried out with the purpose of looking at the social network developed in a forum from a semantics and opinion-oriented point of view. Her contribution is a new model that is complementary to the social network model. It can be applied to an online discussion together with the social network model in order to enrich the information extracted from a discussion. The proposed model, named Post-Reply Opinion Graphs (PROG), goes beyond the exploitation of the developed user network. It emphasizes on the structure of the discussion, but including content-oriented techniques (topic extraction, opinion mining). It combines Text and Opinion Mining techniques with Social Network Analysis concepts. The main novelty of this proposal is that it integrates into the model structure information of the discussion and the opinion content of the exchanged forum postings, information that is lost when we represent a forum by a social network model. Stavrianou defines measures that give information regarding the opinion flow and the general attitude of users and towards users throughout the whole forum. The application of the proposed model to real forums shows the additional information that can be extracted and the interest in combining the social network and the PROG models. Future work will pass from the theoretical to an experimental state by performing largescale experiments with real forums. The information extracted by the PROG model can be used in many ways. We have experimented by using it in order to rank forum messages from the most to the least interesting. This is a combination of many criteria such as how many reactions a message causes, whether it receives reactions that contain opinions, whether these opinions have the same strength or not. Initial results are promising but more extensive experiments are needed. One future objective is to better take into account both the time dimension and the topics in our model. This will permit monitoring how opinion changes over time. In this way, we could observe whether a product improves as the time passes, whether people become more satisfied with certain services, or even whether people are finally convinced after a long discussion in a forum. Furthermore, an interesting future issue is to combine the social network and the PROG models for an improved discussion analysis. For example, we could extract the experts (Zhang et al., 2007) of the discussion domain through the social network representation, and we could, then, use this information in order to extract from the PROG model their attitude or the discussion chains in which they have

participated. The structure of the Post-Reply Opinion Graph allows the extraction of discussion threads and chains. This knowledge could be used in the future in order to give confidence to topic identification algorithms. For example, two messages that appear in the same discussion chain or thread have higher probability to belong to the same discussion topic. As a result, a topic-identification algorithm could give higher probability of belonging to the same topic to messages that do not only have similar content but they are also linked in the same chain or thread. In the future, we intend to work more with the automatic extraction of these relations between postings and the population of the graph with appropriate links. Additionally, our model captures currently cases where one message can reply to one and only one message. In some cases, though, one message may respond to more than one message. Future work needs to cater for changes in the model and the measures in order to capture this particularity. 4.3 Tracking the information life cycles With a team of political science research, ERIC laboratory is currently working on the analysis of public controversy, discussed in texts published in traditional media (newspapers, television) but also on the Web (blogs, forums). Two public controversies have been selected as sufficiently distant in their themes in their drag media but also in the nature of the group of speakers and audiences they mobilize: the debate about “national identity” organized by the French government and the debate about the new French law again music or movies illegal copy from the Web. We want to establish a diachronic mapping (2007-2010) of their media presence in order to observe the decisive moments of each public controversy and how each media presents these issues. The main hypothesis of this research is to observe the argumentative universe and the ways in which traditional media operate as well as new collective users (media, blogs, forums etc.). An internship will soon begin in using this hypothesis to detect and track the life cycles of information.

5. Conclusion This paper gives us the opportunity to give an overview of our recent contribution in text mining, especially for the topic extraction and tracking. To sum up, we have made up a whole system designed to fetch the information from the Web and to apply machinelearning algorithms on this extracted information. We are currently working to integrate two crucial dimensions associated to the texts: the timestamps and the different origins of the texts. This could lead to see the information as a flow running through a kind of (social) network. These first contributions have been mainly developed within the University of Lyon 2, but we are highly interested today in collaborating in order to give to such a system a broader scope.

References Allan J.(2002). Topic Detection and Tracking: Event-Based Information Organization. Kluwer Academic Publishers. Berry, M. W., Dumais, S., O’Brien, G., Berry, M. W., Dumais, S. T., & Gavin. (1995). Using linear algebra for intelligent information retrieval. SIAM Review, 37 , 573–595. Blei, D. M., Ng, A. Y., Jordan, M. I., & Lafferty, J. (2003). Latent dirichlet allocation. Journal of Machine Learning Research, 3 , 2003. Blei D.M. and J.D. Lafferty (2007). A correlated topic model of science. Ann. Appl. Stat., (1):pp.17_35. Berry M.W., S.T. Dumais, and G.W. O'Brien (1994). Using linear algebra for intelligent information retrieval. Technical Report UT-CS-94-270. Brants T. and F. Chen (2003). A system for new event detection. In SIGIR, pp. 330-337. ACM. McCallum A., A. Corrada-Emmanuel, and X. Wang (2005). Topic and role discovery in social networks. In Proceedings of IJCAI. Carrington P., J. Scott, and S. Wasserman (2005). Models and Methods in Social Network Analysis. New York: Cambridge University Press. Chang, J., J. Boyd-Graber, S. Gerrish, C. Wang, D.M. Blei (2009). Reading tea leaves: How humans interpret topic models. Proceedings of the Conference Neural Information Processing Systems, 2009. Chang, J., J. Boyd-Gaber, and D.M. Blei (2009). Connections between the lines: Augmenting social networks with text. In Proceedings of Knowledge Discovery and Data Mining. Cleuziou, G. (2007). Okm : une extension des k-moyennes pour la recherche de classes recouvrantes. In M. Noirhomme-Fraiture et G. Venturini (Eds.), EGC, Volume RNTIE-9 of Revue des Nouvelles Technologies de l’Information, pp. 691–702. CépaduèsÉditions. Forestier M., J. Velcin and J. G. Ganascia (2009). Un cadre formel pour la veille numérique sur la presse en ligne. In: Atelier Veille Numérique (EGC-VN 09), Strasbourg, Janvier 2009. Hammouda, K. M., D. N. Matute, et M. S. Kamel (2005). Corephrase : Keyphrase extraction for document clustering. MLDM 2005, 265–274. Hofmann T. (1999). Probabilistic latent semantic indexing. In Proceedings of the 22th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, pages pp.50_57. Kontostathis A., L.M. Galitsky, W.M. Pottenger, S. Roy, and D.J. Phelps (2003). A Survey of Emerging Trend Detection in Textual Data Mining. Survey of Text Mining: Clustering, Classification, and Retrieval. Margineantu Dragos D., Wong Weng-Keen and Dash Denver (2010). Machine learning algorithms for event detection. Machine Learning, 79(3):257_259. Mei Q.and C.X. Zhai (2005). Discovering evolutionary theme patterns from text: an exploration of temporal text mining. In Proceedings of the eleventh ACM SIGKDD international conférence on Knowledge discovery in data mining, pages 198_207. ACM New York, NY, USA. Mei Q., X. Ling, M. Wondra, H. Su, C.X. Zhai (2007). Topic sentiment mixture:

modeling facets and opinions in weblogs. In : Proceedings of the 16th international conference on World Wide Web. ACM. Musat C., J. Velcin and S. Trausan-Matu (2011). Improving Topic Evaluation Using Conceptual Knowledge. Just submitted. The title of the conference is omitted because it is a blind review. Newman D., J. H. Lau, K. Grieser and T. Baldwin (2010). Automatic Evaluation of Topic Coherence. In The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics. 100-108. June 2010. Osiński, S. (2003). An Algorithm for Clustering of Web Search Results. Master thesis, Department of Computing Science, Poznań University of Technology, 2003. Pons-Porrata, A., R. Berlanga-Llavori, et J. Ruiz-Shulcloper (2007). Topic discovery based on text mining techniques. Inf. Process. Manage. 43(3), 752–768. Rizoiu, M.A., Velcin, J. and Chauchat, J.H. (2010). Regrouper les données textuelles et nommer les groupes à l'aide de classes recouvrantes (2010). In: Actes des 10ème journées francophones en Extraction et Gestion des Connaissances (EGC), Hammamet, Tunisie. Rizoiu, M.A. and Velcin, J. (2011). Topic Extraction for Ontology Learning. In: Ontology Learning and Knowledge Discovery using the Web: Challenges and Recent Advances. IGI Global. To appear. Stavrianou, A., P. Andritsos, N. Nicoloyannis (2007). Overview and Semantic Issues of Text Mining. SIGMOD Record, Vol. 36, No. 3, September 2007, 23-34. Velcin J. and J.G. Ganascia (2007). Topic extraction with agape. In Proceedings of the International Conference on Advanced Data Mining and Applications (ADMA). Wang X., A. McCallum, and X. Wei. Topical n-grams (2007). Phrase and topic discovery with an application to information retrieval. In Proceedings of ICDM. Wang C., D. Blei, and D. Heckerman (2008). Continuous time dynamic topic models. In The 23rd Conference on Uncertainty in Arti_cial Intelligence. Zhang J., M. Ackerman, and L. Adamic (2007), “Expertise networks in online communities: Structure and algorithms,” in Proc. Of the 16th International conference on World Wide Web, pp. 221–230.