Topic Retrospection with Storyline-based Summarization on News ...

國立中山大學 資訊管理學系 碩士論文

結合事件主軸摘要之議題回顧機制於新聞報導應用 Topic Retrospection with Storyline-based Summarization on News Reports

研究生:梁家豪 撰 指導教授:林福仁 博士

中 華 民 國 九 十 四 年 七 月

Abstract The electronics newspaper becomes a main source for online news readers. When facing the numerous stories, news readers need some supports in order to review a topic in short time. Due to previous researches in TDT (Topic Detection and Tracking) only considering how to identify events and present the results with news titles and keywords, a summarized text to present event evolution is necessary for general news readers to retrospect events under a news topic. This thesis proposes a topic retrospection process and implements the SToRe system that identifies various events under a new topic and constructs the relationship to compose a summary which gives readers the sketch of event evolution in a topic. It consists of three main functions: event identification, main storyline construction and storyline-based summarization. The constructed main storyline can remove the irrelevant events and present a main theme. The summarization extracts the representative sentences and takes the main theme as the template to compose summary. The summarization not only provides enough information to comprehend the development of a topic, but also can be an index to help readers to find more detailed information. A lab experiment is conducted to evaluate the SToRe system in the question-and-answer (Q&A) setting. From the experimental results, the SToRe system can help news readers more effectively and efficiently to capture the development of a topic. Keywords: topic retrospection, Topic Detection and Tracking (TDT), event threading, summarization

中文摘要 隨著電子新聞資料庫建置完善,其已成為線上新聞閱讀者一個重要的資訊來源。但 當使用者面對為數眾多的新聞報導時,仍沒有一個完善的機制協助使用者在短時間內, 回顧一個已發生的議題。有鑑於過去在新聞事件偵測與追蹤 (TDT,Topic Detection and Tracking)的研究上,僅單純地考量如何偵測事件,並將其結果以新聞標題列表和關鍵字 的方式呈現,本研究認為透過事件主軸的摘要機制,可以更有效地協助讀者在短時間 內,獲知事件發展的概念。 因此,本研究中提出一個機制,用以偵測議題中的事件並建構之間的相互關係,再 以此關係摘要成一篇議題回顧的報導,做為使用者快速了解議題發展的文本。此機制主 要包括三部份:事件界定、建構議題主軸、主軸式摘要。建構出的議題主軸可以提供議 題發展脈絡的主幹,並將相關性較低的事件排除。透過找出具代表性的文句,並以議題 發展主軸為範本依據,而構成的摘要,除了可以提供足夠的資訊了解議題發展,也可以 做為索引,協助使用者找到更多更詳細的資訊。 本研究採用實驗室實驗法,並配合問答模式來驗證提出之機制。從實驗結果發現, 本機制可以讓新聞閱讀者更有效且有效率地,獲得事件發展的過程。 關鍵字:議題回顧、事件偵測與追縱、事件緒、摘要機制

Chapter 1 Introduction 1.1 Research Background The prevalence of Internet technologies, such as World Wide Web (WWW), eases the information aggregation and dissemination. For example, online news spreads across Internet due to its responsiveness and customization features. News readers can query past news by accessing document bases, and edit individual clipboards. Even with those easy-to-query and personal clips, online news readers still face information overloading problems. Efforts have been spent on research to cope with this problem, such as personal recommendation (Brown and Duguid, 2000). The amount of information which people can effectively consume is limited. With the continuing rapid expansion of online information, it has become increasingly important to provide improved mechanisms to efficiently find and effectively present textual information. Many technologies have been proposed to solve the information overloading problem including search engines (information retrieval), information agency, information customization, etc. (Berghel, 1997). Information retrieval systems use keyword search or query expansion (Hsueh, 2002) with ranking mechanisms to find related documents and sort them by rank.

In addition, text mining techniques have been used for discovering

interesting patterns from unstructured text. Classification techniques distinguish relevant documents from document sets. Clustering mechanisms can group related documents correspond to the same topic. The project of Topic Detection and Tracking (TDT) sponsored by DARPA provides benchmarks for comparing systems that address topic detection and tracking tasks to manipulating and organizing broadcast news and newswire stories (NIST, 2004). Topic detection is the task of grouping together the articles that correspond to the same topic. -1-

Moreover, topic tracking is the process of monitoring a stream of news stories to find those reports that track (or discuss) the same topic specified by a user. Although there are many mechanisms proposed to handle news, they are not sufficient to completely reduce information overloading.

1.2 Research Motivation Ho and Tang (2001) indicate that information overloading is not only associated with the quantity of information, but also, as importantly, with information format and quality. Search engines, although fast and comprehensive, still present their results as lists of hits that the user must read through to distinguish the rare relevant items from garbage. Search engines just focus on one dimension of information quantity. Furthermore, mechanisms developed in TDT research just focus on information quality. People cannot understand the sketch of one topic quickly from clustered news sets. Most of researches ignore the information format, and present theirs results as group of documents lack of an overview presentation to news reader, i.e., storyline main theme. Storyline-based summarization to present the main theme of the news topic can guide a reader to understand how the news topic evolves is a better information format. Through the summary, news readers can quickly get the sketch of the topic. Without the main theme, readers who are not familiar with the target topic, have no idea which event they should read first in the grouped results. For instance, a person wants to understand the development of 2004 Taiwan president election. When the person faces the clustered news sets of “Referendum Act”, “The New Ten Major Construction Projects”, “KMT’s Asset”, “March 19 Shooting,” etc., s/he does not know what the most important event is, how it starts, its turning point and consequence. The person also doesn’t know which event is related to the event assumed to be read simultaneously. Consequently, to review a series of events under one topic, a storyline-based summarization is necessary to help users in topic retrospection. -2-

Besides, clustered news sets do not provide a meaningful abstract for users. Most of researches (Swan & Allan, 2000; Azcarraga & Yap, 2001; Smith, 2002; Doran, Stokes, Newman, Dunnion & Carthy, 2004; Shih, Chang, Chen, Ho & Kao, 2004) use a set of keywords to highlight the themes of a clustered set. But in most of cases, users need to look into individual documents to grasp the whole view of events. Moreover, keywords present fragmental information, people need to compose meaningful sentences by their domain knowledge. For instance, keywords from “March 19 Shooting” may be “Pan-blue”, “Shoot” and “President Chen”. There are many inferred results from keywords, and some results will mislead the users. Therefore, it still lacks summarization mechanisms to stamp meaningful topic retrospection

1.3 Research Objectives In this research, we will propose a news topic retrospection mechanism which provides a systematic approach to review events reported in clustered news sets under a topic. With the retrospection mechanism, online news readers can easily gain the knowledge of main theme, i.e., the evolution progress of a news topic and interdependent events. In addition, it is more familiar for a news reader to read news articles with text rather than a graph which presents topic development. Simply representing the main theme with a directed graph does not meet readers’ demand. Hence, this study combines the main theme with summarization mechanism to extract representative sentences from clustered news sets and compose summary to enrich topic retrospection. News readers may gain general perspectives prior to reading news articles. If the topic retrospection system is successfully implemented, it will grant the story telling capability for news readers to understand the context of topic development, and in turn, reduce information overloading.


1.4 Thesis Organization In the following, we review the current possible approaches related to topic retrospection and related techniques which help us to propose the mechanism in Chapter 2. In Chapter 3, more specific problem modeling and the assumptions will be described. The topic retrospection process will be explained in details in Chapter 4. Chapter 5 presents the implementation of the SToRe system and experimental design. We analyze the experimental results in Chapter 6. Finally, Chapter 7 concludes the research with limitations and future work.


Chapter 2 Literature Review This chapter reviews literatures related to current research in topic retrospection and the development of the proposed mechanism. Specifically, this chapter reviews existing topic detection and tracking, self-organizing map, event threading and text summarization.

2.1 Topic Detection and Tracking Topic Detection and Tracking research has been pursued under the DARPA Translingual Information Detection, Extraction, and Summarization (TIDES) program (NIST, 2004). It aims to provide a variety of automatic techniques for discovering and threading together topically related materials in streams of data such as newswire and broadcast news. TDT integrates the research of information retrieval, information management and data mining to devise powerful, broadly useful, fully automatic algorithms for determining the topical structure of human language data. TDT Pilot Study in 1997 laid the essential groundwork by conducting three main tasks: the segmentation, tracking and detection tasks. In 2004, TDT program defined five research applications. (1) Story segmentation: detect changes between topically cohesive sections, (2) Topic tracking: keep track of stories similar to a set of example stories, (3) Topic detection: build clusters of stories that discuss the same topic, (4) First story detection: detect if a story is the first story of a new, unknown topic, and (5) Link detection: detect whether or not two stories are topically linked. In TDT study, a topic is defined to be a seminal event or activity, along with all directly related events and activities (Franz & McCarley, 2001). Further, an event is defined as something (non-trivial) that happens at a particular time and place (Allan, Papka & Lavrenko, 1998). For instance, terrorism activity is a topic, and in the topic, many events are related to the topic, such as Oklahoma City Bombing and September 11 Attacks. -5-

The automatic discovery and threading could be high potential in many applications where people need timely and efficient access to large quantities of information. Systems could alert users to new occurred events or to new information about old events. By examining one or two stories in grouped event, a user can decide whether to pay attention to the rest of an evolving thread. Similarly, a user can go to a large archive, find all the stories about a particular event, and learn how it evolved (Wayne, 2000).

2.1.1 Topic Detection Topic detection is the task of grouping articles corresponding to the same topic. Comparing with topic tracking, topic detection has no prior information or description about the topic. Therefore, it learns what the topic is based on unsupervised clustering algorithm (Ku, 2000; Franz et al., 2001). To be more precise, topic detection consists of two tasks: retrospective detection and online detection. The former entails the discovery of previously unidentified events in an accumulated collection. It goes through pre-collected corpus and clusters them into different story clusters, each one stands for an event. The latter strives to identify the onset of new events from live news feeds in real-time (Yang, Pierce & Carbonell, 1998). Event detection can be regarded as a discovery problem to mine the data stream for new patterns in document content. It analyzes the feature of documents and calculates the intra-similarity as a criterion of the same group. A bottom-up cluster approach appears to be a natural solution. It can demonstrate the hierarchical structure of a topic. From the top-down view, a topic can be divided into events, and an event can be split into sub-events. A higher level has an aggregated overview of the topic; in contrast, a lower one has more detailed description of the topic. Furthermore, feature co-occurrence approach has been facing some problems; for


example, different reporters use different vocabularies to report an event because different views or different events have similar features. Researches adopted event- and linguistic-based features to tackle these problems (Hatzivassiloglou, Gravano & Maganti, 2000; Wei & Lee, 2003). It parses and tags the phrase more precisely with part of speech (word class) and name entity. The former labels word as noun, verb, and adjective, while the latter identifies noun phrase as people, place, and organization. Adopting combined features can enhance the system performance.

2.1.2 Topic Tracking Topic tracking is the process of monitoring a stream of news stories to find those that track (or discuss) the same event as one specified by a user. While a new story coming, the system judge whether the story belongs to an existent topic or a new topic. Generally speaking, a topic tracking system assigns stories to specific topics based on a supervised classification (Allan et al., 1998; Franz et al., 2001). At first, it trains the models which adopt different classification algorithms, e.g., decision tree induction, Bayesian networks, and near neighbor matching, with a pre-labeled class data corpus. The trained model is used to trace the new incoming news. In other mechanisms, Swan and Allen (1999) addressed issues in extracting significant events based on a simple statistical model for the frequency of feature occurrence. They used the χ 2 -method to identify a burst of feature terms that appear more frequently at a time than at other times. It starts with tracing an event when corresponded χ 2 value is greater than 7.879 (p