Topic Retrospection with Storyline-based Summarization on News ...

9 downloads 0 Views 673KB Size Report
爛表現名單,惠普與康柏被選為命運最坎坷的企業合併案,微軟執行長巴爾莫(Steve. Ballmer)的衣著扮相最慘不忍睹,但福斯汽車的行銷手法最高桿,寶鹼公司(P&G).
國立中山大學 資訊管理學系 碩士論文

結合事件主軸摘要之議題回顧機制於新聞報導應用 Topic Retrospection with Storyline-based Summarization on News Reports

研究生:梁家豪 撰 指導教授:林福仁 博士

中 華 民 國 九 十 四 年 七 月

致謝 致謝,是論文最後動筆,也最難動筆的部份。每個人生階段的開始,都帶著徬徨與 不安,然而在許多人的支持鼓勵與扶持下,我才能站穩腳步,繼續走下去。如今,論文 的竣工也代表此一階段的結束,沒有眾多人的協助,我無法順利完成人生中的第一本著 作。 如果將做研究比擬成尋找 Nemo,那我的指導教授,林福仁老師,就如片中的 Dory 一般,他不將正確答案直接告訴我們,而是希望我們能透過自己動動手,動動腦地體會 出研究的真諦,當然在需要協助時,他也會適時地用著詼諧的比喻,告訴我們可行的方 向。謝謝老師在研究路上的帶領,讓我學習到如何發現、分析及解決問題,並且察微知 著了解研究精髓。老師對學生的用心與付出也可見一般,總是在午夜 11、12 點時,仍 可與老師透過 net meeting 討論許多研究細節與方向,也時常從清大南下,不辭辛勞的 指導我們,感謝老師這二年來的照顧與指導,讓我明白許多道理並能順利完成學位。 雖然研究室沒有後繼的學弟妹,但彼此相互照料的研究室,更像是一個大家庭般, 沒有同窗好友的建培、鴻杰、宗銘、江釗、建民、宣龍及嘉慶,我的研究所生活將失色 不少,我將永遠記得那徹夜努力打拼夜裡,無裡頭的對話內容,讓大伙壓力減輕許多, 而彼此的加油打氣,也讓我們能撐到最後一刻。而博班學長祿適、盛程、旭立、勢敏、 仁德在 group meeting 時的指正與建議,也讓我釐清許多疑惑,並且適度的調整研究的 方向與內容,使得研究進展更為順利。 口試過程中,感謝口試委員魏志平老師、盧文祥老師在百忙之中仍撥空細心的審閱 論文,提供寶貴的建議與指正,使得本論文更趨充實與完善。研究所求學過程中,感謝 郭峰淵老師、廖達琪老師的指導與提攜,給予許多不同以往的想法與見解,更啟發我資 訊領域以外的觀念與看法。 最後,也是最重要的,我要感謝我的家人,家中的老么終於順利畢業,感謝父母這 二十幾年來辛苦地養育我,並且給於我選擇的自由度,全力支持我的每一個決定,當我 面對困難與挫折時,給予我支持與關懷,讓我能重新整理腳步再出發。這些年來,爸媽, 你們辛苦了,接下來的日子,我會接續你們的棒子,僅將此一論文,獻給我最親愛的家 人。 家豪 乙酉年七月 西子灣

-i-

論文提要 學年度:93 學期:2 校院:國立中山大學 系所:資訊管理學系 論文名稱(中):結合事件主軸摘要之議題回顧機制於新聞報導應用 論文名稱(英):Topic Retrospection with Storyline-based Summarization on News Reports 學位類別:碩士 語文別:英文 學號:M924020002 提要開放使用:是 頁數:75 研究生(中)姓:梁 研究生(中)名:家豪 研究生(英)姓:Liang 研究生(英)名:Chia-Hao 指導教授(中)姓名:林福仁 指導教授(英)姓名:Lin, Fu-ren 關鍵字(中):議題回顧、事件偵測與追縱、事件緒、摘要機制 關鍵字(英):topic retrospection, Topic Detection and Tracking (TDT), event threading, summarization

- ii -

Abstract The electronics newspaper becomes a main source for online news readers. When facing the numerous stories, news readers need some supports in order to review a topic in short time. Due to previous researches in TDT (Topic Detection and Tracking) only considering how to identify events and present the results with news titles and keywords, a summarized text to present event evolution is necessary for general news readers to retrospect events under a news topic. This thesis proposes a topic retrospection process and implements the SToRe system that identifies various events under a new topic and constructs the relationship to compose a summary which gives readers the sketch of event evolution in a topic. It consists of three main functions: event identification, main storyline construction and storyline-based summarization. The constructed main storyline can remove the irrelevant events and present a main theme. The summarization extracts the representative sentences and takes the main theme as the template to compose summary. The summarization not only provides enough information to comprehend the development of a topic, but also can be an index to help readers to find more detailed information. A lab experiment is conducted to evaluate the SToRe system in the question-and-answer (Q&A) setting. From the experimental results, the SToRe system can help news readers more effectively and efficiently to capture the development of a topic. Keywords: topic retrospection, Topic Detection and Tracking (TDT), event threading, summarization

- iii -

中文摘要 隨著電子新聞資料庫建置完善,其已成為線上新聞閱讀者一個重要的資訊來源。但 當使用者面對為數眾多的新聞報導時,仍沒有一個完善的機制協助使用者在短時間內, 回顧一個已發生的議題。有鑑於過去在新聞事件偵測與追蹤 (TDT,Topic Detection and Tracking)的研究上,僅單純地考量如何偵測事件,並將其結果以新聞標題列表和關鍵字 的方式呈現,本研究認為透過事件主軸的摘要機制,可以更有效地協助讀者在短時間 內,獲知事件發展的概念。 因此,本研究中提出一個機制,用以偵測議題中的事件並建構之間的相互關係,再 以此關係摘要成一篇議題回顧的報導,做為使用者快速了解議題發展的文本。此機制主 要包括三部份:事件界定、建構議題主軸、主軸式摘要。建構出的議題主軸可以提供議 題發展脈絡的主幹,並將相關性較低的事件排除。透過找出具代表性的文句,並以議題 發展主軸為範本依據,而構成的摘要,除了可以提供足夠的資訊了解議題發展,也可以 做為索引,協助使用者找到更多更詳細的資訊。 本研究採用實驗室實驗法,並配合問答模式來驗證提出之機制。從實驗結果發現, 本機制可以讓新聞閱讀者更有效且有效率地,獲得事件發展的過程。 關鍵字:議題回顧、事件偵測與追縱、事件緒、摘要機制

- iv -

Table of Contents 致謝.............................................................................................................. I 論文提要 ...................................................................................................... II ABSTRACT .................................................................................................... III 中文摘要 ..................................................................................................... IV TABLE OF CONTENTS ...................................................................................... V LIST OF FIGURES .......................................................................................... VII LIST OF TABLES .......................................................................................... VIII CHAPTER 1 INTRODUCTION...............................................................................1 1.1 Research Background .....................................................................1 1.2 Research Motivation........................................................................2 1.3 Research Objectives ........................................................................3 1.4 Thesis Organization ........................................................................4 CHAPTER 2 LITERATURE REVIEW.......................................................................5 2.1 Topic Detection and Tracking..........................................................5 2.1.1 Topic Detection .................................................................................................6 2.1.2 Topic Tracking ..................................................................................................7 2.2 Self-organizing Maps.......................................................................8 2.2.1 SOM Algorithm ................................................................................................8 2.2.2 Growing Hierarchical Self-Organizing Map (GHSOM) ................................11 2.2.3 WEBSOM .......................................................................................................12 2.2.4 The labeling method applied with SOM .........................................................14 2.3 Event Threading ...........................................................................15 2.4 Text Summarization......................................................................17 2.4.1 Summarization Method and Evaluation..........................................................18 2.4.2 Chinese Summarization ..................................................................................20 CHAPTER 3 RESEARCH FRAMEWORK ................................................................21 3.1 Definition......................................................................................21 3.2 Problem Modeling .........................................................................23 -v-

CHAPTER 4 TOPIC RETROSPECTION .................................................................26 4.1 Preprocess ....................................................................................27 4.2 Event identification .......................................................................29 4.3 Main storyline construction ..........................................................30 4.4 Storyline-based Summarization ....................................................34 CHAPTER 5 SYSTEM IMPLEMENT AND EXPERIMENTAL DESIGN ..............................36 5.1 Data source ..................................................................................36 5.2 System Implementation ................................................................37 5.2.1 Preliminary Result ..........................................................................................39 5.3 Experimental Design.....................................................................40 5.3.1 Subjects ...........................................................................................................40 5.3.2 Experimental Procedure..................................................................................41 5.3.3 Examinational Questions ................................................................................43 CHAPTER 6 EXPERIMENTAL RESULTS ...............................................................44 6.1 Subject Profile...............................................................................44 6.2 Evaluating SToRe System .............................................................46 CHAPTER 7 CONCLUSION AND FUTURE WORK ....................................................52 7.1 Conclusion ...................................................................................52 7.2 Research Limitation ......................................................................53 7.3 Future Work .................................................................................53 REFERENCES ...............................................................................................55 APPENDIX A SUMMARIZATION RESULT ..............................................................60 APPENDIX B SNAPSHOT OF THE USER INTERFACE IN EXPERIMENTATION .................63 APPENDIX C QUESTIONNAIRES ........................................................................65 APPENDIX D THE EXAMPLE OF QUESTIONS .......................................................66

- vi -

List of Figures FIGURE 2.1 ABSTRACTED SOM ........................................................................9 FIGURE 2.2 EXAMPLE OF TOPOLOGICAL NEIGHBORHOOD (T1 < T2 < T3) ...................10 FIGURE 2.3 SOM ALGORITHMS (SHIH ET AL., 2004) ..........................................10 FIGURE 2.4 THE EXAMPLE OF INSERTION OF UNITS .............................................12 FIGURE 2.5 ARCHITECTURE OF A GHSOM .......................................................12 FIGURE 2.6 ARCHITECTURE OF A WEBSOM.....................................................13 FIGURE 2.7 ARCHITECTURE OF A TEXT SUMMARIZATION SYSTEM (MANI ET AL., 1999) .................................................................................................................17 FIGURE 3.1 THE EXAMPLE OF THE TOPIC STRUCTURE ..........................................24 FIGURE 4.1 THE PSEUDO CODE OF MST ..........................................................33 FIGURE 4.2 THE EXAMPLE OF MAIN STORYLINE ..................................................34 FIGURE 5.1 UDNDATA.COM CLIPPING FOLDER SYSTEM .......................................36 FIGURE 5.2 STORE SYSTEM ARCHITECTURE ......................................................38 FIGURE 5.3 EVENT REVOLUTION PROCESS OF “HP’S ACQUISITION OF COMPAQ” ........39 FIGURE 5.4 EXPERIMENTAL PROCEDURE ..........................................................41 FIGURE B.1 MAIN PAGE OF THE EXPERIMENT ....................................................63 FIGURE B.2 SNAPSHOT OF USER INTERFACE OF THE CONTROL GROUP ....................63 FIGURE B.3 SNAPSHOT OF USER INTERFACE OF THE EXPERIMENTAL GROUP ............64 FIGURE B.4 SNAPSHOT OF USER INTERFACE OF A NEWS STORY .............................64

- vii -

List of Tables TABLE 2.1 THE SUMMARY OF RELATED WORKS OF EVENT THREADING ....................16 TABLE 4.1 TOPIC RETROSPECTION PROCESS ......................................................27 TABLE 5.1 THE SUMMARY OF PARAMETERS IN STORE .........................................39 TABLE 5.2(A) TYPES OF QUESTIONS IN FORMAL EXAMINATION ...............................43 TABLE 5.2(B) FORMAT OF QUESTIONS IN FORMAL EXAMINATION ............................43 TABLE 6.1 CHI-SQUARE TESTS OF READING FREQUENCY ......................................44 TABLE 6.2 CHI-SQUARE TESTS OF READING CATEGORY........................................45 TABLE 6.3 GROUP STATISTICS OF READING TIME ................................................45 TABLE 6.4 INDEPENDENT SAMPLES TEST OF READING TIME ..................................45 TABLE 6.5 GENDER OF SUBJECTS ...................................................................45 TABLE 6.6 PAIRED SAMPLES TEST OF QUESTION TYPES ........................................46 TABLE 6.7 GROUP STATISTICS OF MEASUREMENTS .............................................47 TABLE 6.8 INDEPENDENT SAMPLES TEST OF MEASUREMENTS................................47 TABLE 6.9 STATISTICS OF HYPERLINK CLICKS.....................................................49 TABLE 6.10 PAIRED SAMPLES TEST OF HYPERLINK CLICKS....................................49 TABLE 6.11 GROUP STATISTICS OF ANSWERING QUESTIONS IN DIFFERENT TYPES ......50 TABLE 6.12 INDEPENDENT SAMPLES TEST OF ANSWERING QUESTIONS IN DIFFERENT TYPES .........................................................................................................50

TABLE 6.13 STATISTICS OF SURVEY IN THE EXPERIMENTAL GROUP.........................51

- viii -

Chapter 1 Introduction 1.1 Research Background The prevalence of Internet technologies, such as World Wide Web (WWW), eases the information aggregation and dissemination. For example, online news spreads across Internet due to its responsiveness and customization features. News readers can query past news by accessing document bases, and edit individual clipboards. Even with those easy-to-query and personal clips, online news readers still face information overloading problems. Efforts have been spent on research to cope with this problem, such as personal recommendation (Brown and Duguid, 2000). The amount of information which people can effectively consume is limited. With the continuing rapid expansion of online information, it has become increasingly important to provide improved mechanisms to efficiently find and effectively present textual information. Many technologies have been proposed to solve the information overloading problem including search engines (information retrieval), information agency, information customization, etc. (Berghel, 1997). Information retrieval systems use keyword search or query expansion (Hsueh, 2002) with ranking mechanisms to find related documents and sort them by rank.

In addition, text mining techniques have been used for discovering

interesting patterns from unstructured text. Classification techniques distinguish relevant documents from document sets. Clustering mechanisms can group related documents correspond to the same topic. The project of Topic Detection and Tracking (TDT) sponsored by DARPA provides benchmarks for comparing systems that address topic detection and tracking tasks to manipulating and organizing broadcast news and newswire stories (NIST, 2004). Topic detection is the task of grouping together the articles that correspond to the same topic. -1-

Moreover, topic tracking is the process of monitoring a stream of news stories to find those reports that track (or discuss) the same topic specified by a user. Although there are many mechanisms proposed to handle news, they are not sufficient to completely reduce information overloading.

1.2 Research Motivation Ho and Tang (2001) indicate that information overloading is not only associated with the quantity of information, but also, as importantly, with information format and quality. Search engines, although fast and comprehensive, still present their results as lists of hits that the user must read through to distinguish the rare relevant items from garbage. Search engines just focus on one dimension of information quantity. Furthermore, mechanisms developed in TDT research just focus on information quality. People cannot understand the sketch of one topic quickly from clustered news sets. Most of researches ignore the information format, and present theirs results as group of documents lack of an overview presentation to news reader, i.e., storyline main theme. Storyline-based summarization to present the main theme of the news topic can guide a reader to understand how the news topic evolves is a better information format. Through the summary, news readers can quickly get the sketch of the topic. Without the main theme, readers who are not familiar with the target topic, have no idea which event they should read first in the grouped results. For instance, a person wants to understand the development of 2004 Taiwan president election. When the person faces the clustered news sets of “Referendum Act”, “The New Ten Major Construction Projects”, “KMT’s Asset”, “March 19 Shooting,” etc., s/he does not know what the most important event is, how it starts, its turning point and consequence. The person also doesn’t know which event is related to the event assumed to be read simultaneously. Consequently, to review a series of events under one topic, a storyline-based summarization is necessary to help users in topic retrospection. -2-

Besides, clustered news sets do not provide a meaningful abstract for users. Most of researches (Swan & Allan, 2000; Azcarraga & Yap, 2001; Smith, 2002; Doran, Stokes, Newman, Dunnion & Carthy, 2004; Shih, Chang, Chen, Ho & Kao, 2004) use a set of keywords to highlight the themes of a clustered set. But in most of cases, users need to look into individual documents to grasp the whole view of events. Moreover, keywords present fragmental information, people need to compose meaningful sentences by their domain knowledge. For instance, keywords from “March 19 Shooting” may be “Pan-blue”, “Shoot” and “President Chen”. There are many inferred results from keywords, and some results will mislead the users. Therefore, it still lacks summarization mechanisms to stamp meaningful topic retrospection

1.3 Research Objectives In this research, we will propose a news topic retrospection mechanism which provides a systematic approach to review events reported in clustered news sets under a topic. With the retrospection mechanism, online news readers can easily gain the knowledge of main theme, i.e., the evolution progress of a news topic and interdependent events. In addition, it is more familiar for a news reader to read news articles with text rather than a graph which presents topic development. Simply representing the main theme with a directed graph does not meet readers’ demand. Hence, this study combines the main theme with summarization mechanism to extract representative sentences from clustered news sets and compose summary to enrich topic retrospection. News readers may gain general perspectives prior to reading news articles. If the topic retrospection system is successfully implemented, it will grant the story telling capability for news readers to understand the context of topic development, and in turn, reduce information overloading.

-3-

1.4 Thesis Organization In the following, we review the current possible approaches related to topic retrospection and related techniques which help us to propose the mechanism in Chapter 2. In Chapter 3, more specific problem modeling and the assumptions will be described. The topic retrospection process will be explained in details in Chapter 4. Chapter 5 presents the implementation of the SToRe system and experimental design. We analyze the experimental results in Chapter 6. Finally, Chapter 7 concludes the research with limitations and future work.

-4-

Chapter 2 Literature Review This chapter reviews literatures related to current research in topic retrospection and the development of the proposed mechanism. Specifically, this chapter reviews existing topic detection and tracking, self-organizing map, event threading and text summarization.

2.1 Topic Detection and Tracking Topic Detection and Tracking research has been pursued under the DARPA Translingual Information Detection, Extraction, and Summarization (TIDES) program (NIST, 2004). It aims to provide a variety of automatic techniques for discovering and threading together topically related materials in streams of data such as newswire and broadcast news. TDT integrates the research of information retrieval, information management and data mining to devise powerful, broadly useful, fully automatic algorithms for determining the topical structure of human language data. TDT Pilot Study in 1997 laid the essential groundwork by conducting three main tasks: the segmentation, tracking and detection tasks. In 2004, TDT program defined five research applications. (1) Story segmentation: detect changes between topically cohesive sections, (2) Topic tracking: keep track of stories similar to a set of example stories, (3) Topic detection: build clusters of stories that discuss the same topic, (4) First story detection: detect if a story is the first story of a new, unknown topic, and (5) Link detection: detect whether or not two stories are topically linked. In TDT study, a topic is defined to be a seminal event or activity, along with all directly related events and activities (Franz & McCarley, 2001). Further, an event is defined as something (non-trivial) that happens at a particular time and place (Allan, Papka & Lavrenko, 1998). For instance, terrorism activity is a topic, and in the topic, many events are related to the topic, such as Oklahoma City Bombing and September 11 Attacks. -5-

The automatic discovery and threading could be high potential in many applications where people need timely and efficient access to large quantities of information. Systems could alert users to new occurred events or to new information about old events. By examining one or two stories in grouped event, a user can decide whether to pay attention to the rest of an evolving thread. Similarly, a user can go to a large archive, find all the stories about a particular event, and learn how it evolved (Wayne, 2000).

2.1.1 Topic Detection Topic detection is the task of grouping articles corresponding to the same topic. Comparing with topic tracking, topic detection has no prior information or description about the topic. Therefore, it learns what the topic is based on unsupervised clustering algorithm (Ku, 2000; Franz et al., 2001). To be more precise, topic detection consists of two tasks: retrospective detection and online detection. The former entails the discovery of previously unidentified events in an accumulated collection. It goes through pre-collected corpus and clusters them into different story clusters, each one stands for an event. The latter strives to identify the onset of new events from live news feeds in real-time (Yang, Pierce & Carbonell, 1998). Event detection can be regarded as a discovery problem to mine the data stream for new patterns in document content. It analyzes the feature of documents and calculates the intra-similarity as a criterion of the same group. A bottom-up cluster approach appears to be a natural solution. It can demonstrate the hierarchical structure of a topic. From the top-down view, a topic can be divided into events, and an event can be split into sub-events. A higher level has an aggregated overview of the topic; in contrast, a lower one has more detailed description of the topic. Furthermore, feature co-occurrence approach has been facing some problems; for

-6-

example, different reporters use different vocabularies to report an event because different views or different events have similar features. Researches adopted event- and linguistic-based features to tackle these problems (Hatzivassiloglou, Gravano & Maganti, 2000; Wei & Lee, 2003). It parses and tags the phrase more precisely with part of speech (word class) and name entity. The former labels word as noun, verb, and adjective, while the latter identifies noun phrase as people, place, and organization. Adopting combined features can enhance the system performance.

2.1.2 Topic Tracking Topic tracking is the process of monitoring a stream of news stories to find those that track (or discuss) the same event as one specified by a user. While a new story coming, the system judge whether the story belongs to an existent topic or a new topic. Generally speaking, a topic tracking system assigns stories to specific topics based on a supervised classification (Allan et al., 1998; Franz et al., 2001). At first, it trains the models which adopt different classification algorithms, e.g., decision tree induction, Bayesian networks, and near neighbor matching, with a pre-labeled class data corpus. The trained model is used to trace the new incoming news. In other mechanisms, Swan and Allen (1999) addressed issues in extracting significant events based on a simple statistical model for the frequency of feature occurrence. They used the χ 2 -method to identify a burst of feature terms that appear more frequently at a time than at other times. It starts with tracing an event when corresponded χ 2 value is greater than 7.879 (p