An Iterative Approach to Text Segmentation - Semantic Scholar

9 downloads 242 Views 230KB Size Report
2 PryLynx Corporation, 21 Oneida Place, Kitchener, Ontario, N2A 3G2, Canada [email protected]. Abstract. We present divSeg, a novel method for text ...
An Iterative Approach to Text Segmentation Fei Song1 , William M. Darling1 , Adnan Duric1 , and Fred W. Kroon2 1

2

School of Computer Science, University of Guelph, 50 Stone Road East, Guelph, Ontario, N1G 2W1, Canada {fsong,wdarling,aduric}@uoguelph.ca PryLynx Corporation, 21 Oneida Place, Kitchener, Ontario, N2A 3G2, Canada [email protected]

Abstract. We present divSeg, a novel method for text segmentation that iteratively splits a portion of text at its weakest point in terms of the connectivity strength between two adjacent parts. To search for the weakest point, we apply two different measures: one is based on language modeling of text segmentation and the other, on the interconnectivity between two segments. Our solution produces a deep and narrow binary tree – a dynamic object that describes the structure of a text and that is fully adaptable to a user’s segmentation needs. We treat it as a separate task to flatten the tree into a broad and shallow hierarchy either through supervised learning of a document set or explicit input of how a text should be segmented. The rich structure of our created tree further allows us to segment documents at varying levels such as topic, sub-topic, etc. We evaluated our new solution on a set of 265 articles from Discover magazine where the topic structures are unknown and need to be discovered. Our experimental results show that the iterative approach has the potential to generate better segmentation results than several leading baselines, and the separate flattening step allows us to adapt the results to different levels of details and user preferences. Keywords: Text Segmentation; Language Modeling

1

Introduction

A natural language document typically has an underlying hierarchical organization where one or few common themes are supported by a set of interrelated topics, which are in turn made up of subtopics, sub-subtopics, etc. Text segmentation is the task of dividing a document into a sequence of segments that correspond to its constituent parts. Traditionally, segmentation has been seen as a linear task where breaks are inserted into a flat text. However, full text segmentation would allow segments to be further divided into smaller segments, or “subtopics”, to the point where a full underlying hierarchical organization of a document could be discovered. Text segmentation is useful for many applications including information retrieval, text summarization, and information visualization. In information retrieval, for example, a query is matched with either documents or passages of

the documents. The drawback is that the user may get too many passages or have to examine long documents to find the relevant information contained within. By breaking documents into topics and / or subtopics, we can use the related segments as suitable search results that provide greater precision. Text segmentation is also an important part of topic detection and tracking research (TDT). TDT aims to discover and correlate related topics in streams of data such as broadcast news for eventual archiving and other uses. An overview is available in [3]. There are three major approaches to text segmentation in the literature. They include similarity curves ([7]), dotplots ([9]), and language models ([11]). These methods are similar in that they all use word frequency metrics to measure the similarity between two regions of text so that a document can be separated at the points where the connections between the regions are weak, but within the regions are strong. They differ, however, in how the similarity between portions of the text is used to find such regions. In [7], Hearst computes the similarity within a sliding window and follows the peaks and valleys of the similarity curve to determine where to segment a text. In [9], Reynar calculates a similarity matrix between all pair-wise sentences and identifies patterns along the diagonal line for separating segments. In [11], Utiyama and Isahara generate all possible segment partitions through dynamic programming and use the probability distribution of the words to rank and select the best segment partitions. In this paper, we propose a new method for text segmentation that iteratively partitions a portion of the text in order to build a hierarchical organization of the input. Similar to Utiyama and Isahara’s language model, we generate multiple partitions for a document, but instead of enumerating all possible partitions with a different number of segments, we examine all the two-way partitions for a portion of the text and select the weakest point and then split the text into two parts. By doing such two-way partitions iteratively, we build a deep and narrow binary tree as the output. We leave it as a separate step to flatten the tree into a broad and shallow hierarchy, since there are considerable variations among different users in producing such hierarchical organizations ([7]) and many approaches take the number of desired segments for each document as an input from the user ([4] and [5]). We demonstrate our new method’s segmentation ability on a set of 265 articles from Discover magazine. The topic structures of the articles are unknown; it is therefore desirable to re-discover the underlying hierarchical organizations. The experimental results demonstrate the effectiveness of our new method. The rest of the paper is organized as follows. In section 2, we introduce related work that helps us measure the connectivity strength between two adjacent segments. In section 3, we describe the detailed steps in our iterative approach for text segmentation. Our solution produces a deep and narrow hierarchy and we offer two possible methods for evaluating hierarchical organizations of text segmentation in section 4. In section 5, we discuss our experimental results on the Discover dataset, and finally, section 6 concludes the paper with some future research directions.

2

Related Work

Lexical cohesion refers to the connectivity between two portions of text in terms of word relationships ([6]). Although there can be different kinds of relationships between two words, including synonymy (the same meaning) and hyponymy (where one word is a more specific instance of another), the simplest form is when two words are identical or share the same morphological root. Lexical cohesion is commonly modeled by the interconnectivity between sentences in terms of word overlaps or similarities ([10]). Based on the sentence-level similarity, [12] defines a general similarity measure for the interconnectivity between two adjacent segments in a document as Pm Pn i=1 j=1 simij (1) simbetween = m×n where m and n are the numbers of sentences in two adjacent segments, and simij is the similarity between sentences i and j. Clearly, the similarity between two sentences that are close to each other should weigh more than that between two sentences that are further apart. Accordingly, [12] also offers a distance-based interconnectivity measure between two segments as Pm Pn i=1 j=1 wij simij simbetween = (2) m×n where wij = 1 for |i − j| ≤ 2 or √

1 , |i−j|−1

otherwise.

Alternatively, we can follow Utiyama and Isahara’s approach in [11] and use a cost function to measure the strength for a segment partition of a document. Let W = w1 w2 . . . wn be a document of n words and S = S1 S2 . . . Sm be a partition of m segments. The cost for this partition is defined as follows: C(S) = − log P (W |S)P (S) ni m X X =− log P (wji |Si ) + m log n i=1 j=1 ni m X X

fi (wji ) + 1 + m log n ni + k i=1 j=1   ni m X X ni + k  = log + log n i f (w i j) + 1 i=1 j=1

=−

log

(3)

Here, ni denotes the length of segment Si and wji gives the jth word in segment Si . In addition, fi (wji ) stands for the frequency of word wji in Si , while k is the total number of unique words in a given document.

Intuitively, the cost function measures the strength of a segment partition: the lower the value, the weaker the interconnectivity between the segments, and therefore, the better the choice for text segmentation. Since we are only interested in two-way partitions in our iterative approach, i.e., splitting a portion of text into two segments at a time, we can simplify equation (3) into the following: C(S) =

ni 2 X X

log

i=1 j=1

3 3.1

ni + k fi (wji ) + 1

(4)

Iterative Text Segmentation Motivation for an Iterative Approach

Most existing text segmentation methods are aimed at finding a suitable linear segment partition for a given document, at either a topic or a subtopic level. To recover the underlying hierarchical organization of a naturally occurring document, these methods may need to be applied recursively: first, by breaking a document into a sequence of topics, and then, by breaking large segments into sequences of subtopics. One problem with this approach is that each application of the method can be quite expensive. For example, in both [11] and [12], one needs to enumerate all possible partitions for a different number of segments in order to select the best possible segment partition. Such a process is expensive even using dynamic programming techniques to save the intermediate results. Another problem is that a system needs to be tuned to model a typical / average topic or subtopic partition for human editors. As observed by Hearst in [7], wide variations typically exist among human editors about the underlying hierarchical organization of a document. Some are fine-grained while others are coarse-grained, and even for the same editor, some parts of the organization can be detailed while other parts are brief, depending on the editors background knowledge and interest. As a result, it will be difficult to have one size to fit all variations. In this paper, we propose a new method for text segmentation that iteratively splits a portion of the text at its weakest point in terms of the interconnectivity between the two parts. Such two-way partitions are easy to implement and efficient to compute. In addition, we build a complete binary tree of the partitioned segments as the output of this iterative process. We treat it as a separate task to flatten the tree into a broad and shallow hierarchy so that we can adapt the results to different levels of details and user preferences or needs. 3.2

Detailed Steps

Our method performs two-way partitions for a document or a portion of the text. Given a range of sentences, our method will try different split points using the interconnectivity measures described in equations (2) or (4) above so that we can

find the weakest point to split the text into two segments. This is implemented as the InterSim function mentioned below. Intuitively, it should be easier and perhaps more reliable to break a portion of the text at its weakest point than finding multiple split points as required by most of the existing methods for text segmentation. Based on the InterSim function (either equation (2) or (4)), we will iteratively partition a portion of the text until it is too small to be split (less than or equal to the minimum segment size). We summarize all the related steps in the following algorithm for clarity. Note that variables are displayed in italics.

Input: start and end sentence positions, and the document similarity index Output: the partitioned segments as a binary tree begin set tree to empty; if (end − start) > MinSegmentSize then set bestSim to MaxValue; set bestSplit to −1; foreach position i between start and end do set sim to the result of InterSim(start, i, end, index ); if sim < bestSim then bestSim = sim; bestSplit = i; end end if bestSplit > 0 then set tree to a new node covering sentences start to end ; add SplitSegment(start, i, index ) to the children list of tree; add SplitSegment(i + 1, end, index ) to the children list of tree; end return tree end end

Algorithm 1: SplitSegment Function

As can be seen from the SplitSegment function detailed in Algorithm 1, twoway partitions are both easy to implement and efficient to compute, since each time we split, the remaining segments are made smaller. This is a typical application of the divide-and-conquer principle. Our output is a binary tree of all the partitioned segments, which can later be flattened into a desirable hierarhcical organization, given the user’s segmenting needs (please refer to subsection 4.2 for details). A portion of an example binary tree is shown in Figure 1, where each node contains the start and end sentence positions for the portion of text it covers, along with a score referring to the interconnectivity strength for splitting it into the two segments at the children level. For example, the left child of the root

0 - 174 0.0088

0 - 53 0.0104

0 - 24 0.0098

25 - 53 0.0098

54 - 174 0.0103

54 - 106 0.0172

107 - 174 0.0079

Fig. 1. A Binary Tree of the Segmented Document Structure Created by divSeg

node covers sentences from 0 to 53 and the score for splitting it into two segments (sentences from 0 to 24 and the sentences from 25 to 53) is 0.0104. 3.3

Comparisons with Related Work

Our iterative approach for text segmentation relies on the interconnectivity measures between two segments defined in [11] and [12], but we differ in how these measures are used for text segmentation. Both [11] and [12] use dynamic programming to enumerate all possible partitions for a different number of segments and then apply an interconnectivity measure to select the best possible segment partition. In [11], the prior information about the probability of segmentation S, which is modeled as P (S) = n−m , helps get a reasonable number of segments. Given the variations among human editors about the level of details and preferences for the underlying hierarchical organizations of segments, it is difficult to see how such a general factor can be optimized for the segmentation process. Although [12] considers many factors for text segmentation such as withinsegment similarities and between-segment similarities along with segment lengths and sentence distances, it is not clear what is the rationale about the way these factors are combined in their implementation. Their cost function, where the α and β parameters would have to be optimized on training data, is shown in equation (5) below. C(S) = αSimwithin − (1 − α)Simbetween + βRatioslength

(5)

In our approach, we iteratively break a portion of the text at the weakest points and record all the partitioned segments in a binary tree structure. We treat

it as a post-processing step to flatten the binary tree into a broad and shallow hierarchy, which can be tuned to different levels of details and user preferences. In fact, the binary tree itself can simply be used as a visual illustration of the underlying hierarchical structure of a document, or it can be usefully applied to many other tasks such as information retrieval and text summarization.

4

Evaluating Hierarhical Organizations

Our new method for text segmentation produces a binary tree of all the partitioned segments along with their connectivity scores. On the other hand, a human-labeled segmentation structure is also hierarchical, typically made up of topics and subtopics. We take a two-step approach to evaluate the results of text segmentation. First, given two hierarchical organizations, we try to find the best possible match by comparing all possible linear partitions for the two hierarchical organizations. Second, we treat it as a separate step to flatten a binary tree into a suitable linear partition so that the result can be compared directly with a human-labeled structure, at either a topic- or subtopic-level. 4.1

Best Possible Match

Most current systems for text segmentation are evaluated by Pk and / or WindowDiff measures. Both compute a degree of mismatches between two linear partitions of text. Pk, originally proposed by Beeferman et al. in [1], measures the rate of mis-matched segment boundaries via a moving window of size k, which is usually set to half of the average segment length in the reference partition. WindowDiff is an improved version of Pk proposed by Pevzner and Hearst in [8]. For each window of size k, WindowDiff not only decides whether there is a mismatch between the hypothesized partition and the reference partition, it also counts the difference of the number of segment boundaries in the given window between the two partitions. Thus, the results of WindowDiff are generally higher than those of Pk. To measure the degree of mismatches between two hierarchical organizations, we propose the following method to find the best possible match between these two structures. First, we generate all linear partitions for each of the hierarchical organizations, from coarse levels to refined levels, and that cover the same portion of the text, as illustrated in Figure 2. Then, by comparing all pairs of the linear partitions from these two structures, we find the best possible match that has the lowest Pk or WindowDiff value between the corresponding linear partitions. The best possible match represents the ideal case where a machine-generated partition matches a human labeled partition. It establishes an upper-bound for the results that could be achieved given a perfect converter from the binary tree representation to a flatly segmented document. In addition to providing us with a “best-case” view of our segmenter for a given human-labeling of a given dataset, this also allows us to measure how consistent the human labeled organizations are for a given document as compared to their statistical interconnectivity.

A

D

B

C

E

F

G

Possible partitions: [A], [B, C], [B, F, G], [D, E, C], [D, E, F, G] Fig. 2. All Possible Partitions for a Hierarchical Structure

4.2

Flattened Linear Partition Match

Although our result of a binary tree captures most partitioned segments at different levels of details, many applications such as information retrieval and text summarization may require topic-level and / or subtopic-level partitions that are comparable to human labeled structures. We treat it as a separate step to flatten a deep and narrow tree into a broad and shallow hierarchy. Our example flattener traverses a binary tree, such as the one shown in Figure 1, in the depth-first order and examines the scores at each node. If the score at a particular node is below a pre-determined threshold, the flattener cuts the tree at that point and creates a segment for the left branch and continues the search for the right branch. Here, the score represents the interconnectivity strength at a split point for two segments at the children level and can be computed by either equation (2) or (4). With a threshold of 0.009, for example, the tree in Figure 1 will be flattened into two segments: sentences from 0 to 53 and sentences from 54 to 174. To find appropriate threshold values for the flattening step, we use a training set of documents with human labeled segment structures and estimate the threshold values for topic and subtopic partitions, respectively. For flattening, we will generate topic partitions first, and then for each segment, we will continue generating subtopic partitions if needed. By treating it as a separate step for flattening a binary tree into a flattened linear partition, we can adapt our results to different levels of details (either topics or subtopics or both) and user preferences (some may be detailed while others may be general).

5

Experimental Results

To demonstrate the effectiveness of our iterative method for text segmentation, we conducted experiments on a dataset of real-world magazine articles and compared our results with two existing methods for text segmentation. 5.1

Discover Magazine Dataset

For our experiments, we used a corpus of 265 articles collected from Discover magazine between the years 2000 and 2009. Each document was annotated by an independent human editor for both topic and subtopic boundaries. The independent editor (who was unaware of our research) was given instructions for segmenting the document where the objective was to place major topic boundaries only when a prominent topic under discussion changes to some other prominent topic. Among other directives, a topic transition was defined as a major change in the subject matter. Subtopics were defined as being used to support the discussion of a topic; subtopics can only be nested within a major topic. We chose Discover magazine articles because each article has a reasonable length (between 1,000 and 2,000 words generally) and there are no clearly marked sections and subsections; there is therefore a need for automatic text segmentation in order to break the articles into topics and / or subtopics. Table 1. Statistics for the editor-segmented Discover Magazine dataset Average number of topics per document Average topic length in words Average number of subtopics per document Average subtopic length in words Total number of word types in the dataset

4.33 355.48 10.31 115.40 20,491

All documents are preprocessed by removing non-alphabetic tokens such as numbers and punctuation marks. We also remove stopwords using a common stopword list. Table 1 shows the statistics of this corpus after these preprocessing steps. 5.2

Hierarchical Evaluation

To evaluate the performance of our text segmentation method, we compared our results with those generated by two existing methods for text segmentation: C99 based on dotplots ([2]) and U00 based on language models ([11]). We chose these two systems because they are representative of the existing text segmentation methods, and their implementations are freely available on the Internet. There are two versions of our iterative text segmentation method: divSeg (IC) uses equation (2) for computing the interconnectivity between two adjacent

segments, while divSeg (LM), based on language models, uses equation (4).We record all the results from both the Pk and WindowDiff measures so that we can get different perspectives about the performance. Since our method produces a binary tree of all partitioned segments and the reference structures marked by human editors contain both topics and subtopics, we began by following the methodology outlined in subsection 4.1 and compared the best possible matches between two hierarchical organizations. Although C99 and U00 produce linear partitions as output, we also followed the best possible match method with these algorithms against the reference structure, which is typically hierarchical with both topics and subtopics. Table 2. Hierarchical Evaluation Results Algorithm Avg Pk Avg WD

C99 .32748 .41979

U00 .37265 .39896

divSeg (IC) divSeg (LM) .2650 .20148 .2803 .22014

As seen in Table 2, our segmentation method exceeds C99 and U00 considerably; both Pk and WindowDiff scores are decreased by an appreciable amount. Between the two versions of our own implementation, divSeg (LM) out-performs divSeg (IC) significantly, making it the overall winner in the comparisons. While we concede that this comparison benefits divSeg due to the fact that our binary tree is more amenable to a “best-case” study, these results are given as an “upper-bound” and serve simply to demonstrate the potential of our method in a text segmentation application. Furthermore, the fact that our method is more amenable to this type of study is a benefit in itself; it demonstrates the dynamic nature of the fully adaptable object that our algorithm outputs. This flexibility provides for a diverse range of flattening algorithm approaches that each highlight a distinct level and preciseness of segmentation for a user’s needs. A direct comparison of our example flattener to C99 and U00 in a flat topic-level segmentation scenario follows. 5.3

Topic Level Evaluation

In this subsection, we present a real-world evaluation of our system by automatically flattening a deep and narrow binary tree into a suitable topic-level partition. While the Discover dataset includes both topic and subtopic annotations, we concentrate on topic-level segmentation in this work since the other segmentation systems are aimed at topic level partitions. To train our flattener for an appropriate threshold, we randomly separated our dataset into a training set of 160 documents and a testing set of 105 documents. This separation was selected randomly and the documents selected for inclusion in each set were also chosen randomly. The flattener was then tested on the training documents to help find the threshold value that achieves the best

Pk or WindowDiff scores. For topic-level partitions, this particular editor, and this particular dataset, the threshold value was found to be approximately 0.008. The average Pk and WindowDiff topic-level scores obtained on the testing set for our implementation and for C99 and U00 are shown in Table 3. Table 3. Topic-Level Evaluation Results Algorithm Avg Pk Avg WD

C99 .4846 .7092

U00 .5190 .6121

divSeg (IC) divSeg (LM) .4312 .4013 .4446 .4178

While not as impressive as the scores obtained in the best possible match evaluation, these results again show the advantage of our method. For both the Pk and WindowDiff metrics, our algorithm shows substantial, statistically significant – as determined by the Student’s t-test – improvements over the C99 and U00 algorithms. With further work on a more technically advanced flattening procedure, we are confident that these scores can be improved even further.

6

Conclusions and Future Work

In this paper, we proposed a novel method for text segmentation that iteratively splits a portion of the text until a binary tree of all partitioned segments is fully built. As illustrated in subsection 3.2, such a process is both easy to implement and efficient to compute. We followed a separate step to flatten the binary tree into a broad and shallow hierarchy in order to model human-labeled segment structures. Our experiments on the documents in the Discover dataset showed that the new iterative approach has the potential to generate much better segmentation results and the separate flattening step gives us the flexibility to adapt our results to different levels of details and user preferences in the final segment structures. There are at least two major directions we can potentially take our method for text segmentation. First, it will be desirable to generalize the Pk and WindowDiff measures for determining the degrees of mismatches between hierarchical organizations, since the best possible match only establishes an upper-bound estimate when doing hierarchical experiments. Next, we can explore different ways of flattening a binary tree into a broad and shallow hierarchy so that we can better model the human labeled structures with different levels of details and user preferences. Finally, we will extend our testing to other domains and datasets such as the Wikipedia corpus which should allow for straightforward testing due to its pre-segmented format. Acknowledgements The authors would like to acknowledge the financial support from Ontario Centres of Excellence (OCE) through the OCE/Precarn Al-

liance Program. We would also like to thank the editor Juliette Zhang for labeling the topic and subtopic structures for the Discover magazine dataset.

References 1. Doug Beeferman, Adam Berger, and John Lafferty. Statistical models for text segmentation. Mach. Learn., 34(1-3):177–210, 1999. 2. Freddy Y. Y. Choi. Advances in domain independent linear text segmentation. In Proceedings of the 1st North American chapter of the Association for Computational Linguistics conference, pages 26–33, San Francisco, CA, USA, 2000. Morgan Kaufmann Publishers Inc. 3. Christopher Cieri, David Graff, Mark Liberman, Nii Martey, and Stephanie Strassel. Large, multilingual, broadcast news corpora for cooperative research in topic detection and tracking: The tdt-2 and tdt-3 corpus efforts. In Proceedings of Language Resources and Evaluation Conference, 2000. 4. Jacob Eisenstein. Hierarchical text segmentation from multi-scale lexical cohesion. In NAACL ’09: Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 353–361, Morristown, NJ, USA, 2009. Association for Computational Linguistics. 5. Jacob Eisenstein and Regina Barzilay. Bayesian unsupervised topic segmentation. In EMNLP ’08: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 334–343, Morristown, NJ, USA, 2008. Association for Computational Linguistics. 6. M. A. K. Halliday and Ruqaiya Hasan. Cohesion in English (English Language). Longman Pub Group, May 1976. 7. Marti A. Hearst. Multi-paragraph segmentation of expository text. In Proceedings of the 32nd annual meeting on Association for Computational Linguistics, pages 9–16, Morristown, NJ, USA, 1994. Association for Computational Linguistics. 8. Lev Pevzner and Marti A. Hearst. A critique and improvement of an evaluation metric for text segmentation. Comput. Linguist., 28(1):19–36, 2002. 9. Jeffrey C. Reynar. Topic Segmentation: Algorithms and Applications. PhD thesis, University of Pennsylvania, 1998. 10. E.F. Skorochod’ko. Adaptive method of automatic abstracting and indexing. In Proceedings of the IFIP, volume 71, pages 1179–1182, 1972. 11. Masao Utiyama and Hitoshi Isahara. A statistical model for domain-independent text segmentation. In ACL ’01: Proceedings of the 39th Annual Meeting on Association for Computational Linguistics, pages 499–506, Morristown, NJ, USA, 2001. Association for Computational Linguistics. 12. Na Ye, Jingbo Zhu, Yan Zheng, Matthew Y. Ma, Huizhen Wang, and Bin Zhang. A dynamic programming model for text segmentation based on min-max similarity. In AIRS’08: Proceedings of the 4th Asia information retrieval conference on Information retrieval technology, pages 141–152, Berlin, Heidelberg, 2008. SpringerVerlag.