Text Entailment for Logical Segmentation and

Text Entailment for Logical Segmentation and Summarization Doina Tatar, Andreea Diana Mihis, and Dana Lupsa University “Babes-Bolyai” Cluj-Napoca Romania {dtatar,mihis,dana}@cs.ubbcluj.ro

Abstract. Summarization is the process of condensing a source text into a shorter version preserving its information content ([2]). This paper presents some original methods for text summarization by extraction of a single source document based on a particular intuition which is not explored till now: the logical structure of a text. The summarization relies on an original linear segmentation algorithm which we denote logical segmentation (LTT) because the score of a sentence is the number of sentences of the text which are entailed by it. The summary is obtained by three methods: selecting the first sentence(s) from a segment, selecting the best scored sentence(s) from a segment and selecting the most informative sentence(s) (relative to the previously selected) from a segment. Moreover, our methods permit dynamically adjusting the derived summary size, independently of the number of segments. Alternatively, a Dynamic Programming (DP) method, based on the continuity principle and applied to the sentences logically scored as above is proposed. This method proceeds by obtaining the summary firstly and then determining the segments. Our methods of segmentation are applied and evaluated against the segmentation of the text “I spent the first 19 years” of Morris and Hirst ([17]). The original text is reproduced at [26]. Some statistics about the informativeness of the summaries with different lengths and obtained with the above methods relatively to the original (summarized) text are given. These statistics prove that the segmentation preceding the summarization could improve the quality of obtained summaries.

1

Introduction

Text summarization has become the subject of an intense research in last years and it is still an emerging field. The research is done in the extracts and abstracts areas. The extracts are the summaries created by reusing portion of the input verbatim, while the abstracts are created by regenerating the extracted content ([13]). However, there does not exist a theory of construction of a good summary. E. Kapetanios, V. Sugumaran, M. Spiliopoulou (Eds.): NLDB 2008, LNCS 5039, pp. 233–244, 2008. c Springer-Verlag Berlin Heidelberg 2008

234

D. Tatar, A.D. Mihis, and D. Lupsa

The most important task of summarization is to identify the most informative (salient) parts of a text comparatively with the rest. Usually the salient parts are determined on the following assumptions [16]: they contain words that are used frequently; they contain words that are used in the title and headings; they are located at the beginning or end of sections; they use key phrases which emphasize the importance in text; they are the most highly connected with the other parts of text. If the first four characteristics are easy to achieve and verify, the last one is more difficult to establish. For example, the connectedness may be measured by the number of shared words, synonyms, anaphora [18],[20]. On the other hand, if the last assumption is fulfilled, the cohesion of the resulting summary is expected to be higher than if this one is missing. In this respect, our methods assure a high cohesion for the obtained summary while the connectedness is measured by the number of logic entailments between sentences. Another idea which our work proposes is that a process of logical segmentation before the summarization could be benefic for the quality of summaries. This feature is a new one comparing with [23]. We propose here a method of text segmentation with a high precision and recall (as compared with the human performance). The method is called logical segmentation because the score of a sentence is the number of sentences of the text which are entailed by it. The scores form a structure which indicates how the most important sentences alternate with ones less important and organizes the text according to its logical content. Due to some similarities with TextTiling algorithm for topic shifts detection of Hearst ([12,11]) we call this method Logical TextTiling (LTT). The drawback of LTT is that the number of the segments is fixed for a given text and it results from its logical structure. In this respect we present in this work a method to dynamically correlate the number of the logical segments obtained by LTT with the required length of the summary. The method is applied to the Morris and Hirst’s text in [17] and the results are compared with their structure obtained by the Lexical Chains. The methods presented in this paper are fully implemented in Java and C++. We used our own systems of Text Entailment verification, LTT segmentation, summarization with Sumi and AL methods. For DP algorithm we used a part of OpenNLP ([25,24]) tools to identify the sentences and the tokens in the text. OpenNLP defines a set of Java interfaces and implements some basic infrastructure for NLP components. The DP algorithm verifies the continuity principle by verifying that the set of common words between two sentences is not empty. The paper is structured as follows: in Section 2, some notions about textual entailment and logical segmentation of discourse by Logical TextTilling method are discussed. Summarization by logical segmentation with an arbitrarily length of the summary (algorithm AL) is the topic of Section 3. An original DP method for summarization and segmentation is presented in Section 4. The application of all the above methods to the text from [17] and some statistics of the results are presented in Section 5. We finish the article with conclusions and possible further work directions.

Text Entailment for Logical Segmentation and Summarization

2

235

Segmentation by Logical TextTiling

2.1

Text Entailment

Text entailment is an autonomous field of Natural Language Processing and it represents the subject of some recent Pascal Challenges ([7]). As we established in an earlier paper ([22]), a text T entails a hypothesis H, denoted by T → H, iff H is less informative than T . A method to prove T → H is presented in ([22]) and consists in the verification of the relation: sim(T, H)T ≤ sim(T, H)H . Here sim(T, H)T and sim(T, H)H are text-to-text similarities introduced in [6]. Another method in [22] calculates the similarity between T and H by cosine and verify that cos(T, H)T ≤ cos(T, H)H . Our system of text entailment verification used in this paper relies on this second method for text entailment verification as a directional relation. The reason is that it provides the best results compared with the other methods in [22]. 2.2

Logical Segmentation

A segment is a contiguous piece of text (a sequence of sentences) that is linked internally but disconnected from the adjacent text. The local cohesion of a logical segment is assured by the move of relevance information from the less important to the most important and again to the less important. Similarly with the topic segmentation ([10,12,11]), in logic segmentation we concentrate on describing what is a shift in relevance information of the discourse. Simply, a valley (a local minim) in the obtained logical structure is a boundary between two logical segments (see Section 5, Fig.1). This is an accordance with the definition of a boundary as a perceptible discontinuity in the text structure ([4]), in our case a perceptible discontinuity in the connectedness between sentences. The main differences between TextTiling (TT) and our Logical TextTiling (LTT) method are presented in [23]. The most important consist in the way of scoring and grouping the sentences. As in [17] is stated, cohesion relations are relations among elements in a text (references, conjunctions, lexical cohesion) and coherence relations are relations between clauses and sentences. From this point of view, the logical segmentation is an indicator of the coherence of a text. The algorithm LTT of segmentation has the following function: INPUT: A list of sentences S1 , ..., Sn and a list of scores score(S1 ), ... score(Sn ) OUTPUT: A list of segments Seg1 , ...SegN (see Section 5) The obtained logical segments could be used effectively in summarization. In this respect, the method of summarization falls in the discourse-based category. In contrast with other theories about discourse segmentation, as Rhetorical Structure Theory (RST) of Mann and Thompson (1988), attentional / intentional structure of Grosz and Sidner (1986) or parsed RST tree of Daniel Marcu (1996), our Logical TextTiling method (and also TextTiling method [10]) supposes a linear segmentation (versus hierarchic segmentation) which results in an advantage from a computational viewpoint.

236

3 3.1


Summarization by Segmentation Scoring the Segments

Given a set of N segments we need a criterion to select those sentences from a segment which will be introduced in the summary. Thus, after the score of a sentence is calculated, we calculate a score of a segment. The final score, Scoref inal , of a sentence is weighted by the score of the segment which contains it. The summary is generated by selecting from each segment a number of sentences proportional with the score of the segment. The method has some advantages when a desired level of granularity of summarization is imposed. The denotations for this calculus are:

– Score(Si ) =the number of sentences implied by Si S ∈Seg

Score(Si )

i j – Score(Segj ) = |Segj | – Scoref inal (Si ) = Score(Si ) × Score(Segj ) where Si ∈ Segj N – Score(T ext) = j=1 Score(Segj )

– – – –

Score(Seg )

j , 0 < Cj < 1 W eight of a segment : Cj = Score(T ext) N =the number of segments obtained by LTT algorithm X= the desired length of the summary N SenSegj =the number of sentences selected from the segment Segj

The summarization algorithm with Arbitrarily Length of the summary (AL) is the following: INPUT: The segments Seg1 , ...SegN , the length of summary X (as parameter), Scoref inal (Si ) for each sentence Si ; OUTPUT: A summary SU M of length X, where from each segment Segj are selected N SenSegj sentences. The method of selecting the sentences is given by definitions Sum1, Sum2, Sum3 (section 3.2). Calculate the weights of segments (Cj ), rank them in an decreased order, rank the segments Segj accordingly. Calculate N SenSegj : while the number of selected sentences does not equal X, calculate N SenSegj = min(| Segmentj |, Integer(X × Cj )), if Integer(X × Cj )) ≥ 1 or N SenSegj = 1 if Integer(X × Cj ) < 1. Reorder again the selected sentences as in initial text. Remarks: A number of segments Segj may have N SenSegj > 1. If X < N then a number of segments Segj must have N SenSegj = 0 3.2

Strategies for Summary Calculus

The method of extracting sentences from the segments is decisive for quality of the summary. The deletion of an arbitrary amount of source material between two sentences which are adjacent in the summary has the potential of

Text Entailment for Logical Segmentation and Summarization

237

losing essential information. We propose and compare some simple strategies for efficiently including sentences in the summary. Our first strategy is to include in the summary the first sentence from each segment, as this is of special importance for this segment. The corresponding method is: Definition 1 Given a segmentation of initial text, T = {Seg1, · · · , SegN } the summary is calculated as: } Sum1 = {S1 , · · · , SX where, for each segment Segi , i = 1, · · · , N , first N SenSegi sentences are selected. The second way is that for each segment the sentence(s) which imply a maximal number of other sentences are considered the most important for this segment, and hence they are included in the summary. The corresponding method is: Definition 2 Given a segmentation of initial text, T = {Seg1, · · · , SegN } the summary is calculated as: } Sum2 = {S1 , · · · , SX where Sk = argmaxj {score(Sj ) | Sj ∈ Segi }. For each segment Segi a number of N SenSegi of different sentences Sk are selected. The third way of reasoning is that from each segment the most informative sentence(s) (the least similar) relative to the previously selected sentences are picked up. The corresponding method is: Definition 3 Given a segmentation of initial text, T = {Seg1, · · · , SegN } the summary is calculated as: } Sum3 = {S1 , · · · , SX where Sk = argminj {sim(Sj , Segi−1 ) | Sj ∈ Segi }. Again, for each segment Segi a number of N SenSegi of different sentences Sk are selected. The value sim(Sj , Segi−1 ) represents the similarity between Sj and the last sentence selected in Segi−1 . The similarity between two sentences S, S is calculated in this work by cos(S, S ).

4

Dynamic Programming Algorithm

4.1

Summarization by Dynamic Programming

In order to meet the coherence of the summary our method selects the chain of sentences with the property that two consecutive sentences have at least one common word. This corresponds to the continuity principle in the centering theory which requires that two consecutive units of discourse have at least one entity in common ([18]) . We make the assumption that each sentence is associated with

238


a score that reflects how representative is that sentence. In these experiments the scores are the logical scores as used in the previous sections. The score of a selected summary is the sum of individual scores of contained sentences. The summary will be selected such that its score is maximum. The idea of the Dynamic Programming Algorithm for summarization is the following: let us consider that δik is the score of the best summary with length k that begins with the sentence Si . This score will be calculated as: if Si , Sj have common words score(Si ) + (δjk−1 ) δik = maxj (i