Baseline Keyphrase Extraction Methods from Hebrew News HTML ...

81 downloads 1942 Views 100KB Size Report
gate baseline methods that extract keyphrases from Hebrew news HTML documents. These methods .... bold and cited terms and the lowest values are given.
Baseline Keyphrase Extraction Methods from Hebrew News HTML Documents Yaakov HaCohen-Kerner, Ittay Stern, David Korkus Department of Computer Science, Jerusalem College of Technology (Machon Lev) 21 Havaad Haleumi St., P.O.B. 16031, 91160 Jerusalem, Israel

Abstract: Most documents do not include keyphrases. There are a few keyphrase extraction systems for documents written in English. However, there is no such a system for the Hebrew language. In this ongoing work, we investigate baseline methods that extract keyphrases from Hebrew news HTML documents. These methods have been tested on a set of documents. Each document has an accompanying file containing keyphrases extracted by students who read the original documents. The two best baseline methods were found as: Term Frequency (TF) and the First N terms (FN). These results are similar to those discovered for documents written in English. Key-Words: Extraction, HTML Documents, Keyphrases, Keywords, Text Summarization, Hebrew

1 Introduction

The explosion of information is hard to handle and reading everything may be very time consuming. Various kinds of summaries (e.g.: headlines, abstracts and conclusions) enable people to decide whether they are willing to read the whole text or not. Keyphrases, which can be regarded as very short summaries, may help even more. For instance, keyphrases can serve as an initial filter when retrieving documents. Unfortunately, most documents do not include keyphrases. There are a few automatic keyphrase extraction systems for documents written in English. However, there is no such a system for the Hebrew language. In this ongoing work, we investigate baseline methods that extract keyphrases from Hebrew news HTML documents. This paper is organized as follows. Section 2 gives background concerning extraction of keyphrases, the Hebrew language, and baseline extraction methods. Section 3 describes the proposed model. Section 4 presents the experiments we have carried out. Section 5 concludes and proposes future directions for research.

2 Background 2.1 Extraction of Keyphrases

A keyphrase is an important concept, presented either in a single word (unigram), e.g.: ‘learning’, or a collocation, i.e., a meaningful group of two or more words, e.g.: ‘machine learning’, and ‘natural language processing’. The keyphrases provide general information about the contents of the document and can be seen as an additional kind of a document abstraction. The basic idea of keyphrase extraction for a given article is to build a list of words and collocations sorted in descending order, according to their frequency, while filtering general terms and normalizing similar terms (e.g. “similar” and “similarity”). The filtering is done by using a stop-list of closed-class words such as articles, prepositions and pronouns. The most frequent terms are selected as keyphrases since we assume that the author repeats important words as he advances and elaborates. Example of a system that applied this method among other basic methods is one developed by HaCohen-Kerner [8]. In this system, extraction of keyphrases for academic papers written in English is done from their abstracts and titles. Other three keyphrase extraction systems dealing with whole English documents are discussed below. Turney [17] develops a keyphrase extraction system. This system uses a few baseline extraction methods, e.g.: TF (term frequency), FA (first appearance of a phrase from the beginning of its document normalized by dividing by the number of words in the document) and TL (length of a phrase in number

of words). The best results have been achieved by a genetic algorithm called GenEx. Subjective human evaluation suggests that about 80% of the extracted keyphrases are acceptable to human readers. Frank et. al. [5] propose another keyphrase extraction system called Kea. They used two baseline extraction methods: TFXIDF (how important is a phrase to its document) and distance (distance of the first appearance of a phrase from the beginning of its document in number of words). In addition, they apply the naive Bayes learning Scheme. They show that the quality of the extracted keyphrases improves significantly when domain-specific information is exploited. They show that Kea achieves results similar to the results of GenEx. Moreover, their naive Bayes learning method is much quicker than the genetic algorithm applied in GenEx. Humphreys [11] proposes a keyphrase extractor for HTML documents. Her method finds important HTML tokens and phrases, determine a weight for each word in the document (biasing in favor of words in the introductory text), and uses a harmonic mean measure called RatePhrase to rank phrases. Her conclusion is that RatePhrase performs well as GenEx. However, results containing evaluation measures are not presented and there is no use of any machine learning method.

2.2 The Hebrew language

The issue of Hebrew features is important to our task - extraction of keyphrases - because of the following reasons: (1) All extraction methods should be adapted to the Hebrew language and (2) We aim to normalize Hebrew words in order to count repetitions of the same keyphrases. There are various kinds of features, e.g.: conjugations, verb types, subjects, prepositions, belonging, objects and terminal letters. In this model, we have used most of them. Detailed discussions concerning these features and their application are described in HaCohen-Kerner et. al. [10]. In Hebrew, it is impossible to find the declensions of a certain stem without an exact morphological analysis based on the features mentioned above. The English language is richer in its vocabulary than Hebrew (the English language has about 40,000 stems while Hebrew has only about 4,000 and the number of lexical entries in the English dictionary is 150,000 compared with only 40,000 in the Hebrew dictionary) the Hebrew language is richer in its morphology

forms. For example, the single Hebrew word vkhsyochlhu ( ) is translated into the following sequence of six English words: “and when they will eat it”. In comparison to the Hebrew verb which undergoes a few changes the English verb stays the same. In Hebrew, there are up to seven thousand declensions for only one stem, while in English there is only a few declensions. For example, the English word eat has only four declensions (eats, eating, eaten and ate). The relevant Hebrew stem ‘khl1 ( ,”eat”) has thousands of declensions. Seven of them are presented below: (1) ‘khlty ( , “I ate”), (2) ‘khlt ( , “you ate”), (3) ‘khlnv ( , “we ate”), (4) ‘khvl ( , “he eats”), (5) ‘khvlym ( , “they eat”), (6) ‘tkhl ( , “ she will eat”) and (7) l‘khvl ( , “to eat”). For more detailed discussions of Hebrew grammar from the viewpoint of computational linguistics, refer to (Wintner, 2004). For Hebrew grammar refer either to [6 and 18] in English or to [20] in Hebrew. There are various information retrieval systems for Hebrew texts. For instance: (1) Morfix [12] - a search engine that its central task is to provide retrieval of Hebrew web sites that include declensions of the input word, (2) The Responsa Project [2, 3 and 14] - an information retrieval system that enables access to religious Jewish writings, and (3) Hebrew Google site [7] - which enables the use of Hebrew words within the Google search engine. All these systems enable retrieval based on either a given word or a keyphrase. However, none of them can extract keyphrases from a given document.

2.3 Baseline Methods for Selecting the Most Important Keyphrases

In this section, we introduce the baseline methods we use for selecting the most important Hebrew terms as keyphrases. It is important to mention that all these methods have been adapted to the Hebrew language. In all methods, words and terms that have a grammatical role for the language are excluded from the key words list according to a ready-made stop list. This stop-list contains approximately 300 high frequency close class Hebrew words (e.g.: we, this, and, when, in, usually, also, near). Some of the methods 1

The Hebrew Transliteration Table, used in this paper, is taken from the web-site of the Princeton university library (http://infoshare1.princeton.edu/katmandu/hebrew/trheb.html).

are based on similar summarization methods used mostly for selecting the most important sentences (e.g., [9]). Other methods were formalized by us. 1) Term Frequency (TF): This method rates a term according to the number of its occurrences in the text [4 and 12]. 2) Term Frequency & Term length (TF-TL): This method extends the TF method and takes into consideration the Term Length (TL) property. TL rates a term according to the number of the words included in the term. We did not use TL as an independent method because its results were very poor. We have tried two kinds of combinations for the definition of TF-TL: (1) TL*TF [17] and TF*log2TL [15]. The second combination yields better results. The idea is that, since TL is a weaker indicator than the TF, TF is multiplied by log 2 (TL) and not by TL itself. 3) First N Terms (FN): Only the first N terms in the document are selected. The assumption is that the most important keyphrases are found at the beginning of the document because people tend to place important information at the beginning. This simple method provides a relatively strong baseline for the performance of any text-summarization method [1]. 4) Last N Terms (LN): Only the last N terms in the document are selected. The assumption is that the most important keyphrases are found at the end of the document because people tend to place their important keyphrases in their conclusions which are usually placed near to the end. 5) First and Last N Terms (FLN): Only the first N/2 terms and the last N/2 terms in the document are selected. The assumption is that the most important keyphrases are found both at the beginning and at the end of the document. 6) Paragraph Importance (PI): This method rates a term according to the importance of its paragraph within the article, where the highest value is given for the title, medium values are given for sub-titles, then to the first paragraph. The lowest value is given for a regular paragraph. The assumption is that the most important keyphrases are likely to be found in the article according to the order mentioned above. 7) Term Importance (TI): This method rates a term according to the importance of the term within the article, where the highest values are given for bold and cited terms and the lowest values are given for regular words and words in brackets. The assumption is that the most important keyphrases are likely to be emphasized.

8) Sentence at the Beginning (SB): This method rates a term according to the relative position of its sentence within the article. The assumption is that the most important keyphrases are likely to be found within the first sentences of the document. 9) Sentence at the End (SE): This method rates a term according to the relative position of its sentence within the article. The assumption is that the most important keyphrases are likely to be found within the last sentences of the document. 10) Paragraph at the Beginning (PB): This method rates a term according to the relative position of its paragraph within the article. The assumption is that the most important keyphrases are likely to be found within the first paragraphs of the document. 11) Paragraph at the End (PE): This method rates a term according to the relative position of its paragraph within the article. The assumption is that the most important keyphrases are likely to be found within the last paragraphs of the document. 12) At the Beginning of its Paragraph (BP): This method rates a term according to its relative position in its paragraph. The assumption is that the most important keyphrases are likely to be found close to the beginning of their paragraphs. 13) At the nd of its Paragraph (EP): This method rates a term according to its relative position in its paragraph. The assumption is that the most important keyphrases are likely to be found close to end of their paragraphs.

3 The Model

Our model is composed of the following steps: (1) Implementing an HTML file cleaner, that will create consistent and clear HTML files; (2) Implementing baseline methods that extract keyphrases from Hebrew HTML documents. (3) Testing the model on a set of documents that contain keyphrases composed by humans and (4) Comparing the extracted keyphrases with the composed keyphrases, and analyzing the results. 3.1 HTML File Cleaner The Hyper Text Markup Language (HTML) is the coding language for web pages on the World Wide Web (WWW). HTML is composed of textual content, placed in tags that indicate the text appearances and their part in the document. The tags are marked with greater-than and less-than marks.

Some common tags are:

that represents a paragraph start;

that marks a first-level heading;
adds a hard-break to a line and starts a new one; and that starts a set of tabular data. At present, many tags are used for design purposes only, and give no clue about their inner text contextual part or the document structure. Moreover, many of the tags in use – are either misused, or used for other purposes. Therefore, we propose two methods for cleaning HTML files: (1) The file is being tested against signatures of known site lists. The signatures are strings extracted from specific sites, where each string is special to its site. If a fitting signature is found, a set of matching filters is used to allocate the headings and the text. The data extracted becomes a clean and simple new HTML file. (2) If no signature matched, we throw away all irrelevant information attached to the article. All management tags, e.g. and , are stripped out, some tags are replaced with basic equivalent tags, and duplicate sequential tags are reduced.

3.2 The Basic Algorithm

In order to measure the success of the extraction methods mentioned above we use the popular recall and precision measures. Recall is defined as the number of keyphrases that appear both within the system’s keyphrases and within the keyphrases extracted by the students divided by the number of keyphrases extracted by the students. Precision is defined as the number of keyphrases that appear both within the system’s keyphrases and within the keyphrases extracted by the students divided by the number of keyphrases extracted by the system. A full match is a repetition of the same keyphrase. That is, a repetition of all the words included in the keyphrase. A partial match between two different keyphrases is defined when both keyphrases share at least one word. All other pairs of keyphrases are regarded as failures. Using each one of the baseline methods for selecting the most important keyphrases (section 2.3) our system chooses the N most highly weighted words or collocations. The value of N has been set at 9 because of the following reasons: (1) This is the maximal number of items that an average person is able to remember without apparent effort, according to the cognitive rule called “7 2” [13]. This means that an average person is capable of remembering approxi-

mately between 5 and 9 information items over a relatively short term. (2) The value of 9 was also determined in a related research by Humphreys [11] for the number of keyphrases for inclusion in her system’s summary. (3) Except for one document, 9 is greater than the number of extracted keyphrases by the students for the checked documents. (4) In general, our goal is to achieve a rather high recall rate. The reason we prefer the recall measure is that the main purpose of our system is to help one to get a rather short list of keyphrases still including as many keyphrases as possible that appeared in the list of keyphrases composed by humans. We assume that a user would prefer to have most of the relevant keyphrases with a few unnecessary ones rather than having a pure significant list with many important keyphrases lacking. We have also taken into consideration the fact that meaningless keyphrases can easily be filtered by a human subject. In our model, only keyphrases which are unigrams or 2-grams or 3-grams are treated. That is because in our data-base only about 1% of the keyphrases consist of 4 words or more. The full statistics concerning the word-length of keyphrases is presented in the next section.

4 Experiments 4.1. Data sets

We have constructed a dataset containing 100 Hebrew HTML documents taken from http://www.ynet.co.il (the web site of Yedioth Ahronoth – the most widespread daily newspaper in Israel). Each document contains in average 241 words. Each document has an accompanying file containing between 2 to 11 keyphrases extracted by two students who read the original documents. The keyphrases have been extracted according to the following three guidelines: (1) Keyphrases are not included in the stop-list, (2) Keyphrases are extracted only from terms that exist in the document, and (3) They are relevant to the main topics mentioned in the document. All documents include a total of 480 keyphrases. That is, each document contains in average 4.8 keyphrases. Tables 1-4 present various statistics concerning these documents. Distribution of domains is described in Table 1. Size of documents in words is described in Table 2. Statistics we have made in the

pre-processing stage has given the following distribution results: 257 out of 480 keyphrases (53%) were unigrams, 187 keyphrases (39%) were 2-grams, 32 keyphrases (6%) were 3-grams and 4 keyphrases

(1%) were 4 words or more. Distribution of # keyphrases per document is described in Table 3. Distribution of # words per keyphrase is described in Table 4.

Table 1. Distribution of domains.

Table 2. Size of documents in words.

Domain Architecture & Environment Motorcars Politics World News Sport & Tourism

# of documents 29 24 23 12 12

# of words per keyphrase 1 2 3 4 5 6

# of documents 3 16 32 21 12 10 6

4.2. General Results

For each method, the 9 extracted keyphrases by the system are compared to the keyphrases extracted by the students. Tables 5 and 6 show the recall and the precision results, respectively. Table 5. Summarization of recall results. #

Method

1

# of documents 29 24 23 28

Table 4. Distribution of # words per keyphrase.

Table 3. Distribution of # keyphrases per document. # of keyphrases per document 2 3 4 5 6 7 8 and up

# of words per document Until 150 150-199 200-299 300 and up

TF

Full Matches in % 55.6

Partial Matches in % 11.9

Partial Matches and up in % 67.5

2 3

TF-TL FN

54.6 44.4

8.5 15.2

63.1 59.6

4 5

LN FLN

16.0 31.5

7.1 11.5

23.1 42.9

6 7

PI TI

38.1 37.3

7.1 8.1

45.2 45.4

8 9

SB SE

37.9 19.4

10.0 4.8

47.9 24.2

10 11

PB PE

24.6 26.3

9.2 10.6

33.7 36.9

12 13

BP EP

36.9 25.6

8.8 6.9

45.6 32.5

# of keyphrases 3 16 32 21 12 10

Table 6. Summarization of precision results. #

Method

1

TF

Full Matches in % 29.7

Partial Matches in % 6.3

Partial Matches and up in % 36.0

2

TF-TL

29.1

4.6

33.7

3

FN

23.7

8.1

31.8

4

LN

8.6

3.8

12.3

5

FLN

16.8

6.1

22.9

6

PI

13.1

4.9

18.0

7

TI

14.0

5.7

19.7

8

SB

19.7

4.7

24.3

9

SE

20.3

3.8

24.1

10

PB

19.9

4.3

24.2

11

PE

13.7

3.7

17.3

12

BP

20.2

5.3

25.6

13

EP

10.3

2.6

12.9

The best baseline method was found as Term Frequency (TF). It presents for finding partial and full matches a recall rate of about 67.5% and a precision rate of about 36%. Most of the TF’s success was because of its full matches (267 out of 324 partial

matches and up). In contrast to our hypothesis that TF-TL would achieve better results than TF, we found that the TF method was slightly better. Therefore, TF-TL at least in the current version, is unnecessary. Another promising method was the First N terms (FN) that achieves a recall rate of about 60% and a precision rate of about 32%. These results are similar to those discovered by Frank et. al. [5] for documents written in English.

5 Conclusions and Future Work

Our basic system is the first to extract keyphrases from Hebrew files. The two best baseline methods were found as: Term Frequency (TF) and the First N terms (FN). These results are similar to those discovered for documents written in English. Future directions for research are: (1) Developing unique Hebrew methods and methods based on domain-dependant cue phrases for keyphrase extraction, (2) Checking baseline methods for different domains, (3) Applying machine-learning techniques in order to find the most effective combination between these baseline methods and (4) Conducting more experiments using additional documents from additional domains. References: [1] Brandow, B., Mitze, K., Rau, L.F.: Automatic Condensation of Electronic Publications by Sentence Selection. Information Processing and Management, Vol. 31, No. 5, 1994, pp. 675-685. [2] Choueka, Y.: Full-Text Systems and Research in the Humanities. Computers and the Humanities, Vol. 14, 1980, pp. 153-169. [3] Choueka Y., Klein S.T., Neuvitz, E.: Automatic Retrieval of Frequent Idiomatic and Collocational Expressions in a Large Corpus. Journal Association Literary and Linguistic Computing, Vol. 4, 1983, pp. 34-38. [4] Edmundson, H.P.: New Methods in Automatic Extraction. Journal of the ACM. Vol. 16, No. 2, 1969, pp. 264-285. [5] Frank, E., Paynter, G.W., Witten I.H., Gutwin C., Nevill-Manning, C.G.: Domain-Specific Key-Phrase Extraction. Proc. IJCAI. Morgan Kaufmann, 1999, pp. 668-673. [6] Glinert, L., Gilinert, L.: Hebrew – An Essential Grammar. Routledge, London, 1994. [7] Hebrew Google. http://www.google.co.il. , 2004.

[8] HaCohen-Kerner, Y.: Automatic Extraction of Keywords from Abstracts. Proceedings of the Seventh International Conference on Knowledge-Based Intelligent Information & Engineering Systems, Vol. 1. Lecture Notes in Artificial Intelligence 2773. Springer-Verlag, Berlin Heidelberg New York, 2003, pp.843-849. [9] HaCohen-Kerner, Y., Malin, E., Chasson, I.: Summarization of Jewish Law Articles in Hebrew, Proceedings of the 16th International Conference on Computer Applications in Industry and Engineering, Las Vegas, Nevada USA, Cary, NC: International Society for Computers and Their Applications (ISCA), 2003, pp. 172-177. [10] HaCohen-Kerner, Y., Badlov, A., Filgut, A.: Finding the Correct Stem of an Hebrew Word Using Contexts and Declensions, Submitted for review to WSEAS Transactions on Computers, 2004. [11] Humphreys, K.J.B.: Phraserate: An HTML Keyphrase Extractor. Technical report, University of California, Riverside, Riverside, California, 2002. [12] Luhn, H.P.: “The Automatic Creation of Literature Abstracts,” IBM Journal of Research and Development, Vol. 2, No. 2, 1958, pp. 159-165. [13] Miller, G. A.: The Magical Number Seven, Plus or Minus Two: Some Limits on our Capacity of Information. Psychological Science Vol. 63, 1956, pp. 81-97. [14] Morfix. http://www.morfix.co.il., 2004. [15] Neto, J.L., Freitas, A.A., Kaestner, C.A.A.: Automatic Text Summarization Using a Machine Learning Approach. Proc. SBIA, 2002, pp. 205-215. [16] The Responsa Project. http://www.biu.ac.il/ICJI/Responsa/index.html, 2004. [17] Turney, P.: Learning Algorithms for Keyphrase Extraction. Information Retrieval Journal Vol. 2, No. 4, 2000, pp. 303-336. [18] Wartski, I.: Hebrew Grammar and Explanatory Notes. The Linguaphone Institute, London, 1900. [19] Wintner, S.,: Hebrew Computational Linguistics: Past and Future. Artificial Intelligence Review Vol. 21, No. 2, 2004, pp. 113-138. [20] Yelin, D.: Dikduk HaLason HaIvrit (Hebrew Grammar, in Hebrew), Jerusalem, 1970. Acknowledgment:

We thank to Avishai Badlov, Adi Filgut, Ari Cirota, Asaf Masa and Zuriel Gross for their help.