Multiple sets of features for automatic genre classification of web ...

17 downloads 106221 Views 344KB Size Report
Keywords: Automatic genre classification; Web documents; URL; HTML tags ..... The attribute HREF of an anchor tag is one of the most widely used notions in ...
Information Processing and Management 41 (2005) 1263–1276 www.elsevier.com/locate/infoproman

Multiple sets of features for automatic genre classification of web documents Chul Su Lim a

a,*

, Kong Joo Lee

b,1

, Gil Chang Kim

a,2

Division of Computer Science, Department of EECS, KAIST, 373-1 Kusong-dong, Yusong-gu, Taejon 305-701, South Korea b School of Computer and Information Technology, KyungIn WomenÕs College, 101 Kyesan-dong, Gyeyang-gu, Incheon 407-740, South Korea Received 7 December 2003; accepted 11 June 2004 Available online 10 July 2004

Abstract With the increase of information on the Web, it is difficult to find desired information quickly out of the documents retrieved by a search engine. One way to solve this problem is to classify web documents according to various criteria. Most document classification has been focused on a subject or a topic of a document. A genre or a style is another view of a document different from a subject or a topic. The genre is also a criterion to classify documents. In this paper, we suggest multiple sets of features to classify genres of web documents. The basic set of features, which have been proposed in the previous studies, is acquired from the textual properties of documents, such as the number of sentences, the number of a certain word, etc. However, web documents are different from textual documents in that they contain URL and HTML tags within the pages. We introduce new sets of features specific to web documents, which are extracted from URL and HTML tags. The present work is an attempt to evaluate the performance of the proposed sets of features, and to discuss their characteristics. Finally, we conclude which is an appropriate set of features in automatic genre classification of web documents. Ó 2004 Elsevier Ltd. All rights reserved. Keywords: Automatic genre classification; Web documents; URL; HTML tags

*

Corresponding author. Tel.: +82 42 869 3551; fax: +82 42 869 3510. E-mail addresses: [email protected] (C.S. Lim), [email protected] (K.J. Lee), [email protected] (G.C. Kim). 1 Tel.: +82 32 540 0138. 2 Tel.: +82 42 869 3551; fax: +82 42 869 3510.

0306-4573/$ - see front matter Ó 2004 Elsevier Ltd. All rights reserved. doi:10.1016/j.ipm.2004.06.004

1264

C.S. Lim et al. / Information Processing and Management 41 (2005) 1263–1276

1. Introduction In the past several years, there have been increasing interests in how to present the results of a search engine to users. The majority of conventional search systems return a huge ranked list of the resultant web documents that match a userÕs query. High recall and low precision of a search engine coupled with this huge list make it difficult for the users to find the information that they are looking for. Clustering is currently becoming one of the most crucial techniques for dealing with this problem on the Web. This approach organizes the documents into clusters divided by a variety of criteria, such as a subject, pageÕs URL, or a title. A subject-based clustering for web documents has been extensively studied, and a wide range of approaches for this has been carried out over the past years. A term-based clustering and a link-based clustering are two major approaches for an automatic subject clustering (Wang & Kitsuregawa, 2002). A term-based clustering is based on common terms shared among documents while a link-based approach is based on the assumption that a link can provide objective opinions for the subject of the pages it points to. There have been several researches to apply the clustering techniques to implementing a user interface of a web search engine (Cutting, Karger, & Pedersen, 1993; Karlgren, Bretan, Dewe, Hallberg, & Wolkert, 1998; Zamir & Etziono, 1999). By using this interface, the users can browse the retrieved documents according to their subjects. Although the documents can be grouped successfully according to their subjects, there is a big difference in styles among the documents in a cluster (Lee & Myaeng, 2002). For instance, the documents grouped by a subject Ôgolf gamesÕ vary widely in their styles such as homepages, news articles, collection of images and so on. When users want to find the documents related with Ôgolf gamesÕ, they are looking for homepages sometimes, and news articles some other times. Therefore, a style or a genre of a document is considered as the second view of representing a document, besides a subject. Consequently, this genre information of a document can help users judge relevance in browsing the results of a search engine even in the topical clustering system. For this reason, an automatic genre classification for the web documents begins to attract considerable attention in recent years. Indeed, the studies of an automatic genre classification have a long history (Biber, 1986, 1992, 1995; Karlgren et al., 1998; Karlgren & Cutting, 1994; Kessler, Numberg, & Schu¨tze, 1997; Lee & Myaeng, 2002; Michos, Stamatatos, Fakotakis, & Kokkonakis, 1996; Stamatatos, Fakotakis, & Kokkinakis, 2000a, 2000b), but most of them serve for textual documents, not for web documents. Hence, their studies are performed based on a genre that is suitable for textual documents, and the features that are appropriate for textual documents are employed. Only a few studies (Karlgren et al., 1998; Lee & Myaeng, 2002) dealt with web documents structured with HTML tags and adopted the features specific to web documents. No detailed evaluation for them has been published. In automatic genre classification, a document can be represented by the values of features that seem to express the attribute of a genre. A classifier can guess the genre of a new document based upon these values that it learned from a training corpus. Therefore, selecting features that can make a clear distinction among the genres is the core of automatic genre classification. In this work, we introduce as many features as possible to classify genres for web documents. First, we employ the features borrowed from the previous studies. These features are all extracted from general textual sources, which indicate the textual properties of documents such as the number of sentences, the frequency of a certain word, a lexical ambiguity, and so on. Web documents are different from general textual documents in carrying URLs and HTML tags by which a style of a document can be predicted. For instance, a document would be a type of homepage if the document has Ôindex.htmlÕ in its URL. In this paper, we newly suggest the sets of features specific to web documents extracting from URL and HTML tags. Furthermore, we demonstrate the contribution of not only content-text features but also meta-text features. To begin with, we propose genre categories for web documents in the basis of the results obtained by (Dewe, Bretan, & Karlgren, 1998). Based on these

C.S. Lim et al. / Information Processing and Management 41 (2005) 1263–1276

1265

genre categories, we classify web documents by using the proposed features, evaluate the performance of these features, and discuss their characteristics. Finally, we conclude which are the appropriate features for automatic genre classification for web documents. This paper is organized as follows. First, we describe related works on automatic genre classification. Then in Section 3, the genres for web documents are introduced and the sets of features used in classifying genre are proposed in Section 4. The experimental results and discussions are described in Section 5. Finally, we conclude by summarizing our contributions with the directions for future works.

2. Previous approaches to automatic web genre classification Various types of features have been proposed for the automatic classification of text genres. Table 1 summaries all the previously used features and new features used in this study. Karlgren and Cutting (1994) have adopted twenty simple features for genre classification: lexical count (e.g. ‘‘therefore’’, ‘‘me’’, or ‘‘which’’), POS (part-of-speech) count (e.g. adverb, noun or present verb), and textual count (e.g. character count, long word count, characters per word, and words per sentence). Four years later, they proposed genre categories for web documents and built a balanced corpus based on these categories (Karlgren et al., 1998). They use lexical terms, POS tags and general textual count as a feature. Additionally, the number of images and the number of HREF links used in a document are adopted as the features for web documents. They mentioned using approximately forty features, but there are no reports on their performance. Stamatatos and his colleagues have a series of studies on automatic genre classification. In Michos et al. (1996), they define five genres of documents: public affairs, scientific, journalistic, everyday communication and literary. They also represent a document by four main features: formality, elegance, syntactic complexity and verbal complexity, and then each of the main features can be encoded as several style markers such as the number of words per sentence, verb–noun ratio, idiomatic expressions, and formal words, etc. In Stamatatos et al. (2000b), they implement a text genre detector using common word frequencies only. They collect 50 most frequent words from BNC (British National Corpus) that consists of written and spoken texts, and evaluate them on WSJ (Wall Street Journal) corpus. They also report that the most frequently used punctuation marks play an important role in discriminating a text genre. In Stamatatos et al. (2000a), they employ natural language tools such as a syntactic parser for extracting features. Unlike the previous studies, they use the features extracted from a phrasal level and an analysis level, such as the ratio of NPs (noun phrase) to the total number of chunks, the average number of words included in NP and morphological ambiguities or syntactic ambiguities, and so on. They report that the result using these features is better than the one using the most-frequently used words. They use ten textual genres, and construct a

Table 1 The previously used features Feature class

Token Information

Feature

POS Textual count

Lexical Information Structural Information URL HTML tags Images, links Other tags

Karlgren et al.

Stamatatos et al.

1994

1998

Michos

O O O

O O O

O O O

O

Lee and Maeng

2000b

2000a

O

O O O

O

This study

O O O O O O O

1266

C.S. Lim et al. / Information Processing and Management 41 (2005) 1263–1276

corpus by downloading the documents from WWW sites. However, their documents are general textual ones consisting of spoken and written languages. More recently, Lee and Myaeng (2002) present a method of automatic genre classification based on word statistics only. They use both genre-classified training corpus and subject-classified corpus in order to filter out the words from the feature set, which are more subject-dependent rather than genre-dependent. They build a genre corpus with the seven classes: reportages, editorials, research articles, reviews, homepages, Q&A, and Spec.

3. Web genre 3.1. Classes of web genre It is not easy to find a well-established genre category for web documents. The only study available in the published works on this area is Dewe et al. (1998). They classify web genres into two large categories––textual and non-textual––and then break them further into 11 categories: personal homepages, public/ commercial homepages, interactive pages, link collections, other listings and tables, error messages for non-textual documents: journalistic materials, reports, other running text, FAQs, and discussions for textual documents. Their work is our basis for defining the category for web genres. We slightly refine their categories by adding new ones and subdividing some of them into more elaborate categories. First of all, we would like to define two hyper-categories more accurately: textual and non-textual. A textual category is defined as the hyper-genre in which the documents has more words included in sentences (finished with final periods) than ones in non-sentences such as items, lists and phrases. A non-textual category is defined as vice versa. Table 2 shows the categories of the genre we employ in this work. For the convenience of comparing our categories with those in Dewe et al. (1998), we append the fourth column in Table 2, their corresponding genres. The genres marked with asterisk (*) are newly added and those with dagger ( ) are subdivided in this work. As the number of commercial homepages such as online shopping malls continues to grow steadily, users sometimes might want to filter them out or selectively sort them from the pages retrieved by a search engine. Therefore, we subdivide public/commercial homepages of Dewe et al. (1998) into (B) public homepages and (C) commercial homepages separately even though a classifier cannot easily differentiate each other. The genre (D) is confined to the pages that include the collection of links pointing to the pages of various opinions, questions and answers. With the rapidly growing number of pages containing the collection of multimedia files such as image and sound, we newly define the genre (F). With respect to a textual genre, we subdivide ‘‘reports’’ of Dewe et al. (1998) into three classes (J), (K) and (L). The genre (J) is created for research papers with a fixed format. From a userÕs point of view, it looks very useful to differentiate between the two genres (K) and (L) even though their documents have many properties in common. The growth of on-line shopping requires a genre for product specification (O). 3.2. Construction of genre corpus As we were not able to acquire a public set of documents classified by a genre, we had to build a genre corpus by ourselves. The corpus is constructed by two graduate students in Computer Science. The collected documents are restricted to the written in Korean in this work even though the documents on the Web are diverse with a wide range of languages. We extract the 10 most-frequently used content words from the KAIST Corpus (1996–1997) and acquire the 10 most popular search-terms from the portal sites (http://www.naver.com). We send these 20 queries to search service Google and collect 1000 documents

C.S. Lim et al. / Information Processing and Management 41 (2005) 1263–1276

1267

Table 2 The web genres proposed in this paper Web genres Non-textual (A) Personal homepages (B)  Public homepages (C)  Commercial homepages (D)  Bulletin collections (E) Link collections (F)* Image collections (G) Simple tables/lists (H) Input pages Textual (I) Journalistic materials (J)  Research reports (K)  Official materials (L)  Informative materials (M) FAQs (N) Discussions (O)* Product specifications (P) Others (informal texts)

Samples

Genre in Dewe et al. (1998)

Resume Homepages of government, institution, organization, hospital Shopping mall sites BBS, complaint board, notice board Collection of links Collection of images, pictures Simple yellow pages, mapping tables Login page, interactive pages, search page

Personal homepages Public/Commercial homepages

Press reportage, editorial, review Electronic journals, magazine, Thesis Corporation info., classified ad., legal info., copyright materials, contact info. Recipes, lecture notes, encyclopedic information Faq Pages in news group, page for private opinion, page for question, page for answer Advertising pages for various products Poem, private diary, memo, Internet fiction

Public/Commercial homepages Discussions Link collections N/A Other listings and tables Interactive pages

Journalistic materials Reports Reports Reports FAQs Discussions N/A Other running text

(each 50-documents for every query, which are ranked from 30th to 80th in the GoogleÕs resultant documents). As Google ranks documents with their popularities, most of the top 30 retrieved documents tend to be assigned as the genre public/commercial homepages. Hence we exclude the top 30 retrieved documents from GoogleÕs results for every query in order not to be biased in the distribution of the genres in the corpus. We remove the 28 inappropriate pages such as adult-only pages and error pages out of 1000 documents and assign the genres for the rest of them. Less than five documents are assigned for six genres ((A), (G), (H), (M), (N) and (P)). We cannot extract reliable features for representing genres from the very little number of documents. Therefore, we make a simple survey that asks people which terms are brought up in their minds when they think of these six genres. We collect each 3-terms for every six genre. For example, the terms ÔeducationÕ and ÔaffiliationÕ are collected for genre Ôpersonal homepageÕ, and the term ÔopinionÕ and ÔideaÕ for genre ÔdiscussionÕ. By querying these terms to Google, we supplement the documents into the six genres. The total number of web documents collected is 1224. Since URL of a document is one of the features adopted in this work, not only the content of the document but also the domain that the document is fetched from are important. Hence, we guide the collectors not to gather the bulk of documents from a single domain. When we merge the documents collected by two collectors, we exclude a page if it already exists in the merged corpus. Besides, when more than five documents are fetched from the same domain and fallen into the same genre, we randomly select and remove one of the documents one by one until their number does not exceed five. After merging the documents, we double-check the assigned genres for the collected documents. We make another two students besides the collectors annotate the genre for the collected documents by hand. When the genres annotated by three students including the collector for a document disagree each other, we adjust the genre for the document to the major one. Only eight documents are adjusted their genres by this double-check.

1268

C.S. Lim et al. / Information Processing and Management 41 (2005) 1263–1276

Table 3 The number of documents and the number of their source domains Non-textual

Textual

Web genres

# of doc.

# of domain

Web genres

# of doc.

# of domain

(A) Personal homepages (B) Public homepages (C) Commercial homepages (D) Bulletin collections (E) Link collections (F) Image collections (G) Simple tables/lists (H) Input pages

37 92 73 74 61 60 32 48

23 92 71 69 54 48 28 46

(I) Journalistic materials (J) Research reports (K) Official materials (L) Informative materials (M) FAQs (N) Discussions (O) Product Specifications (P) Others (informal texts)

117 97 150 123 54 53 114 39

43 41 107 97 52 19 62 37

Total number of documents: 1224 Total number of unique domains: 729

A document sometimes consists of two or three frames that are separate pages with their own URLs. In our corpus, the documents consist of separate frames (pages) whose number varies from 1 up to 4. Therefore, the total number of the documents is 1224 while the total number of the pages that we gather is 1328. Furthermore, 45 documents out of 1224 are PDF/PS files and they are all included in the genre Ôresearch reportsÕ. The rest are all HTML files. Table 3 shows the number of documents and the number of their source domains according to the genres. The total sum of the column Ô# of domainÕ in this table is 888, but the total number of unique domains regardless of a genre is 729.

4. Set of features for classifying web genre We use five distinct sets of features to classify the genres. Each of them is extracted from URL, HTML tags, token information, lexical information, and structural information of documents, respectively. Among them, URL and HTML tags, which we newly introduce as the source of features in this work, are the properties that only web documents contain. The token, lexical and structural information are the sources of the features for the genre classification widely used in the previous researches, and they are common for both textual documents and web documents. In order to extract the sets of features, we process each document with the following steps shown in Fig. 1. We keep URL information with the original web page. The pre-processor extracts the texts from a web document if it is an application file such as pdf, doc, or ppt. 3 In addition, a document with multiple frames is handled in this step. After HTML parsing, we can extract a set of features related with HTML tags such as the number of links or the number of input text box. The token information such as the number of sentences and the average number of words per sentence can be gathered from an output of the sentence-boundary detector. In addition, the features related with the POS token should be extracted after the morphological analysis. In the case of Korean, a word is composed of a content word and function word(s) without delimiters such as space. A morphological analyzer can separate a content word from function words. That is the reason why the features related with the lexical information should be extracted after the morphological analysis. By using a syntactic parser, phrase and chunk information can be collected. In the following subsections, we will explain the sets of features in detail. 3 We use an interface IFilter of Microsoft Windows in the case of extracting texts from doc file (Microsoft Word) and ppt file (Microsoft PowerPoint).

C.S. Lim et al. / Information Processing and Management 41 (2005) 1263–1276

1269

Fig. 1. Steps for extracting sets of features.

4.1. URL HTTP URL defines the location of a document on the Web. It mainly consists of a name of the host (domain), a directory path and a filename. The general form of URL is as follows (Berners-Lee, Masinter, & McCahill, 1994): http : == < host >:< port > = < path > ? < searchpart > can be omitted, and , Ô?Õ, are optional. is stripped from the URL in this work as it can be variable depending on a search query sent to a database. We can define a depth of URL as the number of directories included in . The URL can denote the characteristics of the documents. For example, the documents of entrance sites are often located right under the root host (Kraaij, Westerveld, & Hiemstra, 2002), so their URL depth is zero. Table 4 shows the URL-related features and their descriptions. The attribute U3 can tell us weather the document belongs to a user of a UNIX system. The U4 feature checks the filename of a document. The listed filenames are typically used as a filename of

Table 4 The set of features related with URL FEATURE

Description

U1 U2 U3 U4 U5 U6 UL1–UL35

Depth of URL Document typea (document extension): the value is one of {HTML, SCRIPT, DOC, OUTPUT, and MIX} Is Ô/Õ used in URL? Is filename in {index, default, main, home, main_default}? Or is filename omitted? Domain area: com, org, edu, net, gov, ac.kr, co.kr, go.kr, re.kr, ne.kr, or.kr, pe.kr, etc. Number of URLs (= number of frames in a document) Is it used in URL? for 35 lexical terms: faq, news, board, detail, list, qna, index, shop, data, go, view, front, main, company, item, paper, bbslist, product, read, papers, start, file, gallery, introduction, info, login, search, research, bbs, link, intro, people, profile, photoi, photo

a

HTML is for an extension of html, htm, and xml. SCRIPT is for jsp, asp, php, etc. DOC is for an application file such as pdf, doc, ppt, etc. OUTPUT is for an output page of a script. MIX is a mixture of other extensions (only in the case of a document with the several frames).

1270

C.S. Lim et al. / Information Processing and Management 41 (2005) 1263–1276

an entrance site. The U5 feature can divide domain area of the document according to the domain name of the host. The features from UL1 to UL35 are determined by collecting lexical terms, which occur more than three times in URL strings of the training corpus. 4.2. HTML tags The attribute HREF of an anchor tag is one of the most widely used notions in subject clustering of web documents (Wang & Kitsuregawa, 2002). To put it more concretely, the number of common links between the documents is adopted as one of the features. Instead of using the number of common links, we adopt the proportion of the frequency of links to the total number of characters in a document (feature H3 in Table 5). With regard to the features H1 and H2 in Table 5, we subdivide the links into those to the same domain and others to the different domain. Whether more links point to the same domain or to the different domain can provide us a clue to the characteristics of a genre. For instance, the documents in the genre Ôbulletin collectionsÕ may have more links to the same domain than to the different domain while vice versa in the genre Ôlink collectionsÕ. For 72 HTML tags used in the training corpus, we employ the proportion of their frequencies to the total frequencies of tags used in a document as a set of features. The features and their descriptions are described in Table 5. 4.3. Token information Token information includes basic frequencies for the text token and the pos token, which most previous works employed. In general, letters, digits, punctuations, and symbols mingle in a document. Moreover, a Korean document consists of heterogeneous character sets such as hangul (Korean alphabets), hanja (Chinese letters), and alphabets. The features from F8 to F13 are adopted based on the assumption that two distinct styles of documents may differ in the ratio of usage of different character sets. After the morphological analysis, we can extract the features related with POS and dictionary information (from T1 to T15 in Table 6). The feature T10 is proposed in Stamatatos et al. (2000a) for the first time. The features from T11 to T15 can be extracted from a dictionary. 4.4. Lexical information Lexical information is the most commonly used feature for classifying genres in the previous studies. Since Korean has a great number of function words, which play an important role in a sentence, we want to verify their performance in classifying genres. We separate the features of function words from those of content words. The features S1 and S2 in Table 7 are introduced by Stamatatos et al. (2000a). They report that these features are the valuable markers that indicate the percentage of high-frequency words and the percentage of unusual words included in a document, respectively.

Table 5 The set of features related with HTML tags FEATURE

Description

H1 H2 H3 H4–H75

Frequency of links to the same domain/total frequency of tags used in a document Frequency of links to the different domain/total frequency of tags used in a document Frequency of links/total number of characters in a document Frequency of tag/total frequency of tags used in a document for 72 html tags: col, textarea, input, frame, iframe, select, img, area, etc.

C.S. Lim et al. / Information Processing and Management 41 (2005) 1263–1276

1271

Table 6 The set of features for frequencies/ratio of textual and POS tokens FEATURE

Description

F1 F2 F3 F4 F5 F6 F7 F8–F13 T1–T9

Number of characters Number of words Number of candidate sentences Number of detected sentences/number of candidate sentences Average number of words per sentence Average number of characters per word Number of candidate sentences/number of characters Number of TYPE-words/total number of words for TYPE: hangul, hanja, alphabet, digit, punctuation, symbol Number of POS words/total number of words for 9 POSs: noun, pronoun, adjective, verb, adverb, interjection, modifier, postposition, verbal-ending Average number of morphological results per word (morphological ambiguities) Number of DICTINFO-words/total number of words for DICTINFO:a sino, foreign, proper, onomatopoeic/mimetic, title

T10 T11–T15

a sino is a loanword from Chinese, which is transcribed into Korean. foreign is also a foreign word, which is transcribed into Korean. title is a noun for personÕs title such as ÔprofessorsÕ, and ÔchiefÕ.

Table 7 The set of features related with lexical entries FEATURE

Description

MC1–MC50 MF1–MF50 MP1–MP32 S1 S2 V1

Frequency of CONTENT words/total frequency of content words for 50 most frequently used content words Frequency of FUNCTION words/total frequency of function words; for 50 most frequently used function words Frequency of PUNCTUATION/total frequency of punctuation for 32 most frequently used punctuation marks Number of usual words/total number of words (frequency of usual word > 1000 in the training corpus) Number of unusual words/total number of words (frequency of unusual words = 1 in the training corpus) Unique number of words/total number of words (Vocabulary richness)

4.5. Structural information Using NLP tools, we can analyze the syntactic structure of a sentence. From the result of the syntactic analysis, we can gather the number of phrases and the average number of words in a phrase shown in Table 8. Chunks (or multi-word expressions) such as date, time and address are also adopted as useful features in this work. Table 8 The set of features related with structural information FEATURE

Description

P1 P2 P3 P4 P5 P6–P22

Number of declarative sentences/number of candidate sentences Number of imperative sentences/number of candidate sentences Number of question sentences/number of candidate sentences Number of sentence with parsing failure/number of candidate sentences in a document Average number of syntactic trees per sentence(syntactic ambiguities) Number of phrase/total number of phrases in a document for 17 phrases: NP, VP, AJP, AUXP, AVP, CONJP, SENT, IMPR, etc. Average number of words per phrase for 17 phrases: NP, VP, AJP, AUXP, AVP, CONJP, SENT, IMPR, etc. Number of chunks for 11 expressions: date, time, postal address, telephone number, money, unit, Copyright, e-mail, personal names, abbreviation, numeric

P23–P39 C1–C11

1272

C.S. Lim et al. / Information Processing and Management 41 (2005) 1263–1276

5. Experimental results and discussion 5.1. Classifier and experimental environments We use TiMBL version 4.0 (Daelemans, Zavrel, van der Sloot, & van den Bosch, 2001) as a classifier in the experiments. TiMBL is based on memory-based learning, which is a direct descendant of the classical kNearest Neighbor approach to classification. As the value 1 is chosen for k in this experiment, the genre of a test document can be decided as the same genre as that of the document that is the most similar to the test document among the training corpus. For evaluation, we adopt leave-one-out cross validation that TiMBL 4.0 supports. Generally speaking, the textual representation of web documents consists of an anchor text 4 and a body text (Kraaij et al., 2002). The title and meta content are also useful texts as you can easily expect. In this paper, we divide the text of web document into three separate segments; title & meta content, 5 anchor text, and body text. Depending on which segment of a document is dealt with, the experiments are carried out on the seven cases; TM (Title and Meta content), ANCH (Anchor text), BODY (Body text), and their four combinations––TM + ANCH, ANCH + BODY, TM + BODY, and TM + ANCH + BODY. The purpose of combining the segments is that we want to know which text segment is the most useful for automatic genre classification. 5.2. Experimental results on each feature set The results summarized in Table 9 show the performance of each set of features under the given seven combinations of textual segments. The accuracy in the table is the ratio of the number of documents correctly identified by the classifier. As URL and HTML tags are considered independent of the textual segments, their results remain equal to seven different textual segments. The first interesting finding in Table 9 is that the result of the segments including BODY (Body text) outperforms those without BODY for most sets of features except S (Usual/Unusual) and V (Vocabulary richness). With respect to the subject clustering of web documents, body texts are sometimes excluded from the source of features (Wang & Kitsuregawa, 2002). Moreover, Pierre (2000) reports that including body texts in the source of features deteriorates the accuracy of automatic clustering. On the other hand, the overall text including the body is essential for genre classification because the statistics that are acquired by scanning a document from beginning to end can imply the property of a certain genre. Comparing rows in Table 9, we find that the results of all sets of features except H (HTML tags) are below 50% when used exclusively. The output of H ranks the best, and MP (Most frequently used Punctuation marks) is the second best. MP has already been known as one of the useful features for genre classification (Stamatatos et al., 2000b). The result of MF (Most frequently used Function words) is quite better than MC (Most frequently used Content words), which means the function words play a more important role than the content words do in identifying a genre. The result of V (Vocabulary richness) occupies the worst position. Tweedie and Baayen (1998) report that vocabulary richness is unstable for the texts shorter than 1000 words. The average number of words for our training documents is 1623 while it is only 650 when excluding the documents in the genre Ôresearch reportsÕ. Now, we will take a close look at the performance of the feature sets U (Url), UL (Url Lexical) and H (HTML tags) that we focus on in this work. First, with respect to the feature set U, the genres Ôresearch reportsÕ, Ôproduct specificationsÕ and Ôpublic/commercial homepagesÕ are well classified by using this set. We 4

An anchor text is the underlined hypertext that is visible in a link to other web document. Among several attributes of a meta tag, we use the texts related with only an attribute ÔkeywordsÕ, and ÔdescriptionÕ in this experiment. 5

C.S. Lim et al. / Information Processing and Management 41 (2005) 1263–1276

1273

Table 9 The accuracy of each set of features Feature sets

Number of features

TM

ANCH

BODY

TM + ANCH

ANCH + BODY

TM + BODY

TM + ANCH + BODY

U (Url) UL (Url Lexical)

6 35

39.8 43.5

H (Html tags)

75

55.1

F (Frequency) T (Token)

13 15

38.3 31.6

43.4 44.6

46.4 36.9

44.5 36.1

43.2 38.4

43.4 38.2

43.1 39.1

MC (MostFreqCont) MF (MostFreqFuncw) MP (MostFreqPunct) S (Usual/Unsusal) V (VocabularyRich)

50 50 32 2 1

16.3 21.6 25.0 15.0 22.6

28.8 29.5 35.9 18.1 18.6

38.7 42.9 45.6 17.4 12.7

30.0 31.3 38.1 18.6 16.1

37.2 44.2 46.9 16.4 13.6

37.1 43.6 45.8 15.2 11.3

37.4 44.2 45.8 16.4 11.5

P (Phrase) C (Chunk)

39 11

29.1 20.4

33.3 33.2

38.6 37.0

35.0 35.0

38.2 40.6

38.1 37.3

37.3 43.4

can interpret this in the following way. The half of the documents in Ôresearch reportsÕ in our training corpus has the value ÔDOCÕ for the feature U2, and many documents in this genre have the value Ôac.krÕ for the feature U5. Most of the documents in Ôproduct specificationsÕ are characterized by having the value ÔOUTPUTÕ for the feature U2 and the deepest URL depth to other genres. This might be inferred from the fact that a document containing a product specification is usually retrieved by running the script with userÕs query as you experienced. Most of the documents in both Ôpublic homepagesÕ and Ôcommercial homepagesÕ have zero URL depth and use pre-defined filenames for U4, thus it would be quite confusing for a classifier to distinguish the documents in the former from those in the latter. The feature set UL can identify the genres FAQs and Ôproduct specificationsÕ very well because 36 documents out of 54 in FAQs include the term ÔfaqÕ in their URLs, which means many authors tend to name the FAQ page using the term ÔfaqÕ. In the case of product specifications, 47 documents out of 114 adopt one of the terms ÔdetailÕ, ÔitemÕ, and ÔproductÕ in their URLs. Here, we would like to know why the feature set H perform outstandingly better than the others. When two documents of the same genre are collected from the same domain, we assume that it is highly possible for their HTML tags to be almost identical even though other features such as the number of words or the most frequently used punctuations are quite different. In other words, the distance between the values of HTML tags features of two documents in this case is far closer than that between other features. In our training corpus, 711 out of 1224 documents are fetched from the domain that the other documents are also fetched from. That is, 711 documents share their domains with the other documents while 513 documents do not. Let us investigate closely the performance of the features of these 711 documents only. Table 10 illustrates the performance of the feature set H comparing with those of U, UL and MP on these 711 documents only. The feature set U and UL are also related with the domain, and the MP ranks the next best to the set H. In the case of H, 496 out of 711 documents are correctly classified. Among 496 documents, 369 (74%) have the same domains as their nearest documents determined by the classifier among the training corpus. It is surprisingly higher than the results of MP, U and UL. In addition, 52 (24%) of 215 documents, which are misclassified out of 711, are deemed to be the nearest to the documents collected from the same domain. This result is also quite higher than the others. This seems to be compatible with the assumption that the usage patterns of HTML tags are very similar among the documents that are collected from the

1274

C.S. Lim et al. / Information Processing and Management 41 (2005) 1263–1276

Table 10 The analysis of the performance of the features of 711 documents that share the source domains Feature set

Nearest doc with the same domain/correctly classified doc

Nearest doc with the same domain/incorrectly classified doc

U (Url) UL (Url Lexical) H (Html tags) MP (MostFreqPunct)

113/357 59/345 369/496 139/402

7/354 9/366 52/215 7/309

(0.31) (0.17) (0.74) (0.34)

(0.02) (0.02) (0.24) (0.02)

same domain. This fact may become the advantage and the disadvantage at the same time when using the features HTML tags. 5.3. Experimental results on all feature sets Table 11 shows the results of selective experiments, which we have done only on the text segments of ANCH + BODY and TM + ANCH + BODY. When we compare the performance of the web-specific features with the common textual features, we find that the former (features (1) in Table 11) is slightly better than the latter (features (2) in Table 11). When using all sets of features––total number of features is 329––we can get approximately 74% of accuracy. Applying a feature selection method, which is a forward sequential selection 6 (Caruana & Freitag, 1994), we can decide the best combination of feature sets (features (4) in Table 11). The accuracy of the best combination can be improved up to 75.7%. We can easily guess that the accuracy can be improved considerably only if we apply the feature selection to every single feature not to the feature set, but we will leave this as a future study. The excluded sets from this combination are F (Frequency), P (Phrase), MC (Most frequently used Content words) and V (Vocabulary richness). Table 12 depicts the confusion matrix for the result of (ANCH + BODY) using the feature set (4) in Table 11. The figures in the column ÔAÕ in Table 12 indicate the number of documents guessed by a classifier as the genre ÔAÕ while the figures in the row ÔAÕ indicate the actual numbers of documents included in the genre ÔAÕ. The accuracies of textual genres are better than those of non-textual ones on the whole. The precisions/recalls of the genres Ôresearch reportsÕ (J), Ôjournalistic materialsÕ (I) and ÔdiscussionsÕ (N) rank the best. The documents in the genre Ôresearch reportsÕ have distinctive values for most feature sets, so it is easy to detect them from the documents in other genres. Most news articles are of stereotyped format, and the documents in the genre Ôjournalistic materialsÕ can also be well classified. In the case of ÔdiscussionsÕ, we found that many documents in this genre accidentally have a very similar format with each other even though they are gathered from different domains. We presume that this is the reason for discussions to become one of the best genres. The genres Ôinput pagesÕ (H), Ôsimple tables/listsÕ (G) and ÔothersÕ (P) occupy the lowest position. In particular, the recalls of Ôinput pagesÕ and ÔothersÕ are awful. It means there are no distinguishable properties in the documents in these genres. Indeed many web documents contain tables, input windows within their pages by default. Consequently, we must look into more carefully whether or not these classes are indispensable for the web genres.

6 The most common sequential search algorithms for feature selection are forward sequential selection (FSS). FSS begins with zero features, evaluates all feature subsets with exactly one feature, and selects the one with the best performance. It then adds to this subset the feature that yields the best performance for subsets of the next larger size. This cycle repeats until no improvement is obtained by extending the current subset.

C.S. Lim et al. / Information Processing and Management 41 (2005) 1263–1276

1275

Table 11 The selective experimental results Used feature sets

Number of features

ANCH + BODY (%)

TM + ANCH + BODY (%)

(1) Features with only U (Url) + UL (Url Lexical) + H (Html tags) (2) All features except U (Url) + UL (Url Lexical) + H (Html tags) (3) All feature sets: (1) + (2) (4) Best combination: U + UL + H + T (Token) + MF (MostFreqFuncw) + MP (MostFreqPunct) + S (Usual/Unusual) + C (Chunk)

116 213 329 226

64.9 61.4 73.9 75.7

64.9 60.0 74.3 75.6

Table 12 The confusion matrix for the best result (P/R: precision/recall) Non-textual

Textual

A

B

C

D

E

F

G

H

I

J

K

L

M

N

O

P

A B C D E F G H

23 0 0 0 1 1 0 0

2 63 13 2 3 1 0 3

0 20 54 1 6 2 0 0

0 0 0 58 1 3 3 2

1 1 1 1 38 1 1 1

3 0 3 2 2 42 0 4

1 0 0 2 2 4 21 8

0 1 0 1 0 1 1 21

0 4 0 1 1 0 0 1

0 0 0 0 0 0 0 0

2 2 2 0 3 2 3 3

1 0 0 0 3 1 0 3

1 1 0 2 0 0 0 0

0 0 0 2 0 0 0 0

3 0 0 2 1 2 3 2

0 0 0 0 0 0 0 0

I J K L M N O P

0 0 3 0 0 0 0 0

1 0 4 2 1 0 0 0

0 0 0 0 0 0 0 0

0 0 1 0 2 2 5 1

0 0 1 1 1 0 1 0

0 0 1 1 0 0 2 0

0 0 2 0 0 0 2 1

0 0 4 0 0 0 4 1

111 0 3 1 0 0 2 1

0 93 0 4 0 0 0 1

0 0 107 8 4 0 1 2

3 3 15 99 5 0 2 11

0 1 2 3 41 0 0 5

0 0 1 1 0 51 4 1

2 0 6 1 0 0 91 2

0 0 0 2 0 0 0 13

P/R

82/62

66/68

65/74

74/78

78/62

70/70

49/66

62/44

89/95

95/96

77/71

68/81

73/76

85/96

79/80

87/33

A: personal homepages; B: public homepages; C: commercial homepages; D: bulletin collections; E: link collections; F: image collections; G: simple tables/lists; H: input pages; I: journalistic materials; J: research reports; K: official materials; L: informative materials; M: faqs; N: discussions; O: product specifications; P: others.

The genre ÔpublicÕ and Ôcommercial homepagesÕ are very confusable as we expect, and so are the genre Ôofficial materialsÕ and Ôinformative materialsÕ. More elaborate features have to devised to differentiate the confusable ones and it is left as a further study.

6. Conclusion In this paper, we propose the multiple sets of features for classifying the genres for web documents. Contrary to subject clustering, the overall features from text including the body are essential for genre classification. In addition, punctuation marks and function words are attested to be adequate features for the genre classification. Web documents are different from the textual documents in carrying URL and HTML tags. The features extracted from documentÕs URL and HTML tags are appropriate for identifying the genres of documents. Through the experimental results, we have grasped the general idea of their

1276

C.S. Lim et al. / Information Processing and Management 41 (2005) 1263–1276

contribution to genre classification. The best combination of feature sets included URL, HTML tags, token information, most frequently used function words, most frequently used punctuation marks and chunks. The features suggested in this paper are more or less language-independent. Therefore, we can apply these features into other languages without a little modification. The following are left as future studies. As web documents are often with a wide range of styles, it is not easy to confine them to a single genre. For this reason, a classifier should be able to assign multiple genres into a document as the case may be. In addition, efficiencies for on-line genre classification for the web documents will be taken into account.

References Berners-Lee, T., Masinter, L., & McCahill, M. (Dec. 20, 1994). Uniform resource locators, Internet RFC 1738. Biber, D. (1986). Spoken and written textual dimensions in English: Resolving the contradictory findings. Language, 62(2), 384–413. Biber, D. (1992). The multidimensional approach to linguistic analyses of genre variation: An overview of methodology and finding. Computers in the Humanities, 26(5–6), 331–347. Biber, D. (1995). Dimensions of register variation: A cross-linguistic comparison. Cambridge, England: Cambridge University Press. Caruana, R., & Freitag, D. (1994). Greedy attribute selection. In International conference on machine learning (pp. 28–36). Cutting, D. R., Karger, D. R., & Pedersen, J. O. (1993). Constant interaction-time scatter/gather browsing of very large document collections. In Proceedings of the 16th annual international ACM conference on research and development in information retrieval (pp. 126–134). Daelemans, W., Zavrel, J., van der Sloot, K., & van den Bosch, A. (2001). TiMBL: Tilburg Memory-Based Learner version 4.0 reference guide. Dewe, J., Bretan, I., & Karlgren, J. (1998). Assembling a balanced corpus from the internet. In Proceedings of 11th nordic computational linguistics conference, Copenhagen. KAIST corpus (1996–1997). Korea Advanced Institute of Science and Technology. Available: http://korterm.org. Karlgren, J., Bretan, I., Dewe, J., Hallberg, A., & Wolkert, N. (1998). Iterative information retrieval using fast clustering and usagespecific genres. In Proceedings of the eighth DELOS workshop on user interfaces in digital libraries (pp. 85–92). Karlgren, J., & Cutting, D. (1994). Recognizing text genres with simple metrics using discriminant analysis. In Proceedings of the 15th international conference on computational linguistics (pp. 1071–1075). Kessler, B., Numberg, G., & Schu¨tze, H. (1997). Automatic detection of text genre. In Proceedings of 35th annual meeting ACL (pp. 32– 38). Kraaij, W., Westerveld, T., & Hiemstra, D. (2002). The importance of prior probabilities for entry page search. In Proceedings of the 25th annual international acm SIGIR conference on research and development in information retrieval (pp. 27–34). Lee, Y.-B., & Myaeng, S. H. (2002). Text genre classification with genre-revealing and subject-revealing features. In Proceedings of the 25th annual international ACL SIGIR conference on research and development in information retrieval (pp. 145–150). Michos, S., Stamatatos, E., Fakotakis, N., & Kokkonakis, G. (1996). An empirical text categorizing computational model based on stylistic aspects. In Proceedings of the eighth international conference on tools with artificial intelligence (pp. 71–77). Pierre, J. M. (2000). Practical issues for automated categorization of web sites. In ECDL 2000 Workshop on the Semantic Web. Stamatatos, E., Fakotakis, N., & Kokkinakis, G. (2000a). Automatic text categorization in terms of genre and author. Computational Linguistics, 26(4), 471–495. Stamatatos, E., Fakotakis, N., & Kokkonakis, G. (2000b). Text genre detection using common word frequencies. In Proceedings of the international conference on computational linguistics (COLING2000) (pp. 808–814). Tweedie, F., & Baayen, H. (1998). How variable may a constant be? Measures of lexical richness in perspective. Computers and the Humanities, 32(5), 323–352. Wang, Y., & Kitsuregawa, M. (2002). Evaluating contents-link coupled web page clustering for web search results. In Proceeding of 11th international conference on information and knowledge management (pp. 499–506). Zamir, O., & Etziono, O. (1999). Grouper: A dynamic clustering interface to web search results. Computer Networks, 31(11–16), 1361–1374.