Adapting Web Information Extraction Knowledge ... - Semantic Scholar

Adapting Web Information Extraction Knowledge via Mining Site-Invariant and Site-Dependent Features TAK-LAM WONG City University of Hong Kong and WAI LAM The Chinese University of Hong Kong

We develop a novel framework that aims at automatically adapting previously learned information extraction knowledge from a source Web site to a new unseen target site in the same domain. Two kinds of features related to the text fragments from the Web documents are investigated. The first type of feature is called, a site-invariant feature. These features likely remain unchanged in Web pages from different sites in the same domain. The second type of feature is called a sitedependent feature. These features are different in the Web pages collected from different Web sites, while they are similar in the Web pages originating from the same site. In our framework, we derive the site-invariant features from previously learned extraction knowledge and the items previously collected or extracted from the source Web site. The derived site-invariant features will be exploited to automatically seek a new set of training examples in the new unseen target site. Both the site-dependent features and the site-invariant features of these automatically discovered training examples will be considered in the learning of new information extraction knowledge for the target site. We conducted extensive experiments on a set of real-world Web sites collected from three different domains to demonstrate the performance of our framework. For example, by just providing training examples from one online book catalog Web site, our approach can automatically extract information from ten different book catalog sites achieving an average precision and recall of 71.9% and 84.0% respectively without any further manual intervention. Categories and Subject Descriptors: H.3.1 [Information Storage and Retrieval]: Content Analysis and Indexing; I.2.6 [Artificial Intelligence]: Learning—Induction The work described in this article is substantially supported by grants from the Research Grant Council of the Hong Kong Special Administrative Region, China (Project Nos: CUHK 4179/03E and CUHK4193/04E) and the Direct Grant of the Faculty of Engineering, CUHK (Project Codes: 2050363 and 2050391). This work is also affiliated with the Microsoft-CUHK Joint Laboratory for Human-centric Computing and Interface Technologies. Authors’ addresses: T. L. Wong, Department of Computer Science, City University of Hong Kong, 83 Tat Chee Avenue, Kowloon, Hong Kong; email: [email protected]; W. Lam, Department of Systems Engineering and Engineering Management, the Chinese University of Hong Kong, Shatin, Hong Kong; email: [email protected]. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or direct commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212) 869-0481, or [email protected]. C 2007 ACM 1553-5399/07/0200-ART6 $5.00. DOI 10.1145/1189740.1189746 http://doi.acm.org/ 10.1145/1189740.1189746 ACM Transactions on Internet Technology, Vol. 7, No. 1, Article 6, Publication date: February 2007.

2

•

T.-L. Wong and W. Lam

General Terms: Algorithms, Design Additional Key Words and Phrases: Wrapper adaptation, Web mining, text mining, machine learning ACM Reference Format: Wong, T.-L. and Lam, W. 2007. Adapting Web information extraction knowledge via mining siteinvariant and site-dependent features. ACM Trans. Intern. Tech. 7, 1, Article 6 (February 2007), 40 pages. DOI = 10.1145/1189740.1189746 http://doi.acm.org/10.1145/1189740.1189746

1. INTRODUCTION The vast amount of online documents in the World Wide Web provides a good resource for users to search information. One common practice is to make use of search engines. For example, a potential customer may browse different bookstore Web sites with the aid of a search engine hoping to gather precise information such as the title, authors, and selling price of some books. A major problem of online search engines is that the unit of the search results is an entire Web document. Human effort is required to examine each of the returned entries to extract the precise information. Automatic information extraction systems can automate the task by effectively identifying the relevant text fragments within the document. The extracted data can also be utilized in many intelligent applications such as on online comparison-shopping agent [Doorenbos et al. 1997], or an automated travel assistant [Ambite et al. 2002]. Unlike free texts (e.g., newswire articles) and structured texts (e.g., texts in rigid format), online Web documents are semi-structured text documents with a variety of formats containing a mix of short, weakly grammatical text fragments, mark-up tags, and free texts. For example, Figure 1 depicts a portion of an example of a Web book catalog.1 To automatically extract precise data from semi-structured documents, a commonly used technique is to make use of wrappers. A wrapper normally consists of a set of extraction rules that can identify the text fragments in the documents. In the past, human experts analyzed the documents and constructed the set of extraction rules manually. This approach is costly, time-consuming, tedious, and error-prone. Wrapper induction aims at automatically constructing wrappers by learning a set of extraction rules from the manually annotated training examples. For instance, a user can specify some training examples containing title, authors, and price of the book records in the Web document, as shown in Figure 1, through a GUI. The wrapper induction system can automatically learn the wrapper from these training examples, and the learned wrapper is able to effectively extract data from the documents of the same Web site. Different techniques have been proposed, which demonstrate that wrapper induction achieves very good extraction performance [Califf and Mooney 2003; Ciravegna 2001; Downey et al. 2005; Freitag and McCallum 1999; Muslea et al. 2001; Soderland 1999]. One major limitation of existing wrapper induction techniques is that a learned wrapper from a particular Web site cannot be applied to a new unseen Web site even in the same domain. For instance, suppose we have learned 1 The

URL of this Web site is http://www.powells.com.

ACM Transactions on Internet Technology, Vol. 7, No. 1, Article 6, Publication date: February 2007.

Adapting Web Information Extraction Knowledge

•

3

Fig. 1. A portion of a sample Web page of a book catalog.

the wrapper for the Web site as shown in Figure 1. In this article, it is called a source Web site. Figure 2 is another book catalog collected from a Web site different from the one shown in Figure 12 . Although both the source site (Figure 1) and the new Web site (Figure 2) contain information about book records, the learned wrapper for the source site cannot be applied directly to this new Web site for extracting information because their layouts are typically quite different. To automatically extract data from the new site, we must construct another wrapper customized to this new site. Hence, a separate human effort is necessary to collect the training examples from the new site and invoke the wrapper induction process separately. In this article, we develop a novel framework that can fully automate the adaptation and eliminate the human effort. This problem is called wrapper adaptation: it aims at automatically adapting a previously learned wrapper from a source Web site to a new unseen site in the same domain. Under our model, if some attributes in a domain share a certain amount of similarity among different sites, the wrapper learned from the source Web site can be adapted to the new unseen sites without any human intervention. As a result, manual effort is guaranteed to be reduced for preparing training examples in the overall process. We have previously developed an algorithm called WrapMA [Wong and Lam 2002] for solving the wrapper adaptation problem. WrapMA can adapt the previously learned extraction knowledge from one Web site to another new unseen Web site in the same domain. The main drawback of WrapMA is that human effort is still required to scrutinize the intermediate data during the adaptation phase. In this article, we present a novel method called Information 2 The

URL of the Web site is http://www.halfpricecomputerbooks.com. ACM Transactions on Internet Technology, Vol. 7, No. 1, Article 6, Publication date: February 2007.

4

•


Fig. 2. A portion of a sample Web page about a book catalog collected from a different Web site then Figure 1.

Extraction Knowledge Adaptation (IEKA) for solving the wrapper adaptation problem. IEKA is a fully automatic method without need for manual effort. The idea of IEKA is to analyze the site-dependent features and the site-invariant features of the Web pages in order to automatically seek a new set of training examples from the new, unseen, site. A preliminary version was reported in Wong and Lam [2004b]. In this article, we substantially enhance the framework by modeling the dependence among various kinds of knowledge and site-specific features for Web environment. An information-theoretic approach to analyzing the DOM (document object model) structure of Web pages is also incorporated for seeking training examples in the new site more effectively. The performance of our new approach, IEKA, is very promising, as demonstrated in the extensive experiments described in Section 9. For example, by just providing training examples from one online book catalog Web site, our approach can automatically extract information from ten different book catalog sites achieving an average precision and recall of 71.9% and 84.0% respectively without any further manual intervention. 2. PROBLEM DEFINITION AND MOTIVATION Consider the Web site shown in Figure 1. Figure 3 depicts an excerpt of the HTML texts of this page. Suppose we want to automatically extract the information such as the title, authors, and prices of the books from this Web site. We can construct a wrapper for this Web site to achieve this task. To learn the wrapper, we first manually annotate some training examples similar to the one depicted in Table I via a GUI. We employ our wrapper learning system (HISER), which considers the text fragments of the data item as well as the ACM Transactions on Internet Technology, Vol. 7, No. 1, Article 6, Publication date: February 2007.


•

5

Fig. 3. An excerpt of the HTML texts for the Web page shown in Figure 1. Table I. Sample of Manually Annotated Training Examples from the Web Page Shown in Figure 1 Item Book Title: Author: Final Price:

Item value Programming Microsoft Visual Basic 6.0 with CDROM (Programming) Francesco Balena 59.99

surrounding text fragments [Lin and Lam 2000]. The learned wrapper is composed of a hierarchical record structure and extraction rules. For example, Figure 4 shows the learned hierarchical record structure and Table II shows one of the learned extraction rules for the book title. A hierarchical record structure is a tree-like structure representing the relationship of the items of interest. The root node represents a record that consists of one or more items. An internal node represents a certain fragment of its parent node. An internal node can be a repetition, which may consist of other subtrees or leaf nodes. A repetition node specifies that its child can appear repeatedly in the record. A leaf node represents an attribute item of interest. Each node of the hierarchical record structure is associated with a set of extraction rules. The extraction rule contains three components. The left and right pattern components contain the left and right delimiters of the items, and the target pattern component consists of the semantic meaning of the items. After obtaining the wrapper, we can apply it to the other pages from the same Web site to automatically extract items. The learned wrapper can effectively extract items from the Web pages of the same site. However, it cannot extract any item if we directly apply the learned wrapper to a new unseen site in the same domain, such as the one shown in Figure 2. Figure 5 depicts an excerpt of the HTML texts for this page. The failure of extraction is due to the difference between the layout formats of the two Web sites; the learned wrapper becomes inapplicable to the new site. In order to automatically extract information from the new site, one could learn the wrapper for the new site by manually collecting another set of training examples. Instead, we propose and develop our IEKA framework to automatically tackle this problem. ACM Transactions on Internet Technology, Vol. 7, No. 1, Article 6, Publication date: February 2007.

6

•


root title

price repetition( author)

author Fig. 4. The learned hierarchical record structure for the Web page shown in Figure 1.

Fig. 5. An excerpt of the HTML texts for the Web page shown in Figure 2.

There are several characteristics for IEKA. The first is that IEKA utilizes the previously learned extraction knowledge contained in the wrapper of the source Web site. For example, the extraction rule depicted in Table II shows that the majority of the book titles contain alphabets and words starting with a capital letter. Such knowledge contains useful evidence for information extraction for the new unseen site in the same domain. However, it is not directly applicable to the new site due to the difference between the contexts of the two Web sites. We refer such knowledge as weak extraction knowledge. The second characteristic of IEKA is to make use of the items previously extracted or collected from the source site. These items can contribute to deriving training examples for the new unseen site. One major difference between this kind of training example and ordinary training examples, is that the former only consist of information about the item content, while the latter contain information for both the content and context of the Web pages. We call this property partially specified. Based on the weak extraction knowledge and the partially specified training examples, IEKA first derives those site-invariant features that remain largely unchanged for different sites. For example, one kind of site-invariant feature is the patterns, such as capitalization, about the attributes. Another kind of site-invariant feature is the orthographic information of the attributes. Next, a set of training example candidates is selected by analyzing the DOM structures of the Web documents of the new unseen site based on an information-theoretic approach. Machine learning methods are then employed to automatically discover some machine-labeled training examples from the set of candidates, based on the site-invariant features. Table III depicts samples of the automatically discovered machine-labeled training examples from the new unseen site shown in Figure 2. Both site-invariant and site-dependent features of the machine-labeled ACM Transactions on Internet Technology, Vol. 7, No. 1, Article 6, Publication date: February 2007.


•

7

Table II. A Sample of a Learned Extraction Rule for the Book Title for the Web Page Shown in Figure 1 Left pattern component Scan Until(, SEMANTIC), Scan Until(“”, TOKEN), Scan Until(“”, SEMANTIC), Scan Until(“”, TOKEN). Target pattern component Contain() Contain() Right pattern component Scan Until(“”, TOKEN), Scan Until(“”, TOKEN), Scan Until(“
”, TOKEN), Scan Until(, SEMANTIC). Table III. Samples of Machine Labeled Training Examples Obtained by Adapting the Wrapper from the Web Site Shown in Figure 1 to the Web Site Shown in Figure 2 Using Our IEKA Framework Example 1

Example 2

Item Book Title: Title: Final Price: Author: Final Price:

Item value C++ Weekend Crash Course, 2nd edition Stephen Randy Davis 23.99 Steve Oualline 31.96

training examples will then be considered in the learning of the new wrapper for the new target site. The newly discovered hierarchical record structure for the new site is the same as the one shown in Figure 4. Table IV shows the set of adapted extraction rules of the book title. The newly learned wrapper can be applied to extract items from the Web pages of this new site. 3. RELATED WORK Research efforts about information extraction from various kinds of textual documents ranging from free texts to structured documents have been investigated [Chawathe et al. 1994; Srihari and Li 1999]. Among different extraction approaches, a wrapper is a common technique for extracting information from semistructured documents such as Web pages [Kushmerick and Thomas 2002]. In the past few years, many wrapper learning systems that aim at constructing wrappers by learning from a set of training examples have been proposed [Blei et al. 2002; Ciravegna 2001; Cohen et al. 2002; Freitag and McCallum 1999; Hogue and Karger 2005; Hsu and Dung 1998; Kushmerick 2000a; Lin and Lam 2000; Muslea et al. 2001; Soderland 1999]. These approaches can automatically learn wrappers from a set of training examples and the learned wrapper can effectively extract items from the Web sites. However, they suffer from two common drawbacks. First, as the layout of Web sites is changed, the learned wrapper typically becomes obsolete and useless. This refers to the wrapper maintenance problem. Second, the learned wrapper can only be applied to the Web site from whence the training examples come. In order to learn a wrapper ACM Transactions on Internet Technology, Vol. 7, No. 1, Article 6, Publication date: February 2007.

8

•

T.-L. Wong and W. Lam Table IV. The Set of Extraction Rules for Extracting the Book Title from the Web Page Shown in Figure 2 Obtained by Adapting the Wrapper from the Web Site Shown in Figure 1 Using Our IEKA Framework Left pattern component Scan Until(
, TOKEN), Scan Until(“”, TOKEN), Scan Until(“”, TOKEN), Scan Until(“”, SEMANTIC). Target pattern component Contain() Contain() Right pattern component Scan Until(“”, TOKEN), Scan Until(“”, TOKEN), Scan Until(“”, TOKEN), Scan Until(“
”, TOKEN). Left pattern component Scan Until(
, TOKEN), Scan Until(“”, TOKEN), Scan Until(“”, TOKEN), Scan Until(“”, SEMANTIC). Target pattern component Contain() Contain() Right pattern component Scan Until(“”, TOKEN), Scan Until(“”, TOKEN), Scan Until(“
”, TOKEN), Scan Until(“”, TOKEN).

for a different Web site, a separate manual effort is required to prepare a new set of training examples. Wrapper maintenance aims at relearning the wrapper if the wrapper is found to be no longer applicable. Several approaches have been developed to address the wrapper maintenance problem. RAPTURE [Kushmerick 2000b] has been developed to verify the validity of the wrapper using regression technique. A probabilistic model is built based on the extracted items when the wrapper is known to operate correctly based on the extracted items. After the system operates for a period of time, the items extracted are compared against the model. If the extracted items are found to be largely different, this wrapper is believed to be invalid and needs to be relearned. However, it can only partially solve the wrapper maintenance problem since it cannot learn a new wrapper automatically. Lerman et al. [2003] have developed the DataPro algorithm to address the problem. It learns some patterns from the extracted items. For example, , which represents a word containing alphabets only, followed by another word starting with a capital letter, is one of the patterns learned from the business names such as “Cajun Kitchen.” When the layout of ACM Transactions on Internet Technology, Vol. 7, No. 1, Article 6, Publication date: February 2007.


•

9

the Web site is changed, the DataPro algorithm will automatically label a new set of training examples by matching the learned patterns in the new Web page. The patterns are mainly composed of the display format information such as lower case and upper case of the items. However, it is doubtful that the items have the same display format in the old and new layouts of the Web site. Several approaches have been designed to reduce the human effort in preparing training examples. These approaches have an objective similar to wrapper adaptation. Bootstrapping algorithms [Ghani and Jones 2002; Riloff and Jones 1999] are well-known methods for reducing the number of training examples. They normally initiate the training process by using a set of seed words and incorporating the unlabeled examples in the training phase. However, bootstrapping algorithms assume that those seed words must be present in the training data, leading to ineffective training. For example, the word “Shakespeare” may appear in the title, or as the author of a book.3 DIPRE [Brin 1998] attempts to find the occurrence of some concept pairs such as title/author in the documents to obtain training examples by finding text fragments exactly matched with the user inputs. Once sufficient training examples are obtained, it learns extraction patterns from these training examples. DIPRE can reduce the effect of incorrect initiation in bootstrapping. However, it can only work on site-independent concept pairs such as title/author. It cannot extract sitedependent concept pairs such as title/price. The reason is that it assumes that the prices of a particular book are the same in different Web sites and the prices from different sites are known in advance. Moreover, quite a number of concept pairs are required to be prepared in advance in order to obtain sufficient training examples. Cotesting [Muslea et al. 2000] is a semi-automatic approach for reducing the number of examples in the training phase. The idea of cotesting is to learn different wrappers from a few labeled training examples. One may learn a wrapper by processing the Web page forward and learn another by processing the same Web page backward. These wrappers are then applied to the unlabeled examples. If the wrappers label the examples differently, users are asked to manually label those inconsistent examples. The newly labeled examples are then added to the training set and the process iterates until convergence. However, such an active learning approach can only partially reduce the human work. ROADRUNNER [Crescenzi et al. 2001], DeLa [Wang and Lochovsky 2003], and MDR [Liu et al. 2003] are approaches developed for completely eliminating the human effort in extracting items in Web sites. The idea of ROADRUNNER is to compare the similarities and differences of the Web pages. If two different strings occur in the same corresponding positions of two Web pages, they are believed to be the items to be extracted. DeLa discovers repeated patterns of the HTML tags within a Web page and expresses these repeated patterns with regular expressions. The items are then extracted in a table format by parsing the Web page to the discovered regular patterns. MDR first discovers the data regions in the Web page by building the HTML tag tree and making use of 3 The

word “Shakespeare” appears in the title of the book “Shakespeare – by Michael Wood” and appears in the author of the book “Romeo And Juliet – by William Shakespeare”. ACM Transactions on Internet Technology, Vol. 7, No. 1, Article 6, Publication date: February 2007.

10

•


string comparison techniques. The data records from each data region are extracted by applying some heuristic knowledge of how people commonly present data objects in Web pages. These three approahes do not require any human involvement in training and extraction. However, they suffer from one common shortcoming. They do not consider the type of information extracted, and hence the items extracted by these systems require human effort to interpret their meaning. For example, if the extracted string is “Shakespeare,” it is not known whether this string refers to a book title or a book author. Wrapper adaptation aims at automatically adapting the previously learned extraction knowledge to a new unseen site in the same domain. This can significantly reduce the human work in labeling training examples for learning wrappers. In principle, wrapper adaptation can solve the wrapper maintenance problem. It can also be applied to other intelligent tasks [Lam et al. 2003; Wong and Lam 2005]. Golgher and da Silva [2001] proposed to solve the wrapper adaptation problem by applying a bootstrapping technique and a query-like approach. This approach searches the exact matching of items in the new unseen Web page. However their approach shares the same shortcomings as bootstrapping. In essence, their approach assumes that the seed words, which refer to the elements in the source repository in their framework, must appear in the new Web page. Cohen and Fan [1999] designed a method for learning pageindependent heuristics for extracting items from Web pages. Their approach is able to extract items in different domains. However, a major disadvantage of this method is that training examples from several Web sites must be collected to learn such heuristic rules. KNOWITALL [Etzioni et al. 2005] is a domainindependent information extraction system. Its idea is to make use of online search engines and bootstrap from a set of domain-independent and generic patterns from the Web. It can extract the relation between instances and classes, and the relation between superclasses and subclasses. However, one limitation of KNOWITALL is that the proposed generic patterns cannot solve the multislot extraction problem, which aims at extracting records containing one or more attribute items. The machine-labeled training example discovery component of our proposed framework is related to the research area of object identification or duplicate detection, which aims at identifying matching objects from different information sources. Tejada et al. [2001, 2002] developed a system called Active Atlas to solve the object identification problem. They designed a method for learning the weights for different string transformations. The identification of matching objects is then achieved by computing the similarity score between the attributes of the objects. MARLIN [Bilenko and Mooney 2003] is another object identification system based on a generative model for computing the string distance with affine gaps, which applied SVM to compute the vector-space similarity between strings. Cohen defined the similarity join of tables in a database containing free text data [Cohen 1999]. The database may be constructed by extracting data from Web sites. The idea is to consider the importance of the terms contained in the attributes and compute the cosine similarity between the attributes of the tuples. However, one major difference between our machine-labeled training example discovery and these object identification methods is that ACM Transactions on Internet Technology, Vol. 7, No. 1, Article 6, Publication date: February 2007.


•

11

fI Site Invariant Features

β

fD

Item Knowledge

α

Content Knowledge (Domain Dependent and Site Invariant)

γ

Site Dependent Features

Context Knowledge (Site Dependent)

Web page

Web Site Domain

Fig. 6. Dependence model of text data for Web sites for a particular domain.

machine-labeled training example discovery identifies the text fragments, which likely belong to the items of interest, within the Web page collected from the new unseen site, while object identification determines the similarity between records that are obtained or extracted in advance. Moreover, the goal of machine-labeled training example discovery is to identify the text fragments belonging to the items of interest, but not to integrate information from different information sources. The technique used in object identification is not applicable since it is common that the source Web site and the new unseen site do not contain shared records. 4. OVERVIEW OF IEKA 4.1 Dependence Model Our proposed adaptation framework is called IEKA (Information Extraction Knowledge Adaptation). It is designed based on a dependence model of text data contained in the Web sites. Figure 6 shows the dependence model for a particular domain. Typically, there are different Web sites containing data records. Within a particular Web site, there is a set of Web pages containing some data items. For example, in the book domain, there are many bookstore Web sites. Each of these Web sites contains a set of Web pages and each page displays some items such as title, authors, price, and so on. Sometimes, a Web page is obtained by supplying a keyword to the internal search engine provided by the Web site. Associated with each domain, there exists some content knowledge denoted by α. This content knowledge contains the general information about the data items of this domain. For example, in the book domain, α refers to the knowledge that each book consists of items such as title, authors, and price. Within α, there is more specific knowledge, called item knowledge, associated with the items to be extracted. For instance, the item title is associated with particular item knowledge denoted by β, which it refers to knowledge about the title: for example, a title normally consists of few words and some of the words may start with a capital letter. It is obvious that α and β are domain dependent. For example, the knowledge for the book domain and the consumer electronics appliance domain are different. α and β are also regarded as site-invariant since ACM Transactions on Internet Technology, Vol. 7, No. 1, Article 6, Publication date: February 2007.

12

•


such knowledge does not change with different Web sites. There is another kind of knowledge called context knowledge denoted by γ . Context knowledge refers to context information such as the layout format of the Web sites. Different Web sites have different contexts γ . For example, in the book domain, the book title is displayed after the token “Title:” in one Web site, whereas the book title is displayed at the begining of a line in another. In a particular Web page, we differentiate two types of feature. The first type is called the site-invariant feature denoted by f I . f I is mainly related to the item content within the Web page and is dependent on α and β. For example, f I can represent the text fragments regarding the title of a book. Due to the dependence of α and β, f I remains largely unchanged in the Web pages from different Web sites in the same domain. The other type of feature is called the site-dependent feature, denoted by f D . For example, f D can represent the text fragments regarding the layout format of the title of a book in a Web page. Specifically, the titles of the books as shown in Figure 1 are bolded and underlined. f D is dependent on the context knowledge γ associated with a particular Web site. f D is also dependent on β because each item may have different contexts. As the context knowledge γ of different Web sites is different, the resulting f D are also different for the Web pages collected from different sites. However, f D of the Web pages originating from the same site are likely unchanged because they depend on the same γ . In wrapper induction, we attempt to learn the wrapper by manually annotating some training examples in the Web site. These training examples consist of the site-invariant features and the site-dependent features of the Web pages. Wrapper induction is a process of learning information extraction knowledge from the site-invariant and dependent features of the pages from the Web site. The learned wrapper can effectively extract information from the other pages of the same Web site because the site-invariant and dependent features of these Web pages depend on the same α and γ respectively. However, the wrapper learned from a source Web site cannot be directly applied to a new unseen Web site because the site-dependent features of the Web pages in the new unseen site depend on different γ . 4.2 IEKA Framework Description Our IEKA framework tackles the problem by making use of the site-invariant features as clues to solve the wrapper adaptation problem. IEKA first identifies the site-invariant features of the Web pages of the new unseen site. This is achieved by exploiting two pieces of information in the source Web site to derive the site-invariant features. The first piece of information is the extraction knowledge contained in the previously learned wrapper. The other piece of information is the items collected or extracted in the source Web site. To perform information extraction for a new Web site, the existing extraction knowledge contained in the previously learned wrapper is useful since the site-invariant features are likely applicable. However, the site-dependent features cannot be used since they are different in the new site. As mentioned in Section 2, we call such knowledge, weak extraction knowledge. The items previously extracted or collected in the source Web site embody rich information about the item ACM Transactions on Internet Technology, Vol. 7, No. 1, Article 6, Publication date: February 2007.


Previously Learned Extraction Knowledge Contained in Wrappe r

Potential Training Text Fragment Identification DOM Analysis

Items Previously Extracted or Collected Source Web Site

Modified K Classification

Potential Training Text Fragment s

Machine Labeled Training Example Discovery Content Classification Model

Machine Labeled Trainin g Examples

Wrapper Learning Component

Lexicon Approximate Matching

•

13

New Wrapper for Target Web Site

Information Extraction Knowledge Adaptation (IEKA) Unseen Targe t Web Site

Fig. 7. The major stages of IEKA.

content. For example, these extracted items contain some characteristics and orthographic information about the item content. These items can be viewed as training examples for the new site. However, they are different from the ordinary training examples because the former only contain information about the site-invariant features, while the latter contain information about both the site-invariant features and site-dependent features. As mentioned in Section 2, we call this property partially specified. By deriving the site-invariant features from the weak extraction knowledge and the partially specified training examples, IEKA employs machine-learning methods to automatically discover some training examples from the new Web site. These newly discovered training examples are called machine-labeled training examples. The next step is to analyze both the site-invariant features and site-dependent features of those machine-labeled training examples of the new site. IEKA then learns the new information extraction knowledge tailored to the new site using a wrapper learning component. Figure 7 depicts the major stages of our IEKA framework. IEKA consists of three stages employing machine-learning methods to tackle the adaptation problem. The first stage of IEKA is the potential training text fragment identification. In this stage, we employ an information-theoretic approach to analyze the DOM structures of the Web pages of the unseen Web site. The informative nodes in the DOM structure can be effectively identified. Next, the weak extraction knowledge contained in the wrapper from the source site is utilized to identify appropriate text fragments in these informative nodes as the potential training text fragments for the new unseen site. This stage considers the site-dependent features of the Web pages as discussed above. Some auxiliary example pages are automatically fetched for the analysis of the site-dependent features. A modified K -nearest neighbours classification model is developed for effectively identifying the potential training text fragments. The second stage is the machine-labeled training example discovery. It aims at scoring the potential training text fragments. Those “good” potential training text fragments will become the machine-labeled training examples for learning the new wrapper for the new site. This stage considers the site-invariant features of the partially specified training examples. An automatic text ACM Transactions on Internet Technology, Vol. 7, No. 1, Article 6, Publication date: February 2007.

14

•


root

book_title

repetition (author)

author

price

list_price

final_price

Fig. 8. The hierarchical record structure for the book information shown in Figure 2.

fragment-classification model is developed to score the potential training text fragments. The classification model consists of two components. The first component is the content classification component. It considers several features to characterize the item content. The second component is the approximate matching component, which analyzes the orthographical information of the potential training text fragments. In the third stage, based on the automatically generated machine-labeled training examples, a new wrapper for the new Web site is learned using the wrapper learning component. The wrapper learning component in IEKA is derived from our previous work [Lin and Lam 2000], a brief summary of which is given in the following.

4.3 Wrapper Learning Component A wrapper learning component discovers information extraction knowledge from training text fragments. We employ a wrapper learning algorithm called HISER described in our previous work [Lin and Lam 2000]. In this article, we will only present a brief summary of HISER. HISER is a two-stage learning algorithm. The first stage induces a hierarchical representation for the structure of the records. This hierarchical record structure is a tree-like structure that can model the relationship between the items of the records. It can model records with missing items, multi-valued items, and items arranged in unrestricted order. For example, Figure 8 depicts a sample of a hierarchical record structure representing the records in the Web site as shown in Figure 2. The record structure in this example contains a book title, a list of authors, and a price. The price consists of a list price and a final price. There is no restriction on the order of the nodes under the same parent. A record can also have any item missing. The multiple occurrence property of author is modeled by a special internal node called repetition. Each node in the hierarchical record structure is associated with a set of extraction rules. These extraction rules are automatically learned in the second stage in HISER. An extraction rule consists of three parts: the left pattern component, the right pattern component, and the target pattern component. Table V depicts one of the extraction rules for the final price for the Web document in Figure 2. Both the left and right pattern components make use ACM Transactions on Internet Technology, Vol. 7, No. 1, Article 6, Publication date: February 2007.


•

15

Table V. A Sample of an Extraction Rule for the Final Price for the Web Document Shown in Figure 2. Left pattern component Scan Until(“Our”, TOKEN), Scan Until(“Price”, TOKEN), Scan Until(“:”, TOKEN), Scan Until(“”, SEMANTIC). Target pattern component Contain() Right pattern component Scan Until(“ ”, TOKEN), Scan Until(“ ”, TOKEN), Scan Until(“”, TOKEN), Scan Until(“”, SEMANTIC).

ALL HTML_TAG

TEXT DIGIT PUNCT

HTML_OBJECTS_TAG HTML_SPACE

FLOAT

Our

Price

:

39.99

HTML_FONT_TAG

HTML_IMG_TAG

Fig. 9. Examples of semantic classes organized in a hierarchy.

of a token scanning instruction, Scan Until(), to identify the left and right delimiters of the item. The token scanning instruction instructs the wrapper to scan and consume any token until a particular token matching is found. The argument of the instruction can be a token or a semantic class. For the target pattern component, it makes use of an instruction, Contain(), to represent the semantic class of the item content. An extraction rule-learning algorithm is developed based on a covering-based learning algorithm. HISER first tokenizes the Web document into sequence of tokens. A token can be a word, number, punctuation, date, HTML tag, some specific ASCII characters such as “ ” which represents a space in HTML documents, or some domain-specific contents such as manufacture names. Each token will be associated with a set of semantic classes that is organized in a hierarchy. For example, Figure 9 depicts the semantic class hierarchy for the following text fragments from Figure 5 after tokenization. Our Price: 39.99 HISER learns the extraction rules by performing lexical and semantic ACM Transactions on Internet Technology, Vol. 7, No. 1, Article 6, Publication date: February 2007.

16

•


generalization, until effective extraction rules are discovered. The details of HISER can be found in our previous work [Lin and Lam 2000]. 5. POTENTIAL TRAINING TEXT FRAGMENT IDENTIFICATION In our IEKA framework, the first stage is the potential training text fragment identification component. This stage shares some resemblance with the research area of object identification or duplicate detection, which aims at identifying matching objects from different information sources [Bilenko and Mooney 2003; Cohen 1999; Tejada et al. 2001, 2002]. However, it is different from object identification or duplicate detection in three aspects. The first aspect is that IEKA identifies the text fragments within the Web page collected from the new unseen site. On the contrary, object identification determines the similarity between records that are obtained or extracted in advance. The second aspect is that IEKA identifies the text fragments belonging to the items of interest in the new site, while the aim of object identification is to integrate data objects from different information sources. The third aspect is that the source Web site and the new unseen site may not contain any common object. For instance, in the object identification task, it determines if the records “Art’s Deli” and “Art’s Delicatessen” collected from two different restaurant information sources refer to the same restaurant [Tejada et al. 2002]. These two records are stored in a database in advance. However, our approach identifies the text fragments “Practical C++ Programming, 2nd Edition,” which is a substring of the entire HTML text document, in the Web page shown in Figure 2. Moreover, the source site may not simultaneously contain this book displayed in its Web pages. Therefore, the techniques developed for object identification are not applicable. In our IEKA framework, potential training text fragments refer to the text fragments that likely belong to one of the items of interest, collected from the Web pages in the new unseen Web site. Notice that the potential training text fragments identified at this stage are not classified as any particular item of interest. In the next stage, some of the potential training text fragments will then be classified as different items and used as the machine-labeled training examples for learning the new wrapper for the new site in the last stage of IEKA. The idea of this stage is to analyze the site-dependent features and the site-invariant features of the new site. The DOM structure representation of the Web pages is utilized to identify the useful text fragments in the new site. A modified K -nearest neighbours method is employed to select the potential training text fragments. 5.1 Auxiliary Example Pages IEKA will automatically generate some machine-labeled training examples in one of the Web pages in the new unseen Web site. We call the Web page where the machine-labeled training examples are to be automatically collected as the main example page M. Relative to a main example page, auxiliary example pages A(M) are Web pages from the same Web site, but containing different categories of item contents. For example, in the book domain, M may contain items about ACM Transactions on Internet Technology, Vol. 7, No. 1, Article 6, Publication date: February 2007.


•

17

Fig. 10. A portion of a sample Web page about networking books.

programming books. A(M ) may contain items about networking books. Note that the main and auxiliary example pages are collected from the same site and hence the site-dependent features f D of these Web pages are dependent on the same context knowledge γ as described in Section 4.1. As the main example page and the auxiliary example pages contain different item contents, the text fragments regarding the item content are different in different Web pages, while the text fragment regarding the layout format are very similar. This observation gives a good indication for locating the potential training text fragments. Auxiliary example pages can easily be automatically obtained from different pages in a Web site. One typical method is to supply different keywords or queries automatically to the internal search engine provided by the Web site. For instance, consider the book catalog associated with the Web page shown in Figure 2. This Web page is generated by automatically supplying the keyword “PROGRAMMING” to the search engine provided by the Web site. Suppose a different keyword such as “NETWORKING” is automatically supplied to the search engine, a new Web page as shown in Figure 10 is returned. Only a few keywords are needed for a domain and they can easily be chosen in advance. The Web page in Figure 10 can be regarded as an auxiliary example page relative to the Web page in Figure 2. Figures 5 and 11 show the excerpt of the HTML text document associated with the Web page shown in Figures 2 and 10 respectively. The bolded text fragments are related to the item content, while the remaining text fragments are related to the format layout. The text fragments related to the item content are very different in different Web pages, whereas the text fragments related to the format layout are very similar. ACM Transactions on Internet Technology, Vol. 7, No. 1, Article 6, Publication date: February 2007.

•

18


Fig. 11. An excerpt of the HTML texts for the Web page shown in Figure 10.

... ...

Author: Curtis Frye ... ...

Published:

2004

List Price:

49.99

Microsoft Excel ... ... Fig. 12. Part of the DOM structure representation for the Web page shown in Figure 2.

5.2 DOM Structure Analysis A Web page can be represented by a DOM (Document Object Model)4 structure. A DOM structure is an ordered tree consisting of two types of nodes. The first type of node is called element node, which is used to represent HTML tag information. These nodes are labeled with the element name such as “”, “”, and so on. The other type of node is called text node, which includes the text displayed in the browser and is labeled simply with the corresponding text. Figure 12 shows part of the DOM structure representation for the Web page shown in Figure 2. We develop an algorithm that can effectively locate the informative text nodes in the DOM structure. For each of the text nodes in the DOM structure, we define the path as the string created by concatenating the node labels from the first ancestor to the n-th ancestor where n is a predefined value. For example, as shown in Figure 12, the path for the text nodes labeled with “Published:” and “List Price:” are both equal to “

” and the path for the text node labeled with “Microsoft Excel 2003 Programming inside Out” is “ ” when n is set to 4. Note that each path may locate more than one text node in the DOM structure. We define the probability that

4 The

details of the Document Object Model can be found in http://www.w3.org/DOM/.



•

19

Fig. 13. An outline of the DOM structure path-finding algorithm.

the term wi occurs in the text nodes located by the path p as: N (wi , p) P (wi , p) = j N (w j , p) where N (wi , p) is the number of occurrences of wi in all the text nodes located by p. Next, we define the path entropy, E( p), as follows: E( p) = − P (wi , p) log P (wi , p). (1) i

Note that E( p) can be calculated from more than one DOM structure by treating all the DOM structures as a forest and each P (wi , p) is calculated by considering all the text nodes located by p in the forest. Figure 13 shows an outline of our path-finding algorithm. The objective of this algorithm is to identify the paths that can locate some informative text nodes in the DOM structure. It first creates the DOM structures for the main example page M and all the auxiliary example pages A(M ). Next, all the paths in the DOM structure, d om M , for the main example page M will be identified. For each of these paths, E( p) and E( p) are calculated. If E( p) exceeds E( p) by a threshold, δ, this path will be included in the return path set. The rationale of Step 8 in the algorithm is that entropy is a measure of the randomness of the distribution. Recall that the main and auxiliary example pages consist of different site-invariant features. If the underlying path can locate the text nodes consisting of site-invariant features, the term-distribution under this path will become more complex when more pages are being considered. On the other hand, if the underlying path can just locate the text nodes corresponding to sitedependent features, the term-distribution under this path will likely remain unchanged when more pages are being considered because the site-dependent features largely remain unchanged in different Web pages in a Web site. Hence, the resulting return path will contain the paths that can locate a “complex” text node that highly likely consists of site-invariant features. For example, the path ACM Transactions on Internet Technology, Vol. 7, No. 1, Article 6, Publication date: February 2007.

20

•


“ ” is one of the returned paths found by our path-finding algorithm for the Web page shown in Figure 2. 5.3 Modified K-Nearest Neighbour Classification Model The text fragments within the text nodes located by the returned paths in the above algorithm will become some useful text fragments. Although the paths found by our algorithm can effectively identify useful text nodes containing site-invariant features, these paths may also incorrectly locate some other text nodes because each path may locate more than one text node at the same time. We develop a modified K -nearest neighbours classification model for filtering out these incorrect text fragments. Recall that the previously learned wrapper from the source Web site consists of the left pattern component, the right pattern component, and the target pattern component. This wrapper is not fully applicable in the new unseen target site due to the difference between the site-dependent features f D of the source site and the target site. However, the target pattern component, which contains the semantic classes of the items regarded as the weak extraction knowledge of the source site, can be utilized for discovering useful text fragments in the new target site. Based on the weak extraction knowledge, we can obtain the set UTF(M ) from the main example page M of the new target site. UTF(M ) is the set of useful text fragments in M and each text fragment contains the same set of the semantic classes as the one contained in the target component of the previously learned wrapper. From an auxiliary example page A(M ), we can also obtain the set UTF(A(M )). As explained in Section 5.1, the text fragments regarding the item content in the main example page are less likely to appear in the auxiliary example pages, while the text fragments regarding the layout format will probably appear in both the main example page and the auxiliary example page. Note that our objective is to retain the text fragment corresponding to the site-invariant features in M . Hence, all the elements in UTF(A(M )) are treated as negative instances. Each instance in the modified K -nearest neighbours classification model is represented by a set, ti , containing the unique words in the text fragment. Suppose we have two text fragments t1 and t2 . We define the similarity between these two text fragments sim(t1 , t2 ), as follows5 : sim(t1 , t2 ) =

|t1 ∩ t2 | max(|t1 |, |t2 |)

(2)

where t1 ∩ t2 denotes the intersection of the sets t1 and t2 , and |t| denotes the number of elements in the set t. Some existing methods for object identification, make use of the term frequency-inverse document frequency (TF-IDF) method to assign weight to

5 We

have also tried different similarity measurements such as cosine similarity. We found that the similarity measurement described in Equation 2 has slightly better performance.



•

21

each term of the attributes of the objects [Cohen 1999; Bilenko and Mooney 2003; Tejada et al. 2001, 2002]. TF-IDF assigns higher weights to the important terms that are frequent in the document and infrequent in the whole corpus. Therefore, matching of important terms shows a higher degree of confidence in matching of the objects. However, TF-IDF is not suitable in our classification problem. As distinct from ordinary classification problems, our goal is to classify the negative instances in UTF(M ), that are the incorrect or useless text fragments. Moreover, we only have a set of negative instances in the training examples. These incorrect or useless text fragments normally contain unimportant terms that repeatedly appear in the main example page and auxiliary example pages. As a result, the TF-IDF weighting method is not suitable to our problem. The goal of our classification model is to classify the potential training text fragments from UTF(M ). To achieve this task, for each element in UTF(M ), we first find the K -nearest neighbours in UTF(A(M )), based on our defined similarity measure given in Equation 2. If the average similarity between the element in UTF(M ) and the K -nearest neighbours in UTF(A(M )) exceeds a threshold, θ , it will be classified as a negative instance. This is because the useful text fragment is unlikely to repeatedly appear in both the main example page and the auxiliary example page. On the other hand, if the similarity is below θ , it will be classified as a potential training text fragment. Once the potential training text fragments for an item are identified, they will be processed by the text fragment-classification model in the machinelabeled training example discovery stage. Those “good” text fragments become the machine-labeled training examples for the new site.

6. DISCOVERY OF MACHINE LABELED TRAINING EXAMPLES As mentioned in Section 4, the partially specified training examples refer to the items previously extracted or collected in the source Web site. The rationale of using the partially specified training examples is that the item content can be represented by the site-invariant features. The partially specified training examples are used to train a text fragment-classification model that can classify “good” text fragments from the potential training text fragments. This text fragment-classification model consists of two components, which consider two different aspects of the item content. One aspect is the characteristics of the item content. For example, in the consumer electronics domain, a model number of a DVD player usually contains tokens mixed with alphabets and digits and starts with a capital letter. The content classification component considers several features, and can be trained. The trained model can effectively characterize the item content. The second aspect is the orthographic information of the item content. For example, the model numbers of the products may share some characters found in the training examples. The approximate matching component makes use of the orthographic information of the item content to help classify the machine-labeled training examples. ACM Transactions on Internet Technology, Vol. 7, No. 1, Article 6, Publication date: February 2007.

22

•


6.1 Content Classification Component We identify some features for characterizing the content of the items. A classification model can then be learned to classify the “good” potential training text fragments. The features used are as follows: F1 : F2 : F3 : F4 : F5 : F6 : F7 : F8 : F9 : F10 : F11 : F12 :

the number of characters in the content the number of tokens in the content the average number of characters per token the proportion of the number of digit number to the number of tokens the proportion of the number of floating point numbers to the number of tokens the proportion of the number of alphabet characters to the number of characters the proportion of the number of upper case characters to the number of characters the proportion of the number of lower case characters to the number of characters the proportion of the number of punctuation marks to the number of characters the proportion of the number of HTML tags to the number of tokens the proportion of the number of tokens starting with capital letter to the number of tokens whether the content starts with a capital letter

These features attempt to characterize the format of the items. Some of the features were also used in Kushmerick [2000b]. With the above feature design, a classification model can be learned from a set of training examples. The content classification model will return a score, f 1 , which indicates the degree of confidence of its being a “good” potential training text fragment. f 1 will be normalized to a value between 0 and 1. The content classification model is trained from a set of training examples composed of a set of positive item content examples and negative item content examples. The set of positive item content examples are the partially specified training examples in the source site. In the main page of the source Web site, M s , we can obtain UTF(M s ) as described in Section 5. Those elements in UTF(M s ) which are not in the set of positive item content examples are collected to become the negative item content examples. Next, the values of the features Fi (1 ≤ i ≤ 12) of each positive and negative item content examples are computed. To learn the content classification model, we employ Support Vector Machines (SVM) [Vapnik 1995]. 6.2 Approximate Matching Component French et al. [1997] discussed the effectiveness of approximate word matching in information retrieval. As mentioned above, the objective of the approximate ACM Transactions on Internet Technology, Vol. 7, No. 1, Article 6, Publication date: February 2007.


•

23

matching component is to boost the confidence of those potential text fragments that share some text similarities with the previously collected items. To enhance the robustness, we make use of edit distance [Gusfield 1997] and design a twolevel approximate matching algorithm to compare the similarity between two strings. At the lower level, we compute the character-level edit distance of a given pair of tokens. At the upper lever, we compute the token-level edit distance of a given pair of text fragments. We will illustrate our algorithm by an example. Suppose we obtain a potential training text fragment of the model number “PANASONIC DVDCV52’ and a particular previously collected item content “PAN DVDRV32K”. (Actually these two model numbers are obtained from two different Web sites in our consumer electronics domain experiment. They refer to the same brand of products, but with different model numbers.) At the lower level, we compute the character-level edit distance between two tokens with the cost of insertion, deletion, and modification of a character all being equal to one. Then the character-level edit distances computed are normalized by the longest length of the tokens. For example, the normalized character-level edit distance between “PAN” and “PANASONIC” is 0.667. At the upper level, we compute the token-level edit distance between a potential training text fragment and a partially specified training example, with the cost of insertion and deletion of a token being equal to one, and the cost of modification of a token being equal to the character-level edit distance between the tokens. The token-level edit distance is then normalized by the largest number of tokens among the potential training text fragment and the partially specified training example. For instance, the normalized token-level edit distance between “PANASONIC DVDCV52” and “PAN DVDRV32K” is 0.521. Both the character-level and token-level edit distances can be efficiently computed by dynamic programming. The score, f 2 , of a potential training text fragment is then computed as follows: f 2 = max{D (c, l i )} i

(3)

where D (c, l i ) = 1 − D(c, l i ) and D(c, l i ) is the normalized token-level edit distance between the potential training text fragment, c, and the i-th partially specified training example. 7. NEW WRAPPER LEARNING FOR THE UNSEEN WEB SITE In the machine-labeled training example discovery stage, the scores from the content classification component and approximate matching component are computed. The final score Score(c) of each potential training text fragment c is given by: Score(c) = λ f 1 + (1 − λ) f 2

(4)

where f 1 and f 2 are the scores obtained in the content classification component and the approximate matching component respectively; λ is a parameter controlling the relative weight of the content classification and approximate matching components and 0 < λ < 1. After the scores of the potential training text fragments are computed, IEKA will select the “good” potential training text fragments as machine-labeled ACM Transactions on Internet Technology, Vol. 7, No. 1, Article 6, Publication date: February 2007.

24

•


training examples for the new unseen site. The N -best potential training text fragments will be selected as the machine-labeled training examples. Users could optionally scrutinize the discovered training examples to improve the quality of the training examples. However, in our experiments, we did not conduct any manual intervention and the adaptation was conducted in a fully automatic way. After obtaining a set of machine-labeled training examples, IEKA makes use of the wrapper learning component HISER, derived from our previous work [Lin and Lam 2000] and briefly presented in Section 4.3, to learn the wrapper tailored to the new Web site. A small refinement on HISER is performed to suit the new requirement. The set of machine-labeled training examples is different from the set of user labeled training examples because the former may contain inaccurate training examples. The noise in the training example set tends to exhibit overgeneralization of the extraction rules due to the scoring criteria of the extraction rule-learning algorithm in Lin and Lam [2000]. To cope with this effect, we introduce a metarule for restricting the number of token generalizations of the extraction rule induction algorithm. Each extraction rule may then cover fewer training examples due to the restriction on the generalization power. This will not degrade the extraction performance of the learned wrapper as each extraction rule set in the wrapper may contain more extraction rules to broaden its coverage. This metarule can avoid the overgeneralization effect. The newly learned wrapper is tailored to the new Web site and it can be applied to the remaining pages in the new site for information extraction. 8. CASE STUDY In this section, we present a complete case study to illustrate the steps in the process of adapting the wrapper from the source to a new unseen site using IEKA. Refer to the scenario mentioned in Section 2. We first train a wrapper for the source site shown in Figure 1 using training examples similar to the one depicted in Table I. The trained wrapper is then applied to other pages from the same Web site to automatically extract items. The extraction performance of the learned wrapper is very promising. Both precision and recall for extracting the title from other pages of the Web site are 99.0%. The precision and recall for extracting the author are 80.2% and 97.0% respectively. Both precision and recall for extracting the price are 100%. Though the learned wrapper can effectively extract items from the Web pages of the same site, it cannot extract any item when we apply directly the learned wrapper to a new site, shown in Figure 2. We apply our IEKA framework to adapt the extraction knowledge to this new site. At the potential training text fragment identification stage, IEKA first analyzes the DOM structure of the page as presented in Section 5.2 to generate the useful text fragments from the Web page of the new site. This is achieved by identifying the paths in the DOM structure that can locate the informative text nodes in the Web page of the new site. Table VI shows the samples of the paths discovered and the associated useful text fragments generated. After that, the modified K -nearest neighbours classification method is applied to identify the potential training ACM Transactions on Internet Technology, Vol. 7, No. 1, Article 6, Publication date: February 2007.


•

25

Table VI. Samples of the Paths Discovered and the Associated Useful Text Fragments Generated for the Web Page Shown in Figure 2 Path: Associated useful text fragments:

Path: Associated useful text fragments:

Path: Associated useful text fragments:

Stephen Randy Davis Herb Schildt Steve Oualline 2003 2002

Published: List Price You Save 6.00 23.99
C++ Weekend Crash Course, 2nd edition C++ from the Ground Up Practical C++ Programming, 2nd Edition Sams Teach Yourself C in 21 Days Practical C++ Programming, 2nd Edition

Table VII. Samples of Potential Training Text Fragments Obtained by IEKA for the Web Page Shown in Figure 2 Stephen Randy Davis Herb Schildt Steve Oualline 6.00 23.99 C++ Weekend Crash Course, 2nd edition C++ from the Ground Up Practical C++ Programming, 2nd Edition

text fragments from the pool of useful text fragments. Table VII depicts the samples of potential training text fragments of the new site. Notice that the potential training text fragments discovered are not labeled and not grouped into records. At the machine-labeled training example discovery stage, the machine-labeled training examples are automatically discovered from the set of potential training text fragments. Table VIII depicts two samples of the discovered machine-labeled training examples. From the machinelabeled training examples, IEKA is able to learn the information extraction knowledge for the new site. The discovered hierarchical record structure for the new site is the same as the one shown in Figure 4. Table IV shows the set of rules for extracting a book title. This wrapper adaptation is carried out in a fully automatic manner. The newly learned wrapper is tailored to the new unseen site. It can be applied to extract items from the Web pages of this new site. A very promising result is achieved. The precision and recall for the title are 100.0% and 90.0% respectively. The precision and recall for the author are 85.7% and 90.0% respectively. The precision and recall for the price are 93.0% and 93.0% respectively. We attempt to compare the wrapper adaptation results using IEKA with the ACM Transactions on Internet Technology, Vol. 7, No. 1, Article 6, Publication date: February 2007.

26

•


Table VIII. Samples of Machine Labeled Training Examples Obtained by Adapting the Wrapper from the Web Site Shown in Figure 1 to the Web Site Shown in Figure 2 With Our IEKA Framework. Example 1

Example 2

Item Book Title: Title: Final Price: Author: Final Price:

Item value C++ Weekend Crash Course, 2nd edition Stephen Randy Davis 23.99 Steve Oualline 31.96

Score 0.756 0.998 1.000 0.600 1.000

results using manually annotated training examples in learning the wrapper for the unseen site. The precision and recall for the title using manually annotated training examples in learning the wrapper for the unseen site are 100.0% and 90.0% respectively. The precision and recall for the author are 100.0% and 90.0% respectively. Both precision and recall for the price are 93.0%. The results of wrapper adaptation using IEKA are comparable with the results obtained by learning the wrapper from manually annotated training examples. Note that when IEKA is applied, no manual annotation is needed for the new site; thus eliminating a great deal of manual effort. 9. EXPERIMENTAL RESULTS We conducted experiments on a number of real-world Web sites in three different domains, namely, book domain, consumer electronics appliance domain, and job advertisement domain, to demonstrate the effectiveness of IEKA. Table IX depicts the Web sites used in our experiment. The first column shows the Web site labels. The second column shows the names of the Web sites and the corresponding Web site addresses. The third and fourth columns depict the number of pages and the number of records collected from the Web sites for evaluation. T1, T2, and S1–S10 are the Web sites from the book domain. T1 and T2 are used for parameter tuning while S1–S10 are used for testing. T3, T4, and S11–S20 are the Web sites from the consumer electronics appliance domain. T3 and T4 are used for parameter tuning while S11–S20 are used for testing. Similarly, T5, T6, and S21–S30 are the Web sites from the job advertisement domain. T5 and T6 are used for parameter tuning while S21–S30 are used for testing. In IEKA, some parameters are required to be determined, namely, the threshold δ in the path-finding algorithm as described in Section 5.2; the parameters K and θ in the modified K -nearest neighbours classification model as described in Section 5.3; the weight λ in the text fragment-classification model and the parameter N in the N -best potential training text fragments as described in Section 7. We used T1 and T2, which were randomly selected from the book domain sites for tuning the parameters for the book domain. Similarly, we used T3–T4 and T5–T6 from the consumer electronics appliance and the job advertisement domains to conduct parameter tuning. In each domain, we used different sets of parameters for adapting the wrapper from one of the Web sites to another and vice versa. For example, in the book domain, we first learned a wrapper from T1, which was then adapted to T2 using our IEKA framework. The extraction performance of the adapted wrapper in T2 ACM Transactions on Internet Technology, Vol. 7, No. 1, Article 6, Publication date: February 2007.


•

27

Table IX. Information Sources for Experiments (“pp. #” and “rec. #” Refer to the Number of Pages and the Number of Records Collected in the Web Site Respectively.) T1 T2 T3 T4 T5 T6 S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S12 S13 S14 S15 S16 S17 S18 S19 S20 S21 S22 S23 S24 S25 S26 S27 S28 S29 S30

Web site (URL) Powell’s Books (www.powells.com) Barnes & Noble.com (www.barnesandnoble.com) Zones.com (www.zones.com) Micro Warehouse Inc. (www2.warehouse.com) The Job Spider (www.jobspider.com) The New York Times (jobmarket.nytimes.com/pages/jobs/) Amazon.com (www.amazon.com) Discount-PCBooks.com (www.discount-pcbooks.com) Jim’s Computer Books (www.vstore.com/cgi-bin/pagegen/vstorecomputers/jimsbooks/) 1Bookstreet.com (www.1bookstreet.com) Bookpool.com (www.bookpool.com) Border, Inc. (www.bordersstores.com) Half Price Computer Books (www.halfpricecomputerbooks.com) Abebooks.com (www.abebooks.com) Harvard Book Store (www.harvard.com) WHSmith (www.whsmith.co.uk) BestBuy.com (www.bestbuy.com) Cambridge SoundWorks (www.hifi.com) Circuit City (www.circuitcity.com) DVD Overseas Electronics (www.dvdoverseas.com) 4till9.com (www.4till9.com) Etronics.com (www.etronics.com) J & R Electronics Inc. (www.jandr.com) PC Mall (www.pcmall.com) SoundCity.com (www.soundcity.com) World Trade Electronics (www.worldtradeelectronics.com) Career.com (www.career.com) CareerBuilder.com (www.careerbuilder.com) Career Magazine (www.careermag.com) CareerNET (www.careernet.com) Job.com (www.job.com) Jobs.com (www.jobs.com) Vault (www.jobvault.com) NowHiring.com (www.nowhiring.com) Yahoo! hotjobs (hotjobs.yahoo.com) Dice.com (www.dice.com)

pp. # 12 30 28 6 6 12 30 14 8

rec. # 302 300 367 120 300 300 302 110 141

8 30 15 13 12 27 12 4 12 6 13 14 12 3 4 10 5 30 15 15 15 15 12 15 15 15 12

200 306 302 259 362 300 300 123 157 120 110 189 107 92 160 100 216 300 371 373 375 300 300 285 300 750 351

was recorded. Next, we learned the wrapper from T2 and adapted it to T1. The extraction performance of the adapted wrapper in T1 was also recorded. Different sets of parameters were used and the set producing the best average extraction performance was chosen for the remaining experiments on the testing data. Table X summarizes the range and interval of the parameters used in the parameter tuning process, and the parameter values obtained from the tuning process for use in the testing phase in each domain. We conducted several sets of experiments in each domain. For each domain, we first provide few training examples in each Web site to learn the wrapper. ACM Transactions on Internet Technology, Vol. 7, No. 1, Article 6, Publication date: February 2007.

28

•


Table X. The Table Summarizing the Range and Interval of the Parameters Used in the Parameter Tuning Process, and the Parameter Values Obtained from the Tuning Process for Using in the Testing Phase in Each Domain. Book, Electronic, and Job Refer to the Book Domain, the Consumer Electronics Appliance Domain, and the Job Advertisement Domain Respectively Parameter δ K θ

Range 0.05–0.5 1–10 0.1–1.0

Interval 0.05 1 0.1

λ

0.1–0.9

0.1

N

5,10,15

-

Book 0.05 3 0.2 0.5 (book title) 0.6 (author) 0.3 (price) 5 (book title) 10 (author) 5 (price)

Electronic 0.05 3 0.3 0.4 (model number) 0.8 (description) 0.3 (price) 5 (model number) 10 (description) 10 (price)

Job 0.1 3 0.2 0.3 (job title) 0.7 (company) 0.5 (location) 5 (job title) 5 (company) 5 (location)

After obtaining the wrapper for a particular Web site, we apply different wrapper adaptation approaches to adapt the learned wrapper to all the remaining sites. For example, the wrapper learned from S1 is adapted to S2-S10 to extract items in the book domain. The first set of experiments is called Baseline0. In this set of experiments, we simply apply the wrapper learned from a particular Web site directly to all the remaining Web sites, for information extraction without adaptation. This set of experiments is treated as the first baseline for comparison. The second set of experiments, namely Baseline1, is a simple wrapper adaptation approach. Baseline1 is developed by making use of the DOM structure of the Web page to identify a set of text fragments in the target site. These text fragments are then classified into different items of interest by the Support Vector Machines (SVM) classification model which is trained using the extracted items from the source Web site. We consider several characteristics of the text fragments such as the length and the number of digits of the text fragments. Compared with IEKA, Baseline1 does not differentiate the site-invariant and site-dependent features. Hence it does not consider the orthographic information of the items and the information about the text fragments related to layout format obtained from other Web pages in the same Web site. The third set of experiments is to employ our IEKA framework for adaptation. The fourth set of experiments is a variant of ROADRUNNER6 and the purpose of this set of experiments is to compare IEKA with existing unsupervised information extraction methods. We first apply the wrapper learned from the source Web site to extract information in the same site. The extracted data are treated as lexicons of the items. Next, we apply ROADRUNNER to extract information in the target site. Since ROADRUNNER extracts information in an unsupervised manner, the extracted data is not labeled. We achieve the labeling task by first computing the edit distance between each of the extracted items and each entry of the lexicons. Each extracted item is then assigned to the label of the nearest neighbour found in the lexicons. We denote this set of experiments as 6 ROADRUNNER

is available in the Web site: http://www.dia.uniroma3.it/db/roadRunner/

software.html. ACM Transactions on Internet Technology, Vol. 7, No. 1, Article 6, Publication date: February 2007.


•

29

ROADRUNNER+NN. Another set of experiments is to make use of the existing wrapper learning package called WIEN [Kushmerick and Grace 1998] to perform the same task without adaptation.7 Three more sets of experiments focusing on the evaluation on each component of IEKA are also conducted. The extraction performance is evaluated by two commonly used metrics called precision and recall. Precision is defined as the number of items which the system correctly identified divided by the total number of items it extracts. Recall is defined as the number of items which the system correctly identified divided by the total number of actual items. 9.1 Book Domain In the book domain, the items of interest are book title, author, and price. Table XI summarizes the average extraction performance on title, author, and price for the experiments on Baseline1, IEKA, and ROADRUNNER+NN, when training examples of one particular Web site are provided. The first column shows the Web sites where training examples are given. Each row summarizes the results obtained by adapting the learned wrapper of the Web site shown in the first column to all other sites for extraction. For example, the wrapper learned from S1 is adapted to S2-S10 and the precision and recall for extracting the item book title are 85.4% and 85.7% respectively. The last row depicts the average extraction performance of the experiments. Note that Baseline0 as well as WIEN obtain zero precision and recall in extracting all the items in all of the new unseen sites under the same experimental setup. Hence they are not shown in Table XI. This illustrates the problem of typical wrapper induction and the need of wrapper adaptation. The experimental results demonstrate that our IEKA framework can greatly improve the extraction performance in adapting the wrapper from a particular site to new unseen sites. The average precision and recall of IEKA are 82.1% and 90.8% respectively for the book title; 65.7% and 81.3% respectively for the author; and 67.8% and 79.9% respectively for the price. The overall average precision and recall for all items are 71.9% and 84.0% respectively. The extraction performance is better than Baseline1, whose average precision and recall are 32.8% and 34.7% respectively for the book title; 27.6% and 24.9% respectively for the author; and 42.5% and 66.6% respectively for the price. The overall average precision and recall for all items are 34.3% and 42.1% respectively. Note that S1 and S4 cannot extract any book title from other Web sites when Baseline1 is applied. IEKA can substantially improve the extraction performance. The precision and recall are 85.4% and 85.7% respectively for S1, and 72.2% and 96.0% respectively for S4 in extracting the book title after applying IEKA. The improvement is especially significant for the book title and author because IEKA considers the orthographic information of the items that are very useful clues in identifying the items in the new unseen site. The extraction performance of IEKA is also better than ROADRUNNER+NN, which achieves average precision and recall of 39.5% and 79.8% respectively in 7 WIEN is available in the Web site: http://www.cs.ucd.ie/staff/nick/research/research/wrappers/

wien/. ACM Transactions on Internet Technology, Vol. 7, No. 1, Article 6, Publication date: February 2007.

S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 Ave.

Book title IEKA P R 85.4 85.7 90.7 95.3 87.9 95.5 72.2 96.0 86.3 96.5 78.1 85.4 78.6 86.4 85.4 96.0 92.1 85.6 64.2 85.5 82.1 90.8 R.+NN P R 42.8 79.3 36.3 79.1 31.5 81.2 30.4 74.5 40.3 79.4 45.5 79.2 40.2 78.2 42.2 79.7 45.6 89.8 40.7 77.9 39.5 79.8

Baseline1 P R 10.3 21.1 42.5 37.5 5.6 10.0 14.9 13.6 27.6 23.6 20.9 17.7 51.9 44.3 11.1 10.0 31.8 26.8 59.8 44.5 27.6 24.9

Author IEKA P R 70.9 87.0 80.3 80.6 60.6 74.6 63.8 83.1 74.5 86.5 61.0 76.1 62.2 81.8 59.2 74.6 64.7 81.8 60.1 86.5 65.7 81.3 R.+NN P R 29.3 62.3 26.2 50.4 31.2 39.2 31.5 55.3 25.4 50.4 22.4 53.2 30.6 63.2 26.0 52.5 31.0 60.8 15.9 52.8 27.0 54.0

Baseline1 P R 46.1 66.8 45.3 67.1 35.5 55.8 43.0 69.5 37.9 65.6 43.0 65.6 46.8 69.5 46.1 66.8 43.2 73.4 38.1 65.9 42.5 66.6

P 72.5 70.1 64.3 69.4 63.8 63.8 65.8 74.9 69.6 63.8 67.8

R 78.2 78.5 76.9 80.9 77.0 76.9 80.9 88.0 84.7 76.9 79.9

Price IEKA

R.+NN P R 16.2 45.7 21.0 56.8 15.6 45.7 10.4 13.7 19.0 45.7 17.6 45.8 15.7 3.2 23.0 56.8 21.1 56.8 21.7 56.8 18.1 42.7

•

Baseline1 P R 0.0 0.0 81.1 84.2 0.1 0.1 0.0 0.0 67.5 74.9 2.2 11.1 69.7 64.2 18.8 20.1 41.5 42.0 47.5 50.7 32.8 34.7

Table XI. Average Extraction Performance on Book Title, Author, and Price for the Book Domain for the cases of using Baseline1, IEKA and ROADRUNNER+NN When Training Examples of One Particular Information Source are provided. Note that Baseline0 and WIEN obtain zero precision and recall for all items in all sites. Hence they are not shown in the table. (R.+NN denotes ROADRUNNER+NN. P and R refer to precision and recall in percentage respectively.)

30 T.-L. Wong and W. Lam



•

31

extracting the book title, 27.0% and 54.0% respectively in extracting the author, and 18.1% and 42.7% respectively in extracting the price. We observe that one major problem of ROADRUNNER+NN is that ROADRUNNER extracts data by finding repeated patterns among Web pages and frequently extracts some irrelevant information. For example, the text fragments “Agriculture,” “Architecture,” and so on, that appear in the left hand side of the Web page shown in Figure 1, are some hyperlinks to different book categories. Such text fragments are not related to the book records and become noise for the subsequent classification process. Consequently, ROADRUNNER+NN obtains a less satisfactory performance. Table XII shows the detailed results of the experiments for adapting wrappers from the source site to new unseen sites using IEKA in the book domain. The first column shows the Web sites (source sites) from which the wrappers are learned with given training examples. The first row shows the Web sites (new unseen sites) to which the learned wrapper of a particular Web site is adapted. Each cell in Table XII is divided into two subcolumns and three subrows. The three subrows record the extraction performance on the book title, the author, and the price respectively. The two subcolumns represent the precision (P) and recall (R) for extracting the items. The results indicate that the extraction performance is very satisfactory. When the wrapper from S1 is adapted to S7 using our IEKA framework, the precision and recall are 86.3% and 91.1% respectively for the book title, 99.2% and 96.1% respectively for the author, and 100.0% and 96.1% for the price. These results are comparable to the case when manually annotated training examples are provided for wrapper learning in S7. IEKA cannot extract the price from S8. The reason is that the price of all the books in S8 is “1.00”. Since the text fragment “1.00” repeats a number of times in the new unseen site, this text fragment is considered to be uninformative in the modified K -nearest neighbours classification model. Nevertheless, it is also uninteresting to extract the price if all the books are being sold at such a hypothetic price in the Web site. We carry out another three sets of experiments to evaluate the effectiveness of each component of IEKA. In each set of these experiments, the modified K -nearest neighbours classification model described in Section 5.3, the content classification component described in Section 6.1, and the approximate matching component described in Section 6.2, are removed from IEKA to investigate the effectiveness of these components. Table XIII depicts the results of the experiments. The first column shows the Web sites (source sites) from which the wrappers are learned with given training examples. The columns denoted by “IEKA\KNN,” “IEKA\Con.,” “IEKA\App.,” and “IEKA,” are the extraction performances of IEKA without using the K -nearest neighbours classification model; IEKA without using the content classification component; and IEKA without using the approximate matching component, as well as the full version of IEKA. Each cell in Table XIII shows the average extraction performance for adapting the wrapper from the Web site in column one to the other remaining sites. The meaning of the values shown in each cell is similar to the one in Table XII. The results illustrate that each component of IEKA can improve the extraction performance. For example, the approximate matching ACM Transactions on Internet Technology, Vol. 7, No. 1, Article 6, Publication date: February 2007.


S10

S9

S8

S7

S6

S5

S4

S3

S2

R 97.3 73.9 88.0 97.3 73.9 88.0 97.3 73.9 88.0 97.3 73.9 88.0 97.3 73.9 88.0 97.3 73.9 88.0 97.3 73.9 88.0 97.3 73.9 88.0 97.3 73.9 88.0

S2 P 83.9 38.1 42.7 82.5 0.0 42.7 42.5 37.5 42.7 84.7 80.3 42.7 83.2 0.0 42.7 83.9 37.5 42.7 42.5 41.6 42.7 84.7 28.5 42.7 0.0 42.7 42.7 R 100.0 100.0 85.5 100.0 0.0 85.5 100.0 100.0 85.5 100.0 100.0 85.5 100.0 0.0 85.5 100.0 100.0 85.5 100.0 100.0 85.5 100.0 100.0 85.5 0.0 100.0 85.5

S3 P 95.2 60.6 95.2 98.6 75.2 95.2 28.3 44.3 95.2 95.2 60.6 95.2 95.2 71.0 95.2 95.2 44.3 95.2 49.3 60.6 95.2 95.2 71.0 95.2 50.0 60.6 95.2 R 99.3 85.2 99.3 99.3 85.2 99.3 99.3 60.2 99.3 99.3 85.2 99.3 99.3 80.5 99.3 99.3 60.2 99.3 99.3 85.2 99.3 99.3 80.5 99.3 99.3 85.2 99.3

S4 P 97.9 100.0 67.8 97.4 100.0 67.8 97.4 50.0 67.8 49.5 100.0 67.8 58.8 52.9 67.8 96.0 50.0 67.8 97.0 100.0 67.8 99.0 100.0 67.8 49.4 50.0 67.8 R 95.5 96.0 96.0 95.5 66.0 96.0 95.5 96.0 96.0 95.5 96.0 96.0 95.5 96.0 96.0 95.5 96.0 96.0 95.5 96.0 96.0 95.5 66.0 96.0 95.5 96.0 96.0

S5 P 98.2 50.0 99.7 98.5 99.6 99.7 99.6 49.9 99.7 100.0 49.9 99.7 97.8 49.8 99.7 98.5 99.3 99.7 98.5 0.7 99.7 98.5 49.9 99.7 98.9 49.9 99.7 R 90.0 90.0 98.3 90.0 90.0 98.3 90.0 90.0 98.3 90.0 90.0 98.3 90.0 90.0 98.3 90.0 90.0 98.3 90.0 0.7 98.3 90.0 90.0 98.3 90.0 90.0 98.3

S6 P 44.7 50.6 100.0 99.7 53.1 100.0 99.7 50.6 100.0 100.0 50.7 100.0 100.0 53.1 100.0 44.7 50.7 100.0 100.0 50.7 100.0 99.7 42.5 100.0 100.0 50.7 100.0 R 100.0 60.0 99.3 100.0 60.0 99.3 100.0 60.0 99.3 100.0 60.0 99.3 100.0 60.0 99.3 100.0 60.0 99.3 100.0 60.0 99.3 100.0 49.4 99.3 100.0 60.0 99.3

S7 P 86.3 99.2 100.0 79.9 51.9 100.0 86.5 50.9 100.0 86.5 53.6 100.0 95.9 53.7 100.0 86.3 47.8 100.0 95.2 53.6 100.0 95.6 52.1 100.0 86.5 98.8 100.0 R 91.1 96.1 96.1 91.1 94.9 96.1 92.2 96.1 96.1 92.2 96.1 96.1 91.1 96.1 96.1 91.1 94.9 96.1 92.2 96.1 96.1 92.2 96.1 96.1 92.2 96.1 96.1

S8 P 65.7 100.0 0.0 100.0 76.1 0.0 49.7 99.7 0.0 49.7 99.7 0.0 75.3 100.0 0.0 6.1 99.7 0.0 6.1 100.0 0.0 65.7 100.0 0.0 49.7 49.3 0.0 R 6.4 96.6 0.0 95.6 96.6 0.0 95.6 96.6 0.0 95.6 96.6 0.0 95.6 96.6 0.0 6.4 90.1 0.0 6.4 96.6 0.0 6.4 90.1 0.0 95.6 96.6 0.0

S9 P 96.4 89.4 47.3 92.8 89.4 47.3 93.1 89.2 47.3 93.4 100.0 65.5 92.8 100.0 47.3 92.8 89.4 47.3 92.8 89.4 65.5 95.8 89.4 47.3 93.4 73.5 47.3 R 100.0 69.0 29.2 100.0 69.0 29.2 100.0 69.0 29.2 100.0 80.7 61.7 100.0 80.7 29.2 100.0 69.0 29.2 100.0 69.0 61.7 100.0 69.0 29.2 100.0 80.7 29.2

S10 P 100.0 50.0 100.0 100.0 100.0 100.0 100.0 75.3 100.0 100.0 99.6 100.0 100.0 50.0 100.0 100.0 100.0 100.0 100.0 50.0 100.0 100.0 99.6 100.0 100.0 100.0 100.0 -

R 89.3 90.2 100.0 89.3 90.2 100.0 89.3 90.2 100.0 89.3 90.2 100.0 89.3 90.2 100.0 89.3 90.2 100.0 89.3 90.2 100.0 89.3 90.2 100.0 89.3 90.2 100.0 -

•

S1

S1 P 49.5 77.8 21.3 83.0 79.8 21.3 49.5 38.8 21.3 83.0 72.5 21.3 83.0 38.0 21.3 89.8 38.8 21.3 90.7 36.8 21.3 90.7 38.0 21.3 49.5 65.8 21.3

Table XII. Experimental Results of Adapting a Learned Wrapper Using Our IEKA Framework from One Information Source to the Remaining Information Sources in the Book Domain. The Three Sub-Rows in a Cell Record the Extraction Performance on the Book Title, the Author, and the Price. P and R Refer to Precision and Recall in Percentage



•

33

Table XIII. Experimental Results of Adapting a Learned Wrapper from One Information Source to the Remaining Information Sources in the Book Domain Using IEKA Without Using the Modified K -Nearest Neighbours Classification Model, IEKA Without Using the Content Classification Component, IEKA Without Using the Approximate Matching Component, as well as the Full Version of IEKA. The Three Sub Rows in a Cell Record the Extraction Performance on the Book Title, the Author, and the Price Respectively. P and R Refer to Precision and Recall in Percentage Respectively.

S1

S2

S3

S4

S5

S6

S7

S8

S9

S10

Ave.

IEKA\KNN P R 63.7 64.6 75.9 79.5 46.1 66.8 70.5 53.2 77.2 70.9 45.3 67.1 85.2 54.7 65.4 59.4 35.5 55.8 83.9 64.8 68.5 66.8 43.0 69.5 54.4 65.2 86.3 84.7 48.5 76.6 59.7 43.6 65.2 65.4 43.0 65.6 67.5 75.8 69.0 70.1 46.8 69.5 46.4 42.8 70.4 64.4 46.1 66.8 56.4 43.3 69.0 71.7 43.2 73.4 82.9 74.5 41.6 38.3 38.1 65.9 67.1 58.3 68.9 67.1 43.6 67.7

IEKA\Con. P R 94.4 95.8 64.2 88.3 39.2 45.1 74.2 84.3 26.0 74.6 26.5 34.2 74.6 95.5 34.7 64.6 31.2 43.7 65.9 84.9 50.7 76.4 25.5 36.8 86.4 96.5 64.4 86.5 41.6 54.9 74.4 95.3 58.4 65.5 41.6 54.9 77.7 86.4 61.6 85.8 32.5 47.8 63.8 84.9 49.9 63.9 31.2 43.7 84.5 95.5 30.0 85.9 36.5 51.5 54.6 75.0 25.0 64.8 31.0 43.9 75.1 89.4 46.5 75.6 33.7 45.7

IEKA\App. P R 22.0 19.9 14.3 24.0 39.2 45.1 79.7 94.8 67.2 66.3 36.8 45.4 24.0 42.5 18.0 31.1 31.0 43.9 0.0 0.1 49.3 61.7 23.5 33.2 67.7 84.8 52.9 67.4 31.0 43.9 10.7 20.3 44.1 41.9 31.0 43.9 53.1 74.0 64.2 74.8 32.5 47.8 36.0 52.4 8.7 14.0 41.6 54.9 80.3 94.8 52.2 61.7 36.3 51.7 38.0 62.5 61.0 69.2 41.6 54.9 41.2 54.6 43.2 51.2 34.5 46.5

IEKA P 85.4 70.9 72.5 90.7 80.3 70.1 87.9 60.6 64.3 72.2 63.8 69.4 86.3 74.5 63.8 78.1 61.0 63.8 78.6 62.2 65.8 85.4 59.2 74.9 92.1 64.7 69.6 64.2 60.1 63.8 82.1 65.7 67.8

R 85.7 87.0 78.2 95.3 80.6 78.5 95.5 74.6 76.9 96.0 83.1 80.9 96.5 86.5 77.0 85.4 76.1 76.9 86.4 81.8 80.9 96.0 74.6 88.0 85.6 81.8 84.7 85.5 86.5 76.9 90.8 81.3 79.9

component can increase the average precision and recall from 41.2% and 54.6% to 82.1% and 90.8% respectively for the book title, from 43.2% and 51.2% to 65.7% and 81.3% respectively for the author, and from 34.5% and 46.5% to 67.8% and 79.9% respectively for the price. Similarly, the content classification model can improve the precision and recall in extracting all the items. The K -modified nearest neighbours classification model improves the average precision and recall from 67.1% and 58.3% to 82.1% and 90.8% respectively for the book title, from 43.6% and 67.7% to 67.8% and 79.9% respectively for the price. Though the precision for extracting the author degrades slightly, the ACM Transactions on Internet Technology, Vol. 7, No. 1, Article 6, Publication date: February 2007.

34

•


recall is significantly increased by nearly 15.0% when the K -nearest neighbours classification model is applied. 9.2 Consumer Electronics Appliance Domain and Job Advertisement Domain In the consumer electronics appliance domain, the items of interest are model number, description, and price. In the job advertisement domain, the items of interest are job title, company, and location. Tables XIV and XV summarize the average extraction performance on different items in the consumer electronics appliance domain and the job advertisement domain. The tables contain the experimental results on Baseline1, IEKA, and ROADRUNNER+NN when training examples of one particular Web site are provided. Note the experiments of Baseline0 and WIEN obtain zero precision and recall in all cases. Hence they are not shown in Tables XIV and XV. In the consumer electronics appliance domain, the extraction performance is very satisfactory. The average precision and recall of Baseline1 are 37.6% and 33.5% for the model number; 44.3% and 39.4% for the description, and 58.7% and 69.2% for the price. On the other hand, the average precision and recall of IEKA are 61.5% and 59.7% for the model number; 63.8% and 61.3% for the description; and 78.8% and 86.6% for the price. Note that the description contains a large portion of free text and varies greatly in different Web sites. This poses more difficulty in extraction. The extraction performance of ROADRUNNER+NN is less satisfactory. The average precision and recall are 15.1% and 46.3% for the model number; 23.5% and 72.2% for the description; and 42.5% and 79.5% for the price. Similarly, ROADRUNNER extracts some irrelevant information in the new site and poses harmful effects for the overall performance. Moreover, the data contained in different Web sites is very dissimilar and simply using edit distance cannot precisely predict the labels of the extracted data. Different from the book domain and the consumer electronics appliance domain, all the three attributes are mainly text fragments in the job advertisement domain. Therefore wrapper adaptation in the job advertisement is a more challenging task. Our IEKA framework achieves a very satisfactory extraction performance. The average precision and recall are 66.1% and 71.6% in extracting the job title; 44.0% and 40.0% in extracting the company, and 58.3% and 56.0% in extracting the location when IEKA is applied. The performance of IEKA is better than Baseline1, whose average precision and recall are 15.6% and 13.8% in extracting the job title; 17.6% and 14.7% in extracting the company; and 27.6% and 21.6% in extracting the location; as well as ROADRUNNER+NN, whose average precision and recall are 28.1% and 63.6% in extracting the job title; 28.6% and 63.4% in extracting the company; and 27.8% and 39.7% in extracting the location. The results show that our adaptation approach achieves very satisfactory extraction performance. Compared with Baseline0, Baseline1, WIEN, and ROADRUNNER+NN, IEKA shows a great improvement in extracting items in the new unseen sites. We also conducted experiments in the consumer electronics appliance domain and job advertisement domain on evaluating the effectiveness ACM Transactions on Internet Technology, Vol. 7, No. 1, Article 6, Publication date: February 2007.

S11 S12 S13 S14 S15 S16 S17 S18 S19 S20 Ave.

Baseline1 P R 16.6 17.0 33.2 32.3 43.2 37.9 44.4 40.9 27.7 27.8 44.4 37.7 44.3 42.3 33.4 25.7 55.5 46.6 33.0 27.3 37.6 33.5

Model number IEKA P R 66.6 57.0 61.0 73.0 70.7 67.8 61.0 58.0 33.6 37.6 55.6 46.7 77.6 69.5 66.8 67.7 72.1 67.5 49.9 52.4 61.5 59.7 R.+NN P R 13.9 53.0 18.3 71.4 13.9 25.1 14.2 58.4 17.1 32.2 14.1 51.5 15.6 57.6 16.6 58.4 14.4 32.9 13.1 22.2 15.1 46.3

Baseline1 P R 54.4 41.1 48.6 41.1 45.3 45.3 43.6 31.3 53.3 50.8 44.1 40.1 54.4 41.1 45.3 51.2 43.3 41.0 10.4 10.9 44.3 39.4

Description IEKA P R 66.2 66.9 65.2 67.8 75.6 73.6 70.4 63.7 81.5 73.0 70.8 62.7 64.6 64.2 67.4 78.4 65.5 51.3 10.4 10.9 63.8 61.3 R.+NN P R 31.7 57.6 30.9 64.1 14.7 68.3 26.5 65.3 15.9 92.2 23.1 63.0 30.3 62.5 26.6 69.4 17.7 89.4 17.6 90.1 23.5 72.2

Baseline1 P R 62.1 70.7 68.0 81.6 65.6 74.0 56.9 71.8 50.6 50.3 57.0 70.5 49.2 62.3 54.3 60.6 66.8 79.8 56.9 70.8 58.7 69.2

Price IEKA P R 83.1 89.2 78.0 90.8 86.6 92.6 78.7 85.3 82.1 78.1 78.1 89.1 80.1 90.5 75.2 79.1 78.0 90.3 68.0 81.2 78.8 86.6

R.+NN P R 21.3 78.0 40.0 78.7 45.1 78.0 39.4 89.1 65.7 85.1 38.2 78.0 33.4 77.7 40.1 78.0 55.0 74.8 46.7 77.7 42.5 79.5

Table XIV. Average Extraction Performance on Model Number, Description, and Price for the Consumer Electronics Appliance Domain for the Experiments of Baseline1, IEKA, and ROADRUNNER+NN when Training Examples of One Particular Information Source are Provided. Note that Baseline0 and WIEN Obtain Zero Precision and Recall for All Items in All Sites. Hence They Are Not Shown in the Table. (R.+NN denotes ROADRUNNER+NN. P and R refer to precision and recall in percentage respectively.)

Adapting Web Information Extraction Knowledge •

35


S21 S22 S23 S24 S25 S26 S27 S28 S29 S30 Ave.

Job title IEKA P R 68.5 67.8 52.7 74.2 57.4 71.9 73.2 66.2 66.5 74.2 75.9 74.7 62.7 63.0 67.9 71.4 62.0 81.7 74.3 70.8 66.1 71.6 R.+NN P R 24.7 52.7 32.3 60.0 24.6 63.5 24.8 63.6 23.8 63.5 35.8 71.0 23.8 72.9 32.0 61.3 34.8 73.9 24.0 53.1 28.1 63.6

Baseline1 P R 9.7 4.7 19.3 13.6 13.9 18.3 25.2 15.0 13.3 16.5 41.1 27.9 0.0 0.0 7.9 0.6 27.3 32.9 18.3 17.5 17.6 14.7

Company IEKA P R 57.1 54.2 48.2 35.9 33.7 27.6 33.5 24.1 61.5 57.3 43.9 40.4 23.3 24.8 43.5 37.6 63.4 60.6 31.9 37.5 44.0 40.0 R.+NN P R 22.1 68.0 28.7 56.0 36.7 61.4 36.7 60.7 34.5 53.0 19.8 72.4 27.1 54.2 22.8 62.6 30.8 76.1 26.5 69.2 28.6 63.4

Baseline1 P R 22.0 21.8 16.6 12.5 6.4 9.2 8.2 9.5 0.1 0.0 50.9 49.9 25.4 17.8 58.9 37.4 36.2 33.6 50.8 24.6 27.6 21.6

Location IEKA P R 40.2 47.8 70.2 55.5 75.6 63.4 67.5 62.8 8.6 17.9 70.0 64.0 49.5 49.0 68.7 63.3 69.1 74.7 63.5 61.8 58.3 56.0

R.+NN P R 19.3 29.2 31.4 57.8 32.3 48.2 32.8 48.0 39.5 35.5 18.7 19.3 0.0 0.0 31.5 45.5 46.7 62.4 25.7 50.9 27.8 39.7

•

Baseline1 P R 3.7 1.3 0.0 0.0 7.2 3.1 24.3 22.5 36.4 34.1 58.1 47.1 0.0 0.0 19.1 22.3 0.2 0.1 7.0 7.2 15.6 13.8

Table XV. Average Extraction Performance on Job Title, Company, and Location for the Job Advertisement Domain for the Cases of Using Baseline1, IEKA, and ROADRUNNER+NN when Training Examples of One Particular Information Source are Provided. Note that Baseline0 and WIEN Obtain Zero Precision and Recall for All Items in All Sites. Hence They Are Not Shown in the table. (R.+NN denotes ROADRUNNER+NN. P and R refer to precision and recall in percentage respectively.)




•

37

of each component of IEKA in a manner similar to the experiments conducted in the book domain and the results demonstrate similar trend and conclusion as in the book domain. 9.3 Discussion In the experiment, Baseline0 and WIEN are conducted by directly applying the wrapper learned from the source site to the new unseen sites for extraction without adaptation. The experimental results show that the learned wrapper from the source Web site cannot extract any item from the new unseen site without adaptation. This illustrates the problem of existing wrapper learning methods. As explained in Section 4.1, this is due to the difference in the sitedependent features between the source Web site and the new unseen sites. IEKA can solve the wrapper adaptation fully automatically by analyzing the site-invariant and site-dependent features. IEKA achieves a very promising performance in adapting the wrapper from the source Web site to new unseen sites. IEKA especially achieves very good performance in extracting items with domain-specific text string characteristics such as the book title, in the book domain, and the job title, in the job advertisement domain. This is due to the fact that the site-invariant features embodied in the learned wrapper and the extracted items from the source site provide very useful clues in identifying these items in the new unseen site. However, some items, which are less domain dependent or involve more free text, such as the descriptions in the consumer electronics appliance domain, achieve only satisfactory performance. One reason is that it is more difficult to derive the site-invariant features for these items. Unsupervised approaches such as ROADRUNNER can effectively extract information from Web pages without providing any manually labeled training examples. However, one disadvantage is that the extracted items are not labeled and hence human effort is needed to interpret the meaning of the extracted data. On the contrary, IEKA can make use of the information extraction knowledge previously learned, and the items previously collected or extracted in the source Web site to automatically label the extracted items in the target site. This can significantly reduce the human work. Experimental results also demonstrate the effectiveness of IEKA. 10. CONCLUSIONS AND FUTURE WORK We develop a framework called IEKA, which is capable of adapting previously learned extraction knowledge from a source Web site to a new unseen site in the same domain. We analyze the problem by identifying two kinds of features, namely, site-invariant features and site-dependent features. Under a particular domain, the site-invariant features such as the text fragments regarding the item content remain largely unchanged, while the site-dependent features such as the text fragments regarding the layout format are different for Web pages originating from different Web sites. This provides very useful clues for extracting information from Web pages. ACM Transactions on Internet Technology, Vol. 7, No. 1, Article 6, Publication date: February 2007.

38

•


Some site-invariant features can be derived from the previously learned extraction knowledge and the partially specified training examples, which refer to the items previously extracted or collected in the source Web site. IEKA considers the DOM structure of the Web pages and automatically generates the machine-labeled training examples. The machine-labeled training examples are utilized for learning the wrapper as tailored to the unseen Web sites. Both site-invariant features and site-dependent features are considered for these machine-labeled training examples. Finally IEKA can learn the new information extraction knowledge for the new unseen site. We conducted extensive experiments on several real-world Web sites to demonstrate the effectiveness of IEKA. Empirical results show that IEKA is able to automatically extract information from new unseen Web sites. We intend to extend this research in several directions. One possible direction is to investigate a more formal model based on Bayesian learning. A preliminary model along this line is reported in Wong and Lam [2004a]. Another direction is to incorporate domain-specific knowledge from users. Very often, users may already have some background information or knowledge about the domain. For example, users may have a prior knowledge about the format of the items. We intend to develop a mechanism in which users can easily incorporate their domain-specific knowledge. REFERENCES AMBITE, J., BARISH, G., KNOBLOCK, C., MUSLEA, M., OH, J., AND MINTON, S. 2002. Getting from here to there: Interactive planning and agent execution for optimizing travel. In Proceedings of the Fourteenth Innovative Applications of Artificial Intelligence Conference, 862–869. BILENKO, M. AND MOONEY, R. 2003. Adaptive duplicate detection using learnable string similarity measures. In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD), 39–48. BLEI, D., BAGNELL, J., AND MCCALLUM, A. 2002. Learning with scope, with application to information extraction and classification. In Proceedings of the Eighteenth Conference on Uncertainty in Artificial Intelligence (UAI), 53–60. BRIN, S. 1998. Extracting patterns and relations from the World Wide Web. In Proceedings of the International Workshop on the Web and Databases, 172–183. CALIFF, M. E. AND MOONEY, R. J. 2003. Bottom-up relational learning of pattern matching rules for information extraction. J. Mach. Learn. Res. 4, 177–210. CHAWATHE, S., GARCIA-MOLINA, H., HAMMER, J., IRELAND, K., PAPAKONSTANINOU, Y., ULLMAN, J., AND WIDOM, J. 1994. The TSIMMIS project: Integration of heterogeneous information sources. In Proceedings of the Information Processing Society of Japan, 7–18. CIRAVEGNA, F. 2001. (L P )2 an adaptive algorithm for information extraction from Web-related texts. In Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence (IJCAI), 1251–1256. COHEN, W. 1999. Reasoning about textual similarity in a Web-based information access system. Autonomous Agents and Multi-Agent Systems 2(1), 65–86. COHEN, W. AND FAN, W. 1999. Learning page-independent heuristics for extracting data from Web pages. Comput. Netw. 31(11-16), 1641–1652. COHEN, W. W., HURST, M., AND JENSEN, L. 2002. A flexible learning system for wrapping tables and lists in HTML documents. In Proceedings of the Eleventh International World Wide Web Conference (WWW), 232–241. CRESCENZI, V., MECCA, G., AND MERIALDO, P. 2001. ROADRUNNER: Towards automatic data extraction from large Web sites. In Proceedings of the 27th Very Large Databases Conference (VLDB), 109–118. ACM Transactions on Internet Technology, Vol. 7, No. 1, Article 6, Publication date: February 2007.


•

39

DOORENBOS, R. B., ETZIONI, O., AND WELD, D. S. 1997. A scalable comparison-shopping agent for the World-Wide Web. In Proceedings of the First International Conference on Autonomous Agents, 39–48. DOWNEY, D., ETZIONI, O., AND SODERLAND, S. 2005. A probabilistic model of redundancy in information extraction. In Proceedings of the Eleventh International Joint Conference on Artificial Intelligence (IJCAI), 1034–1041. ETZIONI, O., CAFARELLA, M., DOWNEY, D., POPESCU, A. M., SHAKED, T., SODERLAND, S., AND WELD, D. 2005. Unsupervised named-entity extraction from the Web: An experimental study. Artif. Intell. 165(1), 91–134. FREITAG, D. AND MCCALLUM, A. 1999. Information extraction with HMMs and shrinkage. In Proceedings of the AAAI-99 Workshop on Machine Learning for Information Extraction. FRENCH, J. C., POWELL, A. L., AND SCHULMAN, E. 1997. Applications of approximate word matching in information retrieval. In Proceedings of the Sixth International Conference on Information and Knowledge Management (CIKM), 9–15. GHANI, R. AND JONES, R. 2002. A comparison of efficacy and assumptions of bootstrapping algorithms for training information extraction systems. In Proceedings of the Workshop on Linguistic Knowledge Acquisition and Representation: Bootstrapping Annotated Data at the Linguistic Resources and Evaluation Conference. GOLGHER, P. AND DA SILVA, A. 2001. Bootstrapping for example-based data extraction. In Proceedings of the Tenth ACM International Conference on Information and Knowledge Management (CIKM), 371–378. GUSFIELD, D. 1997. Algorithms on Strings, Trees, and Sequences. Cambridge University Press. HOGUE, A. AND KARGER, D. 2005. Thresher: Automating the unwrapping of semantic content from the World Wide web. In Proceedings of the Fourteenth International World Wide Web Conference (WWW), 86–95. HSU, C. AND DUNG, M. 1998. Generating finite-state transducers for semi-structured data extraction from the Web. J. Info. Sys., Special Issue on Semistructured Data 23(8), 521–528. KUSHMERICK, N. 2000a. Wrapper induction: Efficiency and expressiveness. Artif. Intell. 118(1-2), 15–68. KUSHMERICK, N. 2000b. Wrapper verification. W.W.W. J. 3(2), 79–94. KUSHMERICK, N. AND GRACE, B. 1998. The wrapper induction environment. In Proceedings of the Workshop on Software Tools for Developing Agents (AAAI), 131–132. KUSHMERICK, N. AND THOMAS, B. 2002. Adaptive information extraction: Core technologies for information agents. In Intelligents Information Agents R&D In Europe: An AgentLink Perspective, 79–103. LAM, W., WANG, W., AND YUE, C. W. 2003. Web discovery and filtering based on textual relevance feedback learning. Computational Intell. 19(2), 136–163. LERMAN, K., MINTON, S., AND KNOBLOCK, C. 2003. Wrapper maintenance: A machine-learning approach. J. Artif. Intell. Res. 18, 149–181. LIN, W. Y. AND LAM, W. 2000. Learning to extract hierarchical information from semi-structured documents. In Proceedings of the Ninth International Conference on Information and Knowledge Management (CIKM), 250–257. LIU, B., GROSSMAN, R., AND ZHAI, Y. 2003. Mining data records in Web pages. In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD), 601–606. MUSLEA, I., MINTON, S., AND KNOBLOCK, C. 2000. Selective sampling with redundant views. In Proceedings of the Seventeenth National Conference on Artificial Intelligence (AAAI), 621–626. MUSLEA, I., MINTON, S., AND KNOBLOCK, C. 2001. Hierarchical wrapper induction for semistructured information sources. J. Autonomous Agents and Multi-Agent Systems 4(1-2), 93– 114. RILOFF, E. AND JONES, R. 1999. Learning dictionaries for information extraction by multi-level bootstrapping. In Proceedings of the Sixteenth National Conference on Artificial Intelligence (AAAI), 1044–1049. SODERLAND, S. 1999. Learning information extraction rules for semi-structured and free text. Mach. Learn. 34(1-3), 233–272. ACM Transactions on Internet Technology, Vol. 7, No. 1, Article 6, Publication date: February 2007.

40

•


SRIHARI, R. AND LI, W. 1999. Question answering supported by information extraction. In Proceedings of the Eighth Text REtrieval Conference (TREC-8), 185–196. TEJADA, S., KNOBLOCK, C., AND MINTON, S. 2001. Learning object identification rules for information integration. Info. Syst. 26(8), 607–635. TEJADA, S., KNOBLOCK, C., AND MINTON, S. 2002. Learning domain-independent string transformation weights for high accuracy object identification. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD), 350–359. VAPNIK, V. N. 1995. The Nature of Statistical Learning Theory. Springer. WANG, J. AND LOCHOVSKY, F. H. 2003. Data extraction and label assignment for Web databases. In Proceedings of the Twelfth International World Wide Web Conference (WWW), 187–196. WONG, T. L. AND LAM, W. 2002. Adapting information extraction knowledge for unseen web sites. In Proceedings of the 2002 IEEE International Conference on Data Mining (ICDM), 506–513. WONG, T. L. AND LAM, W. 2004a. A probabilistic approach for adapting information extraction wrappers and discovering new attributes. In Proceedings of the IEEE International Conference on Data Mining (ICDM), 257–264. WONG, T. L. AND LAM, W. 2004b. Text mining from site-invariant and dependent features for information extraction knowledge adaptation. In Proceedings of the 2004 SIAM International Conference on Data Mining (SDM), 45–56. WONG, T. L. AND LAM, W. 2005. Learning to refine ontology for a new Web site using a Bayesian approach. In Proceedings of the 2005 SIAM International Conference on Data Mining (SDM). 298–309. Received July 2004; revised August 2005, February 2006; accepted July 2006


Copyright © 2024 M.MOAM.INFO. All rights reserved.
| About Us | Privacy Policy | Terms of Service | Help | Copyright | Contact Us | Cookie Policy

Sign In

Email

Password

Remember me Forgot password?

Our partners will collect data and use cookies for ad personalization and measurement. Learn how we and our ad partner Google, collect and use data. Agree & close