Creating Personal Information Assistants for ... - Semantic Scholar

Creating Personal Information Assistants for Targeted Navigation and Extraction via a Web Browser Nikeeta Julasana, Akshat Khandelwal, Anupama Lolage, Prabhdeep Singh, Priyanka Vasudevan, Hasan Davulcu, I.V. Ramakrishnan Department of Computer Science, SUNY Stony Brook, Stony Brook,NY 11794 [email protected] ABSTRACT

A Personal Information Assistant (PIA) is a software robot that a user deploys to retrieve targeted data from Web sources that he/she is interested in. A PIA automatically navigates to relevant sites, locates the correct Web pages (which can be either directly accessed by traversing appropriate links or by filling out HTML forms), and extracts, structures, and organizes the data from these pages into presentation formats specified by the user. In this paper we describe techniques for empowering the end user (who is not necessarily trained in computing) to create PIAs using just a web browser. Essentially the user need only highlight in the browser examples of data of interest in a Web page and the links to be followed and/or forms to be filled to reach this page. From the highlighted examples the system creates a PIA by ``learning’’ navigation and extraction expressions. Runtime deployment of a PIA amounts to navigating and extracting targeted data from a Web site by applying these expressions at the appropriate Web pages in the site. A unique aspect of our learningbased extraction algorithms is their “resilience” to slight structural variations in the Web page. Finally we also describe a generic graphical user interface framework via which a user can build a specification for transforming the extracted data into any desired document presentation formats such as WML, VoiceXML, etc. 1. INTRODUCTION

The browsing patterns of users who use the Web regularly tend to become fairly fixed over time. For example a lay person’s daily Web browsing activity may include reading the headline news from www.nytimes.com, a financial planner would tend to read market headlines from www.wsj.com daily, a retail vendor’s activity may include finding out the prices of competing products being promoted at the web sites of competitors, a life sciences researcher frequently searches for scientific articles at www.pubmed.com and so on. A characteristic common to all these kinds of browsing activities is their repetitive nature, i.e., the user regularly visits the same site and views the same kind information (such as headline news, stock prices) provided at the site Automating these repetitive browsing activities will provide a significant degree of convenience to the user.

One approach towards automating these activities is to build software robots that will automatically navigate to user-specified Web sites (such as www.nytimes.com) and extract the specific content of interest to the user (such as headline news of the day) on a periodic basis. Of course one can write scripts to implement such robots. But an interesting question is this: can the end user create such robots without programming? Further can the extracted content be presented in formats that are amenable for interaction via different modalities such as voice and different devices such as wireless phones and PDAs? In this paper we describe techniques for creating such robots. We call them Personal Information Assistants (PIAs) since they locate and extract specialized, targeted content that is of personal interest to the user. This content may be buried deep within a Web site (e.g., behind formbased interfaces) that is not directly accessible via traditional keyword-based search engines. PIAs navigate to relevant sites, locate the correct Web pages by traversing appropriate links or filling out HTML forms, extract structure and organize the data from these pages in userspecified presentation formats such as HTML, XML, Text, WML, VoiceXML, etc. More importantly the end user can create and execute PIAs using just a Web browser. Essentially the user need only highlight in the browser examples of data of interest in a Web page and the links to be followed and/or forms to be filled to reach this page. From the highlighted examples the system creates a PIA by ``learning’’ navigation and extraction expressions. The paper is organized thus: Section 2 describes the learning algorithm for creating PIAs from highlighted examples. Section 3 discusses implementation of WinAgent, our system for creating and executing PIAs with a Web browser (for a demo please visit Section 4 http://www.lmc.cs.sunysb.edu/~winagent). describes a generic GUI framework for transforming the extracted raw Web data into appropriate user-specified presentation formats. Experimental evaluation appears in Section 5 and related work in Section 6.

Figure 1 2. CREATING PIAs 2.1 Overview

Our goal is to be able to create PIA’s to extract information from pages as shown in Figures 1 and 2, through a browser. The HTML source is parsed into a DOM (Document Object Model) tree. When the user highlights examples in the browser the PIA builder learns XPath expressions, which is the language for addressing XML/HTML DOM tree elements. . E.g. the Xpath expression for the DOM tree shown in Fig. 3, the rightmost text() element in the subtree rooted under the second TABLE node is: HTML/BODY/TABLE[2]/TR[3]/TD[2]/A/text(). Figure 1 shows a page from Zappos.com containing twelve shoes. The user highlights a small example of the data (e.g. the middle shoe in the last row in Fig. 1) to be extracted from a page. Based on this positive example, the PIA builder learns an Xpath expression, applies it to the DOM tree of this page and displays similar items and attributes from the page that match the Xpath expression.. If the matches correspond exactly to all the items of interest (e.g. all the 12 shoes in Fig. 1), the correct Xpath expression has been learnt for this page. If the matches cover more items than desired, the user marks one or more of the erroneously matched items as negative examples. The PIA builder uses the negative examples to learn a more discriminating Xpath expression. It is also possible that the pattern does not match all of the desired items. The user supplies a sample of the missed items as additional positive examples. The builder will be able to learn the correct extraction pattern within a few such iterations. The user can also traverse links and navigate to an inner page and continue extraction. All the extractions and traversals of a user are recorded into a navigation map for future playback. On playback, the

Figure 2 navigation map is interpreted by the PIA interpreter and the targeted data is extracted. 2.2 Learning Xpaths

To begin with the user highlights the smallest enclosing region containing all the items of interest. In Fig. 1 this will correspond to the region enclosing all the shoe items. In the DOM tree this is identified as the regionLCA (least common ancestor) of all the nodes corresponding to the HTML node elements in the region. For example the regionLCA of the region containing all the shoe items in Fig. 1 is the TABLE node enclosed in the dotted circle in Fig. 3, which is a simplified fragment of the DOM corresponding to the HTML page in Fig. 1. It is easy to see that by confining the applicability of the Xpath expression to the subtree rooted at the regionLCA node will result in considerably enhancing extraction precision. Now observe in Fig. 3 that the HTML nodes corresponding to attributes of an item are all encapsulated within a subtree rooted at the lca of all these nodes. (e.g. the subtree shown enclosed within the big circle in Fig. 3). We will denote this lca node as the itemLCA. In Fig. 3 the itemLCA nodes are all the TD nodes labeled in bold font. We next learn two Xpath expressions – an isolatorXpath and attributeXpath, the former will identify all the itemLCA nodes while the latter will extract the attributes of an item. The algorithm operates in two phases. In the first phase we construct the isolatorXpath. We proceed as follows: the user highlights an example of an item within the region (such as the shoe item shown in Fig 1). We determine its itemLCA and initialize the isolatorXPath to be this node. In Fig. 3 the initial isolatorXPath is //TD. On applying this expression to the DOM all the TD nodes in Fig. 3 are identified. Observe that Xpath is overly general resulting in the identification of many erroneous items.

HTML BODY TABLE TR

R

TABLE

TABLE TR

TR

TD

TR

TD

text()

TR

TR

TD

FONT IMG

text()

I

TD A

text()

TD

text()

text() I

TR

FONT text()

I

TD

IMG

A

text()

I

TD

FONT IMG

text()

A

text()

FONT IMG

text()

I

TD

A

text()

I

TD

FONT

IMG

text()

A

text()

text() TD

FONT text()

I

itemLCA

R

regionLCA

IMG

A text()

Figure 3. Fragment of the DOM tree of the HTML page shown in Figure 1. Circled subtree represents the example item. To eliminate all such erroneous items, we specialize our isolatorXPath expression. We proceed as follows. Our first goal is to eliminate all the nodes that are not inside the subtree rooted at the regionLCA. The specialization works iteratively through the process of scoring explained thus: We begin with the selected item’s LCA. This will be our starting pivot node. We now pick a combination of attributes and attribute-value pairs of this node, that when incorporated in the isolatorXpath expression will eliminate the maximum number of nodes outside the region. In our example, the TD nodes not in bold font do not have the ‘vAlign’ attribute (coincidentally they are all located outside the subtree rooted under the regionLCA node.) We specialize our isolatorXpath to //TD[@vAlign] which will eliminate all the TD nodes without the ‘vAlign’ attribute. This specialized XPath may still fail to eliminate all the nodes outside the region. In Fig. 3 the rightmost two TD nodes in the subtree rooted on the third TABLE node will still be identified by the specialized isolatorXpath. In such a case we repeat this process of specialization by extending the isolatorXpath expression to include the parent of the current pivot node, viz., //TR/TD[@vAlign]. The parent becomes the new pivot for the next iteration. Observe that as long as the specialization is done with respect to the pivot nodes within the region, the nodes in the Xpath expression are free of any location dependent information namely, node indices. This provides a degree of resilience to our expression. However, sometimes we may need to use nodes that are ancestors of the regionLCA as our pivots. In such cases we introduce indices in our isolatorXpath expression. For example the isolatorXpath //TR/TD[@vAlign] generated in the 2nd iteration still does

not eliminate the rightmost two TD nodes outside the region. So in the 3rd iteration we use the regionLCA as the pivot point and specialize the isolatorXpath to be //TABLE[2]/TR/TD[@vAlign]. With this expression, we will identify all the itemLCA nodes within the region and present it to the user. But this may include erroneous items within the region. These will be flagged by the user. We use these flagged items as negative examples to specialize our isolatorXpath using the same process of scoring to eliminate erroneous nodes within the region. In the second phase of our algorithm we generate attributeXpaths to retrieve the attributes of the items of interest. This is a straight forward process in which we generate an Xpath expression for the leaves rooted at the item LCA. In Fig. 3 these expressions are //A/text(), //IMG[@src] and //FONT/text() which when applied to the itemLCA nodes will retrieve the name, image and price attributes of an individual shoe item. 3. IMPLEMENTATION

We have developed the WinAgent System for creating and executing PIAs based on the algorithms discussed in Section 2. It consists of a PIA builder through which a user creates a navigation map for a web site. This map encodes information about how to navigate to pages of interest in a Web site and how to extract data needed from these pages. At run time the PIA map interpreter automatically navigates to the web site, follows specified links, fills out forms and extracts targeted data from the pages specified in the map. The extracted data is presented in XML.

3.1 Winagent Interface

The PIA builder is embedded as a tool bar in Internet Explorer (see the horizontal bar beginning with the label “WinAgent” below the URL address box in Fig. 4). Users interact with the system through the five buttons (“Get Item”, “Get Region”, “Record”, “Stop”, “Play”) resembling a VCR, for creating and interpreting navigation maps.

Next the user highlights an example of an item (the shaded item in Fig. 6) by clicking the “Get Item” button. The PIA builder invokes the learning algorithm (described in Section 2) to recognize all the shoe items within the previously selected region and displays all of the extracted attributes to the user who can now supply the appropriate labels for them using the input text box (see the display panel labeled “WinAgent Results” in Fig. 7).

Figure 4: PIA Builder The PIA creation is initiated when the “Record” button is clicked. From now on until the “Stop” button is clicked, the user’s actions are monitored and recorded in the navigation map. Using the “Get Region” button the user specifies the region of interest by highlighting it on the page. The user uses the “Get Item” button to select an example item of interest. Finally the “Play” button allows the user to specify the navigation map and launch the PIA interpreter on this map. 3.2 Building PIA: An Illustration

We illustrate how a PIA is built to extract attributes of shoes from www.zappos.com. We begin the PIA creation from a page containing images of the products along with a few attributes including a link (see Fig. 5), which leads to the next level page from which we will extract a more detailed description of each product and some additional attributes (see Fig. 9). The user highlights the region containing all the shoe items in Fig. 5 after clicking the “Get Region” button.

Figure 5: Select the Region

Figure 6: Selecting a sample item In the XML output the extracted attributes will have these labels. The user can also drop any attribute from the output by checking the “Drop Column” checkbox. Two or more columns can be merged into one column by giving the columns identical names.

Figure 7: Phase-1 Result for annotation

The user can also specify that an attribute is a link to be traversed by checking the “Follow Link” option or be merely extracted without traversing using the “Extract Link” option. In our example we check the “Follow Link” option for the “Product_Image” attribute, so that we can follow the link to extract the description and other attributes for that product. In Fig. 7 the Product names are split across columns 2 and 3. So we merge the two columns by naming both of them “Product_Name”. Attributes that are not needed, such as “Attribute 3”, are dropped by checking the “Drop Column” box. At this stage the user is done and clicks the submit button. The WinAgent will display the final result for this page as shown in Fig. 8. The user navigates to the next level, highlights the description (see Fig. 9) and chooses text format (Fig. 10) for displaying the extracted result (Fig. 11).

Figure 10: Options for extraction

Figure 11: Extracting the text from a region The user can repeat the process of selecting regions and items as many times as he desires to extract items from different regions on the page. For example he can select the region containing the SKU_Number, sizes, width attributes etc and repeat the process (Fig. 12). Figure 8: Final WinAgent Result

Figure 12: Selection from multiple regions Figure 9: Selecting Region and Item on Page 2

Finally the user clicks “Stop” and a navigation map for Zappos gets created and stored in a folder.

3.3 Navigation Map

The navigation map consists of a collection of pagemaps (see Fig 13). In this figure PAGE1 and PAGE2 are the pagemaps for extracting attributes and link data for Figs. 8 & 11 respectively. Each pagemap has one or more ’WA_TABLE’ corresponding to the multiple regions on that page from which items are extracted. Each ‘WA_TABLE’ consists of an isolatorXpath within the ‘ISOLATOR’ tag and a set of attributeXpaths enclosed within the ‘MAGNET’ tag. The map has three kinds of ‘MAGNET’ tags, viz., ‘TEXT’, ‘IMAGE’ or ‘ANCHOR’ corresponding to the three attribute types - text, image or link.

Figure 14: Launching the PIA using the ’PIA Manager’

Figure 15: XML output of the interpreter

Figure 13: Navigation Map 3.4 Interpreter - Launching The PIA The PIA interpreter interprets this navigation map by automatically navigating to the web site, following specified links, filling out forms and extracting the targeted data from the pages as specified in the pagemaps. The PIA can be launched by either using the “Play” button on the toolbar, or through the independent PIA Manager (Fig. 14).

The output of the interpreter is an XML document with attribute names that were supplied by the user when the PIA was created (see Fig. 15).

4. PRESENTATION OF EXTRACTED DATA

Figure 16 The raw data extracted from the PIA is organized as an XML document in a hierarchically structured fashion. At level one of the output tree, there are objects corresponding to the first level of navigation. These objects contain their extracted attributes as children. They also contain level two objects corresponding to the second level of traversals and extractions and so on (see Fig. 15). Thus, a sample output looks like: L1_Obj1

|- L1_Attribute1 |- L1_Attribute2 |- L2_Obj1 |- L2_Attribute1 |- L2_Attribute2

Our goal is to transform this nested XML into one of many desired presentation formats, such as comma-separated records, VoiceXML, HTML, WML etc. Such transformations on XML can be expressed using XSLT stylesheets. We provide a generic GUI framework for rapidly creating such XSLT’s, with an interface for users to plug in a specific formatting module for transformation. The principle used in these generic transformations is that any XSLT, for our output, will have a standard structure containing loops, variable declarations and variable use. Since our output consists of nested objects, there will always be a need to iterate over these objects (possibly in a nested fashion), and output values of attributes. In addition to this, there will be a need to insert strings between loops and between attribute outputs (formatting strings). Accordingly, we provide a UI component for selecting iterations over objects (See Fig. 16). Within these generated loops, the user can then select attributes from the current or any parent object, for output. This phase results in the generation of a skeletal XSLT of the form shown in Figure 17 in the left tree. All that is left now is to fill the various placeholders based on the formatting module.

Figure 17 Let us take the example of generating an output for the purpose of loading into a database (comma-separated attributes and newline-separated records). For example, on Zappos’ pages we find an output with the following structure:

Now, if we want a transformation to a comma-separated file, then we want a sequence of records of the form: Product_Image, Product_Name, Price, Sizes, Width, Colors, Description

Such a transformation requires us to create a loop for Object2’s nested within Object1’s, with all the attributes being written out in this inner loop. Hence, in the UI, the user would choose Object2 for loop generation, creating: foreach Object1 foreach Object2

Now, in this loop, the user would choose to print all seven attributes, creating the skeleton: foreach Object1 var vProduct_Image := ./Product_Image var vProduct_Name := ./Product_Name var vPrice := ./Price foreach Object2 print ./Sizes print ./Width print ./Colors print ./Description print vProduct_Image

print vProduct_Name print vPrice

This is accomplished by simple clicks in the UI shown in Fig. 17. Now, in this skeletal XSLT, additional substitution can be performed by a format-specific module. We provide a standard interface for such modules that can be created by anyone for a particular presentation format. Such a sample is shown in Fig. 17 where we have a database conversion module providing options for substitution. Now the user simply drags the appropriate string for replacement. Here’s a snippet of the final transformed XSL: |

We have experimented with creating audio-browsable content from extracted web pages, using a translator module for VoiceXML also. 5. EXPERIMENTAL EVALUATION

WinAgent has been in operation since the last six months. We have been extensively experimenting with it. We have created PIAs for several dozens of websites. The number of records extracted ranged from 20 to over 600 and the number of pages traversed to get to any such record has ranged from 2 to 4. The time taken for creating these PIAs ranged from 1 to 4 mins and the time to execute them has ranged from 0.5 to 20 mins. The precision and recall of extraction has almost always been 100%. Finally as far as the usability of our system is concerned we have observed that even naïve users require very little time to create PIAs. 6. RELATED WORK

Using a string representation for HTML pages [1] describes a learning algorithm for constructing wrappers. Such a representation is sensitive to changes in the order in which attributes are presented. [3, 6] introduces an algorithm that can handle such variations but requires multiple passes over the documents for extracting each type of an attribute. WinAgent system provides a very intuitive and easy-to-use GUI based on the user's regular browsing activity coupled with novel uses of easy highlighting actions. In particular the notion of "highlighting the region of interest" in our system allows users to rapidly train a robust "isolator" expression that can accurately learn to identify all item instances from highlighting a single representative example item. This usually eliminates a long sequence of training with negative examples [5] that would require tedious inspection and matching by the user. Our Xpath

expressions are resilient to changes in how the attributes are ordered. They are robust in the presence of missing or multiple occurrences of attribute instances and do not require additional training with such items. W4F[8] allows specification of retrieval, extraction and mapping rules. It produces an agent that can be executed as a Java program. It doesn't support true navigation by extracting and exploring links from within a page. Once retrieved, extraction rules are applied in the form of paths in the document hierarchy. These are displayed for individual pieces of data on a page, however the user has to then create the rules by observing these "document paths" and generalize them whenever possible. In XWRAP[4], objects and elements of interest on a page are derived automatically by heuristics. Different pre-set heuristics may be used and can be further fine-tuned by modifying characteristics such as element data types, objects sizes and element tag separators XWRAP's search interface extraction allows users to compose GET and POST requests, while WinAgent has full-fledged capability for navigation including form processing. 7. REFERENCES

[1] N. Ashish and C. Knoblock, Wrapper Generation for Semi-Structured Internet Sources, 1997, ACM SIGMOD Record, No. 4, Vol. 26 [2] H. Davulcu, J. Freire, M. Kifer, I.V. Ramakrishnan. A Layered Architecture for Querying Dynamic Web Content, ACM Conference on Management of Data (SIGMOD), Philadelphia, PA, June, 1999. [3] Chun-Nan Hsu and Ming-Tzung Dung, Generating Finite-State Transducers for Semi-Structured Data Extraction from the Web. 1998, Information Systems Journal Vol. 23, No. 8, Pgs 521-538. [4] Ling Liu, Calton Pu, Wei Han. `` XWRAP: An XMLenabled Wrapper Construction System for Web Information Sources", Proceedings of the 16th International Conference on Data Engineering(ICDE'2000), San Diego CA (February 28 - March 3, 2000). [5]Steven N. Minton, Sorinel I. Ticrea. "Trainability: Developing a responsive learning system" 2003, IJCAI'03 Workshop on Information Integration on the Web. [6] Ion Muslea and Steve Minton and Craig Knoblock, A hierarchical approach to wrapper induction. 1999, Proceedings of the Third International Conference on Autonomous Agents (Agents'99), Pgs. 190-197 [7] M. Perkowitz, R.B. Doorenbos, O. Etzioni, and D.S. Weld. Learning to understand information on the internet: An example-based approach, Journal of Intelligent Information Systems, 8(2):133--153, March 1997. [8] Arnaud Sahuguet, Fabien Azavant. "Building lightweight wrappers for legacy Web data-sources using W4F ", International Conference on Very Large Databases (VLDB) (1999).