Querying Structured Web Resources - Semantic Scholar

Querying Structured Web Resources Cheng-Hai Tan Boon-Wan Lim Wee-Keong Ng Centre for Advanced Information Systems School of Applied Science, Nanyang Technological University Nanyang Avenue, Singapore 639798 Email: [email protected] Fax: (65)792-6559 Tel: (65)799-4802

Ee-Peng Lim

INTRODUCTION

To provide query facilities over Web resources, several web search engines such as Yahoo, Altavista, Infoseek, etc. have been developed. Most prominently, web search engines have been used to index all web pages on the Internet. The queries supported by these search engines are mainly designed to reveal web pages that meet the search criteria specified by the users. In this way, the search engines have been used as some kind of discovery tools. The queries supported are known as discovery queries. On the other hand, a web search engine can also be used to index and support queries over a specific web collection on the Internet or Intranet. Such a web collection is usually well known to the users who query it, and is properly maintained by some organization or individual(s). Queries on such kind of web collections are known as retrieval queries. When the existing web search engines are used for supporting retrieval queries over well organized web collections, they often fail to exploit the structuredness of web collections to be queried. In these collections, web pages of the same categories often share some common intra-document structure. Moreover, well-defined inter-document structures are defined as links between web pages. Consider the National Cancer Institute’s web site. It contains a directory of web pages (http:Nrex.nci.nih.gov/PATIENTS/SITES_TYPES.html) for cancer patients or general public interested to know about the different types of cancers. The home page for each cancer type contains links to the web pages describing the symptoms, diagnosis, treatment, and other properties about the cancer type. For well organized web collections, it would be useful to provide users the means to specify structured queries exploiting both the intra- and inter-documents structures. Moreover, these structural information should also be retained in the query results returned. In the following, we will illustrate the need using an example. 1: Tom who is contemplating of quiting smoking habit would like to know what types of cancer can be

Example

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies hear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.

Digital Libraries 98 Pittsburgh PA USA Copyright ACM 1998 O-89791-965--3/98/ 6...%5.00

297

caused by smoking, and to check if he has already got one of them. To collect the required information, Tom has to browse through the web pages of different cancer types and identify their causes and symptoms. In this context, normal web search engines can only help in a limited way since they can only identify pages that contain the keywords such as “smoking”, “symptom” and “cause”. Tom must still go through the result pages to identify their semantics and their inter-relationships. In our ongoing WebIR (Web Information Retrieval) research, we are looking into how web search engines can be extended to exploit the structuredness of web collections for retrieval type queries. Our overall goal is to create a new breed of web search engines that handle retrieval queries involving both intra- and inter- document structures. To achieve this goal, the following issues have to be addressed: How do we obtain the structural information about a web collection? l What is the appropriate query model? In other words, how should the new retrieval queries look like? How should the query results be represented? l What are the appropriate indexing and query evaluation strategies? l What should be the ranking formula for the query results? l What is the appropriate framework to measure the performance of the new search engines? l

In the remaining sections of our paper, we will present our approaches to address the first two issues in the WebIR research project. STRUCTURING

WEB DOCUMENTS

WITH SEMANTICTAGS

All web pages are written in HTML which includes only presentation tags. To capture the intra-document structures of web pages in a well-organized web collection, we propose to add semantic tags to the web pages. Similar to the HTML presentation tags, semantic tags come in pairs: The begin and end tags mark up the portion of web content that carries some meaning identified by the semantic tag name. For example, and can be used to mark the begin and end of the portion of a web page describing the symptoms of some cancer type. By allowing nesting among semantic tags, the hierarchical intra-document structure of a web page can be represented. When the entire content of every web page is enclosed by some pair of semantic tags, one effectively classify the web pages into different categories. Semantic tags can also be used to indicate the meaning of links between web pages. Since the existing web browsers

ignore semantic tags in a web page, the appearance of the web page remains unchanged in the browsers. Having discussed the representation of intra- and inter-document structures using semantic tags, one has to address the problem of tagging a web collection. Instead of manually add semantic tags to existing web pages, one could explore the following alternatives: For web collections that are automatically generated by some web authoring and maintenance tools, one could modify the tool to add semantic tags as the web pages are generated[ l] For example, a newspaper web site may have its news web pages generated from some news files prepared for the printed newspaper. The web page generation process can be slightly modified to incorporate the semantic tags. l For web collections that are created manually, one may have to write specialized computer programs to tag the pages. To automatically add semantic tags to a web collection, one has to be familar with the logical layout of web pages and be able to identify the correct places to insert the tags, perhaps guided by the HTML presentation tags in the web pages.

l

Clearly, the second alternative will be difficult to implement, and may require some web mining techniques. In this paper, we do not attempt to address the problem of tagging web pages. We will however focus on the idea of structured web queries and their result representation. STRUCTURED

RETRIEVAL

QUERIES

Although a web collection may contain very well structured web pages, different web users will possess different levels of knowledge about these structures. For example, for the same web collection about cancer types, a user may only know that there are several categories of web pages: There are home pages of cancer types, pages describe symptoms, treatment, side effects, etc. However, he/she may not know that web pages about side effects of cancer treatment are structured according to different cancer treatment methods. Hence, the structured query model to be defined should give users the flexibility to formulate structured queries based on their knowledge about the web collection as long as their knowledge is consistent with the structure.

‘The

links with

between

nodes

have

been

included

in the

link

predicates

:=

Pred

:=

TextPred

:=

LinkPred

:=

QNodeName = TextPred ] QNodeName = TextPred.(Pred,...,Pred) TextPred ] LinkPred 1TextPred.(Pred,...,Pred) TagName ] TagName(“Term,...,Term”) HREF(“Term,...,Term”,QNodeName) ] HREF(QNodeName)

Example 2: The query example in Example I can be formulated as a structured retrieval query denoted by (q 1,q2,q3) where:

= cancer_type.(cause.(HREF(NULL,q2)), symptom.(HREF(NULL,q3))) (22 = html(“smoking”) @ = html.(NULL) ql

In order to satisfy the above structured web query, a web page tagged by must contain two links tagged by and leading to the web pages containing the cause and symptom information respectively. The term “smoking” has to be found in the web page containing the symptom information. In other words, a structured web query result consists of a set of directed acyclic graphs each containing a set of web pages satisfying the structural and keyword criteria specified in the query. In this way, the inter-document structure is also retained in the query result, allowing the user to interpret the result in a more meaningful manner. CONCLUSIONS

In this paper, we identified several research issues to be addressed for querying web collections that contain well organized intra- and inter-document structures. We have introduced a structured retrieval query model that is particularly useful when the users have some knowledge about the structures of web collections and the knowledge need not be complete. As part of our ongoing WebIR research, we have developed a simple search engine that is based on the proposed query model and experimented it with some web collections. A new indexing technique for both intra- and inter-document structures has been developed. We are currently enhancing our search engine with better query optimization module and experimenting with different ranking techniques. Incidentally, a web page containing semantic tags looks very much like an XML document. XML(Extensible Markup Language) [2] is a standard proposed by W3C Consortium to introduce structures into web. Although XML standard incorporates more features into the structures of web pages, we believe that our proposed query model can be easily extended to handle XML documents.

In the WebIR project, we define a structured retrieval query to be a directed acyclic graph, denoted by (q 1,q2; .,qn) where qi’s represent nodes’. Each node captures the text predicates, link predicates and other complex predicates associated with different portions of the web content enclosed by semantic tags. A text predicate essentially specifies the keyword(s) to be found within some portion of web page identified by a tag. A link predicate specifies the keyword(s) to be found within an anchor text and possibly a link to another query node. A complex predicate allows text and link predicates to be nested together to form a more complicated predicate for the hierarchical intra-document structure. The BNF specification of the query node is given below:

sociated

QNode

REFERENCES

I. P. Atzeni, G. Mecca, and P. Mcrialdo. Semistructured and Structured Data in the Web: Going Back and Forth. SIGMOD Record, 26(4):8-15, 1997. 2. World Wide Web Consortium. Extensible Markup Language (XML). W3C: PR-xml-971208, 1997.

as-

the nodes.

298