Distributed Search for Structured Documents - CiteSeerX

Distributed Search for Structured Documents Nikita Schmidt and Ahmed Patel Computer Networks and Distributed Systems Research Group Department of Computer Science University College Dublin Ireland

13 May 2002 Abstract The quality of search performed by a search engine can be improved by letting the user specify conditions on the information structure associated with document content. Some centralised search engines have implemented structure-aware search. However, in a distributed heterogeneous search system, structured search is more difficult. The difficulties are associated with the different structural organisation of documents in different databases. This complicates presentation of query forms to the user and the adaptation of queries for propagation to different target systems. Selection of targets for query propagation also needs to be structure-aware. This paper presents an approach taken in the distributed search system being developed by the authors. Keywords: information structure, hierarchy, distributed search, attribute set

1

Introduction

Driven by the rapidly growing World Wide Web, today’s information retrieval applications strive to provide their users with facilities aimed at improving search quality. One such facility is “structured,” or, perhaps more precisely, structure-aware search. Increasingly many documents on the Web are structured—that is, composed of parts that are syntactically distinguishable. For instance, most HTML documents contain a title, headings, and normal text, denoted by tags. Some also have keywords, metadata descriptions, scripts, and so on. Bibliographic records held in libraries are very finely structured, containing such fields as author, title, publisher, and many others. XML supports arbitrary document structures and is being increasingly used on the Web. Even e-mail messages are divided into header and body, with the header further sub-divided into individual fields (‘From’, ‘To’, ‘Subject’, ‘Received’, etc.). The idea of structured search is to allow the user to utilise document structures to improve search quality. A structure-aware engine can perform searches for keywords in user-specified parts of documents, achieving greater search precision. Compared to traditional information retrieval techniques, which try to guess from a bunch of keywords what the user wants, structured search actually allows users to be more precise in their search queries. Many private search engines on the Web are already offering structured search facilities. However, structured search has not yet taken off in larger, heterogeneous search systems. This is due to the multiplicity of different document structures in such systems. The structure of a data record depends on the kind of information the record carries—for example, flights schedule, asset list, book, e-mail message, and so on. Unlike a small private engine holding only a few controlled record types, a large search engine cannot easily find a universal format for structured queries that would work with its heterogeneous document collection. Distributed search engines (Patel et al., 1999; Craswell, 2000; Waterhouse, 2001; Bowman et al., 1995), which have the potential to provide more scalable and complete search services than 1

their centralised counterparts, also experience difficulties in implementing structured search. The heterogeneity of individual search engines in a distributed system usually implies a variety of document and query structures: a global distributed search system must be able to work with different types of documents and search engines. This raises such issues as query and document adaptation and efficient propagation of structured queries. No universal distributed search system today supports structure-aware search. This paper describes an approach to combining the distributed search paradigm with the power of structure-aware search. This work builds on the results of international research and development projects, presented in the next section.

2

Background

This work builds on top of a distributed search system for the Internet being developed by the ADSA project (Khoussainov et al., 2001). It grows from ideas and results from the previous projects the authors participated in, namely EUROPAGATE, DALI, OASIS, UNIVERSE, and PRIDE. These were international projects, funded by the European Commission in the Libraries Programme. The work described in this paper has been based directly on the results of the PRIDE project (PRIDE project team, 1999), which successfully completed in August 2000. The project involved 13 partner organisations from Australia, UK, Ireland, Hungary, Germany, and France. PRIDE was an interoperability project developing software to create an international distributed directory of library services and resources. PRIDE experience has demonstrated the viability of using an X.500-style distributed directory for providing one-stop access to dispersed information about library services, resources, and users. An extensive user requirements and scenarios survey, prepared by Macquarie University, Sydney and the London and South Eastern Library Region (LASER), has enabled project partners to build a comprehensive directory schema and a mix of services on top of the standard directory, tuned to the needs of information providers (libraries) and users, in particular in Australia and Europe. The ADSA project pushes PRIDE results towards the wider world. ADSA is different in two major aspects: • provision for information placement (advertising) along with information discovery (search); • search for and advertisement of generic information services (not restricted to the libraries domain), such as World Wide Web search engines and documents. That means that in essence, ADSA is developing a distributed system of search and advertisement engines for the Internet, enriched with the ability to recognise structural conditions and metadata associated with documents and services. However, unlike PRIDE, ADSA can no longer afford to use static knowledge of the schemata involved, due to the heterogeneous nature of the World Wide Web, which represents most of the Internet accessible content. The challenge is therefore to become dynamically adaptable to different types of information and services. The next section gives a brief introduction to the state of the art in distributed search engines and outlines the ADSA architecture in this context. Subsequent sections describe the challenges of structured search in a heterogeneous distributed world, and how they are addressed by this research for further validation on top of a pilot ADSA service.

3

Distributed Search Engines

Distributed search architectures use many search engines to act as one search system. Such a system can be viewed as • a number of scattered independent document databases with a “search engine-like” interface; and 2

• a search infrastructure that locates appropriate document databases, distributes search requests and merges results. Each document database in a distributed search system is, in fact, a small search engine. It may only index local document storage, or it may have its own Web robot or some other information harvester. A request to a document database is typically formulated as search keywords, possibly in a Boolean formula. The search infrastructure may simply distribute all requests to all databases, or it may apply some sort of forward knowledge to route requests to most promising targets only. Distributed search systems intend to address scalability and coverage problems inherent to centralised architectures by: • removing the economic bottleneck on system size (the total cost of the system can be divided among many independent service providers); and • allowing the providers of privately indexable information to provide their own search service (avoiding the need for access and licensing arrangements with external search service providers). (Khoussainov et al., 2001) This paper discusses the implementation of structured search facilities in global distributed search systems. The ADSA system, described below, has been chosen to provide a validating platform for the design decisions presented in this paper. These decisions are, therefore, viewed in the context of the ADSA architecture.

3.1

ADSA search system

The goal of the ADSA (Adaptive Distributed Search and Advertising) project is to develop an effective search system for the Web. The general requirements of the ADSA system are that it provide: • search users with complete, up-to-date, and high quality search services for very large quantities of on-line information; • content providers with a cost effective means of advertising their content; and • service providers with low cost access to the global information search and advertising market. To meet these requirements, the ADSA project is developing a distributed search system that can scale with Internet growth, support distributed search-based document advertising, and provide management support tools to enable efficient network and hardware resource usage. The ADSA architecture consists of three basic types of inter-operating components: Document Databases, Service Directories, and Clients (Figure 1). Of these, the latter two form the search infrastructure described above. A Search Client interfaces with the end user. For each search request, the Client first queries a Service Directory, which returns a list of the best Document Databases for that request. The request is then sent to each database. The results returned by the databases are presented to the user. In case of document advertising (submission of documents to databases), a similar procedure is undertaken by an Advertisement Client.

4

Structured Search in a Distributed Context

This section discusses typical approaches adopted by search systems to provide structured search capabilities, and request models employed. These models are then considered in the context of distributed systems. Problems associated with these approaches are identified.

3

list of databases

request

Service Directory Document Database

st

ue

req

s

ult

res request Client User

request

results

results

Document Database

Figure 1: ADSA architecture overview and usage scenario

4.1

Request models

Traditional request models (such as keyword queries used in today’s general purpose search engines) only specify content-based conditions. Queries themselves do not take document structure into account. While information retrieval algorithms might use document structures internally— for example, to compute term weights (Brin and Page, 1998)—there is usually no provision for including structural conditions in the query. A structured request model must allow users to specify structural conditions in their queries. In the simplest structured search example, the user may want to restrict certain keywords to certain parts of documents. Assuming that the document structure is hierarchical or can be represented as such (which is true for most document types, including HTML, XML, PDF, e-mail, Usenet news (Schmidt and Patel, 2001)), a request model should be appropriate for hierarchical, or tree-like, structures. A sophisticated query language (based, for instance, on XPath (W3C, 1999) or query automata (Neven and Schwentick, 1999)) can offer flexibility in locating and targeting particular structure elements, in context-dependent querying, and so on. However, this flexibility comes with significant complexity associated with query formulation on the user’s side, and query adaptation and propagation to different target databases. For that reason, simpler attribute-based queries are traditionally used in information retrieval applications and “private” search engines on the Web. In an attribute-based search query, each keyword is associated with an attribute that determines which part of a document this keyword should belong to. These attribute-annotated keywords can, as usual, be combined into Boolean queries to provide greater flexibility. For example: (Title:VMS OR Title:OpenVMS) AND All:architecture (here, ‘Title’ and ‘All’ are attribute names, whereas ‘VMS’, ‘OpenVMS’, and ‘architecture’ are search keywords). Information retrieval systems may introduce special attributes which are not associated with parts of document content, but influence the search process itself. For instance, an attribute ‘Substring Search’ with two allowed values, ‘Yes’ and ‘No’, can be used to specify whether keywords should be present in the documents as complete words or as sub-words. In fact, three categories of attributes can be identified: • Content attributes match structural elements of a document (e.g., Title, Body).

4

• Metadata attributes are associated with the “envelope information” for a document, such as location (URL), time of last modification, time of entry into database, and so on. Usually, values of metadata attributes are not contained inside documents. • Special attributes affect the way search is conducted; their values are not associated with documents. Examples may include the ‘Substring Search’ attribute mentioned above and an ‘Example’ attribute which gives a document URL to find more documents “like this one.” An example of an information retrieval protocol that does structure-aware search is Z39.50 (ANSI/NISO, 1995). This protocol is primarily employed to provide access to library catalogues worldwide. Although most catalogue record formats are fixed two-level hierarchies, Z39.50 provides for unlimited recursively hierarchical document structures. Search queries are attribute-based (simple or Boolean). The standard defines a few attribute sets that can be used with queries. These sets contain both content-related and special attributes. The database back-end of a Z39.50 server is responsible for implementing the mapping between “flat” attribute-based queries and hierarchical documents in the database.

4.2

Attribute sets

Attribute-based request models, which specify resources (e.g., documents) by associating search attributes with keywords, are by far the most widely used structured search models. The set of attributes supported by a system is largely dependent on the application (type of data). As an example, Table 1 lists actual attribute sets used in four search systems. The first three are Web search engines found on the following Web sites: the Internet Movie Database (IMDB, www.imdb.com); the Macquarie University Online Directory (www.directory.mq.edu.au); and a car sales search system (carsales.com.au). The fourth column shows an excerpt from the Bib-1 attribute set, employed by libraries worldwide using the Z39.50 Information Retrieval protocol (ANSI/NISO, 1995). Table 1: Example attribute sets IMDB All Titles My Movies People Characters Quotes Bios Plots

Directory All Person Person (Exact Word) Unit Service Phone Number Job Title Room

Car sales Make Model Body Type Lifestyle Price Range Keywords plus additionally: Doors Seats Fuel Engine Capacity Transmission

Bib-1 ‘Use’ Personal name Corporate name Conference name Title Title series Title uniform ISBN ISSN Abstract Author Any etc. (99 attributes in total)

Most systems provide an all-encompassing attribute, usually called ‘All’ or ‘Any’, which corresponds to the union of all other content attributes. Search by this attribute is the same as the traditional unstructured search. Apart from that, as can be seen from the table, attributes vary greatly between different systems.

4.3

Problems with distributed search

In a distributed Internet search engine that intends to cover Web servers as well as privately held information databases, structural diversity is so high that a query language more sophisticated 5

than the “flat” attribute-based model described above would make query adaptation very difficult and a significant entry obstacle for the implementation of the architecture. Thus, in this paper we only consider the attribute-based request model. Distributed Z39.50 search systems have been developed (Pettman and Ward, 1997). Standardisation of attribute sets in Z39.50 made it relatively easy to implement such systems, because little or no attribute set or request conversion is needed. Thus, the standardisation of query formats and attribute sets allows databases of differently structured documents to be efficiently queried in a distributed search operation. However, in a general purpose distributed Internet search system there is a large variety of incompatible attribute sets. In many cases the difference between attribute sets naturally follows from the diversity of application domains (e.g., cars and books). This makes it impractical to standardise on an attribute set: different types of data (bibliographical, geospatial, historical, engineering, etc.) require different query attribute sets that are inherently semantically diverse. A generic attribute set such as Dublin Core1 provides a good common ground in that many applications can efficiently use some of those attributes. However, domain specific attributes (e.g., ‘Engine Capacity’), not covered by generic sets, will also need to be supported. This identifies the following issues related to structure-aware distributed search: • If the system is to scale, it must support arbitrary attribute sets. • The user must be able to use the attribute set that matches the application domain of his or her request, and the system must know how to route requests to find the best results. • The system should be able to provide unstructured search—for example, through the use of attributes that map to entire document text, such as ‘Any’ or ‘All’ attributes described above. • Request and attribute set adaptation might be needed in order to achieve the best coverage for a given request. A request directed at a certain domain may have relevant results in other, similar domains, with different (albeit similar) attribute sets. Moreover, different information providers in the same domain may also have different attribute sets. • The user should know which attributes and/or attribute sets to use. The following sections describe how these issues are tackled by the ADSA system.

5

The ADSA Request Model

An ADSA query builds upon attribute-value pairs. The values of attributes are typically content keywords; the attribute names determine the context in which these keywords should be evaluated. Currently, a query consists of a list of attribute-value pairs in the AltaVista-like style, where each pair is designated as optional, mandatory, or negative2 . The architecture does not define any topic- or domain-specific attributes. The only rules regarding attributes are that: • attribute names are strings of printable characters; • attribute name comparisons are case-insensitive; • attributes with the same name have the same semantics. This is done in the hope that database administrators will select attribute names that best represent their meaning, and that semantically similar elements in different databases will get the same attribute name. The ADSA request model defines a few reserved attribute names, which are marked below as CA for content attributes, MA for metadata attributes, and SA for special attributes: 1 http://dublincore.org/ 2 http://au.altavista.com/help/tips?t=20

6

• All (CA): a union of all content attributes. Values of the ‘All’ attribute (that is, keywords qualified with it) are searched for in all document text, thereby providing the traditional unstructured search facility. All document databases are required to support this attribute. • Location (MA): document URL. • Time Inserted (MA): the time document was inserted into the database, expressed as a decimal number of seconds elapsed since 1 January 2001. When this attribute is used in a search query, it selects documents inserted at or after the time specified. Its negation, obviously, selects the documents inserted before, which allows for specifying time intervals. This attribute is useful for implementing “standing queries,” whereby a request is repeatedly (say, daily) submitted to a database to discover newly appeared documents matching certain criteria. • Example (SA): URL of a document example. This attribute selects similar documents in the database. Thus, the attribute-based structured search model is employed beyond its original purpose to provide flexible architectural support for search engine features (search by location or example, or standing queries), which would otherwise require special treatment. New features can simply be added by introducing new attributes.

6

Document Database

A Document Database responds to search requests by supplying links to the matching documents. Requests follow the attribute-value model described in Section 5. Typically, a database would allow insertion or removal of document records. New documents can be added externally by sending advertisement requests and/or internally by a robot. Each database can be independently owned and maintained. In ADSA, databases are usually topic-specific, where the topic is set up by the database administrator. Databases make themselves known to the world by means of their Service Descriptions. These descriptions include topic and other parameters important for efficient query routing. Figure 2 illustrates an ADSA document database, showing what happens during document search and database population. This is explained below in detail.

attribute Client

est set requ

set attrubute attrib ute−b ased query searc h res ults

Service description ...

Document Digest

Attribute set: All, Title, ..., Links; Example, Location. ...

All

full text

Title ...

title ...

Links

anchors etc.

Document parser

Document

Document Index Structure specification Document Database

Figure 2: Document database: structure-aware search and population

6.1

Service description

Each Document Database in ADSA has an attribute set that enumerates the names of the attributes that the database supports. This attribute set, which includes content, metadata, and 7

special attributes, is a part of the database’s Service Description. The smallest possible attribute set consists of the ‘All’ attribute. The attributes supported by a database are known to others by their name only. There are no value types (number, date, etc.) or attribute descriptions in attribute sets. Attribute sets help ADSA Clients and Directories select databases and adapt search queries. They can also be used by Clients to generate search forms tailored to particular databases. This will be described in more detail in Sections 7 and 8.

6.2

Document parser

An ADSA database is populated by documents through a document parser. The document parser processes documents that need to be indexed and supplies their content and structure to the database’s indexer. The parser is provided with a structure specification, which describes how to convert the structure of the source document to the structure of the target database. 6.2.1

Structure specification

The ADSA database structure is 2-level, whereby each document is represented as a set of attribute-value pairs. The structure of source documents is assumed to be tree-like (hierarchical), where documents consist of tagged text elements with unlimited level of tagging. This matches both HTML and PDF document organisation, as well as that of many other types of documents. The task of the Document parser is therefore to extract words from the source documents, preserving their order and proximity information, and to label each word with a set of target database document attributes. These attributes are assigned according to the structure specification, associated with the database. Structure specification is a set of pairs: attribute-name: tag-path-pattern , one pair per database attribute. Tag-path-pattern is an expression that consists of tags, combined using three binary operations: concatenation, conjunction, and disjunction. A tag is a word representing a single tag of the source document. Tag names and their semantics depend on the document type. Examples of HTML tags are ‘title’, ‘body’, ‘a’, whereas MARC (ISO, 1996) tags are 3-digit numbers (e.g., ‘245’, ‘100’) and single letters or digits (e.g., ‘f’). Consequently, document types with different tag sets require different structure specifications. A tag selects document elements which contain the given tag in their tag path, that is, the sequence of tags between the document root and the element. Concatenation of two tag path patterns selects elements whose tag paths contain the first pattern and then the second one, starting from the root. In the document, the two matching patterns may have an arbitrary number of tags between them. Conjunction and disjunction of two tag path patterns select intersection and union, respectively, of the elements that match the patterns. For example, the pattern ‘html body (em | i)’ selects all emphasised portions of an HTML document’s body. An external representation of a structure specification utilises the following syntax of tag path patterns: tag-path term sequence primary tag

::= ::= ::= ::= ::=

term | tag-path ‘|’ term sequence | term ‘&’ sequence primary | sequence primary tag | ‘(’ tag-path ‘)’ sequence of printable ASCII characters, excluding space, ‘&’, ‘|’, ‘(’, ‘)’

Spaces and horizontal tabulation characters between components of tag path patterns are ignored. They are useful as tag name delimiters. HTML tag attributes are specified as if they were subordinate tags with the symbol ‘@’ prepended to their name. For instance, the pattern ‘a @href’ selects HTML anchor links. 8

6.3

Document digest

The document digest is the output of the document parser. It is supplied to the indexer when a new document is being placed in the database. A digest is split into sections, one section per content attribute. Each section (attribute digest) contains a list of phrases for the appropriate attribute, and their positions. Phrases are continuous sequences of words. Document digests therefore preserve information that is needed to perform phrase searches, if they are implemented in the database. The position of a phrase is the offset of that phrase in the full document text (that is, in the digest of the ‘All’ attribute). Positions are measured in characters. Position information facilitates proximity searches and other proximity-based algorithms (e.g., topic-specific Web crawling (O’Meara and Patel, 2001)). Although typically attribute digests are contained in the ‘All’ attribute digest, they do not have to be. For example, HTML anchor links (“hrefs”) do not belong to the ‘All’ attribute, because they are not present in the plain text. Positions for such attributes are recorded according to their projection onto the plain text of the document. HTML anchor links would, for instance, share their positions with their corresponding anchor phrases (text under the tag).

7

Service Directory

Service Directories are queried by Search Clients to find the best Document Databases for each incoming request. A directory is presented with the same request that is later sent to databases. To make database selection efficient, databases in the system are assumed to be topic-specific. Two criteria guide directory decision: • similarity of topics; and • similarity of attribute sets. Topical similarity is determined by inferring topic from the query keywords, sent in the search request, and service descriptions downloaded from various ADSA databases. There are well-known database selection algorithms based on topical similarity, such as GlOSS (Glossary-of-Servers Server) (Gravano and Garcia-Molina, 1995; Gravano and Garcia-Molina, 1999), CORI (Collection Retrieval Inference Network) (Callan et al., 1995), CVV (Yuwono and Lee, 1996). Database selection methods in a Web environment are investigated in (Craswell et al., 2000). The ADSA architecture does not impose any particular algorithms to perform topic similarity computations. Attribute set similarity is determined in a straightforward way by measuring the intersection of the set of attribute names used in the query with the attribute set of each database. A high similarity score increases confidence that the database has documents of the type the user is looking for. Also, the closer the attribute sets are, the better search precision can be achieved. A service directory simply indexes all known databases by their content words and by attribute names (see Figure 3), which makes initial database pre-selection quick. Databases are then ranked and returned to the Client. This is the same technique that is used for document selection and ranking in most information retrieval systems. Note that the details of which attribute is used with which query keyword are not used by the Directory. Selection is done independently by keywords and by attribute names. In a sense, databases are evaluated along two orthogonal axes: topic and structure. This decreases significantly the complexity of the Directory and the sizes of database descriptions, compared to having full structure-aware content knowledge for each database in the Directory. It is believed that the latter approach will not bring worthwhile benefits, if any at all, because different document elements in a topic-oriented document collection are likely to have very similar topics, providing no basis for discrimination.

9

attribute−based query

Service (Database) Index

Client

set bute i r t t tion a scrip e d t en cont

list of databases Service Directory

attri

bute set ent d escr iptio n

cont

Document Database

Document Database

Figure 3: Service directory

8

Search Client

A Search Client is a front-end for the ADSA system. In the basic scenario, for each search request, a Client 1. accepts the user’s search query, 2. submits the query to a Service Directory and receives a ranked list of Document Databases, 3. adapts and propagates the query to the Databases, 4. retrieves and merges search results, and 5. presents the results to the user. The Client may allow the user to intervene in the Database selection in step 3 above. The user may also want to select Databases manually before submitting the query, in which case step 2 is skipped. The Client has two responsibilities with respect to structured search: • presentation of attribute sets (query forms) to the user; and • adaptation of the user’s query to the attribute sets of target databases. The ability of the system to perform structured search has no effect on result merging. Search results simply contain references to and relevant excerpts from the documents found. They are independent of the structure or format of the documents themselves.

8.1

Presentation of attributes to the user

The user needs to know attribute names in order to formulate a structured query. However, attributes are not known until target databases have been selected, and selection of target databases cannot be done without a search query. Four scenarios are envisaged here. • The query is submitted in two steps. Initially, some generic attribute set (or simply the ‘All’ attribute) is used for the query. This query is sent to the Service Directory. After the target databases are known, the client retrieves their attribute sets, merges them, and presents them to the user. The user then re-formulates the query. • The user may want to target the query to some database(s) already known to the user. In this case, the Service Directory is not queried, and the target attribute set can be obtained before formulating the query. A Client may be pre-programmed to offer some databases and 10

their attribute sets straight away (for example, a Client operated by an organisation may want to offer direct access to the organisation’s local databases). • The Client can offer a selection of generic attribute sets to its users. A generic attribute set contains popular attributes pertaining to a particular information domain (e.g., movies). All clients should offer a simple set consisting of just one attribute, ‘All’, to facilitate simple unstructured searches. A query formulated in a generic attribute set is then adapted to each target by the Client. • The user can simply enter desired attribute names instead of or in addition to selecting attributes from a list provided by the Client. These user-provided attributes may have been obtained by the user from previous searches or outside sources, or may simply be a guesswork. When the attribute set is known, the client may use it to present the user with a table-like search form that contains an input field for each attribute. Such forms are frequently used in the existing structure-aware search engines.

8.2

Query adaptation

Query adaptation in ADSA is only performed to adjust attribute names according to the sets supported by target databases. No other changes are needed, because all databases use the same query format (at least currently). The following simple rules are used: 1. If a query attribute is listed in the target attribute set, it is passed unchanged. 2. Otherwise, if it is a well-known metadata or special attribute, it is removed from the query (although it may still be processed by the Client). 3. Otherwise, the attribute is “promoted” by changing its name to ‘All’. The last rule may need to be adjusted, though. The action performed on an unsupported content attribute should generally aim at increasing search quality (primarily precision). If the attribute is mandatory (e.g., combined with AND) and not negated, then discarding it and its value will lower the precision as compared with attribute promotion (changing to ‘All’). However, promoting a negated attribute may cause good documents to be excluded, thus hurting recall. On the other hand, discarding a negated attribute could possibly decrease precision. Finding a good balance here is a topic for further research. In fact, the client may choose to simply pass unsupported attributes on to databases. In this case the decision how to deal with them is left up to the database implementations.

9

Evaluation

Although the ADSA project is still under development at the time of writing, we have been able to obtain some preliminary estimations of the search benefits proposed in this paper. We compared results produced by conventional unstructured search with those obtained using structured techniques with the same information retrieval algorithm. This experiment was conducted in a non-distributed environment with the test collection WT10g3 from the TREC9 conference. This collection contains 1,692,096 Web pages and a number of search topics, each with a list of best matching documents. Structured search depends on user collaboration in order to deliver on its promises. Therefore, traditional information retrieval metrics (such as recall and precision) are not immediately applicable. For example, human judgement is required to form a structured query from the WT10g’s free-text topic specification. Certain scenarios, enabled by structured techniques, such as search 3 http://www.ted.cmis.csiro.au/TRECWeb/

11

for books written by a particular author, are not covered at all by the sample that comes with the collection. This suggests that a usability or user satisfaction study would be more appropriate, possibly complemented by the conventional evaluation techniques. Multitude and heterogeneity of documents in WT10g make it very suitable for such studies. For this preliminary evaluation, we performed search for ten simple phrases using substring matching. Some of these phrases were inspired by the actual searches recently conducted by the authors; others were taken from the WT10g topic list. The intent was to discover documents that provide information on the specific subjects defined by those phrases, rather than anything that simply mentions them. In terms of structured search this intent was formulated as search in HTML titles of Web pages: general search user experience suggests that if a Web page is devoted to discussing a certain topic, it is very likely that the topic keywords appear in the title of the page, and vice versa. From each Web page of the WT10g collection, a simple two-attribute document digest was generated. Attribute ‘All’ contained all plain text of the page, and attribute ‘Title’ only received the text under the tag. Since title is a pretty much universal characteristic of a text document, it can be used as a search attribute in very heterogeneous collections, even beyond the traditional HTML-based World Wide Web. Attribute ‘All’ was used to simulate unstructured search for comparison purposes. In the collection, 1,608,227 Web pages have titles, which represents 95% of the total number of documents. This suggests that although search in ‘Title’ may exhibit loss of recall due to non-discovery of relevant pages with no title, this loss would most likely amount to no more than 5%. Table 2 below shows the numbers of documents that contained our sample search phrases, separately for the ‘All’ and ‘Title’ attributes. The first phrase was compared in uppercase, while the remaining nine queries were case insensitive. Table 2: Comparison of document frequencies between unstructured search and title search Search Query RS-232 or RS232 monoid context-free structured search ASN.1 Bengal cat Parkinson’s disease Chevrolet trucks fasting lava lamp Total number of bytes

All

Title

2124 51 101 13 185 20 591 11 1110 140 6,669,890,459

19 2 1 0 1 2 22 1 25 4 60,295,921

Ratio 0.89% 3.92% 0.99% 0.00% 0.54% 10.00% 3.72% 9.09% 2.25% 2.86% 0.90%

Manual inspection of a few randomly selected documents from these result sets revealed that documents found using structured search (‘Title’ attribute) were much more relevant to our search intent. For example, the first document from the 19 resulted from the title search on RS-232 was entitled ‘RS-232, RS-422 AND V.35 INTERFACES’ and contained detailed information on the topic, including connector pinouts and signal descriptions. At the same time, most of the 2124 documents resulted from unstructured search on the same topic described various devices that happened to use RS-232 (such as GPS, industrial machinery, communication equipment, etc.). These preliminary observations support the expectation that structured search techniques can significantly increase search precision with a very little loss of recall. More advanced experimentation, including usability studies, in a distributed environment is needed to better quantify these expectations. Human-assisted structured search on the 50 test topics from WT10g can provide a

12

good estimation of the achievable recall and precision. We plan to carry out these experiments when the system is in pilot operation.

10

Discussion

In a heterogeneous distributed search system, the diversity of information sources makes it difficult to conduct searches based on structured conditions. Traditional approaches, such as imposing a common structure on all information sources, fail because of inherent differences in application domains. A global distributed system has to embrace this diversity, adapting itself to different kinds of online information. In designing such a system, there are three major issues to address: • Request model: the data model for search requests (queries). • Request adaptation: how queries are translated to match requirements of different target search engines (document databases). • Request routing and propagation: finding document databases with the biggest potential to retrieve relevant results, according to a given structured request. The attribute-based request model proposed in this paper is the most widely used model for structured search in homogeneous search engines (such as library systems or Web sites serving content of private databases). This implies that the model is adequate from the user’s point of view. Choosing a different model would therefore escalate entry cost and be more difficult to implement with no clear advantages. The problems of request adaptation, routing, and propagation are closely related to the choice and adaptation of attribute sets used. Heterogeneity and scalability require that the system support arbitrary attribute sets. The approach taken here is inspired by such technologies as XML, which are discoverable and can easily evolve and adapt to the conditions at hand. The attributes are specified by their name (rather than a unique number), eliminating the need for a central authority that assigns attributes. The reason is that if different components (say, document databases) introduce identically named attributes, they are very likely to have the same semantics and must therefore be treated as the same attribute. The architecture described in this paper standardises a few attribute names that are required to have a pre-defined semantics. One such name is ‘All’—an all-encompassing attribute that provides traditional keyword-based unstructured search. Apart from these few special names, no attributes are enforced by the architecture: each database is free to select the attribute set that best matches its application domain. Commonly used attribute names will attract more requests, so there is an incentive for information providers to base individual attribute sets on established and accepted standards such as Dublin Core4 . Request routing is based on the content keywords extracted from the request, and on the attributes used. Potential targets (document databases) are ranked by a measure derived from content similarity and attribute set similarity. Ranking by each attribute separately would require service directories to maintain several indexes (one per attribute) for each service. Assuming an average attribute set size to be around 8, this would give an almost 8-fold increase in index size and search complexity. Given that even unstructured service indexes used today can grow quite large, and anticipating future growth of attribute sets, detailed indexing seems infeasible. Evaluation of the approach described in this paper is a subject of further research and analysis of results that will be obtained through a prototype ADSA system. 4 http://dublincore.org/

13

11

Conclusions

The use of structured queries has helped improve search quality in private centralised search engines, where document structure is fixed. Distributed structured search has been implemented for Z39.50 databases, which again employ a fixed, standardised structure (attribute set) for formulating queries. Achieving the same quality improvement in distributed heterogeneous search systems is problematic due to the diversity of document structures. This paper presented an approach to structured search in a distributed system of topic-specific search engines for the Internet. This approach is being implemented and evaluated using the ADSA project’s development environment. The project aims to achieve greater scalability and coverage of the “deep” Web than the existing centralised search engines can provide. The core system architecture presented consists of two main types of components: Document Databases (search engines which find documents placed on the Web), and Service Directories, which find Document Databases. This is based on successful approaches implemented by the project’s predecessors. Because of their orientation towards library applications, previous projects could implement structured search by employing attribute-value based document descriptions, using a common attribute set, such as Bib-1. This is not possible or effective in a heterogeneous Web search system, where one must be prepared to handle arbitrary attribute sets. Service Directories in particular must be able to find document search services (Document Databases) based not only on their content, but also on the attribute sets that they support. This paper showed how these problems are tackled by the ADSA project. The project implements the traditional view of structured queries as composed of attribute-value pairs. A detailed explanation of query formats and attribute usage in queries was given. The paper described how the project deals with populating structured document repositories, assisting the user in submitting structured queries, choosing the best document sources, and adapting queries to differently structured document databases. Following encouraging initial results, a pilot ADSA system, when ready, will help us further estimate the value that distributed structured search can offer in the multi-faceted heterogeneous Web domain.

References ANSI/NISO (1995). ANSI/NISO Z39.50-1995: Information Retrieval (Z39.50): Application Service Definition and Protocol Specification. Z39.50 Maintenance Agency, USA. http://lcweb.loc.gov/z3950/agency/document.html. Bowman, C. M., Danzig, P. B., Hardy, D. R., Manber, U., and Schwartz, M. F. (1995). “The Harvest information discovery and access system” in Computer Networks and ISDN Systems v.28 n.1-2 p.119–125. Brin, S. and Page, L. (1998). “The anatomy of a large-scale hypertextual Web search engine” in Proceedings of the Seventh International World Wide Web Conference, Brisbane, Australia. Callan, J. P., Lu, Z., and Croft, W. B. (1995). “Searching distributed collections with inference networks” in Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval p.21–28. ACM Press. Craswell, N., Bailey, P., and Hawking, D. (2000). “Server selection on the World Wide Web” in Proceedings of the Fifth ACM Conference on Digital Libraries p.37–46, San Antonio, TX, USA. ACM Press. Craswell, N. E. (2000). Methods for Distributed Information Retrieval. PhD thesis, The Australian National University. Gravano, L. and Garcia-Molina, H. (1995). “Generalizing GlOSS to vector-space databases and broker hierarchies” in Proceedings of the 21st International Conference on Very Large Data Bases (VLDB ’95) p.78–89. 14

Gravano, L. and Garcia-Molina, H. (1999). “GlOSS: Text-source discovery over the Internet” in ACM Transactions on Database Systems v.24 n.2 p.229–264. ISO (1996). Information and Documentation — Format for Information Exchange. International Organization for Standardization, Geneva, Switzerland, 3rd edition. International Standard ISO 2709:1996. Khoussainov, R., O’Meara, T., and Patel, A. (2001). “Independent proprietorship and competition in distributed Web search architectures” in Proceedings of the Seventh IEEE International Conference on Engineering of Complex Computer Systems (ICECCS 2001) p.191–199, University of Sk¨ ovde, Sk¨ ovde, Sweden. IEEE Computer Society Press, Los Alamitos, California, USA. Neven, F. and Schwentick, T. (1999). “Query automata” in Proceedings of the 11th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems p.205–214, Philadelphia, USA. ACM Press. O’Meara, T. and Patel, A. (2001). “A topic-specific Web robot model based on restless bandits” in IEEE Internet Computing v.5 n.2 p.27–35. Patel, A., Petrosjan, L. A., and Rosenstiel, W., editors (1999). OASIS: Distributed Search System in the Internet. St.Petersburg State University Published Press, St.Petersburg, Russia. Pettman, I. and Ward, S. (1997). “UNIverse — Global, distributed library services” in Brophy, P., Fisher, S., and Clarke, Z., editors, Libraries without Walls 2. The delivery of library services to distant users. Conference proceedings, Lesvos, Greece. Centre for Research in Library and Information Management (CERLIM), Manchester Metropolitan University, Library Association Publishing, London, 1998. PRIDE project team (1999). “Oiling the works: the PRIDE project develops an information brokerage service” in Exploit Interactive n.1. Schmidt, N. and Patel, A. (2001). “Labelled ordered trees as a unified structured data model” in Callaos, N., Izworski, A., and Pineda, J., editors, Proceedings of the 5th World Multiconference on Systemics, Cybernetics and Informatics (SCI 2001) v.V p.84–89, Orlando, USA. International Institute of Informatics and Systemics. W3C (1999). XML Path Language (XPath) 1.0. World Wide Web Consortium, 2nd edition. W3C Recommendation. Waterhouse, S. (2001). “JXTA search: Distributed search for distributed networks.” Technical report, Sun Microsystems, Inc., Palo Alto, CA, USA. Yuwono, B. and Lee, D. L. (1996). “Search and ranking algorithms for locating resources on the World Wide Web” in Proceedings of the 12th International Conference on Data Engineering p.164–171. IEEE Computer Society.

15