XRL: A XML-based Query Language for Advanced ... - ftp.FreeBSD.org

XRL: A XML-based Query Language for Advanced Services in Digital Libraries Juan Manuel Pérez, María José Aramburu, and Rafael Berlanga Universitat Jaume I, Castellón, Spain {martinej, aramburu, berlanga }@uji.es

Abstract. In this paper we present a new XML-based query language for XML documents denoted XRL. This language expresses database conditions concerning the attributes and the structure of documents, as well as Information Retrieval conditions over their contents and their relevance. XRL queries can be stored into a XML repository and manipulated as any other XML document. In order to illustrate the usefulness of this language, we also describe some general guidelines of its current implementation, and present an example application. This application consists in a subscription/notification service of news articles, which are periodically retrieved from a digital library of newspapers according to the preferences of each user.

1

Introduction

New applications of digital libraries store and manipulate large amounts of semistructured data, mainly multimedia documents with textual data, and provide their access by means of an extensive range of different tools. From our point of view, the main purpose of these applications is to provide users with new Web-based services whose development presents a set of common requirements that could be summarized as follows: 1. When storing and retrieving documents, these services must integrate the usual functions of databases and Information Retrieval (IR) systems, with new techniques for the efficient evaluation of trajectories over document structures. 2. They must have a middleware architecture designed to free the servers from executing the specific tasks required by the user services, and that are too complex to be processed at Web clients. 3. These systems must facilitate query reusing in a uniform way, so that user queries can be stored, retrieved, redefined, interchanged, and executed by applying the same techniques than for documents. The XML language is the most immediate response to these requirements, as it is designed to represent and manipulate semi-structured documents. At the same time, it allows the transmission of data between software components distributed along heterogeneous architectures. The storage and retrieval of large amounts of XML documents is supported by current technology (e.g.: [1][2]). However, concerning the first and third requirements, we still have some issues without a proper solution. In R. Cicchetti et al. (Eds.): DEXA 2002, LNCS 2453, pp. 300–309, 2002. c Springer-Verlag Berlin Heidelberg 2002

XRL: A XML-Based Query Language in Digital Libraries

301

order to solve them we propose a new query language for XML documents with a XML-based syntax in the way explained in [3]. Furthermore, this language should be able to express database conditions over the attributes and the structure of documents, as well as IR conditions over their contents and relevance. Finally, queries should be stored into a XML repository to be manipulated as any other XML document. In previous works, we have developed new applications of digital libraries for storing documents with a temporal component. The main novelty of the resulting languages and tools is that they allow both the retrieval of documents by their temporal properties, and the historical analysis of past events [4]. Following with this features, and considering the requirements previously explained, in this paper we present a new temporal document retrieval language with a XML-based syntax. This language expresses conditions on the attributes and structure, as well as on the temporal components of the documents. Furthermore, it supports a variety of IR operators [11], the ranking of results by combining different relevance criteria (keywords and document structure), and several ways for presenting the query results. In order to illustrate the usefulness of this language, in this paper we also describe some general guidelines of its implementation, and present an example application. This application consists on a subscription/notification service of news articles, which are periodically retrieved from a digital library of newspapers according to the preferences of each user. The rest of the paper is organized as follows. Section 2 presents XRL (XML Retrieval Language). In Section 3 we give the implementation guidelines. Section 4 is dedicated to present the subscription/notification service. In Section 5 some related work is analyzed, and, finally in Section 6 we outline some conclusions and future work.

2. The XML Retrieval Language (XRL) In this section we present the syntax of the XML Retrieval Language (XRL). Briefly, a XRL query is a XML document that specifies a set of advanced IR conditions over XML documents. A XRL document consists of two main parts: the definition of the query variables, and the specification of the queries. The following sections describe in detail these two parts by means of examples. 2.1 Variable definition The variables of a XRL query are defined by means of define_var tags. In XRL a variable represents a relevance ordered list of document components (i.e. XML document sub-trees), which satisfy the given retrieval conditions. These retrieval conditions are specified by means of the define_var‘s attributes, which are described in turn. The id attribute identifies the variable so that it can be referenced either from queries or from other variables. The contains attribute specifies a boolean Information Retrieval Expression (IRE) over the textual contents of the document components

302

J.M. P´erez, M.J. Aramburu, and R. Berlanga

represented by the variable. For example, the variable v1 in Figure 1 establishes the following condition: text must contain the word ‘PSG’ or the words ‘Paris Saint Germain’. Besides the contains attribute, the text marked by the define_var tag is also treated as an IRE, which retrieves the document components with the most similar contents to the specified text (see variable v5 in Figure 1). Paris Saint-Germain FC beat Olympique de Marseille 7-6 on penalties in the last 16 of the French Cup on Sunday, depriving the 1993 UEFA Champions League winners of their last chance to play in Europe next season.

Fig. 1: Example of variable definition in XRL.

The path attribute specifies by means of a path expression the document components that must satisfy the retrieval condition. If this attribute is omitted, retrieval conditions will be evaluated over all the components of the documents. The path attribute can contain alternative path expressions, like those specified at the definition of the variable v2. Moreover, it is possible to specify conditions on the order of the document components (see variable v4), and the tag attribute values (see variable v6). The level attribute establishes the components of the documents that will be retrieved by the variable. If omitted, the variable will represent the last non-wildcard component specified in the path attribute. Additionally, the attributes d_ini and d_end define a time window over the document publication dates. There are two query attributes that restrict the size of the query variables: the max attribute, which indicates the maximum number of document components to be retrieved, and the min_rel attribute, which states the minimum relevance for the retrieved document components. One of the most interesting properties of XRL is the redefinition of variables. The redefines attribute allows the definition of a new variable from a previously defined one. For example, in Figure 1 the variable v3 redefines the variable v1. To reference a variable defined in a separate XRL document, we can indicate the URI of that document along with the variable to reuse (see variable v4). When defining a variable using the redefines attribute, the new variable inherits all the values for its attributes from the referenced variable, except those redefined in the new variable.


303

Finally, the XRL language includes the value_of operator ($), whose aim is to reference attribute values of previously defined variables. For example, v5 takes the value of its d_ini attribute from the one defined for the attribute d_end of v1, and v6 defines the contains attribute starting from that of v1. 2.2 Query Specification In XRL, a query can be seen as a combination of variables. Queries are specified by means of query tags (see Figure 2), which indicate the involved variables as well as how to combine their document components for constructing the result. As in the define_var tag, the id attribute of a query tag states an identifier for the query. When specifying a query, user can select the relevance function for ranking the query results. Specifically, the relevance attribute can take the values contents, structural or none. The first one indicates that query results will be ordered according to the relevance given by the keyword frequencies [11]. The second one indicates that query results will be ordered by a combination of the term relevance and the structural relevance [5]. In the third case, query results are not ordered at all.

Fig. 2: An example of query specification in XRL.

The attribute type specifies how query results are returned to the user. This attribute can take the following values: x_doc to obtain the whole document components, x_ref to obtain the references to the document components, or x_count to obtain the size of the query result. The size of the returned results can be limited by using the min_rel attribute. We can also specify a range within the document components of the result by using the min and max attributes. For example, in Figure 2 query q1 returns the retrieved document components from 5th to 12th in order of relevance. The variables involved in a query are specified inside a query tag by means of var tags. The id attribute of a var tag is used to reference a previously defined variable. It is also possible to reference the result of another query (see third variable of query q1 in Figure 2). In order to combine the results of several query variables, XRL

304


introduces the set operators union, intersects and difference, which are expressed as nested tags inside the query tag. The semantics of these operators is as usual. However, these operators can change the relevance order of the resulting documents, as shown in Table 1. Finally, the relevance of the results of a variable can be also adjusted by using the weight attribute. Set operator union(S1,S2)

intersects(S1,S2)

difference(S1, S2)

Relevance recalculation

The resulting relevance of a document appearing at set S1 or at S2 is the maximum of the relevance of this document in the two sets. The resulting relevance of a document appearing at set S1 and at S2 is the minimum of the relevance of this document in the two sets. No recalculation is needed.

Table 1. Relevance recalculation implied by the set operators.

The redefines attribute of a query tag can be applied to reuse the definition of a previously defined query. The example of Figure 2 shows how this attribute is used in the specification of query q2. The parameters introduced between parenthesis indicate the correspondence between the original query’s variables and the new ones.

3. Implementation Guidelines In this section we briefly describe the main issues of the current implementation of the XRL query processor. This implementation mainly relies on three indexes, namely: • Scodes [2], which are unique codes, implemented as strings, that represent the trajectories through the XML trees that lead to the terminal elements and attributes. XRL path expressions are efficiently evaluated over scodes by using stringmatching operations [2]. • Inverted file, which records for each term a list with the scodes of the XML elements containing it. XRL’s contains expressions are evaluated over this index. • Time index, which is used to project the document database on a time window. The current implementation of the XRL query processor uses a relational database with IR capabilities to store the XML documents according to the previous indexes. Specifically, we have defined the following tables: • repository(id, scode, text), which stores the text of each terminal element identified by the corresponding scode. The inverted file index is defined on this table. • repository_att(id, scode, type, value), which stores the value of each tag attribute identified by the corresponding scode. The type of the value can be either string, number or date, and it is inferred at insertion time. • type(element, code), which associates a unique code to each XML element or attribute. This code is used to build the scode of the document elements.


305

• insertion_time(scode, time), which stores the insertion time of each document component, identified by its scode. Since the insertion of a whole document is usually considered as a transaction, this table only contains the root elements of the XML documents. Client Applications XRL Query (XML)

XML Document

Query Result (XML)

HTTP Servlet

Servlet

XMLLoader

XRLpreProcessor

Schema

JDBC Relational SGDB

Figure 3. XRL query processor architecture

To store and query XML documents in this database a middleware arquitecture were designed (see Figure 3). The two components that are visible from the client applications are XMLLoader and XRLpreProcessor, both are implemented as Java Servlets. These two components constitute the wrapper of the relational database. They contain a XML parser to translate the XML documents and XRL queries into the database representation. The XRLpreProcessor is in charge of translating XRL queries into a set of simple SQL queries, usually one for each variable involved in the query. These SQL queries can be efficiently evaluated in parallel by using the presented indexing schema. At this first stage, the XRLpreProcessor uses the Schema component for translating the path expressions into string-matching expressions over scodes. Afterwards, the results of each variable query are combined by union, intersection or difference operations for constructing the final XRL query result.

4. An Example Application The first application we have developed with the XRL query language is a subscription/notification system for news articles. The users of this system can specify their preferences by means of the interface of Figure 4. Firstly, users can describe the topics of interest by means of an Information Retrieval expression. Additionally, they can specify the newspaper section (e.g. sports, economy, etc.) and the article's component where the retrieval conditions must be evaluated (e.g. title, summary, whole contents, etc.). Afterwards, users can also indicate which parts of the articles must be returned (e.g. title, summary, etc.).

306


To define the frequency of the subscription evaluation, users specify patterns by selecting a weekly (res. monthly) frequency, and the exact days of the week (res. month) at which they want to be notified.

Fig. 4: Interface for the News Subscription System.

Daily the whole contents of the published news articles are inserted into the database in XML format, being indexed by their structure, topics and the temporal attributes. Thus, each user subscription is interpreted by the system as a temporal query over the news database, which must be executed with the specified frequency to retrieve the last novelties related to the user preferences. Figure 5 schematizes the architecture proposed to implement this system. Firstly, user requirements are processed by a module that updates the users and queries data files. At the same time, a planning module schedules the execution of each query according to the subscription frequency. The planner sends daily to the query processor the queries that must be executed. Once queries are evaluated, their results are passed to the reports generator, so they can be properly presented to the users. An important feature of this system is that both queries and query results are represented as XML documents. In this way, queries can be easily stored, retrieved, redefined and executed as many times as needed. Additionally, the format of the documents returned by the query processor is flexible, so they can be processed to generate different presentations adapted to the preferences of the users.


307

4.1 Operation of the System The subscriber module of Figure 5 is in charge of executing the processes that visualize the user interfaces at the client. Furthermore, this module generates the initial XRL query that retrieves the documents relevant to the user subscription, and inserts it into the queries file. Finally, this module creates an object for each subscription that contains its frequency, its XRL query and the last date it was evaluated. This object is sent to the planner module, which is in charge of determining the next dates at which each subscription must be evaluated. Client Applications

Subscription of digital news

applets users

Subscriber

Reports Generator Query results

queries

Planner XML

Middleware Servers

XRL XRL query processor Digital Newspapers Database

Figure 5. Architecture of the News Subscription System.

Thus, every day the planner retrieves the scheduled XRL queries, and sends them to the query processor, updating accordingly the subscription objects. As an example consider the XRL query at the left side of Figure 6. This basic XRL query could be initially associated to a subscription that notifies monthly of articles about the “Champions League”. As the user wishes to be notified the first day of each month, before being executed, this query must be redefined in order to adjust its temporal projection. The query at the right side of Figure 6 retrieves all the relevant articles published during April, so it must be executed the first day of May. In this way, we make use of the facilities for query reusing and redefinition ensured by XRL as explained in Section 2. Finally, the reports generator receives the XML file with the query results and produces another XML/HTML document ready to be visualized. Depending on their preferences, users are automatically notified by means of an e-mail, or a new link to this document in their personal Web-pages.

308


Fig. 6: XRL Query for an example subscription.

5. Related Work In this section we summarise other models and query languages for storing and retrieving XML documents, and we compare these approaches with our work. As presented in Section 2, our language provides mechanisms for specifying document structure conditions by including a subset of Xpath [9]. Other languages like XQuery [10] are based on this proposal. However, they do not provide any Information Retrieval operator nor relevance ranking mechanisms. In [1], as in our approach, indexing is only performed at the lower/text level of structure of the documents. Their indexing scheme assigns a GID (General element IDentifier) to each element of the document. GIDs can be used for obtaining the ancestors of each element and for accumulating the weights of the terms for relevance ranking. However, it is not clear how the mapping between the query evaluation method and a path expression is performed, specifically when path expressions include wildcards, or the searched elements are at different levels. Under this approach query response time depends on the retrieval level specified by the user, increasing when it is near the root element of the document. They disregard the attributes of tags (present in many XML/SGML documents), the temporal properties of documents, and a structural relevance mechanism. The work in [6] proposes an expressive model for efficiently indexing and retrieving structured documents. This work consider different structural hierarchies over the same document, with a special hierarchy for the textual contents of the document. As in our approach they use two indexes, one for the document structures and another for the textual contents. Similarly, [7] proposes an index that combines an inverted file with two approaches to codifying the document structure. However, they do not take into account any relevance ranking mechanism for the results. Finally, YAXQL [8] is a very powerful and expressive XML language for querying XML documents. As in our approach, it provides query reuse and separates the variable definitions from the results specification. Nevertheless, it does not provide any IR operator nor relevance ranking.


309

6. Conclusions In this paper we have presented a XML query language for retrieving XML documents. The main goals of this language are to combine structured documents and IR queries, enabling relevance ranking , as well as query reuse. We have described the guidelines of its implementation, and we have shown how this language can be used in a news subscription system. The XRL query processor is currently implemented with Java Servlets and a commercial database. In the future we want to migrate this database into a native XML database with a more appropriate indexing scheme. Over this native database we plan to implement the entire Xpath recommendation [9] for XRL, and to develop optimization techniques for complex XRL queries that combine a large number of variables. Additionally, we plan to introduce advanced presentations for query results, like the histograms and chronicles presented in [5]. Acknowledgements . This work has been funded by the Bancaixa project with contract number PI.1B2000-14, and the CYCIT project with contract number TIC2000-1568-C03-02.

References 1.

D. Shin, H. Jang and H. Jin. “BUS: An Effective Indexing and Retrieval Scheme in Structured Documents”. Digital Libraries '98, pp. 235-243, 1998. 2. R. Berlanga, M. J. Aramburu and S. Garcia “Efficient Retrieval of Structured Documents from Object-Relational Databases”. DEXA'1999, LNCS 1677, Springer-Verlag, 1999. 3. Deutsch, M. F. Fernandez, D. Florescu, A. Y. Levy, D. Maier, D. Suciu “Querying XML Data”. IEEE Data Engineering Bulletin 22(3), pp. 10-18, 1999. 4. M. J. Aramburu and R. Berlanga "A Temporal Object-Oriented Model for Digital Librares of Documents" Concurrency: Practice and Experience 13 (11), John Wiley, 2001. 5. R. Berlanga, J.M.Pérez, M.J.Aramburu, and D.Llidó. "Techniques and Tools for the Temporal Analysis of Retrieved Information". DEXA'2001, LNCS 2113, Springer-Verlag, 2001. 6. G. Navarro and R. Baeza-Yates. “Proximal Nodes: A Model to Query Document Databases by Contents and Structure”. ACM Trans. on Information Systems 15 (4), 1997. 7. V. Aguilera, S. Cluet, P. Veltri, D. Vodislav and F. Wattez. “Querying XML Documents in Xyleme”. VLDB Conference, Roma, 2001. 8. G. Moerkotte. “YAXQL: A powerful and web-aware query language supporting query reuse”. Technical Report, University of Mannheim, January 2000. 9. J. Clark and S. DeRose. “XML path language (XPath)”. W3C Working Draft 9, July 999. 10. “XQuery 1.0: An XML Query Language”. W3C Working Draft, April 2002. 11. R. Baeza-Yates and B.Ribeiro-Neto. “Modern Information Retrieval”, Addison-Wesley, 1999.