An Original Semantics to Keyword Queries for XML ... - Semantic Scholar

14 downloads 783 Views 350KB Size Report
Therefore, their execution can profit from optimization techniques ... ful TPQs for keyword queries from a structural summary of the data tree (index graph).
An Original Semantics to Keyword Queries for XML Using Structural Patterns Dimitri Theodoratos and Xiaoying Wu Department of Computer Science, New Jersey Institute of Technology, USA {dth,xw43}@njit.edu

Abstract. XML is by now the de facto standard for exporting and exchanging data on the web. The need for querying XML data sources whose structure is not fully known to the user and the need to integrate multiple data sources with different tree structures have motivated recently the suggestion of keyword-based techniques for querying XML documents. The semantics adopted by these approaches aims at restricting the answers to meaningful ones. However, these approaches suffer from low precision, while recent ones with improved precision suffer from low recall. In this paper, we introduce an original approach for assigning semantics to keyword queries for XML documents. We exploit index graphs (a structural summary of data) to extract tree patterns that return meaningful answers. In contrast to previous approaches that operate locally on the data to compute meaningful answers (usually by computing lowest common ancestors), our approach operates globally on index graphs to detect and exploit meaningful tree patterns. We implemented and experimentally evaluated our approach on DBLP-based data sets with irregularities. Its comparison to previous ones shows that it succeeds in finding all the meaningful answers when the others fail (perfect recall). Further, it outperforms approaches with similar recall in excluding meaningless answers (better precision). Since our approach is based on tree-pattern query evaluation, it can be easily implemented on top of an XQuery engine.

1 Introduction XML is by now the de facto standard for exporting and exchanging data on the web. XML data are represented in a tree structured form.1 Structured query languages for XML are based on the specification of tree patterns to be matched against the data tree. The semistructured nature of XML poses problems when it comes to query data on the web using query languages based on tree patterns. First, XML data does not have to comply with a schema and writing a Tree Pattern Query (TPQ) in this context becomes intricate. Second, even if the data complies with some schema, the syntax of a structured language like XQuery is much more complex than a keyword query and therefore, not appropriate for the naive user. Third, a user might not have full knowledge of the schema of the document. Then, formulating a TPQ that retrieves the desired results without 1

ID/IDREF links would require the modeling of XML documents as graphs but we ignore these features here for simplicity.

R. Kotagiri et al. (Eds.): DASFAA 2007, LNCS 4443, pp. 727–739, 2007. c Springer-Verlag Berlin Heidelberg 2007 

728

D. Theodoratos and X. Wu

being too general can be extremely cumbersome. Finally, data sources usually export data on the web under different structures even if they export the same information. Since elements may be ordered differently in these structures, a single TPQ is not able to retrieve the desired information from all of them. These issues have been identified early on and attempts have been made to exploit keyword-based techniques used by current search engines on the web for HTML documents. Two main modifications have to be made to these techniques so that they are applicable to XML documents. First, they have to be able to distinguish between values/text (data) and tags/elements (metadata). Second, they have to be able to return fragments of the documents that contain the keywords, as this is appropriate for XML, instead of links to documents. Several approaches suggest keyword-based queries as standalone languages [16,10,5]. Others, extend structured query languages for XML (e.g. XQuery) with keyword search capabilities [6,14]. The problem. Keyword queries usually return to the user a large percentage of XML document fragments that are meaningless (that is, the keywords are matched to unrelated parts of the document). To cope with this problem most approaches assign semantics to queries using some variation of the concept of Lowest Common Ancestor (LCA) of a set of nodes in a tree [16,10,5,19]. However, in most practical cases, the information in the XML tree is incomplete (e.g. optional elements/values in the schema of the document are missing), or irregular (e.g. different structural patterns coexist in the same document). For instance, examining the DBLP data set (data collected in May 2006) we found that almost 10% of the ‘book’ entries and over 1% of ‘article’ entries do not have an author while almost all ‘proceedings’ entries do not have authors (this latter one is reasonable and expected). In such cases, these approaches, even if they succeed to retrieve all the meaningful answers, they comprise only a tiny percentage of meaningful answers in their answer set. Most of the answers are meaningless. In other words, these approaches have low precision. Our experiments in Section 6 with DBLP-based data sets show that in some cases their precision falls below 1% for some approaches. Clearly, such a low precision is a serious drawback for those approaches. A recent approach [14] introduces the concept of Meaningful Lowest Common Ancestor Structure (MLCAS) for assigning semantics to keyword queries. It also adds new functionality to XQuery to allow users to specify optional structural restrictions on the data selected by the keyword search. The goal is to improve the precision of previous approaches. However, the percentage of meaningful answers returned by this approach (i.e. the recall) is low when the data is incomplete. In our experiments in Section 6, the recall of the MLCAS approach falls below 60% for several cases of incomplete XML data. Clearly, the poor recall cannot be improved by further imposing structural restrictions. This performance is not satisfactory for data integration environments for which this approach is intended. Our approach. In this paper we suggest an original approach for assigning semantics to keyword queries for XML documents. The originality of our approach relies on the use of structural summaries of the XML document for identifying structural patterns (in the form of TPQs) for a given query. Using a transformation for TPQs we identify those of them (called meaningful TPQs) that return meaningful answers. Previous approaches identify meaningful answers by operating locally on the data (usually

An Original Semantics to Keyword Queries for XML Using Structural Patterns

729

computing lowest common ancestors of nodes in the XML tree). In contrast, our approach operates globally on structural summaries of data to compute meaningful TPQs. This overview of data gives an advantage to our approach compared to previous ones. Contribution. Our main contributions are the following: • We introduce a simple keyword query language for XML that allows the specification of elements and values of elements (atomic predicates) (Section 3). This language allows the user to specify queries without knowledge of the structure of the XML tree, and without knowledge of a complex TPQ language like XQuery. • We define structural summaries of data called index graphs. We show how index graphs can be used to compute a set of TPQs for a keyword query that together compute the answer of that query (Section 4). • Based on a transformation for the TPQs of a query we determine those of them that are meaningful. The meaningful TPQs are used to assign semantics to keyword queries (Section 5). • Since the meaningful TPQs are tree pattern queries they can be evaluated using any XQuery engine. Therefore, their execution can profit from optimization techniques developed up to now for XQuery (e.g [11,1,4]). • We compare our approach with other prominent keyword-based approaches and also with the MLCAS approach. We analyse cases where our approach succeeds in returning meaningful answers that escape other approaches. We also analyse cases where our approach succeeds in excluding meaningless answers that are returned by other approaches. • We have implemented and experimentally evaluated our approach both on complete and incomplete real XML data. Our approach shows better recall compared to previous ones. In addition, it allows for a better precision among approaches with similar recall (Section 6).

2 Related Work A number of papers deal with the assignment of meaningful semantics to keywordbased query languages for XML [16,10,5,14,19]. All of them are based on some variation of the concept of Lowest Common Ancestor (LCA). The query language in [5] allows also some primitive structural restrictions to be expressed. The approach MLCAS [14] provides an extension of XQuery to allow users to query an XML document without full knowledge of the structure. We experimentally compare our approach with the three approaches in [16,5,14] in Section 6. In [19] the concept of Smallest Lowest Common Ancestor (SLCA) is used to assign semantics to keyword queries. SLCAs are defined to be LCAs that do not contain other LCAs. This semantics is similar to that of the MLCAS approach. In order to cope with low precision some approaches extend the database techniques with information retrieval techniques. In this direction, they rank the answers of keyword search queries on XML documents according to their estimated relevance [5,8]. Information retrieval systems using ranking functions may trade recall for precision. We view our keyword query language as database query language. Therefore, we do

730

D. Theodoratos and X. Wu

not employ any ranking functions. Our goal is to not miss any meaningful answer and to exclude as many meaningless answers as possible. Some languages employ approximation techniques to search for answers when the initial query is too restricted to return any. They either relax the structure of the queries or the matchings of the queries to the data [12,2]. In contrast to our language, these languages return approximate (not exact) answers. Several papers focus on providing efficient algorithms for evaluating LCAs for keyword queries [16,10,5,14,19,9]. Our approach is different and does not have to explicitly compute LCAs of nodes in the XML tree. In contrast, it computes a number of meaningful TPQs for keyword queries from a structural summary of the data tree (index graph). TPQs can be evaluated directly using an XQuery engine.

3 Data Model and Keyword Queries We present in this section the data model and our simple keyword-based query language. 3.1 Data Model Let E be an infinite set of elements that includes a distinguished element r, and V be an infinite set of values. Symbols e, and v (possibly with indices) refer systematically to an element, and a value respectively. As is usual, we model XML documents as trees. Nodes in an XML tree are labeled by elements or values. In particular, the root node of an XML tree is the only node labeled by element r. Values can label only leaf nodes. For simplicity we assume that the same element does not label two nodes on the same path. Further, attributes of elements in an XML document are modeled as (sub)elements. Figure 1 shows three XML trees T1 , T2 , and T3 from different data sources that record bibliographic information in different formats (a slight extension of an example introduced in [14]). T1 and T3 categorize the data based on the publication year, while T2 categorizes the data based on the type of publication (article or book). Still, in T1 the year of the publications is specified as a child element of a ‘bib’ node, while in T3 there is no ‘bib’ node, and the ‘book’ and ‘article’ nodes are children of a ‘year’ node that indicates their year of publication. We are interested in retrieving information by issuing the same keyword query against all these data sources, even though information is structured differently in each one of them. Therefore, we view all these XML trees as one tree T rooted at r. 3.2 Keyword Query Language Our query language allows the specification of element keywords and value keywords associated with elements (atomic predicates on elements). Definition 1. A keyword query is a set of constructs each of them being an expression of the form: (a) an element e, or (b) a predicate e = V , where V , the annotation of e, is a set of values {v1 , . . . , vk }, k ≥ 1.

An Original Semantics to Keyword Queries for XML Using Structural Patterns

731

Suppose that we want to find the title and year of publications authored by “Mary” [14]. We formulate this request as the keyword query Q1 = [year, author = {M ary}, title]. We use Q1 as a running example on this paper. The answer of a keyword query is based on the concept of query embedding. Definition 2. An embedding of a keyword query Q to an XML tree T is a mapping M of the elements of Q to nodes in T such that: (a) An element e of Q is mapped by M to a node in T labeled by e, and (b) If an element e has an annotation V (that is, a predicate e = V is specified in Q), then the image of e under M has a child value node labeled by a value in V . We initially define keyword query answers as follows:

Tree T1

r

r bib

Tree T2

bib year

year book article author

title “XML”

article

article

“2000”

“1999”

author

title

author title

author

“John”

“Bob” “XQuery”

“Mary” “C++”

“1999”

author

“Mary”

author

title

author “Mary”

year “2000”

book

“XML”

“Bob”

r

year “1999”

article

article

author title title

“John” “Bob” “XQuery”

bib

article book article article year year author “1999” title year author title author “John” title author “2000” author “XQuery” “Joe”

“XML”

“Joe”

“Mary”

Tree T3

bib

article

author author

“C++” “Mary”

Tree T is the tree resulting by merging the roots of T1 , T2 and T3 .

“Mary” “C++”

“Joe”

Fig. 1. An XML Tree T

Definition 3. Let Q be a keyword query and T be an XML tree. An answer of Q on T is a subtree Ta of T rooted at r such that: (a) there is an embedding M of Q to Ta , and (b) the leaf nodes of Ta are images under M of an element of Q or the child nodes of the image of an element node in Q (if it has one). The set of all the answers of Q on T is called answer set of Q on T . Figure 2 shows four of the answers of the keyword query Q1 on the XML tree T of Figure 1. More specifically, these answers correspond to embeddings of Q1 to the XML tree T1 . The keyword query is able to retrieve with one query the title and year of the publications of Mary from different parts of the XML tree, even though these parts structure the data in different ways. The previous definition of the answer set of a keyword query accepts any possible embedding of Q to T . This generality allows embeddings that do not relate elements and values in the way the user was expecting when formulating the query. We call the answers corresponding to these embeddings meaningless answers. For instance, the

732

D. Theodoratos and X. Wu r

book

year “1999”

bib

bib

book

article

year article article

author

title

bib

title author

“XML”

“Mary”

year “1999”

“Mary”

“C++”

(d)

(c)

(b)

bib

r

r

r

(a)

“2000”

title “C++”

bib

author “Mary”

year “1999”

book article author title “Mary”

“XQuery”

Fig. 2. Four answers of Q1 on T

answer of Q1 in Figure 2(a) correctly corresponds to a publication (a book in this case) authored by “Mary”. However, this is not the case with the answers of Q1 shown in Figures 2(b), (c) and (d). In each one of them, year and/or title values do not correspond to a publication authored by “Mary” even though these values appear in an answer with “Mary”. In Section 5, we will present a technique that excludes these subtrees and returns answers to the user that are meaningful.

4 Evaluating Keyword Queries Using TPQs We show now how keyword queries can be evaluated using TPQs. We first discuss index graphs for XML trees. Then, we use index graphs to construct a set of TPQs whose answers together form the answer set of a given keyword query. The TPQs of a keyword query provide the basis for defining meaningful semantics for such queries in the next section. 4.1 Index Graphs Given a partitioning of the nodes of an XML tree T , an index graph for T is a graph G such that: (a) every node in G is associated with a distinct equivalence class of element nodes in T , and (b) there is an edge in G from the node associated with the equivalence class a to the node associated with the equivalence class b, iff there is an edge in T from a node in a to a node in b. Index graphs have been referred to with different names in the literature including “path summaries”, “path indexes” and “structural summaries”. They differ in the equivalence relations they employ to partition the nodes of the XML tree which includes simulation and bisimulation [15,13] or even semantic equivalence relations [17]. Index graphs have been extensively studied in recent years in both the “exact” [7,15,3] and the “approximate” flavor [13]. A common characteristic of these approaches is that the index graph is used as a back end for evaluating a class of path expressions without accessing the XML tree. To this end, the equivalence classes of the XML tree nodes are attached to the corresponding index graph nodes. Here we define index graphs where the equivalence classes are formed by all the nodes labeled by the same element in the XML tree. Figure 3 shows the index graph G of the XML tree T of Figure 1. In contrast to other approaches, the equivalence classes of the XML tree nodes are not kept with the index graph. Therefore, keyword queries are ultimately evaluated on the

An Original Semantics to Keyword Queries for XML Using Structural Patterns r

733

r

r

bib bib

bib

year

article

book year

year

book title

article author

Fig. 3. Index graph G

author= {Mary} (a)

title

book

author= {Mary}

title

(b)

Fig. 4. Two TPQs of Q1 on G: (a) U1 , (b) U3

XML tree. We use index graphs to support the evaluation of a keyword query through the generation of an equivalent set of TPQs. 4.2 TPQs for a Keyword Query If G is the index graph of an XML tree T , we say that T underlies G. Given a keyword Q and an index graph G, Q can be evaluated by computing a set of TPQs whose answers taken together are equal to the answer of Q on any XML tree underlying G. Intuitively, a TPQ satisfies both: the keyword query requirements and the structural constraints of the index graph. Definition 4. Let Q be a keyword query and G be an index graph. A TPQ of Q on G is a TPQ U without descendant precedence relationships which is rooted at a node labeled by r and satisfies the following conditions: (a) There is a mapping M from the elements of Q to the nodes of U such that element e is mapped by M to a node labeled by e. If V1 , . . . , Vk are the annotations of all the elements in Q that are mapped to the same node n in U , n is annotated by V1 ∩ . . . ∩ Vk . Two nodes in a path in U are not labeled by the same element, and every leaf node of U is the image of an element of Q under M . (b) There is a mapping M  from the nodes of U to the nodes of G that respects labeling elements and child precedence relationships.  Figure 4 shows two of the TPQs of the keyword query Q1 on the index graph G of Figure 3. We define an answer of a TPQ Q on an XML tree T to be a subtree of T which matches Q. The child value nodes of the matching element nodes of T are also included in an answer. The answer set of a TPQ on T is the set of all the answers of Q on T . The following proposition shows that the answers of a keyword query Q on an XML tree T can be computed by determining all the TPQs of Q on the index graph of T and by computing their answers on T . Proposition 1. Let Q be a keyword query, G be an index graph, and U1 , . . . , Uk , k ≥ 1, be all the TPQs of Q on G. Let also A, A1 , . . . , Ak be the answer sets of Q, U1 , . . . , Uk , respectively, on an XML tree underlying G. Then A = ∪i∈[1,k] Ai . 

734

D. Theodoratos and X. Wu

Consider the XML tree T (Figure 1) and its index graph G (Figure 3). Consider also our keyword query Q1 and two of its TPQs on G, U1 and U3 , shown in Figure 4. One can see that the answer of Q1 on T shown in Figure 2(a) is also an answer of U1 . Similarly, the answer of Q1 on T shown in Figure 2(d) is also an answer of U3 .

5 Using TPQs to Define Meaningful Answers In this section, we assign semantics to our keyword query language that returns meaningful answers. In contrast to previous approaches which exclude embeddings of the query to the XML tree [16,5,14,19], our approach excludes TPQs of the query on the index graph of the XML tree. In this sense, our approach relies both on data and on structural patterns of data, instead of relying exclusively on data. Based on the results of the previous section, we consider that, given an XML tree T (and its index graph G), the answer of a keyword query on T is the union of the answers of its TPQs on G. However, some of these TPQs may return meaningless answers. Consider, for instance, our query Q1 and the XML tree of Figure 1 along with its index graph G of Figure 3. The TPQ U3 of Q1 on G shown in Figure 4(b) returns (among others) the answer shown in Figure 2(d). This answer is meaningless. Therefore, TPQ U3 should not be used for computing the answers of Q1 on T . Analogously to query answers, we characterize a TPQ of a query Q on G as meaningful with respect to T if it returns a meaningful answer on T . Otherwise, it is characterized as meaningless with respect to T . In order to formally define meaningful TPQs we need to introduce a transformation for TPQs. 5.1 A Transformation for TPQs Let Q be a keyword query, T be an XML tree and G be its index graph. Figure 5 shows two TPQs, U and U  , of Q on G. TPQ U comprises three subtrees Ta , Tb and Tc rooted at the nodes labeled by a , b , and c respectively. Tb is the subtree of Tc . Subtrees Ta and Tb can be empty (that is, trivially contain only their root node a and b respectively). The c-node can coincide with the root of U . However, the a-node cannot coincide with the c-node, and the b-node cannot coincide with the c-node (that is, the c-node is an

r

r TR

c TPQ U

a

Ta

c a

b

Tb

Tc

Ta Tb

Fig. 5. Transformation T R

TPQ U 

An Original Semantics to Keyword Queries for XML Using Structural Patterns

r

r

U2 year

r

U1

TR

bib

U3

TR

bib

year

book

735

bib year

book

article

book

book title

author= {Mary} T a

Tb

author= {Mary} T a

Tc

title

author= {Mary} T a

Tb

title

Tc

Tb

Fig. 6. Two applications of transformation T R to TPQs of Q1 on G. U2 and U3 are meaningless

U4

r

year

bib

author= {Mary}

Ta

book

article title

Tb

year

Tc

bib

book

Ta

author= {Mary}

article

title year

author= {Mary}

Ta

U6

year

article

book

r

TR bib

year

title

Tb

r

U5

U5

year

year book

r

TR

book

title

year

Tb

Tc

Tb

author= {Mary}

Ta

Fig. 7. Two applications of T R to TPQs of Q1 on G, in sequence. U4 and U5 are meaningless

ancestor of the a-node and b-node). Labels a and b can be equal. Subtree Tb in U  is a tree identical to Tb except that its root is labeled by a instead of b. TPQ U  can be obtained from U by removing the subtree Tc below the c-node, and by making Tb a subtree of the a-node. The transformation T R on TPQs is a transformation that takes a TPQ of the form of U and returns a TPQ of the form of U  . Consider, for instance, the keyword query Q1 and the index graph G (Figure 3). Figure 6 shows three TPQs U1 , U2 and U3 of Q1 on G, and two applications of transformation T R. Dotted lines denote the subtrees Ta , Tb , and Tc of transformation T R as they are graphically shown in Figure 5. Notice that in U2 , the roots of Ta and Tb are labeled by the same element ‘book’, while in U3 they are labeled by different elements ‘book’ and ‘article’. Figure 7 shows two applications of T R in sequence. Notice that Tb of U5 (and consequently Tb of U6 ) are empty. TPQ U5 has also an extra branch from the root with respect to U6 . 5.2 Determining Meaningful TPQs We first provide some intuition on transformation T R. Consider a TPQ U  resulting by applying T R to a TPQ U . To understand the idea, observe that there is a 1-1 mapping f from the nodes of U  to the nodes of U that respects node labels and child precedence relationships (with the exception of the child precedence relationships from the node labeled by a in Tb ). The following proposition holds:

736

D. Theodoratos and X. Wu

Proposition 2. Assume that TPQ U  results by applying transformation T R to a TPQ U . If n is the lowest common ancestor (LCA) of the nodes n1 , . . . nk in U  , and n is the LCA of the nodes f (n1 ), . . . , f (nk ) in U then n is not a descendant of f (n ) in U .  Since, there is an answer of Q on T that closely relates the nodes as determined by U  , any answer of Q on T that relates the nodes in the looser way determined by U is not meaningful, Therefore, if U  returns an answer on T , U should be characterized as meaningless and should be excluded from generating an answer for Q on T . Definition 5. A TPQ U of Q on G is called meaningless with respect to T if there is another TPQ U  of Q on G such that (a) U  can can obtained from U by a sequence of applications of transformation T R, and (b) U  has an answer on T . Otherwise, it is called meaningful with respect to T .  Consider the TPQs U1 , U2 and U3 of Q in G shown in Figure 6. One can see that U1 has an answer on T . Therefore, U2 and U3 are meaningless w.r.t. T . Consider also the TPQs U4 , U5 and U6 of Q on G shown in Figure 7. One can see that U5 and U6 have an answer on T . Therefore, U4 and subsequently U5 are meaningless w.r.t. T . We can now update the definition of the answer set of a keyword query given in Section 3.2 so that an answer set comprises only meaningful answers. The new definition is based on Proposition 1 and Definition 5. Definition 6. Let Q be a keyword query, T be an XML tree and G be an its index graph. Let also U1 , . . . , Uk , k ≥ 1, be the meaningful TPQs of Q on G with respect to T . If A, A1 , . . . , Ak are the answer sets of Q, U1 , . . . , Uk , respectively, on T , then A = ∪i∈[1,k] Ai .  Consider the TPQ U3 of Q1 on G shown in Figure 4(b). As mentioned in Section 4.2, U3 evaluated on the XML tree T of Figure 1 returns the meaningless answer of Figure 2(d). TPQ U3 is also shown in Figure 6 and it is meaningless according to Definition 5. Therefore, it will not be used to generate answers for query Q1 on T . In contrast, TPQ U1 of Figure 4(a) returns only the meaningful answer of Figure 2(a). TPQ U1 is also shown in Figure 6. One can see that T R cannot be applied to U1 . Therefore, it is correctly characterized by Definition 5 as meaningful, and will be used to generate answers for Q1 on T . Since the meaningful TPQs are TPQs, their evaluation can be implemented on top of an XQuery engine and benefit from the extensive optimization techniques that have been developed up to now for XQuery [11,1,4].

6 Experimental Evaluation We implemented our approach (Meaningful Tree Pattern - MTP) and the three other approaches Meet [16], XSEarch [5], and MLCAS [14]. We ran detailed experiments to compare their Recall (defined as the proportion of relevant materials retrieved) and Precison (defined as the proportion of retrieved materials that are relevant). We used real-world DBLP data collected in May 2006. To reduce the size of the document for the experiments, we retained only three publication types: ‘book’, ‘article’,

An Original Semantics to Keyword Queries for XML Using Structural Patterns 2 keywords (year, author).Type 1 document

2 keywords (year, author).Type 2 document

737

2 keywords (year, author).Type 3 document

1

Recall

0.8 0.6 0.4 0.2 0 0

10

0

10

20

30

40

50

20

30

40

50

0

10

20

30

40

50

0

10

20

30

40

50

%Incomplete Publications 2 keywords (year, author).Type 2 document

%Incomplete Publications 2 keywords (year, author).Type 1 document

0

10

20

30

40

50

0

10

20

30

40

50

%Incomplete Publications 2 keywords (year, author).Type 3 document

1

Precision

0.8 0.6 0.4 0.2 0

%Incomplete Publications

%Incomplete Publications

%Incomplete Publications

Fig. 8. Recall and Precision for the two-keyword query {author, year} Meet

3 keywords (year, title, author).Type 1 document

XSEarch

MLCAS

MTP

3 keywords (year, title, author).Type 2 document

3 keywords (year, title, author).Type 3 document

1

Recall

0.8 0.6 0.4 0.2 0 0

10

20

30

40

%Incomplete Publications

50

3 keywords (year, title, author).Type 1 document

0

10

20

30

40

%Incomplete Publications

50

3 keywords (year, title, author).Type 2 document

0

10

20

30

40

%Incomplete Publications

50

3 keywords (year, title, author).Type 3 document

1

Precision

0.8

0.6

0.4

0.2

0

0

10

20

30

40

%Incomplete Publications

50

0

10

20

30

40

%Incomplete Publications

50

0

10

20

30

40

%Incomplete Publications

50

Fig. 9. Recall and Precision for the three-keyword query {title, author, year}

and ‘inproceedings’. For each publication type, we retained only the properties ‘title’, ‘authors’, and ’year’. As the original DBLP data is flat, we restructured it into three types of data sets. Publications in schema type 1 do not have references. Publications in schemas type 2 and 3 may have references. One difference between schemas type 2 and type 3 is that publications in schema type 3 are categorized by year. Besides the structure of the document, the “incompletness” of the data also affects the effectiveness of the keyword based searches. We define a publication in the data set as complete if it has all the subelements ‘title’, ‘year’, and ‘author’. Otherwise it is incomplete. For each query and each data set type, we ran the four approaches on six XML

738

D. Theodoratos and X. Wu

documents with increasing percentage of incomplete publications in the range from 0% (all the publications are complete) to 50% (half the publications are incomplete). We ran the experiments on a Pentium 2.40GHz computer with 512MB of RAM running Windows XP Professional. We implemented all keyword search techniques in Java and used the SAX API of the Xerces Java Parser for the parsing of XML files. Berkeley DB XML 2.2.13 was used to store XML files and run XQuery. Figure 8 shows precision and recall of the two-keyword query {author, year} for the three types of documents varying the percentage of incomplete publications in the documents. Figure 9 reports on the same measurements for the three-keyword query {title, author, year}. The trends are similar with a slight degradation of the recall of the Meet approach, and an average degradation of the precision of XSEarch and Meet. In summary, Meet and XSEarch show very poor precision on the average. MLCAS improves significantly on precision but scores low on recall both for the two- and the three-keyword query. This performance is not satisfactory for a database query language. Employing a structured TPQ language (e.g. XQuery) to further filter a query answer set using structural restrictions does not recover the missed meaningful answers. In contrast, MTP shows perfect recall. It also shows better precision compared to approaches with similar recall. Precision can be further improved by imposing structural restrictions on the answer set or by integrating our semantics for keyword queries with a structured TPQ query language.

7 Conclusion Issues related to applications exporting and exchanging XML data on the web have motivated recently the extension of keyword-based techniques for querying XML documents. Although these keyword-based approaches provide independence from the structure of the XML documents, they fail to retrieve all and only meaningful answers especially when the XML data are incomplete. We have introduced a simple keyword query language for querying XML documents and we suggested a novel semantics for it. In contrast to previous approaches that operate locally on data to extract lowest common ancestors (LCAs), our approach operates on structural summaries of data to extract meaningful tree pattern. This global view of data provides an advantage to our approach compared with previous ones. Our approach generates tree pattern queries TPQs. Therefore, it can be easily implemented on top of an XQuery engine and benefit form well known query optimization techniques. We experimentally compared our approach to previous ones. Our experimental evaluation shows that it has a perfect recall both for XML documents with complete and incomplete data. It also has a better precision compared to approaches with similar recall. Its precision can be further improved by further specifying structural restrictions on the answers returned by a keyword query. We are currently working on applying the semantics suggested in this paper for keyword search to recently suggested query languages for tree structured data [17,18] that flexibly allow not only keywords but also partial specification of a tree pattern.

An Original Semantics to Keyword Queries for XML Using Structural Patterns

739

References 1. S. Al-Khalifa, H. V. Jagadish, J. M. Patel, Y. Wu, N. Koudas, and D. Srivastava. Structural Joins: A Primitive for Efficient XML Query Pattern Matching. In Proc. of the Intl. Conf. on Data Engineering, pages 141–, 2002. 2. S. Amer-Yahia, S. Cho, and D. Srivastava. Tree Pattern Relaxation. In Proc. of the 8th Intl. Conf. on Extending Database Technology, Prague, Czech Republic, 2002. 3. A. Barta, M. P. Consens, and A. O. Mendelzon. Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods. In Proc. of the 31st Intl. Conf. on Very Large Data Bases, pages 133–144, 2005. 4. N. Bruno, N. Koudas, and D. Srivastava. Holistic twig joins: optimal XML pattern matching. In Proc. of the ACM SIGMOD Intl. Conf. on Management of Data, pages 310–321, 2002. 5. S. Cohen, J. Mamou, Y. Kanza, and Y. Sagiv. XSearch: A Semantic Search Engine for XML. In Proc. of the 29th Intl. Conf. on Very Large Data Bases, 2003. 6. D. Florescu, D. Kossmann, and I. Manolescu. Integrating keyword search into xml query processing. Computer Networks, 33(1-6):119–135, 2000. 7. R. Goldman and J. Widom. DataGuides: Enabling query formulation and optimization in semistructured databases. In Proc. of the 23rd Intl. Conf. on Very large Databases, pages 436–445, 1997. 8. L. Guo, F. Shao, C. Botev, and J. Shanmugasundaram. XRANK: Ranked Keyword Search over XML Documents. In Proc. of the ACM SIGMOD Intl. Conf. on Management of Data, pages 16–27, 2003. 9. V. Hristidis, N. Koudas, Y. Papakonstantinou, and D. Srivastava. Keyword Proximity Search in XML Trees. IEEE Trans. Knowl. Data Eng., 18(4):525–539, 2006. 10. V. Hristidis, Y. Papakonstantinou, and A. Balmin. Keyword Proximity Search on XML Graphs. In Proc. of the 19th Intl. Conf. on Data Engineering, pages 367–378, 2003. 11. H. V. Jagadish, S. Al-Khalifa, A. Chapman, L. V. S. Lakshmanan, A. Nierman, S. Paparizos, J. M. Patel, D. Srivastava, N. Wiwatwattana, Y. Wu, and C. Yu. Timber: A native XML database. VLDB Journal, 11(4):274–291, 2002. 12. Y. Kanza and Y. Sagiv. Flexible Queries Over Semistructured Data. In Proc. of the ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, 2001. 13. R. Kaushik, P. Bohannon, J. F. Naughton, and H. F. Korth. Covering Indexes for Branching Path Queries. In Proc. of the ACM SIGMOD Intl. Conf. on Management of Data, 2002. 14. Y. Li, C. Yu, and H. V. Jagadish. Schema-Free Xquery. In Proc. of the 30th Intl. Conf. on Very Large Data Bases, pages 72–83, 2004. 15. T. Milo and D. Suciu. Index structures for Path Expressions. In Proc. of the 9th Intl. Conf. on Database Theory, pages 277–295, 1999. 16. A. Schmidt, M. L. Kersten, and M. Windhouwer. Querying XML Documents Made Easy: Nearest Concept Queries. In Proc. of the 17th Intl. Conf. on Data Engineering, 2001. 17. D. Theodoratos, T. Dalamagas, A. Koufopoulos, and N. Gehani. Semantic Querying of TreeStructured Data Sources Using Partially Specified Tree-Patterns. In Proc. of the 14th ACM Intl. Conf. on Information and Knowledge Management, pages 712–719, 2005. 18. D. Theodoratos, S. Souldatos, T. Dalamagas, P. Placek, and T. Sellis. Heuristic Containment Check of Partial Tree-Pattern Queries in the Presence of Index Graphs. In Proc. of the 15th ACM Intl. Conf. on Information and Knowledge Management, 2006. 19. Y. Xu and Y. Papakonstantinou. Efficient Keyword Search for Smallest LCAs in XML Databases. In Proc. of the ACM SIGMOD Intl. Conf. on Management of Data, 2005.