Validity-Sensitive Querying of XML Databases∗ Extended Abstract†

Slawomir Staworko

Jan Chomicki

Department of Computer Science and Engineering University at Buffalo {staworko,chomicki}@cse.buffalo.edu

Abstract We consider the problem of using XPath to query XML documents which are not valid with respect to given DTDs. If a query is formulated under the assumption that documents satisfy a DTD and a given document does not, then the query answer in this document may be different from the expected one. We propose an alternative mode of query evaluation which actively incorporates the schema into the evaluation process: If a document is invalid, all possible valid documents obtained from it by applying the minimum number of repairing operations are considered. Conceptually, the query is evaluated in each such document, and the intersection of all the results is returned to the user. We study the issue of how the computational complexity of query evaluation in our approach depends on the repertoire of the repairing operations and the syntax of the query. We describe experiments that validate our approach.

1

differ with respect to the constraints on the cardinalities of elements. The presence of legacy data sources may even result in situations where schema constraints cannot be enforced at all. Also, temporary violations of the schema may arise during data entry. At the same time, DTDs and XML Schemas capture important information about the expected structure of an XML document. Very often, the formulation of queries relies on the document schema. However, if the document is not valid, then the result of query evaluation may be insufficiently informative or may fail to conform to the expectations of the user. Example 1 Consider the following DTD D0 specifying a description of a project:

proj emp name salary

(name, emp, proj*, emp*) > (name, salary) > (#PCDATA) > (#PCDATA) >

The description consists of a name, a manager, a collection of subprojects, and a collection of employees involved in the project. The following query Q0 computes the salaries of all employees that are not managers:

Introduction

XML is rapidly becoming the standard format for the representation and exchange of semi-structured data (documents) over the Internet. In most contexts, documents are processed in the presence of a schema, typically a Document Type Definition (DTD) or an XML Schema. Although currently there exist various methods for maintaining the validity of semi-structured databases, many XML-based applications operate on data which is invalid with respect to a given schema. A document may be the result of integrating several documents of which some are not valid. Parts of an XML document could be imported from a document that is valid with respect to a schema slightly different than the given one. For example, the schemas may ∗ Research supported by NSF Grants IIS-0119186 and IIS0307434. † Previously published in Proc. EDBT Workshops (dataX), March 2006, Munich, Germany, Springer, LNCS 4254, pp. 164– 177.

//proj/emp/f ollowing sibling::emp/salary Now consider the following document T0 which lacks the information about the manager of the main project: proj name

emp

emp

Pierogies proj name Stuffing

name salary

name salary

John

Mary

emp

80k

40k

emp

name salary

name salary

Peter

Steve

30k

50k

Such a document can be the result of the main project not having the manager assigned yet or the manager being changed.

The standard evaluation of the query Q0 will yield the salaries of Mary and Steve. However, knowing the DTD D0 , we can determine that an emp element following the name element ‘‘Pierogies’’ is likely to be missing, and conclude that the salary of John should also be returned.

and modifying the label of a node. Such operations are commonly used in the context of XML document change detection [9, 10] and incremental integrity maintenance of XML documents [1, 4, 5]. We define repairs to be valid documents obtained from a given invalid document using sequences of operations of minimum cost, where the cost is measured as the cumulative number of inserted, deleted, and modified nodes. Valid query answers are defined analogously to consistent answers as answer obtained in every repair. A valid document is its only repair, and hence when no validity violence occurs the valid answers are equivalent to standard query answers. We consider schemas of XML documents defined using DTDs.

In our paper we address the impact of validity violations of XML documents on the result of query evaluation. In general, this problem can be approached from 3 different directions: 1. Document repair which involves techniques of cleaning and correcting databases [8, 23]. Usually, however, there exists many possible alternatives to restore the validity of the document and selecting an arbitrary repair may lead to an unjustified loss of information. User’s input can be incorporated in the correction process, but user may not have an adequate competence to fully repair the document (for instance in Example 1 the user issuing the query may not know who is the manager of main project).

Example 2 The validity of the document T1 from Example 1 can be restored in two alternative ways: 1. by inserting in the main project a missing emp element (together with its subelements name and salary, and two text elements). The cost is 5. 2. by deleting the main project node and all its subelements. The cost is 26. Because of the minimum-cost requirement, only the first way leads to a repair. Note that the text nodes of name and salary of the inserted emp element can assume infinitely many possible values, and thus there are infinitely many repairs. However, all those repairs have the same structure, and so in the context of valid answers it can be determined that the manager of the main project exists but a specific name or salary for her cannot be returned. However, the query Q0 doesn’t ask about those values but only requires the existence of the first emp element to access the remaining employees. Therefore, the valid answers to Q0 consist of the salaries of Mary, Steve, and John.

2. DTD alteration which in Example 1 would require changing the rule for the proj element:

This approach, however, requires knowledge of anticipated validity violations which is often not available. This approach may also lead to an explosion of the DTD and make it uninformative for the user. Furthermore, the modified DTD may lose an important semantic information expressed the document structure (the manager is always the first listed employee). Finally the user may not wish to change the DTD if the document contains only a small number of validity violations. 3. Query modification again requires prior identification of possible validity violations. Query modification addressing one kind of validity violation will most likely fail to handle others.

In our opinion, the set of document operations proposed here is sufficient for the correction of local violations of validity created by missing or superfluous nodes, or by incorrect node labels. The notion of valid query answer provides a way to query possibly invalid XML documents in a validity-sensitive way. In Section 6.1 we discuss possibilities of adapting our framework to handle other operations motivated by common kinds of validity violations. The contributions of this paper include:

We propose an alternative query evaluation method using the schema to identify validity violations and returning refined answers by considering all possible repairing scenarios. Our approach adapts the framework of repairs and consistent query answers developed in the context of inconsistent relational databases [3]. A repair is a consistent database instance which is minimally different from the original database. Various different notions of minimality have been studied, e.g., set- or cardinality-based minimality. A consistent query answer is an answer obtained in every repair. The framework of [3] is used as a foundation for most of the work in the area of querying inconsistent databases (for recent developments see [7, 15]). In our approach, the differences between XML documents are captured using sequences of editing operations on documents: inserting/deleting a subtree,

• A framework for validity-sensitive querying of XML documents (Section 2); • The notion of a trace graph which is a compact representation of all repairs of a possibly invalid XML document (Section 3); • An efficient algorithm, based on the trace graph, for the computation of valid query answers to a broad class of XPath queries (Section 4); • Experimental evaluation of the proposed algorithms (Section 5).

2

2

Basic notions

A tree T = X(T1 , . . . , Tn ) is valid w.r.t. a DTD D if: (1) Ti is valid w.r.t. D for every i and, (2) if X1 , . . . , Xn are labels of root nodes of T1 , . . . , Tn respectively and E = D(X), then X1 · · · Xn ∈ L(E).

We adapt a model of XML documents and DTDs similar to those commonly used in the literature [4, 5, 21, 25].

Example 3 Consider the following DTD D1 :

Ordered labeled trees

D1 (C) = (A · B)∗ ,

We view XML documents as labeled ordered trees with text values. For simplicity we ignore attributes: they can be easily simulated using text values. By Σ we denote a fixed (and finite) set of node labels and we distinguish a label PCDATA ∈ Σ to identify text nodes. A text node has no children and is additionally labeled with an element from an infinite domain Γ of text constants. For clarity of presentation, we use capital letters A, B, C, D, E, . . . for elements from Σ and capital letters X, Y, Z . . . for variables ranging over Σ. We assign a unique identifier to every node. Figure 1 contains an example of a tree. C n1

A

B n2

d

e

n4

B

D1 (B) = .

The tree T1 = C(A(d), B(e), B) is not valid w.r.t. D1 but the tree C(A(d), B) is. To capture the sets of strings satisfying regular expressions we use the standard notion of non-deterministic finite automaton (NDFA) M = hΣ, S, q0 , ∆, F i, where S is a finite set of states, q0 ∈ S is a distinguished starting state, F ⊆ S is the set of final states, and ∆ ⊆ S ×Σ×S is the transition relation. It’s a classic result that for every regular expression there exists an equivalent NDFA whose set of states is of size linear in the size of the expression [20].

n0

n3

D1 (A) = PCDATA + ,

2.1

n5

Tree edit distance and repairs

Tree operations text nodes PCDATA

A location is a sequence of natural numbers defined as follows: ε is the location of the root node, and v · i is the location of i-th child of the node at location v. This notion allows us to identify nodes without fixing a tree. We consider the three standard tree operations [1, 4, 5, 9, 10]:

Figure 1: Running example We assume that the data structure used to store a document allows for any given node to get its label, its parent, its first child, and its immediate following sibling in time O(1). For the purpose of presentation, we represent trees as terms over the signature Σ \ {PCDATA} with constants from Γ. For instance, the tree T1 from Figure 1 is represented as C(A(d), B(e), B). Note that a text constant t ∈ Γ, viewed as a term, represents a tree node with no children, that is labeled with PCDATA, and whose additional text label is t.

1. Deleting a subtree rooted at a specified location; 2. Inserting a subtree at a specified location; 3. Modifying the label at a specified location. With each operation we associate its cost. For deleting (inserting) it is the size of the deleted (resp. inserted) subtree. The cost of label modification is 1. The following example shows that the order of applying operations is important and therefore we consider sequences (rather than sets) of operations when transforming documents.

DTDs For simplicity our view of DTDs omits the specification of the root label. A DTD is a function D that maps labels from Σ \ {PCDATA} to regular expressions over Σ. We use regular expressions given by the grammar E ::≡ | X | E + E | E · E | E ∗ ,

Example 4 Recall the tree T1 = C(A(d), B(e), B) from Figure 1. If we first insert D as a second child of the root and then remove the first child of the root we obtain C(D, B(e), B). If, however, we first remove the first child of the root and then add D as a second child of the root we get C(B(e), D, B).

where X ranges over Σ, with denoting the empty string, and E + E, E · E, E ∗ denoting the union, concatenation and the Kleene closure, respectively. L(E) is the set of all strings over Σ satisfying E. |E| stands for the size (length) of E. The size of D, denoted |D|, is the sum of the lengths of the regular expressions occurring in D.

The cost of a sequence of operations is the sum of costs of its elements. Two sequences of operations are equivalent on a tree T if their application to T yields the same tree. We observe that some sequences may perform redundant operations, for instance an insertion and subsequent removal of a subtree. Because we focus on finding cheapest sequences of operations,

3

we restrict our considerations to redundancy-free sequences (those for which there is no equivalent but cheaper sequence).

3.1

We start the construction of a trace graph by building first a restoration graph which captures all (possibly not optimal) ways to repair the document at a given node. Suppose T = X(T1 , . . . , Tn ) is a tree and X1 , . . . , Xn the sequence of the root labels of T1 , . . . , Tn respectively. The DTD we work with is D and E = D(X). Let ME = hΣ, S, q0 , ∆, F i be the NDFA recognizing L(E). The restoration graph for the root of T , denoted UT , contains vertices of the form q i for q ∈ S and i ∈ {0, . . . , n}. The vertex q i is referred as the state q in the i-th column of UT and it corresponds to the state q being reached by ME after reading the elements X1 , . . . , Xi with possibly some elements deleted and some elements inserted. The edges of UT are:

Definition 1 (Edit distance) Given two trees T and T 0 , the edit distance dist(T, T 0 ) between T and T 0 is the minimum cost of transforming T into T 0 . Note that the distance between two documents (even if we consider only deletions and insertions) is a metric i.e., it is positively defined, symmetric, and satisfies the triangle inequality. For a DTD D and a (possibly invalid) tree T , a sequence of operations is a sequence repairing T w.r.t. D if the document resulting from applying the sequence to T is valid w.r.t. D. We are interested in the optimal repairing sequences of T . Definition 2 (Distance to a DTD) Given a document T and a DTD D, the distance dist(T, D) of T to D is the minimum cost of transforming T into a document valid w.r.t. D.

Del

• q i−1 −−−→ q i for any state q ∈ S and any i ∈ {1, . . . , n} (we remove Xi from the input but we don’t change the state of ME ), Ins Y

• pi −−−−−→ q i exists only if ∆(p, Y, q) (we don’t remove any symbol from the input but we change the state of ME as if the symbol Y was read),

Repairs We use the notions of edit distance to capture the minimality of change required to repair a document.

Read

• pi−1 −−−−→ q i exists only if ∆(p, Xi , q) (we read Xi from the input and change the state of ME accordingly).

Definition 3 (Repair) Given a document T and a DTD D, a document T 0 is a repair of T w.r.t. D if T 0 is valid w.r.t. D and dist(T, T 0 ) = dist(T, D).

A repairing path in UT is a path from q00 to any accepting state in the last column of UT .

Note that, if repairing a document involves inserting a text node, the corresponding text label can have infinitely many values, and thus in general there can be infinitely many repairs. However, as shown in the following example, even if the operations are restricted to deletions there can be an exponential number of non-isomorphic repairs of a given document.

Example 6 Recall the document T1 = C(A(d), B(e), B) and the DTD D1 from Example 3. The automaton M(A·B)∗ consists of two states q0 and q1 ; q0 is both the starting and the only accepting state; ∆ = {(q0 , A, q1 ), (q1 , B, q0 )}. The restoration graph UT is presented in Figure 2.

q11

Del

Del

Ins B

Ins B

Ins B

q12

d ea R

q03 Ins A

Del

d ea R

Del

q02 Ins A

q10

R ea d

Del

q01 Ins A

Ins A

The document A(B(1), T, F, . . . , B(n), T, F) consists of 4n + 1 elements and has 2n repairs w.r.t. D2 . For instance, an example of a repair for the document T2 = A(B(1), T, F, B(2), T, F, B(3), T, F) is the document A(B(1), T, B(2), F, B(3), T).

3

Del

q00

D2 (T) = , D2 (F) = .

Ins B

Example 5 Consider the following DTD D2 : D2 (A) = (B · (T + F))∗ , D2 (B) = PCDATA,

Restoration graph

q13

Figure 2: Construction of the restoration graph We observe that for any sequence of editing operations that repairs T we can obtain an equivalent sequence by translating a repairing path in UT in the following fashion:

Trace graph

In this section we present a construction that allows to capture all repairs of an XML document. For clarity of presentation, we limit operations to insertion and deletion. Later on we show how to extend our approach to handle label modification as well. The main element of this construction is a trace graph which is built for every node of the tree. A trace graph captures all optimal ways to repair the document at a given node.

Del

• q i−1 −−−→ q i is mapped to a deletion of Ti , Ins Y

• pi −−−−−→ q i is mapped to an insertion of a valid subtree with root label Y , Read

• pi−1 −−−−→ q i is mapped to a repairing sequence for Ti (obtained by induction from UTi ).

4

3.2

Similarly a choice of paths in each of the trace graphs yields a repair. It is important to note that a choice of a path on the top-level trace graph of T may correspond to more than one repair (this happens when some subtree has more than one repair). Trace graphs can also be used for interactive document repair.

Compact representation of repairs

Now, to capture optimal repairing sequences of T we assign to the edges of UT the (minimum) costs of the corresponding operations. The trace graph for T , denoted UT∗ , is the subgraph of UT consisting of only the optimal repairing paths. Note that dist(T, D) = the cost of an optimal repairing path in UT∗

Complexity analysis We note that for any vertex from the restoration graph UT the incoming edges come from the same or the preceding column. Therefore, when computing the minimum cost of a path to a given vertex we need to consider at most |Σ| × |S| + |S| + 1 values (up to |Σ| × |S| Ins edges, up to |S| Read edges, and one Del edge). We assume that |S| is bounded by the size of D and that Σ is fixed.

The exact cost assigned to edges of UT is: Del

• q i−1 −−−→ q i the cost is |Ti |, Ins Y

• q i −−−−−→ pi the cost is equal to the minimal size of a valid subtree with root label Y (this cost can be computed with a simple algorithm omitted here), Read

• q i−1 −−−−→ pi the cost is dist(Ti , D) (obtained by recursion from UT∗i ).

Theorem 1 For a given tree T and a DTD D all trace graphs of T can be constructed in O(|D|2 × |T |) time.

Example 7 (cont. Example 6) Repairing the second child of T1 requires removing the text node d and Read hence the cost assigned to q11 −−−−→q02 is 1. For the DTD D1 all insertion costs are 1. The trace graph UT∗ is presented in Figure 3. The repairing paths in the

3.3

Handling label modification

In order to extend the trace graph to handle label modification we add the following edges: Mod Y

q11 , 0

Del

Ins B

d ea R

d ea R

q12 , 2

Del

• q i −−−−−−−→ pi+1 which exists only if ∆(q, Y, p) and Y 6= Xi (we remove Xi from the input, but we change the state of ME as if Y was read);

q03 , 2 Ins B

Del

q02 , 1

Ins A

Del

Del

Ins A

q10 , 1

Ins A

Ins A

R ea d

q01 , 2 Ins B

Del

Ins B

q00 , 0

This edges corresponds to modification of the root label of Ti to Y and then recursively repairing Ti0 (which is Ti with root label changed to Y ). The cost assigned to this edge is equal 1 + dist(Ti0, D). We obtain the value dist(Ti0 , D) similarly to the way we obtained dist(Ti , D): by first constructing the trace graph UT∗ 0 . i Handling label modification requires computing |Σ| values rather than 1, but since Σ is assumed to be fixed, the asymptotic time and space complexity remains the same. Note, however, that in this way a significant constant factor in the real running time of the algorithm is introduced.

q13 , 3

Figure 3: Construction of the trace graph trace graph capture 3 different repairs: Read

Read

Ins

Read

A q12 −−−→ q03 yields 1. q00 −−−→ q11 −−−→ q02 −−−→ C(A(d), B, A, B) by repairing the second child (which involves deleting the text node e), and inserting A;

Read

Del

Read

Read

Read

2. q00 −−−→ q11 −−→ q12 −−−→ q03 yields C(A(d), B) by deleting the second child; Del

3. q00 −−−→ q11 −−−→ q02 −−→ q03 yields C(A(d), B) by repairing the second child and deleting the third.

4

Valid Query Answers

We consider here with the following class of positive Regular XPath queries [24]:

We note that although isomorphic, the repairs (2) and (3) are not the same because the nodes labeled with B are obtained from different nodes in the original tree (resp. n5 and n3 ).

Q ::= ⇐ | ⇓ | Q∗ | Q−1 | Q1 /Q2 | Q1 ∪ Q2 | name() | text() | | [t],

Repairs and paths in the trace graph

where ‘⇐’ and ‘⇓’ stand for the immediate-previoussibling and child axis; ‘Q∗ ’ is the closure of Q (repeating Q any number of times); ‘Q−1 ’ is the inverse of Q; ‘Q1 /Q2 ’ and Q1 ∪ Q2 stand for the composition and the union of Q1 and Q2 , respectively; name() selects the label of the current node, while text() can be used to obtain the text value stored at the current node (if

Note that if UT has cycles, only Ins edges can be involved in them. Since the costs of inserting operations are positive, UT∗ is a directed acyclic graph. Suppose now that we have constructed a trace graph in every node of T . Every repair can be characterized by selecting a path in each of the trace graphs.

5

it’s a text node); finally, [t] is the self axis with an optional test condition t. The test conditions are:

Example 9 (cont. Example 8) Consider a query Q1 = ::C/⇓∗/text(). For T1 we can infer: (n0 , name(), C) 7→ (n0 , ::C, n0 ), (n0 , , n0 ) 7→ (n0 , ⇓∗ , n0 ), (n0 , ⇓∗ , n0 ) ∧ (n0 , ⇓, n3 ) 7→ (n0 , ⇓∗ , n3 ), (n0 , ⇓∗ , n3 ) ∧ (n3 , ⇓, n4 ) 7→ (n0 , ⇓∗ , n4 ), (n0 , ::C, n0 ) ∧ (n0 , ⇓∗ , n4 ) 7→ (n0 , ::C/⇓∗ , n4 ), (n0 , ::C/⇓∗ , n4 ) ∧ (n4 , text(), e) 7→ (n0 , Q1 , e).

t ::= name() = X | text() = t | Q | Q1 = Q2 , where name() = X and text() = t test respectively the tag and the text label of the nodes; Q used as a test condition filters nodes from which the query Q can reach any object (an object is a node, a text value, or a node label); and Q1 = Q2 is the join condition which is satisfied by a node from which there is an object reachable with both Q1 and Q2 . Queries that don’t contain join conditions are called join-free. For convenience we introduce the following macros: Q+ := Q/Q∗ , −1

⇒ := ⇐

,

QAQ1 (T1 ) = {d, e}. We note that when computing answers to a query Q we need to use only a fixed number of different derivation rules (which involve only subqueries of Q). Since we use positive queries, the rules don’t contain negation. Therefore, the derivation process, similarly to negation-free Datalog programs, is monotonic i.e., adding new (basic) facts cannot invalidate facts derived previously. This observation allows us to construct a simple algorithm for computing the set of all tree facts that are relevant to answering Q (facts that use subqueries of Q). It traverses the document (in left-to-right prefix order): for every encountered node it adds to the computed set the basic facts about the node and then uses derivation rules to add the implied facts. After the whole tree is processed we select from the computed set the answers to Q.

Q[t] := Q/[t], Q::X := Q[name() = X].

⇓∗ ::proj/ ⇓ ::emp/ ⇒+ ::emp/ ⇓ ::salary. Recall from [24] that Regular XPath (with negation) is known to capture XPath 1.0 [31] restricted to its logical core (all axes but no attributes and no functions). Evaluation of positive queries

We use the standard semantics of XPath queries [31, 24], defined in a way geared towards the computation of valid query answers. The main notion we use is that of a tree fact which is a triple (x, Q, y), where y is an object (a node, a node label, or a text label) reachable from a node x with a query Q. We distinguish basic tree facts that use only the queries , ⇓, ⇐, name(), and text().

4.2

Validity-sensitive query evaluation

We define valid query answers by considering every repair. Definition 4 (Valid query answers) Given a tree T , a query Q, and a DTD D, an object x is a valid answer to Q in T w.r.t D if x is an answer to Q in every repair of T w.r.t. D. We denote the set of all valid query answers to Q in T w.r.t. D by V QAQ D (T ).

Example 8 Recall the tree T1 from Figure 1. Examples of basic tree facts for T1 are: (n0 , , n0 ), (n0 , name(), C), (n0 , ⇓, n3 ), (n3 , ⇓, n4 ), (n4 , text(), e). We observe that basic facts capture all structural and textual information of a given tree. Other tree facts can be derived using simple Horn rules that follow the standard semantics of XPath. For example: (x, Q∗ , x) ← (x, , x), (x, Q∗ , y) ← (x, Q∗ , z) ∧ (z, Q, y), (x, Q1 /Q2 , y) ← (x, Q1 , z) ∧ (z, Q2 , y), (x, ::X, x) ← (x, name(), X).

(4) (1) (2) (2) (3) (3)

Similarly, we can infer (n0 , Q1 , d). Thus

The query Q0 from Example 1 can be expressed as:

4.1

by by by by by by

4.2.1

Computational complexity

Computing valid answers is more complex than computing standard answers to queries (whose combined complexity is known to be PTIME [16]).

(1) (2) (3) (4)

Theorem 2 The (combined) complexity of computing valid answers to positive join-free Regular XPath queries is co-NP-complete.

Naturally, x is an answer to Q in T if (r, Q, x) can be derived for T , where r is the root node of T . We denote the set of all answers to Q in T with QAQ (T ).

Proof sketch: We reduce the membership of the complement of SAT to checking if an object is a valid answer. We use the DTD D2 and the documents from Example 5 to represent all possible valuations and translate the boolean formula into an

6

Algorytm 1 Computing V QAQ D (T ) function Certain(T ,D,Q) T = X(T1 , . . . , Tn ) with the root node r MD(X) = hΣ, S, q0 , ∆, F i def in(v) = {u ∈ UT∗ | u −→ v} (∗ incoming neighbors ∗) precompute: UT∗ – the trace graph for T CY – tree facts common for every valid tree with the root label Y (for Y ∈ Σ) Ci := Certain(Ti , D, Q) (for i ∈ {1, . . . , n}) 1: V := {q00 } 2: C(q00 ) := ({(r, , r), (r, name(), X)})Q

XPath query accordingly. For instance for the formula ϕ = (x1 ∨ ¬x2 ) ∧ x3 we get the query Q2 : ::A ⇓::B[⇓[text()=1]]/⇒::T ∪ ⇓::B[⇓[text()=2]]/⇒::F ⇓::B[⇓[text()=3]]/⇒::T . 2 We claim that ϕ 6∈ SAT ⇔ r ∈ V QAQ D2 (T2 ), where r is the root node of T2 (Example 5). Therefore, we use the data complexity [29], to express the complexity of the problem in terms of the size of the document only (by assuming other input components to be fixed).

4.3

(∗ if T is a text node t we also add (r, text(), t) ∗)

Na¨ıve computation of valid query answers

3: 4: 5: 6:

Again for clarity we consider only inserting and deleting tree operations: the approach can be extended to modifying operations in a way analogous to trace graphs in Section 3. Algorithm 1 computes valid query answers by (recursively) constructing the set of certain facts that hold in every repair of the document. Basically, for every repairing path in the trace graph the algorithm constructs the set of tree facts present in the repairs represented by this path. These sets are constructed in an incremental fashion using a flooding technique: V stores visited nodes and C(v) stores sets of certain facts for all paths from q00 to v. For the empty path we only include basic tree facts of the root node. For every traversed edge we append all certain facts contributed by the edge: none for a Del edge; for Read edge certain facts of the corresponding subtree; for Ins Y certain facts CY present in every valid tree with root Y (computed with a simple algorithm). The appending operation ]r is union which also adds basic facts establishing parent-child relation (between r and the root node of the appended subtree) and basic facts establishing sibling relation (and order) among appended trees. In every step we also add any facts that can be derived with the derivation rules for the query Q; this operation is denoted with the superscript (·)Q .

7: 8: 9:

while V 6= UT∗ do choose any q i ∈ UT∗ such that in(q i ) ⊆ V V := V ∪ {q i } C(q i ) := ∅ Del if q i−1 −−→ q i ∈ UT∗ then C(q i ) := C(q i−1 ) Ins Y

for pi −−−−→ q i ∈ UT∗ do C(q i ) := C(q i ) ∪ {(C ]r CY )Q |C ∈ C(pi )} Read

for pi−1 −−−→ q i ∈ UT∗ do C(q i )T:= S C(q i )∪{(C ]r Ci )Q |C ∈ C(pi−1 )} n return qn ∈UT∗ C(q ) end function body V QAQ D (T ) = {x|(r, Q, x) ∈ Certain(T, D, Q)} end body 10: 11: 12:

C(q20 ) = {B2 }, where B2 = (B1 ∪ C2 ∪ {(n0 , ⇓, n3 ), (n3 , ⇐, n1 )})Q1 , and C2 is the set of certain facts for the second child (note that C2 doesn’t contain information about n4 ) C2 = ({(n3 , , n3 ), (n3 , name(), B)})Q1 . C(q12 ) = {B1 , B3 }, where B3 = (B1 ∪ CA ∪ {(n0 , ⇓, i1 ), (i1 , ⇐, n3 )})Q1 , where CA is the set of certain facts for every valid tree with the root label A (i1 is a new node) CA = ({(i1 , , i1 ), (i1 , name(), A)})Q1 .

Example 10 Recall the tree T1 from Figure 1, the DTD D1 from Example 3, and the query Q1 = ::C/ ⇓∗ /text() (Example 9). The trace graph for T1 and D1 can be found in Figure 3.

C(q03 ) = {B2 , B4 , B5 }, where B4 = (B3 ∪ C3 ∪ {(n0 , ⇓, n5 ), (n5 , ⇐, i1 )})Q1 , B5 = (B1 ∪ C3 ∪ {(n0 , ⇓, n5 ), (n5 , ⇐, n1 )})Q1 , where C3 is the set of certain facts for the third child C3 = ({(n5 , , n5 ), (n5 , name(), B)})Q1 .

C(q00 ) = {B0 }, where B0 = ({(n0 , , n0 ), (n0 , name(), C, n0 )})Q1 . C(q11 ) = {B1 }, where

Now the set of certain facts for T1 is C = B2 ∩B4 ∩B5 . Note that (n0 , Q1 , d) ∈ C but (n0 , Q1 , e) 6∈ C. Hence

B1 = (B0 ∪ C1 ∪ {(n0 , ⇓, n1 )})Q1 , and C1 is the set of certain facts for A(d)

1 V QAQ D1 (T1 ) = {d}.

C1 = ({(n1 , , n1 ), (n1 , name(), A), (n1 , ⇓, n2 ), (n2 , , n2 ), (n2 , name(), PCDATA),

If we compare the valid answer with with QAQ1 (T1 ) (Example 9) we note that e has been removed. This is because D1 doesn’t allow any (text) nodes under B.

(n2 , text(), d)})Q1 .

7

Algorytm 2 Eager intersection (mod. of Algorithm 1) function Certain(T ,D,Q) ... T 9: C(q i ) := C(q i ) ∪ {(C ]r CY )Q |C ∈ C(pi )} ... T i i 11: C(q ) := C(q )∪ {(C ]r Ci )Q |C ∈ C(pi−1 )}

We recall from Example 7 that our framework allows more than one isomorphic repair. Therefore the set of valid answers to query ⇓∗ ::B in T1 is empty. This is a consequence of our decision to query the invalid document without repairing it, and thus the valid answers have to be given in terms of the original document. We note however, that if we consider a query ⇓∗ ::B/name() that asks about the names of the nodes (and thus disregards the differences between isomorphic nodes), the answer is {B}. 4.3.1

... end function Because of space limitations we skip the correctness proof of this rule. Using it in Example 10 we obtain: 0 0 C(q03 ) = {B2 , B4,5 }, where B4,5 = B4 ∩ B5 . Using simple induction over the number of the column, we can easily prove that in Algorithm 2 the number of sets stored in C(q i ) is O(i × |S| × |Σ|).

Exponential running time

We note that because there can be an exponential number of repairing paths in the trace graph (Example 5), Algorithm 1 may require exponential time to run. This algorithm may be the only alternative for computing valid answers to positive Regular XPath queries.

Theorem 4 Algorithm 2 computes valid answers to join-free positive Regular XPath queries in time polynomial in the size of the document.

Theorem 3 There exists a positive Regular XPath query Q (with join conditions) and a DTD D such that the set

4.5

DQ,D = {(T, x)|x ∈ V QAQ D (T )}.

Note that for a valid document every trace graph contains only one path (consisting of Read edges only). If the document contains validity violations, this path may create branches which correspond to alternative repairing scenarios (an example of a branching point in Figure 3 is q11 ). On each of the branches a separate copy of certain facts collected so far need to be used. Eventually the branches meet back and the sets of certain facts need to be intersected (either by following Ins or Read edges or in the final intersection). If the set of certain facts is large, copying it and consecutive intersection for every validity violation may cause a significant overhead. A lazy copying optimization separates the facts collected on different branches from the facts collected before the branching point; the intersection is performed only on the former facts.

is co-NP-complete. Proof sketch: We reduce the complement of SAT to DQ3 ,D3 , where the DTD D3 is defined as follows: D3 (A) = ((T + F) · B)∗ · C∗ , D3 (C) = N∗ , D3 (B) = , D3 (F) = D3 (T) = D3 (N) = PCDATA. and the query Q3 is ::A[⇓::C[⇓::N/⇓/text() = ⇑::A/(⇓::T ∪ ⇓::F)/⇓/text()]] We show only an example of constructing the document T3 for the formula φ = (x1 ∨¬x2 ∨x3 )∧(x2 ∨x3 ) : A T(1), F(∼1), B, T(2), F(∼2), B, T(3), F(∼3), B, C(N(∼1), N(2), N(∼3)), C(N(∼2), N(∼3))

5

Experimental results

The main goal of our experiments was to validate our approach and to discover possible limitations to be addressed in the future. Because flat documents (documents of bounded height) are very common among XML repositories, we tested the behavior of our solutions on flat documents.

We claim that ϕ 6∈ SAT ⇔ (T3 , r) ∈ DQ3 ,D3 , where r is the root node of T3 . 4.4

Lazy copying

Eager intersection

If Q is a join-free query, then we can use the heuristic of eager intersection:

Data sets To observe the impact of the document size on the performance of algorithms we first randomly generated a valid document. Next, we introduced the violations of validity to a document by removing and inserting randomly chosen nodes. To measure the validity violations of a document T we use the invalidity ratio dist(T, D)/|T |. For most of the experiments, we used the DTD D0 and the query Q0 from Example 1. In experiments

Let B1 and B2 be two sets from C(v) for some vertex v. Suppose that v −→ v 0 and the edge is labeled with an operation that appends a tree (either Rep or Ins). Let B10 and B20 be the sets for v 0 obtained from B1 and B2 resp. as described before. Instead of storing in C(v 0 ) both sets B10 and B20 we only store their intersection B10 ∩ B20 .

8

investigating the impact of the DTD size on the performance, we used a family of DTDs Dn , n ≥ 0:

bc Parse

Time (sec.)

Dn (A) = (. . . ((PCDATA + A1 ) · A2 + A3 ) · A4 + . . . An ) , Dn (Ai ) = A∗ , for i ∈ {1, . . . , n}. For documents generated for these DTDs we used a simple query ⇓∗/text(). Implementation All compared algorithms were implemented in Java 5.0 and used common programming tools including: a Pull model parser (STaX), a representation of regular expressions and corresponding NDFA’s, structures storing tree facts, and an implementation of derivation rules. For ease of implementation, we considered only a restricted class of descending path queries which involve only simple filter conditions (testing tag and text labels), do not use operators ∪ and −1 , and use the closure operator ∗ only on the axises ⇓ and ⇐. We note that those queries are most commonly used in practice and the restrictions allow to compute standard answers to such queries in time linear in the size of the document.

u MDist

t u

u

9

u 6

u

3

u

0

u bct

t u u b t b bc bc

u t bcb

t bu bc

0

10

u t b bc

u t b bc

t u b bc

t u bc

20 30 Document size (MB)

bc

bc

bc

bc

b

b

b

b

b

t u

t u

t u

40

50

Figure 4: Trace graph construction for variable document size (0.1% invalidity ratio) to efficiently validate XML documents should also be applicable to efficiently construct trace graphs. By increasing the size of DTD we also expand Σ whose size is a factor in the time of constructing trace graphs with label modification. This explains why the observed execution time of MDist is cubic in the size of |D|.

Environment All tests were performed on an Intel Pentium M 1000MHz machine running Windows XP with 512 MB RAM and 40 GB hard drive. We repeated each test 5 times, discarded extreme readings, and took the average of the remaining ones.

bc Validate u

b Dist

u MDist b

6 Time (sec.)

5.1

t Dist u

u

12

∗

b Validate

Trace graph construction

In the first part of the experiments, we measured the time necessary to construct the trace graphs. We used algorithms MDist and Dist that constructed trace graphs (resp. with and without label modification) to compute the edit distance of the document to the DTD. As a base line we used the time necessary to parse the file containing the document (Parse). We also compared the performance of our algorithms with Validate which validates the database w.r.t. the DTD. Figure 4 contains the results showing the impact of the document size on the performance of our algorithms. The results confirm our analysis: trace graphs are constructed in time linear in the size of the document. Note also that computation of the edit distance without using modifying operations introduces only small overhead over validation. Note also that the overhead of Validate over Parse is small as well, which suggests that the implementation of validation is efficient. We also observe that including modification operation in the model leads to a significant increase in the computation time. Figure 5 shows that Dist takes the time quadratic in the size of the DTD. We note, however, that Validate behaves similarly and the overhead of Dist is small. Because in our approach we don’t assume any particular properties of the automata used, we conjecture: that any technique that optimize the automata

u u

4

bb

b

bb

b

b bc bc bb bc bc b c b c b b b b bc bc bc bc bc bb b bcb bc bc bc bc uu b b b c b bc bc bc bc bc bc bc u

2

0 0

u

30

60 DTD size |D|

b

b

cb bc

90

b

b bc c

120

Figure 5: Trace graph construction for variable DTD size (1MB Document with 0.1% invalidity ratio) 5.2

Valid query answer computation

In the second part of the experiments we measured the time needed to compute the valid query answers. The algorithm QA for computing standard query answers, described in Section 4.1 is used as the baseline. Our implementations MVQA and VQA (resp. with and without label modification) use both eager intersection and lazy copying. Figure 6 shows that for the DTD D0 computing valid query answers (without modifying operations) is about 6 times longer than computing query answers with QA. Incorporating the modification operation into the model causes again a very significant evaluation time increase. Because the computation of valid answers involves constructing trace graphs, similarly to computing edit distance, we observe in Figure 7 a quadratic depen-

9

b VQA c QA b

b VQA

t MVQA u

t u b b t u

b

b

b

b

b

4

b

tb u c b u 0 t 0

c b

b bc

bc

bc

bc

bc bc

Time (sec.)

30

b

b

b b

0

10

20

b

30 40 DTD size |D|

50

60

70

Figure 7: Valid query answers computation for variable DTD size (1 MB document with 0.1% invalidity ratio) In the last series of experiments we investigated the impact of invalidities on the execution time of algorithms computing valid answers. To observe the benefits of lazy copying we compare algorithms VQA with EagerVQA which computes valid query answers without using lazy copying. We use the DTD D2 from Example 3. In Figure 8 we find that invalidities significantly affect the computation time of EagerVQA while with lazy copying the execution time grows very slowly as the invalidity ratio increases.

6 6.1

rs

rs

b b

rs b

b

b

b

b

b

b

0.05 0.10 0.15 0.20 0.25 Invalidity ratio dist(t, D)/|t| (%)

Missing or superfluous inner nodes. The validity of a document that is missing (or has a superfluous) an inner node can be restored by performing vertical insertion (resp. deletion) of such a node. A vertical insertion of a node takes a subsequence of subelements and links them as the children of the inserted node. The vertical deletion works in the opposite way: the children of the removed node are linked to the parent of the node. This generalized notion of edit distance [27, 6] subsumes ours (our notion is sometimes called 1-degree edit distance [26]). This notion has practical applications in information extraction [12] and integration of XML repositories [19]. [28] presents a polynomial algorithm for computing the generalized edit distance between a document T and a DTD. This algorithm creates graph structures which, analogously to trace graphs, are used to find optimal repairing sequences by selecting the shortest paths. Although these structures could be used to compute valid answers, their construction takes O(|T |5 ) time, which limits the practical applications of this approach. It is an open question if valid query answers can be computed more efficiently in this setting.

b

0

10

Figure 8: Valid query answers computation for variable invalidity ratio (1 MB document)

b VQA 40

b

rs

0 9

dence between the performance time and the size of DTD for VQA (because of significantly higher readings of MVQA we omit them).

b b b b b b b b

rs

rs

0

Figure 6: Valid query answers computation for variable document size (0.1% invalidity ratio)

10

20

sr b

c bc bc b

3 6 Document size (MB)

20

rs rs

b

8

EagerVQA

rs Time (sec.)

Time (sec.)

16 12

rs

30

Element transpositions Validity violations caused by incorrect order of elements could be addressed by an operation moving a subtree to a specified location [6]. This kind of an operation is studied (together with insertion, deletion, and modification) in the context of detecting document changes [9, 10] and document correction techniques [8]. It is an open question if our framework can be extended to include this operation, however the intractability of the problem of calculating the string edit distance with moves [8, 11] is discouraging.

Related work Other editing operations

6.2

By extending the repertoire of considered editing operations, our basic framework of repairs could be adapted to handle other common causes of validity violations. We note, however, those extensions may require new algorithms to compute valid answers.

Query approximation

A framework for evaluating queries with none or only a partial knowledge of the schema is proposed in [22]. A (refined) notion of Lowest Common Ancestor is used to identify meaningful relationships between document

10

elements in situations where expected document structure is not known. In this approach, however, the schema is not used directly in the query evaluation but rather the user is required to assess her knowledge of the schema during query formulation and identify potential structural ambiguity. This contrasts with our approach, where the user is assumed to possess a good knowledge of the expected document structure but cannot or does not wish to identify all potential structural discrepancies. Also, the schema is used during the query evaluation to alleviate the impact of possible validly violations on the query answer. 6.3

DTD’s a polynomial algorithm is proposed to compute certain answers to path queries. The considered class of queries are a subset of XPath, and in particular does not contain queries with union. Also this work contains no experimental evaluation of the proposed framework. [30] is another adaptation of consistent query answers to XML databases closely based on the framework of [3].

7

In this paper we have investigated the problem of querying XML documents containing violations of validity of a local nature, caused by missing or superfluous (leaf) nodes, or incorrect node labeling. We proposed a framework which considers possible repairs of a given document, obtained by applying a minimum number of operations that insert, delete, or rename nodes. We demonstrated an efficient algorithm for validity-sensitive querying based on the notion of valid query answer. We envision several possible directions for future work. First, one can investigate if valid answers can be obtained using query rewriting [17, 18]. Techniques based on query rewriting have been successfully used to compute consistent query answers in relational databases [3, 15]. Second, it is an open question if negation could be introduced into our framework. Our solutions make use of the monotonicity of the derivation process, which are implied with the rules having positive premises. For simple negative facts like (n, [name() 6= X], n), the derivation process is still performed in a monotonic fashion. This is not the case for queries with more general form of negation like ⇓∗ [¬ ⇑∗ [A]]. Third, it would be of significant interest to establish a complete picture of how the complexity of computing valid query answers depends on the query language and the repertoire of the available tree operations. Finally, it would be interesting to find out to what extent our framework can be adapted to handle semantic inconsistencies in XML documents, for example violations of key dependencies.

Structural restoration

The problem of correcting a slightly invalid document is considered in [8]. Under certain conditions, the proposed algorithm returns a valid document whose distance from the original one is guaranteed to be within a multiplicative constant of the minimum distance. A notion equivalent to the distance of a document to a DTD (Definition 2) was used to construct errorcorrecting parsers for context-free languages [2]. 6.4

Conclusions and future work

Consistent query answers for XML

[13] investigates querying XML documents that are valid but violate functional dependencies. Two repairing actions are considered: updating element values with a null value and marking nodes as unreliable. This choice of actions prevents from introducing further validity violations in the document upon repairing it. Nodes with null values or marked as unreliable do not cause violations of functional dependencies but also are not returned in the answers to queries. Repairs are consistent instances with a minimal set of nodes affected by the repairing actions. Two types of query semantics are proposed: certain answers – answers present in every repair; and possible answers – answers present in any repair. A polynomial algorithm for computing certain and possible answers is presented. We note that only simple descending path queries a1 /a2 / . . . /an are considered. This work contains no experimental evaluation of the proposed framework. A set of operations similar to ours is considered for consistent querying of XML documents that violate functional dependencies in [14]. Depending on the operations used different notions of repairs are considered: cleaning repairs obtained only by deleting elements, completing repairs obtained by inserting nodes, and general repairs obtained by both operations. A set-theoretic notion of minimality is used when defining repairs: the set of operations needed to repair a document is minimized with the preference for inserting operations nodes (the document T1 = C(A(d), B(e), B) has only one repair C(A(d), B, A, B). Again possible and certain answers are considered. For restricted classes of functional dependencies and

References [1] S. Abiteboul, J. McHugh, M. Rys, V. Vassalos, and J. L. Wiener. Incremental Maintenance for Materialized Views over Semistructured Data. In International Conference on Very Large Data Bases (VLDB), pages 38–49, 1998. [2] A. V. Aho and T. G. Peterson. A Minimum Distance Error-Correcting Parser for ContextFree Languages. SIAM Journal on Computing, 1(4):305–312, 1972. [3] M. Arenas, L. Bertossi, and J. Chomicki. Consistent Query Answers in Inconsistent Databases. In

11

[17] G. Grahne and A. Thomo. Query Containment and Rewriting using Views for Regular Path Queries under Constraints. In ACM Symposium on Principles of Database Systems (PODS), 2003.

ACM Symposium on Principles of Database Systems (PODS), pages 68–79, 1999. [4] A Balmin, Y. Papakonstantinou, and V. Vianu. Incremental Validation of XML Documents. ACM Transactions on Database Systems (TODS), 29(4):710–751, December 2004.

[18] G. Grahne and A. Thomo. Query Answering and Containment for Regular Path Queries under Distortions. In Foundations of Information and Knowledge Systems (FOIKS), 2004.

[5] M. Benedikt, W. Fan, and F. Geerts. XPath Satisfiability in the Presence of DTDs. In ACM Symposium on Principles of Database Systems (PODS), 2005.

[19] S. Guha, H. V. Jagadish, N. Koudas, D. Srivastava, and T. Yu. Approximate XML Joins. In ACM SIGMOD International Conference on Management of Data, pages 287–298, 2002.

[6] P. Bille. Tree Edit Distance, Aligment and Inclusion. Technical Report TR-2003-23, The IT University of Copenhagen, 2003.

[20] J. E. Hopcroft, R. Motwani, and J. D. Ullman. Introduction to Automata Theory, Languages, and Computation. Addison Wesley, 2nd edition, 2001.

[7] P. Bohannon, M. Flaster, W. Fan, and R. Rastogi. A Cost-Based Model and Effective Heuristic for Repairing Constraints by Value Modification. In ACM SIGMOD International Conference on Management of Data, 2005.

[21] A. Klarlund, T. Schwentick, and D. Suciu. XML: Model, Schemas, Types, Logics and Queries. In J. Chomicki, R. van der Meyden, and G. Saake, editors, Logics for Emerging Applications of Databases. Springer-Verlag, 2003.

[8] U. Boobna and M. de Rougemont. Correctors for XML Data. In International XML Database Symposium (Xsym), pages 97–111, 2004.

[22] Y. Li, C. Yu, and H. V. Jagadish. Schema-Free XQuery. In International Conference on Very Large Data Bases (VLDB), pages 72–83, 2004.

[9] S. Chawathe, A. Rajaraman, H. Garcia-Molina, and J. Widom. Change Detection in Hierarchically Structured Information. In ACM SIGMOD International Conference on Management of Data, pages 493–504, 1996.

[23] W. L. Low, W. H. Tok, M. Lee, and T. W. Ling. Data Cleaning and XML: The DBLP Experience. In International Conference on Data Engineering (ICDE), page 269, 2002.

[10] G. Cobena, S. Abiteboul, and A. Marian. Detecting Changes in XML Documents. In International Conference on Data Engineering (ICDE), 2002.

[24] M. Marx. Conditional XPath, the First Order Complete XPath Dialect. In ACM Symposium on Principles of Database Systems (PODS), 2004.

[11] G. Cormode and S. Muthukrishnan. The String Edit Distance Matching Problem with Moves. In ACM SIAM Symposium on Discrete Algorithms (SODA), pages 667–676, 2002.

[25] F. Neven. Automata, Logic, and XML. In Workshop on Computer Science Logic (CSL), 2002.

[12] D. de Castro Reis, P. Golgher, A. da Silva, and A Laender. Automatic Web News Extraction using Tree Edit Distance. In International Conference on World Wide Web (WWW), 2004.

[26] S. M. Selkow. The Tree-to-Tree Editing Problem. Information Processing Letters,1977. [27] D. Shasha and K. Zhang. Approximate Tree Pattern Matching. In A. Apostolico and Z. Galil, editors, Pattern Matching in Strings, Trees, and Arrays, Oxford University Press, 1997.

[13] S. Flesca, F. Furfaro, S. Greco, and E. Zumpano. Repairs and Consistent Answers for XML Data with Functional Dependencies. In International XML Database Symposium (Xsym), 2003.

[28] N. Suzuki. Finding an Optimum Edit Script between an XML Document and a DTD. In ACM Symposium on Applied Computing (SAC), 2005.

[14] S. Flesca, F Furfaro, S. Greco, and E. Zumpano. Querying and Repairing Inconsistent XML Data. In Web Information Systems Engineering (WISE), pages 175–188, 2005.

[29] M. Y. Vardi. The Complexity of Relational Query Languages. In ACM Symposium on Theory of Computing (STOC), pages 137–146, 1982.

[15] A. Fuxman, E. Fazli, and R. J. Miller. ConQuer: Efficient Management of Inconsistent Databases. In ACM SIGMOD International Conference on Management of Data, 2005.

[30] S. Vllalobos. Consistent Answers to Queries Posed to Inconsistent XML Databases. Master’s thesis, Catholic University of Chile (PUC), 2003.

[16] G. Gottlob, C. Koch, and R. Pichler. Efficient Algorithms for Processing XPath Queries. In International Conference on Very Large Data Bases (VLDB), pages 95–106, 2002.

[31] W3C. XML path language (XPath 1.0), 1999.

12

Slawomir Staworko

Jan Chomicki

Department of Computer Science and Engineering University at Buffalo {staworko,chomicki}@cse.buffalo.edu

Abstract We consider the problem of using XPath to query XML documents which are not valid with respect to given DTDs. If a query is formulated under the assumption that documents satisfy a DTD and a given document does not, then the query answer in this document may be different from the expected one. We propose an alternative mode of query evaluation which actively incorporates the schema into the evaluation process: If a document is invalid, all possible valid documents obtained from it by applying the minimum number of repairing operations are considered. Conceptually, the query is evaluated in each such document, and the intersection of all the results is returned to the user. We study the issue of how the computational complexity of query evaluation in our approach depends on the repertoire of the repairing operations and the syntax of the query. We describe experiments that validate our approach.

1

differ with respect to the constraints on the cardinalities of elements. The presence of legacy data sources may even result in situations where schema constraints cannot be enforced at all. Also, temporary violations of the schema may arise during data entry. At the same time, DTDs and XML Schemas capture important information about the expected structure of an XML document. Very often, the formulation of queries relies on the document schema. However, if the document is not valid, then the result of query evaluation may be insufficiently informative or may fail to conform to the expectations of the user. Example 1 Consider the following DTD D0 specifying a description of a project:

proj emp name salary

(name, emp, proj*, emp*) > (name, salary) > (#PCDATA) > (#PCDATA) >

The description consists of a name, a manager, a collection of subprojects, and a collection of employees involved in the project. The following query Q0 computes the salaries of all employees that are not managers:

Introduction

XML is rapidly becoming the standard format for the representation and exchange of semi-structured data (documents) over the Internet. In most contexts, documents are processed in the presence of a schema, typically a Document Type Definition (DTD) or an XML Schema. Although currently there exist various methods for maintaining the validity of semi-structured databases, many XML-based applications operate on data which is invalid with respect to a given schema. A document may be the result of integrating several documents of which some are not valid. Parts of an XML document could be imported from a document that is valid with respect to a schema slightly different than the given one. For example, the schemas may ∗ Research supported by NSF Grants IIS-0119186 and IIS0307434. † Previously published in Proc. EDBT Workshops (dataX), March 2006, Munich, Germany, Springer, LNCS 4254, pp. 164– 177.

//proj/emp/f ollowing sibling::emp/salary Now consider the following document T0 which lacks the information about the manager of the main project: proj name

emp

emp

Pierogies proj name Stuffing

name salary

name salary

John

Mary

emp

80k

40k

emp

name salary

name salary

Peter

Steve

30k

50k

Such a document can be the result of the main project not having the manager assigned yet or the manager being changed.

The standard evaluation of the query Q0 will yield the salaries of Mary and Steve. However, knowing the DTD D0 , we can determine that an emp element following the name element ‘‘Pierogies’’ is likely to be missing, and conclude that the salary of John should also be returned.

and modifying the label of a node. Such operations are commonly used in the context of XML document change detection [9, 10] and incremental integrity maintenance of XML documents [1, 4, 5]. We define repairs to be valid documents obtained from a given invalid document using sequences of operations of minimum cost, where the cost is measured as the cumulative number of inserted, deleted, and modified nodes. Valid query answers are defined analogously to consistent answers as answer obtained in every repair. A valid document is its only repair, and hence when no validity violence occurs the valid answers are equivalent to standard query answers. We consider schemas of XML documents defined using DTDs.

In our paper we address the impact of validity violations of XML documents on the result of query evaluation. In general, this problem can be approached from 3 different directions: 1. Document repair which involves techniques of cleaning and correcting databases [8, 23]. Usually, however, there exists many possible alternatives to restore the validity of the document and selecting an arbitrary repair may lead to an unjustified loss of information. User’s input can be incorporated in the correction process, but user may not have an adequate competence to fully repair the document (for instance in Example 1 the user issuing the query may not know who is the manager of main project).

Example 2 The validity of the document T1 from Example 1 can be restored in two alternative ways: 1. by inserting in the main project a missing emp element (together with its subelements name and salary, and two text elements). The cost is 5. 2. by deleting the main project node and all its subelements. The cost is 26. Because of the minimum-cost requirement, only the first way leads to a repair. Note that the text nodes of name and salary of the inserted emp element can assume infinitely many possible values, and thus there are infinitely many repairs. However, all those repairs have the same structure, and so in the context of valid answers it can be determined that the manager of the main project exists but a specific name or salary for her cannot be returned. However, the query Q0 doesn’t ask about those values but only requires the existence of the first emp element to access the remaining employees. Therefore, the valid answers to Q0 consist of the salaries of Mary, Steve, and John.

2. DTD alteration which in Example 1 would require changing the rule for the proj element:

This approach, however, requires knowledge of anticipated validity violations which is often not available. This approach may also lead to an explosion of the DTD and make it uninformative for the user. Furthermore, the modified DTD may lose an important semantic information expressed the document structure (the manager is always the first listed employee). Finally the user may not wish to change the DTD if the document contains only a small number of validity violations. 3. Query modification again requires prior identification of possible validity violations. Query modification addressing one kind of validity violation will most likely fail to handle others.

In our opinion, the set of document operations proposed here is sufficient for the correction of local violations of validity created by missing or superfluous nodes, or by incorrect node labels. The notion of valid query answer provides a way to query possibly invalid XML documents in a validity-sensitive way. In Section 6.1 we discuss possibilities of adapting our framework to handle other operations motivated by common kinds of validity violations. The contributions of this paper include:

We propose an alternative query evaluation method using the schema to identify validity violations and returning refined answers by considering all possible repairing scenarios. Our approach adapts the framework of repairs and consistent query answers developed in the context of inconsistent relational databases [3]. A repair is a consistent database instance which is minimally different from the original database. Various different notions of minimality have been studied, e.g., set- or cardinality-based minimality. A consistent query answer is an answer obtained in every repair. The framework of [3] is used as a foundation for most of the work in the area of querying inconsistent databases (for recent developments see [7, 15]). In our approach, the differences between XML documents are captured using sequences of editing operations on documents: inserting/deleting a subtree,

• A framework for validity-sensitive querying of XML documents (Section 2); • The notion of a trace graph which is a compact representation of all repairs of a possibly invalid XML document (Section 3); • An efficient algorithm, based on the trace graph, for the computation of valid query answers to a broad class of XPath queries (Section 4); • Experimental evaluation of the proposed algorithms (Section 5).

2

2

Basic notions

A tree T = X(T1 , . . . , Tn ) is valid w.r.t. a DTD D if: (1) Ti is valid w.r.t. D for every i and, (2) if X1 , . . . , Xn are labels of root nodes of T1 , . . . , Tn respectively and E = D(X), then X1 · · · Xn ∈ L(E).

We adapt a model of XML documents and DTDs similar to those commonly used in the literature [4, 5, 21, 25].

Example 3 Consider the following DTD D1 :

Ordered labeled trees

D1 (C) = (A · B)∗ ,

We view XML documents as labeled ordered trees with text values. For simplicity we ignore attributes: they can be easily simulated using text values. By Σ we denote a fixed (and finite) set of node labels and we distinguish a label PCDATA ∈ Σ to identify text nodes. A text node has no children and is additionally labeled with an element from an infinite domain Γ of text constants. For clarity of presentation, we use capital letters A, B, C, D, E, . . . for elements from Σ and capital letters X, Y, Z . . . for variables ranging over Σ. We assign a unique identifier to every node. Figure 1 contains an example of a tree. C n1

A

B n2

d

e

n4

B

D1 (B) = .

The tree T1 = C(A(d), B(e), B) is not valid w.r.t. D1 but the tree C(A(d), B) is. To capture the sets of strings satisfying regular expressions we use the standard notion of non-deterministic finite automaton (NDFA) M = hΣ, S, q0 , ∆, F i, where S is a finite set of states, q0 ∈ S is a distinguished starting state, F ⊆ S is the set of final states, and ∆ ⊆ S ×Σ×S is the transition relation. It’s a classic result that for every regular expression there exists an equivalent NDFA whose set of states is of size linear in the size of the expression [20].

n0

n3

D1 (A) = PCDATA + ,

2.1

n5

Tree edit distance and repairs

Tree operations text nodes PCDATA

A location is a sequence of natural numbers defined as follows: ε is the location of the root node, and v · i is the location of i-th child of the node at location v. This notion allows us to identify nodes without fixing a tree. We consider the three standard tree operations [1, 4, 5, 9, 10]:

Figure 1: Running example We assume that the data structure used to store a document allows for any given node to get its label, its parent, its first child, and its immediate following sibling in time O(1). For the purpose of presentation, we represent trees as terms over the signature Σ \ {PCDATA} with constants from Γ. For instance, the tree T1 from Figure 1 is represented as C(A(d), B(e), B). Note that a text constant t ∈ Γ, viewed as a term, represents a tree node with no children, that is labeled with PCDATA, and whose additional text label is t.

1. Deleting a subtree rooted at a specified location; 2. Inserting a subtree at a specified location; 3. Modifying the label at a specified location. With each operation we associate its cost. For deleting (inserting) it is the size of the deleted (resp. inserted) subtree. The cost of label modification is 1. The following example shows that the order of applying operations is important and therefore we consider sequences (rather than sets) of operations when transforming documents.

DTDs For simplicity our view of DTDs omits the specification of the root label. A DTD is a function D that maps labels from Σ \ {PCDATA} to regular expressions over Σ. We use regular expressions given by the grammar E ::≡ | X | E + E | E · E | E ∗ ,

Example 4 Recall the tree T1 = C(A(d), B(e), B) from Figure 1. If we first insert D as a second child of the root and then remove the first child of the root we obtain C(D, B(e), B). If, however, we first remove the first child of the root and then add D as a second child of the root we get C(B(e), D, B).

where X ranges over Σ, with denoting the empty string, and E + E, E · E, E ∗ denoting the union, concatenation and the Kleene closure, respectively. L(E) is the set of all strings over Σ satisfying E. |E| stands for the size (length) of E. The size of D, denoted |D|, is the sum of the lengths of the regular expressions occurring in D.

The cost of a sequence of operations is the sum of costs of its elements. Two sequences of operations are equivalent on a tree T if their application to T yields the same tree. We observe that some sequences may perform redundant operations, for instance an insertion and subsequent removal of a subtree. Because we focus on finding cheapest sequences of operations,

3

we restrict our considerations to redundancy-free sequences (those for which there is no equivalent but cheaper sequence).

3.1

We start the construction of a trace graph by building first a restoration graph which captures all (possibly not optimal) ways to repair the document at a given node. Suppose T = X(T1 , . . . , Tn ) is a tree and X1 , . . . , Xn the sequence of the root labels of T1 , . . . , Tn respectively. The DTD we work with is D and E = D(X). Let ME = hΣ, S, q0 , ∆, F i be the NDFA recognizing L(E). The restoration graph for the root of T , denoted UT , contains vertices of the form q i for q ∈ S and i ∈ {0, . . . , n}. The vertex q i is referred as the state q in the i-th column of UT and it corresponds to the state q being reached by ME after reading the elements X1 , . . . , Xi with possibly some elements deleted and some elements inserted. The edges of UT are:

Definition 1 (Edit distance) Given two trees T and T 0 , the edit distance dist(T, T 0 ) between T and T 0 is the minimum cost of transforming T into T 0 . Note that the distance between two documents (even if we consider only deletions and insertions) is a metric i.e., it is positively defined, symmetric, and satisfies the triangle inequality. For a DTD D and a (possibly invalid) tree T , a sequence of operations is a sequence repairing T w.r.t. D if the document resulting from applying the sequence to T is valid w.r.t. D. We are interested in the optimal repairing sequences of T . Definition 2 (Distance to a DTD) Given a document T and a DTD D, the distance dist(T, D) of T to D is the minimum cost of transforming T into a document valid w.r.t. D.

Del

• q i−1 −−−→ q i for any state q ∈ S and any i ∈ {1, . . . , n} (we remove Xi from the input but we don’t change the state of ME ), Ins Y

• pi −−−−−→ q i exists only if ∆(p, Y, q) (we don’t remove any symbol from the input but we change the state of ME as if the symbol Y was read),

Repairs We use the notions of edit distance to capture the minimality of change required to repair a document.

Read

• pi−1 −−−−→ q i exists only if ∆(p, Xi , q) (we read Xi from the input and change the state of ME accordingly).

Definition 3 (Repair) Given a document T and a DTD D, a document T 0 is a repair of T w.r.t. D if T 0 is valid w.r.t. D and dist(T, T 0 ) = dist(T, D).

A repairing path in UT is a path from q00 to any accepting state in the last column of UT .

Note that, if repairing a document involves inserting a text node, the corresponding text label can have infinitely many values, and thus in general there can be infinitely many repairs. However, as shown in the following example, even if the operations are restricted to deletions there can be an exponential number of non-isomorphic repairs of a given document.

Example 6 Recall the document T1 = C(A(d), B(e), B) and the DTD D1 from Example 3. The automaton M(A·B)∗ consists of two states q0 and q1 ; q0 is both the starting and the only accepting state; ∆ = {(q0 , A, q1 ), (q1 , B, q0 )}. The restoration graph UT is presented in Figure 2.

q11

Del

Del

Ins B

Ins B

Ins B

q12

d ea R

q03 Ins A

Del

d ea R

Del

q02 Ins A

q10

R ea d

Del

q01 Ins A

Ins A

The document A(B(1), T, F, . . . , B(n), T, F) consists of 4n + 1 elements and has 2n repairs w.r.t. D2 . For instance, an example of a repair for the document T2 = A(B(1), T, F, B(2), T, F, B(3), T, F) is the document A(B(1), T, B(2), F, B(3), T).

3

Del

q00

D2 (T) = , D2 (F) = .

Ins B

Example 5 Consider the following DTD D2 : D2 (A) = (B · (T + F))∗ , D2 (B) = PCDATA,

Restoration graph

q13

Figure 2: Construction of the restoration graph We observe that for any sequence of editing operations that repairs T we can obtain an equivalent sequence by translating a repairing path in UT in the following fashion:

Trace graph

In this section we present a construction that allows to capture all repairs of an XML document. For clarity of presentation, we limit operations to insertion and deletion. Later on we show how to extend our approach to handle label modification as well. The main element of this construction is a trace graph which is built for every node of the tree. A trace graph captures all optimal ways to repair the document at a given node.

Del

• q i−1 −−−→ q i is mapped to a deletion of Ti , Ins Y

• pi −−−−−→ q i is mapped to an insertion of a valid subtree with root label Y , Read

• pi−1 −−−−→ q i is mapped to a repairing sequence for Ti (obtained by induction from UTi ).

4

3.2

Similarly a choice of paths in each of the trace graphs yields a repair. It is important to note that a choice of a path on the top-level trace graph of T may correspond to more than one repair (this happens when some subtree has more than one repair). Trace graphs can also be used for interactive document repair.

Compact representation of repairs

Now, to capture optimal repairing sequences of T we assign to the edges of UT the (minimum) costs of the corresponding operations. The trace graph for T , denoted UT∗ , is the subgraph of UT consisting of only the optimal repairing paths. Note that dist(T, D) = the cost of an optimal repairing path in UT∗

Complexity analysis We note that for any vertex from the restoration graph UT the incoming edges come from the same or the preceding column. Therefore, when computing the minimum cost of a path to a given vertex we need to consider at most |Σ| × |S| + |S| + 1 values (up to |Σ| × |S| Ins edges, up to |S| Read edges, and one Del edge). We assume that |S| is bounded by the size of D and that Σ is fixed.

The exact cost assigned to edges of UT is: Del

• q i−1 −−−→ q i the cost is |Ti |, Ins Y

• q i −−−−−→ pi the cost is equal to the minimal size of a valid subtree with root label Y (this cost can be computed with a simple algorithm omitted here), Read

• q i−1 −−−−→ pi the cost is dist(Ti , D) (obtained by recursion from UT∗i ).

Theorem 1 For a given tree T and a DTD D all trace graphs of T can be constructed in O(|D|2 × |T |) time.

Example 7 (cont. Example 6) Repairing the second child of T1 requires removing the text node d and Read hence the cost assigned to q11 −−−−→q02 is 1. For the DTD D1 all insertion costs are 1. The trace graph UT∗ is presented in Figure 3. The repairing paths in the

3.3

Handling label modification

In order to extend the trace graph to handle label modification we add the following edges: Mod Y

q11 , 0

Del

Ins B

d ea R

d ea R

q12 , 2

Del

• q i −−−−−−−→ pi+1 which exists only if ∆(q, Y, p) and Y 6= Xi (we remove Xi from the input, but we change the state of ME as if Y was read);

q03 , 2 Ins B

Del

q02 , 1

Ins A

Del

Del

Ins A

q10 , 1

Ins A

Ins A

R ea d

q01 , 2 Ins B

Del

Ins B

q00 , 0

This edges corresponds to modification of the root label of Ti to Y and then recursively repairing Ti0 (which is Ti with root label changed to Y ). The cost assigned to this edge is equal 1 + dist(Ti0, D). We obtain the value dist(Ti0 , D) similarly to the way we obtained dist(Ti , D): by first constructing the trace graph UT∗ 0 . i Handling label modification requires computing |Σ| values rather than 1, but since Σ is assumed to be fixed, the asymptotic time and space complexity remains the same. Note, however, that in this way a significant constant factor in the real running time of the algorithm is introduced.

q13 , 3

Figure 3: Construction of the trace graph trace graph capture 3 different repairs: Read

Read

Ins

Read

A q12 −−−→ q03 yields 1. q00 −−−→ q11 −−−→ q02 −−−→ C(A(d), B, A, B) by repairing the second child (which involves deleting the text node e), and inserting A;

Read

Del

Read

Read

Read

2. q00 −−−→ q11 −−→ q12 −−−→ q03 yields C(A(d), B) by deleting the second child; Del

3. q00 −−−→ q11 −−−→ q02 −−→ q03 yields C(A(d), B) by repairing the second child and deleting the third.

4

Valid Query Answers

We consider here with the following class of positive Regular XPath queries [24]:

We note that although isomorphic, the repairs (2) and (3) are not the same because the nodes labeled with B are obtained from different nodes in the original tree (resp. n5 and n3 ).

Q ::= ⇐ | ⇓ | Q∗ | Q−1 | Q1 /Q2 | Q1 ∪ Q2 | name() | text() | | [t],

Repairs and paths in the trace graph

where ‘⇐’ and ‘⇓’ stand for the immediate-previoussibling and child axis; ‘Q∗ ’ is the closure of Q (repeating Q any number of times); ‘Q−1 ’ is the inverse of Q; ‘Q1 /Q2 ’ and Q1 ∪ Q2 stand for the composition and the union of Q1 and Q2 , respectively; name() selects the label of the current node, while text() can be used to obtain the text value stored at the current node (if

Note that if UT has cycles, only Ins edges can be involved in them. Since the costs of inserting operations are positive, UT∗ is a directed acyclic graph. Suppose now that we have constructed a trace graph in every node of T . Every repair can be characterized by selecting a path in each of the trace graphs.

5

it’s a text node); finally, [t] is the self axis with an optional test condition t. The test conditions are:

Example 9 (cont. Example 8) Consider a query Q1 = ::C/⇓∗/text(). For T1 we can infer: (n0 , name(), C) 7→ (n0 , ::C, n0 ), (n0 , , n0 ) 7→ (n0 , ⇓∗ , n0 ), (n0 , ⇓∗ , n0 ) ∧ (n0 , ⇓, n3 ) 7→ (n0 , ⇓∗ , n3 ), (n0 , ⇓∗ , n3 ) ∧ (n3 , ⇓, n4 ) 7→ (n0 , ⇓∗ , n4 ), (n0 , ::C, n0 ) ∧ (n0 , ⇓∗ , n4 ) 7→ (n0 , ::C/⇓∗ , n4 ), (n0 , ::C/⇓∗ , n4 ) ∧ (n4 , text(), e) 7→ (n0 , Q1 , e).

t ::= name() = X | text() = t | Q | Q1 = Q2 , where name() = X and text() = t test respectively the tag and the text label of the nodes; Q used as a test condition filters nodes from which the query Q can reach any object (an object is a node, a text value, or a node label); and Q1 = Q2 is the join condition which is satisfied by a node from which there is an object reachable with both Q1 and Q2 . Queries that don’t contain join conditions are called join-free. For convenience we introduce the following macros: Q+ := Q/Q∗ , −1

⇒ := ⇐

,

QAQ1 (T1 ) = {d, e}. We note that when computing answers to a query Q we need to use only a fixed number of different derivation rules (which involve only subqueries of Q). Since we use positive queries, the rules don’t contain negation. Therefore, the derivation process, similarly to negation-free Datalog programs, is monotonic i.e., adding new (basic) facts cannot invalidate facts derived previously. This observation allows us to construct a simple algorithm for computing the set of all tree facts that are relevant to answering Q (facts that use subqueries of Q). It traverses the document (in left-to-right prefix order): for every encountered node it adds to the computed set the basic facts about the node and then uses derivation rules to add the implied facts. After the whole tree is processed we select from the computed set the answers to Q.

Q[t] := Q/[t], Q::X := Q[name() = X].

⇓∗ ::proj/ ⇓ ::emp/ ⇒+ ::emp/ ⇓ ::salary. Recall from [24] that Regular XPath (with negation) is known to capture XPath 1.0 [31] restricted to its logical core (all axes but no attributes and no functions). Evaluation of positive queries

We use the standard semantics of XPath queries [31, 24], defined in a way geared towards the computation of valid query answers. The main notion we use is that of a tree fact which is a triple (x, Q, y), where y is an object (a node, a node label, or a text label) reachable from a node x with a query Q. We distinguish basic tree facts that use only the queries , ⇓, ⇐, name(), and text().

4.2

Validity-sensitive query evaluation

We define valid query answers by considering every repair. Definition 4 (Valid query answers) Given a tree T , a query Q, and a DTD D, an object x is a valid answer to Q in T w.r.t D if x is an answer to Q in every repair of T w.r.t. D. We denote the set of all valid query answers to Q in T w.r.t. D by V QAQ D (T ).

Example 8 Recall the tree T1 from Figure 1. Examples of basic tree facts for T1 are: (n0 , , n0 ), (n0 , name(), C), (n0 , ⇓, n3 ), (n3 , ⇓, n4 ), (n4 , text(), e). We observe that basic facts capture all structural and textual information of a given tree. Other tree facts can be derived using simple Horn rules that follow the standard semantics of XPath. For example: (x, Q∗ , x) ← (x, , x), (x, Q∗ , y) ← (x, Q∗ , z) ∧ (z, Q, y), (x, Q1 /Q2 , y) ← (x, Q1 , z) ∧ (z, Q2 , y), (x, ::X, x) ← (x, name(), X).

(4) (1) (2) (2) (3) (3)

Similarly, we can infer (n0 , Q1 , d). Thus

The query Q0 from Example 1 can be expressed as:

4.1

by by by by by by

4.2.1

Computational complexity

Computing valid answers is more complex than computing standard answers to queries (whose combined complexity is known to be PTIME [16]).

(1) (2) (3) (4)

Theorem 2 The (combined) complexity of computing valid answers to positive join-free Regular XPath queries is co-NP-complete.

Naturally, x is an answer to Q in T if (r, Q, x) can be derived for T , where r is the root node of T . We denote the set of all answers to Q in T with QAQ (T ).

Proof sketch: We reduce the membership of the complement of SAT to checking if an object is a valid answer. We use the DTD D2 and the documents from Example 5 to represent all possible valuations and translate the boolean formula into an

6

Algorytm 1 Computing V QAQ D (T ) function Certain(T ,D,Q) T = X(T1 , . . . , Tn ) with the root node r MD(X) = hΣ, S, q0 , ∆, F i def in(v) = {u ∈ UT∗ | u −→ v} (∗ incoming neighbors ∗) precompute: UT∗ – the trace graph for T CY – tree facts common for every valid tree with the root label Y (for Y ∈ Σ) Ci := Certain(Ti , D, Q) (for i ∈ {1, . . . , n}) 1: V := {q00 } 2: C(q00 ) := ({(r, , r), (r, name(), X)})Q

XPath query accordingly. For instance for the formula ϕ = (x1 ∨ ¬x2 ) ∧ x3 we get the query Q2 : ::A ⇓::B[⇓[text()=1]]/⇒::T ∪ ⇓::B[⇓[text()=2]]/⇒::F ⇓::B[⇓[text()=3]]/⇒::T . 2 We claim that ϕ 6∈ SAT ⇔ r ∈ V QAQ D2 (T2 ), where r is the root node of T2 (Example 5). Therefore, we use the data complexity [29], to express the complexity of the problem in terms of the size of the document only (by assuming other input components to be fixed).

4.3

(∗ if T is a text node t we also add (r, text(), t) ∗)

Na¨ıve computation of valid query answers

3: 4: 5: 6:

Again for clarity we consider only inserting and deleting tree operations: the approach can be extended to modifying operations in a way analogous to trace graphs in Section 3. Algorithm 1 computes valid query answers by (recursively) constructing the set of certain facts that hold in every repair of the document. Basically, for every repairing path in the trace graph the algorithm constructs the set of tree facts present in the repairs represented by this path. These sets are constructed in an incremental fashion using a flooding technique: V stores visited nodes and C(v) stores sets of certain facts for all paths from q00 to v. For the empty path we only include basic tree facts of the root node. For every traversed edge we append all certain facts contributed by the edge: none for a Del edge; for Read edge certain facts of the corresponding subtree; for Ins Y certain facts CY present in every valid tree with root Y (computed with a simple algorithm). The appending operation ]r is union which also adds basic facts establishing parent-child relation (between r and the root node of the appended subtree) and basic facts establishing sibling relation (and order) among appended trees. In every step we also add any facts that can be derived with the derivation rules for the query Q; this operation is denoted with the superscript (·)Q .

7: 8: 9:

while V 6= UT∗ do choose any q i ∈ UT∗ such that in(q i ) ⊆ V V := V ∪ {q i } C(q i ) := ∅ Del if q i−1 −−→ q i ∈ UT∗ then C(q i ) := C(q i−1 ) Ins Y

for pi −−−−→ q i ∈ UT∗ do C(q i ) := C(q i ) ∪ {(C ]r CY )Q |C ∈ C(pi )} Read

for pi−1 −−−→ q i ∈ UT∗ do C(q i )T:= S C(q i )∪{(C ]r Ci )Q |C ∈ C(pi−1 )} n return qn ∈UT∗ C(q ) end function body V QAQ D (T ) = {x|(r, Q, x) ∈ Certain(T, D, Q)} end body 10: 11: 12:

C(q20 ) = {B2 }, where B2 = (B1 ∪ C2 ∪ {(n0 , ⇓, n3 ), (n3 , ⇐, n1 )})Q1 , and C2 is the set of certain facts for the second child (note that C2 doesn’t contain information about n4 ) C2 = ({(n3 , , n3 ), (n3 , name(), B)})Q1 . C(q12 ) = {B1 , B3 }, where B3 = (B1 ∪ CA ∪ {(n0 , ⇓, i1 ), (i1 , ⇐, n3 )})Q1 , where CA is the set of certain facts for every valid tree with the root label A (i1 is a new node) CA = ({(i1 , , i1 ), (i1 , name(), A)})Q1 .

Example 10 Recall the tree T1 from Figure 1, the DTD D1 from Example 3, and the query Q1 = ::C/ ⇓∗ /text() (Example 9). The trace graph for T1 and D1 can be found in Figure 3.

C(q03 ) = {B2 , B4 , B5 }, where B4 = (B3 ∪ C3 ∪ {(n0 , ⇓, n5 ), (n5 , ⇐, i1 )})Q1 , B5 = (B1 ∪ C3 ∪ {(n0 , ⇓, n5 ), (n5 , ⇐, n1 )})Q1 , where C3 is the set of certain facts for the third child C3 = ({(n5 , , n5 ), (n5 , name(), B)})Q1 .

C(q00 ) = {B0 }, where B0 = ({(n0 , , n0 ), (n0 , name(), C, n0 )})Q1 . C(q11 ) = {B1 }, where

Now the set of certain facts for T1 is C = B2 ∩B4 ∩B5 . Note that (n0 , Q1 , d) ∈ C but (n0 , Q1 , e) 6∈ C. Hence

B1 = (B0 ∪ C1 ∪ {(n0 , ⇓, n1 )})Q1 , and C1 is the set of certain facts for A(d)

1 V QAQ D1 (T1 ) = {d}.

C1 = ({(n1 , , n1 ), (n1 , name(), A), (n1 , ⇓, n2 ), (n2 , , n2 ), (n2 , name(), PCDATA),

If we compare the valid answer with with QAQ1 (T1 ) (Example 9) we note that e has been removed. This is because D1 doesn’t allow any (text) nodes under B.

(n2 , text(), d)})Q1 .

7

Algorytm 2 Eager intersection (mod. of Algorithm 1) function Certain(T ,D,Q) ... T 9: C(q i ) := C(q i ) ∪ {(C ]r CY )Q |C ∈ C(pi )} ... T i i 11: C(q ) := C(q )∪ {(C ]r Ci )Q |C ∈ C(pi−1 )}

We recall from Example 7 that our framework allows more than one isomorphic repair. Therefore the set of valid answers to query ⇓∗ ::B in T1 is empty. This is a consequence of our decision to query the invalid document without repairing it, and thus the valid answers have to be given in terms of the original document. We note however, that if we consider a query ⇓∗ ::B/name() that asks about the names of the nodes (and thus disregards the differences between isomorphic nodes), the answer is {B}. 4.3.1

... end function Because of space limitations we skip the correctness proof of this rule. Using it in Example 10 we obtain: 0 0 C(q03 ) = {B2 , B4,5 }, where B4,5 = B4 ∩ B5 . Using simple induction over the number of the column, we can easily prove that in Algorithm 2 the number of sets stored in C(q i ) is O(i × |S| × |Σ|).

Exponential running time

We note that because there can be an exponential number of repairing paths in the trace graph (Example 5), Algorithm 1 may require exponential time to run. This algorithm may be the only alternative for computing valid answers to positive Regular XPath queries.

Theorem 4 Algorithm 2 computes valid answers to join-free positive Regular XPath queries in time polynomial in the size of the document.

Theorem 3 There exists a positive Regular XPath query Q (with join conditions) and a DTD D such that the set

4.5

DQ,D = {(T, x)|x ∈ V QAQ D (T )}.

Note that for a valid document every trace graph contains only one path (consisting of Read edges only). If the document contains validity violations, this path may create branches which correspond to alternative repairing scenarios (an example of a branching point in Figure 3 is q11 ). On each of the branches a separate copy of certain facts collected so far need to be used. Eventually the branches meet back and the sets of certain facts need to be intersected (either by following Ins or Read edges or in the final intersection). If the set of certain facts is large, copying it and consecutive intersection for every validity violation may cause a significant overhead. A lazy copying optimization separates the facts collected on different branches from the facts collected before the branching point; the intersection is performed only on the former facts.

is co-NP-complete. Proof sketch: We reduce the complement of SAT to DQ3 ,D3 , where the DTD D3 is defined as follows: D3 (A) = ((T + F) · B)∗ · C∗ , D3 (C) = N∗ , D3 (B) = , D3 (F) = D3 (T) = D3 (N) = PCDATA. and the query Q3 is ::A[⇓::C[⇓::N/⇓/text() = ⇑::A/(⇓::T ∪ ⇓::F)/⇓/text()]] We show only an example of constructing the document T3 for the formula φ = (x1 ∨¬x2 ∨x3 )∧(x2 ∨x3 ) : A T(1), F(∼1), B, T(2), F(∼2), B, T(3), F(∼3), B, C(N(∼1), N(2), N(∼3)), C(N(∼2), N(∼3))

5

Experimental results

The main goal of our experiments was to validate our approach and to discover possible limitations to be addressed in the future. Because flat documents (documents of bounded height) are very common among XML repositories, we tested the behavior of our solutions on flat documents.

We claim that ϕ 6∈ SAT ⇔ (T3 , r) ∈ DQ3 ,D3 , where r is the root node of T3 . 4.4

Lazy copying

Eager intersection

If Q is a join-free query, then we can use the heuristic of eager intersection:

Data sets To observe the impact of the document size on the performance of algorithms we first randomly generated a valid document. Next, we introduced the violations of validity to a document by removing and inserting randomly chosen nodes. To measure the validity violations of a document T we use the invalidity ratio dist(T, D)/|T |. For most of the experiments, we used the DTD D0 and the query Q0 from Example 1. In experiments

Let B1 and B2 be two sets from C(v) for some vertex v. Suppose that v −→ v 0 and the edge is labeled with an operation that appends a tree (either Rep or Ins). Let B10 and B20 be the sets for v 0 obtained from B1 and B2 resp. as described before. Instead of storing in C(v 0 ) both sets B10 and B20 we only store their intersection B10 ∩ B20 .

8

investigating the impact of the DTD size on the performance, we used a family of DTDs Dn , n ≥ 0:

bc Parse

Time (sec.)

Dn (A) = (. . . ((PCDATA + A1 ) · A2 + A3 ) · A4 + . . . An ) , Dn (Ai ) = A∗ , for i ∈ {1, . . . , n}. For documents generated for these DTDs we used a simple query ⇓∗/text(). Implementation All compared algorithms were implemented in Java 5.0 and used common programming tools including: a Pull model parser (STaX), a representation of regular expressions and corresponding NDFA’s, structures storing tree facts, and an implementation of derivation rules. For ease of implementation, we considered only a restricted class of descending path queries which involve only simple filter conditions (testing tag and text labels), do not use operators ∪ and −1 , and use the closure operator ∗ only on the axises ⇓ and ⇐. We note that those queries are most commonly used in practice and the restrictions allow to compute standard answers to such queries in time linear in the size of the document.

u MDist

t u

u

9

u 6

u

3

u

0

u bct

t u u b t b bc bc

u t bcb

t bu bc

0

10

u t b bc

u t b bc

t u b bc

t u bc

20 30 Document size (MB)

bc

bc

bc

bc

b

b

b

b

b

t u

t u

t u

40

50

Figure 4: Trace graph construction for variable document size (0.1% invalidity ratio) to efficiently validate XML documents should also be applicable to efficiently construct trace graphs. By increasing the size of DTD we also expand Σ whose size is a factor in the time of constructing trace graphs with label modification. This explains why the observed execution time of MDist is cubic in the size of |D|.

Environment All tests were performed on an Intel Pentium M 1000MHz machine running Windows XP with 512 MB RAM and 40 GB hard drive. We repeated each test 5 times, discarded extreme readings, and took the average of the remaining ones.

bc Validate u

b Dist

u MDist b

6 Time (sec.)

5.1

t Dist u

u

12

∗

b Validate

Trace graph construction

In the first part of the experiments, we measured the time necessary to construct the trace graphs. We used algorithms MDist and Dist that constructed trace graphs (resp. with and without label modification) to compute the edit distance of the document to the DTD. As a base line we used the time necessary to parse the file containing the document (Parse). We also compared the performance of our algorithms with Validate which validates the database w.r.t. the DTD. Figure 4 contains the results showing the impact of the document size on the performance of our algorithms. The results confirm our analysis: trace graphs are constructed in time linear in the size of the document. Note also that computation of the edit distance without using modifying operations introduces only small overhead over validation. Note also that the overhead of Validate over Parse is small as well, which suggests that the implementation of validation is efficient. We also observe that including modification operation in the model leads to a significant increase in the computation time. Figure 5 shows that Dist takes the time quadratic in the size of the DTD. We note, however, that Validate behaves similarly and the overhead of Dist is small. Because in our approach we don’t assume any particular properties of the automata used, we conjecture: that any technique that optimize the automata

u u

4

bb

b

bb

b

b bc bc bb bc bc b c b c b b b b bc bc bc bc bc bb b bcb bc bc bc bc uu b b b c b bc bc bc bc bc bc bc u

2

0 0

u

30

60 DTD size |D|

b

b

cb bc

90

b

b bc c

120

Figure 5: Trace graph construction for variable DTD size (1MB Document with 0.1% invalidity ratio) 5.2

Valid query answer computation

In the second part of the experiments we measured the time needed to compute the valid query answers. The algorithm QA for computing standard query answers, described in Section 4.1 is used as the baseline. Our implementations MVQA and VQA (resp. with and without label modification) use both eager intersection and lazy copying. Figure 6 shows that for the DTD D0 computing valid query answers (without modifying operations) is about 6 times longer than computing query answers with QA. Incorporating the modification operation into the model causes again a very significant evaluation time increase. Because the computation of valid answers involves constructing trace graphs, similarly to computing edit distance, we observe in Figure 7 a quadratic depen-

9

b VQA c QA b

b VQA

t MVQA u

t u b b t u

b

b

b

b

b

4

b

tb u c b u 0 t 0

c b

b bc

bc

bc

bc

bc bc

Time (sec.)

30

b

b

b b

0

10

20

b

30 40 DTD size |D|

50

60

70

Figure 7: Valid query answers computation for variable DTD size (1 MB document with 0.1% invalidity ratio) In the last series of experiments we investigated the impact of invalidities on the execution time of algorithms computing valid answers. To observe the benefits of lazy copying we compare algorithms VQA with EagerVQA which computes valid query answers without using lazy copying. We use the DTD D2 from Example 3. In Figure 8 we find that invalidities significantly affect the computation time of EagerVQA while with lazy copying the execution time grows very slowly as the invalidity ratio increases.

6 6.1

rs

rs

b b

rs b

b

b

b

b

b

b

0.05 0.10 0.15 0.20 0.25 Invalidity ratio dist(t, D)/|t| (%)

Missing or superfluous inner nodes. The validity of a document that is missing (or has a superfluous) an inner node can be restored by performing vertical insertion (resp. deletion) of such a node. A vertical insertion of a node takes a subsequence of subelements and links them as the children of the inserted node. The vertical deletion works in the opposite way: the children of the removed node are linked to the parent of the node. This generalized notion of edit distance [27, 6] subsumes ours (our notion is sometimes called 1-degree edit distance [26]). This notion has practical applications in information extraction [12] and integration of XML repositories [19]. [28] presents a polynomial algorithm for computing the generalized edit distance between a document T and a DTD. This algorithm creates graph structures which, analogously to trace graphs, are used to find optimal repairing sequences by selecting the shortest paths. Although these structures could be used to compute valid answers, their construction takes O(|T |5 ) time, which limits the practical applications of this approach. It is an open question if valid query answers can be computed more efficiently in this setting.

b

0

10

Figure 8: Valid query answers computation for variable invalidity ratio (1 MB document)

b VQA 40

b

rs

0 9

dence between the performance time and the size of DTD for VQA (because of significantly higher readings of MVQA we omit them).

b b b b b b b b

rs

rs

0

Figure 6: Valid query answers computation for variable document size (0.1% invalidity ratio)

10

20

sr b

c bc bc b

3 6 Document size (MB)

20

rs rs

b

8

EagerVQA

rs Time (sec.)

Time (sec.)

16 12

rs

30

Element transpositions Validity violations caused by incorrect order of elements could be addressed by an operation moving a subtree to a specified location [6]. This kind of an operation is studied (together with insertion, deletion, and modification) in the context of detecting document changes [9, 10] and document correction techniques [8]. It is an open question if our framework can be extended to include this operation, however the intractability of the problem of calculating the string edit distance with moves [8, 11] is discouraging.

Related work Other editing operations

6.2

By extending the repertoire of considered editing operations, our basic framework of repairs could be adapted to handle other common causes of validity violations. We note, however, those extensions may require new algorithms to compute valid answers.

Query approximation

A framework for evaluating queries with none or only a partial knowledge of the schema is proposed in [22]. A (refined) notion of Lowest Common Ancestor is used to identify meaningful relationships between document

10

elements in situations where expected document structure is not known. In this approach, however, the schema is not used directly in the query evaluation but rather the user is required to assess her knowledge of the schema during query formulation and identify potential structural ambiguity. This contrasts with our approach, where the user is assumed to possess a good knowledge of the expected document structure but cannot or does not wish to identify all potential structural discrepancies. Also, the schema is used during the query evaluation to alleviate the impact of possible validly violations on the query answer. 6.3

DTD’s a polynomial algorithm is proposed to compute certain answers to path queries. The considered class of queries are a subset of XPath, and in particular does not contain queries with union. Also this work contains no experimental evaluation of the proposed framework. [30] is another adaptation of consistent query answers to XML databases closely based on the framework of [3].

7

In this paper we have investigated the problem of querying XML documents containing violations of validity of a local nature, caused by missing or superfluous (leaf) nodes, or incorrect node labeling. We proposed a framework which considers possible repairs of a given document, obtained by applying a minimum number of operations that insert, delete, or rename nodes. We demonstrated an efficient algorithm for validity-sensitive querying based on the notion of valid query answer. We envision several possible directions for future work. First, one can investigate if valid answers can be obtained using query rewriting [17, 18]. Techniques based on query rewriting have been successfully used to compute consistent query answers in relational databases [3, 15]. Second, it is an open question if negation could be introduced into our framework. Our solutions make use of the monotonicity of the derivation process, which are implied with the rules having positive premises. For simple negative facts like (n, [name() 6= X], n), the derivation process is still performed in a monotonic fashion. This is not the case for queries with more general form of negation like ⇓∗ [¬ ⇑∗ [A]]. Third, it would be of significant interest to establish a complete picture of how the complexity of computing valid query answers depends on the query language and the repertoire of the available tree operations. Finally, it would be interesting to find out to what extent our framework can be adapted to handle semantic inconsistencies in XML documents, for example violations of key dependencies.

Structural restoration

The problem of correcting a slightly invalid document is considered in [8]. Under certain conditions, the proposed algorithm returns a valid document whose distance from the original one is guaranteed to be within a multiplicative constant of the minimum distance. A notion equivalent to the distance of a document to a DTD (Definition 2) was used to construct errorcorrecting parsers for context-free languages [2]. 6.4

Conclusions and future work

Consistent query answers for XML

[13] investigates querying XML documents that are valid but violate functional dependencies. Two repairing actions are considered: updating element values with a null value and marking nodes as unreliable. This choice of actions prevents from introducing further validity violations in the document upon repairing it. Nodes with null values or marked as unreliable do not cause violations of functional dependencies but also are not returned in the answers to queries. Repairs are consistent instances with a minimal set of nodes affected by the repairing actions. Two types of query semantics are proposed: certain answers – answers present in every repair; and possible answers – answers present in any repair. A polynomial algorithm for computing certain and possible answers is presented. We note that only simple descending path queries a1 /a2 / . . . /an are considered. This work contains no experimental evaluation of the proposed framework. A set of operations similar to ours is considered for consistent querying of XML documents that violate functional dependencies in [14]. Depending on the operations used different notions of repairs are considered: cleaning repairs obtained only by deleting elements, completing repairs obtained by inserting nodes, and general repairs obtained by both operations. A set-theoretic notion of minimality is used when defining repairs: the set of operations needed to repair a document is minimized with the preference for inserting operations nodes (the document T1 = C(A(d), B(e), B) has only one repair C(A(d), B, A, B). Again possible and certain answers are considered. For restricted classes of functional dependencies and

References [1] S. Abiteboul, J. McHugh, M. Rys, V. Vassalos, and J. L. Wiener. Incremental Maintenance for Materialized Views over Semistructured Data. In International Conference on Very Large Data Bases (VLDB), pages 38–49, 1998. [2] A. V. Aho and T. G. Peterson. A Minimum Distance Error-Correcting Parser for ContextFree Languages. SIAM Journal on Computing, 1(4):305–312, 1972. [3] M. Arenas, L. Bertossi, and J. Chomicki. Consistent Query Answers in Inconsistent Databases. In

11

[17] G. Grahne and A. Thomo. Query Containment and Rewriting using Views for Regular Path Queries under Constraints. In ACM Symposium on Principles of Database Systems (PODS), 2003.

ACM Symposium on Principles of Database Systems (PODS), pages 68–79, 1999. [4] A Balmin, Y. Papakonstantinou, and V. Vianu. Incremental Validation of XML Documents. ACM Transactions on Database Systems (TODS), 29(4):710–751, December 2004.

[18] G. Grahne and A. Thomo. Query Answering and Containment for Regular Path Queries under Distortions. In Foundations of Information and Knowledge Systems (FOIKS), 2004.

[5] M. Benedikt, W. Fan, and F. Geerts. XPath Satisfiability in the Presence of DTDs. In ACM Symposium on Principles of Database Systems (PODS), 2005.

[19] S. Guha, H. V. Jagadish, N. Koudas, D. Srivastava, and T. Yu. Approximate XML Joins. In ACM SIGMOD International Conference on Management of Data, pages 287–298, 2002.

[6] P. Bille. Tree Edit Distance, Aligment and Inclusion. Technical Report TR-2003-23, The IT University of Copenhagen, 2003.

[20] J. E. Hopcroft, R. Motwani, and J. D. Ullman. Introduction to Automata Theory, Languages, and Computation. Addison Wesley, 2nd edition, 2001.

[7] P. Bohannon, M. Flaster, W. Fan, and R. Rastogi. A Cost-Based Model and Effective Heuristic for Repairing Constraints by Value Modification. In ACM SIGMOD International Conference on Management of Data, 2005.

[21] A. Klarlund, T. Schwentick, and D. Suciu. XML: Model, Schemas, Types, Logics and Queries. In J. Chomicki, R. van der Meyden, and G. Saake, editors, Logics for Emerging Applications of Databases. Springer-Verlag, 2003.

[8] U. Boobna and M. de Rougemont. Correctors for XML Data. In International XML Database Symposium (Xsym), pages 97–111, 2004.

[22] Y. Li, C. Yu, and H. V. Jagadish. Schema-Free XQuery. In International Conference on Very Large Data Bases (VLDB), pages 72–83, 2004.

[9] S. Chawathe, A. Rajaraman, H. Garcia-Molina, and J. Widom. Change Detection in Hierarchically Structured Information. In ACM SIGMOD International Conference on Management of Data, pages 493–504, 1996.

[23] W. L. Low, W. H. Tok, M. Lee, and T. W. Ling. Data Cleaning and XML: The DBLP Experience. In International Conference on Data Engineering (ICDE), page 269, 2002.

[10] G. Cobena, S. Abiteboul, and A. Marian. Detecting Changes in XML Documents. In International Conference on Data Engineering (ICDE), 2002.

[24] M. Marx. Conditional XPath, the First Order Complete XPath Dialect. In ACM Symposium on Principles of Database Systems (PODS), 2004.

[11] G. Cormode and S. Muthukrishnan. The String Edit Distance Matching Problem with Moves. In ACM SIAM Symposium on Discrete Algorithms (SODA), pages 667–676, 2002.

[25] F. Neven. Automata, Logic, and XML. In Workshop on Computer Science Logic (CSL), 2002.

[12] D. de Castro Reis, P. Golgher, A. da Silva, and A Laender. Automatic Web News Extraction using Tree Edit Distance. In International Conference on World Wide Web (WWW), 2004.

[26] S. M. Selkow. The Tree-to-Tree Editing Problem. Information Processing Letters,1977. [27] D. Shasha and K. Zhang. Approximate Tree Pattern Matching. In A. Apostolico and Z. Galil, editors, Pattern Matching in Strings, Trees, and Arrays, Oxford University Press, 1997.

[13] S. Flesca, F. Furfaro, S. Greco, and E. Zumpano. Repairs and Consistent Answers for XML Data with Functional Dependencies. In International XML Database Symposium (Xsym), 2003.

[28] N. Suzuki. Finding an Optimum Edit Script between an XML Document and a DTD. In ACM Symposium on Applied Computing (SAC), 2005.

[14] S. Flesca, F Furfaro, S. Greco, and E. Zumpano. Querying and Repairing Inconsistent XML Data. In Web Information Systems Engineering (WISE), pages 175–188, 2005.

[29] M. Y. Vardi. The Complexity of Relational Query Languages. In ACM Symposium on Theory of Computing (STOC), pages 137–146, 1982.

[15] A. Fuxman, E. Fazli, and R. J. Miller. ConQuer: Efficient Management of Inconsistent Databases. In ACM SIGMOD International Conference on Management of Data, 2005.

[30] S. Vllalobos. Consistent Answers to Queries Posed to Inconsistent XML Databases. Master’s thesis, Catholic University of Chile (PUC), 2003.

[16] G. Gottlob, C. Koch, and R. Pichler. Efficient Algorithms for Processing XPath Queries. In International Conference on Very Large Data Bases (VLDB), pages 95–106, 2002.

[31] W3C. XML path language (XPath 1.0), 1999.

12