Inclusion Dependencies in XML: Extending ... - Semantic Scholar

Institute für Wirtschaftsinformatik Nr. 09.01 / Mai 2009

Inclusion Dependencies in XML: Extending Relational Semantics

Michael Karlinger Millist Vincent Michael Schrefl

Forschungsbericht Research Report

09.01

Vorbemerkung der Herausgeber / Editors’ preamble Die Forschungsberichte der Institute für Wirtschaftsinformatik dienen der Darstellung vorläufiger Ergebnisse z. B. Projektberichte und Zwischenergebnisse, die i. d. R. für spätere Veröffentlichungen überarbeitet werden. Die Autoren sind für kritische Hinweise dankbar. Alle Rechte vorbehalten. Research reports comprise preliminary results, e.g. project reports and intermediary results which will usually be revised for subsequent publications. Critical comments are appreciated by the authors. All rights reserved.

Herausgeber / Editors o. Univ.-Prof. Dipl.-Ing. Dr. Gustav Pomberger, Institut für Wirtschaftsinformatik – Software Engineering o. Univ.-Prof. Mag. Dr. Friedrich Roithmayr, Institut für Wirtschaftsinformatik – Information Engineering o. Univ.-Prof. Dipl.-Ing. Dr. Michael Schrefl, Institut für Wirtschaftsinformatik – Data & Knowledge Engineering o. Univ.-Prof. Dipl.-Ing. Dr. Christian Stary, Institut für Wirtschaftsinformatik – Communications Engineering

Autoren / Authors Michael Karlinger1 [email protected] Millist Vincent2 [email protected] Michael Schrefl1 [email protected] 1 Johannes Kepler Universität Linz Institut für Wirtschaftsinformatik - Data & Knowledge Engineering Altenberger Str. 69, A-4040 Linz, Austria http://www.dke.jku.at 2 University of South Australia School of Computer and Information Science Mawson Lakes, Adelaide, South Australia SA 5001 http://www.cis.unisa.edu.au/

Inclusion Dependencies in XML: Extending Relational Semantics (Internal Technical Report)

Michael Karlinger1, Millist Vincent2 , and Michael Schrefl1 1

2

Johannes Kepler University, Linz, Austria University of South Australia, Adelaide, Australia

Abstract. In this article we define a new type of integrity constraint in XML, called an XML inclusion constraint (XIND), and show that it extends the semantics of a relational inclusion dependency. This property is important in areas such as XML publishing and ‘data-centric’ XML, and is one that is not possessed by other proposals for XML inclusion constraints. We also investigate the implication and consistency problems for XINDs in complete XML documents, a class of XML documents that generalizes the notion of a complete relation, and present an axiom system that we show to be sound and complete.

1

Introduction

Integrity constraints are one of the oldest and most important topics in database research, and they find application in a variety of areas such as database design, data translation, query optimization and data storage [1]. With the adoption of the eXtensible Markup Language (XML) [11] as the industry standard for data interchange over the internet, and its increasing usage as a format for the permanent storage of data in database systems [12], the study of integrity constraints in XML has increased in importance in recent years. In this article we investigate the topic of inclusion constraints in XML. We use the syntactic framework of the keyref mechanism in XML Schema [11], where both the LHS and RHS of the constraint include a selector, which is used to select elements in the XML document, followed by a sequence of fields, which are used to specify the descendant nodes that are required to match in the document. This general idea of requiring selected elements in a document to have matching descendant nodes is also found in other approaches towards inclusion constraints in XML [3, 9, 10]. While the syntactic framework of the keyref mechanism in XML Schema is an expressive one, both it and other proposals for XML inclusion constraints [3, 6, 9, 10, 14] have some important limitations from the perspective of semantics. In particular, these proposals for XML inclusion constraints do not always allow one to extend the semantics of a relational inclusion dependency (IND), which we now illustrate by the example of an XML publishing scenario.

Figure 1 shows two relations teaches and offer. Relation teaches stores the details of courses taught by lecturers in a department, where lec is the name of the lecturer, cno is the identifier of the course they are teaching, day is the day of the week that the course is being taught and sem is the semester in which the course is taught. Relation offer stores all the courses offered by the university, where cno, day and sem have the same meaning as in teaches. We note that the key for offer is {cno, day, sem} and the key for teaches is {lec, cno, day, sem}, thus more than one lecturer can teach a course. The database also satisfies the IND teaches[cno, day, sem] ⊆ offer[cno, day, sem], which specifies that a lecturer can only teach courses that are offered by the university. Suppose we now map the relational data to XML by first mapping the two relations to separate documents with root nodes offer and teaches and then combining these documents to a single document with root node uni, as shown in Figure 1. In particular, the tuples in relation offer were directly mapped to elements with tag course. Concerning the relation teaches, a nesting on {lec, day, sem} preceded the direct mapping of the (nested) tuples, within which tags course and info were introduced. Flat Relations

XML Document cno {lec day sem} 3 C1 L1 TUE 09S L1 MON 08W 2 1 Nested Relation

teaches lec cno day sem L1 C1 TUE 09S L1 C1 MON 08W offer cno day sem C1 TUE 09S C1 MON 08W

Fig. 1. Example Relations and XML Document

Now the XML document satisfies an inclusion constraint because of the original IND, but one cannot express this inclusion constraint by χ = ((uni.teaches.course, [cno, info.day, info.sem]) ⊆ (uni.offer.course, [cno, day, sem])), where uni.teaches.course and uni.offer.course are the LHS and RHS selector, and [cno, info.day, info.sem] and [cno, day, sem] are the LHS and RHS fields, and applying the semantics given in [3, 6, 9–11, 14]. In particular, the keyref mechanism [11] requires that there exists for each selector node at most one descendant node per field. This however does not hold in our example, since for the LHS selector uni.teaches.course, there exist two descendant info.sem nodes, which are marked with 2 and 3 in Figure 1. The approaches in [3, 10] require the field nodes to be attributes of the selector nodes. In our example, LHS fields info.day and info.sem are no attributes of the LHS selector uni.teaches.course and therefore the approaches in [3, 10] can-

not express the constraint χ. Finally, in the proposals in [6, 9, 14] it is required that every possible combination of nodes from [cno, info.day, info.sem] within a uni.teaches.course node must have matching nodes in [cno, day, sem] within a uni.offer.course node. So, since one combination is {C1, TUE, 08W}, this requires a uni.offer.course node with child nodes {C1, TUE, 08W}, which clearly does not hold. In this paper, we propose different semantics so that the constraint χ holds in our example. The key idea is that we do not allow arbitrary combinations of nodes from the LHS fields, we only allow nodes that are closely related by what we will define later as the closest property. So, for example in Figure 1, the day and sem nodes marked with 1 and 2 satisfy the closest property, but not nodes 1 and 3 . The motivation for this restriction is that in the relational model data values that appear in the same tuple are more closely related than those that belong to different tuples, and our closest notion extends this idea to XML, and hence allows relational semantics to be extended. Having an XML inclusion constraint that extends the semantics of an IND is important in several areas. Firstly, in the area of XML publishing [8], where a source relational database has to be mapped to a single predefined XML schema, knowing how relational integrity constraints map to XML integrity constraints allows the XML document to preserve the original semantics. This argument also applies to ‘data-centric’ XML [12], where XML databases (not necessarily with predefined schemas) are generated from relational databases. The first contribution of this article is to define an XML inclusion constraint (called XIND) that extends the semantics of an IND. While the constraint is defined for any XML document (tree), we show that in the special case where the XML tree is generated by first mapping complete relations to nested relations by an arbitrary sequences of nest operations, and then directly to an XML tree, the database satisfies the IND if and only if the XML tree satisfies the corresponding XIND. The second contribution of this article is to address the implication problem, i.e. the question of whether a new XIND holds given a set of existing XINDs, and the consistency problem, i.e. the question of whether there exists at least one non-empty XML tree that satisfies a given set of XINDs. We consider in our analysis a class of XINDs which we call core XINDs. This class excludes from all XINDs the very limited set of XINDs that interact with structural constraints of an XML document and impose as a consequence a constraint comparable to a fixed value constraint in a DTD [11]. We do not address this exceptional case, since we believe that it is not the intent of an XIND to impose a fixed value constraint. Also, we focus in our analysis on a class of XML trees introduced in previous work by one of the authors [13], called complete XML trees, which is intuitively the class of XML trees that contain ’no missing data’. Within this context, we present an axiom system for XIND implication and show that the system is sound and complete. While our axiom system contains rules that parallel those of INDs [1], it also contains additional rules that have no parallel in the IND system, reflecting the fact, as we soon discuss, that complete XML trees

are more general than complete relations. Our proof techniques are based on a chase style algorithm, and we show that this algorithm can be used to solve the consistency problem and to provide a decision procedure for XIND implication. The motivation for complete XML trees is to extend the notion of a complete relation to XML. A complete XML tree is however a more general notion than a complete relation since it includes trees that cannot be mapped to complete relations, such as those that contain duplicate nodes or subtrees, and trees that contain element leaf nodes rather than only text or attribute leaf nodes. Our motivation for considering the implication and consistency problems related to XINDs in complete XML trees is that while XML explicitly caters for irregularly structured data, it is also widely used in more traditional business applications involving regularly structured data [12], often referred to as ‘data-centric’ XML, and complete XML trees are a natural subclass in such applications. The rest of the paper is organized as follows. Preliminary definitions are given in Section 2 and our definition of an XIND in Section 3. Section 4 shows that an XIND extends the semantics of an IND, and the implication and consistency problems are addressed in Section 5. Finally, Section 6 discusses related work.

2

XML Trees, Paths and Reachable Nodes

In this section we present some preliminary definitions. First, following the model adopted by XPath and DOM [11], we model an XML document as a tree as follows. We assume countably infinite, disjoint sets E and A of element and attribute labels respectively, and the symbol S indicating text. Thereby, the set of labels that can occur in the XML tree, L, is defined by L = E ∪ A ∪ {S}. Definition 1. An XML tree T is defined by T = (V, E, lab, val, vρ ), where - V is a finite, non-empty set of nodes; - the total function lab : V → L assigns a label to every node in V. A node v is called an element node if lab(v) ∈ E, an attribute node if lab(v) ∈ A, and a text node if lab(v) = S; - vρ ∈ V is a distinguished element node, called the root node, and lab(vρ ) = ρ; - the parent-child relation E ⊂ V × V defines the directed edges connecting the nodes in V and is required to form a tree structure rooted at node vρ . Thereby, for every edge (v, v¯) ∈ E, (i) v is an element node and is said to be the parent of v¯. Conversely, v¯ is said to be a child of v; (ii) if v¯ is an attribute node, then there does not exist a node v˜ ∈ V and an edge (v, v˜) ∈ E such that lab(˜ v ) = lab(¯ v ) and v˜ = v¯; - the partial function val : V → string assigns a string value to every attribute and text node in V. In addition to the parent of a node v in a tree T, we define its ancestor nodes, denoted by ancestor(v), to be the transitive closure of parents of v. An example of an XML tree is presented in Figure 2, where E = {ρ, offer, holding, course, info} and A = {cno, day, lec, sem}.

ρ v1 offer v2 course

v10 teaches v11 course

v6 course

v3 v4 v5 v7 v8 v9 cno day sem cno day sem C1 TUE 09S C1 MON 08W

v12 cno C1 v

14

lec L1

v13 info

v17 info

v15 v16 v18 v19 v20 day sem lec day sem TUE 09S L1 MON 08W

Fig. 2. An XML tree

The notion of a path, which we now present, is central to all work on XML integrity constraints. Definition 2. A path P = l1 . · · · .ln is a non-empty sequence of labels (possibly with duplicates) from L. Also, path P is defined to be legal, if for all i ∈ [1, n]3 , li ∈ E if i < n. For example, referring to Figure 2, offer.course and ρ.cno.course are paths but not legal ones, whereas ρ.offer.course is a legal path. We also define the following frequently required operators on paths. Definition 3. Given paths P = l1 . · · · .lm and P¯ = ¯l1 . · · · .¯ln , we define li for all i ∈ [1, m], and m ≤ n. - P to be a prefix of P¯ , denoted by P ⊆ P¯ , if li = ¯ - P to be a strict prefix of P¯ , denoted by P ⊂ P¯ , if P ⊆ P¯ and m < n. - P to be equal to P¯ , denoted by P = P¯ , if P ⊆ P¯ and P¯ ⊆ P . ¯ denoted by P.P, ¯ to be l1 . · · · .ln .¯l1 . · · · .¯lm . - the concatenation of P and P, ¯ to be the - the intersection of P and P¯ if both are legal paths, denoted by P ∩ P, longest path that is a prefix of both P and P¯ . - last(P ) = lm , the last label in P . - parent(P ) = l1 . · · · .lm−1 if m > 1, the longest strict prefix of P . - length(P ) = m, the length of P . For example, the path ρ.offer is a strict prefix of ρ.offer.course, and if P = ρ.offer.course.cno and P¯ = ρ.offer.course.sem, then P ∩ P¯ = ρ.offer.course. We now define a path instance, which is essentially a downward sequence of nodes in an XML tree. Definition 4. A path instance p = v1 . · · · .vn in a tree T = (V, E, lab, val, vρ ) is a non-empty sequence of nodes in V such that v1 = vρ and for all i ∈ [2, n], vi−1 = parent(vi ). The path instance p is said to be defined over a path P = l1 . · · · .ln , if lab(vi ) = li for all i ∈ [1, n]. 3

[1, n] denotes the set {1,. . . ,n}

For example, referring to Figure 2, vρ .v1 .v2 is a path instance, and this path instance is defined over path ρ.offer.course. We also define the following frequently required operators on path instances. vm , we define Definition 5. Given path instances p = v1 . · · ·.vn and p¯ = v¯1 . · · ·.¯ − p to be a prefix of p¯, denoted by p ⊆ p¯, if vi = v¯i , for all i ∈ [1, n], and n ≤ m. − p to be a strict prefix of p¯, denoted by p ⊂ p¯, if p ⊆ p¯ and n < m. − p to be a equal to p¯, denoted by p = p¯, if p ⊆ p¯ and p¯ ⊆ p. v1 . · · ·.¯ vm . − the concatenation of p and p¯, denoted by p.¯ p, to be v1 . · · ·.vn .¯ − last(p) = vn to return the last node in p. The next definition specifies the set of nodes reachable in a tree T from the root node by following a path P . Definition 6. Given a tree T = (V, E, lab, val, vρ ) and a legal path P, the function N(P, T) returns the set of nodes defined by {v ∈ V | v is the final node in path instance p and p is defined over P }. For instance, if T is the tree in Figure 2 and P = ρ.offer.course.day, then N (P, T) = {v4 , v8 }. We note that it follows from our tree model that for every node v in a tree T there is exactly one path instance p such that v is the final node in p and therefore N(P, T) ∩ N(P¯ , T) = ∅ if P = P¯ . We therefore say that P is the path such that v ∈ N(P, T).

3

Defining XML Inclusion Dependencies

In this section we present the syntax and semantics of our definition of an XIND, starting with the syntax. Definition 7. An XML Inclusion Dependency is a statement of the form (P, [P1 , . . . , Pn ]) ⊆ (P , [P1 , . . . , Pn ]) , where P and P are paths called LHS and RHS selector, and P1 , . . . , Pn and P1 , . . . , Pn are non-empty sequences of paths, called LHS and RHS fields, such that for all i ∈ [1, n], P.Pi and P .Pi are legal paths ending in an attribute or text label. We now compare this definition to the keyref mechanism in XML, which is the basis for the syntax of an XIND. (i) We only consider simple paths in the selectors and fields, whereas the keyref mechanism allows for a restricted form of XPath expressions. (ii) In contrast to an XIND, the keyref mechanism also allows for relative constraints, whereby the inclusion constraint is only evaluated in part of the XML tree. (iii) The restrictions on fields means that we only consider inclusion between text/attribute nodes, whereas the keyref mechanism also allows for inclusion between element nodes. We should mention that the restrictions discussed in (i) - (iii) are not intrinsic to our approach, and our definition of an XIND can easily be extended to handle

these extension. Our reason for not considering these extensions here is so that we can concentrate on the main contribution of our paper, which is to apply different semantics to an XIND so as to extend relational semantics. To define the semantics of an XIND, we first make a preliminary definition (first presented in [13]) that is central to our approach. The intuition it, which will be made more precise in the next section, is as follows. In defining relational integrity constraints such as FDs or INDs, it is implicit that the relevant data values from either the LHS or RHS of the constraint belong to the same tuple. The closest definition extends this property of two data values belonging to the same tuple to XML, that is if two nodes in the XML tree satisfy the closest property, then ‘they belong to the same tuple’. Definition 8. Given nodes v1 and v2 in an XML tree T, the boolean function closest(v1 , v2 ) is defined to return true, iff there exists a node v21 such that (i) v21 ∈ aancestor(v1 ), and (ii) v21 ∈ aancestor(v2 ), and (iii) v21 ∈ N(P1 ∩ P2 , T), where P1 and P2 are the paths such that v1 ∈ N(P1 , T) and v2 ∈ N(P2 , T) and the aancestor function is defined by aancestor(v) = ancestor(v) ∪ {v}. For instance, in Figure 2 closest(v3 , v4 ) is true. This is because P = ρ.offer.course.cno and P¯ = ρ.offer.course.day are the paths such that v3 ∈ N(P, T) and v4 ∈ N(P¯ , T), and v3 and v4 have the common ancestor node v2 ∈ N(ρ.offer.course, T), where ρ.offer.course = P ∩ P¯ . However, closest(v3 , v8 ) is false since v3 and v8 have no common ancestor node in N(ρ.offer.course, T), and closest(v3 , v7 ) is false because v3 and v7 have no common ancestor node in N(ρ.offer.course.cno, T). This leads to the definition of the semantics of an XIND. Definition 9. An XML tree T satisfies an XIND σ = ((P, [P1 , . . . , Pn ]) ⊆ (P , [P1 , . . . , Pn ])), denoted by T σ, iff whenever there exists an LHS selector node v and corresponding field nodes v1 , . . . , vn such that: (i) v ∈ N(P, T), (ii) for all i ∈ [1, n], vi ∈ N(P.Pi , T) and v ∈ ancestor(vi ), (iii) for all i, j ∈ [1, n], closest(vi , vj ) = true, then there exists an RHS selector node v and field nodes v1 , . . . , vn such that (i’) v ∈ N(P , T), (ii’) for all i ∈ [1, n], vi ∈ N(P .P i , T) and v ∈ ancestor(vi ), (iii’) for all i, j ∈ [1, n], closest(vi , vj ) = true, (iv’) for all i ∈ [1, n], val(vi ) = val(vi ). For instance, the XML tree in Figure 2 satisfies the XIND χ = ((ρ.teaches.course, [cno, info.day, info.sem]) ⊆ (ρ.offer.course, [cno, day, sem])). This is because the only sequences of LHS field nodes that pairwise satisfy the closest property are v12 , v15 , v16 and v12 , v19 , v20 , and v12 , v15 , v16 is value equal to the sequence of RHS field nodes v3 , v4 , v5 and v12 , v19 , v20 is value equal to v7 , v8 , v9 . The essential difference between an XIND and other proposals is

that we require the sequence of field nodes (both LHS and RHS) generated by the cross product to also satisfy the closest property, whereas other proposals do not contain this additional restriction [6, 9, 14]. As a consequence, the constraint χ is violated in the XML tree in Figure 2 according to the other proposals. We also make the point that we do not address the situation where there may be no node for a LHS field. We only require the inclusion between LHS and RHS field nodes, when there is a node for every LHS field. This is consistent with other work in the area, like for example the proposal for XML keys in [5].

4

Extending Relational Semantics

In this section we justify our claim that an XIND extends the semantics of an IND by showing that in the case where the XML tree is generated from a complete database by a very general class of mappings, then the database satisfies the IND if and only if the XML tree satisfies the corresponding XIND. To show this, we first define a general class of mappings from complete relational databases to XML trees. The presentation of the mapping procedure presented here will be abbreviated because of space requirements, and we refer the reader to [13] for a more detailed presentation if needed. The first step in the mapping procedure maps each initial flat relation to a nested relation by a sequence of nest operations. To be more precise, we recall that the nest operator νY (R∗ ) on a nested (or flat) relation R∗ , where Y is a subset of the schema R of R∗ , combines the tuples in R∗ which are equal on R∗ [R − Y] into single tuples [4]. So if the initial flat relation is denoted by R, we perform an arbitrary sequence of nest operations νY1 , . . . , νYn on R and so the final nested relation R∗ , is defined by R∗ = νYn (· · · νY1 (R)). For instance, in the introductory example the flat relation teaches is converted to a nested relation R∗ by R∗ = νlec,day,sem (teaches). The next step in the mapping procedure is to map the nested relation to an XML tree by converting each sub-tuple in the nested relation to a subtree in the XML tree, using a new element node for the root of the subtree, as illustrated in the introductory example. While we don’t claim that our method is the only way to map a relation to an XML tree, it does have two desirable features. First, it allows the initial flat relation to be nested arbitrarily, which is a desirable feature in data-centric applications of XML [12]. Second, it has been shown that the mapping procedure is invertible [13], and so no information is lost by the transformation. In the context of mapping multiple relations to XML, we extend the method just outlined as follows. We first map each relation to an XML tree as just discussed. We then replace the label in the root node by a label containing the name of the relation (which we assume to be unique), and construct a new XML tree with a new root node and with the XML trees just generated being principal subtrees. This procedure was used in the introductory example. This leads to the following important result which justifies our claim that an XIND extends the semantics of an IND.

Theorem 1. Let complete flat relations R1 and R2 be mapped to an XML tree T by the method just outlined. Then R1 and R2 satisfy the IND R1 [A1 , . . . , An ] ⊆ R2 [B1 , . . . , Bn ], where R1 and R1 are the schemas of R1 and R2 , iff T satisfies the XIND, ((ρ.R1 , [PA1 , . . . , PAn ]) ⊆ (ρ.R2 , [PB1 , . . . , PBn ])), where ρ.R1 .PA1 , . . . , ρ.R1 .PAn , ρ.R2 .PB1 , . . . , ρ.R2 .PBn represent the paths over which the path instances in T that end in leaf nodes are defined. For instance, we deduce from the IND teaches[cno, day, sem] ⊆ offer[cno.day.sem] the XIND ((ρ.teaches, [course.cno, course.info.day, course.info.sem]) ⊆ (ρ.offer, [course.cno, course.day, course.sem])) and, from the inference rules to be given in the next section, this XIND is equivalent to the XIND given in the introductory example, namely ((ρ.teaches.course, [cno, info.day, info.sem]) ⊆ (ρ.offer.course, [cno, day, sem])). Proof (Theorem 1). some proof.

5

Reasoning About XML Inclusion Dependencies

We focus in our reasoning on core XINDs in complete XML trees, and we first define these key concepts in Section 5.1. We then present in Section 5.2 a chase algorithm and use this algorithm in Section 5.3 to solve the implication and consistency problems related to core XINDs in complete XML trees. 5.1

The Framework: Core XINDs in Complete XML Trees

From a general point of view, before requiring the data in an XML document to be complete, one first has to specify the structure of the information that the document is expected to contain. We use a set of legal paths P to specify the structure of the expected information in an XML document, and now define what we mean by an XML tree conforming to P. Definition 10. A tree T is defined to conform to a set of legal paths P, if for every node v in T, if P is the path such that v ∈ N(P, T), then P ∈ P. For example, if we denote the subtree rooted at node v1 in Figure 2 by T1 , then T1 conforms to the set of paths P1 = {offer,offer.course, offer.course.cno,offer.course.day,offer.course.sem}. We now introduce the concept of a complete XML tree, which extends the notion of a complete relation to XML. To understand the intuition, consider again ¯1 = P1 ∪ {offer.course.max}. the subtree T1 in Figure 2 and the set of paths P ¯1 , but we do not consider it to be complete w.r.t. P ¯1 Then T1 also conforms to P since the existence of the path offer.course.max means that we expect every course in T1 to have a max number of students, which is not satisfied by nodes v2 and v6 in Figure 2. We now make this idea more precise.

Definition 11. If T is a tree that conforms to a set of paths P, then T is defined ¯ to be complete w.r.t. P, if whenever P and P¯ are paths in P such that P ⊂ P, and there exists node v ∈ N(P, T), then there also exists node v¯ ∈ N(P¯ , T) such that v ∈ ancestor(¯ v ). ¯1 but it is complete For instance, as just noted, T1 is not complete w.r.t. P w.r.t. P1 . This example also illustrates an important point. Unlike the relational case, the completeness of a tree is only defined w.r.t. a specific set of paths and so, as we have just seen, a tree may conform to two different sets of paths, but may be complete w.r.t. one set but not the other. We also note that if a tree T is complete w.r.t. a set of paths P, then P is what we call downward-closed. Definition 12. Given paths P and P˜ , a set of paths P is downward-closed if whenever P˜ ∈ P and P˜ ⊂ P , then P˜ ∈ P. ¯1 are downward-closed. We now turn to the For example the sets P1 and P class of XINDs that we consider in our reasoning. It is natural to expect that if an XIND σ is intended to apply to an XML tree T, then the constraint imposed by σ should belong to the information represented by T. We incorporate this idea by requiring that the paths in an XIND are taken from the set of paths to which the targeted tree conforms, which we now define. Definition 13. An XIND σ = ((P, [P1 , . . . , Pn ]) ⊆ (P ([P1 , . . . , Pn ])) is defined to conform to a set of paths P, if for all i ∈ [1, n], P.Pi ∈ P and P .P i ∈ P. We also place another restriction on an XIND, motivated by our belief that an XIND σ should not enforce, as a hidden side effect, that each node in a set of nodes in a tree T must have the same value. Suppose then that for the RHS selector of σ, P = ρ, and for some i ∈ [1, n], the RHS field Pi is an attribute label. Then since there is only one root node, and in turn at most one attribute node in N(P .P i , T), the semantics of σ means, that every node in N(P.Pi , T)∪N(P .Pi , T) must have the same value. We believe that this not the intent of an XIND, and that such a constraint should be specified instead explicitly in a DTD or XSD. Since the study of the interaction between structural constraints and integrity constraints is known to be a complex one [3], and outside the scope of this paper, we exclude such an XIND and this leads to the following definition. Definition 14. An XIND ((P, [P1 , . . . , Pn ]) ⊆ (P , [P1 , . . . , Pn ])) is a core XIND, if in case that P = ρ, then there does not exist a RHS field Pi such that length(Pi ) = 1 and Pi ends in an attribute label. 5.2

The Chase for Core XINDs in Complete Trees

The chase is a recursive algorithm that takes as input (i) a set of paths P, (ii) a tree T that is complete w.r.t P, and (iii) a set of XINDs Σ that conforms to P, and adds new nodes to T such that T Σ. From a bird-eyes view, the chase halts if the input tree Ts for a (recursive) step s satisfies Σ, and otherwise it

(i) chooses an XIND σs = ((P, [P1 , . . . , Pn ]) ⊆ (P , [P1 , . . . Pn , ])) from Σ, such that Ts σs because of a sequence of nodes [v0 , v1 , . . . , vn ], where v0 is a LHS selector node for σs and v1 , . . . , vn are corresponding field nodes, and (ii) creates new nodes in Ts , such that the resulting tree Ts+1 contains a RHS selector node v0 and field nodes v1 , . . . , vn that remove the violation. Ts+1

Ts ρ v1 ass

v3 ass

v5 emp

v8 emp

P = {ρ ρ.ass ρ.ass.name ρ.emp ρ.emp.name ρ.emp.room}

v9 v4 v10 Σ = {σ = ((ρ.ass, [name]) ⊆ (ρ.emp, [name]))} v7 v2 v6 name name name room name room A1 A2 A3 R1 A1 0

Fig. 3. Example Chase Step

We now illustrate a step s in the chase by the example depicted in Figure 3. Here the XIND σ is violated in tree Tσ by the sequences of nodes [v1 , v2 ] and [v3 , v4 ]. Given that the sequence [v1 , v2 ] is chosen, the chase creates nodes v8 , v9 in tree Tσ+1 , which remove the violation, and then adds node v10 as a child of v8 , in order that Tσ+1 is complete w.r.t. the set of paths P. The procedure just outlined is non-deterministic since both the choice of σs and the choice of a violating sequence of nodes [v0 , v1 , . . . , vn ] in a step s is random. These choices however essentially determine the characteristics of the trees generated by the steps of the chase, and thus our proof techniques, which are based on certain characteristics of the generated trees. We therefore designed a deterministic chase algorithm, depicted in Algorithm 1, that results in a unique tree. The essential prerequisite is the following, simplified version of document-order. Definition 15. In a tree T, node v is defined to precede node v¯ w.r.t. documentorder, denoted by v ≺ v¯, if v is visited before v¯ in a pre-order traversal of T. We now illustrate how uniqueness of the tree Ts+1 resulting from a step s of the chase is achieved. First, the choice of σs at Line 2 in Algorithm 1 is deterministic, given that σs is the first XIND in Σ that is violated in Ts , and that Σ is expected to be a sequence, rather than a set, of XINDs. Second, in case that there is more than one sequence of violating nodes, then the one that is, roughly speaking, in the top-left of tree Ts , is chosen at Lines 3-9. Referring to the example in Figure 3, the chase deliberately chooses the sequence of violating nodes [v1 , v2 ], since v1 ≺ v3 . Third, the procedure for removing the violation in a step s is deterministic. In particular, the chase loops for this purpose over the paths in P and creates path instances accordingly (cf. Lines 13 - 25), such that the resulting tree Ts+1 is complete w.r.t. P and contains a sequence of nodes [v0 , v1 , . . . , vn ] that removes the violation. The desired uniqueness is basically achieved, since (i) paths P are expected to be a sequence, rather than a set of

Algorithm 1 Chase(P, T, Σ) in: A downward-closed sequence of legal paths P = [R1 , . . . , Rm ] in prefix-order A tree T = (V, E, lab, val, vρ ) that is complete w.r.t. P A sequence of XINDs Σ that conforms to P ¯ that subsumes T, is complete w.r.t. P and satisfies Σ out: Tree T 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: 25:

if T Σ then return T; end if; let σ = ((P, [P1 , . . . , Pn ]) ⊆ (P , [P1 , . . . , Pn ])) be the first XIND in Σ such that T σ; let Y be the set of all sequences of nodes that violate σ in tree T; for i := 0 to n do choose violation repeat v0 , vˆ1 , . . . , vˆn ] from Y; choose sequences [v0 , v1 , . . . , vn ] and [ˆ v0 , vˆ1 , . . . , vˆn ] from Y; end if if vi ≺ vî then remove [ˆ until no more change to Y is possible; end for let [v0 , v1 , . . . , vn ] be the remaining sequence of nodes in Y; let X be the set of nodes such that X = {vρ }; for i := 1 to m do remove violation if there exists path Px ∈ [P1 , . . . , Pn ] such that Ri ∩ P .P x = ρ then let vˆ be the node in N(parent(Ri ), T) ∩ X; create a new node v and add v to both V and X; set lab(v) = last(Ri ); add (ˆ v , v) to E such that v is the last child of vˆ; if there exists path Py ∈ [P1 , . . . , Pn ] such that Ri = P .P y then vy ∈ [v1 , . . . , vn ] set val(v) = val(vy ); else if last(Ri ) is an attribute or text label then set val(v) = ”0”; end if end if end for return Chase(P, T, Σ);

paths, and therefore the succession in which the paths in P are iterated in the loop at Line 12 is deterministic, and (ii) a new node is always added to the parent as the last child w.r.t. document-order (cf. Line 18). Before presenting our main result on the procedure of the chase, we now introduce the notion of tree subsumption and isomorphism, which we then use ¯ as criterion for the uniqueness of a tree. Essentially, we say that the final tree T ˜ is the final resulting from an application of the chase is unique, if whenever T ¯ tree resulting from another application of the chase on the same input, then T ˜ are isomorphic. and T Definition 16. A tree T = (V, E, lab, val, vρ ) is defined to be subsumed within ¯ if there exists mapping α : V → V ¯ ¯ = (V, ¯ E, ¯ ¯lab, v tree T ¯al, v¯ρ ), denoted by T ⊆ T, such that (i) α(v ρ) = v¯ρ , and ¯ and (ii) if v, v˜ ∈ E, then α(v), α(˜ v ) ∈ E,

(iii) for every node v ∈ V , lab(v) = ¯lab α(v) , and (iv) for every attribute or text node v ∈ V , val(v) = v ¯al α(v) , and ¯ (v) for every pair of nodes v, v˜ ∈ V, if v ≺ v˜ in T, then α(v) ≺ α(˜ v ) in T. Definition 17. A tree T = (V, E, lab, val, vρ ) is defined to be isomorphic to tree ¯ = (V, ¯ E, ¯ ¯lab, v ¯ if there exists mapping α : V → V ¯ T ¯ al, v¯ρ ), denoted by T ≈ T, ¯ is established by α, and T ¯ ⊆ T is established by α−1 . such that T ⊆ T We then have the following result on the procedure of the chase. Lemma 1. An application of Chase(P, T, Σ) terminates and returns a unique ¯ that subsumes T, is complete w.r.t. P and satisfies Σ. tree T The next lemmas present some properties of subsumed and isomorphic trees, which we require in order to demonstrate Lemma 1. ¯ E, ¯ = (V, ¯ ¯lab, v ¯al, v¯ρ ) be trees, and Lemma 2. Let T = (V, E, lab, val, vρ ) and T ¯ ¯ α : V → V be a mapping witnessing that T ⊆ T. Then, given nodes v, v˜ in V (i) if v = parent(˜ v ) then α(v) = parent(α(˜ v )); (ii) if v ∈ aancestor(˜ v ) then α(v) ∈ aancestor(α(˜ v )); (iii) if p = v1 . · · · .vn is the path instance in T such that v = last(p), then ¯ p¯ = α(v1 ). · · · .α(vn ) is a path instance in T; ¯ (iv) if P is the path such that v ∈ N(P, T) then α(v) ∈ N(P, T); (v) if closest(v, v˜) = true then closest(α(v), α(˜ v )) = true; Proof (Lemma 2). (i) Given that v = parent(˜ v ), (v, v˜) is an edge in E according to Definition 1. Consequently, (α(v), α(˜ v )) is an edge in E¯ according to (ii) in Definition 16 and therefore α(v) = parent(α(˜ v )). (ii) Given that v ∈ aancestor(˜ v ), either v = v˜ or v ∈ ancestor(˜ v ). If v = v˜ v ) according to Definition 16 and thus α(v) ∈ aancestor(α(˜ v )). If then α(v) = α(˜ instead v ∈ ancestor(v), then α(v) ∈ ancestor(α(˜ v )) follows from (i) in Lemma 2 since function ancestor returns the transitive closure of parents of a node per definition. Therefore, α(v) ∈ aancestor(α(˜ v )) also in case that v ∈ ancestor(˜ v ). (iii) According to (i) in Lemma 2, α(vi−1 ) = parent(α(vi )) for all i ∈ [2, n]. From this together with (i) in Definition 16 it follows that p¯ = α(v1 ). · · · .α(vn ) ¯ is a path instance in T. (iv) Let p = v1 . · · · .vn be the path instance in T such that v = last(p). Then ¯ according to (iii) in Lemma 2. Further, α(v1 ). · · · .α(vn ) is a path instance in T given that v ∈ N(P, T) and that v = last(p), path P = l1 . · · · .ln where for all i ∈ [1, n], li = lab(vi ). Also, α(v) = α(vn ) given that v = last(p) = vn , and ¯ since for all i ∈ [1, n], lab(α(vi )) = lab(vi ) therefore α(v) = α(vn ) ∈ N(P, T) according to (iii) in Definition 16, and lab(vi ) = li as shown above. (v) Given that closest(v, v˜) = true, there exists node vˆ in T according to Definition 8 such that vˆ ∈ aancestor(v), and vˆ ∈ aancestor(˜ v ) and vˆ ∈ N(P ∩ P˜ , T), where P and P˜ are the paths such that v ∈ N(P, T) and v˜ ∈ N(P˜ , T). We now show that closest(α(v), α(˜ v )) = true by showing that node α(ˆ v ) satisfies (i) (iii) in Definition 8 w.r.t. nodes α(v) and α(˜ v ). Thereby, (i) and (ii) in Definition

8, i.e. that α(ˆ v ) ∈ aancestor(α(v)) and α(ˆ v ) ∈ aancestor(α(˜ v )), follow from (ii) in Lemma 2 and the assumption that vˆ ∈ aancestor(v) and vˆ ∈ aancestor(˜ v ), ¯ follows respectively. Further, (iii) in Definition 8, i.e. that α(ˆ v ) ∈ N(P ∩ P˜ , T), from (iv) in Lemma 16 and the assumption that vˆ ∈ N(P ∩ P˜ , T).

¯ = (V, ¯ E, ¯ ¯lab, v ¯ al, v¯ρ ) and Lemma 3. Given trees T = (V, E, lab, val, vρ ) and T ¯ ¯ mapping α : V → V witnessing that T ≈ T. Then, an LHS selector node v0 and a sequence of LHS fields v1 , . . . , vn violate an XIND σ = ((P, [P1 , . . . , Pn ]) ⊆ ¯ (P , [P1 , . . . , Pn ])) in T iff nodes α(v0 ) and α(v1 ), . . . , α(vn ) violate σ in T. Proof (Lemma 3). some proof.

We now introduce some notations used throughout the subsequent proofs. The procedure of Algorithm 1 is recursive, and hence we use Ts to denote the input tree for a (recursive) step s in an application of Algorithm 1. Also, in case that Ts Σ in step s, we use σs to denote the XIND chosen at Line 2, Ys to denote the set of sequences of nodes at Line 3 that violate σs in Ts , and Xs to denote the set of nodes at Line 11. The next lemmas present some properties of the procedure for removing the violation of an XIND within a step s of an application of Algorithm 1. Lemma 4. In an iteration i ∈ [1, m] of the loop at Line 12 within a step s of an application Chase(P, T, Σ), where P = [R1 , . . . , Rm ], (i) if Ri meets the condition at Line 13, then length(Ri ) ≥ 2; (ii) if Ri meets the condition at Line 13 and length(Ri ) > 2, then parent(Ri ) meets the condition at Line 13; (iii) if the condition at Line 13 is met for the first time within step s, then length(Ri ) = 2; Proof (Lemma 4). (i) Assume to the contrary, that length(Ri ) = 1. Then Ri = ρ and in turn Ri ∩ P .P x = ρ, for all Px ∈ [P1 , . . . , Pn ]. This contradicts the assumption that Ri meets the condition at Line 13, and therefore length(Ri ) ≥ 2. (ii) Since path Ri meets the condition at Line 13 per assumption, there exists path Px ∈ [P1 , . . . , Pn ], such that Ri ∩ P .P x = ρ. Now, if parent(Ri ) = ρ, then also parent(Ri ) ∩ P .Px = ρ, since parent(Ri ) ⊂ Ri per definition. Thereby, since length(Ri ) > 2 per assumption, it follows that length(parent(Ri )) > 1 and consequently parent(Ri ) = ρ. Thus, parent(Ri ) meets the condition at Line 13. (iii) Since Ri meets the condition at Line 13 per assumption, it follows that length(Ri ) ≥ 2 according to (i) in Lemma 4, and it therefore remains to be shown that 2 ≤ length(Ri ). For this purpose, assume to the contrary that length(Ri ) > 2. Then, path parent(Ri ) exists and also parent(Ri ) ∈ [R1 , . . . , Rm ], since paths [R1 , . . . , Rm ] are downward-closed per assumption. Also, path parent(Ri ) meets the condition at Line 13 according to (ii) in Lemma 4. This however contradicts the assumption that the condition at Line 13 is met within step s for the first time in iteration i, since paths [R1 , . . . , Rm ] are ordered by length, and therefore

Ri succeeds path parent(Ri ) within the sequence of paths [R1 , . . . , Rm ].

Lemma 5. Whenever a node v is created within a step s of an application Chase(P,T,Σ), where P = [R1 , . . . , Rm ], then (i) {v} = N(Ri , Ts ) ∩ Xs , where i denotes the iteration of the loop at Line 12 within which node v was created, and v ), if v˜ is a node in Ts such that v˜ ∈ / Xs , and (ii) {vρ } = ancestor(v) ∩ ancestor(˜ (iii) closest(v, v˜) = true, if v˜ is a node in Ts such that v˜ ∈ Xs . Proof (Lemma 5). We demonstrate Lemma 5 by induction over the iterations of the loop at Line 12. We note that a node v is created in an iteration i of the loop at Line 12 iff path Ri meets the condition at Line 13. Also, we use vi to denote that node v was created within iteration i. Base Case: We assume that in step s the condition at Line 13 is met for the first time within iteration x of the loop at Line 12, and we use this iteration as base case for the induction. Then, length(Rx ) = 2 according to (iii) in Lemma 4. In turn, parent(Rx ) = ρ and therefore vˆ = vρ at Line 14. We note that vρ ∈ Xs according to Line 11. Given that vˆ = vρ it follows, that edge (vρ , vx ) is added to E at Line 17 and therefore parent(vx ) = vρ . The resulting path instance vρ .vx is defined over path Rx , since Rx is of the form ρ.last(Rx ), given that length(Rx ) = 2, and lab(vρ ) = ρ according to Definition 1, and further lab(vx ) = last(Rx ) according to Line 16. Consequently, vx ∈ N(Rx , Ts ). Also, {vx } = N(Rx , Ts ) ∩ Xs , since vx is the first node added to Xs in step s, given that the condition at Line 13 is met for the first time in iteration x, which establishes (i) in Lemma 5. / Xs , then We proceed with showing that if v˜ is a node in Ts and v˜ ∈ v ) = {vρ }. Given that parent(vx ) = vρ it follows that aancestor(vx ) ∩ aancestor(˜ v )∩aancestor(vx ) = {vρ }, then aancestor(vx ) = {vρ , vx }. Therefore, if aancestor(˜ v ). However, v˜ = vx follows from the assumpeither vx = v˜ or vx ∈ ancestor(˜ tion that v˜ ∈ / Xs and the observation that vx ∈ Xs according to Line 15. Also, / ancestor(˜ v ) since vx is newly created per assumption and is therefore a leaf vx ∈ node in Ts within iteration x. Hence, (ii) in Lemma 5 is established. Next, if v˜ is a node in Ts and v˜ ∈ Xs , then either v˜ = vx or v˜ = vρ . In both cases closest(vx , v˜) follows directly from Definition 8, which establishes (iii) in Lemma 5. Inductive Step: We now assume that Lemma 5 holds true until iteration y, where y ≥ x, and establish the inductive step by showing that Lemma 5 also holds true for the iteration z, where z > y, within which a node is created next. We start with showing, that there exists node vˆ at Line 14, such that {ˆ v } = N(parent(Rz ), Ts ) ∩ Xs . We note that either length(Rz ) = 2 or length(Rz ) > 2 according to (i) in Lemma 4. Thereby, if length(Rz ) = 2, then v} = N(parent(Rz ), Ts ) ∩ Xs follows, since vˆ = vρ parent(Rz ) = ρ. In turn {ˆ given that parent(Rz ) = ρ. If instead length(Rz ) > 2, then the inductive assumption applies, since parent(Rz ) meets the condition at Line 13 according to (ii) in Lemma 4, and also parent(Rz ) precedes Rz within the sequence of paths R1 , . . . , Rm , given that paths R1 , . . . , Rm are in prefix-order. Thus, {ˆ v } = N(parent(Rz ), Ts ) ∩ Xs follows from (i) in Lemma 5 if length(Rz ) > 2.

Given that {ˆ v } = N(parent(Rz ), Ts ) ∩ Xs it follows that parent(vz ) = vˆ, since edge (ˆ v , vz ) is added to E at Line 17. In turn vz ∈ N(Rz , Ts ) follows, since lab(vz ) = last(Rz ) according to Line 16. Further, vz ∈ Xs according to Line 15, and therefore vz ∈ N(Rz , Ts ) ∩ Xz . Also, {vz } = N(Rz , Ts ) ∩ Xz , since if there exists to the contrary a node v˜ ∈ Xs such that v˜ ∈ N(Rz , Ts ) and v˜ = vz , then either v˜ = vρ or v˜ was created in a previous iteration of the loop at Line 12 ˜ ∈ [R1 , . . . , Rm ]. We exclude the case that v˜ = vρ , since then for some path R Rz = ρ, which contradicts (i) in Lemma 4. We also exclude the case that v˜ was ˜ which contradicts the assumption that previously created, since then Rz = R, paths R1 , . . . , Rm do not contain duplicates. Hence (i) in Lemma 5 is established. We show next, that if v˜ is a node in Ts and v˜ ∈ / Xs , then aancestor(vz ) ∩ aancestor(˜ v ) = {vρ }. We note that vρ ∈ aancestor(vz ), since vz ∈ N(Rz , Ts ) as v ), since initially Ts is a tree and v˜ shown above, and also that vρ ∈ aancestor(˜ / Xs . exists in Ts before the first iteration of the loop at Line 12 given that v˜ ∈ v ) = {vρ }, then either Therefore, if to the contrary aancestor(vz ) ∩ aancestor(˜ v ) = {vρ } or vz ∈ aancestor(˜ v ). However, aancestor(parent(vz )) ∩ aancestor(˜ / aancestor(˜ v ) since vz = v˜, given that vz ∈ Xs but v˜ ∈ / Xs , and also vz ∈ / ancestor(˜ v ), given that vz is newly created and therefore a leaf node in Ts vz ∈ within iteration z. It therefore remains to be shown that aancestor(parent(vz )) ∩ aancestor(˜ v ) = {vρ }. Thereby, parent(vz ) = vˆ as shown above, and therefore either parent(vz ) = vρ or parent(vz ) was newly created in a previous iteration of the loop at Line 12. Now, if parent(vz ) = vρ then aancestor(parent(vz )) = v ) = {vρ }. If instead vρ and consequently aancestor(parent(vz )) ∩ aancestor(˜ v ) = {vρ } follows from parent(vz ) = vρ , then aancestor(parent(vz )) ∩ aancestor(˜ the inductive assumption, which establishes (ii) in Lemma 5. We show finally, that if v˜ is a node in Ts and v˜ ∈ Xs , then closest(vz , v˜) = true. Thereby, if v˜ = vz , then closest(vz , v˜) = true, follows directly from Definition 8. We therefore assume that vz = v˜, and show first that closest(vz , v˜) = true, / ancestor(˜ v ). Then, it follows again directly from Definition if parent(vz ) ∈ 8, that closest(vz , v˜) = true iff closest(parent(vz ), v˜) = true. Thereby, since parent(vz ) ∈ Xs per assumption, it follows from the inductive assumption that closest(parent(vz ), v˜) = true, and consequently, also closest(vz , v˜) = true. v ), then {parent(vz )} = aancestor(˜ v) ∩ If instead parent(vz ) ∈ ancestor(˜ aancestor(vz ), since vz is newly created per assumption and is therefore a leaf node in tree Ts within iteration z of the loop at Line 12. We note that v˜ = vz per assumption. Given that {parent(vz )} = aancestor(˜ v ) ∩ aancestor(vz ), it ˜ = parent(Rz ), where R ˜ is the follows that closest(vz , v˜) = true, if Rz ∩ R ˜ given ˜ Ts ). We observe that parent(Rz ) ⊆ Rz ∩ R, path such that v˜ ∈ N(R, v ). Consequently, if parent(Rz ) = that parent(vz ) ∈ aancestor(vz ) ∩ aancestor(˜ ˜ then Rz ∩ R ˜ = Rz . Then however, it follows from the assumpRz ∩ R, v ) ∩ aancestor(vz ), that there exists node tion that {parent(vz )} = aancestor(˜ v¯ ∈ aancestor(˜ v ), such that v¯ ∈ N(Rz , Ts ) and v¯ = vz . Further, given that v¯ ∈ aancestor(˜ v ) and that v˜ ∈ Xs , it follows from the inductive assumption, / Xs . However, given that in particular (ii) in Lemma 5, that v¯ = vρ , if v¯ ∈ v¯ ∈ N(Rz , Ts ), it follows that v¯ = vρ , since length(Rz ) ≥ 2 according to (i) in

Lemma 4. Now, given that v¯ ∈ Xs and that v¯ = vρ , it follows, that v¯ was created in a previous iteration of the loop at Line 12. This however clearly contradicts the inductive assumption, in particular (i) in Lemma 5, if both v¯ and vz are ˜ = parent(Rz ) and nodes in N(Rz , Ts ) and that v¯ = vz . Consequently, Rz ∩ R thus closest(vz , v˜) = true, which establishes (iii) in Lemma 5 for the inductive step.

The next Lemma is central in that it presents important properties of the tree Ts+1 , which is returned from a step s in an application of Algorithm 1, in case that Ts Σ. Lemma 6. If Ts Σ in a step s of an application Chase(P, T, Σ), then a tree Ts+1 is returned such that (i) Ts+1 conforms to Definition 1, and (ii) Ts+1 subsumes Ts , and (iii) Ts+1 is complete w.r.t. P, and ˜u+1 , if T ˜u+1 is the tree returned from a step u in an application (iv) Ts+1 ≈ T ˜u . ˜ Chase(P, T, Σ) and Tu ≈ T Proof (Lemma 6). We note that tree Ts = (V, E, lab, val, vρ ) equals tree Ts+1 at the end of step s. (i) Given that Ts conforms to Definition 1 when step s is entered, it follows that also Ts+1 conforms to Definition 1, if whenever a node vi is added to Ts in an iteration i of the loop at Line 12, then (a) function lab assigns a label in L to vi , and (b) the parent-child edge relation E forms a tree structure rooted at vρ , which satisfies (i) and (ii) in Definition 1, and (c) function val assigns a string value to vi iff vi is an attribute or text node. We therefore demonstrate (i) in Lemma 5, by showing that (a) - (c) holds true whenever a node vi is added to Ts in an iteration i of the loop at Line 12. (a) It follows from Line 16 together with the observation that last(Ri ) ∈ L according to Definition 2, that function lab assigns a label in L to vi . (b) We observe that E forms a tree structure rooted at vρ , since (i) in Lemma 5 implies that vi is connected to vρ , and further Lines 14 and 17 imply, that vi has exactly one parent and that parent(vi ) = vi . Next, E satisfies (i) in Definition 1, i.e. that lab(parent(vi )) ∈ E, since either parent(vi ) = vρ and lab(vρ ) ∈ E according to Definition 1, or parent(v) was newly created in step s. Then lab(parent(vi )) = last(parent(Ri )), according to Line 16, and therefore lab(parent(vi )) ∈ E follows from the assumption that path Ri conforms to Definition 2. We show next, that E also satisfies (ii) in Definition 1, by showing that if vi is an attribute node, then there does not exist attribute node v˜, such that v ), parent(vi ) = parent(˜ v ) and vi = v˜. For this purpose, assume to lab(vi ) = lab(˜ the contrary that v˜ exists. Then, v˜ ∈ / Xs , because (i) in Lemma 5 implies that if v˜ ∈ Xs then v˜ = vρ , given that v˜ = vi , and v˜ = vρ follows from the assumption

that v˜ is an attribute node. Further, given that v˜ ∈ / Xs , it follows from (ii) in Lemma 5, that aancestor(˜ v ) ∩ aancestor(vi ) = {vρ }. This together with the v ) implies, that parent(vi ) = parent(˜ v) = assumption that parent(vi ) = parent(˜ vρ . It follows then from (i) in Lemma 5, that Ri is of the form Ri = ρ.last(Ri ). This together with the observations that last(Ri ) ∈ A, since vi is an attribute node per assumption, and that Ri meets the condition at Line 13, since vi is created in iteration i per assumption, implies that σs contains an RHS field Px such that P .P x = Ri . This however clearly contradicts the assumption that σs conforms to Definition 14. Consequently, node v˜ does not exist and hence E also satisfies (ii) in Definition 1. (c) If vi is an element node, then last(Ri ) ∈ E according to (i) in Lemma 5 and therefore neither the condition at Line 20 nor the condition at Line 18 is met, given that σs conforms to Definition 7 and does therefore not contain field paths ending in element labels. Consequently, val(vi ) is undefined if vi is an element node. Further, if vi is an attribute node, then val(v) is defined, since either the condition at Line 18 or the condition at Line 20 is met. We note that if val(vi ) is set at Line 20, then val(vy ) is defined, since Definition 7 together with Definition 9 implies that nodes v1 , . . . , vn are attribute or text nodes, given that the sequence of nodes [v0 , v1 , . . . , vn ] at Line 10 violates σs in Ts . (ii) Since the chase does not remove nodes or edges, and also does not alter the values, labels or positions of nodes that exist in tree Ts when step s is entered, it follows that Ts ⊆ Ts+1 . (iii) Given that Ts conforms to P when step s is entered, it follows that also Ts+1 conforms to P, since whenever a node vi is added to Ts , in some iteration i of the loop at Line 12, then vi ∈ N(Ri , Ts ) according to (i) in Lemma 5, and Ri ∈ P per assumption. Given that Ts+1 conforms to P and that Ts is complete w.r.t. P when step s is entered, it follows that also Ts+1 is complete w.r.t. P, if it applies to every ˜ is a new node vi , created in some iteration i of the loop at Line 12, that if R ˜ then there exists node v˜ in tree Ts at the end of step s, path in P and Ri ⊂ R, ˜ Ts ) and v ∈ ancestor(˜ v ). such that v˜ ∈ N(R, Thereby, given that vi is created in iteration i, it follows that Ri meets the ˜ and that condition at Line 13. This together with the assumptions that Ri ⊂ R ˜ ˜ R ∈ P implies, that also R meets the condition at Line 13. It follows in turn from ˜ Ts ) ∩ Xs eventually, and, as will (i) in Lemma 5, that there exists node v˜ ∈ N(R, ˜ Ts ) and that be shown next, also v ∈ ancestor(˜ v ). Thereby, given that v˜ ∈ N(R, ˜ it follows that there exists node vˆ ∈ N(Ri , Ts ) such that vˆ ∈ ancestor(˜ v ). Ri ⊂ R, Now, if vˆ ∈ Xs , then both v and vˆ are nodes in N(Ri , Ts ) ∩ Xs and therefore (i) v ), in Lemma 5 implies that v = vˆ. Consequently, if vˆ ∈ Xs , then v ∈ ancestor(˜ v ) ∩ aancestor(˜ v ) = {vρ } according as desired. If instead vˆ ∈ / Xs , then aancestor(ˆ v ) per assumption, to (ii) in Lemma 5. In turn, vˆ = vρ , since vˆ ∈ ancestor(˜ and therefore Ri = ρ, given that vˆ ∈ N(Ri , Ts ). This however contradicts the assumption that Ri meets the condition at Line 13 according to (i) in Lemma 4. Consequently, vˆ ∈ / Xs , and hence (iii) in Lemma 5 is established.

(iv) We subsequently refer to the applications of Algorithm 1 generating trees ˜u+1 as the first and the second application, respectively. Ts+1 and T ˜u = (V, ˜ E, ˜ ˜lab, v ãl, v˜ρ ), it follows Given that Ts = (V, E, lab, val, vρ ) ≈ T ˜ that there exists the isomorphism mapping α : V → V, when steps s and u are entered within the first and the second application, respectively. From this we ˜u+1 , if mapping α can be extended to the sets of nodes conclude that Ts+1 ≈ T ˜ ∪ Xu , when steps s and u are exited. We note that Xs and Xu V ∪ Xs and V denote the sets of nodes that are created in steps s and u, respectively. Thereby, it follows from Lemma 3, together with the assumption that Σ is the input sequence of XINDs in both applications, that the same XIND σ ∈ Σ is chosen at Line 2 within step s and step u of the first and the second application, respectively. Also, given that P = [R1 , . . . , Rm ] is the input set of paths in both applications of Algorithm 1, it follows that the loop at Line 12 is iterated for m times within step s and step u of the first and the second application, respectively. Therefore, given that the paths in P are iterated in the same succession in step s and step u of the first and the second application, respectively, and that in both steps σ is chosen at Line 2, it follows, that in every iteration i ∈ [1, m] of the loop at Line 12, the condition at Line 13 evaluates to the same result in both applications. It follows in turn, that for all i ∈ [1, m], a node vi is created in iteration i of the loop at Line 12 in step s of the first application iff a node v˜i is created within the same iteration of the loop at Line 12 in step u of the second application. Based on this observation we claim, that when choosing α(vi ) = v˜i whenever nodes vi and v˜i are created within an iteration i of the loop at Line 12 in step s and step u, respectively, then α establishes the desired isomorphism between ˜u , when steps s and u are exited. trees Ts and T We note that the definition of α implies, that α is a bijection and consequently, the inverse mapping α−1 exists. We proceed with showing that both α vρ ) = vρ follows and α−1 satisfy Definition 16. Thereby, α(vρ ) = v˜ρ and α−1 (˜ from the observation that the chase does not create a root node together with the ˜u , when assumption that α establishes the isomorphism between trees Ts and T steps s and u of the first and the second application, respectively, are entered. Hence, α and α−1 satisfy (i) in Definition 16. Further, whenever nodes vi and v˜i are created in an iteration i of the loop at Line 12 within steps s and u of the first and the second application, respectively, vi ) = last(Ri ) according to Line 16. In turn, the then lab(vi ) = last(Ri ) and ˜lab(˜ vi ) = definition of α implies, that lab(vi ) = ˜lab(α(vi )) and conversely, that ˜lab(˜ lab(α−1 (vi )). Consequently, α and α−1 satisfy (iii) in Definition 16. We proceed with showing, that α and α−1 satisfy (iv) in Definition 16. We note that the conditions at Lines 18 and 20 evaluate to the same result in steps s and u of the first and the second application, respectively, given that P is the input sequence of paths in both applications, and that the same XIND σ = ((P, [P1 , . . . , Pn ]) ⊆ (P , [P1 , . . . , Pn ])) is chosen at Line 2 in steps s and u of the two applications. Now, if nodes vi and v˜i are created in an iteration i of the loop at Line 12 within steps s and u, and the condition at Line 20 applies,

then val(vi ) = v ˜ al(˜ vi ) = ”0” according to Line 21. In turn, val(vi ) = v ãl(α(vi )) and conversely v ãl(˜ vi ) = val(α−1 (vi )) follows from the definition of α. We show next, that α and α−1 also satisfy (iv) in Definition 16 in case that the values of nodes vi and v˜i are set at Line 19. For this purpose, let v0 , v¨1 , . . . , v¨n ] denote the sequences of nodes in Ys and Yu [v˙ 0 , v˙ 1 , . . . , v˙ n ] and [¨ at Line 10 in steps s and u of the first and the second application, respectively. We note that Ys contains exactly one sequence of nodes at Line 10, since if to the contrary there exists an additional sequence [ˆ v0 , vˆ1 , . . . , vˆn ] ∈ Ys , then Definition 15 together with the condition at Line 7 implies, that v˙ j = vˆj for all j ∈ [0, n], which contradicts the assumption that Ys does not contain duplicates. Analogously, [¨ v0 , v¨1 , . . . , v¨n ] is the only sequence of nodes in Yu at Line 10 within step u of the second application. ãl(˜ vi ) = v ãl(¨ vy ) at Line 19 in steps s and u Then, val(vi ) = val(v˙ y ) and v of the first and the second application, respectively, where y is the index of the RHS field Py ∈ [P1 , . . . , Pn ] such that Ri = P .P y . We note that Py is unique, since σ does not contain duplicate RHS fields according to Definition 7. Further, v0 , v¨1 , . . . , v¨n ] exist in trees Ts since per assumption nodes [v˙ 0 , v˙ 1 , . . . , v˙ n ] and [¨ ˜u when steps s and u are entered, it follows, that α(v˙ j ) and α−1 (¨ and T vj ) is defined for all j ∈ [0, n]. Now, assume for the moment that for all j ∈ [0, n], α(v˙ j ) = v¨j and conversely, vj ) = v˙ j . Then, given that val(vi ) = val(v˙ y ), it follows that val(vi ) = that α−1 (¨ ãl(α(v˙ y )) according to (iv) in Definition 16. In turn, v ãl(α(v˙ y )), since val(v˙ y ) = v ãl(¨ vy ), given that α(v˙ y ) = vÿ , and therefore val(vi ) = v ãl(α(vi )), val(vi ) = v ãl(˜ vi ) = v ãl(¨ vy ) per assumption. The since α(vi ) = v˜i per definition of α and v vi )) follows immediately. converse, i.e. that v ãl(˜ vi ) = val(α−1 (˜ Therefore, α and α−1 satisfy (iv) in Definition 16 if for all j ∈ [0, n], α(v˙ j ) = vj ) = v˙ j , which is what we show next. v¨j and α−1 (¨ We show first that α(v˙ j ) = v¨j , for all j ∈ [0, n]. Thereby, it follows from Lemma 3 that [α(v˙ 0 ), α(v˙ 1 ), . . . , α(v˙ n )] ∈ Yu at Line 3 in step u of the sec˜u ≈ Ts when step u is entered, and ond application, since per assumption T also [v˙ 0 , v˙ 1 , . . . , v˙ n ] ∈ Ys at Line 10 in step s of the first application. Therefore, if also [α(v˙ 0 ), α(v˙ 1 ), . . . , α(v˙ n )] ∈ Yu at Line 10 in step u of the second application, then α(v˙ j ) = v¨j for all j ∈ [0, n], as desired. We show now / Ys at Line 10 in step s of the first the contradiction that [v˙ 0 , v˙ 1 , . . . , v˙ n ] ∈ / Yu at Line 10 in step u of the secapplication, if [α(v˙ 0 ), α(v˙ 1 ), . . . , α(v˙ n )] ∈ ond application. For this purpose, let z be the iteration of the loop at Line 4 within step u of the second application, such that for all i ∈ [0, z], vï = α(v˙ i ) if i < z, and vï ≺ α(v˙ i ) if j = z. We note that this iteration exists, given that [¨ v0 , v¨1 , . . . , v¨n ] = [α(v˙ 0 ), α(v˙ 1 ), . . . , α(v˙ n )] at Line 10 in step u of the second application. Then, it follows from (v) in Definition 17, that for all i ∈ [0, z], vi ) = α−1 (α(v˙ i )) if i < z, and α−1 (¨ vj ) ≺ α−1 (α(v˙ i )) if i = z. That is, for all α−1 (¨ −1 −1 vi ) = v˙ i if i < z, and α (¨ vi ) ≺ v˙ i if i = z, since α−1 (α(v˙ i )) = v˙ i i ∈ [0, z], α (¨ ˜ per assumption. Now, given that Ts ≈ Tu when step s of the first application is entered, and that [¨ v0 , v¨1 , . . . , v¨n ] ∈ Yu at Line 3 in step u of the second applicav0 ), α−1 (¨ v1 ), . . . , α−1 (¨ vn )] ∈ Ys at Line tion, it follows from Lemma 3, that [α−1 (¨

3 in step s of the first application. Also, given that for all i ∈ [0, z], α−1 (¨ vi ) = v˙ i if i < z, and that α−1 (¨ vi ) ≺ v˙ i if i = z, it follows that [v˙ 0 , v˙ 1 , . . . , v˙ n ] is removed from Ys in iteration z of the loop at Line 4 within step s of the first application, which establishes the contradiction. The converse, i.e. that α−1 (¨ vj ) = v˙ j for all j ∈ [0, n], follows immediately. Hence α and α−1 satisfy (iv) in Definition 16. We show finally by induction over the iterations of the loop at Line 12 within steps s and u of the first and the second application, respectively, that α and α−1 also satisfy (ii) and (v) in Definition 16 when steps s and u are exited. Base Case: We assume that in steps s and u of the first and the second application, respectively, the condition at Line 13 is met for the first time within iteration x of the loop at Line 12, and we use this iteration as base case for the induction. Thereby, length(Rx ) = 2 according to (iii) in Lemma 4, vρ } = and therefore parent(Rx ) = ρ. In turn {vρ } = N(parent(Rx ), Ts ) and {˜ ˜u ) at Line 14 within steps s and u of the first and the second N(parent(Rx ), T vρ , v˜x ) are added to application, respectively. Consequently, edges (vρ , vx ) and (˜ ˜ at Line 17, respectively. E and E to (i) inDefinition 16, and α(vx ) = v˜x Thereby, since α(vρ ) = v˜ρ according ˜ The converse, i.e. that per definition of α, it follows that α(v ), α(vx ) ∈ E. ρ −1 −1 vρ ), α (˜ vx ) ∈ E, follows immediately. Hence α and α−1 satisfy (ii) in α (˜ Definition 16 in the base case. Further, according to Line 17, nodes vx and v˜x are the last child nodes of vρ and v˜ρ , respectively. Therefore, nodes vx and v˜x are the last nodes visited in a ˜u , respectively. From this it follows that for pre-order traversal of trees Ts and T ˜u , since α(vx ) = v˜x ˙ ≺ α(vx ) in tree T every node v˙ in Ts , v˙ ≺ vx in tree Ts iff α(v) ˜u , v¨ ≺ v˜x in tree Tu iff per definition of α. Analogously, for every node v¨ in T v ) ≺ α−1 (˜ vx ) in tree Ts , which establishes that α and α−1 satisfy (v) in α−1 (¨ Definition 16 in the base case. Inductive Step: We now assume that Lemma 5 holds true until iteration y, where y ≥ x, and establish the inductive step by showing that α and α−1 also satisfy (ii) and (v) in Definition 16 in iteration z, where z > y, within which nodes are created next in steps s and u of the first and the second application, respectively. Thereby, if length(Rz ) = 2, then it follows analogously from the argumentation in the base case, that α and α−1 satisfy (ii) and (v) in Definition 16. If instead length(Rz ) = 2, then length(Rz ) > 2 according to (ii) in Lemma 4. Now, let vˆ1 and vˆ2 be the nodes at Line 14 within step s and u of the first and the second application, respectively. Then, given that length(Rz ) > 2, it follows that ˜u ) vˆ1 = vρ and vˆ2 = v˜ρ , since vˆ1 ∈ N(parent(Rz ), Ts ) and vˆ2 ∈ N(parent(Rz ), T according to Line 14. Further, given that vˆ1 = vρ and vˆ2 = v˜ρ , it follows, that vˆ1 and vˆ2 were created in a previous iteration of the loop at Line 12 within steps s and u of the first and the second application, respectively, since vˆ1 ∈ Xs and vˆ2 ∈ Xu per assumption. Therefore (i) in Lemma 5 applies to nodes vˆ1 and v1 } = N(parent(Rz ), Ts ∩ Xs and {ˆ v2 } = N(parent(Rz ), T˜u ∩ Xu . vˆ2 , and thus {ˆ

We conclude from this together with the definition of α, that α(ˆ v1 ) = vˆ2 and v2 ) = vˆ1 . conversely, that α−1 (ˆ ˜ within v2 , v˜z ) are added to E and E, Further, at Line 17 edges (ˆ v1 , vz ) and (ˆ steps s and u of the first and the second application, respectively. Therefore, ˜ that α(vz ) = v˜z , it follows that α(ˆ v1 ), α(vz ) ∈ E. given that α(ˆ v1 ) = vˆ2 and −1 −1 v2 ), α (˜ vz ) ∈ E, follows immediately. Hence α The converse, i.e. that α (ˆ and α−1 satisfy (ii) in Definition 16. In order to show that α and α−1 also satisfy (v) in Definition 16, we show ˙ ≺ α(vz ) in tree Tu . first that for every node v˙ in tree Ts , v˙ ≺ vz iff α(v) / ancestor(v), ˙ then v˙ is visited before vz in a pre-order We observe that if vˆ1 ∈ traversal of tree Ts iff v˙ is visited before vˆ1 , since vˆ1 = parent(vz ) per assumption. / ancestor(v), ˙ then v˙ ≺ vz iff v˙ ≺ vˆ1 . Also, it follows from the That is, if vˆ1 ∈ ˙ ≺ α(ˆ v1 ). Consequently, v˙ ≺ vz iff inductive assumption, that v˙ ≺ vˆ1 iff α(v) / ancestor(v). ˙ α(v) ˙ ≺ α(vz ), given that vˆ1 ∈ ˙ ≺ α(vz ) in case that We proceed with showing that also v˙ ≺ vz iff α(v) ˙ We show first that v˙ ≺ vz if α(v) ˙ ≺ α(vz ). Thereby, v˙ ≺ vz vˆ1 ∈ ancestor(v). follows since vˆ1 ∈ ancestor(v) ˙ per assumption, and vz is the last child of vˆ1 according to Line 17, and therefore v˙ is visited before vz in a pre-order traversal of tree Ts . We show next that α(v) ˙ ≺ α(vz ) if v˙ ≺ vz . Thereby, given that vˆ1 ∈ ˙ In ancestor(v), ˙ it follows from (ii) in Lemma 2, that α(ˆ v1 ) ∈ ancestor(α(v)). ˙ since α(ˆ v1 ) = vˆ2 , as shown above. Further, α(vz ) = v˜z turn, vˆ2 ∈ ancestor(α(v)), per definition of α, and therefore α(v) ˙ is visited before α(vz ) in a pre-order tra˜u , since α(vz ) = v˜z is the last child of vˆ2 according to Line 17. versal of tree T Consequently, α(v) ˙ ≺ α(vz ). Therefore, α satisfies (v) in Definition 16. It follows analogously from the argumentation above that also α−1 satisfies (v) in Definition 16, which establishes the inductive step.

We are now ready to establish Lemma 1. Proof (Lemma 1). We start with showing that the application Chase(P, Σ, T) of Algorithm 1 terminates. For this purpose, we show first, that there exists a finite set of string-values U, such that whenever u is a string-value in tree Ts in a step s of the application of Algorithm 1, then u ∈ U. We note that u is said to be a string-value in tree Ts , if there exists an attribute or text node v, such that val(v) = u. Now, let U be the set of string-values in the initial tree T1 including the string-value ”0”. Then, U is finite since T1 conforms to Definition 1 and therefore the number of attribute and text nodes in T1 is finite. Also, u ∈ U, if u is a string-value in T1 , follows directly from the construction of U. Now assume, that it holds true until step s, that u ∈ U if u is a string-value in tree Ts , and we show that it also holds true for step s + 1. Thereby, since Ts ⊆ Ts+1 according to (ii) in Lemma 6, it follows, that either u is a string value in Ts , and thus u ∈ U follows from the inductive assumption, or u is the string-value of a node v created in step s. If u is the string-value of a new node v, then u = ”0”, if

u is assigned to v at Line 21, and u is the string-value of a node in Ts , if u is assigned to v at Line 19. Consequently u ∈ U, if u is a string-value in tree Ts+1 . We proceed with showing that tree Ts satisfies an XIND σ = ((P, [P1 , . . . , Pn ]) ⊆ (P , [P1 , . . . , Pn ])), if for every sequence of string-values u1 , . . . , un ∈ U × · · · × U, there exists a sequence of nodes v1 , . . . , vn such that (a) vi ∈ N(P .P i , Ts ), for all i ∈ [1, n], and (b) closest(vi , vj ) = true, for all i, j ∈ [1, n], and (c) val(vi ) = ui , for all i ∈ [1, n].

Now, assume for the moment that whenever there exist nodes v0 , v1 , . . . , vn in tree Ts satisfying (i) - (iii) in Definition 9 with respect to σ, then there exist nodes v0 , v1 , . . . , vn satisfying (i’) - (iv’) in Definition 9 with respect to σ and nodes v1 , . . . , vn , if there exist nodes v1 , . . . , vn satisfying (a) - (c) with respect to σ and the sequence of values val(v1 ), . . . , val(vn ). Then, Ts σ follows immediately, since val(v1 ), . . . , val(vn ) is contained in U × · · · × U, given that val(vi ) ∈ U, for all i ∈ [1, n]. Thereby, given that nodes v1 , . . . , vn satisfy (a) - (c) with respect to σ and the sequence of values val(v1 ), . . . , val(vn ), it follows that if there exists node v0 such that v0 ∈ N(P , Ts ) and v0 ∈ aancestor(vi ), for all i ∈ [1, n], then nodes v0 , v1 , . . . , vn satisfy (i’) - (iv’) in Definition 9 with respect to σ and nodes v1 , . . . , vn . We observe that there exists node vˆ ∈ N(P .P 1 ∩ · · · ∩ P .Pn , Ts ), such that vˆ ∈ aancestor(vi ), for all i ∈ [1, n], since closest(vi , vj ) = true, for all v ), i, j ∈ [1, n], per assumption. Consequently, there exists node v0 ∈ aancestor(ˆ v ) and that vˆ ∈ such that v0 ∈ N(P , Ts ). Also, given that v0 ∈ aancestor(ˆ aancestor(vi ), for all i ∈ [1, n], it follows that also v0 ∈ aancestor(vi ), for all i ∈ [1, n]. Further, we say that Ts misses a sequence of nodes in order to satisfy σ, if for a given sequence of string-values u1 , . . . , un ∈ U × · · · × U, there do not exist nodes v1 , . . . , vn in tree Ts satisfying (a) - (c). Then, since a succeeding step s+ 1 is performed by the application Chase(P, T, Σ) iff Ts Σ according to Line 1, it follows that step s + 1 is performed iff tree Ts misses at least one sequence of nodes in order to satisfy Σ. From this we conclude, that if the initial tree T1 misses only a finite number of sequences of nodes in order to satisfy Σ, and also the number of missing sequences of nodes strictly decreases from a step s to step s + 1, then only a finite number of steps is performed, and thus the application Chase(P, T, Σ) of Algorithm 1 terminates. Since U is finite as shown above, it follows that T1 misses only a finite number of sequences of nodes in order to satisfy Σ. To be more precise, T1 misses at x |σ | most i=1 |U| i sequences of nodes, where x is the number of XINDs in Σ, |U| denotes the number of string-values in U, and |σi | denotes the number of RHS fields in σi . Therefore, if the number of missing sequences of nodes strictly decreases from every step s to step s + 1, which is what we show next, then the application Chase(P, T, Σ) of Algorithm 1 terminates after at most xi=1 |U||σi | steps. Now, let ωs and ωs+1 be the number of missing sequences of nodes in order to satisfy Σ in trees Ts and Ts+1 , respectively. Also, let α be the subsumption

mapping establishing that Ts ⊆ Ts+1 . Then, whenever there exists a sequence in tree Ts , that satisfies (a) - (c) with respect to an XIND of nodes v1 , . . . , v|σ| σ ∈ Σ and a sequence of values u1 , . . . , u|σ| ∈ U × · · · × U, then the sequence of ) in tree Ts+1 satisfies (a) and (b), according to (iv) and nodes α(v1 ), . . . , α(v|σ| (v) in Lemma 2, and it also satisfies (c), according to (iv) in Definition 16. From this we conclude, that ωs+1 ≤ ωs . It therefore remains to be shown that ωs+1 < ωs . Thereby, given that step s + 1 is performed, it follows that Ts Σ, and in turn, there exists a sequence of nodes v0 , v1 , . . . , vn at Line 10, that violates σs = ((P, [P1 , . . . , Pn ]) ⊆ (P , [P1 , . . . , Pn ])) in tree Ts . That is, tree Ts misses a sequence of nodes v1 , . . . , vn that satisfies (a) - (c) with respect to σs and the sequence of stringvalues val(v1 ), . . . , val(vn ). Now, if tree Ts+1 contains this sequence of nodes v1 , . . . , vn , then ωs+1 < ωs , as desired. We therefore show finally, that Ts+1 indeed contains nodes v1 , . . . , vn satisfying (a) - (c) with respect to σs and the sequence of string-values val(v1 ), . . . , val(vn ). Thereby, since Σ conforms to P per assumption, it follows that P .P i ∈ [R1 , . . . , Rm ], for all i ∈ [1, n]. Also, since [R1 , . . . , Rm ] is a sequence of paths, and does not contain duplicates therefore, it follows that there exists the unique mapping φ : [1, n] → [1, m], such that P .P i = Rφ(i) , for all i ∈ [1, m]. Further, since P .Pi = ρ, for all i ∈ [1, n], according to Definition 7, it follows that for all i ∈ [1, n], path Rφ(i) = P .P i satisfies the condition at Line 13 within iteration φ(i) of the loop at Line 12. Now, let for all i ∈ [1, n], vi be the node created in iteration φ(i) of the loop at Line 12. Then, according to (i) in Lemma 5, vi ∈ N(Rφ(i) , Ts ) when step s is exited, and thus vi ∈ N(P .P i , Ts+1 ), for all i ∈ [1, n], given that P .Pi = Rφ(i) . Hence, nodes v1 , . . . , vn satisfy (a). Further, given that nodes v1 , . . . , vn were created in step s, closest(vi , vj ) = true for all i, j ∈ [1, n] according to (iii) in Lemma 5, and thus nodes v1 , . . . , vn also satisfy (b). Finally, observing that the sequence of paths [P1 , . . . , Pn ] does not contain duplicates, and that for all i ∈ [1, n], path Rφ(i) meets the condition at Line 18, since Rφ(i) = P .P i per assumption, we conclude, that for all i ∈ [1, n], val(vi ) is assigned to vi at Line 19 within iteration φ(i) of the loop at Line 12. Consequently, nodes v1 , . . . , vn also satisfy (c). Given that the application Chase(P, T, Σ) terminates, it follows immediately, ¯ satisfies Σ. Further, it follows directly form (i) - (iii) that the resulting tree T ¯ in Lemma 6, that T conforms to Definition 1, subsumes tree T, and is complete ˜ is the w.r.t. P. We conclude the proof of Lemma 1, with the observation, that if T ˜ ≈ T, ¯ according tree returned by another application of Chase(P, T, Σ), then T ¯ is unique. to (iv) in Lemma 6, which establishes that tree T

5.3

Consistency and Implication of Core XINDs in Complete Trees

We formulate the consistency problem in our framework as the question of whether there exists a tree T, for any given combination of a downward-closed

set of paths P and a set of core XINDs Σ that conforms to P, such that T is complete w.r.t. P and satisfies Σ. We have the following result. Theorem 2. The class of core XINDs in complete XML trees is consistent. The correctness of Theorem 2 follows from the fact that there always exists ˜ that is complete w.r.t. a given set of paths P, and the result in Lemma a tree T ¯ returned by Chase(P, T, ˜ Σ), is complete w.r.t. P and satisfies 1 that the tree T the given set of core XINDs Σ. We now turn to the implication of core XINDs. We use Σ σ to denote that Σ implies σ, i.e. that given a set of paths P to which Σ ∪ {σ} conforms to, there does not exist a tree T that is complete w.r.t. P such that T Σ but T σ.

R1 Reflexivity {} ((P, [P1 , . . . , Pn ]) ⊆ (P, [P1 , . . . , Pn ])) R2 Permutated Projection ((P, [P1 , . . . , Pn ]) ⊆ (P , [P1 , . . . , Pn ])) , . . . , Pπ(m) ])), if {π(1), . . . , π(m)} ⊆ {1, . . . , n} ((P, [Pπ(1) , . . . , Pπ(m) ]) ⊆ (P , [Pπ(1) R3 Transitivity ((P, [P1 , . . . , Pn ]) ⊆ (P¯ , [P¯1 , . . . , P¯n ])) ∧ ((P¯ , [P¯1 , . . . , P¯n ]) ⊆ (P , [P1 , . . . , Pn ])) ((P, [P1 , . . . , Pn ]) ⊆ (P , [P1 , . . . , Pn ])) R4 Downshift [P1 , . . . , Pn ])) ((P, [P1 , . . . , Pn ]) ⊆ (P , [R.P1 , . . . , R.Pn ])) ((P, [P1 , . . . , Pn ]) ⊆ (P .R, R5 Upshift [P1 , . . . , Pn ])) ((P, [P1 , . . . , Pn ]) ⊆ (P , [R.P1 , . . . , R.Pn ])) ((P, [P1 , . . . Pn ]) ⊆ (P .R, R6 Union ])) ∧ ((P, [Pm+1 , . . . , Pn ]) ⊆ (ρ, [Pm ((P, [P1 , . . . , Pm ]) ⊆ (ρ, [P1 , . . . , Pm +1 , . . . , Pn ])) ((P, [P1 , . . . , Pn ]) ⊆ (ρ, [P1 , . . . , Pn ])), if ρ.Pi ∩ ρ.Pj = ρ for all i, j ∈ [1, m]×[m+1, n] Fig. 4. Inference Rules for XINDs

In order to discuss the implication problem in our framework, which we formulate as the question whether Σ σ (and also Σ σ) is decidable, we start by giving in Figure 4 a set of inference rules, where symbol denotes that the XINDs in the premise derive the XIND in the conclusion. Rules R1 - R3 correspond to the well known inference rules for INDs [1], which is to be expected given Theorem 1 and the fact that XML trees generated from a complete relational database from the mapping described in Section 4 are a subclass of complete XML trees. The remaining rules have no parallels in the inference rules for INDs, and we now discuss them. Rule R4 allows one to shift a path from the end of the RHS selector in an XIND down to the start of the RHS fields. For example, by applying R4 to the XIND ((ρ.teaches.course, [cno]) ⊆ (ρ.offer.course, [cno])), we derive the XIND ((ρ.teaches.course, [cno]) ⊆ (ρ.offer, [course.cno])), whereby the last label in the RHS selector ρ.offer.course has been shifted down to the start of

the RHS fields. Rule R5 is the reverse of R4, whereby a path from the start of the RHS fields is shifted up to the end of the RHS selector. Rule R6 is a rule that, roughly speaking, allows one to union the LHS fields and the RHS fields of two XINDs, provided that the RHS fields intersect only at the root. For example, given the XINDs ((ρ.teaches.course, [cno]) ⊆ (ρ, [offer.course.cno])) and ((ρ.teaches.course, [info.lec]) ⊆ (ρ, [department.lec])), then we can derive ((ρ.teaches.course, [cno, info.lec]) ⊆ (ρ, [offer.course.cno, department.lec])), since ρ.offer.course.cno ∩ ρ.department.lec = ρ. However, the XINDs ((ρ.teaches.course, [cno]) ⊆ (ρ, [offer.course.cno])) and ((ρ.teaches.course, [info.sem]) ⊆ (ρ, [offer.course.sem])) do not imply ((ρ.teaches.course, [cno, info.sem]) ⊆ (ρ, [offer.course.cno, offer.course.sem])) since ρ.offer.course.cno ∩ ρ.offer.course.sem = ρ. We have the following result on the soundness (Σ σ ⇒ Σ σ) and completeness (Σ σ ⇒ Σ σ) of our inference rules. Theorem 3. The set of inference rules R1 - R6 is sound and complete for the implication of core XINDs in complete XML trees. We now establish some preliminary lemmas which we require in order to establish soundness of the inference rules. Lemma 7. Let P1 , P2 , P3 be paths such that P1 ∩ P3 ⊆ P2 ∩ P3 , and let v1 ∈ N(P1 , T), v2 ∈ N(P2 , T), v3 ∈ N(P3 , T) be nodes in a tree T such that closest(v1 , v2 ) = true and closest(v2 , v3 ) = true. Then closest(v1 , v3 ) = true. Proof (Lemma 7). Throughout this prove we use P21 , P31 and P32 to denote paths P1 ∩ P2 , P1 ∩ P3 and P2 ∩ P3 , respectively. Then, given that closest(v1 , v2 ) = true, there exists node v21 according to Definition 8 such that v21 ∈ aancestor(v1 ), and v21 ∈ aancestor(v2 ) and v21 ∈ N(P21 , T). Also, given that closest(v2 , v3 ) = true, there exists node v32 such that v32 ∈ aancestor(v2 ), and v32 ∈ aancestor(v3 ) andv32 ∈ N(P32 , T).

v3

vρ

vρ

v32 R2 ∩ R3

v21 R1 ∩ R2

v21 R1 ∩ R2 v2

v1

v1

Fig. 5. Some Trees

v32 R2 ∩ R3 v2

v3

Further, given that v21 ∈ aancestor(v2 ) and that v32 ∈ aancestor(v2 ), it follows that either v21 ∈ aancestor(v32 ) or that v32 ∈ ancestor(v21 ). We discuss first the case where v21 ∈ aancestor(v32 ) which is illustrated in the right of Figure 5. We show that closest(v1 , v3 ) = true, by showing that node v21 satisfies (i) - (iii) in Definition 8 w.r.t. nodes v1 and v3 . In particular v21 ∈ aancestor(v1 ) per assumption, which establishes (i) in Definition 8 w.r.t. nodes v1 and v3 . Also, v21 ∈ aancestor(v3 ) since v21 ∈ aancestor(v32 ) per assumption and v32 ∈ aancestor(v3 ) as shown above, which establishes (ii) in Definition 8 w.r.t. nodes v1 and v3 . Therefore, closest(v1 , v3 ) = true if v21 ∈ N(R31 , T). We observe that if P21 = P31 then v21 ∈ N(P31 , T), since v21 ∈ N(P21 , T) as shown above. We show that P21 = P31 by showing that P21 ⊆ P31 and that also P31 ⊆ P21 . We start with showing that P21 ⊆ P31 . We observe that if P21 is a common prefix of paths P1 and P3 , then P21 ⊆ P31 , since P31 is required to be the longest common prefix of paths P1 and P3 . Thereby, P21 ⊆ P1 follows directly from the definition of path intersection. It therefore remains to be shown that also P21 ⊆ P3 . Thereby, v21 ∈ aancestor(v3 ), since v21 ∈ aancestor(v32 ) per assumption, and v32 ∈ aancestor(v3 ) as show above. Given that v21 ∈ aancestor(v3 ) it follows that P21 ⊆ P3 , since v21 ∈ N(P21 , T) as shown above and v3 ∈ N(P3 , T) per assumption. We proceed with showing that P31 ⊆ P21 . We observe that if P31 is a common prefix of paths P1 and P2 , then P31 ⊆ P21 , since P21 is required to be the longest common prefix of paths P1 and P2 . Thereby, P31 ⊆ P1 follows directly from the definition of path intersection. Also P31 ⊆ P2 , since P31 ⊆ P32 per assumption, and P32 ⊆ P2 according to the definition of path intersection. We discuss now the case where v32 ∈ ancestor(v21 ), which is illustrated in the left of Figure 5. We show that closest(v1 , v3 ) = true, by showing that node v32 satisfies (i) - (iii) in Definition 8 w.r.t. nodes v1 and v3 . In particular v32 ∈ aancestor(v1 ), since v32 ∈ ancestor(v21 ) per assumption, and v21 ∈ aancestor(v1 ) as shown above, which establishes (i) in Definition 8 w.r.t. nodes v1 and v3 . Also, v32 ∈ aancestor(v3 ) as shown above, which establishes (ii) in Definition 8 w.r.t. node v1 and v3 . Therefore, closest(v1 , v3 ) = true, if v32 ∈ N(P31 , T). We observe that if P32 = P31 , then v32 ∈ N(P31 , T), since v32 ∈ N(P32 , T) as shown above. We note that P31 ⊆ P32 per assumption, and that therefore P32 = P31 , if P32 ⊆ P31 . We observe that if P32 is a common prefix of paths P1 and P3 , then P32 ⊆ P31 , since P31 is required to be the longest common prefix of paths P1 and P3 . Thereby, P32 ⊆ P3 follows directly from the definition of path intersection. It therefore remains to be shown that also P32 ⊆ P1 . Thereby, v32 ∈ aancestor(v1 ), since v32 ∈ ancestor(v21 ) per assumption, and v21 ∈ aancestor(v1 ) as shown above. Given that v32 ∈ aancestor(v1 ) it follows that P32 ⊆ P1 , since v32 ∈ N(R32 , T)

according to (f) and v1 ∈ N(P1 , T) per assumption. Lemma 8. Given a downward-closed set of legal paths P and a tree T that is complete w.r.t. P, let {P1 , . . . , Pm } and {Pm+1 , . . . , Pn } be non-empty, disjoint subsets of P and let v1 , . . . , vm be a set of nodes in T such that vi ∈ N(Pi , T)

for all i ∈ [1, m], and closest(vi , vj ) = true for all i, j ∈ [1, m]. Then, there exist nodes vm+1 , . . . , vn in T such that (i) vi ∈ N(Pi , T) for all i ∈ [m + 1, n], and (ii) closest(vi , vj ) = true for all i, j ∈ [1, n]. Proof (Lemma 8). We show first that if {R1 , . . . , Rx } is a non-empty subset of P ¯ is a distinct, single path in P, and there exist nodes v¯1 , . . . , v¯x in T such and R vi , v¯j ) = true for all i, j ∈ [1, x], that v¯i ∈ N(Ri , T) for all i ∈ [1, x], and closest(¯ then there exists node v¯ in T such that ¯ T), and (a) v¯ ∈ N(R, (b) closest(¯ v , v¯i ) = true for all i ∈ [1, x]. It follows from Definition 3 and the assumption that the paths in P are ¯ ⊆ R ¯ for all i ∈ [1, x]. From this we conclude that one can legal, that Ri ∩ R ¯ ⊆ Ry ∩ R. ¯ Then, since choose y ∈ [1, x] such that for all i ∈ [1, x], Ri ∩ R ¯ T), v¯y ∈ N(Ry , T) per assumption, there exists node vˆ such that vˆ ∈ N(Ry ∩ R, and vˆ ∈ aancestor(¯ vy ). ¯ ∈ P since Ry ∩ R ¯ ⊆ R ¯ per definition, and R ¯ ∈ P and P Further, Ry ∩ R is a downward-closed set of paths, per assumption. Therefore, given that T is ¯ T) it follows from Definition 11 that complete w.r.t. P and that vˆ ∈ N(Ry ∩ R, ¯ there exists node v¯ ∈ N(R, T) such that vˆ ∈ aancestor(¯ v ), which establishes (a). In order to establish (b), we start with showing that closest(¯ v , v¯y ) = true. v , v¯) = true follows directly from Thereby, closest(ˆ v , v¯y ) = true and closest(ˆ v ), Definition 8 and the assumption that vˆ ∈ aancestor(¯ vy ) and vˆ ∈ aancestor(¯ respectively. In turn, closest(¯ vy , v¯) = true follows from Lemma 7 (when choosing ¯ P3 = R, ¯ and v1 = v¯y , v2 = vˆ, v3 = v¯) since Ry ∩ R ¯⊆ P1 = Ry , P2 = (Ry ∩ R), ¯ ∩R ¯ follows directly from Definition 3, and v¯y ∈ N(Ry , T), vˆ ∈ N(Ry ∩ (Ry ∩ R) ¯ T), and v¯ ∈ N(R, ¯ T), per assumption. R, It therefore remains to be shown that also closest(¯ v , v¯i ) = true for all i ∈ [1, x], where i = y. Thereby, for all i ∈ [1, x], closest(¯ vi , v¯y ) = true per assumption v , v¯i ) = true follows and closest(¯ vy , v¯) = true as shown above. In turn, closest(¯ ¯ and v1 = v¯i , from Lemma 7 (when choosing P1 = Ri , P2 = Ry , P3 = R, ¯ ¯ v2 = v¯y , v3 = v¯) since per assumption Ri ∩ R ⊆ Ry ∩ R, and also v¯i ∈ N(Ri , T), ¯ T). v¯y ∈ N(Ry , T), and v¯ ∈ N(R, Now, Lemma 8 follows immediately, since one can choose for all i ∈ [m+1, n],

vi to be the node that exists according to the result just established. We are now ready to demonstrate soundness of our inference rules. We show in particular, that given a downward-closed set of paths P and a set of XINDs Σ ∪ {σ} that conforms to P, then Σ σ ⇒ Σ σ. Proof (Theorem 3 - soundness). R1: We show that given a downward-closed set of paths P and a conforming set of XINDs Σ∪{σ} where σ = ((P, [P1 , . . . , Pn ]) ⊆ (P, ([P1 , . . . , Pn ])), there does not exist a tree T that is complete w.r.t. P such that T Σ but T σ. We show in particular the strictly stronger result that there does not exist a tree T that is complete w.r.t. P and violates σ. For this purpose assume to the contrary that T σ. Then there exist nodes v0 , v1 , . . . , vn according to Definition 9 such that

(a) v0 ∈ N(P, T) and (b) for all i ∈ [1, n], vi ∈ N(P.Pi , T) and v0 ∈ aancestor(vi ) and (c) for all i, j ∈ [1, n], closest(vi , vj ) = true, and there do not exist nodes v0 , v1 , . . . , vn such that (a’) (b’) (c’) (d’)

v0 ∈ N(P, T) and for all i ∈ [1, n], vi ∈ N(P.Pi , T) and v0 ∈ aancestor(vi ) and for all i, j ∈ [1, n], closest(vi , vj ) = true and for all i ∈ [1, n], val(vi ) = val(vi ).

We observe that when choosing nodes v0 , v1 , . . . , vn to be nodes v0 , v1 , . . . , vn , then nodes v0 , v1 , . . . , vn satisfy (a’) and (d’), and therefore the existence of nodes v0 , v1 , . . . , vn contradicts the absence of nodes v0 , v1 , . . . , vn . Consequently, T σ. R2: We show that given a downward-closed set of paths P and a conforming set of XINDs Σ ∪ {σ} where - ((P, [P1 , . . . , Pn ]) ⊆ (P , ([P1 , . . . , Pn ])) ∈ Σ and , . . . , Pπ(m) ])) and - σ = (P, ([Pπ(1) , . . . , Pπ(m) ]) ⊆ (P , [Pπ(1) - for all i ∈ [1, m], π(i) ∈ [1, n], there does not exist a tree T that is complete w.r.t. P such that T Σ but T σ. We show in particular that if T σ then T (P, ([P1 , . . . , Pn ]) ⊆ (P , ([P1 , . . . , Pn ])), We assume subsequently for the ease of presentation that π(i) = i, for all i ∈ [1, m]. That is, we assume that σ = (P, ([P1 , . . . , Pm ]) ⊆ (P , ([P1 , . . . , Pm ])). We note that if m = n, then σ = (P, ([P1 , . . . , Pn ]) ⊆ (P , [P1 , . . . , Pn ])) and consequently T (P, ([P1 , . . . , Pn ]) ⊆ (P , ([P1 , . . . , Pn ])) if T σ. We therefore assume that m < n. Then, given that T σ, there exist nodes v0 , v1 , . . . , vm according to Definition 9 such that (a) v0 ∈ N(P, T) and (b) for all i ∈ [1, m], vi ∈ N(P.Pi , T) and v0 ∈ aancestor(vi ) and (c) for all i, j ∈ [1, m], closest(vi , vj ) = true and there do not exist nodes v0 , v1 , . . . , vm such that

(a’) (b’) (c’) (d’)

v0 ∈ N(P , T) and for all i ∈ [1, m], vi ∈ N(P .Pi , T) and v0 ∈ aancestor(vi ) and for all i, j ∈ [1, m], closest(vi , vj ) = true and for all i ∈ [1, m], val(vi ) = val(vi ).

Now, according to Lemma 7, the existence of nodes v1 , . . . , vm implies the existence of nodes vm+1 , . . . , vn such that (e) for all i ∈ [m + 1, n], vi ∈ N(P.Pi , T) and v0 ∈ aancestor(vi ) and (f) for all i, j ∈ [1, n], closest(vi , vj ) = true.

It follows therefore from (a) - (f) that T (P, ([P1 , . . . , Pn ])) ⊆ (P , ([P1 , . . . , Pn ])) because of nodes v0 , v1 , . . . , vn , if there do not exist nodes v¯0 , v¯1 , . . . , v¯n such that (a”) (b”) (c”) (d”)

v¯0 ∈ N(P , T) and vi ) and for all i ∈ [1, n], v¯i ∈ N(P .Pi , T) and v¯0 ∈ aancestor(¯ for all i, j ∈ [1, n], closest(¯ vi , v¯j ) = true and for all i ∈ [1, n], val(¯ vi ) = val(vi ).

We observe that the existence of nodes v¯0 , v¯1 , . . . , v¯n contradicts the absence since (a”) - (d”) establish (a’) - (d’), respectively, and of nodes v0 , v1 , . . . , vm therefore T σ implies that T (P, ([P1 , . . . , Pn ]) ⊆ (P , [P1 , . . . , Pn ])). R3: We show that given a downward-closed set of paths P and a conforming set of XINDs Σ ∪ {σ} where - (P, ([P1 , . . . , Pn ]) ⊆ (P¯ , ([P¯1 , . . . , P¯n ])) ∈ Σ and - (P¯ , ([P¯1 , . . . , P¯n ])) ⊆ (P , ([P1 , . . . , Pn ])) ∈ Σ and - σ = (P, ([P1 , . . . , Pn ]) ⊆ (P , ([P1 , . . . , Pn ])), there does not exist a tree T that is complete w.r.t. P such that T Σ but T σ. We show in particular that if T σ then T {(P, ([P1 , . . . , Pn ])) ⊆ (P¯ , ([P¯1 , . . . , P¯n ]))} ∪ {(P¯ , ([P¯1 , . . . , P¯n ]) ⊆ (P , ([P1 , . . . , Pn ]))}. Thereby, given that T σ there exist nodes v0 , v1 , . . . , vn such that (a) v0 ∈ N(P, T) and (b) for all i ∈ [1, n], vi ∈ N(P.Pi , T) and v0 ∈ aancestor(vi ) and (c) for all i, j ∈ [1, n], closest(vi , vj ) = true, and there do not exist nodes v0 , v1 , . . . , vn such that (a’) (b’) (c’) (d’)

v0 ∈ N(P , T) and for all i ∈ [1, n], vi ∈ N(P .Pi , T) and v0 ∈ aancestor(vi ) and for all i, j ∈ [1, n], closest(vi , vj ) = true and for all i ∈ [1, n], val(vi ) = val(vi ).

Further, given that nodes v0 , v1 , . . . , vn that satisfy (a) - (c) exist, it follows from Definition 9, that T (P, ([P1 , . . . , Pn ]) ⊆ (P¯ , ([P¯1 , . . . , P¯n ])) if there do not exist nodes v¯0 , v¯1 , . . . , v¯n such that (e) (f) (g) (h)

v¯0 ∈ N(P¯ , T), and ¯ P¯i , T) and v¯0 ∈ aancestor(¯ vi ), and for all i ∈ [1, n], v¯i ∈ N(P. for all i, j ∈ [1, n], closest(¯ vi , v¯j ) = true, and for all i ∈ [1, n], val(¯ vi ) = val(vi ).

Now, T (P, ([P1 , . . . , Pn ]) ⊆ (P¯ , ([P¯1 , . . . , P¯n ])) if nodes v¯0 , v¯1 , . . . , v¯n that satisfy (e) - (h) do not exist. However, if instead nodes v¯0 , v¯1 , . . . , v¯n exist, then T (P¯ , ([P¯1 , . . . , P¯n ]) ⊆ (P , ([P1 , . . . , Pn ])) according to Definition 9 iff there exist nodes v0 , v1 , . . . , vn that satisfy (a’) - (d’). We note that (d’) equals the vi ) since for all i ∈ [1, n], val(vi ) = condition that for all i ∈ [1, n], val(vi ) = val(¯

val(vi ) according to (h). Consequently, T σ ⇒ T {(P, ([P1 , . . . , Pn ]) ⊆ (P¯ , ([P¯1 , . . . , P¯n ]))} ∪ {(P¯ , ([P¯1 , . . . , P¯n ]) ⊆ (P , ([P1 , . . . , Pn ]))}. R4: We show that given a downward closed set of paths P and a conforming set of XINDs Σ ∪ {σ} where - (P, ([P1 , . . . , Pn ]) ⊆ (P .R, ([P1 , . . . , Pn ])) ∈ Σ and - σ = (P, ([P1 , . . . , Pn ]) ⊆ (P , ([R.P1 , . . . , R.Pn ])), there does not exist a tree T that is complete w.r.t. P such that T Σ but T σ. We show in particular that T σ ⇒ T (P, ([P1 , . . . , Pn ]) ⊆ (P .R, ([P1 , . . . , Pn ])). Thereby, given that T σ, there exist nodes v0 , v1 , . . . , vn such that (a) v0 ∈ N(P, T) and (b) for all i ∈ [1, n], vi ∈ N(P.Pi , T) and v0 ∈ aancestor(vi ) and (c) for all i, j ∈ [1, n], closest(vi , vj ) = true, and there do not exist nodes v0 , v1 , . . . , vn such that (a’) (b’) (c’) (d’)

v0 ∈ N(P , T) and for all i ∈ [1, n], vi ∈ N(P .R.Pi , T) and v0 ∈ aancestor(vi ) and for all i, j ∈ [1, n], closest(vi , vj ) = true and for all i ∈ [1, n], val(vi ) = val(vi ).

However, given that nodes v0 , v1 , . . . , vn exist, T (P, ([P1 , . . . , Pn ]) ⊆ (P .R, ([P1 , . . . , Pn ])) according to Definition 9, if there do not exist nodes v¯0 , v¯1 , . . . , v¯n such that (a”) (b”) (c”) (d”)

v¯0 ∈ N(P .R, T) and for all i ∈ [1, n], v¯i ∈ N(P .R.Pi , T) and v¯0 ∈ aancestor(¯ vi ) and for all i, j ∈ [1, n], closest(¯ vi , v¯j ) = true and for all i ∈ [1, n], val(¯ vi ) = val(vi ).

We show next that the existence of nodes v¯0 , v¯1 , . . . , v¯n that satisfy (a”) - (d”) contradicts the absence of nodes v0 , v1 , . . . , vn that satisfy (a’) - (d’), i.e. that T (P, ([P1 , . . . , Pn ]) ⊆ (P .R, ([P1 , . . . , Pn ])) if T σ. Thereby, since v¯0 ∈ N(P .R, T) per assumption, one can choose node v0 from the set of ancestor nodes of v¯0 such that v0 ∈ N(P , T), which establishes (a’). v0 ) it follows from (b”) that v0 ∈ aancestor(¯ vi ) Further, given that v0 ∈ ancestor(¯ for all i ∈ [1, n]. It follows from this together with (b”) - (d”) that the existence of nodes v¯0 , v¯1 , . . . , v¯n implies the existence of nodes v0 , v1 , . . . , vn since one can choose vi = v¯i for all i ∈ [1, n]. R5: We show that given a downward-closed set of paths P and a conforming set of XINDs Σ ∪ {σ} where - (P, ([P1 , . . . , Pn ]) ⊆ (P , ([R.P1 , . . . , R.Pn ])) ∈ Σ and - σ = (P, ([P1 , . . . , Pn ]) ⊆ (P .R, ([P1 , . . . , Pn ])),

there does not exist a tree T that is complete w.r.t. P such that T Σ but T σ. We show in particular that T σ ⇒ T (P, ([P1 , . . . , Pn ]) ⊆ (P , ([R.P1 , . . . , R.Pn ])). Thereby, given that T σ, there exist nodes v0 , v1 , . . . , vn such that (a) v0 ∈ N(P, T) and (b) for all i ∈ [1, n], vi ∈ N(P.Pi , T) and v0 ∈ aancestor(vi ) and (c) for all i, j ∈ [1, n], closest(vi , vj ) = true and there do not exist nodes v0 , v1 , . . . , vn such that (a’) (b’) (c’) (d’)

v0 ∈ N(P .R, T) and for all i ∈ [1, n], vi ∈ N(P .R.Pi , T) and v0 ∈ aancestor(vi ) and for all i, j ∈ [1, n], closest(vi , vj ) = true and for all i ∈ [1, n], val(vi ) = val(vi ).

However, given that nodes v0 , v1 , . . . , vn exist, T (P, ([P1 , . . . , Pn ]) ⊆ (P , ([R.P1 , . . . , R.Pn ])) according to Definition 9, if there do not exist nodes v¯0 , v¯1 , . . . , v¯n such that (a”) (b”) (c”) (d”)

v¯0 ∈ N(P , T) and for all i ∈ [1, n], v¯i ∈ N(P .R.Pi , T) and v¯0 ∈ aancestor(¯ vi ) and for all i, j ∈ [1, n], closest(¯ vi , v¯j ) = true and for all i ∈ [1, n], val(¯ vi ) = val(vi ).

We show next that the existence of nodes v¯0 , v¯1 , . . . , v¯n that satisfy (a”) - (d”) contradicts the absence of nodes v0 , v1 , . . . , vn that satisfy (a’) - (d’), i.e. that T (P, ([P1 , . . . , Pn ]) ⊆ (P , ([R.P1 , . . . , R.Pn ])) if T σ. Thereby, since per assumption v¯i ∈ N(P .R.Pi , T) for all i ∈ [1, n], there vi ) for all i ∈ [1, n]. exists node v˜i such that v˜i ∈ N(P .R, T) and v˜i ∈ aancestor(¯ Also, the assumption that closest(¯ vi , v¯j ) = true for all i, j ∈ [1, n] implies that v˜i = v˜j for all i, j ∈ [1, n], which is what we show next. Given that closest(¯ vi , v¯j ) = true, there exists node v¯ji according to Definition 8 such that i vi ) and v¯ji ∈ aancestor(¯ vj ). v¯j ∈ N(P .R.Pi ∩ P .R.Pj , T) and v¯ji ∈ aancestor(¯ i vi ) and that also v˜i ∈ aancestor(¯ vi ) it follows Further, given that v¯j ∈ aancestor(¯ vji ) since v˜i ∈ N(P .R, T) and v¯ji ∈ N(P .R.Pi ∩ P .R.Pj , T) that v˜i ∈ aancestor(¯ per assumption and P .R ⊆ (P .R.Pi ∩ P .R.Pj ). Analogously, given that v¯ji ∈ vj ) it follows that v˜j ∈ aancestor(¯ vji ) aancestor(¯ vj ) and that also v˜j ∈ aancestor(¯ i since v˜j ∈ N(P .R, T) and v¯j ∈ N(P .R.Pi ∩ P .R.Pj , T) per assumption and P .R ⊆ (P .R.Pi ∩ P .R.Pj ). Now, given that v˜i ∈ aancestor(¯ vji ) and that also i v˜j ∈ aancestor(¯ vj ) and that {˜ vi , v˜j } ⊆ N(P .R) it follows that v˜i = v˜j . Therefore, one can choose v0 to be the node such that for all i, j ∈ [1, n], v0 = v˜i . Consequently, there exists node v0 such that v0 ∈ N(P .R, T) and for all vi ), which establishes (a’). It follows from this together i ∈ [1, n], v0 ∈ aancestor(¯ with (b”) - (d”) that the existence of nodes v¯0 , v¯1 , . . . , v¯n implies the existence of nodes v0 , v1 , . . . , vn , since one can choose for all i ∈ [1, n], vi = v¯i . R6: We show that given a downward-closed set of paths P and a conforming set of XINDs Σ ∪ {σ} such that

-

(P, ([P1 , . . . , Pm ]) ⊆ (ρ, ([P1 , . . . , Pm ])) ∈ Σ and (P, ([Pm+1 , . . . , Pn ]) ⊆ (ρ, [Pm+1 , . . . , Pn ])) ∈ Σ and ρ.Pi ∩ ρ.Pj = ρ for all i, j ∈ [1, m] × [m+1, n] and σ = (P, ([P1 , . . . , Pn ])) ⊆ (ρ, ([P1 , . . . , Pn ])),

there does not exist a tree T that is complete w.r.t. P such that T Σ but T σ. We show in particular that if T σ then either T (P, ([P1 , . . . , Pm ]) ⊆ ])) ∈ Σ or T (P, ([Pm+1 , . . . , Pn ]) ⊆ (ρ, [Pm+1 , . . . , Pn ])) ∈ Σ. (ρ, ([P1 , . . . , Pm Thereby, given that T σ, there exist nodes v0 , v1 , . . . , . . . , vn such that (a) v0 ∈ N(P, T) and (b) for all i ∈ [1, n], vi ∈ N(P.Pi , T) and v0 ∈ aancestor(vi ) and (c) for all i, j ∈ [1, n], closest(vi , vj ) = true, and there do not exist nodes v0 , v1 , . . . , vn such that (a’) (b’) (c’) (d’)

v0 ∈ N(ρ, T) and for all i ∈ [1, n], vi ∈ N(P .Pi , T) and v0 ∈ aancestor(vi ) and for all i, j ∈ [1, n], closest(vi , vj ) = true and for all i ∈ [1, n], val(vi ) = val(vi ).

However, given that nodes v0 , v1 , . . . , vm exist that satisfy (a) - (c), T ])) according to Definition 9 if there do not (P, ([P1 , . . . , Pm ]) ⊆ (P , ([P1 , . . . , Pm exist nodes v¯1 , . . . , v¯m such that vi ) and (a”) for all i ∈ [1, m], v¯i ∈ N(P .Pi , T) and v0 ∈ aancestor(¯ (b”) for all i, j ∈ [1, m], closest(¯ vi , v¯j ) = true, and (c”) for all i ∈ [1, m], val(¯ vi ) = val(vi ). Analogously, given that nodes v0 , vm+1 , . . . , vn exist that satisfy (a) - (c), , . . . , Pn ])) according to Definition 9 if T (P, ([Pm+1 , . . . , Pn ]) ⊆ (P , ([Pm+1 , . . . , v¯n such that there do not exist nodes v¯m+1 (a”’) for all i ∈ [m+1, n], v¯i ∈ N(P .Pi , T) and v0 ∈ aancestor(¯ vi ) and (b”’) for all i, j ∈ [m+1, n], closest(¯ vi , v¯j ) = true and (c”’) for all i ∈ [m+1, n], val(¯ vi ) = val(vi ). Consequently, the existence of nodes v0 , v1 , . . . , vn contradicts the absence of vi , v¯j ) = true, since nodes v0 , v1 , . . . , vn if for all i, j ∈ [1, m] × [m + 1, n], closest(¯ then nodes v0 , v¯1 , . . . , v¯n satisfy (a’) - (d’). Therefore, if for all i, j ∈ [1, m]× [m+ 1, n], closest(¯ vi , v¯j ) = true, then T σ implies that T {(P, ([P1 , . . . , Pm ]) ⊆ ]))} ∪ {(P, ([Pm+1 , . . . , Pn ]) ⊆ (P , ([Pm+1 , . . . , Pn ]))}. (P , ([P1 , . . . , Pm We finish the proof with showing that closest(¯ vi , v¯j ) = true for all i, j ∈ [1, m] × [m + 1, n]. In particular the root node vρ of T satisfies (i) - (iii) in Definition 8 w.r.t. nodes v¯i and v¯j for all i, j ∈ [1, m] × [m+1, n] since – vρ ∈ aancestor(¯ vi ) given that T is a tree and also vj ) given that T is a tree, and – vρ ∈ aancestor(¯ – vρ ∈ N(P .Pi ∩ P .Pj , T), since P .Pi ∩ P .Pj = ρ by assumption.

Algorithm 2 InitTree(P, σ) in: A downward-closed sequence of legal paths P = [R1 , . . . , Rm ] ordered by length An XIND σ = ((P, [P1 , . . . , Pn ]) ⊆ (P , [P1 , . . . , Pn ])) that conforms to P out: An initial tree T = (V, E, lab, val, vρ ) that is complete w.r.t. P 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13:

let T be a tree that exclusively contains the root node vρ ; for i := 2 to m do let vˆ be the node in N(parent(Ri ), T); create a new node v and add v to V; set lab(v) = last(Ri ); add (ˆ v , v) to E such that v is the last child of vˆ; if there exists path Pj ∈ [P1 , . . . , Pn ] such that Ri = P.Pj then set val(v) = j; else if last(Ri ) is an attribute or text label then set val(v) = ”0”; end if end for return T;

We now turn to the completeness of our inference rules. A roadmap for the proof is as follows. We use Algorithm 2 to construct a special initial tree Tσ , which essentially has LHS field nodes with distinct values w.r.t. the XIND σ and is empty elsewhere. We then show by induction that the only XINDs satisfied by any intermediate tree during the chase are those derivable from Σ using rules ¯σ is the final tree returned by the application ¯σ σ, where T R1 - R6. That is, if T ¯σ σ if Σ σ Chase(P, Tσ , Σ), then Σ σ and thus Σ σ ⇒ Σ σ, since T from Lemma 1. Lemmas 9 and 10 establish some properties of the procedure of Algorithm 2, and the tree resulting from an application of Algorithm 2, respectively. Lemma 9. Whenever a node v is created within an application InitTree(P, σ), where P = [R1 , . . . , Rm ], then (i) {v} = N(Ri , T), where i denotes the iteration of the loop at Line 2 within which node v was created, and (ii) closest(v, v˜) = true, where v˜ is a node in T. Proof (Lemma 9). some proof.

Lemma 10. An application InitTree(P, σ) of Algorithm 2, where σ is of the form σ = ((P, [P1 , . . . , Pn ]) ⊆ (P , [P1 , . . . , Pn ])), terminates and returns a unique tree T that is complete w.r.t. P, and contains nodes v0 , v1 , . . . , vn such that (i) v0 ∈ N(P, T), and (ii) for all i ∈ [1, n], vi ∈ N(P.Pi , T) and v0 ∈ aancestor(vi ) and (iii) for all i, j ∈ [1, n], closest(vi , vj ) = true, and (iv) for all i ∈ [1, n], val(vi ) = i. Proof (Lemma 10). some proof.

For the proof of completeness of the inference rules, we require the notion of the intersection node for a given set of nodes, which is what we define next. Definition 18. Given a set of nodes v1 , . . . , vn in a tree T, node v˜ is defined to be the intersection node for nodes v1 , . . . , vn , denoted by v˜ = v1 ∩ · · · ∩ vn , if v˜ ∈ aancestor(vi ), for all i ∈ [1, n], and there does not exist node vˆ, such that (i) vˆ ∈ aancestor(vi ), for all i ∈ [1, n], and (ii) v˜ ∈ ancestor(ˆ v ). We note that there exists an intersection node v˜ for every set of nodes v1 , . . . , vn in a tree T, since every node is connected to the root node of T, given that T conforms to Definition 1. Also, it follows directly from Definition 18, that v˜ is unique for the set of nodes v1 , . . . , vn . The next lemma establishes some properties on concerning the intersection node of a given set of nodes, which we frequently require in the completeness proof. Lemma 11. Given a set of nodes v1 , . . . , vn in a tree T, and a set of paths P.P1 , . . . , P.Pn such that for all i ∈ [1, n], vi ∈ N(P.Pi , T). If there exists node vˆ such that vˆ ∈ aancestor(vi ) for all i ∈ [1, n], then (i) vˆ ∈ aancestor(˜ v ), where v˜ = v1 ∩ · · · ∩ vñ , and ˆ T) and v˜ ∈ N(P, ˜ T). (ii) Pˆ ⊆ P˜ , where Pˆ and P˜ are the paths such that vˆ ∈ N(P, Proof (Lemma 11). some proof.

We are now ready to demonstrate completeness of our inference rules. Proof (Theorem 3 - completeness). We show that given a downward-closed set of legal paths P and a conforming set of XINDs Σ ∪ {σ}, Σ σ ⇒ Σ σ. For this purpose, let σ be of the general form σ = (P, [P1 , . . . , Pn ]) ⊆ (P , [P1 , . . . , Pn ]) . Also, let T be the tree generated by the application InitTree(P, σ) of Algo¯ be the final tree returned by the application Chase(P, T, Σ) rithm 2, and let T ¯ Σ according to Lemma 1 and therefore, if T ¯ σ, of Algorithm 1. Then, T ¯ ¯ then T witnesses the contradiction that Σ σ. If instead T σ, then in order to show that Σ σ, we claim that whenever there exist nodes v¯1 , . . . , v¯n in tree ¯ such that T - for all i, j ∈ [1, n], closest(¯ vi , v¯j ) = true, and - for all i ∈ [1, n], val(¯ vi ) = i, then Σ σ ¯ of the form σ ¯ = (P, [P1 , . . . , Pn ]) ⊆ (P¯ , [P¯1 , . . . , P¯n ]) , where ¯ and - P¯ is the path such that v¯1 ∩ · · · ∩ v¯n ∈ N(P¯ , T), ¯ ¯ - for all i ∈ [1, n], Pi is the path such that v¯i ∈ N(P¯ .P¯i , T).

We show first that the claim is sufficient. That is, we show that if the claim ¯ σ. Thereby, according to Lemma 10, T contains holds true, then Σ σ if T nodes v0 , v1 , . . . , vn , such that -

v0 ∈ N(P, T), and for all i ∈ [1, n], vi ∈ N(P.Pi , T) and v0 ∈ aancestor(vi ), and for all i, j ∈ [1, n], closest(vi , vj ) = true, and for all i ∈ [1, n], val(vi ) = i.

¯ according to Lemma 1, and therefore Lemma 2 together with Further, T ⊆ T ¯ In turn Definition Definition 16 implies that nodes v0 , v1 , . . . , vn also exist in T. ¯ such that ¯ σ then there exist nodes v , v , . . . , v in T, 9 implies that if T 0 1 n -

¯ and v0 ∈ N(P , T), ¯ for all i ∈ [1, n], vi ∈ N(P .P i , T) and v0 ∈ aancestor(vi ), and for all i, j ∈ [1, n], closest(vi , vj ) = true, and for all i ∈ [1, n], val(vi ) = val(vi ) = i.

Thereby, given that for all i, j ∈ [1, n], closest(vi , vj ) = true, and that for all i ∈ [1, n], val(vi ) = i, it follows that the claim applies to nodes v1 , . . . , vn . Therefore, given that the claim holds true, Σ derives the XIND (1) (P, [P1 , . . . , Pn ]) ⊆ (P˜ , [P˜1 , . . . , Pñ ]) , where ¯ and - P˜ is the path such that v1 ∩ · · · ∩ vn ∈ N(P˜ , T), ˜ ¯ - for all i ∈ [1, n], Pi is the path such that vi ∈ N(P˜ .P˜i , T). Now, given that for all i ∈ [1, n], v0 ∈ aancestor(vi ), it follows from (i) in Lemma 11, that v0 ∈ aancestor(v1 ∩ · · · ∩ vn ). In turn, (ii) in Lemma 11 ¯ per assumption. Thereby, if implies, that P ⊆ P˜ , since v1 ∩ · · · ∩ vn ∈ N(P˜ , T) ˜ ˜ P = P , then it follows that also Pi = Pi for all i ∈ [1, n], since per assumption ¯ ˜ ˜ ¯ ˜ ˜ vi ∈ N(P .P i , T) and also vi ∈ N(P .Pi , T), and therefore P .Pi = P .Pi , for all ˜ i ∈ [1, n]. Thus, σ equals (1) if P = P . Hence Σ σ, in case that P = P˜ . If instead P ⊂ P˜ , then let R be the path such that P .R = P˜ . Then, applying rule R5 (Downshift) to (1) yields (P, [P1 , . . . , Pn ]) ⊆ (P , [R.P˜1 , . . . , R.Pñ ])

(2)

˜ ˜ = P˜ , and that for all i ∈ [1, n], P .P Thereby, given that P .R i = P .Pi , it ˜ ˜ follows that P .Pi = P .R.Pi , for all i ∈ [1, n]. Consequently, R.Pi = Pi for all i ∈ [1, n], and therefore σ equals (2). Hence, Σ σ also in case that P ⊂ P˜ .

We demonstrate the claim by induction over the trees generated by the chase. We assume now for this purpose, that the application Chase(P, T, Σ) of Algorithm 1 terminates after f steps. Then, the sequence of input trees for the steps of the application Chase(P, T, Σ) is given by T1 , . . . , Tf , where T1 = T and ¯ Tf = T.

Base Case: We use tree T1 as base case. Thereby, by applying rule R1 (Reflexivity), Σ derives the XIND (P, [P1 , . . . , Pn ]) ⊆ (P, [P1 , . . . , Pn ])

(3)

Now, assume for the moment, that if the claim applies to a set of nodes v¯1 , . . . , v¯n in tree T1 = T, then for all i ∈ [1, n], v¯i = vi , where vi is the node in ¯ P¯i = P.Pi , since N(P.Pi , T) according to Lemma 10. Then, for all i ∈ [1, n], P. ¯ ¯ v¯i ∈ N(P.Pi , T1 ) per assumption, and also v¯i ∈ N(P.Pi , T1 ), given that v¯i = vi and that T1 = T. Therefore, if P¯ = P , then σ ¯ equals (3). Hence, Σ σ ¯ in case that P¯ = P . We proceed with showing that Σ σ ¯ also if P¯ = P . Thereby, given that vi ), since v¯i = vi , for all i ∈ [1, n], it follows that for all i ∈ [1, n], v0 ∈ aancestor(¯ v0 ∈ aancestor(vi ), per assumption. It follows from this together with (ii) in Lemma 11, that P ⊆ P¯ , since per assumption v0 ∈ N(P, T1 ), and v¯1 ∩ · · · ∩ v¯n ∈ N(P¯ , T1 ) and v¯i ∈ N(P¯ .P¯i , T1 ), for all i ∈ [1, n]. In turn, P ⊂ P¯ follows from the assumption that P = P¯ . Consequently, there exists path R, such that P.R = P¯ . ¯ P¯i = P.Pi , for all i ∈ [1, n], it From this together with the assumption that P. ¯ follows that for all i ∈ [1, n], P.R.Pi = P.Pi , and in turn R.P¯i = Pi . Therefore, by applying rule R4 (Upshift) to (3), Σ derives the XIND

(P, [P1 , . . . , Pn ]) ⊆ (P.R, [P¯1 , . . . , P¯n ]) (4) Observing that P.R = P¯ , we conclude that (4) equals σ ¯ , and hence Σ σ ¯ in case that P¯ = P . It therefore remains to be shown that for all i ∈ [1, n], v¯i = vi . Thereby, since vi ) = 0. From val(¯ vi ) = i, for all i ∈ [1, n], it follows that for all i ∈ [1, n], val(¯ this together with the assumption that T1 is the tree returned by the application InitTree(P, σ) of Algorithm 2, it follows that for all i ∈ [1, n], val(¯ vi ) was set at Line 8 in Algorithm 2. Further, since the sequence of RHS fields [P1 , . . . , Pn ] does not contain duplicates according to Definition 7, and val(vi ) = i, for all i ∈ [1, n], per assumption, it follows that Pi ∈ [P1 , . . . , Pn ] is the path for which the condition at Line 7 is met. In turn, (i) in Lemma 9 implies that for all i ∈ [1, n], {¯ vi } = N(P.Pi , T1 ), and thus v¯i = vi follows from the assumption that vi ∈ N(P.Pi , T1 ), which finishes the base case. Inductive Step: Suppose that the claim holds true until step s of the application Chase(P, T, Σ) of Algorithm 1, and we show that it also holds true for step ¯ according s+ 1. We note that if nodes v¯1 , . . . , v¯n exist in tree Ts , then Σ σ to the inductive assumption. We assume therefore, that a non-empty subset of nodes v¯1 , . . . , v¯n exists in tree Ts+1 , but does not exist in tree Ts . Based on this assumption, we now distinguish whether (a) the entire set of nodes v¯1 , . . . , v¯n exclusively exists in tree Ts+1 , or (b) there exists also a non-empty subset of nodes v¯1 , . . . , v¯n in tree Ts . (a) Given that nodes v¯1 , . . . , v¯n exclusively exist in tree Ts+1 , it follows that nodes v¯1 , . . . , v¯n were created in step s. Now, let σs be the XIND chosen at Line

2 within step s of the application Chase(P, T, Σ) of Algorithm 1, and let σs be of the form σs = (P˜ , ([P˜1 , . . . , P˜k ]) ⊆ (P˜ , ([P˜1 , . . . , P˜k ]) . Also, let φ : [1, n] → [1, m], where m is the number of paths in P, be the mapping, such that for all i ∈ [1, n], v¯i is created in iteration φ(i) of the loop at Line 12 within step s of the application of Algorithm 1. Then, since for all vi ) = 0, it follows that i ∈ [1, n], val(¯ vi ) = i per assumption, and thus val(¯ val(¯ vi ) is set at Line 19 in iteration φ(i) of the loop at Line 12. Consequently, the condition at Line 18 is met in iteration φ(i), for all i ∈ [1, n], and therefore, ¯ φ(i) is the path in P in iteration φ(i) of the ¯ φ(i) ∈ [P˜.P˜1 , . . . , P˜.P˜ ], where R R k loop at Line 12. ¯ φ(n) } ⊆ {P˜.P˜1 , . . . , P˜.P˜ }, it follows that there exists ¯ φ(1) , . . . , R Given that {R k ¯ φ(i) = P˜ .P˜ , for all i ∈ [1, n]. Therefore, mapping π : [1, n] → [1, k], such that R π(i) by applying rule R2 (Permutated Projection) to σs , Σ derives the XIND (P˜ , ([P˜π(1) , . . . , P˜π(n) ]) ⊆ (P˜ , ([P˜π(1) , . . . , P˜π(n) ]) .

(5)

¯ P¯i , Ts+1 ) per assumption, and also Thereby, since for all i ∈ [1, n], v¯i ∈ N(P. ¯φ(i) , Ts+1 ) according to (i) in Lemma 5, it follows that R ¯ φ(i) = P. ¯ P¯i . v¯i ∈ N(R ˜ ˜ ˜ ¯ ˜ ¯ ¯ Consequently, also P.Pi = P .Pπ(i) for all i ∈ [1, n], since Rφ(i) = P .Pπ(i) as ¯ P¯n = P˜ .P˜ ∩ · · · ∩ P˜ .P˜ , and therefore ¯ P¯1 ∩ · · · ∩ P. shown above. In turn P. π(1)

π(n)

P¯ = P˜.P˜π(1) ∩ · · · ∩ P˜.P˜π(n) follows, given that v¯1 ∩ · · · ∩ v¯n ∈ N(P¯ , Ts+1 ), and ¯ ¯ that v¯i ∈ N(P.Pi , Ts+1 ), for all i ∈ [1, n]. From this together with the observation ∩ · · · ∩ P˜.P˜π(1) , we conclude that P˜ ⊆ P¯ . that P˜ ⊆ P˜.P˜π(1) ˜ such that P¯ = P˜.R. ˜ Further, Thereby, if P˜ ⊂ P¯ , then there exists path R ˜ ˜ ¯ ¯ since for all i ∈ [1, n], P .Pi = P .Pπ(i) as shown above, it follows that for all ˜ P¯i = P˜ .P˜ , and in turn R. ˜ P¯i = P˜ . Therefore, by applying i ∈ [1, n], P˜.R. π(i)

rule R4 (Upshift) to (5), Σ derives the XIND

π(i)

˜ ([P¯1 , . . . , P¯n ]) (P˜ , ([P˜π(1) , . . . , P˜π(n) ]) ⊆ (P˜.R,

(6)

˜ = P¯ , it follows that (6) equals the XIND Thereby, given that P˜ .R (P˜ , ([P˜π(1) , . . . , P˜π(n) ]) ⊆ (P¯ , ([P¯1 , . . . , P¯n ])

(7)

We note that if instead P˜ = P¯ , then (5) equals (7), since for all i ∈ [1, n], ¯ ¯ P .Pi = P˜ .P˜π(i) , as show above. Therefore, Σ derives the XIND (7) in both cases. Now, let v˜0 , v˜1 , . . . , v˜k be the remaining sequence of nodes at Line 10 in step s of the application of Algorithm 1. That is, v˜0 , v˜1 , . . . , v˜k is the sequence of nodes that violates σs in Ts . Then, it follows from Definition 9, that - v˜0 ∈ N(P˜ , Ts ), and ˜ P˜i , Ts ) and v˜0 ∈ aancestor(˜ vi ), and - for all i ∈ [1, k], v˜i ∈ N(P. - for all i, j ∈ [1, k], closest(˜ vi , v˜j ) = true.

We observe that the claim applies to nodes v˜π(1) , . . . , v˜π(n) if val(˜ vπ(i) ) = i, for all i ∈ [1, n], given that closest(˜ vi , v˜j ) = true, for all i, j ∈ [1, k]. Thereby, given ¯φ(i) , Ts ), that for all i ∈ [1, n], v¯i is newly created in step s, and that v¯i ∈ N(R it follows from Line 19 in Algorithm 1, together with the assumption that v˜0 , v˜1 , . . . , v˜k is the sequence of nodes at Line 10 in the application of Algorithm ¯ φ(i) = P˜.P˜ as shown above. vi ), for all i ∈ [1, n], since R 1, that val(˜ vπ(i) ) = val(¯ π(i) Also, for all i ∈ [1, n], val(¯ vi ) = i per assumption and consequently val(˜ vπ(i) ) = i. Given that the claim applies to nodes v˜0 , v˜1 , . . . , v˜k , it follows from the inductive assumption, that Σ derives the XIND (P, ([P1 , . . . , Pn ]) ⊆ (Pˆ , ([Pˆπ(1) , . . . , Pˆπ(n) ]) , where

(8)

¯s ), and - Pˆ is the path such that v˜π(1) ∩ · · · ∩ v˜π(n) ∈ N(Pˆ , T ¯s ). - for all i ∈ [1, n], Pî is the path such that v˜π(i) ∈ N(Pˆ .Pˆπ(i) , T ˜ P˜π(i) , T ¯s ) follows from the definition We note that for all i ∈ [1, k], v˜π(i) ∈ N(P. ˜ ˜ ¯ of π and the assumption that v˜i ∈ N(P.Pi , Ts ). Thus, given that for all i ∈ [1, n], ¯s ), it follows that for all i ∈ [1, n], P. ˆ Pˆπ(i) = P. ˜ P˜π(i) . In turn v˜π(i) ∈ N(Pˆ .Pˆπ(i) , T ˆ ˜ ˜ ˜ ˜ ˆ ¯ P = P.Pπ(1) ∩ · · · ∩ P.Pπ(n) , given that v˜π(1) ∩ · · · ∩ v˜π(n) ∈ N(P , Ts+1 ). It follows ˜ P˜π(1) , that P˜ ⊆ Pˆ . ˜ P˜π(1) ∩ · · · ∩ P. then from the observation that P˜ ⊆ P. ˜ ˆ Now, if P = P , then rule R3 (Transitivity) applies to the XINDs (8) and ˜ P˜π(i) , for all i ∈ [1, n]. We observe that the resulting ˆ Pˆπ(i) = P. (7), given that P. XIND equals σ ¯ , and we conclude therefore that Σ σ ¯ if P˜ = Pˆ . It therefore remains to be shown that Σ σ ¯ if P˜ ⊂ Pˆ . Thereby, given that ˆ such that P. ˜R ˆ = Pˆ . Therefore, by applying rule R5 P˜ ⊂ Pˆ , there exists path R (Downshift), Σ derives the XIND ˜ ([R. ˆ Pˆπ(1) , . . . , R. ˆ Pˆπ(n) ]) (P, ([P1 , . . . , Pn ]) ⊆ (P,

(9)

ˆ Pˆπ(i) = P. ˜ P˜π(i) for all i ∈ [1, n], and that P. ˜R ˆ = Pˆ , Thereby, given that P. ˜ ˜ ˆ ˆ ˜ ˆ ˆ it follows that for all i ∈ [1, n], P.R.Pπ(i) = P.Pπ(i) and in turn R.Pπ(i) = P˜π(i) . Consequently, rule R3 (Transitivity) applies to XINDs (9) and (7). Again, the resulting XIND equals σ ¯ , and hence Σ σ ¯ , also in case that P˜ ⊂ Pˆ . (b) Given that a non-empty subset of nodes v¯1 , . . . , v¯n exists in Ts , it follows that there exists integer m, where 1 ≤ m < n, and a mapping μ : [1, n] → [1, n] such that for all i ∈ [1, n] - if i ≤ m, then node v¯μ(i) exclusively exists in tree Ts+1 , and - if i > m, then node v¯μ(i) also exists in tree Ts . We assume however for the ease of presentation that μ is the identity function and that therefore μ(i) = i for all i ∈ [1, n]. Then, nodes v¯1 , . . . , v¯m exclusively exist in tree Ts+1 , and nodes v¯m+1 , . . . , v¯n also exist in tree Ts .

Then, given that the claim applies to nodes v¯1 , . . . , v¯n , and that nodes v¯1 , . . . , v¯m exclusively exist in tree Ts+1 , it follows from the argumentation in (a), that Σ derives the XIND (P, [P1 , . . . , Pm ]) ⊆ (P˜ , [P˜1 , . . . , P˜m ]) , where

(10)

¯s+1 ), and - P˜ is the path such that v¯1 ∩ · · · ∩ v¯m ∈ N(P˜ , T ¯s+1 ). - for all i ∈ [1, m], P˜i is the path such that v¯i ∈ N(P˜ .P˜i , T Further, given that the claim applies to nodes v¯1 , . . . , v¯n , and that nodes v¯m+1 , . . . , v¯n exist in tree Ts , it follows follows from the inductive assumption, that Σ derives the XIND (P, [Pm+1 , . . . , Pn ]) ⊆ (Pˆ , [Pˆm+1 , . . . , Pˆn ]) , where

(11)

¯s+1 ), and - Pˆ is the path such that v¯m+1 ∩ · · · ∩ v¯n ∈ N(Pˆ , T ¯s+1 ). - for all i ∈ [m+1, n], Pî is the path such that v¯i ∈ N(Pˆ .Pî , T We observe, that (ii) in Lemma 5 implies, that for all i, j ∈ [1, m] × [m+1, n], vi ) ∩ aancestor(¯ vj ), given that v¯i exclusively exists in Ts+1 and {vρ } = aancestor(¯ that v¯j exists in Ts . Consequently, vρ = v¯1 ∩ · · · ∩ v¯n and therefore P¯ = ρ, since v¯1 ∩ · · · ∩ v¯n ∈ N(P¯ , Ts+1 ) per assumption. ˜ and R ˆ be the paths, such that P˜ = ρ.R ˜ and that Pˆ = ρ.R. ˆ Then Now, let R by applying rule R4 (Upshift) to the XINDs in (10) and (11), Σ derives the XINDs ˜ P˜1 , . . . , R. ˜ P˜m ]) (P, [P1 , . . . , Pm ]) ⊆ (ρ, [R.

ˆ Pˆm+1 , . . . , R. ˆ Pˆn ]) (P, [Pm+1 , . . . , Pn ]) ⊆ (ρ, [R.

(12) (13)

We note that for all i ∈ [1, m], P˜ .P˜i = P¯ .P¯i , since per assumption v¯i ∈ ¯s+1 ) and also v¯i ∈ N(P¯ .P¯i , T ¯s+1 ). It follows analogously, that Pˆ .Pî = ˜ N(P .P˜i , T ¯ ¯ P .Pi , for all i ∈ [m+ 1, n]. From this together with the observation that P¯ = ρ, ˜ P˜i = P¯i and that also, for all i ∈ [m+ 1, n], it follows that for all i ∈ [1, m], R. ˆ Pî = P¯i . Therefore, if rule R6 (Merge) applies to XINDs (12) and (13), which R. is what we show next, then the resulting XIND equals σ ¯ , and hence Σ σ ¯. ˜ P˜i ∩ We note that rule R6 (Merge) applies to the XINDs (12) and (13), if ρ.R. ˆ Pˆj = ρ, for all i, j ∈ [1, m] × [m+ 1, n]. Thereby, given that for all i, j ∈ [1, n], ρ.R. closest(¯ vi , v¯j ) = true, it follows that there exists node v¯ji , for all i, j ∈ [1, m] × [m+ 1, n], such that vi ) and - v¯ji ∈ aancestor(¯ vj ) and - v¯ji ∈ aancestor(¯ i ˜ ˜ ˆ Pˆj , Ts+1 ). - v¯j ∈ N(ρ.R.Pi ∩ ρ.R. vi ) ∩ Now, given that for all i, j ∈ [1, m] × [m + 1, n], {vρ } = aancestor(¯ vi , v¯j ) = true. Consequently, for aancestor(¯ vj ), it follows that v¯ji = vρ , if closest(¯ ˆ Pˆj = ρ, follows from the assumption that ˜ P˜i ∩ ρ.R. all i, j ∈ [1, m] × [m+1, n], ρ.R.

closest(¯ vi , v¯j ) = true.

¯σ σ it follows that T ¯σ σ ⇒ Σ σ since Σ σ ⇒ Given that Σ σ if T ¯σ σ then Σ σ from ¯σ σ, since if to the contrary T Σ σ. Also, Σ σ ⇒ T Lemma 1. Combining this with the result in Lemma 1 that the chase terminates yields that the chase is a decision procedure for the implication of core XINDs ¯σ σ, and we therefore finally have the in complete XML trees, i.e. Σ σ iff T following result on the implication problem. Theorem 4. The implication problem for the class of core XINDs in complete XML trees is decidable.

6

Discussion and Related Work

In recent years, several types of XML Integrity Constraints (XICs) such as functional dependencies or keys for XML have been investigated. Because of space limitations, we restrict our attention in this section to inclusion type constraints and refer the reader to [7] for a survey of other types of XICs. An early type of XICs are path constraints [2]. A path inclusion constraint (PIC) essentially requires that whenever a node is reachable over one path, it must also be reachable over another path. In contrast, an XIND asserts that given a set of nodes, there also exist other nodes with corresponding values. Because of this basic difference, one cannot directly compare a PIC and an XIND. Closer to XINDs are the XML Foreign Keys defined in [9, 3]. Translated to the selector/field framework, these XICs constrain the fields to point to attribute or text nodes that are children of the selector nodes and so cannot express for example the constraint τ = ((ρ.teaches, [course.cno]) ⊆ (ρ.offer, [course.cno])). The keyref mechanism of XSD [11] is limited with respect to the possible number of matching path instances per field. For instance, referring to the constraint τ above and Figure 2, the semantics of the keyref mechanism requires that any offer node has at most one descendant course.cno node, which is clearly not satisfied in this example. The XICs in [9, 6] overcome these limitations. However, again translated to the selector/field framework, these XICs regard every sequence of field nodes as relevant as long as they are descendants of one selector node. As a consequence, these XICs do not always preserve the semantics of an IND, which we have illustrated in detail in the introductory example. The limitation of not always preserving the semantics of an IND also applies to the XML inclusion constraint in [14] developed by a subset of the authors, which in fact motivated the present work of defining an XIND. Compared to the XIND defined in this paper, the approach in [14] is less expressive and does not use the selector/field framework. Further, the semantics used in [14], although partly based on the closest concept, is nevertheless different from the semantics used in this paper, and as a result an XIND preserves the semantics of an IND. In further research, we will relax some of the restrictions on the syntax of an XIND and address the implication and consistency problems related to XINDs that allow for (i) path expressions rather than simple paths in the selectors and

fields, (ii) a relative constraint that is only evaluated in parts of the XML tree and (iii) field nodes that are elements rather than text or attribute nodes.

References 1. S. Abiteboul, R. Hull, and V. Vianu. Foundations of Databases. Addison-Wesley, 1995. 2. S. Abiteboul and V. Vianu. Regular Path Queries with Constraints. JCSS, 58(3):428–452, 1999. 3. M. Arenas, W. Fan, and L. Libkin. On the Complexity of Verifying Consistency of XML Specifications. SIAM Journal on Computing, 38(3):841–880, 2008. 4. P. Atzeni and V. DeAntonellis. Relational Database Theory. Benjamin/Cummings, 1993. 5. P. Buneman, S. B. Davidson, W. Fan, C. S. Hara, and W. C. Tan. Keys for XML. Computer Networks, 39(5):473–487, 2002. 6. A. Deutsch and V. Tannen. XML Queries and Constraints, Containment and Reformulation. Theoretical Computer Science, 336(1):57–87, 2005. 7. W. Fan. XML Constraints: Specification, Analysis, and Applications. In DEXA Workshops, pages 805–809. IEEE Computer Society, 2005. 8. W. Fan. XML Publishing: Bridging Theory and Practice. In DBPL, pages 1–16, 2007. 9. W. Fan, G. M. Kuper, and J. Siméon. A Unified Constraint Model for XML. Computer Networks, 39(5):489–505, 2002. 10. W. Fan and J. Siméon. Integrity constraints for XML. JCSS, 66(1):254–291, 2003. 11. A. M¨ oller and M. Schwartzbach. An Introduction to XML and Web Technologies. Addison Wesley, 2006. 12. A. Vakali, B. Catania, and A. Maddalena. XML Data Stores: Emerging Practices. IEEE Internet Computing, 9(2):62–69, 2005. 13. M. W. Vincent, J. Liu, and M. Mohania. On the Equivalence between FDs in XML and FDs in Relations. Acta Informatica, 44(3-4):207–247, 2007. 14. M. W. Vincent, M. Schrefl, J. Liu, and C. Liu. Generalized Inclusion Dependencies in XML. In APWeb, volume 3007 of LNCS, pages 224–233. Springer, 2004.

Bisher erschienene Institutsberichte (seit 1996) Nr. 96.02 / Oktober 1996 M. Schrefl, G. Kappel, P. Lang: Modeling Collaborative Behavior Using Cooperation Contracts Nr. 96.03 / Dezember 1996 I. Wiesinger, B. Hemetsberger: Evaluierung von Managementunterstützungssystemen Nr. 97.01 / Januar 1997 P. Lang, W. Obermair, M. Schrefl: Modeling Business Rules with Situation/Activation Diagrams Nr. 97.02 / Februar 1997 P. Bichler, G. Preuner, M. Schrefl: Workflow Transparency Nr. 98.01 / Jänner 1998 A. Mittelmann: Organisationales Lernen und Geschäftsprozessmanagement Nr. 99.01 / Mai 1999 M. Gappmaier, J. Siller: Partizipative Interaktionsanalyse – Videogestützte Methoden für ganzheitliches Geschäftsprozessmanagement Nr. 99.02 / Mai 1999 M. Gappmaier, H. Merl, V. Pilsl: Das Organisationsgesundheitsbild Nr. 99.03 / Mai 1999 M. Gappmaier, M. Ruzicka: Partizives Gestalten von Geschäftsprozessen mit der Bildkartengestaltungsmethode (BKM) Nr. 99.04 / September 1999 I. Häntschel, H. Schmidt: Methoden zur Ermittlung des Informationsbedarfs des Managements Nr. 99.05 / September 1999 I. Häntschel, W. Erhart: Ein Vorgehensmodell zur strategiegeleiteten Einführung von Managementunterstützungssystemen (VEM) Nr. 00.01 / März 2000 G. Preuner, S. Conrad, M. Schrefl: View Integration of Behavior in Object-oriented Databases Nr. 00.02 / März 2000 G. Preuner, S. Conrad: View Integration of Life-cycles in Object-oriented Design Nr. 00.03 / April 2000 M. Schrefl, M. Stumptner: Behavior Consistent Refinement of Object Life Cycles Nr. 00.04 / April 2000 A. Felfernig, G. Friedrich, D. Jannach, M. Stumptner: An Integrated Development Environment for the Design and Maintenance of Large Configuration Knowledge Bases

Bisher erschienene Institutsberichte (seit 1996, Fortsetzung) Nr. 01.01 / April 2001 T. Thalhammer, M. Schrefl, M. Mohania: Active Data Warehouses: Complementing OLAP with Active Rules Nr. 01.02 / April 2001 C. Thonabauer, L. J. Heinrich: Ein Messsystem zur Erfassung von Potenzial und Nutzung von E-Commerce / E-Business Nr. 02.01 / Mai 2002 M. Bernauer, M. Schrefl: Bringing Life into Self-Maintaining Web Pages Nr. 02.02 / September 2002 G. Preuner, M. Schrefl: Behavior-consistent Composition of Business Processes from Internal and External Services Nr. 03.01 / Mai 2003 T. Auinger: Wissensmanagement-Audit Nr. 03.02 / Juni 2003 S. Lechner, M. Schrefl: By-example schema transformers for supporting the process of conceptual web application modelling Nr. 04.01 / Jänner 2004 T. Auinger, M. Kobler: Laborstudie – eine Evaluierung des WissensmanagementAudits Nr. 05.01 / April 2005 R. Riedl, M. Kobler, F. Roithmayr: An Action Model for Business Strategy Creation in IT-related SMEs Nr. 05.02 / August 2005 R. Riedl: Auswahl eines Application Service Providers: Analytic Hierarchy Process oder Nutzwertanalyse? Nr. 07.01 / August 2007 A. Bögl: pModeler: Ein System zu semantischen Prozessmodellanalyse Nr. 07.02 / November 2007 M. Karlinger, M. Vincent, M. Schrefl: On the Definition and Axiomatization of Inclusion Dependencies for XML Nr. 08.01 / August 2008 K. Grün, M. Schrefl: Extensible Indexing in XML Databases Nr. 08.02 / September 2008 K. Grün, M. Karlinger, M. Schrefl: SemCrypt – Secure XML Processing in Outsourced Databases Nr. 08.03 / November 2008 B. Neumayr, M. Schrefl: Comparison Criteria for Ontological Multi-Level-Modeling