Efficient Approximations of Conjunctive Queries - DCC - Universidad ...

2 downloads 0 Views 221KB Size Report
[3] S. Abiteboul, R. Hull, and V. Vianu. Foundations of. Databases. Addison-Wesley, 1995. ... [7] A. Chandra and P. Merlin. Optimal implementation of conjunctive ...
Efficient Approximations of Conjunctive Queries Pablo Barcelo´

Leonid Libkin

Miguel Romero

Department of Computer Science, Universidad de Chile

School of Informatics, University of Edinburgh

Department of Computer Science, Universidad de Chile

[email protected]

[email protected]

[email protected]

ABSTRACT When finding exact answers to a query over a large database is infeasible, it is natural to approximate the query by a more efficient one that comes from a class with good bounds on the complexity of query evaluation. In this paper we study such approximations for conjunctive queries. These queries are of special importance in databases, and we have a very good understanding of the classes that admit fast query evaluation, such as acyclic, or bounded (hyper)treewidth queries. We define approximations of a given query Q as queries from one of those classes that disagree with Q as little as possible. We mostly concentrate on approximations that are guaranteed to return correct answers. We prove that for the above classes of tractable conjunctive queries, approximations always exist, and are at most polynomial in the size of the original query. This follows from general results we establish that relate closure properties of classes of conjunctive queries to the existence of approximations. We also show that in many cases, the size of approximations is bounded by the size of the query they approximate. We establish a number of results showing how combinatorial properties of queries affect properties of their approximations, study bounds on the number of approximations, as well as the complexity of finding and identifying approximations. We also look at approximations that return all correct answers and study their properties. Categories and Subject Descriptors. H.2.3 [Database Management]: Languages—Query Languages; G.2.2 [Discrete Mathematics]: Graph algorithms Keywords. Conjunctive queries, query evaluation, query approximation, tractability, acyclic queries, treewidth, hypertree width, graphs, homomorphisms.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. PODS’12, May 21–23, 2012, Scottsdale, Arizona, USA. Copyright 2012 ACM 978-1-4503-1248-6/12/05 ...$10.00.

1. INTRODUCTION The idea of finding approximate solutions to problems for which computing exact solutions is impossible or infeasible is ubiquitous in computer science. It is common in databases too: approximate query answering techniques help evaluate queries over extremely large databases or queries with very high inherent complexity, see, e.g., [10, 11, 14, 23, 28]. By analyzing the structure of both the database and the query one often finds a reasonable approximation of the answer, sometimes with performance guarantees. Approximate techniques are relevant even for problems whose complexity is viewed as acceptable for regular-size databases, since finding precise answers may become impossible for large data sets we often deal with these days. To approximate a query, we must have a good understanding of the complexity of query evaluation, in order to find an approximation that is guaranteed to be efficient. For one very common class of queries – conjunctive, or select-project-join queries – we do have a very good understanding of their complexity. In fact, we know which classes of conjunctive queries (CQs from now on) are easy to evaluate [8, 15, 16, 19, 24, 34]. Given the importance of conjunctive queries, and our good understanding of them, we would like to initiate a study of their approximations. We do it from the static analysis point of view, i.e., independently of the input database: for a query Q, we want to find another query Q′ that will be much faster than Q, and whose output would be close to the output of Q on all databases. Such analysis is essential when a query is repeatedly evaluated on a very large database (say, in response to frequent updates), and when producing approximations based on both data and queries may be infeasible. The complexity of checking whether a tuple a ¯ belongs to the output of a CQ Q on a database D is of the order |D|O(|Q|) , where | · | measures the size (of a database or a query) [3, 33]. In fact, the problem is known to be NP-complete, when its input consists of D as well as Q (even for Boolean CQs). In other words, the combined complexity of CQs is intractable [7]. Of course the data complexity of CQs is low, but having O(|Q|) as the exponent may be prohibitively high for very large datasets. This observation led to an extensive study of classes of

CQs for which the combined complexity is tractable. The first result of this kind by Yannakakis [34] showed tractability for acyclic CQs. That was later extended to queries of bounded treewidth [8, 12, 24]; this notion captures tractability for classes of CQs defined in terms of their graphs [19]. For classes of CQs defined in terms of their hypergraphs, the corresponding notion guaranteeing tractability is bounded hypertree width [16], which includes acyclicity as a special case. All these conditions can be tested in polynomial time [5, 13, 16]. The question we address is whether we can approximate a CQ Q by a CQ Q′ from one of such classes so that Q and Q′ would disagree as little as possible. Assume, for example, that we manage to find an approximation of Q by an acyclic CQ Q′ , for which checking whether a ¯ ∈ Q′ (D) is done in time O(|D| · |Q′ |) [34]. Then we replaced the original problem of complexity |D|O(|Q|) with that of complexity  O f (|Q|) + |D| · s(|Q|) where s(·) measures the size of the resulting approximation, and f (·) is the complexity of finding one.

Thus, assuming that the complexity measures f and s are acceptable, the combined complexity of running Q′ is much better than for Q. Hence, if the quality of the approximation Q is good too, then we may prefer to run the much faster query Q′ instead of Q, especially in the case of very large databases. Thus, we need to answer the following questions: • What are the acceptable bounds for constructing approximations, i.e., the functions f and s above? • What types of guarantees do we expect from approximations? For the first question, if Q′ is of the same size as Q, or even if it polynomially increases the size, this is completely acceptable, as the exponent O(|Q|) is now replaced by the factor s(|Q|). For the complexity f of static computation (i.e., transforming Q to Q′ ), a single exponential is typically acceptable. Indeed, this is the norm in many static analysis and verification questions [29, 32], and small exponents (like 2O(|Q|) or 2O(|Q| log |Q|) we shall mainly encounter) are significantly smaller than |D||Q| if |D| is large. Thus, in terms of their complexity, our desiderata for approximations are: 1. the approximating query should be at most polynomially larger than Q – and ideally, bounded by the size of Q; and 2. the complexity of finding an approximating query should not exceed single-exponential. As for the guarantees we expect from approximations, in general they can be formulated in two different ways. By doing it qualitatively we state that an approximation is a query that cannot be improved in terms of how

much it disagrees with the query it approximates. Alternatively, to do it quantitatively, we define a measure of disagreement between two queries, and look for approximations whose measure of disagreement with the query they approximate is below a certain threshold. Here we develop the qualitative approach to approximating CQs. For a given Q, we compare queries from some good (tractable) class C by how much they disagree with Q: to do so, we define an ordering Q1 ⊑Q Q2 saying, intuitively, that Q2 disagrees with Q less often than Q1 does. Then the best queries with respect to the ordering are our approximations from the class C. Furthermore, we require the approximations to return correct results. This approach is standard in databases (for instance, the standard approximation of query results in the settings of query answering using views and data integration is the notion of maximally contained rewriting [2, 20, 26]). Our goal is to explore approximations of arbitrary CQs by tractable CQs. We shall see that approximations are guaranteed to exist for all the tractable classes of CQs mentioned earlier, which makes the notion worth studying. We first explore CQs on graphs. Even though graph vocabularies often lack enough structure to exhibit interesting approximations, the essential machinery needs to be developed for them first. In addition, most lower bounds for complexity results are witnessed already in this simple case. Then we show how to extend results to queries over arbitrary databases. The structure of approximations depends heavily on combinatorial properties of the (tableau of the) query Q we approximate. Consider, for instance, a Boolean query Q1 ():–E(x, y), E(y, z), E(z, x) over graphs. Its best acyclic approximation is Q′1 ():–E(x, x), which is contained in every Boolean graph query and thus provides us with little information. It turns out that this will be the case whenever the tableau of the query is not a bipartite graph. Let Pm (x0 , . . . , xm ) be the CQ stating that x0 , . . . , xm form a path of length m, i.e., E(x0 , x1 ), . . . , E(xm−1 , xm ). If we now look at Q2 () :– P3 (x, y, z, u), P3 (x′ , y ′ , z ′ , u′ ), E(x, z ′ ), E(y, u′ ) (which has a cycle with variables x, y, z ′ , u′ ), then it has a nontrivial acyclic approximation Q′2 ():–P4 (x′ , x, y, z, u). What changed is that the tableau of Q2 is bipartite, guaranteeing the existence of nontrivial approximations for queries over graphs. Going beyond graph vocabularies lets us find more approximations. Consider again Q1 above, replace binary relation E with a ternary relation R, and introduce fresh variables in the middle positions, i.e., look at the query Q():–R(x, u, y), R(y, v, z), R(z, w, x). This query does have several nontrivial acyclic approximations: for instance, Q′ ():–R(x, u, y), R(y, v, u), R(u, z, x) is one. These examples provides a flavor of the results we establish. We now provide a quick summary of the results of

Class of queries Graph queries Arbitrary queries

Type of approximation Acyclic Treewidth k Acyclic Hypertreewidth k

Existence of approximation always exists

Size of approximation at most |Q| polynomial in |Q|

Time to compute approximation singleexponential

Figure 1: Summary of results on approximations for conjunctive queries Q

the paper. As mentioned earlier, we first study queries over graphs and then lift results to arbitrary queries. Results for graph queries For a query Q, we are interested in approximations Q′ from a good class C. The classes we consider are acyclic queries [34] and queries of fixed treewidth k, which capture the notion of tractability of CQs over graphs [19]. The first two rows in Figure 1 summarize some of our results: within both classes, approximations exist for all queries (this will follow from a general existence result that relates closure properties of classes of graphs to the existence of approximations), they do not increase the complexity of the query, and can be constructed in single-exponential time, thus satisfying all our desiderata for approximating queries. In addition, we study the structure of approximations. We show a close relationship between (k + 1)colorability of the tableau and the existence of interesting treewidth-k approximations. For Boolean queries, we prove a finer trichotomy result for acyclic approximations, which also shows that such approximations are guaranteed to reduce the number of joins. We show that there are at most exponentially many non-equivalent approximations, and that the exponential number of approximations can be witnessed. We provide further complexity analysis, showing that the problem of checking whether Q′ is an acyclic (or treewidth-k) approximation of Q is complete for the class DP (this class, defined formally later, is “slightly” above both NP and coNP [31]). DP-completeness results appeared in the database literature in connection with computing cores of structures [9]; our result is of a very different nature because it holds even when both Q and Q′ are minimized (i.e., their tableaux are cores). Finally, we briefly consider approximations that drop the ‘no false positives’ requirement. Results for arbitrary queries There are two ways of getting tractable classes of CQs over arbitrary databases, depending on whether one formulates conditions in terms of the graph of a query Q, or its hypergraph. For graph-based notions, it is known that bounded treewidth characterizes tractability [19]. For them, results for graph queries extend to arbitrary queries. For hypergraph-based notions, we have the original notion of acyclicity from [34] and its more recent extension to the notion of bounded hypertree width [16]; it is known that hypertree width 1 coincides with acyclicity.

We again prove a general existence result for approximations. However, the closure conditions imposed on classes of hypergraphs are becoming more involved, and it actually requires an effort to prove that they hold for classes of bounded hypertree width. We show that it is still possible to find approximations in single exponential time. As for their sizes, they need not be bounded by |Q|, but they remain polynomial in |Q|, with polynomial depending only on the vocabulary (schema). Thus, as the summary table in Figure 1 shows, in this case too, our desiderata for approximations are met. Regarding techniques required to prove these results, we mainly work with tableaux of queries, and characterize approximations via preorders based on the existence of homomorphisms. Thus, we make a heavy use of techniques from the theory of graph homomorphisms [21]. Besides graph theory and combinatorics, these are commonly used in constraint satisfaction [25], but recently they were applied in database theory as well [6, 27]. Organization Basic notations are given in Section 2. In Section 3 we define the notion of approximations. Section 4 studies queries over graphs, concentrating on acyclic and bounded treewidth approximations. In Section 5 we look at arbitrary databases, concentrating on acyclic and bounded hypertree width approximations. Overapproximations are studied in Section 6, and conclusions are given in Section 7. Due to space limitations, we only give a couple of sample (short) proofs here.

2. NOTATIONS Graphs and digraphs Both graphs and digraphs are defined as pairs G = hV, Ei, where V is a set of nodes (vertices) and E is a set of edges. For graphs, an edge is a set {u, v}, where u, v ∈ V ; for digraphs, an edge is a pair (u, v), i.e., it has an orientation from u to v. If u = v, we have a (undirected or directed) loop. If G = hV, Ei is a directed graph, then Gu is the underlying undirected graph: Gu = hV, {{u, v} | (u, v) ∈ E}i. We denote by Km the complete graph on m vertices: Km = h{u1 , . . . , um }, {{ui , uj } | i 6= j, i, j ≤ m}i, and ⇄ ⇄ by Km the complete digraph on m vertices, i.e., Km = h{u1 , . . . , um }, {(ui , uj ) | i 6= j, i, j ≤ m}, so that edges ⇄ u go in both directions. Note that (Km ) = Km . Graph homomorphisms and cores Given two graphs (directed or undirected) G1 = hV1 , E1 i and G2 = hV2 , E2 i, a homomorphism between them is a

map h : V1 → V2 such that h(e) is in E2 for every edge e ∈ E1 . Of course by h(e) we mean {h(u), h(v)} if e = {u, v} and (h(u), h(v)) if e = (u, v). The image of h is the (di)graph Im(h) = hh(V1 ), {h(e) | e ∈ E1 }i. If there is a homomorphism h from G1 to G2 , we write h G1 → G2 or G1 −→ G2 . A graph G is a core if there is no homomorphism G → G′ into a proper subgraph G′ of G. A subgraph G′ of G is a core of G if G′ is a core and G → G′ . It is well known that all cores of a graph are isomorphic and hence we can speak of the core of a graph, denoted by core(G). We say that two graphs G and G′ are homomorphically equivalent if both G → G′ and G′ → G hold. Homomorphically equivalent graphs have the same core, i.e., core(G) and core(G′ ) are isomorphic. We shall also deal with graphs with distinguished vertices. Let G, G′ be (di)graphs and u ¯, u¯′ tuples of vertices in G and G′ , respectively, of the same length. Then we write (G, u¯) → (G′ , u ¯′ ) if there is a homomorphism ′ h : G → G such that h(¯ u) = u ¯′ . With this definition, the notion of core naturally extends to graphs with distinguished vertices. ′ ′ ′ We write G ⇄  G if G → G , but G → G does not hold.

Databases (relational structures) While the case of graphs is crucial for understanding the main concepts, we shall also state results for conjunctive queries over arbitrary relational structures. A vocabulary (often called a schema in the database context) is a set σ of relation names R1 , . . . , Rl , each relation Ri having an arity ni . A relational structure, or a database, of vocabulary σ is D = hU, R1D , . . . , RlD i, where U is a finite set, and each RiD is an ni -ary relation over U , i.e., a subset of U ni . We usually omit the superscript D if it clear from the context. We also assume (as is normal in database theory) that U is the active domain of D, i.e., the set of all elements that occur in relations RiD ’s. Both directed and undirected graphs, for example, are relational structures of the vocabulary that contains a single binary relation E. For digraphs, it is the edge relation; for graphs, it contains pairs (u, v) and (v, u) for each edge {u, v}. We often deal with databases together with a tuple of distinguished elements, i.e., (D, a ¯), where a ¯ is a k-tuple of elements of the active domain, for some k > 0. Technically, these are structures of vocabulary σ expanded with k extra constant symbols, interpreted as a ¯. Homomorphisms of structures are defined in the same way as for graphs: for D1 = hU1 , (RiD1 )i≤l i and D2 = hU2 , (RiD2 )i≤l i, a homomorphism h : D1 → D2 is a map from U1 to U2 so that h(t¯) ∈ RiD2 for every ni -ary tuple t¯ ∈ RiD1 , for all i ≤ l. As before, we write D1 → D2 in this case. For databases with tuples of distinguished elements we have (D1 , a ¯1 ) → (D2 , a ¯2 ) if the homomorphism h in addition satisfies h(¯ a1 ) = a ¯2 .

The notion of a core for relational structures (with distinguished elements) is defined just as for graphs, using homomorphisms of structures. Conjunctive queries and tableaux A conjunctive query (CQ) over a relational vocabulary σ is a logical formula in the ∃, ∧-fragment of first-order logic, i.e., a Vm xij ), where formula of the form Q(¯ x) = ∃¯ y j=1 Rij (¯ ¯ij a tuple of variables each Rij is a symbol from σ, and x among x ¯, y¯ whose length is the arity of Rij . These are often written in a rule-based notation xim ). xi1 ), . . . , Rim (¯ Q(¯ x) :– Ri1 (¯

(1)

The number of joins in the CQ (1) is m − 1. Given a database D, the answer Q(D) to Q is {¯ a | D |= Q(¯ a)}. If Q is a Boolean query (a sentence), the answer true is, as usual, modeled by the set containing the empty tuple, and the answer false by the empty set. A CQ Q is contained in a CQ Q′ , written as Q ⊆ Q′ , if Q(D) ⊆ Q′ (D) for every database D. With each CQ Q(¯ x) of the form (1) we associate its tableau (TQ , x ¯), where TQ is the body of Q viewed as a σ-database; i.e., it contains tuples x ¯ij ’s in relations Rij ’s, for j ≤ m. If Q is a Boolean CQ, then its tableau is just the σ-structure TQ . Many key properties of CQs can be stated in terms of homomorphisms of tableaux. For example, a ¯ ∈ Q(D) iff (TQ , x ¯) → (D, a ¯). For CQs Q(¯ x) and Q′ (¯ x′ ) with the ¯′ ) → same number of free variables, Q ⊆ Q′ iff (TQ′ , x (TQ , x¯). Hence, the combined complexity of CQ evaluation and the complexity of CQ containment are in NP (in fact, both are NP-complete [7]).

3. THE NOTION OF APPROXIMATION We now explain the main idea of approximations. Suppose C is a class of conjunctive queries (e.g., acyclic, or of bounded treewidth). We are given a query Q not in this class, and we want to approximate it within C. For that, we define an ordering 1, testing, for a Boolean CQ Q over graphs, whether Qtriv k+1 is a TW(k)approximation of Q is NP-hard. Thus, while the behavior of acyclic and treewidth-k approximations for k > 1 is in general similar, testing conditions that guarantee certain properties of approximations is harder even for treewidth-2, compared to the acyclic case. Finally, we note that the analog of the Acyclic Approximation problem for treewidth k (i.e., checking if Q′ is a TW(k)-approximation of Q) remains DPcomplete for all k ≥ 1. Indeed, the proof of the upper bound for the acyclic case applies to bounded treewidth, and the lower bound is already established for k = 1.

5. APPROXIMATING ARBITRARY QUERIES

4.2 Bounded treewidth queries We have already seen that treewidth-k approximations of a CQ Q always exist, that they cannot exceed the size of Q, and can be constructed in single-exponential time. There is an analog of the dichotomy for acyclic queries, in which bipartiteness (i.e., being 2-colorable) is replaced by (k + 1)-colorability for TW(k). Theorem 4.13. Let Q be a CQ over graphs. tableau TQ

Note the big difference in the complexity of testing for the existence of nontrivial approximations: while it is in Ptime in the acyclic case, the problem is already NP-complete for TW(2).

If its

We now switch to queries over arbitrary vocabularies. For them, tractability restrictions could be either graphbased and hypergraph-based. For the graph-based notions, one deals with the graph of query Q, denoted by G(Q). The nodes of G(Q) are variables used in Q. If there is an atom R(x1 , . . . , xn ) in Q, then G(Q) has undirected edges {xi , xj } for all 1 ≤ i < j ≤ n. Note that for graph queries, we have G(Q) = TQu .

• is not (k + 1)-colorable, then all of its TW(k)approximations have a subgoal of the form E(x, x); • is (k + 1)-colorable, then Q has a TW(k)approximation without a subgoal of the form E(x, x).

For hypergraph-based notions, we put restrictions on the hypergraph H(Q) of Q. Recall that its nodes again are variables used in Q, and its hyperedges correspond to the atoms of Q, i.e., for each atom R(x1 , . . . , xn ) in Q, we have a hyperedge {x1 , . . . , xn }.

Recall that a Boolean CQ Qtriv ():–E(x, x) is a trivial (acyclic, or treewidth-k) approximation of every Boolean CQ. In the acyclic case, 2-colorability (or bipartiteness) of TQ was equivalent to the existence of nontrivial approximations. This result extends to treewidth-k.

Restrictions on queries are imposed as follows. If C is a class of graphs, or hypergraphs, then a query Q is

Corollary 4.14. A Boolean CQ Q over graphs has a nontrivial TW(k)-approximation iff its tableau TQ is (k + 1)-colorable.

In general, these are incompatible: there are graphbased classes that are not hypergraph-based, and vice versa [12].

• a graph-based C-query if G(Q) ∈ C, and • a hypergraph-based C-query if H(Q) ∈ C.

5.1 Graph-based classes For graph-based queries, it is easy to transfer results from queries over graphs to queries over arbitrary schemas. We state the result below only for the classes of tractable graph-based queries, but a general existence theorem, extending Theorem 4.1, is true as well. Tractability of CQ answering with respect to graphbased classes of queries was fully characterized in [19] (under a certain complexity-theoretic assumption): given a class C, query answering for graph-based Cqueries is tractable iff C ⊆ TW(k) for some k. We call a CQ Q′ a graph-based C-approximation of Q if it is an approximation of Q in the class of graph-based C-queries. Then we have an analog of the existence of approximation results from Section 4. Theorem 5.1. Every CQ Q has a graph-based TW(k)approximation, for every k ≥ 1, with at most as many joins as Q. Moreover, such an approximation can be found in single-exponential time. 5.2 Hypergraph-based classes We now look at hypergraph-based C-approximations, i.e., approximations in the class of hypergraph-based C queries. The oldest tractability criterion for CQs, acyclicity [34], is a hypergraph-based notion (see the definition in Section 3). An analog of bounded treewidth for hypergraphs was defined in [16]; that notion of bounded hypertree width properly extended acyclicity and led to tractable classes of CQs over arbitrary vocabularies. Our first goal, therefore, is to have a general result about the existence of approximations that will apply to both acyclicity and bounded hypertree width (to be defined formally shortly). Note we cannot trivially lift the closure condition used in Theorem 4.1 for hypergraphs, since even acyclic hypergraphs are not closed under taking subhypergraphs. Indeed, take a hypergraph H with hyperedges {a, b, c}, {a, b}, {b, c}, {a, c}. It is acyclic: the decomposition has {a, b, c} associated with the root of the tree, and two-element edges with the children of the root. However, it has cyclic subhypergraphs, for instance, one that contains its two-element edges. The closure conditions we use instead are: • Closure under induced subhypergraphs. If H = hV, Ei is in C and H′ is an induced subhypergraph, then H′ ∈ C. Recall that an induced subhypergraph is one of the form hV ′ , {e ∩ V ′ | e ∈ E}i. • Closure under edge extensions: if H = hV, Ei is in C and H′ is obtained by adding new vertices V ′ to one hyperedge e ∈ E, where V ′ is disjoint from V , then H′ ∈ C.

We shall see that these will be satisfied by the classes of hypergraphs of interest to us. The analog of the previous existence results can now be stated as follows. Theorem 5.2. Let C be a class of hypergraphs closed under induced subhypergraphs and edge extensions. Then every CQ Q that has at least one hypergraph-based C-query contained in it, has a hypergraph-based Capproximation. Moreover, the number of non-equivalent hypergraph based C-approximations of Q is at most exponential in the size of Q, and every such approximation is equivalent to one which has at most O(nm−1 ) variables and at most O(nm ) joins, where n is the number of variables in Q, and m is the maximum arity of a relation in the vocabulary. It is straightforward to check that the class of acyclic hypergraphs satisfies both closure conditions, and that any constant homomorphism on a query Q produces an acyclic query. Thus, Corollary 5.3. For every vocabulary σ, there exist two polynomials pσ and rσ such that every CQ Q over σ has a hypergraph-based acyclic approximation of size at most pσ (|Q|) that can be found in time 2rσ (|Q|) . Next, we extend these results to hypertree width. First we recall the definitions [16]. A hypertree decomposition of a hypergraph H = hV, Ei is a triple hT, f, ci, where T is a rooted tree, f is a map from T to 2V and c is a map from T to 2E , such that • (T, f ) is a tree decomposition of H. S • f (u) ⊆ c(u) holds for every u ∈ T . S S • c(u) ∩ {f (t) | t ∈ Tu } ⊆ f (u) holds for every u ∈ T , where Tu refers to the subtree of T rooted at u. The width of a hypertree decomposition hT, f, ci is maxu∈T |c(u)|. The hypertree width hw(H) of H is the minimum width over all its hypertree decompositions. We denote by HTW(k) the class of hypergraphs with hypertree width at most k, and slightly abusing notation, the class of CQ’s or tableaux whose hypergraphs have hypertree width at most k. The key result of [16] is that for each fixed k, CQs from HTW(k) can be evaluated in polynomial time with respect to combined complexity. It is also shown in [16] that a hypergraph H is acyclic iff its hypertree width is 1. That is, AC = HTW(1). To apply the existence result, we need to check the closure conditions for hypergraphs of fixed hypertree width. It turns out they are satisfied. Lemma 5.4. For each k, the class HTW(k) is closed under induced subhypergraphs and edge extensions. This gives us the desired result about the existence of approximations within HTW(k) for every k.

Corollary 5.5. For every vocabulary σ, there exist two polynomials pσ and rσ such that every CQ Q over σ has a hypergraph-based HTW(k)-approximation of size at most pσ (|Q|) that can be found in time 2rσ (|Q|) , for every k ≥ 1. Example 5.6. Consider a Boolean query Q() :– R(x1 , x2 , x3 ), R(x3 , x4 , x5 ), R(x5 , x6 , x1 ) over a schema with one ternary relation. If we had a binary relation instead and omitted the middle attribute, we would obtain a query whose tableau is a cycle of length 3, thus having only trivial approximations. However, going beyond graphs lets us find nontrivial acyclic approximations. In fact this query has 3 nonequivalent acyclic approximations (all queries below are minimized): • With fewer joins than Q: Q′1 () :– R(x, y, x). • With as many joins as Q: Q′2 () :– R(x1 , x2 , x3 ), R(x3 , x4 , x2 ), R(x2 , x5 , x1 ). • With more joins than Q: Q′3 () :– R(x1 , x2 , x3 ), R(x3 , x4 , x5 ), R(x5 , x6 , x1 ), R(x1 , x3 , x5 ).

6.

EXTENSIONS

So far approximations were required to produce only correct answers, i.e., no false positives. In general, one can drop this assumption, or replace it by a different one. We now take a quick look at what happens in those cases, leaving more detailed investigation to further work. Our first result concerns the ordering ⊑Q we used to define approximations. Recall that Q1 ⊑Q Q2 means that Q2 approximates Q as well as Q1 does: that is, whenever Q and Q1 agree on a database, then Q and Q2 agree on it too. We now provide the exact complexity of testing this ordering. Note that even the upper bound on the complexity of Q1 ⊑Q Q2 is not straightforward. Indeed, the definition of ⊑Q involves universal quantification over databases, and then checking whether CQs evaluate to true or false over them. To start with, given universal quantification over all databases, a priori it is not clear if Q1 ⊑Q Q2 is decidable. Assuming, however, that we manage to prove that it suffices to check only databases of polynomial size (in terms of the sizes of Q, Q1 , and Q2 ), then parsing the definition of ⊑Q , would give us a Πp2 upper bound. However, we can lower the complexity (and, somewhat surprisingly, to a class that only permits existential guessing). Proposition 6.1. The following problem is NPcomplete: given three CQs Q, Q1 , Q2 , is Q1 ⊑Q Q2 ?

It remains NP-complete if Q1 , Q2 are restricted to be from the class AC of acyclic queries, or from TW(k) for k ≥ 1. The proof establishes structural properties of ⊑Q based on the properties of the → ordering; those in turn lead to an NP algorithm. We finish the paper by looking at the assumption that is opposite to the one we have considered so far: instead of insisting that approximating queries return no false positives (i.e., only correct answers), we now look at approximations that produce all correct answers, and perhaps something else, i.e., no false negatives. We refer to them as overapproximations. Formally, for a CQ Q not in a class C of CQs, a query Q′ ∈ C such that Q ⊆ Q′ is a C-overapproximation of Q if there is no query Q′′ ∈ C with Q ⊆ Q′′ such that Q′