Conjunctive Filter: Breaking the Entropy Barrier

0 downloads 0 Views 2MB Size Report
Conjunctive Filter: Breaking the Entropy Barrier. Daisuke ... Our objective is to break this entropy bound and construct more ... each of them in O(1) time [11].
Conjunctive Filter: Breaking the Entropy Barrier Daisuke Okanohara∗ Abstract We consider a problem for storing a map that associates a key with a set of values. To store (n )values from the universe of size m, it requires log2 m n bits of space, which can be approximated as (1.44 + n) log2 m/n bits when n ≪ m. If we allow ϵ fraction of errors in outputs, we can store it with roughly n log2 1ϵ bits, which matches the entropy bound. Bloom filter is a wellknown example for such data structures. Our objective is to break this entropy bound and construct more space-efficient data structures. In this paper, we propose a novel data structure called a conjunctive filter, which supports conjunctive queries on k distinct keys for fixed k. Although a conjunctive filter cannot return the set of values itself associated with a queried √ key, it can perform conjunctive queries with O(1/ m) fraction of errors. Also, the consumed space is nk log2 m bits and it is significantly smaller than the entropy bound n2 log2 m when k ≥ 3. We will show that many problems can be solved by using a conjunctive filter such as full-text search and database join queries. Also, we conducted experiments using a real-world data set, and show that a conjunctive filter answers conjunctive queries almost correctly using about 1/2 ∼ 1/4 space as the entropy bound. 1 Introduction We consider a data structure called a map or an associative array. A map associates a key with a value, and supports lookup operations; given a query, it returns a corresponding value if it exists, or return the special value indicating that the key is not found. We assume that pairs of a key and a value are given beforehand and a map supports neither insertions nor deletions. We focus on an extended version of a map in which a set of integers are associated with each key.1 An ∗ Preferred Infrastructure, Inc and the Department of Computer Science, The University of Tokyo. email: [email protected] † Preferred Infrastructure Inc. and School of Informatics, Kyoto University. email: [email protected] 1 We assume that these integers are distinct for the sake of simplicity.

77

Yuichi Yoshida† important application of this data structure is full-text search. In full-text search, each term plays the role of a key and documents containing the term play the role of values. Lower bounds and upper bounds on the required bits to store a set of integers are well studied. For n integers in {1, . . . , m}, we can store them using n(log2 e + log2 m/n) + o(n) bits of space, and can access each of them in O(1) time [11]. If we reduce the space further, we cannot always obtain correct answers anymore. However, in some cases such error is allowed. An example of such data structures is a Bloom filter [3]. A Bloom filter stores a set of keys, and given a key it answers whether the key exists or not. A Bloom filter never accepts false negatives, i.e., it always report that a key exists if the key indeed exists. Since a Bloomier filter allows false positives, it may report that a key exists even if the key does not indeed exist. Therefore, when we deal with a massive number of keys, we can filter out most of uninteresting keys by using a Bloom filter stored on a fast memory, and then check whether the key actually exists or not by using another data structure stored on a slow external memory. Another example of data structures allowing errors is a Bloomier filter [4, 5, 8, 10] which associates a key with a value and accepts a small probability of false positives. Especially, an original Bloomier filter [5] is space-efficient especially when the distribution of values is very skewed. Our study follows these work. We consider the case that a map associates a key with a set of values and we accept false positives. Our objective is to reduce the size of required bits beyond its entropy bound, which is roughly log 1ϵ bits per element when we allow that ϵ fraction of the universe of values is in outputs. To achieve this goal, obviously we have to discard some features since the entropy bound is a lower bound. In this paper, we propose a novel data structure called a conjunctive filter, which only processes kconjunctive queries. A k-conjunctive query is a tuple of k distinct keys. Given a k-conjunctive query, we are to compute the intersection of sets of values associated with queried keys. We assume that these sets of values are sparse. That is each set contains o(m) elements and the number of elements in map is linear in the

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

number of keys. Then, we will show that conjunctive filters√can perform k-conjunctive queries with just ϵ = O(1/ m) fraction of errors. And the consumed space is k1 log2 m bits per element, which is significantly smaller than the entropy bound log 1ϵ = 12 log2 m when k ≥ 3. An interesting feature of a conjunctive filter is that it can perform k-conjunctive queries with a few errors although it cannot output the set of values itself associated with a key. Intuitively, intersection operations perform as filters on incorrect values and only correct values can survive. Thus, the resulting output may contain small amount of errors. In the present study, we will give two sample problems solved with k-conjunctive queries. The first one is full-text search, and the second one is conjunctive attribute query on a database. We also describe implementation details of a conjunctive filter, which is fairly easy. In experiments, we examined the performance of a conjunctive filter in a movie data set. We found that a conjunctive filter indeed performs k-conjunctive queries with a few errors and it requires working space less than the entropy bound. Organization: In Section 2 we give notions used in this paper. In Section 3, we show a lower bound on the size of a data structure that supports k-conjunctive queries. We describe the detailed implementation of a conjunctive filter in Section 4. Possible applications of a conjunctive filter are given in Section 5. In Section 6, using a movie database, we show that experimental result supports our analysis. Section 7 gives our conclusions. 2 Definitions An instance considered in this paper is a map that associates each key with a set of values. Let X be the universe of keys and V be the universe of values. Then, a map f can be expressed as a function f : X → 2V . Throughout the paper, n and m denote |X |, |V| respectively. The number of elements of f is defined as ∑ x∈X |f (x)|. For a positive integer k, let Xbk be the set of all k distinct keys, i.e., {(x1 , . . . , xk ) ∈ X k |∀i ̸= j, xi ̸= xj }. ∩k For X = (x1 , . . . , xk ) ∈ Xbk , let f (X) = i=1 f (xi ) be ∪ k the conjunction of f (xi ) and f ∪ (X) = i=1 f (xi ) be the disjunction of f (xi ). Let s be a binary encoding of a function. Here the function is supposed to simulate k-conjunctive queries on f . We identify s with the function, i.e., s : Xbk → 2V . For X ∈ Xbk , we call a values in s(X) \ f (X) a false positive and a value in f (X) \ s(X) a false negative.

78

Using these notions, we define two error measures: ϵ+ X (s) = ϵ− X (s) =

|s(X) \ f (X)| |V \ f (X)| |f (x) \ s(X)| . |f (X)|

Here, ϵ+ X (s) is the proportion of false positives among V \ f (X) and ϵ− X (s) is the proportion of false negatives among f (X). Next, we introduce another error measure ∪ by relaxing ϵ+ X (s). We use f (X) instead of f (X). We define ϵ∪ X (s) =

|s(X) \ f ∪ (X)| . |V \ f ∪ (X)|

In this paper, we only consider data structures with ϵ− X (s) = 0, i.e., s(X) always contains f (X). We say that a binary string s (ϵ, k)-encodes a map f if it supports queries by a tuple of k distinct keys and ϵ∪ X (s) ≤ ϵ and − k b ϵX (s) = 0 hold for all X ∈ X . Some reader might wonder why we introduce ϵ∪ X (s). This is because our data structure will eliminate false positives by taking the intersection of f (xi ). Thus, if a false positive appears in most (not all) of f (xi ), it will survive with high probability. Thus, we can only guarantee that values in s(X)\f ∪ (X) will be eliminated. Also, it might be thought that an (ϵ, k)-encoding of f supports k-disjunctive queries instead of k-conjunctive ∪ queries since ϵ∪ X (s) is defined using f (X). However, an (ϵ, k)-encoding only guarantees that s(X) contains f (X). Thus, there may exist false negatives with respect to f ∪ (X). 3

A Lower Bound on the (ϵ, k)-Encoding of a Map

Size

of

an

In this section, we show a lower bound on the size of an (ϵ, k)-encoding of a map f . First, we consider the case k = 1. Note that we can restore all the elements (with a few error) in f (x) for every x using (ϵ, 1)-encoding. Lemma 3.1. The average number of bits required per element in any data structure that can (ϵ, 1)-encode any map with n keys and l elements is at least nm−l log2 l+ϵ(nm−l) . Proof. Let f be a map and s be a w-bit string that (ϵ, 1)-encodes f . Since |s(x)| ∑≤ |f (x)| + ϵ(m − |f (x)|) holds for any x ∈ X and x∈X |f (x)| = l, we have ∑ |s(x)| ≤ l + ϵ(nm − l). Thus, s can (ϵ, 1)-encode x∈X ( ) at most l+ϵ(nm−l) maps. From the counting argument, l we must have ( ) ( ) l + ϵ(nm − l) nm 2w ≥ . l l

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

b

Using the fact that (a−b) ≤ b! logarithm we require w ≥ l log2

(a)

b ≤ ab! b nm−l l+ϵ(nm−l) .

and taking fact that we need log2 m bits for each element when we naively encode f . Now, we show how to create a data structure under more general settings. For each key x, we construct a In particular, when l = o(nm), the lower bound becomes 1 balanced binary tree Tx such that there is a bijection log2 ϵ . In order to break this bound, we discard the from leaves of Tx to V. We construct this bijection feature of restoring f (x) itself. Instead, we only require randomly using a hash function with seed x. An encode k-conjunctive queries for k ≥ 2. The next theorem of a value v with respect to Tx is a bit encoding of states that such a data structure might be constructed the path from the root of Tx to the leaf corresponding using less bits. to v. That is, in the path from the root to the leaf, we Theorem 3.1. The average number of bits required append 0 (resp., 1) if we go down to the left child (resp., per element in any data structure that can (ϵ, k)- the right child) at each node. encode any map with n keys and l elements is at least Let hx (1 ≤ hx ≤ log2 m) be a parameter deter1 nm−l log . mined later. This parameter will decide the size of our 2 l+ϵ(nm−l) k data structure and the amount of errors. We encode Proof. Let L(n, l) be the lower bound on the size of a f (x) as follows. For each v ∈ f (x), we encode the posidata structure that can (ϵ, k)-encode any map with n tion of v with respect to Tx . To reduce the number of keys and l elements. Let f be a map with n keys and l bits, we only use the first hx bits and let pv be the resultelements. We create a new map f ′ by duplicating each ing bits. Note that pv1 = pv2 may hold for v1 , v2 ∈ f (x) pair of a key and values in f by k times, i.e., for each even if v1 ̸= v2 . Let Px = {pv |v ∈ f (x)}. Then we store x ∈ X we introduce corresponding k keys x1 , . . . , xk Px using rice coding [12]. The number of consumed bits such that f ′ (x1 ) = · · · = f ′ (xk ) = f (x). Let n′ and l′ is at most |Px | log 2hx . 2 |Px | be the number of keys and elements in f ′ . Note that Now, we describe how to perform k-conjunctive n′ = kn and l′ = kl. queries given a bit string obtained by the method stated Let s be a w-bit string that (ϵ, k)-encodes f ′ where above. Let X ∈ Xbk be a query. For each x ∈ X, we w = L(n′ , l′ ) = L(kn, kl). We regard s as an (ϵ, 1)- decode P from the bit string. Then, we traverse T x x encoding of f . To determine f (x), we ask s for using p ∈ P until we reach a node with depth h . Note x x a k-conjunctive query X = (x1 , . . . , xk ). We have that this is possible since T was constructed only from x |s(X) \ f ∪ (X)| ≤ ϵ(m − |f ∪ (X)|) since s is an (ϵ, k)- x. Let V be the set of reached nodes. We create S x x encoding of f ′ . Also, since f ∪ (X) = f (x), we have by gathering all values corresponding to leaves in the |s(X) \ f (x)| ≤ ϵ(m − |f (x)|) and it follows that s subtree rooted at each node of V . The size of S is at x x is also an (ϵ, 1)-encoding of f . From Lemma 3.1, we most |P |2log2 m−hx and f (x) ⊆ S . Finally, we output x x nm−l must have w = L(kn, kl) ≥ l log2 l+ϵ(nm−l) . Therefore, S = ∩ X x∈X Sx as an approximation to f (X). This can l nm−l k−1 k−1 L(n, l) ≥ k log2 l+ϵ(nm−l) . be done in O(km k ) time since |Sx | = m k holds for each x. 4 Conjunctive Filter We set hx = k1 log2 m + log2 |Px | for each x so k−1 In this section, we propose a conjunctive filter, which that |Sx | becomes m k . The following two lemmas (ϵ, k)-encodes a map with bits close to the lower bound guarantee that SX is indeed a good approximation to given by Theorem 3.1. f (X). To describe our idea, let us consider the case that k = 2 and |f (x)| = 1 for every x ∈ X . In order to bk identify f (x) among m values, basically we need log2 m Lemma 4.1. Let X ∈ X be a tuple of k distinct keys. Then, f (X) ⊆ SX holds. 1 bits (if we do not allow errors). √ If we use only 2 log2 m bits instead of log2 m bits, m candidates remain for f (x). Let Sx be the set of these candidates. Suppose Proof. Since f (x) ⊆ Sx for each x ∈ X, the lemma that there is a method that outputs Sx for x ∈ X such immediately follows. that f (x) ⊆ Sx and other elements in Sx are uniformly distributed. Then, for any distinct keys x and y, it Lemma 4.2. Let X ∈ Xbk be a tuple of k distinct keys. holds that f (x) ∩ f (y) ⊆ Sx ∩ Sy and |Sx ∩ Sy | = O(1) Then E[|SX |] ≤ |f ∪ (X)| + 1 holds. (from the birthday paradox). Using this idea, we can construct a data structure that supports 2-conjunctive queries with a few errors using 12 log2 m bits for each Proof. Let Xv (v ∈ V \ f ∪ (X)) be the event that a value element. This is much more efficient compared to the v is in SX . Since elements in Sx \f (x) are independently

79

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

distributed for each x ∈ X, we have P r[Xv ]

=

∏ |Sx | − |f (x)| m − |f (x)|

x∈X



∏ |Sx | (from |Sx | ≤ m) m

x∈X



∏ |Px |2log2 m−hx 1 = . m m

x∈X

Thus, the expected number of elements in SX is at most |f ∪ (X)| + 1.

a hash function with seed x. Let gx (v) = (ax v + bx ) mod p. Clearly, gx (v) is a bijection. The position of the leaf corresponding to a value v in Tx will be gx (v). If we want to encode a value v using h bits, we simply take first h bits of gx (v). When performing a conjunctive query, we have to decode v from gx (v). This can be easily done using the fact that gx−1 (w) = ap−2 x (w − bx ) mod p from Fermat’s little theorem. These operations can be done in O(log m) time. Gathering all the discussions in this section, we have the following theorem. Theorem 4.1. A conjunctive filter (ϵ, k)-encode a map with l elements using kl log2 m bits where m is the size of the universe of values. The time complexity of k−1 performing a k-conjunctive query is O(km k ).

We cannot bound E[|SX |] using |f (X)|. To see this, let (x1 , . . . , xk ) = X and suppose that a value v appears k − 1 of f (x1 ), . . . , f (xk ). Then the probability that v k−1 1 is in Sx is m k /m = m k since only one f (xi ) does not 5 Applications contain v. It follows that the expected number of false k−1 We explain two applications of conjunctive filters. The positives is m k . first one is full-text search for long queries, and the Lemma 4.3. The average number of bits per element second one is conjunctive attribute search on a database. consumed by a conjunctive filter is k1 log2 m. 5.1 Full-Text Search with Long Queries Given 2hx Proof. Since for each x we require |Px | log2 |Px | bits, a query q of length m and a document set D of n The total number of bits consumed by a conjunctive documents, the task of full-text search is finding a set of documents d1 , d2 , · · · dt containing q. In this subsection, filter is we show a space-efficient index for full-text search using ∑ ∑ 2hx conjunctive filters. |Px | log2 = |Px |(hx − log2 |Px |) |Px | We focus on the case that the length of a query x∈X x∈X is large enough, say, Ω(log n). In this setting, we can ∑ 1 = |Px |( log2 m + log2 |Px | − log2 |Px |) search documents containing q by taking the conjunck x∈X tion of documents containing a substring of q. To achieve this in an efficient manner, we propose l ≤ log2 m. to use the decomposition of suffix trees. First, we k concatenate documents inserting a special character at Lemma 4.2 does not directly imply that a conjunctive each boundary of documents and build a suffix tree filter is an (ϵ, k)-encoding of f for small ϵ since we must against the resulting text. bk bound ϵ∪ X (s) for every X ∈ X . However, from a simple Let b be a parameter, which is related to how many application of Hoeffding’s inequality [6], we can show nodes will be removed in the suffix tree. The size of an √ that |SX | ≤ |f ∪ (X)|+O( m − |f ∪ (X)|) holds for every index becomes smaller by taking b larger. However, by X ∈ Xbk with high probability. Thus, a conjunctive filter doing so, a search result will have more false positives, √ is actually an (O(1/ m − u), k)-encoding of f where i.e., documents not containing the query. u = maxX∈Xbk |f ∪ (X)|. Suppose that |f (x)| = √ o(m) For a node t in a suffix tree, d(t) denotes the number holds for any x, then it follows that ϵ = O(1/ m). of leaves descending from t. Let p(t) denote the parent The lower bound given by Theorem 3.1 for such ϵ is node of t if exists and null if t is the root node. A 1 node t is called b-marked if d(t) < b and d(p(t)) ≥ b. 2k log2 m + O(1). Implementation Details: It is not practical to Clearly, a set of b-marked nodes covers all leaves without build a balanced binary tree Tx for x ∈ X since it takes overlapping.2 Then, for each b-marked node t, we O(m) time. Here, we introduce a simple alternative. remove all leaves descending from t. Also, we associate Suppose that we want to store a set of values associated positions stored in these leaves with t. We call this tree with a key x. We identify V with {0, . . . , m − 1}. We b-suffix tree. An example is depicted in Figure 1. The calculate the position of a leaf in Tx using a random 2 Similar technique was used for the marked ancestor probbijection. Let p be the minimum prime number with p ≥ m. We chose two integers ax ̸= 0 and bx using lem [1].

80

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.



6







 









 

  



 





  

( )* "+*, - , . !" # !$% &'

Figure 1: An example of suffix tree and its reduced tree.

b-suffix tree can be regarded as a map from a string to a set of positions. Thus, we can store it using a conjunctive filter. Given a query q of length m, we check whether substrings of q appears in the b-suffix tree. We can check all these positions in O(m) time. Let {(o1 , P1 ), (o2 , P2 ), . . . , (ok , Pk )} be the set of the pairs of an offset and positions associated with such b-nodes. An offset is the beginning position of the corresponding substring in the query. Then, we perform k-conjunctive queries on these positions considering offset information and output the result as a set of documents containing q. For example, suppose that the result for a query q is {(1, {4, 16, 31}), (3, {6, 11, 18, 33})}. Then, positions {3, 15} will be returned as positions where the query q is in the text. 5.2 Conjunctive Attribute Search on a Database Let us consider a database in which each item are represented by a set of attributes. For example, the IMDB movie database3 has the following attributes for each movie; actor/actress, year, production company, locations, editors. In a typical situation, a user searches movies specifying some of these attributes. To solve this problem, we assign a unique value to each value of attributes and these values form the universe of keys. Then, this problem can be solved by a conjunctive filter naively. We notice that a conjunctive filter never gives rise to false negatives and we can obtain correct results by checking whether each result has the queried attributes.

Experiments

We implemented a database described in Section 5.2 using a conjunctive filter. We used IMDB data set4 , actors section5 . It contains 1103393 actors, 1791274 movies, and 6493558 relations between actors and movies, i.e., who acts in which movies. Actors play a role of keys and movies play a role of values. Thus, a query for a conjunctive filter is a set of actors and the result for the query is a set of movies in which all the actors act. Table 1 shows the size of naively encoded and the size of conjunctive filters storing IMDB data set for k = 2, 3, 4. The lower bound is obtained by Theorem 3.1 substituting n = 1103393, m = 1791274, l = 6493558 and ϵ = √1m . Next, we examined the accuracy of k-conjunctive filters for k = 2, 3. The result for k = 4 is similar and we omit it. We selected 695332 and 550884 movies for k = 2, 3 respectively. For each movie we select k actors who act in the movie and we perform k-conjunctive queries with these k actors. Note that the answer to each query is not empty. Figures 2 and 3 show the results of 2-, and 3conjunctive queries respectively. Here, the x axis is the exact solution for a query and the y axis is the number of movies returned by a conjunctive filter given the query. Therefore a point on the line y = x indicates that there is no error for the corresponding query. We can see that there is no false negatives and the number of false positives are roughly bounded by 1000. This √ matches the fact that a conjunctive filter is an (O(1/ m), k)encoding where m = 1791274 here. Figures 4 and 5 show the results of 2- and 3conjunctive queries such that the exact solution for them is 1. Here, the x axis is the number of movies returned by a conjunctive filter and the y axis is the number of queries such that the number of returned movies is x. We can see that the graph decays exponentially. Figures 6 and 7 show the comparison of the upper bound given by Lemma 4.2 and the number of returned movies. From these results, we find that the variance of the number of false positives decreases as the number of returned movies increases. 7

Conclusions

We have presented conjunctive filters, which (ϵ, k)encode a map with kl log2 m bits where m is the size of the universe of values and l is the number of elements in the map. By restricting our attention to k-conjunctive queries, when l is linear in n, we break the barrier of the entropy bound on the number of bits required to 4 http://www.imdb.com/interfaces

3 http://www.imdb.com/

5 ftp://ftp.fu-berlin.de/pub/misc/movies/database/actors.list.gz

81

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

Table 1: The size of conjunctive filters for k = 2, 3, 4. The column of size shows the actual size with rice coding. Size (bits) 114949784 83997560 61002568 50409144

raw k=2 k=3 k=4

250000

103

200000

102

150000 Count

Size of approximations

Lower bound (bits) — 33701395 22467597 16850698

101

100000 100

10-1 -1 10

50000

100

101 Size of exact solutions

102

0 100

103

Figure 2: The result of 2-conjunctive queries

101 102 Size of approximations

103

Figure 4: The result of 2-conjunctive queries for queries X with |f (X)| = 1.

103

300000 250000 200000

101

Count

Size of approximations

102

150000

100

10-1 -1 10

100000 50000 100

101 Size of exact solutions

102

103 0 100

101 102 Size of approximations

103

Figure 3: The result of 3-conjunctive queries Figure 5: The result of 3-conjunctive queries for queries X with |f (X)| = 1. store a map. The size of conjunctive filters obtained by Lemma 4.3 does not still match the lower bound indicated by Theorem 3.1. In particular, we have no lower bound when l = Ω(nm). Tightening this gap will be a future work. In another direction, it is important to reduce the space when the distribution of values are skewed. Very recently, Hreinsson, and et. al. have presented a new data structure for an associative array [7], which achieves a space usage close to the 0-th order entropy of the sequence of function value. We believe that a conjunctive filter can use the idea in their study to improve the spece efficiency. Also, reducing the time complexity is an important k−1 issue. Since conjunctive filters requires O(km k ) time to perform k-conjunctive queries, we cannot insist that

82

they are truly practical. It might be interesting to construct a data structure that supports k-conjunctive queries with time complexity, say O(log m) by using an efficient computation of intersection of compressed represention of hashed values [2]. Our idea would be useful for another applications such as a second-index problem [9]. Acknowledgments. We thank the anonymous reviewers for their useful comments. References [1] S. Alstrup, T. Husfeldt, and T. Rauhe. Marked ancestor problems. In FOCS ’98: Proceedings of the

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

103

Size of approximations

102

101

100

10-1 -1 10

101 100 102 103 Upper bound on expected size of approximations

104

Figure 6: The comparison of |f ∪ (X)| and s(X) in 2conjunctive queries. 103

Size of approximations

102

[8] R. Pagh and M. Dietzfelbinger. Succinct data structures for retrieval and approximate membership. In ICALP ’08: Proceedings of the 35th International Colloquium on Automata, Languages and Programming, pages 385–396, 2008. [9] R. Pagh and S. R. Satti. Secondary indexing in one dimension: beyond b-trees and bitmap indexes. In PODS ’09: Proceedings of the twenty-eighth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pages 177–186, 2009. [10] Ely Porat. An optimal bloom filter replacement based on matrix solving. In CSR ’09: Proceedings of the 4th Computer Science Symposium, pages 263–273, 2009. [11] R. Raman, V. Raman, and S. Srinivasa Rao. Succinct indexable dictionaries with applications to encoding kary trees and multisets. In SODA ’02: Proceedings of the 13th Annual ACM-SIAM Symposium on Discrete Algorithms, pages 233–242, 2002. [12] Robert F. Rice. Some practical universal noiseless coding techniques. JPL Publication, pages 79–22, 1979.

101

100

10-1 -1 10

101 100 102 103 Upper bound on expected size of approximations

104

Figure 7: The comparison of |f ∪ (X)| and s(X) in 3conjunctive queries.

[2]

[3]

[4]

[5]

[6]

[7]

39th Annual Symposium on Foundations of Computer Science, pages 534–543, 1998. P. Bille, A. Pagh, and R. Pagh. Fast evaluation of union-intersection expressions. In ISAAC ’07: Proceedings of the Algorithms and Computation, 18th International Symposium, 2009. Burton H. Bloom. Space/time trade-offs in hash coding with allowable errors. Commun. ACM, 13(7):422– 426, 1970. D. Xavier Charles and K. Chellapilla. Bloomier filters: A second look. In ESA ’08: Proceedings of the 16th Annual European Symposium on Algorithms, pages 259–270, 2008. Bernard Chazelle, Joe Kilian, Ronitt Rubinfeld, and Ayellet Tal. The bloomier filter: an efficient data structure for static support lookup tables. In SODA ’04: Proceedings of the 15th Annual ACM-SIAM Symposium on Discrete Algorithms, pages 30–39, 2004. Wassily Hoeffding. Probability inequalities for sums of bounded random variables. Amer. Statistical Assoc. J, 58(301):13–30, 1963. J. B. Hreinsson, M. Krøyer, and R. Pagh. Storing a compressed function with constant time access. In ESA ’09: Proceedings of the 17th Annual European Symposium on Algorithms, pages 730–741, 2009.

83

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.