Forbidden Patterns

3 downloads 0 Views 357KB Size Report
this scheme emerged, including: space reductions of the underlying data struc- tures [7,16,17], ... Helsinki Institute for Information Technology (HIIT). † Supported ...
Forbidden Patterns Johannes Fischer1,? , Travis Gagie2,?? , Tsvi Kopelowitz3 , Moshe Lewenstein4 , Veli M¨ akinen5,? ? ? , Leena Salmela5,† , and Niko V¨alim¨aki5,‡ 1

KIT, Karlsruhe, Germany, [email protected] Aalto University, Espoo, Finland, [email protected] Weizmann Institute of Science, Rehovot, Israel, [email protected] 4 Bar-Ilan University, Ramat Gan, Israel, [email protected] 5 University of Helsinki, Helsinki, Finland, {vmakinen|leena.salmela|nvalimak}@cs.helsinki.fi 2

3

Abstract. We consider the problem of indexing a collection of documents (a.k.a. strings) of total length n such that the following kind of queries are supported: given two patterns P + and P − , list all nmatch documents containing P + but not P − . This is a natural extension of the classic problem of document listing as considered by Muthukrishnan [SODA’02], where only the positive pattern P + is given. Our main solution is an index of size O(n3/2 ) bits that supports queries in O(|P + | + √ |P − | + nmatch + n) time.

1

Introduction

In recent years, the pattern matching community has paid a considerable amount of attention to document retrieval tasks, where, in contrast to traditional indexed pattern matching, the task is to output each document containing a search pattern even just once, and in particular without spending time proportional to the number of total occurrences of the pattern in that document. Starting with Muthukrishnan’s seminal paper [14] (building in fact on an earlier paper by the same author with colleagues [12]), an abundance of articles on variations of this scheme emerged, including: space reductions of the underlying data structures [7, 16, 17], ranking of the output [9, 11, 15], two-pattern queries [3, 5], and perhaps many more. A different possible, and indeed very natural, extension of the basic problem is to exclude some documents from the output. In this setting, the user specifies, ? ?? ???

† ‡

Supported by the German Research Foundation (DFG). Supported by Academy of Finland grant 134287. Partially funded by the Academy of Finland grant 1140727. Also affiliated with Helsinki Institute for Information Technology (HIIT). Supported by Academy of Finland grant 118653 (ALGODAN). Partially funded by the Academy of Finland grant 118653 (ALGODAN) and Helsinki Doctoral Programme in Computer Science (Hecse).

in addition to the query pattern P + , a negative pattern P − that should not occur in the retrieved documents. Formally, this problem of forbidden patterns can be modeled as follows:6 Given: A collection of P static documents D = {D1 , . . . , Dd } over an alphabet Σ of total length n = i≤d |Di |. Compute: An index that, given two patterns P + and P − online, allows us to compute all nmatch documents containing P + , but not P − . The best we can hope for is certainly an index of linear size having a query time of O(|P + | + |P − | + nmatch ), as this is the time just to read the input and write the output of the query. However, achieving anything close to this optimum seems completely out of reach (at least at the current state of research), as the forbidden pattern queries can be regarded as set difference queries, which are arguably at least as hard as set intersection queries. In the realm of document retrieval, those latter queries correspond to the case where two positive patterns P1+ and P2+ are given (and one is interested in all documents containing both positive patterns); we are aware of three indexes that address this problem: (1) ˜ 3/2 ) words of space with query time O(|P + | + |P + | + nmatch + √n) [5], (2) O(n 1 2 √ O(n log n) words and O(|P1+ | + |P2+ | + ( √ nmatch n log n + nmatch ) log2 n) time [3], and (3) O(n) words and O(|P1+ |+|P2+ |+( nmatch n log n+nmatch ) log n)) time [8] (this improves on [3] in both time and space and has the further advantage that it generalizes to more than two patterns). In Appendix A of this paper, we show that this problem of two positive patterns is indeed harder than document retrieval with just one pattern.7 However, the main body of this paper is devoted to the forbidden patterns problem. The following theorem summarizes our main result. Theorem 1. For a text collection of total length n, there exists a data structure of size O(n3/2 ) bits such that subsequent forbidden pattern queries can be √ answered in O(|P + | + |P − | + nmatch + n) time. The rest of this article is structured as follows. Sect. 2 introduces known results that form the basic building blocks of our solution, including a description of the preprocessing algorithm for document retrieval with just one positive pattern [14]. In Sect. 3, we then give data structures for the forbidden patterns problem, where, apart from proving Thm. 1, we also look at the variation of just counting the number documents. Finally, Sect. 4 concludes the paper. 6

7

Muthukrishnan [14] already considered the case where just a negative pattern P − is given, and has an optimal solution that outputs all documents not containing P − . Unfortunately, we could not come up with any meaningful lower bound for the forbidden patterns problem.

2

2 2.1

Preliminaries Succinct Data Structures

Consider a bit-string S[1, n] of length n. We define the fundamental rank - and select-operations on S as follows: rank1 (S, i) gives the number of 1’s in the prefix S[1, i], and select1 (S, i) gives the position of the i’th 1 in S, reading S from left to right (1 ≤ i ≤ n). The following lemma summarizes a by-now classic result (see, e.g., [13]): Lemma 1. A bit-string of length n can be represented in n + o(n) bits such that rank- and select-operations are supported in O(1) time.

2.2

Range Minimum Queries

A basic building block for our solution is a space-efficient preprocessing scheme for O(1) range minimum queries. For a static array E[1, n] of n objects from a totally ordered universe and two indices i and j with 1 ≤ i ≤ j ≤ n, a range minimum query rmqE (i, j) returns the position of a minimum element in the sub-array E[i, j]; in symbols: rmqE (i, j) = argmin E[k] | i ≤ k ≤ j . We state the following result [6, Thm. 5.8]: Lemma 2. A static array E[1, n] can be preprocessed in O(n) time into a data structure of size 2n + o(n) bits such that subsequent range minimum queries on E can be answered in O(1) time, without consulting E at query time. This size is asymptotically optimal.

2.3

Document Retrieval

We now explain Muthukrishnan’s solution [14] for document retrieval with only one positive pattern P (in fact, we describe a variant [16] of the original algorithm that is more convenient for our purposes). The overall idea is to build a generalized suffix tree ST for the collection of documents D = {D1 , . . . , Dd }, and enhance it with additional information for reporting the documents. A generalized suffix tree for D is a suffix tree for the text T := D1 #1 D2 #2 . . . Dd #d , where the #i ’s are distinct characters not appearing elsewhere in D. A suffix tree ST on T , in turn, is a compacted trie on all suffixes of T , and consists of only O(|T |) nodes. Every such node v is said to correspond to a substring α of T iff the letters on the root-to-v path are exactly α. A suffix tree ST for T allows us to locate all occ occurrences of a search pattern P in T in optimal O(|P | + occ) time (with perfect hashing; otherwise the search takes O(|P | log |Σ| + occ) time). This search proceeds in two steps: it first finds in O(|P |) time the node v in ST such that all leaves below v correspond to the 3

occ suffixes that are prefixed by P . In a second step, the starting points of all these suffixes are reported in additional O(occ) time. For document retrieval, it should be clear that we can reuse only the first part of this search, but must modify the second step such that it uses O(nmatch ) instead of O(occ) time. To this end, Muthukrishnan’s solution proceeds as follows. Consider the leaves of ST in lexicographic order. The positions in T of their corresponding suffixes form a permutation of the numbers [1, n0 ] (n0 = n+d = O(n) being the size of T ), the so-called suffix-array A[1, n0 ]. Define a document array D[1, n0 ] of the same size as A, such that D[i] holds the document number of the lexicographically i’th suffix. More formally, D[i] = j iff #j is the first document separator in T [A[i], n0 ]. We now chain suffixes from the same document in a new array E[1, n0 ] by defining E[i] = max j < i | D[j] = D[i] , where the maximum of the empty set is assumed to be −∞. Array E is prepared for constant-time range minimum queries using Lemma 2. With these data structures, we can obtain optimal O(|P | + nmatch ) listing time, as explained next. We first use ST to find in O(|P |) time the interval [`, r] in A such that the suffixes in A[`, r] are exactly those that are prefixed by P . We then call the recursive procedure list in Alg. 1, initially invoked by list(`, r) and assuming V [i] = 0 for all 1 ≤ i ≤ d just before that first call:

Algorithm 1: List all documents in D[i, j] not occurring in D[`, i − 1]. procedure list(i, j) m ← rmqE (i, j) if V [D[m]] = 0 then output D[m] V [D[m]] ← 1 list(i, m − 1) list(m + 1, j)

The idea of procedure list is that each distinct document identifier in D[`, r] is listed only at the place m of its leftmost occurrence in the interval [`, r]; such places are conveniently located by range minimum queries on E. To avoid duplicate outputs, we mark all documents found by a ’1’ in an additional array V [1, d], which is initialized with all 0’s in the preprocessing phase. Whenever the smallest element in E[i, j] comes from a document already reported (hence V [D[m]] = 1), the recursion can be stopped since every document in D[i, j] is reported when visiting distinct intervals [i0 , j 0 ] with ` ≤ i0 ≤ j 0 < i. Hence, the overall running time is O(nmatch ). At the end of the reporting phase, we need to reset V [·] to 0 for all documents in the output. This takes additional O(nmatch ) time. Apart from the suffix tree ST , the space for this solution is dominated by the n0 words needed for storing the document array D. 4

3

Document Retrieval with Forbidden Patterns

We now come to the description of our solution to the problem of forbidden patterns, as presented in the introduction. We proceed by first presenting a rather simple solution (Sect. 3.1), which is then subsequently refined (Sect. 3.2–3.3).

3.1

O(n2 ) Words of Space

We first show how to achieve optimal (|P + | + |P − | + nmatch ) query time with O(n2 ) space. The idea is again to store a generalized suffix tree ST for the set of documents D and enhance it with additional information. In particular, for every node v in ST corresponding to string α, we store a copy of ST that excludes the documents containing α. We call that copy ST v . Every ST v is prepared for “normal” document listing (Sect. 2.3). When a query arrives, we first match P − in ST until reaching node v (if the matching ends on an edge, we take the following node). We then jump to ST v , where we match P + , and list all nmatch documents in optimal O(nmatch ) time. The space for this solution is clearly O(n2 ) words, as the number of nodes in ST is O(n), and for each such node v we build another generalized suffix tree ST v , all of which could contain O(n) nodes in the worst case.

3.2

O(n2 ) Bits of Space

We now reduce the solution from the previous section to O(n2 ) bits. Our aim is to reuse the full suffix tree ST also when matching the positive pattern P + , and use a modified RMQ-structure when reporting documents by procedure list (see Sect. 2.3). To this end, let v be a node in ST , and let Dv be the set of documents containing the string represented by v (hence Dv is the set of documents in D[`, r] if A[`, r] is the suffix array interval for v in the sense of Sect. 2.3). In a (conceptual) copy Ev of the global chaining array E, we blank out all entries corresponding to documents in Dv by setting the corresponding values to +∞. More precisely, ( +∞ if D[i] ∈ Dv , and Ev [i] = E[i] otherwise. Now each such Ev is prepared for range minimum queries using Lemma 2, and only this RMQ-structure (not the array Ev itself!) is stored at node v. A further bit-vector Bv [1, n0 ] at node v marks those positions with a ’1’ that correspond to documents in Dv , in symbols: Bv [i] = 1 if Ev [i] = +∞, and Bv [i] = 0 otherwise. Hence, the total space needed is 3n0 + o(n0 ) = O(n) bits per node in ST . We also store the global document array D[1, n0 ], plus the bit-vector V [1, d] needed 5

Algorithm 2: Modified procedure for document listing. procedure list0 (i, j) m ← rmqEv (i, j) if V [D[m]] = 0 and Bv [m] = 0 then output D[m] V [D[m]] ← 1 list0 (i, m − 1) list0 (m + 1, j)

by Alg. 1, needing O(n log n) and d = O(n) bits, respectively. The space for the entire data structure thereby amounts to O(n2 ) bits. The query processing starts as in the previous section: we first match P − in ST until reaching node v (and again take the following node if we end on an edge). Now instead of jumping to ST v (which is no longer stored), we use ST again to find the interval [`, r] in the suffix array A such that the suffixes in A[`, r] are exactly those that are prefixed by P + . We then call procedure list(`, r), but using the RMQ-structure for Ev instead of E (corresponding to the negative pattern P − ). We need to further modify that procedure such that it does not list those documents in Dv ; this can be accomplished by checking if the m’th bit in Bv is set to ’0’. If, on the other hand, Bv [m] = 1, we can stop the recursion, as in that case all other entries in Ev [i, j] must also be +∞ (and hence come from documents in Dv ). The complete modified algorithm list0 can be seen in Alg. 2. As before, after having listed all nmatch documents, we need to unmark the listed documents in V in additional O(nmatch ) time in order to prepare for the next query. 3.3

O(n3/2 ) Bits of Space

We now present a space/time tradeoff for the solution given in the previous section. Our general idea is to store the RMQ-structures only at a selected subset of nodes in ST , thereby possibly listing false documents that need to be filtered at query time. For what follows, the reader should also consult the example shown in Fig. 1. We assign a weight wv to each node v in ST as follows. As before, let Dv denote the set of documents “below” v; i.e., the set of documents in D that contain the string represented by v. Then the weight of v is defined to be the number of documents in Dv , wv = |Dv |. In ST , we mark certain nodes as important. The RMQ-structures will only be stored at important nodes. We will make sure that each node v has an important successor u such that wv ≤ wu + s, for some integer s to be determined later. At v, we also store a pointer to this important successor u. Let pv = u denote this pointer (for important nodes v we define pv = v). At query time, when the search for P − ends at v, we use the algorithm from the previous section, but 6

T = turing#1 tape#2 enigma#3 apple#4 4

3

#3

3

p 2

e p #2 l e

#2

#4

#4

1

1

1

1

g

e

a

1

i

2

n i g #1 ma g m #3 m a a #3 1

#3

1

1

1

m a

2

#3

n g

l e

n 2 g #1

#1 #4

ur in g t #1 2 i r g e p in 2 u m #2 l l g a r a e e #1 p in #3 g #4 #4 #e p

2

1

1

1

1

1

1

1

1

1

1

#1

1

1

0

` Fig. 1. Generalized suffix tree (with super-leaf `) for the collection of documents D = {turing, tape, enigma, apple} (with irrelevant parts pruned). The number inside a node v denotes its weight wv . Important nodes (assuming s = 2) are shown in bold. Hence, in this example only 3 RMQ-structures are stored.

now with the RMQ-structure for Epv . This reports at most s false documents, which need to be discarded from the output. It thus remains to identify the false documents. For this, we need not store any additional data structures, as explained next. Let α denote the string represented by node pv . Observe that the false documents are exactly those that contain P − , but not α; but this corresponds to a forbidden pattern query, with P − as the positive pattern, and α as the negative one! And for this query all necessary data structures are at hand, because at pv there exists an RMQ-structure that filters exactly those documents containing α. In summary, to answer the query “P + and not P − ,” we first match P − in ST up to node v, where we follow the pointer pv to an important successor representing string α. At pv , we answer the query “P − and not α” with the algorithm from Sect. 3.2 to identify the set of false documents, which we mark in a bit-vector F [1, d]. Finally, we answer the query “P + and not α,” again by using the algorithm from Sect. 3.2, but this time outputting only those documents not marked in F . In the end, all marked documents are unmarked to prepare for the next query. As by definition the marked (=false) documents are at most s in number, the total query time is O(|P + | + |P − | + nmatch + s).

Identifying Important Nodes. It remains to show how the important nodes can be identified. We do this in a bottom-up traversal of ST . We first enhance ST with a super-leaf ` that is the single child of all original leaves in ST . This 7

node ` has weight w` = 0, is marked as important, and stores an RMQ-structure for the original array E. During the bottom-up traversal of all original nodes in ST (excluding the super-leaf `), let us assume we arrive at a node v with children v1 , . . . , vk . By induction, all vi ’s have already been assigned an important successor pvi . Let vm be the child of v having an important successor occurring in most documents, m = argmax{wpvi | 1 ≤ i ≤ k}. Then if wv − wpvm ≤ s, we set pv to pvm . Otherwise, we mark v as important, create an RMQ-structure for Ev , and set pv = v. Space Analysis. To analyze the space, consider the subtree ST I of ST consisting only of important nodes and their mutual lowest common ancestors (in the terminology of Cole et al. [4], ST I is the subtree of ST induced by the important nodes). The nodes in ST I are further divided into two different classes: (1) non-branching internal, and (2) other. A node belongs to class non-branching internal iff it has exactly one child in ST I ; otherwise it belongs to class other. We analyze the number of non-branching internal and other nodes separately. Let us first consider the other nodes. They form yet another induced subtree of ST I , let us call it ST 0I . A leaf v in ST 0I covers at least s (original) leaves in ST , as its weight wv must by definition be at least wv > s, and in order to cover s documents at least s suffixes from T must be covered. Hence, the number of leaves in ST 0I is bounded from above by n/s, and because ST 0I is compact, the total number of nodes in ST 0I is O(n/s). Now look at the non-branching internal nodes of ST I . Because every such node v must have wv − wu > s for its single child u in ST I , this increase in weight can only come from at least s (original) leaves of ST for which v is their nearest important ancestor. As every leaf in ST can contribute to at most one such non-branching internal node, the number of non-branching internal nodes in also bounded by O(n/s). In total, we store the RMQ-structures (using O(n) bits each) only at O(n/s) √ nodes of ST . By setting s = n, we obtain Thm. 1. 3.4

Approximate Counting Queries

If only the number of documents containing P + but not P − matters, we can obtain a faster data structure than that of Thm. 1. We first explain Sadakane’s succinct data structure for document counting in the presence of just a positive pattern P [16, Sect. 5.2]. Without going too much into the details, the idea of that algorithm is as follows: for each node v in ST we store in c(v) how many “duplicate” suffixes from the same document occur in its subtree. When a query pattern P arrives, we first locate the node v representing P and its suffix array interval [`, r]. Then r − ` + 1 − c(v) gives the number of documents containing P , as r − ` + 1 is the total number of occurrences, and c(v) is the right “correction 8

term” that needs to be subtracted. (This idea was first used by Hui [10], who also shows how to compute the correction terms in overall linear time by lowest common ancestor queries.) The c(v)-numbers are not stored directly in each node (this would need O(n log n) bits overall), but rather in a bit-vector H 0 of length O(n) such that c(v) can be computed by a constant number of rank- and select-queries on H 0 (in fact by counting the 0’s in a certain interval in H 0 ), given v’s suffix array interval [`, r]. In total, this solution uses O(n) bits (see Lemma 1). For forbidden patterns, we modify the data structure from Sect. 3.3 and store only the following information at important nodes v: – Vector Bv , marking the documents in Dv , as defined in Sect. 3.2. Each Bv is prepared for constant-time rank1 -queries. – Vector Hv0 , which is the bit-vector H 0 as defined in the preceding paragraph, but modified to exclude those documents from Dv . Using Hv0 we can compute for any node u a modified correction term cv (u) that excludes the documents containing the string represented by v. Vector Hv0 is at most as long as H 0 , hence using only O(n) bits. Finally, each node v in ST stores the pointer pv to its nearest important successor. When we need to answer a query of the form “what is the approximate number of documents containing P + but not P − ?” we first match P − in ST until reaching node v. We then match P + in ST , say until reaching u. Assume that u corresponds to the suffix array interval [`, r]. Now observe that the number of 1’s in Bpv [`, r] gives the number of suffixes below u that correspond to forbidden patterns, and that this number can be computed by f = rank1 (Bpv , r) − rank1 (Bpv , `−1). So r−`+1−f is the number of suffixes below u excluding those documents containing P − . Hence, we return n0match = r −`+1−f −cpv (u) as √ the approximate answer to nmatch , which satisfies n√match ≤ n0match ≤ nmatch + n by the definition of important nodes (with s = n). Theorem 2. For a text collection of total length n, there exists a data structure of size O(n3/2 ) bits such that subsequent queries asking for the approximate − + number of documents containing P + but not P √ can be answered in O(|P | + − |P |) time, with an additive error of at most n.

4

Conclusions

We initiated the study of document retrieval in the presence of forbidden patterns. Apart from improving on the space consumption and/or query time of our data structures, we point out the following subjects deserving further investigation: (1) document retrieval with more than two patterns, both positive and negative, (2) lower bounds for forbidden patterns, possibly in the spirit of 9

Appendix A, but also in more realistic machine models such as the word-RAM, and (3) algorithmic engineering and comparison with inverted indexes.

References 1. B. Chazelle. Lower bounds for orthogonal range searching, I: The reporting case. J. ACM, 37:200–212, 1990. 2. Y.-F. Chien, W.-K. Hon, R. Shah, and J. S. Vitter. Geometric Burrows-Wheeler transform: Linking range searching and text indexing. In Proc. DCC, pages 252– 261. IEEE Press, 2008. 3. H. Cohen and E. Porat. Fast set intersection and two-patterns matching. Theor. Comput. Sci., 411(40–42):3795–3800, 2010. 4. R. Cole, M. Farach-Colton, R. Hariharan, T. M. Przytycka, and M. Thorup. An O(n log n) algorithm for the maximum agreement subtree problem for binary trees. SIAM J. Comput., 30(5):1385–1404, 2000. 5. P. Ferragina, N. Koudas, S. Muthukrishnan, and D. Srivastava. Two-dimensional substring indexing. J. Comput. Syst. Sci., 66(4):763–774, 2003. 6. J. Fischer and V. Heun. Space efficient preprocessing schemes for range minimum queries on static arrays. SIAM J. Comput., 40(2):465–492, 2011. 7. T. Gagie, S. J. Puglisi, and A. Turpin. Range quantile queries: Another virtue of wavelet trees. In Proc. SPIRE, volume 5721 of LNCS, pages 1–6. Springer, 2009. 8. W.-K. Hon, R. Shah, S. V. Thankachan, and J. S. Vitter. String retrieval for multipattern queries. In Proc. SPIRE, volume 6393 of LNCS, pages 55–66. Springer, 2010. 9. W. K. Hon, R. Shah, and J. S. Vitter. Space-efficient framework for top-k string retrieval problems. In Proc. FOCS, pages 713–722. IEEE Computer Society, 2009. 10. L. C. K. Hui. Color set size problem with application to string matching. In Proc. CPM, volume 644 of LNCS, pages 230–243. Springer, 1992. 11. M. Karpinski and Y. Nekrich. Top-k color queries for document retrieval. In Proc. SODA, pages 401–411. ACM/SIAM, 2011. 12. Y. Matias, S. Muthukrishnan, S. C. S ¸ ahinalp, and J. Ziv. Augmenting suffix trees, with applications. In Proc. ESA, volume 1461 of LNCS, pages 67–78. Springer, 1998. 13. J. I. Munro and V. Raman. Succinct representation of balanced parentheses and static trees. SIAM J. Comput., 31(3):762–776, 2001. 14. S. Muthukrishnan. Efficient algorithms for document retrieval problems. In Proc. SODA, pages 657–666. ACM/SIAM, 2002. 15. G. Navarro and Y. Nekrich. Top-k document retrieval in optimal time and linear space. In Proc. SODA. ACM/SIAM, to appear 2012. 16. K. Sadakane. Succinct data structures for flexible text retrieval systems. J. Discrete Algorithms, 5(1):12–22, 2007. 17. N. V¨ alim¨ aki and V. M¨ akinen. Space-efficient algorithms for document retrieval. In Proc. CPM, volume 4580 of LNCS, pages 205–215. Springer, 2007.

Appendix A

A Lower Bound for Two Positive Patterns

Consider the two-pattern matching problem in [3]: 10

Given: A collection of P static documents D = {D1 , . . . , Dd } over an alphabet Σ of total length n = i≤d |Di |. Compute: An index that, given two patterns P1+ and P2+ online, allows us to compute all nmatch documents containing P1+ and P2+ . We give a new lower bound result for the problem using the technique introduced in [2]. The reduction is from 4-dimensional range queries: Given set S of n points in R4 each represented with 4h0 bits, preprocess S to find points {s | s ∈ S ∩ ([x` , xr ] × [y` , yr ] × [z` , zr ] × [t` , tr ])}, where x, y, z, and t denote the 4 coordinate ranges. Chazelle [1] showed that on a pointer machine, an index supporting d-dimensional range searching in O(polylog(n) + occ) query time requires Ω(n(log n/ log log n)d−1 ) words of storage. First, we can assume that points in S are from a [1, n] × [1, n] × [1, n] × [1, n] grid, since sorting and mapping n points in R4 to their ranks in each coordinate take O(nh0 ) space and O(n log n) time. Later, a range query can be cast to the corresponding one on the ranks in O(log n) time. From the i’th point s = (x, y, z, t) ∈ S we create the document ←−− ←− Di = hyi i#1 hxi i#2 hti i#3 hzi i,

(1)

where hci denotes the h = Θ(log n)-bit binary representation of integer c, and ← − hci its reverse (and the #i ’s are again new characters). Consider a balanced binary tree on values 1, 2, . . . , n in its leaves in this order. Associating 0 with left-branches and 1 with right-branches, paths from the root to the leaves define prefix codes for the values. An interval [c, d] can be partitioned into O(log n) intervals such that each interval corresponds to a different subtree; denote by P (c, d) the set of O(log n) prefix codes defined by paths from the root to the roots of these O(log n) subtrees. We cast a given 4-dimensional range query [x` , xr ] × [y` , yr ] × [z` , zr ] × [t` , tr ] into O(log4 n) two-pattern queries ←− ← − P1+ = hyi#1 hxi and P2+ = hti#3 hzi (2) for all (x, y, z, t) ∈ P (x` , xr ) × P (y` , yr ) × P (z` , zr ) × P (t` , tr ). One can now see that these O(log4 n) two-pattern queries on D = {D1 , . . . , Dn } of total length N = Θ(n log n) bits, constructed using Eq. (1), solve the 4-dimensional range reporting query. Theorem 3. On a pointer machine, an index on a document collection of total length n supporting two-pattern matching in O(|P1+ | + |P2+ | + nmatch + polylog n) time requires Ω(n(log n/ log log n)3 ) bits in the worst case. Proof. The reduction showed the connection between an N bit collection and 4-dimensional range queries on Θ(N/ log N ) points, so recasting the result with a collection of length n gives Ω(n/ log n(log(n/ log n)/ log log(n/ log n))4−1 ) 11

words, i.e. Ω(n(log n/ log log n)3 ) bits.

t u

12