The Performance of Difference Coding for Sets and Relational Tables

The Performance of Difference Coding for Sets and Relational Tables WEI BIAO WU University of Chicago, Chicago, Illinois

AND CHINYA V. RAVISHANKAR University of California—Riverside, Riverside, California

Abstract. We characterize the performance of difference coding for compressing sets and database relations through an analysis of the problem of estimating the number of bits needed for storing the spacings between values in sets of integers. We provide analytical expressions for estimating the effectiveness of difference coding when the elements of the sets or the attribute fields in database tuples are drawn from the uniform and Zipf distributions. We also examine the case where a uniformly distributed domain is combined with a Zipf distribution, and with an arbitrary distribution. We present limit theorems for most cases, and probabilistic convergence results in other cases. We also examine the effects of attribute domain reordering on the compression ratio. Our simulations show excellent agreement with theory. Categories and Subject Descriptors: E.4 [Coding and Information Theory]: data compaction and compression; G.3 [Probability and Statistics]: distribution functions; H.2.4 [Database Management]: Systems—relational databases General Terms: Algorithms, Performance, Theory Additional Key Words and Phrases: Statistical limit theorems, lossless compression, data warehousing, on-line analytical processing

1. Introduction Difference coding, a widely used method [Netravali and Haskell 1988; Viterbi and Omura 1979], represents a series of values by the differences between them. In most applications, successive values in the original sequence are correlated, so that their differences are small, yielding compression. The coding is lossless if the differences are preserved exactly, but this technique is sometimes combined with quantization to convert a continuous range of difference values to a discrete range Authors’ addresses: W. B. Wu, Department of Statistics, University of Chicago, 5734 South University Avenue, Chicago, IL 60637, e-mail: [email protected]; C. V. Ravishankar, Computer Science & Engineering Department, University of California—Riverside, Riverside, CA 92521, e-mail: [email protected]. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or direct commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 1515 Broadway, New York, NY 10036 USA, fax: +1 (212) 869-0481, or [email protected]. ° C 2003 ACM 0004-5411/03/0900-0665 $5.00 Journal of the ACM, Vol. 50, No. 5, September 2003, pp. 665–693.

666

W. B. WU AND C. V. RAVISHANKAR

of output values. Quantization results in lossy compression, which is acceptable in applications like image or voice coding [Netravali and Haskell 1988]. Lossy compression is generally not appropriate for compressing relational databases, but lossless difference coding has been used in this domain [Ng and Ravishankar 1997]. Compression in databases has significant practical advantages. Commercial databases can be very large, and data warehouse sizes can easily reach 1012 –1013 bytes or more. Disk I/O tends to be the bottleneck to query performance, since queries of interest frequently request statistics formed over a large number of records from the database. Database compression reduces both storage requirements as well as the data transfer volumes between disk and main memory. Compression reduces I/O bandwidth requirements at the cost of higher processor loads. This is a desirable trade-off, since processor performance is improving much faster than disk performance. Difference coding is particularly applicable to sets and databases, since the usual requirement that the ordering between elements be preserved no longer applies. We are at liberty to choose the ordering between the elements of a set that maximizes the compression. The Tuple-Difference Coding (TDC) method [Ng and Ravishankar 1997] for database compression is based on this insight. Each tuple in a database table is first treated as an integer, and the table sorted by rows. Successive tuples are then differenced, and the differences used to represent the table. The ith tuple in the table can be reconstructed from the first tuple and the first (i − 1) difference values. In practice, to avoid such laborious reconstruction, the sorted table is partitioned into disk-block sized chunks and TDC applied separately to each chunk. This approach has been shown to perform well in practice when used for compressing relational databases. In this article, we evaluate the performance of difference coding applied to the more general context of sets of integers. Our results will apply directly to relational databases, since a relational table is a set of tuples, and can be modeled as a sample of integers from a large but finite set. In the remainder of the article, we will therefore treat the terms “database” and “set” as equivalent. 2. Tuple-Difference Coding The original work describing TDC [Ng and Ravishankar 1997] discusses practical details such as how to handle textual attributes in TDC, and provides experimental results on query times and other performance parameters in practice. It demonstrates that TDC is superior to other database compression methods currently in use, and provides both better compression as well as faster query response times. A number of factors affecting compression are listed and their effects discussed in qualitative terms in [Ng and Ravishankar 1997]. Our present purpose is to provide a sounder theoretical basis for the performance characteristics of difference coding applied to sets and relational tables. 2.1. A MODEL FOR PERFORMANCE ANALYSIS. We say R is a relational schema over D1 , D2 , . . . , Dr with Di = {0, 1, . . . , |Di |−1} being the ith attribute domain1 We will sometimes specify the domain as {1, 2, . . . , |Di |} instead. However, this change of notation will not alter the semantics of compression, since TDC is based on the spacings between successive samples.

1

The Performance of Difference Coding for Sets and Relational Tables

667

if R = D1 × D2 × · · · × Dr . We call D a database with schema R if D ⊆ R. Each record in the database is an r -tuple hd1 , d2 , . . . , dr i, with di ∈ Di . TDC works as follows: given a database D = {t1 , t2 , . . . , tn } of tuples, a bijective mapping ϕ : D → NR is constructed, where NR = {0, 1, . . . , |R| − 1}. Next, ϕ used to map each database record tk to an integer, and the database is sorted on ϕ(tk ) as key. Successive tuples are differenced, and the differences ϕ(tk+1 ) − ϕ(tk ) are stored instead of the original tuples tk and tk+1 themselves. These differences tend to be significantly smaller than the original tuples, thus achieving compression. In practice, it is convenient to use a ϕ that is equivalent to lexicographic sorting. Formally, given a tuple tk = hd1 , d2 , . . . , dr i, with di ∈ Di , ! Ã r r X Y ϕ(tk ) = di |D j | , (1) i=1

j=i+1

so that di is simply treated as a digit with base |Di |, and tk becomes a mixed-radix number. This mapping is invertible, so compression is lossless. 2.1.1. Some Issues. Two points are worthy of note. First, the distribution of values in different attribute domains Di and D j is typically different, though there may be correlations between the domains. Second, the actual ordering of the Di in the product space defining the schema R is typically immaterial to the database semantics. It is reasonable, therefore, to view a database as a sample from the joint distribution of the domains Di . Since a database table is a set of tuples, it will not contain duplicate tuples, so this sample must be taken without replacement from the joint distribution space. We proceed to form the database by choosing n samples (X 1 , . . . , X n ) from the joint distribution space F(D1 , D2 , . . . , Dr ). Since this sampling must be performed without replacement, these are not i.i.d. random variables, a fact that causes significant technical difficulties for our analysis in the remainder of the paper. When no compression is performed, we must allocate enough space in the tuple to accommodate any value that may need to be stored in the database. Thus, if N = |D1 | · |D2 | · · · |Dr |, each tuple will need to be at least log2 N bits in length, and the entire database will be n log2 N bits in size. We can apply TDC as follows: By sorting on ϕ(X i ) (see Eq. (1)), we can construct the sorted database X (1) Jk : Z i 6∈ {Z J1 , . . . , Z Jk }}, k ≥ 1. For any n, Jn can be shown to be proper in the sense that Pr[Jn < ∞] = 1, which follows immediately from the following Eq. (3). Based on {Jn , n ≥ 1}, we can naturally obtain n mutually different random variables Z J1 , . . . , Z Jn . We show that D

(X 1 , . . . , X n ) = (Z J1 , . . . , Z Jn ).

(2)

ˆ Set conditional probability Pr[·] = Pr[·|Z J1 = x1 , . . . , Z Jk = xk ]. Using the Markov property, we have ˆ Jk+1 = xk+1 ] Pr[Z ∞ X ˆ Jk +l = xk+1 , Z Jk + j ∈ {x1 , . . . , xk }, for j = 1, . . . , l − 1] Pr[Z = l=1

=

∞ X

Pr[Z = xk+1 ] Pr[Z ∈ {x1 , . . . , xk }]l−1

l=1

=

1−

p(xk ) Pk−1 i=1

p(xi )

,

proving Eq. (2). Let Tk = Jk+1 − Jk , k ≥ 1. It is easy to see that given the outcomes (x1 , . . . , xn ), Pk the values Tk are geometrically distributed with mean (1 − i=1 p(xi ))−1 . On average, we need E[Jn ] i.i.d. random variables to obtain n different values. This connection between the two sampling schemes enables us to work with an i.i.d. sampling scheme instead of the more complex scheme of sampling without replacement. To illustrate the use of this idea, consider Theorem 4.7 below, which analyzes the performance of TDC on databases which may be modeled by sampling a Zipf variable without replacement. The theorem, in fact, gives approximations of E3 Z based on a sampling scheme with replacement; that is, the proof of the theorem is developed in terms of n i.i.d. Zipf (N ) random variables. Therefore, in applying Theorem 4.7 to obtain a reasonable estimate for E3 Z when replacement is not allowed, we would use E[Jn ] in place of n. However, it is still extremely difficult to use ¸−1 n−1 n−1 · k X X X E[Jn ] = 1 + E[T j ] = 1 + E 1− p(X i ) (3) k=1

k=1

i=1

directly, given the very complicated nature of the joint distribution of (X 1 , . . . , X n ). We finesse this problem by considering the converse issue: QUESTION. Given n 0 i.i.d. random variables Z 1 , . . . , Z n 0 , what is the number of different elements in this sample? Equivalently, what is the cardinality of {Z 1 , . . . , Z n 0 }? In this set up, n 0 and |{Z 1 , . . . , Z n 0 }| assume the roles of E[Jn ] and n, respectively, in the original problem. This converse can be interpreted as random

670


allocation problem, which is extensively studied in the literature, especially for weak convergence in terms of the Central Limit Theorem and Poisson approximations (c.f. Kolchin et al. [1978]). This converse question has also been studied in Csörg˝o and Wu [2000] for the case where Z has uniform distribution, using large deviation techniques. Suppose we have N cells labeled by 1, . . . , N , and we view the random variables . , Z0 n 0 as n 0 balls with ball j being allocated to cell Z j . Define random variables Z 1, . .P Yi = nj=1 1(Z j = i), i = 1, . . . , N , representing the number of balls in the ith cell, where 1(A) isP the indicator function. Hence, the number of occupied cells N |{Z 1 , . . . , Z n 0 }| = i=1 1(Yi > 0). Now (Y1 , . . . , Y N ) follows the multinomial distribution Multi(n 0 ; p(1), . . . , p(N )). Let Vi have Poisson distribution with mean n 0 p(i) and suppose that {Vi , 1 ≤ i ≤ N } are independent. Then, we have the following Poisson representation of the multinomial distribution D

(Y1 , . . . , Y N ) = (V1 , . . . , VN |V1 + · · · + VN = n 0 ). random variables LEMMA 3.1. Let Ii , 1 ≤ i ≤ N be independent Bernoulli PN 0 with qi = Pr[Ii = 1] = 1 − exp (−n p(i)). Then Pr[| i=1 (Ii − qi )| > n 0 ε] ≤ 2 exp (−n 0 ε2 /3) holds for all 0 < ε ≤ 1/10. PN PROOF. Let t > 0. Denote S N = i=1 (Ii − qi ). Then, by Markov’s inequality, log Pr[S N > n 0 ε] ≤ log[exp(−n 0 εt)E(e SN )] N X = −n 0 εt + log[exp(−tqi )(1 − qi ) + exp(t(1 − qi )) qi ]. i=1

Elementary manipulations show that log[exp(−tq)(1 − q) + exp(t(1 − q)) q] ≤ t 2 (1.1q − q 2 )/2 holds for all 0 < t < 1/10 and 0 < q < 1. Let sets PIN = {i : 1 ≤ i ≤ N , n 0 p(i) ≥ 1} and J = {1, . . . , N } − I. Then |I| ≤ n 0 since i=1 p(i) = 1. If i ∈ J , then n 0 p(i) < 1 and hence qi ≤ n 0 p(i) since 1 − exp(−t) ≤ t for 0 ≤ t ≤ 1. Clearly, for all q, 1.1q − q 2 ≤ 0.552 . So X X X X ¢ + ≤ 0.552 + 1.1n 0p(i) 1.1qi − qi2 =

N X ¡ i=1

i∈I

i∈J

i∈I 0

i∈J 0

≤ 0.55 |I| + 1.1n ≤ 1.4025n . 2

Hence, log Pr[S N > n 0 ε] ≤ −n 0 εt + t 2 1.4025n 0 /2 ≤ −n 0 ε 2 /3 by letting t = ε/1.4025. The other case Pr[S N < −n 0 ε] can be handled similarly. Remark 3.2. The constants in the upper bound in Lemma 3.1 are not best possible, but it suffices for our application. The classical Hoeffding’s inequality [Hoeffding 1963] can yield a bound of order exp[−2(n 0 ε)2 /N ], which appears to be too rough in our context since N is typically much larger than n 0 . PN PN 0 Let M = i=1 Pr[Vi > 0] = i=1 [1 − exp(−n p(i))]. Then, applying Lemma 3.1 to Ii = 1(Yi > 0), we have ¯ "¯ # N ¯X ¯ ¯ ¯ 0 Pr ¯ 1(Yi > 0) − M ¯ ≥ n ε ¯ i=1 ¯


671

¯ ¯ # "¯ N ¯ ¯ ¯X ¯ ¯ 0 ¯ 0 1(Vi > 0) − M ¯ ≥ n ε ¯V1 + · · · + VN = n = Pr ¯ ¯ ¯ ¯ i=1 ¯ ¯ £ PN ¤ Pr ¯ i=1 1(Vi > 0) − M ¯ ≥ n 0 ε ≤ Pr[V1 + · · · + VN = n 0 ] 0 n0 0 n √ e n ≤ 2 exp(−n 0 ε 2 /3) = O(1) n 0 exp(−n 0 ε 2 /3), 0 n! which vanishes to 0 at a geometric rate. Hence # " N 1 X P 1(Yi > 0) − M −→ 0. 0 n i=1 Therefore, it is reasonable to take M as the expected number of occupied cells, and the expected number of i.i.d. copies needed, n 0 , can be approximated via the equation n=M=

N X

[1 − exp(−n 0 p(i))].

(4)

i=1

This scheme causes a complication if the definition of 3 given in Section 2.1.1 is used directly. Let Z (1) ≤ · · · ≤ Z (n 0 ) be the order statistics of Z 1 , . . . , Z n 0 . Since we cannot guarantee that the Z i are mutually different, some of these spacings may be zero. The definition of 3 in Section 2.1.1 is unusable since it involves the for these spacings. Instead,Pwe should use one of the Plogarithms n 0 −1 n 0 −1 ln max(Z (k+1) − Z (k) , 1) or 3 Z = k=1 ln(Z (k+1) − Z (k) + 1). forms 3 Z = k=1 In this article, we suggest the second form, which appears conservative, and is mathematically convenient. In addition, it captures an aspect of the real world: whenever Z (k+1) − Z (k) = 1 in practice, we need one bit to store the difference. However, the corresponding term in the first form goes to zero and makes no contribution to the sum. Numerical simulation indicates that the difference between the two forms is negligible. 4. Limit Theorems for the Single-Field Case We begin our analysis of the TDC technique with the simplest case. We assume that the database consists of n tuples, each tuple comprising a single attribute field A. This is a reasonable starting point for two reasons. First, in some cases, we are able to reduce the general case of r attribute domains to the case of a single attribute domain. Second, we use the single-attribute results to construct an analysis for the multiple-domain case. 4.1. SINGLE ATTRIBUTE, UNIFORM DISTRIBUTION. We first consider the case when the attribute values are drawn uniformly from a single attribute domain of size N . The uniform distribution is interesting for several reasons. First, many attributes domains that appear in practice are uniform. Second, the uniform distribution is known to yield the largest value for the sum of sample spacings of all distributions defined over a given range [Shao and Marjorie 1995]. In this sense, the behavior for the uniform distribution form a lower bound for the compression efficiency

672


of TDC. Also, the uniform distribution is a “least informative” distribution over a given range, and is useful as a model when little is known about a distribution. Finally, as we show in Section 5, a set of k uniformly distributed attribute domains can be modeled as a single uniformly distributed domain. We proceed to form the database by choosing n integers (X 1 , . . . , X n ) from {0, 1, . . . , N − 1}, so that each such n-tuple representing the database has the same probability 1/( Nn ) of being selected. By sorting this database, we can construct the order statistics X (1) < · · · < X (n) , and the corresponding set of spacings {X (k+1) − X (k) }, k = 1, . . . , n − 1. As in Section 2.1.1, we form the statistic 3U = Pn−1 k=1 ln(X (k+1) − X (k) + 1) to estimate the size of the compressed database. 4.1.1. Prior Work. We need to work with a discrete uniform distribution, and so we call the problem of estimating 3U the discrete spacing problem. To the best of our knowledge, prior work in this area has dealt exclusively with continuous distributions. See, for example, Darling [1953], Blumenthal [1968], Pyke [1965], and Shao and Marjorie [1995]. Pyke [1965] reviews the literature in this area. Darling [1953] uses characteristic function techniques to obtain the following limit theorem for the continuous spacings of independent random variables uniform on (0, 1). THEOREM 4.1. Let U(1) < U(2) < · · · < U(n) be the order statistics of i.i.d. uniform(0,1) random variables U1 , . . . , Un . Then, if γ = 0.5771 · · · is Euler’s constant, ¢ Pn−1 ¡ i=1 ln U(i+1) − U(i) + (n + 1)(ln n + γ ) D p −→ N (0, 1). n(π 2 /6 − 1) However, we can not directly extend these results to the discrete case. In particular, although sampling with and without replacement are equivalent for the continuous case, they are not so for discrete distributions. Sampling with replacement causes a singularity in the logarithmic term since X (k+1) − X (k) = 0 with nonzero probability. This difficulty can be overcomed by changing ln(X (k+1) − X (k) ) to ln(X (k+1) − X (k) + 1). We develop Theorems 4.3 and 4.4 for sampling without or with replacement, respectively. At first sight, it seems feasible to apply Darling’s Theorem to our situation by simply substituting the discrete random variables X i = bNUi c, 1 ≤ i ≤ n for the continuous random variables NUi , 1 ≤ i ≤ n. However, this straightforward substitution becomes problematic since the errors will be large if we replace the spacing term ln(X (k+1) − X (k) +1) in Theorems 4.3 and 4.4 by the continuous version ln(NU(k+1) − NU(k) + 1) unless NU(k+1) − NU(k) , 1 ≤ k ≤ n − 1 are stochastically large. The reason for this difficulty is obvious: the error ln(t +dt)−ln t ≈ dt/t will be small for large t. We show that this difficulty may be circumvented when N is large enough, and specifically, when n 2 = o(N ). A significant aspect of our approach to the proof of Theorem 4.3 is that we estimate the possible errors caused by the continuous approximation we use, and show them to be negligible. We then proceed to obtain the limiting distribution. Although Darling’s Theorem is not helpful in the proof of Theorem 4.3, it does provide an incidental benefit. The asymptotic variance we obtain for our limit theorems is hard to estimate analytically, but we can infer it by comparison with Theorem 4.1.


673

4.1.2. A Central-Limit Theorem for Discrete Uniform Spacings. Since 3U is the sum of a large number of random variables, we would expect the statistic to be distributed normally. However, in order to characterize the performance of TDC, we are especially interested in the mean and variance of this distribution. In Theorem 4.3 below, we prove a version of the Central Limit Theorem for this case, and show that the expected value of 3U is approximately n[−γ + ln(N /n)], where γ = 0.57721 . . . is Euler’s constant. In contrast, the number of bits to store X 1 , . . . , X n without compression is n ln N . In showing Theorem 4.3, we first approximate the sample (X 1 , . . . , X n ) by the i.i.d. random variables X 10 , . . . , X n0 distributed as bNUc, U is uniformly distributed over (0, 1). Obviously, because we are sampling without replacement, (X 1 , . . . , X n ) are not independent, but if n = o(N 1/2 ), then we expect then to be asymptotically independent, since the probability of X i = X j for some 1 ≤ i < j ≤ n is 0 0 ≤ · · · ≤ X (n) , or very small. In the proof, we deal with the order statistics X (1) equivalently, bNU(1) c ≤ bNU(2) c ≤ · · · ≤ bNU(n) c using the representation µ ¶ ¢D ¡ S1 Sn ,..., , (5) U(1) , . . . , U(n) = Sn+1 Sn+1 Pj where Y1 , Y2 , . . . are i.i.d. exp(1) random variables, and S j = i=1 Y j . We use this representation form throughout the article. Therefore, 3U can be P ln(NY approximated by n−1 k+1 /Sn+1 ), which can be analyzed using the Strong k=1 Law of Large Numbers. In the process, however, we encounter sets with small probabilities, with which we must deal with care. We first prove the following lemma. LEMMA 4.2. Let {Y, Yi , i ≥ 1} be i.i.d. exp(1) random variables. Then n 1 X 1 P −→ 1 n ln n i=1 Yi

PROOF. Clearly n Pr[1/Y > n ln n] = n[1 − exp(−1/(n ln n))] → 0 as n → ∞. Then ! Ã n X 1 1 P −1 − n E[Y 1Y −1 ≤n ln n ] −→ 0 n ln n i=1 Yi by Klass and Teicher [1977]. Thus, the lemma follows since R ∞ −1 y exp(−y) dy E[Y −1 1Y ≥² ] −² −1 exp(²) lim =1 = lim ² = lim ²→0 ²→0 ²→0 − ln ² − ln ² −² −1 by the L’Hospital rule. We now present a theorem dealing with the performance of TDC when the values in the database are uniformly distributed, and when the database size n and the attribute domain size Nn obey n 2 = o(Nn ).1 In this case, we are able to obtain It is usual to write f (n) = o(g(n)) when f and g are functions of an independent variable t such that limn→∞ f (n)/g(n) = 0. So we will frequently write n = o(Nn ) to emphasize our view of N as a function of the independent variable n and limn→∞ n/Nn = 0.

1

674


accurate results without recourse to the equivalent sampling scheme described in Section 3. In Section 4.1.3, we show how to extend these results to the case when n 2 = o(Nn ) fails to hold. THEOREM 4.3. Let (X 1 , . . . , X n ) be n numbers sampled from the set {0, 1, . . . , Nn − 1} equiprobably, and without replacement. Let X (1) < · · · < X (n) be the order statistics of (X 1 , . . . , X n ), and define the random variable 3U = Pn−1 2 k=1 ln(X (k+1) − X (k) + 1). Then, if n = o(Nn ), 3U − µU D √ −→ N (0, 1), σU / n

√ where µU = (n − 1)[ln Nn − ln(n + 1) − γ ], σU = α n(n − 1), and γ , α are defined in terms of a standard exponential random variable Y as γ = −E(ln Y ) = 0.57721. . . , or Euler’s constant, and α 2 = Var(ln Y − Y ) = π 2 /6 − 1 = 0.644934 . . . PROOF. We write N = Nn , σ = σU , µ = µU for simplicity. Let X 10 , . . . , X n0 be i.i.d. random variables with common distribution Pr[X 10 = k] = 1/N for k = 0, 1, . . . , N − 1. First, we claim the following distributional equality, which will transform the dependent random variables (X 1 , . . . , X n ) to i.i.d. random variables (X 10 , . . . , X n0 ): D

(X 1 , . . . , X n ) = (X 10 , . . . , X n0 |X 10 , . . . , X n0 are different).

(6)

For x1 , . . . , xn ∈ {0, 1, . . . , N − 1}, if xi = x j for some i 6= j, then Pr[X 1 = x1 , . . . , X n = xn ] = 0 = Pr[X 10 = x1 , . . . , X n0 = xn |X 10 , . . . , X n0 are different]. If x1 , x2 , . . . , xn are mutually different, then Pr[X 10 = x1 , . . . , X n0 = xn |X 10 , . . . , X n0 are different] = Pr[X 10 = x1 , . . . , X n0 = xn ]/ Pr[X 10 , . . . , X n0 are different] ¶)−1 ³ ´−1 µ ¶n (n−1 Yµ j 1 N = 1− = n N N j=1

= Pr[X 1 = x1 , . . . , X n = xn ]. Hence, for fixed λ ∈ R, ¸ 1 (3U − µ) < λ Pr (n − 1)1/2 α # " n−1 X 0 0 1/2 0 0 ln(X (k+1) − X (k) + 1) − µ < λ(n − 1) α|X 1 , . . . , X n are different = Pr ·

k=1

= Pr[Bn |An ] (say), + 0 0 0 0 ≤ · · · ≤ X (n) is the order statistics x = where X (1) Qn−1 of (X 1 , . . . , X n ), and ln 2 ln(max(1, x)). Observing that Pr[An ] = (1 − j/N ) = 1 + O(n /N ) = j=1


675

1 + o(1), and that Pr[Bn ] Pr[An Bn ] 1 Pr[Bn ] ≥ = Pr[Bn |An ] ≥ +1− , Pr[An ] Pr[An ] Pr[An ] Pr[An ] we find that Pr[Bn |An ] is close R λ to 2Pr[Bn ]. Hence, it suffices Dto show limn→∞ Pr[Bn ] = 8(λ) = √12π −∞ e−t /2 dt as n → ∞. Since X k0 = bNUk ], where U1 , . . . , Un are i.i.d. random variables uniformly distributed over (0, 1), we know D

(X 10 , . . . , X n0 ) = (bNU1 c, . . . , bNUn c), which yields ¡

¢ D ¡¥ ¦ ¥ ¦¢ 0 0 , . . . , X (n) = NU(1) , . . . , NU(n) . X (1)

(7) Pm

Let Y, Y1 , . . . be i.i.d. exp(1) random variables, and let Sm = i=1 Yi . We now make use of Eq. (5), and let event ( ) n−1 X 1/2 Bˆ n = ln(bNSk+1 /Sn+1 c − bNSk /Sn+1 c + 1) − µ < λ(n − 1) α . k=1

From Eq. (7), Pr[Bn ] = Pr[ Bˆ n ]. Now we can estimate Pr[Bn ] by approximating the integer parts by the values themselves. Roughly speaking, the summand bNSk+1 /Sn+1 c − bNSk /Sn+1 c in the logarithmic terms will be close to NYk+1 /n, which is stochastically large since we have N /n → ∞ and Sn+1 /n → EY = 1, by the usual Strong Law of Large Numbers. To be more precise, we introduce the events Cn , Dn , as follows: ½ Cn =

¯ ¾ ¾ ½¯ ¯ Sn+1 − (n + 1) ¯ NY2 NYn ¯ ¯ > 2, . . . , > 2 , Dn = ¯ √ ¯>1 . Sn+1 Sn+1 n ln n

Event Dnc , the complement of event Dn , leads us to the approximation Sn+1 ≈ n. If event Cn Dnc occurs, then we can show by a straightforward approach that ln(bNSk+1 /Sn+1 c − bNSk /Sn+1 c + 1) can be approximated by ln(NYk+1 /Sn+1 ). Hence, we really need to show that Pr[Cn Dnc ] = 1 + o(1), or, that Pr[Cnc ] + Pr[Dn ] = o(1). To prove this, first Pr[Dn ] ≤ (n ln n)−1 E[Sn+1 − (n + 1)]2 = o(1). Next, for large n, we have Pr[Cn ] ≥ Pr[Cn Dnc ] ≥ Pr[NY2 > 3n, . . . , NYn > 3n, Dnc ] ≥ (exp(−3n/N ))n−1 − Pr[Dn ] = 1 + o(1). Let {x} denote the fractional part of x (i.e., {x} = x − bxc), and let εnk = {NSk /Sn+1 } − {NSk+1 /Sn+1 } + 1 ∈ (0, 2). If ω ∈ Cn Dnc , and n is sufficiently large, we can obtain the following estimates: |Sn+1 /(n + 1) − 1| < (ln n/n)1/2 , |Sn+1 /(n + 1) − 1 − ln(Sn+1 /(n + 1))| ≤ (Sn+1 /(n + 1) − 1)2 < ln n/n. From these estimates, we now have ¯ Ã !¯ º ¹ º ¶ n−1 µ¹ n−1 ¯X ¯ X NSk NSk+1 ¯ ¯ − + 1 − µ − ln (ln Y + γ ) − S + n + 1 ¯ ¯ k+1 n+1 ¯ k=1 ¯ Sn+1 Sn+1 k=1 ¯ ¶ µ ¶¯ ¶ µ n−1 µ ¯X Sn+1 Sn+1 ¯¯ Sn+1 Sn+1 εnk ¯ − 1 + (n − 1) − 1 −ln +2 ln 1 + = ¯ ¯ ¯ k=1 NYk+1 n+1 n+1 n+1 ¯

676


FIG. 1. Uniform distribution: Agreement between Theorem 4.3 and experiment.

µ ¶ n−1 X ln n 1/2 2Sn+1 ≤ +2 + ln n NYk+1 n k=1 ≤

n−1 1 3n X + 2 ln n N k=1 Yk+1

By Lemma 4.2, notice that Pr[Cn Dnc ] = 1 + o(1) and n 2 = o(Nn ), we obtain ¯ º ¹ º ¶ n−1 µ¹ NSk+1 1 ¯¯ X NSk lim ln − +1 ¯ n→∞ n 1/2 ¯ Sn+1 Sn+1 k=1 !¯ Ã n−1 ¯ X ¯ P (ln Yk+1 + γ ) − Sn+1 + n + 1 ¯ = 0, −µ − ¯ k=1 which leads to Theorem 4.3 via Slutsky’s Theorem and the classical central limit Pn−1 D 1/2 −1 theorem [(n − 1) α] k=1 (ln Yk+1 + γ − Yk+1 + 1) −→ N (0, 1). The exact value of the asymptotic variance α 2 is presented in Corollary 4.6 below. Figure 1 compares the estimates of 3U from Theorem 4.3 with the results of experiments on databases of different sizes containing integers drawn uniformly without replacement from {1, 2, . . . , 231 − 1}. Theory and experiment agree to 6 within a fraction of one percent even √ for databases as large as 2 · 10 (showing the robustness of the theorem, since N ≈ 46, 000 in this case). The major idea in the proof of Theorem 4.3 was to first show the asymptotic equivalence of the sample (X 1 , . . . , X n ) without replacement and n i.i.d. uniform(Nn ) random variables under the constraint n 2 = o(Nn ) and hence reduce to the classical central limit theorem based on the i.i.d. case. After some minor modifications, the proof in the second part also implies the following theorem for the n i.i.d. uniform(Nn ) random variables. THEOREM 4.4. Assume that n 2 = o(Nn ). Let X (1) ≤ X (2) ≤ · · · ≤ X (n) be the order statistics of i.i.d. uniform(Nn ) random variables X 1 , . . . , X n . Define 3U =

The Performance of Difference Coding for Sets and Relational Tables Pn−1 k=1

677

ln(X (k+1) − X (k) + 1). Then, 3U − µU D √ −→ N (0, 1), σU / n

where µU , σU are the same as that in Theorem 4.3. Remark 4.5. Theorems 4.3 and 4.4 can be used to construct confidence intervals based on the limiting distributions. Obtaining the exact form of the variance term is an interesting exercise. It is somewhat challenging to obtain α 2 = Var(ln Y − Y ) directly, but we note that Theorems 4.1 and 4.4 jointly lead to the following interesting observation. COROLLARY 4.6. If Y is exp(1) distributed, then α 2 = Var(ln Y −Y ) = π 2 /6− 1 = 0.644934 . . . . 4.1.3. The Case of Large Databases. When n 2 = o(Nn ) is not satisfied, we must fall back on the equivalent sampling scheme described in Section 3. Since for the uniform distribution, p(x) = N −1 for all x ∈ F, we have E[Jn ] = 1 + Pn−1 −1 by Eq. 3. Under the assumption n < N /2, we claim that k=1 (1 − k/n) ¯ µ ¶¯ ¯ ¯ ¯ E[Jn ] − N ln 1 − n ¯ < n + 2 (8) ¯ N ¯ N Define function g(t) = (1 − t/N )−1 . For integer k ∈ [1, n − 1], if t ∈ [k − 1/2, k + 1/2], the Taylor expansion yields (t − k)2 00 g (ξ ) 2 for some ξ ∈ [k − 1/2, k + 1/2]. If t ∈ (0, N /2), then ¯ µ ¶−3 ¯ ¯ ¯ 2 t 00 ¯ ≤ 16 . |g (t)| = ¯¯ 2 1 − ¯ N2 N N g(t) = g(k) + (t − k)g 0 (k) +

Therefore

¯Z ¯ ¯ ¯ n−1 n−1 ¯ Z k+1/2 ¯ n−1/2 ¯ X ¯ X ¯ ¯ ¯ ¯ g(t) dt − g(k)¯ ≤ [g(t) − g(k)] dt¯ ¯ ¯ ¯ 1/2 ¯ k=1 ¯ k−1/2 ¯ k=1 Z n−1 X 1 2n 16 k+1/2 (t − k)2 < dt ≤ . ≤ 2 2 N k−1/2 2 3N 3N k=1

Next, ¯ Ã ¯ ¯ ¯ E[Jn ] − N ln 1 − ¯ ¯ Ã ¯1 ¯ < ¯ + N ln 1 − ¯2
1 ln Nn / ln n = C < ∞, then we have 1 1U P − (1 − ρn )2 −→ 0, n ln Nn 2 and 1 E1U − (1 − ρn )2 −→ 0, n ln Nn 2 as n → ∞, where ρn = ln n/ ln N . PROOF. We write N = Nn for simplicity. The Zipf distribution is very skewed towards the high-probability elements, so for any integer k0 ∈ N, the first k0 values in the order statistics X (1) < X (2) < · · · < X (k0 ) are very likely to be 1, . . . , k0 . We take k0 = bnρn c here. This observation suggests that P2,0 −1 ln(X (k+1) − X (k) + 1) should be stochastically small. In terms of our 3U0 = kk=1

680


P 0 −1 approximation, for the corresponding sum 1U0 = kk=1 ln(N U(k+1) − N U(k) + 1), we P 0 will prove 1U /(n ln N ) −→ 0. Since the logarithm function is concave, we may apply Jensen’s inequality to get Ã ! µ ¶ kX 0 −1 ¢ ¡ 1 1 1 0 U(k+1) U(k) U(k0 ) −N + 1 ≤ ln +1 . N 1 ≤ ln N k0 − 1 U k0 − 1 k=1 k0 − 1 Now for any ε > 0, µ ¶ ¸ · 1 k0 − 1 U(k0 ) ln N +1 >ε Pr n ln N k0 − 1 ¸ · ln(k0 − 1) + ln(N ε/ρn − 1) ≤ Pr U(k0 ) > ln N ¸ · n + 1 ln(k0 − 1) + ln(N ε/ρn − 1) Sk0 /k0 > . = Pr Sn+1 /(n + 1) k0 ln N Sk0 /k0 Sn+1 /(n+1)

P

−→ 1 by the Weak Law of Large Numbers and µ ¶ n + 1 ln(k0 − 1) + ln(N ε/ρn − 1) ε ε lim inf ≥ lim inf 1 + 2 ≥ 1 + 2 , n→∞ n→∞ k0 ln N ρn C

Since

P

we have limn→∞ 1U0 /(n ln N ) = 0. Next define 1U00 =

n−1 X ¡ ¢ ln N U(k+1) − N U(k) + 1 k=k0

n−1 n−1 X ¡ ¢ ln N X Sk+1 + ln 1 − N −Yk+1 /Sn+1 + N −Sk+1 /Sn+1 Sn+1 k=k0 k=k0 = In + Jn (say). P 2 3 Elementary calculations show that E[ n−1 k=k0 (Sk+1 − k − 1)] ≤ n , since for exp(1) random variable Y , we know E(Y − 1) = 0, E(Y − 1)2 = 1. Hence, Ã ! Pn−1 n−1 (k + 1) In 1 X Sn+1 P k=k0 (Sk+1 − k − 1) −→ 0, − = 2 n n ln N nSn+1 n k=k0 D

=

P

or, (n ln N )−1 In − (1 − ρn2 )/2 −→ 0. Now we consider Jn . Given any ε > 0, since 0 ≥ ln(1 − N −Yk+1 /Sn+1 + N −Sk+1 /Sn+1 ) ≥ ln(1 − N −Yk+1 /Sn+1 ), · Pr

¸ 1 |Jn | > ε n ln·N ¸ · ¸ ln N ln N 1 < 1 + Pr ≥1 Jn < −ε, ≤ Pr n ln N Sn+1 Sn+1 # " · ¸ n−1 ln N 1 X ln(1 − exp(−Yk+1 )) < −ε + Pr ≥1 ≤ Pr n ln N k=k0 Sn+1


681

FIG. 2. Zipf distribution: Agreement between Theorem 4.7 and experiment.

Observe that if Yk+1 is exp(1), then 1 − exp(−Yk+1 ) is uniform(0,1), E ln(1 − exp(−Yk+1 )) = −1, hence by Markov’s inequality, the first term ≤ −(ε ln N )−1 E ln(1 − exp(−Yk+1 )) = (ε ln N )−1 . Obviously, the second term goes 0 via the Weak Law of Large Numbers, which completes the proof of the first statement of Theorem 4.7. By Jensen’s inequality, · ¸ n−1 ¢ ¡ U(k+1) 1 1 X ln(Nn + n) 1U U(k) ≤ ln −N +1 < < 2, N 0< n ln Nn ln Nn n − 1 k=1 ln Nn hence random variables {1U /(n ln Nn ) − (1 − ρn )2 /2, n ≥ 2} are uniformly integrable. Then the second convergence result stated in the theorem follows easily from the first one and the Mean Convergence Criterion [Chow and Teicher 1988]. Remark 4.8. Under the conditions of Theorem 4.7, a more careful analysis leads to the stronger result 1U − µ0n P = 0, n→∞ n lim

where µ0n = (1/2)(1 − ρn2 )n ln N + n(1 − ρn )(ln ln N − γ − ln n), γ = 0.5772 . . . is Euler’s constant. Since the details of the proof are complicated, we omit the proof P and only provide an outline here. First, to obtain 1U0 /n −→ 0, we use the √ Law of the |(S − n)/ n ln ln n| = Iterated Logarithm [Chow and Teicher 1988] lim sup n n→∞ √ 2, a much finer estimate than we can obtain from WLLN. For In , the estimate P used in the proof of Theorem 2 can yield n −1 In − (1/2)(1 − ρn2 ) ln N −→ 0. P Since ln N −Yk+1 /Sn+1 −→ 0, we can use Taylor’s expansion 1 − N −Yk+1 /Sn+1 ≈ (ln N )Yk+1 /Sn+1 . Hence Jn can P be further approximated by −(n − k0 )(γ + ln n) by the usual SLLN (n − k0 )−1 nk=k0 ln Yk+1 → E ln Y = −γ and Sn+1 /n → 1 a.s.. Together, these facts imply the refined limit theorem. Figure 2 evaluates how well Remark 4.8 matches the results of experiments on databases of different sizes containing integers drawn without replacement from a Zipf distribution over {1, 2, . . . , 231 − 1}. To validate both the analysis and the approximations driving it, we used the actual value of 3 Z obtained from experiments in place of 1U . Figure 2 shows the percentage difference between 3 Z /(n ln N ) and µ0n /(n ln N ). Agreement is to within a few percent even for databases that are quite large, suggesting that our formula is an excellent predictor of experimental results.

682


We appear to have satisfactorily addressed the problem of estimating 3 Z for relatively large databases. However, the situation for small databases is somewhat different, since the small number of samples means that the spacings between them are likely to be larger. We now address the case where the database is small, and present the following result. THEOREM 4.9. Let U(1) < U(2) < · · · < U(n) be the order statistics of i.i.d. uniform(0,1) random variables U1 , . . . , Un . If limn→∞ n/ ln Nn = 0, then we have 1 P 1U − = 0. n→∞ n ln Nn 2 lim

and 1 E1U − = 0. n→∞ n ln Nn 2 lim

PROOF. As in the proof in Theorem 4.7, we write 1U =

n−1 X ¡ ¢ ln N U(k+1) − N U(k) + 1 k=1

n−1 n−1 X ¡ ¢ ln N X Sk+1 + ln 1 − N −Yk+1 /Sn+1 + N −Sk+1 /Sn+1 Sn+1 k=1 k=1 = In + Jn (say). D

=

P

Using the same argument as in Theorem 4.7, we have (n ln N )−1 In − 1/2 −→ 0. For any ε > 0, let n > n 0 be large enough such that (ln N )−1 (n + 1) < 1/2, then ¸ · 1 ¯¯ ¯¯ Jn > ε Pr n ln"N # · ¸ n−1 ¡ ¢ Sn+1 S 1 X n+1 −Yk+1 /Sn+1 ≤ 2 + Pr ≥2 ln 1 − N < −ε, ≤ Pr n ln N k=1 n+1 n+1 # " · ¸ n−1 Sn+1 1 X ln(1 − exp(−Yk+1 )) < −ε + Pr ≥2 . ≤ Pr n ln N k=1 n+1 P

Again by the same arguments as in Theorem 4.7, we know (n ln N )−1 JN −→ 0. Thus the second convergence result stated in the theorem follows from the first one via uniform integrability, which is an immediate consequence of the uniform boundedness of the random sequence {(n ln Nn )−1 1U − 1/2, n ≥ 2}. Remark 4.10. Under the conditions of Theorem 4.9, we have limn→∞ ln n/ ln Nn = 0, thus ρn ≈ 0. Then interestingly enough, both Theorem 4.9 and Theorem 4.7 are consistent, and give the result E1U ≈ (1/2)n ln Nn . 4.3. SPACINGS FOR DISTRIBUTIONS WITH HIGH CONCENTRATION. A nonnegative integer-valued random variable Z is said to be highly concentrated if Z takes values in a set of few elements with high probability. Thus, the Binomial, Poisson, Geometric, or general Zipf distribution are highly concentrated. (The general Zipf distribution is defined by Pr[Z = k] ∼ ck −α , as k → ∞, α > 1.) When highly concentrated distributions are sampled without replacement, the spacings


683

FIG. 3. Two uniform attributes: Effect of attribute reordering on (3U1 U2 − 3U2 U1 )/3U1 U2 .

Z (2) − Z (1) , . . . , Z (n) − Z (n−1) are very likely to be 1, where Z (1) , . . . , Z (n) are the order statistics of n samples Z 1 , . . . , Z n . Thus, the total number of bits required in this case is likely to be close to O(n). Defining 3 Z simply as the sum of the logarithms of the difference will lead to a smaller estimate since the logarithmic terms corresponding toP differences of 1 will be zero. Fortunately, adopting the conservative form 3 Z = n−1 k=1 ln(Z (k+1) − Z (k) + 1) suggested in Section 3 leads to 3 Z ≈ n ln 2, in perfect agreement with practice. 5. Optimal Ordering of Attribute Domains When multiple attribute fields X 1 , X 2 , . . . , X k are present in a database tuple, it is clear that the ordering of the attribute fields in will influence the value resulting from the application of ϕ (see Section 2) to the tuples. In this section, we consider the question of how to order the attribute fields so that E3 reaches its minimum. 5.1. UNIFORM ATTRIBUTE DOMAINS. Consider first the case when the k fields are all uniformly distributed, so that X i is uniform over (1, |Di |). Somewhat contrary to intuition, E3 will remain unaffected in this case by attribute domain reordering, since the random integer X 1 ·|D2 |·|D3 | · · · |Dk |+ X 2 ·|D3 |·|D4 | · · · |Dk |+· · ·+ X k is, regardless of field ordering, always distributed uniformly over the set {a + 1, a + 2, . . . , a +b}, if we define a = |D2 |·|D3 | · · · |Dk |+|D3 |·|D4 | · · · |Dk |+· · ·+|Dk |, and b = |D1 | · |D2 | · · · |Dk |. This somewhat paradoxical result is confirmed by our simulations, which are shown in Figure 3. 5.2. NONUNIFORM ATTRIBUTE DOMAINS. The case of non-uniform attributes is more complex. In fact, the optimal attribute ordering actually depends on the database size. A full analysis is elusive, but we provide general characterizations of behaviors for different cases. 5.2.1. Small Databases. Let us first consider the simplest case, where there are only two fields, and the database contains just two records. We use the analysis for this case to provide insights into more general situations. Suppose that X, Y are independent random variables distributed as Zipf (m) and Zipf (n) respectively. Therefore, Z = (X, Y ) = n X + Y has distribution function FZ (z) = Pr[Z ≤ z] =

Hy Hx−1 ln x + γ ln(z/n) + γ + ≈ ≈ := F˜ Z (z) Hm x Hn ln m + γ ln m + γ

684


for z = y + nx, x = 1, . . . , m, y = 1, . . . , n. Hence Z can be approximated by random variable F˜ −1 Z (U ) = n exp[U (ln m + γ ) − γ ], where U is uniform on (0, 1). Take Z 1 , Z 2 to be i.i.d. copies of Z with order statistics Z (1) ≤ Z (2) . Now, ¡ ¢ ¡ ¢ E3xy = E ln Z (2) − Z (1) + 1 ≈ E ln neU(2) (ln m+γ )−γ − neU(1) (ln m+γ )−γ £ ¡ ¢¤ 2 γ = ln n − γ + (ln m + γ ) EU(1) + E U(2) − U(1) = ln n + ln m − 3 3 We may, but do not derive this asymptotic formula from the original distribution function F(z) since that route involves elementary but tedious calculations. We observe that the random variable F˜ −1 Z (U ) does not take the distribution of Y into account, which appears reasonable as the first field will dominate 3 when dealing with a sample size of two. Hence, this approach also works for any discrete random variables Y taking possibly n values. The same idea works when X is uniform on (1, m). −1 For integer valued random variable Y taking at most n values, We use F˜U (U ) = mnU to replace Z = (X, Y ) since FU (z) = Pr[n X + Y ≤ nx + y] ≈

x z ≈ . m mn

As before, ¡ ¢ ¡ ¢ 11 E3x y = E ln Z (2) − Z (1) + 1 ≈ E ln mnU(2) − mnU(1) = ln m + ln n − , 6 where Z (1) ≤ Z (2) is the order statistics of Z 1 , Z 2 . Now let us assume n > m. From the formulas above, if X is Zipf (m) and Y is Zipf (n), then E3x y ≈ ln n +

2 γ 2 γ ln m − > ln m + ln n − ≈ E3 yx 3 3 3 3

suggests that we need to put field Y first to minimize E3. This result also appears paradoxical, since the domain of X is smaller than that of Y . Intuition might have suggested that placing X before Y would result both in smaller values of ϕ, as well as longer runs of leading zeroes in the sequence of differences, leading to a lower value of 3. This apparent contradiction can be resolved by considering the skew and concentration effects of the distributions involved. For any p ∈ (0, 1), the order (X, Y ) gives the p-percentile P1 ( p) = F˜ −1 Z ( p) = n exp[ p(ln m + γ ) − γ ], by Pr[(X, Y ) ≤ P1 ( p)] = p, while the order (Y, X ) gives the p-percentile P2 ( p) = m exp[ p(ln n + γ ) − γ ] < P1 ( p). Hence, the latter is more skewed than the former, and consequently, the sample data is more likely to be concentrated on the left extreme, reducing E3. Figure 4 convincingly suggests this relationship by displaying the quantiles. We may also interpret this phenomenon in terms of the distribution functions. Clearly, for integers 1 ≤ x ≤ m, 1 ≤ y ≤ n, we have Pr[(X, Y ) ≤ (x, y)] = Pr[X < x] + Pr[X = x, Y ≤ y] = Hx−1 /Hm + Hy /(xHm Hn ) and Pr[(Y, X ) ≤ (y, x)] = Pr[Y < y] + Pr[Y = y, X ≤ x] = Hy−1 /Hn + Hx /(y Hm Hn ). It can be shown that Pr[(X, Y ) ≤ (x, y)] ≤ Pr[(Y, X ) ≤ (y 0 , x 0 )], if 1 ≤ x, x 0 ≤ m, 1 ≤ y, y 0 ≤ n and xn + y = y 0 m + x 0 through a rather complicated calculation.


685

FIG. 4. Two Zipf attributes: Characterizing skew through the distribution of quantiles.

FIG. 5. Attribute reordering: One Zipf and one uniform attribute.

Another extreme case is when all the fields are Zipf distributed so that X i is Zipf (|Di |). In this situation, the analysis above suggests that to minimize E3, we order fields so that the first Zipf field corresponds to the largest value of |Di |, the second field has the second largest value, and so on. If there exist both Zipf and uniform distributions among those fields, one should put those field with Zipf distributions first, then those with uniform distributions. Similarly, when there are fields with arbitrary nonuniform distributions, we place the uniformly distributed fields last and the field with the highest concentration first, and then the field with the second highest concentration, and so on. Figure 5 illustrates this effect by showing the values of 3 ZU and 3U Z obtained through experiment. Our analysis began by assuming a databaseQ size of 2, but can clearly be extended to databases of size small relative to N = i |Di |. The concentration effect is again the key to determining the optimal ordering. We note however, that for two Zipf random variables, the advantage of optimal ordering over an arbitrary ordering seems small since E3xy − E3xy ≈ 1/3 ln(n/m), which is significant only when the ratio n/m is extremely large. Even for n = 1016 and m = 10, the difference is merely 5 ln 10, which is not very significant. 5.2.2. Large Databases. Consider now the case when database size n is large, but we still have two Zipf attributes X and Y . The situation is now quite different, since the concentration effect will no longer be crucial in determining 3. Whether we order the attributes as (X, Y ) or (Y, X ), it is very likely that the initial segment of the order statistics Z (1) < Z (2) < · · · < Z (d) will be the first consecutive d integers for some d ∈ N. Thus, the lower values in the Zipf range are very likely to

686


FIG. 6. Attribute reordering: two Zipf attributes.

be quickly exhausted; so that their contribution to 3 are small since n is relatively large. The main contributions to 3 will come from the elements at the right extreme of the these order statistics. From Jensen’s inequality, we have 3 Z ≤ (k − 1) ln[(Z (k) − Z (1) )/(k − 1)]. It is intuitively clear that the two sides of this inequality will be closer together if spacings Z (2) − Z (1) , . . . , Z (k) − Z (k−1) are close to each other. Since k is large, Z (k) is close to mn for both orderings. Therefore, nonuniformity within the set of spacings is really an issue. Such nonuniformity is most significant at the right extreme, and more uniformity will lead to higher 3. The quantile plot shown in Figure 4 of (X, Y ) and (Y, X ) shows that the former displays less uniformity at the right extreme, so that we expect the corresponding 3 to be smaller. Figure 6 illustrates this effect through experiments on two Zipf domains with Z 1 < Z 2 . For smaller database sizes, the order Z 2 Z 1 yields lower 3 values (in agreement with our results in Section 5.2.1), while the order Z 1 Z 2 is better for larger database sizes. Although this discussion provides adequate intuition for understanding the difference between the two orderings for small and large database sizes, it appears to be quite difficult to quantify and compare the effects caused by concentration and non-uniformity. It also appears difficult to determine the borderline represented by the value of k. We suggest that if k < m = min(m, n), then we use ordering (Y, X ) and otherwise we use (X, Y ). 6. Limit Theorems for the Multifield Case We now turn to the problem of estimating values of 3 when the database has several fields. Suppose the database has k fields drawn from independent domains D1 , . . . , Dk , respectively. Consider the corresponding random vector XE = (X 1 , . . . , X k ) with X i taking values in {1, . . . , |Di |}, i = 1, . . . , k. As in Section 2, this random vector can be represented by the corresponding random integer X 1 · |D2 | · · · |Dk | + · · · + X k−1 · |Dk | + X k . Our analysis in Section 5 showed that lower values of 3 result when the uniformly distributed domains are placed at the least-significant end of the tuple. In this case, the remaining nonuniform domains will be placed in some suitable order at the head of the tuple. We may view these nonuniformly distributed domains as jointly constituting a single composite domain with an arbitrary discrete distribution. Let us therefore model the nonuniform domains X 1 , . . . , X m−1 as a single random variable Z with and arbitrary distribution, and assuming values in the set


687

{1, . . . , d} with probabilities Pr[Z = k] = pk > 0, k = 1, . . . , d for some fixed d ∈ N. The remaining fields X m , . . . , X k have uniform distributions, whence X m · |D3 | · · · |Dk | + · · · + X k−1 · |Dk | + X k may be collapsed into a single random variable U uniformly distributed over the set {a + 1, a + 2, . . . , a + b}, where a = |Dm+1 | · · · |Dk | + · · · + |Dk−1 | · |Dk | + |Dk | and b = |Dm | · · · |Dk |. Thus, in estimating 3, we can collapse the fields X 1 , . . . , X k into just two fields. Let Z be an arbitrary discrete random variable assuming values from {1, 2, . . . , d}, and U be uniform on {1, 2, . . . , u n }. Let Z and U be independent, and form the random vector (Z , U ). Let Yi = (Z i , Ui ), i = 1, . . . , n, be n i.i.d. copies of this vector. Take Y(1) ≤ YP (2) ≤ · · · ≤ Y(n) to be the order statistics of the Yi . Now form n−1 ln(Y(i+1) − Y(i) + 1). the statistic 3 DU = i=1 If Z is uniform on {1, 2, . . . , d}, then (Z , U ) can be viewed as single large uniformly distributed field. Theorem 4.3 can be directly applied to obtain the following result. D

COROLLARY 6.1. If n 2 = o(u n ), then (3DU − µn )/σ −→ N (0, 1), where σ = αn 1/2 , µn = (n − 1)(ln d + ln u n − γ − ln(n + 1)). If the distribution of Z is not uniform, we may proceed as follows. Since Z can take at most d values, we would expect to see groups of tuples in the database sharing the same value in their first fields. When the database is sorted, tuples in each such cluster will appear together, and their differences will show a zero value in the first field. We call each such cluster of tuples in the sorted database a run. Therefore, we may split the original database into d smaller databases, eachP defined by a run n 1(Z i = j) corresponding to a value of Z , with the jth run having N j = i=1 records. Since we may have N j = 0, we allow runs to be empty. Consequently, 3 DU , the overall statistic to estimate database size, can be decomposed into two components: one to model the spacings within the runs, and one to model the spacings across the runs. That is, D

3DU =

d X j=1

3j +

d−1 X

ln(U j+1,1 − U j,N j + u n + 1) := 3w + 3b .

j=1

P N j −1 In this formula, 3 j = i=1 ln(U j,i+1 − U j,i + 1), or given N j = l, 3 j = Pl−1 3 j (l) = i=1 ln(U j,i+1 − U j,i + 1), U j,1 ≤ U j,2 ≤ · · · U j,l is the order statistics of U¯ j,1 , . . . , U¯ j,l , where {U¯ j,i , 1 ≤ j ≤ d, i ≥ 0}, are i.i.d. random variables uniform on {1, . . . , u n } and independent of Z 1 , . . . , Z n . 3w is the contribution to 3 DU from within runs, and 3b can be regarded as the spacings between consecutive runs. Obviously, (N1 , . . . , Nd ) follows the multinomial distribution Multi(n; p1 , . . . , pd ), so that N j has distribution Bin(n; p j ). As in Shiryayev [1995], we may therefore write the inequality Pr[|N j /n − p j | ≥ ε] ≤ 2 exp(−2nε 2 ) for every ε > 0. When N j = 0 or 1, we use the convention 3 j = 0, and define the corresponding summand in 3b to be 0. However, given the large-deviation style inequality above, we are assured that N j = 0 or 1 with exponentially small probabilities. Therefore, in pursuing the limiting distribution of 3 DU in Theorem 6.2, we may assume without undue concern that N j ≥ 2. We now state the main theorem that allows us to estimate the size of a compressed database with multiple attributes.

688


THEOREM 6.2. Let each record in the database comprise two discrete random fields (Z , U ), where Z is an arbitrary distribution on {1, . . . , d}, and U is uniform on {1, . . . , u n }. If n 2 = o(u n ), then as n → ∞, 3 DU − µ DU D −→ N (0, 1), βn 1/2 where µ DU

d X

d−1 X

Ã

un un = (np j − 1)(ln u n − ln(np j ) − γ ) + ln + np j np j+1 j=1 j=1

! := µw + µb

and β =α + 2

2

d X j=1

Ã p j (ln p j ) − 2

d X

!2 p j ln p j

,

α 2 = π 2 /6 − 1 = 0.644934 · · · .

j=1

PROOF. We first motivate the result with heuristics before proceeding to the rigorous argument. Since N j has distribution Bin(n; p j ), we can replace N j with the mean np j . Then EU j+1,1 ≈ u n /(np j+1 ), E(u n − U j,N j ) ≈ u n /(np j ), so we ≈ µw and the variance approximate E3b by µb . By TheoremP4.3, the mean E3w P P is dj=1 α 2 np j = α 2 n. The part n[ dj=1 p j (ln p j )2 − ( dj=1 p j ln p j )2 ] in the overall variance nβ 2 can be interpreted as the uncertainty in choosing differences across runs, which corresponds to 3b . Now let us proceed the rigid argument. For k ≥ 1, set µ(k) = (k − 1)(−γ + ln u n − ln k), and let µ(k) = 0 if k ≤ 1. To obtain the limiting distribution, we apply the Lévy Continuity Theorem by analyzing characteristic functions. We first note that given N1 = n 1 , . . . , Nd = n d , 31 , . . . , 3d are independent. Hence, for t ∈ R, we have via conditioning, # " d X √ 3 j − µ(np j ) exp −1t √ n j=1 ( " Ã d X √ µ(N j ) − µ(np j ) −1t = E E exp √ n j=1 #) !¯ d X √ 3 j − µ(N j ) ¯¯ + −1t √ ¯ N1 , . . . , Nd ¯ n j=1 ! ( Ã d X √ µ(N j ) − µ(np j ) −1t = E exp √ n j=1 " Ã !¯ #) d Y √ 3 j − µ(N j ) ¯¯ E exp −1t × √ ¯N j ¯ n j=1 We next assert the three convergence results (11), (12), and (13) and proceed to prove them using the Lebesgue Dominated Convergence Theorem and Slutsky’s


689

Theorem. These results will lead to Theorem 6.2 via the Lévy Continuity Theorem. " Ã !¯ # µ 2 2¶ √ 3 j − µ(N j ) ¯¯ −α p j t E exp −1t a.s. , (11) √ ¯ N j −→ exp ¯ 2 n  Ã !2  d d d X X X µ(N j ) − µ(np j ) D −→ N 0, p j (ln p j )2 − p j ln p j  , √ n j=1 j=1 j=1 and for j = 1, . . . , d, · ¶¸ µ un un 1 P ln(U j+1,1 − U j,N j + u n + 1) − ln + −→ 0. 1/2 n np j np j+1

(12)

(13)

To show (11), we proceed as follows: Since event {N j = l} and 3 j (l) are independent, # " Ã " Ã !¯ !# √ √ 3 j − µ(N j ) ¯¯ 3 j (l) − µ(l) −1t −1t E exp √ √ ¯ N j = l = E exp ¯ n n := g(t; n, l). Define set In = {l ∈ N, l ∈ (np j − n 2/3 , np j + n 2/3 )}. Observing that ¯ " Ã !¯ # ¶¯ µ ¯ √ 3 j − µ(N j ) ¯¯ −α 2 p j t 2 ¯¯ ¯ lim sup ¯ E exp −1t √ ¯ ¯ N j − exp ¯ ¯ 2 n n→∞ ¯ ¯ Ã !¯ ¶ µ X X ¯¯ −α 2 p j t 2 ¯¯ + = lim sup ¯1(N j = l) ¯g(t; n, l) − exp ¯ ¯ 2 n→∞ l∈In l6∈In ¯ ¯ µ 2 2 ¶¯ ¯ −α p j t ¯ ¯ ≤ lim sup sup ¯g(t; n, l) − exp ¯ + 2 lim sup 1(N j 6∈ In ) := A + B, ¯ ¯ 2 n→∞ l∈In n→∞ for (11), we only need to show A = 0, BP= 0 a.s.£Again, by inequality Pr[|N P j /n − ¤ ∞ Pr N ∈ 6 I ≤ 2 lim p j | ≥ ε] ≤ 2 exp(−2nε 2 ), limk→∞ ∞ j n k→∞ n=k n=k −1/3 2 exp[−2n(n ) ] = 0. Hence B = 0 a.s. via the Borel–Cantelli lemma. If A 6= 0, then there exists an ε > 0, a subsequence {n 0 } ⊂ N and l(n 0 ) ∈ In 0 such that along this subsequence, |g(t; n 0 , l(n 0 )) − exp(−α 2 p j t 2 /2)| > ε. However, by the Lévy Continuity Theorem, we do have |g(t; n 0 , l(n 0 )) − exp(−α 2 p j t 2 /2)| → 0 D following from (n 0 )−1/2 [3 j (l(n 0 )) − µ(l(n 0 ))] −→ N (0, α 2 p j ), which is due to D l(n 0 )/n 0 → p j and [α 2l(n 0 )]−1/2 [3 j (l(n 0 )) − µ(l(n 0 ))] −→ N (0, 1) asserted by 0 Theorem 4.3 since l(n ) → ∞. To prove (12), define pˆ n = (N1 /n, . . . , Nd /n), pˆ = ( p1 , . . . , pd ), and the Pd ˆ = entropy function ν(q) j=1 q j ln q j for a d-dimensional probability vector qˆ = (q1 , . . . , qd ). By the classical Central Limit Theorem for vectors, we D have n 1/2 ( pˆ n − pˆ ) −→ N (0, 6), where 6 is a d × d positive definite matrix with 6ii = pi (1 − pi ), 6i j = − pi p j . Using the Delta method [van der Vaart 1998], which expands a function of random variables about its mean with a 1-step

690


Taylor expansion to compute its variance, we now have Ã ¯ µ ¶τ ¯ ! ¯ ¯ ∂ν D ¯ pˆ 6 ∂ν ¯ . n 1/2 [ν( pˆ n ) − ν( pˆ )] −→ N 0, ¯ ∂ qˆ ∂ qˆ ¯ pˆ

(14)

D

In view of n 1/2 ( pˆ n − pˆ ) −→ N (0, 6), Taylor’s expansion ν( pˆ n ) − ν( pˆ ) ≈ ˆ τ | pˆ gives some intuition of the application of the Delta method ( pˆ n − pˆ )(∂ν/∂ q) for (14). Now we can use (14) to prove (12) by writing d X j=1

since pˆ n, j

[µ(N j ) − µ(np j )] =

d X j=1

ln

pˆ n, j + n[ν( pˆ ) − ν( pˆ n )], pj

!2 Ã ¯ µ ¶τ ¯ d d X X ∂ν ¯¯ ∂ν ¯¯ → p j a.s. and 6 = pi (ln pi )2 − pi ln pi . ∂q ¯ pˆ ∂q ¯ pˆ j=1 j=1

For (13), we only need to show that np j+1 np j U j+1,1 = O P (1), (u n − U j,N j ) = O P (1). un un The notation X n = O P (1), as in van der Vaart [1998], means that the random sequence X n is stochastically bounded; that is, for each ε > 0, there exists a K = K (ε) > 0 such that supn≥1 Pr[|X n | > K ] < ε. In fact, it is possible to obtain the stronger result np j+1 np j D D U j+1,1 −→ exp(1), (u n − U j,N j ) −→ exp(1). (15) un un Here we only prove the second case since the first one can be derived similarly. Actually, for x ≥ 0, · ¸ · ¸ ∞ X np j np j lim Pr (u n − U j,N j ) > x = lim Pr (u n − U j,l ) > x, N j = l n→∞ n→∞ un un l=1 µ » ¼¶ ∞ X un x l 1 = lim Pr[N j = l] un − n→∞ un np j l=1 ·µ » ¼¶ ¸ 1 u n x np j N j /(np j ) = lim E un − n→∞ un np j = exp(−x). The last step follows from the Lebesgue Dominated Convergence Theorem. Figure 7 shows how closely Theorem 6.2 agrees with values of 3 observed in practice. We generated two datasets, each with two distributions, one skewed and one uniform. The skewed distribution in the first dataset was Zipf (100), and its second field being uniform over (0, 107 ). The other dataset had its first field distributed as Binomial(10, 1/3), with its second field being uniform over (0, 105 ).


691

FIG. 7. Multiple fields: Agreement between Theorem 6.2 and experiment.

Agreement in both cases is to within a fraction of one percent over a large range, illustrating the power of Theorem 6.2. Remark 6.3. ThePreader may observe that, in order to obtain a more accurate estimate of E3b = d−1 j=1 ln(U j+1,1 − U j,N j + u n + 1), one must take advantage of the limiting distributions of U j+1,1 and u n − U j,N j specified by (15) since bias will be caused if we directly replace U j+1,1 , u n − U j,N j by their asymptotic means u n /(np j+1 ), u n /(np j ). This goal can be achieved by the following steps. (We again omit the details because of the overwhelming complexity.) First, given N j and N j+1 , U j+1,1 , u n −U j,N j are independent, since N j , N j+1 are asymptotically independent. So we have np j+1 np j Y1 Y2 D U j+1,1 + (u n − U j,N j ) −→ + , un un p j+1 pj where Y1 , Y2 are two i.i.d. exp(1) random variables. Next, following a careful estimation, the random sequence in the proceeding display can be shown to be uniform integrable. Hence, µ ¶ d−1 X un Y1 Y2 lim E3b − (d − 1) ln E ln + = . n→∞ n p j+1 pj j=1 Finally, an elementary but interesting computation leads to µ ¶ ¶ µ Z ∞Z ∞ Y2 s t Y1 + exp(−(s + t)) ln + = ds dt E ln p j+1 pj p j+1 pj 0 0 p j ln p j+1 − p j+1 ln p j −γ = p j+1 − p j through the parameter transformation x = s/ p j+1 + t/ p j , y = s + t. To summarize, we outlined a better estimate ¸ X · d−1 p j ln p j+1 − p j+1 ln p j un E3b = (d − 1) ln −γ + . n p j+1 − p j j=1 Remark 6.4. If the condition n 2 = o(u n ) does not hold, we would use the techniques in Section 3. An equivalent sampling scheme must be adopted with n replaced by the adjusted size n 0 .

692


7. Conclusions and Future Work This article provides the theoretical foundations for the Tuple Difference Coding method for compressing Large databases and data warehouses. As already noted, practical interest is growing in the TDC method, and the results given in this paper will help in the task of organizing the data in the warehouse so as to maximize the effects of compression. The problem of estimating the effectiveness of compression using TDC reduces to the problem of estimating the sum of the logarithms of the spacings between elements of samples taken without replacement. This is a nontrivial problem, but for the purpose of estimating compression efficiency, we may consider the problem effectively solved using the techniques we have developed. In particular, the approach we develop in Section 3 to sampling without replacement in terms of sampling with replacement is likely to be useful beyond its applications in this paper. This article provides methods for estimating the compression for cases where the population from which database records are sampled is either uniform, Zipf, or the product of a uniform distribution and an arbitrary distribution. We have verified our theoretical results by conducting experiments, and agreement between theory and practice is always within a few percent, and to within a fraction of a percent in most cases. The issue most in need of additional work is that of optimal ordering of attribute domains for achieving optimal compression. We have made significant progress on the issue in this paper, but do not yet have strong analytical results. This is material for further work. Also, our analysis in this article assumes knowledge of data distributions, but in practice, this information is not always available. Much more likely is nonparametric knowledge of data characteristics, such as variance, skew, or information such as “80% of data is formed from 20% of the values.” The estimation of compression efficiency from such non-parametric information is an important area of future work. From the probability theory and statistics viewpoint, it appears quite important to derive the asymptotic distributions for discrete spacings under proper scaling. The results available to date require the strong assumption that the distribution functions are absolute continuous. ACKNOWLEDGMENTS. The authors are grateful to the reviewers for their reading of this article, and for the suggestions they have made for its improvement.

REFERENCES BLUMENTHAL, S. 1968. Logarithms of sample spacings. SIAM J. Appl. Math. 16, 1184–1191. CHOW, Y. S., AND TEICHER, H. 1988. Probability Theory. Springer-Verlag, New York. ¨ O˝ , S., AND WU, W. B. 2000. Random graphs and the strong convergence of bootstrap means. CSORG Combin. Prob. Comput., 9, 315–347. DARLING, D. A. 1953. On a class of problems relating to the random division of an interrval. Ann. Math. Stat. 24, 239–253. HOEFFDING, W. 1963. Probability inequalities for sums of bounded random variables. J. ASA 58, 13–30. KLASS, M., AND TEICHER, H. 1977. Iterated logarithm laws for asymmetric random variables barely with or without finite mean. Ann. Prob. 5, 861–874. KOLCHIN, V. F., SEVAST’YANOV, B. A., AND CHISTYAKOV, V. P. 1978. Random Allocations. Wiley, New York. LI, W. 1992. Random texts exhibit Zipf-law-like word frequency distribution. IEEE Trans. Inf. Theory 38, 1842–1845.


693

NETRAVALI, A. N., AND HASKELL, B. G. 1988. Digital Pictures—Representation and Compression. Plenum Press, New York and London. NG, W.-K., AND RAVISHANKAR, C. V. 1997. Block-oriented compression techniques for large statistical databases. IEEE Trans. Knowl. Data Eng. 9, 314–328. PYKE, R. 1965. Spacings. J. Roy. Stat. Soc., Ser. B 27, 395–449. SHAO, Y., AND MARJORIE, G. 1995. Limit theorems for the logarithm of sample spacings. Stat. Prob. Lett. 24, 121–132. SHIRYAYEV, A. N. 1995. Probability. Springer-Verlag, New York. VAN DER VAART, A. W. 1998. Asymptotic Statistics. Cambridge University Press, Cambridge, Mass. VITERBI, A. J., AND OMURA, J. K. 1979. Principles of Digital Communication and Coding. McGraw-Hill, New York. RECEIVED JANUARY

1999; REVISED APRIL 2002; ACCEPTED MAY 2003

Journal of the ACM, Vol. 50, No. 5, September 2003.