Pointwise Redundancy in Lossy Data ... - Semantic Scholar

Pointwise Redundancy in Lossy Data Compression and Universal Lossy Data Compression I. Kontoyiannis August 17, 1999

Abstract { We characterize the achievable pointwise redundancy rates for lossy data compression

at a xed distortion level. \Pointwise redundancy" refers to the dierence between the description length achieved by an nth-order block code and the optimal nR(D) bits. For memoryless sources, p we show that the best achievable redundancy rate is of order O( n) in probability. This follows from a second-order re nement to the classical source coding theorem, in the form of a \one-sided central limit theorem." Moreover, we show that, along (almost) any source realization, the description lengths of any sequence of block codes operating at distortion level D exceed nR(D) by at least as p much as C n log log n, in nitely often. Corresponding direct coding theorems are also given, showing that these rates are essentially achievable. The above rates are in sharp contrast with the expected redundancy rates of order O(log n) recently reported by various authors. Our approach is based on showing that the compression performance of an arbitrary sequence of codes is essentially bounded below by the performance of Shannon's random code. We obtain partial generalizations of the above results for arbitrary sources with memory, and we prove lossy analogs of \Barron's Lemma."

Index Terms { Lossy data compression, universal coding, redundancy, large deviations, rate-

distortion.

I. Kontoyiannis is with the Department of Statistics, Purdue University, 1399 Mathematical Sciences Building, W. Lafayette, IN 47907-1399. Email: [email protected] 2 This work was supported in part by a grant from the Purdue Research Foundation. 3 Preliminary results from this work were reported at the 1999 Canadian Workshop on Information Theory, Kingston, Ontario, June 1999. 1

1 Introduction Broadly speaking, the objective of lossy data compression is to nd ecient approximate representa4 tions for relatively large amounts of data. Let xn1 = (x1 ; x2 ; : : : ; xn ) denote a data string generated by a random source X = fXn ; n 1g taking values in the source alphabet A. We wish to represent 4 (y1 ; y2 ; : : : ; yn ) taking values in the reproduction alphabet A^ each xn1 by a corresponding string y1n = (where A^ may or may not be the same as A), so that the distortion between each data string and its representation lies within some xed allowable range. For our purposes, distortion is measured by a family of single-letter distortion measures,

n(xn1 ; y1n) = n1

n X i=1

(xi ; yi )

xn1 2 An ; y1n 2 A^n;

where : A A^ ! [0; 1) is a xed nonnegative function. To be speci c, we consider \variable-length block codes operating at a xed distortion level," that is, codes Cn de ned by triplets (Bn ; n ; n ) where: (a) Bn is a subset of A^n called the codebook; (b) n : An ! Bn is the encoder or quantizer; (c) n : Bn ! f0; 1g is an invertible (and pre x-free) representation of the elements of Bn by nite-length binary strings. For D 0, the block code Cn = (Bn ; n ; n ) is said to operate at distortion level D [14] (or to be D-semifaithful [23]), if it encodes each source string with distortion D or less:

n(xn1 ; n (xn1 )) D;

for all xn1 2 An :

From the point of view of data compression, the main quantity of interest is the description length of a block code Cn , expressed in terms of its associated length function `n : An ! N . Here, `n (xn1 ) denotes the description length, in bits, assigned by Cn to the string xn1 . Formally,

`n(xn1 ) = length of [ n(n (xn1 ))]: Roughly speaking, the smaller the description length, the better the code. Shannon in 1959 characterized the best achievable compression performance of block codes. Suppose, for example, that the data are generated by a memoryless source X = fXn ; n 1g, that is, the Xn are independent and identically distributed (i.i.d.) random variables with common distribution P on A. Suppose also that fCn = (Bn ; n ; n ) ; n 1g is an arbitrary sequence of block codes operating 1

at distortion level D. In [28] Shannon identi ed the minimal expected description length that can be achieved by any such sequence fCn g. He showed that the expected compression ratio E [`n (X1n )]=n is asymptotically bounded below by the rate-distortion function R(D), n lim inf E [`n (X1 )] R(D) bits per symbol (1) n!1

n

where R(D) = R(P; D) is de ned by the well-known formula

R(D) = R(P; D) = (X;Y ): X P;infE[(X;Y )]D I (X ; Y ): [Precise de nitions are given in the next section.] Moreover, Shannon demonstrated the existence of codes achieving the above lower bound with equality; see [28] or Berger's classic text [4]. A stronger version of Shannon's \converse" (1) was proved by Kieer in 1991 [14], who showed that the rate-distortion function is an asymptotic lower bound for `n (X1n ) not just in expectation but also in the pointwise sense: n (2) lim inf `n (X1 ) R(D) a.s. n!1

n

[Here and throughout this paper the terms \pointwise," \almost surely" (denoted \a.s.") and \with probability one" are used interchangeably.] Kieer's result says that, asymptotically, it is impossible to beat the rate-distortion function even on a small fraction of the messages generated by the source. In [14] it is also demonstrated that the bound in (2) can be achieved with equality. Our main aim in this paper is to characterize the achievable pointwise redundancy rates for block codes applied to memoryless sources, where the pointwise redundancy is de ned as the dierence between the description length `n (X1n ) of an nth-order block code Cn and the optimum description length given by nR(D). Mathematically, this problem translates to describing the possible rates of convergence in (2), and, in particular, nding the fastest such rate. The main gist of our approach will be to show that the performance of any sequence of block codes operating at a xed distortion level is bounded below by the performance of a (simple variant of) Shannon's random code. In terms of data compression, knowing the possible convergence rates that can be achieved in (2) tells us how big blocks of data we need to take in order to come reasonably close to optimal compression performance. Clearly, these questions are of signi cant practical relevance.

1.1 Outline

For simplicity, assume for now that A and A^ are both nite sets, and let X = fXn ; n 1g be a memoryless source with rate-distortion function R(D). Our main results (Theorems 4 and 5, 2

summarized in Corollary 1) state that the performance of an arbitrary sequence of codes fCn ; `n g is essentially dominated by the performance of a random code, to which we refer as the \Shannon code." (MAIN RESULT): Let Qn be the optimal reproduction distribution at distortion level D, and write B (xn1 ; D) for the distortion-ball of radius D around xn1 [precise de nitions are given in the next section]. For any sequence of block codes fCn g operating at distortion level D, with associated length functions f`n g, we have:

`n(X1n ) log[1=Qn (B (X1n ; D))] + O(log n) a.s. Moreover, the Shannon code asymptotically achieves this lower bound with equality. [Above and throughout the paper, `log' denotes the logarithm taken to base 2 and `loge ' denotes the natural logarithm.] Next, motivated by corresponding results in the case of lossless data compression [15], we interpret Kieer's result (2) as a \one-sided" law of large numbers, and we state and prove corresponding second-order re nements to (2). In Theorem 1 we give a \one-sided" central limit theorem (CLT) corresponding to the pointwise lower bound in (2): (CLT): There is a sequence of random variables fGng (depending on P and D) such that, for any sequence of codes fCn ; `n g operating at distortion level D, we have

p

`n(X1n ) nR(D) + nGn + O(log n) a.s.

(3)

where the Gn converge in distribution (as n ! 1) to a Gaussian random variable. Moreover, there exist codes fCn g achieving the lower bound in (3) [see Theorem 2]. This means that for any sequence of codes, about half the time, the description length `n (X1n ) will p deviate from the optimum nR(D) bits by an extra O( n) bits. A further re nement to the pointwise converse (2) is also given in Theorem 1, in the form of a \onesided" law of the iterated logarithm (LIL) (under some mild conditions). This provides a complete characterization of the pointwise redundancy of block codes at a xed distortion level: (LIL): For any sequence of codes fCn ; `n g operating at distortion level D, the pointwise p redundancy exceeds C n log log n for in nitely many values of n (for some C > 0):

p

`n (X1n ) ? nR(D) C n log log n in nitely often, a.s.

(4)

Moreover, there exist codes fCn g asymptotically achieving this lower bound [Theorem 2]. 3

The pointwise redundancy rates in (3) and (4) are in sharp contrast with the corresponding expected redundancy results recently reported by Zhang, Yang and Wei in [39]. There, it shown that the best possible expected redundancy E [`n (X1n )] ? nR(D) achievable by block codes, is of order O(log n). For practical purposes, this dierence suggests the following interpretation: Since any compression algorithm used in practice is bound to have uctuations p in the description length of order at least as large as O( n), for big enough block lengths n it may or may not be worth putting a lot of eort into optimizing the algorithm's expected performance. Instead, it might be more useful to either (a) try to control the variance of the description lengths `n (X1n ), or (b) optimize the algorithm's implementation. Indeed, it seems to often be the case in practice that \implementation complexity might be the dominating issue" [5]. Our next result says, perhaps somewhat surprisingly, that there is no cost for universality in pointwise redundancy. That is, essentially the same performance can be achieved, even when the source distribution is not known in advance. For the class of all memoryless sources P over the alphabet A, Theorem 3 demonstrates the existence of a sequence of universal codes fCn g with length functions f`n g such that, for every source P (and for some C 0 > 0): (a) `n (X1n )=n ! R(P; D) a.s. p (b) `n (X1n ) = nR(P; D) + nGn + O(log n) a.s. p (c) `n (X1n ) ? nR(P; D) C 0 n log log n eventually, a.s.

A natural next question to ask is whether these results remain true when sources with memory are considered. The fundamental coding theorems in (1) and (2) are of course still valid (with the ratedistortion function now de ned in terms of the distribution of the whole process X ), but redundancy questions appear to be much more delicate. For arbitrary sources with memory (not even necessarily stationary or ergodic), Theorem 6 gives a general pointwise lower bound for the performance of block codes at a xed distortion level. This result can be thought of as the natural analog to the case of lossy compression of a well-known result from lossless compression, sometimes [29] referred to as \Barron's Lemma" [1][2]. A more detailed discussion of this connection is given in Section 2.4. Finally, Theorem 8 is a direct coding theorem demonstrating a pointwise achievability result, which complements the lower bound of Theorem 6.

1.2 History Despite its obvious practical relevance, the redundancy problem for lossy data compression at a xed distortion level seems to have only been considered relatively recently, and, with few exceptions, 4

attention has been restricted to questions regarding expected redundancy. In 1993, Yu and Speed [38] demonstrated the existence of a sequence of universal codes with expected redundancy rate of order O(log n) over the class of memoryless sources with nite source and reproduction alphabets. In the case of the Hamming distortion measure, Merhav in 1995 [21] proved a corresponding lower bound showing that the expected redundancy (even when the source distribution is known in advance) is bounded below by (1=2) log n. The question was essentially settled by the work of Zhang, Yang and Wei [39] in 1997, where it is demonstrated that Merhav's lower bound is true quite generally, and corresponding direct coding theorems are given, exhibiting codes with redundancy bounded above by [log n + o(log n)]. A similar direct coding theorem for sources with abstract alphabets was recently proved by Yang and Zhang [35]. For universal coding at a xed distortion level, Chou, Eros and Gray [8] showed that the price paid for being universal over k-dimensional parametric classes of sources is essentially (k=2) log n. A universal direct coding theorem for memoryless sources over nite alphabets was recently reported by Yang and Zhang in [37]. With only a couple of notable exceptions from 1968 (Pilc [24] and Wyner [32]), the dual problem of lossy compression at a xed rate level appears to also have been considered rather recently. Linder, Lugosi and Zeger [19][20] studied various aspects of the distortion redundancy problem and exhibited universal codes with distortion redundancy of order O(log n). Zhang, Yang and Wei [39] proved a lower bound of order O(log n), and they constructed codes achieving this lower bound (to rst order). Coding for sources with abstract alphabets is considered in [35], and questions of universality are treated in [8] and [36], among many others. The rest of the paper is organized as follows. In the next section we state and discuss our main results. Section 3 contains the proofs of the pointwise converses for memoryless sources (Theorems 1 and 4), and Section 4 contains the proofs of the corresponding direct coding theorems (Theorems 2, 3, and 5). In Section 5 we prove our results for arbitrary sources (Theorems 6{8), and the Appendices contain proofs of various technical steps needed along the way.

2 Results Let X = fXn ; n 1g be a random source taking values in the source alphabet A, where A is assumed to be a Polish space (i.e., a complete, separable metric space); let A denote its associated Borel - eld. Although all our results will be stated for the general case, there is no essential \loss of ideas" in thinking of A as being nite. For 1 i j 1, write Xij for the vector of random variables (Xi ; Xi+1 ; : : : ; Xj ) and similarly write xji = (xi ; xi+1 ; : : : ; xj ) 2 Aj ?i+1 for a realization of Xij . 5

Let A^ denote the reproduction alphabet. Given a nonnegative measurable function : A A^ ! [0; 1), we de ne a sequence of single-letter distortion measures n : An A^n ! [0; 1), n 1, by n X 1 n n (x ; y ) = (x ; y ) xn 2 An; yn 2 A^n: n 1 1

i i

n i=1

1

1

Throughout the paper we will assume that the set A^ is nite, and that the function is bounded, i.e., (x; y) M < 1 for some xed constant M , for all x 2 A; y 2 A^. Although these assumptions are not necessary for the validity of all of our results, they are made here for the sake of simplicity of the exposition. Without loss of generality, we also make the customary assumption that sup min (x; y) = 0:

x2A y2A^

(5)

We are interested in variable-length block codes Cn operating at a xed distortion level, where Cn = (Bn ; n ; n ) is de ned in terms of a subset Bn of A^n called the codebook, an encoder n : An ! Bn , and a lossless (pre x-free) binary code n : Bn ! f0; 1g for Bn . For D 0, we say that the code Cn operates at distortion level D, if n(xn1 ; n(xn1 )) D for all source strings xn1 2 An : The length function `n : An ! N induced by Cn is de ned by

`n(xn1 ) = length of [ n(n (xn1 ))]; so that `n (xn1 ) is the length (in bits) of the description of xn1 by Cn . For D 0 and n 1, the nth-order rate-distortion function of X (see, e.g., [4]) is de ned by n n Rn(D) = (Xinf n ;Y n ) I (X1 ; Y1 ) 1

1

where I (X1n ; Y1n ) denotes the mutual information (in bits) between X1n and Y1n , and the in mum is over all jointly distributed random vectors (X1n ; Y1n ) with values in An A^n , such that X1n has the source distribution and E [n (X1n ; Y1n )] D; if there are no such (X1n ; Y1n ), we let Rn (D) = 1. [Similarly, throughout the paper, the in mum of an empty set is taken to be +1.] The rate-distortion function R(D) of X is de ned as the limit of (1=n)Rn (D) as n ! 1, provided the limit exists.

2.1 Second-Order Coding Theorems for Memoryless Sources

In this section we assume that X is a memoryless source with xed distribution P . That is, the random variables fXn g are i.i.d. according to P , where, strictly speaking, P is a probability measure on (A; A). As is well-known [4], the rate-distortion of a memoryless source reduces to its rst-order rate-distortion function

R(D) = R(P; D) = (X;Y inf I (X ; Y ) ) 6

(6)

where the in mum is over all jointly distributed random variables (X; Y ) such that X has distribution P and E [(X; Y )] D. Let

Dmax = Dmax (P ) = min^ EP [(X; y)] y 2A

(7)

and note that R(D) = 0 for D Dmax (see, e.g., Proposition 1 (iv) in Section 3). In order to avoid the trivial case when R(D) is identically zero, we assume that Dmax > 0: In our rst result, Theorem 1, we give lower bounds on the pointwise deviations of the description lengths `n (X1n ) of any code Cn from the optimum nR(D) bits. It is proved in Section 3.2 by an application of the general lower bound in Theorem 4. Theorem 1: (Second-Order Converses) Let X be a memoryless source with rate-distortion function R(D), and let D 2 (0; Dmax ). (i) CLT: There is a sequence of random variables Gn = Gn (P; D) such that, for any sequence of codes fCn ; `n g operating at distortion level D, we have

p

`n(X1n ) ? nR(D) nGn ? 2 log n eventually, a.s.

(8)

and the Gn converge in distribution to a Gaussian random variable D N (0; 2 ) Gn ?!

with variance 2 explicitly identi ed. (ii) LIL: With 2 as above, for any sequence of codes fCn ; `n g operating at distortion level D: n (X1n ) ? nR(D) a.s. lim sup `p 2n loge loge n n!1 `p n (X1n ) ? nR(D) ? a.s. lim inf n!1 2n loge loge n [Recall that loge denotes the natural logarithm and log log2 .] Our next result, Theorem 2, shows that these lower bounds are tight. It is proved in Section 4 using a random coding argument. Although the construction is essentially identical to Shannon's classical argument, determining its pointwise asymptotic behavior is signi cantly more delicate, and it relies heavily on the recent results of Dembo and Kontoyiannis [10] and Yang and Zhang [35] on the asymptotics of the probability of \distortion balls." See the discussion after Theorem 4.

7

Theorem 2: (Direct Coding Theorem) Let X be a memoryless source with rate-distortion function R(D), and let D 2 (0; Dmax ). There is a sequence of codes fCn ; `n g operating at distortion level D, which achieve asymptotic equality (to rst order) in all the almost-sure statements of Theorem 1: ` (X n ) ? nR(D) ? G = 0 a.s. (a) lim n 1 p

(b) (c)

n

n

n!1

n (X1n ) ? nR(D) = a.s. lim sup `p n!1 2n loge loge n `p n (X1n ) ? nR(D) = ? a.s. lim inf n!1 2n loge loge n

Remarks: 1. Variance. The variance 2 in Theorems 1 and 2 is a quantity characteristic of the source, which tells us that, when the source is encoded in the most ecient way, the deviations of the codeword lengths `n(X1n ) from the optimum nR(D) bits will have a variance roughly equal to n2 . If any other code is used, these deviations will be asymptotically bounded below by a Gaussian random variable of variance n2 . In view of this, we think of 2 = 2 (P; D) as the minimal coding variance of the source P at distortion level D. The precise de nition of 2 is given in the next section and its properties are discussed in some detail in Section 2.3. In particular, 2 is always nonnegative (typically it is strictly positive), and it can be expressed as

2 = Var(? log F (X1 ))

(9)

for some function F : A ! (0; 1). 2. Pointwise redundancy. Let fCn ; `n g be arbitrary codes operating at distortion level D. If 2 > 0, part (ii) of Theorem 1 says that when the codes fCn g are applied to almost any realization of the source X , then for in nitely many n,

p

`n (X1n ) ? nR(D) C n loge loge n;

p

(10)

where for C we can take any constant C 2 (0; 2). Moreover, the amount by which we can \beat" the rate-distortion function satis es

p

p

`n(X1n ) ? nR(D) ?C n loge loge n eventually, a.s.

The O( n loge loge n) rate in (10) is in sharp contrast with the expected redundancy rates of order O(log n) reported in [39]. 3. Expected versus pointwise redundancy. The dierence between the two types of redundancy is reminiscent of the classical bias/variance trade-o in statistics. Here, if the goal is to design a 8

lossy compression algorithm that will be used repeatedly and on large data sets, then it is probably a good idea to optimize the expected performance. On the other hand, if it is important to guarantee compression performance within certain bounds, it might be possible to give up some rate in order to reduce the variance. 4. Lossless compression. The results in Theorem 1 are close parallels of the corresponding lossless compression results in [15, Theorems 1 and 2]. There, the coding variance takes the simple form

2 = Var(? log P (X1 ))

(11)

(cf. equation (9) above), which can be viewed as the natural second-order analog of the entropy H = E (? log P (X1 )): In the lossless case, the pointwise lower bounds are easily achieved, for example, by the Human code or the Shannon code [9]. In fact, it is well-known [30][18] that we can come within O(log n) of the Shannon code universally over all memoryless sources, for all message strings xn1 . Therefore, in the lossless case, the same pointwise behavior can be achieved universally at no extra cost [15]. Next we show that the pointwise redundancy rates of Theorem 2 can be achieved universally over all memoryless sources on A. The proof of Theorem 3 (Section 4) is similar in spirit to that of Theorem 2, with the dierence that here, in order to be universal, we generate multiple random codebooks and we allow the encoder to choose the best one. The additional cost of transmitting the index of the codebook that was used turns out to be negligible, and the pointwise behavior obtained is identical (up to terms of order O(log n)) to that achieved with knowledge of the source distribution. The idea of multiple random codebooks is well-known in information theory, dating at least as far back as Ziv's 1972 paper [40] and the work of Neuho, Gray and Davisson in 1975 [22]. Nevertheless, to determine the exact pointwise behavior of this random code is more delicate, and our analysis relies on recent results from [10] and [35]. Theorem 3: (Universal Coding) There is a sequence of universal codes fCn ; `n g operating at distortion level D, such that, if the data (X1 ; X2 ; : : :) are generated by any memoryless source P on A, and if D 2 (0; Dmax (P )), then: ` (X n) ? nR(P; D) n 1 0 p ? G (P; D) = 0 P ? a.s. (a ) lim

n n n (b0 ) lim sup `n (pX1 ) ? nR(P; D) = (P; D) P ? a.s. 2n loge loge n n!1 `np (X1n ) ? nR(P; D) = ?(P; D) P ? a.s. (c0 ) lim inf n!1 2n loge loge n where the random variables Gn = Gn(P; D) and the variance 2 = 2 (P; D) are as in Theorem 1. n!1

9

2.2 Main Results: Pointwise Optimality of the Shannon Code In this section we state our main results, from which Theorems 1, 2 and 3 of the previous section will follow. Assume that X is a memoryless source with distribution P on A, and let Q be an arbitrary measure on A^; since A^ is a nite set, we think of Q simply as a discrete probability mass function (p.m.f.). For each Q, de ne

R(P; Q; D) = (X;Y inf [I (X ; Y ) + H (QY kQ)] )

(12)

where H (RkQ) denotes the relative entropy (in bits) between two distributions R and Q, QY denotes the distribution of Y; and the in mum is over all jointly distributed random variables (X; Y ) with values in A A^ such that X has distribution P and E [(X; Y )] D. It is easy to see that the rate-distortion function of X can be expressed as

R(D) = R(P; D) = inf R(P; Q; D) Q

(13)

where the in mum is over all p.m.f.s Q on A^ (simply interchange the two in ma). For each source P on A and distortion level D 0, let Q = Q (P; D) denote a p.m.f. achieving the in mum in (13):

R(D) = R(P; Q ; D):

(14)

[See Proposition 2 (ii) in Section 3.1 for the existence of Q .] We call this Q the optimal reproduction distribution for P at distortion level D. For a xed source P , a distortion level D 2 (0; Dmax ), and a corresponding Q as in (14), we let x (); x 2 A, 0, be the log-moment generating function of the random variable (x; Y ) when Y Q : x () = loge EQ e(x;Y ) ; x 2 A; 0: Then there exists a unique = < 0 such that dd [EP (X ())] = D (see Lemma 1 in Section 3.1). Our next result, Theorem 4, shows that the pointwise redundancy of any sequence of block codes is essentially bounded below by a sum of i.i.d. random variables. As we discuss in the remarks following Theorem 4, this lower bound can be interpreted as saying that the performance of any sequence of block codes is dominated by the performance of Shannon's random code. Theorem 4 is proved in Section 3.2.

10

Theorem 4: (Pointwise Lower Bound) Let X be a memoryless source with rate-distortion function R(D), and let D 2 (0; Dmax ). For any sequence of codes fCn ; `n g operating at distortion level D and any sequence fbn g of positive constants P such that n 2?bn < 1, we have

`n(X1n ) ? nR(D)

n X i=1

f (Xi) ? bn eventually, a.s.

(15)

where 4 f (x) = (log e)(?x ( ) ? EP [?X ( )]):

(16)

Remarks: 1. Consequences. It is easy to see that Theorem 1 is an immediate consequence of the lower bound (15). In particular, the coding variance 2 in Theorems 1 and 2 is simply the variance of the random variable f (X1 ). 2. Intuition. Suppose we generate a random (Shannon) codebook according to Q , that is, we generate i.i.d. codewords Y (i) = (Yi;1 ; Yi;2 ; : : : ; Yi;n); i = 1; 2; : : : ; each drawn from the distribution Qn = (Q)n . We can encode each source sequence X1n by specifying the index i = Wn of the rst codeword Y (i) such that n (X1n ; Y (i)) D. This description takes approximately (log Wn ) bits. But Wn, the \waiting time" until the rst D-close match for X1n , is approximately equal to the reciprocal of the probability of nding such a match, so

log Wn log[1=Qn (B (X1n ; D))] where the \distortion balls" B (xn1 ; D) are de ned by

B (xn1 ; D) = fy1n 2 A^n : n (xn1 ; y1n ) Dg xn1 2 An :

(17)

From the recent work of Dembo and Kontoyiannis [10] and Yang and Zhang [35] we know that these probabilities behave like

X log[1=Qn (B (X1n ; D))] = nR(D) + f (Xi ) + 1 log n + O(log log n) a.s. n

2

i=1

(18)

[See Proposition 3 in Section 4.] Therefore, the pointwise description length of the Shannon code is, approximately, log Wn nR(D) +

n X i=1

f (Xi ) bits, a.s.

11

In view of this, we can rephrase Theorem 4 by saying that, in a strong sense, the performance of any code is bounded below by the performance of the Shannon code. Indeed, the proof of Theorem 4 is based on rst showing that any code can be thought of as a random code (according to a measure on A^n dierent from Qn ), and then proving that the pointwise performance of any random code is dominated by the performance of the Shannon code. The following result, Theorem 5, formalizes the above random coding argument, and also shows that essentially the same performance can be achieved universally over all memoryless sources. Theorem 5: (The Shannon Code { Random Coding) (i) Let X be a memoryless source with rate-distortion function R(D), and let D 2 (0; Dmax ). There is a sequence of codes fCn ; `n g operating at distortion level D, such that

`n(X1n ) log[1=Qn (B (X1n ; D))] + 4 log n + Const. eventually, a.s. where Q is the optimal reproduction distribution at distortion level D, and Qn = (Q )n . (ii) There is a sequence of universal codes fCn ; `n g operating at distortion level D, such that, if the data (X1 ; X2 ; : : :) are generated by any memoryless source P on A, and if D 2 (0; Dmax (P )), then

`n(X1n ) log[1=Qn (B (X1n ; D))] + (4 + k) log n + Const. eventually, P ?a.s. ^ Q = Q (P; D) is the optimal reproduction distribution where k is the number of elements in A; corresponding to the true source P at distortion level D, and Qn = (Q )n . In the next Corollary we combine Theorem 4 (with bn = (1+ ) log n) with (18) and Theorem 5 (i), to rewrite the above results in an intuitively more appealing form. Although Corollary 1 is a simple consequence of Theorems 4 and 5, conceptually it contains the main contribution of this paper: Corollary 1: (Pointwise Optimality of Shannon Code) Under the assumptions of Theorem 4, for any sequence of codes fCn ; `n g operating at distortion level D, we have

`n (X1n ) log[1=Qn (B (X1n ; D))] ? 2 log n eventually, a.s.

(19)

where the Shannon code achieves

`n (X1n ) log[1=Qn (B (X1n; D))] + 4 log n + Const. eventually, a.s.

2.3 Minimal Coding Variance Suppose X is a memoryless source with distribution P , let D 2 (0; Dmax ); and let Q be the corresponding optimal reproduction distribution. In the notation of the previous section, the minimal 12

coding variance 2 = Var[f (X1 )] can be written as

h

2 = VarP ? log EQ e[(X;Y )?D]

i

(here X and Y are independent random variables with distributions P and Q , respectively). As we will see in Section 3.1, the rate-distortion function R(D) can similarly be expressed as

h

i

R(D) = EP ? log EQ e [(X;Y )?D] : Comparing the last two expressions suggests that we may think of 2 as a second-order version of R(D), and further justi es the term minimal coding variance, by analogy to the minimal coding rate R(D). It is obvious that 2 is always nonnegative, and it is typically strictly positive, since the only way ? it can be zero is if the expectation EQ e (x;Y ) is constant for P -almost all x 2 A. We give three simple examples illustrating this. E.g. 1. Lossless Compression. As mentioned above, in the case of lossless compression the minimal coding variance reduces to 2 = Var[? log P (X1 )] (cf. eq. (11)), from which it is immediate that 2 = 0 if and only if P is the uniform distribution over the nite alphabet A. [See [15] for more details and a corresponding characterization for Markov sources.] E.g. 2. Binary Source, Hamming Distortion. This is the simplest non-trivial lossy example. Suppose X has Bernoulli(p) distribution for some p 2 (0; 1=2]. Let A = A^ = f0; 1g and let be the Hamming distortion measure, (x; y) = jx ? yj. For D 2 (0; p), easy but tedious calculations (cf. [4, Example 2.7.1] and [9, Theorem 13.3.1]) show that Q is a Bernoulli(q) distribution with q = (p ? D)=(1 ? 2D), = loge (D=(1 ? D)), and EQ e (x;Y ) = 1P?(xD) : As we already noted, 2 = 0 if and only if the above expression is constant in x, so that, here, 2 = 0 if and only if p = 1=2, i.e., if and only if P is the uniform distribution on A = f0; 1g. E.g. 3. A Quaternary Example. This is a standard example from Berger's text [4, Example 2.7.2]. Suppose X takes values in the alphabet A = f1; 2; 3; 4g and let A^ = A. Suppose the distribution of X is given by P = (p=2; (1 ? p)=2; (1 ? p)=2; p=2), for some p 2 (0; 1=2]; and let the distortion measure be speci ed by the matrix (ij ) = ((i; j )); i; j 2 A, where

0 1 0 1=2 1=2 1 BB C 1=2 0 1 1=2 C B C: (ij ) = B B@ 1=2 1 0 1=2 C C A 1

1=2 1=2 13

0

p

For D 2 (0; (1 ? 1 ? 2p)=2) it is possible (although tedious) to calculate Q and explicitly, and to obtain that EQ e (x;Y ) = (1P?(xD))2 :

Once again, this is generally not constant in x, with the exception of the case p = 1=2. So, 2 will be strictly positive, unless P is the uniform distribution on A = f1; 2; 3; 4g. There is an obvious trend in all three examples above: The variance 2 is strictly positive, unless P is uniformly distributed over A. It is an interesting problem to determine how generally this pattern persists.

2.4 Sources With Memory Here we present analogs of Theorems 4 and 5 for arbitrary sources. Of course, at this level of generality, the results we get are not as strong as the ones in the memoryless case. Still, adopting a dierent approach, we are able to get interesting partial generalizations. Let X be an arbitrary source with values in A, and let Pn denote the distribution of X1n . By a sub-probability measure Qn on A^n we mean a positive measure with total mass 0 < Qn(A^n ) 1: For each D 0 we de ne (recall the notation in (17))

EPn f? log Qn (B (X1n ; D))g Kn(D) = Kn (P; D) = inf Q n

where the in mum is over all sub-probability measures Qn . Suppose Qen achieves the above in mum (the existence of such a Qn is established by Lemma 3 in Section 5). Our next result gives an analog of Theorem 4 for the case of sources with memory. It is proved in Section 5 using an argument similar to the one used by Kieer in the proof of Theorem 2 in [14]. Theorem 6: (A Lossy \Barron's Lemma") Suppose X is an arbitrary source, and let D 0 be such that Kn (D) < 1. For any sequence of codes fCn; `n g operating at distortion level D, we have: (i) For all n: E [`n (X1n )] Kn (D) Rn (D):

(ii) For any sequence fbn g of positive constants such that

P

n2

?bn

< 1:

`n(X1n ) log[1=Qen (B (X1n ; D))] ? bn eventually, a.s. 14

(20)

The lower bound in (20) is a natural \lossy analog" of a well-known result from lossless compression, often called \Barron's Lemma" [1][2]. Barron's Lemma states that for any sequence of lossless codes fCn; `n g, `n(X1n ) log[1=Pn (X1n)] ? bn eventually, a.s. Similarly, we can interpret the lower bound of Corollary 1 (equation (19)) as a dierent generalization of Barron's Lemma, valid only for memoryless sources. The reason why (19) is preferable over (20) is because the Qn are product measures, whereas it is apparently hard to characterize the measures Qen in general. For example, in the case of memoryless sources one would expect that, for large n, the measures Qen \converge" to the measures Qn in some sense. The only way in which we have been able to make this intuition precise is by proving the following result asserting the asymptotic equivalence between the compression performance of Qen and Qn . Theorem 7: (Equivalence of Measures) Let X be a memoryless source with distribution P , and let D 2 (0; Dmax ). Then:

log[1=Qen (B (X1n ; D))] = log[1=Qn (B (X1n ; D))] + O(log n) a.s. Clearly, Theorem 7 can be combined with the recent results on the probabilities of D-balls mentioned in (18), to give alternative proofs of the pointwise converses in Theorems 1 and 4. Finally we state a direct coding theorem, demonstrating that the lower bound in Theorem 6 is asymptotically tight. It is proved in Section 5 using a random coding argument. Theorem 8: (Partial Achievability) Suppose X is an arbitrary source, and let D 0 be such that Kn (D) < 1. Then there is a sequence of codes fCn ; `n g operating at distortion level D, such that:

"

#

2 + Const. eventually, a.s. `n(X1n ) log[1=Qen (B (X1n; D))] + 2 log n + 2 log log e 2n n Qn (B (X1 ; D))

Theorem 8, combined with Theorem 7 and (18), provides alternative proofs for Theorem 2 and Theorem 5 (i). Lastly we remark that, for \nice" measures Qen , it is natural to expect that the probabilities Qen(B (X1n; D)) will decay to zero exponentially fast. This would be true, for example, if the Qen were the nite-dimensional marginals of a \nice" process, like a Markov chain [34] or a process with \rapid mixing" properties [7]. In that case, the \log log" term in Theorem 8 would grow like log n, implying that the lower bound of Theorem 7 is tight up to terms of order O(log n). 15

3 Converses for Memoryless Sources In Section 3.1 we collect some useful technical facts, and in Section 3.2 we prove Theorem 4 and use it to deduce Theorem 1.

3.1 Representations and Properties of (

R P; Q; D

)

For n 1, let and be arbitrary probability measures on An and A^n , respectively (of course, since A^ is a nite set, is a discrete p.m.f. on A^n ). Write X1n for a random vector with distribution on An , and Y1n for an independent random vector with distribution on A^n . Let Sn = fy1n 2 A^n : (y1n) > 0g A^n denote the support of ; and de ne ; Dmin ; Dmax

n n = E ymin n 2Sn n (X1 ; y1 ) 1 = E [ n (X1n ; Y1n )] :

; D; < 1: For 0, we de ne Clearly, 0 Dmin max

i h ; () = E loge E en (X n ;Y n ) 1

1

and, for D 0, we write ; for the Fenchel-Legendre transform of ; : ; (D) = sup [D ? ; ()]: 0

In analogy with (12), we also de ne n n n R(; ; D) = (Xinf n ;Z n )[I (X1 ; Z1 ) + H (QZ k )] 1

1

1

where H (RkQ) denotes the relative entropy (in bits) between two distributions R and Q, QZ n denotes the distribution of Z1n, and the in mum is over all jointly distributed random vectors (X1n ; Z1n ) with values in An A^n such that X1n has distribution and E [n (X1n ; Z1n )] D. In the next Lemma we collect various standard properties of ; and ; . Parts (i){(iii) can be found, e.g., in [10] or [17]; part (iv) is proved in Appendix I. 1

Lemma 1: ; , and 0 () ! D; as ! ?1. (i) ; is in nitely dierentiable on (?1; 0), 0; (0) = Dmax ; min ; < D; , then 00 () > 0 for all 0. (ii) 00; () 0 for all 0; if, moreover, Dmin max ; ; < D; and D 2 (D; ; D; ), then there exists a unique < 0 such that 0 () = D (iii) If Dmin max ; min max and ; (D) = D ? ; ().

16

(iv) For every 0 and every probability measure on A, ; () is upper semicontinuous as a function of . In the following Propositions we give two alternative representations of the function R(; ; D), and state several of its properties. Propositions 1 and 2 are proved in Appendices II and III, respectively. Proposition 1: (Representations of R(; ; D))

(i) For all D 0,

R(; ; D) = inf E [H ((jX1n )k ())] ;

where the in mum is taken over all probability measures on AnA^n such that the An -marginal of equals and E [n (X1n ; Y1n )] D: (ii) For all D 0 we have:

R(; ; D) = (log e); (D):

Proposition 2: (Properties of R(; ; D))

(i) For every D 0 and every probability measure on An , R(; ; D) is lower semicontinuous as a function of . (ii) For every D 0, there exists a p.m.f. Q = Q on A^ achieving the in mum in (14). ; , R(; ; D) = 1; for D; < D < D; , 0 < R(; ; D) < 1; and for D D; , (iii) For D < Dmin max max min R(; ; D) = 0.

(iv) For 0 < D < Dmax we have 0 < R(D) < 1, whereas for D Dmax , R(D) = 0. P;Q < DP;Q and D 2 (DP;Q ; DP;Q ). (v) If D 2 (0; Dmax ), then Dmin max max min

3.2 Proofs of Converses Proof of Theorem 1 from Theorem 4: Taking bn = 2 log n in Theorem 4, yields

n X n (21) `n(X1 ) ? nR(D) = f (Xi) ? 2 log n eventually, a.s. i=1 p P Writing Gn = (1= n) ni=1 f (Xi ) we get (8) since the random variables ff (Xn )g are zero-mean, D N (0; 2 ), where 2 = bounded, i.i.d. random variables, the ordinary CLT implies that Gn ?!

Var(f (X1 )). This proves part (i).

17

p

Next, dividing both sides of (21) by 2n loge loge n, letting n ! 1, and invoking the classical LIL (see, e.g., [6, Theorem 13.25] or [12, p. 437]), immediately gives the two statements in part (ii). 2 Proof of Theorem 4: Let fCn = (Bn ; n ; n )g be an arbitrary sequence of block codes operating at distortion level D, and let f`n g be the sequence of corresponding length functions. By Proposition 2 (ii) we can choose a p.m.f. Q on A^ so that R(D) = R(D; Q ; D). Since we assume D 2 (0; Dmax ), P;Q ; DP;Q ), so by Lemma 1 we can pick a < 0 with Proposition 2 (v) implies that D 2 (Dmin max

D ? P;Q ( ) = P;Q (D) = (loge 2)R(P; Q ; D) = (loge 2)R(D);

(22)

where the second equality comes from Proposition 1 (ii). Since n is a pre x-free lossless code (for each n), it induces a length function Ln on Bn given by

Ln(y1n) = length of [ n(y1n )]; y1n 2 Bn: The functions Ln and `n are clearly related by `n(xn1 ) = Ln(n (xn1 )). The key idea of the proof is to consider the following sub-probability measure on A^n : 4 QCn (F ) =

X

2?Ln (yn ) ; for all F A^n : 1

y1n 2F \Bn

Note that QCn is supported entirely on Bn ; the fact that it is a sub-probability measure follows by the Kraft inequality. Our main use for QCn will be to bound the description lengths `n (xn1 ) in terms of an expectation over QCn . For any xn1 2 An : 2?`n (xn ) = 2?Ln (n (xn )) (a) 2?Ln (n (xn )) en[n(xn ;n(xn ))?D] X ?Ln (yn ) n[n(xn ;yn)?D] 2 e 1

1

1

1

1

1

1

1

y1n 2Bn

= EQCn en [n (xn ;Y n )?D] 1

1

where (a) follows from the fact that Cn operates at distortion level D. This gives the lower bound

`n (xn1 ) ? log EQCn en [n(xn ;Y n )?D] : 1

1

(23)

Now consider the following family of functions on An ,

n

Fn =4 g : g(xn1 ) = EQn en[n(xn ;Y n )?D] for a sub-probability measure Qn on A^n 1

1

and notice that Fn is a convex family. We are interested in the in mum

n

n n n f? loge g (X1n )g = inf EP n ? loge EQn en [n (X ;Y )?D] E inf P Qn g2Fn

18

1

1

o

o (24)

where P n denotes the distribution X1n . According to Lemma 2 below, this in mum is achieved by the function g 2 Fn de ned in terms of the measure Qn = (Q )n ,

g (xn1 ) = EQn en[n(xn ;Y n )?D] =

i.e.,

1

1

g(X n )

EP n loge g (X1n ) 1

n Y i=1

EQ e [(xi;Y )?D] ;

(25)

0; for all g 2 Fn:

But these are exactly the Kuhn-Tucker conditions for the optimality of g in (24); therefore, by [3, Theorem 2] we have that g(X n ) EP n g (X1n) 1; for all g 2 Fn : (26) 1 The result of Theorem 4 can now be proved as follows. De ne gn 2 Fn by

gn(xn1 ) = EQCn en[n (xn ;Y n )?D] : 1

1

(27)

Recall the function f on A de ned in (16), and observe that, using (22), it can be rewritten as

f (x) = ?R(D) ? log EQ e[(x;Y )?D] :

(28)

Then, the probability that the assertion of the Theorem fails can be bounded above as Pr

(

) n X n `n (X1 ) ? nR(D) < f (Xi ) ? bn ) ( i=1 n n [n(X n ;Y n )?D] X (a) 1 1 < [f (Xi ) + R(D)] ? bn Pr ? log EQCn e i=1 (b) n n = Pr f? g log(Xgnn)(X1 ) 2bn 1 g (X n ) (c) 2?bn EP n gn (X1n ) 1 (d) ?b 2 n;

(29)

where (a) follows from the bound (23), (b) follows from the de nitions of f , gn and g in equations (28) (27) and (25), (c) is simply Markov's inequality, and (d) follows from the Kuhn-Tucker conditions (26) with g = gn . Now since the sequence 2?bn is summable by assumption, an application of the Borel-Cantelli lemma to the bound in (29) completes the proof. 2 Lemma 2: The in mum in equation (24) is achieved by Qn = (Q )n .

19

Proof of Lemma 2: Write Pn (or SP n ) for the collection of all probability (respectively, subprobability) measures on A^n . For 0 and Qn a p.m.f. on A^n , let

h(; Qn) = nD ? P n;Qn (n): Taking Qn = Qn = (Q )n in the in mum in (24), the result of the Lemma follows from the following series of relations:

n

EP n ? loge EQn en[n(X n ;Y n )?D] 1

1

o

(a)

= n D ? P n ;Qn (n )

(b)

= (c) = (d) = (e) = (f ) = (g)

n [ D ? P;Q ( )] (loge 2)nR(P; Q ; D) (loge 2)nR(D) (loge 2)Rn (D) R(P n ; Qn ; D) (loge 2) Q inf 2P n

n

sup h(=n; Qn ) = Q inf n 2Pn 0

(h)

sup h(; Qn ) = Q inf n 2Pn 0

(i)

h(; Qn ) = sup Q inf 0 n2Pn

(j )

h( ; Qn ) = Q inf n 2Pn n n n o (k) n ? loge EQn en [n (X ;Y )?D] E = Q inf P n 2Pn

n

(`)

1

1

o

n n n ? loge EQn en [n (X ;Y )?D] ; E = Q inf P n 2SP n where (a) and (b) follow from the de nitions of P n ;(Q )n and P;Q , respectively; (c) follows from the choice of and (d) follows from the choice of Q in (14); (e) follows from the well-known fact that the rate-distortion function of a vector of n i.i.d. random variables equals n times the rate-distortion function of one of them; (f ) follows in the same way as we noted in the Introduction that (6) is the same as (13); (g) follows from Proposition 1 (ii); (h) follows simply by replacing by n; (i) follows by an application of the minimax theorem (see below for an explanation); (j ) follows from a simple continuity argument given below; (k) follows from the de nitions of the functions h and P n ;Qn ; and (`) follows from noting that if Qn is strictly a sub-probability measure with Qn (A^n ) = Z < 1, then using the probability measure Q0n () = Z ?1 Qn () we can make the expression that is being minimized on line (`) smaller by log Z < 0.

20

1

1

To justify the application of the minimax theorem in step (i), rst we note that, by Lemma 1 (iv), h is lower semicontinuous as a function of Qn , and by Lemma 1 (i) it is a continuous function of : Also, since by Lemma 1 (ii) 00P n ;Qn () 0, h is concave in , and by Jensen's inequality (and the concavity of the logarithm), h is convex in Qn. And since the space of all p.m.f.s Qn on A^n is compact, we can invoke Sion's minimax theorem [31, Corollary 3.3] to justify the exchange of the in mum and the supremum in step (i). Finally we need to justify step (j ). For 0 de ne the functions M () = h(; Qn ), and

h(; Qn): m() = Q inf 2P n

n

Our goal is to show sup m() = m( ): 0

(30)

From the above argument (a){(i), we have that

M ( ) = sup M () = sup m(): 0

0

As noted in the beginning of the proof of Theorem 4, M () is strictly concave and its supremum is achieved uniquely by < 0, so we may restrict our attention to an interval I = [ ? ; + ] (?1; 0), and since m() M () for all ,

M ( ) = sup M () = sup m(): 2I

2I

By the de nition of m as the in mum of continuous functions it follows that it is upper semicontinuous (see, e.g., [27, p. 38]), so it achieves its supremum on the compact set I . Since m() M () for all , and M () = M ( ) only at , this implies that the supremum of m over I must also be achieved at , giving (30). 2

4 Shannon's Random Codes and D-Ball Probabilities In this section we prove Theorem 5, and we deduce Theorems 2 and 3 from it. We continue in the notation of the previous section, and recall the de nition of a distortion ball in (17). Let X be a memoryless source with distribution P , x D 2 (0; Dmax ), let Q be the optimal reproduction distribution for P at distortion level D, and write Qn = (Q )n . The following proposition will be used repeatedly in the proofs. If follows easily from some recent results in [10] and [35]; see Appendix VI. 21

Proposition 3: If D 2 (0; Dmax ), we have:

? log Qn(B (X1n ; D)) = nR(D) +

n X

f (Xi) + 21 log n + O(log log n) a.s. i=1

Proofs of Theorems 2 and 3 from Theorem 5: Combining Theorem 5 (i) with Proposition 3 gives n X n `n(X1 ) ? nR(D) f (Xi ) i=1

+ O(log n) a.s.

In view of the corresponding lower bound in Theorem 4 (with bn = 2 log n), this, together with the classical CLT and the LIL applied to the sum of the bounded, zero-mean, i.i.d. random variables ff (Xi)g, yield the three statements of Theorem 2. Theorem 3 follows from Theorem 5 (ii) in exactly the same way. 2 Proof of Theorem 5 (i): For each n 1 we generate a random codebook according to Qn = (Q )n : Let Y (i) = (Yi;1 ; Yi;2 ; : : : ; Yi;n ); i = 1; 2; : : : ; be i.i.d random vectors in A^n , each drawn according to Qn. Given a source string X1n to be encoded, let Wn = Wn(X1n ) denote the rst index i such that X1n matches the ith codeword with distortion D or less:

Wn = inf fi 1 : n (X1n ; Y (i)) Dg: If such a match exists, we describe X1n (with distortion no more than D) by describing the integer Wn to the decoder; this can be done using (cf. [13][33])

dlog(Wn + 1)e + 2dlog (dlog(Wn + 1)e + 1)e log Wn + 2 log log(2Wn) + 10 bits. Otherwise, we describe X1n exactly, using dn log jA^je bits (recall (5)). Our code Cn consists of combining these two descriptions, together with a one-bit ag to specify which one was chosen. Next we show that, for large n, the \waiting time" Wn until a D-match is found can be approximated by the reciprocal of the probability of nding such a match, Qn (B (X1n ; D)). Speci cally, we 4 claim that the dierence n = log Wn ? log[1=Qn (B (X1n ; D))] satis es

n 2 log n eventually, a.s.,

(31)

where the almost-sure statement above (and also in all subsequent statements) is with respect to P -almost any source realization, and almost any sequence of codebooks generated according to the above procedure. We prove (31) by an argument along the lines of the \strong approximation" results 22

n in [16][10][17]. Let Gm = fx1 1 : 0 < Qn (B (x1 ; D)) < 1=2 for all n mg. Proposition 2 implies that, eventually, almost every string x1 1 generated by X will belong to a Gm , i.e.,

P ([m1 Gm) = 1;

(32)

where, with a slight abuse of notation, we write P for the one-dimensional marginal of the distribution of X as well as the in nite dimensional product distribution it induces. Now let n m 1. Conditional on X11 = x1 1 2 Gm , the waiting time Wn has a geometric distribution with parameter 4 pn = Qn(B (X1n; D)), so that Prfn > 2 log n j X1n = xn1 g = PrfWn > n2 =pn j X1n = xn1 g (1 ? pn)(n =pn )?1 2(1 ? pn)n =pn 2=n2 2

2

where the last step follows from the inequality (1 ? p)M 1=(Mp), for p 2 (0; 1) and M > 0. Since 2 this bound is uniform over x1 1 2 Gm , and (2=n ) is summable, the Borel-Cantelli lemma implies that n 2 log n, eventually, for almost every codebook sequence, and P -almost all x1 1 2 Gm . This together with (32) establishes (31). In particular, (31) implies that with probability one, Wn < 1 eventually. Therefore the description length of our code

`n(X1n ) log Wn + 2 log log(2Wn ) + 11 bits, eventually, a.s. and this can be bounded above as

`n(X1n ) log[1=Qn (B (X1n ; D))] + n + 2 log[1 + n ? log Qn(B (X1n ; D))] + 11 (a) log[1=Qn (B (X1n ; D))] + 2 log n + 2 log[1 + 2 log n + 2nR(D)] + 11 log[1=Qn (B (X1n ; D))] + 4 log n + Const. eventually, a.s. where (a) follows from (31) and Proposition 3. This proves part (i).

2

Note that the above proof not only demonstrates the existence of a good sequence of codes fCn ; `n g, but it also shows that almost every sequence of random codes generated as above will satisfy the statement of the theorem. Proof of Theorem 5 (ii): Here we assume that the source X = fXn ; n 1g has a distribution P , where P is unknown to the encoder and decoder, but such that D 2 (0; Dmax (P )). For each n 1

23

we generate a family of codebooks, one for each n-type on A^. Recall [9] that a p.m.f. Q on A^ is called an n-type if, for each y 2 A^, Q(y) = m=n for an integer m. The number T (n) of n-types grows polynomially in n, and it is bounded above as

T (n) (n + 1)k

(33)

where k = jA^j denotes the cardinality of A^; see [9, Chapter 13]. For 1 j T (n), let Q(j ) denote the j th n-type. The j th codebook consists of i.i.d. random (j ) ); i = 1; 2; : : : ; where each Y (j ) (i) is drawn according to (Q(j ) )n . vectors Y (j ) (i) = (Yi;(1j ) ; Yi;(2j ) ; : : : ; Yi;n Given a source string X1n , we let Wn(j ) = Wn(j ) (X1n ) be the waiting time until a D-close match for X1n is found in the j th codebook,

Wn(j) = inf fi 1 : n(X1n ; Y (j) (i)) Dg; and we de ne

Wn = 1min W (j) : j T (n) n

It is not hard to see that, for large enough n, there will (almost) always be a D-match for X1n in one of the codebooks, so that Wn < 1 eventually, a.s., where the almost-sure statement here (and also in all subsequent statements) is with respect to P almost any source realization, and almost any sequence of codebooks generated as above. [This is so because, for n k, at least one of the n-types has positive probability on all elements of A^, so with probability one every possible A^n -string will appear in nitely often. Assumption (5) then guarantees the existence of a D-match.] Therefore, we can describe X1n (with distortion no more than D) to the decoder by specifying the waiting time Wn , and the codebook in which Wn is achieved. As in part (i), and using the bound in (33), this can be done using

`n(X1n ) = dlog(Wn + 1)e + 2dlog (dlog(Wn + 1)e + 1)e + dk log(n + 1)e log Wn + 2 log log(2Wn ) + k log n + (11 + k) bits, eventually, a.s. Now, following [17], we pick a sequence of n-types that are close to Q . We let qn be an n-type such that qn (y) > 0 and jqn (y) ? Q (y)j k=n, for all y 2 A^. This can be done for all n N , for some xed integer N (see [17] for the details). Let Vn denote the waiting time associated with the codebook corresponding to qn, and write Qn = (qn )n . The same argument as the one used to prove (31) in part (i) can be used here to show that 4 0n = = log Vn ? log[1=Qn (B (X1n ; D))] 2 log n eventually, a.s.

24

(34)

Using the obvious fact that Wn is never greater than Vn , we can bound `n(X1n ) above by

`n(X1n ) log Vn + 2 log log(2Vn ) + k log n + (11 + k) log(1=pn ) + n + 0n + 2 log[1 + n + 0n + log(1=pn )] + k log n + (11 + k);

(35)

eventually, almost surely, where pn = Qn (B (X1n ; D)) as before, and Q (B (X n; D)) n n = n (X1 ) = log Qn (B (X1n ; D)) : n 1 Next we claim that there exist absolute constants C and M such that:

n (xn1 ) C; for all n M; and all xn1 2 An.

(36)

Before proving this, let us see how it allows us to complete the proof. Recalling Proposition 3 and substituting the bounds (34) and (36) into (35) gives

`n (X1n ) log(1=pn ) + 2 log[1 + C + 2 log n + 3nR(D)] + (2 + k) log n + (C + 11 + k) log(1=pn ) + 2 log[4nR(D)] + (2 + k) log n + (C + 11 + k) log[1=Qn (B (X1n ; D))] + (4 + k) log n + Const. eventually, P ?a.s.

Finally we need to prove (36). Pick M N large enough, so that for all n M and all y 2 A^, Q (y) is either equal to zero or Q (y) > k=n. Let n M and xn1 2 An be arbitrary. Then P n n hQ (yn ) Qn(ynn ) i n y 2B(x ;D) n 1 Qn (y ) Qn(B (x1 ; D)) = P n n Q (yn) n Qn(B (x1 ; D)) y 2B(x ;D) n 1 Qn(y1n ) yn2max B(xn ;D) Qn(y1n ) n Y Q (yi) = n maxn y 2B(x ;D) i=1 qn(yi ) ! (y) n Q max y2A^ : Q(y)>0 qn (y) !n (y) (a) Q max y2A^ : Q(y)>0 Q (y) ? k=n 1

1

1

1

1

1

1

1

1

1

k

1 ? nQ (y )

?n

where (a) is by the choice of qn, and y is the y 2 A^ with the smallest nonzero Q probability. So,

n (xn1 ) ?n log(1 ? C 0=n); with C 0 = k=Q (y ), and this is a convergent sequence so it must be bounded.

2

As in part (i), this proof actually shows that almost every sequence of random codes generated as above will satisfy the statement of the theorem. 25

5 Arbitrary Sources Let X be an A-valued source, and write Pn for the distribution of X1n . In this section, we prove Theorems 6{8. We begin with two useful Lemmas; they are proved in Appendices IV and V, respectively. Lemma 3: The in mum

EPn f? log Qn(B (X1n ; D))g inf Q

(37)

n

over all sub-probability measures Qn on A^n is the same as the in mum over all probability measures, and it is achieved by some probability measure Qen . Lemma 4:

n

o

Kn(D) = EPn ? log Qen(B (X1n ; D)) Rn (D):

Proof of Theorem 6: Let fCn ; `n g be an arbitrary sequence of codes operating at distortion level D, where each Cn consists of a triple (Bn; n ; n ). Let Ln be the length function induced by n on Bn. As in the proof of Theorem 4, the key idea is to consider the sub-probability measure QCn on A^n de ned by 4 X ?Ln (yn ) 2 ; for all F A^n: QCn (F ) = 1

y1n 2F \Bn

Since Cn operates at distortion level D, for any xn1 2 An we have:

`n (xn1 ) = Ln (n(xn1 )) = ? log QCn (n (xn1 )) ? log QCn (B (xn1 ; D)):

(38)

From (38) and the de nition of Kn (D) we immediately get that

EPn [`n(X1n )] Kn(D) and, in view of Lemma 4, this proves part (i). For part (ii), we de ne a family of functions on An

n

Gn =4 g : g(xn1 ) = Qn(B (xn1 ; D)) for a sub-probability measure Qn on A^n

o

and note that Gn is a convex family. By Lemma 3 we know that inf EPn f? log g(X1n )g = EPn f? log g~(X1n )g ;

g2Gn

26

(39)

where g~ is the function g~(xn1 ) = Qen (B (xn1 ; D)): But for each n 1, (39) are exactly the Kuhn-Tucker conditions for the optimality of g~ in Gn , so [3, Theorem 2] implies that g(X n ) (40) EPn g~(X1n) 1; for all g 2 Gn : 1 Therefore, letting gn (xn1 ) = QCn (B (xn1 ; D), the probability that the assertion of (ii) fails can be bounded above as Prf`n (X1n ) < log[1=Qen (B (X1n ; D))] ? bn g (a)

Prf(log[1=QCn (B (X1n; D))] )< log[1=Qen (B (X1n ; D))] ? bng n

= Pr QeCn (B (Xn1 ; D)) > bn Qn(B (X1 ; D)) gn (X n) (b) ? b n 2 EPn g~(X n1) 1 (c)

2?bn ;

where (a) follows from the bound (38), (b) is simply Markov's inequality, and (c) follows from the Kuhn-Tucker conditions (40) with g = gn . Since the sequence 2?bn is summable by assumption, the Borel-Cantelli lemma completes the proof. 2 Proof of Theorem 7: Suppose X is a memoryless source with distribution P , let Pn = P n denote the distribution of X1n , and write Qn = (Q )n , where Q is the optimal reproduction distribution at distortion level D. Replacing gn by gn0 (xn1 ) = Qn(B (xn1 ; D) in the last part of the argument of the proof of Theorem 6, 4 and taking bn = 2 log n; we get

log[1=Qn (B (X1n ; D))] log[1=Qen (B (X1n ; D))] ? 2 log n eventually, a.s.

Similarly, taking

gn00 (xn1 ) = EQen en [n (xn ;Y n)?D] 1

1

(41)

4 in place of gn in the proof of Theorem 4, and choosing bn = 2 log n; we get

? log EQen en[n(X n ;Y n)?D] ? log EQn en[n(X n ;Y n )?D] ? 2 log n eventually, a.s.

(42)

But by Proposition 3 we know that (in the notation of the proof of Theorem 4) n X log[1=Qn (B (X1n ; D))] = nR(D) + f (Xi ) + 21 log n + O(log log n) a.s. in=1 [n(X n ;Y n )?D] 1 = log EQn e + 2 log n + O(log log n) a.s.;

(43)

1

1

1

1

27

1

1

and also, by a simple Cherno-type bound,

o

n

n

o

Qen (B (X1n ; D)) = EQen Ifn(X n ;Y n )Dg EQen en[n (X n ;Y n)?D] : 1

1

1

1

(44)

From (42), (43) and (44) we have

log[1=Qen (B (X1n ; D))] log[1=Qn (B (X1n ; D))] ? 25 log n + O(log log n) a.s.

Combining this with the corresponding lower bound in (41) completes the proof.

2

Proof of Theorem 8: We use a random coding argument, very similar to the ones used in the proofs of Theorem 5 parts (i) and (ii). For each n 1 we generate a random codebook according to Qen: Let Y (i) = (Yi;1 ; Yi;2 ; : : : ; Yi;n ); i = 1; 2; : : : ; be i.i.d random vectors in A^n , each drawn according to Qen. Given a source string X1n , let Wn = Wn (X1n ) denote the rst index i such that X1n matches the ith codeword with distortion D or less:

Wn = inf fi 1 : n (X1n ; Y (i)) Dg: If such a match exists, we describe X1n to the decoder (with distortion no more than D) by describing Wn, using, as before, no more than log Wn + 2 log log(2Wn ) + 10 bits. Otherwise, we describe X1n exactly, using dn log jA^je bits; this is possible because of our initial assumption (5). Our code Cn consists of combining these two descriptions, together with a one-bit

ag to specify which one was chosen. Next we claim that the waiting times Wn can be approximated by the quantities 1=Qen (B (X1n ; D)), in that their dierence satis es: 4 log Wn ? log[1=Qen (B (X1n ; D))] 2 log n eventually, a.s. n =

(45)

The assumption that Kn (D) < 1 implies that Qen (B (xn1 ; D)) > 0 for Pn -almost all xn1 , so the strong approximation argument from the proof of Theorem 5 goes through essentially verbatim to prove (45). In particular, (45) implies that Wn < 1 eventually, almost surely, so the description length of our code can be bounded above as

`n(X1n ) log Wn + 2 log log(2Wn ) + 11

"

#

2n +1 + 11 Qen (B (X1n; D)) # " 2 2 n n + Const. eventually, a.s. log[1=Qen (B (X1 ; D))] + 2 log n + 2 log log e Qn(B (X1n; D)) and we are done. 2 log[1=Qen (B (X1n ; D))] + n + 2 log log

28

Acknowledgments The author wishes to thank Amir Dembo, Dimitris Gatzouras and Herman Rubin for various enlightening technical discussions, and Andrew Barron for his useful comments on an earlier version of this paper.

Appendix I Proof of Lemma 1, (iv): Fix a 0 and a probability measure on An . Let f (k) g be a sequence of p.m.f.s on A^n , such that the (k) converge, as k ! 1, to some p.m.f. on A^n . Then,

n

o n

n n ? log E inf E lim sup ; k () = ? lim k en (X ;Y ) e k!1 k!1 ( )

( )

(a)

1

1

n ?E lim inf ? log E k en (X ;Y e k!1 ( )

o n = ?E ? loge E en (X n ;Y n ) = ; ();

(b)

1

1

1

)

1

where (a) follows from Fatou's Lemma and (b) follows from the assumption that (k) ! . Therefore, ; () is upper semicontinuous in . 2

Appendix II Proof of Proposition 1: The alternative representation of R(; ; D) in (i) can be obtained from its de nition by a simple application of the chain rule for relative entropy (see, e.g., [25, eq. (3.11.5)] or [11, Theorem D.13]). For part (ii) we will use the representation in (i). Fix D 0 arbitrary, and recall (see [11, Lemma 6.2.13]) that for any bounded measurable function : A^n ! R, any xn1 2 An , and any candidate measure on An A^n with An -marginal equal and E [n (X1n ; Y1n )] D; we have

(loge 2)H ((jxn1 )k ())

Z

(y1n)d(y1n jxn1 ) ? loge E e(Y n ) : 1

Choosing (y1n ) = n (xn1 ; y1n ) and taking expectations of both sides with respect to , yields that (loge 2)E [H ((jX1n )k ())] D ? ; (): Taking the in mum over all candidate measures and the supremum over all 0, implies (from (i)):

R(; ; D) (log e); (D): 29

(46)

To prove the reverse inequality we consider four cases. (Note that we only need to consider cases when ; (D) < 1.) ; ; Case I: Dmin > 0 and D 2 (0; Dmin ). By Lemma 1, for all 0,

d [D ? ()] = D ? 0 () D ? D; < 0: ; ; min d

Therefore

; (D) lim sup[D ? ; ()] = 1; !?1

so in this case

R(; ; D) = (log e); (D) = 1:

(47)

; . Here, by Lemma 1, for all 0, D Dmax d [D ? ()] = D ? 0 () D; ? D; = 0; max max ; ; d so ; (D) is achieved at = 0, giving ; (D) = ; (0) = 0: On the other hand, taking = , ; , and recalling that relative entropy is nonnegative, implies that noting that E [n (X1n ; Y1n )] = Dmax R(; ; D) = 0. Hence, here Case II:

R(; ; D) = (log e); (D) = 0: Case III:

(48)

; < D; and D 2 (D; ; D; ). By Lemma 1, there is a unique < 0 such that Dmin max min max

0; ( ) = D and

; (D) = D ? ; ( ) > 0 ? ; (0) = 0: Let

(49)

d (xn; yn) = e?n (xn ;yn ) d d 1 1 E e n (xn ;Y n) and observe that E [n (X1n ; Y1n )] = 0; ( ) = D. Then 1

1

1

1

n R(; ; D) E [H (( Z jdX1 )k ())] = log d d d d = (log e)[ D ? ; ( )] = (log e); (D):

This, together with (46) and (49) imply that here: 0 < R(; ; D) = (log e); (D) < 1: 30

(50)

Case IV:

; < D; and D = D; . Write S A^n for the support of , and for xn 2 An let Dmin max n 1 min n n n(xn1 ) = ymin n 2S n (x1 ; y1 ) n

1

; = E [ (X n )] and so that Dmin n 1

h

i

; ? () = E ? log E e[n (X n ;Y n )?n (X n )] : Dmin ; e 1

1

1

Also, by Lemma 1,

d ; ; 0 d [Dmin ? ; ()] = Dmin ? ; () < 0; ; ) is the increasing limit of [D; ? ()] as ! ?1. Therefore, letting Z (xn ) denote so ; (Dmin ; 1 min n n n n n n ^ the event fy1 : n (x1 ; y1 ) = n (x1 )g A and Z (x1 ) denote its complement

h

i

; ) = lim E ? log E e[n (X n ;Y n )?n (X n )] ; (Dmin e !?1 n h oi = !?1 lim E ? loge (Z (X1n )) + E e[n (X n ;Y n )?n (X n )] IZ(X n ) = E [? loge (Z (X1n ))] ; 1

1

1

1

1

1

1

where the last equality follows from the monotone convergence theorem. Since we are only interested ; ) < 1, the above calculation implies that we may assume, without loss of in the case ; (Dmin generality, that (Z (xn1 )) > 0 for -almost all xn1 2 An . We can then de ne a measure by 1 d n n n n d d (x1 ; y1 ) = (Z (xn1 )) Ify 2Z (x )g ; 1

1

; , and which has E [n (X1n ; Y1n )] = Dmin

n R(; ; D) E [H (( Z jdX1 )k ())] = log d d d d

Z

= (log e) ? loge (Z (xn1 )) d(xn1 ) ; ): = (log e); (Dmin

2

This together with (46) complete the proof.

Appendix III Proof of Proposition 2: By Lemma 1 (iv), ; () is upper semicontinuous as a function of , so [D ? ; ()] is lower semicontinuous. Therefore, by the representation of R(; ; D) in Proposition 1 (ii)

31

and the fact that the supremum of lower semicontinuous functions is itself lower semicontinuous (see, e.g., [27, p. 38]) we get that R(; ; D) is lower semicontinuous as a function of , proving (i). Part (ii) follows immediately from part (i): Since A^ is nite, the set of all p.m.f.s Q on A^ is compact, and therefore the lower semicontinuous function R(P; Q; D) must achieve its in mum over that compact set (see, e.g., [26, p. 195]), proving the existence of the required Q . For part (iii), it is easy to check that the stated properties of R(; ; D) are actually proved in the course of proving Proposition 1 (ii); see equations (47), (48) and (50). Part (iv): First, if D 2 (0; Dmax ), then letting U denote the uniform distribution on A^ and recalling our basic assumption (5), we have P;U Dmin

"

#

= E min^ (X; y) sup min^ (x; y) = 0; y 2A

x2A y2A P;U ; so by > Dmin

P;U = 0. Therefore D > 0 means that D i.e., Dmin hence R(D) < 1. Also, for any distribution Q on A^,

part (iii) above R(P; U; D) < 1 and

P;Q Dmax = min^ EP [(X; y)] EPQ [(X; Y )] = Dmax y 2A

P;Q . Part (iii) then implies that R(D) = R(P; Q ; D) > 0. so, in particular, D < Dmax On the other hand, if D Dmax, then by the de nition of Dmax in (7) there exists a z 2 A^ such that P;z Dmax = EP [(X; z)] = EPz [(X; Y )] = Dmax

P;z , so by part (iii) above where z is the measure attaching unit mass at z . This means that D Dmax R(P; z ; D) = 0, and hence R(D) = 0. P;Q ; DP;Q ), Part (v): Since R(P; Q ; D) = R(D) 2 (0; 1), from part (iii) we have that D 2 [Dmin max P;Q P;Q P;Q and, in particular that Dmin < Dmax . So we only have to rule out the case D = Dmin , but this done in Appendix II of [17]. 2

Appendix IV Proof of Lemma 3: First observe that if Qn is strictly a sub-probability measure with Qn (A^n ) = Z < 1, then using the probability measure Q0n () = Z ?1 Qn () we can make the expectation in (37) smaller by log Z < 0. Therefore it is enough to consider probability measures Qn . As for the achievability of the in mum, it suces to note that it is taken over a compact set (the set over all p.m.f.s Qn on A^n ), and that the map Qn 7! EPn f? log Qn (B (X1n ; D))g is lower semicontinuous. This follows from Fatou's Lemma in exactly the same way as it was shown in Appendix I that ; () is upper semicontinuous in . 2

32

Appendix V Proof of Lemma 4: De ne a joint probability measure on the product space An A^n by restricting the product measure Pn Qen to be supported on f(xn1 ; y1n ) : n (xn1 ; y1n ) Dg: 1 d (xn; yn) = n I n : 1 1 e e dPn Qn Qn(B (xn1 ; D)) fy 2B(x ;D)g Observe that the An -marginal of is Pn , and let 2 denote its A^n -marginal. Then, with (X1n ; Y1n ) distributed according to , 1

n

1

o

Kn(D) (=a) EPn ? log Qen(B (X1n ; D)) = H (kPn Qen ) (b) = E fH ((jY1n )kPn ())g + H (2 kQen ) (c) H (kPn 2) (d) = I (X1n ; Y1n ) (e) Rn(D); 2

where (a) follows from the de nition of Kn (D), (b) follows from the chain rule for relative entropy (see, e.g., [25, eq. (3.11.5)] or [11, Theorem D.13]), (c) follows from the nonnegativity of relative entropy, (d) is just the de nition of mutual information, and (e) comes from the de nition of R(D), since E [n (X1n ; Y1n )] D. 2

Appendix VI P;Q ; DP;Q ), so Proof of Proposition 3: Since D 2 (0; Dmax ), Proposition 2 (v) implies that D 2 (Dmin max we may invoke [35, Corollary 1] to obtain ? log Qn(B (X1n ; D)) = nR(P^n; Q; D) + 21 log n + O(1) a.s. where P^n is the empirical measure induced by X1n on A, i.e., the measure that assigns mass 1=n to each one of the values Xi , i = 1; 2; : : : ; n. Also, [10, Theorem 3] says that

X p nR(P^n ; Q ; D) = nR(D) + f (Xi ) + o( n) a.s. n

i=1

p

but a simple examination of the proof in [10] shows that we may replace the term o( n) above by O(log log n), without any changes in the proof. Combining these two results completes the proof of the proposition. 2 33

References [1] P.H. Algoet. Log-Optimal Investment. PhD thesis, Dept. of Electrical Engineering, Stanford University, 1985. [2] A.R. Barron. Logically Smooth Density Estimation. PhD thesis, Dept. of Electrical Engineering, Stanford University, 1985. [3] R. Bell and T.M. Cover. Game-theoretic optimal portfolios. Management Sci., 34(6):724{733, 1988. [4] T. Berger. Rate Distortion Theory: A Mathematical Basis for Data Compression. Prentice-Hall Inc., Englewood Clis, NJ, 1971. [5] T. Berger and J.D. Gibson. Lossy source coding. IEEE Trans. Inform. Theory, 44(6):2693{2723, 1998. [6] L. Breiman. Probability. SIAM Classics in Applied Mathematics, 7, Philadelphia, PA, 1992. [7] Z. Chi. The rst order asymptotics of waiting times with distortion between stationary processes. Preprint, 1999. [8] P.A. Chou, M. Eros, and R.M. Gray. A vector quantization approach to universal noiseless coding and quantizations. IEEE Trans. Inform. Theory, 42(4):1109{1138, 1996. [9] T.M. Cover and J.A. Thomas. Elements of Information Theory. J. Wiley, New York, 1991. [10] A. Dembo and I. Kontoyiannis. The asymptotics of waiting times between stationary processes, allowing distortion. Ann. Appl. Probab., 9:413{429, 1999. [11] A. Dembo and O. Zeitouni. Large Deviations Techniques And Applications. Second Edition. Springer-Verlag, New York, 1998. [12] R. Durrett. Probability: Theory and Examples. Second edition. Duxbury Press, Belmont, CA, 1996. [13] P. Elias. Universal codeword sets and representations of the integers. IEEE Trans. Inform. Theory, 21:194{203, 1975. [14] J.C. Kieer. Sample converses in source coding theory. IEEE Trans. Inform. Theory, 37(2):263{ 268, 1991. 34

[15] I. Kontoyiannis. Second-order noiseless source coding theorems. IEEE Trans. Inform. Theory, 43(4):1339{1341, 1997. [16] I. Kontoyiannis. Asymptotic recurrence and waiting times for stationary processes. J. Theoret. Probab., 11:795{811, 1998. [17] I. Kontoyiannis. An implementable lossy version of the Lempel-Ziv algorithm that is asymptotically optimal { Part I: Optimality for memoryless sources. To appear, IEEE Trans. Inform. Theory, November 1999. [18] R.E. Krichevsky and V.K. Tro mov. The performance of universal coding. IEEE Trans. Inform. Theory, 27(2):199{207, 1981. [19] T. Linder, G. Lugosi, and K. Zeger. Rates of convergence in the source coding theorem, in empirical quantizer design, and in universal lossy source coding. IEEE Trans. Inform. Theory, 40(6):1728{1740, 1994. [20] T. Linder, G. Lugosi, and K. Zeger. Fixed-rate universal lossy source coding and rates of convergence for memoryless sources. IEEE Trans. Inform. Theory, 41(3):665{676, 1995. [21] N. Merhav. A comment on `A rate of convergence result for a universal D-semifaithful code'. IEEE Trans. Inform. Theory, 41(4):1200{1202, 1995. [22] R.M. Neuho, D.L. Gray and L.D. Davisson. Fixed rate universal block source coding with a delity criterion. IEEE Trans. Inform. Theory, 21(5):511{523, 1975. [23] D. Ornstein and P.C. Shields. Universal almost sure data compression. Ann. Probab., 18:441{452, 1990. [24] R.J. Pilc. The transmission distortion of a source as a function of the encoding block length. Bell System Tech. J., 47:827{885, 1968. [25] M.S. Pinsker. Information and Information Stability of Random Variables and Processes. Izd. Akad. Nauk SSSR, Moscou, 1960. Translated and edited by A. Feinstein. San Francisco: HoldenDay, 1964. [26] H.L. Royden. Real Analysis. Macmillan, New York, 1988. [27] W. Rudin. Real and Complex Analysis. McGraw-Hill, New York, 1987. 35

[28] C.E. Shannon. Coding theorems for a discrete source with a delity criterion. IRE Nat. Conv. Rec., part 4:142{163, 1959. Reprinted in D. Slepian (ed.), Key Papers in the Development of Information Theory, IEEE Press, 1974. [29] P.C. Shields. String matching bounds via coding. Ann. Probab., 25:329{336, 1997. [30] J. Shtarkov. Coding of discrete sources with unknown statistics. In Topics in Information Theory (I. Csiszar and P. Elias, eds.), Coll. Math. Soc. J. Bolyai, no. 16, North Holland Amsterdam, pages 559{574, 1977. [31] M. Sion. On general minimax theorems. Pac. J. Math., 8:171{176, 1958. [32] A.D. Wyner. Communication of analog data from a Gaussian source over a noisy channel. Bell System Tech. J., 47:801{812, 1968. [33] A.D. Wyner and J. Ziv. The sliding-window Lempel-Ziv algorithm is asymptotically optimal. Proc. IEEE, 82(6):872{877, 1994. [34] E.-h. Yang and J.C. Kieer. On the performance of data compression algorithms based upon string matching. IEEE Trans. Inform. Theory, 44(1):47{65, 1998. [35] E.-h. Yang and Z. Zhang. On the redundancy of lossy source coding with abstract alphabets. IEEE Trans. Inform. Theory, 45(4):1092{1110, 1999. [36] E.-h. Yang and Z. Zhang. The redundancy of source coding with a delity criterion { Part II: Coding at a xed rate level with unknown statistics. Preprint, 1999. [37] E.-h. Yang and Z. Zhang. The redundancy of source coding with a delity criterion { Part III: Coding at a xed distortion level with unknown statistics. Preprint, 1999. [38] B. Yu and T.P. Speed. A rate of convergence result for a universal D-semifaithful code. IEEE Trans. Inform. Theory, 39(3):813{820, 1993. [39] Z. Zhang, E.-h. Yang, and V.K. Wei. The redundancy of source coding with a delity criterion { Part I: Known statistics. IEEE Trans. Inform. Theory, 43(1):71{91, 1997. [40] J. Ziv. Coding of sources with unknown statistics { Part II: Distortion relative to a delity criterion. IEEE Trans. Inform. Theory, 18(3):389{394, 1972.

36