Information Bounds for the Risk of Bayesian Predictions ... - IEEE Xplore

INFORMATION BOUNDS FOR THE RISK OF BAYESIAN PREDICTIONS AND THE REDUNDANCY OF UNIVERSAL CODES Andrew Barron, Bertrand Clarke, and David Haussler Yale Univ., Univ. British Columbia, and Univ. California at Santa Cruz ABSTRACT: Several diverse problems have solutions in terms of an induced by the partition. This bound may be used to show that for certain “nonparametric” cases I N is of order N P with 0 < p < 1. w e also give finite information-theoretic quantity for which we examine the asymptotics. Let Y1,Yz, . , . , YN be a sample of random variables with distribution depending and infinite dimensional cases where IN is of order log N . So the price for lack on a (possibly infinite-dimensional) parameter 8. The maximum of the mutual of knowledge of the parameter is small compared to the total entropy. information I N = I(8;Y1,Yz, . . . , Y N )over choices of the prior distribution of 8 In these bounds, we are permitted to have a sequence of exogenous provides a bound on the cummulative Bayes risk of prediction of the sequence of input variables XI, X ? , . . . , XN on which the distributions’ are conditioned. random variables for several choices of loss function. This same quantity is the For example the U, may equal a function f s ( X , ) corrupted by noise. Of minimax redundancy of universal data compression and the capacity of certain particular interest is the case that the YAvariables are binary-valued and equal channels. General bounds for this mutual information are given. A special case fs(.Y7;k) plus independent Bernoulli (A) noise (modulo a), where f o ( r ) is a given concerns the estimation of binary-valued functions with Vapnik-Chervonenkis family of binary-valued functions of Vapnik-Chervonenltis dimension d,,, and dimension d,,, for which the informationis bounded by d,, log N . For smooth the noise rate satisfies 0 < X < l/2. Then for any prior distribution on B, families of probability densities with a Euclidean parameter of dimension d, the information bound is (d/2) log N plus a constant. The prior density proportional to the square root of the Fisher information determinant is the unique continuous density that achieves a mutual information within o(1) of the It follows that for the on-line Bayesian prediction of Y1,Ya, . . . , YNthe relative capacity for large N . The Bayesian procedure with this prior is asymptotically frequency of errors has average that exceeds the noise level X by not more than minimax for the cuniulative relative entropy risk. a multiple of (d,,/N) log(N/d,,). Likewise for universal data compression, the Yz,. . . , YN, length of the Shannon code based on the Bayesian model for Y1, SUMMARY: A parameterized family of distributions P Y N is ~ used yZ,. . . , Y N ) . For prob- divided by the sample size N , has average that exceeds the noise entropy h(X) to model a sequence of random variables Y N = (Yl, by not more than (d,,/N) log(eN/duc). lems of data compression and on-line predict,ion we compare the performance Refined results are possible in the case of smooth parametric families that can be achieved when R is unknown to the performance that would be of densities p(yl8) indexed by a finite-dimensional parameter vector 8. Here achieved if it were known. Entropy and probability of error, repectively, can YI, Y z , . . . , YN are assumed to be independent and identically distributed when used to measure the performance. The relative entropy is used to bound the conditioned on the parameter. An asymptotic expression for the mutual inforadditional risk due to lack of knowledge of the parameter. If B were known, the mation I N of the form (d/2) logN +c(p)+o(l) has been determined by Ibragibest on-line prediction and compression of the sequence of variables Yb would mov and Hasminsltii (1973), in which the constant c ( p ) is precisely determined be available from the conditional distribution P y k ~ y ~ - L If, e 8. is unknown, as a function of the prior density p(B). (Somewhat stringent conditions are these actions may be based on an estimate of the conditional distribution usrequired for their result; see Efroimovich 1980, Clarke 1989 for other formulaing the observed past. When a prior distribution is assigned to the parameter, tions of conditions). Here d is the Euclidean dimension. A related asymptotic Bayesian proceedures use the distribution Pyk~yx-zobtained by averaging out ~ given in Clarke and Barron (1990). This leads ns t o expression for D N , is the parameter. We examine the cumulative relative entropy distance between examine the asymptotics of the capacity CN and the choices of prior distrithese predictive distibutions. By the chain rule this quantity reduces to the relative entropy DN,B= D ( P ~ N~ ~~ BP Ybetween N) the joint distributions of Y N , butions for B that asymptotically achieve this capacity. For each finite N the with and without conditioning on 8. In statistical terminology, D N , is ~ the cu- optimizing prior distribution is generally discrete (Berger and Bernardo 1969, mulative risk, when relative entropy is used as the loss function. Averaging Zhang and Hartigan 1992). Nevertheless, we show under general smoothness conditions that a unique continuous density p ( B ) achieves a value IN within with respect t o the prior distribution of 6’ yields the mutual information I N o(1) of the capacity C N . As conjectured by Bernardo (1979), it is Jeffrey’s as the (cumulative) Bayesian risk. Maximizing the Bayes risk I N with respect to the choice of the prior for 8 , yields the information capacity CN and de- prior, i.e., the prior proportional to the square root of the determinant of the Fisher information matrix. No other prior (continuous or discrete) achieves termines the sequence of Bayes estimators of the conditional distribution that are minimax, i.e., that minimizes the maximum value of DN,B. In situations asymptotically larger value of the mutual information. where determination of the exact asymptotics of INis not possible, bounds on We give a further asymptotic decision-theoretic property of the optimal I N may be used to provide bounds on the minimax cumulative risk. prior. Jeffrey’s prior is shown to be asymptotically least favorable, that is, In universal noiseless coding of discrete random variables, the redunthe minimax statistical risk infp ma% DN,B (which also equals the capacity ~ a code is the increase in the expected total codelength due to C N )is achieved asymptotically by the Bayesian procedure with Jeffrey’s prior, dancy R N , of the lack of knowledge of the parameter value. For the code based on Py., uniquely among continuous priors. Moreover, with this choice of prior, DN,B the relative entropy D N , Bis the redundancy; the information I N is the average is asymptotically independent of the parameter 8, so that, in this case, the redundancy; the information capacity CN is the minimax redundancy; and relative entropy D N , the mut,ual information I N , and the capacity CN are the choice of the prior t.hat achieves the capacity provides the minimax code asymptot,ically the same. (Davisson 1973, Davisson and Leon-Garcia 1980). REFERENCES In the online prediction problem, we let the regret TN,B be defined as J. 0. Berger and J . M. Bernardo, ”Ordered group reference priors with applicathe increase in the expected frequency of mistakes in predicting the values of tions t o multinomial and variance comDonent Drohlems.” Purdue Universitv. Department of itatistics Technical Report, 1689. the sequence, due to the lack of knowledge of the parameter value. The regret J. M. Bernardo, Reference posterior distributions for Bayesian inference.” of the sequence of Bayesian predictions is bounded by Journal Royal Statistics Societ? Ser. B vol. 41, pp. 113-147, 1979. B. S. Clarke and A. R. Barron, Information theoretic asymptotics of Bayes TN,e I (2D~)elN)”~ methods.” IEEE Transactions on Information Theory vol. 36, no. 3, pp. 453-471, 1990. Thus the regret converges to zero if the relative entropy is of smaller order than B. S. Clarke, ”Asymptotic cumulative risk and Bayes risk under entropy loss, N.A tighter bound between r N , e and D N , is~ possible if the sequence of condiwith applications.” Ph. D. Thesis, Department of Statistics, University of tional distributions satisfy an a-separation property, that is, for some a > 0, Illinois, J989. the difference between the first and second largest values of P(Yk = ylY’-’, 0) L. D.Davisson, ”Universal noiseless coding.” IEEE Transactions on Informalion Theorg vol 19, pp. 783-795, 1973. is never less than a. In this case, the regret of the Bayesian predictions is L. D. Davisson and A. Leon-Garcia, A source matching approach to finding ~ ( 2 / f f ) D ~ , e / N .Averaging with respect to the shown to be bounded by T N , 5 minimax codes.” IEEE Transaclions on Information Theory vol. 26, pp. prior yields Bayes average regret 166-174, 1960. S. Yu. Efroimovich, ”Information contained in a sequence of observations.” Problems in Informalion Transmisston ~01.15,pp, 176-189, 1980. TN 5 (2/ff)I~/N. D. Haussler and A. R. Barron, ”How well do Bayes Methods work for on-line prediction of rtl values?” To appear in PTOC.Third NEC Symposium on A basic role in the analysis of the asymptotics of the mutual information Computation and Cognition, 1992. is played by the relative entropy D ( P ~ N ~ I I between P ~ N ~the ~ )distributions at I. A. Ibraginiov and R. 2. Hasminskii. ”On the information in a sample about neighboring parameter points B and 8‘. It is shown that the mutual information a parameter.” Second International Symposium on Information Theory pp. is bounded by 295-309, Akademiai, Kiado, Budapest, 1972. Z. Zhang and J. Hartigan, Department of Statistics, Yale University, personal I N 5 i${oN(n) H(fl)} correspondence, January, 1992. where the infimum is over partitions II of the parameter space. Here D N ( I I ) is the average diameter of the cells of the partition as measured by the relative entropy distance and H ( n ) is the entropy of the discrete random variable I .

+

54