Convergence Rates in Nonparametric Bayesian Density Estimation

0 downloads 0 Views 3MB Size Report
1.2 Asymptotics in Bayesian density estimation. In a large part of this thesis, we study convergence rates in nonparametric Bayesian estimation. After a brief ...
Convergence Rates in Nonparametric Bayesian Density Estimation

Willem Kruijer

c

W.T. Kruijer, Amsterdam 2008 ISBN 9789086592180 Printed by PrintPartners Ipskamp, Enschede.

VRIJE UNIVERSITEIT

Convergence Rates in Nonparametric Bayesian Density Estimation

ACADEMISCH PROEFSCHRIFT

ter verkrijging van de graad Doctor aan de Vrije Universiteit Amsterdam, op gezag van de rector magnificus prof.dr. L.M. Bouter, in het openbaar te verdedigen ten overstaan van de promotiecommissie van de faculteit der Exacte Wetenschappen op vrijdag 6 juni om 13.45 uur in de aula van de universiteit, De Boelelaan 1105

door

Willem Theodorus Kruijer geboren te Eindhoven

promotor:

prof.dr. A.W. van der Vaart

Contents 1 Introduction

1

1.1

Mixtures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2

1.2

Asymptotics in Bayesian density estimation . . . . . . . . . . . . . . . . .

3

1.3

Convergence rates in nonparametric models . . . . . . . . . . . . . . . . .

4

1.3.1

Optimal rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4

1.3.2

The test function approach . . . . . . . . . . . . . . . . . . . . . .

7

1.3.3

An information theoretic approach . . . . . . . . . . . . . . . . . . 11

1.4

Overview of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

1.5

Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2 Posterior Convergence Rates for Dirichlet Mixtures of Beta Densities 15 2.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.2

Bernstein-Dirichlet prior . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.3

Adapted Bernstein-Dirichlet prior . . . . . . . . . . . . . . . . . . . . . . . 23

2.4

Bivariate polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.5

Weakening the positivity requirement . . . . . . . . . . . . . . . . . . . . 31

2.6

Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3 Bayesian Density Estimation with Location-Scale Mixtures

39

3.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.2

Location-scale mixtures of exponential kernels . . . . . . . . . . . . . . . . 41

3.3

Examples of priors on the mixing distribution . . . . . . . . . . . . . . . . 48 3.3.1

Priors for the locations . . . . . . . . . . . . . . . . . . . . . . . . . 48

3.3.2

Priors on the weights . . . . . . . . . . . . . . . . . . . . . . . . . . 50

3.4

Mixtures of symmetric stable distributions . . . . . . . . . . . . . . . . . . 54

3.5

Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 v

4 Deriving converge rates from the information inequality : misspecification in nonparametric regression models

65

4.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

4.2

Posterior convergence rates given independent observations . . . . . . . . 66

4.3

4.4

Density estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 4.3.1

A parametric example . . . . . . . . . . . . . . . . . . . . . . . . . 71

4.3.2

Beta mixtures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 4.4.1

Fixed design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

4.4.2

Random design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

4.5

Misspecification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

4.6

Regression under misspecification of the error distribution . . . . . . . . . 84

4.7

4.6.1

Fixed design normal regression . . . . . . . . . . . . . . . . . . . . 84

4.6.2

Fixed design Laplace regression . . . . . . . . . . . . . . . . . . . . 87

4.6.3

Random design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

The choice of {Γj } and the use of sieves . . . . . . . . . . . . . . . . . . . 90

5 Analyzing spatial count data, with an application to weed counts

93

5.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

5.2

Modeling anisotropy in count data on a lattice . . . . . . . . . . . . . . . 95 5.2.1

The Log-Poisson model . . . . . . . . . . . . . . . . . . . . . . . . 95

5.2.2

Introducing anisotropy . . . . . . . . . . . . . . . . . . . . . . . . . 96

5.2.3

A bivariate model . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

5.3

Sampling from the posterior distribution . . . . . . . . . . . . . . . . . . . 99

5.4

Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

5.5

5.4.1

Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

5.4.2

Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

Samenvatting (summary in Dutch)

107

Dankwoord (acknowledgements)

109

References

110

Chapter 1

Introduction In a statistical estimation problem we want to select a probability distribution pˆ from a model P of candidate distributions, where pˆ should appropriately approximate the data. After choosing a criterium that quantifies this appropriateness, the goal is to choose an estimation procedure that selects a p ∈ P with good values for this criterium. Obviously the choice will depend on the problem of interest: when estimating a regression function we may be interested in accurate prediction of the response at a particular point, or alternatively we may look for an estimator that performs well for all possible covariate values. Usually it is assumed that the data were actually generated by a certain distribution p0 contained in the model. The estimation problem can then be seen as the estimation of the unknown ’true’ distribution p0 . But even if p0 is not contained in the model, there often exists some p∗0 ∈ P which is in a certain sense close to p0 , and a good estimator of p∗0 may still be very useful. Often the model is characterized by a parameter space, i.e. there exists a set Θ and a bijection pθ : Θ → P. A parameter can be a probability distribution itself (in which case pθ is the identity), or something characterizing a probability distribution, for instance a regression function. If Θ is finite dimensional, P is said to be a parametric model. Alternatively, the parameter space is infinite dimensional, typically a function space. The advantage of such models is that the assumption of the underlying distribution being contained in the model is more realistic. A disadvantage is that it is harder to find estimation procedures, and that it is harder to study certain properties of these procedures. This is true in particular for asymptotic properties, that determine the behavior of an estimator when the number of observations n tends to infinity. An asymptotic property studied in detail in this work is the convergence rate, which is the speed at 1

2

1. Introduction

which an estimator converges to the underlying distribution. We shall prove that various nonparametric density estimators converge at a certain rate. All the estimators we consider are Bayesian, i.e. they are based on a posterior distribution that depends on the data and a prior distribution on the parameter space, chosen beforehand. For many problems, a best possible or optimal rate is known to exist; this notion is discussed in more detail in section 1.3.1. No estimator, whether Bayesian or not, can improve upon this rate. However, to prove that Bayesian estimators converge at a certain rate, one uses different techniques than those used for frequentist estimators. In sections 1.3.2 and 1.3.3 we give an introduction.

1.1

Mixtures

An intuitive way to construct a flexible class of probability densities from a simple class is using mixtures. Let {pθ } be a family of distributions indexed by θ ∈ Θ. If we first draw θ from a distribution F on Θ, and given Θ = θ, draw X from pθ , the resulting marginal density of X is called a mixed distribution or mixture. The distribution F is referred to as the mixing distribution. If F is continuous with density f , a continuous mixture R P m(x) = f (θ)pθ (x)dθ is obtained. If F is discrete, m is of the form i∈A wi pθi (x). If in addition the number of components k =| A | is finite, m is referred to as a finite mixture. Examples are given in figure 1.1, where Θ = R and {pθ : θ ∈ Θ} is the class of all normal distributions with variance 1. In figure 1.1(a) the mixing distribution consists of point masses (0.3, 0.5, 0.2) at the points (−1, 2, 6); in figure 1.1(b) the mixing distribution is uniform on [−1, 5]. Mixtures are used in a wide range of applications. If it can be assumed that the observations are drawn from a population consisting of different subpopulations, finite mixtures are a natural model. Of interest is estimation of the number of groups, the proportions of the different groups and the parameters θi characterizing the distributions of the groups. If k is known to be bounded the estimation of these quantities is a parametric problem. Another use of mixtures lies in nonparametric density estimation. The interest is in the mixed distribution. Consider the infinite dimensional model ( k ) k X X P= wi pθi (x) | θi ∈ Θ, wi ≥ 0, wi = 1, k ∈ N i=1

(1.1)

i=1

containing all finite mixtures. Densities of the form

Pk

i=1

wi pθi (·) can be used as

an approximation to other densities p0 outside P. Under appropriate conditions on P and p0 this approximation becomes better for increasing k, and we could use the model to estimate p0 . Obviously only a finite number of parameters could be estimated given finitely many observations. It can be useful, or even necessary to restrict P

1.2. Asymptotics in Bayesian density estimation

3

(b)

0.10

0.10

0.15

0.15

0.20

(a)

0.05



0.05



0.00

0.00



−5

0

5

10

−5

0

5

10

Figure 1.1: A finite mixture (a), with the (rescaled) mixing distribution, and a continuous mixture (b). to a finite dimensional subset Pn , depending on the number of observations n. For example, P could be restricted to all mixtures for which k is at most n. If one is interested in the asymptotic behaviour of estimators however, the sequence Pn , n → ∞ becomes important. As these sub-models typically converge to P, we can still speak of a nonparametric problem, at least in the asymptotic sense.

1.2

Asymptotics in Bayesian density estimation

In a large part of this thesis, we study convergence rates in nonparametric Bayesian estimation. After a brief discussion of asymptotic properties of estimators, the notion of convergence rates is discussed in more detail in the next section. Consider the Bayesian approach to the estimation of a probability density p0 based on i.i.d. observations X1 , . . . , Xn . A prior distribution Π is defined on a model P of candidate densities, and estimates of p0 are based on the posterior distribution R Qn p(Xi )dΠ(P ) Π(A | X1 . . . , Xn ) = RAQni=1 , i=1 p(Xi )dΠ(P )

(1.2)

where A is any measurable subset of the model. An important consideration for the choice of a model and prior, is the asymptotic

4

1. Introduction

behavior of the resulting posterior. First of all, it is desirable that the posterior concentrates around p0 when the number of observations increases. This property is called consistency. More precisely, a model P 3 p0 equipped with a prior Π and a semi-metric d is consistent, if for all  > 0 Π(B c (p0 , ) | X1 . . . , Xn ) = Π({p : d(p0 , p) > } | X1 . . . , Xn ) → 0. in P0n probability. If consistency holds, the next question is the rate at which the posterior concentrates around p0 . The posterior converges to p0 with rate n if, for some sufficiently large constant M , Π(B c (p0 , M n ) | X1 . . . , Xn ) → 0,

n→∞

in P0n -probability. One can also consider other modes of convergence (e.g. almost sure), but unless otherwise stated any convergence statement will be ”in probability”. Finally, it is of interest to know if the posterior converges to some limit distribution. For finite dimensional models, the Bernstein-Von-Mises theorem asserts that under cer√ tain conditions, the posterior distribution of n(θ − θ0 ) converges in distribution to a normal distribution (See e.g. van der Vaart [55], p. 141). To what extent this result can be generalized to nonparametric models, is the subject of much ongoing research.

1.3

Convergence rates in nonparametric models

There are various general results that can be used to prove that a posterior converges at a certain rate. After a brief discussion on optimal rates, section 1.3.2 gives an introduction to the results of Ghosal, Ghosh and van der Vaart [18]. In section 1.3.3 we consider an alternative information theoretic approach.

1.3.1

Optimal rates

Some of the estimators that we encounter in subsequent chapters, are said to converge at optimal rate. Informally speaking this means that no point estimator, whether Bayesian or frequentist, can converge at a faster rate. This has also implications for the asymptotic behavior of the posterior distribution, as point estimators are based on it. Consider for R example the posterior mean p¯(x) = pθ (x)dΠ(θ | X). Applying Jensen’s inequality with Rp R the square root, it can be seen that 2−2 p0 (x)¯ p(x)dx = h2 (p0 , p¯) ≤ h2 (p0 , pθ )dΠ(θ | X). Hence the lower bound on the convergence rate of h2 (p0 , p¯) given in the example below is also a lower bound on the average posterior Hellinger distance between p0 and pθ .

1.3. Convergence rates in nonparametric models

5

Let f0 be a probability density contained in a nonparametric model P, and suppose we observe i.i.d random variables X (n) = (X1 , . . . , Xn ) with this density. Let fˆ = fˆ(X (n) ) be any estimator for f0 , having an expected squared error E (n) (n) (f0 (x0 ) − X

∼f0

fˆ(x0 ))2 at the point x0 . In what follows we focus on the squared error at a fixed point x0 , although the arguments below apply more generally, see for example Devroye [9]. Yang and Barron [58] studied minimax-rates from an information-theoretic perspective. In most parametric models there is an estimator for which the expected squared error is of order n−1 , which follows from the Bernstein-Von-Mises Theorem. A nonparametric model however, is so much ’larger’ that it always contains functions for which the expected squared error converges at a slower rate, whatever estimator is used, i.e. there are constants C > 0 and β < 1 such that

Rn = inf sup EX (n) ∼f (n) (f0 (x0 ) − fˆ(x0 ))2 ≥ C n−β = a2n fˆ f0 ∈P

(1.3)

0

for every n. If this fails to hold for any sequence tending to zero at a slower rate, an is said to be the optimal rate or minimax rate. Formally, an is the optimal rate if for some positive constants c1 and c2 ,

−2 0 < c1 ≤ lim inf a−2 n Rn ≤ lim sup an Rn ≤ c2 < ∞. n→∞

n→∞

The existence of the constant c2 often follows from the fact that certain estimators actually achieve the rate an . We illustrate the existence of the lower-bound, using an example which is taken from Tsybakov ([54], p. 77-80). Let P be the Lipschitz-class1 L(α, M )[−1, 1]. For α ∈ (0, 1], L(α, M )[−1, 1] is defined as the set of all functions g : [−1, 1] → R such that |g(u) − g(v)| ≤ M |u − v|α for all u, v ∈ [−1, 1]. For α > 1, it contains all functions g : [−1, 1] → R that have α − bαc derivatives, and satisfy |g bαc (u) − g bαc (v)| ≤ M |u − v|α−bαc for all u, v ∈ [−1, 1]. For example, f0 ∈ L(1, M ) if |f0 (u) − f0 (v)| ≤ M |u − v| for all u and v. It is a well-known fact that if P is the class of α-Lipschitz functions on a bounded interval I, (1.3) holds with β = 2α/(2α + 1). This can be shown with the following argument. For any sequences f1,n and f2,n contained in P = L(α, M )[−1, 1], and any estimator

1 If

α 6= 1, this class is often referred to as a H¨ older -class

6

1. Introduction

fˆ, sup f0 ∈L(α,M )

EX (n) ∼f (n) (f0 (x0 ) − fˆ(x0 ))2 0

1

 EX (n) ∼f (n) (f1,n (x0 ) − fˆ(x0 ))2 + EX (n) ∼f (n) (f2,n (x0 ) − fˆ(x0 ))2 1,n 2,n 2Z   1  (n) (n) ≥ (f1,n (x0 ) − fˆ(x0 ))2 + (f2,n (x0 ) − fˆ(x0 ))2 f1,n (x(n) ) ∧ f2,n (x(n) ) dλ(x(n) ) 2 Z   1 (n) (n) (f1,n (x0 ) − f2,n (x0 ))2 f1,n (x(n) ) ∧ f2,n (x(n) ) dλ(x(n) ) ≥ 4 Z q 1 (n) (n) 2 ≥ (f1,n (x0 ) − f2,n (x0 )) f1,n (x(n) )f2,n (x(n) )dλ(x(n) ), 8 (1.4)



where we used Le Cam’s inequality in the last step. Because for any densities p1 , p2 on Rn , Z √

p 1 p2

2



Z



= exp 2 log



p1 p2 = exp 2 log

p1

p1 p2 >0

p1 p2 >0

r

 Z ≥ exp 2

p1 log

p1 p2 >0

r

Z



p2  p1

p2  = exp(−DKL (p1 k p2 )), p1

by Jensen’s inequality, the last integral in (1.4) is at least exp(−nDKL (f1,n k f2,n )). To show that (1.3) holds with β = 2α/(2α + 1), it suffices to find f1,n and f2,n in 2α

L(α, M )[−1, 1] such that (f1,n (x0 ) − f2,n (x0 ))2 is of order n− 1+2α , and DKL (f1,n k f2,n ) is of order 1/n. For appropriate constants a, M and c0 , we first define n o n o 1 1 1 ζ(x) = exp − 1−9x 1{|3x|≤1} − exp − 1−9(x−2/3) 1{|3(x−2/3)|≤1} , (1.5) 2 2 a   x − x0 ζn (x) = M hα , where hn = c0 n−1/(1+2α) . n ζ hn Let f1,n (x) =

1 2 1{x∈[−1,1]}

If x0 ∈ (−1, 1) and n

and f2,n (x) = f1,n (x) + ζn (x).

2

is sufficiently large, f2,n is a density on [−1, 1]. Moreover, (f1,n (x0 ) − f2,n (x0 )) = 2α

− 1+2α (M/e)2 c2α , and 0 n

1 kf1,n −f2,n k1 = M hα n 2

x0 +hn /3

Z

exp x0 −hn /3

n



1 1−9((x−x0 )/hn )2

o n R 1/3 dx = M h1+α exp − n −1/3

1 1−9x2

o dx.

Because − log(1 + u) ≤ −u + u2 /2, the Kullback-Leibler divergence between f1,n and f2,n is Z 1/2   1 k f2,n ) = − log 1 + ζn (2x) dx log + ζn (x) dx = − 2 −1/2 −1 Z 1/2 Z 1 Z ≤ ζn2 (2x) − ζn (2x) dx = 2 ζn2 (x)dx = 2M 2 h2α n Z

DKL (f1,n

1

−1/2

= 2M 2 h2α+1 n

−1

Z

1

−1

ζ 2 (x − x0 )dx = 2M 2 c2α+1 0

1

ζ2

x − x  0

hn

−1

Z

1

−1

ζ 2 (x − x0 )dx

1 . n

dx

1.3. Convergence rates in nonparametric models

7

We conclude that the right hand side of (1.4) is bounded from below by a multiple 2α

of n− 1+2α . It remains to show that f2,n is contained in L(α, M )[−1, 1]. For l = bαc, (l)

(l) f2,n (x) = M hα−l n ζ ((x − x0 )/hn ) and

(l) (l) f (x) − f (l) (x0 ) = M hα−l ζ ((x − x0 )/hn ) − ζ (l) ((x0 − x0 )/hn ) ≤ M |x − x0 |α−l , n 2,n

2,n

(l)

provided that the constant a in (1.5) is sufficiently small. Hence f2,n ∈ L(α, M )[−1, 1].

1.3.2

The test function approach

A first step towards results for convergence rates, is to formulate sufficient conditions for consistency. Two classical consistency theorems are those of Doob [11] and Schwartz [50]. Doob’s theorem from 1948 guarantees that under mild conditions on the model, (1.2) holds for Π-almost all p0 contained in the model. The exceptional null set can be problematic however, especially in nonparametric problems. Schwartz’ 1965 theorem is more refined in the sense that it gives sufficient conditions for consistency at a specific point p0 . The form stated below is taken from [31]. For a probability measure P on R the sample space X and a measurable function f : X → R, we write P f = f dP and Pn Pn f = n1 i=1 f (Xi ). Theorem 1.1 (Schwartz’ consistency theorem). Let P be a model with a metric d, dominated by a σ-finite measure µ. Assume that P0 ∈ P, and let Π be a prior on P such that the following conditions hold. (i) For every  > 0,   p0 Π P ∈ P : P0 log ≤ >0 p

(1.6)

(ii) For every  > 0, there exists a sequence of test functions φn : X n → [0, 1] such that P0n φn → 0,

sup

P n (1 − φn ) → 0.

(1.7)

P :d(P,P0 )>

Then for every , Π(d(P, P0 ) ≥  | X1 , . . . , Xn ) → 0 P0 a.s., (n → ∞). Informally speaking, the posterior converges to a point mass at P0 if the prior puts enough mass on distributions that are close to P0 in Kullback-Leibler sense, and if P0 can be appropriately ’distinguished’ from other points in the model. This is the case if we can test P0 versus all P outside a ball of radius , and simultaneously control the errors. To extend this result to rates, it is necessary to quantify the speed at which the quantities in (1.6) and (1.7) converge to zero when  decreases with rate n . This in done in Theorem 1.2 below, which is a result by Ghosal, Ghosh and van der Vaart [18]. Also Lemmas 1.1 and 1.2 appeared in this paper. Related results can be found in Shen

8

1. Introduction

and Wasserman [52]. From this q point, we allow the prior to depend on n, and let d be R √ √ ( p − q)2 dx. the Hellinger distance h(p, q) = With regard to condition (1.7), Birg´e [4] and Le Cam [39] proved several general results that can be applied here. For example, for any convex sets P0 and P1 of probability measures, there exist tests φn such that   1 ≤ exp n log 1 − h2 (P0 , P1 ) , 2 P ∈P0   1 sup P n (1 − φn ) ≤ exp n log 1 − h2 (P0 , P1 ) , 2 P ∈P1 sup P n φn

where h(P0 , P1 ) is the minimum of h(P0 , P1 ) over P0 ∈ P0 and P1 ∈ P1 . This result illustrates the special role of the Hellinger distance, although a similar statement holds for any convex distance that is bounded by a multiple of the Hellinger distance. Note that the exponents are bounded by −nh2 (P0 , P1 )/2. For any fixed pair P0 , P1 in the model, we can therefore choose P0 = {P0 } and P1 = {P : h(P, P1 ) < h(P0 , P1 )/2}, and obtain tests φn such that for a constant K, P0n φn

 ≤ exp −Knh2 (P0 , P1 ) ,  P n (1 − φn ) ≤ exp −Knh2 (P0 , P1 ) .

sup

(1.8) (1.9)

h(P,P1 ) } appearing in (1.7) is not convex. This complication can be resolved by covering this set with Hellinger balls with distance to p0 at least ; for each of these balls a test sequence satisfying (1.8) and (1.9) can be found. Combining these tests, we can obtain tests satisfying Schwartz’ condition (1.7). To keep the error probabilities exponentially small, it is necessary to control the number of balls that are at a certain distance from P0 . This is made rigorous in the lemma below, in which a bound on the packing number of the set {P :  ≤ h(P, P0 ) ≤ 2} is assumed. The packing number D(, A, d) of a set A in a semi-metric space (S, d) is defined as the maximal number of points that can be contained in S, such that the distance between every pair is at least . The collection of these points is called a maximal  separated set. Lemma 1.1. Suppose that for some nonincreasing function D(), some sequence n →  0, and every  ≥ n , D /2, {P :  ≤ h(P, P0 ) ≤ 2}, h ≤ D(). Then for every  > n there exist tests φn, such that for every j ∈ N P0n φn sup h(P,P0 )>j

≤ D()

exp(−Kn2 ) , 1 − exp(−Kn2 )

P n (1 − φn ) ≤ exp(−Kn2 j 2 ).

(1.10) (1.11)

1.3. Convergence rates in nonparametric models

9

Proof. For a given j ∈ N, we cover {P : h(P, P0 ) > j} with ’shells’ Sr = {P : r < h(P, P0 ) ≤ (r + 1)}, r ≥ j. Let Sr0 be a maximal r/2 separated set in Sr . Every P ∈ Sr is within distance r/2 of some point in Sr0 , and by assumption, Sr0 contains at most D(r) points. For every P1 ∈ Sr0 there exists a test ωn such that (1.8) and (1.9) hold; note that h(P0 , P1 )/2 in (1.9) is larger than r/2. Let φn,r be the maximum of these tests, and let φn = maxr≥j φn,r . As D is nondecreasing, P0n φn ≤

X X

exp(−Knr2 2 ) ≤

r≥j P ∈Sr0

X

D(r) exp(−Knr2 2 ) ≤ D()

r≥j

exp(−Kn2 ) 1 − exp(−Kn2 )

n

P (1 − φn ) ≤ sup exp(−Knr2 2 ) ≤ exp(−Knj 2 2 ).

sup

r≥j

h(P,P0 )>j

Consequently, we have found tests such that (1.7) holds with exponentially small errors. To quantify the rate at which the prior mass in (1.6) decreases when n → 0, it 2 is convenient to assume that also P0 log p0 /p is small. Lemma 1.2. For every  > 0 and probability measure Πn on the set  2 BKL (p0 , 2 ) = P : P0 log(p0 /p) ≤ 2 , P0 log(p0 /p) ≤ 2

(1.12)

we have, for every C > 0, P0n

Proof. Let Gn = P0n

n Z Y  p 1 (Xi )dΠn (P ) ≤ exp(−(1 + C)n2 ) ≤ 2 2 . p C n 0 i=1



n(Pn − P0 ) be the empirical process. Then

n Z Y  p (Xi )dΠn (P ) ≤ exp(−(1 + C)n2 ) p i=1 0 n Z  1 X √  p ≤ P0n √ log (Xi )dΠn (P ) ≤ −(1 + C) n2 p0 n i=1 Z  Z  √ √ p p n 2 ≤ P0 Gn log dΠn (P ) ≤ − n(1 + C) − nP0 log dΠn (P ) p0 p0 R p   Z var log dΠ (P ) √ n p p0 ≤ P0n Gn log dΠn (P ) ≤ − nC2 ≤ 2 4 p0 C n  R p 2 P0 log p0 dΠn (P ) 1 ≤ ≤ 2 2. C 2 n4 C n

The first inequality follows from Jensen’s inequality, applied to the logarithm. The third inequality follows from Fubini’s theorem and the assumptions on the support of Π. Chebyshev’s inequality is used to obtain the fourth inequality. The next step follows

10

1. Introduction

from the fact that the variance is bounded by the second moment, and by application of Jensen’s inequality. Finally, we use the second moment assumption on the P ’s in the support of Π. Combining Lemmas 1.1 and 1.2, the following result can now be proved. Theorem 1.2 (Ghosal, Ghosh and van der Vaart [18]). Suppose that for priors Πn on P and a sequence n → 0 with n2n → ∞, there are sets Pn ⊂ P and a constant C ≥ 1, such that log D(n , Pn , h) ≤ n2n ,

(1.13)

−(C+4)n2n

Πn (P \ Pn ) ≤ e ,  2 −Cnn 2 Πn BKL (p0 , n ) ≥ e .

(1.14) (1.15)

Then for sufficiently large M , Πn (p : h(p, p0 ) ≥ M n | X1 , . . . , Xn ) → 0 in P0n probability. Proof. Because packing numbers are nonincreasing in , condition (1.13) implies that  log D( , Pn , h) ≤ log D(n , Pn , h) ≤ n2n 2 2

for every  > 2n . By Lemma 1.1, with j = 1, M ≥ 2,  = M n and D() = enn constant in , there exist tests φn such that P0n φn sup P ∈Pn ,h(P,P0 )>M n



exp{n2n − KnM 2 2n } , 1 − exp{−KnM 2 2n }

(1.16)

P n (1 − φn ) ≤ exp{−KnM 2 2n }.

(1.17)

If KM 2 − 1 > K, (1.16) implies that for large enough n, 2

P0n Πn (P : h(P, P0 ) ≥ n | X1 , . . . , Xn )φn ≤ P0n φn ≤ 2e−Knn . By Fubini’s theorem and the fact that P0 pp0 ≤ 1 for any p ∈ P, P0n

n Y p (Xi )dΠn (P ) ≤ Πn (Pnc ). p c 0 Pn i=1

Z

Combining this with (1.17), it follows that n  Y p (Xi )dΠn (P )(1 − φn ) h(P,P0 )>M n i=1 p0 Z 2 2 2 c ≤ Πn (Pn ) + P n (1 − φn )dΠn (P ) ≤ Πn (Pnc ) + e−KnM n ≤ 2e−n(C+4)n ,

P0n

Z

P ∈Pn ,h(P,P0 )>M n

1.3. Convergence rates in nonparametric models

provided that M ≥

11

p

(C + 4)/K. Let An be the event that

Z Y n 2 2 p (Xi )dΠn (P ) ≥ e−2nn Πn (BKL (p0 , 2n )) ≥ e−(C+2)nn . p i=1 0 Applying Lemma 1.2 with Πn the prior restricted to the set BKL defined in (1.12), it follows that if C ≥ 1, P0n (An ) → 1. For the posterior mass outside the Hellinger ball around p0 we find that P0n Πn (p : h(p, p0 ) > M n | X1 , . . . , Xn )(1 − φn )1An R ! Qn p (Xi )dΠn (P ) 2 2 i=1 p h(P,P )>M  0 0 n R Qn p ≤ e(C+2)nn 2e−(C+4)nn → 0. = P0n (1 − φn )1An (X )dΠ (P ) i n i=1 p0 Because P0n (1 − φn )1An is bounded by 1, this implies that Πn (p : h(p, p0 ) > M n | X1 , . . . , Xn ) converges to zero in P0n -probability. Conditions (1.13) and (1.14) require the existence of increasing subsets Pn , that receive 2

most of the prior mass, and whose packing numbers do not increase faster than ec1 nn . These Pn are often referred to as sieves. In the most simple case, (1.13) holds with P = Pn for all n, and (1.14) becomes redundant. In the following chapters however, Theorem 1.2 is applied to finite mixtures, where the choice of these sieves is important. It is convenient to define Pn as the set of all finite mixtures with at most kn components, for an appropriately chosen sequence kn → ∞. The prior on the number of components k needs to be chosen such that the prior mass on mixtures with more than kn components is exponentially small. At the same time, mixtures with a large number of components allow for a better approximation to p0 , and in view of condition (1.15), the prior should put enough mass on large k. Consequently, Theorem 1.2 gives a good convergence rate if one can realize an appropriate trade-off between the ’prior-mass’ condition (1.15) and the ’entropy conditions’ (1.13) and (1.14). Theorem 1.2 has been refined in various directions. In Ghosal and van der Vaart [21], the entropy condition is relaxed by allowing infinite covers. In Ghosal and van der Vaart [20] the theorem is extended to various non-i.i.d models, and Kleijn and van der Vaart [32] obtain convergence rates for misspecified models.

1.3.3

An information theoretic approach

An alternative method to obtain convergence rates for Bayesian estimators has recently been proposed by Zhang in [59] and [60]. The following well known inequality forms the starting point of this approach.

12

1. Introduction

Lemma 1.3. (Information inequality) For all probability densities p1 and p2 on a measure space (X , λ), Z DKL (p1 k p2 ) =

p1 (x) log

p1 (x) dλ(x) ≥ 0. p2 (x)

Note that the Kullback-Leibler divergence is only defined if p1  p2 , and that for y > 0, 0 · log

0 y

:= limx→0 x · log

x y

= 0.

Proof. Let X be a random variable with density g, and let Y =

p1 (X) p2 (X) .

Then

DKL (p1 k p2 ) = EY log Y ≥ (EY ) log(EY ) = 0, by the convexity of the function x 7→ x log x and Jensen’s inequality. This inequality can be applied to densities on a space of probability distributions P = {Pθ : θ ∈ Γ} indexed by θ ∈ Γ. For example, Pθ could be the law of Y = f (X) + e, where the error and covariate distribution are known and the regression function f is contained in some function space Γ. Let Π be a probability distribution on Γ. Lemma 1.4. Let w(θ, X) be a data-dependent density relative to Π. For any measurable function f : X n × Γ → R and all α ∈ R, Z Z w(θ, X)f (θ, X)dΠ(θ) − α w(θ, X) log EX ef (θ,X) dΠ(θ) Z n o ≤ log exp f (θ, X) − α log EX ef (θ,X) dΠ(θ) + DKL (wdΠ k dΠ).

Proof. If

R

exp{f −α log EX ef }dΠ = ∞, the result is trivial. If

∞, we can use that v =

f R exp{f −α log EX e } exp{f −α log EX ef }dΠ

R

exp{f −α log EX ef }dΠ
0. Our main results are valid for the Hellinger metric sZ p √ 2 h(f, g) = f − g (x) dx, but also for the L1 - and L2 -metrics.

2.2. Bernstein-Dirichlet prior

19

The unit simplex in Rk is the set ∆k = {(x1 , . . . , xk ) : 0 ≤ xi ≤ 1, i = 1, . . . , k,

k X

xj = 1}.

j=1

If the random measure F is distributed according to the Dirichlet process prior with base measure u with cumulative distribution function on [0, 1], then the vector given in (2.3) possesses a Dirichlet distribution on ∆k with parameter vector u(1/k), u(2/k) −  u(1/k), . . . , u(1) − u((k − 1)/k) . For α ∈ (0, 1], a function f : [0, 1] → R is α-smooth if there exists a constant Lα (f ) α such that f (x)−f (y) ≤ Lα (f )| x − y | for all x, y ∈ [0, 1]. A function f on [0, 1]×[0, 1] is α-smooth if there exists a constant Lα (f ) such that for all (x1 , y1 ), (x2 , y2 ) ∈ [0, 1] × α

[0, 1], | f (x1 , y1 ) − f (x2 , y2 ) |≤ Lα (f )kx − yk , where k · k denotes the Euclidean norm. For α ∈ (1, 2], f is α-smooth if and only if it is differentiable and its derivative f 0 is (α − 1)-smooth. In the two-dimensional case, f is α-smooth if both partial derivatives are (α − 1)-smooth. The class of all α-smooth functions is denoted C α . The constants Lα (f ) appearing below will be understood to be the minimal constants satisfying the inequalities. For vectors w ∈ Rk , kwk1 and kwk∞ denote the usual l1 - and l∞ -norm.

2.2

Bernstein-Dirichlet prior

Consider the prior distribution on densities obtained by assigning k and F in the mixture of beta-densities of the type (2.2) independently priors ρ and a Dirichlet process prior. Thus the prior on the density of the observations is concentrated on the space ∪∞ k=1 Bk for Bk the set of Bernstein densities of order k given by Bk =

k nX

o wj β(·; j, k + 1 − j) : w ∈ ∆k .

(2.5)

j=1

Theorem 2.1. Let the true density f0 be strictly positive and α-smooth for α ∈ (0, 2]. Assume that B1 e−β1 k ≤ ρ(k) ≤ B2 e−β2 k for all k ≥ 1 and some positive constants B1 , B2 , β1 , β2 and assume that the base measure of the Dirichlet process prior possesses a continuous, strictly positive density on [0, 1]. Then, for a sufficiently large constant M,   (log n)(1+2α)/(2+2α) Π f : d(f, f0 ) > M | X , . . . , X 1 n −→ 0 nα/(2+2α) For α = 2 the convergence rate is n−1/3 (log n)5/6 , as was found by Ghosal [17]. The rate is slower for smaller α. For C 1 -functions the rate is n−1/4 (log n)3/4 . The proof is based on Theorem 1 of Ghosal, Ghosh and van der Vaart [18], quoted in the appendix as

20

2. Posterior Convergence Rates for Dirichlet Mixtures of Beta Densities

Theorem 2.4, and the following approximation result. It is a straightforward extension of the bound for α = 2, contained in the proof of theorem 2.3 in [17]. Lemma 2.1. For f ∈ C α and α ∈ (0, 2] define Mα (f ) to be Lα (f ) if α ≤ 1 and to be

 Lα (f 0 ) + kf 0 k∞ if 1 < α ≤ 2. Then f − b ·; k, wk (F ) ∞ ≤ 3Mα (f )(1/k)α/2 for any α ∈ (0, 2], where F is a primitive function of f and wk (F ) is given in (2.3). Proof. It can be verified that for a fixed x h Y + 1  Y i  b x; k, wk (F ) = kE F −F , k k

(2.6)

for a binomial random variable Y with parameters k − 1 and (success probability) x. First we assume that α ∈ (0, 1]. By the mean value theorem F ((y +1)/k)−F (y/k) = k

−1

f (ξy ), for some ξy ∈ [y/k, (y + 1)/k]. Consequently, by (2.6),  f (x) − b x; k, wk (F ) = f (x) − Ef (ξY ) ≤ E f (x) − f (ξY ) α ≤ Lα (f )E|x − ξY |α ≤ Lα (f ) E|x − ξY | ,

(2.7)

by Jensen’s or H¨older’s inequality. Since |ξY − Y /k| ≤ k −1 and E(Y /k) = x − x/k, r Y 1 Y Y 1 + x Y 2 2 1 E|x − ξY | ≤ + E x − ≤ + E E − ≤ + var ≤ + √ . k k k k k k k k 2 k The assertion of the lemma for α ∈ (0, 1] follows upon inserting this bound in the right side of (2.7). If α ∈ (1, 2], we can use Taylor’s theorem with integral remainder to obtain h Y + 1  Y i Y  1 Z 1 Y s kE F −F = Ef + E + (1 − s) ds. f0 k k k k k k 0

(2.8)

The second term on the right is bounded above by (2k)−1 kf 0 k∞ . For every t and h there exists, by the mean value theorem, some ξ ∈ [0, 1] such that f (t + h) − f (t) − hf 0 (t) = hf 0 (t + ξh) − hf 0 (t) ≤ |h|Lα−1 (f 0 )|ξh|α−1 ≤ Lα−1 (f 0 )|h|α , (2.9) 0

since f is (α − 1)-smooth. Substituting t = (k − 1)x/k and h = Y /k − (k − 1)x/k, we find Y  Y k − 1  Y k − 1  0  k − 1  k − 1 α −f x − − x f x ≤ Lα−1 (f 0 ) − x . (2.10) f k k k k k k k H¨older’s inequality gives Y k − 1 α  Y k − 1 2 α/2  Y α/2  1 α/2 E − x ≤ E − x = var ≤ . k k k k k 4k

(2.11)

2.2. Bernstein-Dirichlet prior

21

Combining (2.6), (2.8), (2.10) and (2.11), and using that E(Y /k − (k − 1)x/k) = 0, we see that  (k − 1)x   (k − 1)x   Y  1 0 − Ef kf − b(·; k, wk (F ))k∞ ≤ sup f (x) − f kf k∞ + sup f + k k k 2k x x  1 α/2 1 0 1 + ≤ L1 (f ) + Lα−1 (f 0 ) kf k∞ . k 4k 2k This concludes the proof. The slow convergence of (derivatives of) Bernstein polynomials is the price paid for their nice shape-preserving properties; see DeVore and Lorentz [8](sections 10.2-10.4). Lemma 2.2. For every w ∈ Rk , kb(·; k, w)k∞ ≤ kkwk∞ ≤ k kwk1 . Proof. The left side is equal to   k k−1 X X  k − 1 k − 1 j−1 wj k sup x (1 − x)k−j ≤ k max |wj | sup xi (1 − x)k−1−i . j j − 1 i x x j=1 i=0 This is equal to kkwk∞ ≤ k kwk1 by the binomial formula. Proof. of Theorem 2.1. We use Theorem 2.4, stated in the appendix. Abbreviate wk0 = wk (F0 ). We let M1 , M2 , C1 , C2 , D1 , D2 , . . . be suitable constants that are fixed in the proof. We verify conditions (2.30) and (2.31) for ˜n = (log n/n)α/(2+2α) . For the verification 2/α

of (2.31) we let kn be the biggest integer smaller than M2 /˜ n

= M2 (n/ log n)1/(1+α) , for

some M2 > 1. According to Lemma 2.2, kb(·; kn , wkn )−b(·; kn , wk0n )k∞ ≤ kn kwkn − wk0n k1 . 1+2/α

Therefore, if kwkn − wk0n k1 ≤ ˜n

, then by Lemma 2.1,

kf0 − b(·; kn , wkn )k∞ ≤ kf0 − b(·; kn , wk0n )k∞ + kb(·; kn , wkn ) − b(·; kn , wk0n )k∞ ≤

M1 kn

α/2

M1 n = D1 ˜n . + kn ˜1+2/α ≤ 2( p α + M2 )˜ n M2

(2.12)

In particular, the function b(·; kn , wkn ) is bounded away from zero for kn sufficiently  large. Using Lemma 2.7 we conclude that h f0 , b(·; kn , wkn ) ≤ D2 ˜n , and next application of Lemma 2.10 gives that b(·; kn , wkn ) ∈ KL(f0 , C1 ˜n ). The prior mass of KL(f0 , C1 ˜n ) can now be lower bounded using Lemma 2.8 with  =

1 1+2/α ˜n . 2

This

yields that  Πn KL(f0 , C1 n ) ≥ ρ(kn )Π(wkn : kwkn − wk0n k1 ≤ ˜1+2/α ) n 1+2/α

≥ B1 e−β1 kn Ce−ckn log(2/˜n

)

≥ B1 Ce−c1 kn log n  1 ≥ B1 C exp −c2 n 1+α (log n)α/(1+α) = B1 C exp(−c2 n˜ 2n ).

(2.13)

22

2. Posterior Convergence Rates for Dirichlet Mixtures of Beta Densities

1+2/α

Note that the condition  ≤ 1/(M k) in Lemma 2.8 is satisfied, since  = 12 ˜n = 1+α/2 1 , which is smaller than a multiple of 1/kn . Furthermore, the parameter 2 M2 /kn of the Dirichlet distribution in Lemma 2.8 is bounded below by a multiple of 1/kn by the assumption that the base measure has a continuous positive density, that is bounded below by b for some b. To verify (2.29) and (2.30), we define integers sn such that, for some constants L1 and L2 , L1 n1/(1+α) (log n)α/(1+α) ≤ sn ≤ L2 n1/(1+α) (log n)α/(1+α) .

(2.14)

n Furthermore, we set Fn := ∪si=1 Bi , where Bk are the beta mixtures of degree k given in

(2.5). First we check condition (2.30): Π(Fn c ) ≤

∞ X

ρ(r) ≤ B3 e−β2 sn ≤ B3 e−β2 L1 n

1/(1+α)

(log n)α/(1+α)

,

r=sn +1

in which the exponent is of the same order as n˜ 2n = n1/(1+α) (log n)α/(1+α) . Comparing β2 L1 with the constant in (2.13), we conclude that condition (2.30) is satisfied if c2 +4 ≤ β2 L1 , for c2 given in (2.13). This can be achieved by choosing L1 sufficiently large. We verify (2.29) for ¯n = (log n)(1+2α)/(2+2α) /nα/(2+2α) , which is the same as ˜n apart from a higher power log-factor. Because the Hellinger distance h between densities is bounded above by the square root of the L1 -distance, which is bounded above by the uniform norm, we obtain, with the help of Lemma 2.2, D(¯ n , Fn , h) ≤

sn X

D(¯ n , Br , h) ≤

r=1

sn X

D(¯ 2n /r, ∆r , k · k1 )

(2.15)

r=1



sn  X 5r r−1 r=1

¯2n

≤ sn

5sn sn , ¯2n

where the third inequality follows by Lemma 2.9. It follows that  5s  n log D(¯ n , Fn , h) ≤ log sn + sn log 2 ≤ C1 sn log n ≤ c1 n¯ 2n , ¯n

(2.16)

for sufficiently large constants C1 and c1 . Thus all conditions of Theorem 2.4 are satisfied, whence the rate of convergence for the Hellinger distance is ˜n ∨ ¯n = ¯n . To obtain the same result for the L1 - or L2 -distances it suffices to revisit the entropy calculation. For the L1 -distance the calculation is identical, except that we may replace ˜2n by ˜n . For the L2 -distance we use that the densities in Bk are bounded uniformly by k, in view of Lemma 2.2, whence the densities in Fn are bounded above by sn . Therefore, for any f, g ∈ Fn , kf − gk22 =

Z p p √ √ ( f − g)2 (x) ( f + g)2 (x) dx ≤ sn h2 (f, g).

2.3. Adapted Bernstein-Dirichlet prior

23

√ Consequently, the -entropy of Fn for the L2 -norm is bounded above by the / sn entropy of Fn for the Hellinger distance. Now replacing ¯ in (2.15) by ¯/sn has an effect on the size of the constant c1 in (2.16) only, and hence the theorem extends to the L2 -distance.

2.3

Adapted Bernstein-Dirichlet prior

The rate of concentration of the posterior distribution in section 2.2 is, up to logarithmic factors, equal to n−1/(2+2α) rather than n−α/(1+2α) , which is known to be the optimal estimation rate for α-smooth densities. The main reason for the slow convergence in

 Theorem 2.1 is that the error f0 − b ·; k, w(F0 ) ∞ is of the order (1/k)α/2 , compared to (1/k)α for the minimax polynomial of degree k. In this section we show that the convergence rate can be improved by reducing the dimension of the Bernstein approximation b. That this dimension can indeed be reduced, is suggested by the following informal argument. The class Bk in (2.5) contains mixtures of the k densities β(·; i, k + 1 − i), for i = 1, . . . , k. For k odd the sum of the standard deviations of the corresponding distributions is equal to s s   (k−1)/2 k X X i(k + 1 − i) 2 k+1 k+1 p = − i + i (k + 1)2 (k + 2) 2 2 (k + 1)2 (k + 2) i=0 i=1 2 =p (k + 1)2 (k + 2)

(k−1)/2

(k−1)/2 X k + 1 X √ 2 − i = p i = O( k). 2 (k + 1)2 (k + 2) i=0 i=0

Hence for k → ∞, there is increasing overlap in the terms of the mixture, in particular √ for the densities close to the center of the unit interval. The O( k) suggests that the √ same quality of approximation might be achieved with only k components. Consider the prior distribution on densities obtained by assigning l and F in the mixture of beta-densities of the type (2.4) independently priors ρ and a Dirichlet process prior. Theorem 2.2. Let the true density f0 be strictly positive and α-smooth for some α ∈ (0, 1]. Assume that B1 e−β1 l ≤ ρ(l) ≤ B2 e−β2 l for all l ≥ 1 for some positive constants B1 , B2 , β1 , β2 and assume that the base measure of the Dirichlet process possesses a continuous, stictly positive density on [0, 1]. Then, for a sufficiently large constant M ,   (log n)(1+4α)/(2+4α) EΠ f : d(f, f0 ) > M | X , . . . , X 1 n −→ 0. nα/(1+2α) If f ∈ C α for some α > 1, then of course it is contained in C 1 and hence the convergence rate is at least (log n)5/6 n−1/3 , which is the rate obtained in Theorem 2.1 for

24

2. Posterior Convergence Rates for Dirichlet Mixtures of Beta Densities

2-smooth densities. Thus the rate of posterior convergence for the coarsened Bernstein prior is faster for any α < 2 and never worse. For 0 < α ≤ 1 the rate is optimal up to a logarithmic factor. The results of Theorems 2.1 and 2.2 are summarized in figure 2.2 ,

0.0

0.1

0.2

0.3

0.4

where the obtained convergence rates are plotted along with the minimax rate.

0.0

0.5

1.0

1.5

2.0

Figure 2.2: Let f0 be α-smooth and strictly positive. Apart from logarithmic factors, the rate obtained in Theorem 1 is n−α/(2+2α) = n−β1 (α) for α ∈ (0, 2], the rate obtained in Theorem 2 is n−α/(1+2α) = n−β2 (α) for α ∈ (0, 1] and n−1/3 = n−β2 (α) for α ∈ (1, 2], while the minimax rate is n−α/(1+2α) = n−β3 (α) for all α ∈ (0, 2]. The graph displays β1 (α) (dotted), β2 (α) (dashed) and β3 (α) (solid) as functions of α. The key to the proof of the preceding theorem is an approximation result for the coarsened Bernstein polynomial. For k a square of a natural number and w ∈ ∆√k define



˜b(x; k, w) =

k X



i k X

√ i=1 j=(i−1) k+1

1 √ wi β(x; j, k + 1 − j). k

(2.17)

For k = l2 and wi = F (i/l) − F ((i − 1)/l) this yields the expression in (2.4). The “classical” Bernstein approximation and the coarsened approximation (2.17) are mixtures of the same set of k beta-densities, which are polynomials of degree k. However, compared to the “classical” Bernstein approximation b(x; k, F ), the dimension of the √ space of functions ˜b(·; k, w) is reduced from k to k. The following lemma shows that the approximation error for α-smooth functions remains of the same order if α ∈ (0, 1] . √ It is assumed that k is integer; at the end of this section we show how ˜b can be defined if this is not the case.

2.3. Adapted Bernstein-Dirichlet prior

25

 Lemma 2.3. For α ∈ (0, 1] and f ∈ C α we have f −˜b ·; k, w√k (F ) ∞ ≤ 5Lα (f )(1/k)α/2 , where F is a primitive function of f and wk (F ) is given in (2.3). √ √ √ √ Proof. Define a function φk by φk (y) = i k if (i − 1) k ≤ y < i k, for i = 1, . . . , k. It is easy to verify that, for Y a binomial variable with parameters k − 1 and x, √ i   h  √ ˜b(x; k, w√ (F )) = k E F φk (Y ) − F φk (Y ) − k . (2.18) k k k

In view of Lemma 2.1 it suffices to show that b(·; k, wk (F )) − ˜b(·; k, w√k (F )) ∞ ≤ 2Lα (f )(1/k)α/2 . By Taylor’s theorem with integral remainder, Z 1  h Y + 1  Y i Y s k F −F = f + ds, k k k k 0 √ √ Z 1   φ (Y ) − k i √ h  φk (Y )  s  φk (Y ) − k k + √ ds. k F −F = f k k k k 0 By (2.6) and (2.18), for all x ∈ [0, 1],  φ (Y ) − √k Y s  s  k E f + √ −f + ds k k k k 0 √ Z 1 s s α Y φk (Y ) − k ≤ Lα (f ) E + √ − − ds k k k k 0 Z 1 √ α α 2 Lα (f ) ≤ Lα (f ) , E 2/ k ds ≤ k α/2 0 √ √ √ √ since by construction |y −(φk (y)− k)| ≤ k for every y, and |s/ k −s/k| ≤ 1/ k. Z b(x; k, wk (F )) − ˜b(x; k, w√k (F )) ≤

1

The preceding lemma does not extend to α ∈ (1, 2]. In fact, for bigger α the coarsening may lead to a deteriorated approximation. For example, if F (x) = x2 and f (x) = 2x, then h  √k − 1 i  k − 1 i √ h 1 ˜b(1; k, w√ (F )) = √1 F (1)−F √ β(1; k, 1) = k F (1)−F = 2− √ . k k k k k It follows that the approximation error of the coarsened Bernstein polynomial is equal √ to |f (1) − ˜b(1; k, w√ (F ))| = 1/ k, bigger than the approximation error 1/k of the k

ordinary Bernstein polynomial. As shown in Theorem 2.2 this has no bad consequences for the posterior rate. The essential difference with the construction of Theorem 2.1 lies in the fact that  Πn KL(f0 , C1 n ) on the left hand side of (2.13) can now be lower bounded using a set of lower dimension. Hence the factor ρ(kn ) on the first line of (2.13) can be replaced by ρ(ln ). On the third line of (2.13) the number kn in the exponent will therefore become √ ln = kn , with a possibly different additional logartihmic factor. Recall that condition (24) requires that this exponent is at least of the order −n˜ 2n . Because we only need to

26

2. Posterior Convergence Rates for Dirichlet Mixtures of Beta Densities

have ln ≤ n˜ 2n (instead of kn ≤ n˜ 2n ), ˜n can tend to zero faster than it did in Theorem 1. In the following proof we make this informal argument precise. Proof. of Theorem 2.2. We again use Theorem 2.4. Let M1 , M2 , D1 , D2 , . . . be suitable constants that are fixed in the proof. α/(1+2α) We verify conditions (2.30) and (2.31) for ˜n = log n/n , and let ln denote √ √ 1/α 1/(1+2α) n = M2 (n/ log n) , for some M2 > 1. the biggest integer smaller than M2 /˜ 2 2 0 2 ˜ ˜ Lemma 2.2 implies that kb(·; l , wl ) − b(·; l , w )k∞ ≤ l kwl − w0 k1 . It follows n

n

n

n

ln 1+2/α

ln

n

with the help of Lemma 2.3 that if kwln − wl0n k1 ≤ ˜n , then





f0 − ˜b(·; ln2 , wln ) ≤ f0 − ˜b(·; ln2 , wl0 ) + ˜b(·; ln2 , wln ) − ˜b(·; ln2 , wl0 ) n n ∞ ∞ ∞ ≤

M1 M1 + ln2 ˜1+2/α ≤ 2( p α + M2 )˜ n = D1 ˜n . n α ln M2

In particular, the function ˜b(·; ln2 , wln ) is bounded away from zero for ln sufficiently large, and by Lemma 2.7, h(f0 , ˜b(·; l2 , wl )) ≤ D2 ˜n . An application of Lemma 2.10 gives that n

n

˜b(·; l2 , wl ) ∈ KL(f0 , C1 ˜n ). The prior mass on KL(f0 , C1 ˜n ) can now be lower bounded n n 1+2/α

using Lemma 2.8 with  = 21 ˜n Πn

, yielding  1+ 2 KL(f0 , C1 ˜n ) ≥ ρ(ln )Π(wln : kwln − wl0n k1 ≤ ˜n α ) 1+2/α

≥ B1 e−β1 ln Ce−cln log(2/˜n

)

(2.19)

≥ B1 Ce−c1 ln log n = B1 C exp(−c2 n˜ 2n ) The condition  ≤

1 Mk

1+2/α

in Lemma 2.8 is satisfied, since  = 12 ˜n

= 21 (M2 /ln )2+α =

O( l1n ). Furthermore, the assumption that the base measure possesses a positive, continuous density implies that the parameters of the Dirichlet distribution are bounded below by a power of . n We verify (2.29) and (2.30) for Fn = ∪si=1 Ci , where Cl is the set of all beta-mixtures

˜b(·; l2 , w) with w ∈ ∆l , and sn is chosen to satisfy L1 n1/(1+2α) (log n)2α/(1+2α) ≤ sn ≤ L2 n1/(1+2α) (log n)2α/(1+2α) for some constants L1 and L2 . The verification of (2.29) is identical to the verification of this condition in the proof of Theorem 2.1. We verify (2.29) with ¯n = (log n)(1+4α)/(2+4α) /nα/(1+2α) , also along the lines of the proof of Theorem 2.1. In view of the inequality k˜b(·; l2 , w)k1 ≤ l2 kwk1 the packing numbers of the set Cl satisfy D(¯ n , Cl , k · k1 ) ≤ D(¯ n /l2 , ∆l , k · k1 ). By the argument in the proof of Theorem 2.1 it then follows that log D(¯ n , Cl , k · k1 ) ≤ log sn + sn log(5sn /¯ /s2n ) ≤ C1 sn log n ≤ c1 n¯ 2n . This verifies (2.29) for the L1 -distance. The Hellinger distance and L2 -distance can be handled similarly.

2.3. Adapted Bernstein-Dirichlet prior

27

√ √ In the definition of ˜b given by (2.17) we assumed that k is integer. If k is not integer, we can still define ˜b such that the error is of order k −α/2 . √ √ Let k be an arbitrary positive integer, and let m = b kc. In the case m = k √ we treated before, we divided the collection of beta-densities β(·; j, k + 1 − j) into k √ consecutive groups of k densities. More generally, we can define m groups containing Pm ki densities, where k = i=1 ki . If k equals me k, for e k = m, m + 1 or m + 2, we let all ki be equal to e k. For example, if k = 25, 30, or 35, we have respectively 5, 6 or 7 groups of 5 different beta densities. If k is not a multiple of m, it can be uniquely written as k = me k + r, where e k ∈ {m, m + 1} and r ∈ {1, 2, . . . , m − 1}. In this case, let r of the group sizes k1 , . . . , km be equal to e k + 1, and let the m − r remaining groups contain e k densities. Pi Given k1 , . . . , km , define partial sums si = j=1 kj (i = 1, . . . , m), and ˜b(x; k, w) =

m X

si X

i=1 j=si−1

1 wi β(x; j, k + 1 − j), k +1 i

(2.20)

where s0 = 0 by definition. For k = m2 and ki = m for all i, this definition coincides with (2.17). Given the partial sums s0 , s1 , . . . , sm and a distribution F on [0, 1], we define  s  s  s  s  s  s  1 0 2 1 k k−1 −F ,F −F ,...,F −F . wb√kc = F k k k k k k

(2.21)

Lemma 2.3 can now be extended as follows.

 k)α/2 , Lemma 2.4. For α ∈ (0, 1] and f ∈ C α we have f − ˜b ·; k, wb√kc (F ) ∞ . (1/e where F is a primitive function of f and wb√kc (F ) is defined in (2.21).

Proof. Define functions φk and ϕk by φk (y) = si and ϕk (y) = si−1 , where si−1 ≤ y < si Pm Psi −1 1 k−1 j and i = 1, . . . , m. For a normalizing constant k(x) = j=si−1 ki i=1 j x (1 − x)k−1−j ,   1 k−1 j Dk,x (j) = x (1 − x)k−1−j , k(x) ki j

si−1 ≤ j < si ,

(2.22)

defines a distribution on {0, 1, . . . , k − 1} that is close to the binomial distribution with parameters k − 1 and x. It can be verified that for a random variable Z with distribution Dk,x , h    i ˜b(x; k, w√ (F )) = k(x)k EZ F φk (Z) − F ϕk (Z) . k k k

(2.23)

In case k = me k, for e k = m, m + 1 or m + 2, k(x) = 1/e k for all x, Z is in fact binomially distributed. The last display then reduces to (2.18), except that for e k > m, φk increases

28

2. Posterior Convergence Rates for Dirichlet Mixtures of Beta Densities

√ in steps of e k instead of m = k. In the remainder, consider the case k = me k + r, where e k ∈ {m, m + 1} and r ∈ {1, 2, . . . , m − 1}. Similar to the proof of Lemma 2.3, we write       Z +1 Z √ ˜ −F f (x) − b ·; k, wb kc (F ) ≤ f (x) − kE F k k       Z +1 Z √ ˜ + b ·; k, wb kc (F ) − kE F −F . k k The first term on the right hand side is of order (1/e k)α/2 ; this is proved in Lemma 2.11 in the appendix. Using the Lipschitz-property of f , we see that the second term is bounded by   Z 1    Z 1  φ (Z) − ϕ (Z) Z s ϕk (Z) k k EZ f + ds − k(x) E s + ds φ (Z) − ϕ (Z) f Z k k k k k k 0 0 α Z 1 Z s φk (Z) − ϕk (Z) ϕk (Z) 1 ≤ Lα (f ) E + − s− ds + e . k k k k k 0 where we used that e k/(e k + 1) ≤ k(x)(φk (Z) − ϕk (Z)) ≤ (e k + 1)/e k. Since by construction e e |Z − ϕk (Z)| ≤ k and |(φk (Z) − ϕk (Z)) − 1| ≤ k, the integral on the right is at most Lα (f )(2e k/k)α .

2.4

Bivariate polynomials

Using multivariate Bernstein polynomials, previous results can be extended to multivariate densities. In the remainder we restrict the presentation to bivariate densities on the unit square. The Bernstein approximation of a function F on the unit square is given by B(x, y; k1 , k2 , F ) =

k1 X k2    X k1 k2 i=0 j=0

i

j

xi (1 − x)k1 −i y j (1 − y)k2 −j F (

i j , ). k1 k2

(DeVore and Lorentz (1993),p.10). If F is a cdf, x2 ≥ x1 and y2 ≥ y1 , we write  F (x1 , y1 ), (x2 , y2 ) := F (x2 , y2 ) − F (x1 , y2 ) − F (x2 , y1 ) + F (x1 , y1 ). If F has a density, it can be approximated with ∂2 B(x, y; k1 , k2 , F ) = b(x, y; k1 , k2 , wk1 k2 (F )) ∂x∂y k1 X k2 X = wk1 k2 (F )(ij) β(x; i, k1 − i + 1)β(y; j, k2 − j + 1) i=1 j=1

= k1 k2 EF (

X +1 Y +1  X Y , ), ( , ) , k1 k2 k1 k2 (2.24)

2.4. Bivariate polynomials

29

where X ∼ Bin(k1 − 1, x) and Y ∼ Bin(k2 − 1, y) are independent binomial random  j−1 j i variables and wk1 k2 (F ) is the matrix with entries wk1 k2 (F )(ij) = F ( i−1 k1 , k2 ), ( k1 , k2 ) . A natural extension of the Bernstein-Dirichlet prior used in section 2.2, is a Dirichlet process prior on the unit square, combined with independent priors ρ1 and ρ2 for the number of components k1 and k2 in both dimensions. Like in section 2.3 we adapt the √ √ prior by reducing the dimension. Assuming that k1 and k2 are natural numbers and w ∈ ∆√k1 k2 , define √

˜b(x, y; k1 , k2 , w) =

√ iX k1

k1 X



k2 X

√ rX k2



√ √ i=1 j=(i−1) k1 r=1 s=(r−1) k2

1 wir β(x; j, k1 +1−j)β(y; s, k2 +1−s), k1 k2

the two-dimensional analogue of (2.17). Let w√k1 √k2 (F ) denote the  i−1 √ , r−1 ), ( √ik , √rk ) . with entries w√k1 √k2 (F )(i, r) = F ( √ k k 1

2

1



k1 ×



k2 matrix

2

Lemma 2.5. For α ∈ (0, 1] and f ∈ C α , kf − ˜b(x, y; k1 , k2 , w√k1 √k2 (F ))k∞ ≤

4Lα (f ) , min(k1 ,k2 )α/2

for F a primitive function of f . Proof. We first prove that kf − b(x, y; k1 , k2 , wk1 k2 (F ))k∞ ≤

2Lα (f ) . min(k1 ,k2 )α/2

There exist

θX , θY ∈ [0, 1], such that  X Y X +1 Y +1 i 1 X + θX Y + θY F ( , ), ( , ) = f( , ). k1 k2 k1 k2 k1 k2 k1 k2 By Jensen’s inequality, sup |f (x, y) − b(x, y; k1 , k2 , wk1 k2 (F ))| = sup |f (x, y) − Ef ( (x,y)

(x,y)

X + θX Y + θY , )| k1 k2

≤ sup Lα (f )Ek(x, y) − ( (x,y)

X + θX Y + θY α , )k . k1 k2

The squared difference in the first variable has expectation E(x −

X + θX 2 X X + θX 2 2 ) = Var( ) + (E(x − )) ≤ , k1 k1 k1 k1

Y 2 and similarly, E(y− Y +θ k2 ) ≤

2Lα (f ) . (min(k1 ,k2 ))α/2

2 k2 .

As a consequence, kf (x, y)−b(x, y; k1 , k2 , wk1 k2 (F ))k∞ ≤

Like in the proof of Lemma 2.3 it now suffices to bound kb(x, y; k1 , k2 , wk1 k2 (F )) − ˜b(x, y; k1 , k2 , w√k1 √k2 (F ))k∞ .

√ √ √ We define functions φk1 and φk2 by φk1 (x) = i k1 if (i − 1) k1 ≤ x < i k1 , for √ √ √ √ √ i = 1, . . . , k 1 , and φk2 (y) = i k2 if (i − 1) k2 ≤ y < i k2 ,i = 1, . . . , k 2 . For binomial random variables X and Y as before, we obtain √ √   p ˜b(x, y; k1 , k2 , w√ √ (F )) = k1 k2 E F φk1 (X)− k1 , φk2 (Y )− k2 , k1 k2 k1 k2

φk1 (X) φk2 (Y )  , k2 k1

i

.

30

2. Posterior Convergence Rates for Dirichlet Mixtures of Beta Densities

Since F is a cdf, Z 1Z 1 X + s1 Y + s2 X Y X +1 Y +1  f( , )ds1 ds2 = k1 k2 F ( , ), ( , ) , k1 k2 k1 k2 k1 k2 0 0 √ √ √ φ (X) φ (Y )  φ (X)− k1 φk2 (Y )− k2 , ), ( k1k1 , k2k2 ) . and a similar expression holds for k1 k2 F ( k1 k1 k2 √

Since E( X+(1−k1

k1 )s1



√ φk1 (X)− k1 2 ) k1





2 k1

and E( Y +(1−k2

k2 )s2



√ φk2 (Y )− k2 2 ) k2

kb(x, y; k1 , k2 , wk1 k2 (F )) − ˜b(x, y; k1 , k2 , w√k1 √k2 (F ))k∞ Z 1Z 1 √ √    φk1 (X)− k1 φ (Y )− k2 1 Y +s2 √s1 , k2 ≤ E| f X+s , − f + + k1 k2 k1 k2 k 0

1

0

Z

1

Z

≤ Lα (f ) 0

0

1

√    φ (X)− k1 1 Y +s2 − k1 k 1 + Ek X+s k1 , k 2

√s2 k2

√ φ (Y )− k2 √s1 , k2 k 2 k1

+





2 k2 ,

| ds1 ds2

√s2 k2



kα ds1 ds2

2Lα (f ) ≤ . min(k1 , k2 )α/2

Theorem 2.3. Let the true density f0 be strictly positive and α-smooth for some α ∈ 2

2

2

(0, 1]. Assume that B1 e−β1 l1 ≤ ρ1 (l1 ) ≤ B2 e−β2 l1 and B1 e−β1 l2 ≤ ρ2 (l2 ) ≤ B2 e−β2 l2

2

for some positive constants B1 , B2 , β1 , β2 and assume that the base measure of the Dirichlet process possesses a continuous, stictly positive density on [0, 1] × [0, 1]. Then, for a sufficiently large constant M ,   (log n)(1+2α)/(2+2α) EΠ f : d(f, f0 ) > M | X1 , . . . , Xn −→ 0 α/(2+2α) n α/(2+2α) Proof. of Theorem 2.3. We verify conditions (2.30) and (2.31) for ˜n = log n/n . √ √ 1/α 1/(2+2α) Let ln1 = ln2 be the biggest integer smaller than M2 /˜ n = M2 (n/ log n) , for some M2 > 1. To clarify the presentation we still distinguish between ln1 and ln2 . 1+4/α

Due to Lemmas 2.2 and 2.3, kwln1 ln2 − wl0n1 ln2 k1 ≤ ˜n implies



f0 − ˜b(·, ·; ln1 2 , ln2 2 , wl l ) ≤ f0 − ˜b(·, ·; ln1 2 , ln2 2 , wl0 l ) n1 n2 n1 n2 ∞ ∞

2 2 2 2 0 ˜

˜

+ b(·, ·; ln1 , ln2 , wln1 ln2 ) − b(·, ·; ln1 , ln2 , wln1 ln2 ) ∞ ≤

M1 M1 2 2 1+4/α + ln1 ln2 ˜n ≤ 2( p α + M2 )˜ n = D1 ˜n . α min(ln1 , ln2 ) M2

It follows with the help of Lemma 2.7 that h(f0 , ˜b(·, ·; ln1 2 , ln2 2 , wln1 ln2 )) ≤ D2 ˜n , and by Lemma 2.10 we obtain ˜b(·, ·; ln1 2 , ln2 2 , wl l ) ∈ KL(f0 , C1 ˜n ). The prior mass on n1 n2

1+4/α

, yielding KL(f0 , C1 ˜n ) can be lower bounded using Lemma 2.8 with  = 12 ˜n  1+ 4 Πn KL(f0 , C1 ˜n ) ≥ ρ1 (ln1 )ρ2 (ln2 )Π(wln1 ln2 : kwln1 ln2 − wl0n1 ln2 k1 ≤ ˜n α ) 2

2

1+4/α

≥ B12 e−β1 ln1 −β1 ln2 Ce−cln1 ln2 log(2/˜n 2

≥ B12 Ce−c1 ln1 log n = B12 C exp(−c2 n˜ 2n )

)

2.5. Weakening the positivity requirement

31

We verify (2.29) and (2.30) for Fn = ∪i·j≤sn Cij , where Cl1 l2 is the set of all betamixtures ˜b(·, ·; l12 , l22 , w) with w ∈ ∆l1 l2 , and integers sn chosen according to (2.14). If L1 has distribution ρ1 and L2 has distribution ρ2 , √

m X m m P (L2 = ) + P (L1 L2 = m) = P (L1 = r, L2 = ) ≤ r r r=0 r=0 m X

m X √ r= m+1

P (L1 = r)

√ ≤ ( m + 1)B2 e−β2 m + mB2 e−β2 m ≤ B3 e−β3 m , due to the quadratic form in the exponent, in the bound on ρ1 and ρ2 . Π(Fn c ) ≤ ≤

X

ρ1 (i)ρ2 (j) ≤

∞ X

P (L1 L2 = m)

m=sn +1

ij>sn ∞ X

B3 e−β3 m ≤ B4 e−β4 sn ≤ B4 e−β4 L1 n

1/(1+α)

(log n)α/(1+α)

m=sn +1

We verify (2.29) with ¯n = (log n)(1+2α)/(2+2α) /nα/(1+2α) , where the log-factor has the same power as in Theorem 2.1. In view of the inequality k˜b(·, ·; i2 , j 2 , w)k1 ≤ i2 j 2 kwk1 the packing numbers of the set Cij satisfy D(¯ n , Cij , k · k1 ) ≤ D( i2¯nj 2 , ∆ij , k · k1 ). D(¯ n , Fn , h) ≤

X

D(¯ n , Cij , h) ≤

sn X

rD(¯ 2n /r2 , ∆r , k · k1 )

r=1

1≤i·j≤sn



sn X

sn

 5r2 r−1

r=1

¯2n

≤ s2n

5s2n sn . ¯2n

It follows that log D(¯ n , Fn , h) ≤ 2 log sn + 2sn log

 √5s  n

¯2n

≤ C1 sn log n ≤ c1 n¯ 2n ,

for sufficiently large constants C1 and c1 . Thus all conditions of Theorem 2.4 are satisfied, whence the rate of convergence for the Hellinger distance is ˜n ∨ ¯n = ¯n . Again the same result can be obtained for the L1 - or L2 -distances.

2.5

Weakening the positivity requirement

In Theorems 2.1, 2.2 and 2.3 we assumed that f0 is bounded away from zero. This assumption is useful to lower bound the prior mass on Kullback-Leibler balls around f0 , but it is more than what is strictly needed. Since for any y > 0, limx→∞ x log xy = 0 2 2 and limx→∞ x log xy = 0, it is natural to define 0 log y0 = 0 and 0 log y0 = 0 for any R y > 0. This suggests that even if f0 (x) is zero for some x, the quantities log ff0 dF0 2 R and log ff0 dF0 in (2.32) can still be small as long as f is bounded away from zero on [0, 1]. Since f is a beta mixture, this can be achieved by putting a small weight

32

2. Posterior Convergence Rates for Dirichlet Mixtures of Beta Densities

uniformly over all beta-mixtures with mean ranging over the unit interval. The only requirement on f0 is an upper bound on the Lebesgue measure of the region where f0 is 2

very close to zero. This rules out densities that, for example, behave as e−1/x near the origin. Unfortunately there is a dependence on the smoothness in this assumption. To keep the presentation simple, the mixtures b defined in (2.1) are used in the construction below; the mixtures ˜b defined in (2.17) can be handled with similar arguments. For α ∈ (0, 2], let f0 ∈ C α be the density that is to be approximated by a beta mixture. To ensure that the mixture is sufficiently bounded away from zero, we replace the weight vector wk (F0 ) defined in (2.3) by   α+4   α+4 α+2 wk (F0 ) + k − 2 , . . . , k − 2 . yk (F0 ) = 1 − k − 2

(2.25)

Lemma 2.6. For α ∈ (0, 2] and f0 ∈ C α , let Mα (f0 ) be as in the assertion of fα (f0 ) such that Lemma 2.1. Assume that there is another constant M   M fα (f0 ) λ {x : 0 < f0 (x) ≤ rf0 ,α Mα (f0 )k −α/2 } ≤ , (log k)2 where rf0 ,α is defined as the smallest integer larger than kf0 − b(·; k, yk (F0 ))k∞ Z

3Mα (f0 )+1 Mα (f0 )

(2.26)

+ 1. Then

≤ (3Mα (f0 ) + 1)k −α/2 ,

1

f0 (x) dx . k −α/2 , b(x; k, y (F )) k 0 0  2 Z 1 f0 (x) f0 (x) log dx . k −α/2 , b(x; k, yk (F0 )) 0 f0 (x) log

(2.27) (2.28)

where F0 is a primitive function of f0 and yk (F0 ) is given by (2.25). Proof. From the construction of yk (F0 ) it follows that kyk (F0 )−wk (F0 )k∞ ≤ k − kb(·; k, wk (F0 ))−b(·; k, yk (F0 ))k∞ ≤ k

−α/2

α+2 2

and

. Since kf0 −b(·; k, wk (F0 ))k∞ ≤ 3Mα (f0 )k −α/2

by Lemma 2.1, kf0 − b(·; k, yk (F0 ))k∞ ≤ (3Mα (f0 ) + 1)k −α/2 . To prove (2.28), define intervals A0 = {x : f0 (x) = 0} and Ak,j = {x : (j − 1)Mα (f0 )k −α/2 < f0 (x) ≤ jMα (f0 )k −α/2 } for positive integers k and j. For each k ≥ 1 we can write [0, 1] = A0 ∪ (∪j≥1 Ak,j ). By the binomial theorem, b(·; k, yk (F0 )) ≥

k X

k−

α+4 2

β(x; j, k + 1 − j) = k −

a+2 2

j=1

for every x ∈ [0, 1]. Consequently, f0 (x)/b(x; k, yk (F0 )) is at most rf0 ,α kMα (f0 ) if x ∈ Ak,j with j ≤ rf0 ,α . Moreover, for j > rf0 ,α we have that b(x; k, yk (F0 )) ≥ f0 (x)−kf0 −b(·; k, yk (F0 ))k∞ ≥ (j−1)Mα (f0 )k −α/2 −(3Mα (f0 )+1)k −α/2 .

2.6. Appendix

33

Because x log Z

1

 x 2 y

= 0 if y > x = 0,

 2 2 XZ f0 (x) f0 (x) dx = f0 (x) log dx b(x; k, yk (F0 )) b(x; k, yk (F0 )) j≥1 Ak,j   2 ≤ rf0 ,α Mα (f0 )k −α/2 λ {x : 0 < f0 (x) ≤ rf0 ,α Mα (f0 )k −α/2 } log(rf0 ,α kMα (f0 ))

 f0 (x) log

0

+ Mα (f0 )

 jλ(Ak,j )k −α/2 log

X j>rf0 ,α

2 jMα (f0 )k −α/2 (j − 1)Mα (f0 )k −α/2 − (3Mα (f0 ) + 1)k −α/2

fα (f0 ) M 2 (log (rf0 ,α kMα (f0 ))) (log k)2  2 X j −α/2 + Mα (f0 ) jλ(Ak,j )k log . k −α/2 , j − r f ,α 0 j>r

≤ rf0 ,α Mα (f0 )k −α/2

f0 ,α

where the last inequality follows from the fact that X

jλ(Ak,j ) log

j>rf0 ,α

X 2 | j − (j − rf0 ,α ) | 2 j ≤ jλ(Ak,j ) ≤ rf20 ,α (rf0 ,α +1). j − rf0 ,α j − r f ,α 0 j>r f0 ,α

The proof of (2.28) is similar. To apply the preceding lemma in the proof of Theorem 2.1, it is desirable that (2.27) and (2.28) also hold for all mixtures b whose weight vector in an l1 -ball around yk (F0 ). − α+4 2

This is achieved if, for example, the radius of the l1 -ball in (2.12) and (2.13) is kn 1+2/α

instead of ˜n

. Inspecting the inequalities in (2.13), it can be seen that this does not

affect the convergence rate.

2.6

Appendix

In this appendix we state, for easy reference, key results used in the proofs of the main results of this chapter. Theorem 2.4 (Ghosal, Ghosh and van der Vaart [18]). Suppose that for priors Πn on F, there are sets Fn ⊂ F and positive sequences ¯n , ˜n → 0, such that n min(˜ 2n , ¯2n ) → ∞. Suppose that for certain constants c1 , c2 , c3 , c4 , log D(¯ n , Fn , d) ≤ c1 n¯ 2n

(2.29)

−(c2 +4)n˜ 2n

(2.30)

Πn (F \ Fn ) ≤ c3 e  2 Πn KL(f0 , ˜n ) ≥ c4 e−c2 n˜n , where

Z n Z o f0 f0 2 KL(f0 , ) = f : log dF0 ≤ 2 , log dF0 ≤ 2 . f f

(2.31)

(2.32)

34

2. Posterior Convergence Rates for Dirichlet Mixtures of Beta Densities

Then for n = max(˜ n , ¯n ) and M > 0 sufficiently large, the posterior probability Ef0 Π(f : d(f, f0 ) ≤ M n | X1 , . . . , Xn ) −→ 1. Lemma 2.7. If f0 is bounded away from zero, then sZ p (f (x) − f0 (x))2 p p dx ≤ k(1/f0 )k∞ kf − f0 k2 h(f, f0 ) = ( f (x) + f0 (x))2 for every density f . Lemma 2.8 (Ghosal [17]). Let (X1 , . . . , Xk ) be Dirichlet distributed on ∆k , k ≥ 2, Pk with parameters (m; α1 , . . . , αk ), such that Ab ≤ αj ≤ M and m = j=1 αj , for some constant A,b, m and M ≥ 1. Let (x1 , . . . , xk ) be any point in ∆k . Then there exist positive constants c and C depending only on A,b,M and m such that for  ≤ P

k X

1 Mk ,

 1 | Xj − xj |≤ 2, Xj > 4 for all j ≥ Ce−ck log(  ) .

j=1

Lemma 2.9 (Ghosal and van der Vaart [22], Lemma A4). For  ≤ 1 the packing number of the l1 -simplex satisfies D(, ∆k , k · k1 ) ≤

 5 k−1 

.

(2.33)

Lemma 2.10 (Ghosal, Ghosh and van der Vaart [18]). For any probability measure F0 with density f0 ,

Z n o n Z o

f0 f0 2

≤ 2 ⊂ f : log f0 dF0 ≤ 22 , f : h2 (f, f0 ) log dF0 ≤ 42 . (2.34)

f f f ∞ Lemma 2.11. For α ∈ (0, 1], let f ∈ C α be a density on [0, 1], with cumulative distribution function F . Let Z be a random variable on {0, 1, . . . , k − 1} with probability mass function given by (2.22). Then      f (x) − kE F Z + 1 − F Z . 1 k α/2 k k for all x ∈ [0, 1], the multiplicative constant being independent of x. Proof. Analogous to (2.7) we find that      f (x) − kE F Z + 1 − F Z ≤ Lα (f )(E|x − ξZ |)α k k for some point ξZ ∈ [Z/k, (Z + 1)/k]. Hence it suffices to show that E|x − ξZ | ≤ E|x −

Z Z 1 1 Z | + E| − ξZ | ≤ + E|Z − EZ| + E| − x| k k k k k

(2.35)

2.6. Appendix

35

p is of order 1/ e k. Because Dk,x differs only a factor

e k e k+1

from the binomial distribution

with probabilities k − 1 and x, it follows that e k e k+1

(k − 1)x ≤ EZ ≤

e k+1 (k − 1)x, e k

and it can be verified that |EZ/k − x| < 1/e k. In the remainder we show that the term E|Z − EZ| in (2.35) is of the right order. As it is difficult to obtain a sufficiently sharp bound for the variance of Z, we cannot follow the proof of Lemma 2.1 at this point. Instead we use a coupling argument. The random variable Z can be generated according to the following scheme. For x ∈ [0, 1], let U1 , U2 , . . . and Y1 , Y2 , . . . be independent sequences of i.i.d random variables being respectively uniformly distributed on [0, 1], and binomially distributed with parameters k − 1 and x. We define the set bk,x (y)}, A = {(u, y) ∈ [0, 1] × {0, 1, . . . , k − 1} | u ≥ Dk,x (y)/B bk,x is defined as the probability mass function of the Yi s multiplied with (e where B k+1)/e k. b By construction, Dk,x (y) ≤ Bk,x (y) for all y. For all positive integers n, define Z = Yn if (Un , Yn ) is the first pair that is not contained in A. First we show that the output Z indeed has the desired distribution2 . Let q be the probability that a pair (Ui , Yi ) is contained in A. Because P ((Un , Yn ) ∈ / A, Yn = y) does not depend on n, P (Z = y) =

∞ X

P (N = n, Yn = y) =

n=1

∞ X

q n−1 P ((Un , Yn ) ∈ / A, Yn = y)

n=1

P ((Un , Yn ) ∈ / A, Yn = y) P ((Un , Yn ) ∈ / A, Yn = y) = = Dk,x (y), = 1−q 1 − P ((Un , Yn ) ∈ A) as the numerator in the last fraction equals P (Yn = y)P ((Un , Yn ) ∈ / A | Yn = y) = Bk,x (y)

e Dk,x (y) k = Dk,x (y) b e Bk,x (y) k+1

and the denominator equals 1 − P ((Un , Yn ) ∈ A) = 1 −

k X j=0

Because E|Y1 − EY1 | ≤



var Y1 ≤

1 2



Dk,x (j) Bk,x (j) 1 − bk,x (j) B

!

k≈e k/2 and

E |Y1 − EY1 | − E |Z − EZ| ≤ 2E |Y1 − Z| , 2 This

is an example of rejection sampling

=

e k e k+1

.

36

2. Posterior Convergence Rates for Dirichlet Mixtures of Beta Densities

we can obtain the required bound for E|Z − EZ| once we can bound E|Y1 − Z|. Finally, E |Y1 − Z| =

∞ X

E |Y1 − Yn | 1{(U1 ,Y1 )∈A,...,(Un−1 ,Yn−1 )∈A,(Un ,Yn )∈A} /

n=2

=

∞ X

E |Y1 − Yn |

n=2

≤ E |Y1 − Yn |

∞ X

Dk,x (Y1 ) 1− bk,x (Y1 ) B q n−2 .

!

n−1 Dk,x (Yn ) Y Dk,x (Yi ) E 1− bk,x (Yn ) bk,x (Yi ) B B i=2

√ k,

n=2

as E|Y1 − Yn | ≤

p p var(Y1 − Y2 ) ≤ k/2 is independent of n.

!

2.6. Appendix

37

k=4

0

0.5

1

2

1.5

3

4

k=4

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.6

0.8

1.0

0.8

1.0

0.8

1.0

k=16

0

0.5

5

1.5

10 15

k=16

0.4

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.6

k=64

0

0.5

10

1.5

20

k=64

0.4

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

Figure 2.3: Left column: the beta densities β(·; i, k + 1 − i), i = 1, . . . , k for k equal to 4, 16 and 64. Right column: the corresponding approximations of the density f (x) = 1 e−1

(ex + cos(4πx)) (f is the solid line).

Chapter 3

Bayesian Density Estimation with Location-Scale Mixtures Abstract : We study convergence rates of Bayesian density estimators based on finite location-scale mixtures of a kernel with exponential tails. It is assumed that the underlying density is either twice continuously differentiable with exponential tails, or is a finite mixture itself. Regarding the priors on the weights, locations, and number of components, we provide general conditions under which the posterior converges at a near optimal rate. Examples of priors which satisfy these conditions include Dirichlet and Polya tree priors for the weights, and Poisson processes for the locations. Some of the results can be extended to mixtures of symmetric stable kernels.

3.1

Introduction

When the number of components in a mixture model can increase with the sample size, it can be used for nonparametric density estimation. Such models were called mixture sieves by Grenander [23] and Geman and Hwang [15]. Although originally introduced in a maximum likelihood context, there has been a large number of Bayesian papers in the following years. See, among others, Richardson and Green [47], Diebolt and Robert [10], and Escobar and West [13]. Whereas much progress has been made regarding the computational problems in nonparametric Bayesian inference (see for example the review by Marin, Mengersen and Robert [41]), results on convergence were found only recently, especially for the case when the underlying distribution is not a mixture itself. For the estimation of a C 2 -density using continuous normal mixtures with a Dirichlet prior on the mixing dis39

40

3. Bayesian Density Estimation with Location-Scale Mixtures

tribution, Ghosal and van der Vaart [21] found optimal rates under certain assumptions on the prior. Convergence of normal mixtures has also been studied by Scricciolo [51] and Genovese and Wasserman [16]. These results are mostly proved by lower bounding the prior mass on Kullback-Leibler balls around the true density and simultaneously controlling the entropy of certain sieves in the model, following the general results in Ghosal, Ghosh and van der Vaart [18]. This approach is also taken in the present chapter. We obtain posterior convergence rates for location-scale mixtures of the type m(x; k, µ, w, σ) =

k X

wj ψσj (x − µj ),

(3.1)

j=1

where σj > 0, wj ≥ 0,

Pk

j=1

wj = 1, µj ∈ R and ψ is some symmetric kernel, of a form

specified below. Our aim is to formulate general conditions for the priors on µ and w, under which optimal rates can be found for estimation of an α-smooth density p0 . The convergence rate is taken to be the posterior concentration on Hellinger balls around p0 , i.e. the rate is n if Π(h(p0 , pθ ) > M n | X1 . . . , Xn ) → 0 in P0n -probability, for some sufficiently large constant M . For convenience we focus on C 2 -densities. For the inverse bandwidths σj−1 we consider a (single) prior, which is a gamma distribution restricted to intervals [ln , un ] = [n−L , nU ], for arbitrary constants L > 0 and U ≥

1 5.

The value n1/5 as

smallest possible upper bound for the inverse bandwidths can be seen in relation to the optimal bandwidth of order n−1/5 in kernel density estimation. In section 3.2, we consider kernels ψ of the form p

ψ(x) = Cp e−|x| ,

(3.2)

for a normalizing constant Cp . We find optimal rates under quite general conditions on the prior. In section 3.3 we give examples of priors that satisfy these conditions. It appears that optimal rates are achieved in many other models than the Dirichlet mixtures of normals, most often considered in the literature. For example, one can equally well use Laplace mixtures with a Poisson process prior on the locations and a Polya-tree prior on the weights. In section 3.4 we let ψ be a symmetric stable density, and study Cauchy mixtures as a particular case. Although many of the properties of the ’exponential type’ mixtures of section 3.2 also hold for Cauchy mixtures, the approximation properties of the latter are problematic. For a constant δ ∈ (0, 1] and a symmetric stable density R δ y ψσ (y)dy < ∞, the hellinger distance between a C 2 -density f and its

ψσ such that

δ

convolution f ∗ ψσ is only of the order σ 2 , much larger than the σ 2 found for the kernels in section 3.2. This leads to suboptimal rates.

3.2. Location-scale mixtures of exponential kernels

Notation: Let ∆k = {x ∈ Rk : xi ≥ 0, unit-simplex and Sk = {x ∈ R

41

Pk

xi = 1} denote the (k − 1)-dimensional Pk : xi ≥ 0, i=1 xi ≤ 1}. For b, d ∈ Rk , Hk [b, d]

k

i=1

denotes the hypercube {x ∈ Rk | xi ∈ [bi , di ]}. When no confusion can result we write Hk [b, d] := Hk [(b, . . . , b), (d, . . . , d)] for real numbers b and d. The supremum-norm restricted to a compact interval [−a, a] is denoted k · k∞,a . Given  > 0 and fixed points  Pk x ∈ Rk and y ∈ ∆k , define the l1 -balls Bk (x, ) = z ∈ Rk | i=1 | zi − xi |≤  and  Pk ∆k (y, ) = z ∈ ∆k | i=1 | zi − yi |≤  . Inequality up to a multiplicative constant is denoted with . and &. Let log+ x = max(log x, 0). The number of points in an interval I ∈ R is denoted N (I). Finally, KL(p0 , ) stands for the Kullback-Leibler ball {p : P0 log(p0 /p) ≤ 2 , P0 log2 (p0 /p) ≤ 2 }.

3.2

(3.3)

Location-scale mixtures of exponential kernels

For a given p > 0, let ψ be defined as in (3.2). Assumptions on p0 . The observations X1 , . . . , Xn are assumed to be an i.i.d. sample from a twice continously differentiable density p0 . It is assumed that there are constants q0 , D1 and q1 > (1∨p) such that p0 (x) . e−q0 |x|

q1

if | x |≥ D1 , i.e. p0 is assumed to have

smaller tails that the kernel ψ. Since q1 > 1 this implies that P0 ([−x, x]c ) ≤ ψ(x) if | x | R p0 0 4 dp0 < ∞ and is larger than some constant D2 . In addition, it is assumed that p0 R p0 00 2 dp0 < ∞. Ghosal and van der Vaart ([21], Lemma 4) proved that for normal p0 mixtures these assumptions imply that h2 (p0 , p0 ∗ φσ ) . σ 2 . A reading of their proof shows that this still holds if we replace φ by our kernel ψ. To state the conditions on the prior we introduce the number a . For sufficiently small  > 0, let the number ¯ be implicitly defined by 2 (log −1 )−2 = ¯ψ −1 (¯ ), where  1 Cp p −1 ψ (y) = log y is the inverse of ψ defined on (0, Cp ]. For sufficiently small  > 0 we define a = ψ −1 (¯ ) and A = [−a , a ]. The use of ¯ instead of  is for technical convenience only. For example, from the form of ψ and the assumptions on the tails of p0 , it follows that P0 (Ac ) .

Z



q1

p

e−q0 |x| dx . e−a = ¯.

(3.4)

a

In the context of Theorem 3.1, we have  = n = n−2/5 log n, and an is of the order (log n)1/p . In contrast to the prior for the inverse bandwidths σj−1 , which is restricted to [ln , un ], the priors for µ that we consider below do not have to be supported on the intervals A , and do not necessarily depend on n. Prior (Πn ) Let s = 15 (b + p−1 ), where the constant b depends on the choice of the 1

prior. For the sequence σ(n) = n− 5 (log n)s , the inverse bandwidths

1 σi

are independent

42

3. Bayesian Density Estimation with Location-Scale Mixtures

with the exponential density   f (t) = e−ln σ(n) − e−un σ(n) σ(n)e−σ(n)t ,

t ∈ [ln , un ].

The priors on the number of components, locations and weights satisfy the conditions (3.5)-(3.8) below, where d1 , d2 , d3 and d4 may be arbitrary positive constants. First, the marginal distribution ρ of the number of components K is such that for all integers m ∞ X

r

ρ(k) . e−d1 m(log m) ,

(3.5)

k=m

where the constant r ≥ 0 may influence the logarithmic factor in the convergence rate. Second, the joint distribution of (K, µ) satisfies Π N ([−y, y]c ) > 0



 Π K = k, µ ∈ Bk (µ0 , ) for all y > 0, k ∈ N, 
1 + 52 (b + p−1 ) + max(0, 1−r 2 ).

3.2. Location-scale mixtures of exponential kernels

43

The proof is an application of Theorem 3.5, and requires Lemmas 3.1, 3.2 and 3.3 given below, to verify conditions (3.31)-(3.33). Condition (3.33) asserts that there have  2 to be constants c2 and c4 such that Πn KL(p0 , n ) ≥ c4 e−c2 nn , i.e. the prior has to put enough mass on densities close to p0 , where closeness is measured in terms of the Kullback-Leibler balls defined by (3.3). As the support of the prior consists of finite mixtures, we need to find a set of finite mixtures close to p0 . In the following lemma, we construct a well approximating mixture m (·; k , µ , w , σ ), and consider all mixtures of equal dimension, whose coefficients are in an l1 -ball around (µ , w , σ ). The proof can be found in the appendix. In the proofs of Theorem 3.1 it is important that p0 can be well approximated by mixtures whose support points are contained in a compact interval. More specifically, for every small  > 0 there has to be a finite mixture m (·; k , µ , w , σ ) such that d(p0 , m ) <  and w,j ∈ [−a , a ] for j = 1, . . . , k . The distance d is either the Hellinger or Kullback-Leibler distance. The above assumption on the tails of p0 guarantees that for  tending to zero, this approximation can be achieved while controlling the increase of the intervals A = [−a , a ]. The smoothness assumptions on p0 imply that also the number of components k can be sufficiently controlled. This is made rigorous in the following lemma. Below we point out the main steps in the proof; the complete proof can be found in the appendix. 2 p ). Lemma 3.1. For ψ(x) = Cp e−|x| , let ¯ be implicitly defined by / log 1 = ¯ψ −1 (¯  − q p−p p p 2 1 Let  > 0 be such that ¯ < Cp exp − max{D1 , D2 , q0 } . Let p0 ∈ C and assume q1

that h(p0 , p0 ∗ ψσ ) . σ 2 and the tails of p0 are of order e−q0 |x| , with q1 > (1 ∨ p). Then there exists a finite mixture m = m(·; k , µ , w , σ ) contained in KL(p0 , ), such that all −1 of its support points lie in A = [−ψ ), ψ −1 (¯ )], and the number of support points k q (¯ −1 ψ (¯ ) 1 1 is bounded by a multiple of √ log  log ¯ . For a constant C > 0, m(·; k , µ, w, σ) ∈ 4

KL(p0 , C) if (µ, w, σ) ∈ M × W × Σ , where M = Bk (µ , e 2 σ()), W = ∆k (w , e2 ) and Σ = Hk [σ , σ +

2

e  σ() k ].

First we construct a finite mixture m (·; k , µ , w , σe ) such that h(p0 , m ) . . After a small modification of the coefficients, to guarantee that m is sufficiently bounded away from zero, Lemma 3.14 is used to obtain a bound for the Kullback-Leibler divergence. The construction of m relies on the approximation of p0 ∗ ψσ() by a finite mixture, as in Lemma 3.1 in Ghosal and van der Vaart (2000,[22]). Because this technique can only be used for compactly supported p0 , we first replace p0 ∗ ψσ() by p0 ∗ ψσ() , for p0 the restriction of p0 to A . We find that h(p0 , m ) ≤ h(m , p0 ∗ ψσ() ) + h(p0 ∗ ψσ() , p0 ∗ ψσ() ) + h(p0 ∗ ψσ() , p0 ).

(3.9)

By Lemma 3.13 in the appendix we can bound the first term on the right. The second term is sufficiently small due to the tail assumption on p0 . The last term is at most a

44

3. Bayesian Density Estimation with Location-Scale Mixtures

multiple of σ 2 () because of the smoothness assumptions on p0 . Choosing σ() = / log 1 , we see that h(p0 , m ) . / log 1 . Finally, from the construction of m (see the proof in the appendix) it follows that for sufficiently small neighborhoods M , W and Σ of respectively µ , w and σ , all mixtures whose parameters are contained in these sets, are also close to p0 . The latest step relies on the following lemma, that is also used in the entropy calculations in the proof of Theorem 3.1. A similar result is given by Lemma 5 in [21]. Lemma 3.2. Let w, w ˜ ∈ ∆k , µ, µ ˜ ∈ Rk and σ, σ ˜ ∈ (0, ∞)k . Let ψ be a differentiable symmetric density such that xψ 0 (x) is bounded. Then for mixtures m(x) = m(x; k, µ, w, σ) and m(x) ˜ = m(x; k, µ ˜, w, ˜ σ ˜ ) we have km − mk ˜ 1



kw − wk ˜ 1 + 2kψk∞

k X wi ∧ w ˜i i=1

km − mk ˜ ∞

.

σi ∧ σ ˜i

| µi − µ ˜i | +

k X wi ∧ w ˜i i=1

σi ∧ σ ˜i

| σi − σ ˜i |,

k k k X X | wi − w ˜i | X wi ∧ w wi ∧ w ˜i ˜i | µ − µ ˜ | + | σi − σ ˜i | . + i i 2 σi ∧ σ ˜i (σi ∧ σ ˜i ) (σi ∧ σ ˜i )2 i=1 i=1 i=1

Proof. Let 1 ≤ i ≤ k and assume that w ˜i ≤ wi . By the triangle inequality, kwi ψσi (· − µi ) − w ˜i ψσ˜i (· − µ ˜i )k ≤ kwi ψσi (· − µi ) − w ˜i ψσi (· − µi )k + kw ˜i ψσi (· − µi ) − w ˜i ψσi (· − µ ˜i )k + kw ˜i ψσi (· − µ ˜i ) − w ˜i ψσ˜i (· − µ ˜i )k for any norm. We have the following inequalities:     µ ˜i − µi µi − µ ˜i −Ψ kψσi (z − µi ) − ψσi (z − µ ˜i )k1 = 2 Ψ 2σi 2σi 2kψk∞ |µ ˜i − µi | ≤ 2kψk∞ ≤ |µ ˜i − µi |, σ σi ∧ σ ˜i Z i 1 x x 1 kψσi − ψσ˜i k1 ≤ | ψ( ) − ψ( ) | dx ≤ | σi − σ ˜i |, σi ∧ σ ˜i σi σ ˜i σi ∧ σ ˜i 1 d ˜i |, (3.10) kψσi − ψσ˜i k∞ ≤ k gx k∞ | σi − σ 2 (σi ∧ σ ˜i ) dz 1 kψσi (z − µi ) − ψσi (z − µ ˜i )k∞ . |µ ˜i − µi | . (3.11) (σi ∧ σ ˜i )2 To prove (3.10), let σ = z −1 and σ ˜ = z˜−1 , and for fixed x define the function gx : z → zψ(zx). By assumption,

d dz gx (z)

= ψ(zx) + zxψ 0 (zx) is bounded, and

kψσ − ψσ˜ k∞ = sup | gx (z) − gx (˜ z ) |≤| z − z˜ | k x

d 1 d gx k∞ ≤ k gx k∞ | σ − σ ˜|. 2 dz (σ ∧ σ ˜ ) dz

Applying the mean value theorem to ψ itself, (3.11) is obtained. Using Lemma 3.1 it is straightforward to verify condition (3.33) in Theorem 3.5: it suffices to lower bound the prior mass on M , W and Σ . This lower bound can be

3.2. Location-scale mixtures of exponential kernels

45

obtained directly from the conditions (3.7) and (3.8) on the priors for µ and w. The lower bound for Π(Σ ) is found by direct calculation. Conditions (3.31) and (3.32) in Theorem 3.5 require the existence of sets of finite mixtures Pn , of increasing dimension kn , such that the (metric) entropy is at most a multiple of n2n . By condition (3.32), the prior mass on Pnc must be exponentially small. Optimal rates can therefore only be found when kn increases at the appropriate speed. The next lemma gives two well known entropy bounds. Lemma 3.3. For positive vectors b = (b1 , . . . , bk ) and d = (d1 , . . . , dk ), with bi < di for all i, the packing numbers of ∆k and Hk [b, d] satisfy the following bounds. D(, ∆k , k · k1 ) ≤ D(, Hk [b, d], k · k1 ) ≤

5 k  Qk k! i=1 (di − bi + 2) (2)k

(3.12) (3.13)

Proof. A proof of (3.12) can be found in [18]; see also page 34 in the present work. The other result follows from a volume argument. For λk the k-dimensional Lebesgue measure, λk (Sk ) =

1 k!

and λk (Bk (y, 2 , k · k1 )) =

in Rk centered at y, with radius

 2.

k k! ,

where Bk (y, 2 , k · k1 ) is the l1 -ball

Suppose x1 , . . . , xN is a maximal -separated set

in Hk [b, d]. If the center y of an k · k1 -ball of radius any point z in this ball, | zi − yi |≤

 2

 2

is contained in Hk [b, d] then for

for all i. Because for each coordinate we have   2 and | zi |≥ bi − 2 , z is an element of Bk (x1 , 2 , k · k1 ), . . . , Bk (xN , 2 , k · k1 ) is therefore

the bounds | zi |≤| yi | + | zi − yi |≤ di + Hk [b − 2 , d + 2 ]. The union of the balls contained in Hk [b − 2 , d + 2 ]. 2

2

Proof of Theorem 3.1. Let n = n− 5 (log n)t1 and δn = n− 5 (log n)t2 be the sequences in Theorem 3.5, where the constants t1 and t2 are to be determined. Define e n = −1 n log 1n = σ 2 (n), and define ¯n implicitly by e 2n = ¯n ψ −1 (¯ n ). Applying Lemma 3.1 with  = n , we can find a sequence of mixtures mn (·; kn , µn , wn , σn ) contained in KL(p0 , n ). The elements of µn are contained in An = [−an , an ], where an = ψ −1 (¯ n ) . 1

1

1

t1

3

(log n) p , and kn is bounded by a multiple of n 5 (log n) p − 2 + 2 . For all kn -component mixtures m(·; kn , µ, w, σ), we have that m(·; kn , µ, w, σ) ∈ KL(p0 , n ) whenever (µ, w, σ) ∈ Mn × Wn × Σn . The sets Mn and Wn are the l1 -balls Bkn (µn , σ(n)e 2n ) and ∆kn (wn , and Σn is the hypercube Hkn [σn , σn +

e 2 σ(n) kn ],

e 4n 2 ),

where kn − 1 elements of σn equal σ(n),

and one element is 1. We verify condition (3.33) by lower bounding Π ({K = kn , µ ∈ Mn , w ∈ Wn , σ ∈ Σn }) = Π ({K = kn , µ ∈ Mn }) Π ({w ∈ Wn |K = kn }) Π (Σn |K = kn ) .

46

3. Bayesian Density Estimation with Location-Scale Mixtures

Note that conditional on K and µ, the vectors w and σ are independent. By (3.7), the  prior mass on {K = kn , µ ∈ Mn } is at least a multiple of exp −kn log ke2n . The same n

holds for the set {w ∈ Wn |K = kn }, by condition (3.8). −1

Since σ1−1 , . . . , σk−1 are i.i.d. exponential (σ(n)), restricted to [ln , un ] 3 σ(n) n

,

  Πn {σ ∈ Σn |K = kn }  1 1 e 2 σ(n) −1 −1  e 2 σ(n) −1  ∈ (σ(n) + n ) , σ (n) ∀i 6= j Πn ∈ (1 + n ) ,1 = Πn σi kn σj kn kn −1    e2 −kn  −σ(n)(σ(n)+ e2n σ(n) )−1 n σ(n) −1 kn = e−σ(n)ln − e−σ(n)un − e−1 e−σ(n)(1+ kn ) − e−σ(n) e  kn −1  e2  e2  n σ(n) −1 n σ(n) −1 ≥ e−σ(n)(σ(n)+ kn ) − e−1 e−σ(n)(1+ kn ) − e−σ(n) kn −1  σ(n) σ(n) | − 1 | − σ(n) | e−σ(n) ≥ e−1 | e 2n σ(n) e 2 σ(n) σ(n) + kn 1 + nkn n kn o e 2 kn = exp −kn log 2 . & n kn e n (3.14) In the first inequality we used that σ(n)un ≥ 1, and therefore e−σ(n)ln −e−σ(n)un

−kn



1. The second inequality follows from the mean value theorem. Consequently, we find positive constants c, c2 and c4 such that Πn (KL(p0 , n )) ≥ c4 exp{−ckn (log kn )b log

1 } ≥ c4 exp{−c2 n2n }, n

(3.15)

provided that t1 ≥ 1 + 52 (b + p−1 ). The next step is to specify an increasing sequence of submodels Pn such that (3.32) and (3.31) hold. Given the constant d2 in (3.6), let bn be a polynomially increasing 1

sequence such that bdn2 > n 5 and define  −1 k Pn,k = m(·; k, µ, w, σ)|w ∈ ∆k , µ ∈ Hk [−bn , bn ], σ ∈ [u−1 . n , ln ] rn

1

For rn = n 5 (log n)t3 , we choose a sequence of submodels Pn = ∪ Pn,k , where t3 > 0 is k=1

−1 k to be determined. By construction, the prior on σ is restricted to [u−1 n , ln ] . The prior

mass on mixtures with more than rn support points and the prior mass on mixtures with at least one support point outside [−bn , bn ] is controlled by conditions (3.5) and (3.6): Πn (Pnc ) ≤

∞ X

r

ρ(k) + Πn (N ([−bn , bn ]c > 0)) . e−d1 rn (log n) .

k=rn 2

Condition (3.32) requires that the right hand side is bounded by a multiple of e−(c2 +4)nn . Since c2 in (3.15) is unknown, this is only the case if t3 + r > 2t1 .

3.2. Location-scale mixtures of exponential kernels

47

As the L1 -distance is bounded by the Hellinger distance, condition (3.31) only needs to be verified for the L1 -distance. Since for any pair of metric spaces (A, d1 ) and (B, d2 ) we have D(, A × B, d1 + d2 ) ≤ D( 2 , A, d1 )D( 2 , B, d2 ), Lemma 3.2 implies that for all k ≥ 1, D(δn , Pn,k , k · k1 ) is bounded by D

  δ u−1   δ u−1  n n n n −1 k , ∆ k , k · k1 D , Hk [−bn , bn ], k · k1 D , [u−1 n , l n ] , k · k1 . 3 6kψk∞ 3



n

Lemma 3.3 provides the following bounds: D

 δ u−1  n n , Hk [−bn , bn ], k · k1 6kψk∞ δ  n D , ∆ k , k · k1 3  δ u−1  n n −1 k , [u−1 D n , l n ] , k · k1 3

k    δ u−1 −k Y δn u−1 n n n 2bn + , ≤ k! 3kψk∞ 3kψk∞ i=1



15 k−1 , δn

≤ k!

−k u−1 n δn /3

k  Y

 −1 (ln−1 − u−1 n ) + δn un /3 .

i=1

−k k Consequently, D(δn , Pn,k , k · k1 ) ≤ C k (k!)2 δn−3k u2k n ln bn for each k, and

log D(δn , Pn , k · k1 ) ≤ log

rn X

   n rn rn b l D(δn , Pn,k , k · k1 ) ≤ log rn C rn rn !δn−3rn u2r n n n

k=1

≤ log rn + rn log C + log rn + 3 log δn−1 + 2 log ln−1 + log bn



1

. n 5 (log n)t3 +r ≤ c1 nδn2 , where the last inequality only holds if t3 + 1 ≤ 2t2 . Combining this with t3 + r > 2t1 , we find that t2 > t1 + at least min(t1 , t2 ) > 1

1−r 2 . The logarithmic factor + 52 (b + p−1 ) + max(0, 1−r 2 ).

t in the convergence rate has to be

If p0 itself is a finite mixture of the kernel ψ, a parametric rate can be achieved, again with an extra logarithmic factor. The only limitation in the following result is that the intervals [ln , un ] cannot increase arbitrarily fast. Whereas in Theorem 3.1, un could increase at any polynomial rate as long as un & n1/5 , it is now required that un does not increases faster than n1/5 (log n)s , where s = 15 (b + p−1 ). The reason for this appears in the entropy calculation in the proof below, where it is required that un σ(n) . log n. Theorem 3.2. If, for some k0 ≥ 1, µ ∈ Rk0 , w0 ∈ ∆k0 and τ0 ∈ (0, ∞)k0 , p0 is of the form Pk0

i=1

w0,i ψτ0,i (x − µ0,i ), the posterior Πn (· | X1 , . . . , Xn ) converges to p0 , relative to the

L1 , L2 or Hellinger metric, with rate αn = n−1/2 (log n)t , where t > min( 21 , 1 − 2r ). 1

Proof of Theorem 3.2. Again we use Theorem 3.5 with sequences n = n− 2 (log n)t1 , 1

rn

δn = n− 2 (log n)t2 , rn = (log n)t3 and a sequence of submodels Pn = ∪ Pn,k , where k=1

48

3. Bayesian Density Estimation with Location-Scale Mixtures

t1 , t2 and t3 are to be determined. Let e n = n / log −1 ¯ = min τ0,i and assume that n , τ n > min{m | u−1 ¯}. From Lemma 3.2 it follows that whenever m ≤τ (µ, w, σ) ∈ Bk0 (µ0 , e 2n ) × ∆k0 (w0 , e 2n ) × Hk0 (τ0 , e 2n ), km(·; k0 , µ, w, σ) − p0 k . e 2n for the L1 - as well as for the supremum norm. This yields R p0 2n , and using Lemma 3.14 it can be seen that these mixtures are the bound m dP0 . e contained in a Kullback-Leibler ball around p0 . First we lower bound the prior mass on Hk0 (τ0 , e 2n ):     −1 Πn Hk0 (τ0 , e 2n ) = Πn σi−1 ∈ ((τ0,i + e 2n , τ0,i )∀i = (e−σ(n)ln − e−σ(n)un )−k0

k0  Y

2 −1

e−σ(n)(τ0,i +en )

−1

− e−σ(n)τ0,i



i=1

≥ σ(n)k0 | un − ln |k0 e−k0 σ(n)un σ(n)k0

k0 Y

−1 | (τ0,i + e 2n )−1 − τ0,i | e−σ(n)

P

−1 τ0,i

i=1

n X o −1 = exp −k0 un σ(n) + 2 log σ(n)−1 + log | un − ln |−1 −σ(n) τ0,i + log e −1 n & exp{−d5 k0 log e −1 n } (3.16) for a constant d5 > 0, as un σ(n) ≤ log n and log e −1 n . log n. Since it is assumed that 1

(3.7) and (3.8) hold with kn = k0 and n = n− 2 (log n)t1 ,  Πn K = k0 , µ ∈ Bk0 (µ0 , e 2n ), w ∈ ∆k0 (w0 , e 2n ), σ ∈ Hk0 (τ0 , e 2n ) b 2 & exp{−d3 k0 log e −1 −1 −1 n − d4 k0 (log k0 ) log e n − d5 k0 log e n } & exp{−c2 nn }

if t1 ≥ 21 . With regard to the conditions (3.31) and (3.32) the argument is similar to the proof of Theorem 3.1, with only a different sequence rn . Again we obtain the inequalities t3 + r > 2t1 and t3 + 1 > 2t1 . For t1 = 12 , these imply t2 > 1 − 2r . The logarithmic factor t in the convergence rate is at least min(t1 , t2 ) > min( 12 , 1 − 2r ).

3.3 3.3.1

Examples of priors on the mixing distribution Priors for the locations

We show that conditions (3.5)-(3.7) hold for two important types of priors for (k, µ). First, we consider hierarchical priors, where K is sampled from a prior ρ(·) on N, such that (3.5) holds by assumption. In addition it is assumed that ρ(k) & e−d3 k log k ,

(3.17)

3.3. Examples of priors on the mixing distribution

49

which we need to obtain the lower bound in (3.7). Given K = k, the locations µ1 , . . . , µk are drawn independently from a prior pµ on R satisfying pµ (x) & ψ(x),

(3.18)

pµ (x) . e−a1 |x|

a2

for constants a1 > 0 and a2 ≤ p .

(3.19)

The latter assumption implies that for any y > 0, Z a2 d2 P (| µi |> y) = pµ (x)dx . y max{0,1−a2 } e−y . e−|y|

(3.20)

[−y,y]c

for some constant d2 > 0. Because Eρ K < ∞ by condition (3.5), (3.6) follows from (3.20): Π(N ([−y, y]c ) > 0) = ≤

∞ X k=1 ∞ X

ρ(k) Π( max | µi |> y | K = k) i=1,...,k

(3.21) 

−|y|d2

ρ(k)k P (| µi |> y) . Eρ K e

.

k=1

To verify (3.7), let k > 0, 
0 such that Π(K = k, µ ∈ Bk (µ0 , )) & e−d3 k log k

 ¯ k k

1 & exp{−d3 k log }, 

(3.22)

using (3.17) and the fact that k ≤ −1 . Conditions (3.5) and (3.17) imply that ρ needs to of ”exponential form”. If for example for some positive constants B1 , B2 and A1 ≥ A2 B1 e−A1 k ≤ ρ(k) ≤ B2 e−A2 k ,

(3.23)

condition (3.5) holds for r = 0. Such exponential bounds were used by Ghosal [17] for density estimation with mixtures of beta-densities. If ρ is Poisson with intensity ν, we P∞ ν m+1 have k=m+1 ρ(k) ≤ (m+1)! , and using Stirling’s bound for (m + 1)! it can be seen that (3.5) holds with r = 1. Poisson processes are another popular choice for the location prior. We consider a Poisson point process with base measure Pµ on R and intensity λ. It is assumed that Pµ has a density pµ for which (3.18) and (3.19) hold. Again a lower bound on Π(K = k, µ ∈ Bk (µ0 , )) can be obtained by bounding Π({K = k, µ ∈ Rk : | µi − µ0i |≤  k,

1 ≤ i ≤ k}). For some integer l ≤ k, we can find disjoint intervals I1 , . . . , Il ⊂ A

50

3. Bayesian Density Estimation with Location-Scale Mixtures

of length

 k,

containing µ01 , . . . , µ0k . Let ki be the number of points in Ii , and I c the

complement of I1 ∪ . . . ∪ Il in A . Since all Ii are contained in A and pµ & ψ, Z Z  Pµ (Ii ) = pµ (x)dx & ψ(x)dx & λ(Ii )ψ(a ) = ¯. k Ii Ii

(3.24)

Again the tail assumptions on pµ can be used to verify (3.6): Π(N ([−y, y]c ) > 0) = 1 − Π(N ([−y, y]c ) = 0) d2

= 1 − exp(−λPµ ([−y, y]c )) ≤ λPµ ([−y, y]c ) . e−|y| . Finally, (3.7) follows from (3.24) and the fact that ψ(a ) = ¯ by the definition of a , since   Π(K = k, µ ∈ Bk (µ0 , )) ≥ P N (I1 ) = k1 , . . . , N (Il ) = kl , N (I c ) = 0, N (Ac ) = 0 c c exp{−lλPµ (Ii )} (Pµ (Ii )λ)k e−λ(Pµ (I )+Pµ (A )) k1 ! · . . . · kl !  exp{−λ¯ }  k −λ(2a +¯) 1 ≥ ( λ) e & exp −c2 k log . k! k 

=

3.3.2

Priors on the weights

In this section 2 classes of priors on the simplex are discussed. In both cases the Dirichlet distributions appear as a special case. The proofs of Theorems 3.1 and 3.2 require lower bounds l1 -balls around some fixed point in the simplex. These bounds are given in Lemmas 3.4 and 3.6. In the literature, the Dirichlet prior is the most popular choice for pw . As it is a well known fact that a normalized vector of independent gamma distributed random variables is Dirichlet distributed, a straightforward generalization is to consider random variables with an alternative distribution on R+ . Given independent random variables Y1 , . . . , Yk with densities fi on [0, ∞), define a vector X with elements Xi = Yi /(Y1 + . . . + Yk ), i = 1, . . . , k. For (x1 , . . . , xk−1 ) ∈ Sk−1 , Z ∞ P (X1 ≤ x1 , . . . , Xk−1 ≤ xk−1 ) = P (Y1 ≤ x1 y, . . . Yk−1 ≤ xk−1 y) dP Y1 +...+Yk (y) 0

Z



Z

x1 y

Z

x2 y

Z

xk−1 y

···

= 0

0

0

fk (y − 0

k−1 X

si )

i=1

k−1 Y

fi (si )ds1 · · · dsk−1 dy.

i=1

(3.25) The corresponding density is f

X1 ,...,Xk−1

Z



(x1 , . . . , xk−1 ) =

y

k−1

fk (y −

0

Z 0

xi y)

i=1 ∞

y

=

k−1 X

k−1

k Y i=1

fi (xi y)dy,

k−1 Y i=1

fi (xi y)dy (3.26)

3.3. Examples of priors on the mixing distribution

where xk = 1 −

Pk−1 i=1

51

xi . We obtain a result similar to lemma 8 in Ghosal and van der

Vaart [21]. Lemma 3.4. Let X1 , . . . , Xk have a joint distribution with a density of the form (3.26). Assume there are positive constants c1 (k), c2 (k) and c3 such that for i = 1, . . . , k, fi (z) ≥ c1 (k)z c3 if z ∈ [0, c2 (k)]. Then there are constants c and C such that for all c +1

y ∈ ∆k and all  ≤ ( k1 ∧ c1 (k)c2 (k) 3 )   1 P X ∈ ∆k (y, 2) ≥ Ce−ck log(  ) . Proof. As in [21] it is assumed that yk ≥ k −1 . Define δi = max(0, yi − 2 ) and δ¯i = Pk Pk−1 min(1, yi + 2 ). If xi ∈ (δi , δ¯i ) for i = 1, . . . , k − 1, then i=1 | xi − yi |≤ 2 i=1 | P k−1 2 xi −yi |≤ 2(k−1)2 ≤ . Note that (x1 , . . . , xk−1 ) ∈ Sk , as j=1 xj ≤ k−1 k +(k−1) < 1. Since all xi in (3.26) are at most one, c2 (k)

Z f (x1 , . . . , xk−1 ) ≥

y

k−1

0

k Y

c3

c1 (k)(xi y)

i=1



k c +1 c2 (k) 3 c1 (k) (x1 · . . . · xk )c3 . dy = (c3 + 1)k

Because xk = |1 −

k−1 X

xj | = |yk +

j=1



k−1 X

(yj − xj )| ≥ k −1 − (k − 1)2 ≥ 2 ≥

j=1

1 , k2

k k k−1 Z ¯ c +1 c +1 c2 (k) 3 c1 (k) 2k(c3 +1)−2 c2 (k) 3 c1 (k) Y δj c3 xj dxj ≥  k 2c3 (c3 + 1)k (c3 + 1)2 k j=1 δj √  2 c3 +1 ) . ≥ exp k log(c2 (k) c1 (k)) − log(c3 + 1) − log(k) − 2k log(  1

 P X ∈ Bk (y, ) ≥

c3 +1

As  ≤ ( k1 ∧ c1 (k)c2 (k)

), there are constants c and C for which this quantity is

−ck log( 1 )

lower-bounded by Ce

.

Alternatively, the Dirichlet distribution can be considered as a Polya tree distribution. Following Lavine [38] we use the notation E = {0, 1}, E 0 = ∅ and for m ≥ 1, i m E m = {0, 1}m . In addition, we use E∗m = ∪m for i=0 {0, 1} . It is assumed that k = 2

some integer m, and the coordinates are indexed with binary vectors  ∈ E m . A vector X has a Polya tree distribution if X =

m Y j=1,j =0

U1 ···j−1

m Y

 1 − U1 ···j−1 ,

j=1,j =0

 m−1

where Uδ , δ ∈ E∗ is a family of beta random variables with parameters (αδ1 , αδ2 ), δ ∈  m−1 E∗ . We only consider symmetric beta densities, for which αδ = αδ1 = αδ2 . Adding pairs of coordinates, lower dimensional vectors Xδ can be defined for δ ∈ E∗m−1 . For

52

3. Bayesian Density Estimation with Location-Scale Mixtures

 δ ∈ E∗m−1 , let Xδ0 = Uδ Xδ and Xδ1 = 1 − Uδ Xδ , and X∅ = 1 by construction. If for 1 ≤ i ≤ m and δ ∈ E i , αδ = 12 αδ1 ···δi−1 , X is Dirichlet distributed. Using further splittings, infinite Polya trees can be defined, see for example Lavine [38]. P∞ These Polya tree processes are absolutely continuous with probability 1 if i=0 α1i < ∞ (see Kraft [33]). Lemma 3.5. Let X have a Polya distribution with parameters αδ , δ ∈ E∗m−1 . Then for all y ∈ ∆2m and η > 0,   X pm (y, η) = P X ∈ ∆k (y, η) = P ( | Xm − ym |≤ η) ∈E m



m Y

P ( max | Uδ − ∂∈E i−1

i=1

yδ0 η |≤ m−i+2 ). yδ 2

Proof. For all i = 1, . . . , m and δ ∈ E i−1 , | Uδ Xδ − yδ0 | | (1 − Uδ )Xδ − yδ1 |

≤ Uδ | Xδ − yδ | +yδ | Uδ −

yδ0 |, yδ

≤ (1 − Uδ ) | Xδ − yδ | +yδ | (1 − Uδ ) −

yδ − yδ0 |. yδ

Consequently, X

X

| X δ − yδ | =

δ∈E m

| Xδ0 − yδ0 | + | Xδ1 − yδ1 |

δ∈E m−1

X



| X δ − yδ | + 2

δ∈E m−1

X



X

yδ | Uδ −

δ∈E m−1

| Xδ − yδ | + 2 max | Uδ − m−1 δ∈E

δ∈E m−1

yδ0 | yδ

yδ0 |. yδ

Hence, yδ0 η η | Uδ − |≤ ) pm (y, η) ≥ pm−1 (y, )P ( max m−1 2 yδ 4 ∂∈E m Y yδ0 η η ≥ P ( max | Uδ − |≤ m−i+2 )P (| U∅ − y0 |≤ m ) i−1 yδ 2 2 ∂∈E i=2 ≥

m Y i=1

P ( max | Uδ − i−1 ∂∈E

yδ0 η |≤ m−i+2 ), yδ 2

as p1 (η2−m ) = P (| X0 − y0 | + | X1 − y1 |≤ η2−m ) = P (| U0 − y0 | + | (1 − U0 ) − (1 − y0 ) |≤ η2−m ) = P (| U0 − y0 |≤ η2−m−1 ).

3.3. Examples of priors on the mixing distribution

53

With δ ∈ E i−1 fixed, we can lower-bound P (| Uδ −

yδ0 yδ

|≤

η 2m−i+2 )

for various values

of the αδ . In the remainder we will assume that αδ = αi , for all δ ∈ E i−1 , with i = 1, . . . , m. For increasing αi ≥ 1, Uδ has a unimodal beta-density, and we can restrict to the ”worst case” where worst case is

yδ0 yδ

=

1 2.

yδ0 yδ

= 0. If the αi are decreasing, and smaller than one, the

In both cases Lemma 3.15 is used to lower bound the normalizing

constant of the beta-density. The case αi → ∞, i = 1, . . . , m with m → ∞ −m+i−2

P (| Uδ − 0 |≤ η2

Z

η2−m+i−2

)= 0

Z

η2−m+i−2

& 0

Γ(2αi ) αi −1 x (1 − x)αi −1 dx Γ2 (αi ) 1 1 3 3 1 αi − 2 22αi − 2 xαi −1 dx = 2−(m−i)αi − 2 αi − 2 η αi 2

At the ith level there are 2i−1 independent variables Uδ with the Beta(αi , αi ) distribution, and therefore m   Y 2i−1 3 3 2−(m−i)αi − 2 αi − 2 η αi log pm (y, η) & log i=1

=

m X

 1 3 2i−1 −αi log − log(αi ) − αi (m − i) log(2) . η 2 i=1

The case αi → 0, i = 1, . . . , m with m → ∞ 1 P (| Uδ − |≤ η2−m+i−2 ) = 2

Z

1/2+η2−m+i−2

1/2−η2−m+i−2

& αη2−m+i−1

Γ(2αi ) αi −1 x (1 − x)αi −1 dx Γ2 (αi )

1 αi −1 4

m   X   1 log pm (y, η) & 2i−1 log(αi ) − 2αi + (m − i − 1) log(2) − log . η i=1

We have the following application of these results. Lemma 3.6. Let Xδm be Polya distributed with parameters αi . If αi = ib for b > 0, 1 P (X ∈ ∆k (y, η)) ≥ C exp{−ck(log k)b log }, η for some constants c and C. By a straightforward calculation one can see that this result is also valid for b = 0. In the Dirichlet case αi = 12 αi−1 for i = 1, . . . , m, 1 P (X ∈ ∆k (y, η)) ≥ C exp{−ck log }, η in accordance with the result in Ghosal, Ghosh and van der Vaart [18].

54

3.4

3. Bayesian Density Estimation with Location-Scale Mixtures

Mixtures of symmetric stable distributions

For α ∈ [0, 1], we consider kernels ψσ with characteristic functions of the form e−|σλ|

α+1

.

These are the symmetric stable distributions, see for example Ibragimov and Linnik [29] and Zolotarev [61]. A random variable X with a distribution of this form has tails P (X > x) = o(x−(1+α) ). Let Dα > 0 denote the number for which P (X > x) ≤ x−(1+α) if x ≥ Dα , and let γ be a constant which is 2 if α = 0 and 1 otherwise. For comparison with the results of the previous section, we give a convergence rate for finite mixtures. In Lemma 3.4 we give an entropy bound for more general mixtures. The main reason for the slower rate of convergence in Theorem 3.3 is that h(f, ψσ ∗ f ) is no longer of order σ 2 . Instead we have the following result. Lemma 3.7. If ψ is a density such that h(f, ψσ ∗ f ) . σ

δ 2

and kf − ψσ ∗ f k∞ . σ

δ 2

R

y δ ψ(y)dy < ∞ for some δ ∈ (0, 1], then

for all C 2 -densities f .

Proof. Because f is a C 2 density, kf (i) k∞ < ∞ for i = 0, 1, 2. For any x ∈ R, Z | f (x) − (ψσ ∗ f )(x) | ≤ | f (x − z) − f (x) | ψσ (z)dz Z n o δ δ δ 1 1− δ ≤ 2kf k∞ 2 (| z | kf (1) k∞ ) 2 + ( kf (2) k∞ z 2 ) 2 ψσ (z)dz . σ 2 . 2 √ √ From the inequality ( x − y)2 ≤ |x − y| it follows that v uZ sZ o2 u n p h(f, ψσ ∗ f ) . t f (x − σy)ψ(y)dy − f (x) dx sZ Z ≤

| f (x) − f (x − σy) | ψ(y)dydx sZ Z



δ

| f (x) − f (x − σy) |1−δ dxkf 0 kδ∞ | σy |δ ψ(y)dy . σ 2 .

The following lemmas give the same type of approximation results for Cauchy mixtures as the results of lemmas 3.12, 3.13 and 3.1 for exponential mixtures. Lemma 3.8. Let Fbe a  measure on a compact interval of width 2a, and assume that 1 ¯ ] and a . log  . Then there exists a discrete distribution F 0 with at most σ ∈ [σ, σ  γ N . log 1 support points within the support of F , for which k(F − F 0 ) ∗ ψσ k∞ .  2+α 1+α 1 ∧ 2¯σ1Dα . if  < 1, and k(F − F 0 ) ∗ ψσ k1 .  2+α if  < 2a

3.4. Mixtures of symmetric stable distributions

55

Proof. Without loss of generality it can be assumed that F is a distribution on [−a, a]. For each x ∈ R, Z  1 | e−ixλ F˜ (λ) − F˜ 0 (λ) ψ˜σ (λ)dλ | | (F ∗ ψσ )(x) − (F ∗ ψσ )(x) | = 2π Z 1 | F˜ (λ) − F˜ 0 (λ) | ψ˜σ (λ)dλ. ≤ 2π R Since F is a distribution on a compact interval, F 0 can be chosen such that z m dF (z) = R m 0 z dF (z) for m = 1, 2, . . . , k − 1; see the argument in the proof of Lemma 3.12 in the 0

appendix. Consequently | F˜ (λ) − F˜ 0 (λ) | =|

Z

0

izλ

d(F − F )(z) |=|

e

Z nk−1 X (iλz)j j=0

Z ≤

|

j!

+

 (iλz)k iλξz o e d F − F 0 (z) | k!

 zk k (|λ|a)k λ | d F + F 0 (z) ≤ 2 , k! k! (3.27)

1+α where ξz lies between 0 and z. Since ψ˜σ (λ) = e−|σλ| ,

1 | (F ∗ ψσ )(x) − (F ∗ ψσ )(x) | ≤ π 0

Z

(λa)k −|σλ|1+α 2ak e dλ = k! πk!

Z



| λ |k e−|σλ|

1+α



0

k+1

=

α

where Cα = e 1+α

1 1+α

1  1+α

∈ [1,

) Γ( α 2ak σ −(k+1) 1+α . Cαk ak σ −k (k + 1)− 1+α k , (1 + α)π k! (3.28)

p e/2] if α ∈ [0, 1]. The right hand side of (3.28) is at

most  if k is a sufficiently large multiple of log 1 . For a bound on the L1 -norm we need to 2+α 1 use the tail properties of stable distributions. Since by assumption  < 2a ∧ 2¯σ1Dα , 1

¯ Dα }. This is used in the second inequality in the T = − 2+α is larger than 2 max{a, σ R following display, where ψσ (x−z)d(F +F 0 )(z) ≤ 2ψσ (x−a) ≤ 2ψσ (x/2) ≤ σ2 ψ(x/(2¯ σ )). kF ∗ ψσ − F 0 ∗ ψσ k1 ≤

Z

Z

ψσ (x − z)d(F + F 0 )(z)dx + 2T kF ∗ ψσ − F 0 ∗ ψσ k∞

|x|≥T

Z .

ψ(x/(2¯ σ ))dx + 2T kF ∗ ψσ − F 0 ∗ ψσ k∞

|x|>T

−(1+α) 1+α . T /(2¯ σ) + 2T kF ∗ ψσ − F 0 ∗ ψσ k∞ .  2+α . The case α = 0 requires that Cα a/σ in the right hand side of (3.28) has to be smaller α

than one, as the factor (k + 1)− 1+α k vanishes. To achieve this, the support of F is partitioned into intervals of length

Cα a 2σ .

intervals Ii of length li =

σ 2Cα

The interval [−a, a] is partitioned into k =

and Ik+1 with length lk+1 smaller than 2Cσα . The Pk+1 probability measure F can be written as F = i=1 F (Ii )Fi with probability measures

b4 aCσ α c

56

3. Bayesian Density Estimation with Location-Scale Mixtures

Fi on Ii , and F ∗ ψσ =

Pk+1 i=1

F (Ii )(Fi ∗ ψσ ). By the preceding result (for α > 0),

0

discrete distributions Fi with Ni . log

1 

support points in Ii can be found such that 1+α

0

kFi ∗ ψσ − Fi ∗ ψσ k∞ ≤  and kFi ∗ ψσ − Fi 0 ∗ ψσ k1 .  2+α . The distribution F 0 =  Pk+1 0 aCα 1 i=1 F (Ii )Fi has b4 σ c + 1 log  support points and for any norm, it satisfies k(F − F 0 ) ∗ ψσ k = k

k+1 X

k+1 X  F (Ii ) Fi 0 ∗ ψσ − Fi ∗ ψσ k ≤ F (Ii )kFi 0 ∗ ψσ − Fi ∗ ψσ k.

i=1

i=1

2  As it is still assumed that a . log 1 , it follows that F 0 has a multiple of log 1 support points. The proof of the following lemma is identical to that of lemma 2 in Ghosal and van der Vaart [21]. Lemma 3.9. Let a, F and  be as in the previous lemma. For any σ > 0 there exists a finite distribution F 0 on the support of F for which 1+α

k(F − F 0 ) ∗ ψσ k1 .  2+α ,

k(F − F 0 ) ∗ ψσ k∞ .

 , σ

 γ where the number of support points of F 0 is bounded by a multiple of ( σa ∨ 1) log 1 . For Cauchy mixtures we have the following analogue of Lemma 3.1; the proof is given in the appendix. Lemma 3.10. Let ψ(x) =

1 1 π 1+x2 −|x|q1

p0 ∈ C 2 , with tails of order e

denote the Cauchy density, and let δ ∈ ( 23 , 1). Let

, where q1 > (1 ∨ p). Then there exists a finite mixture

m = m(·; k , µ , w , σ ) and neighborhoods M 3 µ , W 3 w and Σ 3 σ , such that for all sufficiently small , m(·; k , µ, w, σ) ∈ KL(p0 , ) whenever (µ, w, σ) ∈ M × W × Σ . 3 2 The number of support points k is bounded by a multiple of e − δ log e14 . In the following theorem we use the same type of prior as for the exponential-type mixtures. q1

Theorem 3.3. Let p0 be twice continuously differentiable, with tails of order e−|x| , with q1 > 1. Let Πn be a sequence of priors satisfying conditions (3.5)-(3.8). Then Πn (· | X1 , . . . , Xn ) converges to p0 in P0n -probability, with respect to the Hellinger or 1

δ

L1 -metric, with rate αn = n− 2 1+δ (log n)t , where t >

2+ 12 b+δ −1 1+δ −1

+ max(0, 1−r 2 ) and

δ ∈ ( 23 , 1) is an arbitrary constant. 1

δ

1

δ

Proof. Let n = n− 2 1+δ (log n)t1 and δn = n− 2 1+δ (log n)t2 be the sequences used in Theorem 3.5, where the constants t1 and t2 are to be determined. Define e n =  2 1 −1 2 n log n = σ (n) and σ(n) = e  δ . By Lemma 3.1 there exists a sequence of

3.4. Mixtures of symmetric stable distributions

57

mixtures mn = mn (·; kn , wn , µn , σn ) and neighborhoods Mn × Wn × Σn such that m(·; kn , µ, w, σ) ∈ KL(p0 , n ) if (µ, w, σ) ∈ Mn × Wn × Σn . The support points are contained in An = [−an , an ], where an = log e14 . The sets Mn and Wn are the l1 -balls Bkn (µn ,

σ(n)2 e 2n ) kn

and ∆kn (wn ,

e 5n 2 ).

The Σn are the hypercubes Hkn [σn , σn +

e 2 σ(n)2 ], kn

where all kn elements of σn equal σ(n). Conditions (3.7) and (3.8) guarantee that the prior puts sufficient mass on Mn and Wn . The prior mass on Σn can be lower bounded as in (3.14). The constant U (in un = nU ) needs to be sufficiently large, to ensure that σ(n) ∈ [un −1 , ln −1 ]. We find positive constants c, c2 and c4 such that Πn (KL(p0 , n )) ≥ c4 exp{−ckn (log kn )b log

1 } ≥ c4 exp{−c2 n2n }, n

(3.29)

provided that t1 is sufficiently large. By comparing the logarithmic factors of n2n = 1

1

2

n 1+δ (log n)2t1 and kn . n 1+δ (log n)3− δ (t1 −1) it can be seen that this is the case if −1

t1 ≥

2+ 12 b+δ 1+δ −1

.

The verification of (3.32) and (3.31) is almost the same as in the proof of Theorem 3.1. The same models Pn,k and Pn can be used, except for different sequences bn and δ

rn . We let rn = n 1+δ (log n)t3 and bn is a polynomially increasing sequence such that bdn2 > n

1 1+δ

. Again, condition (3.32) requires that t3 + r > 2t1 and condition (3.31)

requires that t3 + 1 ≤ 2t2 . Consequently the constant t in the convergence rate has to be larger than

2+ 21 b+δ −1 1+δ −1

+ max(0, 1−r 2 ).

For a more general bound on the entropy of mixtures of stable distributions, we need the following lemma. Lemma 3.11. For each non-negative integer j, j  2  1+α jα α σ j+1 k(ψσ ∗ F )(j) k∞ . e 1+α (j+1) (j + 1)− 1+α . j! 1+α

(3.30)

Proof. By the inversion formula, Z Z  1 1 e−ixλ F{ψσ ∗ F }(λ) dλ = e−ixλ ψ˜σ (λ)Fˆ (λ)dλ, ψσ ∗ F (x) = 2π 2π (j) R 1 and ψσ ∗ F (x) = 2π (−iλ)j e−ixλ ψ˜σ (λ)Fˆ (λ)dλ. Consequently, (a) 1 Z α+1 k(ψσ ∗ F )(j) k∞ . | λ |j e−|σλ| | Fˆ (λ) | dλ 2π sZ sZ 1 ≤ | Fˆ (λ) |2 e−|σλ|α+1 dλ | λ |2j e−|σλ|α+1 dλ 2π s r r 1 2 1  2 2j + 1  2j + 1  −(j+1) ≤ Γ Γ .σ Γ . 2j+1 2π σ 1 + α (1 + α)σ 1+α 1+α √ Since j! = Γ(j + 1) ≥ 2πe−(j+1) (j + 1)j , α ≤ 1 and using inequality (3.48) in Lemma 3.15 in the appendix, we obtain (3.30).

58

3. Bayesian Density Estimation with Location-Scale Mixtures

n The preceding lemma can be used to obtain an entropy bound for the class F = ψσ ∗F : o F ∈ M(R) . The following result is not used in the present work, but is interesting in its own right. Theorem 3.4. For k·ka∞ the supremum norm on [−a, a], log N (4, F, k·ka∞ ) .

a σ



log

1 

2

Proof. Let f = F ∗ ψσ ∈ F for some F . For i = −a/(rσ) + 1, . . . , a/(rσ), f can be approximated on the intervals ((i − 1)rσ, irσ] with Taylor polynomials of degree k. If α > 0, the constant r is set to one, and for the case α = 0, r is chosen smaller than 21 . Applying Lemma 3.11 with j = k, the remainder of the Taylor polynomial on ((i − 1)rσ, irσ] satisfies (for each i) k  2  1+α  kα−1 α (x − irσ)k | Ri x |=| (ψσ ∗ F )(k) (ξ) (k + 1)− 1+α rk . |. e 1+α (k+1) k! 1+α

If α > 0 this is ’dominated’ by the factor (k + 1)− k

reduces to (2r) . In the sequel we let k = k ≈ log

kα−1 1+α

1 

; if α = 0 the right hand side

be the smallest integer for which

the right hand side of the last display is at most . We construct the approximating class of densities a/(rσ)

F0

n = x 7→

X

1((i−1)rσ,irσ] (x)

kX  −1

o αij (x − irσ)j : αij ∈ Aj (, σ, k ) ,

j=0

i=−a/(rσ)+1

where Aj (, σ, k ) = {0, ±ηj , ±2ηj , . . . , ±Lj ηj } and ηj = (j)

F)

 k σ j

and Lj =

σ j k j! k(ψσ



k∞ . From Lemma 3.11 it follows that the Lj ’s are sufficiently large. For this choice

of the ηj ’s we find that every f ∈ F can be approximated by an element of F0 , such that the resulting k · ka∞ -error is at most 2. As a consequence, N (4, F, k · ka∞ ) ≤ N (, F0 , k · ka∞ ) = card(F0 ) a/(rσ)



Y

k  Y

i=−a/(rσ)+1 j=0

k 2a  Y  rσ 2Lj + 1 = (2Lj + 1) . j=0

By another application of Lemma 3.11, k

log N (2, F, k · ka∞ ) . k

  k σ j k(ψ ∗ F )(j) k + 1  2a 2a X 2k σ ∞  log 2 . (k + 1) log σ j=0  j! rσ 

 n o 2a X α 1 2  αj αj − 1 (j + 1) + log j− log(j + 1) + log σ + rσ j=0 1 + α 1+α 1+α 1+α 1+α a 1 2 . log . rσ 

.

3.5. Appendix

3.5

59

Appendix

Theorem 3.5 (Ghosal and van der Vaart [22]). Suppose that for priors Πn on P, there are sets Pn ⊂ P and positive sequences δn , n → 0, such that n min(2n , δn2 ) → ∞. Suppose that for certain constants c1 , c2 , c3 , c4 , log D(δn , Pn , d) ≤ c1 nδn2 , −(c2 +4)n2n

Πn

Πn (P \ Pn ) ≤ c3 e  2 KL(p0 , n ) ≥ c4 e−c2 nn .

(3.31) ,

(3.32) (3.33)

Then for αn = max(n , δn ) and M > 0 sufficiently large, Πn (p : d(p, p0 ) ≥ M αn | X1 , . . . , Xn ) → 0 in P0n -probability. The following lemma is a straightforward generalization of the result for normal mixtures, contained in lemma 2 of Ghosal and van der Vaart [21]. p

Lemma 3.12. Given p > 0, let ψ(x) = Cp e−|x| . Let F be a probability measure on ¯ ] and  < (1 ∧ Cp ). Then there [−a, a], where a . ψ −1 (), and assume that σ ∈ [σ, σ exists a discrete distribution F 0 on [−a, a] with at most N = pe2 log

Cp 

support points

0

such that kF ∗ ψσ − F ∗ ψσ k∞ . . Proof. For M = 2 max{a, σ ¯ ψ −1 ()}, sup | (F ∗ ψσ )(x) − (F 0 ∗ ψσ )(x) |≤ 2ψσ (M − a) ≤ |x|≥M

2 M ψ( ) .  σ 2σ

By Taylor’s expansion of ey and k! ≥ k k e−k , we have for any y < 0 and k > 1,

| ey −

k−1 X j=0

and for ψσ (y) =

Cp σ

 e | y | k yj |≤ , j! k

p

e−|y/σ| ,

| ψσ (x) − Cp

k−1 X j=0

(−1)j σ −(pj+1) | x |pj Cp (e1/p | x/σ |)pk |≤ . j! σ kk

For every distribution F on [−a, a] and y ∈ [−M, M ], |F ∗ ψσ (x) − F 0 ∗ ψσ (x)| is bounded

60

3. Bayesian Density Estimation with Location-Scale Mixtures

by Z k−1 X (−1)j σ −(pj+1) | x − z |pj 0 Cp d(F − F )(z) j! j=0 k−1 X (−1)j σ −(pj+1) | x − z |pj ψσ (x − z) − Cp +2 sup j! |x|≤M,|z|≤a j=0  Z k−1 pj X X (−1)j σ −(pj+1) pjl | x |pj−l | z |l . d(F − F 0 )(z) j! j=0 l=0  pk + 2σ −1 Cp ek ψ −1 () k −k , ≤ where we used that x−z σ if F 0 satisfies

Z

1 σ (|x|

+ |z|) . ψ −1 (). The first term on the right vanishes

Z

0

l

z dF (z) =

(3.34)

z l dF (z),

l = 1, . . . , p(k − 1).

(3.35)

By lemma A.1. in [22], there exists an F 0 with at most p(k − 1) + 1 support points such that (3.35) holds. The last term on the right in (3.34) is bounded by 2σ −1 Cp

e k

log

Cp k k Cp  = 2σ −1 Cp exp{−k log − log log }.  e 

For  < 1, this is bounded by a multiple of  if k ≥ log 2

which is the case if k = e log 2

at most p(e log

Cp 

Cp  .

1 

and log

k e

− log log

Cp 

≥ 1,

The required number of support points is therefore

− 1) + 1.

Lemma 3.13. Given σ ∈ [σ, σ ¯ ] and F ∈ M[−a, a], let F 0 be the discrete distribution from the previous lemma. Then kF ∗ ψσ − F 0 ∗ ψσ k1 .  ψ −1 (). Moreover, for any σ > 0 there exists a discrete F 0 with a multiple of (aσ −1 ∨ 1) log −1 support points, for which kF ∗ ψσ − F 0 ∗ ψσ k1 . ψ −1 () and kF ∗ ψσ − F 0 ∗ ψσ k∞ .

 σ

.

Proof. If | x |≥ T ≥ 2 max{a, σ ¯ ψ −1 (kF ∗ ψσ − F 0 ∗ ψσ k∞ )}, F ∗ ψσ (x) ≤ ψσ (x − |a|) ≤ ψσ (x/2) ≤

Cp −|x/(2¯ σ )|p , σ e

kF ∗ ψσ − F 0 ∗ ψσ0 k1 ≤

2 σ

hence Z ψ |x|>T

≤ 2

σ ¯ σ

|x|  dx + 2T kF ∗ ψσ − F 0 ∗ ψσ0 k∞ 2¯ σ

Z |x|>Dψ −1 ()

ψ(x)dx + 2T kF ∗ ψσ − F 0 ∗ ψσ0 k∞ . ψ −1 ().

An argument analogous to lemma 2 in Ghosal and van der Vaart [21] can now be used to prove the second assertion. If we make further assumption about the approximating finite mixtures, Kullback-Leibler neighborhoods of p0 can be constructed.

3.5. Appendix

61

Proof of Lemma 3.1. Using Lemma 3.13, we first obtain an a mixture m such that 1 h(p0 , m ) . / log =e . (3.36)  We then modify m in a way that the last inequality still holds, and show that the same is true for all mixtures m(·; k , µ, w, σ) with coefficients µ ∈ M = Bk (µ , e 2 σ()), 2

4

]. Finally we prove that for these w ∈ W = ∆k (w , e2 ) and σ ∈ Σ = Hk [σ , σ + e kσ()  mixtures m, and for m in particular, Z p0 (x) 1 log+ dP0 (x) . log , m(x) 

(3.37)

and apply Lemma 3.14 to show that these mixtures are contained in KL(p0 , ). √ Let a = ψ −1 (¯ ) and σ() = e . Let p0 be the (normalized) restriction of p0 to A = [−a , a ]. Using Lemma 3.13 with  = ¯, a = a and σ = σ(), a mixture m can p ¯ be found such that h(p0 ∗ ψσ() , m ) . ¯ψ −1 (¯ ) = e  and kp0 ∗ ψσ() − m k∞ . σ() ≤e . q −1 a (Notation: σ = (σ(), . . . , σ()) ∈ (0, ∞)k ). It has k . σ() log 1¯ . ψ √(¯) log 1 log 1¯ support points. We show that for the k · k∞,a - and Hellinger distance, d(p0 , m ) ≤ d(m , p0 ∗ ψσ() ) + d(p0 ∗ ψσ() , p0 ∗ ψσ() ) + d(p0 ∗ ψσ() , p0 )

(3.38)

is bounded by a multiple of e . First let d be the Hellinger distance. By assumption, h(p0 , p0 ∗ ψσ() ) . σ 2 () = e . To bound h(p0 ∗ ψσ() , p0 ∗ ψσ() ) we use the tail condition p

on p0 . Since ¯ < Cp e−D2 and a is larger than D2 , h2 (p0 ∗ ψσ() , p0 ∗ ψσ() ) ≤ kp0 ∗ ψσ() − p0 ∗ ψσ() k1 Z Z (3.39) ≤ | p0 (z − x) − p0 (z − x) | dx ψσ() (z)dz = 2P0 (Ac ) ≤ 2ψ(a ) = 2¯ .e 2 . Now let d be the k · k∞,a norm. To bound | p0 ∗ ψσ() − p0 ∗ ψσ() |, note that because of the bound on P (Ac ) in (3.4), | p0 (x) − p0 (x) |≤ kp0 k∞

¯ P0 (Ac ) ≤ kp0 k∞ c 1 − P0 (A ) 1 − ¯

for all x ∈ A . Therefore we have the bound Z    k p0 ∗ ψσ() (x) − p0 ∗ ψσ() (x)k∞,a ≤ | p0 (z − x) − p0 (z − x) | ψσ() (z)dz Z ¯ ≤ kp0 k∞ + p(z − x)ψσ() (z)dz (3.40) 1 − ¯ (A +x)c ¯ ¯ ≤ kp0 k∞ + ψσ() (0)P0 (Ac ) . ≤e . 1 − ¯ σ() The last term in (3.38) can be uniformly bounded on R. As Z  | p0 (x) − (p0 ∗ ψσ() )(x) | =| p0 (x − z) − p0 (x) ψσ (z)dz | Z 1 2 (2) (2) = z p0 (ξx,z )ψσ (z)dz . kp0 k∞ σ 2 () 2

62

3. Bayesian Density Estimation with Location-Scale Mixtures

2

and p0 is a C 2 density, kp0 − p0 ∗ ψσ() k∞ . σ() = e . Combining these parts, we find that | p0 (x) − m (x) | is bounded by | p0 (x) − (p0 ∗ ψσ() )(x) | + | (p0 ∗ ψσ() )(x) − (p0 ∗ ψσ() )(x) | + | (p0 ∗ ψσ() )(x) − m (x) | ¯ (2) .e , . kp0 k∞ σ 2 () + e + σ() (3.41) for all x ∈ A . Without loss of generality, we now make the following assumptions about k , w ∈ ∆k and µ ∈ Rk . • The elements of µ are ordered (µ,1 < µ,2 < . . . < µ,k ), maxi=2,...,k µ,i −  µ,i−1 ≤ 2σ()a , and max{µ,1 + a , µ,k − a } ≤ σ()a . This can be achieved by adding at most 2a /(2a σ()) = σ −1 () points, so k is still bounded by a multiple of

a σ()

log 1¯ .

• By modifying w , a version for which mini w,i ≥ e 4 can be obtained. By Lemma 3.2, the L1 -norm changes at most a multiple of k e 4 ≤ e 2 and the supremum norm changes at most

k e 4 σ()

.

¯ σ() .

• There is a component, say the k th, for which (w,k , µ,k , σ,k ) = (e 4 , 0, 1). Hence the first k − 1 elements of σ are σ() and the last element is 1. This extra component is introduced to ascertain the existence of a lower bound on the tails of m independent of σ . The L1 - and the supremum norm of m are changed at most e 4 . 4

Next we define M = Bk (µ , e 2 σ()), W = ∆k (w , e2 ) and Σ = Hk [σ , σ +

e 2 σ() k ],

neighborhoods of respectively µ , w and σ . Let µ ∈ M , w ∈ W and σ ∈ Σ , and let m = m(·; k , µ, w, σ). By Lemma 3.2, km − m k1 ≤ kw − w k1 + 2kψk∞

k X wi ∧ w,i i=1



σi ∧ σ,i

| µi − µ,i | +

k X wi ∧ w,i i=1

σi ∧ σ,i

| σi − σ,i |

 −1 kX  e 4 | µi − µ,i | e 4 + 2kψk∞ + | µk − µ,k | 2 σ() σ() i=1

+

k −1 1 X e 4 | σi − σ,i | + | σk − σ,k | . e 2 . σ() i=1 1−e 2 σ()/k

Combined with (3.38) (with d = h) this gives the bound h(p0 , m) ≤ h(p0 , m ) + p km − mk1 . e , i.e. m satisfies condition (3.36). By a similar application of Lemma 3.2 combined with (3.41), it can be seen that km − m k∞ . for all x ∈ A .

e 2 σ(e)

and | p0 (x) − m(x) |. e 

3.5. Appendix

63

The above assumptions on m are now used to obtain a lower bound for m on this interval. Using this lower bound it can be shown that m also satisfies (3.37). Let x ∈ A .  2 Because max µ,i − µ,i−1 ≤ 2σ()a and µ ∈ M , the interval x − σ()(a + e2 ), x + 2  σ()(a + e2 ) contains at least one support point with mass larger than e 4 . Using that 1  4 C w ∈ ∆k (w , e2 ) and a = ψ −1 (¯ ) = log ¯p p , we find that m(x) ≥ =

e 4 Cp e 2 p e 4 Cp exp{− | a + | }≥ exp{− | 2a |p } 2 σ() 2 2 σ() Cp e 4 ¯−2 e 4 Cp exp{−2p log }& 2 σ() ¯ σ()

p

for all x ∈ A and ¯ ≤ Cp . Because | m − p0 |. e  on A , Z Z Z p p p0 m+ | p0 − m | dP0 ≤ dP0 . (1 + σ()e −3 ¯2 )dP0 = 2a (1 + σ()e −3 ¯2 ) m A m A A (3.42) Since m has a component with bandwidth in [1, 1 + Z Ac

p0 dP0 . m

Z Ac

e −4 e−2q0 |x|

q1 −|x|p

dx ≤ 2e −4

Z

e 2 σ() k ]

and weight in ( 21 e 4 , 32 e 4 ),



a

exp{−q0 | x |q1 }dx . e −4

¯ C Cp (3.43) 1 1 −p

−q

for some positive constant C, where we used that q0 | x |q1 >| x |p if x > q0 and by assumption (on ¯), a >

− 1 q0 q1 −p .

,

Combining (3.42) and (3.43) it can be seen

that m satisfies condition (3.37). By application of Lemma 3.14 it follows that p0 ∈ KL(p0 , n ). We use the same construction to prove Lemma 3.10. A mixture obtained from Lemma 3.9 is modified and a set of mixtures is defined for which (3.36) and (3.37) hold. In what follows we point out the differences with the preceding proof. Proof of Lemma 3.10. Let δ ∈ ( 32 , 1),  > 0 and e  = (log 1 )−1 . As α = 0 we need to apply Lemma 3.9 with  = e 4 and a = log e14 in order to obtain a mixture m =

2 . Since we also need that h(p0 , p0 ∗ m(·; k , µ , w , σ ) such that h2 (m , p0 ∗ ψσ() ) . e 2

ψσ() ) . e  we choose σ() = e  δ in light of Lemma 3.7. Hence the number of support 2 3 2 a points k is bounded by σ() log e14 . e − δ log 1e , and (also Lemma 3.9) km − p0 ∗ 2

ψσ() k∞ . e 4− δ ≤ e , using that δ >

2 3.

Using the triangle inequality as in (3.38),

we now show that d(p0 , m ) . e  for the k · k∞,a - and Hellinger distance. Lemma 3.7 implies d(p0 , p0 ∗ ψσ() ) . e  for both distances. The term d(p0 ∗ ψσ() , p0 ∗ ψσ() ) can be bounded as in (3.39) and (3.40): by the assumption on the tails of p0 , P0 (Ac ) . e 4 and 2

. ψσ() (0)P0 (Ac ) . e 4− δ and hence d(p0 ∗ ψσ() , p0 ∗ ψσ() ) . e  As in the proof of Lemma 3.1 we assume that maxi=2,...,k µ,i − µ,i−1 ≤ 2σ()a , max{µ,1 + a , µ,k − a } ≤ σ()a , and mini w,i ≥ e 5 . Because of the heavy tails of

64

3. Bayesian Density Estimation with Location-Scale Mixtures

the Cauchy density, it is not necessary here to have at least one component with unit 5

4

variance. Next we define neighborhoods M = Bk (µ , e 1+ δ ), W = ∆k (w , e2 ) and 4

Σ = Hk [σ , σ +

e 1+ δ k

]. Let µ ∈ M , w ∈ W and σ ∈ Σ , and consider the mixture

m = m(·; k , µ, w, σ). Using Lemma 3.2, which can be applied with ψ being the Cauchy density, we find that h(p0 , m) ≤ h(p0 , m ) + h(m , m) . e  and that | p0 − m |. e  uniformly on A . 2

By the above assumptions on w and µ , the interval x − σ()(a + e 1+ δ ), x + 2  σ()(a + e 1+ δ ) contains at least one support point if x ∈ A . Since mini w,i ≥ e 5 and 5

w ∈ ∆k (w , e2 ), there is a positive constant R such that m(x) ≥ p0 m (x) 1 5 3 5 (2e  , 2e  ),

and

Z

e 5 Cp 1 & R 2 σ() π(1 + a2 )

2

≤ 1+e 4− δ −R for all x ∈ A . Since m has a component with weight in

p0 dP0 = m

Z A

Z Z Z  q1 p0 p0 1 1 2σ2 4− δ2 −R dP0 + dP0 ≤ 1+e  e−2|x| dx . dP0 + 5 2 + x2 m m πe  σ  c c A A A 

Since (3.36) and (3.37) hold for all (µ, w, σ) ∈ M × W × Σ , Lemma 3.14 implies that these mixtures are contained in KL(p0 , ). Lemma 3.14 (Ghosal and van der Vaart [21]). For every d > 0 there exists a constant d > 0 such that for all probability measures P and Q with 0 < h2 (p, q) < d P (p/q)d , p 1 1 1 p  P log . h2 (p, q) 1 + log+ + log+ P ( )d , (3.44) q d h(p, q) d q p 2 1 1 p 2 1 P log + log+ P ( )d . . h2 (p, q) 1 + log+ (3.45) q d h(p, q) d q Lemma 3.15. The following asymptotic formula for the Gamma function can be found in many textbooks, see for example Abramowitz and Stegun [1] (Online resource: http://www.math.sfu.ca/~cbm/aands/). √ 1 Γ(α) = 2πe−α αα− 2 eθ(α) , where 0 < θ(α)
0

j αj α 1 2  1+α √ e 1+α (j+1) (j + 1)− 1+α , 1+α 2π j αj α 1 1  1+α ≤ e 1+α (j+1)+ 12 (j + 1)− 1+α . 1+α



(3.47) (3.48)

Chapter 4

Deriving converge rates from the information inequality : misspecification in nonparametric regression models Abstract : In this chapter we look at various nonparametric Bayesian estimation problems within the information theoretic framework proposed by Zhang ([59],[60]). In particular, we show that his results can be extended to misspecified models. We give an alternative proof to Kleijn’s ([32]) asymptotic results for misspecified random-design regression models, which allows for an easy extension to fixed-design models.

4.1

Introduction

In statistical modeling it is usually assumed that the unknown distribution of the data is contained in the model. Very often however there is no strong evidence for this assumption. If the model does not contain the distribution that generated the data, it is said to be misspecified. In practice many models still perform well under misspecification, and it is of interest to formulate conditions under which this happens. In the context of Bayesian nonparametric statistics, general results for i.i.d. models were obtained by Kleijn and van der Vaart [32]. In this chapter we develop an alternative approach, based 65

66

4. Misspecification in nonparametric regression models

on recent work by Zhang ([59], [60]), who showed that general results regarding posterior convergence rates can be derived from the information inequality. Although the results in [59] and [60] are stated for models with i.i.d. observations, a reading of the proofs reveals that they easily extend to models with independent nonidentically distributed observations. This is worked out in section 4.2. In sections 4.3 and 4.4 we give various examples that illustrate the use of the information-theoretic bounds in (correctly specified) density estimation and regression models. An inequality for misspecified models is then derived in section 4.5. In section 4.6 this inequality is applied to nonparametric regression models. The conditions under which convergence rates can be found are comparable to those in Kleijn and Van der Vaart [32] for random design models. Moreover, we obtain new results for fixed design models. To a large extent we adopt the notation used by Zhang in [59], see also chapter 1. Note that Lemma’s 4.1 and 4.2, and Theorem 4.1 can be found in this paper, in an i.i.d.-setting. Some of the techniques used in section 4.4 can be found in Zhang [60], in particular the proof of Corollary 4.3. The prior Π is a probability distributionQon the parameter space Γ, dominated by a

sigma-finite measure ν. Let π(θ | X) =

R Qn

n i=1

i=1

pθ (Xi ) pθ (Xi )dΠ(θ)

denote the Bayesian posterior

with respect to Π. For the posterior π(θ | X)dΠ(θ) we write Π(θ | X). Integrals over Π are sometimes written as EΠ . For easy reference, we restate the following result from chapter 1. Lemma 4.1. Let w(θ, X) be a data-dependent density relative to Π. For any measurable function f : X n × Γ → R and all α ∈ R, Z Z 0 w(θ, X)f (θ, X)dΠ(θ) − α w(θ, X) log EX 0 ef (θ,X ) dΠ(θ) Z n o 0 ≤ log exp f (θ, X) − α log EX 0 ef (θ,X ) dΠ(θ) + DKL (wdΠ k dΠ).

4.2

(4.1)

Posterior convergence rates given independent observations

Suppose we observe independent observations X = (X1 , . . . , Xn ) with distributions P0,i , where the P0,i are assumed to have densities p0,i with respect to a measure µ on a sample space X . Throughout this chapter, (X , µ) will mostly be the real line equipped with the Lebesgue measure. For pθ = (pθ,1 , . . . , pθ,n ), we consider models Pn = {pθ | θ ∈ Γ} of densities with respect to µ. For example, in a fixed design regression model with normal errors, θ can be a regression function f contained in a function space Γ, pθ being a vector of normal densities with mean (f (x1 ), . . . , f (xn )). Note that the densities pθ,1 , . . . , pθ,n

4.2. Posterior convergence rates given independent observations

67

all depend on the same θ ∈ Γ. In i.i.d. models, pθ,1 , . . . , pθ,n are identical. The following result (see also [59], Theorem 2.1) is a direct consequence of Lemma 4.1. Lemma 4.2. Let X1 , . . . , Xn be independent with densities (p0,1 , . . . , p0,n ) = p0 ∈ Pn . For any α ∈ R, Qn pθ,i Z ρ n 1 1 X Re i=1 ( p0,i (Xi )) Q αEX Dρ (p0,i | pθ,i )dΠ(θ | X) ≤ EX log pθ,i n 0 ρ dΠ(θ) α n i=1 n EX 0 i=1 ( p0,i (Xi )) Z X n ρ p0,i 1 + EX log (Xi )dΠ(θ | X) + EX DKL (Π(· | X) k Π(·)). n p n θ,i i=1 Z

(4.2) Furthermore, if α = 1, Z EX

n

1 X Re D (p0,i | pθ,i )dΠ(θ | X) n i=1 ρ Z X n ρ p0,i 1 ≤ EX log (Xi )dΠ(θ | X) + EX DKL (Π(· | X) k Π(θ)). n pθ,i n i=1

Proof. We apply Lemma 4.1 with f = −ρ

Pn

(4.3)

p0,i pθ,i (Xi )

and w(θ, X) = π(θ | X), Pn and take the expectation over X. Note that DKL (p0 k pθ ) = i=1 DKL (p0,i k pθ,i ) and R R Qn Pn 1−ρ ρ Re DρRe (p0 | pθ ) = − log · · · i=1 p0,i (xi )pθ,i (xi )dµ(x1 ) · · · dµ(xn ) = i=1 Dρ (p0,i | i=1

log

pθ,i ). If α = 1, the last term in (4.2) is at most zero, which can be seen by applying Jensen’s inequality and interchanging the order of integration. To refine the preceding result, we define (following Zhang [59], p. 2185-2186) n X 1 p0,i (Xi ) λ Eπ w(θ) log + DKL (wdΠ k dΠ), n p n θ,i (Xi ) i=1

bλ (w) R

=

Rλ (w)

= Eπ w(θ)

(4.4)

n

1X λ DKL (p0,i k pθ,i ) + DKL (wdΠ k dΠ), n i=1 n

(4.5)

where λ is a positive constant and w can be any density with respect to Π. The right hand b1 (π(· | X)) + (1 − ρ)EX DKL (Π(· | X) k Π(·)) side of (4.3) can now be written as ρEX R R Pn (1−ρ) b1 (π(· | X)) − EX log p0 (Xi ) dΠ(θ | X). It is useful to consider a or EX R n

i=1

pθ (Xi )

combination of these forms. For constants γ ≥ 1 and λ0 =

λγ−1 γ−ρ

=

γ−1 γ−ρ

∈ [0, 1), (4.3)

can be written as Z n 1 X Re b1 (π(· | X)) − (γ − ρ)EX R bλ0 (π(· | X)). EX D (p0,i | pθ,i )dΠ(θ | X) ≤ γEX R n i=1 ρ (4.6) Our goal is to obtain a bound for each of the terms on the right hand side.

68

4. Misspecification in nonparametric regression models

b1 (π(· | X)). Note that if w does not depend on the First we derive a bound for EX R b1 (w) = R1 (w). It can be shown that R b1 (w) is minimal for w = π(· | X), and data, EX R that R1 (w) is minimized by q(θ) = exp{−

n X

Z DKL (p0,i k pθ,i )}/

exp{−

i=1

n X

DKL (p0,i k pθ,i )}dΠ(θ).

i=1

This result is a special case of Lemma 4.5 in section 4.5, with p∗i = p0,i for all i. Because b1 (π(· | X)) ≤ EX R b1 (w) = R1 (w) for any density w, we have EX R b1 (π(· | X)) ≤ inf EX R b1 (w) = inf R1 (w) EX R w w Z n   X 1 DKL (p0,i k pθ,i ) dΠ(θ). = − log exp − n i=1

(4.7)

Applying Markov’s inequality Ee−Z ≥ e−z P (Z < z) with z = n2 and the random Pn variable Z = i=1 DKL (p0,i k pθ,i ), it can be seen that for any  > 0 the last term in (4.7) is bounded by 2 −

n  1 1X log Π {θ ∈ Γ : DKL (p0,i k pθ,i ) < 2 } . n n i=1

b1 (π(· | X)) can be upper-bounded if we can control Consequently, EX R  Pn 1 Π {θ ∈ Γ : n i=1 DKL (p0,i k pθ,i ) < 2 } for small . The last term in (4.6) can be bounded using a bracketing argument, for which we need the following definition. Definition 4.1. For a constant δ ∈ (0, 1], let the upper-bracketing radius of a subset Γ0 ⊂ Γ be defined as 0

Z

rδ,n (Γ ) =

sup

n Y

θ∈Γ0 i=1

!δ pθ,i (xi )

n Y

n p1−δ 0,i (xi )dµ (x).

i=1

For i = 1, . . . , n, let the i-th coordinate radius be defined as 0

rδ,(i) (Γ ) =

Z 

δ sup pθ,i (xi ) p1−δ 0,i (xi )dµ(xi ).

(4.8)

θ∈Γ0

An -upper discretization of Γ is a covering {Γj } consisting of countably many subsets {Γj } such that rδ,n (Γj ) < 1 +  for each j. If this can be achieved with finitely many Γj , the upper-bracketing number Nub (Γ, ) of Γ is defined as the smallest number of subsets of upper-bracketing radius  and whose union contains Γ. Remark 1 For i.i.d. models all rδ,(i) are equal and we write rδ = rδ,(i) . If in addition δ = 1, (4.8) is the upper bracketing radius as used by Zhang in [59]. The upper bracketing

4.2. Posterior convergence rates given independent observations

69

radius of a subset Γ0 is bounded by the product of the coordinate radii, i.e. rδ,n (Γ0 ) ≤

n Y

rδ,(i) (Γ0 ).

(4.9)

i=1

Remark 2 In Lemma 4.2 it is assumed that p0 ∈ Pn , in which case it is convenient to choose δ = 1. In section 4.5 we extend Definition 4.1 to misspecified models, where it is of interest to consider δ < 1. Remark 3 The same Γ can be the parameter set of different models. Although these models can be covered by the same subsets Γj , the upper-bracketing radius may be different in each model. This is illustrated in section 4.4, where a C 1 class of bounded functions is used in different regression models. The last term in (4.6) can be bounded using the following argument, which is taken from Zhang ([59], Lemma A.2). From another application of Lemma 4.5 below, again with p∗i = p0,i for all i, it follows that for any density w with respect to Π and all λ0 > 0, 0

ˆ λ0 (w) ≥ − λ log R n

Z

n  1 X  p0,i exp − 0 log (Xi ) dΠ(θ). λ i=1 pθ,i

(4.10)

This holds in particular for the posterior π(θ | X) with respect to Π. It follows that for any 0 < δ ≤ 1, 0 < ρ < 1, γ > 1, and λ0 = 0 ˆ λ0 (π(· | X)) ≤ λ EX log −EX R n

1 ≤ log EX δn

γ−1 γ−ρ

> 0, we have

Z Y n  pθ,i i=1

p0,i

Z Y n  pθ,i i=1

p0,i

 X 1 log EX  Π(Γj ) ≤ δn j X 0 1 ≤ log EX Π(Γj )δλ δn j =

! 1/λ0 (Xi ) dΠ(θ) !δλ0 1/λ0 (Xi ) dΠ(θ)

sup

n Y pθ,i

θ∈Γj i=1

sup

p0,i

!1/λ0 δλ0  (Xi )

n Y pθ,i

θ∈Γj i=1

p0,i

(4.11)

!δ (Xi )

X 0 1 log Π(Γj )δλ rδ,n (Γj ). δn j

The second inequality follows from Jensen’s inequality, and in the fourth inequality we P P 0 0 used that δλ0 ∈ [0, 1) and therefore ( j aj )δλ ≤ j aδλ for all positive numbers {aj }. j By combination of (4.6), (4.7) and (4.11) we obtain the following result. Theorem 4.1. For arbitrary δ ∈ (0, 1], ρ ∈ (0, 1) and γ ≥ 1, let λ0 = be a sequence such that

n2n

γ−1 γ−ρ .

Let n → 0

→ ∞. Assume that there exist positive constants k1 and k2

70

4. Misspecification in nonparametric regression models

such that log

hX

i 0 Π(Γj )δλ rδ,n (Γj )

≤ k1 n2n ,

(4.12)

j n

Π {θ ∈ Γ :

 1X DKL (p0,i k pθ,i ) < 2n } n i=1

2

≥ e−k2 nn ,

(4.13)

where {Γj } is an arbitrary countable cover of Γ. Then Z EX

n  γ−ρ 1 X Re D (p0,i | pθ,i )dΠ(θ | X) ≤ k1 + γ(1 + k2 ) 2n . n i=1 ρ δ

(4.14)

The covering sets Γj may depend on n, which is the case for some of the examples in sections 4.3 and 4.4. Unless stated otherwise the sequence n and the constants δ, ρ, γ and λ0 are chosen as in the preceding theorem. The result above allows for the use of infinite covers, but the power δλ0 < 1 in (4.12) can be problematic in many applications. If Π(Γj ) cannot be calculated explicitly and 0

no (sharp) upper bound is available, all factors Πδλ (Γj ) have to be bounded by one. This restricts the result to situations where there is a finite cover, and typically leads to suboptimal rates. In special cases however, an optimal rate can still be derived from Theorem 4.1, for example for the beta-mixtures in section 4.3.2 and the Gaussian spline regression model in section 4.7. To see the problem that can occur, consider the problem of estimating a C α -density p0 based on an i.i.d. sample X1 , . . . , Xn . Given α > 0, we want to verify (4.12) for 2α

2n = n− 1+2α . If Γ1 , . . . , ΓN is an 2n -upper discretization of Γ, the left hand side of (4.12) becomes

1 n

log Nub (Γ, 2n ) + log(1 + 2n ), which should be at most a multiple of 2n . The

number Nub (Γ, ) is bounded by the bracketing1 number N[] (, P, L1 (µ)). The latter is also larger than the packing number N (, P, L1 (µ)) (van der Vaart and Wellner [56], p. 84). If P is a class of densities on a compact interval, the L1 -norm is bounded by a multiple of the supremum norm, and we have  1/α 1 log N (, P, L1 (µ)) . log N (, P, k · k∞ )  ,  where the last relation follows from Theorem 2.7.1. in [56]. Assuming that log Nub (Γ, ) 2α

is of the same order and plugging in  = 2n = n− 1+2α , we find that of order n 1 For

− 2α−1 1+2α

, which is larger than 2n = n

2α − 1+2α

1 n

log Nub (Γ, 2n ) is

.

functions l and u not necessarily contained in the model, the bracket [l, u] is defined as the

set {pθ ∈ P : l ≤ pθ ≤ u}. Given some metric d, the size of this bracket is defined as d(l, u). The -bracketing number N[] (, P, d) is defined as the minimal number of -brackets required to cover P.

4.3. Density estimation

4.3

71

Density estimation

To illustrate the use of Theorem 4.1 we first give examples regarding density estimation with i.i.d. observations. In an example with conjugate normal priors it is shown how parametric rates can be obtained. Also a nonparametric example is given.

4.3.1

A parametric example

The difference between (4.2) and (4.3) in Lemma 4.2 becomes important in parametric models, where, as pointed out in [59][remark 2.3], ”the choice of α = 1 would lead to a rate of O(log n/n)”. It is instructive to calculate the extra term Qn pθ Z (Xi ))ρ 1 i=1 ( Qn p0 pθ dΠ(θ) cρ,n (α) = EX log α n EX i=1 ( p0 (Xi ))ρ for a family of normal densities. Let X1 , . . . , Xn be i.i.d. with distribution P0 = N (θ0 , 1). Let Γ = R and Θ = {pθ | θ ∈ Γ} = {N (θ, 1) | θ ∈ R}. For θ we choose a normal prior with mean ν and unit variance. The posterior Π(θ | X) (with respect to the Lebesgue Pn measure on R) is the normal distribution with mean (ν + i=1 Xi )/(n + 1) and variance 1/(n + 1). Setting λ0 = 0 in (4.6), and adding the term cρ,n (α) from (4.2), it can be seen that Z αEX

DρRe (p0 | pθ )dΠ(θ | X)

ˆ 1 (π(· | X)) − 1 − ρ EX ≤ EX R n R

Since

φρσ (x



Z X n i=1

log

p0 (Xi )dΠ(θ | X) + cρ,n (α). pθ

(4.15)

− µ2 )dx = exp{−ρ(1 − ρ)(µ1 − µ2 )2 /(2σ 2 )} for all µ1 , µ2

µ1 )φ1−ρ σ (x

ρ(1−ρ) (θ 2

− θ0 )2 ; we show that the right ˆ 1 (π(· | X)). hand side is of order O(1/n). Using the first line of (4.7) we bound EX R R (µ1 −µ2 )2 1 1 exp{− 2 σ2 +τ 2 } for every pair of Because φσ (x − µ1 )φτ (x − µ2 )dx = √ 2 2 and σ > 0, the integrand on the left equals

2π(σ +τ )

normal densities, Z Z 2 n 1 1 −nDKL (p0 kpθ ) ˆ dΠ(θ) = − log e− 2 (θ−θ0 ) dΠ(θ) EX R1 (π(· | X)) ≤ − log e n n √    Z 1 1 1 (ν − θ0 )2 1 2πn−1 φ n1 (θ − θ0 )dΠ(θ) = log n + log 1 + + . = − log n 2n n n 2(n + 1) The first term contains an extra logarithmic factor, which cancels out with that in cρ,n (α), provided that α 6= 1. Applying Jensen’s inequality with the logarithm, we see that cρ,n (α) is at most 1 log n

Z EX

n Y pθ

!ρ !1−α

(1−α)n Z Z 1 log pρθ (x)p1−ρ (x)dx dΠ(θ) 0 n  Z 1 1 (ν − θ0 )2 φa−1 (θ − θ ) dΠ(θ) = − log(a + 1) − , 0 n n 2n 2n 1 + a−1 n

(Xi )

p i=1 0 p 1 = log 2πa−1 n n

dΠ(θ) =

72

4. Misspecification in nonparametric regression models

where an = (1 − α)ρ(1 − ρ)n. Finally, we bound the second term on the right hand side of (4.15). Z EX

 Z  n p0 (Xi ) 1X ¯ + 1 (θ2 − θ02 ) dΠ(θ | X) log dΠ(θ | X) = EX (θ0 − θ)X n i=1 pθ (Xi ) 2 !# " Pn Pn   2   ν + i=1 Xi ν + i=1 Xi ¯ 1 1 1 2 = EX X+ + − θ0 =O , θ0 − n+1 2 n+1 n+1 n

as the Xi are i.i.d N (θ0 , 1).

4.3.2

Beta mixtures

Let βk,j (·) denote the beta density β(x; j, k + 1 − j) =

Γ(k + 1) xj−1 (1 − x)k−j , Γ(j)Γ(k + 1 − j)

x ∈ [0, 1].

For an application of Theorem 4.1 to a nonparametric density estimation problem, consider the coarsened beta mixtures √



˜b(x; k, w) =

i k X

k X



i=1 j=(i−1) k+1

where



1 √ wi βk,j (x), k

w ∈ S√ k ,

(4.16)

k is assumed to be integer2 . Let 

wF,k =

F

1 √ k



 − F (0) , F

2 √ k



 −F

1 √ k



 , . . . , F (1) − F

k−1 √ k

!! ,

where F is given a Dirichlet process prior. In chapter 2 we showed that if the data come from an α-Lipschitz density p0 , with α ∈ (0, 1], the rate of posterior concentration around p0 is optimal apart from a logarithmic factor. For simplicity, let the dimension √ √ l = k be determined by a nonrandom sequence kn = ln → ∞. Inspecting the proof of Theorem 2.2 in chapter 2, it can be seen that if ln is of the order n1/(1+2α) , the assertion of this theorem still holds. We have a similar result for the expected posterior Renyi-entropy to p0 . Theorem 4.2. Let p0 be a strictly positive α-smooth density on [0, 1], for some α ∈ 1/(1+2α)

(0, 1]. Given n observations, let ln = (n/ log n)

be the dimension of the coarsened

mixture. Assume that the base measure of the Dirichlet process possesses a continuous, strictly positive density on [0, 1]. Then, for a sufficiently large constant M , Z 2α EX DρRe (p0 | ˜b(·; k, wF,√kn ))d(F | X1 , . . . , Xn ) ≤ M (n/ log n)− 1+2α . 2 See

section 2.5 for a definition in case



k is not integer.

4.3. Density estimation

73

This can be proved using Corollary 1.1 in Chapter 1, which appears as a special case of Theorem 4.1, with δ = 1, λ0 = 0 and i.i.d observations. Whereas the prior mass condition (4.13) can be handled as in chapter 2, the entropy condition (4.12) requires a somewhat different approach, as the covering and packing numbers used in chapter 2 are now upper-bracketing numbers. The following result gives a bound for the upperbracketing radius of an L1 -ball in the unit simplex. Lemma 4.3. For y ∈ Sl , let Sl (y, ) be the intersection of the (l − 1)-dimensional unit simplex in Rl and the L1 -ball of radius  around y. Then Z r1 (Sl (y, )) =

1

sup

l X

il X

0 w∈Sl (y,) i=1 j=(i−1)l+1

Proof. For every x ∈ [0, 1] and i = 1, . . . , l,

1 wi βl2 ,j (x)dx ≤ 1 + l . l

Pil

j=(i−1)l+1

βl2 ,j (x) ≤

Pl2

j=1

βl2 ,j (x) = l2 ,

by the binomial theorem. Consequently, Z

l X

1

sup

il X

0 w∈Sl (y,) i=1 j=(i−1)l+1

Z ≤ 0

1

l X

il X

i=1 j=(i−1)l+1

1 wi βl2 ,j (x)dx l

1 yi βl2 ,j (x)dx +  l

Z

1

max

il X

0 i=1,...,l j=(i−1)l+1

1 βl2 ,j (x)dx ≤ 1 + l. l

Proof of Theorem 4.2. We first verify condition (4.13). By Lemma 2.3, the approximating mixture ˜b satisfies kp0 − ˜b(·; l2 , wF0 ,l2 )k∞ . ln−α . Because the Kullback-Leibler ball with second moment used in chapter 2 is contained in the Kullback-Leibler ball in the present chapter, we can proceed as in (2.19) in the proof of Theorem 2.2:    Π {w ∈ Sln : DKL (p0 k ˜b(·; ln2 , w)) < 2n } & exp (−c1 ln log n) = exp −c1 n2n . Next we verify (4.12) for a cover of L1 -balls. According to Lemma 2.9, the L1 -packing number of Sln is bounded by (5/)ln −1 if  ≤ 1. Hence Sln can be covered with at most dn = (10ln /2n )ln −1 L1 -balls of radius 2n /ln . By the preceding lemma and the remark following Definition 4.1, log

hX j

  i  0 10ln Π(Γj )δλ (rδ,n (Γj )) < log dn (1 + C2n )n ≤ ln log +n log(1+C2n ) . n2n . 2n

74

4. Misspecification in nonparametric regression models

Remark The form of (4.12) suggests that it is beneficial to have pairwise disjoint sets 0

that have equal prior probabilities. In that case, the exact values Π(Γj )δλ can be used, whereas otherwise these numbers are hard to calculate. However, for the convergence rate in Theorem 4.2 this does not seem to make a difference. Edelsbrunner and Grayson [12] showed that any d-dimensional simplex can be subdivided into k d simplices with equal Lebesgue measure and similar shape characteristics, but it is hard to obtain a sharp bound for the upper-bracketing radii corresponding to these simplices.

4.4

Regression

Also in nonparametric regression, convergence rates can be derived from the information inequality, as was shown by Zhang in [60]. Instead of the logarithmic loss-function used for density estimation, he uses squared loss3 . In this section we show that similar results can be obtained if one uses logarithmic loss, as in density estimation. The results hold for random- as well as fixed design models, with either Gaussian or Laplacian noise. For simplicity, the variance of the noise is fixed throughout. The dependence on σ 2 is made explicit in the results, but when no confusion can result, it is suppressed in the notation.

4.4.1

Fixed design

Given fixed covariate values (x1 , . . . , xn ) ∈ Rn and an unknown function f0 : R → R, assume that the random variables Yi , i = 1, . . . , n, are independent with densities ψσ (yi − f0 (xi )), where ψσ is the normal or the Laplace density. For any measurable function f : R → R, let pf denote the vector of densities (pf,1 , . . . , pf,n ), with pf,i (yi ) = ψσ (yi − f (xi )). For f (xi ) and f0 (xi ) we use shorthand notation fi and f0,i , respectively. The model of interest is Pn = {pf | f ∈ F},

(4.17)

for a class F of measurable functions equipped with a prior Π. Applying Theorem 4.1 with pθ,i = pf,i and p0,i = pf0 ,i , we can find convergence rates for regression with Gaussian or Laplacian noise. The result for Gaussian models does not require any assumption on F, although the convergence rate depends on upper-bracketing numbers that we can only control if F is assumed to be uniformly bounded. For Laplacian models this is required in any case, in order to lower-bound the Renyi-entropy by the l2 -norm of f − f0 . 3 In

the proof of Theorem 1.3 on page 13, one can choose f (θ, X) = −ρ

Pn

i=1 (p0 (Xi )

− pθ (Xi ))2

4.4. Regression

75

Corollary 4.1. Let Y1 , . . . , Yn be independent with normal densities φσ (yi − f0,i ). If f0 ∈ F and X

0

2

Π(Fj )δλ rδ,n (Fj ) ≤ ek1 nn ,

(4.18)

j n 2  1 X fi − f0,i < 2n } 2σ 2 n i=1

Π {f ∈ F :

2

≥ e−k2 nn ,

(4.19)

for positive constants k1 and k2 , then ρ(1 − ρ) EY 2σ 2 n

Z X n i=1

2  γ−ρ fi − f0,i dΠ(f | Y ) ≤ k1 + γ(1 + k2 ) 2n . δ

(4.20)

Proof. Because the Kullback-Leibler divergence between normal distributions with means Pn 2 fi and f0,i and equal variance σ 2 is 2σ1 2 (fi − f0,i ) , the quantity n1 i=1 DKL (p0,i k pθ,i ) 2 Pn in (4.13) can be replaced by 2σ12 n i=1 fi − f0,i . The DρRe -entropy between these normal densities is

ρ(1−ρ) 2σ 2

2

(fi − f0,i ) , hence the convergence in Theorem 4.1 is relative to

the l2 norm. The preceding result holds for any class F, but unless we assume that F is uniformly bounded, it is hard to find sets Fj for which (4.18) holds. At the end of this section we give an example. If the noise is Laplacian, a result similar to Corollary 4.1 can be derived. In order to relate the Renyi-entropy to the l2 -norm it is necessary to assume that F is uniformly bounded. Corollary 4.2. Let Y1 , . . . , Yn be independent with Laplace densities ϕσ (yi − f0,i ). If f0 is contained in a uniformly bounded class F and X

0

2

Π(Fj )δλ rδ,n (Fj ) ≤ ek1 nn ,

(4.21)

j n

Π {f ∈ F :

 1 X (fi − f0,i )2 < 2n } σn i=1

2

≥ e−k2 nn

(4.22)

for positive constants k1 and k2 , then Z EY

n

1 X (fi − f0,i )2 dΠ(f | Y ) . 2n . σn i=1

(4.23)

Proof. Following the notation in Kleijn and Van der Vaart [32], we write H(Y, a) = |Y − a| − |Y | and Φ(a) = EH(Y, a) for a ∈ R and a random variable Y . If Y has distribution function G and a density g, it can be verified that Φ0 (a) = 2G(a) − 1 and Φ00 (a) = 2g(a). In particular, Φ0 (0) = 0 if the median of G is zero. Because Yi = f0,i + ei

76

4. Misspecification in nonparametric regression models

and ei is Laplace distributed, the Kullback-Leibler divergence

1 n

Pn

i=1

DKL (p0,i k pθ,i )

in (4.13) equals n

n

n

1 X 1 X 1 X Φ(fi − f0,i ) = (fi − f0,i )Φ0 (0) + (fi − f0,i )2 Φ00 (f˜i − f0,i ), (4.24) σn i=1 σn i=1 2σn i=1 for some point f˜i between fi and f0,i . Since Φ0 (0) = 0 and Φ00 = 2ϕσ is bounded away from zero and infinity by the boundedness of F, the Kullback-Leibler divergence is Pn 1 2 equivalent to σn i=1 (fi − f0,i ) . Hence (4.22) implies (4.13). Using a Taylor expansion as in (4.24) and the fact that Φ00 (x) is at least 2ϕσ (2K) by the boundedness of F, we Pn can write the Renyi-entropy n1 i=1 DρRe (p0,i | pθ,i ) in (4.14) as n



1X ρ log E exp{− H(ei , fi − f0,i )} n i=1 σ ≥−

n n o 1X ρ ρ2 log E 1 − H(ei , fi − f0,i ) + 2 e2ρK/σ H 2 (ei , fi − f0,i ) n i=1 σ 2σ

(4.25)

n

n o 1X 2ρ ρ2 ≥− log 1 − ϕσ (2K)(fi − f0,i )2 + 2 e2ρK/σ (fi − f0,i )2 n i=1 σ 2σ &

1 (fi − f0,i )2 , σn

where the last inequality holds if

ρ2 2ρK/σ 2σ 2 e


0, let N1 =

4aK 0 

and N2 =

Lk Ml

4K 

− 1, and define

   , −a + k , 0 2K 2K 0    = − K + (l − 1) , −K + (l + 1) . 2 2

=



− a + (k − 1)

(k = 1, . . . , N1 ) (l = 1, . . . , N2 )

(4.27)

4.4. Regression

77

For vectors aj ∈ {1, . . . , N2 }N1 we can then define  Fj = f ∈ F | f |Lk ⊂ Maj (k) , k = 1, . . . , N1 .

(4.28)

An example is given in figure 4.1. From the definition of Lk , Ml and Fj it follows that Fj is empty unless aj(k) − aj(k−1) ∈ {−2, −1, 0, 1, 2}, for k = 2, . . . , N1 . Consequently the model can be covered with at most N2 5N1 −1 sets of the form (4.28). Note that the intervals Mk - and therefore also the sets Fj - overlap: a given function f ∈ F is typically contained in more than one Fj . In figure 4.1 for example, Fj is such that the interval corresponding to L6 = [2, 3] is M5 = [0, 1], although on [2, 3], the displayed function is also contained in M6 = [1/2, 3/2]. R Qn To bound the upper-bracketing radii supf ∈Fj i=1 ψσ (yi −fi )dy of these sets, let Fj be determined by Maj (1) , . . . , Maj (N1 ) , where Maj (k) = [lj,k , uj,k ] and lj,k − uj,k = , k = 1, . . . , N1 (in what follows, it is not important whether aj(k) −aj(k−1) ∈ {−2, −1, 0, 1, 2}). For yi contained in (−∞, lj,i ), [lj,i , uj,i ] or (uj,i , ∞), we have that supf ∈Fj ψσ (yi − fi ) is bounded by respectively ψσ (yi −lj,i ), ψσ (0) or ψσ (yi −uj,i ). Therefore the ith coordinate radius is bounded by Z

lj,i

Z

uj,i

ψσ (yi − lj,i )dyi + −∞

Z



ψσ (yi − uj,i )dyi = 1 + ψσ (0).

ψσ (0)dyi + lj,i

(4.29)

uj,i n

As a consequence of (4.9), we find that r1,n (Fj ) ≤ (1 + ψσ (0)) . Consequently, F can n

be covered by N2 5N1 −1 sets whose upper-bracketing radius is at most (1 + ψσ (0)) . We can now verify conditions (4.18) and (4.21) in Corollaries 4.1 and 4.2 in the same way as we verified the entropy condition in the example of the beta densities in section 4.3.2. Suppose that the prior mass conditions (4.19) or (4.22) are satisfied. If we let  = 2n in (4.27), log r1,n (Fj ) ≤ ψσ (0)n2n . Substituting this in (4.18) or (4.21), it follows that n has to be such that log

hX j

i h i 0 1 1 Π(Fj )λ r1,n (Fj ) ≤ log N2 5N1 −1 max r1,n (Fj ) . log 2 + 2 +ψσ (0)n2n ≤ k1 n2n n n

for a constant k1 > 0. Provided that the prior puts enough mass on Kullback-Leibler balls around f0 , n is of the order n−1/4 . Note that this argument only depends on the bounds on the functions in F and their derivatives, and does not use the actual form of 0

ψ. The numbers Π(Fj )λ in the preceding display are bounded by one, which leads to a suboptimal rate. This is in fact the problem pointed out at the end of section 4.2. In section 4.7 we give an example were we can control these numbers, and find the optimal rate.

78

4. Misspecification in nonparametric regression models

K

1

0

M2

M1 −a

0

1

2

a

Figure 4.1: An example of a subset Fj ⊂ F, and a function contained in it. For a = 3, K = 2, K 0 = 1,  = 1, N1 =

4aK 0 

= 6, N2 =

4K 

− 1 = 7 and aj = (4, 5, 6, 6, 6, 5), let

Fj be defined by (4.28). It is the set of all f ∈ F that lie in the shaded region: for x contained in L1 , . . . , L6 , f (x) is contained in respectively M4 , M5 , M6 , M6 , M6 and M5 .

4.4.2

Random design

In random design models, the quantities

1 n

Pn

i=1

DKL (p0,i k pθ,i ) and

1 n

Pn

i=1

DρRe (p0,i |

pθ,i ) appearing in Theorem 4.1 become averages terms. A complication  of identical ρ however is that the Renyi-entropy − log EX,Y ppθ0 (X, Y ) is not always easily interpretable, as the expectation over X appears inside the logarithm. Using properties of the logarithmic moment generating function, Zhang [60] showed that this is bounded by a multiple of the L2 -norm between the regression functions, provided that the model is uniformly bounded. We restate this result in the current logarithmic- loss approach, and add a similar result for models with Laplacian noise. Suppose that we observe independent random vectors (X1 , Y1 ), . . . , (Xn , Yn ) in X ×R with joint density4 p0 , such that Yi = f0 (Xi )+ei , for independent errors ei and covariates 4 In

more conventional notation, p0 would be the density of a single vector (Xi , Yi ), but for consistency

with the notation in the fixed design case, we let p0 denote the density of n i.i.d. copies.

4.4. Regression

79

Xi . For simplicity, we assume that the Xi ’s have a known density m. We write m(x) for Qn Qn the joint density i=1 m(xi ) of X1 , . . . , Xn , and ψσ (y − f (x)) = i=1 ψσ (yi − f0 (xi )) for the joint density of Y1 , . . . , Yn given X = x. Consequently X and Y have a joint density p0 (x, y) = m(x)ψσ (y − f0 (x)) with respect to the Lebesgue measure on R2n . In this i.i.d setting, the averages of Renyi-entropies and Kullback-Leibler divergences appearing in Theorem 4.1, are averages of identical terms; we can therefore take the expectation over a single pair (Xi , ei ), which is denoted EX0 ,e0 . For functions f : X → R, let pf (x, y) = m(x)ψσ (y − f (x)) denote the density of a vector (X, Y ) such that Yi = f (Xi ) + ei ; in particular pf0 = p0 . For these pf , let Pn = {pf | f ∈ F}, as in (4.17). For a prior Π on a class of measurable functions F we find the following results for the posterior distribution on F. Corollary 4.3. Let (X1 , Y1 ), . . . , (Xn , Yn ) have density pf0 (x, y) = m(x)φσ (y − f0 (x)). If f0 is contained in a uniformly bounded class F and if there are positive constants k1 and k2 such that log

hX

i 0 Π(Fj )δλ rδn (Fj )

≤ k1 n2n ,

(4.30)

j

Π {f ∈ F : then

1 EX,Y σ2

2  1 EX0 f (X0 ) − f0 (X0 ) < 2n } 2σ 2 Z

2

≥ e−k2 nn ,

2 EX0 f (X0 ) − f0 (X0 ) dΠ(f | X, Y ) . 2n .

(4.31)

(4.32)

Proof. Again we apply Theorem 4.1. Since for i = 1, . . . , n, Yi = f0 (Xi ) + ei and the errors ei are N (0, σ 2 ) distributed and independent of Xi , n 1 2 2 o 1 DKL (pf0 k pf ) = EX0 ,e0 − 2 Y0 − f0 (X0 ) + 2 Y0 − f (X0 ) 2σ  2σ  1 1 2 2 = EX0 ,e0 − 2 e0 + 2 ((f (X0 ) − f0 (X0 )) − e0 ) 2σ 2σ 1 2 = EX0 (f (X0 ) − f0 (X0 )) . 2σ 2 Pn Hence the quantity n1 i=1 DKL (p0,i k pθ,i ) in (4.13) can be replaced by 2σ1 2 EX0 f (X0 )− 2 f0 (X0 ) . The Renyi-entropy DρRe (p0 | pθ ) in (4.14) equals n ρ o ρ 2 2 − log EX0 ,e0 exp − 2 (Y0 − f (X0 )) + 2 (Y0 − f0 (X0 )) 2σ n 2σ o ρ ρ 2 = − log EX0 ,e0 exp − 2 ((f0 − f )(X0 ) + e0 ) + 2 e20 2σ   2σ ρ(1 − ρ) 2 = − log EX0 exp − (f (X0 ) − f0 (X0 )) . 2σ 2 Using the fact that F is uniformly bounded it can be shown that this is larger than a multiple of the L2 -norm. The following inequalities are taken from Proposition 1.2

80

4. Misspecification in nonparametric regression models

in [60](Appendix I). Let C = 2ρ(1 − ρ)K 2 /σ 2 be a bound on W = ρ(1 − ρ)(f (X0 ) − f0 (X0 ))2 /(2σ 2 ). Because the function t 7→ (e−t + t − 1)/t is strictly increasing and smaller than one,   −W e +W −1 W log Ee−W ≤ Ee−W − 1 = −EW + E W   e−C + C − 1 1 − e−C ≤− 1− EW = − EW. C C

(4.33)

Hence   2 ρ(1 − ρ) f (X ) − f (X ) − log EX0 exp − 0 0 0 2σ 2  f (X ) − f (X ) 2   0 0 0 ≥ 1 − exp −2ρ(1 − ρ)K 2 /σ 2 EX0 2K 2 1 ≈ 2 EX0 f (X0 ) − f0 (X0 ) , σ if ρ is close to zero or one. A similar result holds when the error distribution is Laplacian. Corollary 4.4. Let (X1 , Y1 ), . . . , (Xn , Yn ) be have density pf0 (x, y) = m(x)ϕσ (y − f0 (x)). Let f0 be contained in a class F such that | f |≤ K for all f ∈ F, and log

hX

0

Π(Fj )δλ (rδ (Fj ))n

i

≤ k1 n2n ,

(4.34)

j

Π {f ∈ F :

 1 EX0 (f (X0 ) − f0 (X0 ))2 < 2n } σ

2

≥ e−k2 nn ,

for positive constants k1 and k2 . Then Z 2 1 EX,Y EX0 f (X0 ) − f0 (X0 ) dΠ(f | X, Y ) . 2n . σ

(4.35)

(4.36)

Proof. To show that the Kullback-Leibler ball is contained in an L2 -ball, (4.24) can be used, with fi and f0,i replaced by f (X0 ) and f0 (X0 ), respectively, and the average over the xi ’s replaced by expectation over X0 . By a similar modification of (4.25), we find that the Renyi-entropy is  ρ 1 − e−C ρ − log EX0 ,e0 − H(e0 , (f − f0 )(X0 )) ≥ EX0 ,e0 H(e0 , f (X0 ) − f0 (X0 )) σ C σ 1 & EX0 (f (X0 ) − f0 (X0 ))2 , σ where C =

2ρ σ K

and the first inequality follows from (4.33).

4.5. Misspecification

81

Example (continued) Let F be as in (4.26), and consider the covering sets Fj defined by (4.28). If we replace the xi by random variables Xi and take the expectation over them, the upper-bracketing radii can be bounded as in the preceding section. Analogous to (4.29), we find

r1,(i) (Fj ) =

N1 Z X k=1

=

4.5

sup ψσ (yi − f (xi ))dyi dxi f ∈Fj

Lk

N1 Z X k=1

Z m(xi )

(4.37)  1 + ψσ (0) m(xi )dxi = 1 + ψσ (0) .

Lk

Misspecification

The results presented sofar are valid for any underlying distribution P0 , but if Pn inf θ∈Γ n1 i=1 DKL (p0,i k pθ,i ) > 0 the stated bounds stay away from zero as n →  Pn ∞, because Π {θ ∈ Γ : n1 i=1 DKL (p0,i k pθ,i ) < 2n } in (4.13) is zero for sufficiently large n. For i.i.d. observations, Kleijn and van der Vaart [32] showed that under certain conditions, the posterior concentrates around the point in the model with the smallest Kullback-Leibler divergence relative to p0 . Using straightforward extensions of the results in section 4.2, similar results can be obtained for misspecified models with independent non-identically distributed observations. Given θ∗ = Pn argminθ∈Γ n1 i=1 DKL (p0,i k pθ,i ), let p∗ = (pθ∗ ,1 , . . . , pθ∗ ,n ). For p1 , p2 ∈ Pn , ρ ∈ (0, 1) p ρ Re . Applying Lemma 4.1 with (p1,i | pi,2 ) = − log Ep0,i p2,i and i = 1, . . . , n define Dρ,p 0,i 1,i Pn p∗ i f (θ, X) = −ρ i=1 log pθ,i (Xi ) and w(θ, X) = π(θ | X) we obtain a misspecified analogue of Lemma 4.2. Lemma 4.4. Let X1 , . . . , Xn be independent with densities p0,1 , . . . , p0,n , such that Pn inf pθ ∈Pn i=1 DKL (p0,i k pθ,i ) > 0. We have the following bound for the posterior concentration around p∗ .



1 EX n ≤

Z log EX 0

n Y pθ,i i=1

p∗i

ρ (Xi 0 ) dΠ(θ | X) = EX

1 ρ EX DKL (Π(· | X) k Π(θ)) + EX n n

Z X n i=1

Z

log

n

1 X Re D (p∗ | pθ,i )dΠ(θ | X) n i=1 ρ,p0,i i p∗i (Xi )dΠ(θ | X) pθ,i (4.38)

For a more refined bound on the right hand side, we extend the definition of Rλ and bλ to misspecified models with independent non-identically distributed observations. R

82

4. Misspecification in nonparametric regression models

Let ∗

bp (w) R λ ∗ Rλp (w)

= =

1 n

Z

1 n

Z

w(θ)

n X

log

i=1

λ p∗i (Xi ) dΠ(θ) + DKL (wdΠ k dΠ), pθ,i (Xi ) n

(4.39)

p∗i λ (x)dΠ(θ) + DKL (wdΠ k dΠ). pθ,i n

(4.40)

w(θ)Ep0,i log

Again we use constants γ ≥ 1, ρ ∈ (0, 1) and λ0 =

λγ−1 γ−ρ

=

γ−1 γ−ρ

∈ [0, 1), and rewrite

(4.38) as Z EX

n

∗ ∗ 1 X Re bp (π(· | X)) − (γ − ρ)EX R bp 0 (π(· | X)). Dρ,p0,i (p∗i | pθ,i )dΠ(θ | X) ≤ γEX R 1 λ n i=1 (4.41)

The following lemma, which is a straightforward extension of Proposition 5.1 in [59], is b1 (π(· | X)). useful to bound the term EX R Lemma 4.5. Let

Qn ( i=1 pθ,i (Xi ))1/λ π1/λ (θ | X) = R Qn ( i=1 pθ,i (Xi ))1/λ dΠ(θ)

be the generalized posterior relative to a prior Π, and let n

qe1/λ (θ) = exp{−

p∗ 1X Ep0,i log i }/ λ i=1 pθ,i

n

Z exp{−

p∗ 1X Ep0,i log i }dΠ(θ). λ i=1 pθ,i

Then Z n   X ∗ ∗ p∗i bp (w) = − λ log exp − 1 bp (π1/λ (· | X)) = inf R R log (X ) dΠ(θ), i λ λ w n λ i=1 pθ,i Z n  1X ∗ ∗ p∗  λ Ep0,i log i dΠ(θ), Rλp (e q1/λ (·)) = inf Rλp (w) = − log exp − w n λ i=1 pθ,i

(4.42) (4.43)

where the infimum is taken over the set of all densities with respect to Π.

Proof. Applying lemma 4.1 with α = 0 and f (θ, X) = − λ1 − log

Z Y n  pθ,i i=1

p∗i

Pn

i=1

log

p∗ i pθ,i (Xi )

we find that

 λ1 dΠ(θ) (Xi )

≤ DKL (wdΠ k dΠ) +

1 λ

Z w(θ, X)

n X

log

i=1

p∗i n bp∗ (Xi )dΠ(θ) = R (w) pθ,i λ λ Qn

for all densities w w.r.t. Π. Plugging in w(θ, X) =

(

i=1

1/λ

pθ,i (Xi ))

, it can be ( i=1 pθ,i (Xi )) dΠ(θ) seen that for the (generalized) Bayesian posterior, we have equality in the last display. R Qn

1/λ

4.5. Misspecification

83

In a similar way, we can apply Lemma 4.1 with f (θ, X) = − λ1

Pn

i=1

Ep0,i log

p∗ i pθ,i (xi )

to obtain (

) n 1X p∗i − log exp − Ep log dΠ(θ) λ i=1 0,i pθ,i Z n X 1 n bp∗ p∗ ≤ DKL (wdΠ k dΠ) + w(θ, X) (w) Ep0,i log i dΠ(θ) = R λ pθ,i λ λ i=1 Z

for all densities w w.r.t. Π, and plugging in w = qe1/λ , (4.43) is obtained. Using the preceding lemma we can also extend the inequalities in (4.10) and (4.11) to misspecified models. First we define the upper-bracketing radius for these models. Definition 4.2. The upper-bracketing radius under misspecification of a subset A ⊂ Γ is defined as ∗ rδ,n (A)

Z  =

Qn δ Y n supθ∈A i=1 pθ,i (xi ) Qn p0,i (xi )dµn (x). ∗ i=1 pi (xi ) i=1 ∗

bp 0 (π(· | X)) in (4.41) is bounded Analogous to (4.11) we find that the term −(γ − ρ)EX R λ by X 0 1 log Π(Γj )δλ EX δn j

sup

n Y pθ,i

∗ θ∈Γj ∈ i=1 pi

!δ (Xi )

=

X 0 1 ∗ log Π(Γj )δλ rδ,n (Γj ). δn j

We have proved the following result. Theorem 4.3. For arbitrary δ ∈ (0, 1], ρ ∈ (0, 1) and γ ≥ 1, let λ0 = be a sequence such that

n2n

γ−1 γ−ρ .

Let n → 0

→ ∞. Assume that there exist positive constants k1 and k2

such that log

hX

i 0 ∗ Π(Γj )δλ rδ,n (Γj )

≤ k1 n2n ,

(4.44)

j n

Π {θ ∈ Γ :

 1X p∗ Ep0,i log i < 2n } n i=1 pθ,i

2

≥ e−k2 nn .

(4.45)

Then Z EX

n  1 X Re γ−ρ Dρ,p0,i (p∗i | pθ,i )dΠ(θ | X) ≤ k1 + γ(1 + k2 ) 2n . n i=1 δ

(4.46)

84

4. Misspecification in nonparametric regression models

4.6

Regression under misspecification of the error distribution

With the preceding theorem we can now consider the regression models of section 4.4 under misspecification. A regression model is misspecified if, for example, the assumed error distribution differs from the actual distribution. For normal and Laplace regression with random design, Kleijn and van der Vaart [32] show that under certain conditions on this unknown error distribution, the posterior concentrates around f0 . At the same time the model can also be misspecified in the sense that f0 ∈ / F. For this case, it is shown in [32] that the posterior concentrates around f ∗ , the Kullback-Leibler ”projection” of f0 on F. In section 4.6.1 and 4.6.2 we obtain results for normal and Laplace regression with fixed design, and in section 4.6.3 we give their analogues for random design models. In Gaussian models, it is required that the distribution of the errors ei has subGaussian tails with scale factor σ, which is the case if for any number x, Eex ei ≤ eσ

2

x2 /2

.

In Laplace models it is necessary that the error distribution has a density that is bounded away from zero and infinity on a sufficiently large interval. Similar to results for correctly specified models in section 4.4, we find entropy conditions that depend on the number of subsets of a certain upper-bracketing radius, covering the model. In misspecified models we can use the same covers as before, but only under additional assumptions we can bound the upper-bracketing radii under misspecification.

4.6.1

Fixed design normal regression

Let Yi = f0,i + ei be independent random variables with densities p0,i (yi ) = g(yi − f0,i ). Our goal is to estimate f0 using the normal model Pn = {pf | f ∈ F} = {(φσ (y1 − f (x1 )), . . . , φσ (yn − f (xn ))) | f ∈ F}, for a class F of measurable functions. So instead of the true (unknown) error density g of the ei s, we use the normal density; if g is indeed different from φσ , p0 will usually differ from pf0 , even when f0 ∈ F. Because Ep0 log

p0 pf0

does not depend on f , minimizing

DKL (p0 k pf ) over F is equivalent to minimizing Ep0 log

pf0 pf

. If we assume that g has

zero mean and finite second moment, the latter quantity is n n 1 X  1 X  2 2 E (Y − f ) − (Y − f ) = E (fi − f0,i + ei )2 − e2i i i i 0,i 2 2 2σ i=1 2σ i=1 n 2 1 X = fi − f0,i , 2 2σ i=1

(4.47)

i.e. DKL (p0 k pf ) is minimized by the f ∈ F with the smallest (empirical) l2 distance to f0 . We assume that there is a unique minimizer f ∗ , which is the case if F is closed and

4.6. Regression under misspecification of the error distribution

85

convex. The convexity of F also implies that for every f ∈ F, ft = t f + (1 − t) f ∗ ∈ F. From the minimizing property of f ∗ it then follows that the right derivative of the Pn Pn function t 7→ n1 i=1 (ft (xi ) − f0,i )2 is nonnegative at t = 0, i.e. that i=1 (fi − fi∗ )(fi∗ − f0,i ) ≥ 0 for every f ∈ F. If f ∗ lies in the interior of F, also the left derivative is defined, and we have equality. If f ∗ = f0 this is trivially the case. Applying Theorem 4.3 we obtain the following results. As in the corresponding result in section 4.4.1 for the correctly specified normal regression model, the convergence is with respect to the l2 -norm, except for f0 being replaced by f ∗ . Theorem 4.4. Let Y1 , . . . , Yn be independent with densities g(yi − f0,i ), where the error density g is assumed to have zero mean and sub-Gaussian tails with scale factor σ. Let F be a closed and convex class of measurable functions, such that either f0 ∈ F or f ∗ lies in the interior of F. If for positive constants k1 and k2 X

0

2

∗ Π(Fj )δλ rδ,n (Fj ) ≤ ek1 nn ,

j n

Π {f ∈ F :

2  1 X fi − fi∗ < 2n } 2 σ n i=1

2

≥ e−k2 nn ,

then ρ(1 − ρ) EY 2σ 2 n

Z X n i=1

 2 γ−ρ k1 + γ(1 + k2 ) 2n . fi − fi∗ dΠ(f | Y ) ≤ δ

(4.48)

When |f0 | ≤ K and |f | ≤ K for all f ∈ F, then there exists a constant k1 such that for all δ ∈ (0, 1], ∗ log rδ,n (Fj ) ≤ k1 n2n ,

(4.49)

where Fj is defined as in (4.28) with  = 2n .

Proof. Because Yi = f0,i + ei and the errors have zero mean, the Kullback-Leibler Pn p∗ i divergence n1 i=1 Ep0,i log pθ,i in (4.45) can be written as n n n 2 2 o 2 2 o 1 X 1 Xn ∗ ∗ E e −(f −f ) − e −(f −f ) = f −f − f −f . e i i 0,i i 0,i i 0,i 0,i i i i 2σ 2 n i=1 2σ 2 n i=1

  Pn − f0,i (fi − fi∗ = 0 for all f ∈ F, this equals 2σ12 n i=1 (fi − fi∗ )2 . To P n Re obtain (4.48) we need to show that the Renyi-entropy n1 i=1 Dρ,p (p∗i | pθ,i ) in (4.46) 0,i

Since

Pn

∗ i=1 (fi

is larger than the by the l2 -norm. By the assumption that the errors have sub-Gaussian

86

4. Misspecification in nonparametric regression models

tails with scale factor σ,



n n ρ  o 2 2 1X ρ ρ log Eei exp fi∗ − f0,i − 2 fi − f0,i + 2 fi − fi∗ ei 2 n i=1 2σ 2σ σ

≥−

n n ρ  o 2 2 1X ρ ρ2 ∗ ∗ 2 log Eei exp f − f − f − f + f − f 0,i i 0,i i i i n i=1 2σ 2 2σ 2 2σ 2

≥−

n n n ρ(1 − ρ)  o ρ(1 − ρ) 1 X 2 1X ∗ 2 = log Eei exp − f − f fi − fi∗ , i i 2 2 n i=1 2σ 2σ n i=1

using that

Pn

∗ i=1 (fi

  − f0,i (fi − fi∗ = 0 in the second inequality.

To prove (4.49), assume that |f0 | ≤ K and |f | ≤ K for all K. If it is assumed that Qn Qn for every y ∈ Rn there is an fy ∈ F such that supf ∈Fj i=1 φσ (yi − fi ) = i=1 φσ (yi − fy (xi )), it follows that

∗ rδ,n (Fj )

Z

n Y

g(yi − f0,i )

supf ∈Fj Qn

Qn

i=1

φσ (yi − fi )



dy φσ (yi − fi∗ )  δ Z Y n φσ (yi − fy (xi )) g(yi − f0,i ) = dy φσ (yi − fi∗ ) Rn i=1 Z n n n δ X o Y ∗ 2 2 (y − f ) − (y − f (x )) g(yi − f0,i ) dy = exp i i y i i 2σ 2 i=1 Rn i=1 Z n n n δ X o Y ∗ ∗ 2 = exp 2(y − f )(f (x ) − f ) − (f (x ) − f ) g(yi − f0,i ) dy i 0,i y i y i i i 2σ 2 i=1 Rn i=1 Z n n n δ X o Y ∗ ∗ 2 2y (f (x ) − f ) − (f (x ) − f ) g(yi ) dy. = exp i y i y i i i 2σ 2 i=1 Rn i=1 =

Rn i=1

i=1

(4.50) In the fourth equality we used once more that

Pn

i=1 (fy (xi )

− fi∗ )(fi∗ − f0 ) = 0 for all

fy ∈ F. From the first two lines it can be seen that the expression above does not decrease if fy (xi ) is replaced by lj,i , yi or uj,i , when yi is contained in respectively (−∞, lj,i ), [lj,i , uj,i ] or (uj,i , ∞). Consequently,

∗ rδ,n (Fj )

Z n Y ≤ { i=1

lj,i

n δ o δ g(yi ) exp 2 (lj,i − fi∗ )yi − 2 (fi∗ − lj,i )2 dyi σ 2σ −∞ Z uj,i n δ o δ g(yi ) exp 2 (yi − fi∗ )yi − 2 (fi∗ − yi )2 dyi + σ 2σ lj,i Z ∞ n δ o δ + g(yi ) exp 2 (uj,i − fi∗ )yi − 2 (fi∗ − uj,i )2 dyi } . σ 2σ uj,i

(4.51)

4.6. Regression under misspecification of the error distribution

87

Assuming that (fi∗ − uj,i )2 ≥ (fi∗ − lj,i )2 , the ith term in this product is bounded by Z  δ  δ  2δ g(yi ) exp 2 (uj,i − fi∗ )yi dyi + g(0) exp 2 K 2 (uj,i − lj,i ) exp − 2 (fi∗ − lj,i )2 2σ σ σ 2  δ  δ 2δ ≤ exp − 2 (fi∗ − lj,i )2 + 2 (fi∗ − uj,i )2 + g(0) exp 2 K 2 2n , 2σ 2σ σ (4.52) by the assumption that g is sub-Gaussian with scale factor σ. This can be further bounded by  2δ  2δ  2δ  δ exp 2 |fi∗ − uj,i | 2n + g(0) exp 2 K 2 2n ≤ exp 2 K 2n + g(0) exp 2 K 2 2n . σ σ σ σ The same bound holds if (fi∗ − uj,i )2 < (fi∗ − lj,i )2 . We conclude that there is a constant k1 > 0 such that for any δ ∈ (0, 1],     ∗ log rδ,n (Fj ) ≤ n log exp 2δK 2n /σ 2 + g(0) exp 2δK 2 /σ 2 2n ≤ k1 n2n .

Note that the assumption of sub-Gaussian tails is stronger than Kleijn’s assumption that EeM |ei | < ∞ for all M . In that case we find that the first term on the left in (4.52) is bounded by E exp{2δK|e0 |/σ 2 } ≤ Lδ for some constant L. It follows that  n 2 2δ ∗ rδ,n (Fj ) ≤ Lδ + g(0)e σ2 K  .

(4.53)

The term Lδ should be sufficiently close to 1. If L > 1, we therefore have to let δ = δn decrease to zero. Since for a, b > 0, log(a + b) ≤ log a + b/a,   2δn 2 2 −δ  ∗ log rδ,n (Fj ) ≤ n δ log L + g(0) exp K n L . σ2 But as there is a factor δ −1 on the right hand side of (4.48), δn has to decrease at a slower rate than 2n . This suggests that under the assumption that EeM |e0 | < ∞ for all M > 0, a suboptimal rate is obtained.

4.6.2

Fixed design Laplace regression

Recall that if the errors have density g and a 7→ Φ(a) is defined by  00 Φ(a) = Eei |ei − a| − |ei | , then Φ (a) = 2g(a). In contrast to the assumption in Corollary 4.2, g does not have to be the Laplace density. Instead we only assume that g is bounded away from zero and infinity on [−2K, 2K], where K is such that |f0 (x)| ≤ K and |f (x)| ≤ K for all x and f ∈ F. Also f0 is not necessarily contained in the model. By definition, f ∗ minimizes n n n n o  1X 1 X 1 X DKL (p0,i k pf,i ) = Eei |ei − (fi − f0,i )| − |ei | = Φ fi − f0,i n i=1 2σn i=1 2σn i=1

88

4. Misspecification in nonparametric regression models

over F. If F is assumed to be convex, ft = t f + (1 − t) f ∗ ∈ F for all f ∈ F. From the minimizing property of f ∗ it then follows that the right derivative of the function  Pn t 7→ i=1 Φ fi∗ − f0,i is nonnegative at t = 0. If, in addition, f ∗ lies in the interior of F, or if f ∗ = f0 and the error distribution has zero median, this derivative is zero at t = 0, i.e. n X

  fi − fi∗ Φ0 fi∗ − f0,i = 0.

(4.54)

i=1

Theorem 4.5. For a positive constant K, let F be a class of measurable functions that is closed, convex and uniformly-bounded by K. Let Y1 , . . . , Yn be independent with densities g(yi − f0,i ), where the errors have zero median, and a density g that is bounded away from zero and infinity on [−2K, 2K]. It is assumed that either f ∗ lies in the interior of F, or that the error distribution has zero median and f ∗ = f0 . If f0 is such that |f0 | ≤ K, and if for positive constants k1 and k2 X

Π {f ∈ F :

1 σn

j n X

0

2

∗ Π(Fj )δλ rδ,n (Fj ) ≤ ek1 nn ,

 (fi − fi∗ )2 < 2n }

2

≥ e−k2 nn ,

(4.55)

i=1

then Z EY

n

1X (fi − fi∗ )2 dΠ(f | Y ) . 2n . n i=1

(4.56)

∗ Moreover, there exists a constant k1 such that for all δ ∈ (0, 1], log rδ,n (Fj ) ≤ k1 n,

where Fj is defined as in (4.28) with  = 2n .

Proof. First note that for some point f˜(xi ) between fi and fi∗ and a constant C1 > 0, n h X

 i Φ fi − f0,i − Φ fi∗ − f0,i

i=1

=

n X

n

fi −

fi∗



Φ

0

fi∗

i=1

≤ C1

n X

− f0,i



 2 1 X 00 ˜ + Φ f (xi ) − f0,i fi − fi∗ 2 i=1

(4.57)

2 fi − fi∗ ,

i=1

using (4.54), the uniform bound on all f ∈ F and f0 , and the assumption that Φ00 (x) = 2g(x) is bounded away from zero and infinity on [−2K, 2K]. Similarly we have the reverse inequality, say with a constant C2 . This implies that the Kullback-Leibler divergence is

4.6. Regression under misspecification of the error distribution

89

bounded by a multiple of the l2 -norm, as 2σ

n X

DKL (pf ∗ ,i k pf,i ) =

i=1

n X

n o Eei |ei − (fi − f0,i )| − |ei − (fi∗ − f0,i )|

i=1

=

n h X

 i Φ fi − f0,i − Φ fi∗ − f0,i .

i=1

Using the same arguments, it follows that the Renyi-entropy (e.g.

1 n

Pn

i=1

Re (p∗i | Dρ,p 0,i

pθ,i ) in (4.46)) between pf ∗ and pf is n nρ  ρ o 1X − log Eei exp H ei , (fi∗ − f0,i ) − H ei , (fi − f0,i ) n i=1 σ σ n

n 1X ρ Eei 1 + [H(ei , fi∗ − f0,i ) − H(ei , fi − f0,i )] n i=1 σ o 2 ρ 2 + 2 e2ρK/σ [H(ei , fi∗ − f0,i ) − H(ei , fi − f0,i )] 2σ n 2 2 o 1 Xn ρ2 ≥− − C2 fi − fi∗ + 2 e2ρK/σ fi − fi∗ n i=1 2σ

≥−

(4.58)

n

& C2

2 1X fi − fi∗ , n i=1

for sufficiently small ρ. ∗ With a calculation similar to (4.50) and (4.51), we find that rδ,n (Fj ) is bounded by

Z n Y (

lj,i

 δ δ ∗ |yi − fi | − |yi − lj,i | dyi g(yi − f0,i ) exp σ σ −∞ i=1     Z uj,i Z ∞ δ δ δ |yi − fi∗ | − |yi − uj,i | dyi + g(yi ) exp |yi − fi∗ | dyi ) + g(yi − f0,i ) exp σ σ σ lj,i uj,i   Z n Y δ δ δ 2δ ≤ g(yi − f0,i ) exp |yi − fi∗ | − |yi − lj,i | + 2n + K1yi ∈[lj,i ,uj,i ] dyi σ σ σ σ i=1 ( n ) n Z  n δ 2 Y δX 4Kδ/σ 2 nn ∗ σ ≤ 1+e n e exp (|yi − fi | − |yi − lj,i |) g(yi − f0,i )dy. σ i=1 i=1 

To show that the last integral is at most one for sufficiently small δ, we use the same Taylor series as in (4.58), and inequality (4.57) with an extra minus.

4.6.3

Random design

For easy comparison with the results of Kleijn and van der Vaart [32], we state the random design analogues of Theorems 4.4 and 4.5. They can be proved with the same arguments that were used in section 4.4.2.

90

4. Misspecification in nonparametric regression models

Theorem 4.6. Let (X1 , Y1 ), . . . , (Xn , Yn ) have density p0 (x, y) =

Qn

i=1

m(xi )g(yi −

f0 (xi )), where the error density g is assumed to have zero mean and sub-Gaussian tails. Let F be a closed and convex class of measurable functions, such that either f ∗ = f0 or f ∗ lies in the interior of F. If for positive constants k1 and k2 X

0

n

2

Π(Fj )δλ (rδ∗ (Fj )) ≤ ek1 nn ,

j

2  1 Π {f ∈ F : 2 EX0 f (X0 ) − f ∗ (X0 ) < 2n } σ then

Z

1 EX,Y σ2

2

≥ e−k2 nn ,

2 EX0 f (X0 ) − f ∗ (X0 ) dΠ(f | X, Y ) . 2n .

When |f0 | ≤ K and |f | ≤ K for all f ∈ F, then there exists a constant k1 such that for all δ ∈ (0, 1], ∗ log rδ,n (Fj ) ≤ k1 n2n ,

where Fj is defined as in (4.28) with  = 2n . Theorem 4.7. For a positive constant K, let F be a class of measurable functions that is closed, convex and uniformly-bounded by K. Let (X1 , Y1 ), . . . , (Xn , Yn ) have density Qn p0 (x, y) = i=1 m(xi )g(yi − f0 (xi )), where the errors have zero median, and a density g that is bounded away from zero and infinity on [−2K, 2K]. It is assumed that either f ∗ lies in the interior of F, or that the error distribution has zero median and f ∗ = f0 . If f0 is such that |f0 | ≤ K, and if for positive constants k1 and k2 X

0

n

2

Π(Fj )δλ (rδ∗ (Fj )) ≤ ek1 nn ,

j

Π {f ∈ F : then

1 EX,Y σ

 1 EX0 (f (X0 ) − f ∗ (X0 ))2 < 2n } σ

Z

2

≥ e−k2 nn ,

EX0 (f (X0 ) − f ∗ (X0 ))2 dΠ(f | X, Y ) . 2n .

∗ Moreover, there exists a constant k1 such that for all δ ∈ (0, 1], log rδ,n (Fj ) ≤ k1 n,

where Fj is defined as in (4.28) with  = 2n .

4.7

The choice of {Γj } and the use of sieves 1

In the regression example in section 4.4.1 we found a rate of n− 4 . For the case that the model consists of splines with an increasing number of knots, equipped with a normal prior on the spline coefficients, we define an alternative cover {Γj } that leads to the 1

optimal rate of n− 3 . For simplicity we only consider regression with Gaussian errors.

4.7. The choice of {Γj } and the use of sieves

91

Suppose we observe independent variables Y1 , . . . , Yn with normal densities φσ (yi − f0 (xi )), where the covariates xi are contained in the interval [−M, M ]. Following Ghosal and van der Vaart [20], we use a basis of B-splines B1 , . . . , Bln to approximate regression functions on this interval. The true regression function f0 is assumed to be C 1 . The dimension ln = q + kn − 1 increases with the sample size, where q − 1 is the degree of the polynomials on the subintervals of length

2M kn ,

from which the splines are constructed.

For properties of the B-splines see section 7.6.1. in [20] or the book by de Boor [7]. We Pln βj Bj (x), x ∈ [−M, M ]. The adopt the notation of the former, and write fβ (x) = j=1 model Pn defined by (4.17) now takes the more specific form ( n ) Y  ln ln Pn = pβ | β ∈ R = φσ (yi − fβ (xi )) | β ∈ R .

(4.59)

i=1

The unknown density of Y is denoted p0 (y) =

Qn

i=1

φσ (yi − f0 (xi )). Let the coefficients

β1 , . . . , βln be independent and standard normally distributed; this induces a prior Πn on Pn . For Pxn the empirical measure of the covariates, let k · kn be the norm v u n u1 X kf kn = t f 2 (xi ), f ∈ L2 (Pxn ). n i=1 R Given the covariance matrix Σn = ( Bi Bj dPxn ) we assume, as is done in [20], that the sequence of covariates x1 , x2 , . . . satisfies the spatial separation condition ln−1 kβk22 . β t Σn β . ln−1 kβk22 . Under this assumption, there are positive constants C1 and C2 such that for all β, β 0 ∈ R ln , C1 kβ − β 0 k2 ≤

p ln kfβ − fβ 0 kn ≤ C2 kβ − β 0 k2 .

(4.60)

Using Corollary 4.1 with δ = 1 and λ0 = 21 , we show that the convergence rate is at least 1√ n = n− 3 log n. First we verify condition (4.19), which asserts that the prior mass on the set of β for which

√ 1 kfβ 2σ 2

− f0 kn < n is at least exp{−k2 n2n }, for some constant

k2 > 0. Because f0 is a C 1 -function on [−M, M ], we have that for every l ∈ N there is some βl ∈ Rl such that kf0 − fβl k∞ ≤ D1 l−1 , 1

where the constant D1 includes a factor kf0 k∞ . For a sequence ln ∼ n 3 we then consider the sequence of approximations defined by βln . Since the k · kn -norm is bounded by the k · k∞ -norm, kfβ − f0 kn ≤ kfβln − fβ kn + kfβln − f0 kn ≤ D2 n + D1 ln−1

92

4. Misspecification in nonparametric regression models

if kfβln − fβ kn ≤ D2 n . From (4.60) it follows that this is the case if kβ − βln k2 ≤ √ D2 C2 ln n . Because the elements of βln are contained in compact intervals and the prior √ 2 is standard normal, this ball has prior mass of order ( ln n )ln ≈ e−ln log n ≈ e−k2 nn . To verify condition (4.18), let τn be a sequence to be chosen below, and cover the set Γn in (4.59) by the ln -dimensional hypercubes Γa,n = (τn a(1), τn (1 + a(1))] × . . . × (τn a(ln ), τn (1 + a(ln ))], indexed by a ∈ Zln . These Γa,n are disjoint and have Lebesgue measure τnln . P 0 We have to show that log j Π(Γa,n )λ rδ,n (Γa,n ) is most a multiple of n2n . Although we let λ0 = 1/2 in the following calculation, it could be any value in (0, 1). Using the symmetry of the model and the prior, we can restrict the sum to those a with only nonnegative entries, if we put an extra factor 2ln in front, i.e. q X X 0 Π(Γa,n )λ r1,n (Γa,n ) ≤ 2ln Π(Γa,n )r1,n (Γj ) a∈Zln

a∈{0,1,2,...}ln

= 2ln

∞ X

···

a1 =0



∞ X

v u u u r1,n (Γa,n )t

aln =0

ln

≤ (c0 τn )

sup r1,n (Γa,n ) a∈Zln

τnln (2π)ln /2

  ln   τ2 X a2j exp − n   2 j=1

(4.61)

 2 !ln τ exp − n a21 , 4 =0

∞ X a1

for some constant c0 > 0. Note that for any c > 0, r Z ∞ X 2 2 1 π . e−ck ≤ 1 + e−cx dx = 1 + 2 c 0 k≥0

Taking c =

2 τn 4

we see that the last factor in (4.61) is bounded by (1 +

√ π ln 4τn ) .

We now choose τn = 2n /n and bound r1,n (Γa,n ). Let a ∈ Zln , and let p ∈ Rln be the center of the corresponding Γa,n . Let A1 denote the (ln -dimensional) l2 -ball with radius √ √ ln τn around p, let A2 denote the (n-dimensional) l2 -ball with radius C2 nτn around √ fp (x), and let A3 be the l∞ -ball with radius C2 nτn around fp (x), i.e. the smallest l∞ -ball containing A2 . Because A1 ⊂ A2 by (4.60), Z Z n n Y Y r1,n (Γa,n ) = sup φσ (yi − fβ (xi ))dy ≤ sup φσ (yi − fβ (xi ))dy Z ≤

β∈Γa,n i=1 n Y

sup

z∈A2 i=1

φσ (yi − zi )dy ≤

β∈A1 i=1

Z sup

n Y

z∈A3 i=1

φσ (yi − zi )dy ≤ (1 + C2 φσ (0)2n )n ,

where the last inequality follows from the example in section 4.4.1. Consequently, the logarithm of the right hand side of (4.61) is bounded by r   √ 1 π ≤ C2 φσ (0)n2n + c1 ln log n. C2 φσ (0)n2n + ln log c0 τn + 2 4σ τn

Chapter 5

Analyzing spatial count data, with an application to weed counts Abstract : Count data on a lattice may arise in observational studies of ecological phenomena. In this chapter a hierarchical spatial model is used to analyze weed counts. Anisotropy is introduced, and a bivariate extension of the model is presented.

5.1

Introduction

Let X1 , . . . , Xn ∈ Rp be an i.i.d. sample from a distribution with covariance matrix Σ ∈ Rp×p . In case Σ can be any covariance matrix, n has to be large in order to estimate all covariances. In many applications however, n is much smaller than p. This is typically the case in spatial problems, where n is often 1. Suppose that a quantity of interest X is observed at p distinct locations s1 , . . . , sp in some d-dimensional set S. Questions of interest are the prediction of a new observation at the same locations, given only a few previous observations, and the prediction of X at other locations t1 , . . . , tm given the observations at s1 , . . . , sp . Alternatively, one could be interested in estimation of the expected value of X at t1 , . . . , tm . For example, we could be given the annual rainfall in p cities and want to estimate the annual rainfall at a number of different locations. In these problems, it is often natural to assume a form of spatial dependence between observations of X at nearby locations. This can be modeled by imposing restrictions on the covariance matrix Σ. 93

94

5. Analyzing spatial count data, with an application to weed counts

Let n = 1 and d = 2, and suppose that s1 , . . . , sp form a rectangular grid, with a fixed neighborhood structure. If the sites si and sj are neighbors, this is denoted i ∼ j. We wish to define the distribution of a random vector X = (X1 , . . . , Xp ) whose covariance structure is such that the distribution of Xi given Xj = xj , j 6= i, only depends on the values at the neighbors of si . This is the defining property of Markov Random Fields; see for example Rozanov [48]. In what follows we only consider Gaussian Markov Random Fields, which are multivariate normal distributions that have the covariance structure described above. By the normality of X, also the conditional distribution of Xi given Xj = xj (j ∼ i) is normal. Suppose that all sites that are horizontally or vertically adjacent are neighbors, and that location si has 4 neighbors. If Σii = 4σ −2 and Σij = −γσ −2 (i ∼ j), with γ < 1, it follows that this conditional distribution is P normal with mean γ4 j∼i xj and variance σ 2 . This is a generalization of the Markov property in autoregressive time series, where the process has a density provided that γ < 1. Note however that our lattice is finite. If we define Σii to be 2σ −2 or 3σ −2 if si lies on a corner or on an edge of the lattice, we obtain a process for which the mean of each conditional distribution is γ times the mean of the values at its neighbors. In applications it can be of interest to define a process with a spatial dependence that is different for the horizontal and vertical direction. In geostatistics and stochastic geometry, processes whose law is not invariant under rotation are called anisotropic. In the preceding example, one could think of a process for which the means of the conditional distributions are weighted averages of the values at neighboring locations. On an infinite lattice such a process can be defined by modifying Σ accordingly. For example, let γ1 and γ2 be nonnegative constants with γ1 + γ2 < 2, and define Σi,j = −γ1 σ −2 if si and sj are vertical neighbors and Σi,j = −γ2 σ −2 if si and sj are horizontal neighbors. On a finite lattice however, this approach is not always possible due to complications at the boundaries of the lattice. If for example γ1 = 1.8 and γ2 = 0.1, and a grid point has two vertical neighbors with weight γ1 /3 and one horizontal neighbor whose weight is γ2 /3, the sum of the 3 weights is larger than 1, which suggests that the process might not have a density. That this is indeed a problem can be shown with Gerˇsgorin’s disc theorem. The main contribution of this chapter is the construction of non-singular covariance matrices Σ such that for the inner grid points, the conditional distributions are as in the example above. Following Sain and Cressie [49], we also define a bivariate extension of these anisotropic processes. This work was motivated by a study in weed ecology. From 2001-2003, the number of weed plants was counted on a rectangular plot in a field near Wageningen, The Netherlands. For an extensive analysis of these data, see Heijting [24] and Heijting et al.([25],[26]). For many of the observed weed species, the plants occurred in a stripy pattern, and we aimed to incorporate this in the statistical model. Since there may also

5.2. Modeling anisotropy in count data on a lattice

95

be an interaction between the species, it is of interest to study multivariate models. Besides gaining more insight in the underlying ecological processes, the study aimed to contribute to the development of precision agriculture, where efficient and environmentally friendly treatment of weeds is a major challenge ([46], [34], [53]). Weeds in a field are a threat for valuable crops, as their quality may be affected, either by the presence of damaging substances or by the reduction in yield. To optimize yield, spraying of herbicides is often required, which in most cases is still done by applying the same dosage to the whole field. Environmental and economical considerations however, require a minimization of herbicide use and hence dose amounts that vary with location. This may be achieved by high-technology equipment to optimize herbicide application. A requirement is that weeds are properly characterized, both in species and in location. The plants were observed in a rectangular plot that was partitioned into adjacent squares, or quadrats. For various reasons it was not possible to record the precise location of every plant; therefore only the number of plants in each quadrat was counted. To model these counts, the Gaussian processes described above are incorporated in a logPoisson model, as in Besag, York and Molli´e [3]. Conditional on the values of the process, the data are assumed to be independently Poisson distributed, the intensities being the exponent of the process values. The motivation for such a model lies in the fact that a multivariate Poisson distribution with spatial dependencies has undesirable properties. This was shown by Besag ([2], p.202) using the Hammersly-Clifford theorem.

5.2

Modeling anisotropy in count data on a lattice

The set of quadrats S is labeled S = {1, 2, .., n}. To specify spatial dependence, a predefined neighborhood structure is used. For every pair (i, j), the quadrats associated with i and j are neighbors (i ∼ j) if they are adjacent in one of the two orthogonal directions, say the y− and the x− directions. For the set of i’s neighbors we write Pn P i=1 j∈Si denotes the summation over all

Si = {i1 , i2 , . . . , imi }. The summation

pairs of neighboring quadrats (i, j). Note that in this summation every pair is included twice. Let the elements of an n × n matrix A be denoted by aij , with At as its transpose. Finally, mi equals the number of neighbors of quadrat i, i.e. mi = 2 if the ith quadrat is a corner quadrat, mi = 3 if it is an edge quadrat and mi = 4 otherwise. Random variables are denoted by capitals and their realizations by lower case characters.

5.2.1

The Log-Poisson model

We use a log-Poisson model of the type introduced by Besag, York and Molli´e [3]. Hess, van Lieshout, Payne and Stein [27] already applied this model to weed data, in a study

96

5. Analyzing spatial count data, with an application to weed counts

of the Striga hermonthica weed. A spatial autoregressive process U = (U1 , U2 , . . . , Un ) is used to model spatial variation, and a noise vector V = (V1 , V2 , . . . , Vn ) is included to allow for extra variation. Given U = u and V = v, the observed counts yi are assumed to be realizations of independent Poisson random variables Yi , with intensities λi = β · eui +vi .

(5.1)

The parameter β expresses the overall mean, whereas the factors eui +vi account for local deviations from this mean. In disease mapping, β is often referred to as the relative risk. In this chapter it will be fixed, and assumed to be equal to the mean of the observed counts. We assume that U follows a multivariate normal distribution with covariance matrix A−1 , where A is symmetric with aii = τ1 mi on the diagonal, aij = aji = −τ1 γ if i ∼ j and aij = 0 otherwise. The parameter γ specifies the strength of spatial dependence and τ1 is the precision of the process U . We only consider the case that γ ∈ (0, 1), leading to an invertible precision matrix A. The vector U is a conditional autoregressive (CAR) process (see, for example, Cliff and Ord [6], section 6.2) with density p p n n X X X |A| |A| 1 1 t 2 exp{− exp(− u Au) = (τ m u − τ γ ui uj )}. 1 i i 1 n/2 2 2 (2π)n/2 (2π) i=1 i=1 j∈S

(5.2)

i

The conditional distribution of Ui given its occurrence at the other locations, therefore, is the following univariate normal distribution : Ui | {Uj = uj , j 6= i} ∼ N (

1 γ X uj , ). mi mi τ 1

(5.3)

j∈Si

In contrast to U , V has no spatial structure. The Vi are i.i.d normal with precision τ2 , and independent of U . One could also directly consider U +V , which is normally distributed with covariance matrix A−1 +

1 τ2 In .

The present decomposition into U and V however, facilitates the

estimation of τ1 and τ2 . A Bayesian approach is taken to obtain estimates of τ1 , τ2 , γ and β. In section 5.3 we specify priors p(τ1 ), p(τ2 ) and p(γ), and describe the Gibbs sampler we use to sample from p(u, v, β, γ1 , γ2 , τ1 , τ2 | y) ∝ p(u | γ1 , γ2 , τ1 )p(v | τ2 )p(γ1 , γ2 )p(τ1 )p(τ2 )

n Y

(5.4)

p(yi | ui , vi ),

i=1

Also estimates of u and v can be obtained from the Gibbs sampler.

5.2.2

Introducing anisotropy

Given values Uj = uj , j 6= i, the distribution of Ui is normal with mean as is stated by (5.3). In this sum, all neighbors uj receive weights

γ mi .

γ mi

P

j∈Si

uj ,

In this section

5.2. Modeling anisotropy in count data on a lattice

97

we vary these weights. The motivation to do so is that the spatial dependence in one direction may be stronger than that in the other direction. It would then be reasonable to assign larger weights to the direction with the strongest dependence. In this case x

we replace γ by γ1 and γ2 , and write i ∼ j if i and j are adjacent in the x−direction y

and i ∼ j if they are adjacent in the y−direction. Our aim is to define A such that for every quadrat i not located on an edge or corner, the y−neighbors of i receive weights γ1 4

and the x−neighbors of i receive weights

γ2 4 .

We could define A as the sum of A1

for the y−direction and A2 for the x−direction, with A1 and A2 defined as in section y

5.2.1. For A1 , for example, aij = aji = −τ1 γ1 if i ∼ j. Up to a factor τ1 , the diagonal contains the number of neighbors in the y− direction. With a similar definition for A2 however, a complication occurs on the edges. In fact, a quadrat with two y−neighbors and only one x−neighbor has a sum of weights that could be larger than one. Take, for example, γ1 = 1.7 and γ2 = 0.2. Then the sum equals

1.7+1.7+0.2 3

= 1.2, suggesting

that the distribution might not exist. To obtain a positive definite matrix, we modify y

the γ1 and γ2 by a varying γij . For all inner quadrats i and j, we define γij = γ1 if i ∼ j x

and γij = γ2 if i ∼ j . It is assumed that γ1 ≥ 0 and γ2 ≥ 0 are constants such that γ1 + γ2 = 2 − 2δ for a small constant δ > 0. In our application we set δ = 0.005. First, consider the case γ1 ≥ γ2 . For quadrats i and j both located on either the minimal or maximal x-edge, we define γij = γ1 0 =

3−3δ−γ2 . 2

For quadrats i and j both located

on either the minimal or maximal y-edge, we define γij = γ2 0 =

3−3δ−γ1 . 2

The latter

modification is not necessary to obtain positive definitess as discussed in section 5.5. The case γ2 > γ1 is handled similarly: γij = γ2 0 is defined if i and j are on the minimal or maximal y-edge, and γij = γ1 0 is defined if i and j are on the minimal or maximal x-edge. These modifications are motivated by Gerˇsgorin’s disc theorem; see also Sain and Cressie [49]. Theorem (Gerˇsgorin). Let T be an n × n matrix with complex elements tij . For P 1 ≤ i ≤ n define Ri (T ) = 1≤j≤n,j6=i |tij |. Then the eigenvalues of T are located in the Sn union of discs given by i=1 {|z − tii | ≤ Ri (T )}. A proof can be found in many linear algebra or matrix analysis textbooks, see for example Lancaster and Tismenetsky [37](p. 371-372). For our matrix A, the condition

mi >

X

γij

∀i

j∈Si

guarantees that all eigenvalues are positive, implying that A and A−1 are positivedefinite and non-singular. For all quadrats i and j, the condition mi + mj ≤ 6 is equivalent to i and j being both located on an edge. Using this fact we summarize the

98

5. Analyzing spatial count data, with an application to weed counts

above definitions. For γ1 ≥ γ2 , γij =

−aji −aij y y = = γ1 0 1{(i∼j)∧(m + γ1 1{(i∼j)∧(m i +mj ≤6)} i +mj ≥7)} τ1 τ1

(5.5)

x x + γ2 0 1{(i∼j)∧(m + γ2 1{(i∼j)∧(m i +mj ≤6)} i +mj ≥7)}

and for γ2 > γ1 , γij =

−aji −aij x x = = γ2 0 1{(i∼j)∧(m + γ2 1{(i∼j)∧(m i +mj ≤6)} i +mj ≥7)} τ1 τ1 y + γ1 0 1{(i∼j)∧(m

i +mj

y + γ1 1{(i∼j)∧(m ≤6)}

i +mj

(5.6)

. ≥7)}

For γ2 = γ1 , the isotropic case, (5.5) and (5.6) both imply that γij = 1 − δ for all i ∼ j. P In the anisotropic case, we have m1i j∈Si γij = 1 − δ for any quadrat i, except for the 4 corners. For the covariance structure described above, the distributions given by (5.2) and (5.3) can be refined to p(u) =

n n X X X 1 1 2 (τ m u − τ γij ui uj )} exp{− p 1 i i 1 n/2 2 (2π) |A−1 | i=1 i=1 j∈Si 1 X 1 Ui | {Uj = uj , j 6= i} ∼ N ( γij uj , ). mi mi τ1

(5.7) (5.8)

j∈Si

5.2.3

A bivariate model

In this section we extend the univariate model defined by (5.1) to a bivariate model in which A

ui λA i = βA · e

+viA

,

B

ui λB i = βB · e

+viB

.

(5.9)

We observe counts YiA and YiB which, given U A = uA , U B = uB , V A = v A and V B = v B , are assumed to be independently Poisson distributed with intensities. The processes V A = {Vi A }i=1,2,...,n and V B = {Vi B }i=1,2,...,n are independent and normally distributed with precisions τ2 A and τ2 B . Also U A and U B have normal distributions, with parameters specified later in this section. This model is motivated by the idea that the conditional specification in expression (5.8) should be replaced by a bivariate normal distribution. It has variance

1 τ1 mi ,

and we can write the exponent in

the right hand side of (5.7) as   n n X n n X  X X 1 X 1 X 1 −1 1 −1  2 − τ1 mi ui −τ1 γij ui uj = − mi u i ( ) u i − γij ui ( ) uj . 2 2 i=1 τ1 τ1 i=1 i=1 i=1 j∈Si

j∈Si

Let the correlation between A and B be controlled by the parameter c ∈ (−1, 1), and   1 √ c τ1 A τ1 A τ1 B  define Γ =  . The column vector (Ui A , Ui B )t is denoted Ui A,B . 1 √ c τ B τ1 A τ1 B

1

5.3. Sampling from the posterior distribution

99

Similarly we write ui A,B and yi A,B ; U A,B denotes the column vector (U1 A , U1 B , . . . , Un A , Un B )t . Analogous to (5.2), the joint density of U A and U B is 1 p(uA,B ) ∝ exp{− ((uA,B )t Σ(uA,B ))} 2 n n X X 1 X = exp{− ( mi (ui A,B )t Γ−1 (ui A,B ) − γij (ui A,B )t Γ−1 (uj A,B ))}, 2 i=1 i=1

(5.10)

j∈Si

where Σ is the 2n × 2n block-matrix consisting of 2 × 2 blocks Σii = mi Γ−1 and Σij = −γij Γ−1 for i 6= j. For every i ∼ j, the terms

γij A,B t −1 ) Γ (uj A,B ) 2 (ui

and

γij A,B t −1 ) Γ (ui A,B ) 2 (uj

occur in (5.10). Hence the conditional density can be written as  1 A,B t 1 −1 A,B A,B A,B (ui ) ( Γ) (¯ ui ) p(ui | {uj }j6=i ) ∝ exp 2 mi  1 A,B t 1 −1 A,B 1 A,B t 1 −1 A,B u ) − (ui ) ( Γ) ui + (¯ ) ( Γ) (ui (5.11) 2 i mi 2 mi   1 1 ∝ exp − (ui A,B − u ¯A,B )t ( Γ)−1 (ui A,B − u ¯A,B ) , i i 2 mi P P in which u ¯A,B = ( m1i j∈Si γij uj A , m1i j∈Si γij uj B )t . This is proportional to the bii variate normal density with mean u ¯A,B and covariance matrix i

1 mi Γ.

The process U A,B

is a two-dimensional Gaussian Markov Random Field. Such processes were studied by Mardia [40] and used in various Bayesian models by Pettitt, Weir and Hart [45], Gelfand and Vounatsou [14], Sain and Cressie [49] and Jin, Carlin and Banerjee [30]. In case the spatial correlation between more than two variables is studied, extension of the bivariate model to a multivariate model is straightforward. The bivariate normal density defined by (5.11) then becomes a multivariate normal density.

5.3

Sampling from the posterior distribution

Following Besag, York and Molli´e [3], we sample from the posterior distribution using a Gibbs sampling algorithm. Univariate sampling We assume the following prior distributions the parameters β, τ1 ,τ2 and γ1 .

The

anisotropy parameter γ1 is discretized , and uniformly distributed on the set G = {0, 0.01, . . . , 1.99}. As we assumed that γ2 = 1.99 − γ1 , γ2 also has a uniform distribution on G. Note that in this construction, either γ1 ≥ 1 > γ2 or γ2 ≥ 1 > γ1 . Alternatively we could choose a discretization with, for example, 201 points, containing the case γ1 = γ2 = 0.995. The parameters τ1 and τ2 are given identical exponential priors with mean µ1 . In the present application we set µ = 1. Given τ1 , τ2 and γ1 , U has the

100

5. Analyzing spatial count data, with an application to weed counts

density given by (5.7) and the Vi are uncorrelated normal random variables with mean 0 and variance

1 τ2 .

For every quadrat i ∈ {1, . . . , n}, the vector uSi = (ui1 , . . . , umi ) con-

tains the values of u at adjacent quadrats. To sample from (5.4) the following conditional distributions need to be sampled. p(ui | uSi , y, γ1 , τ1 ) ∝ p(yi | ui , vi )p(ui | uSi , γ1 , τ1 )

(5.12)

p(vi | y) = p(vi | yi ) ∝ p(yi | ui , vi )p(vi | τ2 )

(5.13)

p(γ1 | u, τ1 ) ∝ p(u | γ1 , τ1 )p(γ1 )

(5.14)

p(τ1 | u, γ1 ) ∝ p(u | γ1 , τ1 )p(τ1 )

(5.15)

p(τ2 | v) ∝

n Y

p(vi | τ2 )p(τ2 )

i=1

The conditional density in (5.12) is proportional to mi τ 1 (βevi eui )yi exp{−βevi eui − (ui − u ¯i )2 } 2 (5.16) mi τ1 2 ∝ exp{yi ui − βevi eui + (mi τ1 u ¯i )ui − ui )}. 2 First u∗ , the mode of p(ui | uSi , y, γ1 , τ1 ) is determined numerically. We use a second ∗

order Taylor approximation around u∗ , eui ≈ eu (1 + (ui − u∗ ) + 21 (ui − u∗ )2 ). Instead of sampling (5.12) exactly, we take the approximating normal density with mean (b − ∗





ceu (1 − u∗ ))/(2a + ceu ) and variance 1/(2a + ceu ), for a =

τ mi 2 ,

b = yi + τ m i u ¯i and

vi

c = βe . A similar normal approximation is used for (5.13). The main difficulty in sampling (5.14) is the determinant of A. It is more convenient to work with Q =

1 τ1 A,

not depending on τ1 . For all γ1 ∈ G, |Q| is calculated before

running the Gibbs sampler, and stored in a table. For the current values of u and p τ1 , p(u | γ1 , τ1 ) ∝ τ1 n |Q| exp{− τ21 ut Qu} is then evaluated for all γ1 ∈ G. After normalization, a new γ1 is drawn according to this vector. The density given by (5.15) is proportional to n 1 1 τ1 n |Q| exp{− ut τ1 Qu} ∝ τ1 2 exp{−(u∗ + ut Qu)τ1 }, 2 2 also a gamma density. The same holds for the conditional density of τ2 .

e−u



τ1

p

Bivariate sampling Using the same priors for the parameters, updating the vi A ’s, vi B ’s and the parameters τ2A ,τ2B and γ1 can be done exactly as in the univariate sampler. The parameter c is discretized, and sampled from the set {−0.99, −0.98, . . . , 0.99}, on which a uniform prior is assumed. Additionally, we need to sample from B A A B A A A A B A B p(ui A |uA Si , uSi , yi , γ1 , τ1 , τ1 , c) ∝ p(yi | ui )p(ui | uSi , uSi , γ1 , τ1 , τ1 , c) (5.17)

p(τ1 A | uAB , τ1 B , c, γ1 ) ∝ p(uAB | γ1 , τ1 A , τ1 B , c)p(τ1 A ).

5.4. Application

101

B B A B Similar factorizations hold for p(ui B | uA Si , uSi , yi , γ1 , τ1 , τ1 , c) and A B p(τ1 B | uAB , τ1 A , c, γ1 ). Since p(ui A | uA Si , uB Si , γ1 , τ1 , τ1 , c) is normally distributed, A B AB (5.17) is of the same √ A form as (5.16). The density p(τ1 | τ1 , c, u , γ1 ) is of the form n −c τ A −c (τ1A ) 2 e 1 1 2 τ1 , with constants c1 , c2 depending on uAB , c, τ1B and µ, and can be

sampled using a rejection scheme.

5.4 5.4.1

Application Data

Figure 5.1: Numbers of weed plants of T. officinale (left), C. album (middle) and S. nigrum (right), in 2001. The dark grey tones represent high counts. Weeds were observed within an arable field of 64 m wide and 281 m long, on a clay soil located near Wageningen, The Netherlands, from 2001 to 2003; see Heijting [24] and Heijting et al. [25][26]. The field was planted with maize. Weed plants were counted within a rectangular plot of 50.25 m long (y-direction) and 12 m wide (x-direction). This plot was partitioned into a grid of 0.75 × 0.75 m contiguous quadrats, corresponding in size to the 0.75 m spacing between the rows of maize plants. The plot contained n = 1072 quadrats with 67 and 16 quadrats in the y− and x−direction, respectively. Because of the strong trend in the data, we restrict the analysis in the present work to the plot of 16 × 16 quadrats, located along the Northern boundary of the observation area. In the three years of observation, more than 20 different weed species were found in the plot. For practical reasons we analyze the counts for three species, observed from 18 to 21 June 2001. These species are Chenopodium album L. (Fat hen), Solanum nigrum L. (Black nightshade) and Taraxacum officinale Weber (Dandelion). Figure 5.1 contains a graphical representation of the data. The orientation is such that the left

102

5. Analyzing spatial count data, with an application to weed counts

sides of the images are facing the West. Descriptive statistics of the counts can be found in table 5.1. The distribution of the weed counts is highly skewed. Having a maximum species

Min.

1st Qu.

Median

Mean

3rd Qu.

Max.

T. officinale

0

0

0

0.2305

0

4

C. album

0

2

5

7.887

11

44

S. nigrum

0

0

1

2.004

2

19

Table 5.1: Quantiles, minimum, maximum and median and mean of the numbers of weed plants observed on 16 × 16 quadrats in 2001

equal to 44 and a minimum equal to 0, the median of the C. album counts is only 2, whereas for the two other species the median equals 0. Large peak densities occur for C. album and S. nigrum.

5.4.2

Results

The 2001 counts of T. officinale were analyzed using the univariate model. In addition, the bivariate model was applied to the C. album and S. nigrum counts. The Gibbs sampler described in section 5.3 was implemented in R, and run for 70000 iterations, after a burn-in period of 1000 iterations. Every 70th iteration was taken as a sample from the posterior distribution, hence 1000 samples were stored. The posterior density estimates are given in figure 5.2. The posterior means of the parameters can be found in table 5.2 on page 105. We notice that the distributions are relatively wide, although mostly showing a sharp mode, which may be caused by the fact that the underlying process does not fully obey the conditions imposed by the statistical model. The estimates of τ1 and τ2 measure the relative magnitudes of spatial and non-spatial variation, respectively. For all species the spatial variation is considerably larger than the non-spatial variation. The anisotropy apparent in the counts of C. album and S. nigrum is reflected in the estimate of γ1 . The posterior mean of c is 0.5747, substantially larger than the correlation between the actual counts, which is 0.2088. Figure 5.3 displays images of the posterior means of U , V and β exp(U + U ), which ˜ are denoted as u ˜, v˜ and λ,respectively. The image of v˜ is merely included as a check on randomness. Indeed sudden variations appear in v˜, although for S. nigrum it still possesses some spatial structure. Generally u ˜ and β exp(˜ u + v˜) have the same spatial structure as is apparent in the data, but in a more smoothed fashion.

103

0.4 0.2 0.0

0

0.0

5

0.1

10

0.2

15

0.3

20

0.4

5.4. Application

0.05

0.10

0.15

0

2

4

6

8

10

0

2

4

6

8

1

2

3

4

5

6

0.10

7

5

10

15

20

5

10

15

0.5

1.0

1.5

4 3 2 1 0

0

0.0

1

0.5

2

1.0

3

1.5

4

5

2.0

0

0.00

0.0

0.00

0.2

0.4

0.10

0.6

0.20

0.8

0.00

1.4

1.6

1.8

2.0

0.3

0.4

0.5

0.6

0.7

0.8

Figure 5.2: Posterior densities of the parameters τ1 (top line), and τ2 (middle line) for T. officinale (left), C. album (middle) and S. nigrum (right). On the bottom line: the posterior densities of γ1 for T. officinale (left), γ1 for C. album and S. nigrum (middle) and the posterior of the correlation c between C. album and S. nigrum (right).

104

5. Analyzing spatial count data, with an application to weed counts

Figure 5.3: The posterior mean of U (top line), the posterior mean of V (second line) and the estimated Poisson intensities β exp(u + v) (bottom line), for the 2001 counts of T. officinale (left), C. album (middle) and S. nigrum (right). The dark grey tones represent large values.

5.5. Discussion

105

species

τ1

τ2

γ1

c

T. officinale

0.0661

1.2641

1.0017

C. album

3.2201

9.4576

1.8130

0.5747

S. nigrum

2.1353

5.3999

1.8130

0.5747

Table 5.2: The posterior means of γ1 ,τ1 ,τ2 and c. In the bivariate model, C. album and S. nigrum have the same γ1 and c.

5.5

Discussion

The models described above can be useful in the analysis of count data on a lattice. For further application, the approach taken in this chapter could be extended by incorporating covariables. For quadrats on the edges the γij ’s can be defined in various ways. For instance if γ1 ≥ γ2 , and i and j are located on the minimal or maximal y-edge, γij = γ2 could be defined instead of γij = γ20 =

3−3δ−γ1 2

> γ2 . Numerical experiments with isotropic

test data indicate, however, that this model only performs well on square lattices, but is biased for rectangular lattices. More precisely, this model favours a stronger spatial dependence in the longest direction of the field. With the current definition, this problem does not occur. Although our dataset is not rectangular, this improves the applicability of the model. The models presented in section 5.2 can be used to quantify the spatial- and nonspatial variation in the data. For the application of SSWM, images of u ˜ are easier to interpret than images of the original data, which exhibit much small scale variation. This may lead to a more efficient use of resources. Our method is not restricted to grids of contiguous quadrats, and also applicable to data obtained with discrete area sampling, or to observations at grid points. The grid does not need to be rectangular. In order to define first- and higher order neighbors, however, distances between the observation areas, or points, need to be regular or at least, it must be possible to distribute them over distance classes. Christensen and Waagepetersen [5] take a geostatistical approach, considering weed counts as point observations. Instead of an autoregression process U , defined on a lattice, he assumes a geostatistical process, defined for every point in the area. Prior distributions for the regression parameters are specified, and posterior distributions are estimated by MCMC simulation. An advantage of this approach is that predicting at unsampled locations is straightforward. Hrafnkelsson and Cressie [28] compare approaches

106

5. Analyzing spatial count data, with an application to weed counts

based on lattice and geostatistical processes and describe how M CM C-simulations can be made. Another open question is the possibility of having different degrees of anisotropy for the two species. Parameters γij A ’s and γij B ’s have to introduced such that Σ remains positive-definite.

Samenvatting: Convergentiesnelheden van niet-parametrische Bayesiaanse dichtheidsschatters Wanneer Bayesiaanse methoden worden toegepast op niet-parametrische modellen, blijkt de keuze van de prior verdeling van groot belang te zijn voor het asymptotisch gedrag van de posterior verdeling. In dit proefschrift wordt dit uitgewerkt voor een aantal modellen, waarbij we kijken naar de convergentiesnelheid die de posterior haalt voor verschillende keuzes van de prior. In hoofdstuk 1 worden een aantal begrippen en stellingen uit de literatuur ge¨ıntroduceerd die in de latere hoofdstukken veel gebruikt worden. Hoofdstuk 2 gaat over het schatten van kansdichtheden op het eenheidsinterval met behulp van b`eta-dichtheden. De constructie van Petrone ([43],[42]) heeft, afgezien van een aantal voordelen, als nadeel dat de convergentiesnelheid suboptimaal is binnen een klasse van α-Lipschitz functies; zie Ghosal [17]. We gaan in op de oorzaak hiervan. Voor α ∈ (0, 1] blijkt het mogelijk te zijn om door een andere keuze van de prior wel de optimale snelheid te halen. De zo verkregen posterior is adaptief in α, en kan worden uitgebreid naar hogere dimensies. Hoofdstuk 3 behandelt dichtheidsschatters op de re¨ele rechte, gebaseerd op eindige mengsels van een van tevoren gekozen ’kern’. In de literatuur is dit meestal de normale dichtheid, met daarbij een Dirichlet-prior op de mengverdeling. Onder bepaalde aannames omtrent de te schatten dichtheid, en bepaalde voorwaarden wat betreft de prior, is bekend dat de convergentie van de posterior (asymptotisch) optimaal is. In dit hoofdstuk wordt ingegaan op de vraag of, onder vergelijkbare voorwaarden, de convergentiesnelheid ook optimaal is als er een andere kern gebruikt wordt, en er tegelijkertijd een andere prior op de mengverdeling gekozen wordt. Als de kern exponenti¨ele staarten 107

108

5. Samenvatting

heeft is het antwoord bevestigend, onder vrij algemene voorwaarden aan de prior. Een aantal belangrijke klassen van priors blijkt aan deze voorwaarden te voldoen. Wordt als kern de Cauchy-verdeling genomen, dan is de convergentiesnelheid niet optimaal. Algemener geldt dit voor elke kern met karakteristieke functie van de vorm exp{−|λ|α+1 }, voor α ∈ [0, 1], de zogeheten ’symmetric stable distributions’. Dit wordt veroorzaakt doordat de Hellinger-metriek tussen een dichtheid en zijn convolutie te langzaam afneemt wanneer de bandbreedte naar nul gaat. In hoofdstuk 4 wordt ingegaan op convergentiesnelheden in misgespecificeerde modellen. Een model is misgespecificeerd als de verdeling van de data niet in het model zit. De resultaten van Kleijn en van der Vaart [32] zijn gebaseerd op een generalisatie van de op testfuncties gebaseerde asymptotiek (zie bijvoorbeeld Ghosal, Ghosh en van der Vaart [18]). In dit hoofdstuk wordt vergelijkbare theorie ontwikkeld met behulp van de informatie-theoretische benadering van Zhang ([59],[60]). Na een aantal voorbeelden met goed gespecificeerde modellen, wordt een algemene ongelijkheid afgeleid voor misgespecificeerde modellen. Dit resultaat wordt vervolgens toegepast op niet-parametrische regressiemodellen met Gaussisch of Laplace verdeelde fouten. Voor fixed-design modellen worden convergentiesnelheden gevonden onder vergelijkbare voorwaarden als die bij de in [32] gevonden resultaten voor random-design modellen. In hoofdstuk 5 bestuderen we een speciaal geval van regressie, waar in de prior een bepaalde ruimtelijke structuur wordt verondersteld. Het gaat hier om anisotrope Gaussische Markov Random Fields (GMRF’s). Terwijl voor oneindige roosters de anisotropie eenvoudig is in te voeren, loopt men bij eindige roosters tegen problemen aan bij de randen van het rooster. Met behulp van Gershgorins stelling over de eigenwaarden van een positief definiete matrix, is te zien dat voor anisotrope GMRF’s de covariantiematrix meestal singulier is. Om toch een niet-singuliere covariantiematrix te krijgen, wordt voor de randen een aanpassing van de locale Markov-structuur voorgesteld. Deze aanpassing is ook mogelijk voor multivariate GMRF’s. Ten slotte wordt de implementatie van een MCMC-algoritme beschreven, en worden de anisotrope GMRF’s gebruikt voor de analyse van de onkruiddata van Heijting ([24],[25]).

Dankwoord In de eerste plaats bedank ik mijn promotor Aad van der Vaart, van wie ik de afgelopen jaren veel heb mogen leren. Wanneer het onderzoek vastzat, kwam hij vaak met waardevolle idee¨en, en ondertussen was er alle vrijheid om van het oorspronkelijke onderzoeksvoorstel af te wijken, hetgeen terug te zien is in de laatste twee hoofdstukken. Verder wil ik Alfred Stein, Willem Schaafsma en Sanne Heijting bedanken voor de samenwerking die tot het laatste hoofdstuk heeft geleid. Ismael Castillo bedank ik voor de discussies die hebben bijgedragen aan het introductie-hoofdstuk. I am grateful to the reading committee, consisting of Eduard Belitser, Bas Kleijn, Sonia Petrone, Judith Rousseau, Alfred Stein and Harry van Zanten, for their time and various corrections. I also want to thank Judith for making possible two remarkable visits to Paris in 2006 and 2007. Goede herinneringen bewaar ik aan onze stochastiek-groep: de lunches, maandelijkse ’cultural events’, en mijn kamergenoten Rik, Geert en Ismael. Bedankt allemaal! Mijn vrienden Guido en Ingo wil ik bedanken voor hun rol als paranimf. Ten slotte bedank ik alle niet genoemde vrienden en familieleden voor hun steun en interesse de afgelopen jaren. In het bijzonder mijn ouders, die me altijd gestimuleerd hebben om te studeren en onderzoeken.

109

Bibliography [1] Milton Abramowitz and Irene A. Stegun. Handbook of mathematical functions with formulas, graphs, and mathematical tables, volume 55 of National Bureau of Standards Applied Mathematics Series. For sale by the Superintendent of Documents, U.S. Government Printing Office, Washington, D.C., 1964. [2] Julian Besag. Spatial interaction and the statistical analysis of lattice systems. J. Roy. Statist. Soc. Ser. B, 36:192–236, 1974. With discussion by D. R. Cox, A. G. Hawkes, P. Clifford, P. Whittle, K. Ord, R. Mead, J. M. Hammersley, and M. S. Bartlett and with a reply by the author. [3] Julian Besag, Jeremy York, and Annie Molli´e. Bayesian image restoration, with two applications in spatial statistics. Ann. Inst. Statist. Math., 43(1):1–59, 1991. With discussion and a reply by Besag. [4] Lucien Birg´e. Sur un th´eor`eme de minimax et son application aux tests. Probab. Math. Statist., 3(2):259–282, 1984. [5] Ole F. Christensen and Rasmus Waagepetersen. Bayesian prediction of spatial count data using generalized linear mixed models. Biometrics, 58(2):280–286, 2002. [6] Andrew D. Cliff and J. Keith Ord. Spatial processes: models & applications. Pion Ltd., London, 1981. [7] Carl de Boor. A practical guide to splines, volume 27 of Applied Mathematical Sciences. Springer-Verlag, New York, 1978. [8] Ronald A. DeVore and George G. Lorentz. Constructive approximation, volume 303 of Grundlehren der Mathematischen Wissenschaften [Fundamental Principles of Mathematical Sciences]. Springer-Verlag, Berlin, 1993. [9] Luc Devroye and L´ aszl´ o Gy¨ orfi. Nonparametric density estimation. Wiley Series in Probability and Mathematical Statistics: Tracts on Probability and Statistics. John Wiley & Sons Inc., New York, 1985. The L1 view. 110

BIBLIOGRAPHY

111

[10] Jean Diebolt and Christian P. Robert. Estimation of finite mixture distributions through Bayesian sampling. J. Roy. Statist. Soc. Ser. B, 56(2):363–375, 1994. [11] J. L. Doob. Application of the theory of martingales. In Le Calcul des Probabilit´es et ses Applications., Colloques Internationaux du Centre National de la Recherche Scientifique, no. 13, pages 23–27. Centre National de la Recherche Scientifique, Paris, 1949. [12] H. Edelsbrunner and D. R. Grayson. Edgewise subdivision of a simplex. Discrete Comput. Geom., 24(4):707–719, 2000. ACM Symposium on Computational Geometry (Miami, FL, 1999). [13] Michael D. Escobar and Mike West. Bayesian density estimation and inference using mixtures. J. Amer. Statist. Assoc., 90(430):577–588, 1995. [14] A.E. Gelfand and P. Vounatsou. Proper multivariate conditional autoregressive models for spatial data analysis. Biostatistics, (4):11–15, 2003. [15] Stuart Geman and Chii-Ruey Hwang. Nonparametric maximum likelihood estimation by the method of sieves. Ann. Statist., 10(2):401–414, 1982. [16] Christopher R. Genovese and Larry Wasserman. Rates of convergence for the Gaussian mixture sieve. Ann. Statist., 28(4):1105–1127, 2000. [17] Subhashis Ghosal. Convergence rates for density estimation with Bernstein polynomials. Ann. Statist., 29(5):1264–1280, 2001. [18] Subhashis Ghosal, Jayanta K. Ghosh, and Aad W. van der Vaart. Convergence rates of posterior distributions. Ann. Statist., 28(2):500–531, 2000. [19] Subhashis Ghosal, J¨ uri Lember, and Aad Van Der Vaart. On Bayesian adaptation. In Proceedings of the Eighth Vilnius Conference on Probability Theory and Mathematical Statistics, Part II (2002), volume 79, pages 165–175, 2003. [20] Subhashis Ghosal and Aad van der Vaart. Convergence rates of posterior distributions for noniid observations. Ann. Statist., 35(1):192–223, 2007. [21] Subhashis Ghosal and Aad van der Vaart. Posterior convergence rates of Dirichlet mixtures at smooth densities. Ann. Statist., 35(2):697–723, 2007. [22] Subhashis Ghosal and Aad W. van der Vaart. Entropies and rates of convergence for maximum likelihood and Bayes estimation for mixtures of normal densities. Ann. Statist., 29(5):1233–1263, 2001.

112

BIBLIOGRAPHY

[23] Ulf Grenander. Abstract inference. John Wiley & Sons Inc., New York, 1981. Wiley Series in Probability and Mathematical Statistics. [24] Sanne Heijting. Spatial analysis of weed patterns. PhD thesis, Wageningen University. [25] Sanne Heijting, Wopke van der Werf, Alfred Stein, and Martin J. Kropff. Are weed patches stable in location? application of an explicitly two-dimensional methodology. 47(5):381–395, 2007. [26] Sanne Heijting, Wopke van der Werf, Alfred Stein, and Martin J. Kropff. Testing the spatial significance of weed patterns in arable land using mead’s test. 47(5):396– 405, 2007. [27] van Lieshout Hess, D., M.N.M., B. Payne, and A. Stein. A review of spatio-temporal modelling of quadrat count data with application to striga occurrence in a pearl millet field. 3(2):133–138, 2001. [28] Birgir Hrafnkelsson and Noel Cressie. Hierarchical modeling of count data with application to nuclear fall-out. Environ. Ecol. Stat., 10(2):179–200, 2003. [29] I. A. Ibragimov and Yu. V. Linnik. Independent and stationary sequences of random variables. Wolters-Noordhoff Publishing, Groningen, 1971. With a supplementary chapter by I. A. Ibragimov and V. V. Petrov, Translation from the Russian edited by J. F. C. Kingman. [30] Xiaoping Jin, Bradley P. Carlin, and Sudipto Banerjee. Generalized hierarchical multivariate CAR models for areal data. Biometrics, 61(4):950–961, 2005. [31] B. J. K. Kleijn. Misspecification in infinite-dimensional Bayesian statistics. PhD thesis, Department of Mathematics, Vrije Universiteit Amsterdam. [32] B. J. K. Kleijn and A. W. van der Vaart. Misspecification in infinite-dimensional Bayesian statistics. Ann. Statist., 34(2):837–877, 2006. [33] Charles H. Kraft. A class of distribution function processes which have derivatives. J. Appl. Probability, 1:385–388, 1964. [34] M.J. Kropff, J. Wallinga, and L.A.P. Lotz. Precision agriculture: spatial and temporal variability of environmental quality. In Ciba Foundation Symposium 210, pages 182–204, Chichester, 1997. Wiley. [35] Willem Kruijer, Alfred Stein, Willem Schaafsma, and Sanne Heijting. Analyzing spatial count data, with an application to weed counts. Environ. Ecol. Stat., 14(4):399–410, 2007.

BIBLIOGRAPHY

113

[36] Willem Kruijer and Aad Van der Vaart. Posterior convergence rates for dirichlet mixtures of beta densities. 138(7):1981–1992, 2008. [37] Peter Lancaster and Miron Tismenetsky. The theory of matrices. Computer Science and Applied Mathematics. Academic Press Inc., Orlando, FL, second edition, 1985. [38] Michael Lavine. Some aspects of P´ olya tree distributions for statistical modelling. Ann. Statist., 20(3):1222–1235, 1992. [39] Lucien Le Cam. Asymptotic methods in statistical decision theory. Springer Series in Statistics. Springer-Verlag, New York, 1986. [40] K. V. Mardia. Multidimensional multivariate Gaussian Markov random fields with application to image processing. J. Multivariate Anal., 24(2):265–284, 1988. [41] J.M. Marin, K. Mengersen, and C.P. Robert. Bayesian modelling and inference on mixtures of distributions. [42] Sonia Petrone. Bayesian density estimation using Bernstein polynomials. Canad. J. Statist., 27(1):105–126, 1999. [43] Sonia Petrone. Random Bernstein polynomials. Scand. J. Statist., 26(3):373–393, 1999. [44] Sonia Petrone and Larry Wasserman. Consistency of Bernstein polynomial posteriors. J. R. Stat. Soc. Ser. B Stat. Methodol., 64(1):79–100, 2002. [45] A. N. Pettitt, I. S. Weir, and A. G. Hart. A conditional autoregressive Gaussian process for irregularly spaced multivariate data with application to modelling large sets of binary data. Stat. Comput., 12(4):353–367, 2002. [46] L.J. Rew and R.D. Cousens. Spatial distribution of weeds in arable crops: are current sampling and analytical methods appropriate ? Weed Research, 41:1–18, 2001. [47] Sylvia Richardson and Peter J. Green. On Bayesian analysis of mixtures with an unknown number of components. J. Roy. Statist. Soc. Ser. B, 59(4):731–792, 1997. [48] Yu. A. Rozanov. Markov random fields. Applications of Mathematics. SpringerVerlag, New York, 1982. Translated from the Russian by Constance M. Elson. [49] S.R. Sain and N.A. Cressie. A spatial model for multivariate lattice data, preprint. Journal of Econometrics, 2007.

114

BIBLIOGRAPHY

[50] Lorraine Schwartz. On Bayes procedures. Z. Wahrscheinlichkeitstheorie und Verw. Gebiete, 4:10–26, 1965. [51] C. Scricciolo. Convergence rates of posterior distributions for dirichlet mixtures of normal densities. working paper 2001-21. Technical report, 2001. [52] Xiaotong Shen and Larry Wasserman. Rates of convergence of posterior distributions. Ann. Statist., 29(3):687–714, 2001. [53] Alfred Stein. Spatial statistics for production ecology and resource conservation. Environ. Ecol. Stat., 8(4):293–295, 2001. [54] Alexandre B. Tsybakov. Introduction ` a l’estimation non-param´etrique, volume 41 of Math´ematiques & Applications (Berlin) [Mathematics & Applications]. SpringerVerlag, Berlin, 2004. [55] A. W. van der Vaart. Asymptotic statistics, volume 3 of Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, Cambridge, 1998. [56] Aad W. van der Vaart and Jon A. Wellner. Weak convergence and empirical processes. Springer Series in Statistics. Springer-Verlag, New York, 1996. With applications to statistics. [57] Richard A. Vitale. Bernstein polynomial approach to density function estimation. In Statistical inference and related topics (Proc. Summer Res. Inst. Statist. Inference for Stochastic Processes, Indiana Univ., Bloomington, Ind., 1974, Vol. 2; dedicated to Z. W. Birnbaum), pages 87–99. Academic Press, New York, 1975. [58] Yuhong Yang and Andrew Barron. Information-theoretic determination of minimax rates of convergence. Ann. Statist., 27(5):1564–1599, 1999. [59] Tong Zhang. From -entropy to KL-entropy: analysis of minimum information complexity density estimation. Ann. Statist., 34(5):2180–2210, 2006. [60] Tong Zhang. Information-theoretic upper and lower bounds for statistical estimation. IEEE Trans. Inform. Theory, 52(4):1307–1321, 2006. [61] V. M. Zolotarev. One-dimensional stable distributions, volume 65 of Translations of Mathematical Monographs. American Mathematical Society, Providence, RI, 1986. Translated from the Russian by H. H. McFaden, Translation edited by Ben Silver.