NONPARAMETRIC BERNSTEIN-VON MISES PHENOMENON: A TUNING PRIOR PERSPECTIVE

arXiv:1411.3686v1 [math.ST] 13 Nov 2014

By Zuofeng Shang∗ and Guang Cheng† Department of Statistics, Purdue University November 15, 2014 Statistical inference on infinite-dimensional parameters in Bayesian framework is investigated. The main contribution of this paper is to demonstrate that nonparametric Bernstein-von Mises theorem can be established in a very general class of nonparametric regression models under a novel tuning prior (indexed by a non-random hyperparameter). Surprisingly, this type of prior connects two important classes of statistical methods: nonparametric Bayes and smoothing spline at a fundamental level. The intrinsic connection with smoothing spline greatly facilitates both theoretical analysis and practical implementation for nonparametric Bayesian inference. For example, we can apply smoothing spline theory in our Bayesian setting, and can also employ generalized cross validation to select a proper tuning prior, under which credible regions/intervals are frequentist valid. A collection of probabilistic tools such as Cameron-Martin theorem [7] and Gaussian correlation inequality [30] are employed in this paper.

1. Introduction. A common practice to quantify Bayesian uncertainty is to construct credible sets (or regions) that cover a large fraction of posterior mass. The frequentist validity of Bayesian credible sets is often called as Bernstein-von Mises (BvM) phenomenon. In nonparametric settings, BvM theorem was initially found not to hold by [13, 18]. An essential reason for this failure is due to the lack of flexibility in the assigned priors. In this paper, we introduce a novel tuning prior under which BvM theorem can be established in a very general class of nonparametric regression models. More interestingly, this type of prior connects two important classes of statistical methods: nonparametric Bayes and smoothing spline at a fundamental level. ∗

Visiting Assistant Professor Corresponding Author. Associate Professor, Research Sponsored by NSF CAREER Award DMS-1151692, DMS-1418042, Simons Foundation 305266. Guang Cheng is on sabbatical at Princeton while this work is finalized; he would like to thank the Princeton ORFE department for its hospitality and support. AMS 2000 subject classifications: Primary 62C10 Secondary 62G15, 62G08 Keywords and phrases: Bayesian inference, Gaussian sequence model, nonparametric Bernstein-von Mises Theorem, smoothing spline †

1

2

Z. SHANG AND G. CHENG

This coupling phenomenon is practically useful in the sense that a proper tuning prior can be selected through generalized cross validation. We begin with Gaussian sequence models that can be used as a platform to investigate more complicated statistical problems; see [3, 35]. Recently, a series of (adaptive) credible sets have been proposed for their mean sequences in the literature; see [29, 26, 39, 40, 10]. In this paper, we consider a novel tuning prior with a (non-random) hyper-parameter that can be used to tune the assigned prior. Under tuning priors, we demonstrate that when the tuning parameter is properly selected, the frequentist coverage of an asymptotic credible set approaches one irrespective of the credibility level, called as weak BvM phenomenon. On the other hand, if the tuning parameter is poorly chosen, there always exist uncountably many true parameters that are excluded by the constructed credible set asymptotically. Hence, we observe a phase transition phenomenon, which is caused by the fact that the posterior standard deviation is always strictly greater or smaller than the deviation between the truth and the posterior mode. Also see Figure 1 for a graphical illustration. The issue of overly conservativeness of credible sets can be easily resolved by invoking a weaker topology (inspired by [10]). This leads to the so-called strong BvM phenomenon. In the end, we construct adaptive credible sets by substituting the unknown model regularity with a consistent estimate (with a mild convergence rate). With such a simple replacement, the adaptive sets can still preserve appealing features as those built upon the known regularity. In particular, their radii do not involve the annoying “blow-up” constant as in [40]; see [33] for more discussions. The major contribution of this paper is to establish BvM theorem in a very general class of nonparametric regression models by borrowing the theoretical insights obtained from the simple Gaussian sequence models. For example, the assigned Gaussian process (GP) prior also involves a tuning hyper-parameter (this idea is rooted in [48]). As far as we know, our BvM theorem is the first nonparametric one that holds under total variation distance. Such a result is crucial in constructing credible regions for the regression function under both strong and weak topology, and credible intervals for its linear functionals, e.g., its local value. In particular, their constructions can be concretized by ODE techniques, and their frequentist behaviors are carefully investigated. In contrast, the BvM theorem developed by [11] is in terms of a weaker metric, i.e., bounded Lipschitz metric, and crucially relies on the efficiency of nonparametric estimate ([5]). Other related BvM theorems include [14, 15, 9, 4, 36, 12] for semiparametric models or functionals of the density. Our second contribution is to obtain posterior contraction rates under a type of Sobolev norm as an intermediate step in develop-

NONPARAMETRIC BVM THEOREM

3

ing our BvM theorem. This Sobolev norm is stronger than the commonly used norms in deriving contraction rates such as Kullback-Leibler norm and supremum norm ([20, 21, 43, 44, 45, 46]). The uniformly consistent test arguments ([21]) together with Cameron-Martin theorem [7] and Gaussian correlation inequality [30] are applied to derive this new result. Interestingly, a straightforward application of H´ajek’s Lemma ([22]) illustrates that the posterior distribution under the assigned GP prior turns out to be closely related to the penalized likelihood in the smoothing spline literature. The intrinsic connection with smoothing spline greatly facilitates both theoretical analysis and practical implementation for nonparametric Bayesian inference. For example, we can apply smoothing spline theory in our Bayesian setting, and we can employ the generalized cross validation (GCV) method to select a proper tuning hyper-parameter in the assigned GP prior under which credible regions/intervals are frequentist valid. This novel methodology is crucially different from the empirical Bayes approach, and strongly supported by our simulation results. The rest of this article is organized as follows. Section 2 focuses on Gaussian sequence models under two types of tuning priors, and conducts frequentist evaluation of the resulting (adaptive) credible sets. In Section 3, we present a general class of nonparametric regression models including Gaussian regression and logistic regression, and the assigned GP priors. The duality between nonparametric Bayes models and smoothing spline models are also discussed. Section 4 includes results on posterior contraction rates under a type of Sobolev norms. Section 5.1 includes the main result of the article: nonparametric BvM theorem. Sections 5.2 – 5.4 discuss two important applications of BvM theorem: construction of credible region of the regression function and credible interval of a general class of linear functionals. Their frequentist validity is also discussed. Section 6 includes simulation study. All the technical details are postponed to the online supplementary. 2. Gaussian Sequence Model. In this section, we consider Gaussian sequence models, and generalize the Gaussian priors in [18] by introducing tuning parameters. In this new Bayesian framework, we demonstrate that when the tuning parameter is properly chosen, the credible region possesses more satisfactory frequentist coverage properties, and can be further improved based on a weaker topology; also see [10]. A phase transition phenomenon of credible regions is also observed. A set of adaptive credible regions are constructed in the end. The theoretical insights obtained from this simple model will be naturally carried over to a very general class of nonparametric regression models considered in Section 3.

4

Z. SHANG AND G. CHENG

Consider the following Gaussian sequence model: 1 Yi = θi + √ i , i = 1, 2, . . . , n P∞ 2 2q0 ∞ where θ = {θi }∞ < ∞} for some constant i=1 ∈ `2 (q0 ) ≡ {{θi }i=1 : i=1 θi i q0 ≥ 0. In particular, denote `2 = `2 (0) as the space of real square-summable sequences. Here, i ’s are iid standard Gaussian random variables. Suppose θ0 = {θi0 }∞ i=1 ∈ `2 (q0 ) as the “true” parameter generating Yi ’s. (2.1)

2.1. Tuning Prior I. We assign the following Gaussian priors on the mean sequence: P1. θi ∼ N (0, τi−2 ), where τi2 = nλi2p , λ = n−α0 , α0 > 0 and p satisfies p > q0 + 1/2 ≥ 1/2, and call them as tuning priors indexed by λ. This leads to a more flexible Bayesian framework. In particular, when nλ = 1 (equivalently, α0 = 1) and q0 = 0, the above setup reduces to the one considered in [18]. It is easy to see that the posterior for θi is also Gaussian: 2 b θi |Yi ∼ N θni , vni , i = 1, 2, . . . , 2 = {n(1 + λi2p )}−1 . Thus, given the data, where θbni = Yi /(1 + λi2p ) and vni iid we can rewrite θi as θi = θbni + vni ηi , where η1 , η2 , . . . ∼ N (0, 1). Denote θb as the posterior mode {θbni }∞ , and let h = λ1/(2p) . Similar to Theorem 1 of i=1 √ b 2 = h−1/2 En + Dn1/2 Zn , where [18], it can be shown that n hkθ − θk 2

(2.2) Dn ≡

∞ X i=1

√ ∞ ∞ X X 2h h h −1/2 , En ≡ , and Zn ≡ Dn (η 2 −1). (1 + λi2p )2 1 + λi2p 1 + λi2p i i=1

i=1

d

Since Zn → N (0, 1), we have the following (1−α) asymptotic credible region s √ En + Dn hzα b Cn (α) = θ = {θi }∞ ∈ ` (q ) : kθ − θk ≤ , 2 0 2 i=1 nh (2.3) where zα = RΦ−1 (1 − α) and Φ is theR c.d.f. of N (0, 1). It is easy to see ∞ ∞ 1 that Dn ≈ 2 0 (1+x12p )2 dx and En ≈ 0 1+x 2p dx, where αn ≈ βn means limn→∞ an /bn = 1. Hence, the radius of Cn (α) contracts at the rate n

−

2p−α0 4p

.

5

NONPARAMETRIC BVM THEOREM

The frequentist coverage probabilities of Cn (α) are summarized in the following Theorem 2.1, where a phase transition phenomenon is observed. Also see Figure 1 for a graphical illustration. Theorem 2.1.

Under model (2.1) and Prior P1, we have

(1). If α0 ≥ 2p/(2q0 + 1), then for any θ0 ∈ `2 (q0 ), with probability approaching one, θ0 ∈ Cn (α). (2). If 0 < α0 < 2p/(2q0 + 1), then there exist uncountably many θ0 ∈ `2 (q0 ) such that for each such θ0 , with probability approaching one, θ0 ∈ / Cn (α).

Large 𝜃

𝛼0 ≥

2𝑝 2𝑞0 + 1

Small 𝜃0

𝜃0

𝜃

0 < 𝛼0

1/2. Note that, without using the tuning parameter approach, one cannot get this optimal rate of convergence for p > q0 + 1/2; see [49]. Another interesting observation is the phase transition of the coverage property at α0 = 2p/(2q0 + 1). We note that in more general models as considered in Section 3, calculations are so complex that phase transition is hard to detect, although we conjecture its existence. A comprehensive study on this is beyond the scope of this paper. Remark 2.1. Theorem 2.1 revisits Theorem 4.2 in [25] from a different perspective. Specifically, our assumption p > q0 + 1/2 implies that 2p > q0 . Under 2p > q0 , Theorem 4.2 of [25] shows that the probability that the credible region covers the truth tends to either zero (along a sequence of truths) or one. Actually, [25] further show that if 2p ≤ q0 and α0 = 2p/(2q0 + 1), then there exists a sequence of truths along which the coverage probability tends to any level in [0, 1). Here we restrict to the case p > q0 + 1/2 to ease the technical arguments. Indeed, our results later in this section benefit from this restriction. Although our Theorem 2.1 can be viewed as a special case of [25], our new tuning prior formulation would greatly benefit the theoretical studies on the nonparametric regression models; see Section 3. Even under the flexible tuning priors, Theorem 2.1 still cannot exactly match the credibility level with the frequentist coverage. The main reason is that the L2 -topology, upon which Cn (α) is built, is too strong to support the frequentist-matching property. This motivates us to design aP weaker norm 2 as follows; also see [10]. For any θ = {θi } ∈ `2 , define kθk2† = ∞ i=1 di θi , for a given positive sequence di satisfying di i(log12i)τ for some τ > 1. Clearly, k · k† defines a norm weaker than the usual `2 -norm. Define n o p b Cn† (α) = θ = {θi }∞ (2.4) cα /n , i=1 ∈ `2 (q0 ) : kθ − θk† ≤ where cαPis the upper (1 − α)-quantile of the distribution of the random P 2 , that is, P ( ∞ d η 2 ≤ c ) = 1 − α. The value of c variable ∞ d η i i α α i i i=1 i=1 can be easily simulated, and not sensitive to the choice of τ (we select τ = 2 in our simulations). It is easy to see that P (θ ∈ Cn† (α)|{Yi }∞ i=1 ) = P∞ P∞ P∞ di ηi2 2 2 2 P (n i=1 di vni ηi ≤ cα ) = P ( i=1 1+λi2p ≤ cα ) → P ( i=1 di ηi ≤ cα ) = P di ηi2 1 − α, where the final limit follows by the fact that ∞ i=1 1+λi2p converges P∞ in distribution to i=1 di ηi2 as n → ∞ (by moment generating function approach). That is, Cn† (α) asymptotically possesses (1 − α) posterior coverage. Frequentist coverage properties of Cn† (α) are summarized below.

NONPARAMETRIC BVM THEOREM

Theorem 2.2.

7

Under model (2.1) and Prior P1, we have:

(1). If α0 ≥ 2p/(2q0 + 1), then for any θ0 ∈ `2 (q0 ), with probability approaching 1 − α, θ0 ∈ Cn† (α). (2). If 0 < α0 < 2p/(2q0 + 1), then there exist uncountably many θ0 ∈ `2 (q0 ) such that for each such θ0 , with probability approaching one, θ0 ∈ / Cn† (α). Given the same choice of α0 as in Theorem 2.1, Part (1) of Theorem 2.2 guarantees the frequentist-matching property of Cn† (α) under a weaker b cα ) that are both easy to norm. We also remark that Cn† (α) only relies on (θ, simulate; see Section 6. In contrast, the credible region constructed by [10] may require calculation of certain nontrivial quantities; see (14) therein. In the end, we consider a set of adaptive credible sets that do not rely on the unknown regularity q0 ; see Section S.3 for more details. The main idea is to replace q0 by a proper estimator qb satisfying qb− q0 = oP (1/ log n); see Proposition S.1 and Remark S.1 for the construction of qb. The (1 − α) adaptive credible sets given in (S.2) and (S.3) are constructed by replacing b = n−2p/(2bq+1) and b (p, λ, h) in (2.3) and (2.4) by p = qb + 1, λ h = n−1/(2bq+1) ; see Theorem S.1 for its theoretical justification. Note that the radii of (S.2) and (S.3) do not involve any unknown constants. Hence, we claim that (S.2) and (S.3) are more practically feasible to construct than the region proposed in [40] with a “blow-up” constant; see [33] for more discussions. 2.2. Tuning Prior II. In this section, we assign a more subtle Gaussian prior P2 that includes an additional parameter q, and derive parallel results to Section 2.1. A similar type of prior will be placed on the Fourier coefficients of the nonparametric regression functions considered in Section 3. The new tuning prior (motivated from [48]) is of the following form: −2 2 = i2p + nλi2q , λ = n−α0 , α > 0 and p, q P2. θi ∼ N (0, τ∗i ), where τ∗i 0 satisfy 1/2 < q ≤ q0 < 2q and q + 1/2 < p < 2q + 1/2.

By direct calculations, the posterior distribution of θi is 2 θi |Yi ∼ N (θb∗ni , v∗ni ), i = 1, 2, . . . , 2 = (n(1 + λi2q + n−1 i2p ))−1 . Dewhere θb∗ni = Yi /(1 + λi2q + n−1 i2p ) and v∗ni note θb∗ = {θb∗ni }∞ i=1 as the posterior mode of θ, and let α1 = min{1/(2p), α0 /(2q)}. By similar analysis as in Section 2.1, we have 1/2 n1−α1 kθ − θb∗ k22 = E∗n + n−α1 /2 D∗n Z∗n ,

8

Z. SHANG AND G. CHENG

where (2.5)

D∗n =

∞ X i=1

(2.6)

E∗n =

∞ X i=1

2n−α1 , (1 + λi2q + n−1 i2p )2 n−α1 , 1 + λi2q + n−1 i2p

−1/2

Z∗n = D∗n

(2.7)

∞ X n−α1 /2 (ηi2 − 1) , 1 + λi2q + n−1 i2p i=1

d

iid

and ηi ∼ N (0, 1) independent of Yi . Considering Lemma S.2 that Z∗n → N (0, 1), we construct an asymptotic (1 − α) credible region (2.8) s √ −α 1 E∗n + D∗n n zα b C∗n (α) = θ = {θi }∞ ∈ ` (q ) : kθ − θ k ≤ , 2 0 ∗ 2 i=1 n1−α1 where E∗n and D∗n converge to some constants specified in Lemma S.1. For example, when p = 3, q = 2, α0 = 2/3, we have D∗n ≈ 1.467 and 1−α1 E∗n ≈ 0.919. The radius of C∗n (α) decreases at rate n− 2 . Our next result demonstrates that the frequentist behavior of the credible set C∗n (α) at any level α ∈ (0, 1) depends on the relation between the property of the true parameter, i.e., q0 , and the prior specifications, i.e., (p, q, α0 ). Again, to obtain a frequentist-matching credible region, we need to employ a weaker norm than `2 -norm as follows: n o p † b (2.9) C∗n (α) = θ = {θi }∞ cα /n , i=1 ∈ `2 (q) : kθ − θ∗ k† ≤ and assume an additional technical condition 4q > 2q0 + 1. Theorem 2.3.

Suppose model (2.1) holds with Prior P2.

(1). For C∗n (α), the following results hold: (a) If p − 1/2 ≤ q0 < 2q and α0 ≥ 2q/(2q0 + 1), then for any θ0 ∈ `2 (q0 ), with probability approaching one, θ0 ∈ C∗n (α). (b) If q ≤ q0 < p − 1/2 or 0 < α0 < 2q/(2q0 + 1), then there are uncountably many θ0 ∈ `2 (q0 ) such that for each such θ0 , with probability approaching one, θ0 ∈ / C∗n (α). † (2). Furthermore, suppose 4q > 2q0 + 1. For C∗n (α), the following results hold:

NONPARAMETRIC BVM THEOREM

9

(a) If p − 1/2 ≤ q0 < 2q and α0 ≥ 2q/(2q0 + 1), then for any θ0 ∈ † `2 (q0 ), with probability approaching 1 − α, θ0 ∈ C∗n (α). (b) If q ≤ q0 < p − 1/2 or 0 < α0 < 2q/(2q0 + 1), then there are uncountably many θ0 ∈ `2 (q0 ) such that for each such θ0 , with † probability approaching one, θ0 ∈ / C∗n (α). Besides frequentist validity, another important consequence of Theorem 2.3 is that C∗n (α) achieves the optimal contraction rate in the sense of [25] when λ is carefully chosen. Specifically, C∗n (α) possesses the contraction rate n−q0 /(2q0 +1) when p − 1/2 ≤ q0 < 2q and α0 = 2q/(2q0 + 1). Moreover, to yield a valid region, we only need the weakest possible condition on the true regularity, i.e., q0 ≥ p − 1/2. Adaptive credible regions under P2 are constructed in Section S.3 of the supplement document. 3. Nonparametric Regression Model. In this section, we present a general class of nonparametric regression models to which nonparametric Bayesian method is applied. Specifically, we place a type of prior P2 on the Fourier coefficients of the unknown regression function, and further prove that the resulting posterior distribution turns out to be closely related to the penalized likelihood in the smoothing spline literature; see (3.8). This is the first main result of this paper. Let pf (y|x) be the conditional probability density function of Y given X = x under an infinite dimensional parameter f : (3.1)

Y |f, X = x ∼ pf (y|x),

where (y, x) ∈ Y × I and I := [0, 1]. We assume that there exists a “true” parameter f0 under which the sample is drawn, and that f0 belongs to an m-th order Sobolev space: H m (I) = {f ∈ L2 (I)|f (j) are absolutely continuous for j = 0, 1, . . . , m − 1, and f (m) ∈ L2 (I)}. Throughout the paper, we let m > 1/2 such that H m (I) is a reproducing kernel Hilbert space (RKHS). We impose some regularity assumptions on `(y; f (x)) ≡ log(pf (y|x)). Denote the first-, second... and third-order derivatives of `(y; a) w.r.t. a by `˙a (y; a), `¨a (y; a), and ` a (y; a), respectively. Assumption A1. (a) `(y; a) is three times continuously differentiable and strictly concave w.r.t. a. There exists positive constants C0 and C1

10

Z. SHANG AND G. CHENG

s.t. ¨ Ef0 exp(sup |`a (Y ; a)|/C0 ) X ≤ C1 , a.s., a∈R ... Ef0 exp(sup | ` a (Y ; a)|/C0 ) X ≤ C1 , a.s.

(3.2) (3.3)

a∈R

(b) There exists a constant C2 > 0 such that C2−1 ≤ B(X) ≡ −Ef0 {`¨a (Y ; f0 (X))|X} ≤ C2 , a.s. (c) ≡ `˙a (Y ; f0 (X)) satisfies Ef0 {|X} = 0, Ef0 {2 |X} = B(X), a.s., and there exists a constant M such that Ef0 {4 |X} ≤ M , a.s. We present three common examples that satisfy Assumption A1. Example 3.1. (Gaussian Model ) In Gaussian regression, = f (X) + √ Y −1 with |X ∼ N (0, 1). It is easy to see √ that pf (y|x) = ( 2π) exp(−(y − f (x))2 /2), and hence `(y; a) = − log 2π − (y − a)2 /2. We can easily verify Assumption A1 with B(X) = 1. Example 3.2. (Logistic Model ) In logistic regression, where Pf (Y = 1|X) = 1−Pf (Y = 0|X) = exp(f (X))/(1+exp(f (X))), it is easy to see that `(y; a) = ay − log(1 + exp(a)) and B(X) = exp(f0 (X))/(1 ... + exp(f0 (X)))2 . Assumption A1 (a) follows from the boundedness of `¨a and ` a . Assumption A1 (b) follows from the fact that B(X) is almost surely bounded away from zero and infinity since f0 is supported in a compact interval. Assumption A1 (c) can be verified by direct calculation. Example 3.3. (Exponential Family) Suppose (Y, X) follows one-parameter exponential family Y |f, X ∼ exp {Y f (X) + A(y) − G(f (X))}, where A(·) and G(·) are known. It is easy to see that `(y; a) ... = ya + A(y)... − G(a), and ˙ ¨ hence, `˙a (y; a) = y − G(a), `¨a (y; a) = −G(a) and ` a (y; a) = −G(a). We assume that G has bounded jth-order derivatives on R for j = 2, 3, 4, and for ¨ any real number a belonging...to the range of f0 , G(a) ≥ δ for some constant ¨ δ > 0. Clearly, both `a and ` a are bounded, and hence Assumption A1 (a) ¨ 0 (X)) satisfies Assumption A1 (b). Since holds. Furthermore, B(X) = G(f ˙ 0 (X)), it is easy to see that Ef {|X} = Ef {Y |X}− G(f ˙ 0 (X)) = = Y − G(f 0 0 2 ¨ 0 and Ef0 { |X} = V arf0 (Y |X) = = B(X) (see [31]). Further... G(f0 (X)) ˙ 0 (X))+6G(f ¨ 0 (X))G(f ˙ 0 (X))2 + more, Ef0 (4 |X) = G(4) (f0 (X))+4G(f0 (X))G(f ¨ 0 (X))2 + G(f ˙ 0 (X))4 which is almost surely bounded by a constant. 3G(f Therefore, Assumption A1 (c) holds.

NONPARAMETRIC BVM THEOREM

11

Consider the smoothing spline estimate fbn,λ : fbn,λ = arg max `n,λ (f ) f ∈H m (I) ( n ) 1X (3.4) = arg max [`(Yi ; f (Xi )) + log π(Xi )] − (λ/2)J(f, f ) , f ∈H m (I) n i=1

R1 where J(f, fe) = 0 f (m) (x)fe(m) (x)dx for any f, fe ∈ H m (I), π(x) denotes the marginal density of X satisfying 0 < inf x∈I π(x) ≤ supx∈I π(x) < ∞, and λ → 0 as n → ∞. Define V (f, fe) = Ef0 {B(X)f (X)fe(X)} and hf, fei = V (f, fe)+λJ(f, fe). It follows from [38] that h·, ·i well defines an inner product in H m (I). Let k · k be the corresponding norm, and K be the corresponding reproducing kernel function. Let Wλ be a self-adjoint bounded linear operator from H m (I) to itself satisfying hWλ f, fei = λJ(f, fe). For simplicity, we also denote V (f ) = V (f, f ) and J(f ) = J(f, f ). Assumption A2. There exists a sequence of eigenfunctions ϕν ∈ H m (I) satisfying supν∈N kϕν ksup < ∞, and a nondecreasing sequence of eigenvalues ρν ν 2m such that (3.5)

V (ϕµ , ϕν ) = δµν , J(ϕµ , ϕν ) = ρµ δµν , µ, ν ∈ N,

m where δµν is the Kronecker’s P delta. In particular, any f ∈ H (I) admits a Fourier expansion f = ν V (f, ϕν )ϕν with convergence in the k · k-norm. In particular, the first m eigenvalues are all zero, and ρν > 0 for µ > m.

We refer to [38, Proposition 2.2] about the validity of Assumption A2. In particular, Assumption A2 yields an equivalent series representation of kf k2 , K and Wλ as follows. Proposition 3.1. For any f ∈ H m (I) and z ∈ I, we have kf k2 = P ϕν (z) λρν 2 ν |V (f, ϕν )| (1+λρν ), Kz (·) = ν 1+λρν ϕν (·), and Wλ ϕν (·) = 1+λρν ϕν (·) under Assumption A2. P

We next assign a Gaussian process (GP) prior to f , and point out its relation with prior P2. Let {vν }∞ ν=1 be a sequence of independent random variables satisfying vν ∼ πν for ν = 1, . . . , m, where π1 , . . . , πm are some −(1+β/(2m)) probability distribution functions, and vν ∼ N (0, ρν ) for ν > m. Here, the prior parameter β is used to measure the regularity difference between the prior and the parameter space H m (I). The random variables

12

Z. SHANG AND G. CHENG

vν ’s are assumed to be independent of the samples. The Gaussian process (GP) prior on f is given as follows: (3.6)

Gλ (t) =

∞ X

wν ϕν (t),

ν=1 −β/(2m)

where wν = vν for ν = 1, . . . , m, and wν = (1 + nλρν )−1/2 vν for 2 ν > m. For simplicity, we assume that πj is N (0, σj ) for j = 1, . . . , m in the above GP prior, where σj2 ’s are fixed constants. Our results may also apply to a general class of π1 , . . . , πm . When ν > m, it is easy to see that 1+β/(2m) wν ∼ N (0, (ρν + nλρν )−1 ), and the inverse prior variance satisfies 1+β/(2m) ρν + nλρν ν 2m+β + nλν 2m . This is exactly a type of Prior P2. Clearly, the sample path of Gλ belongs to H m (I) for any β ∈ (1, 2m + 1) P 1+β/(2m) + a.s. due to the trivial observation E{J(Gλ , Gλ )} = ν>m ρν /(ρν nλρν ) < ∞. Note that β and m jointly determine the smoothness of RKHS induced by Gλ ; see [44]. Let Πλ be the probability measure induced by Gλ , i.e., Πλ (B) = P (Gλ ∈ B). The posterior distribution of f can be written as (3.7)

n X P (f |Dn ) ∝ exp( `(Yi ; f (Xi )))dΠλ (f ), i=1

where Dn ≡ {Z1 , . . . , Zn } denotes a full sample set, and Zi = (Yi , Xi ), i = 1, . . . , n are iid copies of Z = (Y, X). By a key Lemma 3.2 below, we can link the above posterior distribution with the penalized likelihood `n,λ R exp(n`n,λ (f ))dΠ(f ) , (3.8) P (f ∈ S|Dn ) = R S H m (I) exp(n`n,λ (f ))dΠ(f ) for any Π-measurable subset S ⊆ H m (I), where Π is the probability measures induced by the following GP G: (3.9)

G(t) =

∞ X

vν ϕν (t).

ν=1

Similarly, we can check that the sample path of G also belongs to H m (I) for any β ∈ (1, 2m + 1) a.s.. We remark that the important duality between (3.7) and (3.8) holds universally irrespective of the model assumptions A1. The following lemma is established based on H´ajek [22], and demonstrates the relationship between Π and Πλ .

NONPARAMETRIC BVM THEOREM

13

Lemma 3.2. With f ∈ H m (I), we have the following Radon-Nikodym derivative of Πλ with respect to Π: ∞ 1/2 Y dΠλ nλ −β/(2m) (f ) = 1 + nλρν · exp − J(f, f ) . dΠ 2 ν=m+1

4. Rate of Posterior Contraction under Sobolev Norm. In this section, we derive posterior contraction rates under a type of Sobolev norm given the GP prior (3.6). The Sobolev type norm is stronger than the commonly used norms such as Kullback-Leibler norm and supremum norm. Our result in this section is an intermediate step for developing the general nonparametric BvM theorem, and also of independent interest. We first assume the following posterior consistency condition. Assumption A3. in Pfn0 -probability.

For any b > 0, as n → ∞, P (kf −f0 ksup > b|Dn ) → 0

A similar assumption was also used in [19] for deriving posterior contraction rates in terms of supremum norm. We point out that this assumption can be removed in two situations: (1) exponential family models such as Gaussian regression model; (2) logistic regression with null space being removed. We first employ the uniformly consistent test arguments ([21]) to obtain an initial rate of posterior contraction. The Cameron-Martin theorem ([7]) together with a Gaussian correlation inequality ([30]) are also used in the proof of Proposition 4.1. With a bit abuse of notation, we define h = λ1/(2m) . Proposition 4.1. (An initial ratePof posterior contraction) Suppose As0 sumptions A1–A3 hold, and f0 (·) = ∞ ν=1 fν ϕν (·) satisfies Condition (S): β−1 P∞ 0 2 1+ 2m < ∞. Then for any δ n → 0 satisfying δn ≥ rn := ν=1 |fν | ρν (nh)−1/2 + λ1/2 and δn = o(h1/2 ), we have P (f ∈ H m (I) : kf − f0 k > M δn |Dn ) → 0 in Pfn0 -probability for sufficiently large M . We comment on the P implications of Condition (S). By Assumption A2, 0 2 2m+β−1 < ∞. This indicates i.e., ρν ν 2m , we have ∞ ν=1 |fν | ν {fν0 } ∈ `2 (m + (β − 1)/2). Meanwhile, the GP prior (3.6) implies fν ∼ N (0, (ν 2m+β + nλν 2m )−1 ) for ν > m. Then, using the notation in Section 2.2, we have p = m + β/2, q = m, q0 = p − 1/2. In particular, q0 = p − 1/2 satisfies the weakest possible condition on the regularity of f0 for obtaining a valid credible set (see Parts (1a) and (2a) of Theorem 2.3).

14

Z. SHANG AND G. CHENG

The initial contraction rate in Theorem 4.1 is sub-optimal, but helps us obtain the optimal one in Theorem 4.2 below. Define the optimal rate r˜n = β−1 (nh)−1/2 + hm+ 2 . This rate optimality follows from the smoothness of the true function in Condition (S): f0 ∈ H m+(β−1)/2 (I). Let an be defined in Lemma S.4 of online supplementary. Theorem 4.2. (An optimal rate contraction) Suppose AsP of posterior 0 ϕ satisfies Condition (S). Fursumptions A1–A3 hold, and f0 = ∞ f ν=1 ν ν thermore, the following set of rate conditions hold (Condition (R)): (i). (ii). (iii). (iv). (v). (vi). (vii).

h = o(1), log h−1 = O(log n) and nh2 → ∞ rn = o(h1/2 ) an log n = o(˜ rn ) 8m−1 −1/2 bn1 := n r˜n h− 4m (log log n)1/2 log n + h−1/2 r˜n log n = o(1) bn2 := n−1/2 h−2 log n(log log n)1/2 = o(1). 8m−1 bn3 := rn3 n−1/2 h− 4m (log n)(log log n)1/2 + rn3 h−1/2 log n = o(˜ rn2 ) 6m−1 − 4m 2 −1/2 1/2 2 bn4 := n h rn (log n)(log log n) = o(˜ rn ).

Then for any ε, δ > 0, there exist positive constants M = M (ε, δ) and N = N (ε, δ) such that for any n ≥ N , with Pfn0 -probability less than ε, P (f : kf − f0 k ≥ M r˜n |Dn ) > δ. Except for the probabilistic tools used in the proof of Proposition 4.1, we also apply the functional Bahadur representation and a set of empirical processes tools recently developed by [38] in the proof of Theorem 4.2. Remark 4.1.

It can be shown that, when m > 3/2 and 1 < β < m + 21 ,

Condition (R) holds for h h∗ ≡ n

1 − 2m+β

. This leads to the contraction

− 2m+β−1 2(2m+β)

rate n . According to Condition (S), we know that f0 belongs to the Sobolev space of higher order H m+(β−1)/2 (I). Hence, the above contraction rate turns out to be optimal according to [45]. 5. Nonparametric Bernstein-von Mises Theorem. In this section, we show that BvM theorem holds in the general nonparametric regression setup established in Section 3. To the best of our knowledge, our BvM theorem is the first nonparametric one based on total variation distance. Such a result is crucial in constructing credible regions under both strong and weak topology, and credible intervals for linear functionals, as illustrated in Sections 5.2 – 5.4. In contrast, the BvM theorems established by [10, 11] are in terms of a weaker metric: bounded Lipschitz metric. Similar as Section 2, the

NONPARAMETRIC BVM THEOREM

15

tuning hyper-parameter λ in the GP prior Gλ controls the frequentist performance of a credible region/interval. In the end of Section 5.1, we discuss how to employ generalized cross validation (GCV) to select a proper tuning prior, i.e., λ, by taking advantage of the duality between nonparametric Bayesian and smoothing spline models. 5.1. Main Theorem. In this section, we prove a general nonparametric BvM theorem which says that the posterior measure can be well approximated by a probability measure P0 defined in (5.1) below: R exp(− n2 kf − fbn,λ k2 )dΠ(f ) (5.1) P0 (B) = R B , for any B ∈ B, exp(− n kf − fbn,λ k2 )dΠ(f ) m H (I)

2

where B is a collection of Π-measurable sets in H m (I), i.e., B is a σ-algebra with respect to Π. In spite of the complicated appearance, P0 turns out to be a Gaussian measure on H m (I) induced by a Gaussian process W which has a closed form (5.3). This Gaussianity facilitates the applications of our BvM theorem, which will be seen in the subsequent sections. Theorem 5.1. (Nonparametric BvM theorem) Suppose Assumptions A1– P∞ 0 f ϕ satisfies Condition (S). Furthermore, let A3 holds, and f0 = ν=1 ν ν Condition (R) be satisfied and the bn1 , bn2 therein further satisfy, as n → ∞, n˜ rn2 (bn1 + bn2 ) = o(1). Then we have, as n → ∞, (5.2)

sup |P (B|Dn ) − P0 (B)| = oPfn (1).

B∈B

0

We sketch the proof of Theorem 5.1 here. According to Theorem 4.2, the posterior mass is mostly concentrated on an M r˜n -ball of f0 , denoted as BM r˜n (f0 ). Hence, for any B ∈ B, we decompose P (B|Dn ) = P (B ∩ BM r˜n (f0 )|Dn ) + P (B ∩ BcM r˜n (f0 )|Dn ). Implied by Theorem 4.2, the second term is uniformly negligible for all B ∈ B. By applying Taylor expansion to the penalized likelihood in (3.8) (in terms of Fr´echet derivatives), we can show that the first term is asymptotically close to P0 (B) uniformly for B ∈ B based on the empirical processes techniques. The conditions in Theorem 5.1 are not restrictive. Indeed, by direct cal− 1 culations, we can verify that h h∗ ≡ n√ 2m+β satisfies Condition (R) and n˜ rn2 (bn1 + bn2 ) = o(1) when m > 1 + 23 ≈ 1.866 and 1 < β < m + 12 . Therefore, (5.2) holds under h h∗ . Interestingly, the optimal posterior contraction rate is simultaneously obtained in this case; see Remark 4.1. We next show that P0 is equivalent to a Gaussian measure ΠW , induced by a Gaussian process W defined in (5.3). As will be seen later, the procedures

16

Z. SHANG AND G. CHENG

for Bayesian inference in Sections 5.2 – 5.4 can be easily performed in terms of W . Define ( n(n + σν−2 )−1 , ν = 1, . . . , m, aν = 1+β/(2m) −1 n(1 + λρν )(n(1 + λρν ) + ρν ) , ν > m, ( ξν =

(1 +

(1 + nσν2 )−1/2 vν , −(1+β/(2m)) nρν (1 + λρν ))−1/2 vν ,

ν = 1, . . . , m, ν > m,

where recall that σν2 ’s are prior variances for vν ’s assumed in (3.6). Let (5.3)

W (·) =

∞ X

(ξν + aν fbν )ϕν (·),

ν=1

P∞ b where fbν = V (fbn,λ , ϕν ) for any ν ≥ 1. Define fen,λ (·) = ν=1 aν fν ϕν (·). P Clearly, W = fen,λ + ν ξν ϕν is a Gaussian process with mean function fen,λ . Note that fen,λ 6= fbn,λ , where fbn,λ is the smoothing spline estimate in (3.4). Given the data Dn , let ΠW be the probability measure induced by W . Theorem 5.2 below essentially says that P0 = ΠW . Together with Theorem 5.1, we note that the posterior distribution P (·|Dn ) and ΠW (·) are asymptotically close under the total variation distance. This implies the mean function fen,λ of W is approximately the posterior mode of P (·|Dn ). Therefore, fen,λ will be used as the center of the credible region to be constructed. Theorem 5.2. With f ∈ H m (I), the Radon-Nikodym derivative of ΠW with respect to Π is dΠW (f ) = R dΠ

H m (I)

exp(− n2 kf − fbn,λ k2 ) . exp(− n kf − fbn,λ k2 )dΠ(f ) 2

Theorem 5.2 implies that dP0 dΠW (f ) = (f ). dΠ dΠ In the end of this section, we point out an important consequence of the duality between nonparametric Bayesian models and smoothing spline models established here. Specifically, we can employ the well developed GCV method to select a proper tuning prior, i.e., the value of λ (equivalently, h), in practice. Let hGCV be the GCV-selected tuning parameter based on the penalized likelihood function, and hGCV is known to achieve the approximate optimal rate n−1/(2m+1) (see [47]). As illustrated in Sections 5.2 – 5.4,

17

NONPARAMETRIC BVM THEOREM −

1

h should be chosen in the order of n 2m+β for obtaining nice frequentist (2m+1)/(2m+β) properties of credible regions/intervals. Hence, we set h as hGCV when tuning the assigned GP prior. Numerical evidence in Section 6 strongly supports this novel method. 5.2. Credible Region in Strong Topology. As the first application of our main result, we consider the construction of an asymptotic credible region for f in terms of k · k-norm and study its frequentist property in this section. Throughout Sections 5.2 – 5.4, we suppose that f0 satisfies Condition (S), − 1 and set h h∗ = n 2m+β for simplicity. It will be shown that such a selection − 2m+β−1

of h yields credible regions with radius of optimal order n 2(2m+β) P∞ . e We first re-write W defined in (5.3) as fn,λ +Z, where Z(·) = ν=1 ξν ϕν (·), and then study the asymptotic behavior of Z. Assume for some constant c0 > 0, ρν /(c0 ν)2m → 1 as ν → ∞, and define ( (n + σν−2 )−1 , ν = 1, . . . , m, cν = 1+β/(2m) −1 (n(1 + λρν ) + ρν ) , ν > m. For j = 1, 2, let (5.4)

ζj = c0 h

∞ X ν=1

Lemma 5.3.

(ncν )j (1 + λρν ), τj2 = 2c0 h

∞ X (ncν )2j (1 + λρν )2 . ν=1

√ d As n → ∞, (n c0 hkZk2 − (c0 h)−1/2 ζ1 )/τ1 −→ N (0, 1).

We construct an asymptotic credible region with credibility level (1 − α): s 1/2 ζ1 + (c0 h) τ1 zα . (5.5) Rn (α) = f ∈ H m (I) : kf − fen,λ k ≤ c0 nh The above Rn (α) is very similar as the credible region C∗n (α) (defined in (2.8)) for the Gaussian sequence model under prior P2. In particular, we note that ζ1 and τ12 in Rn (α) play similar roles as E∗n and D∗n in C∗n (α). On the other hand, some direct calculations based on Lemma S.1 indicate that ζ1 (τ12 ) is slightly larger than E∗n (D∗n ). This is mainly due to the different norms used in constructing C∗n (α) and Rn (α). The posterior coverage of

18

Z. SHANG AND G. CHENG

Rn (α) follows from Theorem 5.1 and Lemma 5.3: P (Rn (α)|Dn ) = P0 (Rn (α)) + oPfn (1) = P (W ∈ Rn (α)) + oPfn (1) 0 0 s ζ1 + (c0 h)1/2 τ1 zα = P kW − fen,λ k ≤ + oPfn (1) 0 c0 nh s ζ1 + (c0 h)1/2 τ1 zα + oPfn (1) = P kZk ≤ 0 c0 nh p = P ((n c0 hkZk2 − (c0 h)−1/2 ζ1 )/τ1 ≤ zα ) + oPfn (1) 0

= 1 − α + oPfn (1). 0

We next examine the frequentist property of Rn (α) by first studying the asymptotic property of its center, i.e., fen,λ , in Lemma 5.4. ∗ −1/(2m+β) Lemma 5.4. √ Let Assumptions A1–A3 be satisfied, and h h = n for m > 1 + 3/2 and 1 < β < m + 1/2. Then, as n → ∞,

c0 nhkfen,λ − f0 k2 = ζ2 + c0 nhW (n) + oPfn (1), 0

1/2

d

where W (n) is a random variable satisfying c0 τ2−1 nh1/2 W (n) → N (0, 1). Lemma 5.5.

τ1 ≥ τ2 and limn→∞ (ζ1 − ζ2 ) > 0.

By Lemma 5.5, ζ1 − ζ2 > c0 for some positive constant c0 > 0. By Lemma 5.4, we have c0 nhW (n) = oPfn (1), and with Pfn0 -probability approaching 0 one, c0 nhkfe− f0 k2 = ζ2 + c0 nhW (n) + oP n (1) ≤ ζ1 + (c0 h)1/2 τ1 zα , which is f0

exactly the event f0 ∈ Rn (α). In view of the above discussions, Theorem 5.6 shows that, if f0 satisfies Condition (S) and h h∗ , then Rn (α) is valid in the frequentist sense, and possesses the optimal contraction rate n

2m+β−1 − 2(2m+β)

.

Theorem 5.6. Under the conditions of Lemma 5.4, we have for any α ∈ (0, 1), limn→∞ Pfn0 (f0 ∈ Rn (α)) = 1. 5.3. Credible Region in Weak Topology. In this section, we will construct an asymptotic credible region using a weaker topology such that the truth can be covered with probability approaching exactly 1 − α. This echoes with the use of a weaker norm in Section 2. As far as we know, this is the first frequentist-matching Bayesian credible sets in the nonparametric regression

NONPARAMETRIC BVM THEOREM

19

models. We apply the general BvM theorem in Section 5.1 and employ a strong approximation result ([41]) P in this section. P∞ 2 2 For any f ∈ H m (I) with f (·) = ∞ ν=1 fν ϕν (·), define kf k† = ν=1 dν fν , where dν is a given positive sequence satisfying dν ν −1 (log 2ν)−τ for a constant τ > 1. For instance, one can choose d1 = . . . = dm = 1 and −1/(2m) dν = ρν (log(ρν + 1))−τ for ν > m. Define n o p (5.6) Rn† (α) = f ∈ H m (I) : kf − fen,λ k† ≤ cα /n , where cα is defined in Section 2.1. We first check that Rn† (α) indeed covers (1 − α) posterior mass. Similar on the posterior coverage of Cn† (α) Panalysis P in Section 2.1 implies that n ν dν ξν2 weakly converges to ν dν ην2 , where iid

η1 , η2 , . . . ∼ N (0, 1). It then follows by Theorem 5.1 that P (Rn† (α)|Dn ) = P0 (Rn† (α)) + oPfn (1) 0

= P (nkW − fen,λ k2† ≤ cα ) + oPfn (1) 0 X 2 n = P (n dν ξν ≤ cα ) + oPf (1) → 1 − α, ν

0

in Pfn0 -probability.

To study the frequentist coverage of the credible set Rn† (α), we need to assume that the basis functions ϕν and the real sequence ρν come from the following ordinary differential system: (5.7) (j) (−1)m ϕ(2m) (·) = ρν B(·)π(·)ϕν (·), ϕ(j) ν ν (0) = ϕν (1) = 0, j = m, . . . , 2m−1, where recall that π(·) is the marginal density of the covariate X and B(·) is the function defined in Assumption A1. Theorem 5.7 below proves that the credible region Rn† (α) constructed under weak topology perfectly matches with the frequentist coverage. Theorem 5.7. Let (ϕν , ρν ) be the eigenvalues and eigenfunctions of the ODE system (5.7). Suppose that the conditions of Lemma 5.4 hold, and there exists a positive constant C1 such that Ef0 {exp(||/C1 )} < ∞. Then we have for any α ∈ (0, 1), limn→∞ Pfn0 (f0 ∈ Rn† (α)) = 1 − α. It should be noted that the scope of Theorem 5.7 covers a variety of nonparametric models including Examples 3.1 – 3.3. Again, we can choose a proper tuning prior in Cn† (α) via the generalized cross validation method.

20

Z. SHANG AND G. CHENG

5.4. Linear Functionals on the Regression Function. In practice, it is also of great interest to consider a linear functional of the regression function f . Typical examples include (i) evaluation functional: Fz (f ) = R 1 f (z), where z ∈ I is a fixed number; (ii) integral functional: Fω (f ) = 0 f (z)ω(z)dz, where ω(·) is a known deterministic integrable function such as an indicator function. The major goal of this section is to construct asymptotic credible intervals for a general class of linear functionals. In spirit, some of the results in this section are similar as those implied by parametric BvM theorem in the sense that the constructed credible intervals may contract at the parametric rate n−1/2 . We note that linear functionals of densities were considered in [36] but without any frequentist justification. Let F : H m (I) 7→ R be a linear Π-measurable functional, i.e., F (af + bg) = aF (f ) + bF (g) for any a, b ∈ R and f, g ∈ H m (I). We assume that supν≥1 |F (ϕν )| < ∞. This assumption holds for both Fz and Fω above due to the uniform boundedness of ϕν . In addition, we assume that there exists a constant κ > 0 such that for any f ∈ H m (I), |F (f )| ≤ κh−1/2 kf k.

(5.8)

Lemma 5.8 below (given in [38]) implies that both Fz and Fω satisfy (5.8). Lemma 5.8. There exists a universal constant c > 0 s.t. for any f ∈ H m (I), kf ksup ≤ ch−1/2 kf k. Define the following Bayesian credible interval for F (f0 ) tn CIn,λ (α) : F (fen,λ ) ± zα/2 √ , n

(5.9) where t2n =

m X

X F (ϕν )2 F (ϕν )2 + . 1 + n−1 σν−2 ν>m 1 + λρν + n−1 ρ1+β/(2m) ν ν=1

If tn converges to a positive constant, then CIn,λ (α) contracts at the parametric rate n−1/2 ; see (5.17) and (5.18). The posterior coverage of CIn,λ (α) can be easily checked via Theorems 5.1 and 5.2 as follows. It follows by Theorem 5.1 that |P (F (f ) ∈ CIn,λ (α)|Dn )−P0 (F −1 (CIn,λ (α)))| ≤ sup |P (B|Dn )−P0 (B)| → 0, B∈B

in

Pfn0 -probability.

On the other hand, it follows by Theorem 5.2 that

P0 (F −1 (CIn,λ (α))) = P (W ∈ F −1 (CIn,λ (α))) = P (F (W ) ∈ CIn,λ (α)) √ √ = P (−zα/2 tn / n ≤ F (Z) ≤ zα/2 tn / n) = 1 − α,

NONPARAMETRIC BVM THEOREM

21

P where the last equality follows from the fact that F (Z) = ∞ ν=1 ξν F (ϕν ) ∼ N (0, t2n /n) given that ξν ’s are independent zero-mean Gaussian variables. Theorem 5.9 below shows that CIn,λ covers the true value F (f0 ) with probability asymptotically at least 1 − α for a general class ofP functionals F 2 satisfying (5.10). Under an additional condition on F , i.e., 0 < ∞ ν=1 F (ϕν ) < ∞, we can even prove that the coverage is asymptotically exact. Theorem 5.9. √ Let the assumptions in Theorem 5.1 be satisfied. Suppose m > 1 + 3/2, 1 < β < m + 1/2, and f0 satisfies Condition (S† ): β P∞ 0 |2 ρ1+ 2m < ∞. Furthermore, there exists a constant r ∈ [0, 1] such |f ν ν ν=1 that for j = 1, 2 (5.10) ! m X X F (ϕν )2 F (ϕν )2 r lim inf h + > 0. n→∞ (1 + n−1 σν−2 )j ν>m (1 + λρν + (λρν )1+β/(2m) )j ν=1 Then (5.11)

lim inf Pfn0 (F (f0 ) ∈ CIn,λ (α)) ≥ 1 − α. n→∞

P 2 In particular, if 0 < ∞ ν=1 F (ϕν ) < ∞ and (5.10) holds for r = 0, then CIn,λ (α) asymptotically achieves (1 − α) frequentist coverage, that is, (5.12)

lim Pfn0 (F (f0 ) ∈ CIn,λ (α)) = 1 − α.

n→∞

Remark that Condition (S† ) is slightly stronger than Condition (S). We next verify that Condition (5.10) holds when F is the evaluation functional or integral function, and model (3.1) is Gaussian, i.e., the setting of Example 3.1. The proof relies on a nice closed form of ϕν and a careful analysis of the trigonometric functions. Proposition 5.10. N (f (X), 1).

Suppose m = 2, X ∼ U nif [0, 1], and Y |f, X ∼

(i) If F = Fz for any z ∈ (0, 1), then (5.10) holdsP for r = 1; 2 (ii) If F = Fω for any ω ∈ L2 (I)\{0}, then 0 < ∞ ν=1 F (ϕν ) < ∞ and (5.10) holds for r = 0. By carefully examining the proof of Theorem 5.9, we find that when F = Fz , the inequality (5.11) is actually strict. However, when F = Fω , CIn,λ (α) covers the truth with probability approaching 1 − α. Therefore, we observe a subtle difference between these two types of functional, which can still be empirically detected in the simulations; see Section 6.

22

Z. SHANG AND G. CHENG

In the end, we give a concrete example of CIn,λ (α). Under the setup of Proposition 5.10, it follows from [38] that (ϕν , ρν ) in Assumption A2 can be chosen as the eigensystem of the following uniform free beam problem: (j) (j) ϕ(4) ν (·) = ρν ϕν (·), ϕν (0) = ϕν (1) = 0, j = 2, 3.

(5.13)

The eigenvalues satisfy limν→∞ ρν /(πν)4 = 1; see [24, Problem 3.10]. The normalized solutions to (5.13) are √ (5.14) ϕ1 (z) = 1, ϕ2 (z) = 3(2z − 1),

(5.15)

ϕ2k+1 (z) =

sin(γ2k+1 (z − 1/2)) sinh(γ2k+1 (z − 1/2)) + , k ≥ 1. sin(γ2k+1 /2) sinh(γ2k+1 /2)

(5.16)

ϕ2k+2 (z) =

cos(γ2k+2 (z − 1/2)) cosh(γ2k+2 (z − 1/2)) + , k ≥ 1, cos(γ2k+2 /2) cosh(γ2k+2 /2)

1/4

where γν = ρν satisfying cos(γν ) cosh(γν ) = 1; see [6, page 295–296]. For F = Fz with z ∈ (0, 1), we have tn CIn,λ : fen,λ (z) ± zα/2 √ , n

(5.17) where t2n =

ϕν (z)2 ν=1 1+n−1 σν−2

P2

+

ϕν (z)2 ν>2 1+λρ +n−1 ρ1+β/4 , with ρν ν ν t2n = 1/h. For F = Fω with ω

P

ϕν given in (5.14) – (5.16). Here, we have Z 1 tn fen,λ (z)ω(z)dz ± zα/2 √ , (5.18) CIn,λ : n 0

≈ (πν)4 and ∈ L2 (I)\{0},

P2 P P∞ Fω (ϕν )2 Fω (ϕν )2 2 where t2n = ν=1 1+n−1 σν−2 + ν>2 1+λρ +n−1 ρ1+β/4 → ν=1 Fω (ϕν ) = ν ν R1 2 0 ω(z) dzR> 0, as n → ∞. The last equality follows from the trivial fact 1 Fω (ϕν ) = 0 ω(z)ϕν (z)dz is the ν-th Fourier coefficient of ω. Theorem 5.9 implies that the frequentist coverage of (5.17) is strictly larger than 1 − α, while (5.18) has an exact frequentist coverage probability. To use (5.17) and (5.18), we need explicit expressions of (ρν , ϕν ). However, the explicit expressions of the eigenvalues ρν are not available, and hence we cannot directly use formulas (5.14) – (5.16) to find the eigen-functions ϕν . To circumvent this problem, in numerical study we use an equivalent kernel approach. To be more specific, it follows by [38] that the reproducing

NONPARAMETRIC BVM THEOREM

23

kernel P∞ function K with respect to the inner product h·, ·i satisfies K(s, t) = ν=1 ϕν (s)ϕν (t)/(1 + λρν ). On the other hand, it follows by [32, page 184– 185] that in the setting of Proposition 5.10 the kernel function K(s, t) yields an approximation 1 |s − t| |s − t| s−t √ exp − √ K(s, t) ≈ sin √ + cos √ 2 2h 2h 2h 2h s+t s−t 2−s−t t−s 1 √ Φ √ ,√ +Φ ,√ , + √ 2 2h 2h 2h 2h 2h where Φ(u, v) = exp(−u)(cos(u) − sin(u) + 2 cos(v)). Then (ρν , ϕν ) can be found by performing a spectral decomposition on the above function. 6. Simulations. 6.1. Gaussian Sequence Model. In this section, performance of the adaptive credible regions (ACR) and modified adaptive credible regions (MACR) constructed by Priors P1 and P2 is demonstrated through a simulation study. These adaptive regions are constructed in Section S.3 of the supplementary documents. For simplicity, we use ACR1, MACR1, ACR2 and MACR2 to denote the regions constructed by formulas (S.2), (S.3), (S.6) and (S.7) respectively. Suppose that the samples Yi are generated from (2.1) with true parameters being θi0 = (i log i)−3/2 , i = 2, 3, . . . . Thus, the true regularity is q0 = 1. The model errors i are iid independent normal random variables. Results are based on 10,000 independent experiments. In each experiment, (1−α) ACR1, MACR1, ACR2 and MACR2 were constructed. We examined the coverage proportions (CP) of these regions, i.e., the proportions of the regions that cover the true model parameter θ0 . To fully investigate how the coverage proportions relate to the credible levels, we tested a set of different values of the credibility level 1 − α. Results are summarized in Figures 3 and 4. It can be seen that the coverage proportions of both ACR1 and ACR2 become larger than the credible level 1−α as n becomes sufficiently large. Meanwhile, we observe that ACR2 yields a bit smaller CP than ACR1. This is because ACR2 has smaller radius hence smaller volume. The smaller volume may be viewed as an advantage since this means that the credible region is more “concentrated.” The fact that ACR2 has smaller volume is not magic. This is because we have used the same qb in both formulas (S.2) and (S.6). From a numerical

24

0.00

0.02

δE

0.04

0.06

Z. SHANG AND G. CHENG

0

100

200

300

400

500

300

400

500

0.0 0.2 0.4 0.6 0.8

δD

q

0

100

200 q

bn − E b∗n and δD = D bn − D b ∗n versus qb. In E bn and D b n appearing Fig 2. Plots of δE = E b∗n and D b ∗n appearing in (S.6), we choose p = qb + 1/2 in (S.2), we choose p = qb + 1; in E and q = (2b q + 1)/4. The plots display that δD and δE are positive-valued along with qb, bn > D b ∗n and E bn > E b∗n . indicating that D

bn > E b∗n and D bn > D b ∗n (see Figure 2), examination it can be seen that E adapt adapt (α). and hence, C∗n (α) has smaller radius than Cn It can also be observed that both MACR1 and MACR2 yield coverage proportions very close to the nominal level 1 − α for a variety of values α. This means the, under both priors P1 and P2, MACR matches the frequentist coverage better than ACR. 6.2. Smoothing Spline Model. In this section, we investigate the frequentist coverage probabilities of the credible region (5.5) and modified credible region (5.6), and credible intervals for two functionals (5.17) and (5.18). Suppose (6.1)

Yi = f0 (Xi ) + i , i = 1, 2, . . . , n,

where Xi are iid uniform over [0, 1], and i are iid standard normal random variables independent of Xi . The true regression function f0 was chosen as f0 (x) = 3β30,17 (x) + 2β3,11 (x), where βa,b is the probability density function for Beta(a, b). Figure 5 displays the true function f0 , from which it can be seen that f0 has both peaks and trouts.

25

NONPARAMETRIC BVM THEOREM 1 - α = 0.90

15000

0.85

MACR1

20000

0

5000

10000

n

n

1 - α = 0.70

1 - α = 0.50

20000

15000

20000

15000

20000

CP

0.6

0.8

15000

5000

10000

15000

20000

0

5000

10000

n

n

1 - α = 0.30

1 - α = 0.10

0.0

0.2

0.4

0.2

CP

0.6

0.4

0.8

0.6

0

CP

0.95

CP 10000

0.4

CP

5000

0.6 0.7 0.8 0.9 1.0

0

0.90

0.94

ACR1

0.90

CP

0.98

1.00

1 - α = 0.95

0

5000

10000 n

15000

20000

0

5000

10000 n

Fig 3. Coverage proportion (CP) of ACR1 and MACR1, constructed by (S.2) and (S.3) respectively, based on various samples and credibility levels. The dotted red line indicates the position of the 1 − α credibility level.

To examine the coverage property of the credible regions, we chose n ranging from 20 to 2000. For each n, 1,000 independent trials were conducted. From each trial, a credible region (CR) and a modified credible region (MCR) were constructed. Proportions of the CR and MCR covering f0 were calculated, and were displayed against the sample sizes. Results are summarized in Figure 6. It can be seen that for different 1 − α, i.e., the credibility levels, the coverage proportions (CP) of CR are greater than 1 − α when n is large enough. They even tend to one for large sample sizes. However, the CP of the MCR tends to exactly 1 − α when n increases. Thus, the numerical results confirm our theory developed in Sections 5.2 and 5.3. To examine the coverage property of credible intervals, we chose n = 25 , 27 , 28 , 29 to demonstrate the trend of coverage along with increasing

26

Z. SHANG AND G. CHENG 1 - α = 0.90

CP 5000

10000

15000

0.80

MACR2

20000

0

1 - α = 0.70

1 - α = 0.50

15000

20000

15000

20000

15000

20000

CP

0.4

0.6

0.9 0.7

10000

15000

20000

0

5000

10000

n

n

1 - α = 0.30

1 - α = 0.10

0.0

0.2

0.4

CP

0.6

0.4

5000

0.2

CP

10000 n

0.5 0

CP

5000

n

0.8

0

0.90

0.94

ACR2

0.90

CP

0.98

1.00

1 - α = 0.95

0

5000

10000 n

15000

20000

0

5000

10000 n

Fig 4. Coverage proportion (CP) of ACR2 and MACR2, constructed by (S.6) and (S.7) respectively, based on various samples and credibility levels. The dotted red line indicates the position of the 1 − α credibility level.

sample sizes. For pointwise functional, we considered F = Fz for 15 evenlyspaced z points in [0, 1]. For each z, a credible interval based on (5.17) was constructed. We then calculated the coverage probability of this interval based on 1,000 independent experiments, that is, the empirical proportion of the intervals (among the 1,000 intervals) that cover the true value f0 (z). Figure 7 summarizes the results for different credibility levels α, where coverage probabilities are plotted against the corresponding points z. It can be seen that the coverage probability of the pointwise intervals is a bit larger than 1 − α for all α and n being considered. This is consistent with Proposition 5.10 (i), except for the points near the right peak of f0 . Indeed, at those points near the right peak, under-coverage has been observed. This is a common phenomenon in the frequentist literature: the peak and trouts

27

0

5

f0

10

15

NONPARAMETRIC BVM THEOREM

0.0

0.2

0.4

0.6

0.8

1.0

x

Fig 5. Plot of the true function f0 used in model (6.1).

may affect the coverage property of the pointwise interval; see [34, 38]. For integral functional, we considered F = Fωz0 for ωz0 (z) = I(0 ≤ z ≤ z0 ) with 15 evenly-spaced z0 points in [0, 1]. We evaluated the coverage probability at each z0 based on 1,000 experiments. Figure 8 summarizes the results for different credibility levels α, where coverage probabilities are plotted against the corresponding points z0 . It can be seen that, as n increases, the coverage probability of the integral intervals tends to 1 − α for all α. This phenomenon is consistent with our theory, i.e., Proposition 5.10 (ii). Acknowledgements. We thank PhD student Meimei Liu at Purdue for help with the simulation study. REFERENCES [1] Birkhoff, D. (1908). Boundary Value and Expansion Problems of Ordinary Linear Differential Equations. Transactions of the American Mathematical Society, 9, 373– 395. [2] Belitser, E. and Ghoshal, S. (2003). Adaptive Bayesian inference on the mean of an infinite dimensional normal distribution. Annals of Statistics, 31, 536–559. [3] Brown, L. D. and Low, M. G. (1996). Asymptotic equivalence of nonparametric regression and white noise. Annals of Statistics, 24, 2384–2398. [4] Bickel, P. J. and Kleijn, B. J. K. (2012). The semiparametric Bernstein-von Mises theorem. Annals of Statistics, 40, 206–237. [5] Bickel, P. J. and Ritov, Y. (2003). Nonparametric estimators which can be “pluggedin”. Annals of Statistics, 31, 1033–1053.

28

Z. SHANG AND G. CHENG

1000

1500

MCR

0

500

1000 n

1 − α = 0.70

1 − α = 0.50

2000

1500

2000

1500

2000

0.9

1500

0.3

0.5

0.7

CP

0.8 CP 0.6 0.4

1000

1500

2000

0

500

1000

n

n

1 − α = 0.30

1 − α = 0.10

CP

0.0

0.4 0.0

0.4

0.8

500

0.8

0

CP

0.90

2000

n

1.0

500

0.80

CP CR

0

0.70

0.80

CP

0.90

1.00

1 − α = 0.90

1.00

1 − α = 0.95

0

500

1000 n

1500

2000

0

500

1000 n

Fig 6. Coverage proportion (CP) of CR and MCR, constructed by (5.5) and (5.6) respectively, based on various sample sizes and credibility levels. The dotted red line indicates the position of the 1 − α credibility level.

[6] Courant, R. and Hilbert, D. (1937). Methods of Mathematical Physics, Vol. 1. Intercience Publishers, Inc., New York. [7] Cameron, R. H. and Martin, W. T. (1944). Transformations of Wiener Integrals under Translations. Annals of Mathematics, 45, 386–396. [8] Castillo, I. (2014). On Bayesian supremum norm contraction rates. Annals of Statistics, 42, 2058–2091. [9] Castillo, I. (2012). A semiparametric Berstein-von Mises theorem for Gaussian process priors. Prob. Theory Rel. Fields, 152, 53–99. [10] Castillo, I. and Nickl, R. (2013). Nonparametric Bernstein-von Mises theorem in Gaussian white noise. Annals of Statistics, 41, 1999–2028. [11] Castillo, I. and Nickl, R. (2014). On the Bernstein-von Mises phenomenon for nonparametric Bayes procedures. Annals of Statistics, 42, 1941–1969. [12] Castillo, I. and Rousseau, J. (2014). A Bernstein-von Mises theorem for smooth functionals in semiparametric models. Preprint.

29

NONPARAMETRIC BVM THEOREM

CP

0.90

CP

0.80 0.70

n=25

n=27

0.4

0.6

n=28

n=29

0.8

0.2

0.4

0.6

z

z

1 − α = 0.70

1 − α = 0.50

0.8

0.6 CP 0.2

0.4

0.6 0.4

0.4

0.6

0.8

0.2

0.4

0.6 z

1 − α = 0.30

1 − α = 0.10

0.8

0.20

z

0.00

CP

0.1 0.2 0.3 0.4 0.5

CP

0.2

0.10

CP

0.8

0.8

0.2

0.6 0.7 0.8 0.9 1.0

1 − α = 0.90

1.00

1 − α = 0.95

0.2

0.4

0.6 z

0.8

0.2

0.4

0.6

0.8

z

Fig 7. Coverage proportion (CP) of the credible interval for Fz (f0 ) versus z. The dotted red line indicates the position of the 1 − α credibility level.

[13] Cox, D. (1993). An analysis of Bayesian inference for nonparametric regression. Annals of Statistics, 21, 903–923. [14] Cheng, G. and Kosorok, M. R. (2008). General frequentist properties of the posterior profile distribution. Annals of Statistics, 36, 1819–1853. [15] Cheng, G. and Kosorok, M. R. (2009). The Penalized Profile Sampler. Journal of Multivariate Analysis, 100, 345–362. [16] Cheng, G. and Shang, Z. (2013). Joint asymptotics for semi-nonparametric regression models with partially linear structure. http://arxiv.org/abs/1311.2628. [17] de Jong, P. (1987). A central limit theorem for generalized quadratic forms. Probability Theory & Related Fields, 75, 261–277. [18] Freedman, D. (1999). On the Bernstein-von Mises theorem with infinite-dimensional parameters. Annals of Statistics, 27, 1119-1140. [19] Gin´e, E. and Nickl, R. (2011). Rates of contraction for posterior distributions in Lr -metrics, 1 ≤ r ≤ ∞. Annals of Statistics, 39, 2795–3443.

30

Z. SHANG AND G. CHENG

CP

0.90

CP

0.80 0.70

n=25

n=27

0.4

0.6

n=28

n=29

0.8

0.2

0.4

0.6

z0

z0

1 − α = 0.70

1 − α = 0.50

0.8

CP 0.4

0.6

0.8

0.2

0.4

0.6 z0

1 − α = 0.30

1 − α = 0.10

0.8

0.10

CP

0.00

0.20 0.10

0.05

0.30

0.15

z0

0.40

0.2

CP

0.4 0.2

0.3

0.5

CP

0.7

0.6

0.9

0.2

0.6 0.7 0.8 0.9 1.0

1 − α = 0.90

1.00

1 − α = 0.95

0.2

0.4

0.6 z0

0.8

0.2

0.4

0.6

0.8

z0

Fig 8. Coverage proportion (CP) of the credible interval for Fωz0 (f0 ) versus z0 . The dotted red line indicates the position of the 1 − α credibility level.

[20] Ghosal, S., Ghosh, J. K. and Ramamoorthi, R. V. (1999). Posterior consistency of Dirichlet mixtures in density estimation. Annals of Statistics, 27, 143–158. [21] Ghosal, S., Ghosh, J. K. and van der Vaart, A. W. (2000). Convergence rates of posterior distributions. Annals of Statistics, 28, 500–531. [22] H´ ajek, J. (1962). On linear statistical problems in stochastic processes. Czechoslovak Mathematical Journal, 12, 404–444. [23] Hoffmann-Jorgensen, Shepp, L. A. and Dudley, R. M. (1979). On the lower tail of Gaussian seminorms. Annals of Probability, 7, 193–384. [24] Johnstone, I. M. (2013). Gaussian Estimation: Sequence and Wavelet Models. [25] Knapik, B. T., van der Vaart, A. W. and van Zanten, J. H. (2011). Bayesian inverse problems with Gaussian priors. Annals of Statistics, 39, 2626–2657. [26] Knapik, B. T., Szab´ o, B. T., van der Vaart, A. W. and van Zanten, J. H. (2013). Bayes procedures for adaptive inference in inverse problems for the white noise model. Preprint.

NONPARAMETRIC BVM THEOREM

31

[27] Kosorok, M. R. (2008). Introduction to Empirical Processes and Semiparametric Inference. Springer: New York. [28] Kuelbs, J., Li, W. V. and Linde, W. (1994). The Gaussian measure of shited balls. Probability Theory and Related Fields, 98, 143–162. [29] Leahu, H. (2011). On the Bernstein-von Mises phenomenon in the Gaussian white noise model. Electronic Journal of Statistics, 5, 373-404. [30] Li, W. V. (1999). A Gaussian correlation inequality and its applications to small ball probabilities. Electronic Communications in Probability, 4, 111–118. [31] Morris, C. N. (1982). Natural exponential families with quadratic variance functions. Annals of Statistics, 10, 65–80. [32] Messer, K. and Goldstein, L. (1993). A new class of kernels for nonparametric curve estimation. Annals of Statistics, 21, 179–195. [33] Nickl, R. (2014). Discussion to “Frequentist coverage of adaptive nonparametric Bayesian credible sets. Annals of Statistics, to appear. [34] Nychka, D. (1988). Bayesian confidence intervals for smoothing splines. Journal of the American Statistical Association, 83, 1134–1143. [35] Nussbaum, M. (1996) Asymptotic equivalence of density estimation and Gaussian white noise. Annals of Statistics, 24, 2399–2430. [36] Rivoirard, V. and Rousseau, J. (2012). Bernstein-von Mises theorem for linear functionals of the density. Annals of Statistics, 40, 1489–1523. [37] Shang, Z. (2010). Convergence rate and Bahadur type representation of general smoothing spline M-Estimates. Electronic Journal of Statistics, 4, 1411–1442. [38] Shang, Z. and Cheng, G. (2013). Local and global asymptotic inference in smoothing spline models. Annals of Statistics, 41, 2608–2638. [39] Szab´ o, B., van der Vaart, A. W. and van Zanten, H. (2014). Empirical Bayes scaling of Gaussian priors in the white noise model. Electronic Journal of Statistics, 7, 991-1018. [40] Szab´ o, B., van der Vaart, A. W. and van Zanten, H. (2014). Frequentist coverage of adaptive nonparametric Bayesian credible sets. Annals of Statistics. [41] Tusnady, G. (1977). A remark on the approximation of the sample distribution function in the multidimensional case. Periodica Mathematica Hungarica, 8, 53–55. [42] Utreras, F. I. (1988). Boundary effects on convergence rates for Tikhonov regularization. Journal of Approximation Theory, 54, 235–249. [43] van der Vaart, A. W. and van Zanten, J. H. (2007). Bayesian inference with rescaled Gaussian process priors. Electronic Journal of Statistics, 1, 433–448. [44] van der Vaart, A. W. and van Zanten, J. H. (2008). Reproducing kernel Hilbert spaces of Gaussian priors. IMS Collections. Publishing the Limits of Contemporary Statistics: Contributions in Honor of Jayanta K. Ghosh. 3, 200–222. [45] van der Vaart, A. W. and van Zanten, J. H. (2008). Rates of contraction of posterior distributions based on Gaussian process priors. Annals of Statistics, 36, 1031–1508. [46] van der Vaart, A. W. and van Zanten, J. H. (2008). Adaptive Bayesian estimation using a Gaussian random field with inverse Gamma bandwidth. Annals of Statistics, 37, 2655–2675. [47] Wahba, G. (1985). A comparison of GCV and GML for choosing the smoothing parameter in the generalized spline smoothing problem. Annals of Statistics, 13, 1251– 1638. [48] Wahba, G. (1990). Spline Models for Observational Data. SIAM, Philidelphia.

32

Z. SHANG AND G. CHENG

[49] Zhao, L. H. (2000). Bayesian aspects of some nonparmetric problems. Annals of Statistics, 28, 532–552.

Supplementary document to NONPARAMETRIC BERNSTEIN-VON MISES PHENOMENON: A TUNING PRIOR PERSPECTIVE By Zuofeng Shang‡ and Guang Cheng§ Department of Statistics Purdue University This supplementary document contains the proofs of the main results and additional materials not included in the main manuscript. We organize this document as follows: • Section S.1 contains proofs in Section 2.1, i.e., proofs of Theorems 2.1 and 2.2. • Section S.2 contains proofs in Section 2.2, i.e., proofs of Theorem 2.3 and several technical lemmas. • In Section S.3, we construct adaptive credible regions under priors P1 and P2, and explore their frequentist coverage properties. • Section S.4 includes some preliminary results in Section 3, e.g., the expression of Radon-Nikodym derivative of Πλ with respect to Π, together with several limiting results. • Section S.5 includes results on posterior contraction rates under a type of Sobolev norm, as given in Section 4. • Section S.6 contains the proof of the nonparametric BvM theorem, i.e., Theorem 5.1 presented in Section 5.1, together with the specification of the probability measure P0 . • Section S.7 includes the proofs of the results in Section 5.2, in particular, the coverage property of the credible regions under a type of Sobolev norm. • Section S.8 includes the proofs of the results in Section 5.3, in particular, the coverage property of the credible regions under a weaker norm. ‡

Visiting Assistant Professor Corresponding Author. Associate Professor, Research Sponsored by NSF CAREER Award DMS-1151692, DMS-1418042, Simons Foundation 305266. Guang Cheng is on sabbatical at Princeton while this work is finalized; he would like to thank the Princeton ORFE department for its hospitality and support. §

1

2

Z. SHANG AND G. CHENG

• Section S.9 includes the proofs of the results in Section 5.4, in particular, the coverage property of the linear functional of the regression function. Both pointwise functional and integral functional are addressed. S.1. Proofs in Section 2.1. Proof of Theorem 2.1. For simplicity, denote γ = 2p/(2q + 1). Proof of Part (1). If α0 ≥ γ, implying that h−(2q0 +1) = nα0 /γ ≥ n, then for any θ0 ∈ `2 (q0 ), by dominated convergence theorem, as n → ∞, n

∞ X i=1

∞

X (ih)4p−2q0 λ2 i4p 0 2 2q0 |θ | = nh |θ0 |2 i2q0 = o(nh2q0 ) = o(h−1 ), i (1 + λi2p )2 (1 + (ih)2p )2 i i=1

which leads to that ∞ ∞ X X (−λi2p θi0 + n−1/2 i )2 0 2 b n ( θi − θi ) = n (1 + λi2p )2 i=1

i=1

−1

= oP (h

)+

∞ X i=1

∞

X 2 − 1 1 i + . (1 + λi2p )2 (1 + λi2p )2 i=1

It follows by direct calculations that E{|

∞ X i=1

therefore,

2i −1 i=1 (1+λi2p )2

P∞

∞

X 2i − 1 2 2 | }= = O(h−1 ), 2p 2 (1 + λi ) (1 + λi2p )4 i=1

= OP (h−1/2 ). This implies that

b 2 = oP (1) + Fn + h nhkθ0 − θk 2

∞ X i=1

2i − 1 = Fn + oP (1), (1 + λi2p )2

R∞ 1 h where Fn = i=1 (1+λi dx, 2p )2 . It is easy to see that as h → 0, En → 0 1+x2p R∞ 1 and Fn → 0 (1+x2p )2 dx. So inf h>0 (En − Fn ) > 0. This implies that, with probability approaching one, θ0 ∈ Cn (α). Proof of Part (2). If 0 < α0 < γ, then α0 /(2p) < 1/(2q0 + 1) ≤ 1 and 2q0 < 2p/α0 − 1. Let t be a constant such that 2q0 < t < 2p/α P 00−2 12q0and 0 −(1+t)/2 . It is easy to see that = t < 4p − 1, and let θi = i i |θi | i P −1−t+2q0 < ∞, and hence, θ = {θ 0 }∞ ∈ ` (q ). Then i 2 0 0 i i=1 i P∞

n

∞ X i=1

∞

X (ih)4p−t−1 λ2 i4p 0 2 t+1 |θ | = nh ≈ nht i (1 + λi2p )2 (1 + (ih)2p )2 i=1

Z 0

∞

x4p−t−1 dx. (1 + x2p )2

3

SUPPLEMENT TO NONPARAMETRIC BVM THEOREM

In the meantime, E{|

∞ ∞ X X (ih)2p θi0 i 2 (ih)4p |θi0 |2 | } = (1 + (ih)2p )2 (1 + (ih)2p )4 i=1

i=1

Z ∞ 4p−t−1 ∞ X x (ih)4p i−(t+1) t = ≈h dx, (1 + (ih)2p )4 (1 + x2p )4 0 i=1

implying that

∞ X (ih)2p θi0 i = OP (ht/2 ), (1 + (ih)2p )2 i=1

and E{

∞ X i=1

X 2i 1 } = h−1 , 2p 2 (1 + (ih) ) (1 + (ih)2p )2 i

which further imply that

2i

P∞

b 2 = nh nhkθ0 − θk 2 = nh

i=1 (1+(ih)2p )2

= OP (h−1 ). Therefore,

∞ X (ih)4p |θ0 |2 − 2n−1/2 i (ih)2p θ0 + n−1 2 i

i=1 ∞ X i=1

i

i

(1 + (ih)2p )2 √ λ2 i4p |θi0 |2 + OP ( nh1+t/2 + 1) 2p 2 (1 + λi )

nht+1 (1 + oP (1)) = n1−α0 (t+1)/(2p) (1 + oP (1)). Since 1−α0 (t+1)/2p > 0, the above derivations imply that, with probability √ 2 b / Cn (α). approaching one, nhkθ0 − θk2 > En + Dn hzα , or equivalently, θ0 ∈ Proof of Theorem 2.2. Proof of Part (1). Note that b2 nkθ0 − θk † ∞ X (−λi2p θi0 + n−1/2 i )2 = n di (1 + λi2p )2 i=1

∞ ∞ ∞ X X √ X di (ih)4p |θi0 |2 di (ih)2p θi0 i di 2i = n − 2 n + . (1 + (ih)2p )2 (1 + (ih)2p )2 (1 + (ih)2p )2 i=1

i=1

i=1

It follows by conditions on θ0 and dominated convergence theorem that ∞ ∞ X X di (ih)4p |θi0 |2 (ih)4p−2q0 −1 0 2 2q0 2q0 +1 n . nh |θ | i = o(nh2q0 +1 ) = o(1), (1 + (ih)2p )2 (1 + (ih)2p )2 i i=1

i=1

4

Z. SHANG AND G. CHENG

where the last equality holds since α0 ≥ 2p/(2q0 + 1) implying nh2q0 +1 = n1−α0 (2q0 +1)/(2p) = O(1). On the other side, it follows by similar calculations that ∞ ∞ X √ X di (ih)2p θi0 i 2 d2i (ih)4p |θi0 |2 E{| n | } = n = o(1), (1 + (ih)2p )2 (1 + (ih)2p )4 i=1

i=1

P∞ √ P di (ih)2p θi0 i which implies n ∞ i=1 (1+(ih)2p )2 = oP (1). In the end, as n → ∞, i=1 P 2 in distribution. Therefore, as n → ∞, converges to ∞ d i i i=1

di 2i (1+(ih)2p )2

∞ X b 2 ≤ cα ) → P ( Pθ0 (θ0 ∈ Cn† (α)) = Pθ0 (nkθ0 − θk di 2i ≤ cα ) = 1 − α. † i=1

Proof of Part (2). Now suppose 0 < α0 < 2p/(2q0 +1), and hence, α0 (2q0 + 1)/(2p) < 1. Choose t > 0 with α0 (2q0 + 1 + t)/(2p) < 1 and t < 4p − 2q0 − 2. Define θi0 = i−(2q0 +1+t)/2 for i = 1, 2, . . .. It is easy to see that θ0 = {θi0 }∞ i=1 ∈ `2 (q0 ). By choice of di we have n

∞ X di (ih)4p |θi0 |2 (1 + (ih)2p )2

n

i=1

∞ −1 X i (log 2i)−τ (ih)4p i−2q0 −1−t

(1 + (ih)2p )2

i=1

= nh2q0 +2+t

∞ X (log 2ih + log h−1 )−τ (ih)4p−2q0 −2−t

(1 + (ih)2p )2

i=1

nh

2q0 +1+t

−τ

Z

(log n)

0

∞

x4p−2q0 −2−t dx. (1 + x2p )2

On the other side, ∞ ∞ X √ X di (ih)2p θi0 i 2 di (ih)4p |θi0 |2 E{| n | } . n nh2q0 +1+t (log n)−τ , 2p 2 (1 + (ih) ) (1 + (ih)2p )2 i=1

which implies

i=1

√ P∞ n i=1

di (ih)2p θi0 i (1+(ih)2p )2

p = OP ( nh2q0 +1+t (log n)−τ ). Since

nh2q0 +1+t (log n)−τ = n1−α0 (2q0 +1+t)/(2p) (log n)−τ → ∞, we have b 2 nh2q0 +1+t (log n)−τ (1 + oP (1)) + OP (1). nkθ0 − θk † b 2 > cα , i.e., θ0 ∈ Therefore, with probability approaching one, nkθ0 − θk / † Cn† (α).

SUPPLEMENT TO NONPARAMETRIC BVM THEOREM

5

S.2. Proofs in Section 2.2. The following Lemma S.1 shows that both E∗n and D∗n are asymptotically positive constants. The limiting distribution of Z∗n is standard normal as specified in the Lemma S.2. For any constant δ ≥ 1, we have R∞ 1 α0 > q/p ∞ R0 (1+x2p )δ dx, −α X 1 n ∞ 1 0 < α0 < q/p ≡ lim = 0 (1+x2q )δ dx, n→∞ (1 + λi2q + n−1 i2p )δ R∞ 1 i=1 α0 = q/p 0 (1+x2q +x2p )δ dx, Lemma S.1.

lα0

Lemma S.2.

d

As n → ∞, Z∗n → N (0, 1).

Proof of Lemma S.1. If α0 > q/p, then α1 = 1/(2p), and hence ∞ X i=1

n−α1 (1 + λi2q + n−1 i2p )δ

∞ X

n−1/(2p) (1 + (in−1/(2p) )2q n−α0 +q/p + (in−1/(2p) )2p )δ i=1 ∞ Z i X n−1/(2p) dx ≤ −1/(2p) )2q n−α0 +q/p + (xn−1/(2p) )2p )δ (1 + (xn i−1 Zi=1∞ Z ∞ 1 1 as n → ∞ = dx −→ dx, −α +q/p 2q 2p δ 0 (1 + x2p )δ (1 + x n +x ) 0 0 =

where the last limit follows by dominated convergence theorem. Similarly, we have ∞ X

n−1/(2p) (1 + (in−1/(2p) )2q n−α0 +q/p + (in−1/(2p) )2p )δ i=1 ∞ Z i+1 X n−1/(2p) ≥ dx −1/(2p) )2q n−α0 +q/p + (xn−1/(2p) )2p )δ (1 + (xn i Zi=1∞ Z ∞ 1 1 as n → ∞ = dx −→ dx. −α +q/p 2q 2p δ 0 (1 + x2p )δ +x ) n−1/(2p) (1 + x n 0 This proves the result for α0 > q/p. The proofs in the other cases are similar and omitted. Proof of Lemma S.2. The logarithm of the moment generating func-

6

Z. SHANG AND G. CHENG

tion of Z∗n satisfies, for t ∈ (−1/2, 1/2), log(E{exp(tZ∗n )}) ∞ −α1 /2 (η 2 − 1) X −1/2 n i = log(E{exp(tD∗n )}) 1 + λi2q + n−1 i2p =

i=1 ∞ X

−1/2

i=1

=

=

−1/2

2tD∗n n−α1 /2 tD∗n n−α1 /2 1 )− − log(1 − 2q −1 2p 2 1 + λi + n i 1 + λi2q + n−1 i2p

∞ X

−1/2

!

−1/2

2tD∗n n−α1 /2 1 2tD∗n n−α1 /2 2 1 − ( ) − (− 2 1 + λi2q + n−1 i2p 2 1 + λi2q + n−1 i2p i=1 ! −1/2 −1/2 2tD∗n n−α1 /2 3 tD∗n n−α1 /2 +O(( ) )− ) 1 + λi2q + n−1 i2p 1 + λi2q + n−1 i2p ∞ −1/2 X 2tD∗n n−α1 /2 3 t2 + O( ( ) ). 2 1 + λi2q + n−1 i2p i=1

By Lemma S.1, ∞ X i=1

∞

−1/2

X n−α1 2tD∗n n−α1 /2 3 −1/2 3 −α1 /2 ) = (2tD ) n → 0, ( ∗n 1 + λi2q + n−1 i2p (1 + λi2q + n−1 i2p )3 i=1

therefore, E{exp(tZ∗n )} → exp(t2 /2), which proves the desired result. Proof of Theorem 2.3. Before formal proofs, we notice a simple fact: for any i ≥ 1, (S.1) λi2q = (n−α0 /(2q) i)2q ≤ (n−α1 i)2q , n−1 i2p = (n−1/(2p) i)2p ≤ (n−α1 i)2p . Meanwhile, by direct examinations we have

=

kθb∗ − θ0 k22 √ ∞ X (−(λi2q + n−1 i2p )θ0 + i / n)2 i

(1 + λi2q + n−1 i2p )2

i=1

=

∞ X (λi2q + n−1 i2p )2 |θ0 |2 i=1

(1 + λi2q +

+n−1

∞ X i=1

i n−1 i2p )2

− 2n

2i (1 + λi2q + n−1 i2p )2

−1/2

∞ X (λi2q + n−1 i2p )θi0 i (1 + λi2q + n−1 i2p )2 i=1

.

7

SUPPLEMENT TO NONPARAMETRIC BVM THEOREM

Define the three terms on the right side of the above equation to be T1 , T2 , T3 . It follows immediately from Lemma S.1 that E{T3 } = O(n−1+α1 ), implying that T3 = OP (n−1+α1 ). Proof of Part (1a). If p − 1/2 ≤ q0 < 2q and α0 ≥ 2q/(2q0 + 1), then we have (2q0 + 1)α1 ≥ 1 since both 1/(2p) ≥ 1/(2q0 + 1) and α0 /(2q) ≥ 1/(2q0 + 1) are true. Therefore, by (S.1), that the function x 7→ x/(1 + x) is increasing on x ≥ 0, and dominated convergence theorem, we get that n1−α1 T1 2 ∞ X λi2q + n−1 i2p 1−α1 = n |θi0 |2 1 + λi2q + n−1 i2p i=1 2 ∞ X (n−α1 i)2q + (n−α1 i)2p 1−α1 |θi0 |2 ≤ n 1 + (n−α1 i)2q + (n−α1 i)2p i=1

= n1−α1 −2α1 q0 = o(n

∞ X

(n−α1 i)4q−2q0

i=1 1−α1 −2α1 q0

1 + (n−α1 i)2(p−q) 1 + (n−α1 i)2q + (n−α1 i)2p

!2 |θi0 |2 i2q0

).

We also note that E{T22 }

= 4n

−1

∞ X i=1

(λi2q + n−1 i2p )2 |θ0 |2 ≤ 4n−1 T1 , (1 + λi2q + n−1 i2p )4 i

1/2

therefore, T2 = oP (n−1/2 T1 ). Since 1/2−α1 −α1 q0 < (1−α1 −2α1 q0 )/2 ≤ 0, we get n1−α1 T2 = oP (n1/2−α1 −α1 q0 ). To handle T3 , note that n

1−α1

T3 = n

−α1

∞ X i=1

∞

X 2i − 1 n−α1 + , (1 + λi2q + n−1 i2p )2 (1 + λi2q + n−1 i2p )2 i=1

and by Lemma S.1, E{|

∞ X i=1

∞

X 2i − 1 2 |2 } = = O(nα1 ), 2q −1 2p 2 2q (1 + λi + n i ) (1 + λi + n−1 i2p )4 i=1

which implies n1−α1 T3 = OP (n−α1 /2 ) + the above, we get that n1−α1 kθ0 − θb∗ k22 = n1−α1 (T1 + T2 + T3 ) =

n−α1 i=1 (1+λi2q +n−1 i2p )2 .

P∞

Combining

∞ X

n−α1 + oP (1). (1 + λi2q + n−1 i2p )2

i=1

8

Z. SHANG AND G. CHENG

Since the leading term in the above equation is asymptotically strictly smaller than E∗n , we have, with probability approaching one, θ0 ∈ C∗n (α). Proof of Part (1b). If q ≤ q0 < p − 1/2 or 0 < α0 < 2q/(2q0 + 1), then α1 < 1/(2q0 + 1) since either 1/(2p) < 1/(2q0 + 1) or α0 /(2q) < 1/(2q0 + 1). This implies that 1/α1 − 2q0 > 1. Since 4q − 2q0 > 0, we can choose an arbitrary real number t such that 1 < t < min{2, 1/α1 −2q0 , 4q−2q0 +1}. Let |θi0 |2 = i−2q0 −t for i = 1, 2, . . ., so θ0 ∈ `2 (q0 ). Observe that λi2q + n−1 i2p ≥ (n−α1 i)2q for i ≥ nα1 which leads to n1−α1 T1 X (n−α1 i)2q 2 1−α1 ≥ n i−2q0 −t 1 + (n−α1 i)2q α1 i≥n

= n1−α1 (2q0 +t) · n−α1 n1−α1 (2q0 +t)

Z 1

∞

X (n−α1 i)4q−2q0 −t (1 + (n−α1 i)2q )2 α1

i≥n 4q−2q 0 −t x

(1 + x2q )2

dx.

Note the integral on the right of the last equation is finite since 4q −2q0 −t > −1 and 2q0 + t > 1. On the other hand, E{T22 } ∞ X −1 = 4n i=1

≤ 4n−1

∞ X i=1

(λi2q + n−1 i2p )2 −2q0 −t i (1 + λi2q + n−1 i2p )4 ((n−α1 i)2q + (n−α1 i)2p )2 −2q0 −t i (1 + (n−α1 i)2q + (n−α1 i)2p )4

= 4n−1−α1 (2q0 +t)+α1 · n−α1 4n

−1−α1 (2q0 +t)+α1

Z 0

∞ X (n−α1 i)4q−2q0 −t (1 + (n−α1 i)2(p−q) )2

(1 + (n−α1 i)2q + (n−α1 i)2p )4

i=1 ∞ 4q−2q0 −t x (1

+ x2(p−q) )2 dx, (1 + x2q + x2p )4

where the integral in the last equation finitely exists because 4q − 2q0 − t > −1, p > q and 2q0 + t > 1. Therefore, n1−α1 T2 = OP (n(1−α1 −α1 (2q0 +t))/2 ). Since α1 < 1/(2q0 +1) and 1 < t < 2, it can be seen that α1 (2q0 +t−1) < 1. This implies that (1 − α1 − α1 (2q0 + t))/2 < 1 − α1 (2q0 + t), and hence, n1−α1 T2 = oP (n1−α1 T1 ). The above analysis of T1 , T2 leads to that n1−α1 kθb∗ − θ0 k22 ≥ n1−α1 T1 (1 + oP (1)).

SUPPLEMENT TO NONPARAMETRIC BVM THEOREM

9

Since α1 (2q0 + t) < 1, as n → ∞, n1−α1 T1 → ∞. So, with probability approaching one, θ0 ∈ / C∗n (α), which completes the proof for C∗n (α). † The proof for C∗n (α) is similar. So we only briefly sketch the proof. Proof of Part (2a). Similar to the proof of Part (1a), it can be shown by using 4q > 2q0 + 1 that n

∞ X di (λi2q + n−1 i2p )2 |θ0 |2 i

i=1

(1 + λi2q + n−1 i2p )2

and

= o(n1−α1 (2q0 +1) ) = o(1)

∞ √ X di (λi2q + n−1 i2p )θi0 i n = oP (1). (1 + λi2q + n−1 i2p )2 i=1

And hence, nkθ0 − θb∗ k2† =

di 2i i=1 (1+λi2q +n−1 i2p )2

P∞

d

+ oP (1) →

P∞

2 i=1 di i ,

im-

† (α)) → 1 − α. plying Pθ0 (θ0 ∈ C∗n Proof of Part (2b). If p − 1/2 ≤ q0 < 2q and α0 ≥ 2q/(2q0 + 1), then we have α1 (2q0 + 1) < 1; see proof of Part (1b). For any t > 1, let |θi0 |2 = i−(2q0 +1) (log 2i)−t . It is easy to see that θ0 ∈ `2 (q0 ). Following the proof of Part (1b) and using 4q > 2q0 + 1, it can be shown that

n

∞ X di (λi2q + n−1 i2p )2 |θ0 |2 i

(1 + λi2q + n−1 i2p )2

i=1

& n1−α1 (2q0 +1) (log n)−τ −t ,

∞ √ X di (λi2q + n−1 i2p )θi0 i n = OP (n(1−α1 (2q0 +1))/2 ). (1 + λi2q + n−1 i2p )2 i=1

Therefore, nkθ0 − θb∗ k2† ∞ ∞ X √ X di (λi2q + n−1 i2p )2 |θi0 |2 di (λi2q + n−1 i2p )θi0 i = n − 2 n (1 + λi2q + n−1 i2p )2 (1 + λi2q + n−1 i2p )2 + & n

i=1

i=1 ∞ X

(1 +

di 2i λi2q + n−1 i2p )2

i=1 1−α1 (2q0 +1)

(log n)−τ −t (1 + oP (1)) + OP (1).

The leading term approaches infinity which implies that, with probability † approaching one, θ0 ∈ / C∗n (α).

10

Z. SHANG AND G. CHENG

S.3. Adaptive credible regions under priors P1 and P2. In this section, we construct a set of adaptive credible regions that do not rely on the unknown regularity q0 . The main idea is to replace q0 by a proper estimator, namely qb with qb ≥ 0 a.s. This is different from the existing adaptive framework for proving contraction rates; see [2]. For simplicity, we choose p = qb+1 in Prior P1, although other choices satisfying p > qb + 1/2 also apply in our b 2p ), procedure. The corresponding posterior mode θb becomes θbni = Yi /(1+ λi −2p/(2b q +1) b=n where for simplicity we choose λ . The (1−α) adaptive credible sets (parallel to (2.3) and (2.4)) are stated as v q u u bn + D b nb E hz t α adapt ∞ b (S.2) Cn (α) = θ = {θi }i=1 ∈ `2 : kθ − θk2 ≤ , nb h o n p b c /n , Cn†adapt (α) = θ = {θi }∞ ∈ ` : kθ − θk ≤ α 2 † i=1

(S.3)

b n, E bn are defined similar as (2.2) by replacing where b h = n−1/(2bq+1) and D h, λ and q with their estimators. theorem S.1. Suppose that Yi ’s are generated from model (2.1) with θ0 ∈ `2 (q0 ) for some unknown q0 ≥ 0, and the Sobolev rectangle property holds: supi≥1 |θi0 |2 i2q0 +1 < ∞. Let qb be any nonnegative-valued estimator of q0 satisfying qb − q0 = oP (1/ log n), and set p = qb + 1 in Prior P1. Then as †adapt n → ∞, P (θ ∈ Cnadapt (α)|{Yi }∞ (α)|{Yi }∞ i=1 ) → 1−α and P (θ ∈ Cn i=1 ) → 1 − α in probability. Furthermore, with probability approaching one, θ0 ∈ Cnadapt (α); with probability approaching 1 − α, θ0 ∈ Cn†adapt (α). Proof of Theorem S.1. Define h0 = n−1/(2q0 +1) , p0 = q0 + 1, λ0 = Let Dn0 , En0 , Zn0 be defined as (2.2) with (h, λ, p) therein replaced by (h0 , λ0 , p0 ). To simplify the notation, in the rest of the proofs we still use h b respectively, where recall that b and λ to denote b h and λ h = n−1/(2bq+1) and −2p/(2b q +1) b=n λ are the empirical version of the tuning parameters. By (??) √ b 2 = h−1/2 En + Dn1/2 Zn , where Dn , En , Zn are defined as we have n hkθ − θk 2 (2.2). It can be shown that 0 h2p 0 .

Dn − Dn0 = oP (1), En − En0 = oP (1), Zn − Zn0 = oP (1). To see this, note that p = qb + 1 → p0 in probability and h → 0. Hence Z ∞ Z ∞ ∞ Z i+1 X h 1 1 in probability Dn ≥ 2 dx = 2 dx −→ 2 dx, 2p 2 2p 2 (1 + (xh) ) (1 + x2p0 )2 i h (1 + x ) 0 i=1

11

SUPPLEMENT TO NONPARAMETRIC BVM THEOREM ∞ Z X

Dn ≤ 2

i=1

i

i−1

h dx = 2 (1 + (xh)2p )2

∞

Z 0

1 in probability dx −→ 2 (1 + x2p )2

Z

∞

0

1 dx. (1 + x2p0 )2

Therefore, Dn −Dn0 = oP (1) holds. The proof for En −En0 = oP (1) is similar. The proof of Zn − Zn0 = oP (1) is a bit nontrivial. Observe that E{|Zn − Zn0 |2 } = E{E{|Zn − Zn0 |2 |b q }} √ √ ∞ X h h0 − )2 } = 2E{ ( 2p 1 + λi 1 + λ0 i2p0 i=1

∞ X = 2E{ i=1

√ ∞ ∞ X X h hh0 h0 } − 4E{ }+2 . 2p 2 2p 2p 0 (1 + λi ) (1 + λi )(1 + λ0 i ) (1 + λ0 i2p0 )2 i=1

i=1

It is easy to see that Z E{

∞

h

Z ∞ ∞ X 1 h 1 dx} ≤ E{ } ≤ E{ dx}. 2p 2 2p 2 (1 + x ) (1 + λi ) (1 + x2p )2 0 i=1

R∞ R∞ 1 1 dx → Since p ≥ 1, we have 0 (1+x12p )2 dx ≤ 1+ 1 (1+x 2 )2 dx. Since 0 (1+x2p )2 R∞ 1 0 (1+x2p0 )2 dx in probability, it follows by bounded convergence theorem R∞ R∞ that E{ 0 (1+x12p )2 dx} → 0 (1+x12p0 )2 dx. Since h ≤ 1 and h → 0 in probRh ability, by bounded convergence theorem, E{ 0 (1+x12p )2 dx} ≤ E{h} → 0. R∞ P 1 h Therefore, E{ ∞ i=1 (1+λi2p )2 } → 0 (1+x2p0 )2 dx. We also note that r Z ∞ √ ∞ X 1 hh0 h E{ } ≤ E{ dx}, 2p 2p 2p 0 (1 + λi )(1 + λ0 i ) h0 0 (1 + x (h/h0 )2p )(1 + x2p0 ) R∞

i=1

and E{

∞ X i=1

√

hh0 } ≥ E{ 2p (1 + λi )(1 + λ0 i2p0 )

r

h h0

Z

∞

h0

1 (1 +

x2p (h/h0 )2p )(1

+ x2p0 )

dx}.

It can be easily seen that r Z h 0 p h 1 dx} ≤ E{ E{ hh0 } → 0. h0 0 (1 + x2p (h/h0 )2p )(1 + x2p0 ) Since qb − q0 = oP (1/ log n), q it Ris easy to show that h/h0 → 1Rin probability. ∞ ∞ Then it can be seen that hh0 0 (1+x2p (h/h 1)2p )(1+x2p0 ) dx → 0 (1+x12p0 )2 dx 0

12

Z. SHANG AND G. CHENG

in probability, and r Z ∞ h 1 dx h0 0 (1 + x2p (h/h0 )2p )(1 + x2p0 ) √ Z ∞ hh0 = dx 2p (1 + λx )(1 + λ0 x2p0 ) 0 Z Z Z ∞ 1 ∞ h h0 1 ∞ 1 ≤ dx + dx ≤ dx, a.s. 2p 2 2p 2 0 2 0 (1 + λx ) 2 0 (1 + λ0 x ) (1 + x2 )2 0 It follows by bounded convergence theorem that ∞ X E{ i=1

√

hh0 }→ 2p (1 + λi )(1 + λ0 i2p0 )

Z

∞

0

1 dx. (1 + x2p0 )2

This shows that E{|Zn − Zn0 |2 } → 0, and hence, Zn − Zn0 = oP (1). Since d

d

Zn0 → N (0, 1), we have Zn → N (0, 1). Therefore, P (θ ∈ Cnadapt (α)|{Yi }∞ i=1 ) p √ 2 −1/2 b ≤h En + Dn zα |{Yi }∞ = P (n hkθ − θk 2 i=1 ) = P (Zn ≤ zα ) → 1 − α. To the end of the proof we show that P (θ0 ∈ Cnadapt (α)) → 1. Define 0 }∞ with θ b0 = Yi /(1 + λ0 i2p0 ). Following the proof of Theorem θb0 = {θbni i=1 ni P h0 0 2.1 (1), nh0 kθ0 − θb k22 = Fn0 + oP (1), where Fn0 = ∞ i=1 (1+λ0 i2p0 )2 . Since h/h0 = 1 + oP (1), we have nhkθ0 − θb0 k2 = F 0 + oP (1). On the other hand, 2

n

for any x > 1 the inequality |xα − 1| ≤ (log x)x|α| |α| holds for any α ∈ R.

SUPPLEMENT TO NONPARAMETRIC BVM THEOREM

13

Then we have b2 kθb0 − θk 2 2 ∞ X 1 1 2 = − Yi 1 + λ0 i2p0 1 + λi2p i=1 ∞ X

=

Yi2 λ20 i4p0

i=1 ∞ X

=

( λλ0 i2(p−p0 ) − 1)2 (1 + λ0 i2p0 )2 (1 + λi2p )2 1 (2b q +1)(2q0 +1)

i)2(bq−q0 ) − 1)2 (1 + λ0 i2p0 )2 (1 + λi2p )2

((n Yi2 λ20 i4p0

i=1

≤ 4|b q − q0 |

2

∞ X

(log(n Yi2 λ20 i4p0

1 (2b q +1)(2q0 +1)

i=1 2

= 4|b q − q0 | n 2

≤ 8|b q − q0 | n

8p|b q −q0 | (2b q +1)(2q0 +1)

8p|b q −q0 | (2b q +1)(2q0 +1)

∞ X i=1 ∞ X

2p ( q+1)(2q 0 +1) 2 2 4p0 (2b Yi λ 0 i

8p|b q −q0 |

−1

log n + log (ih0 ))2 (ih0 )4|bq−q0 |

(1 + λ0 i2p0 )2 (1 + λi2p )2

2p ( q+1)(2q 0 +1) 0 2 (2b |θi |

log n + log (ih0 ))2 (ih0 )4p0 +4|bq−q0 |

(1 + λ0 i2p0 )2 (1 + λi2p )2

i=1

+8|b q − q0 |2 n (2bq+1)(2q0 +1)

1

i))2 (n (2bq+1)(2q0 +1) i)4|bq−q0 | (1 + λ0 i2p0 )2 (1 + λi2p )2

∞ X i=1

2i

2p ( (2bq+1)(2q log n + log (ih0 ))2 (ih0 )4p0 +4|bq−q0 | 0 +1)

(1 + λ0 i2p0 )2 (1 + λi2p )2

Denote the two terms on the right side of the last inequality to be T1 , T2 . Since qb − q0 = oP (1/ log n), with probability approaching one, |b q − q0 | ≤ 1/ log n. In what follows we assume |b q − q0 | ≤ 1/ log n. To address T1 , T2 , we introduce an elementary inequality: for any x > 0, (S.4)

x4p0 +4|bq−q0 | ≤ 1 + x8/ log n . (1 + x2p )2

We briefly show (S.4). If 0 < x ≤ 1, then the left hand side of (S.4) is always less than 1, so the inequality trivially holds. If x > 1, then x4p0 +4|bq−q0 | ≤ x4|bq−q0 |+4(p0 −p) ≤ x8|bq−q0 | < 1 + x8/ log n . (1 + x2p )2 8p|b q −q0 |

8p Using (S.4) and n (2bq+1)(2q0 +1) ≤ exp( (2bq+1)(2q ) = OP (1), we have 0 +1) 2 −1

T2 = OP (1) · |b q − q0 | n

∞ X i=1

2i

(log n + log(ih0 ))2 (1 + (ih0 )8/ log n ) . (1 + (ih0 )2p0 )2

.

14

Z. SHANG AND G. CHENG

Since E{

∞ X

2i

i=1

=

(log n + log(ih0 ))2 (1 + (ih0 )8/ log n ) } (1 + (ih0 )2p0 )2

∞ X (log n + log(ih0 ))2 (1 + (ih0 )8/ log n )

(1 + (ih0 )2p0 )2

i=1

≤ h−1 0

∞

Z 0

}

(log n + log x)2 (1 + x8/ log n ) 2 dx . h−1 0 (log n) . (1 + x2p0 )2

Therefore, T2 = OP ((nh0 )−1 (log n)2 |b q − q0 |2 ) = oP ((nh0 )−1 ). For T1 , using condition supi≥1 |θi0 |2 i2q0 +1 < ∞ and (S.4) we have 0 +1 T1 = OP (1)|b q − q0 |2 h2q 0

∞ X (log n + log(ih0 ))2 (ih0 )4p0 +4|bq−q0 |−2q0 −1

(1 + (ih0 )2p0 )2 (1 + (ih0 )2p )2

i=1

= OP (1)|b q−

0 +1 −1 h0 q0 |2 h2q 0

2

= OP ((log n) |b q−

0 q0 |2 h2q 0 )

∞

Z 0

(log n + log x)2 x4p0 +4|bq−q0 |−2q0 −1 dx (1 + x2p0 )2 (1 + x2p )2

= oP ((nh0 )−1 ).

b 2 = T1 + T2 = oP ((nh0 )−1 ) = oP ((nh)−1 ). So we Consequently, kθb0 − θk 2 have b 2 = nhkθ0 − θb0 k2 + oP (1) = F 0 + oP (1). nhkθ0 − θk 2 2 n Since En − En0 = oP (1) and lim inf n→∞ (En0 − Fn0 ) > 0, with probability approaching one, θ0 ∈ Cnadapt (α). To the end, we prove the results for Cn†adapt (α). Since b2= nkθ − θk †

∞ X i=1

∞

X di ηi2 d → di ηi2 , b 2p 1 + λi i=1

it is easy to see that Cn†adapt (α) possesses asymptotically 1 − α posterior coverage. Let θb0 be defined the same as in the proof of Theorem S.1. It follows by Theorem 2.2 that P (nkθ0 − θb0 k2 < cα ) → 1 − α. Following the proof of †

Theorem S.1, it can be shown that b 2 = OP (b nkθb0 − θk q − q0 ) = oP (1). † b 2 < cα ) → 1 − α. This completes the proof of Theorem Hence, P (nkθ0 − θk † S.1.

SUPPLEMENT TO NONPARAMETRIC BVM THEOREM

15

A crucial assumption in Theorem S.1 is the existence of qb that satisfies qb− q0 = oP (1/ log n). Due to this proper rate, the contraction rate of Cnadapt (α) is of the order n−bq/(2bq+1) , which is asymptotically equivalent to the theoretical optimal one n−q0 /(2q0 +1) under any regularity q0 > 1/2. We next discuss the construction of a desirable qb by starting from a simple example. If it is known that the true parameter satisfies |θi0 |2 = (i log i)−(2q0 +1) for i = 2, 3, . . ., then we assign a prior on θi as θi ∼ N (0, (i log i)−(2q0 +1) ), i = 2, 3, . . . , such that the marginal distribution of Yi ∼ N (0, n−1 + (i log i)−(2q0 +1) ). Let φ0 = 2q0 + 1. Based on {Yi }ni=2 , the log-likelihood function of φ ≡ 2q + 1 is n 1X Yi2 −1 −φ (S.5) ln (φ) = − log(n + (i log i) ) + −1 . 2 n + (i log i)−φ i=2

proposition S.1. (Existence of qb with a desirable rate of convergence) b of the log-likelihood There exists a sequence of local maxima, denoted as φ, 1 − function ln (φ) satisfying φb − φ0 = OP (n 2φ0 (log n)2φ0 ). If we let qb = (φb − 1)+ /2, where (a)+ is the positive part of a real number a, then qb satisfies qb − q0 = OP (n

− 2φ1

0

(log n)2φ0 ) = oP (1/ log n).

remark S.1. It is possible to extend Proposition S.1 to more general situations. For example, given that |θi0 |2 = b0 i−(2q0 +1) (log i)−a0 for i = 2, 3, . . . and unknown b0 > 0, q0 > 0, a0 > 1, then one may assign priors θi ∼ N (0, b0 i−(2q0 +1) (log i)−a0 ), i = 2, 3, . . . , which leads to log-likelihood function of (b, φ, a) as n Yi2 1X −1 −φ −a log(n + bi (log i) ) + −1 . ln (b, φ, a) = − 2 n + bi−φ (log i)−a i=2

It can be shown by a multivariate extension of the proof of Proposition S.1 bb that there exists a sequence of local maxima of ln (b, φ, a), namely, (bb, φ, a), b b such that the marginal entry φ satisfies φ − φ0 = oP (1/ log n). Hence, by defining qb = (φb − 1)+ /2 we have qb − q0 = oP (1/ log n).The rate for (bb, b a) can also be derived but is of less interest. Proof of Proposition S.1. Let φn = n that with probability approaching one,

− 2φ1

ln (φ0 ± φn ) < ln (φ0 ),

0

(log n)2φ0 . We will show

16

Z. SHANG AND G. CHENG

which implies that there exists a local maxima φb of ln (φ) s.t. φ0 − φn < φb < φ0 + φn . We only show ln (φ0 + φn ) < ln (φ0 ) with probability approaching one. The proof for ln (φ0 − φn ) < ln (φ0 ) is exactly the same. It is easy to show that the first-, second- and third-order derivatives of ln (φ) are " # n 2 (log c )cφ X Y 1 log c i i i l˙n (φ) = − i , −1 cφ −1 cφ )2 2 1 + n (1 + n i i i=2 # " n 2 (log c )2 cφ (1 − n−1 cφ ) 2 n−1 cφ X Y (log c ) 1 i i i i i ¨ln (φ) = − + i , −1 cφ )2 −1 cφ )3 2 (1 + n (1 + n i i i=2 # " n ... ) 1 X (log ci )3 n−1 cφi (1 − n−1 cφi ) (log ci )3 Yi2 cφi (1 − 4n−1 cφi + n−2 c2φ i , + l n (φ) = − φ 3 φ 4 −1 −1 2 (1 + n ci ) (1 + n ci ) i=2

where ci = i log i. By Taylor’s expansion, 1 ... 1 ln (φ0 + φn ) − ln (φ0 ) = l˙n (φ0 )φn + ¨ln (φ0 )φ2n + l n (φ∗ )φ3n , 2 6 for a random φ∗ ∈ [φ0 , φ0 + φn ]. 0 0 Noting that Yi ∼ N (θi0 , 1/n), we have E{|Yi2 − c−φ − 1/n|2 } = 4c−φ /n + i i 2 2/n . By direct calculations it can be shown that E{|l˙n (φ0 )|2 } n −φ0 0 2 2 − 1/n|2 } 1 X c2φ i (log ci ) E{|Yi − ci = 4 (1 + n−1 cφi 0 )4 i=2

=

n 1 X c2φ0 (log ci )2 (4c−φ0 /n + 2/n2 ) i

i −1 (1 + n cφi 0 )4 i=2 n n X n−1 cφi 0 (log ci )2 X . −1 φ0 3 (1 i=2 (1 + n ci ) i=2

4

1 (log i)2 φ0 (log n)2 n + n−1 iφ0 )2

Z 0

∞

1 dx, (1 + xφ0 )2

n n 1X (log ci )2 1X (log i)2 . −1 φ0 2 2 2 (1 + n−1 iφ0 )2 i=2 (1 + n ci ) i=2 Z ∞ 1 1 n φ0 (log n)2 dx, (1 + xφ0 )2 0

−E{¨ln (φ0 )} =

n

1 1 X (log i)2(1−φ0 ) 2(1−φ0 ) φ0 −E{¨ln (φ0 )} & & (log n) n 2 (1 + n−1 iφ0 )2

i=2

Z 0

∞

1 dx, (1 + xφ0 )2

17

SUPPLEMENT TO NONPARAMETRIC BVM THEOREM

V ar(¨ln (φ0 )) n −φ0 0 2 − 1/n|2 } 1 X (log ci )4 (1 − n−1 cφi 0 )2 c2φ i E{|Yi − ci = 4 (1 + n−1 cφi 0 )6 1 4

=

i=2 n X i=2

n X

.

−φ0 0 (log ci )4 (1 − n−1 cφi 0 )2 c2φ /n + 2/n2 ) i (4ci

i=2

(log ci )4 (1 +

n−1 cφi 0 )2

.

(1 + n−1 cφi 0 )6 n X (log i)4 i=2

(1 +

n−1 iφ0 )2

4

(log n) n

1 φ0

Z

∞

0

1 dx. (1 + xφ0 )2

Using Yi2 ≤ 2(ci−φ0 +2i /n) and the elementary fact max1≤i≤n 2i = OP (log n) we have uniformly for any φ∗ ∈ [φ0 , φ0 + φn ], cφi ∗ −φ0 ≤ cφnn = 1 + o(1) and ... | l n (φ∗ )| n (log cn )3 X n−1 cφi ∗ |1 − n−1 cφi ∗ | ≤ 2 (1 + n−1 cφi ∗ )3 i=2 3

+OP ((log n)(log cn ) )

n ∗ X (cφi ∗ −φ0 + n−1 cφi ∗ )|1 − 4n−1 cφi ∗ + n−2 c2φ i |

(1 + n−1 cφi ∗ )4

i=2 1

1

= OP ((log n)4 n φ∗ ) = OP ((log n)4 n φ0 ), 1

1

where in the last equation we have used the fact n φ∗ n φ0 uniformly for any φ∗ ∈ [φ0 , φ0 + φn ]. Therefore, ln (φ0 + φn ) − ln (φ0 ) 1

1

1

= E{¨ln (φ0 )}φ2n + OP (n 2φ0 (log n)φn + n 2φ0 (log n)2 φ2n + n φ0 (log n)4 φ3n ) 1

1

1

1

. −n φ0 (log n)2(1−φ0 ) φ2n + OP (n 2φ0 (log n)φn + n 2φ0 (log n)2 φ2n + n φ0 (log n)4 φ3n ). It is easy to see that with φn = n 1

1

− 2φ1

0

(log n)2φ0 , 1

1

n 2φ0 (log n)φn + n 2φ0 (log n)2 φ2n + n φ0 (log n)4 φ3n n φ0 (log n)2(1−φ0 ) φ2n , therefore, with probability approaching one, ln (φ0 + φn ) < ln (φ0 ). It is easy to see that qb = (φb − 1)+ /2 satisfies the desirable rate of convergence. This completes the proof of the proposition. To the end of this section, we propose two adaptive credible regions under P2. Similarly, assume that qb is a proper estimator of q0 and let p = qb + 1/2,

18

Z. SHANG AND G. CHENG

q = (2b q + 1)/4 and α0 = 2q/(2b q + 1). In this case, α0 = 1/2 and α1 = b = n−α0 = n−1/2 . Consider the following regions: 1/(2b q + 1). Hence λ (S.6) v q u u b∗n + D b ∗n n−α1 zα t E adapt b C∗n (α) = θ = {θi }∞ , i=1 ∈ `2 : kθ − θ∗ k2 ≤ n1−α1

(S.7)

n o p †adapt b C∗n (α) = θ = {θi }∞ ∈ ` : kθ − θ k ≤ c /n , 2 ∗ † α i=1

b 2q + n−1 i2p ) for i = 1, 2, . . ., and D b ∗n , E b∗n are dewhere θb∗ni = Yi /(1 + λi fined similar as (2.5) with α1 , p, q, λ therein replaced by the above empirical b∗n and D b ∗n . counterparts. Again, Lemma S.1 gives the limiting values of E adapt −b q /(2b q +1) The contraction rate of C∗n (α) is of the order n . Based on similar proof in Theorem S.1, we study the Bayesian and frequentist properties of †adapt adapt (α) as follows. (α) and C∗n C∗n theorem S.2. Suppose that Yi ’s are generated from model (2.1) with θ0 ∈ `2 (q0 ), supi≥1 |θi0 |2 i2q0 +1 < ∞, and that qb − q0 = oP (1/ log n). Then as adapt †adapt n → ∞, P (θ ∈ C∗n (α)|{Yi }∞ (α)|{Yi }∞ i=1 ) → 1−α and P (θ ∈ C∗n i=1 ) → 1 − α in probability. Furthermore, with probability approaching one, θ0 ∈ †adapt adapt (α). (α); with probability approaching 1 − α, θ0 ∈ C∗n C∗n S.4. Proofs in Section 3. m (I), by Assumption A2, f adProof of Lemma 3.2. For any f ∈ HP mits theP unique series representation f = ∞ ν=1 fν ϕν , where fν = V (f, ϕν ) satisfies ν fν2 ρν < ∞. Therefore, T : f 7→ {fP ν : ν ≥ 1} defines a one-to-one 2 m ∞ ∞ map from H (I) to Rm ≡ {{fν }ν=1 ∈ R : ∞ ν=1 fν ρν < ∞}. ˜ ˜ Let Πλ and Π be the probability measures induced by {wν : ν > m} and {vν : ν > m}, respectively, which are both defined on R∞ . That is, for ˜ λ (S) = P ({wν : ν > m} ∈ S) and Π(S) ˜ any subset S ∈ R∞ , Π = P ({vν : 0 0 ν > m} ∈ S). Likewise, let Πλ and Π be probability measures induced by {wν : ν ≥ 1} and {vν : ν ≥ 1}. It is easy to see that, for any measurable B ⊆ Rm ,

Πλ (T −1 B) = P (Gλ ∈ T −1 B) = P ({wν : ν ≥ 1} ∈ B) = Π0λ (B), and Π(T −1 B) = P (G ∈ T −1 B) = P ({vν : ν ≥ 1} ∈ B) = Π0 (B). The following result can be found in H´ajek [22].

SUPPLEMENT TO NONPARAMETRIC BVM THEOREM

˜ λ w.r.t. Π ˜ is The Radon-Nikodym derivative of Π

proposition S.2.

˜λ dΠ ({fν : ν > m}) = ˜ dΠ =

∞ Y ν>m ∞ Y

1 + nλρ−β/(2m) ν

1/2

1 + nλρ−β/(2m) ν

1/2

ν>m

Note that in Proposition S.2, since

−β/(2m) ρν

19

Q∞

ν>m

exp(−

nλ 2 f ρν ) 2 ν

∞ nλ X 2 · exp − f ρν 2 ν>m ν

−β/(2m) 1/2

1 + nλρν

! .

is convergent

ν −β . Therefore, by Proposition S.2, we have

dΠ0λ ({fν : ν ≥ 1}) ˜ λ ({fν : ν > m}) = dπ1 (f1 ) · · · dπm (fm )dΠ ˜λ dΠ ˜ = dπ1 (f1 ) · · · dπm (fm ) ({fν : ν > m}) · dΠ({f ν : ν > m}) ˜ dΠ ∞ 1/2 Y 1 + nλρ−β/(2m) = dπ1 (f1 ) · · · dπm (fm ) ν ν=m+1

nλ · exp − 2 =

∞ Y ν=m+1

∞ X

! fν2 ρν

˜ dΠ({f ν : ν > m})

ν=m+1

1 + nλρν−β/(2m)

1/2

∞ nλ X 2 fν ρν · exp − 2 ν=m+1

! dΠ0 ({fν : ν ≥ 1}).

20

Z. SHANG AND G. CHENG

Then for any measurable S ⊆ H m (I), by change of variable, we have Πλ (S) = Π0λ (T S) Z dΠ0λ ({fν : ν ≥ 1}) = =

TS ∞ Y

1 + nλρ−β/(2m) ν

1/2

ν=m+1 ∞ nλ X 2 exp − · fν ρν 2 TS

Z

! dΠ0 ({fν : ν ≥ 1})

ν=m+1

=

∞ Y

1 + nλρ−β/(2m) ν

1/2

ν=m+1

nλ −1 −1 exp − J(T ({fν : ν ≥ 1}), T ({fν : ν ≥ 1})) · 2 TS Z

·d(Π ◦ T −1 )({fν : ν ≥ 1}) ∞ 1/2 Z Y nλ −β/(2m) 1 + nλρν exp(− J(f, f ))dΠ(f ). = 2 S ν=m+1

This completes the proof of the lemma. S.5. Proofs in Section 4. Before proving Proposition 4.1, we present a preliminary lemma. Let {ϕ˜ν : ν ≥ 1} be a bounded orthonormal basis of L2 (I) under usual 2 L inner product. For any b ∈ [0, β], define ˜b = { H

∞ X

fν ϕ˜ν :

ν=1

∞ X

fν2 ρ1+b/(2m) < ∞}. ν

ν=1

˜ b can be viewed as a version of Sobolev space with regularity m + Then H ˜ = P∞ vν ϕ˜ν , a centered GP, and f˜0 = P∞ fν0 ϕ˜ν . Define b/2. Define G ν=1R ν=1 1 ˜ ) = V˜ (f, g) = hf, giL2 = 0 f (x)g(x)dx, the usual L2 inner product, J(f P∞ ˜ 2 ˜ ˜ ˜ν )| ρν , a functional on H0 . For simplicity, we denote V (f ) = ν=1 |V (f, ϕ ˜ β . Since G ˜ is a Gaussian process with covariance V˜ (f, f ). Clearly, f˜0 ∈ H function ˜ t) = E{G(s) ˜ G(t)} ˜ R(s, =

m X ν=1

σν2 ϕ˜ν (s)ϕ˜ν (t) +

X ν>m

β −(1+ 2m )

ρν

ϕ˜ν (s)ϕ˜ν (t),

SUPPLEMENT TO NONPARAMETRIC BVM THEOREM

21

˜ β is the RKHS of G. ˜ For any H ˜ b with 0 ≤ b ≤ β, it follows by [44] that H define inner product ∞ ∞ m X X X X 1+ b fν gν ρν 2m . h fν ϕ˜ν , gν ϕ˜ν ib = σν−2 fν gν + ν=1

ν=1

ν>m

ν=1

Let k · kb be the norm corresponding to the above inner product. The following lemma plays a similar role as quantifying the concentration function defined in [45]. The calculation of the concentration function in [44] depends on convolutions (see their proof of Theorem 4.1), while the proof of Lemma S.3 depends on a series representation. Lemma S.3. Let dn be any positive sequence. If Condition (S) holds, ˜ β such that then there exists ω ∈ H 2(β−1)

˜ − f˜0 ) ≤ (i). V˜ (ω − f˜0 ) + λJ(ω 2

−

1 2 16 (dn

+ λdn2m+β−1 ), and

(ii). kωk2β = O(dn 2m+β−1 ). Proof of Lemma S.3. Let ω = 2/(2m+β−1) dn ,

P∞

˜ν , ν=1 ων ϕ

where ων =

dfν0 d+(σbν )α ,

σ=

1/(2m) ρν ,

bν = α = m + (β − 1)/2, and d > 0 is a constant to be (σbν )α fν0 described. It is easy to see that for any ν, fν0 − ων = d+(σb α . Then ν) V˜ (ω − f˜0 ) =

∞ ∞ X X |fν0 |2 (σbν )2α (fν0 − ων )2 = (d + (σbν )α )2 ν=1

ν=1

≤ σ 2m+β−1 d−2

∞ X

1+ β−1 2m

|fν0 |2 ρν

,

ν=1

and ˜ − f˜0 ) = J(ω

∞ X

(fν0 − ων )2 ρν

ν=1

= σ β−1

∞ X

1+ β−1 2m

|fν0 |2 ρν

(d(σbν )−m + (σbν )(β−1)/2 )−2

ν=1

≤ d−

β−1 k

∞

((

2m − m 2m β−1 −2 β−1 X 0 2 1+ β−1 |fν | ρν 2m , ) k +( ) 2k ) σ β−1 β−1 ν=1

where in the last equation k = m + (β − 1)/2. Therefore, we choose d as a suitably large fixed constant such that (i) holds.

22

Z. SHANG AND G. CHENG

To show (ii), observe that kωk2β =

m X

σν−2 ων2 +

=

ν>m

ν=1

= O(σ

1+ β ων2 ρν 2m

X

−1

m X ν=1

β−1

2m X d2 |f 0 |2 ρ1+ ν ν −2 0 2 bν σν |fν | + (d + (σbν )α )2 ν>m

). 2/(2m+β−1)

The result follows by σ = dn

.

Proof of Proposition 4.1. The proof is rooted in Theorem 2.1 of [21] with nontrivial modifications. We write Pf0 = Pfn0 for simplicity. Let γ02 = P 0 2 2 2 −1 ν |fν | ρν and Mn = (1 + γ0 ) δn λ . We can rewrite the posterior density by Qn nλ i=1 (pf /pf0 )(Zi ) exp(− 2 J(f ))dΠ(f ) p(f |Dn ) = R , Qn nλ i=1 (pf /pf0 )(Zi ) exp(− 2 J(f ))dΠ(f ) H m (I) where recall that pf (Z) is the probability density of Z = (Y, X) under f . First of all, we give a lower bound for n Y

Z I1 = H m (I)

(pf /pf0 )(Zi ) exp(−

i=1

nλ J(f ))dΠ(f ). 2

Define Bn = {f ∈ H m (I) : J(f ) ≤ Mn , V (f − f0 ) ≤ δn2 }. Then n Y nλ ≥ (pf /pf0 )(Zi ) exp(− J(f ))dΠ(f ) 2 Bn i=1 Z Y n ≥ (pf /pf0 )(Zi )dΠ(f ) exp(−nλMn /2)

Z

I1

Bn i=1

Z = Bn

n X exp( Ri (f, f0 ))dΠ(f ) exp(−nλMn /2), i=1

where Ri (f, f0 ) = `(Yi ; f (Xi ))−`(Yi ; f0 (Xi )) for i = 1, . . . , n. Define dΠ∗ (f ) =

SUPPLEMENT TO NONPARAMETRIC BVM THEOREM

23

dΠ(f )/Π(Bn ), a reduced probability measure on Bn . By Jensen’s inequality, n X exp( Ri (f, f0 ))dΠ∗ (f )

Z log Bn

≥ =

i=1

n X

Z

Ri (f, f0 )dΠ∗ (f )

Bn i=1 Z X n

[Ri (f, f0 ) − Ef0 {Ri (f, f0 )}]dΠ∗ (f )

Bn i=1

Z

Ef0 {Ri (f, f0 )}dΠ∗ (f ) ≡ J1 + J2 .

+n Bn

By Cauchy-Schwartz inequality and direct examinations, we have Ef0 {J12 }

n X

Z = Ef0 {|

[Ri (f, f0 ) − Ef0 {Ri (f, f0 )}]dΠ∗ (f )|2 }

Bn i=1

Z ≤ n

Ef0 {Ri (f, f0 )2 }dΠ∗ (f ).

Bn

For any f ∈ Bn , by Taylor’s expansion, there exists a constant C˜ > 0 s.t. Ef0 {Ri (f, f0 )2 } = Ef0 {(`(Yi ; f (Xi )) − `(Yi ; f0 (Xi )))2 } = Ef {(`˙a (Y ; f0 (X))(f (X) − f0 (X)) + `¨a (Y ; f ∗ (X))(f (X) − f0 (X))2 )} 0

≤ 2Ef0 {2 (f (X) − f0 (X))2 } + 2Ef0 {Ef0 {sup |`¨a (Y ; a)|2 |X}|f (X) − f0 (X)|4 } a∈R

≤

˜ n2 , Cδ

where f ∗ (X) is between f (X) and f0 (X), and the last step is because Ef0 {2 |X} = B(X) (Assumption A1(c)), Ef0 {supa∈R |`¨a (Y ; a)|2 |X} ≤ 2C02 C1 , a.s., (Assumption A1 (a)). On the other hand, for any f ∈ Bn , by Taylor’s expansion and Assumption A1(a), |Ef0 {Ri (f, f0 )}| ≤ |Ef0 {`¨a (Y ; f ∗ (X))(f (X) − f0 (X))2 }| ≤ Ef {Ef {sup |`¨a (Y ; a)| |X}(f (X) − f0 (X))2 } 0

≤

0

a∈R

C0 C1 δn2 ,

where f ∗ (X) is between f (X) and f0 (X). Therefore, J2 ≥ −C0 C1 nδn2 , and p 2 ˜ by Cauchy-Schwartz inequality, Pf0 (J1 < − nδn2 ) ≤ C/(nδ n ) → 0. So, by

24

Z. SHANG AND G. CHENG

p nδn2 ≤ nδn2 , we get that with probability approaching one, I1 ≥ exp(−(1 + C0 C1 )nδn2 − nλMn /2)Π(Bn ). ˜ β satisfy To proceed we provide a lower bound for Π(Bn ). Let ω ∈ H (i)–(ii) of Lemma S.3 with dn therein replaced by δn . It follows immediately ˜ − f˜0 ) ≤ δ 2 /4. By Gaussian from δn = o(1) and λ ≤ δn2 that V˜ (ω − f˜0 ) + λJ(ω n correlation inequality (see Theorem 1.1 of [30]), Cameron-Martin theorem (see [7]), and the fact λ−1 δn2 > 1, we have Π(Bn ) = P (V (G − f0 ) ≤ δn2 , J(G) ≤ Mn ) ˜ − f˜0 ) ≤ δ 2 , J( ˜ G) ˜ ≤ (1 + γ0 )2 δ 2 λ−1 ) = P (V˜ (G n

n

˜ − f˜0 ) ≤ δn2 , λJ( ˜G ˜ − f˜0 ) ≤ δn2 ) ≥ P (V˜ (G ˜ − ω) ≤ δn2 /4, λJ( ˜G ˜ − ω) ≤ δn2 /4) ≥ P (V˜ (G 1 ˜ ≤ δn2 /4, J( ˜ G) ˜ ≤ 1/4) ≥ exp(− kωk2β )P (V˜ (G) 2 1 ˜ ≤ δ 2 /8)P (J( ˜ G) ˜ ≤ 1/8) ≥ exp(− kωk2β )P (V˜ (G) n 2 ˜ G) ˜ ≤ 1/8) exp(−c1 δ −2/(2m+β−1) ), ≥ P (J( n where the third last inequality follows by (4.18) of [28], and the last step follows by Lemma S.3 (ii) and Example 4.5 of [23] in which c1 > 0 is a ˜ G) ˜ ≤ 1/8) is a universal universal constant. Note in the last step c2 := P (J( −1 2 positive constant. Since β > 1 and δn ≥ (nh) + λ ≥ n−2m/(2m+1) , we get 2(2m+β)

nδn2m+β−1 ≥ n

2m(2m+β)

1− (2m+1)(2m+β−1)

−

2

> 1, so nδn2 > δn 2m+β−1 . So

Π(Bn ) ≥ c2 exp(−c1 δn−2/(2m+β−1) ) ≥ c2 exp(−c1 nδn2 ). Consequently, with Pf0 -probability approaching one, I1 ≥ c2 exp(−c3 nδn2 ), 2 where c3 = 1R+ CQ 0 C1 + c1 + (1 + γ0 ) /2. n Let I2 = An i=1 (pf /pf0 )(Zi ) exp(− nλ 2 J(f ))dΠ(f ), where An = {f ∈ m H (I) : kf − f0 k ≥ 2M δn , kf − f0 ksup ≤ b}, for some M to be determined. Let An1 = {f ∈ H m (I) : V (f − f0 ) ≥ M 2 δn2 , kf − f0 ksup ≤ b} and An2 = {f ∈ H m (I) : λJ(f − f0 ) ≥ M 2 δn2 , kf − f0 ksup ≤ b}. So I2 ≤ I21 + I22 , where R Q I2j = Anj ni=1 (pf /pf0 )(Zi ) exp(− nλ 2 J(f ))dΠ(f ) for j = 1, 2. 2 2 Since for f ∈ An2 , M δn ≤ λJ(f − f0 ) ≤ 2λJ(f ) + 2λJ(f0 ) ≤ 2λJ(f ) + 2δn2 γ02 , we have λJ(f ) ≥ (M 2 −2γ02 )δn2 /2, and hence, Ef0 {I22 } ≤ exp(−(M 2 − 2γ02 )nδn2 /4). To the end, we address the term I21 , for which we need to build the uniformly consistent tests. Define Pn = {f ∈ H m (I) : kf − f0 ksup ≤ b, J(f ) ≤ c4 Mn , V (f − f0 ) ≥ M 2 δn2 }, where c4 > 2(c3 + 2)/(1 + γ0 )2 is a fixed constant. Let δ = M δn and D(δ/2, Pn , k · kL2 ) be the δ/2-packing number

SUPPLEMENT TO NONPARAMETRIC BVM THEOREM

25

√ in terms of the usual L2 -norm. Let M ≥ 2 c4 (1 + γ0 ). It is well known √ √ that log Pn , k · kL2 ) ≤ log D( c4 (1 + γ0 )δn , Pn , k · kL2 ) ( c4 (1 + √ D(δ/2,−1/m γ0 )δn / c4 Mn ) ≤ nδn2 , where the last inequality follows by δn2 ≥ (nh)−1 . It follows by the fact that, for sufficiently small b, on Pn , the L2 -metric is dominated by the Hellinger metric, and Theorem 7.1 of [21] that there exist tests φn such that Ef0 {φn } ≤ exp(−T M 2 nδn2 ), and sup Ef {1 − φn } ≤ exp(−T M 2 nδn2 ), f ∈Pn

where T > 0 is a universal constant. Note I21 /I1 = P (An1 |Dn ) = P (An1 |Dn )φn + P (An1 |Dn )(1 − φn ), and Ef0 {P (An1 |Dn )φn } ≤ exp(−T M 2 nδn2 ). On the other hand, following similar arguments of [21] we have that n Y pf (Zi ) nλ exp(− J(f ))dΠ(f )(1 − φn ) pf0 (Zi ) 2 An1

Z Ef0

i=1

≤ exp(−nλc4 Mn /2) + sup Ef {1 − φn } f ∈Pn

≤ exp(−(c3 +

2)nδn2 )

+ exp(−T M 2 nδn2 ).

Therefore, by the above analysis, we can choose the constant M such that √ > 4(c3 + 2) + 2γ02 , T M 2 > c3 + 2 and M > 2 c4 (1 + γ0 ). And then, it can be seen that, as n → ∞, P (kf − f0 k ≥ 2M δn |Dn ) converges to zero in Pf0 -probability. M2

Before proving Theorem 4.2, let us introduce some preliminaries. Let ∆f, ∆fj ∈ H m (I) for j = 1, 2, 3. The Fr´echet derivative of `n,λ can be identified as n

D`n,λ (f )∆f =

1X˙ `a (Yi ; f (Xi ))hKXi , ∆f i − hWλ f, ∆f i ≡ hSn,λ (f ), ∆f i. n i=1

Note that Sn,λ (fbn,λ ) = 0, and Sn,λ (f0 ) can be expressed as n

(S.8)

1X Sn,λ (f0 ) = i KXi − Wλ f0 . n i=1

The Fr´echet derivative of Sn,λ (DSn,λ ) is denoted DSn,λ (f )∆f1 ∆f2 (D2 Sn,λ (f )∆f1 ∆f2 ∆f3 ).

26

Z. SHANG AND G. CHENG

These derivatives can be explicitly written as D2 `n,λ (f )∆f1 ∆f2 n X −1 = n `¨a (Yi ; g(Xi ))hKXi , ∆f1 ihKXi , ∆f2 i − hWλ ∆f1 , ∆f2 i, i=1

D3 `n,λ (f )∆f1 ∆f2 ∆f3 n X ... −1 = n ` a (Yi ; g(Xi ))hKXi , ∆f1 ihKXi , ∆f2 ihKXi , ∆f3 i). i=1

We next review a set of results from [38, 16]. The following result gives the rate of convergence for fbn,λ when h satisfies the rate conditions in Theorem 4.2. proposition S.3. Suppose Assumptions A1 and A2 are satisfied. Furthermore, as n → ∞, h = o(1) and n−1/2 h−2 (log n)(log log n)1/2 = o(1). Then kfbn,λ − f0 k = OPfn (rn ). 0

The proof of Proposition S.3 can be found in [16]. The following lemma is also useful, whose proof can be found in [38, 37]. Lemma S.4. (Functional Bahadur representation under the true model) Suppose that Assumptions A1–A3 hold, h = o(1), and nh2 → ∞. Recall that Sn,λ (f0 ) is defined in (S.8). Then we have kfbn,λ − f0 − Sn,λ (f0 )k = OPfn (an log n),

(S.9)

0

where an = n−1/2 rn h−(6m−1)/(4m) (log log n)1/2 + C` h−1/2 rn2 / log n, and

... C` = sup Ef0 {sup | ` a (Y ; a)| X = x}. x∈I

a∈R

P 0 Let the true f0 be expressed as f0 (·) = ∞ ν=1 fν ϕν (·) satisfying Condition (S). Under the conditions of Lemma S.4, it can be directly verified that kSn,λ (f0 )k = OPfn (˜ rn ), and therefore, kfbn,λ − f0 k = OPfn (an log n + r˜n ). To 0 0 see this, by Proposition 3.1 we have n

∞

i=1

ν=1

1 1X 1 1X i KXi k2 } = Ef0 {2 K(X, X)} = = O((nh)−1 ), Ef0 {k n n n 1 + λρν

27

SUPPLEMENT TO NONPARAMETRIC BVM THEOREM

and kWλ f0 k2 = h =

∞ X

∞

fν0

ν=1 ∞ X

X λρν λρν ϕν , fν0 ϕν i 1 + λρν 1 + λρν ν=1

|fν0 |2

ν=1

= λ

1+ β−1 2m

λ2 ρ2ν 1 + λρν

∞ X

1+ β−1 |fν0 |2 ρν 2m

ν=1

β−1

(λρν )1− 2m = O(h2m+β−1 ), 1 + λρν β−1

where the last equation follows by λ = h2m , supx≥0 ((nh)−1/2

x1− 2m 1+x

< ∞, and Con-

m+ β−1 2

dition (S). Then kSn,λ (f0 )k = OPfn +h ) = OPfn (˜ rn ). 0 0 To complete the proof of Theorem 4.2, we consider a function class G = {g ∈ H m (I) : kgksup ≤ 1, J(g) ≤ c−2 hλ−1 }, where c > 0 is the universal constant satisfying Lemma 5.8. The following equi-continuity lemma over G is useful for further theoretical analysis. The proof can be found in [38]. Lemma S.5. Suppose that ψn (z; g) is a measurable function defined upon z = (y, x) ∈ Y × I and g ∈ G satisfying ψn (z; 0) = 0 and the following Lipschitz continuity condition: for any i = 1, . . . , n and f, g ∈ G, (S.10) |ψn (Zi ; f ) − ψn (Zi ; g)| ≤ c−1 h1/2 kf − gksup , a.s. in Pf0 -probability, where c > 0 is the universal constant specified in Lemma 5.8. Then we have ! kZ (g)k n lim P n sup ≤ (5 log log n)1/2 = 1, −1/2 n→∞ f0 g∈G h−(2m−1)/(4m) kgkγ sup + n where Zn (g) =

√1 n

Pn

i=1 [ψn (Zi ; g)KXi

− EfZ0 {ψn (Z; g)KX }], EfZ0 represents

the expectation taken with respect to Z under Pf0 , and γ = 1 −

1 2m .

Proof of Theorem 4.2. Let rn = (nh)−1/2 + λ1/2 and denote fb = fbn,λ for convenience. By kfb − f0 k = Of0 (˜ rn ) and Assumption A1, there exists a constant M1 > 1 s.t. Pfn0 (En0 ) approaches one (as n → ∞), where (S.11) ... En0 = {kfb−f0 k ≤ M1 r˜n , max sup |`¨a (Yi ; a)| ≤ M1 log n, max sup | ` a (Yi ; a)| ≤ M1 log n}. 1≤i≤n a∈R

1≤i≤n a∈R

Define Ain = {|`¨a (Yi ; f0 (Xi ))| ≤ M1 log n}, for i = 1, 2 . . . , n. By Assumption A1 (a), we can actually manage the above M1 such that Pf0 (Acin ) ≤

28

Z. SHANG AND G. CHENG

C2−1 n−1/2 h−(2m−1)/(4m) log n(5 log log n)1/2 , where recall that C2 satisfies Assumption A1 (a) and c satisfies Lemma 5.8. By Proposition 4.1, there exists a large M0 s.t. P (f ∈ H m (I) : kf − f0 k ≤ M0 rn |Dn ) converges to one in Pfn0 -probability. Let M > M1 be a constant to be determined. Since P (kf −f0 k > 2M r˜n |Dn ) = P (kf −f0 k > M0 rn |Dn )+P (2M r˜n < kf −f0 k ≤ M0 rn |Dn ), where the first term on the right side converges to zero in Pfn0 -probability, we only look at the second term. For convenience, define Sn = {f : 2M r˜n < kf − f0 k ≤ M0 rn }. Since r˜n ≤ rn , on En and for any f ∈ Sn , kf − fbk ≤ kf −f0 k+kfb−f0 k ≤ (M0 +M1 )rn , and kf − fbk ≥ kf −f0 k−kfb−f0 k > M r˜n . Define (j)

(S.12) En00 = {sup g∈G

(j)

where Zn (g) =

kZn (g)k ≤ (5 log log n)1/2 , j = 1, 2}, h−(2m−1)/(4m) kgkγsup + n−1/2 (j) (1) (j) Z i=1 [ψn (Zi ; g)KXi −Ef0 {ψn (Z; g)KX }], ψn (Z; g) = (2) ψn (Z; g) = (M1 log n)−1 c−1 h1/2 `¨a (Yi ; f0 (Xi ))g(Xi )IAin . says that, as n → ∞, Pfn0 (En00 ) → 1. Let En = En0 ∩ En00

√1 n

Pn

c−1 h1/2 g(X) and Then Lemma S.5 whose Pfn0 -probability thus goes to one as n tends to infinity. In the rest of the proof, we will assume that R 1 REn1 holds. For any f , define In (f ) = 0 0 sDSn,λ (fb+ ss0 (f − fb))(f − fb)(f − fb)dsds0 . Then by Taylor’s expansion we get that `n,λ (f ) − `n,λ (fb) = Sn,λ (fb)(f − fb) + In (f ) = In (f ). By (3.8) we have that P (f ∈ Sn |Dn ) R R b exp(nIn (f ))dΠ(f ) Sn exp(n(`n,λ (f ) − `n,λ (f )))dΠ(f ) = R = R Sn b H m (I) exp(nIn (f ))dΠ(f ) H m (I) exp(n(`n,λ (f ) − `n,λ (f )))dΠ(f ) R We first derive a lower bound for J1 = H m (I) exp(nIn (f ))dΠ(f ). Immediately, Z J1 ≥ exp(nIn (f ))dΠ(f ). f :kf −f0 k≤˜ rn

Note on En and when kf − f0 k ≤ r˜n , we have kf − fbk ≤ kf − f0 k + kfb − f0 k ≤ (M1 + 1)˜ rn . Let c > 0 be the constant specified in Lemma 5.8, b dn = c(M1 + 1)h−1/2 r˜n and g = d−1 n (f − f ). Immediately, it can be shown −1/2 −1/2 −1 that kgksup ≤ ch kgk = ch dn kf − fbk ≤ ch−1/2 d−1 rn = 1, n (M1 + 1)˜ −1 2 −1 −2 2 −1 −2 2 2 −2 b and J(g) ≤ λ kgk = λ dn kf − f k ≤ λ dn (M1 + 1) r˜n = c hλ−1 . This implies g ∈ G.

SUPPLEMENT TO NONPARAMETRIC BVM THEOREM

29

Then we can rewrite In (f ) as Z

1Z 1

In (f ) = 0

s[DSn,λ (fb + ss0 (f − fb)) − DSn,λ (f0 )](f − fb)(f − fb)dsds0

0

1 1 + [DSn,λ (f0 ) − Ef0 {DSn,λ (f0 )}](f − fb)(f − fb) − kf − fbk2 2 2 Z 1Z 1 = d2n s[DSn,λ (fb + ss0 dn g) − DSn,λ (f0 )]ggdsds0 0

0

d2 1 + n [DSn,λ (f0 ) − Ef0 {DSn,λ (f0 )}]gg − kf − fbk2 2 2 1 ≡ T1 + T2 − kf − fbk2 , 2 where T1 and T2 denote the first and second terms in the above expression. Since for any s ∈ [0, 1], by Lemma 5.8 and kgksup ≤ 1, we have kfb + sdn g − f0 ksup ≤ kfb − f0 ksup + dn kgksup ≤ c(2M1 + 1)h−1/2 r˜n .

30

Z. SHANG AND G. CHENG

Note that, on En and using kgk2 ≤ c−2 h, for any s ∈ [0, 1], |[DSn,λ (fb + sdn g) − DSn,λ (f0 )]gg| n 1 X¨ = | [`a (Yi ; (fb + sdn g)(Xi )) − `¨a (Yi ; f0 (Xi ))]g(Xi )2 | n ≤

1 n

i=1 n X

... sup | ` a (Yi ; a)| · kfb − f0 + sdn gksup g(Xi )2

i=1 a∈R n

≤ =

cM1 (2M1 + 1)h−1/2 r˜n log n X g(Xi )2 n cM1 (2M1 +

1)h−1/2 r˜n log n n

i=1 n X

h

[g(Xi )KXi − EfX0 {g(X)KX }], gi

i=1 +cM1 (2M1 + 1)h r˜n log nEfX0 {g(X)2 } n c2 M1 (2M1 + 1)h−1 r˜n log n X (1) h [ψn (Zi ; g)KXi − EfX0 {ψn(1) (Z; g)KX }], gi n i=1 −1/2 +cC2 M1 (2M1 + 1)h r˜n log nkgk2 c2 M1 (2M1 + 1)h−1 n−1/2 r˜n log nkZn(1) (g)k · kgk +cC2 M1 (2M1 + 1)h−1/2 r˜n log nkgk2 c2 M1 (2M1 + 1)h−1 n−1/2 r˜n log n(h−(2m−1)/(4m) + n−1/2 )(5 log log n)1/2 kgk +cC2 M1 (2M1 + 1)h−1/2 r˜n log nkgk2 2cM1 (2M1 + 1)n−1/2 r˜n h−(4m−1)/(4m) (5 log log n)1/2 log n +c−1 C2 M1 (2M1 + 1)h1/2 r˜n log n. −1/2

≤

≤ ≤ ≤

Therefore, |T1 | ≤ C1 (c, C2 , M1 )˜ rn2 (n−1/2 r˜n h−(8m−1)/(4m) (log log n)1/2 log n + h−1/2 r˜n log n) = C1 (c, C2 , M1 )˜ rn2 bn1 , where C1 (c, C2 , M1 ) is a positive constant depending only on c, C2 , M1 . By assumption in the theorem, as n → ∞, T1 = oPfn (˜ rn2 ). 0 To handle T2 , we use similar ideas. Note that on En00 and by kgksup ≤ 1, kgk ≤ c−1 h1/2 , C2 P (Acin ) ≤ n−1/2 h−(2m−1)/(4m) log n(5 log log n)1/2 and

SUPPLEMENT TO NONPARAMETRIC BVM THEOREM

31

direct calculations we have |[DSn,λ (f0 ) − Ef0 {DSn,λ (f0 )}]gg| n 1X¨ ≤ k [`a (Yi ; f0 (Xi ))g(Xi )IAin KXi − EfZ0i {`¨a (Yi ; f0 (Xi ))g(Xi )IAin KXi }]k n i=1

·kgk + C2 P (Acin ) = n−1/2 (M1 log n)ch−1/2 kZn(2) (g)k · kgk + C2 P (Acin ) ≤ 2M1 n−1/2 h−(2m−1)/(4m) log n(5 log log n)1/2 + C2 P (Acin ) ≤ 3M1 n−1/2 h−(2m−1)/(4m) log n(5 log log n)1/2 , therefore, |T2 | ≤ (d2n /2)3M1 n−1/2 h−(2m−1)/(4m) log n(5 log log n)1/2 ≤ C2 (c, M1 )˜ rn2 n−1/2 h−(6m−1)/(4m) log n(log log n)1/2 = C2 (c, M1 )bn2 r˜n2 , where C2 (c, M1 ) > 0 is a constant depending only on c, M1 . By assumptions in the theorem, T2 = oPfn (˜ rn2 ). 0 Let n be sufficiently large such that C1 (c, C2 , M1 )bn1 + C2 (c, M1 )bn2 ≤ c2 (M1 +1)2 /2. By the above bounds of T1 and T2 , on En and when kf −f0 k ≤ r˜n , nIn (f ) ≥ −n(C1 (c, C2 , M1 )bn1 + C2 (c, M1 )bn2 + c2 (M1 + 1)2 /2)˜ rn2 ≥ −c2 (M1 + 1)2 n˜ rn2 . Therefore, J1 ≥ exp(−c2 (M1 + 1)2 n˜ rn2 )Π(Bn ), where Bn = {f : kf − f0 k ≤ r˜n }. To proceed, we still need a lower bound for Π(Bn ). We employ an idea ˜ β satisfy the properties (i)–(ii) used in the proof of Proposition 4.1. Let ω ∈ H of Lemma S.3, with dn therein replaced by r˜n . Note that r˜n2 ≥ h2m+β−1 = 4m 2m+β−1 ˜ λ 2m which implies that λ ≤ r˜n2m+β−1 . It can be shown that, if V˜ (G−ω) ≤ 2(β−1)

˜G ˜ − ω) ≤ r˜n2m+β−1 /16, then V˜ (G ˜ − f˜0 ) + λJ( ˜G ˜ − f˜0 ) ≤ (˜ r˜n2 /16 and J( rn2 + 2(β−1)

λ˜ rn2m+β−1 )/4 ≤ r˜n2 . It then follows by Cameron-Martin theorem, Gaussian

32

Z. SHANG AND G. CHENG

correlation inequality and the results of [23] that Π(Bn ) = P (kG − f0 k ≤ r˜n ) ˜ − f˜0 ) + λJ( ˜G ˜ − f˜0 ) ≤ r˜n2 ) = P (V˜ (G 2(β−1)

˜ − ω) ≤ r˜n2 /16, J( ˜G ˜ − ω) ≤ r˜n2m+β−1 /16) ≥ P (V˜ (G 2(β−1)

˜ ≤ r˜n2 /16, J( ˜ G) ˜ ≤ r˜n2m+β−1 /16) ≥ exp(−kωk2β /2)P (V˜ (G) 2(β−1)

˜ ≤ r˜n2 /32)P (J( ˜ G) ˜ ≤ r˜n2m+β−1 /32) ≥ exp(−kωk2β /2)P (V˜ (G) −

2

≥ exp(−c5 r˜n 2m+β−1 ), where the last inequality follows by 2

−

˜ ≤ r˜n2 /32) r˜n 2m+β−1 , − log P (V˜ (G) 2(β−1)

β−1

2

− − ˜ G) ˜ ≤ r˜n2m+β−1 /32) (˜ − log P (J( rn2m+β−1 ) β−1 = r˜n 2m+β−1 2

2 2+ 2m+β−1

(see Example 4.5 of [23]), and c5 > 0 is a universal constant. Since n˜ rn n(2n

1 − 2m+β−1 1+ 2m+β−1 2m+β

)

> 1 which implies

2 − r˜n 2m+β−1

< n˜ rn2 , we get that

J1 ≥ exp(−c5 n˜ rn2 ). R Next we derive an upper bound for J2 = Sn exp(nIn (f ))dΠ(f ). The idea we employ is similar to how we handle J1 , but with some technical difference. b Let d∗n = c(M0 + M1 )h−1/2 rn , and g∗ = d−1 ∗n (f − f ). It is easy to see 2 −2 that, on En and when f ∈ Sn , we get that kg∗ k ≤ c h, kg∗ ksup ≤ 1 and J(g∗ ) ≤ c−2 hλ−1 , which imply that g∗ ∈ G. Then we can rewrite In (f ) as Z 1Z 1 In (f ) = d2∗n s[DSn,λ (fb + ss0 d∗n g∗ ) − DSn,λ (f0 )]g∗ g∗ dsds0 (S.13)

0

0

d2 1 + ∗n [DSn,λ (f0 ) − Ef0 {DSn,λ (f0 )}]g∗ g∗ − kf − fbk2 2 2 1 2 ≡ T3 + T4 − kf − fbk . 2 Similar to the analysis of T1 and T2 , it can be shown that, on En and for any f ∈ Sn , |T3 | ≤ 2c3 M1 (M0 + 2M1 )(M0 + M1 )2 rn3 n−1/2 h−

8m−1 4m

(log n)(5 log log n)1/2

+c2 M1 (M0 + 2M1 )(M0 + M1 )2 rn3 h−1 log n ≤ C3 (c, M0 , M1 )(rn3 n−1/2 h− = C3 (c, M0 , M1 )bn3 ,

8m−1 4m

(log n)(log log n)1/2 + rn3 h−1/2 log n)

≥

SUPPLEMENT TO NONPARAMETRIC BVM THEOREM

|T4 | ≤ 3c2 M1 (M0 + M1 )2 n−1/2 h− ≤ C4 (c, M0 , M1 )n−1/2 h−

6m−1 4m

6m−1 4m

33

rn2 (log n)(5 log log n)1/2

rn2 (log n)(log log n)1/2 = C4 (c, M0 , M1 )bn4 ,

where Cj (c, M0 , M1 ) for j = 3, 4 are positive constants depending on c, M0 , M1 only. Let n be large enough such that C3 (c, M0 , M1 )bn3 +C4 (c, M0 , M1 )bn4 ≤ r˜n2 . Therefore, on En and for any f ∈ Sn , In (f ) ≤ C3 (c, M0 , M1 )bn3 + C4 (c, M0 , M1 )bn4 − which implies J2 ≤ exp((1 − Consequently, on En ,

M2 2 r˜ ≤ (1 − M 2 /2)˜ rn2 , 2 n

M2 rn2 ). 2 )n˜

J2 M2 ≤ exp((1 + c5 − )n˜ rn2 ). J1 2 p By choosing M > M1 and M > 2(1 + c5 ), we have that the right side in the above equation tends to zero as n → ∞. This concludes the proof. P (f ∈ Sn |Dn ) =

S.6. Proofs in Section 5.1. Proof of Theorem 5.1. Let δ, ε be arbitrary prefixed small positive numbers within (0, 1/8). Let En0 and En00 be defined as in (S.11) and (S.12), in which the constant M1 is chosen to be large so that, as n approaches infinity, both events have Pfn0 -probability greater than 1 − 4ε . It follows by Theorem 4.2 that, there exist M, N > 0 such that, for any n ≥ N , Pfn0 (En000 ) ≥ 1 − 4ε , where En000 = {P (f : kf − f0 k ≤ Mpr˜n |Dn ) ≥ 1 − δ}. We can actually make the above M greater than M1 + 2(c5 + 1) so that, on En0 and for any f p satisfying kf − f0 k ≥ M r˜n , kf − fbk ≥ (M − M1 )˜ rn ≥ 2(c5 + 1), where c5 is the universal constant in (S.13). Using the proofs of Theorem 4.2, we can show that, on En0 , Z Z n n exp(− kf − fbk2 )dΠ(f ) ≥ exp(− kf − fbk2 )dΠ(f ) 2 2 H m (I) f :kf −f0 k≤˜ rn ≥ exp(−(c5 + 1/2)n˜ rn2 ), Z

n exp(− kf − fbk2 )dΠ(f ) ≤ exp(−(c5 + 1)n˜ rn2 ). 2 f :kf −f0 k≥M r˜n

Thus, on En0 ,

P0 (f : kf − f0 k ≥ M r˜n ) ≤ exp(−n˜ rn2 /2),

34

Z. SHANG AND G. CHENG

in which the right side tends to zero as n → ∞. This means that the above N can be further managed such that, for any n ≥ N , Pfn0 (En∗ ) > 1 − ε, where En∗ = En0 ∩ En00 ∩ En000 ∩ En0000 and En0000 = {P0 (f : kf − f0 k ≤ M r˜n ) ≥ 1 − δ}. For arbitrary B ∈ B, let B 0 = B ∩ {f : kf − f0 k ≤ M r˜n }. Then it is easy to see that, on En∗ , |P (B|Dn ) − P0 (B)| ≤ P (kf − f0 k ≥ M r˜n |Dn ) + P0 (kf − f0 k ≥ M r˜n ) + |P (B 0 |Dn ) − P0 (B 0 )| ≤ 2δ + |P (B 0 |Dn ) − P0 (B 0 )|. To prove the theorem, it is sufficient to give a small upper bound for |P (B 0 |Dn ) − P0 (B 0 )|, and this upper bound does not depend on B. Observe that n n(`n,λ (f ) − `n,λ (fb)) + kf − fbk2 = n(T1 + T2 ), 2 where Z

1Z 1

T1 = 0

s[DSn,λ (fb + ss0 (f − fb)) − DSn,λ (f0 )](f − fb)(f − fb)dsds0 ,

0

1 T2 = [DSn,λ (f0 ) − Ef0 {DSn,λ (f0 )}](f − fb)(f − fb). 2 Based on the proof of Theorem 4.2, it can be shown that, on En∗ and for any f satisfying kf − f0 k ≤ M r˜n , n|T1 | ≤ C5 (c, M1 , M )n˜ rn2 bn1 and n|T2 | ≤ 2 C6 (c, M1 , M )n˜ rn bn2 , where Cj (c, M1 , M ), j = 5, 6, are constants depending only on c, M1 , M , and recall that c satisfies Lemma 5.8. By assumptions, the above upper bounds for n|T1 | and n|T2 | are both of order o(1), as n → ∞. Define Z Jn1 = exp(n(`n,λ (f ) − `n,λ (fb)))dΠ(f ) H m (I)

and

Z Jn2 =

n exp(− kf − fbk2 )dΠ(f ). 2 H m (I)

R Also define J¯n1 = f :kf −f0 k≤M r˜n exp(n(`n,λ (f ) − `n,λ (fb)))dΠ(f ) and J¯n2 = R n ∗ b2 f :kf −f0 k≤M r˜n exp(− 2 kf − f k )dΠ(f ). Then, on En , we have 0≤

Jnj − J¯nj ≤ δ, for j = 1, 2, Jnj

SUPPLEMENT TO NONPARAMETRIC BVM THEOREM

35

which implies that J¯nj ≤ Jnj ≤

(S.14)

1 ¯ Jnj , for j = 1, 2. 1−δ

Let Rn (f ) = n(`n,λ (f )−`n,λ (fb))+ n2 kf − fbk2 . On En∗ and for any f satisfying kf − f0 k ≤ M r˜n , |Rn (f )| ≤ C5 (c, M1 , M )n˜ rn2 bn1 + C6 (c, M1 , M )n˜ rn2 bn2 ≡ tn (c, M1 , M ). By assumption, as n → ∞, tn (c, M1 , M ) → 0. It can then be see that, on En∗ ,

= ≤ ≤ ≤

≤ (S.15) ≤

|P (B 0 |Dn ) − P0 (B 0 )| Z exp(n(`n,λ (f ) − `n,λ (fb))) exp(− n2 kf − fbk2 ) − dΠ(f ) 0 Jn1 Jn2 B Z exp(Rn (f )) 1 n − exp(− kf − fbk2 ) dΠ(f ) 2 Jn1 Jn2 B0 Z n 2 exp(Rn (f ))Jn2 − Jn1 b exp(− kf − f k ) dΠ(f ) 2 Jn1 Jn2 f :kf −f0 k≤M r˜n Z n exp(− kf − fbk2 ) 2 f :kf −f0 k≤M r˜n |Jn2 − Jn1 | 1 · exp(Rn (f )) + | exp(Rn (f )) − 1| dΠ(f ) Jn1 Jn2 Jn2 2tn (c, M1 , M ) ¯ |Jn2 − Jn1 | ¯ Jn2 + Jn2 exp(tn (c, M1 , M )) Jn1 Jn2 Jn2 |Jn2 − Jn1 | exp(tn (c, M1 , M )) + 2tn (c, M1 , M ). Jn1

It can also be shown that, on En∗ , Z n ¯ ¯ |Jn2 − Jn1 | ≤ exp(− kf − fbk2 )| exp(Rn (f )) − 1|dΠ(f ) 2 f :kf −f0 k≤M r˜n Z n ≤ 2tn (c, M1 , M ) exp(− kf − fbk2 )dΠ(f ) 2 f :kf −f0 k≤M r˜n ¯ = 2tn (c, M1 , M )Jn2 , which implies J¯n2 1 1 ≤ ¯ ≤ . 1 + 2tn (c, M1 , M ) 1 − 2tn (c, M1 , M ) Jn1 Combined with (S.14) we have that 1−δ J¯n2 Jn2 1 J¯n2 1 ≤ (1−δ) ¯ ≤ ≤ ·¯ ≤ , 1 + 2tn (c, M1 , M ) Jn1 1 − δ Jn1 (1 − δ)(1 − 2tn (c, M1 , M )) Jn1

36

Z. SHANG AND G. CHENG

which leads to −

2tn (c, M1 , M ) + δ 2tn (c, M1 , M ) + δ Jn2 −1≤ ≤ , 1 + 2tn (c, M1 , M ) Jn1 (1 − δ)(1 − 2tn (c, M1 , M ))

therefore, |

2tn (c, M1 , M ) + δ δ Jn2 − 1| ≤ → ≤ 2δ. Jn1 (1 − δ)(1 − 2tn (c, M1 , M )) 1−δ

It follows by (S.15) that as n is large, on En∗ and any B ∈ B, |P (B 0 |Dn ) − P0 (B 0 )| ≤ 3δ. This shows the desired conclusion (5.2). Proof of Theorem 5.2. Let T : f 7→ {fν : ν ≥ 1} be the one-to-one map from H m (I) to Rm , as defined in Lemma 3.2. Let Π0W and Π0 be the probability measures induced by {ξν + aν fbν : ν ≥ 1} and {vν : ν ≥ 1}. Then dΠ0W /dΠ0 equals limN →∞ p1,N (f1 , . . . , fN )/p2,N (f1 , . . . , fN ), where p1,N and p2,N are the probability densities under fν ∼ ξν + aν fbν and fν ∼ vν , ν = 1, . . . , N , respectively. A direct evaluation leads to that ∞

dΠ0W nX ({fν : ν ≥ 1}) = Cn,λ exp(− (fν − fbν )2 (1 + λρν )), 0 dΠ 2 ν=1

where Cn,λ =

m Y

∞

(1+nσν2 )1/2

Y

(1+n(1+λρν )ρ−(1+β/(2m)) )1/2 exp ν

ν>m

ν=1

1 X b2 fν bν 2

! ,

ν=1

1+β/(2m)

and bν = nσν−2 /(n+σν−2 ) for ν = 1, . . . , m, and bν = n(1+λρν )ρν /(n(1+ P b2 1+β/(2m) λρν ) + ρν ) for ν > m. Since ν fν ρν < ∞ and β > 1, it is not hard to see that Cn,λ is a finite constant. By similar arguments in the proof of Lemma 3.2, we have for any B ⊆ Rm , ΠW (T −1 B) = Π0W (B), and Π(T −1 B) = Π0 (B). By change of variable, for any Π-measurable S ⊆ H m (I), ΠW (S) = Π0W (T S) Z = dΠ0W ({fν : ν ≥ 1}) TS

Z

n

nX = Cn,λ exp(− (fν − fbν )2 (1 + λρν ))dΠ0 (fν : ν ≥ 1) 2 TS ν=1 Z n = Cn,λ exp(− kf − fbn,λ k2 )dΠ(f ). 2 S

SUPPLEMENT TO NONPARAMETRIC BVM THEOREM

37

In particular, let S = H m (I) in the above equations, we get that !−1 Z n 2 b Cn,λ = exp(− kf − fn,λ k )dΠ(f ) . 2 H m (I) This proves the desired result. S.7. Proofs in Section 5.2. Proof of Lemma 5.3. It is easy to see that n

∞ X p p c0 hkZk2 = n c0 h cν ην2 (1 + λρν ) ν=1

= n

p

c0 h

∞ X

cν (1 + λρν )(ην2 − 1) + n

∞ X p c0 h cν (1 + λρν ),

ν=1

ν=1

where ην are iid standard normal random variables independent of the data Dn√. Therefore, the of the moment generating function of Un = √ logarithm P∞ 2 (n c0 hkZk − n c0 h ν=1 cν (1 + λρν ))/τ1 is

=

log(E{exp(tUn )}) ∞ X p −1 log(E{exp(tτ1 n c0 h cν (1 + λρν )(ην2 − 1))}) ν=1

= =

∞ X

p log(E{exp(tτ1−1 n c0 hcν (1 + λρν )(ην2 − 1))})

ν=1 ∞ X

−

ν=1

=

2

p p 1 [ log(1 − 2tτ1−1 n c0 hcν (1 + λρν )) + tτ1−1 n c0 hcν (1 + λρν )] 2

2

t c0 n h

∞ X

c2ν (1

2

+ λρν )

ν=1

/τ12

+

3/2 O(t3 n3 c0 h3/2

∞ X

c3ν (1 + λρν )3 /τ13 )

ν=1

2

→ t /2, where the limit in the by (5.4) and the fact, as n → ∞, P∞last 3equation follows 3 3 1/2 ) = o(1). Consequently, 3 3/2 τ1 1 and n h ν=1 cν (1 + λρν ) /τ1 = O(h the result of the lemma holds. Proof of Lemma 5.4. The proof relies on the following kernel function: 0

R(x, x ) =

∞ X ν=1

cν ϕν (x)ϕν (x0 ), for any x, x0 ∈ I,

38

Z. SHANG AND G. CHENG

where cν are defined in Section 5.2. We also use Rx (·) to denote the function R(x, ·), for simplicity. Denote fb = fbn,λ , fe = fen,λ , and Rn = fb − f0 − Sn,λ (f0 ). For any ν ≥ 1, fbν

= V (fb, ϕν ) = V (Rn , ϕν ) + V (f0 , ϕν ) + V (Sn,λ (f0 ), ϕν ) n 1 X ϕν (Xi ) λρν = V (Rn , ϕν ) + fν0 + + fν0 . i n 1 + λρν 1 + λρν i=1

Therefore, by aν /(1 + λρν ) = ncν , we get that fe − f0 X = (aν fbν − fν0 )ϕν ν n

=

X

aν V (Rn , ϕν )ϕν +

ν

+

(aν − 1)fν0 ϕν +

ν

X ν

=

X

X

1 X X aν ϕν (Xi ) i ϕν n 1 + λρ ν ν i=1

aν λρν fν0 ϕν 1 + λρν

aν V (Rn , ϕν )ϕν +

ν

X ν

(aν −

1)fν0 ϕν

+

n X

i RXi +

X ν

i=1

fν0

aν λρν ϕν , 1 + λρν

(S.16) P where RXi = ν cν ϕν (Xi )ϕν , for i = 1, . . . , n. Denote the above four terms by I1 , I2 , I3 , I4 . It follows by Lemma S.4 and a2ν ≤ 1 that X nkI1 k2 = n a2ν |V (Rn , ϕν )|2 (1 + λρν ) ≤ nkRn k2 ν

= OPfn (n(an log n)2 ) = oPfn (n˜ rn2 ) = oPfn (h−1 ). 0

0

0

P Since |aν −1| = O(n−1 ) for ν = 1, . . . , m, it is easy to see that n m ν=1 (aν − 1)2 |fν0 |2 (1 + λρν ) = O(n−1 ). By Dominated convergence theorem and direct

SUPPLEMENT TO NONPARAMETRIC BVM THEOREM

39

calculations it follows that X (aν − 1)2 |fν0 |2 (1 + λρν ) n ν>m

= n

−1

1+β/(2m)

n(1 + λρν ) + ρν

ν>m

n

!2

n(1 + λρν )

X

|fν0 |2 (1 + λρν )

(λρν )2+β/m |f 0 |2 1+β/(2m) )2 ν (1 + λρ + (λρ ) ν ν ν>m X

+nλ

(λρν )2+β/m |f 0 |2 ρν 1+β/(2m) )2 ν (1 + λρ + (λρ ) ν ν ν>m X β−1

= nλ1+ 2m

Let gλ (ν) = and ν,

X (λρν )1+(β+1)/(2m) + (λρν )2+(β+1)/(2m) |fν0 |2 ρν1+(β−1)/(2m) . 1+β/(2m) )2 (1 + λρ + (λρ ) ν ν ν>m

(λρν )1+(β+1)/(2m) +(λρν )2+(β+1)/(2m) , (1+λρν +(λρν )1+β/(2m) )2

for ν > m. Clearly, for all λ

x1+(β+1)/(2m) + x2+(β+1)/(2m) < ∞, (1 + x + x1+β/(2m) )2 x>0 P 1+(β−1)/(2m) < ∞, it follows and limλ→0 gλ (ν) = 0 for all ν. Since ν |fν0 |2 ρν P 1+(β−1)/(2m) → by dominated convergence theorem that, as λ → 0, ν>m gλ (ν)|fν0 |2 ρν 0 < gλ (ν) ≤ sup

β−1

0. Therefore, by nλ1+ 2m h−1 , we get that, as n → ∞, nkI2 k

2

= n

m X

(aν − 1)2 |fν0 |2 (1 + λρν ) + n

ν=1 −1

= O(n

X

(aν − 1)2 |fν0 |2 (1 + λρν )

ν>m −1

) + o(h

−1

) = o(h

).

By dominated convergence theorem and direct examinations it can be shown that nkI4 k2 = n

X ν

|fν0 |2

X a2ν (λρν )2 (λρν )2 ≤n |fν0 |2 1 + λρν 1 + λρν ν

= nλ1+(β−1)/(2m)

X ν

|fν0 |2 ρν1+(β−1)/(2m)

(λρν )1−(β−1)/(2m) = o(h−1 ), 1 + λρν

where the last equation follows by 0 < 1 − (β − 1)/(2m) < 1, implying 1−(β−1)/(2m) supx>0 x 1+x < ∞, and dominated convergence theorem.

40

Z. SHANG AND G. CHENG

Next, let us handle I3 . Note that k

n X

n X

2

i RXi k =

i=1

2i kRXi k2 + 2

i=1

X

i j hRXi , RXj i ≡

im Z ∞ 2m 2m+β 2m (x + x )(1 + x ) dx > 0. 2m (1 + x + x2m+β )2 0 This completes the proof for the ζj part. The proof for the τj2 part is similar, which is ignored. S.8. Proofs in Section 5.3. Proof of Theorem 5.7. We start from the decomposition (S.16) in the proof of Lemma √ 5.4. Let I1 , I2 , I3 , I4 be defined the same as in (S.16). By m > 1 + 3/2 and 1 < β < m + 1/2, it can be shown that rn = (nh)−1/2 + hm = O(hm ), and hence n(an log n)2 = o(1). Therefore, X dν |V (Rn , ϕν )|2 nkI1 k2† ≤ n ν

. n

X

|V (Rn , ϕν )|2 ≤ nkRn k2 = OPfn (n(an log n)2 ) = oPfn (1). 0

ν

By |aν − 1| = O(1/n) and the choice of dν , we get X nkI2 k2† = n dν (aν − 1)2 |fν0 |2 ν

= n

m X

dν (aν − 1)2 |fν0 |2 + n

ν=1

= O(1/n) + n

X ν>m

dν

(n(1 +

X

dν (aν − 1)2 |fν0 |2

ν>m 2+β/m ρν |f 0 |2 . 1+β/(2m) 2 ν λρν ) + ρν )

0

SUPPLEMENT TO NONPARAMETRIC BVM THEOREM

43

Using λ n−2m/(2m+β) and dominated convergence theorem, it can be shown that 2+β/m

n

X

dν

ν>m

ρν

(n(1 + λρν ) +

1+β/(2m) 2 ρν )

|fν0 |2

2+β/m

. n

−1

ρν |fν0 |2 dν 1+β/m 2 (1 + λρν + (λρν ) ) ν>m X

1+(β+1)/(2m)

. n−1

X

ρ−1/(2m) ν

ν>m

.

ρν |f 0 |2 ρ1+(β−1)/(2m) (1 + λρν + (λρν )1+β/m )2 ν ν

(λρν )1+β/(2m) |f 0 |2 ρν1+(β−1)/(2m) = o(1). 1+β/m )2 ν (1 + λρ + (λρ ) ν ν ν>m X

Therefore, nkI2 k2† = o(1). Similarly, we have nkI4 k2† = n

X

≤ n

X

. n

X

dν |fν0 |2

ν

dν

ν

a2ν (λρν )2 (1 + λρν )2

(λρν )2 |f 0 |2 (1 + λρν )2 ν

ρ−1/(2m) ν

ν>m

.

(λρν )2 ρ−1−(β−1)/(2m) |fν0 |2 ρν1+(β−1)/(2m) (1 + λρν )2 ν

X (λρν )1−β/(2m) ν

(1 + λρν )2

|fν0 |2 ρν1+(β−1)/(2m) = o(1).

Next we handle I3 . Define m ∞ X X ϕν (x)ϕν (y) ϕν (x)ϕν (y) T (x, y) = . −2 + −1 −1 1+β/(2m) 1 + n σν ν>m 1 + λρν + n ρν ν=1

It follows by Theorem 5 of [42] and proof of Proposition 2.2 in [38] that (j) −j/(2m) for any j = 0, 1, 2, . . . , 2m, supν kϕν ksup ρν < ∞. And hence, for any x, y ∈ I, |

X ∂ |ϕ˙ ν (x)ϕν (y)| T (x, y)| . O(1) + ∂x 1 + (hν)2m + (hν)2m+β ν>m X ν . O(1) + 2m + (hν)2m+β 1 + (hν) ν>m Z ∞ s h−2 ds, 2m 1 + s + s2m+β 0

44

Z. SHANG AND G. CHENG

∂ which implies supx,y∈I | ∂x T (x, y)| = O(h−2 ). For simplicity, define Tx (y) = T (x, y) for any x, y ∈ I. Let RXi be defined as in the proof of Lemma 5.4. By direct examination, it can be seen that

RXi =

m X ϕν (Xi )ϕν ν=1

n+

σν−2

+

X

ϕν (Xi )ϕν

ν>m

n(1 + λρν ) + ρν

1+β/(2m)

= TXi /n.

P P So I3 = n−1 ni=1 i TXi which implies nkI3 k2† = n−1 k ni=1 i TXi k2† . Since Ef0 {exp(||/C1 )} < ∞, we can choose a constant C > C1 such that Pfn0 (En ) → 1, where En = {max1≤i≤n |i | ≤ bn ≡ C log n}. We can even choose the above C to be properly large so that the following rate condition holds: (S.17)

h−1 n1/2 exp(−bn /(2C1 )) = o(1), h−2 exp(−bn /(2C1 )) = o(1).

Define Hn (·) = n−1/2

n X

i TXi (·), Hnb (·) = n−1/2

i=1

n X

i I(|i | ≤ bn )TXi (·).

i=1

Write Hn = Hn − Hnb − Ef0 {Hn − Hnb } + Hnb − Ef0 {Hnb }. Clearly, on En , Hn = Hnb , and hence, |Hn (z) − Hnb (z) − Ef0 {Hn (z) − Hnb (z)}| = |Ef0 {Hn (z) − Hnb (z)}| = n1/2 |Ef0 {I(|| > bn )TX (z)}| . n1/2 h−1 Ef0 {2 }1/2 Pfn0 (|| > bn )1/2 ≤ n1/2 h−1 exp(−bn /(2C1 )) = o(1), where the last o(1)-term follows by (S.17) and is free of the argument z. Thus, (S.18)

sup |Hn (z) − Hnb (z) − Ef0 {Hn (z) − Hnb (z)}| = oPfn (1). 0

z∈I

Define Rn = Hnb −Ef0 {Hnb } and Zn (e, x) = n1/2 (Pn (e, x)−P (e, x)), where Pn (e, x) is the empirical distribution of (, X) and P (e, x) is the population distribution of (, X) under Pfn0 -probability. It follows by Theorem 1 of [41] that (S.19)

sup |Zn (e, x) − W (t(e, x))| = OPfn (n−1/2 (log n)2 ),

e∈R,x∈I

0

45

SUPPLEMENT TO NONPARAMETRIC BVM THEOREM

where W (·, ·) is Brownian bridge indexed on I2 , t(e, x) = (F1 (x), F2 (e|x)), F1 is the marginal distribution of X and F2 is the conditional distribution of given X both under Pfn0 -probability. It can be seen that Z

1 Z bn

Rn (z) =

eTx (z)dZn (e, x). 0

Define R0n (z)

Z

−bn

1 Z bn

eTx (z)dW (t(e, x)).

= 0

−bn

Write Z

bn

dUn (x) = −bn

edZn (e, x), dUn0 (x) =

Z

bn

edW (t(e, x)). −bn

It follows from integration by parts where all quadratic variation terms are zero that Z bn Z bn bn Zn (e, x)de, ede Zn (e, x) = Zn (e, x)e|e=−bn − Un (x) = −bn

−bn

Un0 (x) =

Z

bn

−bn

n − ede W (t(e, x)) = W (t(e, x))e|be=−b n

Z

bn

W (t(e, x))de, −bn

and hence, it follows by (S.19) that sup |Un (x) − Un0 (x)| = OPfn (bn n−1/2 (log n)2 ). 0

x∈I

∂ It follows from integration by parts again and supx,y∈I | ∂x T (x, y)| = O(h−2 ) that Z 1 Z 1 ∂ 1 Rn (z) = Tx (z)dUn (x) = Un (x)T (x, z)|x=0 − Un (x) T (x, z)dx, ∂x 0 0 Z 1 Z 1 ∂ R0n (z) = Tx (z)dUn0 (x) = Un0 (x)T (x, z)|1x=0 − Un0 (x) T (x, z)dx, ∂x 0 0

and hence, (S.20)

sup |Rn (z) − R0n (z)| = OPfn (h−2 bn n−1/2 (log n)2 ) = oPfn (1), z∈I

0

where the last equality follows by 2m + β > 4 and hence h−2 n−1/2 bn (log n)2 = O(n−1/2+2/(2m+β) (log n)3 ) = o(1).

0

46

Z. SHANG AND G. CHENG

Next we handle the term R0n . Write W (s, t) = B(s, t) − stB(1, 1), where B(s, t) is standard Brownian motion indexed on I2 . Define Z 1 Z bn 0 ¯ Rn (z) = eTx (z)dB(t(e, x)). −bn

0

Let F (e, x) = F1 (x)F2 (e, x) be the joint distribution of (, X). It is easy to see that Z 1 Z bn 0 0 ¯ |Rn (z) − Rn (z)| = |B(1, 1)| · | eTx (z)dF (e, x)| −bn

0

= |B(1, 1)| · |Ef0 {I(|| ≤ bn )TX (z)}| = |B(1, 1)| · |Ef0 {I(|| > bn )TX (z)}| = OPfn (h−1 exp(−bn /(2C1 ))) = oPfn (1), 0

0

where the last equality follows by (S.17). Therefore, we have shown that ¯ 0 (z) − R0 (z)| = oP n (1). sup |R n n f

(S.21)

0

z∈I

By (S.18), (S.20) and (S.21) that ¯ 0n k2 + oP n (1). nkI3 k2† = kHn k2† = kR † f

(S.22)

0

Define Z

1Z ∞

e R(z) =

eTx (z)dB(t(e, x)). −∞

0

e ¯ 0n (z). Then Let ∆(z) = R(z) −R Z 1Z ∆(z) = 0

eTx (z)dB(t(e, x)).

|e|>bn

For each z, ∆(z) is a zero-mean Gaussian random variable with variance Z 1Z 2 Ef0 {∆(z) } = e2 Tx (z)2 dF (e, x) .

0 |e|>bn −2 h Ef0 {2 I(||

> bn )} = O(h−2 exp(−bn /(2C1 ))) = o(1),

where the last o(1)-term follows from (S.17) and is free of the argument z. Therefore, Z 1 0 2 2 e ¯ Ef0 {kR − Rn k† } . Ef0 {k∆kL2 } = Ef0 {∆(z)2 }dz = o(1), 0

SUPPLEMENT TO NONPARAMETRIC BVM THEOREM

47

e−R ¯ 0 k† = oP n (1). Therefore, it follows by (S.22) that implying that kR n f 0

e 2 + oP n (1). nkI3 k2† = kRk † f

(S.23)

0

It follows from the definition of T (·, ·) that e 2= kRk †

∞ X X dν ηeν2 dν ηeν2 d + → dν ηeν2 , −1 σ −2 )2 −1 ρ1+β/(2m) )2 (1 + n ν (1 + λρ + n ν ν ν>m ν=1 ν=1

m X

where ηeν =

R1R∞ 0

−∞ eϕν (x)dB(t(e, x)).

It is easy to see that for any ν, µ,

Ef0 {e ην ηeµ } = Ef0 {2 ϕν (X)ϕµ (X)} = Ef0 {B(X)ϕν (X)ϕµ (X)} = V (ϕν , ϕµ ) = δνµ , that is, ηeν are iid standard normal random variables. Combined with the above analysis of terms I1 , I2 , I3 , I4 , we have shown that as n → ∞, d nkf0 − fen,λ k2† →

∞ X

dν ηeν2 .

ν=1

This implies that as n → ∞, Pfn0 (f0 ∈ Rn† (α)) = Pfn0 (nkf0 − fen,λ k2† ≤ cα ) → 1 − α. The proof is completed. S.9. Proofs in Section 5.4. Proof of Theorem 5.9. Let I1 , I2 , I3 , I4 be the decomposition terms in (S.16). Then kfe − f0 − I3 k2 ≤ 3(kI1 k√2 + kI2 k2 + kI4 k2 ). We first handle the norms of I1 , I2 , I4 . Since m > 1 + 3/2 and 1 < β < m + 1/2, it can be shown that n(an log n)2 = o(1). It follows the derivations in the proof of Lemma 5.4 that, as n → ∞, nkI1 k2 = OPfn (n(an log n)2 ) = oPfn (1), 0

nkI2 k2 nh2m+β

X ν

0

(hν)2m+β + (hν)4m+β |f 0 |2 ν 2m+β = o(1), (1 + (hν)2m + (hν)2m+β )2 ν

nkI4 k2 . nh2m+β

X (hν)2m−β |f 0 |2 ν 2m+β = o(1). 2m ν 1 + (hν) ν

This means that kfe− f0 − I3 k = oPfn (n−1/2 ). It follows by (5.8) that |F (fe− 0 f0 − I3 )| ≤ κh−1/2 kfe − f0 − I3 k = oP n (h−1/2 n−1/2 ). f0

48

Z. SHANG AND G. CHENG

Before proceeding further, we analyze F (I3 ) which equals Since both ϕν and F (ϕν ) are uniformly bounded, we have X n2 c2ν F (ϕν )2 Ef0 {2 |nF (RX )|2 } =

Pn

i=1 i F (RXi ).

ν

=

m X

X F (ϕν )2 F (ϕν )2 + (1 + n−1 σν−2 )2 ν>m (1 + λρν + n−1 ρ1+β/(2m) )2 ν ν=1

. h−1 , m X |nF (RX )| =

1 ϕν (X)F (ϕν ) 1 + n−1 σν−2 ν=1 X 1 . h−1 . + ϕ (X)F (ϕ ) ν ν −1 ρ1+β/(2m) 1 + λρ + n ν ν ν>m P Let s2n = V arPfn ( ni=1 i nF (RXi )). It is easy to see that 0

s2n

m X

X F (ϕν )2 F (ϕν )2 =n + −2 (1 + n−1 σν )2 ν>m (1 + λρν + n−1 ρ1+β/(2m) )2 ν ν=1

! .

Clearly, s2n = O(nh−1 ). By (5.10), s2n & nh−r . Next we verify Lindeberg’s condition: for any δ > 0,

s−2 n = ≤

n X

Ef0 {2i |nF (RXi )|2 I(|i nF (RXi )| ≥ δsn )}

i=1 −2 nsn Ef0 {2 |nF (RX )|2 I(|nF (RX )| ≥ δsn )} 4 4 1/2 ns−2 P (|nF (RX )| ≥ δsn )1/2 n Ef0 { |nF (RX )| }

4 4 1/2 ≤ ns−2 (δsn )−4 Ef0 {4 |nF (RX )|4 } n Ef0 { |nF (RX )| }

1/2

−2 4 4 = ns−2 n (δsn ) Ef0 { |nF (RX )| } −3 . ns−4 . n−1 h2r−3 = n n h

− 2m+β+2r−3 2m+β

= o(1).

This implies that Lindeberg’s condition holds. By central limit theorem, P d nF (I3 )/sn = ni=1 i nF (RXi )/sn → N (0, 1). √ Observe that F (f0 ) ∈ CIn,λ (α) is equivalent to |F (fe − f0 )| ≤ zα/2 tn / n. In the meantime, n n n n d F (fe− f0 ) = F (fe− f0 − I3 ) + F (I3 ) = F (I3 ) + oPfn (1) → N (0, 1). 0 sn sn sn sn

SUPPLEMENT TO NONPARAMETRIC BVM THEOREM

49

Since t2n ≥ s2n /n, we get that √ Pfn0 (F (f0 ) ∈ CIn,λ (α)) = Pfn0 (|F (fe − f0 )| ≤ zα/2 tn / n) √ ntn n n e ) = Pf0 ( |F (f − f0 )| ≤ zα/2 sn sn n ≥ Pfn0 ( |F (fe − f0 )| ≤ zα/2 ) → 1 − α, as n → ∞. sn P In particular, we notice that when ν F (ϕν )2 < ∞ and (5.10) holds for r = 0, it follows by dominated convergence theorem that both s2n /n and t2n √ P n converge to ν F (ϕν )2 . This implies that snt → 1, and hence, Pfn0 (F (f0 ) ∈ n CIn,λ (α)) → 1 − α. This completes the proof of the theorem. Proof of Proposition 5.10. Proof of (i). By direct examinations, it can be shown that when ν ≥ 3 is odd, cos(x) cosh(x) = 1 has a unique solution in ((ν+1/2)π, (ν+1)π), that is, γν ∈ ((ν+1/2)π, (ν+1)π); when ν ≥ 3 is even, cos(x) cosh(x) = 1 has a unique solution in (νπ, (ν + 1/2)π), that is, γν ∈ (νπ, (ν +1/2)π). Consequently, for any k ≥ 1, 0 < γ2k+2 −γ2k+1 < π. Let δ0 be constant such that 0 < δ0 < π/2 − π|z − 1/2|, and d0 = min{sin2 (δ0 ), cos2 (δ0 + π|z − 1/2|)}. Clearly, d0 > 0 is a constant. It is easy to see that when k → ∞, sinh(γ2k+1 (z − 1/2)) cosh(γ2k+2 (z − 1/2)) → 0, and → 0. sinh(γ2k+1 /2) cosh(γ2k+2 /2) Then for arbitrarily small ε ∈ (0, d0 /8), there exists N s.t. for any k ≥ N , ϕ2k+1 (z)2 ≥

1 1 sin2 (γ2k+1 (z−1/2))−ε, and ϕ2k+2 (z)2 ≥ cos2 (γ2k+2 (z−1/2))−ε. 2 2

Let φ0k = (γ2k+2 − γ2k+1 )(z − 1/2). Then |φ0k | ≤ π|z − 1/2| < π/2. There exists an integer lk s.t. γ2k+1 (z − 1/2) = φk + lk π, where φk ∈ [0, π). Then sin2 (γ2k+1 (z − 1/2)) = sin2 (φk ), and cos2 (γ2k+2 (z − 1/2)) = cos2 (φk + φ0k ). If 0 ≤ φk ≤ δ0 , then it can be seen that −π|z − 1/2| ≤ φ0k ≤ φk + φ0k ≤ δ0 + φ0k ≤ δ0 + π|z − 1/2|. Therefore, cos2 (φk + φ0k ) ≥ cos2 (δ0 + π|z − 1/2|). If δ0 < φk < π − δ0 , then sin2 (φk ) ≥ sin2 (δ0 ). If π − δ ≤ φk < π, then it can be seen that π − δ0 − π|z − 1/2| ≤ φk + φ0k < π + π|z − 1/2|.

50

Z. SHANG AND G. CHENG

Therefore, cos2 (φk + φ0k ) ≥ cos2 (δ0 + π|z − 1/2|). Consequently, for any k ≥ N, ϕ2k+1 (z)2 + ϕ2k+2 (z)2 ≥ ≥

1 (sin2 (γ2k+1 (z − 1/2)) + cos2 (γ2k+2 (z − 1/2))) − 2ε 2 1 min{sin2 (δ0 ), cos2 (δ0 + π|z − 1/2|)} − 2ε ≥ d0 /4. 2

Then we have X

hϕν (z)2 (1 + λρν + (λρν )1+β/4 )j ν>2 X hϕ2k+1 (z)2 hϕ2k+2 (z)2 + (1 + λρ2k+1 + (λρ2k+1 )1+β/4 )j k≥1 (1 + λρ2k+2 + (λρ2k+2 )1+β/4 )j k≥1

=

X

≥

X

≥

X

hϕ2k+1 (z)2 + hϕ2k+2 (z)2 (1 + λρ2k+2 + (λρ2k+2 )1+β/4 )j k≥1

k≥N

&

X k≥N Z ∞

≥ =

hd0 /4 (1 + λρ2k+2 + (λρ2k+2 )1+β/4 )j (1 +

(kπh)4

h + (kπh)4+β )j

h dx 4 + (πhx)4+β )j (1 + (πhx) N Z Z ∞ 1 ∞ 1 1 h→0 1 dx −→ dx > 0. 4 4+β j 4 π πN h (1 + x + x π 0 (1 + x + x4+β )j )

This shows that condition (5.10) P holds for r = 1. Proof of (ii). Write ω = ν ων ϕν where ων is a square-summable real R1 P 2 sequence. Then Fω (ϕν ) = 0 ω(z)ϕν (z)dz ν Fω (ϕν ) = P∞ = ων . Therefore, P 2 2 ν=1 Fω (ϕν ) > 0. Consequently, for ν ων < ∞. Meanwhile, since ω 6= 0, j = 1, 2, it follows by dominated convergence theorem that as n → ∞, ∞ X X Fω (ϕν )2 Fω (ϕν )2 + → Fω (ϕν )2 > 0. 1+β/(2m) )j −1 σ −2 )j (1 + λρ + (λρ ) (1 + n ν ν ν ν>m ν=1 ν=1

m X

Hence (5.10) holds for r = 0. Department of Statistics Purdue University 250 N. University Street West Lafayette, IN 47906 Email: [email protected]; [email protected]

arXiv:1411.3686v1 [math.ST] 13 Nov 2014

By Zuofeng Shang∗ and Guang Cheng† Department of Statistics, Purdue University November 15, 2014 Statistical inference on infinite-dimensional parameters in Bayesian framework is investigated. The main contribution of this paper is to demonstrate that nonparametric Bernstein-von Mises theorem can be established in a very general class of nonparametric regression models under a novel tuning prior (indexed by a non-random hyperparameter). Surprisingly, this type of prior connects two important classes of statistical methods: nonparametric Bayes and smoothing spline at a fundamental level. The intrinsic connection with smoothing spline greatly facilitates both theoretical analysis and practical implementation for nonparametric Bayesian inference. For example, we can apply smoothing spline theory in our Bayesian setting, and can also employ generalized cross validation to select a proper tuning prior, under which credible regions/intervals are frequentist valid. A collection of probabilistic tools such as Cameron-Martin theorem [7] and Gaussian correlation inequality [30] are employed in this paper.

1. Introduction. A common practice to quantify Bayesian uncertainty is to construct credible sets (or regions) that cover a large fraction of posterior mass. The frequentist validity of Bayesian credible sets is often called as Bernstein-von Mises (BvM) phenomenon. In nonparametric settings, BvM theorem was initially found not to hold by [13, 18]. An essential reason for this failure is due to the lack of flexibility in the assigned priors. In this paper, we introduce a novel tuning prior under which BvM theorem can be established in a very general class of nonparametric regression models. More interestingly, this type of prior connects two important classes of statistical methods: nonparametric Bayes and smoothing spline at a fundamental level. ∗

Visiting Assistant Professor Corresponding Author. Associate Professor, Research Sponsored by NSF CAREER Award DMS-1151692, DMS-1418042, Simons Foundation 305266. Guang Cheng is on sabbatical at Princeton while this work is finalized; he would like to thank the Princeton ORFE department for its hospitality and support. AMS 2000 subject classifications: Primary 62C10 Secondary 62G15, 62G08 Keywords and phrases: Bayesian inference, Gaussian sequence model, nonparametric Bernstein-von Mises Theorem, smoothing spline †

1

2

Z. SHANG AND G. CHENG

This coupling phenomenon is practically useful in the sense that a proper tuning prior can be selected through generalized cross validation. We begin with Gaussian sequence models that can be used as a platform to investigate more complicated statistical problems; see [3, 35]. Recently, a series of (adaptive) credible sets have been proposed for their mean sequences in the literature; see [29, 26, 39, 40, 10]. In this paper, we consider a novel tuning prior with a (non-random) hyper-parameter that can be used to tune the assigned prior. Under tuning priors, we demonstrate that when the tuning parameter is properly selected, the frequentist coverage of an asymptotic credible set approaches one irrespective of the credibility level, called as weak BvM phenomenon. On the other hand, if the tuning parameter is poorly chosen, there always exist uncountably many true parameters that are excluded by the constructed credible set asymptotically. Hence, we observe a phase transition phenomenon, which is caused by the fact that the posterior standard deviation is always strictly greater or smaller than the deviation between the truth and the posterior mode. Also see Figure 1 for a graphical illustration. The issue of overly conservativeness of credible sets can be easily resolved by invoking a weaker topology (inspired by [10]). This leads to the so-called strong BvM phenomenon. In the end, we construct adaptive credible sets by substituting the unknown model regularity with a consistent estimate (with a mild convergence rate). With such a simple replacement, the adaptive sets can still preserve appealing features as those built upon the known regularity. In particular, their radii do not involve the annoying “blow-up” constant as in [40]; see [33] for more discussions. The major contribution of this paper is to establish BvM theorem in a very general class of nonparametric regression models by borrowing the theoretical insights obtained from the simple Gaussian sequence models. For example, the assigned Gaussian process (GP) prior also involves a tuning hyper-parameter (this idea is rooted in [48]). As far as we know, our BvM theorem is the first nonparametric one that holds under total variation distance. Such a result is crucial in constructing credible regions for the regression function under both strong and weak topology, and credible intervals for its linear functionals, e.g., its local value. In particular, their constructions can be concretized by ODE techniques, and their frequentist behaviors are carefully investigated. In contrast, the BvM theorem developed by [11] is in terms of a weaker metric, i.e., bounded Lipschitz metric, and crucially relies on the efficiency of nonparametric estimate ([5]). Other related BvM theorems include [14, 15, 9, 4, 36, 12] for semiparametric models or functionals of the density. Our second contribution is to obtain posterior contraction rates under a type of Sobolev norm as an intermediate step in develop-

NONPARAMETRIC BVM THEOREM

3

ing our BvM theorem. This Sobolev norm is stronger than the commonly used norms in deriving contraction rates such as Kullback-Leibler norm and supremum norm ([20, 21, 43, 44, 45, 46]). The uniformly consistent test arguments ([21]) together with Cameron-Martin theorem [7] and Gaussian correlation inequality [30] are applied to derive this new result. Interestingly, a straightforward application of H´ajek’s Lemma ([22]) illustrates that the posterior distribution under the assigned GP prior turns out to be closely related to the penalized likelihood in the smoothing spline literature. The intrinsic connection with smoothing spline greatly facilitates both theoretical analysis and practical implementation for nonparametric Bayesian inference. For example, we can apply smoothing spline theory in our Bayesian setting, and we can employ the generalized cross validation (GCV) method to select a proper tuning hyper-parameter in the assigned GP prior under which credible regions/intervals are frequentist valid. This novel methodology is crucially different from the empirical Bayes approach, and strongly supported by our simulation results. The rest of this article is organized as follows. Section 2 focuses on Gaussian sequence models under two types of tuning priors, and conducts frequentist evaluation of the resulting (adaptive) credible sets. In Section 3, we present a general class of nonparametric regression models including Gaussian regression and logistic regression, and the assigned GP priors. The duality between nonparametric Bayes models and smoothing spline models are also discussed. Section 4 includes results on posterior contraction rates under a type of Sobolev norms. Section 5.1 includes the main result of the article: nonparametric BvM theorem. Sections 5.2 – 5.4 discuss two important applications of BvM theorem: construction of credible region of the regression function and credible interval of a general class of linear functionals. Their frequentist validity is also discussed. Section 6 includes simulation study. All the technical details are postponed to the online supplementary. 2. Gaussian Sequence Model. In this section, we consider Gaussian sequence models, and generalize the Gaussian priors in [18] by introducing tuning parameters. In this new Bayesian framework, we demonstrate that when the tuning parameter is properly chosen, the credible region possesses more satisfactory frequentist coverage properties, and can be further improved based on a weaker topology; also see [10]. A phase transition phenomenon of credible regions is also observed. A set of adaptive credible regions are constructed in the end. The theoretical insights obtained from this simple model will be naturally carried over to a very general class of nonparametric regression models considered in Section 3.

4

Z. SHANG AND G. CHENG

Consider the following Gaussian sequence model: 1 Yi = θi + √ i , i = 1, 2, . . . , n P∞ 2 2q0 ∞ where θ = {θi }∞ < ∞} for some constant i=1 ∈ `2 (q0 ) ≡ {{θi }i=1 : i=1 θi i q0 ≥ 0. In particular, denote `2 = `2 (0) as the space of real square-summable sequences. Here, i ’s are iid standard Gaussian random variables. Suppose θ0 = {θi0 }∞ i=1 ∈ `2 (q0 ) as the “true” parameter generating Yi ’s. (2.1)

2.1. Tuning Prior I. We assign the following Gaussian priors on the mean sequence: P1. θi ∼ N (0, τi−2 ), where τi2 = nλi2p , λ = n−α0 , α0 > 0 and p satisfies p > q0 + 1/2 ≥ 1/2, and call them as tuning priors indexed by λ. This leads to a more flexible Bayesian framework. In particular, when nλ = 1 (equivalently, α0 = 1) and q0 = 0, the above setup reduces to the one considered in [18]. It is easy to see that the posterior for θi is also Gaussian: 2 b θi |Yi ∼ N θni , vni , i = 1, 2, . . . , 2 = {n(1 + λi2p )}−1 . Thus, given the data, where θbni = Yi /(1 + λi2p ) and vni iid we can rewrite θi as θi = θbni + vni ηi , where η1 , η2 , . . . ∼ N (0, 1). Denote θb as the posterior mode {θbni }∞ , and let h = λ1/(2p) . Similar to Theorem 1 of i=1 √ b 2 = h−1/2 En + Dn1/2 Zn , where [18], it can be shown that n hkθ − θk 2

(2.2) Dn ≡

∞ X i=1

√ ∞ ∞ X X 2h h h −1/2 , En ≡ , and Zn ≡ Dn (η 2 −1). (1 + λi2p )2 1 + λi2p 1 + λi2p i i=1

i=1

d

Since Zn → N (0, 1), we have the following (1−α) asymptotic credible region s √ En + Dn hzα b Cn (α) = θ = {θi }∞ ∈ ` (q ) : kθ − θk ≤ , 2 0 2 i=1 nh (2.3) where zα = RΦ−1 (1 − α) and Φ is theR c.d.f. of N (0, 1). It is easy to see ∞ ∞ 1 that Dn ≈ 2 0 (1+x12p )2 dx and En ≈ 0 1+x 2p dx, where αn ≈ βn means limn→∞ an /bn = 1. Hence, the radius of Cn (α) contracts at the rate n

−

2p−α0 4p

.

5

NONPARAMETRIC BVM THEOREM

The frequentist coverage probabilities of Cn (α) are summarized in the following Theorem 2.1, where a phase transition phenomenon is observed. Also see Figure 1 for a graphical illustration. Theorem 2.1.

Under model (2.1) and Prior P1, we have

(1). If α0 ≥ 2p/(2q0 + 1), then for any θ0 ∈ `2 (q0 ), with probability approaching one, θ0 ∈ Cn (α). (2). If 0 < α0 < 2p/(2q0 + 1), then there exist uncountably many θ0 ∈ `2 (q0 ) such that for each such θ0 , with probability approaching one, θ0 ∈ / Cn (α).

Large 𝜃

𝛼0 ≥

2𝑝 2𝑞0 + 1

Small 𝜃0

𝜃0

𝜃

0 < 𝛼0

1/2. Note that, without using the tuning parameter approach, one cannot get this optimal rate of convergence for p > q0 + 1/2; see [49]. Another interesting observation is the phase transition of the coverage property at α0 = 2p/(2q0 + 1). We note that in more general models as considered in Section 3, calculations are so complex that phase transition is hard to detect, although we conjecture its existence. A comprehensive study on this is beyond the scope of this paper. Remark 2.1. Theorem 2.1 revisits Theorem 4.2 in [25] from a different perspective. Specifically, our assumption p > q0 + 1/2 implies that 2p > q0 . Under 2p > q0 , Theorem 4.2 of [25] shows that the probability that the credible region covers the truth tends to either zero (along a sequence of truths) or one. Actually, [25] further show that if 2p ≤ q0 and α0 = 2p/(2q0 + 1), then there exists a sequence of truths along which the coverage probability tends to any level in [0, 1). Here we restrict to the case p > q0 + 1/2 to ease the technical arguments. Indeed, our results later in this section benefit from this restriction. Although our Theorem 2.1 can be viewed as a special case of [25], our new tuning prior formulation would greatly benefit the theoretical studies on the nonparametric regression models; see Section 3. Even under the flexible tuning priors, Theorem 2.1 still cannot exactly match the credibility level with the frequentist coverage. The main reason is that the L2 -topology, upon which Cn (α) is built, is too strong to support the frequentist-matching property. This motivates us to design aP weaker norm 2 as follows; also see [10]. For any θ = {θi } ∈ `2 , define kθk2† = ∞ i=1 di θi , for a given positive sequence di satisfying di i(log12i)τ for some τ > 1. Clearly, k · k† defines a norm weaker than the usual `2 -norm. Define n o p b Cn† (α) = θ = {θi }∞ (2.4) cα /n , i=1 ∈ `2 (q0 ) : kθ − θk† ≤ where cαPis the upper (1 − α)-quantile of the distribution of the random P 2 , that is, P ( ∞ d η 2 ≤ c ) = 1 − α. The value of c variable ∞ d η i i α α i i i=1 i=1 can be easily simulated, and not sensitive to the choice of τ (we select τ = 2 in our simulations). It is easy to see that P (θ ∈ Cn† (α)|{Yi }∞ i=1 ) = P∞ P∞ P∞ di ηi2 2 2 2 P (n i=1 di vni ηi ≤ cα ) = P ( i=1 1+λi2p ≤ cα ) → P ( i=1 di ηi ≤ cα ) = P di ηi2 1 − α, where the final limit follows by the fact that ∞ i=1 1+λi2p converges P∞ in distribution to i=1 di ηi2 as n → ∞ (by moment generating function approach). That is, Cn† (α) asymptotically possesses (1 − α) posterior coverage. Frequentist coverage properties of Cn† (α) are summarized below.

NONPARAMETRIC BVM THEOREM

Theorem 2.2.

7

Under model (2.1) and Prior P1, we have:

(1). If α0 ≥ 2p/(2q0 + 1), then for any θ0 ∈ `2 (q0 ), with probability approaching 1 − α, θ0 ∈ Cn† (α). (2). If 0 < α0 < 2p/(2q0 + 1), then there exist uncountably many θ0 ∈ `2 (q0 ) such that for each such θ0 , with probability approaching one, θ0 ∈ / Cn† (α). Given the same choice of α0 as in Theorem 2.1, Part (1) of Theorem 2.2 guarantees the frequentist-matching property of Cn† (α) under a weaker b cα ) that are both easy to norm. We also remark that Cn† (α) only relies on (θ, simulate; see Section 6. In contrast, the credible region constructed by [10] may require calculation of certain nontrivial quantities; see (14) therein. In the end, we consider a set of adaptive credible sets that do not rely on the unknown regularity q0 ; see Section S.3 for more details. The main idea is to replace q0 by a proper estimator qb satisfying qb− q0 = oP (1/ log n); see Proposition S.1 and Remark S.1 for the construction of qb. The (1 − α) adaptive credible sets given in (S.2) and (S.3) are constructed by replacing b = n−2p/(2bq+1) and b (p, λ, h) in (2.3) and (2.4) by p = qb + 1, λ h = n−1/(2bq+1) ; see Theorem S.1 for its theoretical justification. Note that the radii of (S.2) and (S.3) do not involve any unknown constants. Hence, we claim that (S.2) and (S.3) are more practically feasible to construct than the region proposed in [40] with a “blow-up” constant; see [33] for more discussions. 2.2. Tuning Prior II. In this section, we assign a more subtle Gaussian prior P2 that includes an additional parameter q, and derive parallel results to Section 2.1. A similar type of prior will be placed on the Fourier coefficients of the nonparametric regression functions considered in Section 3. The new tuning prior (motivated from [48]) is of the following form: −2 2 = i2p + nλi2q , λ = n−α0 , α > 0 and p, q P2. θi ∼ N (0, τ∗i ), where τ∗i 0 satisfy 1/2 < q ≤ q0 < 2q and q + 1/2 < p < 2q + 1/2.

By direct calculations, the posterior distribution of θi is 2 θi |Yi ∼ N (θb∗ni , v∗ni ), i = 1, 2, . . . , 2 = (n(1 + λi2q + n−1 i2p ))−1 . Dewhere θb∗ni = Yi /(1 + λi2q + n−1 i2p ) and v∗ni note θb∗ = {θb∗ni }∞ i=1 as the posterior mode of θ, and let α1 = min{1/(2p), α0 /(2q)}. By similar analysis as in Section 2.1, we have 1/2 n1−α1 kθ − θb∗ k22 = E∗n + n−α1 /2 D∗n Z∗n ,

8

Z. SHANG AND G. CHENG

where (2.5)

D∗n =

∞ X i=1

(2.6)

E∗n =

∞ X i=1

2n−α1 , (1 + λi2q + n−1 i2p )2 n−α1 , 1 + λi2q + n−1 i2p

−1/2

Z∗n = D∗n

(2.7)

∞ X n−α1 /2 (ηi2 − 1) , 1 + λi2q + n−1 i2p i=1

d

iid

and ηi ∼ N (0, 1) independent of Yi . Considering Lemma S.2 that Z∗n → N (0, 1), we construct an asymptotic (1 − α) credible region (2.8) s √ −α 1 E∗n + D∗n n zα b C∗n (α) = θ = {θi }∞ ∈ ` (q ) : kθ − θ k ≤ , 2 0 ∗ 2 i=1 n1−α1 where E∗n and D∗n converge to some constants specified in Lemma S.1. For example, when p = 3, q = 2, α0 = 2/3, we have D∗n ≈ 1.467 and 1−α1 E∗n ≈ 0.919. The radius of C∗n (α) decreases at rate n− 2 . Our next result demonstrates that the frequentist behavior of the credible set C∗n (α) at any level α ∈ (0, 1) depends on the relation between the property of the true parameter, i.e., q0 , and the prior specifications, i.e., (p, q, α0 ). Again, to obtain a frequentist-matching credible region, we need to employ a weaker norm than `2 -norm as follows: n o p † b (2.9) C∗n (α) = θ = {θi }∞ cα /n , i=1 ∈ `2 (q) : kθ − θ∗ k† ≤ and assume an additional technical condition 4q > 2q0 + 1. Theorem 2.3.

Suppose model (2.1) holds with Prior P2.

(1). For C∗n (α), the following results hold: (a) If p − 1/2 ≤ q0 < 2q and α0 ≥ 2q/(2q0 + 1), then for any θ0 ∈ `2 (q0 ), with probability approaching one, θ0 ∈ C∗n (α). (b) If q ≤ q0 < p − 1/2 or 0 < α0 < 2q/(2q0 + 1), then there are uncountably many θ0 ∈ `2 (q0 ) such that for each such θ0 , with probability approaching one, θ0 ∈ / C∗n (α). † (2). Furthermore, suppose 4q > 2q0 + 1. For C∗n (α), the following results hold:

NONPARAMETRIC BVM THEOREM

9

(a) If p − 1/2 ≤ q0 < 2q and α0 ≥ 2q/(2q0 + 1), then for any θ0 ∈ † `2 (q0 ), with probability approaching 1 − α, θ0 ∈ C∗n (α). (b) If q ≤ q0 < p − 1/2 or 0 < α0 < 2q/(2q0 + 1), then there are uncountably many θ0 ∈ `2 (q0 ) such that for each such θ0 , with † probability approaching one, θ0 ∈ / C∗n (α). Besides frequentist validity, another important consequence of Theorem 2.3 is that C∗n (α) achieves the optimal contraction rate in the sense of [25] when λ is carefully chosen. Specifically, C∗n (α) possesses the contraction rate n−q0 /(2q0 +1) when p − 1/2 ≤ q0 < 2q and α0 = 2q/(2q0 + 1). Moreover, to yield a valid region, we only need the weakest possible condition on the true regularity, i.e., q0 ≥ p − 1/2. Adaptive credible regions under P2 are constructed in Section S.3 of the supplement document. 3. Nonparametric Regression Model. In this section, we present a general class of nonparametric regression models to which nonparametric Bayesian method is applied. Specifically, we place a type of prior P2 on the Fourier coefficients of the unknown regression function, and further prove that the resulting posterior distribution turns out to be closely related to the penalized likelihood in the smoothing spline literature; see (3.8). This is the first main result of this paper. Let pf (y|x) be the conditional probability density function of Y given X = x under an infinite dimensional parameter f : (3.1)

Y |f, X = x ∼ pf (y|x),

where (y, x) ∈ Y × I and I := [0, 1]. We assume that there exists a “true” parameter f0 under which the sample is drawn, and that f0 belongs to an m-th order Sobolev space: H m (I) = {f ∈ L2 (I)|f (j) are absolutely continuous for j = 0, 1, . . . , m − 1, and f (m) ∈ L2 (I)}. Throughout the paper, we let m > 1/2 such that H m (I) is a reproducing kernel Hilbert space (RKHS). We impose some regularity assumptions on `(y; f (x)) ≡ log(pf (y|x)). Denote the first-, second... and third-order derivatives of `(y; a) w.r.t. a by `˙a (y; a), `¨a (y; a), and ` a (y; a), respectively. Assumption A1. (a) `(y; a) is three times continuously differentiable and strictly concave w.r.t. a. There exists positive constants C0 and C1

10

Z. SHANG AND G. CHENG

s.t. ¨ Ef0 exp(sup |`a (Y ; a)|/C0 ) X ≤ C1 , a.s., a∈R ... Ef0 exp(sup | ` a (Y ; a)|/C0 ) X ≤ C1 , a.s.

(3.2) (3.3)

a∈R

(b) There exists a constant C2 > 0 such that C2−1 ≤ B(X) ≡ −Ef0 {`¨a (Y ; f0 (X))|X} ≤ C2 , a.s. (c) ≡ `˙a (Y ; f0 (X)) satisfies Ef0 {|X} = 0, Ef0 {2 |X} = B(X), a.s., and there exists a constant M such that Ef0 {4 |X} ≤ M , a.s. We present three common examples that satisfy Assumption A1. Example 3.1. (Gaussian Model ) In Gaussian regression, = f (X) + √ Y −1 with |X ∼ N (0, 1). It is easy to see √ that pf (y|x) = ( 2π) exp(−(y − f (x))2 /2), and hence `(y; a) = − log 2π − (y − a)2 /2. We can easily verify Assumption A1 with B(X) = 1. Example 3.2. (Logistic Model ) In logistic regression, where Pf (Y = 1|X) = 1−Pf (Y = 0|X) = exp(f (X))/(1+exp(f (X))), it is easy to see that `(y; a) = ay − log(1 + exp(a)) and B(X) = exp(f0 (X))/(1 ... + exp(f0 (X)))2 . Assumption A1 (a) follows from the boundedness of `¨a and ` a . Assumption A1 (b) follows from the fact that B(X) is almost surely bounded away from zero and infinity since f0 is supported in a compact interval. Assumption A1 (c) can be verified by direct calculation. Example 3.3. (Exponential Family) Suppose (Y, X) follows one-parameter exponential family Y |f, X ∼ exp {Y f (X) + A(y) − G(f (X))}, where A(·) and G(·) are known. It is easy to see that `(y; a) ... = ya + A(y)... − G(a), and ˙ ¨ hence, `˙a (y; a) = y − G(a), `¨a (y; a) = −G(a) and ` a (y; a) = −G(a). We assume that G has bounded jth-order derivatives on R for j = 2, 3, 4, and for ¨ any real number a belonging...to the range of f0 , G(a) ≥ δ for some constant ¨ δ > 0. Clearly, both `a and ` a are bounded, and hence Assumption A1 (a) ¨ 0 (X)) satisfies Assumption A1 (b). Since holds. Furthermore, B(X) = G(f ˙ 0 (X)), it is easy to see that Ef {|X} = Ef {Y |X}− G(f ˙ 0 (X)) = = Y − G(f 0 0 2 ¨ 0 and Ef0 { |X} = V arf0 (Y |X) = = B(X) (see [31]). Further... G(f0 (X)) ˙ 0 (X))+6G(f ¨ 0 (X))G(f ˙ 0 (X))2 + more, Ef0 (4 |X) = G(4) (f0 (X))+4G(f0 (X))G(f ¨ 0 (X))2 + G(f ˙ 0 (X))4 which is almost surely bounded by a constant. 3G(f Therefore, Assumption A1 (c) holds.

NONPARAMETRIC BVM THEOREM

11

Consider the smoothing spline estimate fbn,λ : fbn,λ = arg max `n,λ (f ) f ∈H m (I) ( n ) 1X (3.4) = arg max [`(Yi ; f (Xi )) + log π(Xi )] − (λ/2)J(f, f ) , f ∈H m (I) n i=1

R1 where J(f, fe) = 0 f (m) (x)fe(m) (x)dx for any f, fe ∈ H m (I), π(x) denotes the marginal density of X satisfying 0 < inf x∈I π(x) ≤ supx∈I π(x) < ∞, and λ → 0 as n → ∞. Define V (f, fe) = Ef0 {B(X)f (X)fe(X)} and hf, fei = V (f, fe)+λJ(f, fe). It follows from [38] that h·, ·i well defines an inner product in H m (I). Let k · k be the corresponding norm, and K be the corresponding reproducing kernel function. Let Wλ be a self-adjoint bounded linear operator from H m (I) to itself satisfying hWλ f, fei = λJ(f, fe). For simplicity, we also denote V (f ) = V (f, f ) and J(f ) = J(f, f ). Assumption A2. There exists a sequence of eigenfunctions ϕν ∈ H m (I) satisfying supν∈N kϕν ksup < ∞, and a nondecreasing sequence of eigenvalues ρν ν 2m such that (3.5)

V (ϕµ , ϕν ) = δµν , J(ϕµ , ϕν ) = ρµ δµν , µ, ν ∈ N,

m where δµν is the Kronecker’s P delta. In particular, any f ∈ H (I) admits a Fourier expansion f = ν V (f, ϕν )ϕν with convergence in the k · k-norm. In particular, the first m eigenvalues are all zero, and ρν > 0 for µ > m.

We refer to [38, Proposition 2.2] about the validity of Assumption A2. In particular, Assumption A2 yields an equivalent series representation of kf k2 , K and Wλ as follows. Proposition 3.1. For any f ∈ H m (I) and z ∈ I, we have kf k2 = P ϕν (z) λρν 2 ν |V (f, ϕν )| (1+λρν ), Kz (·) = ν 1+λρν ϕν (·), and Wλ ϕν (·) = 1+λρν ϕν (·) under Assumption A2. P

We next assign a Gaussian process (GP) prior to f , and point out its relation with prior P2. Let {vν }∞ ν=1 be a sequence of independent random variables satisfying vν ∼ πν for ν = 1, . . . , m, where π1 , . . . , πm are some −(1+β/(2m)) probability distribution functions, and vν ∼ N (0, ρν ) for ν > m. Here, the prior parameter β is used to measure the regularity difference between the prior and the parameter space H m (I). The random variables

12

Z. SHANG AND G. CHENG

vν ’s are assumed to be independent of the samples. The Gaussian process (GP) prior on f is given as follows: (3.6)

Gλ (t) =

∞ X

wν ϕν (t),

ν=1 −β/(2m)

where wν = vν for ν = 1, . . . , m, and wν = (1 + nλρν )−1/2 vν for 2 ν > m. For simplicity, we assume that πj is N (0, σj ) for j = 1, . . . , m in the above GP prior, where σj2 ’s are fixed constants. Our results may also apply to a general class of π1 , . . . , πm . When ν > m, it is easy to see that 1+β/(2m) wν ∼ N (0, (ρν + nλρν )−1 ), and the inverse prior variance satisfies 1+β/(2m) ρν + nλρν ν 2m+β + nλν 2m . This is exactly a type of Prior P2. Clearly, the sample path of Gλ belongs to H m (I) for any β ∈ (1, 2m + 1) P 1+β/(2m) + a.s. due to the trivial observation E{J(Gλ , Gλ )} = ν>m ρν /(ρν nλρν ) < ∞. Note that β and m jointly determine the smoothness of RKHS induced by Gλ ; see [44]. Let Πλ be the probability measure induced by Gλ , i.e., Πλ (B) = P (Gλ ∈ B). The posterior distribution of f can be written as (3.7)

n X P (f |Dn ) ∝ exp( `(Yi ; f (Xi )))dΠλ (f ), i=1

where Dn ≡ {Z1 , . . . , Zn } denotes a full sample set, and Zi = (Yi , Xi ), i = 1, . . . , n are iid copies of Z = (Y, X). By a key Lemma 3.2 below, we can link the above posterior distribution with the penalized likelihood `n,λ R exp(n`n,λ (f ))dΠ(f ) , (3.8) P (f ∈ S|Dn ) = R S H m (I) exp(n`n,λ (f ))dΠ(f ) for any Π-measurable subset S ⊆ H m (I), where Π is the probability measures induced by the following GP G: (3.9)

G(t) =

∞ X

vν ϕν (t).

ν=1

Similarly, we can check that the sample path of G also belongs to H m (I) for any β ∈ (1, 2m + 1) a.s.. We remark that the important duality between (3.7) and (3.8) holds universally irrespective of the model assumptions A1. The following lemma is established based on H´ajek [22], and demonstrates the relationship between Π and Πλ .

NONPARAMETRIC BVM THEOREM

13

Lemma 3.2. With f ∈ H m (I), we have the following Radon-Nikodym derivative of Πλ with respect to Π: ∞ 1/2 Y dΠλ nλ −β/(2m) (f ) = 1 + nλρν · exp − J(f, f ) . dΠ 2 ν=m+1

4. Rate of Posterior Contraction under Sobolev Norm. In this section, we derive posterior contraction rates under a type of Sobolev norm given the GP prior (3.6). The Sobolev type norm is stronger than the commonly used norms such as Kullback-Leibler norm and supremum norm. Our result in this section is an intermediate step for developing the general nonparametric BvM theorem, and also of independent interest. We first assume the following posterior consistency condition. Assumption A3. in Pfn0 -probability.

For any b > 0, as n → ∞, P (kf −f0 ksup > b|Dn ) → 0

A similar assumption was also used in [19] for deriving posterior contraction rates in terms of supremum norm. We point out that this assumption can be removed in two situations: (1) exponential family models such as Gaussian regression model; (2) logistic regression with null space being removed. We first employ the uniformly consistent test arguments ([21]) to obtain an initial rate of posterior contraction. The Cameron-Martin theorem ([7]) together with a Gaussian correlation inequality ([30]) are also used in the proof of Proposition 4.1. With a bit abuse of notation, we define h = λ1/(2m) . Proposition 4.1. (An initial ratePof posterior contraction) Suppose As0 sumptions A1–A3 hold, and f0 (·) = ∞ ν=1 fν ϕν (·) satisfies Condition (S): β−1 P∞ 0 2 1+ 2m < ∞. Then for any δ n → 0 satisfying δn ≥ rn := ν=1 |fν | ρν (nh)−1/2 + λ1/2 and δn = o(h1/2 ), we have P (f ∈ H m (I) : kf − f0 k > M δn |Dn ) → 0 in Pfn0 -probability for sufficiently large M . We comment on the P implications of Condition (S). By Assumption A2, 0 2 2m+β−1 < ∞. This indicates i.e., ρν ν 2m , we have ∞ ν=1 |fν | ν {fν0 } ∈ `2 (m + (β − 1)/2). Meanwhile, the GP prior (3.6) implies fν ∼ N (0, (ν 2m+β + nλν 2m )−1 ) for ν > m. Then, using the notation in Section 2.2, we have p = m + β/2, q = m, q0 = p − 1/2. In particular, q0 = p − 1/2 satisfies the weakest possible condition on the regularity of f0 for obtaining a valid credible set (see Parts (1a) and (2a) of Theorem 2.3).

14

Z. SHANG AND G. CHENG

The initial contraction rate in Theorem 4.1 is sub-optimal, but helps us obtain the optimal one in Theorem 4.2 below. Define the optimal rate r˜n = β−1 (nh)−1/2 + hm+ 2 . This rate optimality follows from the smoothness of the true function in Condition (S): f0 ∈ H m+(β−1)/2 (I). Let an be defined in Lemma S.4 of online supplementary. Theorem 4.2. (An optimal rate contraction) Suppose AsP of posterior 0 ϕ satisfies Condition (S). Fursumptions A1–A3 hold, and f0 = ∞ f ν=1 ν ν thermore, the following set of rate conditions hold (Condition (R)): (i). (ii). (iii). (iv). (v). (vi). (vii).

h = o(1), log h−1 = O(log n) and nh2 → ∞ rn = o(h1/2 ) an log n = o(˜ rn ) 8m−1 −1/2 bn1 := n r˜n h− 4m (log log n)1/2 log n + h−1/2 r˜n log n = o(1) bn2 := n−1/2 h−2 log n(log log n)1/2 = o(1). 8m−1 bn3 := rn3 n−1/2 h− 4m (log n)(log log n)1/2 + rn3 h−1/2 log n = o(˜ rn2 ) 6m−1 − 4m 2 −1/2 1/2 2 bn4 := n h rn (log n)(log log n) = o(˜ rn ).

Then for any ε, δ > 0, there exist positive constants M = M (ε, δ) and N = N (ε, δ) such that for any n ≥ N , with Pfn0 -probability less than ε, P (f : kf − f0 k ≥ M r˜n |Dn ) > δ. Except for the probabilistic tools used in the proof of Proposition 4.1, we also apply the functional Bahadur representation and a set of empirical processes tools recently developed by [38] in the proof of Theorem 4.2. Remark 4.1.

It can be shown that, when m > 3/2 and 1 < β < m + 21 ,

Condition (R) holds for h h∗ ≡ n

1 − 2m+β

. This leads to the contraction

− 2m+β−1 2(2m+β)

rate n . According to Condition (S), we know that f0 belongs to the Sobolev space of higher order H m+(β−1)/2 (I). Hence, the above contraction rate turns out to be optimal according to [45]. 5. Nonparametric Bernstein-von Mises Theorem. In this section, we show that BvM theorem holds in the general nonparametric regression setup established in Section 3. To the best of our knowledge, our BvM theorem is the first nonparametric one based on total variation distance. Such a result is crucial in constructing credible regions under both strong and weak topology, and credible intervals for linear functionals, as illustrated in Sections 5.2 – 5.4. In contrast, the BvM theorems established by [10, 11] are in terms of a weaker metric: bounded Lipschitz metric. Similar as Section 2, the

NONPARAMETRIC BVM THEOREM

15

tuning hyper-parameter λ in the GP prior Gλ controls the frequentist performance of a credible region/interval. In the end of Section 5.1, we discuss how to employ generalized cross validation (GCV) to select a proper tuning prior, i.e., λ, by taking advantage of the duality between nonparametric Bayesian and smoothing spline models. 5.1. Main Theorem. In this section, we prove a general nonparametric BvM theorem which says that the posterior measure can be well approximated by a probability measure P0 defined in (5.1) below: R exp(− n2 kf − fbn,λ k2 )dΠ(f ) (5.1) P0 (B) = R B , for any B ∈ B, exp(− n kf − fbn,λ k2 )dΠ(f ) m H (I)

2

where B is a collection of Π-measurable sets in H m (I), i.e., B is a σ-algebra with respect to Π. In spite of the complicated appearance, P0 turns out to be a Gaussian measure on H m (I) induced by a Gaussian process W which has a closed form (5.3). This Gaussianity facilitates the applications of our BvM theorem, which will be seen in the subsequent sections. Theorem 5.1. (Nonparametric BvM theorem) Suppose Assumptions A1– P∞ 0 f ϕ satisfies Condition (S). Furthermore, let A3 holds, and f0 = ν=1 ν ν Condition (R) be satisfied and the bn1 , bn2 therein further satisfy, as n → ∞, n˜ rn2 (bn1 + bn2 ) = o(1). Then we have, as n → ∞, (5.2)

sup |P (B|Dn ) − P0 (B)| = oPfn (1).

B∈B

0

We sketch the proof of Theorem 5.1 here. According to Theorem 4.2, the posterior mass is mostly concentrated on an M r˜n -ball of f0 , denoted as BM r˜n (f0 ). Hence, for any B ∈ B, we decompose P (B|Dn ) = P (B ∩ BM r˜n (f0 )|Dn ) + P (B ∩ BcM r˜n (f0 )|Dn ). Implied by Theorem 4.2, the second term is uniformly negligible for all B ∈ B. By applying Taylor expansion to the penalized likelihood in (3.8) (in terms of Fr´echet derivatives), we can show that the first term is asymptotically close to P0 (B) uniformly for B ∈ B based on the empirical processes techniques. The conditions in Theorem 5.1 are not restrictive. Indeed, by direct cal− 1 culations, we can verify that h h∗ ≡ n√ 2m+β satisfies Condition (R) and n˜ rn2 (bn1 + bn2 ) = o(1) when m > 1 + 23 ≈ 1.866 and 1 < β < m + 12 . Therefore, (5.2) holds under h h∗ . Interestingly, the optimal posterior contraction rate is simultaneously obtained in this case; see Remark 4.1. We next show that P0 is equivalent to a Gaussian measure ΠW , induced by a Gaussian process W defined in (5.3). As will be seen later, the procedures

16

Z. SHANG AND G. CHENG

for Bayesian inference in Sections 5.2 – 5.4 can be easily performed in terms of W . Define ( n(n + σν−2 )−1 , ν = 1, . . . , m, aν = 1+β/(2m) −1 n(1 + λρν )(n(1 + λρν ) + ρν ) , ν > m, ( ξν =

(1 +

(1 + nσν2 )−1/2 vν , −(1+β/(2m)) nρν (1 + λρν ))−1/2 vν ,

ν = 1, . . . , m, ν > m,

where recall that σν2 ’s are prior variances for vν ’s assumed in (3.6). Let (5.3)

W (·) =

∞ X

(ξν + aν fbν )ϕν (·),

ν=1

P∞ b where fbν = V (fbn,λ , ϕν ) for any ν ≥ 1. Define fen,λ (·) = ν=1 aν fν ϕν (·). P Clearly, W = fen,λ + ν ξν ϕν is a Gaussian process with mean function fen,λ . Note that fen,λ 6= fbn,λ , where fbn,λ is the smoothing spline estimate in (3.4). Given the data Dn , let ΠW be the probability measure induced by W . Theorem 5.2 below essentially says that P0 = ΠW . Together with Theorem 5.1, we note that the posterior distribution P (·|Dn ) and ΠW (·) are asymptotically close under the total variation distance. This implies the mean function fen,λ of W is approximately the posterior mode of P (·|Dn ). Therefore, fen,λ will be used as the center of the credible region to be constructed. Theorem 5.2. With f ∈ H m (I), the Radon-Nikodym derivative of ΠW with respect to Π is dΠW (f ) = R dΠ

H m (I)

exp(− n2 kf − fbn,λ k2 ) . exp(− n kf − fbn,λ k2 )dΠ(f ) 2

Theorem 5.2 implies that dP0 dΠW (f ) = (f ). dΠ dΠ In the end of this section, we point out an important consequence of the duality between nonparametric Bayesian models and smoothing spline models established here. Specifically, we can employ the well developed GCV method to select a proper tuning prior, i.e., the value of λ (equivalently, h), in practice. Let hGCV be the GCV-selected tuning parameter based on the penalized likelihood function, and hGCV is known to achieve the approximate optimal rate n−1/(2m+1) (see [47]). As illustrated in Sections 5.2 – 5.4,

17

NONPARAMETRIC BVM THEOREM −

1

h should be chosen in the order of n 2m+β for obtaining nice frequentist (2m+1)/(2m+β) properties of credible regions/intervals. Hence, we set h as hGCV when tuning the assigned GP prior. Numerical evidence in Section 6 strongly supports this novel method. 5.2. Credible Region in Strong Topology. As the first application of our main result, we consider the construction of an asymptotic credible region for f in terms of k · k-norm and study its frequentist property in this section. Throughout Sections 5.2 – 5.4, we suppose that f0 satisfies Condition (S), − 1 and set h h∗ = n 2m+β for simplicity. It will be shown that such a selection − 2m+β−1

of h yields credible regions with radius of optimal order n 2(2m+β) P∞ . e We first re-write W defined in (5.3) as fn,λ +Z, where Z(·) = ν=1 ξν ϕν (·), and then study the asymptotic behavior of Z. Assume for some constant c0 > 0, ρν /(c0 ν)2m → 1 as ν → ∞, and define ( (n + σν−2 )−1 , ν = 1, . . . , m, cν = 1+β/(2m) −1 (n(1 + λρν ) + ρν ) , ν > m. For j = 1, 2, let (5.4)

ζj = c0 h

∞ X ν=1

Lemma 5.3.

(ncν )j (1 + λρν ), τj2 = 2c0 h

∞ X (ncν )2j (1 + λρν )2 . ν=1

√ d As n → ∞, (n c0 hkZk2 − (c0 h)−1/2 ζ1 )/τ1 −→ N (0, 1).

We construct an asymptotic credible region with credibility level (1 − α): s 1/2 ζ1 + (c0 h) τ1 zα . (5.5) Rn (α) = f ∈ H m (I) : kf − fen,λ k ≤ c0 nh The above Rn (α) is very similar as the credible region C∗n (α) (defined in (2.8)) for the Gaussian sequence model under prior P2. In particular, we note that ζ1 and τ12 in Rn (α) play similar roles as E∗n and D∗n in C∗n (α). On the other hand, some direct calculations based on Lemma S.1 indicate that ζ1 (τ12 ) is slightly larger than E∗n (D∗n ). This is mainly due to the different norms used in constructing C∗n (α) and Rn (α). The posterior coverage of

18

Z. SHANG AND G. CHENG

Rn (α) follows from Theorem 5.1 and Lemma 5.3: P (Rn (α)|Dn ) = P0 (Rn (α)) + oPfn (1) = P (W ∈ Rn (α)) + oPfn (1) 0 0 s ζ1 + (c0 h)1/2 τ1 zα = P kW − fen,λ k ≤ + oPfn (1) 0 c0 nh s ζ1 + (c0 h)1/2 τ1 zα + oPfn (1) = P kZk ≤ 0 c0 nh p = P ((n c0 hkZk2 − (c0 h)−1/2 ζ1 )/τ1 ≤ zα ) + oPfn (1) 0

= 1 − α + oPfn (1). 0

We next examine the frequentist property of Rn (α) by first studying the asymptotic property of its center, i.e., fen,λ , in Lemma 5.4. ∗ −1/(2m+β) Lemma 5.4. √ Let Assumptions A1–A3 be satisfied, and h h = n for m > 1 + 3/2 and 1 < β < m + 1/2. Then, as n → ∞,

c0 nhkfen,λ − f0 k2 = ζ2 + c0 nhW (n) + oPfn (1), 0

1/2

d

where W (n) is a random variable satisfying c0 τ2−1 nh1/2 W (n) → N (0, 1). Lemma 5.5.

τ1 ≥ τ2 and limn→∞ (ζ1 − ζ2 ) > 0.

By Lemma 5.5, ζ1 − ζ2 > c0 for some positive constant c0 > 0. By Lemma 5.4, we have c0 nhW (n) = oPfn (1), and with Pfn0 -probability approaching 0 one, c0 nhkfe− f0 k2 = ζ2 + c0 nhW (n) + oP n (1) ≤ ζ1 + (c0 h)1/2 τ1 zα , which is f0

exactly the event f0 ∈ Rn (α). In view of the above discussions, Theorem 5.6 shows that, if f0 satisfies Condition (S) and h h∗ , then Rn (α) is valid in the frequentist sense, and possesses the optimal contraction rate n

2m+β−1 − 2(2m+β)

.

Theorem 5.6. Under the conditions of Lemma 5.4, we have for any α ∈ (0, 1), limn→∞ Pfn0 (f0 ∈ Rn (α)) = 1. 5.3. Credible Region in Weak Topology. In this section, we will construct an asymptotic credible region using a weaker topology such that the truth can be covered with probability approaching exactly 1 − α. This echoes with the use of a weaker norm in Section 2. As far as we know, this is the first frequentist-matching Bayesian credible sets in the nonparametric regression

NONPARAMETRIC BVM THEOREM

19

models. We apply the general BvM theorem in Section 5.1 and employ a strong approximation result ([41]) P in this section. P∞ 2 2 For any f ∈ H m (I) with f (·) = ∞ ν=1 fν ϕν (·), define kf k† = ν=1 dν fν , where dν is a given positive sequence satisfying dν ν −1 (log 2ν)−τ for a constant τ > 1. For instance, one can choose d1 = . . . = dm = 1 and −1/(2m) dν = ρν (log(ρν + 1))−τ for ν > m. Define n o p (5.6) Rn† (α) = f ∈ H m (I) : kf − fen,λ k† ≤ cα /n , where cα is defined in Section 2.1. We first check that Rn† (α) indeed covers (1 − α) posterior mass. Similar on the posterior coverage of Cn† (α) Panalysis P in Section 2.1 implies that n ν dν ξν2 weakly converges to ν dν ην2 , where iid

η1 , η2 , . . . ∼ N (0, 1). It then follows by Theorem 5.1 that P (Rn† (α)|Dn ) = P0 (Rn† (α)) + oPfn (1) 0

= P (nkW − fen,λ k2† ≤ cα ) + oPfn (1) 0 X 2 n = P (n dν ξν ≤ cα ) + oPf (1) → 1 − α, ν

0

in Pfn0 -probability.

To study the frequentist coverage of the credible set Rn† (α), we need to assume that the basis functions ϕν and the real sequence ρν come from the following ordinary differential system: (5.7) (j) (−1)m ϕ(2m) (·) = ρν B(·)π(·)ϕν (·), ϕ(j) ν ν (0) = ϕν (1) = 0, j = m, . . . , 2m−1, where recall that π(·) is the marginal density of the covariate X and B(·) is the function defined in Assumption A1. Theorem 5.7 below proves that the credible region Rn† (α) constructed under weak topology perfectly matches with the frequentist coverage. Theorem 5.7. Let (ϕν , ρν ) be the eigenvalues and eigenfunctions of the ODE system (5.7). Suppose that the conditions of Lemma 5.4 hold, and there exists a positive constant C1 such that Ef0 {exp(||/C1 )} < ∞. Then we have for any α ∈ (0, 1), limn→∞ Pfn0 (f0 ∈ Rn† (α)) = 1 − α. It should be noted that the scope of Theorem 5.7 covers a variety of nonparametric models including Examples 3.1 – 3.3. Again, we can choose a proper tuning prior in Cn† (α) via the generalized cross validation method.

20

Z. SHANG AND G. CHENG

5.4. Linear Functionals on the Regression Function. In practice, it is also of great interest to consider a linear functional of the regression function f . Typical examples include (i) evaluation functional: Fz (f ) = R 1 f (z), where z ∈ I is a fixed number; (ii) integral functional: Fω (f ) = 0 f (z)ω(z)dz, where ω(·) is a known deterministic integrable function such as an indicator function. The major goal of this section is to construct asymptotic credible intervals for a general class of linear functionals. In spirit, some of the results in this section are similar as those implied by parametric BvM theorem in the sense that the constructed credible intervals may contract at the parametric rate n−1/2 . We note that linear functionals of densities were considered in [36] but without any frequentist justification. Let F : H m (I) 7→ R be a linear Π-measurable functional, i.e., F (af + bg) = aF (f ) + bF (g) for any a, b ∈ R and f, g ∈ H m (I). We assume that supν≥1 |F (ϕν )| < ∞. This assumption holds for both Fz and Fω above due to the uniform boundedness of ϕν . In addition, we assume that there exists a constant κ > 0 such that for any f ∈ H m (I), |F (f )| ≤ κh−1/2 kf k.

(5.8)

Lemma 5.8 below (given in [38]) implies that both Fz and Fω satisfy (5.8). Lemma 5.8. There exists a universal constant c > 0 s.t. for any f ∈ H m (I), kf ksup ≤ ch−1/2 kf k. Define the following Bayesian credible interval for F (f0 ) tn CIn,λ (α) : F (fen,λ ) ± zα/2 √ , n

(5.9) where t2n =

m X

X F (ϕν )2 F (ϕν )2 + . 1 + n−1 σν−2 ν>m 1 + λρν + n−1 ρ1+β/(2m) ν ν=1

If tn converges to a positive constant, then CIn,λ (α) contracts at the parametric rate n−1/2 ; see (5.17) and (5.18). The posterior coverage of CIn,λ (α) can be easily checked via Theorems 5.1 and 5.2 as follows. It follows by Theorem 5.1 that |P (F (f ) ∈ CIn,λ (α)|Dn )−P0 (F −1 (CIn,λ (α)))| ≤ sup |P (B|Dn )−P0 (B)| → 0, B∈B

in

Pfn0 -probability.

On the other hand, it follows by Theorem 5.2 that

P0 (F −1 (CIn,λ (α))) = P (W ∈ F −1 (CIn,λ (α))) = P (F (W ) ∈ CIn,λ (α)) √ √ = P (−zα/2 tn / n ≤ F (Z) ≤ zα/2 tn / n) = 1 − α,

NONPARAMETRIC BVM THEOREM

21

P where the last equality follows from the fact that F (Z) = ∞ ν=1 ξν F (ϕν ) ∼ N (0, t2n /n) given that ξν ’s are independent zero-mean Gaussian variables. Theorem 5.9 below shows that CIn,λ covers the true value F (f0 ) with probability asymptotically at least 1 − α for a general class ofP functionals F 2 satisfying (5.10). Under an additional condition on F , i.e., 0 < ∞ ν=1 F (ϕν ) < ∞, we can even prove that the coverage is asymptotically exact. Theorem 5.9. √ Let the assumptions in Theorem 5.1 be satisfied. Suppose m > 1 + 3/2, 1 < β < m + 1/2, and f0 satisfies Condition (S† ): β P∞ 0 |2 ρ1+ 2m < ∞. Furthermore, there exists a constant r ∈ [0, 1] such |f ν ν ν=1 that for j = 1, 2 (5.10) ! m X X F (ϕν )2 F (ϕν )2 r lim inf h + > 0. n→∞ (1 + n−1 σν−2 )j ν>m (1 + λρν + (λρν )1+β/(2m) )j ν=1 Then (5.11)

lim inf Pfn0 (F (f0 ) ∈ CIn,λ (α)) ≥ 1 − α. n→∞

P 2 In particular, if 0 < ∞ ν=1 F (ϕν ) < ∞ and (5.10) holds for r = 0, then CIn,λ (α) asymptotically achieves (1 − α) frequentist coverage, that is, (5.12)

lim Pfn0 (F (f0 ) ∈ CIn,λ (α)) = 1 − α.

n→∞

Remark that Condition (S† ) is slightly stronger than Condition (S). We next verify that Condition (5.10) holds when F is the evaluation functional or integral function, and model (3.1) is Gaussian, i.e., the setting of Example 3.1. The proof relies on a nice closed form of ϕν and a careful analysis of the trigonometric functions. Proposition 5.10. N (f (X), 1).

Suppose m = 2, X ∼ U nif [0, 1], and Y |f, X ∼

(i) If F = Fz for any z ∈ (0, 1), then (5.10) holdsP for r = 1; 2 (ii) If F = Fω for any ω ∈ L2 (I)\{0}, then 0 < ∞ ν=1 F (ϕν ) < ∞ and (5.10) holds for r = 0. By carefully examining the proof of Theorem 5.9, we find that when F = Fz , the inequality (5.11) is actually strict. However, when F = Fω , CIn,λ (α) covers the truth with probability approaching 1 − α. Therefore, we observe a subtle difference between these two types of functional, which can still be empirically detected in the simulations; see Section 6.

22

Z. SHANG AND G. CHENG

In the end, we give a concrete example of CIn,λ (α). Under the setup of Proposition 5.10, it follows from [38] that (ϕν , ρν ) in Assumption A2 can be chosen as the eigensystem of the following uniform free beam problem: (j) (j) ϕ(4) ν (·) = ρν ϕν (·), ϕν (0) = ϕν (1) = 0, j = 2, 3.

(5.13)

The eigenvalues satisfy limν→∞ ρν /(πν)4 = 1; see [24, Problem 3.10]. The normalized solutions to (5.13) are √ (5.14) ϕ1 (z) = 1, ϕ2 (z) = 3(2z − 1),

(5.15)

ϕ2k+1 (z) =

sin(γ2k+1 (z − 1/2)) sinh(γ2k+1 (z − 1/2)) + , k ≥ 1. sin(γ2k+1 /2) sinh(γ2k+1 /2)

(5.16)

ϕ2k+2 (z) =

cos(γ2k+2 (z − 1/2)) cosh(γ2k+2 (z − 1/2)) + , k ≥ 1, cos(γ2k+2 /2) cosh(γ2k+2 /2)

1/4

where γν = ρν satisfying cos(γν ) cosh(γν ) = 1; see [6, page 295–296]. For F = Fz with z ∈ (0, 1), we have tn CIn,λ : fen,λ (z) ± zα/2 √ , n

(5.17) where t2n =

ϕν (z)2 ν=1 1+n−1 σν−2

P2

+

ϕν (z)2 ν>2 1+λρ +n−1 ρ1+β/4 , with ρν ν ν t2n = 1/h. For F = Fω with ω

P

ϕν given in (5.14) – (5.16). Here, we have Z 1 tn fen,λ (z)ω(z)dz ± zα/2 √ , (5.18) CIn,λ : n 0

≈ (πν)4 and ∈ L2 (I)\{0},

P2 P P∞ Fω (ϕν )2 Fω (ϕν )2 2 where t2n = ν=1 1+n−1 σν−2 + ν>2 1+λρ +n−1 ρ1+β/4 → ν=1 Fω (ϕν ) = ν ν R1 2 0 ω(z) dzR> 0, as n → ∞. The last equality follows from the trivial fact 1 Fω (ϕν ) = 0 ω(z)ϕν (z)dz is the ν-th Fourier coefficient of ω. Theorem 5.9 implies that the frequentist coverage of (5.17) is strictly larger than 1 − α, while (5.18) has an exact frequentist coverage probability. To use (5.17) and (5.18), we need explicit expressions of (ρν , ϕν ). However, the explicit expressions of the eigenvalues ρν are not available, and hence we cannot directly use formulas (5.14) – (5.16) to find the eigen-functions ϕν . To circumvent this problem, in numerical study we use an equivalent kernel approach. To be more specific, it follows by [38] that the reproducing

NONPARAMETRIC BVM THEOREM

23

kernel P∞ function K with respect to the inner product h·, ·i satisfies K(s, t) = ν=1 ϕν (s)ϕν (t)/(1 + λρν ). On the other hand, it follows by [32, page 184– 185] that in the setting of Proposition 5.10 the kernel function K(s, t) yields an approximation 1 |s − t| |s − t| s−t √ exp − √ K(s, t) ≈ sin √ + cos √ 2 2h 2h 2h 2h s+t s−t 2−s−t t−s 1 √ Φ √ ,√ +Φ ,√ , + √ 2 2h 2h 2h 2h 2h where Φ(u, v) = exp(−u)(cos(u) − sin(u) + 2 cos(v)). Then (ρν , ϕν ) can be found by performing a spectral decomposition on the above function. 6. Simulations. 6.1. Gaussian Sequence Model. In this section, performance of the adaptive credible regions (ACR) and modified adaptive credible regions (MACR) constructed by Priors P1 and P2 is demonstrated through a simulation study. These adaptive regions are constructed in Section S.3 of the supplementary documents. For simplicity, we use ACR1, MACR1, ACR2 and MACR2 to denote the regions constructed by formulas (S.2), (S.3), (S.6) and (S.7) respectively. Suppose that the samples Yi are generated from (2.1) with true parameters being θi0 = (i log i)−3/2 , i = 2, 3, . . . . Thus, the true regularity is q0 = 1. The model errors i are iid independent normal random variables. Results are based on 10,000 independent experiments. In each experiment, (1−α) ACR1, MACR1, ACR2 and MACR2 were constructed. We examined the coverage proportions (CP) of these regions, i.e., the proportions of the regions that cover the true model parameter θ0 . To fully investigate how the coverage proportions relate to the credible levels, we tested a set of different values of the credibility level 1 − α. Results are summarized in Figures 3 and 4. It can be seen that the coverage proportions of both ACR1 and ACR2 become larger than the credible level 1−α as n becomes sufficiently large. Meanwhile, we observe that ACR2 yields a bit smaller CP than ACR1. This is because ACR2 has smaller radius hence smaller volume. The smaller volume may be viewed as an advantage since this means that the credible region is more “concentrated.” The fact that ACR2 has smaller volume is not magic. This is because we have used the same qb in both formulas (S.2) and (S.6). From a numerical

24

0.00

0.02

δE

0.04

0.06

Z. SHANG AND G. CHENG

0

100

200

300

400

500

300

400

500

0.0 0.2 0.4 0.6 0.8

δD

q

0

100

200 q

bn − E b∗n and δD = D bn − D b ∗n versus qb. In E bn and D b n appearing Fig 2. Plots of δE = E b∗n and D b ∗n appearing in (S.6), we choose p = qb + 1/2 in (S.2), we choose p = qb + 1; in E and q = (2b q + 1)/4. The plots display that δD and δE are positive-valued along with qb, bn > D b ∗n and E bn > E b∗n . indicating that D

bn > E b∗n and D bn > D b ∗n (see Figure 2), examination it can be seen that E adapt adapt (α). and hence, C∗n (α) has smaller radius than Cn It can also be observed that both MACR1 and MACR2 yield coverage proportions very close to the nominal level 1 − α for a variety of values α. This means the, under both priors P1 and P2, MACR matches the frequentist coverage better than ACR. 6.2. Smoothing Spline Model. In this section, we investigate the frequentist coverage probabilities of the credible region (5.5) and modified credible region (5.6), and credible intervals for two functionals (5.17) and (5.18). Suppose (6.1)

Yi = f0 (Xi ) + i , i = 1, 2, . . . , n,

where Xi are iid uniform over [0, 1], and i are iid standard normal random variables independent of Xi . The true regression function f0 was chosen as f0 (x) = 3β30,17 (x) + 2β3,11 (x), where βa,b is the probability density function for Beta(a, b). Figure 5 displays the true function f0 , from which it can be seen that f0 has both peaks and trouts.

25

NONPARAMETRIC BVM THEOREM 1 - α = 0.90

15000

0.85

MACR1

20000

0

5000

10000

n

n

1 - α = 0.70

1 - α = 0.50

20000

15000

20000

15000

20000

CP

0.6

0.8

15000

5000

10000

15000

20000

0

5000

10000

n

n

1 - α = 0.30

1 - α = 0.10

0.0

0.2

0.4

0.2

CP

0.6

0.4

0.8

0.6

0

CP

0.95

CP 10000

0.4

CP

5000

0.6 0.7 0.8 0.9 1.0

0

0.90

0.94

ACR1

0.90

CP

0.98

1.00

1 - α = 0.95

0

5000

10000 n

15000

20000

0

5000

10000 n

Fig 3. Coverage proportion (CP) of ACR1 and MACR1, constructed by (S.2) and (S.3) respectively, based on various samples and credibility levels. The dotted red line indicates the position of the 1 − α credibility level.

To examine the coverage property of the credible regions, we chose n ranging from 20 to 2000. For each n, 1,000 independent trials were conducted. From each trial, a credible region (CR) and a modified credible region (MCR) were constructed. Proportions of the CR and MCR covering f0 were calculated, and were displayed against the sample sizes. Results are summarized in Figure 6. It can be seen that for different 1 − α, i.e., the credibility levels, the coverage proportions (CP) of CR are greater than 1 − α when n is large enough. They even tend to one for large sample sizes. However, the CP of the MCR tends to exactly 1 − α when n increases. Thus, the numerical results confirm our theory developed in Sections 5.2 and 5.3. To examine the coverage property of credible intervals, we chose n = 25 , 27 , 28 , 29 to demonstrate the trend of coverage along with increasing

26

Z. SHANG AND G. CHENG 1 - α = 0.90

CP 5000

10000

15000

0.80

MACR2

20000

0

1 - α = 0.70

1 - α = 0.50

15000

20000

15000

20000

15000

20000

CP

0.4

0.6

0.9 0.7

10000

15000

20000

0

5000

10000

n

n

1 - α = 0.30

1 - α = 0.10

0.0

0.2

0.4

CP

0.6

0.4

5000

0.2

CP

10000 n

0.5 0

CP

5000

n

0.8

0

0.90

0.94

ACR2

0.90

CP

0.98

1.00

1 - α = 0.95

0

5000

10000 n

15000

20000

0

5000

10000 n

Fig 4. Coverage proportion (CP) of ACR2 and MACR2, constructed by (S.6) and (S.7) respectively, based on various samples and credibility levels. The dotted red line indicates the position of the 1 − α credibility level.

sample sizes. For pointwise functional, we considered F = Fz for 15 evenlyspaced z points in [0, 1]. For each z, a credible interval based on (5.17) was constructed. We then calculated the coverage probability of this interval based on 1,000 independent experiments, that is, the empirical proportion of the intervals (among the 1,000 intervals) that cover the true value f0 (z). Figure 7 summarizes the results for different credibility levels α, where coverage probabilities are plotted against the corresponding points z. It can be seen that the coverage probability of the pointwise intervals is a bit larger than 1 − α for all α and n being considered. This is consistent with Proposition 5.10 (i), except for the points near the right peak of f0 . Indeed, at those points near the right peak, under-coverage has been observed. This is a common phenomenon in the frequentist literature: the peak and trouts

27

0

5

f0

10

15

NONPARAMETRIC BVM THEOREM

0.0

0.2

0.4

0.6

0.8

1.0

x

Fig 5. Plot of the true function f0 used in model (6.1).

may affect the coverage property of the pointwise interval; see [34, 38]. For integral functional, we considered F = Fωz0 for ωz0 (z) = I(0 ≤ z ≤ z0 ) with 15 evenly-spaced z0 points in [0, 1]. We evaluated the coverage probability at each z0 based on 1,000 experiments. Figure 8 summarizes the results for different credibility levels α, where coverage probabilities are plotted against the corresponding points z0 . It can be seen that, as n increases, the coverage probability of the integral intervals tends to 1 − α for all α. This phenomenon is consistent with our theory, i.e., Proposition 5.10 (ii). Acknowledgements. We thank PhD student Meimei Liu at Purdue for help with the simulation study. REFERENCES [1] Birkhoff, D. (1908). Boundary Value and Expansion Problems of Ordinary Linear Differential Equations. Transactions of the American Mathematical Society, 9, 373– 395. [2] Belitser, E. and Ghoshal, S. (2003). Adaptive Bayesian inference on the mean of an infinite dimensional normal distribution. Annals of Statistics, 31, 536–559. [3] Brown, L. D. and Low, M. G. (1996). Asymptotic equivalence of nonparametric regression and white noise. Annals of Statistics, 24, 2384–2398. [4] Bickel, P. J. and Kleijn, B. J. K. (2012). The semiparametric Bernstein-von Mises theorem. Annals of Statistics, 40, 206–237. [5] Bickel, P. J. and Ritov, Y. (2003). Nonparametric estimators which can be “pluggedin”. Annals of Statistics, 31, 1033–1053.

28

Z. SHANG AND G. CHENG

1000

1500

MCR

0

500

1000 n

1 − α = 0.70

1 − α = 0.50

2000

1500

2000

1500

2000

0.9

1500

0.3

0.5

0.7

CP

0.8 CP 0.6 0.4

1000

1500

2000

0

500

1000

n

n

1 − α = 0.30

1 − α = 0.10

CP

0.0

0.4 0.0

0.4

0.8

500

0.8

0

CP

0.90

2000

n

1.0

500

0.80

CP CR

0

0.70

0.80

CP

0.90

1.00

1 − α = 0.90

1.00

1 − α = 0.95

0

500

1000 n

1500

2000

0

500

1000 n

Fig 6. Coverage proportion (CP) of CR and MCR, constructed by (5.5) and (5.6) respectively, based on various sample sizes and credibility levels. The dotted red line indicates the position of the 1 − α credibility level.

[6] Courant, R. and Hilbert, D. (1937). Methods of Mathematical Physics, Vol. 1. Intercience Publishers, Inc., New York. [7] Cameron, R. H. and Martin, W. T. (1944). Transformations of Wiener Integrals under Translations. Annals of Mathematics, 45, 386–396. [8] Castillo, I. (2014). On Bayesian supremum norm contraction rates. Annals of Statistics, 42, 2058–2091. [9] Castillo, I. (2012). A semiparametric Berstein-von Mises theorem for Gaussian process priors. Prob. Theory Rel. Fields, 152, 53–99. [10] Castillo, I. and Nickl, R. (2013). Nonparametric Bernstein-von Mises theorem in Gaussian white noise. Annals of Statistics, 41, 1999–2028. [11] Castillo, I. and Nickl, R. (2014). On the Bernstein-von Mises phenomenon for nonparametric Bayes procedures. Annals of Statistics, 42, 1941–1969. [12] Castillo, I. and Rousseau, J. (2014). A Bernstein-von Mises theorem for smooth functionals in semiparametric models. Preprint.

29

NONPARAMETRIC BVM THEOREM

CP

0.90

CP

0.80 0.70

n=25

n=27

0.4

0.6

n=28

n=29

0.8

0.2

0.4

0.6

z

z

1 − α = 0.70

1 − α = 0.50

0.8

0.6 CP 0.2

0.4

0.6 0.4

0.4

0.6

0.8

0.2

0.4

0.6 z

1 − α = 0.30

1 − α = 0.10

0.8

0.20

z

0.00

CP

0.1 0.2 0.3 0.4 0.5

CP

0.2

0.10

CP

0.8

0.8

0.2

0.6 0.7 0.8 0.9 1.0

1 − α = 0.90

1.00

1 − α = 0.95

0.2

0.4

0.6 z

0.8

0.2

0.4

0.6

0.8

z

Fig 7. Coverage proportion (CP) of the credible interval for Fz (f0 ) versus z. The dotted red line indicates the position of the 1 − α credibility level.

[13] Cox, D. (1993). An analysis of Bayesian inference for nonparametric regression. Annals of Statistics, 21, 903–923. [14] Cheng, G. and Kosorok, M. R. (2008). General frequentist properties of the posterior profile distribution. Annals of Statistics, 36, 1819–1853. [15] Cheng, G. and Kosorok, M. R. (2009). The Penalized Profile Sampler. Journal of Multivariate Analysis, 100, 345–362. [16] Cheng, G. and Shang, Z. (2013). Joint asymptotics for semi-nonparametric regression models with partially linear structure. http://arxiv.org/abs/1311.2628. [17] de Jong, P. (1987). A central limit theorem for generalized quadratic forms. Probability Theory & Related Fields, 75, 261–277. [18] Freedman, D. (1999). On the Bernstein-von Mises theorem with infinite-dimensional parameters. Annals of Statistics, 27, 1119-1140. [19] Gin´e, E. and Nickl, R. (2011). Rates of contraction for posterior distributions in Lr -metrics, 1 ≤ r ≤ ∞. Annals of Statistics, 39, 2795–3443.

30

Z. SHANG AND G. CHENG

CP

0.90

CP

0.80 0.70

n=25

n=27

0.4

0.6

n=28

n=29

0.8

0.2

0.4

0.6

z0

z0

1 − α = 0.70

1 − α = 0.50

0.8

CP 0.4

0.6

0.8

0.2

0.4

0.6 z0

1 − α = 0.30

1 − α = 0.10

0.8

0.10

CP

0.00

0.20 0.10

0.05

0.30

0.15

z0

0.40

0.2

CP

0.4 0.2

0.3

0.5

CP

0.7

0.6

0.9

0.2

0.6 0.7 0.8 0.9 1.0

1 − α = 0.90

1.00

1 − α = 0.95

0.2

0.4

0.6 z0

0.8

0.2

0.4

0.6

0.8

z0

Fig 8. Coverage proportion (CP) of the credible interval for Fωz0 (f0 ) versus z0 . The dotted red line indicates the position of the 1 − α credibility level.

[20] Ghosal, S., Ghosh, J. K. and Ramamoorthi, R. V. (1999). Posterior consistency of Dirichlet mixtures in density estimation. Annals of Statistics, 27, 143–158. [21] Ghosal, S., Ghosh, J. K. and van der Vaart, A. W. (2000). Convergence rates of posterior distributions. Annals of Statistics, 28, 500–531. [22] H´ ajek, J. (1962). On linear statistical problems in stochastic processes. Czechoslovak Mathematical Journal, 12, 404–444. [23] Hoffmann-Jorgensen, Shepp, L. A. and Dudley, R. M. (1979). On the lower tail of Gaussian seminorms. Annals of Probability, 7, 193–384. [24] Johnstone, I. M. (2013). Gaussian Estimation: Sequence and Wavelet Models. [25] Knapik, B. T., van der Vaart, A. W. and van Zanten, J. H. (2011). Bayesian inverse problems with Gaussian priors. Annals of Statistics, 39, 2626–2657. [26] Knapik, B. T., Szab´ o, B. T., van der Vaart, A. W. and van Zanten, J. H. (2013). Bayes procedures for adaptive inference in inverse problems for the white noise model. Preprint.

NONPARAMETRIC BVM THEOREM

31

[27] Kosorok, M. R. (2008). Introduction to Empirical Processes and Semiparametric Inference. Springer: New York. [28] Kuelbs, J., Li, W. V. and Linde, W. (1994). The Gaussian measure of shited balls. Probability Theory and Related Fields, 98, 143–162. [29] Leahu, H. (2011). On the Bernstein-von Mises phenomenon in the Gaussian white noise model. Electronic Journal of Statistics, 5, 373-404. [30] Li, W. V. (1999). A Gaussian correlation inequality and its applications to small ball probabilities. Electronic Communications in Probability, 4, 111–118. [31] Morris, C. N. (1982). Natural exponential families with quadratic variance functions. Annals of Statistics, 10, 65–80. [32] Messer, K. and Goldstein, L. (1993). A new class of kernels for nonparametric curve estimation. Annals of Statistics, 21, 179–195. [33] Nickl, R. (2014). Discussion to “Frequentist coverage of adaptive nonparametric Bayesian credible sets. Annals of Statistics, to appear. [34] Nychka, D. (1988). Bayesian confidence intervals for smoothing splines. Journal of the American Statistical Association, 83, 1134–1143. [35] Nussbaum, M. (1996) Asymptotic equivalence of density estimation and Gaussian white noise. Annals of Statistics, 24, 2399–2430. [36] Rivoirard, V. and Rousseau, J. (2012). Bernstein-von Mises theorem for linear functionals of the density. Annals of Statistics, 40, 1489–1523. [37] Shang, Z. (2010). Convergence rate and Bahadur type representation of general smoothing spline M-Estimates. Electronic Journal of Statistics, 4, 1411–1442. [38] Shang, Z. and Cheng, G. (2013). Local and global asymptotic inference in smoothing spline models. Annals of Statistics, 41, 2608–2638. [39] Szab´ o, B., van der Vaart, A. W. and van Zanten, H. (2014). Empirical Bayes scaling of Gaussian priors in the white noise model. Electronic Journal of Statistics, 7, 991-1018. [40] Szab´ o, B., van der Vaart, A. W. and van Zanten, H. (2014). Frequentist coverage of adaptive nonparametric Bayesian credible sets. Annals of Statistics. [41] Tusnady, G. (1977). A remark on the approximation of the sample distribution function in the multidimensional case. Periodica Mathematica Hungarica, 8, 53–55. [42] Utreras, F. I. (1988). Boundary effects on convergence rates for Tikhonov regularization. Journal of Approximation Theory, 54, 235–249. [43] van der Vaart, A. W. and van Zanten, J. H. (2007). Bayesian inference with rescaled Gaussian process priors. Electronic Journal of Statistics, 1, 433–448. [44] van der Vaart, A. W. and van Zanten, J. H. (2008). Reproducing kernel Hilbert spaces of Gaussian priors. IMS Collections. Publishing the Limits of Contemporary Statistics: Contributions in Honor of Jayanta K. Ghosh. 3, 200–222. [45] van der Vaart, A. W. and van Zanten, J. H. (2008). Rates of contraction of posterior distributions based on Gaussian process priors. Annals of Statistics, 36, 1031–1508. [46] van der Vaart, A. W. and van Zanten, J. H. (2008). Adaptive Bayesian estimation using a Gaussian random field with inverse Gamma bandwidth. Annals of Statistics, 37, 2655–2675. [47] Wahba, G. (1985). A comparison of GCV and GML for choosing the smoothing parameter in the generalized spline smoothing problem. Annals of Statistics, 13, 1251– 1638. [48] Wahba, G. (1990). Spline Models for Observational Data. SIAM, Philidelphia.

32

Z. SHANG AND G. CHENG

[49] Zhao, L. H. (2000). Bayesian aspects of some nonparmetric problems. Annals of Statistics, 28, 532–552.

Supplementary document to NONPARAMETRIC BERNSTEIN-VON MISES PHENOMENON: A TUNING PRIOR PERSPECTIVE By Zuofeng Shang‡ and Guang Cheng§ Department of Statistics Purdue University This supplementary document contains the proofs of the main results and additional materials not included in the main manuscript. We organize this document as follows: • Section S.1 contains proofs in Section 2.1, i.e., proofs of Theorems 2.1 and 2.2. • Section S.2 contains proofs in Section 2.2, i.e., proofs of Theorem 2.3 and several technical lemmas. • In Section S.3, we construct adaptive credible regions under priors P1 and P2, and explore their frequentist coverage properties. • Section S.4 includes some preliminary results in Section 3, e.g., the expression of Radon-Nikodym derivative of Πλ with respect to Π, together with several limiting results. • Section S.5 includes results on posterior contraction rates under a type of Sobolev norm, as given in Section 4. • Section S.6 contains the proof of the nonparametric BvM theorem, i.e., Theorem 5.1 presented in Section 5.1, together with the specification of the probability measure P0 . • Section S.7 includes the proofs of the results in Section 5.2, in particular, the coverage property of the credible regions under a type of Sobolev norm. • Section S.8 includes the proofs of the results in Section 5.3, in particular, the coverage property of the credible regions under a weaker norm. ‡

Visiting Assistant Professor Corresponding Author. Associate Professor, Research Sponsored by NSF CAREER Award DMS-1151692, DMS-1418042, Simons Foundation 305266. Guang Cheng is on sabbatical at Princeton while this work is finalized; he would like to thank the Princeton ORFE department for its hospitality and support. §

1

2

Z. SHANG AND G. CHENG

• Section S.9 includes the proofs of the results in Section 5.4, in particular, the coverage property of the linear functional of the regression function. Both pointwise functional and integral functional are addressed. S.1. Proofs in Section 2.1. Proof of Theorem 2.1. For simplicity, denote γ = 2p/(2q + 1). Proof of Part (1). If α0 ≥ γ, implying that h−(2q0 +1) = nα0 /γ ≥ n, then for any θ0 ∈ `2 (q0 ), by dominated convergence theorem, as n → ∞, n

∞ X i=1

∞

X (ih)4p−2q0 λ2 i4p 0 2 2q0 |θ | = nh |θ0 |2 i2q0 = o(nh2q0 ) = o(h−1 ), i (1 + λi2p )2 (1 + (ih)2p )2 i i=1

which leads to that ∞ ∞ X X (−λi2p θi0 + n−1/2 i )2 0 2 b n ( θi − θi ) = n (1 + λi2p )2 i=1

i=1

−1

= oP (h

)+

∞ X i=1

∞

X 2 − 1 1 i + . (1 + λi2p )2 (1 + λi2p )2 i=1

It follows by direct calculations that E{|

∞ X i=1

therefore,

2i −1 i=1 (1+λi2p )2

P∞

∞

X 2i − 1 2 2 | }= = O(h−1 ), 2p 2 (1 + λi ) (1 + λi2p )4 i=1

= OP (h−1/2 ). This implies that

b 2 = oP (1) + Fn + h nhkθ0 − θk 2

∞ X i=1

2i − 1 = Fn + oP (1), (1 + λi2p )2

R∞ 1 h where Fn = i=1 (1+λi dx, 2p )2 . It is easy to see that as h → 0, En → 0 1+x2p R∞ 1 and Fn → 0 (1+x2p )2 dx. So inf h>0 (En − Fn ) > 0. This implies that, with probability approaching one, θ0 ∈ Cn (α). Proof of Part (2). If 0 < α0 < γ, then α0 /(2p) < 1/(2q0 + 1) ≤ 1 and 2q0 < 2p/α0 − 1. Let t be a constant such that 2q0 < t < 2p/α P 00−2 12q0and 0 −(1+t)/2 . It is easy to see that = t < 4p − 1, and let θi = i i |θi | i P −1−t+2q0 < ∞, and hence, θ = {θ 0 }∞ ∈ ` (q ). Then i 2 0 0 i i=1 i P∞

n

∞ X i=1

∞

X (ih)4p−t−1 λ2 i4p 0 2 t+1 |θ | = nh ≈ nht i (1 + λi2p )2 (1 + (ih)2p )2 i=1

Z 0

∞

x4p−t−1 dx. (1 + x2p )2

3

SUPPLEMENT TO NONPARAMETRIC BVM THEOREM

In the meantime, E{|

∞ ∞ X X (ih)2p θi0 i 2 (ih)4p |θi0 |2 | } = (1 + (ih)2p )2 (1 + (ih)2p )4 i=1

i=1

Z ∞ 4p−t−1 ∞ X x (ih)4p i−(t+1) t = ≈h dx, (1 + (ih)2p )4 (1 + x2p )4 0 i=1

implying that

∞ X (ih)2p θi0 i = OP (ht/2 ), (1 + (ih)2p )2 i=1

and E{

∞ X i=1

X 2i 1 } = h−1 , 2p 2 (1 + (ih) ) (1 + (ih)2p )2 i

which further imply that

2i

P∞

b 2 = nh nhkθ0 − θk 2 = nh

i=1 (1+(ih)2p )2

= OP (h−1 ). Therefore,

∞ X (ih)4p |θ0 |2 − 2n−1/2 i (ih)2p θ0 + n−1 2 i

i=1 ∞ X i=1

i

i

(1 + (ih)2p )2 √ λ2 i4p |θi0 |2 + OP ( nh1+t/2 + 1) 2p 2 (1 + λi )

nht+1 (1 + oP (1)) = n1−α0 (t+1)/(2p) (1 + oP (1)). Since 1−α0 (t+1)/2p > 0, the above derivations imply that, with probability √ 2 b / Cn (α). approaching one, nhkθ0 − θk2 > En + Dn hzα , or equivalently, θ0 ∈ Proof of Theorem 2.2. Proof of Part (1). Note that b2 nkθ0 − θk † ∞ X (−λi2p θi0 + n−1/2 i )2 = n di (1 + λi2p )2 i=1

∞ ∞ ∞ X X √ X di (ih)4p |θi0 |2 di (ih)2p θi0 i di 2i = n − 2 n + . (1 + (ih)2p )2 (1 + (ih)2p )2 (1 + (ih)2p )2 i=1

i=1

i=1

It follows by conditions on θ0 and dominated convergence theorem that ∞ ∞ X X di (ih)4p |θi0 |2 (ih)4p−2q0 −1 0 2 2q0 2q0 +1 n . nh |θ | i = o(nh2q0 +1 ) = o(1), (1 + (ih)2p )2 (1 + (ih)2p )2 i i=1

i=1

4

Z. SHANG AND G. CHENG

where the last equality holds since α0 ≥ 2p/(2q0 + 1) implying nh2q0 +1 = n1−α0 (2q0 +1)/(2p) = O(1). On the other side, it follows by similar calculations that ∞ ∞ X √ X di (ih)2p θi0 i 2 d2i (ih)4p |θi0 |2 E{| n | } = n = o(1), (1 + (ih)2p )2 (1 + (ih)2p )4 i=1

i=1

P∞ √ P di (ih)2p θi0 i which implies n ∞ i=1 (1+(ih)2p )2 = oP (1). In the end, as n → ∞, i=1 P 2 in distribution. Therefore, as n → ∞, converges to ∞ d i i i=1

di 2i (1+(ih)2p )2

∞ X b 2 ≤ cα ) → P ( Pθ0 (θ0 ∈ Cn† (α)) = Pθ0 (nkθ0 − θk di 2i ≤ cα ) = 1 − α. † i=1

Proof of Part (2). Now suppose 0 < α0 < 2p/(2q0 +1), and hence, α0 (2q0 + 1)/(2p) < 1. Choose t > 0 with α0 (2q0 + 1 + t)/(2p) < 1 and t < 4p − 2q0 − 2. Define θi0 = i−(2q0 +1+t)/2 for i = 1, 2, . . .. It is easy to see that θ0 = {θi0 }∞ i=1 ∈ `2 (q0 ). By choice of di we have n

∞ X di (ih)4p |θi0 |2 (1 + (ih)2p )2

n

i=1

∞ −1 X i (log 2i)−τ (ih)4p i−2q0 −1−t

(1 + (ih)2p )2

i=1

= nh2q0 +2+t

∞ X (log 2ih + log h−1 )−τ (ih)4p−2q0 −2−t

(1 + (ih)2p )2

i=1

nh

2q0 +1+t

−τ

Z

(log n)

0

∞

x4p−2q0 −2−t dx. (1 + x2p )2

On the other side, ∞ ∞ X √ X di (ih)2p θi0 i 2 di (ih)4p |θi0 |2 E{| n | } . n nh2q0 +1+t (log n)−τ , 2p 2 (1 + (ih) ) (1 + (ih)2p )2 i=1

which implies

i=1

√ P∞ n i=1

di (ih)2p θi0 i (1+(ih)2p )2

p = OP ( nh2q0 +1+t (log n)−τ ). Since

nh2q0 +1+t (log n)−τ = n1−α0 (2q0 +1+t)/(2p) (log n)−τ → ∞, we have b 2 nh2q0 +1+t (log n)−τ (1 + oP (1)) + OP (1). nkθ0 − θk † b 2 > cα , i.e., θ0 ∈ Therefore, with probability approaching one, nkθ0 − θk / † Cn† (α).

SUPPLEMENT TO NONPARAMETRIC BVM THEOREM

5

S.2. Proofs in Section 2.2. The following Lemma S.1 shows that both E∗n and D∗n are asymptotically positive constants. The limiting distribution of Z∗n is standard normal as specified in the Lemma S.2. For any constant δ ≥ 1, we have R∞ 1 α0 > q/p ∞ R0 (1+x2p )δ dx, −α X 1 n ∞ 1 0 < α0 < q/p ≡ lim = 0 (1+x2q )δ dx, n→∞ (1 + λi2q + n−1 i2p )δ R∞ 1 i=1 α0 = q/p 0 (1+x2q +x2p )δ dx, Lemma S.1.

lα0

Lemma S.2.

d

As n → ∞, Z∗n → N (0, 1).

Proof of Lemma S.1. If α0 > q/p, then α1 = 1/(2p), and hence ∞ X i=1

n−α1 (1 + λi2q + n−1 i2p )δ

∞ X

n−1/(2p) (1 + (in−1/(2p) )2q n−α0 +q/p + (in−1/(2p) )2p )δ i=1 ∞ Z i X n−1/(2p) dx ≤ −1/(2p) )2q n−α0 +q/p + (xn−1/(2p) )2p )δ (1 + (xn i−1 Zi=1∞ Z ∞ 1 1 as n → ∞ = dx −→ dx, −α +q/p 2q 2p δ 0 (1 + x2p )δ (1 + x n +x ) 0 0 =

where the last limit follows by dominated convergence theorem. Similarly, we have ∞ X

n−1/(2p) (1 + (in−1/(2p) )2q n−α0 +q/p + (in−1/(2p) )2p )δ i=1 ∞ Z i+1 X n−1/(2p) ≥ dx −1/(2p) )2q n−α0 +q/p + (xn−1/(2p) )2p )δ (1 + (xn i Zi=1∞ Z ∞ 1 1 as n → ∞ = dx −→ dx. −α +q/p 2q 2p δ 0 (1 + x2p )δ +x ) n−1/(2p) (1 + x n 0 This proves the result for α0 > q/p. The proofs in the other cases are similar and omitted. Proof of Lemma S.2. The logarithm of the moment generating func-

6

Z. SHANG AND G. CHENG

tion of Z∗n satisfies, for t ∈ (−1/2, 1/2), log(E{exp(tZ∗n )}) ∞ −α1 /2 (η 2 − 1) X −1/2 n i = log(E{exp(tD∗n )}) 1 + λi2q + n−1 i2p =

i=1 ∞ X

−1/2

i=1

=

=

−1/2

2tD∗n n−α1 /2 tD∗n n−α1 /2 1 )− − log(1 − 2q −1 2p 2 1 + λi + n i 1 + λi2q + n−1 i2p

∞ X

−1/2

!

−1/2

2tD∗n n−α1 /2 1 2tD∗n n−α1 /2 2 1 − ( ) − (− 2 1 + λi2q + n−1 i2p 2 1 + λi2q + n−1 i2p i=1 ! −1/2 −1/2 2tD∗n n−α1 /2 3 tD∗n n−α1 /2 +O(( ) )− ) 1 + λi2q + n−1 i2p 1 + λi2q + n−1 i2p ∞ −1/2 X 2tD∗n n−α1 /2 3 t2 + O( ( ) ). 2 1 + λi2q + n−1 i2p i=1

By Lemma S.1, ∞ X i=1

∞

−1/2

X n−α1 2tD∗n n−α1 /2 3 −1/2 3 −α1 /2 ) = (2tD ) n → 0, ( ∗n 1 + λi2q + n−1 i2p (1 + λi2q + n−1 i2p )3 i=1

therefore, E{exp(tZ∗n )} → exp(t2 /2), which proves the desired result. Proof of Theorem 2.3. Before formal proofs, we notice a simple fact: for any i ≥ 1, (S.1) λi2q = (n−α0 /(2q) i)2q ≤ (n−α1 i)2q , n−1 i2p = (n−1/(2p) i)2p ≤ (n−α1 i)2p . Meanwhile, by direct examinations we have

=

kθb∗ − θ0 k22 √ ∞ X (−(λi2q + n−1 i2p )θ0 + i / n)2 i

(1 + λi2q + n−1 i2p )2

i=1

=

∞ X (λi2q + n−1 i2p )2 |θ0 |2 i=1

(1 + λi2q +

+n−1

∞ X i=1

i n−1 i2p )2

− 2n

2i (1 + λi2q + n−1 i2p )2

−1/2

∞ X (λi2q + n−1 i2p )θi0 i (1 + λi2q + n−1 i2p )2 i=1

.

7

SUPPLEMENT TO NONPARAMETRIC BVM THEOREM

Define the three terms on the right side of the above equation to be T1 , T2 , T3 . It follows immediately from Lemma S.1 that E{T3 } = O(n−1+α1 ), implying that T3 = OP (n−1+α1 ). Proof of Part (1a). If p − 1/2 ≤ q0 < 2q and α0 ≥ 2q/(2q0 + 1), then we have (2q0 + 1)α1 ≥ 1 since both 1/(2p) ≥ 1/(2q0 + 1) and α0 /(2q) ≥ 1/(2q0 + 1) are true. Therefore, by (S.1), that the function x 7→ x/(1 + x) is increasing on x ≥ 0, and dominated convergence theorem, we get that n1−α1 T1 2 ∞ X λi2q + n−1 i2p 1−α1 = n |θi0 |2 1 + λi2q + n−1 i2p i=1 2 ∞ X (n−α1 i)2q + (n−α1 i)2p 1−α1 |θi0 |2 ≤ n 1 + (n−α1 i)2q + (n−α1 i)2p i=1

= n1−α1 −2α1 q0 = o(n

∞ X

(n−α1 i)4q−2q0

i=1 1−α1 −2α1 q0

1 + (n−α1 i)2(p−q) 1 + (n−α1 i)2q + (n−α1 i)2p

!2 |θi0 |2 i2q0

).

We also note that E{T22 }

= 4n

−1

∞ X i=1

(λi2q + n−1 i2p )2 |θ0 |2 ≤ 4n−1 T1 , (1 + λi2q + n−1 i2p )4 i

1/2

therefore, T2 = oP (n−1/2 T1 ). Since 1/2−α1 −α1 q0 < (1−α1 −2α1 q0 )/2 ≤ 0, we get n1−α1 T2 = oP (n1/2−α1 −α1 q0 ). To handle T3 , note that n

1−α1

T3 = n

−α1

∞ X i=1

∞

X 2i − 1 n−α1 + , (1 + λi2q + n−1 i2p )2 (1 + λi2q + n−1 i2p )2 i=1

and by Lemma S.1, E{|

∞ X i=1

∞

X 2i − 1 2 |2 } = = O(nα1 ), 2q −1 2p 2 2q (1 + λi + n i ) (1 + λi + n−1 i2p )4 i=1

which implies n1−α1 T3 = OP (n−α1 /2 ) + the above, we get that n1−α1 kθ0 − θb∗ k22 = n1−α1 (T1 + T2 + T3 ) =

n−α1 i=1 (1+λi2q +n−1 i2p )2 .

P∞

Combining

∞ X

n−α1 + oP (1). (1 + λi2q + n−1 i2p )2

i=1

8

Z. SHANG AND G. CHENG

Since the leading term in the above equation is asymptotically strictly smaller than E∗n , we have, with probability approaching one, θ0 ∈ C∗n (α). Proof of Part (1b). If q ≤ q0 < p − 1/2 or 0 < α0 < 2q/(2q0 + 1), then α1 < 1/(2q0 + 1) since either 1/(2p) < 1/(2q0 + 1) or α0 /(2q) < 1/(2q0 + 1). This implies that 1/α1 − 2q0 > 1. Since 4q − 2q0 > 0, we can choose an arbitrary real number t such that 1 < t < min{2, 1/α1 −2q0 , 4q−2q0 +1}. Let |θi0 |2 = i−2q0 −t for i = 1, 2, . . ., so θ0 ∈ `2 (q0 ). Observe that λi2q + n−1 i2p ≥ (n−α1 i)2q for i ≥ nα1 which leads to n1−α1 T1 X (n−α1 i)2q 2 1−α1 ≥ n i−2q0 −t 1 + (n−α1 i)2q α1 i≥n

= n1−α1 (2q0 +t) · n−α1 n1−α1 (2q0 +t)

Z 1

∞

X (n−α1 i)4q−2q0 −t (1 + (n−α1 i)2q )2 α1

i≥n 4q−2q 0 −t x

(1 + x2q )2

dx.

Note the integral on the right of the last equation is finite since 4q −2q0 −t > −1 and 2q0 + t > 1. On the other hand, E{T22 } ∞ X −1 = 4n i=1

≤ 4n−1

∞ X i=1

(λi2q + n−1 i2p )2 −2q0 −t i (1 + λi2q + n−1 i2p )4 ((n−α1 i)2q + (n−α1 i)2p )2 −2q0 −t i (1 + (n−α1 i)2q + (n−α1 i)2p )4

= 4n−1−α1 (2q0 +t)+α1 · n−α1 4n

−1−α1 (2q0 +t)+α1

Z 0

∞ X (n−α1 i)4q−2q0 −t (1 + (n−α1 i)2(p−q) )2

(1 + (n−α1 i)2q + (n−α1 i)2p )4

i=1 ∞ 4q−2q0 −t x (1

+ x2(p−q) )2 dx, (1 + x2q + x2p )4

where the integral in the last equation finitely exists because 4q − 2q0 − t > −1, p > q and 2q0 + t > 1. Therefore, n1−α1 T2 = OP (n(1−α1 −α1 (2q0 +t))/2 ). Since α1 < 1/(2q0 +1) and 1 < t < 2, it can be seen that α1 (2q0 +t−1) < 1. This implies that (1 − α1 − α1 (2q0 + t))/2 < 1 − α1 (2q0 + t), and hence, n1−α1 T2 = oP (n1−α1 T1 ). The above analysis of T1 , T2 leads to that n1−α1 kθb∗ − θ0 k22 ≥ n1−α1 T1 (1 + oP (1)).

SUPPLEMENT TO NONPARAMETRIC BVM THEOREM

9

Since α1 (2q0 + t) < 1, as n → ∞, n1−α1 T1 → ∞. So, with probability approaching one, θ0 ∈ / C∗n (α), which completes the proof for C∗n (α). † The proof for C∗n (α) is similar. So we only briefly sketch the proof. Proof of Part (2a). Similar to the proof of Part (1a), it can be shown by using 4q > 2q0 + 1 that n

∞ X di (λi2q + n−1 i2p )2 |θ0 |2 i

i=1

(1 + λi2q + n−1 i2p )2

and

= o(n1−α1 (2q0 +1) ) = o(1)

∞ √ X di (λi2q + n−1 i2p )θi0 i n = oP (1). (1 + λi2q + n−1 i2p )2 i=1

And hence, nkθ0 − θb∗ k2† =

di 2i i=1 (1+λi2q +n−1 i2p )2

P∞

d

+ oP (1) →

P∞

2 i=1 di i ,

im-

† (α)) → 1 − α. plying Pθ0 (θ0 ∈ C∗n Proof of Part (2b). If p − 1/2 ≤ q0 < 2q and α0 ≥ 2q/(2q0 + 1), then we have α1 (2q0 + 1) < 1; see proof of Part (1b). For any t > 1, let |θi0 |2 = i−(2q0 +1) (log 2i)−t . It is easy to see that θ0 ∈ `2 (q0 ). Following the proof of Part (1b) and using 4q > 2q0 + 1, it can be shown that

n

∞ X di (λi2q + n−1 i2p )2 |θ0 |2 i

(1 + λi2q + n−1 i2p )2

i=1

& n1−α1 (2q0 +1) (log n)−τ −t ,

∞ √ X di (λi2q + n−1 i2p )θi0 i n = OP (n(1−α1 (2q0 +1))/2 ). (1 + λi2q + n−1 i2p )2 i=1

Therefore, nkθ0 − θb∗ k2† ∞ ∞ X √ X di (λi2q + n−1 i2p )2 |θi0 |2 di (λi2q + n−1 i2p )θi0 i = n − 2 n (1 + λi2q + n−1 i2p )2 (1 + λi2q + n−1 i2p )2 + & n

i=1

i=1 ∞ X

(1 +

di 2i λi2q + n−1 i2p )2

i=1 1−α1 (2q0 +1)

(log n)−τ −t (1 + oP (1)) + OP (1).

The leading term approaches infinity which implies that, with probability † approaching one, θ0 ∈ / C∗n (α).

10

Z. SHANG AND G. CHENG

S.3. Adaptive credible regions under priors P1 and P2. In this section, we construct a set of adaptive credible regions that do not rely on the unknown regularity q0 . The main idea is to replace q0 by a proper estimator, namely qb with qb ≥ 0 a.s. This is different from the existing adaptive framework for proving contraction rates; see [2]. For simplicity, we choose p = qb+1 in Prior P1, although other choices satisfying p > qb + 1/2 also apply in our b 2p ), procedure. The corresponding posterior mode θb becomes θbni = Yi /(1+ λi −2p/(2b q +1) b=n where for simplicity we choose λ . The (1−α) adaptive credible sets (parallel to (2.3) and (2.4)) are stated as v q u u bn + D b nb E hz t α adapt ∞ b (S.2) Cn (α) = θ = {θi }i=1 ∈ `2 : kθ − θk2 ≤ , nb h o n p b c /n , Cn†adapt (α) = θ = {θi }∞ ∈ ` : kθ − θk ≤ α 2 † i=1

(S.3)

b n, E bn are defined similar as (2.2) by replacing where b h = n−1/(2bq+1) and D h, λ and q with their estimators. theorem S.1. Suppose that Yi ’s are generated from model (2.1) with θ0 ∈ `2 (q0 ) for some unknown q0 ≥ 0, and the Sobolev rectangle property holds: supi≥1 |θi0 |2 i2q0 +1 < ∞. Let qb be any nonnegative-valued estimator of q0 satisfying qb − q0 = oP (1/ log n), and set p = qb + 1 in Prior P1. Then as †adapt n → ∞, P (θ ∈ Cnadapt (α)|{Yi }∞ (α)|{Yi }∞ i=1 ) → 1−α and P (θ ∈ Cn i=1 ) → 1 − α in probability. Furthermore, with probability approaching one, θ0 ∈ Cnadapt (α); with probability approaching 1 − α, θ0 ∈ Cn†adapt (α). Proof of Theorem S.1. Define h0 = n−1/(2q0 +1) , p0 = q0 + 1, λ0 = Let Dn0 , En0 , Zn0 be defined as (2.2) with (h, λ, p) therein replaced by (h0 , λ0 , p0 ). To simplify the notation, in the rest of the proofs we still use h b respectively, where recall that b and λ to denote b h and λ h = n−1/(2bq+1) and −2p/(2b q +1) b=n λ are the empirical version of the tuning parameters. By (??) √ b 2 = h−1/2 En + Dn1/2 Zn , where Dn , En , Zn are defined as we have n hkθ − θk 2 (2.2). It can be shown that 0 h2p 0 .

Dn − Dn0 = oP (1), En − En0 = oP (1), Zn − Zn0 = oP (1). To see this, note that p = qb + 1 → p0 in probability and h → 0. Hence Z ∞ Z ∞ ∞ Z i+1 X h 1 1 in probability Dn ≥ 2 dx = 2 dx −→ 2 dx, 2p 2 2p 2 (1 + (xh) ) (1 + x2p0 )2 i h (1 + x ) 0 i=1

11

SUPPLEMENT TO NONPARAMETRIC BVM THEOREM ∞ Z X

Dn ≤ 2

i=1

i

i−1

h dx = 2 (1 + (xh)2p )2

∞

Z 0

1 in probability dx −→ 2 (1 + x2p )2

Z

∞

0

1 dx. (1 + x2p0 )2

Therefore, Dn −Dn0 = oP (1) holds. The proof for En −En0 = oP (1) is similar. The proof of Zn − Zn0 = oP (1) is a bit nontrivial. Observe that E{|Zn − Zn0 |2 } = E{E{|Zn − Zn0 |2 |b q }} √ √ ∞ X h h0 − )2 } = 2E{ ( 2p 1 + λi 1 + λ0 i2p0 i=1

∞ X = 2E{ i=1

√ ∞ ∞ X X h hh0 h0 } − 4E{ }+2 . 2p 2 2p 2p 0 (1 + λi ) (1 + λi )(1 + λ0 i ) (1 + λ0 i2p0 )2 i=1

i=1

It is easy to see that Z E{

∞

h

Z ∞ ∞ X 1 h 1 dx} ≤ E{ } ≤ E{ dx}. 2p 2 2p 2 (1 + x ) (1 + λi ) (1 + x2p )2 0 i=1

R∞ R∞ 1 1 dx → Since p ≥ 1, we have 0 (1+x12p )2 dx ≤ 1+ 1 (1+x 2 )2 dx. Since 0 (1+x2p )2 R∞ 1 0 (1+x2p0 )2 dx in probability, it follows by bounded convergence theorem R∞ R∞ that E{ 0 (1+x12p )2 dx} → 0 (1+x12p0 )2 dx. Since h ≤ 1 and h → 0 in probRh ability, by bounded convergence theorem, E{ 0 (1+x12p )2 dx} ≤ E{h} → 0. R∞ P 1 h Therefore, E{ ∞ i=1 (1+λi2p )2 } → 0 (1+x2p0 )2 dx. We also note that r Z ∞ √ ∞ X 1 hh0 h E{ } ≤ E{ dx}, 2p 2p 2p 0 (1 + λi )(1 + λ0 i ) h0 0 (1 + x (h/h0 )2p )(1 + x2p0 ) R∞

i=1

and E{

∞ X i=1

√

hh0 } ≥ E{ 2p (1 + λi )(1 + λ0 i2p0 )

r

h h0

Z

∞

h0

1 (1 +

x2p (h/h0 )2p )(1

+ x2p0 )

dx}.

It can be easily seen that r Z h 0 p h 1 dx} ≤ E{ E{ hh0 } → 0. h0 0 (1 + x2p (h/h0 )2p )(1 + x2p0 ) Since qb − q0 = oP (1/ log n), q it Ris easy to show that h/h0 → 1Rin probability. ∞ ∞ Then it can be seen that hh0 0 (1+x2p (h/h 1)2p )(1+x2p0 ) dx → 0 (1+x12p0 )2 dx 0

12

Z. SHANG AND G. CHENG

in probability, and r Z ∞ h 1 dx h0 0 (1 + x2p (h/h0 )2p )(1 + x2p0 ) √ Z ∞ hh0 = dx 2p (1 + λx )(1 + λ0 x2p0 ) 0 Z Z Z ∞ 1 ∞ h h0 1 ∞ 1 ≤ dx + dx ≤ dx, a.s. 2p 2 2p 2 0 2 0 (1 + λx ) 2 0 (1 + λ0 x ) (1 + x2 )2 0 It follows by bounded convergence theorem that ∞ X E{ i=1

√

hh0 }→ 2p (1 + λi )(1 + λ0 i2p0 )

Z

∞

0

1 dx. (1 + x2p0 )2

This shows that E{|Zn − Zn0 |2 } → 0, and hence, Zn − Zn0 = oP (1). Since d

d

Zn0 → N (0, 1), we have Zn → N (0, 1). Therefore, P (θ ∈ Cnadapt (α)|{Yi }∞ i=1 ) p √ 2 −1/2 b ≤h En + Dn zα |{Yi }∞ = P (n hkθ − θk 2 i=1 ) = P (Zn ≤ zα ) → 1 − α. To the end of the proof we show that P (θ0 ∈ Cnadapt (α)) → 1. Define 0 }∞ with θ b0 = Yi /(1 + λ0 i2p0 ). Following the proof of Theorem θb0 = {θbni i=1 ni P h0 0 2.1 (1), nh0 kθ0 − θb k22 = Fn0 + oP (1), where Fn0 = ∞ i=1 (1+λ0 i2p0 )2 . Since h/h0 = 1 + oP (1), we have nhkθ0 − θb0 k2 = F 0 + oP (1). On the other hand, 2

n

for any x > 1 the inequality |xα − 1| ≤ (log x)x|α| |α| holds for any α ∈ R.

SUPPLEMENT TO NONPARAMETRIC BVM THEOREM

13

Then we have b2 kθb0 − θk 2 2 ∞ X 1 1 2 = − Yi 1 + λ0 i2p0 1 + λi2p i=1 ∞ X

=

Yi2 λ20 i4p0

i=1 ∞ X

=

( λλ0 i2(p−p0 ) − 1)2 (1 + λ0 i2p0 )2 (1 + λi2p )2 1 (2b q +1)(2q0 +1)

i)2(bq−q0 ) − 1)2 (1 + λ0 i2p0 )2 (1 + λi2p )2

((n Yi2 λ20 i4p0

i=1

≤ 4|b q − q0 |

2

∞ X

(log(n Yi2 λ20 i4p0

1 (2b q +1)(2q0 +1)

i=1 2

= 4|b q − q0 | n 2

≤ 8|b q − q0 | n

8p|b q −q0 | (2b q +1)(2q0 +1)

8p|b q −q0 | (2b q +1)(2q0 +1)

∞ X i=1 ∞ X

2p ( q+1)(2q 0 +1) 2 2 4p0 (2b Yi λ 0 i

8p|b q −q0 |

−1

log n + log (ih0 ))2 (ih0 )4|bq−q0 |

(1 + λ0 i2p0 )2 (1 + λi2p )2

2p ( q+1)(2q 0 +1) 0 2 (2b |θi |

log n + log (ih0 ))2 (ih0 )4p0 +4|bq−q0 |

(1 + λ0 i2p0 )2 (1 + λi2p )2

i=1

+8|b q − q0 |2 n (2bq+1)(2q0 +1)

1

i))2 (n (2bq+1)(2q0 +1) i)4|bq−q0 | (1 + λ0 i2p0 )2 (1 + λi2p )2

∞ X i=1

2i

2p ( (2bq+1)(2q log n + log (ih0 ))2 (ih0 )4p0 +4|bq−q0 | 0 +1)

(1 + λ0 i2p0 )2 (1 + λi2p )2

Denote the two terms on the right side of the last inequality to be T1 , T2 . Since qb − q0 = oP (1/ log n), with probability approaching one, |b q − q0 | ≤ 1/ log n. In what follows we assume |b q − q0 | ≤ 1/ log n. To address T1 , T2 , we introduce an elementary inequality: for any x > 0, (S.4)

x4p0 +4|bq−q0 | ≤ 1 + x8/ log n . (1 + x2p )2

We briefly show (S.4). If 0 < x ≤ 1, then the left hand side of (S.4) is always less than 1, so the inequality trivially holds. If x > 1, then x4p0 +4|bq−q0 | ≤ x4|bq−q0 |+4(p0 −p) ≤ x8|bq−q0 | < 1 + x8/ log n . (1 + x2p )2 8p|b q −q0 |

8p Using (S.4) and n (2bq+1)(2q0 +1) ≤ exp( (2bq+1)(2q ) = OP (1), we have 0 +1) 2 −1

T2 = OP (1) · |b q − q0 | n

∞ X i=1

2i

(log n + log(ih0 ))2 (1 + (ih0 )8/ log n ) . (1 + (ih0 )2p0 )2

.

14

Z. SHANG AND G. CHENG

Since E{

∞ X

2i

i=1

=

(log n + log(ih0 ))2 (1 + (ih0 )8/ log n ) } (1 + (ih0 )2p0 )2

∞ X (log n + log(ih0 ))2 (1 + (ih0 )8/ log n )

(1 + (ih0 )2p0 )2

i=1

≤ h−1 0

∞

Z 0

}

(log n + log x)2 (1 + x8/ log n ) 2 dx . h−1 0 (log n) . (1 + x2p0 )2

Therefore, T2 = OP ((nh0 )−1 (log n)2 |b q − q0 |2 ) = oP ((nh0 )−1 ). For T1 , using condition supi≥1 |θi0 |2 i2q0 +1 < ∞ and (S.4) we have 0 +1 T1 = OP (1)|b q − q0 |2 h2q 0

∞ X (log n + log(ih0 ))2 (ih0 )4p0 +4|bq−q0 |−2q0 −1

(1 + (ih0 )2p0 )2 (1 + (ih0 )2p )2

i=1

= OP (1)|b q−

0 +1 −1 h0 q0 |2 h2q 0

2

= OP ((log n) |b q−

0 q0 |2 h2q 0 )

∞

Z 0

(log n + log x)2 x4p0 +4|bq−q0 |−2q0 −1 dx (1 + x2p0 )2 (1 + x2p )2

= oP ((nh0 )−1 ).

b 2 = T1 + T2 = oP ((nh0 )−1 ) = oP ((nh)−1 ). So we Consequently, kθb0 − θk 2 have b 2 = nhkθ0 − θb0 k2 + oP (1) = F 0 + oP (1). nhkθ0 − θk 2 2 n Since En − En0 = oP (1) and lim inf n→∞ (En0 − Fn0 ) > 0, with probability approaching one, θ0 ∈ Cnadapt (α). To the end, we prove the results for Cn†adapt (α). Since b2= nkθ − θk †

∞ X i=1

∞

X di ηi2 d → di ηi2 , b 2p 1 + λi i=1

it is easy to see that Cn†adapt (α) possesses asymptotically 1 − α posterior coverage. Let θb0 be defined the same as in the proof of Theorem S.1. It follows by Theorem 2.2 that P (nkθ0 − θb0 k2 < cα ) → 1 − α. Following the proof of †

Theorem S.1, it can be shown that b 2 = OP (b nkθb0 − θk q − q0 ) = oP (1). † b 2 < cα ) → 1 − α. This completes the proof of Theorem Hence, P (nkθ0 − θk † S.1.

SUPPLEMENT TO NONPARAMETRIC BVM THEOREM

15

A crucial assumption in Theorem S.1 is the existence of qb that satisfies qb− q0 = oP (1/ log n). Due to this proper rate, the contraction rate of Cnadapt (α) is of the order n−bq/(2bq+1) , which is asymptotically equivalent to the theoretical optimal one n−q0 /(2q0 +1) under any regularity q0 > 1/2. We next discuss the construction of a desirable qb by starting from a simple example. If it is known that the true parameter satisfies |θi0 |2 = (i log i)−(2q0 +1) for i = 2, 3, . . ., then we assign a prior on θi as θi ∼ N (0, (i log i)−(2q0 +1) ), i = 2, 3, . . . , such that the marginal distribution of Yi ∼ N (0, n−1 + (i log i)−(2q0 +1) ). Let φ0 = 2q0 + 1. Based on {Yi }ni=2 , the log-likelihood function of φ ≡ 2q + 1 is n 1X Yi2 −1 −φ (S.5) ln (φ) = − log(n + (i log i) ) + −1 . 2 n + (i log i)−φ i=2

proposition S.1. (Existence of qb with a desirable rate of convergence) b of the log-likelihood There exists a sequence of local maxima, denoted as φ, 1 − function ln (φ) satisfying φb − φ0 = OP (n 2φ0 (log n)2φ0 ). If we let qb = (φb − 1)+ /2, where (a)+ is the positive part of a real number a, then qb satisfies qb − q0 = OP (n

− 2φ1

0

(log n)2φ0 ) = oP (1/ log n).

remark S.1. It is possible to extend Proposition S.1 to more general situations. For example, given that |θi0 |2 = b0 i−(2q0 +1) (log i)−a0 for i = 2, 3, . . . and unknown b0 > 0, q0 > 0, a0 > 1, then one may assign priors θi ∼ N (0, b0 i−(2q0 +1) (log i)−a0 ), i = 2, 3, . . . , which leads to log-likelihood function of (b, φ, a) as n Yi2 1X −1 −φ −a log(n + bi (log i) ) + −1 . ln (b, φ, a) = − 2 n + bi−φ (log i)−a i=2

It can be shown by a multivariate extension of the proof of Proposition S.1 bb that there exists a sequence of local maxima of ln (b, φ, a), namely, (bb, φ, a), b b such that the marginal entry φ satisfies φ − φ0 = oP (1/ log n). Hence, by defining qb = (φb − 1)+ /2 we have qb − q0 = oP (1/ log n).The rate for (bb, b a) can also be derived but is of less interest. Proof of Proposition S.1. Let φn = n that with probability approaching one,

− 2φ1

ln (φ0 ± φn ) < ln (φ0 ),

0

(log n)2φ0 . We will show

16

Z. SHANG AND G. CHENG

which implies that there exists a local maxima φb of ln (φ) s.t. φ0 − φn < φb < φ0 + φn . We only show ln (φ0 + φn ) < ln (φ0 ) with probability approaching one. The proof for ln (φ0 − φn ) < ln (φ0 ) is exactly the same. It is easy to show that the first-, second- and third-order derivatives of ln (φ) are " # n 2 (log c )cφ X Y 1 log c i i i l˙n (φ) = − i , −1 cφ −1 cφ )2 2 1 + n (1 + n i i i=2 # " n 2 (log c )2 cφ (1 − n−1 cφ ) 2 n−1 cφ X Y (log c ) 1 i i i i i ¨ln (φ) = − + i , −1 cφ )2 −1 cφ )3 2 (1 + n (1 + n i i i=2 # " n ... ) 1 X (log ci )3 n−1 cφi (1 − n−1 cφi ) (log ci )3 Yi2 cφi (1 − 4n−1 cφi + n−2 c2φ i , + l n (φ) = − φ 3 φ 4 −1 −1 2 (1 + n ci ) (1 + n ci ) i=2

where ci = i log i. By Taylor’s expansion, 1 ... 1 ln (φ0 + φn ) − ln (φ0 ) = l˙n (φ0 )φn + ¨ln (φ0 )φ2n + l n (φ∗ )φ3n , 2 6 for a random φ∗ ∈ [φ0 , φ0 + φn ]. 0 0 Noting that Yi ∼ N (θi0 , 1/n), we have E{|Yi2 − c−φ − 1/n|2 } = 4c−φ /n + i i 2 2/n . By direct calculations it can be shown that E{|l˙n (φ0 )|2 } n −φ0 0 2 2 − 1/n|2 } 1 X c2φ i (log ci ) E{|Yi − ci = 4 (1 + n−1 cφi 0 )4 i=2

=

n 1 X c2φ0 (log ci )2 (4c−φ0 /n + 2/n2 ) i

i −1 (1 + n cφi 0 )4 i=2 n n X n−1 cφi 0 (log ci )2 X . −1 φ0 3 (1 i=2 (1 + n ci ) i=2

4

1 (log i)2 φ0 (log n)2 n + n−1 iφ0 )2

Z 0

∞

1 dx, (1 + xφ0 )2

n n 1X (log ci )2 1X (log i)2 . −1 φ0 2 2 2 (1 + n−1 iφ0 )2 i=2 (1 + n ci ) i=2 Z ∞ 1 1 n φ0 (log n)2 dx, (1 + xφ0 )2 0

−E{¨ln (φ0 )} =

n

1 1 X (log i)2(1−φ0 ) 2(1−φ0 ) φ0 −E{¨ln (φ0 )} & & (log n) n 2 (1 + n−1 iφ0 )2

i=2

Z 0

∞

1 dx, (1 + xφ0 )2

17

SUPPLEMENT TO NONPARAMETRIC BVM THEOREM

V ar(¨ln (φ0 )) n −φ0 0 2 − 1/n|2 } 1 X (log ci )4 (1 − n−1 cφi 0 )2 c2φ i E{|Yi − ci = 4 (1 + n−1 cφi 0 )6 1 4

=

i=2 n X i=2

n X

.

−φ0 0 (log ci )4 (1 − n−1 cφi 0 )2 c2φ /n + 2/n2 ) i (4ci

i=2

(log ci )4 (1 +

n−1 cφi 0 )2

.

(1 + n−1 cφi 0 )6 n X (log i)4 i=2

(1 +

n−1 iφ0 )2

4

(log n) n

1 φ0

Z

∞

0

1 dx. (1 + xφ0 )2

Using Yi2 ≤ 2(ci−φ0 +2i /n) and the elementary fact max1≤i≤n 2i = OP (log n) we have uniformly for any φ∗ ∈ [φ0 , φ0 + φn ], cφi ∗ −φ0 ≤ cφnn = 1 + o(1) and ... | l n (φ∗ )| n (log cn )3 X n−1 cφi ∗ |1 − n−1 cφi ∗ | ≤ 2 (1 + n−1 cφi ∗ )3 i=2 3

+OP ((log n)(log cn ) )

n ∗ X (cφi ∗ −φ0 + n−1 cφi ∗ )|1 − 4n−1 cφi ∗ + n−2 c2φ i |

(1 + n−1 cφi ∗ )4

i=2 1

1

= OP ((log n)4 n φ∗ ) = OP ((log n)4 n φ0 ), 1

1

where in the last equation we have used the fact n φ∗ n φ0 uniformly for any φ∗ ∈ [φ0 , φ0 + φn ]. Therefore, ln (φ0 + φn ) − ln (φ0 ) 1

1

1

= E{¨ln (φ0 )}φ2n + OP (n 2φ0 (log n)φn + n 2φ0 (log n)2 φ2n + n φ0 (log n)4 φ3n ) 1

1

1

1

. −n φ0 (log n)2(1−φ0 ) φ2n + OP (n 2φ0 (log n)φn + n 2φ0 (log n)2 φ2n + n φ0 (log n)4 φ3n ). It is easy to see that with φn = n 1

1

− 2φ1

0

(log n)2φ0 , 1

1

n 2φ0 (log n)φn + n 2φ0 (log n)2 φ2n + n φ0 (log n)4 φ3n n φ0 (log n)2(1−φ0 ) φ2n , therefore, with probability approaching one, ln (φ0 + φn ) < ln (φ0 ). It is easy to see that qb = (φb − 1)+ /2 satisfies the desirable rate of convergence. This completes the proof of the proposition. To the end of this section, we propose two adaptive credible regions under P2. Similarly, assume that qb is a proper estimator of q0 and let p = qb + 1/2,

18

Z. SHANG AND G. CHENG

q = (2b q + 1)/4 and α0 = 2q/(2b q + 1). In this case, α0 = 1/2 and α1 = b = n−α0 = n−1/2 . Consider the following regions: 1/(2b q + 1). Hence λ (S.6) v q u u b∗n + D b ∗n n−α1 zα t E adapt b C∗n (α) = θ = {θi }∞ , i=1 ∈ `2 : kθ − θ∗ k2 ≤ n1−α1

(S.7)

n o p †adapt b C∗n (α) = θ = {θi }∞ ∈ ` : kθ − θ k ≤ c /n , 2 ∗ † α i=1

b 2q + n−1 i2p ) for i = 1, 2, . . ., and D b ∗n , E b∗n are dewhere θb∗ni = Yi /(1 + λi fined similar as (2.5) with α1 , p, q, λ therein replaced by the above empirical b∗n and D b ∗n . counterparts. Again, Lemma S.1 gives the limiting values of E adapt −b q /(2b q +1) The contraction rate of C∗n (α) is of the order n . Based on similar proof in Theorem S.1, we study the Bayesian and frequentist properties of †adapt adapt (α) as follows. (α) and C∗n C∗n theorem S.2. Suppose that Yi ’s are generated from model (2.1) with θ0 ∈ `2 (q0 ), supi≥1 |θi0 |2 i2q0 +1 < ∞, and that qb − q0 = oP (1/ log n). Then as adapt †adapt n → ∞, P (θ ∈ C∗n (α)|{Yi }∞ (α)|{Yi }∞ i=1 ) → 1−α and P (θ ∈ C∗n i=1 ) → 1 − α in probability. Furthermore, with probability approaching one, θ0 ∈ †adapt adapt (α). (α); with probability approaching 1 − α, θ0 ∈ C∗n C∗n S.4. Proofs in Section 3. m (I), by Assumption A2, f adProof of Lemma 3.2. For any f ∈ HP mits theP unique series representation f = ∞ ν=1 fν ϕν , where fν = V (f, ϕν ) satisfies ν fν2 ρν < ∞. Therefore, T : f 7→ {fP ν : ν ≥ 1} defines a one-to-one 2 m ∞ ∞ map from H (I) to Rm ≡ {{fν }ν=1 ∈ R : ∞ ν=1 fν ρν < ∞}. ˜ ˜ Let Πλ and Π be the probability measures induced by {wν : ν > m} and {vν : ν > m}, respectively, which are both defined on R∞ . That is, for ˜ λ (S) = P ({wν : ν > m} ∈ S) and Π(S) ˜ any subset S ∈ R∞ , Π = P ({vν : 0 0 ν > m} ∈ S). Likewise, let Πλ and Π be probability measures induced by {wν : ν ≥ 1} and {vν : ν ≥ 1}. It is easy to see that, for any measurable B ⊆ Rm ,

Πλ (T −1 B) = P (Gλ ∈ T −1 B) = P ({wν : ν ≥ 1} ∈ B) = Π0λ (B), and Π(T −1 B) = P (G ∈ T −1 B) = P ({vν : ν ≥ 1} ∈ B) = Π0 (B). The following result can be found in H´ajek [22].

SUPPLEMENT TO NONPARAMETRIC BVM THEOREM

˜ λ w.r.t. Π ˜ is The Radon-Nikodym derivative of Π

proposition S.2.

˜λ dΠ ({fν : ν > m}) = ˜ dΠ =

∞ Y ν>m ∞ Y

1 + nλρ−β/(2m) ν

1/2

1 + nλρ−β/(2m) ν

1/2

ν>m

Note that in Proposition S.2, since

−β/(2m) ρν

19

Q∞

ν>m

exp(−

nλ 2 f ρν ) 2 ν

∞ nλ X 2 · exp − f ρν 2 ν>m ν

−β/(2m) 1/2

1 + nλρν

! .

is convergent

ν −β . Therefore, by Proposition S.2, we have

dΠ0λ ({fν : ν ≥ 1}) ˜ λ ({fν : ν > m}) = dπ1 (f1 ) · · · dπm (fm )dΠ ˜λ dΠ ˜ = dπ1 (f1 ) · · · dπm (fm ) ({fν : ν > m}) · dΠ({f ν : ν > m}) ˜ dΠ ∞ 1/2 Y 1 + nλρ−β/(2m) = dπ1 (f1 ) · · · dπm (fm ) ν ν=m+1

nλ · exp − 2 =

∞ Y ν=m+1

∞ X

! fν2 ρν

˜ dΠ({f ν : ν > m})

ν=m+1

1 + nλρν−β/(2m)

1/2

∞ nλ X 2 fν ρν · exp − 2 ν=m+1

! dΠ0 ({fν : ν ≥ 1}).

20

Z. SHANG AND G. CHENG

Then for any measurable S ⊆ H m (I), by change of variable, we have Πλ (S) = Π0λ (T S) Z dΠ0λ ({fν : ν ≥ 1}) = =

TS ∞ Y

1 + nλρ−β/(2m) ν

1/2

ν=m+1 ∞ nλ X 2 exp − · fν ρν 2 TS

Z

! dΠ0 ({fν : ν ≥ 1})

ν=m+1

=

∞ Y

1 + nλρ−β/(2m) ν

1/2

ν=m+1

nλ −1 −1 exp − J(T ({fν : ν ≥ 1}), T ({fν : ν ≥ 1})) · 2 TS Z

·d(Π ◦ T −1 )({fν : ν ≥ 1}) ∞ 1/2 Z Y nλ −β/(2m) 1 + nλρν exp(− J(f, f ))dΠ(f ). = 2 S ν=m+1

This completes the proof of the lemma. S.5. Proofs in Section 4. Before proving Proposition 4.1, we present a preliminary lemma. Let {ϕ˜ν : ν ≥ 1} be a bounded orthonormal basis of L2 (I) under usual 2 L inner product. For any b ∈ [0, β], define ˜b = { H

∞ X

fν ϕ˜ν :

ν=1

∞ X

fν2 ρ1+b/(2m) < ∞}. ν

ν=1

˜ b can be viewed as a version of Sobolev space with regularity m + Then H ˜ = P∞ vν ϕ˜ν , a centered GP, and f˜0 = P∞ fν0 ϕ˜ν . Define b/2. Define G ν=1R ν=1 1 ˜ ) = V˜ (f, g) = hf, giL2 = 0 f (x)g(x)dx, the usual L2 inner product, J(f P∞ ˜ 2 ˜ ˜ ˜ν )| ρν , a functional on H0 . For simplicity, we denote V (f ) = ν=1 |V (f, ϕ ˜ β . Since G ˜ is a Gaussian process with covariance V˜ (f, f ). Clearly, f˜0 ∈ H function ˜ t) = E{G(s) ˜ G(t)} ˜ R(s, =

m X ν=1

σν2 ϕ˜ν (s)ϕ˜ν (t) +

X ν>m

β −(1+ 2m )

ρν

ϕ˜ν (s)ϕ˜ν (t),

SUPPLEMENT TO NONPARAMETRIC BVM THEOREM

21

˜ β is the RKHS of G. ˜ For any H ˜ b with 0 ≤ b ≤ β, it follows by [44] that H define inner product ∞ ∞ m X X X X 1+ b fν gν ρν 2m . h fν ϕ˜ν , gν ϕ˜ν ib = σν−2 fν gν + ν=1

ν=1

ν>m

ν=1

Let k · kb be the norm corresponding to the above inner product. The following lemma plays a similar role as quantifying the concentration function defined in [45]. The calculation of the concentration function in [44] depends on convolutions (see their proof of Theorem 4.1), while the proof of Lemma S.3 depends on a series representation. Lemma S.3. Let dn be any positive sequence. If Condition (S) holds, ˜ β such that then there exists ω ∈ H 2(β−1)

˜ − f˜0 ) ≤ (i). V˜ (ω − f˜0 ) + λJ(ω 2

−

1 2 16 (dn

+ λdn2m+β−1 ), and

(ii). kωk2β = O(dn 2m+β−1 ). Proof of Lemma S.3. Let ω = 2/(2m+β−1) dn ,

P∞

˜ν , ν=1 ων ϕ

where ων =

dfν0 d+(σbν )α ,

σ=

1/(2m) ρν ,

bν = α = m + (β − 1)/2, and d > 0 is a constant to be (σbν )α fν0 described. It is easy to see that for any ν, fν0 − ων = d+(σb α . Then ν) V˜ (ω − f˜0 ) =

∞ ∞ X X |fν0 |2 (σbν )2α (fν0 − ων )2 = (d + (σbν )α )2 ν=1

ν=1

≤ σ 2m+β−1 d−2

∞ X

1+ β−1 2m

|fν0 |2 ρν

,

ν=1

and ˜ − f˜0 ) = J(ω

∞ X

(fν0 − ων )2 ρν

ν=1

= σ β−1

∞ X

1+ β−1 2m

|fν0 |2 ρν

(d(σbν )−m + (σbν )(β−1)/2 )−2

ν=1

≤ d−

β−1 k

∞

((

2m − m 2m β−1 −2 β−1 X 0 2 1+ β−1 |fν | ρν 2m , ) k +( ) 2k ) σ β−1 β−1 ν=1

where in the last equation k = m + (β − 1)/2. Therefore, we choose d as a suitably large fixed constant such that (i) holds.

22

Z. SHANG AND G. CHENG

To show (ii), observe that kωk2β =

m X

σν−2 ων2 +

=

ν>m

ν=1

= O(σ

1+ β ων2 ρν 2m

X

−1

m X ν=1

β−1

2m X d2 |f 0 |2 ρ1+ ν ν −2 0 2 bν σν |fν | + (d + (σbν )α )2 ν>m

). 2/(2m+β−1)

The result follows by σ = dn

.

Proof of Proposition 4.1. The proof is rooted in Theorem 2.1 of [21] with nontrivial modifications. We write Pf0 = Pfn0 for simplicity. Let γ02 = P 0 2 2 2 −1 ν |fν | ρν and Mn = (1 + γ0 ) δn λ . We can rewrite the posterior density by Qn nλ i=1 (pf /pf0 )(Zi ) exp(− 2 J(f ))dΠ(f ) p(f |Dn ) = R , Qn nλ i=1 (pf /pf0 )(Zi ) exp(− 2 J(f ))dΠ(f ) H m (I) where recall that pf (Z) is the probability density of Z = (Y, X) under f . First of all, we give a lower bound for n Y

Z I1 = H m (I)

(pf /pf0 )(Zi ) exp(−

i=1

nλ J(f ))dΠ(f ). 2

Define Bn = {f ∈ H m (I) : J(f ) ≤ Mn , V (f − f0 ) ≤ δn2 }. Then n Y nλ ≥ (pf /pf0 )(Zi ) exp(− J(f ))dΠ(f ) 2 Bn i=1 Z Y n ≥ (pf /pf0 )(Zi )dΠ(f ) exp(−nλMn /2)

Z

I1

Bn i=1

Z = Bn

n X exp( Ri (f, f0 ))dΠ(f ) exp(−nλMn /2), i=1

where Ri (f, f0 ) = `(Yi ; f (Xi ))−`(Yi ; f0 (Xi )) for i = 1, . . . , n. Define dΠ∗ (f ) =

SUPPLEMENT TO NONPARAMETRIC BVM THEOREM

23

dΠ(f )/Π(Bn ), a reduced probability measure on Bn . By Jensen’s inequality, n X exp( Ri (f, f0 ))dΠ∗ (f )

Z log Bn

≥ =

i=1

n X

Z

Ri (f, f0 )dΠ∗ (f )

Bn i=1 Z X n

[Ri (f, f0 ) − Ef0 {Ri (f, f0 )}]dΠ∗ (f )

Bn i=1

Z

Ef0 {Ri (f, f0 )}dΠ∗ (f ) ≡ J1 + J2 .

+n Bn

By Cauchy-Schwartz inequality and direct examinations, we have Ef0 {J12 }

n X

Z = Ef0 {|

[Ri (f, f0 ) − Ef0 {Ri (f, f0 )}]dΠ∗ (f )|2 }

Bn i=1

Z ≤ n

Ef0 {Ri (f, f0 )2 }dΠ∗ (f ).

Bn

For any f ∈ Bn , by Taylor’s expansion, there exists a constant C˜ > 0 s.t. Ef0 {Ri (f, f0 )2 } = Ef0 {(`(Yi ; f (Xi )) − `(Yi ; f0 (Xi )))2 } = Ef {(`˙a (Y ; f0 (X))(f (X) − f0 (X)) + `¨a (Y ; f ∗ (X))(f (X) − f0 (X))2 )} 0

≤ 2Ef0 {2 (f (X) − f0 (X))2 } + 2Ef0 {Ef0 {sup |`¨a (Y ; a)|2 |X}|f (X) − f0 (X)|4 } a∈R

≤

˜ n2 , Cδ

where f ∗ (X) is between f (X) and f0 (X), and the last step is because Ef0 {2 |X} = B(X) (Assumption A1(c)), Ef0 {supa∈R |`¨a (Y ; a)|2 |X} ≤ 2C02 C1 , a.s., (Assumption A1 (a)). On the other hand, for any f ∈ Bn , by Taylor’s expansion and Assumption A1(a), |Ef0 {Ri (f, f0 )}| ≤ |Ef0 {`¨a (Y ; f ∗ (X))(f (X) − f0 (X))2 }| ≤ Ef {Ef {sup |`¨a (Y ; a)| |X}(f (X) − f0 (X))2 } 0

≤

0

a∈R

C0 C1 δn2 ,

where f ∗ (X) is between f (X) and f0 (X). Therefore, J2 ≥ −C0 C1 nδn2 , and p 2 ˜ by Cauchy-Schwartz inequality, Pf0 (J1 < − nδn2 ) ≤ C/(nδ n ) → 0. So, by

24

Z. SHANG AND G. CHENG

p nδn2 ≤ nδn2 , we get that with probability approaching one, I1 ≥ exp(−(1 + C0 C1 )nδn2 − nλMn /2)Π(Bn ). ˜ β satisfy To proceed we provide a lower bound for Π(Bn ). Let ω ∈ H (i)–(ii) of Lemma S.3 with dn therein replaced by δn . It follows immediately ˜ − f˜0 ) ≤ δ 2 /4. By Gaussian from δn = o(1) and λ ≤ δn2 that V˜ (ω − f˜0 ) + λJ(ω n correlation inequality (see Theorem 1.1 of [30]), Cameron-Martin theorem (see [7]), and the fact λ−1 δn2 > 1, we have Π(Bn ) = P (V (G − f0 ) ≤ δn2 , J(G) ≤ Mn ) ˜ − f˜0 ) ≤ δ 2 , J( ˜ G) ˜ ≤ (1 + γ0 )2 δ 2 λ−1 ) = P (V˜ (G n

n

˜ − f˜0 ) ≤ δn2 , λJ( ˜G ˜ − f˜0 ) ≤ δn2 ) ≥ P (V˜ (G ˜ − ω) ≤ δn2 /4, λJ( ˜G ˜ − ω) ≤ δn2 /4) ≥ P (V˜ (G 1 ˜ ≤ δn2 /4, J( ˜ G) ˜ ≤ 1/4) ≥ exp(− kωk2β )P (V˜ (G) 2 1 ˜ ≤ δ 2 /8)P (J( ˜ G) ˜ ≤ 1/8) ≥ exp(− kωk2β )P (V˜ (G) n 2 ˜ G) ˜ ≤ 1/8) exp(−c1 δ −2/(2m+β−1) ), ≥ P (J( n where the third last inequality follows by (4.18) of [28], and the last step follows by Lemma S.3 (ii) and Example 4.5 of [23] in which c1 > 0 is a ˜ G) ˜ ≤ 1/8) is a universal universal constant. Note in the last step c2 := P (J( −1 2 positive constant. Since β > 1 and δn ≥ (nh) + λ ≥ n−2m/(2m+1) , we get 2(2m+β)

nδn2m+β−1 ≥ n

2m(2m+β)

1− (2m+1)(2m+β−1)

−

2

> 1, so nδn2 > δn 2m+β−1 . So

Π(Bn ) ≥ c2 exp(−c1 δn−2/(2m+β−1) ) ≥ c2 exp(−c1 nδn2 ). Consequently, with Pf0 -probability approaching one, I1 ≥ c2 exp(−c3 nδn2 ), 2 where c3 = 1R+ CQ 0 C1 + c1 + (1 + γ0 ) /2. n Let I2 = An i=1 (pf /pf0 )(Zi ) exp(− nλ 2 J(f ))dΠ(f ), where An = {f ∈ m H (I) : kf − f0 k ≥ 2M δn , kf − f0 ksup ≤ b}, for some M to be determined. Let An1 = {f ∈ H m (I) : V (f − f0 ) ≥ M 2 δn2 , kf − f0 ksup ≤ b} and An2 = {f ∈ H m (I) : λJ(f − f0 ) ≥ M 2 δn2 , kf − f0 ksup ≤ b}. So I2 ≤ I21 + I22 , where R Q I2j = Anj ni=1 (pf /pf0 )(Zi ) exp(− nλ 2 J(f ))dΠ(f ) for j = 1, 2. 2 2 Since for f ∈ An2 , M δn ≤ λJ(f − f0 ) ≤ 2λJ(f ) + 2λJ(f0 ) ≤ 2λJ(f ) + 2δn2 γ02 , we have λJ(f ) ≥ (M 2 −2γ02 )δn2 /2, and hence, Ef0 {I22 } ≤ exp(−(M 2 − 2γ02 )nδn2 /4). To the end, we address the term I21 , for which we need to build the uniformly consistent tests. Define Pn = {f ∈ H m (I) : kf − f0 ksup ≤ b, J(f ) ≤ c4 Mn , V (f − f0 ) ≥ M 2 δn2 }, where c4 > 2(c3 + 2)/(1 + γ0 )2 is a fixed constant. Let δ = M δn and D(δ/2, Pn , k · kL2 ) be the δ/2-packing number

SUPPLEMENT TO NONPARAMETRIC BVM THEOREM

25

√ in terms of the usual L2 -norm. Let M ≥ 2 c4 (1 + γ0 ). It is well known √ √ that log Pn , k · kL2 ) ≤ log D( c4 (1 + γ0 )δn , Pn , k · kL2 ) ( c4 (1 + √ D(δ/2,−1/m γ0 )δn / c4 Mn ) ≤ nδn2 , where the last inequality follows by δn2 ≥ (nh)−1 . It follows by the fact that, for sufficiently small b, on Pn , the L2 -metric is dominated by the Hellinger metric, and Theorem 7.1 of [21] that there exist tests φn such that Ef0 {φn } ≤ exp(−T M 2 nδn2 ), and sup Ef {1 − φn } ≤ exp(−T M 2 nδn2 ), f ∈Pn

where T > 0 is a universal constant. Note I21 /I1 = P (An1 |Dn ) = P (An1 |Dn )φn + P (An1 |Dn )(1 − φn ), and Ef0 {P (An1 |Dn )φn } ≤ exp(−T M 2 nδn2 ). On the other hand, following similar arguments of [21] we have that n Y pf (Zi ) nλ exp(− J(f ))dΠ(f )(1 − φn ) pf0 (Zi ) 2 An1

Z Ef0

i=1

≤ exp(−nλc4 Mn /2) + sup Ef {1 − φn } f ∈Pn

≤ exp(−(c3 +

2)nδn2 )

+ exp(−T M 2 nδn2 ).

Therefore, by the above analysis, we can choose the constant M such that √ > 4(c3 + 2) + 2γ02 , T M 2 > c3 + 2 and M > 2 c4 (1 + γ0 ). And then, it can be seen that, as n → ∞, P (kf − f0 k ≥ 2M δn |Dn ) converges to zero in Pf0 -probability. M2

Before proving Theorem 4.2, let us introduce some preliminaries. Let ∆f, ∆fj ∈ H m (I) for j = 1, 2, 3. The Fr´echet derivative of `n,λ can be identified as n

D`n,λ (f )∆f =

1X˙ `a (Yi ; f (Xi ))hKXi , ∆f i − hWλ f, ∆f i ≡ hSn,λ (f ), ∆f i. n i=1

Note that Sn,λ (fbn,λ ) = 0, and Sn,λ (f0 ) can be expressed as n

(S.8)

1X Sn,λ (f0 ) = i KXi − Wλ f0 . n i=1

The Fr´echet derivative of Sn,λ (DSn,λ ) is denoted DSn,λ (f )∆f1 ∆f2 (D2 Sn,λ (f )∆f1 ∆f2 ∆f3 ).

26

Z. SHANG AND G. CHENG

These derivatives can be explicitly written as D2 `n,λ (f )∆f1 ∆f2 n X −1 = n `¨a (Yi ; g(Xi ))hKXi , ∆f1 ihKXi , ∆f2 i − hWλ ∆f1 , ∆f2 i, i=1

D3 `n,λ (f )∆f1 ∆f2 ∆f3 n X ... −1 = n ` a (Yi ; g(Xi ))hKXi , ∆f1 ihKXi , ∆f2 ihKXi , ∆f3 i). i=1

We next review a set of results from [38, 16]. The following result gives the rate of convergence for fbn,λ when h satisfies the rate conditions in Theorem 4.2. proposition S.3. Suppose Assumptions A1 and A2 are satisfied. Furthermore, as n → ∞, h = o(1) and n−1/2 h−2 (log n)(log log n)1/2 = o(1). Then kfbn,λ − f0 k = OPfn (rn ). 0

The proof of Proposition S.3 can be found in [16]. The following lemma is also useful, whose proof can be found in [38, 37]. Lemma S.4. (Functional Bahadur representation under the true model) Suppose that Assumptions A1–A3 hold, h = o(1), and nh2 → ∞. Recall that Sn,λ (f0 ) is defined in (S.8). Then we have kfbn,λ − f0 − Sn,λ (f0 )k = OPfn (an log n),

(S.9)

0

where an = n−1/2 rn h−(6m−1)/(4m) (log log n)1/2 + C` h−1/2 rn2 / log n, and

... C` = sup Ef0 {sup | ` a (Y ; a)| X = x}. x∈I

a∈R

P 0 Let the true f0 be expressed as f0 (·) = ∞ ν=1 fν ϕν (·) satisfying Condition (S). Under the conditions of Lemma S.4, it can be directly verified that kSn,λ (f0 )k = OPfn (˜ rn ), and therefore, kfbn,λ − f0 k = OPfn (an log n + r˜n ). To 0 0 see this, by Proposition 3.1 we have n

∞

i=1

ν=1

1 1X 1 1X i KXi k2 } = Ef0 {2 K(X, X)} = = O((nh)−1 ), Ef0 {k n n n 1 + λρν

27

SUPPLEMENT TO NONPARAMETRIC BVM THEOREM

and kWλ f0 k2 = h =

∞ X

∞

fν0

ν=1 ∞ X

X λρν λρν ϕν , fν0 ϕν i 1 + λρν 1 + λρν ν=1

|fν0 |2

ν=1

= λ

1+ β−1 2m

λ2 ρ2ν 1 + λρν

∞ X

1+ β−1 |fν0 |2 ρν 2m

ν=1

β−1

(λρν )1− 2m = O(h2m+β−1 ), 1 + λρν β−1

where the last equation follows by λ = h2m , supx≥0 ((nh)−1/2

x1− 2m 1+x

< ∞, and Con-

m+ β−1 2

dition (S). Then kSn,λ (f0 )k = OPfn +h ) = OPfn (˜ rn ). 0 0 To complete the proof of Theorem 4.2, we consider a function class G = {g ∈ H m (I) : kgksup ≤ 1, J(g) ≤ c−2 hλ−1 }, where c > 0 is the universal constant satisfying Lemma 5.8. The following equi-continuity lemma over G is useful for further theoretical analysis. The proof can be found in [38]. Lemma S.5. Suppose that ψn (z; g) is a measurable function defined upon z = (y, x) ∈ Y × I and g ∈ G satisfying ψn (z; 0) = 0 and the following Lipschitz continuity condition: for any i = 1, . . . , n and f, g ∈ G, (S.10) |ψn (Zi ; f ) − ψn (Zi ; g)| ≤ c−1 h1/2 kf − gksup , a.s. in Pf0 -probability, where c > 0 is the universal constant specified in Lemma 5.8. Then we have ! kZ (g)k n lim P n sup ≤ (5 log log n)1/2 = 1, −1/2 n→∞ f0 g∈G h−(2m−1)/(4m) kgkγ sup + n where Zn (g) =

√1 n

Pn

i=1 [ψn (Zi ; g)KXi

− EfZ0 {ψn (Z; g)KX }], EfZ0 represents

the expectation taken with respect to Z under Pf0 , and γ = 1 −

1 2m .

Proof of Theorem 4.2. Let rn = (nh)−1/2 + λ1/2 and denote fb = fbn,λ for convenience. By kfb − f0 k = Of0 (˜ rn ) and Assumption A1, there exists a constant M1 > 1 s.t. Pfn0 (En0 ) approaches one (as n → ∞), where (S.11) ... En0 = {kfb−f0 k ≤ M1 r˜n , max sup |`¨a (Yi ; a)| ≤ M1 log n, max sup | ` a (Yi ; a)| ≤ M1 log n}. 1≤i≤n a∈R

1≤i≤n a∈R

Define Ain = {|`¨a (Yi ; f0 (Xi ))| ≤ M1 log n}, for i = 1, 2 . . . , n. By Assumption A1 (a), we can actually manage the above M1 such that Pf0 (Acin ) ≤

28

Z. SHANG AND G. CHENG

C2−1 n−1/2 h−(2m−1)/(4m) log n(5 log log n)1/2 , where recall that C2 satisfies Assumption A1 (a) and c satisfies Lemma 5.8. By Proposition 4.1, there exists a large M0 s.t. P (f ∈ H m (I) : kf − f0 k ≤ M0 rn |Dn ) converges to one in Pfn0 -probability. Let M > M1 be a constant to be determined. Since P (kf −f0 k > 2M r˜n |Dn ) = P (kf −f0 k > M0 rn |Dn )+P (2M r˜n < kf −f0 k ≤ M0 rn |Dn ), where the first term on the right side converges to zero in Pfn0 -probability, we only look at the second term. For convenience, define Sn = {f : 2M r˜n < kf − f0 k ≤ M0 rn }. Since r˜n ≤ rn , on En and for any f ∈ Sn , kf − fbk ≤ kf −f0 k+kfb−f0 k ≤ (M0 +M1 )rn , and kf − fbk ≥ kf −f0 k−kfb−f0 k > M r˜n . Define (j)

(S.12) En00 = {sup g∈G

(j)

where Zn (g) =

kZn (g)k ≤ (5 log log n)1/2 , j = 1, 2}, h−(2m−1)/(4m) kgkγsup + n−1/2 (j) (1) (j) Z i=1 [ψn (Zi ; g)KXi −Ef0 {ψn (Z; g)KX }], ψn (Z; g) = (2) ψn (Z; g) = (M1 log n)−1 c−1 h1/2 `¨a (Yi ; f0 (Xi ))g(Xi )IAin . says that, as n → ∞, Pfn0 (En00 ) → 1. Let En = En0 ∩ En00

√1 n

Pn

c−1 h1/2 g(X) and Then Lemma S.5 whose Pfn0 -probability thus goes to one as n tends to infinity. In the rest of the proof, we will assume that R 1 REn1 holds. For any f , define In (f ) = 0 0 sDSn,λ (fb+ ss0 (f − fb))(f − fb)(f − fb)dsds0 . Then by Taylor’s expansion we get that `n,λ (f ) − `n,λ (fb) = Sn,λ (fb)(f − fb) + In (f ) = In (f ). By (3.8) we have that P (f ∈ Sn |Dn ) R R b exp(nIn (f ))dΠ(f ) Sn exp(n(`n,λ (f ) − `n,λ (f )))dΠ(f ) = R = R Sn b H m (I) exp(nIn (f ))dΠ(f ) H m (I) exp(n(`n,λ (f ) − `n,λ (f )))dΠ(f ) R We first derive a lower bound for J1 = H m (I) exp(nIn (f ))dΠ(f ). Immediately, Z J1 ≥ exp(nIn (f ))dΠ(f ). f :kf −f0 k≤˜ rn

Note on En and when kf − f0 k ≤ r˜n , we have kf − fbk ≤ kf − f0 k + kfb − f0 k ≤ (M1 + 1)˜ rn . Let c > 0 be the constant specified in Lemma 5.8, b dn = c(M1 + 1)h−1/2 r˜n and g = d−1 n (f − f ). Immediately, it can be shown −1/2 −1/2 −1 that kgksup ≤ ch kgk = ch dn kf − fbk ≤ ch−1/2 d−1 rn = 1, n (M1 + 1)˜ −1 2 −1 −2 2 −1 −2 2 2 −2 b and J(g) ≤ λ kgk = λ dn kf − f k ≤ λ dn (M1 + 1) r˜n = c hλ−1 . This implies g ∈ G.

SUPPLEMENT TO NONPARAMETRIC BVM THEOREM

29

Then we can rewrite In (f ) as Z

1Z 1

In (f ) = 0

s[DSn,λ (fb + ss0 (f − fb)) − DSn,λ (f0 )](f − fb)(f − fb)dsds0

0

1 1 + [DSn,λ (f0 ) − Ef0 {DSn,λ (f0 )}](f − fb)(f − fb) − kf − fbk2 2 2 Z 1Z 1 = d2n s[DSn,λ (fb + ss0 dn g) − DSn,λ (f0 )]ggdsds0 0

0

d2 1 + n [DSn,λ (f0 ) − Ef0 {DSn,λ (f0 )}]gg − kf − fbk2 2 2 1 ≡ T1 + T2 − kf − fbk2 , 2 where T1 and T2 denote the first and second terms in the above expression. Since for any s ∈ [0, 1], by Lemma 5.8 and kgksup ≤ 1, we have kfb + sdn g − f0 ksup ≤ kfb − f0 ksup + dn kgksup ≤ c(2M1 + 1)h−1/2 r˜n .

30

Z. SHANG AND G. CHENG

Note that, on En and using kgk2 ≤ c−2 h, for any s ∈ [0, 1], |[DSn,λ (fb + sdn g) − DSn,λ (f0 )]gg| n 1 X¨ = | [`a (Yi ; (fb + sdn g)(Xi )) − `¨a (Yi ; f0 (Xi ))]g(Xi )2 | n ≤

1 n

i=1 n X

... sup | ` a (Yi ; a)| · kfb − f0 + sdn gksup g(Xi )2

i=1 a∈R n

≤ =

cM1 (2M1 + 1)h−1/2 r˜n log n X g(Xi )2 n cM1 (2M1 +

1)h−1/2 r˜n log n n

i=1 n X

h

[g(Xi )KXi − EfX0 {g(X)KX }], gi

i=1 +cM1 (2M1 + 1)h r˜n log nEfX0 {g(X)2 } n c2 M1 (2M1 + 1)h−1 r˜n log n X (1) h [ψn (Zi ; g)KXi − EfX0 {ψn(1) (Z; g)KX }], gi n i=1 −1/2 +cC2 M1 (2M1 + 1)h r˜n log nkgk2 c2 M1 (2M1 + 1)h−1 n−1/2 r˜n log nkZn(1) (g)k · kgk +cC2 M1 (2M1 + 1)h−1/2 r˜n log nkgk2 c2 M1 (2M1 + 1)h−1 n−1/2 r˜n log n(h−(2m−1)/(4m) + n−1/2 )(5 log log n)1/2 kgk +cC2 M1 (2M1 + 1)h−1/2 r˜n log nkgk2 2cM1 (2M1 + 1)n−1/2 r˜n h−(4m−1)/(4m) (5 log log n)1/2 log n +c−1 C2 M1 (2M1 + 1)h1/2 r˜n log n. −1/2

≤

≤ ≤ ≤

Therefore, |T1 | ≤ C1 (c, C2 , M1 )˜ rn2 (n−1/2 r˜n h−(8m−1)/(4m) (log log n)1/2 log n + h−1/2 r˜n log n) = C1 (c, C2 , M1 )˜ rn2 bn1 , where C1 (c, C2 , M1 ) is a positive constant depending only on c, C2 , M1 . By assumption in the theorem, as n → ∞, T1 = oPfn (˜ rn2 ). 0 To handle T2 , we use similar ideas. Note that on En00 and by kgksup ≤ 1, kgk ≤ c−1 h1/2 , C2 P (Acin ) ≤ n−1/2 h−(2m−1)/(4m) log n(5 log log n)1/2 and

SUPPLEMENT TO NONPARAMETRIC BVM THEOREM

31

direct calculations we have |[DSn,λ (f0 ) − Ef0 {DSn,λ (f0 )}]gg| n 1X¨ ≤ k [`a (Yi ; f0 (Xi ))g(Xi )IAin KXi − EfZ0i {`¨a (Yi ; f0 (Xi ))g(Xi )IAin KXi }]k n i=1

·kgk + C2 P (Acin ) = n−1/2 (M1 log n)ch−1/2 kZn(2) (g)k · kgk + C2 P (Acin ) ≤ 2M1 n−1/2 h−(2m−1)/(4m) log n(5 log log n)1/2 + C2 P (Acin ) ≤ 3M1 n−1/2 h−(2m−1)/(4m) log n(5 log log n)1/2 , therefore, |T2 | ≤ (d2n /2)3M1 n−1/2 h−(2m−1)/(4m) log n(5 log log n)1/2 ≤ C2 (c, M1 )˜ rn2 n−1/2 h−(6m−1)/(4m) log n(log log n)1/2 = C2 (c, M1 )bn2 r˜n2 , where C2 (c, M1 ) > 0 is a constant depending only on c, M1 . By assumptions in the theorem, T2 = oPfn (˜ rn2 ). 0 Let n be sufficiently large such that C1 (c, C2 , M1 )bn1 + C2 (c, M1 )bn2 ≤ c2 (M1 +1)2 /2. By the above bounds of T1 and T2 , on En and when kf −f0 k ≤ r˜n , nIn (f ) ≥ −n(C1 (c, C2 , M1 )bn1 + C2 (c, M1 )bn2 + c2 (M1 + 1)2 /2)˜ rn2 ≥ −c2 (M1 + 1)2 n˜ rn2 . Therefore, J1 ≥ exp(−c2 (M1 + 1)2 n˜ rn2 )Π(Bn ), where Bn = {f : kf − f0 k ≤ r˜n }. To proceed, we still need a lower bound for Π(Bn ). We employ an idea ˜ β satisfy the properties (i)–(ii) used in the proof of Proposition 4.1. Let ω ∈ H of Lemma S.3, with dn therein replaced by r˜n . Note that r˜n2 ≥ h2m+β−1 = 4m 2m+β−1 ˜ λ 2m which implies that λ ≤ r˜n2m+β−1 . It can be shown that, if V˜ (G−ω) ≤ 2(β−1)

˜G ˜ − ω) ≤ r˜n2m+β−1 /16, then V˜ (G ˜ − f˜0 ) + λJ( ˜G ˜ − f˜0 ) ≤ (˜ r˜n2 /16 and J( rn2 + 2(β−1)

λ˜ rn2m+β−1 )/4 ≤ r˜n2 . It then follows by Cameron-Martin theorem, Gaussian

32

Z. SHANG AND G. CHENG

correlation inequality and the results of [23] that Π(Bn ) = P (kG − f0 k ≤ r˜n ) ˜ − f˜0 ) + λJ( ˜G ˜ − f˜0 ) ≤ r˜n2 ) = P (V˜ (G 2(β−1)

˜ − ω) ≤ r˜n2 /16, J( ˜G ˜ − ω) ≤ r˜n2m+β−1 /16) ≥ P (V˜ (G 2(β−1)

˜ ≤ r˜n2 /16, J( ˜ G) ˜ ≤ r˜n2m+β−1 /16) ≥ exp(−kωk2β /2)P (V˜ (G) 2(β−1)

˜ ≤ r˜n2 /32)P (J( ˜ G) ˜ ≤ r˜n2m+β−1 /32) ≥ exp(−kωk2β /2)P (V˜ (G) −

2

≥ exp(−c5 r˜n 2m+β−1 ), where the last inequality follows by 2

−

˜ ≤ r˜n2 /32) r˜n 2m+β−1 , − log P (V˜ (G) 2(β−1)

β−1

2

− − ˜ G) ˜ ≤ r˜n2m+β−1 /32) (˜ − log P (J( rn2m+β−1 ) β−1 = r˜n 2m+β−1 2

2 2+ 2m+β−1

(see Example 4.5 of [23]), and c5 > 0 is a universal constant. Since n˜ rn n(2n

1 − 2m+β−1 1+ 2m+β−1 2m+β

)

> 1 which implies

2 − r˜n 2m+β−1

< n˜ rn2 , we get that

J1 ≥ exp(−c5 n˜ rn2 ). R Next we derive an upper bound for J2 = Sn exp(nIn (f ))dΠ(f ). The idea we employ is similar to how we handle J1 , but with some technical difference. b Let d∗n = c(M0 + M1 )h−1/2 rn , and g∗ = d−1 ∗n (f − f ). It is easy to see 2 −2 that, on En and when f ∈ Sn , we get that kg∗ k ≤ c h, kg∗ ksup ≤ 1 and J(g∗ ) ≤ c−2 hλ−1 , which imply that g∗ ∈ G. Then we can rewrite In (f ) as Z 1Z 1 In (f ) = d2∗n s[DSn,λ (fb + ss0 d∗n g∗ ) − DSn,λ (f0 )]g∗ g∗ dsds0 (S.13)

0

0

d2 1 + ∗n [DSn,λ (f0 ) − Ef0 {DSn,λ (f0 )}]g∗ g∗ − kf − fbk2 2 2 1 2 ≡ T3 + T4 − kf − fbk . 2 Similar to the analysis of T1 and T2 , it can be shown that, on En and for any f ∈ Sn , |T3 | ≤ 2c3 M1 (M0 + 2M1 )(M0 + M1 )2 rn3 n−1/2 h−

8m−1 4m

(log n)(5 log log n)1/2

+c2 M1 (M0 + 2M1 )(M0 + M1 )2 rn3 h−1 log n ≤ C3 (c, M0 , M1 )(rn3 n−1/2 h− = C3 (c, M0 , M1 )bn3 ,

8m−1 4m

(log n)(log log n)1/2 + rn3 h−1/2 log n)

≥

SUPPLEMENT TO NONPARAMETRIC BVM THEOREM

|T4 | ≤ 3c2 M1 (M0 + M1 )2 n−1/2 h− ≤ C4 (c, M0 , M1 )n−1/2 h−

6m−1 4m

6m−1 4m

33

rn2 (log n)(5 log log n)1/2

rn2 (log n)(log log n)1/2 = C4 (c, M0 , M1 )bn4 ,

where Cj (c, M0 , M1 ) for j = 3, 4 are positive constants depending on c, M0 , M1 only. Let n be large enough such that C3 (c, M0 , M1 )bn3 +C4 (c, M0 , M1 )bn4 ≤ r˜n2 . Therefore, on En and for any f ∈ Sn , In (f ) ≤ C3 (c, M0 , M1 )bn3 + C4 (c, M0 , M1 )bn4 − which implies J2 ≤ exp((1 − Consequently, on En ,

M2 2 r˜ ≤ (1 − M 2 /2)˜ rn2 , 2 n

M2 rn2 ). 2 )n˜

J2 M2 ≤ exp((1 + c5 − )n˜ rn2 ). J1 2 p By choosing M > M1 and M > 2(1 + c5 ), we have that the right side in the above equation tends to zero as n → ∞. This concludes the proof. P (f ∈ Sn |Dn ) =

S.6. Proofs in Section 5.1. Proof of Theorem 5.1. Let δ, ε be arbitrary prefixed small positive numbers within (0, 1/8). Let En0 and En00 be defined as in (S.11) and (S.12), in which the constant M1 is chosen to be large so that, as n approaches infinity, both events have Pfn0 -probability greater than 1 − 4ε . It follows by Theorem 4.2 that, there exist M, N > 0 such that, for any n ≥ N , Pfn0 (En000 ) ≥ 1 − 4ε , where En000 = {P (f : kf − f0 k ≤ Mpr˜n |Dn ) ≥ 1 − δ}. We can actually make the above M greater than M1 + 2(c5 + 1) so that, on En0 and for any f p satisfying kf − f0 k ≥ M r˜n , kf − fbk ≥ (M − M1 )˜ rn ≥ 2(c5 + 1), where c5 is the universal constant in (S.13). Using the proofs of Theorem 4.2, we can show that, on En0 , Z Z n n exp(− kf − fbk2 )dΠ(f ) ≥ exp(− kf − fbk2 )dΠ(f ) 2 2 H m (I) f :kf −f0 k≤˜ rn ≥ exp(−(c5 + 1/2)n˜ rn2 ), Z

n exp(− kf − fbk2 )dΠ(f ) ≤ exp(−(c5 + 1)n˜ rn2 ). 2 f :kf −f0 k≥M r˜n

Thus, on En0 ,

P0 (f : kf − f0 k ≥ M r˜n ) ≤ exp(−n˜ rn2 /2),

34

Z. SHANG AND G. CHENG

in which the right side tends to zero as n → ∞. This means that the above N can be further managed such that, for any n ≥ N , Pfn0 (En∗ ) > 1 − ε, where En∗ = En0 ∩ En00 ∩ En000 ∩ En0000 and En0000 = {P0 (f : kf − f0 k ≤ M r˜n ) ≥ 1 − δ}. For arbitrary B ∈ B, let B 0 = B ∩ {f : kf − f0 k ≤ M r˜n }. Then it is easy to see that, on En∗ , |P (B|Dn ) − P0 (B)| ≤ P (kf − f0 k ≥ M r˜n |Dn ) + P0 (kf − f0 k ≥ M r˜n ) + |P (B 0 |Dn ) − P0 (B 0 )| ≤ 2δ + |P (B 0 |Dn ) − P0 (B 0 )|. To prove the theorem, it is sufficient to give a small upper bound for |P (B 0 |Dn ) − P0 (B 0 )|, and this upper bound does not depend on B. Observe that n n(`n,λ (f ) − `n,λ (fb)) + kf − fbk2 = n(T1 + T2 ), 2 where Z

1Z 1

T1 = 0

s[DSn,λ (fb + ss0 (f − fb)) − DSn,λ (f0 )](f − fb)(f − fb)dsds0 ,

0

1 T2 = [DSn,λ (f0 ) − Ef0 {DSn,λ (f0 )}](f − fb)(f − fb). 2 Based on the proof of Theorem 4.2, it can be shown that, on En∗ and for any f satisfying kf − f0 k ≤ M r˜n , n|T1 | ≤ C5 (c, M1 , M )n˜ rn2 bn1 and n|T2 | ≤ 2 C6 (c, M1 , M )n˜ rn bn2 , where Cj (c, M1 , M ), j = 5, 6, are constants depending only on c, M1 , M , and recall that c satisfies Lemma 5.8. By assumptions, the above upper bounds for n|T1 | and n|T2 | are both of order o(1), as n → ∞. Define Z Jn1 = exp(n(`n,λ (f ) − `n,λ (fb)))dΠ(f ) H m (I)

and

Z Jn2 =

n exp(− kf − fbk2 )dΠ(f ). 2 H m (I)

R Also define J¯n1 = f :kf −f0 k≤M r˜n exp(n(`n,λ (f ) − `n,λ (fb)))dΠ(f ) and J¯n2 = R n ∗ b2 f :kf −f0 k≤M r˜n exp(− 2 kf − f k )dΠ(f ). Then, on En , we have 0≤

Jnj − J¯nj ≤ δ, for j = 1, 2, Jnj

SUPPLEMENT TO NONPARAMETRIC BVM THEOREM

35

which implies that J¯nj ≤ Jnj ≤

(S.14)

1 ¯ Jnj , for j = 1, 2. 1−δ

Let Rn (f ) = n(`n,λ (f )−`n,λ (fb))+ n2 kf − fbk2 . On En∗ and for any f satisfying kf − f0 k ≤ M r˜n , |Rn (f )| ≤ C5 (c, M1 , M )n˜ rn2 bn1 + C6 (c, M1 , M )n˜ rn2 bn2 ≡ tn (c, M1 , M ). By assumption, as n → ∞, tn (c, M1 , M ) → 0. It can then be see that, on En∗ ,

= ≤ ≤ ≤

≤ (S.15) ≤

|P (B 0 |Dn ) − P0 (B 0 )| Z exp(n(`n,λ (f ) − `n,λ (fb))) exp(− n2 kf − fbk2 ) − dΠ(f ) 0 Jn1 Jn2 B Z exp(Rn (f )) 1 n − exp(− kf − fbk2 ) dΠ(f ) 2 Jn1 Jn2 B0 Z n 2 exp(Rn (f ))Jn2 − Jn1 b exp(− kf − f k ) dΠ(f ) 2 Jn1 Jn2 f :kf −f0 k≤M r˜n Z n exp(− kf − fbk2 ) 2 f :kf −f0 k≤M r˜n |Jn2 − Jn1 | 1 · exp(Rn (f )) + | exp(Rn (f )) − 1| dΠ(f ) Jn1 Jn2 Jn2 2tn (c, M1 , M ) ¯ |Jn2 − Jn1 | ¯ Jn2 + Jn2 exp(tn (c, M1 , M )) Jn1 Jn2 Jn2 |Jn2 − Jn1 | exp(tn (c, M1 , M )) + 2tn (c, M1 , M ). Jn1

It can also be shown that, on En∗ , Z n ¯ ¯ |Jn2 − Jn1 | ≤ exp(− kf − fbk2 )| exp(Rn (f )) − 1|dΠ(f ) 2 f :kf −f0 k≤M r˜n Z n ≤ 2tn (c, M1 , M ) exp(− kf − fbk2 )dΠ(f ) 2 f :kf −f0 k≤M r˜n ¯ = 2tn (c, M1 , M )Jn2 , which implies J¯n2 1 1 ≤ ¯ ≤ . 1 + 2tn (c, M1 , M ) 1 − 2tn (c, M1 , M ) Jn1 Combined with (S.14) we have that 1−δ J¯n2 Jn2 1 J¯n2 1 ≤ (1−δ) ¯ ≤ ≤ ·¯ ≤ , 1 + 2tn (c, M1 , M ) Jn1 1 − δ Jn1 (1 − δ)(1 − 2tn (c, M1 , M )) Jn1

36

Z. SHANG AND G. CHENG

which leads to −

2tn (c, M1 , M ) + δ 2tn (c, M1 , M ) + δ Jn2 −1≤ ≤ , 1 + 2tn (c, M1 , M ) Jn1 (1 − δ)(1 − 2tn (c, M1 , M ))

therefore, |

2tn (c, M1 , M ) + δ δ Jn2 − 1| ≤ → ≤ 2δ. Jn1 (1 − δ)(1 − 2tn (c, M1 , M )) 1−δ

It follows by (S.15) that as n is large, on En∗ and any B ∈ B, |P (B 0 |Dn ) − P0 (B 0 )| ≤ 3δ. This shows the desired conclusion (5.2). Proof of Theorem 5.2. Let T : f 7→ {fν : ν ≥ 1} be the one-to-one map from H m (I) to Rm , as defined in Lemma 3.2. Let Π0W and Π0 be the probability measures induced by {ξν + aν fbν : ν ≥ 1} and {vν : ν ≥ 1}. Then dΠ0W /dΠ0 equals limN →∞ p1,N (f1 , . . . , fN )/p2,N (f1 , . . . , fN ), where p1,N and p2,N are the probability densities under fν ∼ ξν + aν fbν and fν ∼ vν , ν = 1, . . . , N , respectively. A direct evaluation leads to that ∞

dΠ0W nX ({fν : ν ≥ 1}) = Cn,λ exp(− (fν − fbν )2 (1 + λρν )), 0 dΠ 2 ν=1

where Cn,λ =

m Y

∞

(1+nσν2 )1/2

Y

(1+n(1+λρν )ρ−(1+β/(2m)) )1/2 exp ν

ν>m

ν=1

1 X b2 fν bν 2

! ,

ν=1

1+β/(2m)

and bν = nσν−2 /(n+σν−2 ) for ν = 1, . . . , m, and bν = n(1+λρν )ρν /(n(1+ P b2 1+β/(2m) λρν ) + ρν ) for ν > m. Since ν fν ρν < ∞ and β > 1, it is not hard to see that Cn,λ is a finite constant. By similar arguments in the proof of Lemma 3.2, we have for any B ⊆ Rm , ΠW (T −1 B) = Π0W (B), and Π(T −1 B) = Π0 (B). By change of variable, for any Π-measurable S ⊆ H m (I), ΠW (S) = Π0W (T S) Z = dΠ0W ({fν : ν ≥ 1}) TS

Z

n

nX = Cn,λ exp(− (fν − fbν )2 (1 + λρν ))dΠ0 (fν : ν ≥ 1) 2 TS ν=1 Z n = Cn,λ exp(− kf − fbn,λ k2 )dΠ(f ). 2 S

SUPPLEMENT TO NONPARAMETRIC BVM THEOREM

37

In particular, let S = H m (I) in the above equations, we get that !−1 Z n 2 b Cn,λ = exp(− kf − fn,λ k )dΠ(f ) . 2 H m (I) This proves the desired result. S.7. Proofs in Section 5.2. Proof of Lemma 5.3. It is easy to see that n

∞ X p p c0 hkZk2 = n c0 h cν ην2 (1 + λρν ) ν=1

= n

p

c0 h

∞ X

cν (1 + λρν )(ην2 − 1) + n

∞ X p c0 h cν (1 + λρν ),

ν=1

ν=1

where ην are iid standard normal random variables independent of the data Dn√. Therefore, the of the moment generating function of Un = √ logarithm P∞ 2 (n c0 hkZk − n c0 h ν=1 cν (1 + λρν ))/τ1 is

=

log(E{exp(tUn )}) ∞ X p −1 log(E{exp(tτ1 n c0 h cν (1 + λρν )(ην2 − 1))}) ν=1

= =

∞ X

p log(E{exp(tτ1−1 n c0 hcν (1 + λρν )(ην2 − 1))})

ν=1 ∞ X

−

ν=1

=

2

p p 1 [ log(1 − 2tτ1−1 n c0 hcν (1 + λρν )) + tτ1−1 n c0 hcν (1 + λρν )] 2

2

t c0 n h

∞ X

c2ν (1

2

+ λρν )

ν=1

/τ12

+

3/2 O(t3 n3 c0 h3/2

∞ X

c3ν (1 + λρν )3 /τ13 )

ν=1

2

→ t /2, where the limit in the by (5.4) and the fact, as n → ∞, P∞last 3equation follows 3 3 1/2 ) = o(1). Consequently, 3 3/2 τ1 1 and n h ν=1 cν (1 + λρν ) /τ1 = O(h the result of the lemma holds. Proof of Lemma 5.4. The proof relies on the following kernel function: 0

R(x, x ) =

∞ X ν=1

cν ϕν (x)ϕν (x0 ), for any x, x0 ∈ I,

38

Z. SHANG AND G. CHENG

where cν are defined in Section 5.2. We also use Rx (·) to denote the function R(x, ·), for simplicity. Denote fb = fbn,λ , fe = fen,λ , and Rn = fb − f0 − Sn,λ (f0 ). For any ν ≥ 1, fbν

= V (fb, ϕν ) = V (Rn , ϕν ) + V (f0 , ϕν ) + V (Sn,λ (f0 ), ϕν ) n 1 X ϕν (Xi ) λρν = V (Rn , ϕν ) + fν0 + + fν0 . i n 1 + λρν 1 + λρν i=1

Therefore, by aν /(1 + λρν ) = ncν , we get that fe − f0 X = (aν fbν − fν0 )ϕν ν n

=

X

aν V (Rn , ϕν )ϕν +

ν

+

(aν − 1)fν0 ϕν +

ν

X ν

=

X

X

1 X X aν ϕν (Xi ) i ϕν n 1 + λρ ν ν i=1

aν λρν fν0 ϕν 1 + λρν

aν V (Rn , ϕν )ϕν +

ν

X ν

(aν −

1)fν0 ϕν

+

n X

i RXi +

X ν

i=1

fν0

aν λρν ϕν , 1 + λρν

(S.16) P where RXi = ν cν ϕν (Xi )ϕν , for i = 1, . . . , n. Denote the above four terms by I1 , I2 , I3 , I4 . It follows by Lemma S.4 and a2ν ≤ 1 that X nkI1 k2 = n a2ν |V (Rn , ϕν )|2 (1 + λρν ) ≤ nkRn k2 ν

= OPfn (n(an log n)2 ) = oPfn (n˜ rn2 ) = oPfn (h−1 ). 0

0

0

P Since |aν −1| = O(n−1 ) for ν = 1, . . . , m, it is easy to see that n m ν=1 (aν − 1)2 |fν0 |2 (1 + λρν ) = O(n−1 ). By Dominated convergence theorem and direct

SUPPLEMENT TO NONPARAMETRIC BVM THEOREM

39

calculations it follows that X (aν − 1)2 |fν0 |2 (1 + λρν ) n ν>m

= n

−1

1+β/(2m)

n(1 + λρν ) + ρν

ν>m

n

!2

n(1 + λρν )

X

|fν0 |2 (1 + λρν )

(λρν )2+β/m |f 0 |2 1+β/(2m) )2 ν (1 + λρ + (λρ ) ν ν ν>m X

+nλ

(λρν )2+β/m |f 0 |2 ρν 1+β/(2m) )2 ν (1 + λρ + (λρ ) ν ν ν>m X β−1

= nλ1+ 2m

Let gλ (ν) = and ν,

X (λρν )1+(β+1)/(2m) + (λρν )2+(β+1)/(2m) |fν0 |2 ρν1+(β−1)/(2m) . 1+β/(2m) )2 (1 + λρ + (λρ ) ν ν ν>m

(λρν )1+(β+1)/(2m) +(λρν )2+(β+1)/(2m) , (1+λρν +(λρν )1+β/(2m) )2

for ν > m. Clearly, for all λ

x1+(β+1)/(2m) + x2+(β+1)/(2m) < ∞, (1 + x + x1+β/(2m) )2 x>0 P 1+(β−1)/(2m) < ∞, it follows and limλ→0 gλ (ν) = 0 for all ν. Since ν |fν0 |2 ρν P 1+(β−1)/(2m) → by dominated convergence theorem that, as λ → 0, ν>m gλ (ν)|fν0 |2 ρν 0 < gλ (ν) ≤ sup

β−1

0. Therefore, by nλ1+ 2m h−1 , we get that, as n → ∞, nkI2 k

2

= n

m X

(aν − 1)2 |fν0 |2 (1 + λρν ) + n

ν=1 −1

= O(n

X

(aν − 1)2 |fν0 |2 (1 + λρν )

ν>m −1

) + o(h

−1

) = o(h

).

By dominated convergence theorem and direct examinations it can be shown that nkI4 k2 = n

X ν

|fν0 |2

X a2ν (λρν )2 (λρν )2 ≤n |fν0 |2 1 + λρν 1 + λρν ν

= nλ1+(β−1)/(2m)

X ν

|fν0 |2 ρν1+(β−1)/(2m)

(λρν )1−(β−1)/(2m) = o(h−1 ), 1 + λρν

where the last equation follows by 0 < 1 − (β − 1)/(2m) < 1, implying 1−(β−1)/(2m) supx>0 x 1+x < ∞, and dominated convergence theorem.

40

Z. SHANG AND G. CHENG

Next, let us handle I3 . Note that k

n X

n X

2

i RXi k =

i=1

2i kRXi k2 + 2

i=1

X

i j hRXi , RXj i ≡

im Z ∞ 2m 2m+β 2m (x + x )(1 + x ) dx > 0. 2m (1 + x + x2m+β )2 0 This completes the proof for the ζj part. The proof for the τj2 part is similar, which is ignored. S.8. Proofs in Section 5.3. Proof of Theorem 5.7. We start from the decomposition (S.16) in the proof of Lemma √ 5.4. Let I1 , I2 , I3 , I4 be defined the same as in (S.16). By m > 1 + 3/2 and 1 < β < m + 1/2, it can be shown that rn = (nh)−1/2 + hm = O(hm ), and hence n(an log n)2 = o(1). Therefore, X dν |V (Rn , ϕν )|2 nkI1 k2† ≤ n ν

. n

X

|V (Rn , ϕν )|2 ≤ nkRn k2 = OPfn (n(an log n)2 ) = oPfn (1). 0

ν

By |aν − 1| = O(1/n) and the choice of dν , we get X nkI2 k2† = n dν (aν − 1)2 |fν0 |2 ν

= n

m X

dν (aν − 1)2 |fν0 |2 + n

ν=1

= O(1/n) + n

X ν>m

dν

(n(1 +

X

dν (aν − 1)2 |fν0 |2

ν>m 2+β/m ρν |f 0 |2 . 1+β/(2m) 2 ν λρν ) + ρν )

0

SUPPLEMENT TO NONPARAMETRIC BVM THEOREM

43

Using λ n−2m/(2m+β) and dominated convergence theorem, it can be shown that 2+β/m

n

X

dν

ν>m

ρν

(n(1 + λρν ) +

1+β/(2m) 2 ρν )

|fν0 |2

2+β/m

. n

−1

ρν |fν0 |2 dν 1+β/m 2 (1 + λρν + (λρν ) ) ν>m X

1+(β+1)/(2m)

. n−1

X

ρ−1/(2m) ν

ν>m

.

ρν |f 0 |2 ρ1+(β−1)/(2m) (1 + λρν + (λρν )1+β/m )2 ν ν

(λρν )1+β/(2m) |f 0 |2 ρν1+(β−1)/(2m) = o(1). 1+β/m )2 ν (1 + λρ + (λρ ) ν ν ν>m X

Therefore, nkI2 k2† = o(1). Similarly, we have nkI4 k2† = n

X

≤ n

X

. n

X

dν |fν0 |2

ν

dν

ν

a2ν (λρν )2 (1 + λρν )2

(λρν )2 |f 0 |2 (1 + λρν )2 ν

ρ−1/(2m) ν

ν>m

.

(λρν )2 ρ−1−(β−1)/(2m) |fν0 |2 ρν1+(β−1)/(2m) (1 + λρν )2 ν

X (λρν )1−β/(2m) ν

(1 + λρν )2

|fν0 |2 ρν1+(β−1)/(2m) = o(1).

Next we handle I3 . Define m ∞ X X ϕν (x)ϕν (y) ϕν (x)ϕν (y) T (x, y) = . −2 + −1 −1 1+β/(2m) 1 + n σν ν>m 1 + λρν + n ρν ν=1

It follows by Theorem 5 of [42] and proof of Proposition 2.2 in [38] that (j) −j/(2m) for any j = 0, 1, 2, . . . , 2m, supν kϕν ksup ρν < ∞. And hence, for any x, y ∈ I, |

X ∂ |ϕ˙ ν (x)ϕν (y)| T (x, y)| . O(1) + ∂x 1 + (hν)2m + (hν)2m+β ν>m X ν . O(1) + 2m + (hν)2m+β 1 + (hν) ν>m Z ∞ s h−2 ds, 2m 1 + s + s2m+β 0

44

Z. SHANG AND G. CHENG

∂ which implies supx,y∈I | ∂x T (x, y)| = O(h−2 ). For simplicity, define Tx (y) = T (x, y) for any x, y ∈ I. Let RXi be defined as in the proof of Lemma 5.4. By direct examination, it can be seen that

RXi =

m X ϕν (Xi )ϕν ν=1

n+

σν−2

+

X

ϕν (Xi )ϕν

ν>m

n(1 + λρν ) + ρν

1+β/(2m)

= TXi /n.

P P So I3 = n−1 ni=1 i TXi which implies nkI3 k2† = n−1 k ni=1 i TXi k2† . Since Ef0 {exp(||/C1 )} < ∞, we can choose a constant C > C1 such that Pfn0 (En ) → 1, where En = {max1≤i≤n |i | ≤ bn ≡ C log n}. We can even choose the above C to be properly large so that the following rate condition holds: (S.17)

h−1 n1/2 exp(−bn /(2C1 )) = o(1), h−2 exp(−bn /(2C1 )) = o(1).

Define Hn (·) = n−1/2

n X

i TXi (·), Hnb (·) = n−1/2

i=1

n X

i I(|i | ≤ bn )TXi (·).

i=1

Write Hn = Hn − Hnb − Ef0 {Hn − Hnb } + Hnb − Ef0 {Hnb }. Clearly, on En , Hn = Hnb , and hence, |Hn (z) − Hnb (z) − Ef0 {Hn (z) − Hnb (z)}| = |Ef0 {Hn (z) − Hnb (z)}| = n1/2 |Ef0 {I(|| > bn )TX (z)}| . n1/2 h−1 Ef0 {2 }1/2 Pfn0 (|| > bn )1/2 ≤ n1/2 h−1 exp(−bn /(2C1 )) = o(1), where the last o(1)-term follows by (S.17) and is free of the argument z. Thus, (S.18)

sup |Hn (z) − Hnb (z) − Ef0 {Hn (z) − Hnb (z)}| = oPfn (1). 0

z∈I

Define Rn = Hnb −Ef0 {Hnb } and Zn (e, x) = n1/2 (Pn (e, x)−P (e, x)), where Pn (e, x) is the empirical distribution of (, X) and P (e, x) is the population distribution of (, X) under Pfn0 -probability. It follows by Theorem 1 of [41] that (S.19)

sup |Zn (e, x) − W (t(e, x))| = OPfn (n−1/2 (log n)2 ),

e∈R,x∈I

0

45

SUPPLEMENT TO NONPARAMETRIC BVM THEOREM

where W (·, ·) is Brownian bridge indexed on I2 , t(e, x) = (F1 (x), F2 (e|x)), F1 is the marginal distribution of X and F2 is the conditional distribution of given X both under Pfn0 -probability. It can be seen that Z

1 Z bn

Rn (z) =

eTx (z)dZn (e, x). 0

Define R0n (z)

Z

−bn

1 Z bn

eTx (z)dW (t(e, x)).

= 0

−bn

Write Z

bn

dUn (x) = −bn

edZn (e, x), dUn0 (x) =

Z

bn

edW (t(e, x)). −bn

It follows from integration by parts where all quadratic variation terms are zero that Z bn Z bn bn Zn (e, x)de, ede Zn (e, x) = Zn (e, x)e|e=−bn − Un (x) = −bn

−bn

Un0 (x) =

Z

bn

−bn

n − ede W (t(e, x)) = W (t(e, x))e|be=−b n

Z

bn

W (t(e, x))de, −bn

and hence, it follows by (S.19) that sup |Un (x) − Un0 (x)| = OPfn (bn n−1/2 (log n)2 ). 0

x∈I

∂ It follows from integration by parts again and supx,y∈I | ∂x T (x, y)| = O(h−2 ) that Z 1 Z 1 ∂ 1 Rn (z) = Tx (z)dUn (x) = Un (x)T (x, z)|x=0 − Un (x) T (x, z)dx, ∂x 0 0 Z 1 Z 1 ∂ R0n (z) = Tx (z)dUn0 (x) = Un0 (x)T (x, z)|1x=0 − Un0 (x) T (x, z)dx, ∂x 0 0

and hence, (S.20)

sup |Rn (z) − R0n (z)| = OPfn (h−2 bn n−1/2 (log n)2 ) = oPfn (1), z∈I

0

where the last equality follows by 2m + β > 4 and hence h−2 n−1/2 bn (log n)2 = O(n−1/2+2/(2m+β) (log n)3 ) = o(1).

0

46

Z. SHANG AND G. CHENG

Next we handle the term R0n . Write W (s, t) = B(s, t) − stB(1, 1), where B(s, t) is standard Brownian motion indexed on I2 . Define Z 1 Z bn 0 ¯ Rn (z) = eTx (z)dB(t(e, x)). −bn

0

Let F (e, x) = F1 (x)F2 (e, x) be the joint distribution of (, X). It is easy to see that Z 1 Z bn 0 0 ¯ |Rn (z) − Rn (z)| = |B(1, 1)| · | eTx (z)dF (e, x)| −bn

0

= |B(1, 1)| · |Ef0 {I(|| ≤ bn )TX (z)}| = |B(1, 1)| · |Ef0 {I(|| > bn )TX (z)}| = OPfn (h−1 exp(−bn /(2C1 ))) = oPfn (1), 0

0

where the last equality follows by (S.17). Therefore, we have shown that ¯ 0 (z) − R0 (z)| = oP n (1). sup |R n n f

(S.21)

0

z∈I

By (S.18), (S.20) and (S.21) that ¯ 0n k2 + oP n (1). nkI3 k2† = kHn k2† = kR † f

(S.22)

0

Define Z

1Z ∞

e R(z) =

eTx (z)dB(t(e, x)). −∞

0

e ¯ 0n (z). Then Let ∆(z) = R(z) −R Z 1Z ∆(z) = 0

eTx (z)dB(t(e, x)).

|e|>bn

For each z, ∆(z) is a zero-mean Gaussian random variable with variance Z 1Z 2 Ef0 {∆(z) } = e2 Tx (z)2 dF (e, x) .

0 |e|>bn −2 h Ef0 {2 I(||

> bn )} = O(h−2 exp(−bn /(2C1 ))) = o(1),

where the last o(1)-term follows from (S.17) and is free of the argument z. Therefore, Z 1 0 2 2 e ¯ Ef0 {kR − Rn k† } . Ef0 {k∆kL2 } = Ef0 {∆(z)2 }dz = o(1), 0

SUPPLEMENT TO NONPARAMETRIC BVM THEOREM

47

e−R ¯ 0 k† = oP n (1). Therefore, it follows by (S.22) that implying that kR n f 0

e 2 + oP n (1). nkI3 k2† = kRk † f

(S.23)

0

It follows from the definition of T (·, ·) that e 2= kRk †

∞ X X dν ηeν2 dν ηeν2 d + → dν ηeν2 , −1 σ −2 )2 −1 ρ1+β/(2m) )2 (1 + n ν (1 + λρ + n ν ν ν>m ν=1 ν=1

m X

where ηeν =

R1R∞ 0

−∞ eϕν (x)dB(t(e, x)).

It is easy to see that for any ν, µ,

Ef0 {e ην ηeµ } = Ef0 {2 ϕν (X)ϕµ (X)} = Ef0 {B(X)ϕν (X)ϕµ (X)} = V (ϕν , ϕµ ) = δνµ , that is, ηeν are iid standard normal random variables. Combined with the above analysis of terms I1 , I2 , I3 , I4 , we have shown that as n → ∞, d nkf0 − fen,λ k2† →

∞ X

dν ηeν2 .

ν=1

This implies that as n → ∞, Pfn0 (f0 ∈ Rn† (α)) = Pfn0 (nkf0 − fen,λ k2† ≤ cα ) → 1 − α. The proof is completed. S.9. Proofs in Section 5.4. Proof of Theorem 5.9. Let I1 , I2 , I3 , I4 be the decomposition terms in (S.16). Then kfe − f0 − I3 k2 ≤ 3(kI1 k√2 + kI2 k2 + kI4 k2 ). We first handle the norms of I1 , I2 , I4 . Since m > 1 + 3/2 and 1 < β < m + 1/2, it can be shown that n(an log n)2 = o(1). It follows the derivations in the proof of Lemma 5.4 that, as n → ∞, nkI1 k2 = OPfn (n(an log n)2 ) = oPfn (1), 0

nkI2 k2 nh2m+β

X ν

0

(hν)2m+β + (hν)4m+β |f 0 |2 ν 2m+β = o(1), (1 + (hν)2m + (hν)2m+β )2 ν

nkI4 k2 . nh2m+β

X (hν)2m−β |f 0 |2 ν 2m+β = o(1). 2m ν 1 + (hν) ν

This means that kfe− f0 − I3 k = oPfn (n−1/2 ). It follows by (5.8) that |F (fe− 0 f0 − I3 )| ≤ κh−1/2 kfe − f0 − I3 k = oP n (h−1/2 n−1/2 ). f0

48

Z. SHANG AND G. CHENG

Before proceeding further, we analyze F (I3 ) which equals Since both ϕν and F (ϕν ) are uniformly bounded, we have X n2 c2ν F (ϕν )2 Ef0 {2 |nF (RX )|2 } =

Pn

i=1 i F (RXi ).

ν

=

m X

X F (ϕν )2 F (ϕν )2 + (1 + n−1 σν−2 )2 ν>m (1 + λρν + n−1 ρ1+β/(2m) )2 ν ν=1

. h−1 , m X |nF (RX )| =

1 ϕν (X)F (ϕν ) 1 + n−1 σν−2 ν=1 X 1 . h−1 . + ϕ (X)F (ϕ ) ν ν −1 ρ1+β/(2m) 1 + λρ + n ν ν ν>m P Let s2n = V arPfn ( ni=1 i nF (RXi )). It is easy to see that 0

s2n

m X

X F (ϕν )2 F (ϕν )2 =n + −2 (1 + n−1 σν )2 ν>m (1 + λρν + n−1 ρ1+β/(2m) )2 ν ν=1

! .

Clearly, s2n = O(nh−1 ). By (5.10), s2n & nh−r . Next we verify Lindeberg’s condition: for any δ > 0,

s−2 n = ≤

n X

Ef0 {2i |nF (RXi )|2 I(|i nF (RXi )| ≥ δsn )}

i=1 −2 nsn Ef0 {2 |nF (RX )|2 I(|nF (RX )| ≥ δsn )} 4 4 1/2 ns−2 P (|nF (RX )| ≥ δsn )1/2 n Ef0 { |nF (RX )| }

4 4 1/2 ≤ ns−2 (δsn )−4 Ef0 {4 |nF (RX )|4 } n Ef0 { |nF (RX )| }

1/2

−2 4 4 = ns−2 n (δsn ) Ef0 { |nF (RX )| } −3 . ns−4 . n−1 h2r−3 = n n h

− 2m+β+2r−3 2m+β

= o(1).

This implies that Lindeberg’s condition holds. By central limit theorem, P d nF (I3 )/sn = ni=1 i nF (RXi )/sn → N (0, 1). √ Observe that F (f0 ) ∈ CIn,λ (α) is equivalent to |F (fe − f0 )| ≤ zα/2 tn / n. In the meantime, n n n n d F (fe− f0 ) = F (fe− f0 − I3 ) + F (I3 ) = F (I3 ) + oPfn (1) → N (0, 1). 0 sn sn sn sn

SUPPLEMENT TO NONPARAMETRIC BVM THEOREM

49

Since t2n ≥ s2n /n, we get that √ Pfn0 (F (f0 ) ∈ CIn,λ (α)) = Pfn0 (|F (fe − f0 )| ≤ zα/2 tn / n) √ ntn n n e ) = Pf0 ( |F (f − f0 )| ≤ zα/2 sn sn n ≥ Pfn0 ( |F (fe − f0 )| ≤ zα/2 ) → 1 − α, as n → ∞. sn P In particular, we notice that when ν F (ϕν )2 < ∞ and (5.10) holds for r = 0, it follows by dominated convergence theorem that both s2n /n and t2n √ P n converge to ν F (ϕν )2 . This implies that snt → 1, and hence, Pfn0 (F (f0 ) ∈ n CIn,λ (α)) → 1 − α. This completes the proof of the theorem. Proof of Proposition 5.10. Proof of (i). By direct examinations, it can be shown that when ν ≥ 3 is odd, cos(x) cosh(x) = 1 has a unique solution in ((ν+1/2)π, (ν+1)π), that is, γν ∈ ((ν+1/2)π, (ν+1)π); when ν ≥ 3 is even, cos(x) cosh(x) = 1 has a unique solution in (νπ, (ν + 1/2)π), that is, γν ∈ (νπ, (ν +1/2)π). Consequently, for any k ≥ 1, 0 < γ2k+2 −γ2k+1 < π. Let δ0 be constant such that 0 < δ0 < π/2 − π|z − 1/2|, and d0 = min{sin2 (δ0 ), cos2 (δ0 + π|z − 1/2|)}. Clearly, d0 > 0 is a constant. It is easy to see that when k → ∞, sinh(γ2k+1 (z − 1/2)) cosh(γ2k+2 (z − 1/2)) → 0, and → 0. sinh(γ2k+1 /2) cosh(γ2k+2 /2) Then for arbitrarily small ε ∈ (0, d0 /8), there exists N s.t. for any k ≥ N , ϕ2k+1 (z)2 ≥

1 1 sin2 (γ2k+1 (z−1/2))−ε, and ϕ2k+2 (z)2 ≥ cos2 (γ2k+2 (z−1/2))−ε. 2 2

Let φ0k = (γ2k+2 − γ2k+1 )(z − 1/2). Then |φ0k | ≤ π|z − 1/2| < π/2. There exists an integer lk s.t. γ2k+1 (z − 1/2) = φk + lk π, where φk ∈ [0, π). Then sin2 (γ2k+1 (z − 1/2)) = sin2 (φk ), and cos2 (γ2k+2 (z − 1/2)) = cos2 (φk + φ0k ). If 0 ≤ φk ≤ δ0 , then it can be seen that −π|z − 1/2| ≤ φ0k ≤ φk + φ0k ≤ δ0 + φ0k ≤ δ0 + π|z − 1/2|. Therefore, cos2 (φk + φ0k ) ≥ cos2 (δ0 + π|z − 1/2|). If δ0 < φk < π − δ0 , then sin2 (φk ) ≥ sin2 (δ0 ). If π − δ ≤ φk < π, then it can be seen that π − δ0 − π|z − 1/2| ≤ φk + φ0k < π + π|z − 1/2|.

50

Z. SHANG AND G. CHENG

Therefore, cos2 (φk + φ0k ) ≥ cos2 (δ0 + π|z − 1/2|). Consequently, for any k ≥ N, ϕ2k+1 (z)2 + ϕ2k+2 (z)2 ≥ ≥

1 (sin2 (γ2k+1 (z − 1/2)) + cos2 (γ2k+2 (z − 1/2))) − 2ε 2 1 min{sin2 (δ0 ), cos2 (δ0 + π|z − 1/2|)} − 2ε ≥ d0 /4. 2

Then we have X

hϕν (z)2 (1 + λρν + (λρν )1+β/4 )j ν>2 X hϕ2k+1 (z)2 hϕ2k+2 (z)2 + (1 + λρ2k+1 + (λρ2k+1 )1+β/4 )j k≥1 (1 + λρ2k+2 + (λρ2k+2 )1+β/4 )j k≥1

=

X

≥

X

≥

X

hϕ2k+1 (z)2 + hϕ2k+2 (z)2 (1 + λρ2k+2 + (λρ2k+2 )1+β/4 )j k≥1

k≥N

&

X k≥N Z ∞

≥ =

hd0 /4 (1 + λρ2k+2 + (λρ2k+2 )1+β/4 )j (1 +

(kπh)4

h + (kπh)4+β )j

h dx 4 + (πhx)4+β )j (1 + (πhx) N Z Z ∞ 1 ∞ 1 1 h→0 1 dx −→ dx > 0. 4 4+β j 4 π πN h (1 + x + x π 0 (1 + x + x4+β )j )

This shows that condition (5.10) P holds for r = 1. Proof of (ii). Write ω = ν ων ϕν where ων is a square-summable real R1 P 2 sequence. Then Fω (ϕν ) = 0 ω(z)ϕν (z)dz ν Fω (ϕν ) = P∞ = ων . Therefore, P 2 2 ν=1 Fω (ϕν ) > 0. Consequently, for ν ων < ∞. Meanwhile, since ω 6= 0, j = 1, 2, it follows by dominated convergence theorem that as n → ∞, ∞ X X Fω (ϕν )2 Fω (ϕν )2 + → Fω (ϕν )2 > 0. 1+β/(2m) )j −1 σ −2 )j (1 + λρ + (λρ ) (1 + n ν ν ν ν>m ν=1 ν=1

m X

Hence (5.10) holds for r = 0. Department of Statistics Purdue University 250 N. University Street West Lafayette, IN 47906 Email: [email protected]; [email protected]