Asymptotic equivalence for nonparametric regression ... - ucsb pstat

Asymptotic equivalence for nonparametric regression experiments with random design Andrew Carter ∗ University of California Santa Barbara (or more generally a sub-sigma field A0 ) that is not a sufficient statistic for estimating f from P but is sufficient for estimating In a nonparametric regression problem where the design f from the experiment P ∗ with distributions P∗f on (X , A). points are randomly distributed with an unknown distribution, Le Cam’s insufficiency is the distance between the distributhere are asymptotically sufficient statistics in the averages of tions, supf d(Pf , P∗f ), and T (X) (or A0 ) is asymptotically the data over small subintervals. An approximation to these sufficient if this distance goes to 0 as n → ∞. Le Cam used statistics can be used to construct a continuous Gaussian pro- total variation distance in his definition to correspond with cess experiment that is asymptotically equivalent to the regres- the notions of asymptotically equivalent experiments. We will sion experiment. generally bound the Kullback–Leibler divergence between the distributions because it is more convenient in product experKEY WORDS: Nonparametric Regression, Asymptotic iments and can be used to bound the total-variation distance. Equivalence of Experiments, Asymptotic Sufficiency Heyer (1980) describes connections between the classical notions of sufficiency and f -divergences. If there are two experiments P and Q that have asymptoti1. Introduction cally sufficient statistics T and R respectively and the distance The statistical problem is to estimate the mean function f from between the distribution of T and the distribution of R goes to n i.i.d. ordered pairs (Xi , Yi ) of the form 0, then P and Q are asymptotically equivalent. This follows from the triangle inequality applied to Le Cam’s deficiency Yi = f (Xi ) + ζi (1) distance between experiments. One drawback to this asymptotic equivalence framework is where ζi ∼ N (0, 1), and the Xi are marginally i.i.d. with that it is not clear how to handle nuisance parameters like the density g on [0, 1]. This nonparametric regression problem design density g. In a situation where our decision procedures with random design is a more realistic model than assuming are related to estimating f but the distributions depend also the data will be observed on a regular grid. on a parameter g, we will elevate this second parameter to the How much of a difference does it make whether the Xi are same role as f and proceed as if our interest was in estimatrandomly placed given that we have a sufficiently large num- ing both f and g. This is a conservative tack to take in this ber of observations? Our approach will be to find a statistic problem, but one that leads to approximations nearly as good that is approximately sufficient in the sense of Le Cam and is in the case where g is fixed. Thus, the theory does not seem to observed at regular fixed intervals over the design space. lose any of its applicability by not differentiating the roles of One motivation for this analysis is the wavelet smoothing the parameters f and g. techniques (as in Kovac & Silverman (2000) for example) that map regression data onto a regular grid before applying the 1.2 The parameter space wavelet transform. Abstract

1.1

Asymptotically sufficient statistics

We will confine ourselves to the Haar basis with scaling functions φk` (x) = 2k/2 {x : (` − 1)2k ≤ x ≤ `2k }√and the wavelet functions are ψk` = (φk+1 2` − φk+1 2`−1 )/ 2. Thus, the assumption is that the mean function is of the form

We are looking to show that the regression observations can be simplified with no appreciable loss of information. The main results are that the various statistical experiments are asymp2k0 X XX totically equivalent in the sense of Le Cam (1964, 1986). f (x) = a` φk0 ` (x) + ckj ψkj (x), (2) These asymptotically equivalent experiments are such that for `=1 k≥k0 j any decision procedure in one experiment, there is a corresponding decision procedure in the second experiment such and the tail of this sequence converges uniformly. that the difference between the risk functions is always small for bounded loss functions. Condition 1 (F) The space of functions Fα (C, γ(k)) contain A related notion is that of asymptotically sufficient statis- functions f of the form (2) that are bounded |f | < C and tics (Le Cam, 1964) or insufficiency as defined in Le Cam X (1974) or Le Cam (1986) chapter 5. For a statistical exper2k(α+1/2) 2 sup 2 sup c kj ≤ γ(k0 ). iment P the data X is observed with distribution Pf on the j f ∈Fα k≥k0 space (X , A). From this experiment, there is a statistic T (X) ∗ Research

supported by NSF Grant DMS-05-04233

for some function γ(k) such that limk→∞ γ(k) = 0.

Thus the space Fα (C, γ(k)) forms a compact subspace of a Besov(α, ∞, 2) space of functions. In particular, it follows that for any f ∈ Fα (C, γ(k)) Z

1

 f (x) −

0

k

2 0 X

2 a` φk0 ` (x) dx =

`=1

This approximating experiment has a similar structure to the original experiment in that the N (t) or Xi contains all the information about g and then, conditionally, the observations Y (t) or Yi do not depend on g while containing all the information about f . If g is a nuisance parameter and we only want to estimate f , then one strategy is to take fˆ to minimize the risk conditional on N (t).

k

2 XX

c2jk ≤ 2−2k0 α γ(k0 ).

(3)

k≥k0 j=1

1.4

Previous work on asymptotic equivalence of nonparametric regression experiments

Brown & Low (1996) showed that for the nonparametric regression where the design points Xi are fixed at i/n is asymptotically equivalent to the continuous Gaussian process 1 Condition 2 (G) The design density g is assumed to be in 0 < t < 1. dY (t) = f (t) dt + √ dW (t), n a function space G(M, ) consisting of all densities on [0, 1] such that at every x the function g(x) ≥ and Carter (2006) extended this result to two-dimensional design spaces, and Rohde (2004) showed how orthogonal bases can sup sup |g(x) − g(x + δ)| ≤ M δ. be used to get better results in cases where smoother function g∈G x,δ classes are used. Reiss (2007) provides a general treatment of these problems. Thus the space G(M, ) is a space of Lipschitz functions. Brown et al. (2002) showed that if the design points are There are other conditions under which our main results randomly chosen from a uniform distribution over the interval might be true. In particular, it is possible to weaken the then the same continuous Gaussian approximation is available. smoothness assumptions on the densities if there are more Our results extend this to the case where g is also unknown. stringent assumptions on F. The approximation in Theorem 2 is in the spirit of Carter (2007) in that the nuisance parameter generates N (t) and the 1.3 Main results. mean is Gaussian conditional on N (t). Thus, the approximation establishes a sort of infinite dimensional asymptotic Let P be the nonparametric regression experiment described mixed normality. in (1). This experiment can be simplified by dividing the data Using bins to simplify nonparametric regression experiup into m bins. Specifically, let N` be the number of Xi that ments is a common strategy. For example, Kovac & Silverman fall into the interval from (` − 1)/m the (2000) used the bins to simplify the wavelet threshold estimaP to`/m. Then take ` ≤ X ≤ average of each interval Y¯` = N1` i Yi `−1 i m m . tors. They had to deal with two complications that typically result: the variance of the binned averages and the possibility Theorem 1 Assuming that the parameter space of pairs of empty bins. Our approximation side-steps these issues (see (f, g) ∈ Fα × G satisfies Conditions 1 and 2 respec- section 2) passing on the same issues to the approximating extively, the pairs (N1 , Y¯1 ), (N2 , Y¯2 ), . . . , (Nm , Y¯m ) are ap- periment. For instance, note that Y (t) in experiment Q is not proximately sufficient statistics from P for estimating f and equivalent to 1 g when m ≥ n 2α for 0 < α < 1. 1 dYe (t) = f (t) dt + √ dW1 (t) nN` These approximately sufficient statistics can be used as the increments of a continuous Gaussian process yielding the fol- because there is a positive probability that N` = 0. lowing approximation. We propose a statistic in the nonparametric regression experiment in Section 2. In section 3, we show that this statistic Theorem 2 Under the same conditions on m, f and g, the ex- is asymptotically sufficient for the regression experiment and periment P is also asymptotically equivalent to the experiment prove Theorem 1. Section 5 shows that there is a continuous Q that observes Gaussian process with an asymptotically sufficient statistic described in section 4. The asymptotically sufficient statistics p 1 dN (t) = 2 g(t) dt + √ dW2 (t), from P and Q experiments have nearly the same distributions, n and thus the continuous Gaussian process is asymptotically equivalent to P establishing Theorem 2. The detailed bounds and then, conditional on the N , and calculations are collected in Section 6. 1 ¯ (t)f (t) dt + √ dW1 (t) dY (t) = N n 2. The statistic We will make stronger smoothness assumptions about the unknown density.

p ¯ (t) = mN` /n and W1 and W2 are independent where N standard Brownian motions.

The first step in constructing the statistic is to divide the support of the Xi into m equal subintervals I` = {x ∈ [0, 1] :

(` − 1)/m ≤ x < `/m}. Then the number of observations 4. Simpler statistic where the Xi falls into the I` interval is denoted N` . Finally, the average of the corresponding Yi values for those Xi ∈ I` To get a homogeneous Gaussian process approximation will use the statistics are p n X Z` = N` f¯` + ξ` 1 ¯ Y` = Yi 1{Xi ∈ I` } . N` i=1 where N` are the number of Xi in I` . The experiment thus We can improve this statistic somewhat via the homogeniz- observes m independent pairs (N` , Z` ). The Kullback–Leibler divergence between the distributions ing transformation of the m pairs (N` , Z` ) and the pairs (N` , Z`∗ ) is n p 1 X m X Z`∗ = N` Y¯` = √ Yi 1{Xi ∈ I` } m m N` i=1 D ({(N` , Z`∗ )}`=1 , {(N` , Z` )}`=1 ) = ED (Z`∗ , Z` | N` ) . so that the conditional variances are Var (Z`∗ | N` ) = 1. This statistic is especially useful in generating a continuous Gaussian process in section 5. In any of the bins where N` = 0 the Y¯` and Z`∗ are not well defined. In these cases, there is no information about f in the interval I` thus it is sufficient to generate uninformative Z`∗ | {N` = 0} ∼ N (0, 1). 3. Asymptotic sufficiency of the binned data

`=1

Furthermore, to bound the K–L divergence between the conditional distributions of the Z` and the Z`∗ we consider the joint distribution of the Z`∗ and the original design points Xi (still conditioning on the bin counts N` ) D ((Z`∗ , X) , (Z` , X) | N` ) = E [D (Z`∗ , Z` | X) | N` ] . Along with this,

D ((Z`∗ , X) , (Z` , X) | N` ) = D(Z`∗ , Z` | N` )+ ∗ ) are approxiThe set of m pairs (N1 , Z1∗ ), · · · , (Nm , Zm + E [D (X∗ , X | Z` , N` ) | N` ] . mately sufficient for the nonparametric observations from (1). This statistic is exactly sufficient if the parameter (f, g) is reimplying that stricted to functions that are constant on each of the subinter∗ vals I` . Thus, we have a P experiment that observes n indeD(Z`∗ , Z` | N` ) ≤ E [D (Z`∗ , Z` | X) | N` ] . pendent observations (Xi , Yi ) as in (1) except that instead of conditional mean f and density g, the parameters are replaced Therefore, by f¯ and g¯ where m m X X Z E (D (Z`∗ , Z` | X) | N` ) D(Z`∗ , Z` | N` ) ≤ X f¯` = m f (x) dx and f¯(t) = f¯` 1{t ∈ I` }, `=1 `=1  I` !2  ` n m X X p Z 1 1 √ f (x) dx N`  f (Xi ){Xi ∈ I` } − m N` = E and g¯` and g¯(t) are defined analogously. 2 N` i=1 I` `=1 The sufficiency can demonstrated by writing the likelihood (4) n Y 1 1 2 g¯(Xi ) √ exp − Yi − f¯(Xi ) L∗ (f, g) = (we can ignore the case N` = 0 because the construction in2 2π D i=1 sures that those Z`∗ = Z` .) # " n # "m N ` XY2 Y g¯ As you would expect, we need to bound the squared exi √` exp − × = pectation and variance. These are shown to be negligible in 2 2π i=1 `=1 " # sections 6.1 and 6.2. From equations (7) and (9), we get m X p 1 2 ∗ ¯ ¯ m × exp − N` f` − 2Z` N` f` X C 2 M 2 −2 2 E [D(Z`∗ , Z` | N` )] ≤ nm + γ(k0 ) . `=1 22 `=1 implying sufficiency via the factorization theorem. Note that this likelihood is correct even if for N` = 0 because then the Thus implying that the error goes to 0 for m sufficiently large, n = o(m2 ). Therefore, the two experiments generated by the likelihood does not depend on that particular Z`∗ . ∗ pairs (N` , Z`∗ ) and (N` , Z` ) respectively are asymptotically The distance between the distributions Pf g and Pf g is equivalent. D Pf¯g¯ , Pf g = nD (Xi∗ , Xi ) + nED (Yi∗ , Yi | Xi∗ ) . 5. Continuous Gaussian process approximation In section 6.3 we bound D (Xi∗ , Xi ) ≤ CM 2 m−2 / in (10) ∗ −2α and ED (Yi , Yi | Xi ) ≤ (M +1)m γα (k0 ) in (11). ThereBrown & Low (1996) showed that a set of n regular nonparafore, (N` , Z`∗ ) is asymptotically sufficient whenever n = metric regression observations are asymptotically sufficient 1 O(m2α ). In particular, m can be chosen as n 2α for α < 1. for the continuous Gaussian process Theorem 1 follows from this result because (N` , Y¯` ) is a Z t one-to-one function of (N` , Z`∗ ) unless N` = 0 when Y¯` could 1 Y (t) = f (x) dx + √ W (t) be defined however we want. n 0

R where the error is on the order of n (f − f¯)2 dx for f¯ a piecewise constant function equal to the average of the f function on each of n subintervals. See Carter (2006) for further discussion of this problem. An exactly equivalent experiment to the continuous process Y is the Gaussian sequence experiment that observes a countable sequence of independent normal random variables 1 1 Zk0 ` ∼ N hf, φk0 ` i , ; ζkj ∼ N hf, ψkj i , n n for every k ≥ k0 and j = 1, . . . , 2kR. The equivalence can be demonstrated by noting that ζkj = ψkj (t) dY (t) and Y (t) =

Z tX 0

`

Zk0 ` φk0 ` (x) +

XX k≥k0

j

difficult in the known case because there is a component accounting for the difference between the N` and their expectations. There is an entirely Gaussian approximation to the (Y (t), N` ) experiment that approximates the bin counts by a continuous Gaussian process. This follows from the results of Nussbaum (1996), Carter (2002), and Brown et al. (2004) on the asymptotic equivalence between density estimation experiments and a continuous Gaussian process. They imply that an experiment that observes the N` is asymptotically equivalent to an experiment that observes Z tp 1 g(x) dx + √ W1 (t). N (t) = 2 n 0

ζkj ψkj (x) dx. Thus, the (Z` , N` ) (and therefore the original Yi ) are equivalent to observing the N (t) and then conditional on N (t),

The argument in Brown & Low (1996) could then be seen as Z t arguing that the Zk0 ` alone are asymptotically sufficient for ¯ (x)f (x) dx + √1 W2 (t) Y (t) = N the sequence experiment. n 0 In the random design experiment, the Z` observations pro√ vide the scaling function coefficients because f¯` = ma` . where W and W are independent standard Brownian mo1 2 Thus the (N` , Z` ) sufficient statistics for the experiment that tions and N ¯ (x) is a function of N (t) that is constant on each √ observes the bin counts N` and Zk∗0 ` = Z` / n so that ¯ (x) as the interval (` − 1)/m ≤ t < `/m. (We can think of N increments of N (t), but the construction follows Brown et al. p 1 (2004) and is somewhat more involved.) mN` /n hf, φk0 ` i , Zk0 ` | N` ∼ N n This proves the result in Theorem 2. 1 ζkj | N` ∼ N 0, . n 6. Bounds and calculations Therefore, the (Z` , N` ) are asymptotically sufficient for the experiment that observes the bin counts 6.1 Squared expectation. p 1 We are looking for an expression for the expected value Zk0 ` | N` ∼ N a` mN` /n, n ! n X p p 1 1 √ E ζkj | N` ∼ N ckj mN` /n, f (Xi ) {Xi ∈ I` } N` = N` E f (X) X ∈ I` . n N` i=1 (5) which is equivalent to the Gaussian process We would like to bound the difference between this quantity √ and N` f¯` . If the density function g, is constant over the 1 ¯ (t) dt + √ dW (t) dY (t) = f (t)N interval I` then the conditional expectation in (5) is n R p m f (x)g(x) dx ¯ (t) = mN` /n when (` − 1)/m ≤ t < `/m. with N I` = f¯` , E f (X) X ∈ I` = The error between the two distributions of sequences of ing¯` dependent normals is and there is no bias.   When g is not constant, then the above expression can be m XX D (Q, Q∗ ) = E  N` c2kj  . written as 2 k≥k0 j R Z m I` f (x)g(x) dx g(x) − g¯` ¯ = f` + m f (x) dx. This is bounded in section 6.4 by nm−2α γα (k0 ) which is negg¯` g¯` I` ligible under the same conditions as in Section 3. As an aside, we note that this approximation is different By Condition 2, we have that |g(x) − g¯ | ≤ M/m. Further` from Brown et al. (2002) in that it is performed conditional on more, if f is bounded above by C and g is bounded below by the N` . In Brown et al. (2002), the design is assumed known so that the N` are ancillary, and the resulting approximating p p CM p experiment replaces the N` in the distribution of Y (t) with ¯ (6) N` E f (X) X ∈ I` − N` f` ≤ N` the expected values of the N` . This makes the bound more m

assuming that xi ∈ I` . Taking the expectation of this quantity,

Therefore, the contribution to (4) is m X `=1

n X

"

# n 1 X E √ f (Xi )1 {Xi ∈ I` } N` + N` i=1 2 p Z f (x) dx − m N`

Z

n 2

ED (Yi∗ , Yi | Xi = xi ) =

i=1

nX = g¯` 2

≤

`=1

6.2

m 2 2

=

C 2M 2 2

n . m2

(7)

Variance term.

We also need to bound the contribution from the variance, Var

n 1 X √ f (Xi ) {Xi ∈ I` } N` N` i=1

I`

`

n X

ED (Yi∗ , Yi | Xi = xi ) ≤

i=1

`=1

! n 1 X √ f (Xi ) {Xi ∈ I` } | N` N` i=1 Z 1 ≤ 2m (f (x) − f¯(x))2 dx ≤ 2γ2 (k0 ) (9)

XX

6.4

Bounding the error in the distribution of the Gaussian processes

The error is D (Q, Q∗ ) =

m XX EN` c2kj . 2 j k≥k0

It is sufficient to bound the sum using X XX X 2 2 m N` ckj ≤ m sup ckj N` k≥k0

j

k≥k0

j

j

X 2 =m sup ckj 2k−k0 n k≥k0

j

by Condition 1 and (3). 6.3

≤ nm−2α

Bounding the error in the sufficient statistics

The K–L divergence between the distribution of Xi and Xi∗ can be bounded using the same arguments as in Carter (2006). More directly, it follows from Condition 2 that |1 − g/¯ g| ≤ by Condition 1. M/(m) and then = nE log

i=1

g(X) g¯(X)

2 g¯(X) nCM 2 ≤ nCE 1 − ≤ . (10) g(X) m2 2 The divergence between Yi and Yi∗ given Xi is D (Yi∗ , Yi

|

Xi∗

j

k≥k0

2 1 = xi ) = f (xi ) − f¯` 2

j

(M + 1)n X 2αk X 2 2 ckj 2m2α j

X =n 2k sup c2kj

D(Xi∗ , Xi )

c2kj .

k≥k0

0

n X

dx

I`

(M + 1)n ≤ γα (k0 ). (11) 2m2α

(8) Therefore, the bound on the sum of the variances is

2

Therefore, the divergence is bounded by

I`

Var

f (x) − f¯`

k≥k0

By Condition 2, the ratio g(x)/¯ g ≤ 1 + M/(m) < 2, for m sufficiently large. Thus. h i Z 2 2 mg(x) ¯ dx E f (Xi ) − f` Xi ∈ I` = f (x) − f¯` g¯ I` Z 2 ≤ 2m f (x) − f¯` dx.

g¯(x) dx

0

≤ (M + 1)

as there are N` non-zero terms in the sum. We can bound this conditional variance via h i 2 Var (f (Xi ) | Xi ∈ I` ) ≤ E f (Xi ) − f¯` Xi ∈ I` .

2

The g¯` terms are bounded above by M + 1 as a result of the smoothness condition on g and the fact that they are densities. Thus, Z 1 X Z 2 2 ¯ g¯` f (x) − f` dx ≤ (M + 1) f (x) − f¯` dx

!

= Var (f (Xi ) | Xi ∈ I` )

m X

Z

`

I` m X N` C 2 M 2

f (xi ) − f¯(x)

X

22k(α+1/2) sup c2kj

k≥k0

j

≤ nm−2α γα (k0 )

References Brown, Lawrence, Cai, Tony, Low, Mark, & Zhang, Cun-Hui. 2002. Asymptotic equivalence theory for nonparametric regression with random design. Ann. statist., 30(3), 688–707. Brown, Lawrence D., & Low, Mark G. 1996. Asymptotic equivalence of nonparametric regression and white noise. Ann. statist., 24(6), 2384–2398. Brown, Lawrence D., Carter, Andrew V., Low, Mark G., & Zhang, Cun-Hui. 2004. Equivalence theory for density estimation, Poisson processes and Gaussian white noise with drift. Ann. statist., 32(5), 2074–2097.

Carter, Andrew V. 2002. Deficiency distance between multinomial and multivariate normal experiments. Ann. statist., 30(3), 708–730. Carter, Andrew V. 2006. A continuous Gaussian approximation to a nonparametric regression in two dimensions. Bernoulli, 12(1), 143–156. Carter, Andrew V. 2007. Asymptotic approximation of nonparametric regression experiments with unknown variances. Ann. statist. To appear. Heyer, Herbert. 1980. Information-type measures and sufficiency. Symposia mathematica, 25, 25–54. Kovac, Arne, & Silverman, Bernard W. 2000. Extending the scope of wavelet regression methods by coefficient-dependent thresholding. Journal of the american statistical association, 95(449), 172–183. Le Cam, L. 1964. Sufficiency and approximate sufficiency. Ann. math. statist., 35, 1419–1455. Le Cam, Lucien. 1974. On the information contained in additional observations. Ann. statist., 2, 630–649. Le Cam, Lucien. 1986. Asymptotic methods in statistical decision theory. New York: Springer–Verlag. Nussbaum, Michael. 1996. Asymptotic equivalence of density estimation and Gaussian white noise. Ann. statist., 24(6), 2399–2430. Reiss, Markus. 2007. Asymptotic equivalence for nonparametric regression with multivariate and random design. Ann. statist. To appear. Rohde, Angelika. 2004. On the asymptotic equivalence and rate of convergence of nonparametric regression and Gaussian white noise. Statistics & decisions, 22(3), 235–243.