Nonparametric Estimation of Quadratic Regression Functionals Li-Shan Huang Department of Statistics Florida State University Tallahassee, F.L. 32306-4330

Jianqing Fan Department of Statistics University of North Carolina Chapel Hill, N.C. 27599-3260

Abstract: Quadratic regression functionals are important for bandwidth selection of nonparametric

regression techniques and for nonparametric goodness-of- t test. Based on local polynomial regression, we propose estimators for weighted integrals of squared derivatives of regression functions. The rates of convergence in mean square error are calculated under various degrees of smoothness and appropriate values of the smoothing parameter. Asymptotic distributions of the proposed quadratic estimators are considered with the Gaussian noise assumption. It is shown that when the estimators are pseudo-quadratic (linear components dominate quadratic components), asymptotic normality with the n?1=2 rate can be achieved.

Key words and phrases. Asymptotic normality, equivalent kernel, local polynomial regression. Abbreviated title: Estimation of Quadratic Regression Functionals.

1 Introduction Let (X ; Y ); : : : ; (Xn; Yn ) be independent and identically distributed observations with the conditional mean E (yjx) = m(x) and variance Var(yjx) = . Consider the problem of estimating functionals of the form 1

1

2

Z

= [m (x)] w(x)dx; ( )

2

= 0; 1; : : : ;

(1.1)

where w(x) is a prescribed nonnegative weight function. These functionals appear in expressions for the asymptotically optimal bandwidth for nonparametric function estimates; see for example Ruppert and Wand (1994) and Fan and Gijbels (1996). Doksum and Samarov (1995) consider particularly the applications of in measuring the explanatory power of covariates in regression. Another area of application is to examine the goodness of t of a ( ? 1)-th degree polynomial model by testing H : = 0, = 1; 2; : : :. When w() is the indicator function of an interval, H tests the polynomial modeling at that given interval. Problems similar to estimation of are considered in a number of papers. Hall and Marron (1987, 1991), Bickel and Ritov (1988), and Jones and Sheather (1991) discuss estimation of R [f (x)] dx based on kernel density estimators, where f () is a probability density function. Birge and Massart (1995) expand the study to the estimation of integrals of smoothed functionals of a density. Adaptive estimation of quadratic functionals can be found in Efromovich and Low (1996). Laurent (1996, 97) use the orthonormal series method to estimate density functionals of the form 0

0

( )

0

2

To whom correspondence should be addressed.

1

R f (x)f 0 (x)w(x)dx and deal with this more general integral form via Taylor's expansions. OpR timality results of estimating [m (x)] dx under the Gaussian white noise model are obtained in ( )

(

)

( )

2

Donoho and Nussbaum (1990) and Fan (1991). In the context of nonparametric regression, the corresponding problem is much less understood. Doksum and Samarov (1995) give estimators of R m (x)f (x)w(x)dx (f () is the design density function) by local constant approximation, i.e. the Nadaraya-Watson estimator. Here we develop estimates of based on local polynomial regression estimators. Ruppert, Sheather, and Wand (1995) consider functionals similar to with weight w(x) = f (x). However, no results as general as shown in this paper have been established and our estimators for are dierent from those of Doksum and Samarov (1995) and Ruppert, Sheather, and Wand (1995). Technically, 's are nonlinear functionals of m(). While most theory focuses on estimation of linear functionals, it is also of theoretical interest to explore the diculty of estimating nonlinear R functionals. In the density estimation setting, the n? = -consistency of estimating [f (x)] dx is established in Hall and Marron (1987, 1991), Bickel and Ritov (1988), and Berge and Massart (1995). Laurent (1996, 97) show that similar results hold for estimated weighted integrals as well. Now a natural question is whether n? = -consistent estimators can be constructed for regression functionals . Note that the results for estimating density integral functionals are based on \diagonal-out" type estimators. However, we do not have a \diagonal-out" estimator for . A similar \bias-corrected" estimator (see (3.5)) can be constructed for regression functionals , but it requires s > 2 + 1=2 to achieve the n? = rate (see Theorem 4.4). Our estimators are based on local polynomial regression estimators. Section 4 describes the asymptotic behavior of the estimators, which are quadratic functionals of weighted regression estimates. Appropriate values of the smoothing parameter and smoothness conditions of m() are given for mean square rates of convergence. To study the asymptotic distributions of the proposed P Pn a Y Y are considered in Section 5, where Y ; : : : ; Y estimators, functionals of the form ni i;j i j n j are independent response variables. As a consequence, we provide further insights into the diculty of estimating quadratic functionals: the n? = rate for estimating can be achieved when estimators are pseudo-quadratic (linear component dominates). For genuine-quadratic estimators (quadratic component dominates), the n? = rate is not attainable. This paper is organized as follows. Section 2 provides the background of local polynomial regression. The proposed estimators of are given in Section 3. Section 4 contains rate-ofconvergence results in mean square error, from both a theoretical and a practical points of view. Asymptotic distributions of the proposed estimators are studied in Sections 5 and 6. Proofs of 2

1 2

( )

1 2

1 2

=1

1

=1

1 2

1 2

2

2

lemmas and theorems are in Section 7.

2 Local polynomial regression Suppose that (Xi; Yi ); i = 1; : : : ; n; have been collected from the model

Yi = m(Xi ) + "i ;

E ("i ) = 0; Var("i ) = ;

(2.1)

2

where m() is an unknown regression function, "i 's are independent and identically distributed error terms, and Xi 's are independent of "i 's. The function of interest could be the regression curve m() or its derivatives. The idea of local polynomial regression is to t locally a low-order polynomial at grid points of interest, with observations receiving dierent weights. Assume that the (p + 1)-th derivative of m() exists, with p a nonnegative integer. For a xed point x, the regression function m() can be locally approximated by

m(z ) m(x) + m0 (x)(z ? x) + : : : + m p (x)(z ? x)p=p!; ( )

for z in a neighborhood of x. This leads to the following least-squares problem: min

n X i

Yi ?

=1

p X

(Xi ? x)

!

=0

2

K Xi h? x ;

(2.2)

where = m (x)= !, = ( ; ; : : : ; p )T , and the dependence of on x is suppressed. The neighborhood is controlled by a bandwidth h, and the weights are assigned via a kernel function K , a continuous, bounded, and symmetric real function which integrates to one. Let ^ = ( ^ ; ^ ; : : : ; ^p ) be the solution to the minimization problem (2.2). Then ! ^ is an estimator for m (x), obtained by tting a p-th degree weighted polynomial in a neighborhood of x. See Fan and Gijbels (1996) for further details. For convenience, some matrix notation is introduced here. Let ( )

0

1

0

1

( )

0 1 (X ? x) B B .. .. X=B . . B @ 1 (Xn ? x) 1

? x)p

: : : (X .. ... . : : : (Xn ? x)p 1

0 K ( X h?x ) B B ... W = diag(K ( Xi h? x ))nn = B [email protected]

1 CC CC ; A

1

3

K ( Xnh?x )

(2.3)

1 CC CC ; A

(2.4)

Y = (Y ; : : : ; Yn)T , and m = (m(X ); : : : ; m(Xn))T . By the standard least-squares theory, the estimator ^ can be written as ^ = (X T WX )? X T WY: (2.5) We further investigate ^ , = 0; : : : ; p; individually. Let e ; = 0; : : : ; p; denote (p + 1) 1 unit vectors having 1 at the ( + 1)-th component, 0 otherwise. Simple algebra shows that 1

1

1

n X ^ (x) = eT ^ = Wn Xi h? x Yi ; i

(2.6)

=1

where

Wn (t) = eT (X T WX )? (1; ht; : : : ; hp tp)T K (t); 1

= 0; : : : ; p:

(2.7)

The weight functions Wn (), = 0; : : : ; p; depend on both x and the design points fX ; : : : ; Xng, and satisfy 1

n X i

=1

Xi ? x 8 < 1 if k = , = (Xi ? x)k Wn : 0 otherwise, for 0 ; k p: h

(2.8)

The dependence of Wn () on x and the design points fX ; : : : ; Xn g is the key to design adaptation and superior boundary behavior of the local polynomial regression. To understand the estimates ^ , = 0; : : : ; p, more intuitively, de ne 1

K (t) = eT S ? (1; t; : : : ; tp)T K (t); 1

= 0; : : : ; p;

(2.9)

R

where S = (i j ? ) p p with k = tk K (t)dt. It is easy to see that the functions K (), = 0; : : : ; p, are independent of x and fX ; : : : ; Xn g and satisfy +

2 ( +1)

( +1)

1

Z

8 < 1 if k = , for 0 ; k p: tk K (t)dt = : 0 otherwise,

(2.10)

The theoretical connection between K () and Wn() is shown in the following lemma, which extends the equivalent kernel results of Ruppert and Wand (1994).

Lemma 2.1 If the marginal density of X , denoted by f (), is Holder continuous on an interval

[a; b] and minx2 a;b f (x) > 0, then [

]

a:s: 0; sup sup jnh Wn (t) ? K (t)=f (x)j ?!

t2 ? ; x2 a;b [

1 1]

[

+1

]

provided that K has a bounded support [?1; 1] and h = h(n) ! 0, nh ! 1.

4

From (2.6) and Lemma 2.1, the local polynomial estimator of (x) = m (x)= ! has the asymptotic expansion: n ^ (x) = (1 + oP (1)) X K Xi ? x Yi : (2.11) nh f (x) i h Indeed (2.11) provides a simple technical device for studying the asymptotic behavior of ^. ( )

+1

=1

3 Estimators By the de nition of in (1.1), a natural approach is to substitute an estimate m^ (x) for m (x). R One candidate is ( !) ^ (x)w(x)dx, where ^ (x) is obtained via tting a local p-th degree polynomial, p , as in (2.6). Since ( !) is a constant for any xed , an estimator ( )

2

( )

2

2

Z

^ = ^ (x)w(x)dx; 2

is considered. Correspondingly, denote

Z Z = (x)w(x)dx = (1!) [m (x)] w(x)dx: ( )

2

2

2

We shall study the asymptotic properties of ^ later, but rst an intuitive discussion may be helpful. In addition to the model assumption in (2.1), we further assume that E ("i ) = 0 and E ("i ) < 1, i = 1; : : : ; n. From (2.6), n X n X ^ = ai;j ( )YiYj ; (3.1) 3

i

=1

where

j

4

=1

Z ai;j ( ) = Wn Xi h? x Wn Xj h? x w(x)dx: Clearly ^ is in a quadratic form of Yi 's and can be written as

(3.2)

^ = Y T A Y with A = (ai;j ( ))nn : The conditional mean and variance of ^ follow at once from (3.1):

E f^ jX ; : : : ; Xng = mT A m + tr(A );

(3.3)

2

1

Varf^ jX ; : : : ; Xng = 4 mT A m + 2 tr(A ) 2

2

4

2

1

+(E (" ) ? 3 ) 4 1

4

n X i

=1

5

ai;i ( ); 2

(3.4)

where tr(A ) denotes the trace of A . Note that the second term in the RHS of (3.3) may be thought of as extra bias in the estimation. This motivates a \bias-corrected" estimate:

XX = ai;j ( )YiYj ? ^ tr(A ); n n

i

=1

j

(3.5)

2

=1

where ^ is an estimator of . The non-negativity of is no longer granted. This can be easily modi ed by taking the estimator = max( ; 0), whose performance is at least as good as . 2

2

+

4 Rates of convergence Asymptotic expressions for the conditional bias and variance of ^ and will be derived in this section. We shall use these to study how the mean square errors (MSE) behave. It is convenient at this point to introduce some conditions. Conditions (A): (A1) The kernel K () is a continuous, bounded, and symmetric probability density function, having a support on [?1; 1]. (A2) The design density function f () is positive and continuous for x 2 [?1; 1], and f () has derivatives of order max(; (p ? + 1)=2) on [?1; 1]. (A3) The weight w() is a bounded and nonnegative function on the interval [?1; 1]. Further w() has bounded derivatives of order max(; (p ? + 1)=2) and w i (1) = w i (?1) = 0; i = 0; : : : ; max(; (p ? + 1)=2). Condition (A1) is assumed for simplicity of proofs, and Condition (A2) is imposed in order to apply the \integration by parts" in deriving the results (see (7.9) and (7.15)). With Condition (A2), more smoothness condition on m is required. Condition (A3) is mainly for eliminating the \boundary eects" so that deeper and cleaner results can be obtained. Without loss of generality, the interval [-1,1] is used in Conditions (A1), (A2), and (A3). The results will hold for any bounded intervals [a ; b ], [a ; b ], and [a ; b ] in Conditions (A1), (A2), and (A3) respectively. The regression function m() will be said to have smoothness of order s if there is a constant M > 0 such that for all x and y in [?1; 1], ( )

1

1

2

2

2

2

jm l (x) ? m l (y)j M jx ? yj ; ( )

( )

where s = l + and 0 < 1:

( )

(4.1)

Let [s] denote the largest integer strictly less than s. Throughout we will use s to denote the smoothness of the regression function m() and p to denote the degree of local polynomial tting. Note that p and s are two independent indices with p and s > . When estimating the -th 6

derivative of m(), it is argued in Ruppert and Wand (1994) and Fan and Gijbels (1996) that odd integers (p ? ) should be used. Thus, (p ? ) is assumed to be odd in this section so that (p +1) is a multiple of 2. For simplicity of notation, it is understood that all expectations in this section are R conditioned on fX ; : : : ; Xng. Finally, let ( )(u) = (t) (u ? t)dt denote the convolution of two real-valued functions and . The asymptotic bias and variance of ^ are described in the following theorem. The proof is given in Section 7. 1

1

1

2

1

2

2

Theorem 4.1 Assume that as n ! 1, h = h(n) ! 0, nh ! 1, and nh 2 +1

s? s

2(

[ ])+1

under Conditions (A), (a) the asymptotic bias is

! 1. Then

E (^ ) ? Bn;

8 < OP (hs s ? ) + (C + oP (1))n? h? ? = : (C + oP (1))hp ? + (C + oP (1))n? h? ? +[ ]

2

1

2

+1

1

2

1

2

if s (p + + 1)=2; if s > (p + + 1)=2;

1

2

1

(4.2) where

C = 2

Z

tp

+1

1

C =

2

Z

2

K (t)dt

Z

p (x) (x)w(x)dx ; +1

Z

[K (t)] dt 2

f (x)? w(x)dx ; 1

(4.3)

(b) the asymptotic variance is Var(^ ) Vn;

8 < (D + oP (1))n? h? ? + OP (n? h? s) = : (D + oP (1))n? h? ? + (D + oP (1))n? 1

1

where

D = 2

2

Z

1

+2

2

ZZ

2

4

1

2

4

1

1

2

Z

f (x)? w (x)dx 2

2

m (x)f ? (x)f ? (y)w(x)w(y)dy ; 2

1

1

Z 4 D = ( !) [G (y)] f (y)dy with G (y) = (y)w(y)f ? (y). 2

( )

2

1

2

[(K K )(z )] dz

if s 2; if s > 2;

2 +

2

2

1

7

(4.4)

Remark 1: Note that in Theorem 4.1 (a), when s (p + 1), the convention of R (x) (x)w(x)dx, which involves m p (x), is p Z dr ( (x)w(x))dx p (x) dx with r = (p ? + 1)=2; r ( +1)

+1

+ +1 2

via integration by parts. The asymptotic minimum mean square error of ^ can be obtained from Theorem 4.1. We state the results in two versions. The theoretical version (part (a)) is that given the degree of smoothness s, what the best rate can be by choosing a suciently large p. Practically (part (b)), given the order of local polynomial tting p, one wants to know what the best rate is if m() is suciently smooth.

Theorem 4.2 Under the assumptions of Theorem 4.1, the asymptotic minimum mean square errors of ^ are achieved as follows. (a) Given the degree of smoothness s > , taking h = O(n? = s s

1 ( +[ ]+1)

8 < OP (n? s s ? = s s ) ^ E ( ? ) = : (D + oP (1))n? + OP (n? 2( +[ ]

if s 2; ) if s > 2:

2 ) ( +[ ]+1)

2

1

2

s s? = s s

2( +[ ]

) and p (2s ? ? 1), then

2 ) ( +[ ]+1)

In particular, when s > 2 + 1, E (^ ? ) = D n? + oP (n? ). (b) Given the order of local polynomial t p , if s > (p + + 1)=2, then 2

2

1

1

8 > OP (n? p? = p ) if C > > > OP (n? p? = p ) if C > > > < (D + oP (1))n? E (^ ? ) = > +OP (n? p? = p ) if C > > > > (D + oP (1))n? > > : +OP (n? p? = p ) if C (2

2 +3) ( + +2)

2(

+1) ( + +2)

(2

2 +3) ( + +2)

2(

+1) ( + +2)

8 =p > < C hOPT = > ?CC n = p : C p ? n 2 1

1 ( + +2)

2 (2 +1)

1 ( +1

1

< 0 and s > 2;

1

> 0 and s > 2;

(4.5)

1

2

by taking

1

< 0 and s 2; > 0 and s 2;

1

2

2

1

1 ( + +2)

if C < 0; if C > 0:

(4.6)

1

1

)

In particular, E (^ ? ) = D n? + oP (n? ), if p > 3 ? 1 for the case of C < 0, and p > 3 for C > 0. 2

2

1

1

1

1

Remark 2: The bandwidth hOPT in (4.6) minimizes the mean square error of ^ . Particu-

larly, when D n? is the leading term in (4.5), hOPT minimizes the second-order terms of the MSE 2

1

8

of ^ . In this case, any h satisfying

h = o(n? =

p ?

1 (2( +1

))

) and n? =

1 (4 +2)

= o(h)

can be the optimal smoothing parameter. Thus, the choice of bandwidth is not sensitive in this case. Bickel and Ritov (1988) and Laurent (1996, 97) give the ecient information bound for estimating respectively unweighted and weighted integrals of squared density derivatives. For the current regression setting, we conjecture that D n? is the semiparametric information bound. In other words, ^ is ecient when the degree of smoothness is suciently large. But the construction of the estimator ^ is conceptually simpler than that of Bickel and Ritov (1988). Note that C n? h? ? in the asymptotic bias (4.2) converges to zero slower than the squareroot of D n? h? ? in the variance expression (4.4). In kernel density estimation, Jones and Sheather (1991) argue in favor of estimators of the type ^ , since one can choose an ideal bandwidth such that the leading bias terms (analogous to the second expression in (4.2)) cancel altogether. However, this bias reduction technique is not applicable here, since C in (4.3) is not necessarily negative and C is always positive. To correct the extra bias term, we use the estimator de ned in (3.5). Suppose that can be estimated at a O(n) rate, i.e. E f(^ ? ) g = O(n). It will be seen in Theorem 4.4 that the rate n = o(h = ) is fast enough for the purpose of bias correction. The following theorem gives the asymptotic bias and variance of . 1

2

1

2

1

2

4

2

1

1

1

2

2

2

2

2

2

1 2

Theorem 4.3 Under the assumptions of Theorem 4.1,

(a) the asymptotic bias of is

E ( ) ? bn;

8 s s? < OP h if s (p + + 1)=2; + nn? h? ? = : (C + oP (1))hp ? + OP (nn? h? ? ) if s > (p + + 1)=2; +[ ]

2

1

2

1

+1

1

1

2

1

(b) the asymptotic variance of is Var( ) vn;

8 < (D + oP (1))n? h? ? + OP (n? h? s + nn? h? ? ) if s 2; = : ? ? ? ? ? ? ? ) if s > 2: (D + oP (1))n h + (D + oP (1))n + OP (nn h 1

1

2

4

1

2

4

1

1

2 +

2

1

2

9

2

4

2

2

2

4

2

(4.7)

We now summarize some results from Theorem 4.3. Again they are stated in both theoretical (part (a)) and practical (part (b)) versions.

Theorem 4.4 Under the assumptions of Theorem 4.1, the asymptotic minimum mean square errors of are achieved as follows. (a) Given s, taking h = O(n? = s s

2 (2 +2[ ]+1)

) and p (2s ? ? 1), if n = o(h = ), then 1 2

8 < O (n? s s ? = s s ) E ( ? ) = : P (D + oP (1))n? + OP (n? s 4( +[ ]

if s 2; ) if s > 2:

2 ) (2 +2[ ]+1)

2

1

2

s? = s

4( +[ ]

s

(4.8)

2 ) (2 +2[ ]+1)

In particular, E ( ? ) = D n? + oP (n? ) when s > 2 + 0:5. (b) Given p, if s > (p + + 1)=2 and n = o(h = ), then 2

2

1

1

1 2

8 ? p ? = p 2;

4( +1

=

p

+ 1) hopt = 2CD ((4 n? = p p + 1 ? ) In particular, E ( ? ) = D n? + oP (n? ) when p > 3 ? 0:5. 1 (2 +2 +3)

1

2

1

2 (2 +2 +3)

2 1

2

(4.9)

) (2 +2 +3)

:

(4.10)

1

Remark 3: As shown in Theorem 4.4, the n? = rate is achievable and in this case, the 1 2

smoothing parameter aects only the second-order terms of the MSE. In fact, when D n? is the dominant term in (4.9), the n? = rate of convergence is attained for any bandwidth satisfying 2

1

1 2

h = o(n? =

p ?

1 (2( +1

))

) and n? =

1 (4 +1)

= o(h):

Remark 4: Theorem 4.4 (a) shows that if s > 2 + :5, can be estimated at the n? =

1 2

rate. This minimal smoothness condition is slightly stronger than s > 2 + 1=4 for estimating functionals of a density, as shown in Bickel and Ritov (1988), Laurent (1996, 97) and Birge and Massart (1995). One can possibly follow a similar idea to Laurent (1996) to construct better biascorrected estimators, at expenses of getting much more complex estimators in our setting. We have not chosen this option because we are primarily interested in understanding the performance of the intuitively natural estimators. In practical implementation one should attempt to estimate and hOPT or hopt. The parameter can be estimated at a n? = rate using the estimators of Rice (1984) and Hall, Kay, and Titterington (1990), for example. The problem of estimating the ideal bandwidth for ^ or is beyond the scope of this chapter, but (4.6) and (4.10) provide a guideline. A simple rule of 2

2

1 2

10

thumb is the following. Fit a global polynomial of order (p + + 5)=2 to obtain an estimate for C (see Remark 1) by regarding the regression function as a polynomial function; estimate f (x) and plug the resulting estimate in C or D . Then an estimate of hOPT or hopt is formed. This \plug-in rule" is expected to work reasonably well in many situations, since ^ and are robust against the choice of bandwidth (see Remarks 2 and 3). Ruppert, Sheather, and Wand (1995) consider estimation of 1

2

Z

1

rs = m r (x)m s (x)f (x)dx; r; s 0 and r + s even. They propose an estimator

( )

(4.11)

( )

n X ^rs = n? m^ r (Xi; g)m^ s (Xi; g); 1

i

=1

where m^ r (; g) and m^ s (; g) are obtained via tting a p-th degree local polynomial with a bandwidth g. Also p ? r and p ? s are both odd. Comparing with their estimator, our error criterion (weighted mean integrated square error) is more general and the estimators ^ and are dierent from ^rs .

5 Asymptotic distributions of quadratic functionals The asymptotic distribution of

XX ^ = ai;j Yi Yj n n

i

=1

j

(5.1)

=1

will be considered in this section. For simplicity of technical arguments, we restrict attention to the Gaussian model,

Yi = m(Xi ) + "i ;

"i i.i.d. N (0; );

i = 1; : : : ; n:

2

(5.2)

This Gaussian noise assumption can be removed with additional work on proofs, e.g. via applying the martingale central limit theorem. The ai;j 's in (5.1) possibly depend on n and the design points fX ; : : : ; Xn g. Let m be a n 1 column vector with entries m(Xi ), i = 1; : : : ; n, " = (" ; : : : ; "n)T , and A = (ai;j )nn. Then ^ can be written as 1

1

^ = Y T AY = mT Am + 2mT A" + "T A":

(5.3)

The mean and variance of ^ can be calculated directly from (5.3):

E f^jX ; : : : ; Xng = mT Am + tr(A);

(5.4)

2

1

Varf^jX ; : : : ; Xng = 2 tr(A ) + 4 mT A m: 4

1

11

2

2

2

(5.5)

Conditioned on fX ; : : : ; Xn g, 2mT A" and "T A" are the stochastic terms in (5.3). It is easy to see that 2mT A" is linear in "i 's and contributes 4 mT A m to the variance in (5.5), while "T A" is quadratic in "i 's with variance 2 tr(A ). The objective is to show that ^ is asymptotically normally distributed under some conditions. Two cases will be considered, depending on whether the linear term or the quadratic term dominates (see (5.3) and (5.5)): 1

2

4

2

2

P 0 (mT A" dominates), (i) tr(A )=mT A m ! 2

2

P 1 ("T A" dominates). (ii) tr(A )=mT A m ! 2

2

If the linear term dominates (case (i)), i.e. ^ is \pseudo-quadratic", then the normality of ^ follows directly under the Gaussian noise assumption. The distribution theory of quadratic forms in normal variables will be used to prove the asymptotic normality for case (ii), where ^ is \genuine-quadratic". As will be seen in Section 6, the separate treatments of these two cases are natural, corresponding respectively to root-n and non-root-n rates of convergence. There is a \boundary case" when tr(A ) and mT A m are of the same order. That may be handled with additional work. The general theory of quadratic functionals in Whittle (1964) and Khatri (1980) is useful, but not directly applicable to the situation described here. We provide the following result. 2

2

Theorem 5.1 Under the model assumption in (5.2), if ^ is pseudo-quadratic (case (i)), then the conditional distribution of ^ given fX ; : : : ; Xng is asymptotically normal: C N mT Am + tr(A); 2 tr(A ) + 4 mT A m ; Y T AY ?! (5.6) 1

2

4

2

2

2

C denotes convergence conditioned on fX ; : : : ; X g. Similarly, (5.6) holds for where the symbol ?! n case (ii) if 1

P 0; as n ! 1: tr(A )=(tr(A )) ! 4

2

2

More precisely, the conditional asymptotic normality in (5.6) means that ( T ) T Am ? tr(A) Y AY ? m P (t) 8t; P (2 tr(A ) + 4 mT A m) = tjX ; : : : ; Xn ?! (5.7) where () is the cumulative distribution function of the standard normal random variable. Expression (5.7) also implies the unconditional asymptotic normality by the Dominated Convergence Theorem: ) ( T T Am ? tr(A) Y AY ? m for all t: P (2 tr(A ) + 4 mT A m) = t ?! (t) 2

4

2

2

2

1 2

2

4

2

2

2

1 2

12

1

6 Asymptotic normality We now establish the asymptotic normality of ^ and . In addition to Conditions (A), the bandwidth is assumed to satisfy Condition (B): h = h(n) ! 0 and nh + log(h) ! 1 as n ! 1. The requirement (B) is minor in the sense that if a local smoothing neighborhood contains at least (log n) data points (h > n? (log n)k , k > 1), then Condition (B) is satis ed. We need the following lemmas to show the asymptotic normality of ^ and . 1

Lemma 6.1 Suppose Zn; , Zn; , : : : is a sequence of random variables, having a multinomial distribution (n; pn; ; pn; ; : : :) with parameters satisfying pn;j 0; j = 1; 2; : : : and maxj pn;j < ch, 1

1

2

2

where c is a constant. Under Condition (B),

P fmax Z > nhbng ?! 0; j n;j

as n ! 1;

for any sequence bn such that bn ! 1.

Lemma 6.2 Let max ( ) be the maximum eigenvalue of A = (ai;j ( ))nn, where ai;j ( ) is de ned

in (3.2). Under Conditions (A) and (B),

max ( ) ?! P 0: (6.1) tr(A ) Since the design matrix X (see (2.3)) and the matrix A are the same for (X ; (Y )),: : :,(Xn; (Yn )) with () a known function, Lemmas 6.1 and 6.2 are applicable to the case of estimating functionals of the form Z [m (x)] w(x)dx; 2

2

1

( )

1

2

where m (x) = E ( (y)jx). From Lemmas 6.1 and 6.2, the asymptotic normality of ^ and follows.

Theorem 6.1 Under the Gaussian model assumption in (5.2), if Conditions (A) and (B) are satis ed and nh ! 1, nh s? s ! 1, as n ! 1, then the conditional distributions of ^ 2 +1

2(

[ ])+1

and are asymptotically normal:

C N (B ; V ); (^ ? ) ?! n; n; C N (b ; v ); ( ? ) ?! n; n;

where Bn; , Vn; , bn; and vn; are given in Theorems 4.1 and 4.3.

13

In particular, the n? = -consistency is achieved with appropriate smoothing parameters (see Theorems 4.2 and 4.4). (a) Given s > 2 + 1, taking h = O(n? = s s ) and p (2s ? ? 1), 1 2

1 ( +[ ]+1)

p ^ C N (0; D ): n( ? ) ?! 2

(b) Given p such that

8 < p > 3 ? 1 : p > 3

if C < 0; if C > 0; 1

1

if s > (p + + 1)=2 and h = hOPT given in (4.6), then

p ^ C N (0; D ): n( ? ) ?! 2

(c) Given s > 2 + 0:5, taking h = O(n? = s s

) and p (2s ? ? 1), if n = o(h = ), then

2 (2 +2[ ]+1)

1 2

pn( ? ) ?! C N (0; D ): 2

(d) Given p > 3 ? 0:5, taking h = hopt in (4.10), if s > (p + + 1)=2 and n = o(h = ), then 1 2

pn( ? ) ?! C N (0; D ): 2

Note that the above n? = rate of convergence is achieved when estimators are pseudo-quadratic, since D n? is one of the terms in the asymptotic variance of 2mT A ". Nonparametric estimation of has important implications for the choice of the smoothing parameter of local regression. Based on the theory in this paper, Fan and Huang (1996) show that the relative rate of convergence of the bandwidth selector proposed by Fan and Gijbels (1995) for tting local linear regression is of order n? = if the locally cubic tting is used in the pilot stage, and the rate can be improved to n? = when a local polynomial of degree 5 is used in the pilot tting. The theoretical results appear to be new in the nonparametric regression literature. 1 2

2

1

2 7

2 5

7 Proofs Proof of Lemma 2.1: Stone (1980) shows that X ? x n X a.s. 0; j = 0; : : : ; 2p: j ? j ? ? ? f (x)j j ?! (Xi ? x) h K i h sup jn h x2 c;d [

]

1

1

i

=1

14

Substituting f (x)j , j = 0; : : : ; 2p, for the corresponding moments in the matrix (X T WX ), we obtain a.s. 0; jjn? h? H ? (X T WX )H ? ? f (x)S jj1 ?! 1

1

1

(7.1)

1

where H = diag(1; h; h ; : : : ; hp) p p . Lemma 2.1 follows from (2.7) and (7.1). 2 Proof of Theorem 4.1: We begin by estimating the conditional bias using (3.3). From Lemma 2.1 and Condition (A1), it can be shown that Xi ? x n Z X n tr(A ) = W w(x)dx h i 2

( +1)

( +1)

2

=1

n Z 1 X Xi ? x + oP (1)I X ?h;X h (x) w(x)dx K i i f (x) h

= (nh1 ) i +1 2

2

[

+

]

=1

Xi ? x n Z 1 X (1 + o (1)) P = (nh ) f (x) K h w(x)dx: i

(7.2)

2

+1 2

2

=1

Since Var

n X i

=1

K

2

Xi ? x ! h

= O(nh) = o

nE K

2

Xi ? x ! 2

h

;

the summation in (7.2) can be replaced by its expectation: Z Z 1 ? tr(A ) = nh K (t)dt f (x) w(x)dx + oP (n? h? ? ): We now evaluate the rst term of (3.3). A Taylor series expansion gives 2

1

1

2

1

2 +1

n X i

=1

Wn

(7.3)

X ? x X ? x n X i i n 0 W m ( X ) = i h h [m(x) + m (x)(Xi ? x) + : : :] i

=1

= (x) + r(x);

(7.4)

where r(x) denotes the remainder terms. For the case that s > (p + + 1)=2, assume initially that m() has (p + 1) bounded derivatives. It is shown in Ruppert and Wand (1994) that

r(x) = hp ? +1

Hence

mT A m =

Z

Z

tp

+1

K (t)dt p (x) + oP (hp ? ):

(x)w(x)dx + 2hp ? 2

Z

(7.5)

+1

+1

+1

Z

tp K (t)dt +1

p (x) (x)w(x)dx + oP (hp ? ): +1

15

+1

(7.6)

R

As noted in Remark 1, p (x) (x)w(x)dx depends only on the rst ((p + + 1)=2) derivatives of m(). Since any function with ((p + + 1)=2) derivatives can be approximated arbitrarily close by a function with (p + 1) derivatives, (7.6) must hold for s > (p + + 1)=2. Similar arguments can be seen in Hall and Marron (1991), page 170. In the case that s (p + + 1)=2, set l = [s] + [s ? ]. Assume initially that m l exists and is bounded (note that l < (p + 1)). From a Taylor expansion with an integral remainder, +1

( )

m(Xi ) = m(x) +

l? X 1

k

(Xi ? x)k m k (x)=k! ( )

=1

i ? x) + (X (l ? 1)!

lZ

1

(1 ? t)l? m l (x + t(Xi ? x))dt; 1

0

(7.7)

( )

for i = 1; : : : ; n. It follows from (7.4) and (7.7) that

mT A m =

Z

Z

(x)w(x)dx + (x)r(x)w(x)dx + OP (h s? ): 2

2(

(7.8)

)

The second term in the RHS of (7.8) can be written as

X ? x n Z (1 ? t)l? Z d s? X i n l ( x ) W ( X ? x ) w ( x ) i (l ? 1)! dx s? h i m s (x + t(Xi ? x)) ? m s (x) dxdt X ? x Z (1 ? t)l? Z d s? i l ? ( x ) K = h? ? ( X ? x ) w ( x ) f ( x ) E f i (l ? 1)! h dx s? m s (x + t(Xi ? x)) ? m s (x) dxgdt + OP (hl? an) Z )l? Z Z d s? (x)K (u)ulw(x)f ? (x) = hl? (1(l??t1)! dx s? m s (x + htu) ? m s (x) f (x + hu)dudxdt + OP (hl? an ); (7.9) 1

1

=1

[

]

[

0

]

([ ])

([ ])

1

1

[

]

1

1

[

0

([ ])

1

]

([ ])

1

[

]

1

[

0

([ ])

]

([ ])

where an = h + pnh is obtained by more careful evaluation of the oP (1) term in Lemma 2.1. It follows from (7.8), (7.9), and (4.1) that if s (p + + 1)=2, 1

mT A m

=

Z

(x)w(x)dx + OP (hs s ? ): 2

+[ ]

2

The combination of (7.3), (7.6), and (7.10) gives the bias expression in Theorem 4.1 (a).

16

(7.10)

Next, we compute the asymptotic conditional variance of ^ . In the sequel, we treat explicitly the three terms given in the RHS of (3.4). The rst term (omitting the factor 4 ) is

mT A

m

n X n X

=

2

j j

=

X i k6 j

=

X i6 j

m (Xi ) 2

i

=1 =1

i;j;k

Z

k

=

2

i6 j n h 4

4 +4

=

all dierent i k6 j

Wn

X

X

+

i j6 k i j k

Xj ? x h

:

(7.11)

= =

= =

w(x)dx

2

K Xi h? x K Xj h? x w(x)f ? (x) 2

[

Xi;h Xi (x)I ?h Xj;h Xj (x))dx

+

= n? h? ? 4

+2

= =

h

(1 + oP (1)I ?h Z Z Z 2

X

+

Xi ? x

Wn

X m (Xi ) Z

ai;j ( )m(Xi)ak;j ( )m(Xk )

=1

X

=

= =

ai;j ( )m(Xi)

2

=1

n X n X n X

=

Now

i

=1

!

2

2

+

]

[

+

+

o

2

]

K (u )K u + x ?h y m (x + hu )f (x + hu ) 2

1

1

1

1

o 1 + oP (1)I ? ; (u )I ? ? x?y =h; ? x?y =h (u ) du Z x ? y K (u )K u + h f (y + hu ) 1 + oP (1)I ? ; (u ) o I ? ? x?y =h; ? x?y =h (u ) du w(x)w(y)f ? (x)f ? (y)dxdy [

1

1 1]

2

[

1

(

)

1

(

)

]

1

2

2

1

[

2

[

1

(

)

1

(

+OP (n? h? ? = ) 3

4

4

ZZ

1

i;j;k

all dierent

2

2

2

Z

(K K (z )) dz 2

m (x)f ? (x)f ? (y)w(x)w(y)dxdy: 2

1

By (7.4) and Lemma 2.1,

X

]

2

3 2

= n? h? ? (1 + oP (1)) 2

)

1 1]

= n? h? 1

2

ZZZ

(7.12)

1

( (y) + r(y))( (y + hz ) + r(y + hz ))K(t)

K (t + z )f (y + h(z + t))w(y)w(y + hz )f ? (y + hz ) 1

17

p f ? (y)dtdzdy + OP ( nhn? h? ? ) ZZZ 1

2

= n? h? 1

2

2

g (y)g (y + hz )K(t)K(t + z )

2

p f (y + h(z + t))dtdzdy + OP ( nhn? h? ? ) 2

2

(7.13)

2

with g (y) = ( (y)+ r(y))w(y)f ? (y). For the case that s > 2 , it is assumed initially that s > 3 . Then write 1

g (y + hz ) = and

f (y + h(z + t)) =

X 2

l

( )

(7.14)

2

=0

X k

gl (y)hlz l =l! + o(h )

f k (y)hk (z + t)k =k! + o(h ):

(7.15)

( )

=0

Note that from (2.10), for nonnegative integers l; k, and l + k 2 ,

8 > (?1) > ZZ < z l (z + t)k K (t)K(t + z )dtdz = > > :0

l l?

if l ; k ; and l + k = 2; elsewhere.

!

!(

)!

(7.16)

Combining (7.13), (7.14), (7.15), and (7.16),

! Z X ( ? 1) ! i ? i = n( !) g (y) g (y)f (y) i!( ? i)! + oP (n? ) i i;j;k all dierent X

( + )

(

1

)

2

=0

Z d [g (y)f (y)]dy + o (n? ): = n(?(1)!) g (y) dy P ( )

(7.17)

1

2

Applying integration by parts and Condition (A3) to (7.17), it follows that

Z 1 = n( !) [g (y)] f (y)dy + oP (n? ) i;j;k all dierent X

( )

2

1

2

Z = n(1!) [G (y)] f (y)dy + oP (n? ) ( )

2

2

1

(7.18)

with G (y) = (y)w(y)f ? (y). Expression (7.18) only involves derivatives of m() up to the 2 -th, and hence must hold for s > 2 . To modify arguments from (7.13) to (7.18) for the case of s 2 , the change required is in (7.14) and (7.15). The initial assumption is that m() has (2[s]? ) bounded derivatives, and Taylor expansions of orders 2([s] ? ) and (2 ? [s]) for (g (y + hz ) ? g (y)) and 1

18

f (y + h(z + t)) respectively. Then, applying integration by parts ([s] ? ) times, together with the smoothness de nition (4.1), we obtain

X

= OP (n? h? s): 1

2 +

all dierent P P P The other two terms in (7.11), 2 i j 6 k and i j k are of smaller order than i;j;k all dierent P and i k6 j . The second term in the RHS of (3.4) (omitting the factor 2 ) is i;j;k

= =

= =

= =

4

tr(A ) = 2

n X i

ai;i ( ) + 2

=1

X i6 j

ai;j ( ) 2

=

I +I ; 1

2

where I and I denote the diagonal and non-diagonal terms respectively. Using similar arguments as in establishing the term mT A m, we obtain that I = OP (n? h? ? ) and X Z Xi ? x Xj ? x ? n ( n ? 1)(1 + o (1)) P I = E K nh h K h f (x) w(x)dx i6 j 1

2

2

3

1

4

2

2

2

2

4

4 +4

= n? h? ? 2

4

1

ZZ

=

f(K K )(z )g dzf (x)? w (x)dx + oP (n? h? ? ): 2

Consequently, tr(A ) = n? h? ? (1 + oP (1)) 2

2

4

1

Z

2

2

2

[(K K )(z )] dz 2

?

Z

4

1

f (x)? w (x)dx : 2

2

(7.19)

Observe that the last term (omitting the factor E (" ) ? 3 ) in the RHS of (3.4) is n X i

4

4 1

ai;i ( ) = I = oP (tr(A )): 2

(7.20)

2

1

=1

The asymptotic variance follows from (7.12), (7.18), (7.19) and (7.20). 2 Proof of Theorem 4.3: This theorem follows directly by the following two expressions.

E ( ) = E (^ ) ? E (^ tr(A )) 2

= E (^ ) ? tr(A ) + E (^ ? )tr(A ) 2

2

2

= E (^ ) ? tr(A ) + OP (nn? h? ? ): 2

1

Var( ) = Var(^ ? (^ ? )tr(A )) 2

2

= Var(^ ) + OP (nn? h? ? ): 2

19

2

4

2

2

1

2

P !

Proof of Theorem 5.1: If the linear term dominates, i.e. tr(A )=mT A m 0, then "T A" ? E ("T A") = oP (mT A m) = . Conditioning on fX ; : : : ; Xn g, it is trivial that mT A" is normally distributed with mean 0 and variance mT A m. This implies 2

2

1 2

2

1

2

2

(4 mT A m) = (^ ? E (^)) 2

2

1 2

C N (0; 1): = (4 mT A m) = f2mT A" + ("T A" ? E ("T A"))g ?! 2

2

1 2

P 1, we need only to consider "T A". Factor A into a product For case (ii), tr(A )=mT A m ! ?T ?, where ? is an orthonormal matrix and is a diagonal matrix with eigenvalues of A, i 's as its entries. Now 2

2

"T A" = T =

n X i

i i

2

=1

P 0; as n ! 1, with = ?" = ( ; : : : ; n): Note that by the assumption that tr(A )=(tr(A )) ! n X i

4

1

E c ji i ? i j = 2

4

n X i

i E c (i ? 1) = 60 4

2

4

4

i

2

i = oP ([tr(A )] ); 4

2

2

=1

=1

=1

n X

2

where E c denotes the conditional expectation given fX ; : : : ; Xng. Hence the Lindeberg condition P for the asymptotic normality of ni i i holds: Pn E c j ? j 15 Pni i ?! P 0; i i i i P = [Var( ni i i jX ; : : : ; Xn )] [tr(A )] and so "T A" is asymptotic normally distributed. Consequently, the asymptotic normality of ^ for cases (i) and (ii) is proved. 2 Proof of Lemma 6.1: Note that 1

2

=1

4

2

4

=1

=1

4

=1

2

2

1

P (max Z > nhbn ) j n;j

2

X j

2

P (Zn;j > nhbn ):

(7.21)

Clearly, it is sucient to show that [Pnj P (Zn;j > nhbn )] ! 0. Given an arbitrary positive constant d, =1

P (Zn;j > nhbn) = P (exp(Zn;j d) > exp(nhbnd))

exp(?nhbn d)E fexp(Zn;j d)g = exp(?nhbn d)(1 ? pn;j + ed pn;j )n

exp(?nhbn d + npn;j ed ): 20

(7.22)

Take d = log( phbn;jn ) in (7.22). If n is suciently large, then

pn;j e nhbn

P (Zn;j > nhbn )

hbn

:

(7.23)

Since maxj pn;j ch, the RHS of (7.23) is bounded by (ch)nhbn?

1

ce nhbn e nhbn 1 pn;j = ch b pn;j : hbn n

Now bn(log(ce) ? log(bn)) ?1 for suciently large n, which together with Condition (B) give X 1 1 nh X P (Zn;j > nhbn ) ( ) pn;j j ch e j 1 ( 1 )nh ?! 0: = ch e

(7.24)

2

The conclusion follows from (7.21) and (7.24). Proof of Lemma 6.2: Observe that from (7.19) tr(A ) c n? h? ? + oP (n? h? ? ) 2

2

4

1

2

4

(7.25)

1

1

for some nonnegative constant c . It suces to consider the rate of max ( ). By basic theory of linear algebra, n X n X max ( ) = sup ai;j ( )bibj : (7.26) 1

kbk i j =1 =1

From Lemma 2.1, for n suciently large, n X n X i

=1

j

ai;j ( )bibj (nh ) +1

=1

Z

2

2

f (x)?

2

=1

X ? x ! K i h bi w(x)dx:

n X

2

i

=1

Under Conditions (A), the RHS of (7.27) is further bounded by

c n? h? ? 2

2

2

Z X n

2

j bi j I Xi?h;h [

i

!

Xi (x)

+

2

]

(7.27)

dx

=1

n Z X = c n? h? ? bi I Xi?h;Xi h (x)dx 2

2

2

2

2

[

i

+

]

=1

+

X i6 j

1 h (x)dxA

Z

j bi jj bj j I Xi?h;Xi h (x)I Xj ?h;Xj [

+

]

[

+

]

=

c n? h? ? 2

2

2

2

1 0 n X X @2h bi + 2h I jXi ?Xj j h j bi jj bj jA 2

i6 j

i

=1

=

21

[

2

]

(7.28)

for some nonnegative constant c . We use the following argument to evaluate the double summation in (7.28). Let Ik = (4(k ? 1)h; 4kh], k 2 Z be a partition of (?1; 1), and nk be the number of design points Xi that fall in the interval Ik . Since I jXi?Xj j h = 0 for Xi 2 Ik , Xj 2 Ik ; and jk ? k j > 1, 2

[

X i;j

2

1

]

0 1 X @ I jXi?Xj j h jbi jjbj j [

2

]

X

k ?1 Xi;Xj 2Ik [I(k+1)

1

2

1 1 0 X @ jbijjbj jA =

2

X

2

k ?1 Xi2Ik [I(k+1)

=

1 jbijA :

(7.29)

=

Put N = maxk2Z (nk + nk ) (nk + nk ) for some k 2 Z. The maximum of the RHS of (7.29) p over kbk = 1 is attained at jbi j = 1= N for those indices such that Xi 2 Ik [ Ik . Therefore, +1

max kbk

=1

0

0 +1

0

X

0

I jXi?Xj j h jbi jjbj j N: [

i;j

0 +1

2

]

Note that fnk g have a multinomial distribution and hence by Lemma 6.1, N = OP (nhn), for any sequence n ! 1. Taking n = h? , it follows from (7.28) that 1 4

max ( ) c n? h? ? OP (h + nh = ): 2

2

2

7 4

2

(7.30)

Combine (7.25) and (7.30) to conclude that

max ( ) = O (n? h? + h = ) ?! 0: P tr(A ) 2

2

1

1 2

2

Proof of Theorem 6.1: From Theorem 5.1, one only needs to verify tr(A ) ?! P 0; (tr(A ))

as n ! 1:

4

2

2

2

Let i ( ); i = 1; : : : ; n be eigenvalues of A and max ( ) be the maximum eigenvalue among all i ( )'s. Then,

P

tr(A ) = P ni i ( ) Pmax ( ) = max ( ) ?! P 0; as n ! 1: (7.31) n ( ) (tr(A )) ( ni i ( )) tr( A ) i i Hence the asymptotic normality of ^ and follows. 2 Acknowledgments. This work was part of Li-Shan Huang's doctoral dissertation at University of North Carolina. She would like to thank the Department of Statistics, University of North Carolina, National Institute of Statistical Sciences, Professors M.R. Leadbetter and R.L. Smith for the nancial support during her graduate studies. 4

2

2

4

=1

2

2

2

=1

2

2

2

=1

22

REFERENCES Berge, L. and Massart, P. (1995) Estimation of integral functionals of a density. Ann. Statist., 23, 11-29. Bickel, P.J. and Ritov, Y. (1988) Estimating integrated squared density derivatives: sharp best order of convergence estimates. Sankhya, 50-A, 381-393. Doksum, K. and Samarov, A. (1995) Nonparametric estimation of global functionals and a measure of the explanatory power of covariates in regression. Ann. Statist., 23, 1443-1473. Donoho, D.L. and Nussbaum, M. (1990) Minimax quadratic estimation of a quadratic functional. J. Complex., 6, 290-323. Efromovich, S. and Low, M. (1996) On Bickel and Ritov's conjecture about adaptive estimation of the integral of the square of density derivative. Ann. Statist., 24, 682-686. Fan, J. (1991) On the estimation of quadratic functionals. Ann. Statist., 19, 1273-1294. Fan, J. and Gijbels, I. (1995) Data-driven bandwidth selection in local polynomial tting: variable bandwidth and spatial adaptation. J. Roy. Statist. Soc. Sec. B, 57, 371-394. Fan, J. and Gijbels, I. (1996) Local polynomial modelling and its applications. London: Chapman and Hall. Fan, J. and Huang, L.-S. (1996) Rates of convergence for the pre-asymptotic substitution bandwidth selector. Technical report #912, Department of Statistics, Florida State University. Hall, P., Kay, J.W. and Titterington, D.M. (1990) Asymptotically optimal dierence-based estimation of variance in nonparametric regression. Biometrika, 77, 521-528. Hall, P. and Marron, J.S. (1987) Estimation of integrated squared density derivatives. Stat. Probab. Lett., 6, 109-115. Hall, P. and Marron, J.S. (1991) Lower bounds for bandwidth selection in density estimation. Probab. Th. Rel. Fields, 90, 149-173. Jones, M.C. and Sheather, S.J. (1991) Using non-stochastic terms to advantage in kernel-based estimation of integrated squared density derivatives. Stat. Probab. Lett., 11, 511-514. Khatri, C.G. (1980) Quadratic forms in normal variables. In P.R. Krishnaiah (ed.), Handbook of Statistics, Vol. I, 443-469. New York: North-Holland. Laurent, B. (1996) Ecient estimation of integral functionals of a density. Ann. Statist., 24, 659-681. Laurent, B. (1997) Estimation of integral functionals of a density and its derivatives. Bernoulli, 3, 181-211. Rice, J. (1984) Bandwidth choice for nonparametric regression. Ann. Statist., 12, 1215-1230. Ruppert, D., Sheather, S.J. and Wand, M.P. (1995) An eective bandwidth selector for local least squares regression. Jour. Amer. Statist. Assoc., 90, 1257{1270. 23

Ruppert, D. and Wand, M.P. (1994) Multivariate locally weighted least squares regression. Ann. Statist., 22, 1346-1370. Stone, C.J. (1980) Optimal rates of convergence for nonparametric estimators. Ann. Statist., 8, 1348-1360. Whittle, P. (1964) On the convergence to normality of quadratic forms in independent random variables. Theory Prob. Appl., 9, 103-108.

24

Jianqing Fan Department of Statistics University of North Carolina Chapel Hill, N.C. 27599-3260

Abstract: Quadratic regression functionals are important for bandwidth selection of nonparametric

regression techniques and for nonparametric goodness-of- t test. Based on local polynomial regression, we propose estimators for weighted integrals of squared derivatives of regression functions. The rates of convergence in mean square error are calculated under various degrees of smoothness and appropriate values of the smoothing parameter. Asymptotic distributions of the proposed quadratic estimators are considered with the Gaussian noise assumption. It is shown that when the estimators are pseudo-quadratic (linear components dominate quadratic components), asymptotic normality with the n?1=2 rate can be achieved.

Key words and phrases. Asymptotic normality, equivalent kernel, local polynomial regression. Abbreviated title: Estimation of Quadratic Regression Functionals.

1 Introduction Let (X ; Y ); : : : ; (Xn; Yn ) be independent and identically distributed observations with the conditional mean E (yjx) = m(x) and variance Var(yjx) = . Consider the problem of estimating functionals of the form 1

1

2

Z

= [m (x)] w(x)dx; ( )

2

= 0; 1; : : : ;

(1.1)

where w(x) is a prescribed nonnegative weight function. These functionals appear in expressions for the asymptotically optimal bandwidth for nonparametric function estimates; see for example Ruppert and Wand (1994) and Fan and Gijbels (1996). Doksum and Samarov (1995) consider particularly the applications of in measuring the explanatory power of covariates in regression. Another area of application is to examine the goodness of t of a ( ? 1)-th degree polynomial model by testing H : = 0, = 1; 2; : : :. When w() is the indicator function of an interval, H tests the polynomial modeling at that given interval. Problems similar to estimation of are considered in a number of papers. Hall and Marron (1987, 1991), Bickel and Ritov (1988), and Jones and Sheather (1991) discuss estimation of R [f (x)] dx based on kernel density estimators, where f () is a probability density function. Birge and Massart (1995) expand the study to the estimation of integrals of smoothed functionals of a density. Adaptive estimation of quadratic functionals can be found in Efromovich and Low (1996). Laurent (1996, 97) use the orthonormal series method to estimate density functionals of the form 0

0

( )

0

2

To whom correspondence should be addressed.

1

R f (x)f 0 (x)w(x)dx and deal with this more general integral form via Taylor's expansions. OpR timality results of estimating [m (x)] dx under the Gaussian white noise model are obtained in ( )

(

)

( )

2

Donoho and Nussbaum (1990) and Fan (1991). In the context of nonparametric regression, the corresponding problem is much less understood. Doksum and Samarov (1995) give estimators of R m (x)f (x)w(x)dx (f () is the design density function) by local constant approximation, i.e. the Nadaraya-Watson estimator. Here we develop estimates of based on local polynomial regression estimators. Ruppert, Sheather, and Wand (1995) consider functionals similar to with weight w(x) = f (x). However, no results as general as shown in this paper have been established and our estimators for are dierent from those of Doksum and Samarov (1995) and Ruppert, Sheather, and Wand (1995). Technically, 's are nonlinear functionals of m(). While most theory focuses on estimation of linear functionals, it is also of theoretical interest to explore the diculty of estimating nonlinear R functionals. In the density estimation setting, the n? = -consistency of estimating [f (x)] dx is established in Hall and Marron (1987, 1991), Bickel and Ritov (1988), and Berge and Massart (1995). Laurent (1996, 97) show that similar results hold for estimated weighted integrals as well. Now a natural question is whether n? = -consistent estimators can be constructed for regression functionals . Note that the results for estimating density integral functionals are based on \diagonal-out" type estimators. However, we do not have a \diagonal-out" estimator for . A similar \bias-corrected" estimator (see (3.5)) can be constructed for regression functionals , but it requires s > 2 + 1=2 to achieve the n? = rate (see Theorem 4.4). Our estimators are based on local polynomial regression estimators. Section 4 describes the asymptotic behavior of the estimators, which are quadratic functionals of weighted regression estimates. Appropriate values of the smoothing parameter and smoothness conditions of m() are given for mean square rates of convergence. To study the asymptotic distributions of the proposed P Pn a Y Y are considered in Section 5, where Y ; : : : ; Y estimators, functionals of the form ni i;j i j n j are independent response variables. As a consequence, we provide further insights into the diculty of estimating quadratic functionals: the n? = rate for estimating can be achieved when estimators are pseudo-quadratic (linear component dominates). For genuine-quadratic estimators (quadratic component dominates), the n? = rate is not attainable. This paper is organized as follows. Section 2 provides the background of local polynomial regression. The proposed estimators of are given in Section 3. Section 4 contains rate-ofconvergence results in mean square error, from both a theoretical and a practical points of view. Asymptotic distributions of the proposed estimators are studied in Sections 5 and 6. Proofs of 2

1 2

( )

1 2

1 2

=1

1

=1

1 2

1 2

2

2

lemmas and theorems are in Section 7.

2 Local polynomial regression Suppose that (Xi; Yi ); i = 1; : : : ; n; have been collected from the model

Yi = m(Xi ) + "i ;

E ("i ) = 0; Var("i ) = ;

(2.1)

2

where m() is an unknown regression function, "i 's are independent and identically distributed error terms, and Xi 's are independent of "i 's. The function of interest could be the regression curve m() or its derivatives. The idea of local polynomial regression is to t locally a low-order polynomial at grid points of interest, with observations receiving dierent weights. Assume that the (p + 1)-th derivative of m() exists, with p a nonnegative integer. For a xed point x, the regression function m() can be locally approximated by

m(z ) m(x) + m0 (x)(z ? x) + : : : + m p (x)(z ? x)p=p!; ( )

for z in a neighborhood of x. This leads to the following least-squares problem: min

n X i

Yi ?

=1

p X

(Xi ? x)

!

=0

2

K Xi h? x ;

(2.2)

where = m (x)= !, = ( ; ; : : : ; p )T , and the dependence of on x is suppressed. The neighborhood is controlled by a bandwidth h, and the weights are assigned via a kernel function K , a continuous, bounded, and symmetric real function which integrates to one. Let ^ = ( ^ ; ^ ; : : : ; ^p ) be the solution to the minimization problem (2.2). Then ! ^ is an estimator for m (x), obtained by tting a p-th degree weighted polynomial in a neighborhood of x. See Fan and Gijbels (1996) for further details. For convenience, some matrix notation is introduced here. Let ( )

0

1

0

1

( )

0 1 (X ? x) B B .. .. X=B . . B @ 1 (Xn ? x) 1

? x)p

: : : (X .. ... . : : : (Xn ? x)p 1

0 K ( X h?x ) B B ... W = diag(K ( Xi h? x ))nn = B [email protected]

1 CC CC ; A

1

3

K ( Xnh?x )

(2.3)

1 CC CC ; A

(2.4)

Y = (Y ; : : : ; Yn)T , and m = (m(X ); : : : ; m(Xn))T . By the standard least-squares theory, the estimator ^ can be written as ^ = (X T WX )? X T WY: (2.5) We further investigate ^ , = 0; : : : ; p; individually. Let e ; = 0; : : : ; p; denote (p + 1) 1 unit vectors having 1 at the ( + 1)-th component, 0 otherwise. Simple algebra shows that 1

1

1

n X ^ (x) = eT ^ = Wn Xi h? x Yi ; i

(2.6)

=1

where

Wn (t) = eT (X T WX )? (1; ht; : : : ; hp tp)T K (t); 1

= 0; : : : ; p:

(2.7)

The weight functions Wn (), = 0; : : : ; p; depend on both x and the design points fX ; : : : ; Xng, and satisfy 1

n X i

=1

Xi ? x 8 < 1 if k = , = (Xi ? x)k Wn : 0 otherwise, for 0 ; k p: h

(2.8)

The dependence of Wn () on x and the design points fX ; : : : ; Xn g is the key to design adaptation and superior boundary behavior of the local polynomial regression. To understand the estimates ^ , = 0; : : : ; p, more intuitively, de ne 1

K (t) = eT S ? (1; t; : : : ; tp)T K (t); 1

= 0; : : : ; p;

(2.9)

R

where S = (i j ? ) p p with k = tk K (t)dt. It is easy to see that the functions K (), = 0; : : : ; p, are independent of x and fX ; : : : ; Xn g and satisfy +

2 ( +1)

( +1)

1

Z

8 < 1 if k = , for 0 ; k p: tk K (t)dt = : 0 otherwise,

(2.10)

The theoretical connection between K () and Wn() is shown in the following lemma, which extends the equivalent kernel results of Ruppert and Wand (1994).

Lemma 2.1 If the marginal density of X , denoted by f (), is Holder continuous on an interval

[a; b] and minx2 a;b f (x) > 0, then [

]

a:s: 0; sup sup jnh Wn (t) ? K (t)=f (x)j ?!

t2 ? ; x2 a;b [

1 1]

[

+1

]

provided that K has a bounded support [?1; 1] and h = h(n) ! 0, nh ! 1.

4

From (2.6) and Lemma 2.1, the local polynomial estimator of (x) = m (x)= ! has the asymptotic expansion: n ^ (x) = (1 + oP (1)) X K Xi ? x Yi : (2.11) nh f (x) i h Indeed (2.11) provides a simple technical device for studying the asymptotic behavior of ^. ( )

+1

=1

3 Estimators By the de nition of in (1.1), a natural approach is to substitute an estimate m^ (x) for m (x). R One candidate is ( !) ^ (x)w(x)dx, where ^ (x) is obtained via tting a local p-th degree polynomial, p , as in (2.6). Since ( !) is a constant for any xed , an estimator ( )

2

( )

2

2

Z

^ = ^ (x)w(x)dx; 2

is considered. Correspondingly, denote

Z Z = (x)w(x)dx = (1!) [m (x)] w(x)dx: ( )

2

2

2

We shall study the asymptotic properties of ^ later, but rst an intuitive discussion may be helpful. In addition to the model assumption in (2.1), we further assume that E ("i ) = 0 and E ("i ) < 1, i = 1; : : : ; n. From (2.6), n X n X ^ = ai;j ( )YiYj ; (3.1) 3

i

=1

where

j

4

=1

Z ai;j ( ) = Wn Xi h? x Wn Xj h? x w(x)dx: Clearly ^ is in a quadratic form of Yi 's and can be written as

(3.2)

^ = Y T A Y with A = (ai;j ( ))nn : The conditional mean and variance of ^ follow at once from (3.1):

E f^ jX ; : : : ; Xng = mT A m + tr(A );

(3.3)

2

1

Varf^ jX ; : : : ; Xng = 4 mT A m + 2 tr(A ) 2

2

4

2

1

+(E (" ) ? 3 ) 4 1

4

n X i

=1

5

ai;i ( ); 2

(3.4)

where tr(A ) denotes the trace of A . Note that the second term in the RHS of (3.3) may be thought of as extra bias in the estimation. This motivates a \bias-corrected" estimate:

XX = ai;j ( )YiYj ? ^ tr(A ); n n

i

=1

j

(3.5)

2

=1

where ^ is an estimator of . The non-negativity of is no longer granted. This can be easily modi ed by taking the estimator = max( ; 0), whose performance is at least as good as . 2

2

+

4 Rates of convergence Asymptotic expressions for the conditional bias and variance of ^ and will be derived in this section. We shall use these to study how the mean square errors (MSE) behave. It is convenient at this point to introduce some conditions. Conditions (A): (A1) The kernel K () is a continuous, bounded, and symmetric probability density function, having a support on [?1; 1]. (A2) The design density function f () is positive and continuous for x 2 [?1; 1], and f () has derivatives of order max(; (p ? + 1)=2) on [?1; 1]. (A3) The weight w() is a bounded and nonnegative function on the interval [?1; 1]. Further w() has bounded derivatives of order max(; (p ? + 1)=2) and w i (1) = w i (?1) = 0; i = 0; : : : ; max(; (p ? + 1)=2). Condition (A1) is assumed for simplicity of proofs, and Condition (A2) is imposed in order to apply the \integration by parts" in deriving the results (see (7.9) and (7.15)). With Condition (A2), more smoothness condition on m is required. Condition (A3) is mainly for eliminating the \boundary eects" so that deeper and cleaner results can be obtained. Without loss of generality, the interval [-1,1] is used in Conditions (A1), (A2), and (A3). The results will hold for any bounded intervals [a ; b ], [a ; b ], and [a ; b ] in Conditions (A1), (A2), and (A3) respectively. The regression function m() will be said to have smoothness of order s if there is a constant M > 0 such that for all x and y in [?1; 1], ( )

1

1

2

2

2

2

jm l (x) ? m l (y)j M jx ? yj ; ( )

( )

where s = l + and 0 < 1:

( )

(4.1)

Let [s] denote the largest integer strictly less than s. Throughout we will use s to denote the smoothness of the regression function m() and p to denote the degree of local polynomial tting. Note that p and s are two independent indices with p and s > . When estimating the -th 6

derivative of m(), it is argued in Ruppert and Wand (1994) and Fan and Gijbels (1996) that odd integers (p ? ) should be used. Thus, (p ? ) is assumed to be odd in this section so that (p +1) is a multiple of 2. For simplicity of notation, it is understood that all expectations in this section are R conditioned on fX ; : : : ; Xng. Finally, let ( )(u) = (t) (u ? t)dt denote the convolution of two real-valued functions and . The asymptotic bias and variance of ^ are described in the following theorem. The proof is given in Section 7. 1

1

1

2

1

2

2

Theorem 4.1 Assume that as n ! 1, h = h(n) ! 0, nh ! 1, and nh 2 +1

s? s

2(

[ ])+1

under Conditions (A), (a) the asymptotic bias is

! 1. Then

E (^ ) ? Bn;

8 < OP (hs s ? ) + (C + oP (1))n? h? ? = : (C + oP (1))hp ? + (C + oP (1))n? h? ? +[ ]

2

1

2

+1

1

2

1

2

if s (p + + 1)=2; if s > (p + + 1)=2;

1

2

1

(4.2) where

C = 2

Z

tp

+1

1

C =

2

Z

2

K (t)dt

Z

p (x) (x)w(x)dx ; +1

Z

[K (t)] dt 2

f (x)? w(x)dx ; 1

(4.3)

(b) the asymptotic variance is Var(^ ) Vn;

8 < (D + oP (1))n? h? ? + OP (n? h? s) = : (D + oP (1))n? h? ? + (D + oP (1))n? 1

1

where

D = 2

2

Z

1

+2

2

ZZ

2

4

1

2

4

1

1

2

Z

f (x)? w (x)dx 2

2

m (x)f ? (x)f ? (y)w(x)w(y)dy ; 2

1

1

Z 4 D = ( !) [G (y)] f (y)dy with G (y) = (y)w(y)f ? (y). 2

( )

2

1

2

[(K K )(z )] dz

if s 2; if s > 2;

2 +

2

2

1

7

(4.4)

Remark 1: Note that in Theorem 4.1 (a), when s (p + 1), the convention of R (x) (x)w(x)dx, which involves m p (x), is p Z dr ( (x)w(x))dx p (x) dx with r = (p ? + 1)=2; r ( +1)

+1

+ +1 2

via integration by parts. The asymptotic minimum mean square error of ^ can be obtained from Theorem 4.1. We state the results in two versions. The theoretical version (part (a)) is that given the degree of smoothness s, what the best rate can be by choosing a suciently large p. Practically (part (b)), given the order of local polynomial tting p, one wants to know what the best rate is if m() is suciently smooth.

Theorem 4.2 Under the assumptions of Theorem 4.1, the asymptotic minimum mean square errors of ^ are achieved as follows. (a) Given the degree of smoothness s > , taking h = O(n? = s s

1 ( +[ ]+1)

8 < OP (n? s s ? = s s ) ^ E ( ? ) = : (D + oP (1))n? + OP (n? 2( +[ ]

if s 2; ) if s > 2:

2 ) ( +[ ]+1)

2

1

2

s s? = s s

2( +[ ]

) and p (2s ? ? 1), then

2 ) ( +[ ]+1)

In particular, when s > 2 + 1, E (^ ? ) = D n? + oP (n? ). (b) Given the order of local polynomial t p , if s > (p + + 1)=2, then 2

2

1

1

8 > OP (n? p? = p ) if C > > > OP (n? p? = p ) if C > > > < (D + oP (1))n? E (^ ? ) = > +OP (n? p? = p ) if C > > > > (D + oP (1))n? > > : +OP (n? p? = p ) if C (2

2 +3) ( + +2)

2(

+1) ( + +2)

(2

2 +3) ( + +2)

2(

+1) ( + +2)

8 =p > < C hOPT = > ?CC n = p : C p ? n 2 1

1 ( + +2)

2 (2 +1)

1 ( +1

1

< 0 and s > 2;

1

> 0 and s > 2;

(4.5)

1

2

by taking

1

< 0 and s 2; > 0 and s 2;

1

2

2

1

1 ( + +2)

if C < 0; if C > 0:

(4.6)

1

1

)

In particular, E (^ ? ) = D n? + oP (n? ), if p > 3 ? 1 for the case of C < 0, and p > 3 for C > 0. 2

2

1

1

1

1

Remark 2: The bandwidth hOPT in (4.6) minimizes the mean square error of ^ . Particu-

larly, when D n? is the leading term in (4.5), hOPT minimizes the second-order terms of the MSE 2

1

8

of ^ . In this case, any h satisfying

h = o(n? =

p ?

1 (2( +1

))

) and n? =

1 (4 +2)

= o(h)

can be the optimal smoothing parameter. Thus, the choice of bandwidth is not sensitive in this case. Bickel and Ritov (1988) and Laurent (1996, 97) give the ecient information bound for estimating respectively unweighted and weighted integrals of squared density derivatives. For the current regression setting, we conjecture that D n? is the semiparametric information bound. In other words, ^ is ecient when the degree of smoothness is suciently large. But the construction of the estimator ^ is conceptually simpler than that of Bickel and Ritov (1988). Note that C n? h? ? in the asymptotic bias (4.2) converges to zero slower than the squareroot of D n? h? ? in the variance expression (4.4). In kernel density estimation, Jones and Sheather (1991) argue in favor of estimators of the type ^ , since one can choose an ideal bandwidth such that the leading bias terms (analogous to the second expression in (4.2)) cancel altogether. However, this bias reduction technique is not applicable here, since C in (4.3) is not necessarily negative and C is always positive. To correct the extra bias term, we use the estimator de ned in (3.5). Suppose that can be estimated at a O(n) rate, i.e. E f(^ ? ) g = O(n). It will be seen in Theorem 4.4 that the rate n = o(h = ) is fast enough for the purpose of bias correction. The following theorem gives the asymptotic bias and variance of . 1

2

1

2

1

2

4

2

1

1

1

2

2

2

2

2

2

1 2

Theorem 4.3 Under the assumptions of Theorem 4.1,

(a) the asymptotic bias of is

E ( ) ? bn;

8 s s? < OP h if s (p + + 1)=2; + nn? h? ? = : (C + oP (1))hp ? + OP (nn? h? ? ) if s > (p + + 1)=2; +[ ]

2

1

2

1

+1

1

1

2

1

(b) the asymptotic variance of is Var( ) vn;

8 < (D + oP (1))n? h? ? + OP (n? h? s + nn? h? ? ) if s 2; = : ? ? ? ? ? ? ? ) if s > 2: (D + oP (1))n h + (D + oP (1))n + OP (nn h 1

1

2

4

1

2

4

1

1

2 +

2

1

2

9

2

4

2

2

2

4

2

(4.7)

We now summarize some results from Theorem 4.3. Again they are stated in both theoretical (part (a)) and practical (part (b)) versions.

Theorem 4.4 Under the assumptions of Theorem 4.1, the asymptotic minimum mean square errors of are achieved as follows. (a) Given s, taking h = O(n? = s s

2 (2 +2[ ]+1)

) and p (2s ? ? 1), if n = o(h = ), then 1 2

8 < O (n? s s ? = s s ) E ( ? ) = : P (D + oP (1))n? + OP (n? s 4( +[ ]

if s 2; ) if s > 2:

2 ) (2 +2[ ]+1)

2

1

2

s? = s

4( +[ ]

s

(4.8)

2 ) (2 +2[ ]+1)

In particular, E ( ? ) = D n? + oP (n? ) when s > 2 + 0:5. (b) Given p, if s > (p + + 1)=2 and n = o(h = ), then 2

2

1

1

1 2

8 ? p ? = p 2;

4( +1

=

p

+ 1) hopt = 2CD ((4 n? = p p + 1 ? ) In particular, E ( ? ) = D n? + oP (n? ) when p > 3 ? 0:5. 1 (2 +2 +3)

1

2

1

2 (2 +2 +3)

2 1

2

(4.9)

) (2 +2 +3)

:

(4.10)

1

Remark 3: As shown in Theorem 4.4, the n? = rate is achievable and in this case, the 1 2

smoothing parameter aects only the second-order terms of the MSE. In fact, when D n? is the dominant term in (4.9), the n? = rate of convergence is attained for any bandwidth satisfying 2

1

1 2

h = o(n? =

p ?

1 (2( +1

))

) and n? =

1 (4 +1)

= o(h):

Remark 4: Theorem 4.4 (a) shows that if s > 2 + :5, can be estimated at the n? =

1 2

rate. This minimal smoothness condition is slightly stronger than s > 2 + 1=4 for estimating functionals of a density, as shown in Bickel and Ritov (1988), Laurent (1996, 97) and Birge and Massart (1995). One can possibly follow a similar idea to Laurent (1996) to construct better biascorrected estimators, at expenses of getting much more complex estimators in our setting. We have not chosen this option because we are primarily interested in understanding the performance of the intuitively natural estimators. In practical implementation one should attempt to estimate and hOPT or hopt. The parameter can be estimated at a n? = rate using the estimators of Rice (1984) and Hall, Kay, and Titterington (1990), for example. The problem of estimating the ideal bandwidth for ^ or is beyond the scope of this chapter, but (4.6) and (4.10) provide a guideline. A simple rule of 2

2

1 2

10

thumb is the following. Fit a global polynomial of order (p + + 5)=2 to obtain an estimate for C (see Remark 1) by regarding the regression function as a polynomial function; estimate f (x) and plug the resulting estimate in C or D . Then an estimate of hOPT or hopt is formed. This \plug-in rule" is expected to work reasonably well in many situations, since ^ and are robust against the choice of bandwidth (see Remarks 2 and 3). Ruppert, Sheather, and Wand (1995) consider estimation of 1

2

Z

1

rs = m r (x)m s (x)f (x)dx; r; s 0 and r + s even. They propose an estimator

( )

(4.11)

( )

n X ^rs = n? m^ r (Xi; g)m^ s (Xi; g); 1

i

=1

where m^ r (; g) and m^ s (; g) are obtained via tting a p-th degree local polynomial with a bandwidth g. Also p ? r and p ? s are both odd. Comparing with their estimator, our error criterion (weighted mean integrated square error) is more general and the estimators ^ and are dierent from ^rs .

5 Asymptotic distributions of quadratic functionals The asymptotic distribution of

XX ^ = ai;j Yi Yj n n

i

=1

j

(5.1)

=1

will be considered in this section. For simplicity of technical arguments, we restrict attention to the Gaussian model,

Yi = m(Xi ) + "i ;

"i i.i.d. N (0; );

i = 1; : : : ; n:

2

(5.2)

This Gaussian noise assumption can be removed with additional work on proofs, e.g. via applying the martingale central limit theorem. The ai;j 's in (5.1) possibly depend on n and the design points fX ; : : : ; Xn g. Let m be a n 1 column vector with entries m(Xi ), i = 1; : : : ; n, " = (" ; : : : ; "n)T , and A = (ai;j )nn. Then ^ can be written as 1

1

^ = Y T AY = mT Am + 2mT A" + "T A":

(5.3)

The mean and variance of ^ can be calculated directly from (5.3):

E f^jX ; : : : ; Xng = mT Am + tr(A);

(5.4)

2

1

Varf^jX ; : : : ; Xng = 2 tr(A ) + 4 mT A m: 4

1

11

2

2

2

(5.5)

Conditioned on fX ; : : : ; Xn g, 2mT A" and "T A" are the stochastic terms in (5.3). It is easy to see that 2mT A" is linear in "i 's and contributes 4 mT A m to the variance in (5.5), while "T A" is quadratic in "i 's with variance 2 tr(A ). The objective is to show that ^ is asymptotically normally distributed under some conditions. Two cases will be considered, depending on whether the linear term or the quadratic term dominates (see (5.3) and (5.5)): 1

2

4

2

2

P 0 (mT A" dominates), (i) tr(A )=mT A m ! 2

2

P 1 ("T A" dominates). (ii) tr(A )=mT A m ! 2

2

If the linear term dominates (case (i)), i.e. ^ is \pseudo-quadratic", then the normality of ^ follows directly under the Gaussian noise assumption. The distribution theory of quadratic forms in normal variables will be used to prove the asymptotic normality for case (ii), where ^ is \genuine-quadratic". As will be seen in Section 6, the separate treatments of these two cases are natural, corresponding respectively to root-n and non-root-n rates of convergence. There is a \boundary case" when tr(A ) and mT A m are of the same order. That may be handled with additional work. The general theory of quadratic functionals in Whittle (1964) and Khatri (1980) is useful, but not directly applicable to the situation described here. We provide the following result. 2

2

Theorem 5.1 Under the model assumption in (5.2), if ^ is pseudo-quadratic (case (i)), then the conditional distribution of ^ given fX ; : : : ; Xng is asymptotically normal: C N mT Am + tr(A); 2 tr(A ) + 4 mT A m ; Y T AY ?! (5.6) 1

2

4

2

2

2

C denotes convergence conditioned on fX ; : : : ; X g. Similarly, (5.6) holds for where the symbol ?! n case (ii) if 1

P 0; as n ! 1: tr(A )=(tr(A )) ! 4

2

2

More precisely, the conditional asymptotic normality in (5.6) means that ( T ) T Am ? tr(A) Y AY ? m P (t) 8t; P (2 tr(A ) + 4 mT A m) = tjX ; : : : ; Xn ?! (5.7) where () is the cumulative distribution function of the standard normal random variable. Expression (5.7) also implies the unconditional asymptotic normality by the Dominated Convergence Theorem: ) ( T T Am ? tr(A) Y AY ? m for all t: P (2 tr(A ) + 4 mT A m) = t ?! (t) 2

4

2

2

2

1 2

2

4

2

2

2

1 2

12

1

6 Asymptotic normality We now establish the asymptotic normality of ^ and . In addition to Conditions (A), the bandwidth is assumed to satisfy Condition (B): h = h(n) ! 0 and nh + log(h) ! 1 as n ! 1. The requirement (B) is minor in the sense that if a local smoothing neighborhood contains at least (log n) data points (h > n? (log n)k , k > 1), then Condition (B) is satis ed. We need the following lemmas to show the asymptotic normality of ^ and . 1

Lemma 6.1 Suppose Zn; , Zn; , : : : is a sequence of random variables, having a multinomial distribution (n; pn; ; pn; ; : : :) with parameters satisfying pn;j 0; j = 1; 2; : : : and maxj pn;j < ch, 1

1

2

2

where c is a constant. Under Condition (B),

P fmax Z > nhbng ?! 0; j n;j

as n ! 1;

for any sequence bn such that bn ! 1.

Lemma 6.2 Let max ( ) be the maximum eigenvalue of A = (ai;j ( ))nn, where ai;j ( ) is de ned

in (3.2). Under Conditions (A) and (B),

max ( ) ?! P 0: (6.1) tr(A ) Since the design matrix X (see (2.3)) and the matrix A are the same for (X ; (Y )),: : :,(Xn; (Yn )) with () a known function, Lemmas 6.1 and 6.2 are applicable to the case of estimating functionals of the form Z [m (x)] w(x)dx; 2

2

1

( )

1

2

where m (x) = E ( (y)jx). From Lemmas 6.1 and 6.2, the asymptotic normality of ^ and follows.

Theorem 6.1 Under the Gaussian model assumption in (5.2), if Conditions (A) and (B) are satis ed and nh ! 1, nh s? s ! 1, as n ! 1, then the conditional distributions of ^ 2 +1

2(

[ ])+1

and are asymptotically normal:

C N (B ; V ); (^ ? ) ?! n; n; C N (b ; v ); ( ? ) ?! n; n;

where Bn; , Vn; , bn; and vn; are given in Theorems 4.1 and 4.3.

13

In particular, the n? = -consistency is achieved with appropriate smoothing parameters (see Theorems 4.2 and 4.4). (a) Given s > 2 + 1, taking h = O(n? = s s ) and p (2s ? ? 1), 1 2

1 ( +[ ]+1)

p ^ C N (0; D ): n( ? ) ?! 2

(b) Given p such that

8 < p > 3 ? 1 : p > 3

if C < 0; if C > 0; 1

1

if s > (p + + 1)=2 and h = hOPT given in (4.6), then

p ^ C N (0; D ): n( ? ) ?! 2

(c) Given s > 2 + 0:5, taking h = O(n? = s s

) and p (2s ? ? 1), if n = o(h = ), then

2 (2 +2[ ]+1)

1 2

pn( ? ) ?! C N (0; D ): 2

(d) Given p > 3 ? 0:5, taking h = hopt in (4.10), if s > (p + + 1)=2 and n = o(h = ), then 1 2

pn( ? ) ?! C N (0; D ): 2

Note that the above n? = rate of convergence is achieved when estimators are pseudo-quadratic, since D n? is one of the terms in the asymptotic variance of 2mT A ". Nonparametric estimation of has important implications for the choice of the smoothing parameter of local regression. Based on the theory in this paper, Fan and Huang (1996) show that the relative rate of convergence of the bandwidth selector proposed by Fan and Gijbels (1995) for tting local linear regression is of order n? = if the locally cubic tting is used in the pilot stage, and the rate can be improved to n? = when a local polynomial of degree 5 is used in the pilot tting. The theoretical results appear to be new in the nonparametric regression literature. 1 2

2

1

2 7

2 5

7 Proofs Proof of Lemma 2.1: Stone (1980) shows that X ? x n X a.s. 0; j = 0; : : : ; 2p: j ? j ? ? ? f (x)j j ?! (Xi ? x) h K i h sup jn h x2 c;d [

]

1

1

i

=1

14

Substituting f (x)j , j = 0; : : : ; 2p, for the corresponding moments in the matrix (X T WX ), we obtain a.s. 0; jjn? h? H ? (X T WX )H ? ? f (x)S jj1 ?! 1

1

1

(7.1)

1

where H = diag(1; h; h ; : : : ; hp) p p . Lemma 2.1 follows from (2.7) and (7.1). 2 Proof of Theorem 4.1: We begin by estimating the conditional bias using (3.3). From Lemma 2.1 and Condition (A1), it can be shown that Xi ? x n Z X n tr(A ) = W w(x)dx h i 2

( +1)

( +1)

2

=1

n Z 1 X Xi ? x + oP (1)I X ?h;X h (x) w(x)dx K i i f (x) h

= (nh1 ) i +1 2

2

[

+

]

=1

Xi ? x n Z 1 X (1 + o (1)) P = (nh ) f (x) K h w(x)dx: i

(7.2)

2

+1 2

2

=1

Since Var

n X i

=1

K

2

Xi ? x ! h

= O(nh) = o

nE K

2

Xi ? x ! 2

h

;

the summation in (7.2) can be replaced by its expectation: Z Z 1 ? tr(A ) = nh K (t)dt f (x) w(x)dx + oP (n? h? ? ): We now evaluate the rst term of (3.3). A Taylor series expansion gives 2

1

1

2

1

2 +1

n X i

=1

Wn

(7.3)

X ? x X ? x n X i i n 0 W m ( X ) = i h h [m(x) + m (x)(Xi ? x) + : : :] i

=1

= (x) + r(x);

(7.4)

where r(x) denotes the remainder terms. For the case that s > (p + + 1)=2, assume initially that m() has (p + 1) bounded derivatives. It is shown in Ruppert and Wand (1994) that

r(x) = hp ? +1

Hence

mT A m =

Z

Z

tp

+1

K (t)dt p (x) + oP (hp ? ):

(x)w(x)dx + 2hp ? 2

Z

(7.5)

+1

+1

+1

Z

tp K (t)dt +1

p (x) (x)w(x)dx + oP (hp ? ): +1

15

+1

(7.6)

R

As noted in Remark 1, p (x) (x)w(x)dx depends only on the rst ((p + + 1)=2) derivatives of m(). Since any function with ((p + + 1)=2) derivatives can be approximated arbitrarily close by a function with (p + 1) derivatives, (7.6) must hold for s > (p + + 1)=2. Similar arguments can be seen in Hall and Marron (1991), page 170. In the case that s (p + + 1)=2, set l = [s] + [s ? ]. Assume initially that m l exists and is bounded (note that l < (p + 1)). From a Taylor expansion with an integral remainder, +1

( )

m(Xi ) = m(x) +

l? X 1

k

(Xi ? x)k m k (x)=k! ( )

=1

i ? x) + (X (l ? 1)!

lZ

1

(1 ? t)l? m l (x + t(Xi ? x))dt; 1

0

(7.7)

( )

for i = 1; : : : ; n. It follows from (7.4) and (7.7) that

mT A m =

Z

Z

(x)w(x)dx + (x)r(x)w(x)dx + OP (h s? ): 2

2(

(7.8)

)

The second term in the RHS of (7.8) can be written as

X ? x n Z (1 ? t)l? Z d s? X i n l ( x ) W ( X ? x ) w ( x ) i (l ? 1)! dx s? h i m s (x + t(Xi ? x)) ? m s (x) dxdt X ? x Z (1 ? t)l? Z d s? i l ? ( x ) K = h? ? ( X ? x ) w ( x ) f ( x ) E f i (l ? 1)! h dx s? m s (x + t(Xi ? x)) ? m s (x) dxgdt + OP (hl? an) Z )l? Z Z d s? (x)K (u)ulw(x)f ? (x) = hl? (1(l??t1)! dx s? m s (x + htu) ? m s (x) f (x + hu)dudxdt + OP (hl? an ); (7.9) 1

1

=1

[

]

[

0

]

([ ])

([ ])

1

1

[

]

1

1

[

0

([ ])

1

]

([ ])

1

[

]

1

[

0

([ ])

]

([ ])

where an = h + pnh is obtained by more careful evaluation of the oP (1) term in Lemma 2.1. It follows from (7.8), (7.9), and (4.1) that if s (p + + 1)=2, 1

mT A m

=

Z

(x)w(x)dx + OP (hs s ? ): 2

+[ ]

2

The combination of (7.3), (7.6), and (7.10) gives the bias expression in Theorem 4.1 (a).

16

(7.10)

Next, we compute the asymptotic conditional variance of ^ . In the sequel, we treat explicitly the three terms given in the RHS of (3.4). The rst term (omitting the factor 4 ) is

mT A

m

n X n X

=

2

j j

=

X i k6 j

=

X i6 j

m (Xi ) 2

i

=1 =1

i;j;k

Z

k

=

2

i6 j n h 4

4 +4

=

all dierent i k6 j

Wn

X

X

+

i j6 k i j k

Xj ? x h

:

(7.11)

= =

= =

w(x)dx

2

K Xi h? x K Xj h? x w(x)f ? (x) 2

[

Xi;h Xi (x)I ?h Xj;h Xj (x))dx

+

= n? h? ? 4

+2

= =

h

(1 + oP (1)I ?h Z Z Z 2

X

+

Xi ? x

Wn

X m (Xi ) Z

ai;j ( )m(Xi)ak;j ( )m(Xk )

=1

X

=

= =

ai;j ( )m(Xi)

2

=1

n X n X n X

=

Now

i

=1

!

2

2

+

]

[

+

+

o

2

]

K (u )K u + x ?h y m (x + hu )f (x + hu ) 2

1

1

1

1

o 1 + oP (1)I ? ; (u )I ? ? x?y =h; ? x?y =h (u ) du Z x ? y K (u )K u + h f (y + hu ) 1 + oP (1)I ? ; (u ) o I ? ? x?y =h; ? x?y =h (u ) du w(x)w(y)f ? (x)f ? (y)dxdy [

1

1 1]

2

[

1

(

)

1

(

)

]

1

2

2

1

[

2

[

1

(

)

1

(

+OP (n? h? ? = ) 3

4

4

ZZ

1

i;j;k

all dierent

2

2

2

Z

(K K (z )) dz 2

m (x)f ? (x)f ? (y)w(x)w(y)dxdy: 2

1

By (7.4) and Lemma 2.1,

X

]

2

3 2

= n? h? ? (1 + oP (1)) 2

)

1 1]

= n? h? 1

2

ZZZ

(7.12)

1

( (y) + r(y))( (y + hz ) + r(y + hz ))K(t)

K (t + z )f (y + h(z + t))w(y)w(y + hz )f ? (y + hz ) 1

17

p f ? (y)dtdzdy + OP ( nhn? h? ? ) ZZZ 1

2

= n? h? 1

2

2

g (y)g (y + hz )K(t)K(t + z )

2

p f (y + h(z + t))dtdzdy + OP ( nhn? h? ? ) 2

2

(7.13)

2

with g (y) = ( (y)+ r(y))w(y)f ? (y). For the case that s > 2 , it is assumed initially that s > 3 . Then write 1

g (y + hz ) = and

f (y + h(z + t)) =

X 2

l

( )

(7.14)

2

=0

X k

gl (y)hlz l =l! + o(h )

f k (y)hk (z + t)k =k! + o(h ):

(7.15)

( )

=0

Note that from (2.10), for nonnegative integers l; k, and l + k 2 ,

8 > (?1) > ZZ < z l (z + t)k K (t)K(t + z )dtdz = > > :0

l l?

if l ; k ; and l + k = 2; elsewhere.

!

!(

)!

(7.16)

Combining (7.13), (7.14), (7.15), and (7.16),

! Z X ( ? 1) ! i ? i = n( !) g (y) g (y)f (y) i!( ? i)! + oP (n? ) i i;j;k all dierent X

( + )

(

1

)

2

=0

Z d [g (y)f (y)]dy + o (n? ): = n(?(1)!) g (y) dy P ( )

(7.17)

1

2

Applying integration by parts and Condition (A3) to (7.17), it follows that

Z 1 = n( !) [g (y)] f (y)dy + oP (n? ) i;j;k all dierent X

( )

2

1

2

Z = n(1!) [G (y)] f (y)dy + oP (n? ) ( )

2

2

1

(7.18)

with G (y) = (y)w(y)f ? (y). Expression (7.18) only involves derivatives of m() up to the 2 -th, and hence must hold for s > 2 . To modify arguments from (7.13) to (7.18) for the case of s 2 , the change required is in (7.14) and (7.15). The initial assumption is that m() has (2[s]? ) bounded derivatives, and Taylor expansions of orders 2([s] ? ) and (2 ? [s]) for (g (y + hz ) ? g (y)) and 1

18

f (y + h(z + t)) respectively. Then, applying integration by parts ([s] ? ) times, together with the smoothness de nition (4.1), we obtain

X

= OP (n? h? s): 1

2 +

all dierent P P P The other two terms in (7.11), 2 i j 6 k and i j k are of smaller order than i;j;k all dierent P and i k6 j . The second term in the RHS of (3.4) (omitting the factor 2 ) is i;j;k

= =

= =

= =

4

tr(A ) = 2

n X i

ai;i ( ) + 2

=1

X i6 j

ai;j ( ) 2

=

I +I ; 1

2

where I and I denote the diagonal and non-diagonal terms respectively. Using similar arguments as in establishing the term mT A m, we obtain that I = OP (n? h? ? ) and X Z Xi ? x Xj ? x ? n ( n ? 1)(1 + o (1)) P I = E K nh h K h f (x) w(x)dx i6 j 1

2

2

3

1

4

2

2

2

2

4

4 +4

= n? h? ? 2

4

1

ZZ

=

f(K K )(z )g dzf (x)? w (x)dx + oP (n? h? ? ): 2

Consequently, tr(A ) = n? h? ? (1 + oP (1)) 2

2

4

1

Z

2

2

2

[(K K )(z )] dz 2

?

Z

4

1

f (x)? w (x)dx : 2

2

(7.19)

Observe that the last term (omitting the factor E (" ) ? 3 ) in the RHS of (3.4) is n X i

4

4 1

ai;i ( ) = I = oP (tr(A )): 2

(7.20)

2

1

=1

The asymptotic variance follows from (7.12), (7.18), (7.19) and (7.20). 2 Proof of Theorem 4.3: This theorem follows directly by the following two expressions.

E ( ) = E (^ ) ? E (^ tr(A )) 2

= E (^ ) ? tr(A ) + E (^ ? )tr(A ) 2

2

2

= E (^ ) ? tr(A ) + OP (nn? h? ? ): 2

1

Var( ) = Var(^ ? (^ ? )tr(A )) 2

2

= Var(^ ) + OP (nn? h? ? ): 2

19

2

4

2

2

1

2

P !

Proof of Theorem 5.1: If the linear term dominates, i.e. tr(A )=mT A m 0, then "T A" ? E ("T A") = oP (mT A m) = . Conditioning on fX ; : : : ; Xn g, it is trivial that mT A" is normally distributed with mean 0 and variance mT A m. This implies 2

2

1 2

2

1

2

2

(4 mT A m) = (^ ? E (^)) 2

2

1 2

C N (0; 1): = (4 mT A m) = f2mT A" + ("T A" ? E ("T A"))g ?! 2

2

1 2

P 1, we need only to consider "T A". Factor A into a product For case (ii), tr(A )=mT A m ! ?T ?, where ? is an orthonormal matrix and is a diagonal matrix with eigenvalues of A, i 's as its entries. Now 2

2

"T A" = T =

n X i

i i

2

=1

P 0; as n ! 1, with = ?" = ( ; : : : ; n): Note that by the assumption that tr(A )=(tr(A )) ! n X i

4

1

E c ji i ? i j = 2

4

n X i

i E c (i ? 1) = 60 4

2

4

4

i

2

i = oP ([tr(A )] ); 4

2

2

=1

=1

=1

n X

2

where E c denotes the conditional expectation given fX ; : : : ; Xng. Hence the Lindeberg condition P for the asymptotic normality of ni i i holds: Pn E c j ? j 15 Pni i ?! P 0; i i i i P = [Var( ni i i jX ; : : : ; Xn )] [tr(A )] and so "T A" is asymptotic normally distributed. Consequently, the asymptotic normality of ^ for cases (i) and (ii) is proved. 2 Proof of Lemma 6.1: Note that 1

2

=1

4

2

4

=1

=1

4

=1

2

2

1

P (max Z > nhbn ) j n;j

2

X j

2

P (Zn;j > nhbn ):

(7.21)

Clearly, it is sucient to show that [Pnj P (Zn;j > nhbn )] ! 0. Given an arbitrary positive constant d, =1

P (Zn;j > nhbn) = P (exp(Zn;j d) > exp(nhbnd))

exp(?nhbn d)E fexp(Zn;j d)g = exp(?nhbn d)(1 ? pn;j + ed pn;j )n

exp(?nhbn d + npn;j ed ): 20

(7.22)

Take d = log( phbn;jn ) in (7.22). If n is suciently large, then

pn;j e nhbn

P (Zn;j > nhbn )

hbn

:

(7.23)

Since maxj pn;j ch, the RHS of (7.23) is bounded by (ch)nhbn?

1

ce nhbn e nhbn 1 pn;j = ch b pn;j : hbn n

Now bn(log(ce) ? log(bn)) ?1 for suciently large n, which together with Condition (B) give X 1 1 nh X P (Zn;j > nhbn ) ( ) pn;j j ch e j 1 ( 1 )nh ?! 0: = ch e

(7.24)

2

The conclusion follows from (7.21) and (7.24). Proof of Lemma 6.2: Observe that from (7.19) tr(A ) c n? h? ? + oP (n? h? ? ) 2

2

4

1

2

4

(7.25)

1

1

for some nonnegative constant c . It suces to consider the rate of max ( ). By basic theory of linear algebra, n X n X max ( ) = sup ai;j ( )bibj : (7.26) 1

kbk i j =1 =1

From Lemma 2.1, for n suciently large, n X n X i

=1

j

ai;j ( )bibj (nh ) +1

=1

Z

2

2

f (x)?

2

=1

X ? x ! K i h bi w(x)dx:

n X

2

i

=1

Under Conditions (A), the RHS of (7.27) is further bounded by

c n? h? ? 2

2

2

Z X n

2

j bi j I Xi?h;h [

i

!

Xi (x)

+

2

]

(7.27)

dx

=1

n Z X = c n? h? ? bi I Xi?h;Xi h (x)dx 2

2

2

2

2

[

i

+

]

=1

+

X i6 j

1 h (x)dxA

Z

j bi jj bj j I Xi?h;Xi h (x)I Xj ?h;Xj [

+

]

[

+

]

=

c n? h? ? 2

2

2

2

1 0 n X X @2h bi + 2h I jXi ?Xj j h j bi jj bj jA 2

i6 j

i

=1

=

21

[

2

]

(7.28)

for some nonnegative constant c . We use the following argument to evaluate the double summation in (7.28). Let Ik = (4(k ? 1)h; 4kh], k 2 Z be a partition of (?1; 1), and nk be the number of design points Xi that fall in the interval Ik . Since I jXi?Xj j h = 0 for Xi 2 Ik , Xj 2 Ik ; and jk ? k j > 1, 2

[

X i;j

2

1

]

0 1 X @ I jXi?Xj j h jbi jjbj j [

2

]

X

k ?1 Xi;Xj 2Ik [I(k+1)

1

2

1 1 0 X @ jbijjbj jA =

2

X

2

k ?1 Xi2Ik [I(k+1)

=

1 jbijA :

(7.29)

=

Put N = maxk2Z (nk + nk ) (nk + nk ) for some k 2 Z. The maximum of the RHS of (7.29) p over kbk = 1 is attained at jbi j = 1= N for those indices such that Xi 2 Ik [ Ik . Therefore, +1

max kbk

=1

0

0 +1

0

X

0

I jXi?Xj j h jbi jjbj j N: [

i;j

0 +1

2

]

Note that fnk g have a multinomial distribution and hence by Lemma 6.1, N = OP (nhn), for any sequence n ! 1. Taking n = h? , it follows from (7.28) that 1 4

max ( ) c n? h? ? OP (h + nh = ): 2

2

2

7 4

2

(7.30)

Combine (7.25) and (7.30) to conclude that

max ( ) = O (n? h? + h = ) ?! 0: P tr(A ) 2

2

1

1 2

2

Proof of Theorem 6.1: From Theorem 5.1, one only needs to verify tr(A ) ?! P 0; (tr(A ))

as n ! 1:

4

2

2

2

Let i ( ); i = 1; : : : ; n be eigenvalues of A and max ( ) be the maximum eigenvalue among all i ( )'s. Then,

P

tr(A ) = P ni i ( ) Pmax ( ) = max ( ) ?! P 0; as n ! 1: (7.31) n ( ) (tr(A )) ( ni i ( )) tr( A ) i i Hence the asymptotic normality of ^ and follows. 2 Acknowledgments. This work was part of Li-Shan Huang's doctoral dissertation at University of North Carolina. She would like to thank the Department of Statistics, University of North Carolina, National Institute of Statistical Sciences, Professors M.R. Leadbetter and R.L. Smith for the nancial support during her graduate studies. 4

2

2

4

=1

2

2

2

=1

2

2

2

=1

22

REFERENCES Berge, L. and Massart, P. (1995) Estimation of integral functionals of a density. Ann. Statist., 23, 11-29. Bickel, P.J. and Ritov, Y. (1988) Estimating integrated squared density derivatives: sharp best order of convergence estimates. Sankhya, 50-A, 381-393. Doksum, K. and Samarov, A. (1995) Nonparametric estimation of global functionals and a measure of the explanatory power of covariates in regression. Ann. Statist., 23, 1443-1473. Donoho, D.L. and Nussbaum, M. (1990) Minimax quadratic estimation of a quadratic functional. J. Complex., 6, 290-323. Efromovich, S. and Low, M. (1996) On Bickel and Ritov's conjecture about adaptive estimation of the integral of the square of density derivative. Ann. Statist., 24, 682-686. Fan, J. (1991) On the estimation of quadratic functionals. Ann. Statist., 19, 1273-1294. Fan, J. and Gijbels, I. (1995) Data-driven bandwidth selection in local polynomial tting: variable bandwidth and spatial adaptation. J. Roy. Statist. Soc. Sec. B, 57, 371-394. Fan, J. and Gijbels, I. (1996) Local polynomial modelling and its applications. London: Chapman and Hall. Fan, J. and Huang, L.-S. (1996) Rates of convergence for the pre-asymptotic substitution bandwidth selector. Technical report #912, Department of Statistics, Florida State University. Hall, P., Kay, J.W. and Titterington, D.M. (1990) Asymptotically optimal dierence-based estimation of variance in nonparametric regression. Biometrika, 77, 521-528. Hall, P. and Marron, J.S. (1987) Estimation of integrated squared density derivatives. Stat. Probab. Lett., 6, 109-115. Hall, P. and Marron, J.S. (1991) Lower bounds for bandwidth selection in density estimation. Probab. Th. Rel. Fields, 90, 149-173. Jones, M.C. and Sheather, S.J. (1991) Using non-stochastic terms to advantage in kernel-based estimation of integrated squared density derivatives. Stat. Probab. Lett., 11, 511-514. Khatri, C.G. (1980) Quadratic forms in normal variables. In P.R. Krishnaiah (ed.), Handbook of Statistics, Vol. I, 443-469. New York: North-Holland. Laurent, B. (1996) Ecient estimation of integral functionals of a density. Ann. Statist., 24, 659-681. Laurent, B. (1997) Estimation of integral functionals of a density and its derivatives. Bernoulli, 3, 181-211. Rice, J. (1984) Bandwidth choice for nonparametric regression. Ann. Statist., 12, 1215-1230. Ruppert, D., Sheather, S.J. and Wand, M.P. (1995) An eective bandwidth selector for local least squares regression. Jour. Amer. Statist. Assoc., 90, 1257{1270. 23

Ruppert, D. and Wand, M.P. (1994) Multivariate locally weighted least squares regression. Ann. Statist., 22, 1346-1370. Stone, C.J. (1980) Optimal rates of convergence for nonparametric estimators. Ann. Statist., 8, 1348-1360. Whittle, P. (1964) On the convergence to normality of quadratic forms in independent random variables. Theory Prob. Appl., 9, 103-108.

24