Statistical Estimation in Varying-Coe cient Models 1 ... - CiteSeerX

Statistical Estimation in Varying-Coecient Models Jianqing Fan Department of Statistics University of North Carolina Chapel Hill, NC 27599-3260

Wenyang Zhang Department of Statistics The Chinese University of Hong Kong Shatin, Hong Kong

Abstract

Varying-coecient models are a useful extension of classical linear models. They arise naturally when one wishes to examine how regression coecients change over dierent groups characterized by certain covariates such as age. The appeal of these models is that the coecient functions can easily be estimated via a simple local regression. This yields a simple one-step estimation procedure. We show that such a one-step method can not be optimal when dierent coecient functions admit dierent degrees of smoothness. This drawback can be repaired by using our proposed two-step estimation procedure. The asymptotic mean-squared error for the two-step procedure is obtained and is shown to achieve the optimal rate of convergence. A few simulation studies show that the gain by the two-step procedure can be quite substantial. The methodology is illustrated by an application to an environmental dataset.

KEY WORDS: Varying-coecient models, local linear t, optimal rate of convergence, mean squared errors. SHORT TITLE: Varying Coecients Models.

1 Introduction 1.1 background Driven by many sophisticated applications and fueled by modern computing power, many useful data-analytic modeling techniques have been proposed to relax traditional parametric models and to exploit possible hidden structure. For introduction to these techniques, see the books by Hastie and Tibshirani (1990), Green and Silverman (1994), Wand and Jones (1995) and Fan and Gijbels (1996), among others. In dealing with high-dimensional data, many powerful approaches have been incorporated to avoid so-called \curse of dimensionality". Examples include additive models (Breiman and Friedman, 1995; Hastie and Tibshirani, 1990), low-dimensional interaction model (Friedman 1991, Gu and Wahba, 1992, Stone et al.1997), multiple-index models (Hardle and Stoker 1990, Li 1991), partially linear models (Wahba 1984; Green and Silverman 1994), and their hybrids (Carroll et al.1997, Fan et al.1998, Heckman et al.(1998)), among others. Dierent models explore dierent aspects of high-dimensional data and incorporate dierent prior knowledge into modeling and approximation. They together form useful tool kits for processing high-dimensional data. A useful extension of classical linear models is varying-coecient models. This idea is scattered around text books. See for example page 245 of Shumway (1988). However, the potential of such Partially supported by NSF Grant DMS-9803200 and an NSA Grant 96-1-0015.

1

a modeling technique did not get fully explored until the seminal work of Cleveland et al.(1991) and Hastie and Tibshirani (1993). The varying-coecient models assume the following conditional linear structure: p X Y = aj (U )Xj + " (1.1) j =1

for given covariates (U; X1 ; ; Xp )0 and response variable Y with

E ("jU; X1 ; ; Xp ) = 0 and

var("jU; X1 ; ; Xp ) = 2 (U ):

By regarding X1 1, (1.1) allows varying intercept term in the model. The appeal of this model is that via allowing coecients a1 ; ; ap to depend on U , the modeling bias can signi cantly be reduced and \curse of dimensionality" can be avoided. Another advantage of this model is its interpretability. It arises naturally when one is interested in exploring how regression coecients change over dierent groups such as age. It is particularly appealing in longitudinal studies where it allows one to examine the extent to which covariates aect responses over time. See Hoover et al (1997) and Fan and Zhang (1998) for details on novel applications of varying-coecient models to longitudinal data. For nonlinear time series applications, see Chen and Tsay (1993) where functional-coecient AR models are proposed and studied.

1.2 Estimation Methods Suppose that we have a random sample f(Ui ; Xi1 ; :::; Xip ; Yi )gni=1 from model (1.1). One simple approach to estimate the coecient functions aj ()(j = 1; ; p) is to use local linear modeling. For each given point u0 , approximate the function locally as

aj (u) aj + bj (u ? u0 );

(1.2)

for u in a neighborhood of u0 . This leads to the following local least-squares problem: Minimize

32 Yi ? faj + bj (Ui ? u0 )gXij 5 Kh(Ui ? u0 )

2 n X 4 i=1

p X

j =1

(1.3)

for a given kernel function K and bandwidth h, where Kh () = K (=h)=h. The idea is due to Cleveland et al.(1991). While this idea is very simple and useful, it is implicitly assumed that the functions aj () possess about the same degrees of smoothness and hence they can be approximated equally well in the same interval. If the functions process dierent degrees of smoothness, suboptimal estimators are obtained via using the least-squares method (1.3). 2

To formulate the above intuition in mathematical framework, let us assume that ap () is smoother than the rest functions. For concreteness, we assume that ap possesses a bounded fourth derivative so that the function can locally be approximated by a cubic function:

ap (u) ap + bp(u ? u0) + cp (u ? u0 )2 + dp (u ? u0 )3 ;

(1.4)

for u in a neighborhood of u0 . This naturally leads to the following weighted least-squares problem:

2 n X 4 i=1

32 Yi ? faj + bj (Ui ? u0 )gXij ? fap + bp (Ui ? u0 ) + cp(Ui ? u0 )2 + dp(Ui ? u0 )3 gXip 5 Kh1 (Ui ?u0): pX ?1 j =1

(1.5) ^ ^ ^ Let a^j;1; bj;1 (j = 1; ; p ? 1) and a^p;1 ; bp;1 ; c^p;1 ; dp;1 minimize (1.5). The resulting estimator a^p;OS (u0 ) = a^p;1 is called an one-step estimator. We will show that the bias of the one-step estimator is of order O(h21 ) and the variance of the one-step estimator is of order O((nh1 )?1 ). Therefore, using the one-step estimator a^p;OS (u0 ), the optimal rate of order O(n?8=9 ) can not be achieved. To achieve the optimal rate, a two-step procedure has to be used. The rst step involves getting an initial estimate of a1 (); ; ap?1 (). Such an initial estimate is usually undersmoothed so that the bias of the initial estimator is small. Then, in the second step, a local least-squares regression is tted again via substituting the initial estimate into the local least-squares (1.5). More precisely, we use the local linear regression to obtain a preliminary estimate by minimizing

0 n X @

k=1

12 Yk ? faj + bj (Uk ? u0 )gXkj A Kh0 (Uk ? u0 ) p X

j =1

(1.6)

for a given initial bandwidth h0 and kernel K . Let a^1;0 (u0 ); ; a^p;0 (u0 ) denote the initial estimate of a1 (u0 ); ; ap (u0 ). In the second step, we substitute the preliminary estimates a^1;0 (); ; a^p?1;0 () and use a local cubic t to estimate ap (u0 ), namely minimize

0 n X @ i=1

Yi ?

pX ?1 j =1

12 a^j;0 (Ui )Xij ? fap + bp (Ui ? u0 ) + cp (Ui ? u0 )2 + dp(Ui ? u0 )3 gXip A Kh2 (Ui ? u0 )

(1.7) with respect to ap , bp, cp , dp , where h2 is the bandwidth in the second step. In this way, a two-step estimator of a^p;TS (u0 ) of ap (u0 ) is obtained. We will show that the bias of the two-step estimator is of O(h42 ) and the variance of order Of(nh2 )?1 g, provided that

h0 = o(h22 ); nh0 = log h0 ! 1; and nh30 ! 1. This means that when the optimal bandwidth h2 n?1=9 is used, and the preliminary bandwidth h0 is between the rates O(n?1=3 ) and O(n?2=9 ), the optimal rates of convergence O(n?8=9 ) for estimating a2 can be achieved. 3

Note that the condition nh30 ! 1 is only a convenient technical condition based on the assumption of the sixth bounded moments of the covariates. It plays little role in understanding of the two-step estimation procedure. If Xi is assumed to have higher moments, the condition can be relaxed as weak as nh1+ ! 1 for some small > 0. See Condition (7) in Section 4 for details. Therefore, the requirement on h0 is very minimal. A practical implication of this is that the two-step estimation method is not sensitive to the initial bandwidth h0 . This makes practical implementation much easier. Another possible way to conduct variable smoothing for coecient functions is to use the following smoothing spline approach proposed by Hastie and Tibshirani (1993):

2 n X 4 i=1

Yi ?

p X j =1

32 p Z X aj (Ui )Xij 5 + j fa00j (u)g2 du; j =1

for some smoothing parameters 1 ; ; p . While this idea is powerful, there are a number of potential problems. First of all, there are p-smoothing parameters to choose simultaneously. This is quite a task in practice. Secondly, computation can be a challenge. An iterative scheme was proposed in Hastie and Tibshirani (1993). Thirdly, sampling properties are somewhat dicult to obtain. It is not clear if the resulting method can achieve the same optimal rate of convergence as the one-step procedure. Example 3 : A typical result 2.0

Example 2 : A typical result

0.5

-1.0

-0.5

0.0

0.0

-0.5

0.2

0.4

0.0

1.0

0.6

0.5

1.5

0.8

1.0

1.0

Example 1 : A typical result

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

-2

-1

0

1

2

Figure 1: Comparisons of the performance between the one-step and two-step estimator. Solid curves { true functions; short-dashed curves { estimates based on the one-step procedure; longdashed curves { estimates based on the two-step procedure. The above theoretical work is not purely academic. It has important practical implications. To validate our asymptotic claims, we use three simulated examples to illustrate our methodology. The sample size n = 500 and p = 2. Figure 1 depicts typical estimates of the one-step and two-step 4

methods both using the optimal bandwidth for estimating a2 () (For the two-step estimator, we do not optimize simultaneously the bandwidths h0 and h2 ; rather, we only optimize the bandwidth h2 for a given small bandwidth h0 ). Details of simulations can be found in Section 5.2. In the rst example, the bias of the one-step estimate is too large since the optimal bandwidth h1 for a2 is so large that a1 can no longer be approximated well by a linear function in such a large neighborhood. While in the second example the estimated curve is clearly undersmoothed by using the one-step estimate, since the optimal bandwidth for a2 has to be very small in order to compromise for the bias arising from approximating a1 . The one-step estimator works reasonably well in the third example, though the two-step estimator still improves somewhat the quality of the one-step estimate. In real applications, we don't know in advance if ap is really smoother than the rest of functions. The above discussion reveals the the two-step procedure can lead to signi cant gain when ap is smoother than the rest of the functions. When ap has the same degree of smoothness as the rest of the functions, we will demonstrate that the two-step estimation procedure has the same performance as the one-step approach. Therefore, the two-step scheme is always more reliable than the one-step approach. Details of implementing the two-step method will be outlined in Section 2.

1.3 Outline of the paper Section 2 gives strategies for implementing the two-step estimators. The explicit formulas for our proposed estimators are given in Section 3. Section 4 studies asymptotic properties of the one-step and two-step estimators. In Section 5, we study nite sample properties of the one-step and the two-step estimators via some simulated examples. The two-step techniques are further illustrated by an application to an environmental data set. Technical proofs are given in Section 6.

2 Practical implementation of two-step estimators As discussed in the introduction, one-step procedure is not optimal when coecient functions admit dierent degrees of smoothness. However, we don't know in advance which function is not smooth. To implement the two-step strategy, one minimizes (1.6) with a small bandwidth h0 to obtain preliminary estimates a^1;0 (Ui ); ; a^p;0 (Ui ) for i = 1; ; n. With these preliminary estimates, one can now estimate the coecient functions aj (u0 ) by using an equation that is similar to (1.7). Other techniques such as smoothing splines can also be used in the second stage of tting. In practical implementation, it usually suces to use local linear ts instead of local cubic ts in the second step. This would result in computational savings. Our experiences with local polynomial ts show that for practical purposes the local linear t with optimally chosen bandwidth performs comparably with the local cubic t with optimal bandwidth. As discussed in the introduction, choices of initial bandwidth are not very sensitive to the twostep estimator as long as it is small enough so that the bias in the rst step smoothing is negligible. 5

This suggests the following simple automatic rule. Use the cross-validation or generalized crossvalidation (see e.g. Hoover et al 1997) to select the bandwidth h^ for the one-step t. Then, use h0 = 0:5h^ (say) as the initial bandwidth. An advantage of the two-step procedure is that in the second step, the problem is really a univariate smoothing problem. Therefore, one can apply univariate bandwidth selection procedures such as cross-validation (Stone, 1974), pre-asymptotic substitution method (Fan and Gijbels, 1995), plug-in bandwidth selector (Ruppert, Sheather and Wand 1995) and empirical bias method (Ruppert 1997) to select the smoothing parameter. As discussed before, the preliminary bandwidth h0 is not very crucial to our nal estimates, since for a wide range of bandwidth h0 the two-step method will achieve the optimal rate. This is another bene t for the two-step procedure: bandwidth selection problems become relatively easy.

3 Formulae for the proposed estimators The solutions to the least squares problems (1.5) { (1.7) can easily be obtained. We take this opportunity to introduce necessary notation. In the notation below, we use subscripts \0", \1" and \2" respectively to indicate the variables related to the initial, one-step and two-step estimators. Let 0 1 X X ( U ? u ) X X ( U ? u ) 11 11 1 0 1 p 1 p 1 0 B CC .. .. . . . ... X0 = BB@ ... CA . . Xn1 Xn1 (Un ? u0 ) Xnp Xnp(Un ? u0 ) Y = (Y1 ; ; Yn )T ; and W0 = diag (Kh0 (U1 ? u0); ; Kh0 (Un ? u0 )) : Then, the solution to the least-squares problem (1.6) can be expressed as

a^j;0 (u0 ) = eT2j?1;2p (XT0 W0 X0 )?1 XT0 W0 Y; j = 1; ; p:

(3.1)

Here and hereafter, we always use notation ek;m to denote the unit vector of length m with 1 at the k-th position. The solution to problem (1.5) can be expressed as follows. Let

0 B X2 = BB@

and

0 B X3 = BB@

1

X1p X1p (U1 ? u0 ) X1p (U1 ? u0 )2 X1p (U1 ? u0 )3 C . . . . C ..

..

..

..

Xnp Xnp(Un ? u0 ) Xnp(Un ? u0 )2 Xnp (Un ? u0 )3

CA 1

X11 X11 (U1 ? u0 ) X1(p?1) X1(p?1) (U1 ? u0 ) C . . . . C .

.. CA Xn1 Xn1 (Un ? u0 ) Xn(p?1) Xn(p?1) (Un ? u0 ) ..

..

.. 6

..

X1 = (X3; X2) W1 = diag (Kh1 (U1 ? u0); ; Kh1 (Un ? u0)) : Then, the solution to the least-squares problem (1.5) is given by

a^p;1 (u0 ) = eT2p?1;2p+2 (XT1 W1 X1 )?1 XT1 W1 Y:

(3.2)

Using the notation introduced above, we can express the two-step estimator as

a^p;2 (u0 ) = (1; 0; 0; 0)(X T2 W2 X2)?1 XT2 W2 (Y ? V );

(3.3)

where

W2 = diag (Kh2 (U1 ? u0 ); ; Kh2 (Un ? u0)) P ?1 a^ (U )X . Note that the two-step estimator a^ is a linear and V = (V1 ; ; Vn )T with Vi = pj =1 j;0 i ij p;2 estimator for given bandwidths h0 and h2 , since it is a weighted average of observations Y1 ; ; Yn .

The weights are somewhat complicated. To obtain these weights, let X(i) be the matrix X0 with u0 = Ui and W(i) be the matrix W0 with u0 = Ui . Then

Vi = Set

Then,

pX ?1 j =1

Xij eT2j?1;2p (XT(i) W(i) X(i) )?1 XT(i) W(i) Y:

0 T W(1) X1j eT2j?1;2p (XT(1) W(1) X(1) )?1 X(1) .. Bn = In ? B B@ . j =1 Xnj eT2j?1;2p (XT(n) W(n) X(n) )?1 XT(n) W(n) pX ?1 B

1 CC CA :

a^p;2 (u0 ) = (1; 0; 0; 0)(X T2 W2 X2 )?1 XT2 W2 Bn Y:

(3.4)

4 Main results We impose the following technical conditions: (1) EXj2s < 1, for some s > 2, j = 1; ; p. (2) a00j () is continuous in a neighborhood of u0 , for j = 1; ; p. Further, assume a00j (u0 ) 6= 0, for j = 1; ; p. (3) The function ap has a continuous fourth derivative in a neighborhood of u0 . (4) rij00 () is continuous in a neighborhood of u0 and rij00 (u0 ) 6= 0, for i; j = 1; ; p, where rij (u) = E (Xi Xj jU = u). (5) The marginal density of U has a continuous second derivative in some neighborhood of u0 and f (u0 ) 6= 0. 7

(6) The function K (t) is a symmetric density function with a compact support. (7) h0 =h2 ! 0 and h2 ! 0, nh 0 = log h0 ! 1, for any > s=(s ? 2) with s given in Condition (1). Throughout this paper, we will use the following notation. Let

i =

Z

ti K (t)dt

Z

and i = ti K 2 (t)dt;

and D be the observed covariates vector, namely

D = (U1 ; ; Un; X11 ; ; X1n ; ; Xp1; ; Xpn)T : Set rij = rij (u0 ) = E (Xi Xj jU = u0 ), for i; j = 1; ; p. Put

= diag 2 (U1 ); ; 2 (Un ) ;

j (u) = (r1j (u); ; r(p?1)j (u))T ;

j = j (u0 )

for

j = 1; ; p

and

i(u) = E f(X1 ; ; Xi )T (X1 ; ; Xi )jU = ug

i = i (u0 )

for

i = 1; ; p:

For the one step-estimator, we have the following asymptotic bias and variance.

Theorem 1 Under conditions (1){(6), if h1 ! 0 in such a way that nh1 ! 1, then the asymptotic conditional bias of a^p;OS (u0 ) is given by

?1 2 2 pX h 1 00 2 bias(âp;OS (u0 )jD) = ? 2rpp j =1 rpj aj (u0 ) + oP (h1 );

and the asymptotic conditional variance of a^p;OS (u0 ) is var(âp;OS (u0 )jD) =

2 (u0 )(2 rpp + 3 Tp ?p?11 p) (1 + op (1)); nh1 f (u0 )1 rpp(rpp ? Tp ?p?11 p )

where 1 = (4 ? 22 )2 , 2 = 0 24 ? 22 2 4 + 22 4 , and 3 = 22 2 4 ? 20 22 4 ? 22 4 + 0 42 .

The proofs of Theorem 1 and other theorems are given in Section 6. It is clear that the conditional MSE of the one-step estimator a^p;OS (u0 ) is only of order OP fh41 + (nh1 )?1 g which achieves the rate OP (n?4=5 ) when the bandwidth h1 = O(n?1=5 ) is used. The bias expression above indicates clearly that the approximation errors of functions a1 ; ; ap?1 are transmitted to the bias of estimating ap . Thus, the one-step estimator for ap inherits non-negligible approximation errors and is not optimal. Note that Theorem 1 continue to hold if condition (3) is dropped. See also Theorem 3. We now consider the asymptotic MSE for the two-step estimator. 8

Theorem 2 If Conditions (1){(7) hold, then the asymptotic conditional bias of a^p;TS (u0 ) can be expressed as

bias(âp;TS (u0 )jD) 2 2 p?1 4 ? 2 h0 X a00 (u )r + o (h4 + h2 ) = 4!1 4 ??6 2 2 a(4) ( u ) h P 2 j 0 pj 0 p 0 2 2r pp j =1 4 2 and the asymptotic conditional variance of a^p;TS (u0 ) is given by 2 2 2 var(âp;TS (u0 )jD) = (4 0 ? 24 2 2 + 2 24 )2 (u0 ) eTp;p ?p 1 ep;pf1 + oP (1)g: nh2 f (u0 )(4 ? 2 ) By Theorem 2, the asymptotic variance of the two-step estimator is independent of the initial bandwidth as long as nh 0 ! 1, where is given in Condition (7). Thus, the initial bandwidth h0 should be chosen as small as possible subject to the constraint that nh 0 ! 1. In particular, when h0 = o(h22 ), the bias from the initial estimator becomes negligible and the bias expression for the two-step estimator is 1 24 ? 6 2 a(4) (u )h4 + o (h4 ): 4! 4 ? 22 p 0 2 P 2 Hence, via taking the optimal bandwidth h2 of order n?1=9 , the conditional MSE of the two-step estimator achieves the optimal rate of convergence OP (n?8=9 ).

Remark 1 Consider the ideal situation where a1 ; ; ap?1 are known. Then, one can simply run

a local cubic estimator to estimate ap . The resulting estimator has the asymptotic bias 1 24 ? 6 2 a(4) (u )h4 + o (h4 ) 4! 4 ? 22 p 0 2 P 2 and asymptotic variance 24 0 ? 24 2 2 + 224 2 (u ) + o f(nh )?1 g: 0 P 2 nh2 f (u0 )rpp (4 ? 22)2 This ideal estimator has the same asymptotic bias as the two-step estimator. Further, this ideal estimator has the same order of variance as the two-step estimator. In other words, the two-step estimator enjoys the same optimal rate of convergence as the ideal one.

We now consider the case that ap is as smooth as the rest of functions. In technical terms, we assume that ap has only continuous second derivative. For this case, a local linear approximation is used for the function ap in both the one-step and two-step procedure. With some abuse of notation, we still denote the resulting one-step and two-step estimators as a^p;OS and a^p;TS respectively. Our technical results are to establish that the two-step estimator does not lose its statistical eciency. Indeed, it has the same performance as the one-step procedure. Since it gains the eciency when ap is smoother, we conclude that the two-step estimator is preferable. These results give theoretical endorsement of the proposed two-step method in Section 2. 9

Theorem 3 Under Conditions (1){(2) and (4){(6), if h1 ! 0 and nh1 ! 1, then the asymptotic conditional bias of the one-step estimator is given by

h2 bias(âp;OS (u0 )jD) = 1 2 a00p (u0 )(1 + oP (1)) 2 and the asymptotic conditional variance of a^p;OS (u0 ) is given by 2 var(âp;OS (u0 )jD) = (u0 )0 eTp;p ?p 1 ep;pf1 + oP (1)g: nh1 f (u0 ) We now consider the asymptotic behavior for the two-step estimator.

Theorem 4 Suppose that Conditions (1){(2) and (4){(7) hold. Then, we have the asymptotic conditional bias

0 1 ?1 2 pX h 1 2 0 00 2 00 bias(âp;TS (u0 )jD) = @ ap (u0 )2 h2 ? aj (u0 )rpj A (1 + oP (1)) 2rpp j =1

2

and the asymptotic variance var(âp;TS (u0 )jD) =

02 (u0 ) eT ?1 e f1 + o (1)g: P nh2 f (u0 ) p;p p p;p

Remark 2 The asymptotic bias of the two-step estimator is simpli ed as

1 a00 (u ) h2 (1 + o (1)); P 2 2 0 2 1 by taking initial bandwidth h0 = o(h2 ). Moreover, it has the same asymptotic variance as that of the one-step estimator. In other words, the performance of the one-step and two-step estimators is asymptotically identical.

Remark 3 When a1(t); ; ap?1(t) are known, we can use the local linear t to nd an estimate

of ap . Such an ideal estimator possesses the bias 1 a00 (u ) h2 f1 + o (1)g P 2 p 0 2 2 and variance 2 (u0 )0 f1 + o (1)g: P nh2 f (u0 )rpp So, both one-step and two-step estimators have the same order of MSE as the ideal estimator. However, the variance of the ideal estimator is typically small. Unless (X1 ; ; Xp?1 ) and Xp are uncorrelated given U = u0 , the asymptotic variance of the ideal estimator is always smaller.

5 Simulations and Applications In this section, we illustrate our methodology via an application to an environmental data set and via simulations. Throughout this section, we use the Epanechnikov kernel K (t) = 0:75(1 ? t2 )+ . 10

5.1 Applications to an environmental data set We now illustrate the methodology via an application to an environmental data set. The data set used here consists of a collection of daily measurements of pollutants and other environmental factors in Hong Kong between January 1, 1994 and December 31, 1995 (Courtesy of Professor T.S. Lau). Three pollutants, Sulphur Dioxide (in g=m3 ), Nitrogen Dioxide (in g=m3 ) and dust (in g=m3 ), are considered here. Table 1 summarizes their correlation coecients. The correlation between the dust level and NO2 is quite high. Figure 2 depicts the marginal distributions of the level of pollutants in the summer (April 1 { September 30) and winter seasons (October 1 | March 31). The level of pollutants in the summer season (raining very often) tends to be lower and has smaller variation.

Table 1: Correlation coecients among pollutants

Sulphur Dioxide Nitrogen Dioxide Dust Sulphur Dioxide 1.000000 0.402452 0.281008 Nitrogen Dioxide 1.000000 0.781975 Dust 1.000000 Densities of NO2

Densities of dust

0

20

40

60

80

100

0.020 0.010 0.0

0.0

0.0

0.01

0.005

0.02

0.010

0.03

0.015

0.04

0.020

0.05

0.030

Densities of SO2

20

40

60

80

100

120

20

40

60

80

120

160

Figure 2: Density estimation for the distributions of pollutants. Solid curves are for summer season and dashed curves are for winter season. An objective of the study is to understand the association between level of the pollutants and number of daily total hospital admissions for circulationary and respirationary problems and to examine the extent to which the association varies over time. We consider relationship among the number of daily hospital admission (Y ) and level of pollutants SO2 , NO2 and dust, which are denoted by X2 ; X3 and X4 , respectively. We took X1 = 1 { the intercept term, and U = t = time. 11

Coefficient Functions a2(t) 1.5 1.0 0.5 -1.0

200

400

600

0

200

400

600

t

t

Coefficient Functions a3(t)


-0.2

0.4 0.2 0.0 -0.2

0.2

0.6

coefficient function

0.6

1.0

0


0.0


240 200 160


280


0

200

400

600

0

200

t

400

600

t

Figure 3: The estimated coecient functions. The solid and long dashed curves are for the two-step and one-step methods, respectively. Two short dashed curves indicate pointwise 95% con dence intervals with bias ignored. The varying-coecient model

Y = a1 (t) + a2 (t)X2 + a3 (t)X3 + a4 (t)X4 + "

(5.1)

is tted to the given data. The two-step method is employed. An initial bandwidth h0 = 0:06 729 (six percents of the whole interval) was chosen. As anticipated, the results do not alters much with dierent choices of initial bandwidths. The second stage bandwidths h2 were chosen respectively 25%, 25%, 30% and 30% of the interval length for the functions a1 ; ; a4 . Figure 3 depicts the estimated coecient functions. They describe the extent to which the coecients vary with time. Two short dashed curves indicate pointwise 95% con dence intervals with bias ignored. The 12

standard errors are computed from the second stage local cubic regression. See Section 4.3 of Fan and Gijbels (1996) on how to compute the estimated standard errors from local polynomial regression. The gure shows that there is strong time eect on the coecient functions. For comparison purposes, in Figure 3, we also superimpose the estimates (long-dashed curves) using the one-step procedure with bandwidths 25%, 25%, 30% and 30% of the time interval for a1 ; ; a4 , respectively To compare the performance between the one-step and two-step methods, we de ne the relative eciency between the one-step and the two-step methods via computing

n

o

RSSh(one-step) ? RSSh (two-step) =RSSh (two-step);

where RSSh (one-step) and RSSh (two-step) are the residual sum of squares for the one-step procedure using the bandwidth h and the two-step method using the same bandwidth h in the second stage, respectively. Figure 4(a) shows that the two-step method has smaller RSS than that of the one-step method. The gain is more pronounced as the bandwidth increases. This can be intuitively explained as follows. As bandwidth increases, at least one of the components would have non-negligible biases. RSS vs. time delay

Mean number of admissions

0.0

daily admissions 250 200

1300

0.5

1400

relative efficiency 1.0 1.5

RSS(two-step) 1500 1600

2.0

300

1700

2.5

Relative RSS

200

400 bandwidth

600

(a)

10

20 30 delay time

40

50

0

(b)

200

400 time

600

(c)

Figure 4: (a) Comparing the relative eciency between the one-step and the two-step method. (b) Testing if there is any time delay in the response variable. (c) The expected number of hospital admission over time when pollutant levels are set at their averages. Pollutants may not have immediate eect on circulationary and respirationary systems. A natural question arises if there is any time lag in the response variable. To study this question, we t model (5.1) for each time lag using the data

n

o

Y (t + ); X2 (t); X3 (t); X4 (t); t = 1; 2; ; : 13

Figure 4(b) presents the resulting residuals sum of squares for each time lag. As gets larger, so does the residuals sum of squares. This in turn suggests no evidence for time delay in the response variable. We now examine how the expected number of hospital admissions change over time, when pollutants levels are set at their averages. Namely, we plot the function

Y^ (t) = a^1 (t) + a^2 (t)X 2 + a^3 (t)X 3 + a^4 (t)X4 against t, where the estimated coecient functions were obtained by the two-step approach. Figure 4(c) presents the result. It indicates an overall increasing trend in the number of hospital admissions for respirationary and circulationary problems. Seasonal pattern can also be seen. These features are not available in usual parametric least-squares models.

5.2 Simulations Example 2: true functions

Example 3: true functions

0.0

0.2

0.4

0.6

0.8

1.0

-1.0

-1.0

-1.0

-0.5

-0.5

-0.5

0.0

0.5

0.0

0.0

1.0

0.5

0.5

1.5

2.0

1.0

1.0

Example 1: true functions

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

Figure 5: Varying-coecient functions. Solid curves are a1 () and dashed curves are for a2 (). We use the following three examples to illustrate the performance of our method: Example 1: Example 2: Example 3:

Y = sin(60U )X1 + 4U (1 ? U )X2 + " Y = sin(6U )X1 + sin(2U )X2 + " Y = sin(8(U ? 0:5))X1 + 3:5[expf?(4U ? 1)2 g + expf?(4U ? 3)2 g] ? 1:5 X2 + ";

where U follows a uniform distribution on [0; 1] and X1 and X2 are normally distributed with correlation coecient 2?1=2 . Further, the marginal distributions of X1 and X2 are the standard normal and ", U and (X1 ; X2 ) are independent. The random variable " follows a normal distribution 14

with mean zero and variance 2 . The 2 is chosen so that the signal to noise ratio is about 5:1, namely 2 = 0:2varfm(U; X1 ; X2 )g; with m(U; X1 ; X2 ) = E (Y jU; X1 ; X2 ) Figure 5 shows the varying-coecient functions a1 and a2 for Examples 1 { 3. For each of the above examples, we conducted 100 simulations with sample size n = 250; 500; 1000. Mean integrated squared errors for estimating a2 are recorded. For the one-step procedure, we plot the MISE against h1 and hence the optimal bandwidth can be chosen. For the two-step procedure, we choose some small initial bandwidth h0 and then compute the MISE for the two-step estimator as a function of h2 . Speci cally, we chose h0 = 0:03; 0:04 and 0.05 respectively for Examples 1, 2, and 3. The optimal bandwidths h1 and h2 were used to compute the resulting estimators presented in Figure 1. Among 100 samples, we select the one such that the two-step estimator attains the median performance. Once the sample is selected, the one-step estimate and the two-step estimate are computed. Figure 1 depicts the resulting estimate based on n = 500. Figure 6 depicts the MISE as a function of bandwidth. The MISE curves for the two-step method are always smaller than that for the one-step approach for the three examples that we tested. This is in line with our asymptotic theory that the two-step approach outperforms the one-step procedure if the initial bandwidth is correctly chosen. The improvement of the two-step estimator is quite substantial if the optimal bandwidth is used (in comparison with the one-step approach using the optimal bandwidth). Further, for the two-step estimator, the MISE curve is

atter than that for the one-step method. This in turn reveals that the bandwidth for the twostep estimator is less crucial than that for the one-step procedure. This is an extra bene t of the two-step procedure.

6 Proofs The proof of Theorem 3 (and Theorem 4) is similar to that of Theorem 1 (and Theorem 2). Thus, we only prove Theorems 1 and 2. When the asymptotic conditional bias and variance are calculated for the two-step procedure a^p;TS (u0 ), the following lemma on the uniform convergence will be used.

Lemma 1 Let (X1 ; Y1 ); :::; (Xn ; Yn) be i.i.d random vectors, where the Yi's are scalar random variR ables. Assume further that E jyjs < 1 and sup jyjs f (x; y)dy < 1, where f denotes the joint x density of (X; Y ). Let K be a bounded positive function with a bounded support, satisfying a Lipschitz condition. Then

X sup jn?1 fKh (Xi ? x)Yi ? E [Kh (Xi ? x)Yi ]gj = OP [fnh= log(1=h)g?1=2 ] x2D i=1 n

provided that n2"?1 h ?! 1 for some " < 1 ? s?1 .

15

Example 2 : n = 250

Example 3 : n = 250

0.04

0.02

0.02

0.08

0.04

0.04

0.06

0.12

0.06

0.08

0.16

0.08

0.10

Example 1 : n = 250

0.2

0.4

0.6

0.8

1.0

0.2

0.4

0.8

1.0

0.4

Example 2 : n = 500

0.8

1.0

0.14 0.12

0.08

0.10

0.06

0.04

0.02

0.04

0.02

0.06

0.04

0.08

0.03 0.02 0.01

0.6

Example 3 : n = 500

0.05

Example 1 : n = 500

0.6

0.2

0.4

0.6

0.8

1.0

0.2

0.4

0.6

0.8

1.0

0.4

0.6

0.8

1.0

Figure 6: MISE as a function of bandwidth. Solid curve { one-step procedure; dashed curve { two-step procedure

Proof : This follows immediately from the result obtained by Mack and Silverman(1982). The following notation will be used in the proof of the theorems. Let

0 1 S S 11 12 A S=@ T S12 S22

with

1 0 0 0 A; S11 = p?1 @ 0 2

1 0 0 0 0 2 A S12 = p @ 0 2 0 4

16

and

1 0 0 0 2 0 C BB 0 2 0 4 C C B S22 = rpp B B@ 2 0 4 0 CCA ;

0 4 0 6 where denotes the Kronecker product. Let S~ be the matrix similar to S except replacing i by i . Set 1 1 0 0 1 0 0 0 = S jU =u ; A; A ; S(0) Q = p @ S(i) = p(Ui ) @ (i) i 0 0 0 0 2 and p X (Ti) = a00j (Ui )2 (Tj (Ui ); rpj (Ui )) (1; 0); T = (Tp ; rpp) (1; 0): Put and

j =1

1 0 1 0 A; A = Ip?1 @ 0 h1

0 BB 1 0 0 0 D=B BB 0 h1 02 0 @ 0 0 h1 0

0 0 0 h31 We are now ready to prove our results.

1 0 1 0 A G = Ip @ 0 h0

0 BB 1 0 0 0 D2 = B BB 0 h2 02 0 @ 0 0 h2 0

1 CC CC ; CA

0 0

0 h32

1 CC CC : CA

Proof of Theorem 1. First of all, let us calculate the asymptotic conditional bias of a^p;1(u0).

Note that by Taylor's expansion, we have Y = X1 (a1 (u00 ); a01 (u0 ); ; ap?1 (u0 ); a10p?1 (u00); ap (u0 ); a0p (u0 ); 12 a00p (u0 ); 3!11a000p (u0 ))T (4) ( )(U ? u )4 X 00 (1j )(U1 ? u0 )2 X1j a a p 1 1 0 1p C j B C B pP ?1 B . . C B CC + ~" 1 1 .. .. +2 B CA + 4! B@ @ A j =1 4 a00j (nj )(Un ? u0 )2 Xnj a(4) ( )( U ? u ) X p n n 0 np

where ~" = ("1 ; ; "n )T , ij and i are between Ui and u0 for i = 1; ; n, j = 1; ; p ? 1. Thus,

0 00 aj (1j )(U1 ? u0 )2 X1j B p ? 1 .. a^p;1(u0 ) = ap(u0 ) + 21 P eT2p?1;2p+2 (XT1 W1 X1 )?1 XT1 W1 B B@ . j =1 00 aj (nj )(Un ?1u0 )2 Xnj 0 (4) ap (1 )(U1 ? u0 )4 X1p C B .. CC + 4!1 eT2p?1;2p+2 (XT1 W1 X1 )?1 XT1 W1 B B@ . A 4 a(4) ( )( U ? u ) X p n n 0 np T T ? 1 T +e2p?1;2p+2(X1 W1 X1 ) X1 W1 ~": 17

1 CC CA

Obviously

0 T 1 T W 1 X2 X W X X 1 3 3 3 A: XT1 W1X1 = @ T X2 W1X3 XT2 W1X2

By calculating the mean and variance, one can easily get

XT3 W1X3 = nf (u0)AS11 A(1 + oP (1)); XT3 W1X2 = nf (u0)AS12 D(1 + oP (1)); and

XT2 W1 X2 = nf (u0)DS22 D(1 + oP (1)):

(6.1)

Combination of the last three asymptotic expressions leads to

XT1 W1X1 = nf (u0)diag(A; D)S diag(A; D)(1 + oP (1)): Similarly, we have

0 00 aj (1j )(U1 ? u0 )2 X1j B .. XT3 W1 BB@ . 00 aj (nj )(Un ? u0 )2 Xnj

and

Thus,

1 CC CA = nf (u0)h21 a00j (u0)A j (1; 0)T 2(1 + oP (1))

0 00 aj (1j )(U1 ? u0 )2 X1j B .. XT2 W1 BB@ . a00j (nj )(Un ? u0)2 Xnj

0 1 rpj 2 B B CC CA = nf (u0)h21 a00j (u0 )D BBB 0 @ rpj 4 0

1 CC CC (1 + oP (1)): CA

0 00 1 2 X1j ( )( U ? u ) a 1 j 1 0 BB j CC .. T X1 W1 B@ CA . a00j (nj )(Un ? u0 )2Xnj = nf (u0)h21 a00j (u0 )diag(A; D)(Tj (1; 0)2 ; rpj 2 ; 0; rpj 4 ; 0)T (1 + oP (1)):

So the asymptotic conditional bias of a^p;1 (u0 ) is given by

bias(âp;1 (u0 )jD) pP ?1 = 21 h21 a00j (u0 )eT2p?1;2p+2 S ?1 (Tj (1; 0)2 ; rpj 2 ; 0; rpj 4 ; 0)T (1 + oP (1)): j =1

Using the properties of the Kronecker product we have bias(âp;1 (u0 )jD)

pP ?1

= 2(rpp ?Tph1 ?21 p )rpp (rpj Tp ?p?11p ? rpp Tp ?p?11 j )a00j (u0 )(1 + oP (1)) p?1 j =1 2

pP ?1

2 rpj a00j (u0 ) + oP (h21 ): = ? h21rpp j =1 2

18

We now calculate the asymptotic variance. Using a similar asymptotic argument as above, it is easy to calculate that the asymptotic conditional variance of a^p;1(u0 ) is given by var(âp;1 (u0 )jD) = eT2p?1;2p+2 (XT1 W1 X1 )?1 XT1 W1 W1 X1 (XT1 W1 X1 )?1 e2p?1;2p+2 2 ~ ?1 e2p?1;2p+2 (1 + oP (1)): = nh1 (fu(u0 )0 ) eT2p?1;2p+2 S ?1 SS By using the properties of the Kronecker product, it follows that var(âp;1 (u0 )jD) =

2 (u0 )(2 rpp + 3 Tp ?p?11 p ) (1 + oP (1)): nh1 f (u0 )1 rpp(rpp ? Tp ?p?11 p)

where 1 = (4 ? 22 )2 , 2 = 0 24 ? 22 2 4 + 22 4 , 3 = 22 2 4 ? 20 22 4 ? 22 4 + 0 42 . This establishes the result in Theorem 1.

Proof of Theorem 2. We rst compute the asymptotic conditional bias. Note that by Taylor's

expansion, one obtains

0 00 1 2 ( )( U ? U ) X a i 1j C j 1j 1 p B P . B CC + ~" 1 0 0 T . Y = X(i) (a1 (Ui ); a1 (Ui ); ; ap (Ui); ap (Ui )) + 2 B . @ A j =1 00 2 0 aj00(nj )(Un ? Ui)2 Xnj 1 aj (Ui )(U1 ? Ui ) X1j C p B P CC B .. 1 = X(i) (a1 (Ui ); a01 (Ui ); ; ap (Ui ); a0p (Ui ))T + 2 B . A @ j =1 00 (Ui )(Un ? Ui )2 Xnj a j 0 00 1 2 00 (aj (1j ) ? aj (Ui ))(U1 ? Ui ) X1j C p B P .. B CC + ~"; 1 +2 B . @ A j =1 (a00j (nj ) ? a00j (Ui ))(Un ? Ui )2 Xnj where kj is between Ui and Uk . Thus, for l = 1; ; p ? 1 1 0 00 2 X1j ( U )( U ? U ) a i 1 i j CC p B .. a^l;0 (Ui ) = al (Ui ) + 12 eT2l?1;2p (XT(i) W(i) X(i) )?1 XT(i) W(i) P B CA B@ . j =1 2 00 0 00 aj (Ui00)(Un ? Ui ) Xnj2 1 (aj (1j ) ? aj (Ui ))(U1 ? Ui ) X1j C B p P CC ... + 12 eT2l?1;2p (XT(i) W(i) X(i) )?1 XT(i) W(i) B B A j =1 @ (a00j (nj ) ? a00j (Ui ))(Un ? Ui )2 Xnj +eT2l?1;2p (XT(i) W(i) X(i) )?1 XT(i) W(i) ~":

By Lemma 1, we have

XT(i) W(i) X(i) = nf (Ui)GS(i)G(1 + oP (1)) 19

(6.2)

and

0 00 2 p B aj (Ui )(U1 ? Ui ) X1j X .. XT(i) W(i) BB@ . j =1 a00j (Ui )(Un ? Ui )2Xnj

1 CC CA = nf (Ui)h20 G (i) (1 + oP (1)):

(6.3)

Note that in our applications below, we only consider those Ui 's which are in a neighborhood of u0 . By the continuity assumption, the term oP (1) holds uniformly in i such that Ui falls in the neighborhood of u0 . Combining (6.2) and (6.3), we have

0 00 aj (Ui)(U1 ? Ui )2 X1j p B X 1 eT B .. T W X )?1 XT W ( X B . ( i ) ( i ) ( i ) 2 l ? 1 ; 2 p ( i ) ( i ) 2 j =1 @ 00 aj (Ui )(Un ? Ui )2 Xnj

1 CC 1 2 T CA = 2 h0 e2l?1;2pS(?i) 1 (i) (1 + oP (1)):

Note that K has a bounded support. From the last expression and the uniform continuity of functions a00j () in a neighborhood of u0 , it follows that 1 E (âl;0 (Ui) ? al (Ui)jD) = 21 h20 eT2l?1;2p S(? i) (i) (1 + oP (1)):

Since

0 pP ?1 Y ? a^j;0 (U1 )X1j B 1 B j =1 B .. B B . B B p ? 1 @ Yn ? P a^j;0(Un)Xnj j =1

1 CC 0 ap(U1 )X1p CC BB .. CC = B@ . CA ap (Un)Xnp

it follows from (3.3) we get

0 1 B pP?1(aj (U1) ? a^j;0(U1 ))X1j CC BBB j=1 .. CA + BB . B@ pP?1 (aj (Un ) ? a^j;0 (Un ))Xnj j =1

(6.4)

1 CC CC CC + ~"; CA

0 (4) ap (1 )(U1 ? u0 )4X1p B 1 .. a^p;2(u0 ) = ap (u0 ) + 4! (1; 0; 0; 0)(X T2 W2X2 )?1 XT2 W2 B B@ . (4) ap (n )(Un ? u0 )4 Xnp 0 pP?1 1 BB j=1(aj (U1 ) ? a^j;0(U1 ))X1j CC B CC .. +(1; 0; 0; 0)(X T2 W2 X2 )?1 XT2 W2 B BB CC . B@ pP?1 C (aj (Un ) ? a^j;0(Un ))Xnj A j =1 T ? 1 T +(1; 0; 0; 0)(X 2 W2 X2 ) X2 W2~" ap(u0 ) + 4!1 J~1 + J~2 + (1; 0; 0; 0)(XT2 W2X2 )?1XT2 W2~":

20

1 CC CA

(6.5)

By simple calculation we have

1 0 r pp4 CC BB 0 C B (4) E (J~1 jD) = h42 ap (u0 )(1; 0; 0; 0)S22?1 B B@ rpp6 CCA (1 + oP (1)) 0 0 1 BB 4 CC B 0 CC (1 + oP (1)) = h42 a(4) p (u0 )( 4?422 ; 0; ? 4?222 ; 0) B B@ 6 CA =

By (6.4), we have

24 ?2 6 h4 a(4) (u )(1 + o (1)): P 4 ?22 2 p 0

0

0 p?1 1 P BB j=1 E ((aj (U1) ? a^j;0(U1 ))jD)X1j CC B CC .. E (J~2 jD) = (1; 0; 0; 0)(X T2 W2 X2 )?1 XT2 W2 B BB CC . B@ pP?1 C E ((aj (Un ) ? a^j;0 (Un ))jD)Xnj A j =10 1 pP ?1 T ? 1 BB j=1 e2j?1;2pS(1) (1) X1j CC B CC .. = ? 21 h20 (1; 0; 0; 0)(X T2 W2 X2 )?1 XT2 W2 B BB CC (1 + oP (1)) . B@ pP?1 T C 1 X A e2j?1;2p S(? (n) nj n ) 1 0 pP?1 j=1 ? 1 T BB j=1 e2j?1;2pS(0) (0) rpj CC CC BB 2 0 h C B = ? 2rpp0 ( 4 ?422 ; 0; ? 4 ?222 ; 0) B BB pP?1 eT S ?1 r CCC (1 + oP (1)) B@ j=1 2j?1;2p (0) (0) pj 2 CA =

p?1 ?1 rpj (1 + oP (1)): ? 2hrpp P eT2j?1;2pS(0) (0) j =1

Therefore, by (6.5) we obtain bias(âp;2 (u0 )jD) =

0

2 0

2 p?1 ?1 rpj + 24 ?2 26 a(4) 4 ? 2hrpp0 P eT2j?1;2pS(0) (0) 4!(4 ?2 ) p (u0 )h2 j =1

By using the properties of the Kronecker product, we have bias(^ 0 ap;2(u0 )jD)

!

(1 + oP (1)):

0 11 Pp a00 (u )(T ; 0) ?1 @ j AA (1 + o (1)) P p j 0 p

h = @ 4!1 ?4 ?6 22 2 a(4) p (u0 )h42 ? 2rpp j =1 rpj 2 ? 2 6 2 (4) 2 h0 Pp?1 00 1 4 4 4 = 4! 4 ?22 ap (u0 )h2 ? 2rpp j =1 aj (u0 )rpj + oP (h2 + h20 ): 2 4

2 2 0

21

This proves the bias expression in Theorem 2. We now calculate the asymptotic variance. Recall Bn de ned at the end of Section 3. Denote by H = I ? Bn . By (3.4), we have var(âp;2 (u0 )jD) = (1; 0; 0; 0)(X T2 W2 X2 )?1 XT2 W2 W2 X2 (XT2 W2 X2 )?1 (1; 0; 0; 0)T ?2(1; 0; 0; 0)(XT2 W2X2)?1 XT2 W2H W2X2(XT2 W2X2 )?1(1; 0; 0; 0)T +(1; 0; 0; 0)(X T2 W2 X2 )?1 XT2 W2 H H T W2 X2 (XT2 W2 X2 )?1 (1; 0; 0; 0)T :

(6.6)

Using similar arguments as before, we can show that (1; 0; 0; 0)(X T2 W2 X2 )?1 XT2 W2 W2 X2 (XT2 W2 X2 )?1 (1; 0; 0; 0)T 2 24 2 2 + 22 4 2 (u )(1 + o (1)) = nh4 0f? 0 P 2 (u0 )rpp (4 ? 22 )2 Since

0 T T ?1 T pX ?1 B X1j e2j ?1;2p (X(1) W(1) X(1) ) X(1) W(1) W2 X2 .. H W2 X2 = B B@ . j =1 Xnj eT2j?1;2p (XT(n) W(n) X(n) )?1 XT(n) W(n) W2X2

1 CC CA

by Lemma 1, we have

XT(i) W(i) W2X2 = nf (Ui)2 (Ui )Kh2 (Ui ? u0)GT2p4;(i) D2 (1 + oP (1)) where

for k = 1; ; p

1 k 2p; 0 l 3

Ui ? u 0

h2 + oP (1); 2 u~2k?1;2;(i) = rkp(Ui ) Ui h? u0 + oP (1) Ui h? u0 + oP (1); 2 2 Ui ? u0 2 Ui ? u0 Ui ? u0 3 + oP (1) h + oP (1) h + oP (1); u~2k?1;3;(i) = rkp(Ui ) h 2 2 2 u~2k;0;(i) = oP (1); u~2k;1;(i) = oP (1) Ui h? u0 + oP (1); 2 Ui ? u0 2 Ui ? u0 u~2k;2;(i) = oP (1) h + oP (1) h + oP (1); 2 2 u~2k?1;0;(i) = rkp(Ui );

and

T2p4;(i) = u~k;l;(i) 2p4

u~2k?1;1;(i) = rkp(Ui )

3 2 u~2k;3;(i) = oP (1) Ui h? u0 + oP (1) Ui h? u0 + oP (1) Ui h? u0 + oP (1): 2 2 2 22

(6.7)

Thus, we obtain

0 0 0 2 0 B p ? 1 B X ?1 rpj 2 (u0 )D2 B XT2 W2H W2X2 = nfh(u0) eT2j?1;2pS(0) BB 0 2 0 4 2 j =1 @ 2 0 4 0

0 4 0 6

1 CC CC D2 (1 + oP (1)): CA

This and (6.1) together yield (1; 0; 0; 0)(X T2 W2 X2 )?1 XT2 W2 H W2 X2 (XT2 W2 X2 )?1 (1; 0; 0; 0)T ?1 2 24 2 2 + 22 4 pX ?1 rpj 2 (u0 )(1 + oP (1)): eT2j?1;2p S(0) = nh4 0f? 2 (4 ? 2 )2 2 (u0 )rpp 2 j =1 Let

(6.8)

Xk(i) = (Xk1 ; Xk1 (Uk ? Ui); ; Xkp ; Xkp(Uk ? Ui ))T

and

XT(i) = (X1(i) ; X2(i) ; ; Xn(i) ):

Then, we have

XT2 W2H H T W2X2 = (vrs)44

0 r; s 3

where

vrs

Pn Pn fX X (U ? u )r (U ? u )s K (U ? u )K (U ? u ) ip lp i 0 0 0 h2 l 0 l h2 i i=1 l=1 pP ?1 p?1 Xij eT2j?1;2p (XT(i) W(i) X(i) )?1 XT(i) W(i) ( P Xlm eT2m?1;2p (XT(l) W(l) X(l) )?1 XT(l) W(l) )T g m=1 j =1 ?1 P pP ?1 pP n P n = f XipXij (Ui ? u0)r eT2j?1;2p(XT(i) W(i) X(i) )?1 Xk(i)Kh2 (Ui ? u0)Kh0 (Uk ? Ui )2 (Uk )g j =1 m=1 k=1 i=1 n f P X X (U ? u )s X T (XT W X )?1 e K (U ? u )K (U ? U )g: =

l=1

lp lm l

0

k(l)

(l) (l) (l)

2m?1;2p h2

l

0

h0

k

l

Using Lemma 1 and tedious calculation, we obtain

XT2 W2H H T W2X2

1 0 0 0 2 0 CC BB ? 1 pP ? 1 2 (u ) pP 0 0 2 4 CC D2 (1 + oP (1)): B nf ( u ) ?1 QS ?1 e2m?1;2p D2 B 0 0 rpj rpm eT2j?1;2pS(0) = h2 (0) B j =1 m=1 @ 2 0 4 0 CA 0 4 0 6

Combination of this and (6.1) gives (1; 0; 0; 0)(X T2 W2 X2 )?1 XT2 W2 H H T W2 X2 (XT2 W2 X2 )?1 (1; 0; 0; 0)T ?1 ?1 pX 2 24 2 2 + 22 4 pX ?1 QS ?1 e2m?1;2p 2 (u0 )(1 + oP (1)):(6.9) rpj rpm eT2j?1;2pS(0) = nh4 0f? 2 (0) 2 2 ( u ) r ( ? ) 2 0 pp 4 2 j =1 m=1 23

Substituting (6.7) { (6.9) into (6.6), we have 24 0 ?24 2 2 +22 4 var(âp;2 (u0 )jD) = nh 2 (4 ?2 )2 2 f (u0 )rpp 2 ?1 pP ?1 pP ?1 QS ?1 e2m?1;2p rpp + rpj rpm eT2j?1;2p S(0) (0)

?2

pP ?1 j =1

j =1 m=1 ! ? 1 T rpj e2j?1;2p S(0) 2 (u0 )(1 + oP (1)):

Using the properties of the Kronecker product we get var(âp;2 (u0 )jD)

2 2 2 = (4 nh0 ?22f(u4 0 )2rpp22 +(42?4)22)2(u0 ) 0

0 11 @rpp + rpp2 eTp;p ?p 1ep;p ? (Tp ; rpp) ?p 1 @ p AA (1 + oP (1)): rpp

Note that ?p 1 (Tp ; rpp)T = ep;p. This results in Theorem 2. Acknowledgement: The authors would like to thank an associate editor's helpful comments that lead to improvement of the presentation of the paper.

References Breiman, L. and Friedman, J.H. (1985). Estimating optimal transformations for multiple regression and correlation (with discussion). J. Amer. Statist. Assoc., 80, 580{619. Carroll, R.J., Fan, J., Gijbels, I, and Wand, M.P. (1997). Generalized partially linear single-index models. Jour. Ameri. Statist. Assoc., 92, 477-489. Chen, R. and Tsay, R.S. (1993). Functional-coecient autoregressive models. Jour. Ameri. Statist. Assoc., 88, 298-308. Cleveland, W.S., Grosse, E. and Shyu, W.M. (1991). Local regression models. In Statistical Models in S (Chambers, J.M. and Hastie, T.J., eds), 309{376. Wadsworth & Brooks, Paci c Grove. Friedman, J.H. (1991). Multivariate adaptive regression splines (with discussion). Ann. Statist., 19, 1{141. Fan, J. and Gijbels, I. (1995). Data-driven bandwidth selection in local polynomial tting: variable bandwidth and spatial adaptation. J. Royal Statist. Soc. B, 57, 371{394. Fan, J. and Gijbels, I. (1996). Local Polynomial Modelling and Its Applications. Chapman and Hall, London. Fan, J., Hardle, W. and Mammen, E. (1998). Direct estimation of additive and linear components for high dimensional data. The Annals of Statistics, 26, 943-971. Fan, J. and Zhang, J. (1998). Functional linear models for longitudinal data. Submitted. 24

Green, P.J. and Silverman, B.W. (1994). Nonparametric Regression and Generalized Linear Models: a Roughness Penalty Approach. Chapman and Hall, London. Gu, C. and Wahba, G. (1993). Smoothing spline ANOVA with component-wise Bayesian "con dence intervals". J. Comput. Graph. Statist. 2 (1993), 97{117. Hardle, W. and Stoker, T.M. (1989). Investigating smooth multiple regression by the method of average derivatives. J. Amer. Statist. Assoc., 84, 986{995. Hastie, T.J. and Tibshirani, R. (1990). Generalized Additive Models. Chapman and Hall, London. Hastie, T.J. and Tibshirani, R. J. (1993). Varying-coecient models. Jour. Roy. Statist. Soc. B., 55, 757-796. Heckman, J., Ichimura, H., Smith, J. and Todd, P. (1998). Characterizing Selection Bias Using Experimental Data. Econometrica, 66 1017-1098. Hoover, D.R., Rice, J.A., Wu, C.O. and Yang, L.P. (1997). Nonparametric smoothing estimates of time-varying coecient models with longitudinal data. Biometrika, 85, 809-822. Li, K.-C. (1991). Sliced inverse regression for dimension reduction (with discussion). J. Amer. Statist. Assoc., 86, 316{342. Mack, Y. P., Silverman, B. W. (1982). Weak and strong uniform consistency of kernel regression estimates. Z. Wahrscheinlichkeitstheorie verw. Gebiete, 61, 405{415. Ruppert, D. (1997). Empirical-bias bandwidths for local polynomial nonparametric regression and density estimation. Jour. Ameri. Statist. Assoc., 92, 1049{1062. Ruppert, D., Sheather, S.J. and Wand, M.P. (1995). An eective bandwidth selector for local least squares regression. J. Amer. Statist. Assoc., 90, 1257{1270. Shumway, R.H. (1988). Applied Statistical Time Series Analysis. Prentice-Hall. Stone, M. (1974). Cross-validatory choice and assessment of statistical predictions (with discussion). J. Royal Statist. Soc. B, 36, 111{147. Stone, C.J., Hansen, M., Kooperberg, C. and Truong, Y.K. (1997). Polynomial Splines and Their Tensor Products in Extended Linear Modeling. Ann. Statist., 25, 1371-1470. Wahba, G. (1984). Partial spline models for semiparametric estimation of functions of several variables. In Statistical Analysis of Time Series, Proceedings of the Japan U.S. Joint Seminar, Tokyo, 319{329. Institute of Statistical Mathematics, Tokyo. Wand, M.P. and Jones, M.C. (1995). Kernel Smoothing. Chapman and Hall, London.

25