efficient estimation of generalized additive nonparametric regression ...

EFFICIENT ESTIMATION OF GENERALIZED ADDITIVE NONPARAMETRIC REGRESSION MODELS Oliver B. Linton The Department of Economics, Yale University, New Haven, CT 06520, U.S.A. September 18, 1998 Abstract We dene new procedures for estimating generalized additive nonparametric regression models that are more ecient than the Linton and Härdle (1996) integration-based method and achieve certain oracle bounds. We consider criterion functions based on the Linear Exponential Family, which includes many important special cases. We also consider the extension to multiple parameter models like the Gamma distribution and to models for conditional heteroskedasticity.

Some key words: Additive Regression Models; ARCH Models; Dimensionality Reduction; Generalized Additive Models; Kernel estimation; Nonparametric Regression. Address correspondence to: Oliver Linton, Cowles Foundation for Research in Economics, Yale University, Yale

Station Box 208281, New Haven, CT 06520-8281, USA. Phone: (203) 432-3699. Fax: (203) 432-6167. E-mail address: [email protected]; http:nnwww.econ.yale.edun~linton.

1

1 Introduction Additive models are widely used in both theoretical economics and in econometric data analysis. The standard text of Deaton and Muellbauer (1980) provides many microeconomics examples in which a separable structure is convenient for analysis and important for interpretability. There has been much recent theoretical and applied work in econometrics on semiparametric and nonparametric methods, see Härdle and Linton (1994) and Powell (1994) for bibliography and discussion; in such models additivity often has important implications for the rate at which the components can be estimated. Let (X; Y ) be a random variable with X of dimension d and Y a scalar. Consider the estimation of the regression function m(x) = E (Y j X = x) based on a random sample f(Xi; Yi )gni=1 from this population. Stone (1980, 1982) and Ibragimov and Hasminskii (1980) showed that the optimal rate for estimating m is n?`/(2`+d) with ` an index of smoothness of m. An additive structure for m is a regression function of the form

m(x) = c + = (x1; : : : ; xd)0

d X =1

m(x);

(1)

where x are the d-dimensional predictor variables and m are one-dimensional nonparametric functions operating on each element of the vector or predictor variables with E fm(X)g = 0. Stone (1986) showed that for such regression curves the optimal rate for estimating m is the onedimensional rate of convergence with n?`/(2`+1) ; and does not increase with dimensions. In practice, the backtting procedures proposed in Breiman and Friedman (1985) and Buja, Hastie and Tibshirani (1989) are widely used to estimate the additive components. Buja, Hastie and Tibshirani (1989, equation (18)) consider the problem of nding the projection of m onto the space of additive functions representing the right hand side of (1). Replacing population by sample, this leads to a system of normal equations with nd nd dimensions. To solve this in practice, the backtting or Gauss-Seidel algorithm, is usually used, see Hastie and Tibshirani (1990, p91) and Venables and Ripley (1994, pp251-255). This technique is iterative and depends on the starting values and convergence criterion. These methods have been evaluated on numerous datasets and been rened quite considerably since their introduction. Recently, Linton and Nielsen (1995), Tjøstheim and Auestadt (1994), and Newey (1994) have independently proposed an alternative procedure for estimating m; which we call integration, that exploits the following idea. Suppose that m(x1; x2) is any bivariate function, and consider the quan2

R

R

tities 1(x1) = m(x1; x2)dP2 (x2) and 2(x2) = m(x1; x2)dP1(x1); where P1 and P2 are probability measures. If m(x1; x2) = m1(x1) + m2(x2); then 1() and 2() are m1() and m2(); respectively, up to a constant. In practice one replaces m by an estimate and integrates with respect to some known measure. The procedure is explicitly dened and its asymptotic distribution is easily derived: centered correctly, it converges to a normal distribution at the one-dimensional rate; the faster rate is because integration is averaging and hence reduces variance. The estimation procedure has been extended to a number of other contexts like the generalized additive model [Linton and Härdle (1996)], to dependent variable transformation models [Linton, Chen, Wang, and Härdle (1997)], to econometric time series models [Yang and Härdle (1997)], to panel data models [Porter (1996)], and to hazard models with time varying covariates and right censoring [Nielsen and Linton (1997)]. Gozalo and Linton (1997) develop tests of additivity. In this wide variety of sampling schemes and models asymptotics for integration based procedures have been derived because of the explicit form of the estimator. However, the integration method does not fully exploit the additive structure in (2) and is inecient. Linton (1997) proposed a two-step procedure that took the integration estimate as a rst step and then did one backtting iteration from that. This procedure was argued to be oracle ecient, that is, as ecient as the infeasible estimate that is based on knowing all components but the one of interest. The theoretical analysis of backtting-like methods has only just begun and is thus far conned to regression. Opsomer and Ruppert (1997) provided conditional mean squared error expressions for bivariate i.i.d. data under strong conditions, while Linton, Mammen, and Nielsen (1997) established a central limit theorem for a modied form of backtting called empirical projection. A generalized additive structure for m is of the form

G fm(x)g = c +

d X =1

m(x)

(2)

for some known, typically monotonic, link function G; where x = (x1; : : : ; xd)T are the d-dimensional predictor variables and m are one-dimensional nonparametric functions operating on each element of the vector of predictor variables. Here, E fm(X)g = 0 for identication. This class of models includes additive regression when G is the identity and multiplicative regression when G is the logarithm. For binary data it is appropriate to take G to be the inverse of a cumulative distribution function like the normal or logit [this ensures that the regression function lies between 0 and 1 no P matter what values c + d=1 m(x) takes]. Compare this specication with the semiparametric single index model considered in Ichimura (1993) in which the index on the right hand side of (2) 3

is linear, but the link function G() is unrestricted [apart from the fact that it is the inverse of a c.d.f.]. Both models considerably weaken the restrictions imposed by parametric binary choice models, but are non-nested. One advantage of the additive model is that it allows for more general elasticity patterns: specically, while in the single index model j:k = (@ ln m /@xj ) /(@ ln m /@xk ) is restricted to be constant with respect to x; for the additive model j:k can vary with xj and xk [although not with other x0s]: Note that (2) is a partial model specication and we have not restricted in any way the variance or other aspects of the conditional distribution L(Y jX ) of Y given X . A full model specication, widely used in this context, is to assume that L(Y jX ) belongs to an exponential family with known link function G and mean m: This class of models was called Generalized Additive by Hastie and Tibshirani (1990). In some respects, econometricians would prefer the partial model specication in which we keep (2), but do not restrict ourselves to the exponential family. This exibility is a relevant consideration for many datasets where there is over-dispersion or heterogeneity. Turning to estimation, Stone (1986) showed that for such models the optimal rate for estimating m [and m], based on a random sample f(Yi ; Xi )gni=1 from this population, is the one-dimensional rate of convergence n?`=(2`+1) to be compared with the best possible rate of n?`=(2`+d) when m is not so restricted. In practice, the backtting procedures in conjunction with Fisher Scoring are widely used to estimate Generalized Additive Models, see Hastie and Tibshirani (1990, p141). Linton and Härdle (1996) recently proposed an alternative direct method for estimating the components by integrating a transformed pilot regression smoother. They provided sucient conditions for a central limit theorem at the optimal one-dimensional rate. Nevertheless, this estimator is inecient for the reasons given above. In this paper, we suggest two-step procedures for estimating m() in (2) that are more ecient than the integration method, thus extending the recent work of Linton (1997) in regression. We also provide more rigorous proofs of the claims made in that work. We base our procedures on a localized version of the likelihood function of Linear Exponential Families, see Gourieroux, Monfort, and Trognon (1984a,b). This family includes what we are calling the partial model specication as a special case that corresponds to the homoskedastic normal likelihood function. Our estimators are nonlinear and their asymptotics do not follow immediately from standard arguments for kernel estimators. Our proofs are based on a modication of some recent results of Gozalo and Linton (1995). For expositional purposes we shall work with the special case where we expect the `onedimensional' rate of convergence n2=5 for the additive estimates. The paper is organized as follows. 4

In section 2 we discuss infeasible oracle procedures for estimating one component that use knowledge of the other components. In particular, we introduce a criterion function based on linear exponential family density. We discuss feasible procedures and standard error construction. In section 4 we discuss the extension to a model in which additive components enter into the local parameters of a general moment condition. We estimate the unknown functions using a local Generalized Method of Moments (GMM) and local partial GMM criterion function. Our examples include the Binomial and Poisson models as well as models for conditional heteroskedasticity, known in time series as ARCH. The symbol ?!p denotes convergence in probability, while =) denotes convergence in distribution. For a random sequence Xn and deterministic decreasing sequences an; bn we write Xn AD = N (an; b2n) whenever Xn ? an =) N (0; 1): b n

2 Single Parameter Linear Exponential Family 2.1 Infeasible Procedures We partition X = (X1; X2) and x = (x1; x2) where x1 and X1 are scalar, while x2 and X2 are in general of dimensions d ? 1: Let p1 be the marginal density of X1; and let p2 and p be the densities of X2 and X respectively. Throughout, m2() is an abbreviation for all the other components, i.e., P m2(x2) = d=2 m(x); and can be of any dimension. Let 2(x) = var(Y jX = x): Our purpose here is to dene a standard by which to measure estimators of the components. The notion of eciency in nonparametric models is not as clear and well understood as it is in parametric models. In particular, pointwise mean squared error comparisons do not provide a simple ranking between estimators like kernel, splines, and nearest neighbors. While minimax eciencies can in principle serve this purpose, they are hard to work with and even harder to justify. Our approach is to measure our procedures against a given infeasible (oracle) procedures for estimating m1(x1) based on knowledge of c and m2(). Linton (1997) already dened the oracle estimator when G() is the identity function, i.e., when we are in the additive regression setting (1). In this case, one smooths the partial errors Yi ? c ? m2(X2i) on the direction of interest X1i . He showed that indeed the oracle estimate has mean squared error smaller than the comparable integration-type estimator. In the general case though, one cannot nd simple transformations of Yi and c + m2(X2i) to which one can apply one-dimensional smoothing and that result in a more ecient procedure than the integration-type estimators. In sum, it was not immediately clear to us how to even dene oracle 5

eciency in these nonlinear models. We suggest the following solution ? impose our knowledge about c + m2(X2i) inside of a suitable criterion function. We shall work with a criterion function motivated by the likelihood function of a complete specication of the conditional distribution of Y jX along with the additivity restriction (2). In particular, we consider one-parameter linear exponential families, described in Gourieroux and Monfort (1984a), applied to the conditional distribution of Y given X = x. Every member of the family has a density with respect to some xed measure and this density function can be written as

`(y; m) = exp fA(m) + B (y) + C (m)y)g ; (3) where A(); B (); and C () are known functions; with m being the mean of the distribution whose density is `(y; m): The scalar m 2 M, a suitable parameter space. See Gourieroux, Monfort, and Trognon (1984a,b) for parametric theory and applications in economics. The above likelihood function leads us to suggest the following class of criterion functions:

n X 1 K x1 ?h X1i fYiCi() + Ai()g ; (4) Qn() = nh n i=1 n where Ci() = C (F (c + m2(X2i)+ 0 + 1(X1i ? x1))) and Ai() = A(F (c + m2(X2i)+ 0 + 1(X1i ? x1))) with F = G?1 ; while = (0; 1): Here, hn is a scalar bandwidth sequence and K is a kernel function. Let b maximize Qn(), and let m b 1(x1) = b0(x1) be our infeasible estimate of m1(x1): We have the following result

Theorem 1. Suppose that (2) holds. Then, under the regularity conditions A given in the

appendix, we have

h2

m b 1(x1) ? m1(x1) = N 2 AD

R

R

n

1 2 i1(x1 ) 00 2 (K )m1 (x1); nh kK k2 j 2(x ) n 1 1

where kK k22 = K 2 (s)ds and 2 (K ) = K (s)s2ds; while

i1(x1) = j1(x1) =

Z Z

C 0(m(x))2F 0(G(m(x)))22(x)p(x)dx2 C 0(m(x))F 0(G(m(x)))2p(x)dx2: 6

;

We call m b 1(x1) an oracle estimator because its denition uses knowledge that only an oracle could have. A variety of smoothing paradigms could have been chosen here, and each will result in an `oracle' estimate. We have chosen the local linear with constant bandwidth kernel weighting because the local constant version, which does not include the slope parameter 1 and is slightly computationally easier, will result in `bad bias' behaviour, see Fan (1992) for a discussion of the merits of local linear estimation. Higher order polynomials than linear can be used and will result in faster rates of convergence under appropriate smoothness conditions. Remarks.

1. When (3) is true, we have C 0(m(x)) = 1= 2(x) by Property 3 of Gourieroux, Monfort, and Trognon (1984a). In this case, j1(x1) is proportional to i1(x1) and one obtains the simpler asymptotic variance proportional to VE = R ?2 (x)F 0(G(1m(x)))2p(x)dx : 2 The integration procedure of Linton and Härdle (1996) has asymptotic variance proportional to Z 2 VH = G0 fm(x)g2 2(x) pp2((xx)2) dx2:

Since G0 = 1=F 0; we have applying the Cauchy-Schwarz inequality, that VE VLH ; and the oracle estimator has lower variance than the integration estimator: When (3) is not completely true, i.e., when the variance is misspecied, it is not possible to (uniformly) rank the two estimators unless the form of heteroskedasticity is restricted in some way, see the next section.

2. The bias of m b 1(x1) is what you would expect if c + m2() were known to be exactly zero, and is design adaptive. In the Linton and Härdle procedure there is an additional multiplicative factor to the bias Z p2(x2) F 0(G(m(x))) dx2; which can be either greater or less than one.

R

3. Note that m b 1(x1) is not guaranteed to satisfy mb 1(x1)p1(x1)dx1 = 0; but the recentred estimate

Z

mb c1(x1) = m b 1(x1) ? mb 1(x1)p1(x1)dx1 7

does have this property. In fact, the variance of m b c1(x1) and mbR1(x1) are the same to rst order, 00 while the bias of m b c1(x1) has m1 (x1) replaced by m001 (x1) ? m001(x1)p1(x1)dx1: According to integrated mean squared error, then, we are better o recentering because

Z

m001 (x1) ?

Z

m001 (x1)p1(x1)dx1

2

p1(x1)dx1

Z

fm001 (x1)g2 p1(x1)dx1:

2.2 Feasible Procedures The previous section established the standard by which we choose to measure our estimators. We now show that one can achieve the oracle eciency bounds given in Theorem 1 by substituting a suitable pilot estimator of c + m2(X2i) in the criterion function (4): Suppose that ec + m e 2(X2i) is some initial consistent estimate, and let

n X K x1 ?h X1i fYi Cei() + Aei()g; Qen() = nh1 n i=1 n

(5)

where Aei() = A(F (ec+m e 2(X 2i)+0 +1(X1i ?x1))) and Cei() = C (F (ec+ me 2(X2i)+0 +1(X1i ?x1))): b 1(x1) = b0(x1) be our feasible estimate Now let b (x1) = (b0(x1); b1(x1)) minimize Qen(); and let m of m1(x1): Suitable initial estimates are provided by the Linton and Härdle (1996) procedure, which is explicitly dened. Finding b still involves solving a nonlinear optimization problem in general; an alternative approach here is to use the linearized two-step estimator

m b 1 (x1) mb 0 1 (x1)

!

" 2 e e #?1 e e Qn() @ Qn() ; b = e ? @@@ T @

where e is the full vector of preliminary estimates. To provide asymptotic results we shall suppose that the initial estimator satises a linear expansion. Specically, we suppose that

n X 1 ec ? c + me 2(X2i) ? m2(X2i) = ngn K Xi g?n Xj w(Xi ; Xj )"j + ni; =2 j =1 d X

(6)

where "j = Yj ? E (Yj jXj ); while K is a kernel function, gn is a bandwidth sequence, and w is some xed function. Here, ni = op(n?2=5) is a small remainder term, and the expansion (6) is assumed to obey the regularity conditions B given in the appendix. A number of procedures have recently been proposed for estimating components in additive models under a variety of sampling 8

schemes, see, for example, Linton and Nielsen (1995), Linton and Härdle (1996), Yang and Härdle (1997), and Kim, Linton, and Hengartner (1997) amongst others. The expansion (6) can be achieved by all of these methods by undersmoothing under various conditions.1 One might need to assume stronger smoothness conditions than made in assumption A to achieve this, although recent work by Hengartner (1996) suggests this may not be necessary. We now have the following result. Theorem 2. Suppose that assumptions A and B given in the appendix hold. Then, under (2),

we have

n2=5 fm b 1(x1) ? mb 1(x1)g ; n2=5 fmb 1 (x1) ? mb 1(x1)g ?!p 0:

This says that ecient estimates can be constructed by the two-step procedure and by the linearized two-step estimator; estimation of the nuisance parameter c + m2() has no eect on the limiting distribution. This is not generally the case in parametric estimation problems, unless there is some orthogonality between the estimating equations. In our case, there is an intrinsic local orthogonality that aects smoothing operations. Standard error and bandwidth choice issues can now be addressed via the mean squared error expressions given in Theorem 1, using modications of standard methods. Thus, under the conditions of Theorem 2 and provided nh5n ! 0; the following interval

s

b mb 1(x1) z=2 nh1 kK k22 bi12(x1) n j1 (x1) provides 1 ? coverage of the true function m1(x1), where z is the critical value from the standard normal distribution, while bi1(x1) bj1(x1)

n X 1 = n C 0(m e (x1; X2i))2F 0(G(me (x1; X2i)))2e2(x1; X2i) i=1

n X 1 e (x1; X2i))F 0(G(me (x1; X2i)))2 = n C 0(m i=1

Note that the expansion (6) contains no bias terms, which can be achieved by undersmoothing or additional bias reduction. 1

9

in which m e () and e2() are any uniformly consistent estimates of m() and 2(); see Härdle and Linton (1994).

3 Multiparameter Extensions The models we have examined thus far were one-parameter families as has been the case in most of the literature on additive models; we now consider extensions to multiple parameter families. The quadratic exponential family of Gourieroux, Monfort, and Trognon (1984a) can be analyzed similarly to above. This would amount to having an additional set of equations that impose additivity on some transformation of the variance. We shall consider a slightly more general situation based on the Generalized Method of Moments, which allows the additivity to be imposed on any set of moments. We suppose that there exists a known function ' : Rm+d+p ! Rq such that there exists a vector of additive functions g0(x) = (g10(x); : : :; gp0(x)) with

gl0(x) = cl +

d X =1

gl(x); l = 1; : : : ; p;

where gl(X) are mean zero for identication, for which

?

E ' U; g0(X ) jX = x = 0; (7) where U = (Y; X ): We assume that q > 1 and that there is a unique solution to (7). This sort of information could arise from an economic model or through partial specication of moments, like happens in the ARCH models, see below. It also includes a full likelihood specication as a special case. For example, suppose that `(U; g0(X )) is the logarithm of the density function of Y jX in which g0(X ) is a vector of parameters. Then, g0(x) is the unique quantity that satises @ E `(U; g0(X )) jX = x = 0: @g This system of equations is of the form (7). This leads naturally to the following estimation scheme. First, estimate g0(x) by any unrestricted smoothing method ? we propose a sort of local GMM. Second, integrate out the directions not of interest to get a preliminary estimate of the univariate eects. Finally, re-estimate the local GMM criterion function replacing the parameters of the components not of interest by preliminary estimates. 10

Let e(x) = (e1(x); : : :; ep(x)) minimize the following criterion

X

2 n

1 K x ? Xi '(Ui; )

nhdn i=1

An hn

(8)

with respect to = (1; : : :; p), where Ui = (Yi; Xi ); K is a multivariate kernel, while kxkAn = (xT Anx)1=2 for some sequence of positive denite matrices An ?!p A; and let eg(x) = e(x): We are using a local constant approach here for simplicity. The asymptotic properties of this procedure can be derived using an extension of Gozalo and Linton (1995); we expect that eg (x) is asymptotically normal with pointwise mean squared error rate of n?4=(4+d) and indeed has an expansion like (6). To obtain estimates of the component functions, we simply integrate this pilot procedure as follows, letting, for example,

Z

egl1(x1) = gel (x)p2(x2)dx2; l = 1; : : :; p; (9) R and the other components similarly.2 To estimate cl we can use cel = egl1(x1)p1(x)dx1: Thus, eglj ()

are feasible preliminary estimates of glj (): To achieve eciency, we must modify this procedure to impose additivity. We rst describe the oracle estimate. Let b = (b0; b1) = (b01; : : :; b0p; b11; : : :; b1p) minimize the partial GMM criterion

2

X x ? X n 1

1 1i K ' [ U Gn () =

nh i ; c + 0 + 1 (X1i ? x1 ) + g2 (X2i )]

An hn n i=1

with respect to 0 = (01; : : : ; 0p) and 1 = (11; : : : ; 1p); where the vectors g2() = (g12(); : : :; gp2()) and c = (c1; : : : ; cp) are assumed known, and let bg1(x1) = (bg11(x1); : : : ; bgp1(x1)) = b0(x1). Finally, the feasible version of this replaces g2() and c by a vector of preliminary estimates provided by the integration principle, i.e., we let b = (b0; b1) = (b01; : : :; b0p; b11; : : :; b1p) minimize

X

2 n Gen () =

nh1 K x1 ?h X1i ' [Ui; ec + 0 + 1 (X1i ? x1) + eg2(X2i)]

n i=1 n An with respect to = (0; 1); where ec and eg2(X2i) are obtained from (8)-(9), and let bg1 (x1) (bg11 (x1); : : :; gbp1(x1)) = b0(x1):

=

A computationally ecient estimate of gl1 (x1) can be constructed by generalizing Kim, Linton, and Hengartner (1997) as follows. Let n X X2i ) ; gl1 (x1 ) = n1 Kh (x1 ? X1i )egl (Xi ) pepe2((X e ) 2

i

i=1

where pe2 and pe are kernel estimates of p2 and p respectively.

11

3.1 Asymptotics Dene the following q p and q q matrices @'(U; t) '(U; t)'T (U; t) jX = x ; j X = x ; R ( x; t ) = E (x; t) = E @t

R

R

and let 1 = 1(x1) = (x; g0(x))p(x)dx2 and R1 = R1(x1) = R(x; g0(x))p(x)dx2: Furthermore, suppose that each of the preliminary estimators described in (8)-(9) satises a linear expansion like (6). We have the following result. Theorem 3. Under the regularity conditions A0 and B 0 given in the appendix, we have under

the specication (3), that n2=5 [bg1 (x1) ? bg1 (x1)] = op (1) and that

bg1(x1) ? g1(x1) = N AD

h2

n

2

1 00 2(K )g1 (x1); nh

n

kK k22

? T A ?1 ? T AR A ? T A ?1 : 1

1

1

1

1

1

1

(10)

b1(x1) ?!p R1(x1); then n2=5 [bg1 (x1) ? bg1(x1)] = Furthermore, if we take An = Rb?1 1 (x1); where R op(1) and bg1(x1) ? g1(x1) = N AD

h2

n (K )g 00(x ); 1 1 1 nh 2 2 n

kK k22

? T R?1 ?1 : 1 1

1

The choice of An = Rb?1 1 (x1) as weighting gives minimum variance amongst the class of all such procedures. Note that the eciency standard we erect here is not as high as in the one-parameter models. This is because, generically, we can expect correlation between gbj1 (x1) and bgk1(x1) for j; k = 1; : : :; p: In other words, it is not possible to estimate g11(x1); say, as well as if one knew every other component function in the model, although it is possible to estimate the vector g1() as well as if g2() were known. As before, the above result can be used for bandwidth choice and standard error construction by replacing unknown quantities in (10) by estimates. Thus, under the conditions of Theorem 3 and provided nh5n ! 0; for any vector a = (a1; : : : ; ap)T ; the following interval 12

r

b T b ?1 b T b b b T b ?1 1 2 T b 1 AnR1An 1 1 An 1 a nhn kK k2 a 1 An 1 provides 1 ? coverage of the true function aT g1(x1), where aT g1 (x1) z=2

n X K x1 ?h X1i @' b 1 = nh1 @t [Ui; ec + eg1(x1) + eg2(X2i)] n i=1 n

Rb1

n X 1 K x1 ?h X1i (' 'T ) [Ui; ec + eg1(x1) + eg2(X2i)] : = nh n i=1 n

3.2 Examples Gamma and Beta

Suppose that there exists functions (x) and (x); both themselves additively separable functions of x; that satisfy the equations

E (Y jX = x) = (x) (x) ; E (Y 2jX = x) = 2(x)(x)[1 + (x)]: This partial model specication is implied by Y jX = x being gamma distributed, but is somewhat weaker. In this case, (7) is satised with '1(Y; X j; ) = Y ? and '2(Y; X j; ) = Y 2 ? 2(1+): A full model specication can be based on the gamma (log) density function of (Y; X ); from which we obtain

E [`(U; ; jX = x )] = ((x) ? 1)m`(x) ? (x)?1m(x) ? ln ?((x)) ? (x) ln (x);

(11)

where ?() is the gamma function, while m(x) = E [Y jX = x] and m`(x) = E [ln Y jX = x]. This generates the following moment conditions 0 '1(U j; ) = ln Y ? ??(()) ? ln ; '2(U j; ) = Y ?2 : The asymptotic variance of these procedures can be found by direct calculation.3 The Beta distriWith regard to preliminary estimation in the full model specication, there are two estimation strategies. First, simply substitute estimates of m(x) and m` (x) in (11) and maximize to obtain e (x) and e(x). Second, one can estimate the local parameters (x) and (x) by local likelihood, i.e., let e(x) and e (x) maximize n x ? Xi ( ? 1) ln Y ? ?1 Y ? ln ?() ? ln 1 X K i i nhdn i=1 hn 3

with respect to ; : In both cases, we then integrate e (x) and e (x) with respect to p2 (x2)dx2:

13

bution, which is frequently used in the study of rate or proportion data, can also be put in this framework. See Hackman and Willis (1977) for an econometric application of the Beta distribution. Variance Models (ARCH)

Suppose that with probability one

E (Y jX = x) = m(x) = Fm [(x)] ; (x) = cm + m1(x1) + m2(x2)

(12)

var(Y jX = x) = 2(x) = F [ (x)] ; (x) = c + 1(x1) + 2(x2)

(13)

for some known functions Fm and F : Estimates of mj () and j () can be obtained by integrating (transformed) nonparametric estimates of the mean and variance, as in Yang and Härdle (1997). Note that their procedure ignores the cross equation information, which can be imposed in our framework. Using only the mean and variance specication gives the following moment functions '1(Y; X j; ) = Y ? Fm() and '2(Y; X j; ) = Y 2 ? Fm2 () ? F ( ); the asymptotic variance of the GMM procedure is as in (10) with

"

2(x) 3(x) R(x; g0(x)) = 3(x) 4(x) + 2

#

#

"

0 Fm0 ((x)) ; ; (x; g0(x)) = 0 0 2Fm ((x))Fm((x)) F ((x))

where 3(x) = E [fY ? E (Y jX = x)g3jX = x]: The optimal estimator has lower asymptotic variance than the procedure of Yang and Härdle (1997, Theorem 2.4) because it uses cross-equation information.4 A convenient complete model specication here is that Y jX = x is N (m(x); 2(x)), which leads to the following moments

'1(Y; X j; ) = Y ?F F( m((x))(x)) Fm0 ((x)) ; '2(Y; X j; ) = 12

( ) Y ? Fm((x)) 2 ? 1 F0 ( (x)): F ( (x))

The corresponding procedure has asymptotic variance is as in (10) with

2 Fm ((x)) R(x; g0(x)) = (x; g0(x)) = 4 F ( (x)) 1 h F 2

0

0

2

F

3

0 i2 5 : ( ( x )) F 0

Strictly speaking our results only apply to the i.i.d. case, but recent work of Kim (1998) has extended this to a time series setting. 4

14

4 Concluding Remarks We have provided a general principle for obtaining ecient estimates that works in almost any model with separable nonparametric components, whether fully specied or only partially specied. We did not consider models with parametric components or discrete explanatory variables, since such models can be viewed as special cases of ours. The only new issue that arises in such models is how to impose the restriction of parametric eects eciently. If the additive structure (2) does not hold, then m b 1(x1) is estimating some other functional of the joint distribution (depending of course on what c + m2() is), see for example Newey (1994). Specically, m b 1(x1) consistently estimates the minimizer of a Kullback-Liebler distance with respect to : Centred correctly, the asymptotic distributions take a similar form, with some relabeling, and are ecient for estimating these particular functionals.

5 Appendix Let L(z) = C (F (z)); P (z) = A(F (z)); and

D(x; z) = m(x)L(z) + P (z): shall let D (x; z); j = 1; 2; : : : ; denote partial derivatives of D with respect to z: We let jAj = ?We tr(AT A) 1=2 for any matrix A: (j )

We use the following assumptions. Assumption A

1. The random sample f(Y i; Xi )gni=1; Yi 2 R; Xi 2 X a compact subset of Rd; is independent and identically distributed (i.i.d.) with E (Y 4 ) < 1: 2. Let p(x) be the marginal density of X with respect to Lebesgue measure and let m(x) E (Y jX = x): We suppose that p(x) and m1(x1) are twice continuously dierentiable with respect to x1 at all x and that inf x2X p(x) > 0: 15

3. The variance function 2(x) = var(Y jX = x) is Lipschitz continuous at all x 2 X ; i.e., there exists a constant c such that for all x; x0; we have j 2 (x) ? 2(x0 )j c jx ? x0j : 4. The functions A(); C (); G(); and F () have bounded continuous second derivatives over any compact interval. The function G is strictly monotonic. 5. The kernel weighting function K is continuous, symmetric about zero, of bounded support, and R satises K (v)dv = 1: 6. fhn : n 1g is a sequence of nonrandom bounded positive constants satisfying hn ! 0 and nhn = log n ! 1: 7. The true parameters 00(x1) = m1(x1) and 01(x1) = m01(x1) lie in the interior of the compact parameter space = 0 1.

Assumption B.

1. For each = 2; : : : ; d; the functions w and K are continuous on their bounded supports. Fur thermore, K is Lipschitz continuous, i.e., there exists a nite constant c such that K (t) ? K (s) c jt ? sj for all t; s: 2. The bandwidths satisfy gn =hn ! 0; nhn gn ! 1; and n3gn5 = log n ! 1. 3. The remainder term in (6) satises max j j = op(n 1in ni

?2=5 ):

4. The functions A(); C (); and F () have bounded continuous third derivatives over any compact interval. Assumptions A0 and B 0 are like A and B except that we replace m; 2; A; C; and F by the corresponding quantities derived from '. 16

Proof of Theorem 1. Let 00 (x1) and 01 (x1) be the true local parameters, i.e., 00 (x1 ) = m1 (x1)

and 01(x1) = m01(x1): We rst show that b(x1) = (b0(x1); b1(x1))T consistently estimates (x1) = (0(x1); 1(x1)): By the uniform law of large numbers in Gozalo and Linton (1995), we have

sup Qn() ? Qn() ?!p 0; 2

where Qn() = E fQn()g : This applies because of the smoothness and boundedness conditions on A; C; and F: Furthermore,

Qn() = =

!

Z Z Z

X1 p(X )dX D(X; c + m2(X2) + 0 + 1(X1i ? x1)) h1 K x1 ? hn n D(x1 ? uhn ; x2; c + m2(x2) + 0 + 1hn u)K (u)p(x1 ? uhn ; x2)dudx2

(14)

D(x; c + m2(x2) + 0)p(x)dx2 := Q0(0)

uniformly in 2 . The second equality follows by the change of variables X1 ! u = (x1 ? X1) /hn ; and convergence follows by dominated convergence and continuity. We now apply property 4 of Gourieroux, Monfort and Trognon (1984a) [henceforth GMT], which says that, provided F is monotonic, Q0(0) Q0(00) with equality if and only if 0 = 00: This establishes consistency of b0(x1): The derivative parameter 1(x1) is determined by the next order term (in hn) through a Taylor expansion of (14). When evaluated at (00; 1); this quantity is, apart from terms that do not depend on 1 or are of smaller order; h2n times a constant times

Z

where

Q1(1) = fa(x)1 + 21 b(x)21gp(x)dx2;

(15)

@m (x)C 0(m(x))F 0(G(m(x))) ; b(x) = D00(x; G(m(x)): a(x) = @x 1 Note that by properties 1 and 2 of GMT, we have D00(x; G(m(x)) = ?C 0(m(x))F 0(G(m(x))2; 17

(16)

and we can see that the unique minimum of Q1(1) is 1(x1) = m01(x1) [C 0(m) > 0 by property 3 of GMT]: See Gozalo and Linton (1995) for further discussion: This establishes the consistency of b(x1): We now turn to asymptotic normality. By an asymptotic expansion we have

hb i ?1 @ 2Qn((x1)) ?1 ?1 ?1 @Qn(0(x1)) 0 Hn Hn Hn (x1) ? (x1) = ? Hn ; (17) @ @@T where Hn = diag(1; hn) and (x1) is a vector intermediate between b(x1) and 0(x1): The presen-

tation of (17) assumes that the matrix in square brackets is invertible, which we shall show is true with probability tending to one. The score function is ! n 0 X 1 @Qn( (x1)) = ?2 K x1 ? X1i 0 0 f Y i L (Z i ) + P (Z i )g; @ nhn i=1 hn (X1i ? x1) while the Hessian matrix is ! x1 ? X1i n 1 (X1i ? x1) @ 2Qn () = 2 X 00 00 f Y K i L (Z i ( )) + P (Z i ())g; T 2 nhn i=1 hn @@ (X1i ? x1) (X1i ? x1) where

Z i() = c + m2(X2i ) + 0 + 1(X1i ? x1) Zi = c + m2(X2i ) + m1(X1i); and Z i = Z i(0(x1)): We next show that the vector Hn?1 @Qn(0(x1)) /@ satises a central limit theorem, while Hn?1f@ 2Qn ((x1)) @@T gHn?1 is, asymptotically, a positive denite diagonal matrix. Write the score function as (0(x

x ? X n ?2 X K 1 1i "iL0(Z i)

1

!

Hn?1 @Qn @ 1)) = nh X1i ?x1 hn n i=1 hn n X x1 ? X1i 0 1 K D ( X ? nh2 i ; Z i ) X1i ?x1 hn n i=1 hn

Tn1 + Tn2; 18

!

where "i = Yi ? m(Xi) = Yi ? E (YijXi = x): The rst random vector is mean zero and has variance matrix var(Tn1) = nh4 nh1 n n

2 0 13 X ? x i X 4 2 x1 ? X1i 2 0 2 @ 1 hn A5 E K (Xi )L (Z i ) X i?x X i ?x 2 n

1

hn

i=1

1

hn

1

1

hn

1

1

Z 4 = nh K 2(u)2(x1 ? uhn ; x2)L0(c + m2(x2) + m1(x1) + hnm01(x1)u)2 n ! 1 u p(x1 ? uhn; x2)dx2du u u2 kK k22

= nh4 n

0

!

0 i (x ) f1 + o(1)g 2(K 2) 1 1

by the law of iterated expectation, Fubini's theorem, and dominated convergence, which can be applied using the boundedness and continuity conditions. Finally, q 4 (1; 0)T2n1 =) N (0; 1) nhn kK k2 i1(x1 ) by the Lindeberg Central Limit Theorem, see Lemma CLT of Gozalo and Linton (1995). The second term in the score function determines the bias of m b 1(x1). By Taylor expansion

D(Xi ; Z i ) = D(Xi ; Zi) + D0 (Xi; Zi ) [m1(X1i) ? m1(x1) ? m01(x1)(X1i ? x1)] +D00(Xi ; Zi) [m1(X1i) ? m1(x1) ? m01(x1)(X1i ? x1)]2 ; where Zi are intermediate between Z i and Zi : Note that D(Xi ; Zi) = 0 by property 1 of GMT. Therefore,

Tn2

x1 ? X1i n X ? 2 = nh K h n i=1 n

n

X x1 ? X1i ? 2 + K nhn

i=1

hn

!

1

X1i ?x1 hn

1

X1i ?x1 hn

D0 (Xi; Zi) [m1(X1i) ? m1(x1) ? m01(x1)(X1i ? x1)]

!

D00(Xi ; Zi) [m1(X1i) ? m1(x1) ? m01(x1)(X1i ? x1)]2 19

x1 ? X1i n X ? 1 = nh K h n i=1

!

n

!

1

X1i ?x1 hn

m001 (x1)(X1i ? x1)2D0 (Xi ; Zi) + op(h2n )

2(K ) m001 (x1)i1(x1) + op(h2n ); 0

= ?h2n

(18) (19)

where (18) follows from the fact that for some c < 1;

1 sup m1(t1) ? m1(x1) ? m01(x1)(t1 ? x1) ? 2 m001 (x1)(t1 ? x1)2 = op(h2n ); jt ?x j