Nonparametric Regression with Correlated Errors 1 ... - CiteSeerX

15 downloads 0 Views 412KB Size Report
Feb 1, 1999 - Nonparametric regression techniques are often sensitive to the ... In nonparametric regression, the researcher is most often interested inĀ ...
Nonparametric Regression with Correlated Errors Jean Opsomer Iowa State University

Yuedong Wang University of California, Santa Barbara Yuhong Yang Iowa State University February 1, 1999 Abstract

Nonparametric regression techniques are often sensitive to the presence of correlation in the errors. The practical consequences of this sensitivity are explained, with particular emphasis on smoothing parameter selection. We review the existing literature in kernel regression, smoothing splines, wavelet regression, both for short-range and long-range dependence. Extensions to random design, higher dimensional models and adaptive estimation are discussed.

1 Introduction In nonparametric regression, the researcher is most often interested in estimating the mean function E(Y jX ) = f (X ) for a set of observations (X 1; Y1); : : :; (X n ; Yn), where the X i can be either univariate or multivariate. Nonparametric regression is a rapidly growing and exciting branch of statistics, both because of recent theoretical developments and more widespread use of fast and inexpensive computers. Many competing methods are currently available, including kernel-based methods, regression splines, smoothing splines and wavelet and Fourier series decompositions. The bulk of the literature in these areas has focused on the case in which an unknown mean function is \masked" by a certain amount of white noise, and the goal of the regression is to \remove" the white noise and uncover this unknown mean function. Recently, a number of authors have begun to look at the situation where the noise is no longer white, and instead contains a certain amount of \structure" in the form of correlation. The focus of this article is to look at the problem of estimating the mean function in the presence of correlation, not that of estimating the correlation 1

function itself. In this context, the goals of the proposed article are (1) to explain some of the diculties associated with the presence of correlation in nonparametric regression, (2) to provide an overview of the nonparametric regression literature that deals with the correlated errors case, and (3) to discuss some new developments in this area. In this article, we will be looking at the following statistical model:

Yi = f (X i) + "i;

(1)

where f () is an unknown, smooth function, the X i are either random or xed with domain X , and " = ("1; : : : ; "n)T has variance Var(") = 2C . As a convention, we will use lower case x when the design is xed and capital X when it is random. The correlation matrix C is either considered completely speci ed, known up to a nite numbers of parameters, or left completely unspeci ed. Section 2 explains some of the practical diculties associated with estimating f () under model (1). In Section 3, we review the existing literature on this topic in several areas of nonparametric regression. Section 4 describes some extensions of existing results as well as new developments.

2 Problems with Correlation A number of problems, some quite fundamental, occur when nonparametric regression is attempted in the presence of correlated errors. Indeed, in the most general setting where no parametric shape is assumed for the mean nor the correlation function, the model is essentially unidenti able, so that it is impossible to estimate either function separately. From a practical perspective, the presence of correlation (whether identi able or not) causes the commonly used automatic tuning parameter selection methods such as cross-validation or plug-in to break down in most nonparametric regresion techniques. These problems are illustrated by the \Drum Roller" data analyzed by Laslett [44] and Altman [3] (the data are available on Statlib). As noted by both authors, the data exhibit signi cant short-range correlation. Figure 1 shows two local linear regression ts with the bandwidth obtained by data-driven selection methods. As an example of an automated bandwidth selection method, cross-validation was used to t a mean function to this dataset. As the Figure shows, the cross-validation t is very wriggly and does not result in a satisfactory mean function estimate. This type of undersmoothing behavior is common with most commonly used automated 2

5

4.5

Height

4

3.5

3

2.5

2

0

200

400

600 Location

800

1000

Figure 1: Drum Roller data with estimated mean function using local linear regression; bandwidth selection performed using cross-validation ({) and CDPI (-). bandwidth selection methods. At its most conceptual level, it is caused by the fact that the bandwidth selection method \perceives" all the trend in the data to be due to the mean function, and attempts to incorporate that trend into its estimate. When the data are uncorrelated, this \perception" is valid, but it breaks down in the presence of correlation. The other t in Figure 1 uses a bandwidth value selected by CDPI, a method discussed in Section 4.1 which takes correlation into account when estimating the optimal bandwidth. It should be noted again, that when the shape of both the mean and correlation functions are completely unspeci ed, the separation of trend and noise remains a subjective matter, so that for some applications, the cross-validation t in Figure 1 could be the correct t. The choice of bandwith selection method therefore depends on the speci c type of data under consideration. 3

(b)

ACF 0.2

0.8

o o o o o o o o o o oo o o o o o o o o ooo o oo

o o

o o oo

o oo o oo oo

-0.4

oo o o o o o o o o

-1.0

y 0.0

1.0

(a)

0.0 0.2 0.4 0.6 0.8 1.0 x

0

5

(c)

10 Lag

15

(d) ACF 0.2 0.8

o o o o o o o o o o o o o o o o

o

o

o o

o o o

o o

o

o

o

o o

o o

o

o o

o o

o

o o

o

o o

o o o

o o

-0.4

-2 -1 0 1 2 3

y

o o

0.0 0.2 0.4 0.6 0.8 1.0 x

0

5

10 Lag

15

Figure 2: Simulation 1 (independent data): (a) Observations (circles) and spline t (line); (b) Autocorrelation function of residuals. Simulation 2 (autoregressive): (c) Observations (circles) and spline t (line); (d) Autocorrelation function of residuals. Similarly, inappropriate choice of tuning parameter can induce spurious serial correlation in the regression residuals, even when no correlation is present in the errors. Indeed, consider the following naive approach to tting data: 1. t the function with a \sensible" choice of the smoothing parameter, 2. estimate the correlation using the residuals of that t, and 3. estimate the function again using the estimated correlation. While reasonable on the surface, this approach can break down in practice. Indeed, even if data are independent, a wrong choice of the smoothing parameter can induce \spurious" serial correlation in the residuals, and similarly, if data are correlated, a 4

wrong choice of the smoothing parameter leads to estimated correlation in the residuals which does not re ect correlation in the random error. Two simple simulations using smoothing splines illustrate these facts (see Figure 2). In the rst simulation, 50 observations are generated from the following model

Yi = sin(2i=50) + "i;

i = 1;    ; 50;

where "i's are iid normal with mean zero and standard deviation 0.1. The S-Plus function smooth.spline is used to t the data with smoothing parameter set at 0.01 (Figure 2(a)). In the second simulation, 50 observations are generated according to model (1) with mean zero and errors following a rst-order autoregressive process (AR(1)) with autocorrelation 0.5 and standard deviation 0.1. The S-Plus function smooth.spline is again used to t the data with smoothing parameter selected by GCV method (Figure 2(c)). The estimated autocorrelation function (ACF) for the rst plot looks autoregressive (Figure 2(b)), while that for the second plot appears independent (Figure 2(d)). In both cases, this conclusion about the error structure is erroneous. For this naive approach to work, one has to have the correct guess of the smoothing parameter in the rst step.

3 Results to Date

3.1 Kernel-based Methods In this Section, we consider the problem of how to estimate the mean function for data assuming to follow model (1), with E("i) = 0 ; Var("i) = 2 ; Corr("i; "j ) = n (X i ? X j );

(2)

and 2 and n unknown. The dependency of  on n is indicated by the subscript, because consistency properties of the estimators will depend on the behavior of the correlation function as n increases. In this section, we only consider univariate X i = Xi in a bounded interval [a; b], and for simplicity, we let [a; b] = [0; 1]. The researchers who have studied the properties of kernel-based estimators of the function f () have focused on the time series case, in which the design points are xed and equally spaced, or equivalently, Yi = f ( ni ) + "i ; Corr("i; "j ) = n (ji ? j j): 5

We rst consider the simplest situation, in which the correlation function is taken to be n (i) = (i). This restriction on n will be relaxed in Section 4.1. The function f () can be tted by kernel regression or local polynomial regression. Following the authors in this area, we discuss estimation by kernel regression, and for simplicity, we consider the Priestley-Chao kernel estimator   n ^f (x) = sTx Y = 1 X K Xi ? x Yi nh i=1 h (Priestley and Chao [56]) for some kernel function K and bandwidth x 2  X1?x h, at a point T X ? x 1 T [0; 1]. In vector notation, Y = (Y1; : : : ; Yn) and sx = nh K ( h ); : : : ; K ( nh ) . The Mean Squared Error of f^(x) is MSE(f^(x)) = E(sTx Y ? f (x))2 = (sTx E(Y ) ? f (x))2 + 2sTx Csx: The rst term of the MSE represents the squared bias. Under the usual assumption on the kernel and the bandwidth, and for a kernel of order p, the bias is approximated by sTx E(Y ) ? f (x) = (?h)p pp(K! ) f (p)(x) + o(hp + nh1 ) (Altman [2]), with q (K ) = R tq K (t)dt for any q  0. Note the important fact that the bias of f^(x) does not depend on the dependency structure of the errors. This holds not only in the time series context, but also for random designs or in higher dimensional problems. The e ect on the variance part of the MSE is potentially severe, however, and we will focus on that aspect in this Section. It is necessary to make some assumptions about , to ensure that summations over the correlation in the data remain bounded: (R.I) lim Pn j(k)j < 1, n!1

k=1

(R.II) limn!1 n1 Pnk=1 kj(k)j = 0.

These assumptions, common in time series analysis, ensure that observations suciently far apart are essentially uncorrelated. Then, the variance component of the MSE can be approximated by 1  (K 2)2(1 + 2R) + o( 1 ); 2sTx Csx = nh 0 nh P 1 (Altman [2]) with R = k=1 (k). When the observations are uncorrelated, R = 0 so that this result reduces to the one usually reported for kernel regression with 6

independent errors. Note also that since the spectral density of the errors is 1 2 X H (!) = 2 (jkj)e?i! ; k=?1 it is easy to see that 21 2(1 + 2R) = H (0), the spectral density at ! = 0. This fact will be useful in developing bandwidth selection methods that incorporate the e ect of the correlation. Let AMSE denote the asymptotic approximation to the MSE, or ) f (p)(x))2 + 1  (K 2)2(1 + 2R): (3) AMSE(f^(x)) = ((?h)p pp(K ! nh 0 This is a simple function of h, so that the e ect of the correlation on the optimal bandwidth can easily be seen. The presence of the additional term R in the AMSE has important implications for the correct choice of bandwidth. If R > 0, implying that the error correlation is (mostly) positive, then the variance of f^(x) will be larger than in the corresponding uncorrelated case. The AMSE is therefore minimized by a value for the bandwidth h that is larger than in the uncorrelated case. Conversely, if R < 0, the AMSE-optimal bandwidth is smaller than in the uncorrelated case. In addition to changing the asymptotically optimal bandwidth, correlation has a perverse e ect of automated bandwidth selection methods as well, as described in Altman [2] and Hart [34] for cross-validation. As a global measure of goodness-of- t, we consider the Mean Average Squared Error (MASE), or 2 n  i X i 1 ^ f(n) ? f(n) : (4) MASE (h) = n i=1 Let f^(?i) denote the kernel regression estimate computed on the dataset with the ith observation removed. The cross-validation criterion is n X 1 CV(h) = n (f^(?i)( ni ) ? Yi)2; (5) i=1 so that

n 2X ^(?i)( i ); "i): Cov( f n i=1 n It is easy to show that asymptotically, Cov(f^(i)( ni ); "i)  T1=nh, so that for suciently large amounts of correlation, T1 ; E(CV(h))  T0 ? 2nh 7

E(CV(h))  MASE (h) + 2 ?

for some constants T0; T1, which is minimized by letting h become very small. This leads to a t which nearly interpolates the data and explains the behavior of CV shown in Figure 1. This results holds not only for cross-validation but also for related measures of t such as GCV and Mallows' criterion (Chiu [13]). One approach to solve this problem was proposed independently by Chiu [13], Altman [2] and Hart [34]. While the speci c implementations varied, they chose to estimate the correlation function parametrically and to use this estimate to adjust the bandwidth selection criterion. The estimation of the correlation function is of course complicated by the fact that the errors in (1) are unobserved. Chiu [13] attempts to bypass that problem by estimating the correlation function in the frequency domain while down-weighting the low frequency periodogram components, in which the mean function is assumed to lie, generalizing the results of Hurvich and Zeger [38] to the nonparametric regression context. Altman [2] uses residuals from a pilot kernel regression t and ts a low-order autoregressive process to the residuals. Hart [34] uses di erencing to remove the trend followed by estimation in the frequency domain. In later work, Hart [35] describes a further re nement to this approach. He introduces time series cross-validation as a new goodness-of- t criterion, which can be jointly minimized over the set of parameters for the correlation function (a pth-order autoregressive process in this case) and the bandwidth parameter. These approaches appear to work well in practice. Even when the parametric part of the model is misspeci ed, they provide a signi cant improvement over the ts computed under the assumption of independent errors, as the simulation experiments in Altman [2] and Hart [34] show. However, when performing nonparametric regression, it is sometimes desirable to completely avoid such parametric assumptions. Chu and Marron [14] propose two new cross-validation-based criteria that estimate the MASE-optimal bandwidth without specifying the correlation function. In modi ed cross-validation (MCV), the kernel regression values f^(?i) in (5) are computed by leaving out the 2l +1 observations i ? l; i ? l +1; : : :; i + l ? 1; i + l surrounding the ith observation. Since the correlation is assumed to be short-range, proper choice of l will greatly decrease the e ect of the terms Cov(f^(?i)( ni ); "i) in the CV criterion. In partitioned cross-validation (PCV), the observations are partitioned into g subgroups by taking every gth observations. Within each subgroup, the observations are further apart and hence are assumed less correlated. Cross-validation is performed for each subgroup, and the bandwidth estimate for all the observations is a simple function of the average of the subgroup-optimal bandwidths. The drawback of both MCV and PCV is that the values of l and g need to be selected with some care. 8

Herrmann et al. [37] also propose a fully nonparametric method for estimating the MASE-optimal bandwidth, but replace the CV-based criteria by a plug-in approach. This type of bandwidth selection has been shown to have a number of theoretical and practical advantages over CV (Hardle et al. [31], [32]). By using the asymptotic approximation (3) for a kernel of order p = 2, it is easy to show that the minimizer of the MASE is asymptotically equal to 2(1 + 2R) !1=5  hMASE  C (K ) n R f 00(x)2dx with C (K ) a known kernel-dependent constant. Plug-in bandwidth selection is peformed by estimating the unknown quantities in this expression and replacing them by their estimators (hence the name \plug-in"). The estimation of R f 00(x)2dx is completely analogous to that in the uncorrelated case. The variance component S = 2(1 + 2R) is estimated by a summation over squared di erencing residuals. As in Chu and Marron [14], no parametric shape is assumed for the correlation function. More recently, Hall et al. [30] extended the results of Chu and Marron [14] in a number of useful directions. Their theoretical results apply to kernel regression as well as local linear regression, a method with signi cantly better properties. They also explicitly consider the long-range dependence case, where assumptions (R.I) and (R.II) are no longer required. They discuss bandwidth selection through MCV and compare it with a bootstrap-based approach which estimates the MASE in (4) directly through resampling of \blocks" of residuals from a pilot smooth. However, because the latter approach requires a pilot t, it is sensitive to the problems described in Section 2. As was the case for Chu and Marron [14], both approaches are fully nonparametric but require the choice of other tuning parameters. Opsomer et al. [55] analyzed a large spatial dataset where the observations were thought to contain both location-dependent heteroskedasticity and correlation. While the mean function was parametrically speci ed, the spatial variance function was to be computed based by local linear regression on the (correlated) squared residuals. In contrast to the time series work by the authors discussed above, this research involved multivariate, unequally spaced observations. Following Altman [2] and Hart [34], the correlation function was assumed parametric in order to simplify the calculation. The overall tting procedure is iterative, with the mean, variance and correlation functions tted in sequence and assumed known in the next tting step, until convergence is achieved for each function. The bandwidth for the variance function was selected using EBBS (Ruppert [57]), with the variance component of 9

the MSE suitably adjusted for the presence of the correlation.

3.2 Polynomial Splines

Consider model (1) with xed design points xi and X = [0; 1], and assume that f is a function with certain smoothness properties. More precisely, assume f belongs to the Sobolev space Z1 m ( v ) W2 = ff : f absolutely continuous, v = 0;    ; m ? 1; (f (m)(x))2dx < 1g: 0

A smoothing spline estimate f^ is the minimizer of the following penalized weighted least-square 1  Z1 ? 1 ( m ) 2 T min (6) (Y ? f ) C (Y ? f ) +  (f (x)) dx ; f 2W2m n 0 where  is the smoothing parameter controlling the trade-o between the goodnessof- t measured by least-square and the roughness of the estimate measured by R 1 (m) 2 0 (f (x)) dx. R Let  (x) = x?1 =( ? 1)!,  = 1;    ; m. Let R(x; z) = 01(x ? u)m+ ?1(z ? u)m+ ?1du=((m ? 1)!)2, where (x)+ = x for x  0 and (x)+ = 0 otherwise. Denote T nm = f (xi)gni=1m=1 and nn = fR(xi; xj )gni=1nj=1. Let T = (Q1 Q2)(RT 0T )T be the QR decomposition of T . Kimeldorf and Wahba [40] showed that the solution to (6) is n m X X f^(x) = d  (x) + ciR(xi; x) ; where c = (c1;    ; cn)T and d

i=1 = (d1;    ; dm )T are  =1

solutions to

( + nC )c + Td = Y ; T T c = 0: At the design points, f^ = (f^(x1);    ; f^(xn ))T = AY , where

(7)

A = I ? nCQ2(QT2 ( + nC )Q2)?1 QT2 is the \hat" matrix. Note that A may be asymmetric, which di ers from the inde-

pendent case. So far the smoothing parameter  has been xed. Good choices of  are crucial to the performance of spline estimates. Much research has been devoted to developing data-driven methods for selecting  when observations are independent. Several 10

methods have been proposed and among them the CV (cross-validation), GCV (generalized cross-validation), GML (generalized maximum likelihood) and UBR (unbiased risk) methods are popular choices (Wahba [66]). The CV and GCV methods are well known for their optimal properties (Wahba [66]). The GML is very stable and ecient for small to moderate sample sizes. When the dispersion parameter is known, e.g. for binary and Poisson data, the UBR method works better. All these methods tend to underestimate smoothing parameters when data are correlated for the same reasons discussed in Section 2. As discussed in Section 2, the model is essentially unidenti able without any restriction on the mean or the correlation function. Several authors considered assumptions on the correlation function. Diggle and Hutchinson [19] assumed the random errors are generated by an autoregressive process. In the following we discuss their method in a more general setting where we assume C is known besides a set of parameters . Denote  = (; ). De ne

d = trC ?1=2AC 1=2 = trA as the degree of freedom. Replacing the function f by its smoothing spline estimate, Diggle and Hutchinson [19] proposed to estimate all parameters using the penalized pro le likelihood (we use the negative log likelihood here) n ^ )T C ?1(Y ? f^ )=2 + ln jC j=2 + n ln 2 + (n; d)o ; min ( Y ? f (8)  ;2 where  is an increasing penalty function of d. It is easy to see that ^ 2 = (Y ? f^ )T C ?1 (Y ? f^ )=n. Thus (8) is reduced to n ^ )T C ?1(Y ? f^ ) + ln jC j=2 + (n; d)o : n ln( Y ? f min  Two forms of penalty have been compared in Diggle and Hutchinson [19]:

(n; d) = ?2n ln(1 ? d=n); (n; d) = d ln n; which are annolgies of AIC and BIC. When observations are independent, the rst penalty gives a method approximates the GCV and the second penalty gives a new method which does not reduce to any existing methods. Simulation results in Diggle and Hutchinson [19] suggest that the second penalty function works better. However, in their discussion, Diggle and Hutchinson [19] commented that using the second 11

penalty gives results which are signi cantly inferior to those obtained by GCV when C is known, which includes the independent data as a special case. More research is necessary to nd properties of this method. Diggle and Hutchinson [19] have developed an ecient O(n) algorithm for the special AR(1) error structure. For independent observations, Wahba [63] showed that a polynomial spline of degree 2m ? 1 can be obtained by signal extraction. Denote W (x) as a zero-mean Wiener process. Suppose that f is generated by the stochastic di erential equation

dmf (x)=dxm = (n)?1=2dW (x)=dx

(9)

with initial conditions

z0 = (f (0); f (1)(0);    ; f (m?1)(0))T  N(0; aI m): (10) Let f^(x; a) = E(f (x)jY ; a) represent the signal extraction estimate of f . Then

lima!1 f^(x; a) equals the smoothing spline estimate. Kohn, Ansley and Wong [42] used the signal extraction approach to derive method for spline smoothing with autoregressive moving average errors. Assume observations are equally spaced: xi = i=n, and consider again model (1). Kohn et al. [42] assumed model (1) with f generated by the stochastic model (9) and (10) and the errors "i generated by a discrete time stationary ARMA(p,q) model

"i = 1"i?1 +    + p"i?p + ei + 1ei?1 +    + qei?q ;

(11)

where ei iid  N (0; 2) and are independent of the Wiener process W (x). Denote zi = (f (xi); f (1)(xi);    ; f (m?1)(xi))T . The stochastic model (9) and (10) can be written in a state space form as

zi = F izi?1 + ui ; i = 1;    ; n;

where F i is an upper triangular mm matrix having ones on the diagonal and (j; k)th element (xi ? xi?1)k?j =(k ? j )! for k > j . The perturbation ui  N(0; 2ui =n), where ui is a m  m matrix with (j; k)th element being (xi ? xi?1)2m?i?k+1 =[(2m ? i ? j + 1)!(m ? k)!(m ? j )!]. For the ARMA(p,q) model (11), let m0 = max(p; q + 1) and 1 0  j 1 CC BB .. CC BB . j C: G = BB m0?1 j 0 I B@ ? ? ? m??1 CCA 0T m0 j 12

Consider the following state space model

wi = Gwi?1 + vi ; i = 1;    ; n; (12) where wi is a m0 vector, and v i = (ei; 1ei;    ; m0?1ei)T . Substituting repeatedly from the bottom row of the system, it is easy to see that the rst element in wi is

identical to the ARMA model de ned in (11). Therefore the ARMA(p,q) model can be represented in a state space form (12). Combining the two state space representations for the signal and the random error, the original model (1) can be represented by the following state space model

Y i = hT xi ; xi = H ixi?1 + ai; where

!

!

!

xi = wzii ; ai = uvii ; H i = F i G :

h is a m+m0 vector with 1 in the rst and the (m+1)st positions and zeros elsewhere.

Due to this state space representation, the ltering and smoothing algorithms can be used to calculate the estimate of the function. Kohn et al [42] also derived algorithms to calculate GML and GCV estimate of all parameters  = (; 1;    ; p; 1;    ; q ) and 2.

3.3 Long-range Dependence In Sections 3.1 and 3.2, we have seen that with short-range dependence among the errors, familiar nonparametric procedures intended for uncorrelated errors behave very poorly. Several methods have been proposed to handle the weak dependences there. When the dependence among the errors are stronger, regression estimation becomes even harder. In this section, we review known theoretical results on the e ects of long-range dependence. As in the previous sections, consider stationary errors. In addition, the errors are assumed to be Gaussian here with a known covariance matrix unless stated otherwise. Let (i) = Corr("j ; "j+i) be the correlation between the errors "j and "j+i . The error process is said to be long-range dependent if for some c > 0 and 0 < < 1; the spectral density of the errors satisfy

H (!)  c!?(1? ) as ! ! 0 13

(see, e.g., Cox [16]). Then (j ) is of order jj j? (see, e.g., Adenstedt [1]). When the correlation decreases at order jj j? for some dependence index > 1, the errors are said to have a short-range dependence. Minimax risks have been widely considered for evaluating performance of an estimator of a function assumed to be in a target class. Let F be a nonparametric (in nite-dimensional) class of functions on a compact interval in [0; 1]d. Let C denote R the covariance matrix of the errors. Let k u?v k= ( (u ? v)2dx)1=2 be the L2 distance between two functions u and v. A minimax risk for estimating the regression function f under the squared L2 loss is

R(F ; C; n) = min max E k f ? f^ k2; ^ f 2F f

where f^ is over all estimators based on (Xi; Yi)ni=1 and the expectation is taken under the true regression function f . The minimax risk measures how well one can estimate f uniformly over the function class F . With the knowledge of the minimax risk, one knows how large the sample size should be to achieve a desired level of accuracy for every f 2 F . Due to the diculty in evaluating R(F ; C; n) in general, its convergence rate is often considered. An estimator with the risk converging at the minimax risk rate uniformly over F is said to be minimax-rate optimal. For short-range dependence, it has been shown that the minimax risk rate remains unchanged compared to the case with independent errors (see, e.g., Bierens [10], Collomb and Hardle [15], Johnstone and Silverman [39], Wang [71], Yang [72]). Fundamental di erences show up when the errors become long-range dependent. Estimation under long-range dependence has attracted more and more attentions in recent years. In many scienti c research elds, such as astronomy, physics, geoscience, hydrology and signal processing, the observational errors sometimes reveal long-range dependence. Kunsch et al. [43] wrote: \Perhaps most unbelievable to many is the observation that high-quality measurement series from astronomy, physics, chemistry, generally regarded as prototypes of `i.i.d.' observations, are not independent but long-range correlated." A number of results have been obtained for parametric estimation under long-range dependences and on the estimation of dependence parameters. For a review of work in that direction, see Beran [8], [9]. We here focus on the case with nonparametric estimation of f . Results on minimax rates of convergence are obtained for long-range dependent errors with a one-dimensional equally spaced xed design in Hall and Hart [29], Wang [71], and Johnstone and Silverman [39] for some classical smoothness function classes. 14

The model being considered is again model (1), with Xi = (i ? 1)=n, Corr("i; "j )  cji ? j j? for some 0 < < 1, and f is assumed to be in a Besov class (or a Lipschitz class in Hall and Hart [29]. The minimax rate of convergence under the square L2 loss for estimating f is shown to be of order

n?2 =(2 + );

(13)

where is the smoothness parameter of the Besov class. In contrast, the rate of convergence under independent or short-range dependent errors is n?2 =(2 +1). Clearly the long-range dependence has a damaging e ect on the convergence rate. Suppose now that Xi ; i  1 are i.i.d. independent of the errors. Note that here the errors are serially correlated (the correlation has nothing to do with the values of Xi's). Given F ; the distribution of Xi; 1  i  n, and a general dependence among the errors (not necessarily stationary), Yang [72] shows that the minimax risk rate for estimating f is determined by the maximum of two rates: the rate of convergence for the class F under i.i.d. errors and the rate of convergence for the estimation of the mean value of the regression function,  = Ef (X), under the dependence. The rst rate is determined by the largeness of the target class F and the second rate is determined by severity of the dependence among the errors. Thus the diculty in estimating f is determined by either the massiveness of F or the severity of the dependence among the errors, whichever is worse. As a consequence, the minimax rate may well remain unchanged if the dependence is not severe enough relative to the largeness of the target function class. A similar result was obtained independently by Efromovich [23] for a one-dimensional Lipschitz class but without requiring normality on the errors. It is also shown in Yang [72] that dependence among the errors in general, as long as known, does not hurt prediction of the next response. When the above result of Yang [72] is applied to the case of a Besov class, one has the minimax rate of convergence

n? min(2 =(2 +1); ):

(14)

For a given long-range dependence of index , the rate of convergence gets damaged only when is relatively large, i.e., > =(2(1 ? )). Note that the above rate of convergence is always faster compared to the rate given in (13). At the rst glance, the results discussed above seem to be contradictory. Experience in the i.i.d. world suggests that the random design and a xed equally spaced 15

design do not di er in terms of rate of convergence for regression estimation. Then it may seem to be clear that one should expect the same rates of convergence under long-range dependence of the same index . What is really going on? An explanation of this phenomenon is as follows (see Efromovich [23] and Yang [72]). For the xed design case, observations with X values close to each other are highly correlated. For this situation, the dependence among the errors can be viewed in terms of locations, i.e., Corr("i; "j )  cn? jXi ? Xj j? . Note that the correlation depends on the locations Xi and Xj and also on the sample size. For the random design case, however, because of the assumption that the correlations among the errors have nothing to do with the locations, when Xi and Xj are nearby, the orders (i and j ) of the measurements are not necessarily adjacent but on average quite far away from each other, resulting in weaker correlations. In fact, given the locations of the design points Xi 's, the correlation between the responses associated with Xi and Xj (i 6= j ) is ?1 1 nX n ? 1 (i); i=1

which is of order n? . Note that this correlation is the same regardless of the locations of Xi and Xj . Since the nearby (in locations) observations have a much weaker correlation compared to the previous xed design case, it is not surprising any more that the rate of convergence for the new case is faster. The di erence in rates in (13) and (14) may seem to suggest that the random design is better. One could interpret the dependence for the xed design case in a di erent way, i.e., the long-range dependent errors are measurement errors independent of the sampling sites, and the observations of the responses are made at the equally spaced grid points in the natural order. In this way, one could conclude that the random design wins. But it is not a completely fair comparison. For the xed design, an experienced statistician would randomizes the order of the observations as taught in any experimental design course. Then would the rate of convergence still be n?2 =(2 + )? The answer is not exactly given in the previously mentioned results, but we believe the actual rate is the same as under the random design, i.e., n? min(2 =(2 +1); ). Thus the di erence in rates of convergence in (13) and (14) is not caused by the di erence in design, per se.

16

3.4 Wavelet estimation In this subsection, we review wavelet methods under dependence focusing on the long-range dependence case. For function estimation using wavelets, one rst performs a wavelet transformation of the data and then use a thresholding rule to deal with the coecients in the wavelet expansion. Nason [49] reported that methods intended for uncorrelated errors do not work well for correlated data. Johnstone and Silverman [39] points out that for independent errors, one can use the same threshold for all the coecients while for dependent errors, the variances of the wavelet coecients depend on the level but are the same within the same level. Accordingly level-dependent thresholdings are proposed. Their procedure is brie y described as follows. Consider the regression model (1) with n = 2J for some integer J: Let W be a discrete wavelet transform operator (for examples of wavelet bases and fast O(n) algorithms, see, e.g., Donoho and Johnstone [20]). Let wj;k ; j = 0; :::; J ? 1; k = 1; :::; 2j be the wavelet transform of the data Y1; :::; Yn: Let Z = W " be the wavelet transform of the errors. Let j be the threshold to be applied to the coecients at level j: Then de ne ^ by ^j;k = (wj;k ; ^j j ); where  is a threshold function and ^j is an estimator of the standard deviation of wj;k : The nal estimator of the regression function is

f^ = W T ^: Earlier work of Donoho and Johnstone (see, e.g., [20]) suggest soft or hard thresholding as follows:

S (wj;k ; ^j j ) = sgn(wj;k ) (jwj;k j ? ^j j )+ H (wj;k ; ^j j ) = wj;k Ifjwj;kj^j j g: A suggested choice of ^j is ^j2 = MADfwj;k ; k = 1; :::; 2j g=0:6745; where MAD means the median absolute deviation and the constant 0:6745 is derived to work for the Gaussian errors. For the choice of j 's, Johnstone and Silverman [39] suggest a method based on Stein unbiased risk estimation (SURE). For v = 17

(v1; :::; vd); given ^ , let

X  2 2 vk ^ t ? 2^2Ifjvkjtg: U^ (t) = ^ d + d

k=1

De ne

t^(v) = arg0t^p2log d minfU^ (t)g: By comparing s2d = d?1 Pdk=1 vk2 ? 1 with a threshold d; let (p 2 et(v) = ^ 2 log d s2d  d : t(v) s d > d Now take

(

L : j = t0e(w =^ ) jj > L j j From Johnstone and Silverman [39], the regression estimator f^ produced by the above procedure converges at the minimax rate of convergence simultaneously over the (C ) for (p; q ) 2 [1; 1); C 2 (0; 1) and (a smoothness parameter) Besov classes Bp;q suciently large (depending on the dependence index and some other constants involved in the procedure). Note that the adaptation is over both the regression function classes and over the dependence parameter as well (see next subsection for a review of adaptive estimation for nonparametric regression). For certain Lipschitz classes (more restrictive compared to the Besov classes), Efromovich [23] studied a trigonometric expansion estimator with order adaptively selected based on data. The resulting estimator of the regression function (and also its derivatives) is shown to be minimax-rate optimal without knowing the number of derivatives the true regression function has.

3.5 Adaptive Estimation Many nonparametric procedures have tuning parameters, e.g., the bandwidth h for local polynomial and the smoothing parameter  for the smoothing spline as considered earlier. The optimal choices of such tuning parameters in general depend on certain characteristics of the unknown regression function and therefore are unknown to us. Various methods (e.g., AIC, BIC, cross-validation and other related model selection criteria) have been proposed to give a choice of a tuning parameter automatically based on the data so that the nal estimator performs as well as (or nearly as well as) the estimator based on the optimal choice. This is the task of adaptive 18

estimation. In this subsection, we brie y review the ideas of adaptive estimation in the context of nonparametric regression. This will provide a background for some of the results in next section. Basically there are two types of results on adaptive estimation, namely adaptive estimation with respect to a collection of estimation procedures (procedure-wise adaptation), and adaptive estimation with respect to a collection of target classes (target-oriented adaptation). For the rst case, one is given a collection of procedures and the goal of adaptation is to have a nal procedure performing close to the best one in the collection. For example, the collection might be a kernel procedure with all valid choices of the bandwidth h. Another collection may be a list of wavelet estimators based on di erent choices of wavelet bases. In general, one may have completely di erent estimation procedures in the collection for more exibility, which may be desirable for many applications where it is rather unclear beforehand which procedures are appropriate. For instance, for the case of high-dimensional regression estimation, one faces the so-called curse of dimensionality in the sense that the traditional function estimation methods such as histogram, kernel, series expansion have exponentially many parameters to be estimated, which can not be done accurately based on a moderate sample size. For this situation, a solution is to try di erent parsimonious models. Since one does not know which parsimonious characterization works best for the underlying unknown regression function, adaptation capability is desired. For the second type of adaptation, one is given a collection of target function classes, i.e., the true unknown function is assumed to be in one of the classes (without knowing which one it is). A goal of adaptation is to have an estimator with the capability to perform optimally simultaneously for the target classes, i.e., the estimator automatically converges optimally at the minimax rate of the class that contains the true function. The two types of adaptations are closely related. If a collection of procedures are designed optimally for a collection of function classes respectively, then the procedure-wise adaptation implies the target-oriented adaptation. It is in this sense that the procedure-wise adaptation is more general than the target-oriented adaptation. In applications, one may encounter a mixture of both types of adaptation at the same time. On one hand, you have several plausible procedures that you wish to try, and on the other hand, you may have a few plausible characteristics (e.g., monotonicity, additivity) of the regression function you want to explore. For this situation, you may derive optimal (or at least reasonably good) estimators for each 19

characteristic respectively and then add them to the original collection of procedures. Then adaptation with respect to the collection of procedures is the desired property. A large number of results have been obtained on adaptive estimation. Examples for adaptation within a speci c type of procedure or a speci c type of classes include Craven and Wahba [17], Efromovich and Pinsker [24], Efromovich [22], Hardle and Marron [33], Lepski [45], Golubev and Nussbaum [27], Donoho and Johnstone (e.g., [20]), Mammen and van de Geer [48], Goldenshluger and Nemirovski [26], Lepski et al [46], Devroye and Lugosi [18], and others. General schemes have also been proposed for the construction of adaptive estimators by model selection in Barron and Cover [7], Birge and Massart [11], Barron, Birge and Massart [6], Yang and Barron [75] and Yang [74], Lugosi and Nobel [47] and others. More recently, for nonparametric regression with i.i.d. Gaussian errors, general positive results on adaptive estimation are obtained in Yang [73]. It is shown that under very mild conditions, for any collection of uniformly bounded function classes, minimax-rate adaptive estimators exist. More generally, for any given collection of regression procedures, a single adaptive procedure can be constructed by combining them together. The new procedure pays only a small price for adaptation, i.e., the risk of the combined procedure is bounded above by the risk of each procedure plus a small penalty term, which is asymptotically negligible for probably most of the interesting applications. A result on multi-dimensional adaptive estimation under long-range dependence will be given in Section 4.3. For nonparametric regression with dependence, adaptation with respect to both the unknown characteristics of the regression function and with respect to the unknown dependence is of interest. A success in this direction is the wavelet estimator based on SURE proposed by Johnstone and Silverman [39] as reviewed in Section 3.4.

4 New Developments

4.1 Kernel Regression Extensions Opsomer [52] outlines some parts of recent research that extends existing methodological results for kernel-based regression estimators under short-range dependence in several directions. The approach is fully nonparametric, uses local linear regression and implements a plug-in bandwidth estimator. The range of applications is extended to include random design, univariate and bivariate observations, and addi20

tive models with correlated errors. We will review some of the main ndings here. Full details and proofs are available in Opsomer [51]. The data are generated by model (1), where the (X 1; Y1); : : :; (X n ; Yn) are a set of random variables in IRD+1 . The model errors "i are again assumed to have moment properties (2). In this general setting, we refer to model (1) as the general (G) model. We also consider two special cases: the simple (S1) model, where the X i; i = 1; : : : ; n are restricted to be univariate as in Section 3.1, and the bivariate additive (A2) model, in which f () is assumed to be additive, i.e. f (X i ) =  + f1(X1i) + f2(X2i): We de ne cn (x) = nE(n (X i ? x)), and let X = [X 1; : : : ; X n]T and X k = (Xk1; : : : ; Xkn )T for k = 1; D. Let X = [0; 1]D represent the support of X i and g its density function, with gk the marginal density corresponding to Xki for k = 1; : : : ; D. We write Dkr for the rth derivative operator with respect to the kth covariate. As in Section 3.1, let K represent a univariate kernel function. In order to simplify notation for model G, we restrict our attention to (tensor) product kernels K (u1): : :K (uD ) for that case, with corresponding bandwidth matrix H = diagfh1; : : : ; hDg. For model A2, we need some additional notation. Let T 12 represent the n  n matrix whose ij th element is (X1i; X2j ) ? 1 ; [T 12]ij = g g(X 1 1i )g2 (X2j ) n and let tTi ; vj represent the ith row and j th column of (I ? T 12)?1, respectively. Let 2 d2f1(X11 ) 3 2 00 3 ( X E( f dx21 1i )jX21) 1 6 7 6 77 ... D2f 1 = 664 ... 775 ; E(f100(X1i)jX 2) = 64 5; 2 00 d f1 (X1n ) E(f1 (X1i)jX2n) dx2 1

and analogously for f 2 and E(f200(X2i)jX 1). The local linear estimator of f () at a point x for model G (and, with the obvious changes, model S1), is de ned as f^(x) = sTxY , with the smoother vector sTx de ned as sTx = eT1 (X TxW xX x)?1X TxW x; (15) with eT1 a vector with 0s elsewhere,  ?11 in its 1st position and  the weight matrix ? 1 1 W x = jH j diagfK H (X 1 ? x) ; : : : ; K H (X n ? x) g and 3 2 1 ( X 1 ? x)T 77 ... X x = 664 ... 5: T 1 (X n ? x)

D2

21

For model A2, the estimator of f () at an observation point X i can also be written as a linear smoother wTX i Y , but the expression is much more complicated and rarely directly used to compute the estimators. We assume here that the conditions guaranteeing convergence of the back tting algorithm, further described in Opsomer and Ruppert [53], are met. These conditions do not depend on the correlation structure of the errors. We make the following assumptions:

(AS.I) The kernel K is compactly supported, bounded and continuous and r (K ) = 0 for r odd,

(AS.II) the density g is compactly supported, bounded and continuous, and g(x) > 0 for all x 2 X , (AS.III) the 2nd derivative(s) of f () are bounded and continuous, and (AS.IV) as n ! 1, H ! 0 and njH j ! 1. In addition, we assume that the correlation function n is an element of a sequence fn g with the following properties: (P.1) n is di erentiable, R njn (t ? x)jdt = O(1), R nn (t ? x)2dt = o(1) for all x, (P.2) 9  > 0 : R jn(t)jI(kH?1 tk>)dt = o(R jn (t)jdt). The properties require the e ect of n to be \short-range" (relative to the bandwidth), but allow its functional form to be otherwise unspeci ed. These properties are generalizations of assumptions (R.I) and (R.II) in Section 3.1 to the random, multivariate design. As noted in Section 3.1, the conditional bias of f^(X i) is not a ected by the presence of correlation in the errors. We therefore refer to Ruppert and Wand [59] for the asymptotic bias of models G and S1, and to Opsomer and Ruppert [53] for that of model A2. We construct asymptotic approximations to the conditional variance of f^(X i) and to the conditional Mean Average Squared Error (MASE) of f^, de ned in (4) for the xed design case.

Theorem 4.1 The conditional variance of f^(X i) for models G and S1 is 1 0(K 2)D (1 + c (X )) + o ( 1 ): Var(f^(X i )jX ) = 2 njH n p j g(X ) njH j i

22

i

For model A2, Var(f^(X i)jX ) =

g1(X1i)?1(1 + E(cn(X i jX1i))) + g2(X2i)?1(1 + E(cn (X ijX2i))) 0 nh1 nh2 1 1 +op( nh + nh ): 1 2

 2

(K 2)

We let

Z 1=2

!

n (t) dt Rn = n ?1=2 and de ne ICn = 2(1 + Rn), the Integrated Covariance Function. We also de ne the 2nd derivative regression functionals 22(k; l) = E(Dk2 f (X i)Dl2f (X i)) with k; l = 1; : : : ; D for models G and S1, and n   X tTi D2f 1 ? vTi E(f100(X1i)jX 2) 2 22(1; 1) = n1 i=1 n   X v Ti D2 f 2 ? tTi E(f200(X2i)jX 1) 2 22(2; 2) = n1 i=1 n    X 22(1; 2) = n1 tTi D2f 1 ? vTi E(f100(X1i)jX 2) vTi D2 m2 ? tTi E(m002 (X2i)jX 1) i=1 for model A2. Because of assumptions AS.I{AS.IV, these quantities are well-de ned and bounded, as long as the additive model has a unique solution. Theorem 4.2 The conditional Mean Average Squared Error of f^ for model G (S1) is

!2 D D  2 (K ) X X 2 2 hk hl 22(k; l) MASE (H jX ) = 2 k=1 l=1 D 1  (K 2)D IC + o (X 4 + 1 ): h + njH 0 n p k njH j j k=1

For model A2,

!2 D D  2 (K ) X X 2 2 hk hl 22(k; l)+ MASE(H jX ) = 2 k=1 l=1 ! 1 + E( g 1 (X1i )?1 cn (X i )) 1 + E(g2 (X2i )?1 cn (X i )) 2 R(K ) + nh1 nh1 +op (h41 + h42 + nh1 + nh1 ): 1

2

23

In this more general setting, the optimal rates of convergence for the bandwidths are again the same as those found for the independent errors case: hk = Op(n?1=(4+D)) for models G and S1, and model A2 achieving the same rate as model S1. Similarly, if Rn > 0 for those models, the optimal bandwidths are larger than those for the uncorrelated case. For model A2, note that independent errors imply that cn(X i) = 0, so that the result in Theorem 4.2 reduces to the MASE approximation derived in Opsomer and Ruppert [53] for the independent errors case. An interesting aspect of Theorem 4.2 is that the presence of correlated errors induces an adjustment in the MASE approximations for models G and S, that does not depend on the distribution of the observations, even though the variance adjustment cn(X i) in Theorem 4.1 does. It is therefore easy to show that the MASE approximations for models G and S in Theorem 4.2 are valid for both random and xed designs. This nding does not hold for the additive model. For the case D = 1, Opsomer [51] develops a plug-in bandwidth selection method that generalizes the Direct Plug-In (DPI) bandwidth selection proposed by Ruppert et al. [58] for the independent error case and extended to the additive model by Opsomer and Ruppert [54]. The method described here is therefore referred to as Correlation DPI (CDPI), and was used as the correlation-adjusted bandwidth selection method in Figure 1. The estimation of the 22(k; l) in CDPI is analogous to that in DPI. Opsomer [51] shows that the term ICn is approximately equal to the frequency at 0 of the covariance function of suitably scaled, binned errors (the binning is used to achieve an equally spaced set of \observations"). Hence, ICn is estimated in the frequency domain by periodogram smoothing (Priestley [56]). CDPI behaves very much like DPI when the data are uncorrelated, but at least partly o sets the e ect of the correlation when it is present. To illustrate this point, we again use the Drum Roller data introduced in Section 3.1. Figure 3 shows four ts to the data using both DPI (dash-dotted lines) and CDPI (solid lines): n = 1150 represents the full dataset, n = 575 uses every other observation, n= 230 every 5th and n = 115 every 10th. The remaining observations being located at increasing distance, it can be expected that the correlation should decrease with decreasing sample size. The plots in Figure 3 indeed exhibit this behavior, with the DPI ts nearly coinciding with the CDPI ones for the two smaller sample sizes. For the two larger sample sizes, CDPI manages to display an approximately \correct" shape for the mean function, while DPI results in severely undersmoothed estimate.

24

Height

n=1150

n=575

4

4

3.5

3.5

3

3

0

500

1000

0

Height

n=230 4

3.5

3.5

3

3 500 Location

1000

n=115

4

0

500

1000

0

500 Location

1000

Figure 3: DPI (?) and DPI ({) ts to the Drum Roller data for four di erent sample sizes.

4.2 Smoothing Spline ANOVA Models All methods we reviewed in Section 3.2 are developed for polynomial splines with special error structure. Some even require the design points to be equally spaced. Their applications are limited to time series. In many applications, the data and/or the error structure are more complicated. Interesting examples are spatial, longitudinal and spatio-temporal data. Mean function of these kind of data can be modeled in a uni ed fashion using the general spline models and the smoothing spline ANOVA models de ned on arbitrary domains (Wahba [66]). However, previous research on the general spline models and the smoothing spline ANOVA models assumed that the observations are independent. When data are correlated, which often is the case for spatial and longitudinal data, conventional methods for selecting smoothing pa25

rameters for these models face the same problems as illustrated in Section 2. Our goal in this section is to present extensions of the GML, GCV and UBR methods for smoothing spline ANOVA (SS ANOVA) models when observations are correlated. In model (1), suppose now the domain X = X1    Xd, where Xi are measurable spaces of rather general form. We assume that f belongs to a subspace of tensor products of reproducing kernel Hilbert spaces (RKHS). More precisely, the model space H of a SS ANOVA model contains elements X X f (x) = C + fj (xj ) + fj1 ;j2 (xj1 ; xj2 ) +    j 2J1 (j1 ;j2 )2J2 X + fj1;;jd (xj1 ;    ; xjd ); (16) (j1 ;;jd )2Jd

where x = (x1;    ; xd) 2 X , xk 2 Xk, and Jk is a subset of the set of all k-tuples f(j1;    ; jk ) : 1  j1 <    < jk  dg for k = 1;    ; d. Identi ability conditions are imposed such that each term in the sums is integrated to zero with respect to any one of its arguments. Each term in the rst sum is called a main e ect, each term in the second sum is called a two-factor interaction, and so on. Similar to the analysis of variance, higher-order interactions are often eliminated from the model space to relieve the curse of dimensionality. See Aronszajn [4] for details about RKHS and Wahba [66], Gu and Wahba [28] and Wahba, Wang, Gu, Klein and Klein [68] for details about SS ANOVA models. After a subspace is chosen as the model space, we can regroup and write it in the form p X H = H 0  Hj ; =1

where H0 is a nite dimensional space containing functions which are not going to be penalized, and Hj 's are subspaces which contain \smooth" elements in the decomposition (16). Again, suppose that C is known besides a set of parameters . No speci c structure is assumed for C , therefore it is not limited to the autoregressive or any special type of error structure. In practice, if the error structure is unknown, di erent structures may be tted to select a nal model. We do not discuss how to model the correlation functions in this paper. See Wang [69] for an example. To illustrate potential applications of the SS ANOVA models with correlated errors, consider spatio-temporal data. Denote X1 = [0; 1] as the time domain and X2 = R2 as the spatial domain (latitude and longitude). Polynomial splines are often used to model temporal data and thin plate splines are often used to model spatial 26

data. Thus the tensor product of two corresponding RKHS's can be used to model the mean function of a spatio-temporal data (Gu and Wahba [28]). Components in model (16) can be interpreted as spatio-temporal main e ects and interactions. An autoregressive structure may be used to model possible temporal correlation and exponential structures may be used to model spatial correlation. Both correlations may appear in the covariance matrix C . A direct generalization of the penalized weighted least square (6) is 8 9 p 0, let Bp;q  C (see, all functions g 2 Lp[0; 1]r such that the Besov norm satis es k g kBp;q e.g., Triebel [62]). Besov and Triebel's F spaces contain a rich collection of function spaces including Holder-Zygmund spaces, Sobolev spaces, fractional Sobolev spaces, and inhomogeneous Hardy spaces. The richness of these classes provides a lot of desirable exibility (e.g., spatial inhomogeneity) for statistical curve estimation as illustrated in Donoho et al. [21], Donoho and Johnstone [20] and others. Consider the following function classes on [0; 1]d: S ;1(C ) = fPd g (x ) : g 2 B ;1(C ); 1  i  dg i=1 i i

p;q

i

p;q

;2 ;2(C ) = fP Sp;q 1i d=p. 2. Without knowing the hyperparameters ; p; q, and the interaction order r, one can construct a minimax-rate adaptive estimator. That is, a single estimator f^ can be constructed such that

   k2= O n? min(2 =(2 +r); ) ^ max E k f ? f f 2S ;r (C ) p;q

automatically for all 1  r  d, 1  p; q  1 and > d=p.

32

The rst result of the theorem suggests the advantage of additive or low interaction order models. For a xed set of hyper-parameters for the Besov classes, if one knew the true interaction order r, and if r < d, then a faster convergence rate can be achieved compared to an estimator obtained assuming the highest interaction order (i.e., r = d). The improvement in rate of convergence is substantial when r is small compared to d. For similar results on e ects of dimension reduction with independent errors, see Stone [60], Nicoleris and Yatracos [50] and Yang [74]. For applications, since the smoothness parameter and the interaction order are unknown, adaptation over them is desired. The second result of the above theorem guarantees the existence of such adaptive estimators. General adaptive estimation with unknown dependence is currently being investigated by one of the authors of this paper. For the Besov classes, as studied in Wang [71] and Johnstone and Silverman [39] for the one-dimensional case, wavelet estimators are natural for consideration. It seems reasonable to expect that with proper thresholding and a method to select the interaction order r, tensor-product wavelet estimators will have the desired adaptation capability.

References [1] R. K. Adenstedt. On large sample estimation for the mean of a stationary sequence. Annals of Statistics, 2:1095{1107, 1974. [2] N. S. Altman. Kernel smoothing of data with correlated errors. Journal of the American Statistical Association, 85:749{759, 1990. [3] N.S. Altman. Krige, smooth, both or neither? Technical report, Biometrics Unit, Cornell University, Ithaca, NY, September 1994. [4] N. Aronszajn. Theory of reproducing kernels. Transactions of the American Mathematical Society, 68:337{404, 1950. [5] A. R. Barron and R. L. Barron. Statistical learning networks: a unifying view. In Computer Science and Statistics: Proceedings of the 21st Interface, 1988. [6] A. R. Barron, L. Birge, and P. Massart. Risk bounds for model selection via penalization. Probability Theory and Related Fields, to appear, 1996. [7] A. R. Barron and T. M. Cover. Minimum complexity density estimation. IEEE Transaction on Information Theory, 37:1034{1054, 1991. 33

[8] J. Beran. Statistical methods for data with long-range dependence. Statistical Science, 7:404{416, 1992. [9] J. Beran. Statistics for Long-Memory Processes. Chapman and Hall, New York, 1994. [10] H.J. Bierens. Uniform consistency of kernel estimators of a regression function under generalized conditions. Journal of the American Statistical Association, 78:699{707, 1983. [11] L. Birge and P. Massart. From model selection to adaptive estimation. In D. Pollard, E. Torgerson, and G. Yang, editors, Research Papers in Probability and Statistics: Festschrift for Lucien Le Cam. Springer-Verlag, 1996. [12] L. Breiman, J. H. Friedman, R. Olshen, and C. J. Stone. Classi cation and Regression Trees. Wadsworth, Belmont, CA, 1984. [13] S.-T. Chiu. Bandwidth selection for kernel estimate with correlated noise. Statistics and Probability Letters, 8:347{354, 1989. [14] C.-K. Chu and J. S. Marron. Comparison of two bandwidth selectors with dependent errors. Annals of Statistics, 19:1906{1918, 1991. [15] G. Collomb and W. Hardle. Strong uniform convergence rates in robust nonparametric time series analysis. Stochastic Processes and Applications, 23:77{89, 1986. [16] D. R. Cox. Long-range dependence: a review. In H. A. David and H. T. David, editors, Statistics: An Appraisal. Proceedings 50th Anniversary Conference, pages 55{74, Ames, IA, 1984. Iowa State University Press. [17] P. Craven and G. Wahba. Smoothing noisy data with spline functions: estimating the correct degree of smoothing by the method of generalized crossvalidation. Numer. Math., 31:377{403, 1979. [18] L.P. Devroye and G. Lugosi. Nonparametric universal smoothing factors, kernel complexity, and yatracos classes. Annals of Statistics, 25:2626{2637, 1997. [19] P. J. Diggle and M. F. Hutchinson. On spline smoothing with autocorrelated errors. The Australian Journal of Statistics, 31:166{182, 1989. 34

[20] D. L. Donoho and I. M. Johnstone. Minimax estimation via wavelet shrinkage. Annals of Statistics, 26:879{921, 1998. [21] D. L. Donoho, I. M. Johnstone, G. Kerkyacharian, and D. Picard. Wavelet shrinkage: asymptotia? (with discussion). Journal of the Royal Statistical Association, Series B, 57:301{370, 1995. [22] S.Y. Efromovich. Nonparametric estimation of a density of unknown smoothness. Theory Probability Applications, 30:557{568, 1985. [23] S.Y. Efromovich. How to overcome curse of long-memory errors in nonparametric regression. Manuscript, 1997. [24] S.Y. Efromovich and M.S. Pinsker. A self-educating nonparametric ltration algorithm. Automation and Remote Control, 45:58{65, 1984. [25] J. Friedman and W. Stuetzel. Projection pursuit regression. Journal of the American Statistical Association, 76:817{823, 1981. [26] A. Goldenschluger and A. Nemirovski. Adaptive de-noising of signals satisfying di erential inequalities. IEEE Transaction on Information Theory, 43:872{889, 1997. [27] G. K. Golubev and M. Nussbaum. Adaptive spline estimates for nonparametric regression models. Theory Probab. Appl., 37:521{529, 1993. [28] C. Gu and G. Wahba. Semiparametric ANOVA with tensor product thin plate spline. Journal of the Royal Statistical Association, Series B, 55:353{368, 1993. [29] P. Hall and J. D. Hart. Nonparametric regression with long-range dependence. Stochastic Processes and Applications, 36:339{351, 1990. [30] P. Hall, S. N. Lahiri, and J. Polzehl. On bandwidth choice in nonparametric regression with both short- and long-range dependent errors. Annals of Statistics, 23:1921{1936, 1995. [31] W. Hardle, P. Hall, and J. S. Marron. How far are automatically chosen regression smoothing parameters from their optimum? Journal of the American Statistical Association, 83:86{95, 1988. 35

[32] W. Hardle, P. Hall, and J. S. Marron. Regression smoothing parameters that are not far from their optimum. Journal of the American Statistical Association, 87:227{233, 1992. [33] W. Hardle and J. S. Marron. Optimal bandwidth selection in nonparametric regression function estimation. Annals of Statistics, 13:1465{1481, 1985. [34] J. D. Hart. Kernel regression estimation with time series errors. Journal of the Royal Statistical Association, Series B, 53:173{187, 1991. [35] J. D. Hart. Automated kernel smoothing of dependent data by using time series cross-validation. Journal of the Royal Statistical Association, Series B, 56:529{ 542, 1994. [36] D. Harville. Extension of the Gauss-Markov theorem to include the estimation of random e ects. Annals of Statistics, 4:384{395, 1976. [37] E. Herrmann, T. Gasser, and A. Kneip. Choice of bandwidth for kernel regression when residuals are correlated. Biometrika, 79:783{795, 1992. [38] C. M. Hurvich and S. L. Zeger. A frequency domain selection criterion for regression with autocorrelated errors. Journal of the American Statistical Association, 85:705{714, 1990. [39] I.M. Johnstone and B.W. Silverman. Wavelet threshold estimators for data with correlated noise. Journal of the Royal Statistical Association, Series B, 59:319{351, 1997. [40] G. S. Kimeldorf and G. Wahba. Some results on Tchebychean spline functions. Journal of Mathematical Analysis and Applications, 33:82{94, 1971. [41] R. Kohn, C. F Ansley, and D. Tharm. The performance of cross-validation and maximum likelihood estimators of spline smoothing parameters. Journal of the American Statistical Association, 86:1042{1050, 1991. [42] R. Kohn, C. F. Ansley, and C. Wong. Nonparametric spline regression with autoregressive moving average errors. Biometrika, 79:335{346, 1992. [43] H. Kunsch, J. Beran, and F. Hampel. Contrasts under long-range correlations. Annals of Statistics, 21:943{964, 1993. 36

[44] G.M. Laslett. Kriging and splines: an empirical comparison of their predictive performance in some applications. Journal of the American Statistical Association, 89:392{409, 1994. [45] O. V. Lepski. Asymptotically minimax adaptive estimation i: Upper bounds. optimally adaptive estimates. Theory Probab. Appl., 36:682{697, 1991. [46] O. V. Lepski, E. Mammen, and V. G. Spokoiny. Ideal spatial adaptation to inhomogeneous smoothness: an approach based on kernel estimates with variable bandwidth selection. Annals of Statistics, 25:929{947, 1997. [47] G. Lugosi and A. Nobel. Adaptive model selection using empirical complexities. submitted to Annals of Statistica, 1996. [48] E. Mammen and S. van de Geer. Locally adaptive regression splines. Annals of Statistics, 25:387{413, 1997. [49] G. P. Nason. Wavelet shrinkage using cross-validation. J. R. Statist. Soc. B, 58:463{479, 1996. [50] T. Nicoleris and Y. G. Yatracos. Rate of convergence of estimates, kolmogrov's entropy and the dimensionality reduction principle in regression. Annals of Statistics, 25:2493{2511, 1997. [51] J.-D. Opsomer. Estimating a function by local linear regression when the errors are correlated. Preprint 95{42, Department of Statistics, Iowa State University, 13 December 1995. [52] J.-D. Opsomer. Nonparametric regression in the presence of correlated errors. In Modelling Longitudinal and Spatially Correlated Data: Methods, Applications and Future Directions, pages 339{348. Springer-Verlag, 1997. [53] J.-D. Opsomer and D. Ruppert. Fitting a bivariate additive model by local polynomial regression. Annals of Statistics, 25:186{211, 1997. [54] J.-D. Opsomer and D. Ruppert. A fully automated bandwidth selection method for tting additive models by local polynomial regression. Journal of the American Statistical Association, 93:605{619, 1998.

37

[55] J.-D. Opsomer, D. Ruppert, M.P. Wand, U. Holst, and O. Hossjer. Kriging with nonparametric variance function estimation. Working paper, Center for Agricultural and Rural Development, Iowa State University, 1997. Revised for Biometrics. [56] M. B. Priestley and M. T. Chao. Nonparametric function tting. Journal of the Royal Statistical Association, Series B, 34:385{392, 1972. [57] D. Ruppert. Empirical-bias bandwidths for local polynomial nonparametric regression and density estimation. Journal of the American Statistical Association, 92:1049{1062, 1997. [58] D. Ruppert, S. J. Sheather, and M. P. Wand. An e ective bandwidth selector for local least squares regression. Journal of the American Statistical Association, 90:1257{1270, 1995. [59] D. Ruppert and M. P. Wand. Multivariate locally weighted least squares regression. Annals of Statistics, 22:1346{1370, 1994. [60] C. J. Stone. The use of polynomial splines and their tensor products in multivariate function estimation. Annals of Statistics, 22:118{184, 1994. [61] C. J. Stone, M. H. Hansen, C. Kooperberg, and Y Truong. Polynomial splines and their tensor products in extended linear modeling (with discussion). Annals of Statistics, 25:1371{1470, 1997. [62] H. Triebel. Theory of Function Spaces II. Birkhauser, Boston, 1992. [63] G. Wahba. Improper priors, spline smoothing, and the problem of guarding against model errors in regression. Journal of the Royal Statistical Association, Series B, 40:364{372, 1978. [64] G. Wahba. Bayesian con dence intervals for the cross-validated smoothing spline. Journal of the Royal Statistical Association, Series B, 45:133{150, 1983. [65] G. Wahba. A comparison of GCV and GML for choosing the smoothing parameters in the generalized spline smoothing problem. Annals of Statistics, 4:1378{1402, 1985. [66] G. Wahba. Spline Models for Observational Data. SIAM, Philadelphia, 1990. CBMS-NSF Regional Conference Series in Applied Mathematics, Vol. 59. 38

[67] G. Wahba and Y. Wang. Behavior near zero of the distribution of GCV smoothing parameter estimates for splines. Statistics and Probability Letters, 25:105{ 111, 1993. [68] G. Wahba, Y. Wang, C. Gu, R. Klein, and B. Klein. Smoothing spline ANOVA for exponential families, with application to the Wisconsin Epidemiological Study of Diabetic Retinopathy. Annals of Statistics, 23:1865{1895, 1995. [69] Y. Wang. Mixed-e ects smoothing spline ANOVA. Journal of the Royal Statistical Association, Series B, 60:159{174, 1998. [70] Y. Wang. Smoothing spline models with correlated random errors. Journal of the American Statistical Association, 93:34{348, 1998. [71] Yazhen Wang. Function estimation via wavelet shrinkage for long-memory data. Annals of Statistics, 24:466{484, 1996. [72] Y. Yang. Nonparametric regression with dependent errors. Preprint 97{29, Department of Statistics, Iowa State University, 1997. [73] Y. Yang. Combining di erent procedures for adaptive regression. Technical Report 98-11, Department of Statistics, Iowa State University, 1998. [74] Y. Yang. Model selection for nonparametric regression. accepted by Statistica Sinica, 1998. [75] Y. Yang and A. R. Barron. An asymptotic property of model selection criteria. IEEE Transaction on Information Theory, 44:95{116, 1998.

39