Double Smoothing Robust Estimators in Nonparametric Regression

1 downloads 0 Views 731KB Size Report
For nonparametric regression, a double smoothing robust procedure is pro- posed to estimate the regression function. It is constructed by applying the.
Sankhy¯ a : The Indian Journal of Statistics 2009, Volume 71-A, Part 2, pp. 298-330 c 2009, Indian Statistical Institute

Double Smoothing Robust Estimators in Nonparametric Regression Ruey-Ching Hwang National Dong Hwa University, Hualien, Taiwan

Zong-Huei Lin Taiwan Hospitality & Tourism College, Hualien, Taiwan

C.K. Chu National Dong Hwa University, Hualien, Taiwan Abstract For nonparametric regression, a double smoothing robust procedure is proposed to estimate the regression function. It is constructed by applying the local constant M -estimator to the pseudo data. The pseudo data are generated using local linear fits from every pair of observations in the smoothing window. The asymptotic behavior of the proposed estimator is studied. In practice, the proposed estimator is implemented by the Newton method together with the median of the pseudo data as the initial guess. Real data examples and simulations are given to illustrate our method. The results show that our strategy is a useful alternative for estimating the regression function. AMS (2000) subject classification. Primary 62G05; Secondary 62G20. Keywords and phrases. Double smoothing robust estimator, local constant M -estimator, local linear M -estimator, Newton method, nonparametric regression.

1

Introduction

Nonparametric regression, by smoothing methods, has been well established as a useful data-analytic tool. In this paper, the equally spaced fixed design nonparametric regression model is considered. The regression model is given by Yi = m(xi ) + ǫi , (1.1) for i = 1, . . . , n. Here m is an unknown but smooth regression function defined on the closed interval [0, 1], xi are equally spaced fixed design points,

Double smoothing robust estimators

299

that is, xi = i/n, ǫi are regression errors with mean 0, and Yi are noisy observations of m at xi . We wish to estimate m(x), for x ∈ [0, 1]. To estimate m(x), the local linear estimator (LLE) is one of the most popular smoothers, due to its computational simplicity and nice asymptotic properties. Given the kernel function K as a probability density function supported on the interval [−1, 1], and the bandwidth h = hn tending to 0 as n → ∞, it is constructed by minimizing the local weighted sum of squares SLLE (a, b; x) =

n X i=1

(Yi − a − b zi )2 K(zi ),

(1.2)

where zi = (x − xi )/h. The kernel function K is used to compute the weights assigned to the observations, and the value of bandwidth h stands for the width of the neighborhood in which averaging is performed. By Taylor theorem and conditions (C1)-(C5) given in Section 4 of this paper, asymptotically, SLLE (a, b; x) has a unique global minimizer close to the location of {m(x), (−h)m(1) (x)}. Let (b aLLE , bbLLE ) be the global minimizer of SLLE (a, b; x). Thus the LLE for m(x) is taken as m b LLE (x) = b aLLE .

For asymptotic properties of m b LLE (x), including high asymptotic efficiency and automatic boundary carpentry, see Fan (1992, 1993) and Fan and Gijbels (1995). For a detailed introduction of nonparametric regression, see the monographs by Eubank (1988), M¨ uller (1988), H¨ardle (1990, 1991), Scott (1992), Wand and Jones (1995), Fan and Gijbels (1996), Simonoff (1996), and Gy¨orfi, Kohler, Krzyzak, and Walk (2002). Because the LLE is based on least squares estimation, it is sensitive to large swings in data. For the fact, see the monographs by Huber (1981), Hampel, Ronchetti, Rousseeuw, and Stahel (1986), and Rousseeuw and Leroy (1987). To cope with this drawback, a local linear M -estimator (LLM) integrating both ideas of robust M estimation and local linear fit is suggested. Given the standard Gaussian density function L and the tuning parameter g, the LLM for m(x) is constructed in this paper by maximizing   n X Yi − a − b zi K(zi ), (1.3) L SLLM (a, b; x) = g i=1

where zi , h, and K are the same as those in (1.2). The function SLLM (a, b; x) in (1.3) is a local weighted sum of Gaussian ridges along the straight lines a + b zi = Yi on the (a, b)-plane.

300

R.C. Hwang, Z.H. Lin and C.K. Chu

Note that the LLM produced from (1.3) is originally formulated by minimizing its M function constructed using the outlier-resistant function −L. Its computational problems will be illustrated in Section 2 using an artificial example. For having better visual appeal of the example, its M function has been multiplied by −1 (which turns the minimization problem into the maximization problem). The same computational problems shown by the example also happen to the LLMs using other outlier-resistant functions with continuous derivatives. For simplicity of presentation, the LLM discussed in the paper is restricted to be the one produced from (1.3). By Taylor theorem and conditions (C1)-(C5) given in Section 4 of this paper, asymptotically, SLLM (a, b; x) has a unique global maximizer close to the location of {m(x), (−h)m(1) (x)}. Let (b aLLM , bbLLM ) be the global maximizer of SLLM (a, b; x). Thus the LLM for m(x) is taken as m b LLM (x) = b aLLM .

It not only overcomes the lack of robustness, but also enjoys the advantages of the LLE. For details, see Tsybakov (1986), Besl, Birch, and Watson (1989), Fan, Hu, and Truong (1994), and Fan and Jiang (1999). By Fan and Jiang (1999), where the LLM is studied in the random design case, and Fan, Gasser, Gijbels, Brookmann, and Engels (1993), the optimal K for constructing each of m b LLE (x) and m b LLM (x) is the Epanechnikov kernel 2 K(z) = (3/4)(1 − z )I[−1,1] (z), in the sense of yielding smaller asymptotic mean squared error (AMSE). For other useful local robust procedures, see for example, Cleveland (1979) for the idea of LOWESS and H¨ardle and Gasser (1984) for the local constant M -estimator. The purpose of this paper is twofold. First, we discuss the robustness problem of m b LLM (x) from computational viewpoints. This is done by means of a simple example in Section 2. Second, to sidestep the practical problems of m b LLM (x), a double smoothing robust estimator (DSRE) m b DSRE (x) for m(x) is proposed in Section 3. Its asymptotic behavior is studied in Section 4. Empirical results given in Section 5 demonstrate that the proposed DSRE is useful for estimating the regression function. Finally, sketches of the proofs are contained in Section 6.

2

The Problem

In practice, m b LLM (x) is often implemented by Newton method together with an initial guess. For computational convenience, one often uses

Double smoothing robust estimators

301

Figure 1: An artificial example. The data used in Figure 1 are shown in (1a) and have outliers at x = 0.7, 0.72, and 0.74. They were generated by the regression model (1.1) except those outliers. (1b) Regression function estimates. (1c) The surface of SLLM (a, b; x = 0.7) using Epanechnikov kernel K and standard Gaussian density function L for m b LLM (0.7) in (1b) at the location marked by the vertical solid line. Locations of vertical dashed, solid, and dotted lines in (1c) denote respectively the global maximizer of SLLM (a, b; x = 0.7), and the initial and the final estimates of the Newton method for m b LLM (0.7) in (1b). Solid curves at the top of (1c) stand for locations of straight lines a + bzi = Yi for observations (xi , Yi ) with xi ∈ [0.7−h, 0.7+h]. (1d) Same captions as those of (1c) with SLLM (a, b; x = 0.7) of the LLM replaced by SDSRE (a, b; x = 0.7) of the proposed DSRE. The pseudo data used by SDSRE (a, b; x = 0.7) are shown by stars and pluses at the top of (1d). Pluses were produced using at least one of the outliers in (1a), and stars using none. Captions of (1e)-(1g) and (1h)-(1j) are the same as those of (1b)-(1d), respectively, in each case.

302

R.C. Hwang, Z.H. Lin and C.K. Chu

Figure 2: Same captions as those of Figure 1. The data used in Figure 2 are shown in (2a). They are the same as those in (1a) with the outliers at x = 0.7, 0.72, and 0.74 deleted. The values of smoothing parameters h and g used by the discussed estimators in Figure 2 are the same as those employed in Figure 1.

Double smoothing robust estimators

303

(b aLLE , bbLLE ) as the initial guess for Newton iteration (H¨ardle and Gasser 1984; Fan and Jiang 1999). Unfortunately, based on that strategy, two practical problems may arise. First, by the nonrobustness of LLE, such an initial guess might be far from the global maximizer of SLLM (a, b; x). Second, the Gaussian density function L employed by m b LLM (x) has a peak. This implies that SLLM (a, b; x) might have several ridges on its surface. If both practical problems occur simultaneously, then the Newton iteration is likely to progress toward some undesirable ridge on the surface of SLLM (a, b; x) which does not locate at the global maximizer of SLLM (a, b; x). Consequently, an undesirable solution may be produced for m b LLM (x). Therefore the estimate produced for LLM is not robust enough due to the definition of LLM. These computational problems of LLM can be avoided if the global maximizer of SLLM (a, b; x) is found by grid search. But the computational burden of the grid search approach is very heavy. On the other hand, the LLM with the least-absolute-deviation version, that is, taking its outlier-resistant function as the absolute value function, can be efficiently computed by way of the simplex method or interior point method for linear programming (Portnoy and Koenker 1997). However, like for the median, the solution for the latter LLM is not unique in general (Fried, Einbeck, and Gather 2007). The above computational problems of m b LLM (x) are illustrated in Figures 1 and 2 using the simulated data with outliers and with those outliers deleted respectively. Based on the data with outliers in Figure 1a, both estimates produced by LLE and LLM in Figures 1a and 1b, respectively, exhibit undesirable departures from the true regression function, whereas that by the proposed DSRE in Figure 1b does not. On the other hand, when the outlying data in Figure 1a are deleted, all the resulting estimates in Figures 2a and 2b produced by the discussed estimators are close to the true regression function. Comparing the results with those in Figures 1a and 1b, the estimates produced by the proposed DSRE are nearly the same regardless of the presence of outliers, but those by LLE and LLM are quite different in the two situations. Thus, under the simulated settings, the proposed DSRE has better outlier-resistance performance than LLE and LLM. The same remark also applies to Figures 1e and 1h with the same value of h but that of g being increased. In Figures 1 and 2, the value of h = 0.1 was a subjective choice, and those of g were chosen to demonstrate their effects on the Newton method for LLM. Figure 1c explains the poor performance of m b LLM (0.7) in Figure 1b by showing the surface of SLLM (a, b; x = 0.7). In Figure 1c, the Newton iteration’s initial estimate produced by LLE for computing m b LLM (0.7) is

304

R.C. Hwang, Z.H. Lin and C.K. Chu

driven away from the global maximizer of SLLM (a, b; x = 0.7) by those outliers. Also, there are many ridges on the surface of SLLM (a, b; x = 0.7). The two practical problems give rise to the undesirable solution provided by the Newton method for m b LLM (0.7) in Figure 1b. Similar explanations apply to Figures 1f and 1i for the poor performance of m b LLM (0.7) in Figures 1e and 1h with larger values of g, respectively. 3

The Proposed Estimator

To sidestep the practical problems of LLM, the DSRE for m(x) is proposed. It is based on a strategy different from the conventional M estimation. It is constructed by two stages. In the first stage, the pseudo data (b aij , bbij ) are generated using local linear fits from every pair of observations (xi , Yi ) and (xj , Yj ) in the compact window [x − h, x + h] around the given point x ∈ [0, 1]. Note that (b aij , bbij ) is equivalent to (b aLLE , bbLLE ) using only the two observations (xi , Yi ) and (xj , Yj ). Through a straightforward calculation, for i 6= j, b aij = (zi Yj − zj Yi ) / (zi − zj ),

bbij = (Yi − Yj ) / (zi − zj ),

(3.1)

where zi and zj are the same as those in (1.2). By (1.1), the number of observations Yi occurred in [x − h, x + h] is of order O(nh), and thus that of the generated pseudo data (b aij , bbij ) is of order O(n2 h2 ).

Based on the influence of outliers, the generated pseudo data can be separated into two parts. The first part includes those using none of outliers, and the second part using at least one of outliers. Using the regression model (1.1) and assuming that the regression function m(x) has two Lipschitz continuous derivatives, it will be shown in Section 6 that the asymptotic expectations of b aij and bbij not polluted by outliers can be expressed as

E(b aij ) = m(x)+h2 αij +Ou (h3 ),

E(bbij ) = (−h) m(1) (x)+h2 β ij +Ou (h3 ). (3.2) Here αij = (−1/2)m(2) (x)zi zj , β ij = (1/2)m(2) (x)(zi + zj ), and the notation rij = Ou (rn ) means that |rij / rn | is uniformly bounded above by a positive constant over the subindices i and j. Combining (1.1), (3.1), and (3.2) with the condition that the density function of regression errors ǫi is unimodal and symmetric about 0, asymptotically, the pseudo data uncontaminated by outliers cluster around the single location of {m(x), (−h)m(1) (x)}. On the other hand, those contaminated are themselves outliers and flee from the

Double smoothing robust estimators

305

location. Such characteristics of the uncontaminated and the contaminated pseudo data are important for the success of the proposed DSRE. Figures 1d and 2d illustrate centralization of the uncontaminated pseudo data and decentralization of the contaminated pseudo data. In the second stage, given K, L, h, g, zi , and zj as those in (1.2) and (1.3), a two-dimensional local constant M -estimator is applied to the pseudo data (b aij , bbij ). It is similarly defined as SLLM (a, b; x) in (1.3) by !   n n X X a−b aij b − bbij . L K(zi ) K(zj ) L SDSRE (a, b; x) = g g i=1

j=1, i>j

The function SDSRE (a, b; x) is a local weighted sum of Gaussian peaks centered at (b aij , bbij ) on the (a, b)-plane. Note that it is proportional to the weighted kernel density estimator (Devroye and Gy¨orfi 1985; Silverman 1986) of (b aij , bbij ). The equivalence of M estimation and kernel density estimation has been pointed out in Chu, Glad, Godtliebsen, and Marron (1998). Combining the result with the characteristics of both the uncontaminated and the contaminated pseudo data introduced above, asymptotically, SDSRE (a, b; x) has a unique global maximizer close to the location of {m(x), (−h)m(1) (x)}. Due to the fast decay rate of Gaussian density function L, our SDSRE (a, b; x) has the advantage of effectively separating the pseudo data (b aij , bbij ) through its modes. Such a characteristic of Gaussian density function has been utilized to construct the Gaussian kernel classifier (Section 9.1 of Scott 1992). By the above considerations, the proposed DSRE for m(x) is constructed by maximizing SDSRE (a, b; x) over (a, b). Let (b aDSRE , bbDSRE ) be the global maximizer of SDSRE (a, b; x). Thus the DSRE of m(x) is taken as m b DSRE (x) = b aDSRE .

Its asymptotic behavior will be studied in Section 4. In practice, to compute m b DSRE (x), we suggest using Newton method together with the robust initial estimate (b amed , bbmed ). Here b amed denotes the “median” of b aij , and bbmed is similarly defined for bbij . By the equivalence of SDSRE (a, b; x) to kernel density estimator, m b DSRE (x) inherits the characteristic of outlier-resistance of kernel density estimator, and is related to the modal regression (Section 8.3.2 of Scott 1992). In particular, it is the “mode” of the pseudo data closest to the initial guess (b amed , bbmed ). A procedure written using the GAUSS software for computing m b DSRE (x) is available upon request.

306

R.C. Hwang, Z.H. Lin and C.K. Chu

Figure 1d explains the nice performance of m b DSRE (0.7) in Figure 1b by showing some appealing features of the pseudo data and the surface of SDSRE (a, b; x = 0.7). First, comparing the uncontaminated pseudo data with those contaminated, the former are of larger size, clustered together, and closer to the target, that is, the global maximizer of SDSRE (a, b; x = 0.7). By the result, the initial estimate for Newton iteration taken as the median of the pseudo data is mainly determined by those uncontaminated, thus it is not only resistant to outliers, but also close to the target. Second, the surface of SDSRE (a, b; x = 0.7) has smooth appearance, and the location of its single principal peak corresponds to those of uncontaminated pseudo data. Consequently, these appealing features enhance our strategy by first providing a robust initial candidate close enough to the target, and then constructing a single principal peak on which the Newton iteration can advance toward correctly. These explanations also apply to Figures 1g and 1j for the nice performance of m b DSRE (0.7) in Figures 1e and 1h with larger values of g, respectively. We now close this section with the following remarks.

Remark 3.1. (Using different values of g in each dimension of the pseudo data). One might use different values of g in each dimension of the pseudo data (b aij , bbij ) to produce SDSRE (a, b; x). Here, for simplicity of presentation, the same value of g is employed in each dimension. Remark 3.2. (The computational burden of DSRE) The DSRE requires O(nh) times as much computational time as the LLM, since the number of the L function values computed by SLLM (a, b; x) in each Newton iteration is of order O(nh), but that by SDSRE (a, b; x) is of order O(n2 h2 ). To speed up the implementation of DSRE, one may consider the idea of the iterative weighted least squares (Simpson, He, and Liu 1998) with a small fixed number of iterations starting at (b amed , bbmed ).

Remark 3.3. {Taking the robust (b amed , bbmed ) as the starting value of the Newton method of LLM, instead of the nonrobust (b aLLE , bbLLE )} Let ∗ m b LLM ∗ (x), the LLM , be such an estimator. In practice, it avoids the first computational problem of the Newton method for LLM. However, it still might suffer from the second computational problem. If the value of g employed by m b LLM ∗ (x) is not large enough, then the resulting SLLM ∗ (a, b; x) has rough appearance which might cause the Newton method to produce an undesirable solution. Simulation results in Section 5.2 show that, using such suboptimal values of smoothing parameters, the resulting LLM∗ reacts

Double smoothing robust estimators

307

more sensitive than the corresponding DSRE, in the sense of yielding larger sample standard deviation. Remark 3.4. {Considering b amed as an estimate of m(x)} Based on the above discussions, b amed has characteristics of fast computation, outlierresistance, and being close to m(x). Thus it also can be considered as an estimate of m(x). Let m b LLA , the LLA, be such an estimator. However, simulation results in Section 5.2 show that it has worse performance than DSRE, in the sense of yielding larger sample integrated squared error.

4

Results

In this section, we shall study the asymptotic behavior of m b DSRE (x). For this, assume the regression model (1.1) and the following conditions are satisfied: (C1) The regression function m(x) in (1.1) is defined on the interval [0, 1] and has two Lipschitz continuous derivatives. (C2) Regression errors ǫi are independent and identically distributed random variables with all moments finite. Their density function f is Lipschitz continuous, unimodal, symmetric about 0, and supported on R. (C3) The function K is a Lipschitz continuous and symmetric probability density function supported on [−1, 1], and L the standard Gaussian density function. (C4) The values of h and g are selected respectively on the intervals [δn−1/3+δ , δ −1 n−δ ] and [δ, δ −1 ], where the positive constant δ is arbitrarily small. (C5) The total number of observations in this regression setting is n, with n → ∞. (C6) The density function f of regression errors ǫi satisfies the condition ζ 2,0 ζ 0,2 − ζ 21,1 6= 0. The notation of ζ p,q , for p, q ≥ 0, is defined as below. Conditions (C1)-(C5) are regular for the usual nonparametric robust regression analysis. Condition (C6) makes sure that the Hessian matrix for SDSRE (a, b; x) is invertible. The following Theorem 4.1 gives the asymptotic bias and variance of m b DSRE (x). Its proof will be given in Section 6. Due to the complicated covariances of b aij and bbij shown in (6.2) and (6.3), the

308

R.C. Hwang, Z.H. Lin and C.K. Chu

asymptotic bias and variance of m b DSRE (x) are complicated. Define quantities related to the asymptotic bias and variance of m b DSRE (x) as below: Z ϕZ ϕ ζ p,q = K(s) K(t) (t − s) Ip,q (s, t) ds dt, ρ

ξ1 = ξ2 =

Z

Z

ρ

ρ

ϕZ

ϕ

K(s) K(t) (t − s) {s t I2,0 (s, t) − (s + t) I1,1 (s, t)} ds dt,

ρ, t>s ϕZ ϕ

K(s) K(t) (t − s) {s t I1,1 (s, t) − (s + t) I0,2 (s, t)} ds dt,

ρ, t>s

ψ p,q,i,j =

ρ, t>s

Z

ϕ ρ

Z

Z

ϕ ρ

ϕ

K(s)K(t)K(u) {K(s)(s − t)(s − u)J1,p,q,i,j (s, t, u) + ρ, u>t>s

K(t)(t − s)(u − t)J2,p,q,i,j (s, t, u) + K(u)(u − s)(u − t)J3,p,q,i,j (s, t, u)} ds dt du,

for p, q, i, j ≥ 0, where Ip,q (s, t) =

Z

ρ = max{−1, (x − 1)/h}, ϕ = min{1, x/h}, Z ∞ L(p) (−α) L(q) (−β) f {g(α + βs)} f {g(α + βt)} dα dβ,



−∞

J1,p,q,i,j (s, t, u) =

Z

−∞

∞ −∞

Z



−∞

Z



L(p) (−α) L(q) (−β) L(i) (−α − βs + γs) L(j) (−γ) ×

−∞

f {g(α + βs)} f {g(α + βt)} f [g{α + βs + γ(u − s)}] dα dβ dγ, J2,p,q,i,j (s, t, u) =

Z

∞ −∞

Z



−∞

Z



L(p) (−α) L(q) (−β) L(i) (−α − βt + γt) L(j) (−γ) ×

−∞

f {g(α + βt)} f {g(α + βu)} f [g{α + βt + γ(s − t)}] dα dβ dγ, J3,p,q,i,j (s, t, u) =

Z



−∞

Z

∞ −∞

Z



L(p) (−α) L(q) (−β) L(i) (−α − βu + γu) L(j) (−γ) ×

−∞

f {g(α + βu)} f {g(α + βs)} f [g{α + βu + γ(t − u)}] dα dβ dγ.

Theorem 4.1. If the regression model (1.1) and conditions (C1)-(C6) are satisfied, then the asymptotic bias and variance of m b DSRE (x) can be expressed respectively as Bias{m b DSRE (x)} = h2 m(2) (x) bDSRE + o(h2 ),

Var{m b DSRE (x)} = n−1 h−1 vDSRE + o(n−1 h−1 ),

for each x ∈ [0, 1], where

bDSRE = (1/2) (ζ 1,1 ξ 2 − ζ 0,2 ξ 1 ) / (ζ 2,0 ζ 0,2 − ζ 21,1 ),

(4.1)

(4.2)

Double smoothing robust estimators

309

vDSRE = 2 g (ζ 20,2 ψ1,0,1,0 − 2 ζ 1,1 ζ 0,2 ψ 1,0,0,1 + ζ 21,1 ψ 0,1,0,1 ) / (ζ 2,0 ζ 0,2 − ζ 21,1 )2 .

We now close this section with the following remarks. Remark 4.1. {Comparing m b DSRE (x), m b LLE (x), and m b LLM (x) on their AMSEs} By Theorem 1 of Fan (1993) and Theorem 3.2 of Fan and Jiang (1999), the asymptotic bias and variance of m b LLE (x) and those of m b LLM (x) can be expressed as Bias{m b LLE (x)} = h2 m(2) (x) bLLE + o(h2 ),

Var{m b LLE (x)} = n−1 h−1 vLLE + o(n−1 h−1 ), Bias{m b LLM (x)} = h2 m(2) (x) bLLM + o(h2 ),

Var{m b LLM (x)} = n−1 h−1 vLLM + o(n−1 h−1 ),

(4.3)

(4.4) (4.5) (4.6)

for each x ∈ [0, 1], where

bLLE = (1/2) (κ22 − κ1 κ3 ) / (κ0 κ2 − κ21 ) = bLLM , vLLE = σ 2 (κ22 τ 0 − 2κ1 κ2 τ 1 + κ21 τ 2 ) / (κ0 κ2 − κ21 )2 = σ 2 g−2 η 22 η −1 1 vLLM , Z ϕ Z ϕ si K(s)2 ds, si K(s)ds, τ i = κi = ρ

ρ

η 1 = E{L(1) (ε/g)2 },

η 2 = E{L(2) (ε/g)},

for i ≥ 0. By (4.1), the dominant term of asymptotic bias of m b DSRE (x) is of order h2 in magnitude, for each x ∈ [0, 1]. Therefore m b DSRE (x) does not suffer from boundary effects. For a detailed discussion of boundary effects on the kernel regression function estimator, see for example, Section 4.3 of M¨ uller (1988) and Section 4.4 of H¨ardle (1990). By (4.1)-(4.6), the AMSEs of m b DSRE (x), m b LLE (x), and m b LLM (x) are of the same order h4 + n−1 h−1 in magnitude, for each x ∈ [0, 1]. However, due to the unknown factors m(2) (x) and f (x), their magnitudes are not comparable. Note that the results (4.1)-(4.2) for m b DSRE (x) and (4.5)-(4.6) for m b LLM (x) are obtained when g is constant. In the case when g → ∞, a straightforward calculation gives where

m b DSRE (x) → m b DSRE ∗ (x),

m b DSRE ∗ (x) =

(

n X i=1

n X

j=1, i>j

K(zi ) K(zj ) b aij

m b LLM (x) → m b LLE (x), )

/

( n X i=1

n X

j=1, i>j

(4.7)

)

K(zi ) K(zj ) .

310

R.C. Hwang, Z.H. Lin and C.K. Chu

By (4.7), each of DSRE and LLM loses its outlier-resistance ability as g → ∞. Using (1.1), (C1)-(C6), and assuming that the regression function m(x) has four continuous derivatives, it can be proved that the dominant terms of asymptotic bias and variance of m b DSRE ∗ (x) are of order h4 and n−1 h−1 in magnitude, respectively, for x ∈ (h, 1 − h). Thus, in the case when g → ∞, the asymptotic bias of the corresponding m b DSRE (x) is of faster convergence rate than that of the associated m b LLM (x), for x ∈ (h, 1 − h). But, by characteristics of a kernel estimator having asymptotic bias of order h4 (M¨ uller 1988), the coefficient of the dominant term of asymptotic variance of the former estimator is larger in magnitude than that of the latter estimator. Using an artificial example, Figure 3 presents some numerical results for the constant factors in (4.1)-(4.6). Firstly, it shows that as the value of r increases, that of bDSRE decreases, vDSRE increases, and vLLM approaches the value of vLLE . This result coincides with that of (4.7). Secondly, for most of the selected values of g, m b DSRE (x) has the smallest AMSE, among the three discussed estimators. Finally and most importantly, when using a sufficiently small value of g, the minimal AMSE of m b LLM (x) can be arrived at, and its magnitude is nearly equal to the corresponding AMSE of m b DSRE (x). However, the smaller the value of g employed, the resulting estimate produced by m b LLM (x) suffers more from the two computational problems of the Newton method, but that by m b DSRE (x) does not.

Remark 4.2. {Choosing the kernel function K and the values of h and g for constructing m b DSRE (x)} From Theorem 4.1, the optimal K for constructing m b DSRE (x) with minimal AMSE is not available, since it depends on the unknown factors m(2) (x) and f (x). On the other hand, we suggest using the robust cross-validation criterion (Leung 2005) to choose the values of h and g. The selected values of h and g are taken P as the minimizer of the robust cross-validation score CVDSRE (h, g) = n−1 ni=1 ρ{m b i (xi )−Yi }. Here m b i (xi ) is m b DSRE (xi ) with (xi , Yi ) deleted and ρ is a given outlier-resistant function. For other robust selection techniques for smoothing parameters, see for example, Cantoni and Ronchetti (2001) and Boente and Rodriguez (2008). Remark 4.3. (Using more observations to construct the pseudo data) Since (b aij , bbij ) is equivalent to (b aLLE , bbLLE ) produced using only the two observations (xi , Yi ) and (xj , Yj ), the variances of b aij and b bij given in (6.2) and (6.3) are large when the values of |i− j| are small. To improve the drawback, we may consider using more than two, say q, observations to construct the pseudo data. However the drawbacks to this approach include that both the

Double smoothing robust estimators

311

Figure 3: Plot of numerical values of the constant factors in (4.1)-(4.6) versus the value of g = rσ. These numerical values were obtained using Riemann sum approximation and simulated settings of Figure 6 in Section 5.2 including the Epanechnikov kernel K, standard Gaussian density function L, interior point x, and regression errors εi as the normal mixture random √ variables (1/10)N (0, 100µ 2 )+(9/10)N (0, µ 2 ). Here µ = 0.083, and σ = µ 10.9 = 0.27 denotes the standard deviation of the simulated regression errors εi . Locations of vertical dashed and solid lines in (3a)-(3b) stand for the optimal values of g = 1.35σ and 2.07σ used by m b LLM (x) and m b DSRE (x) in Figure 6, respectively, in each case.

312

R.C. Hwang, Z.H. Lin and C.K. Chu

computational burden of DSRE and the percentage of contaminated pseudo data increase, as the value of q increases. Remark 4.4. (Constructing the DSRE with local p-degree polynomial fit) Under the regression model (1.1), the DSRE m b DSRE,p (x) for m(x) with local p-degree polynomial fit can be similarly constructed by two stages, where p > 0. In the first stage, given every p + 1 observations (xi0 , Yi0 ), . . ., (xip , Yip ) occurred in [x − h, x + h], take the (p + 1)-dimensional pseudo data (b cI,0 , . . . , b cI,p ) as the solution of a system of equations c0 +c1 zik +· · ·+cp zipk = Yik , for k = 0, . . . , p. Here I = (i0 , . . . , ip ) with i0 > · · · > ip . In the second stage, a (p + 1)-dimensional local constant M -estimator is applied to the pseudo data (b cI,0 , . . . , b cI,p ). It is defined by SDSRE, p (c0 , . . . , cp ; x) =

n X

···

i0 =1

n X

ip =1, i0 > ··· >ip

(

p Y

K(zik ) L

k=0



ck − b cI,k g

«)

.

Following the arguments in Section 3, those uncontaminated pseudo data cluster around the location of {m(x), (−h)m(1) (x), . . . , (−h)p m(p) (x)/(p!)}, asymptotically. Thus m b DSRE,p (x) is taken as the first component of the global maximizer of SDSRE,p(c0 , . . . , cp ; x). To find the global maximizer, we suggest using Newton method with initial estimate (b c0,med , . . . , b cp,med ). Here b ck,med denotes the median of b cI,k , for each k = 0, . . . , p.

Remark 4.5. (Constructing the DSRE in the random design case) Give the p-dimensional random design nonparametric regression model as Yi = w(Xi ) + εi , for i = 1, . . . , n. Here Xi = (Xi,1 , . . . , Xi,p ) are p-dimensional design points, w is an unknown but smooth regression function, ǫi are regression errors with mean 0, and Yi are noisy observations of w at Xi . The DSRE w bDSRE,p (x) for w(x), x = (x1 , . . . , xp ), is constructed by two stages. In the first stage, given every p +Q1 observations (Xj0 , Yj0 ), . . . , (Xjp , Yjp ) occurred in the compact window pu=1 [xu − h, xu + h], take the (p + 1)dimensional pseudo data (b cJ,0 , . . . , b cJ,p ) as the solution of a system of equations c0 + c1 Zjk ,1 + · · · + cp Zjk ,p = Yjk , for k = 0, . . . , p. Here Zjk = (x − Xjk )/h ≡ (Zjk ,1 , . . . , Zjk ,p ) and J = (j0 , . . . , jp ) with j0 > · · · > jp . In the second stage, a (p + 1)-dimensional local constant M -estimator is applied to the pseudo data (b cJ,0 , . . . , b cJ,p ). It is defined by ∗ SDSRE, p (c0 , . . . , cp ;

x) =

n X

j0 =1

···

n X

jp =1, j0 > ··· >jp

(

p Y

k=0



K (Zjk ) L



ck − b cJ,k g

«)

,

Q where K ∗ (Zjk ) = pu=1 K(Zjk ,u ). Following the arguments in Section 3, those uncontaminated pseudo data cluster around the location of {w(x), (−h) (∂/∂x1 )w(x), . . . , (−h)(∂/∂xp )w(x)}, asymptotically. Thus w bDSRE,p (x) is ∗ taken as the first component of the global maximizer of SDSRE,p (c0 , . . . , cp ; x). To find the global maximizer, we suggest using Newton method with initial

Double smoothing robust estimators

313

estimate (b c0,med , . . . , b cp,med ). Here b ck,med denotes the median of b cJ,k , for each k = 0, . . . , p. 5

Empirical Studies

To evaluate the performance of the proposed DSRE, empirical studies were carried out. Real data examples and simulation studies are given respectively in Subsections 5.1 and 5.2. In this section, the five estimators LLE, LLM, LLM∗ , LLA, and DSRE were considered. The functions K and L used by the five discussed estimators were taken as the Epanechnikov kernel and the standard Gaussian density function, respectively. 5.1. Two examples. In this subsection, two real data sets were employed to show the practical performance of DSRE. The five discussed estimators were applied to each data set using their robust cross-validated smoothing parameters (see Remark 4.2). The values of robust cross-validated smoothing parameters for each discussed estimator were chosen on the equally spaced logarithmic grid of 1001 values of h in [2d1 , d2 /2] and on that of 1001 values of g in [b σ /5, 20b σ ], where d1 denotes the distance between every two consecutive design points, d2 the distance between the minimum and the maximum design points, and σ b the estimate of standard deviation of regression errors. The Huber’s function ρ(u) = (1/2)u2 , for |u| < c, and c|u| − (1/2)c2 , for |u| ≥ c, where c = 1.35 × σ b, was used to compute the robust cross-validation score. In this paper, σ b2 was constructed as P a trimmed n−q 2 −1 mean (Wu and Chu 1993), and defined by σ b = {2(n − 1 − 2q)} i=2+q ξ i , where ξ i , for i = 2, . . . , n, denote the rearranged (Yi − Yi−1 )2 and are in ascending order, and q = [n/10]. Thus the values of ξ i affected by outliers can be left out in constructing σ b2 . Here the value of q was chosen subjectively. See Marron and Wand (1992) for a discussion that an equally space grid of parameters is typically not a very efficient design for this type of grid search. Example 1. The first data set is annual total U.S. lumber production (millions of board feet) in the period 1947 to 1976. It comes from Abraham and Ledolter (1983). The numerical results are shown in Figure 4. Using the data set, Figures 4a and 4b show that, in the period 1953 to 1976, the regression function estimate produced by DSRE shows a sharp valley, but that by each of LLM, LLE, LLA, and LLM∗ presents a slight sag. The result for the latter estimate is caused by the outlying observation occurred in 1959 which is much larger than the other observations occurred in the period 1957 to 1963. Comparing the five regression function estimates with

314

R.C. Hwang, Z.H. Lin and C.K. Chu

Figure 4: U.S. lumber production example. (4a)-(4b) Regression function estimates using the same data set and the values of smoothing parameters h = 10.56 for m b LLE (x), h = 11.51 for m b LLA (x), (h, g) = (8.76, 0.51b σ ) for m b LLM (x), (h, g) = (11.76, 0.50b σ ) for m b LLM ∗ (x), and (h, g) = (7.47, 0.62b σ) for m b DSRE (x), where σ b = 1444. (4c)-(4d) Regression function estimates using the data of (4a)-(4b) with the data point marked by the vertical solid line deleted. The values of smoothing parameters for producing the estimates in (4c)-(4d) include h = 6.57 for m b LLE (x), h = 11.51 for m b LLA (x), (h, g) = (5.20, 0.66b σ ) for m b LLM (x), (h, g) = (11.46, 0.53b σ ) for m b LLM ∗ (x), and (h, g) = (8.01, 0.59b σ ) for m b DSRE (x), where σ b = 1366.

Double smoothing robust estimators

315

the observations occurred in the period 1957 to 1976, it is clear that the estimate produced by DSRE has the best fit, and is not affected by that outlying observation occurred in 1959. On the other hand, by deleting that outlying observation, Figures 4c and 4d show that the resulting estimate produced by DSRE remains nearly unchanged, but those by LLM, LLE, LLA, and LLM∗ are changed significantly by showing the valley in the period 1957 to 1963 more clear. Hence, under the example, the DSRE has the best outlier-resistance performance among the five discussed estimators. Example 2. The second data set is the daily stock trading volume of Laboratory Corporation of America Holding listed on the New York Stock Exchange. It contains trading volumes on 100 trading days in the period of August 11, 2004 to December 31, 2004. The data come from the free database provided by America Online.1 They can be accessed by first inputting the company name on the website homepage, and then clicking on the Historical Prices button on the left hand side and giving the range of trading dates. The numerical results are shown in Figure 5. Figure 5a shows that there are two outlying observations of large magnitude occurred on the 54th and 57th trading dates. Using the data set, the regression function estimate produced by LLE in Figure 5a is affected by the two outlying observations since it shows a huge peak in the period of the 50th to the 64th trading dates, but that by each of DSRE, LLM, LLM∗ , and LLA in Figures 5b-5c is not affected. The estimate produced by DSRE shows clearly that there are three peaks around the 30th , 59th , and 86th trading dates, but those by LLM, LLM∗ , and LLA present two peaks around the 30th and 59th trading dates and a slight bump around the 86th trading date. On the other hand, by deleting the two outlying observations, Figures 5d-5f show that the resulting estimate obtained by DSRE remains nearly unchanged, but those by the other four discussed estimators are changed significantly by showing those three peaks around the 30th , 59th , and 86th trading dates more clear. Hence, under the example, the outlier-resistance performance of DSRE is the best among the five discussed estimators. 5.2. Simulations. In this subsection, a simulation study was performed to access the performance of DSRE. The simulated regression function m(x) = (1/4)m∗ (x), for x ∈ [0, 1], was considered, where m∗ (x) = 2 − 2x + 3 exp{−100(x − 1/2)2 } 1

http://money.aol.com

316

R.C. Hwang, Z.H. Lin and C.K. Chu

Figure 5: Daily stock trading volume example. (5a)-(5c) Regression function estimates using the same data set and the values of smoothing parameters h = 6.00 for m b LLE (x), h = 7.01 for m b LLA (x), (h, g) = (7.46, 19.63b σ ) for m b LLM (x), (h, g) = (5.92, 0.51b σ ) for m b LLM ∗ (x), and (h, g) = (4.26, 4.54b σ) 5 for m b DSRE (x), where σ b = 1.61 × 10 . For having better visual performance of the estimates, the observations of large magnitude marked by vertical solid lines in (5a) are not drawn in (5b)-(5c). (5d)-(5f) Regression function estimates using the data of (5a)-(5c) with the data points marked by vertical solid lines deleted. The values of smoothing parameters for producing the estimates in (5d)-(5f) include h = 3.42 for m b LLE (x), h = 2.50 for m b LLA (x), (h, g) = (3.77, 1.66b σ ) for m b LLM (x), (h, g) = (2.72, 1.63b σ ) for m b LLM ∗ (x), and (h, g) = (3.35, 7.74b σ ) for m b DSRE (x), where σ b = 1.53 × 105 .

Double smoothing robust estimators

317

(M¨ uller 1988). The minimum and the maximum values of the simulated regression function m(x) are nearly equal to 0 and 1, respectively. The sample size n = 100 was employed. Several distributions for generating the simulated regression errors εi were considered, although some of them do not satisfy the assumption (C2). These distributions include the E1: standard normal distribution, E2: standard double exponential distribution, E3: standard logistic distribution, E4: standard Cauchy distribution, E5: normal mixture distribution (1/20)N (0, 25) + (19/20)N (0, 1), E6: normal mixture distribution (1/10)N (0, 25) + (9/10)N (0, 1), E7: normal mixture distribution (1/20)N (0, 100) + (19/20)N (0, 1), E8: normal mixture distribution (1/10)N (0, 100) + (9/10)N (0, 1), E9: skewed Student t distribution with degrees of freedom 3 and scale parameter 5 (Fernandez and Steel, 1998), E10: Pareto distribution with degrees of freedom 3 and scale parameter 5 (Siegrist, 2005), E11: 5% U nif om(0, 10) contamination in E1, and E12: 10% U nif om(0, 10) contamination in E1. Given each of E1-E12, 200 independent sets of observations were generated from the regression model (1.1). The pseudo random variables generated from each of E9 and E10 were first centered by subtracting their expectation so that the resulting pseudo random variables had expectation zero. The simulated regression errors εi were taken as the scaled pseudo random variables generated from each of E1-E10 so that the signal-to-noise ratio (Donoho and Johnstone 1994) of the resulting observations was nearly equal to 1. The simulated regression errors corresponding to each of E11 and E12 were those produced from E1 with 5 percent and 10 percent of them replaced by the pseudo U nif orm(0, 10) random variables, respectively. Thus E11 and E12 provide one-sided positive contamination. Comparing with E11-E12, E1 produces no contamination, E2-E8 two-sided symmetric contamination, and E9-E10 two-sided asymmetric contamination. The distributions E1-E8 have been used in Fan and Jiang (1999) and Cantoni and Ronchetti (2001), and E9-E10 in Hwang, Cheng, and Lee (2007).

318

R.C. Hwang, Z.H. Lin and C.K. Chu

Given each of E1-E12 and each data set, the values of m b DSRE (x) and its integrated squared error ISEDSRE (h, g) were calculated on the equally spaced logarithmic grid of 201× 201 values of (h, g) in [2/n, 1/2]× [σ/5, 20σ], where σ is the standard deviation of regression errors εi . Given the values R1 of h and g, the value of ISEDSRE (h, g) was defined P by 0 {m b DSRE (x) − m(x)}2 dx, and was empirically approximated by (1/u) ui=0 {m b DSRE (ui ) − 2 m(ui )} . Here ui = i/u and u = 200. After evaluation on the grid, the ◦ global minimizer (h◦DSRE , gDSRE ) of ISEDSRE (h, g) was taken on the grid. ◦ ◦ When the values of (hDSRE , gDSRE ) were obtained over the 200 data sets, their sample average, denoted by (hDSRE , gDSRE ), was calculated. In our simulation, the optimal values of h and g for m b DSRE (x) were taken as hDSRE and gDSRE . After the values of (hDSRE , gDSRE ) were obtained, those of m b DSRE (x) using the smoothing parameters (rhDSRE , rgDSRE ) and ISEDSRE (rhDSRE , rgDSRE ) were computed over the 200 data sets, for each value of r = 1/2, 1, and 2. This range of smoothing parameters is regarded as wide enough to cover most of applications. The sample average and standard deviation of m b DSRE (x) and the sample root mean squared deviation (RMSD) of ISEDSRE (rhDSRE , rgDSRE ) were computed over the 200 data sets, for each value of r. The latter quantity was taken as the square root of the average of squared deviations in ISEDSRE (rhDSRE , rgDSRE ) from 0. The results for r = 1 are treated as the optimal performance, and those for r = 1/2 and 2 the obtainable performance of m b DSRE (x). This computational procedure can be referred to Fan and Jiang (1999). The same computational procedure for DSRE was applied to each of LLM, LLE, LLA, and LLM∗ . Let ISELLM (rhLLM , rgLLM ), ISELLE (rhLLE ), ISELLA (rhLLA ), and ISELLM ∗ (r hLLM ∗ , rgLLM ∗ ) be similarly defined. Given each of E1-E12, to present the typical performance of the discussed estimators, a typical simulated data set was chosen as the one with which m b DSRE (x) has its median accuracy, in terms of ISEDSRE (hDSRE , gDSRE ), over the 200 data sets. Given the typical simulated data set, the typical regression function estimates were generated by the discussed estimators using their optimal smoothing parameters. On the other hand, to compare the overall performance of the discussed estimators on their sample integrated squared errors, the ratios of the value of ISEDSRE (rhDSRE , rgDSRE ) to those of ISELLM (rhLLM , rgLLM ), ISELLE (rhLLE ), ISELLA (rhLLA ), and ISELLM ∗ (rhLLM ∗ , rgLLM ∗ ) were evaluated over the 200 data sets, for each of E1-E12 and each value of r = 1/2, 1, and 2. Also, the ratios of the sample RMSD of ISEDSRE (rhDSRE , rgDSRE ) to those of ISELLM (rhLLM , rgLLM ),

Double smoothing robust estimators

319

ISELLE (rhLLE ), ISELLA (r hLLA ), and ISELLM ∗ (rhLLM ∗ , rgLLM ∗ ) were computed. The simulation results are summarized in Figures 6-7 and Table 1. Figure 6 presents both the typical and the local performance of the discussed estimators using normal mixture regression errors generated from E8. The results using other simulated regression errors are similar and omitted. Under the simulated data, Figures 6a-6d show that the typical regression function estimate produced by LLE suffers from the outlying observations, but those by LLM, LLA, LLM∗ , and DSRE are resistant to the outliers. For r = 1, Figures 6i-6l show that the optimal performance of DSRE is better than that of LLE and LLM, but similar to that of LLA and LLM∗ , in terms of the sample bias and standard deviation. For r 6= 1, Figures 6e-6h and 6m-6p show that DSRE has the best obtainable performance among the five discussed estimators when using the suboptimal values of r considered here. Note that, for r = 1/2, the performance of LLM deteriorates seriously to the level similar to that of LLE. In addition, LLA and LLM∗ have unstable performance since their sample standard deviations fluctuate tempestuously over the interval [0, 1]. Such poor performance of LLM is due to the two computational problems of Newton method introduced in Section 2, and that of LLM∗ the rough appearance of SLLM (a, b; x) using a small value of g. Note also that, for r = 2, LLM, LLA, LLM∗ , and DSRE have similar sample standard deviation, but DSRE has the smallest sample bias in magnitude. By the result of (4.7), such an advantage of DSRE is due to having a faster convergence rate of bias when using a large value of g. Figure 7 and Table 1 present the relative overall performance of LLM, LLE, LLA, and LLM∗ with respect to DSRE at the given finite sample for each of E1-E12 and each value of r = 1/2, 1, and 2. They show that, for r = 1, the optimal overall performance of DSRE is better than that of LLM, LLE, and LLA, but similar to that of LLM∗ for most cases of regression errors, in terms of sample integrated squared error. For r 6= 1, DSRE has the best obtainable overall performance among the five discussed estimators for most cases of regression errors when using the suboptimal values of r considered here. Comparing the results for r = 1 with those for r 6= 1 in Figures 6-7 and Table 1, it is clear that each of LLM, LLE, LLA, and LLM∗ reacts more sensitive to the suboptimal values of r considered here than DSRE.

320

R.C. Hwang, Z.H. Lin and C.K. Chu

Figure 6: Simulation results using simulated regression errors generated from E8. (6a)-(6d) Typical regression function estimates using optimal smoothing parameters hLLE = 0.094, hLLA = 0.094, (hLLM , gLLM ) = (0.073, 1.35σ), (hLLM ∗ , gLLM ∗ ) = (0.066, 0.81σ), and (hDSRE , gDSRE ) = (0.085, 2.07σ), where σ = 0.27. (6e)-(6h) The absolute value of sample bias and the sample standard deviation using half the optimal smoothing parameters over the 200 data sets. (6i)-(6l) Same captions as those of (6e)-(6h), but using the optimal smoothing parameters. (6m)-(6p) Same captions as those of (6e)-(6h), but using double the optimal smoothing parameters.

Double smoothing robust estimators

321

Figure 7: Box plots with limits at the 5th and 95th percentiles of the ratios of the value of ISEDSRE (rhDSRE , rgDSRE ) to those of ISELLM (rhLLM , rgLLM ), ISELLE (rhLLE ), ISELLA (rhLLA ), and ISELLM ∗ (rhLLM ∗ , rgLLM ∗ ) over the 200 data sets, for each of E1-E12 and each value of r = 1/2, 1, and 2.

322

Table 1. The ratios of the sample RMSD of ISEDSRE (rhDSRE , rgDSRE ) to those of ISELLM (rhLLM , rgLLM ), ISELLE (rhLLE ), ISELLA (rhLLA ), and ISELLM ∗ (rhLLM ∗ , rgLLM ∗ ) over the 200 data sets, for each of E1-E12 and each value of r = 1/2, 1, and 2. r = 1/2 LLE LLA 0.97 0.81 0.81 0.97 0.91 0.84 0.31 0.83 0.57 0.82 0.46 0.83 0.23 0.69 0.27 0.92 0.07 0.60 0.25 0.69 0.06 0.77 0.03 0.59

LLM∗ 1.06 0.93 0.99 0.43 0.97 0.89 0.55 0.60 0.83 0.64 0.29 0.23

LLM 1.00 0.95 0.95 0.53 0.95 0.91 0.44 0.35 0.05 0.28 0.09 0.05

r=1 LLE LLA 1.01 0.83 0.85 1.02 0.94 0.86 0.18 1.01 0.61 0.91 0.50 0.93 0.26 0.88 0.18 0.86 0.06 0.53 0.24 0.68 0.06 0.78 0.02 0.64

LLM∗ 0.99 0.93 0.95 0.87 0.97 0.97 1.03 0.94 0.79 0.82 1.00 1.00

LLM 0.77 0.64 0.75 0.32 0.63 0.58 0.59 0.45 0.43 0.80 0.52 0.30

r=2 LLE LLA 0.84 0.78 0.65 0.80 0.81 0.81 0.12 1.05 0.52 0.75 0.44 0.78 0.28 0.85 0.21 0.89 0.15 0.67 0.41 0.84 0.10 0.75 0.05 0.73

LLM∗ 0.71 0.57 0.71 0.65 0.60 0.57 0.64 0.67 0.85 0.83 0.66 0.71

R.C. Hwang, Z.H. Lin and C.K. Chu

E1 E2 E3 E4 E5 E6 E7 E8 E9 E10 E11 E12

LLM 1.00 0.87 0.95 0.25 0.64 0.54 0.20 0.24 0.06 0.22 0.04 0.03

Double smoothing robust estimators 6

323

Sketches of the Proofs

The following results and notation will be used in this section. Set m0 = m(x), Raij

= m0 − {zi m(xj ) − zj m(xi )}/(zi − zj ),

Vaij

= (−1)(zi εj − zj εi )/(zi − zj ),

m1 = (−h) m(1) (x), Rbij

= m1 − {m(xi ) − m(xj )}/(zi − zj ),

Vbij X

= (−1)(εi − εj )/(zi − zj ), Xn = ,

(p,q)

= L(p) (Vaij /g) L(q) (Vbij /g) ,

i

i=1

Ki = K(zi ),

Lij

(1,0)

Ωij = Ki Kj {Lij

I0 (s, t; u, v) =

Z

∞ −∞

ξ 0 (u, v) = θ = (a, b),

Z

ρ

},



−∞

Z

(0,1)

, Lij

L(u−α) L(v−β) f {g(α+βs)} f {g(α+βt)} dαdβ,

ϕZ ϕ

ρ, t>s

K(s) K(t) (t − s) I0 (s, t; u, v) dsdt,

b θ = (b aDSRE , bbDSRE ),

S(θ) = SDSRE (a, b; x),

θ0 = (m0 , m1 ),

RS(θ) = E{S(θ)},

ξ = (ξ 1 , ξ 2 ),

V S(θ) = S(θ)−E{S(θ)},

 ψ 1,0,1,0 ψ 1,0,0,1 , ψ = ψ 1,0,0,1 ψ 0,1,0,1   ζ 2,0 ζ 1,1 , ζ = ζ 1,1 ζ 0,2   Jp,1,0,1,0 Jp,1,0,0,1 Jp (s, t, u) = (s, t, u), Jp,1,0,0,1 Jp,0,1,0,1 

for i, j = 1, . . . , n and p, q ≥ 0. Each element in ξ, ψ, ζ, and Jp has been defined in Section 4. Let S (1) (θ) and S (2) (θ) be the gradient vector and the Hessian matrix of S(θ), respectively. Set Ω = {θ : θ ∈ [−δ −1 , δ −1 ] × [−δ −1 , δ −1 ]}, Ω1 = {θ : θ ∈ Ω and |θ − θ0 | > g}, and Ω∗ = {(p, q) : (ω p , ω q ) ∈

324

R.C. Hwang, Z.H. Lin and C.K. Chu

Ω}. Here the notation |θ| denotes the Euclidean length of the given vector θ, and ωp , p ∈ Z, partition points of R satisfying ω p − ω p−1 = n−3 . Decompose b aij = m0 −Raij −Vaij ,

b bij = m1 −Rbij −Vbij ,

S(θ) = RS(θ)+V S(θ). (6.1)

Using (1.1) and (C2), through a straightforward calculation, the covariances between b aij and b apq and those between b bij and b bpq , for i > j and p > q, are 8 2 2 σ (zi + zj2 )(zi − zj )−2 , > > > > σ2 zj zq (zi − zj )−1 (zi − zq )−1 , > > < 2 σ zi zq (zi − zj )−1 (zq − zj )−1 , Cov(b aij , b apq ) = 2 σ zj zp (zi − zj )−1 (zi − zp )−1 , > > > 2 > σ zi zp (zi − zj )−1 (zp − zj )−1 , > > : 0,

8 2σ 2 (zi − zj )−2 , > > > > σ 2 (zi − zj )−1 (zi − zq )−1 , > > < 2 σ (zi − zj )−1 (zq − zj )−1 , Cov(b bij , b bpq ) = σ > 2 (zi − zj )−1 (zi − zp )−1 , > > > σ 2 (zi − zj )−1 (zp − zj )−1 , > > : 0,

if if if if if if

if if if if if if

p = i and q = j, p = i > j > q or p = i > q > j, i > j = p > q, p > q = i > j, i > p > q = j or p > i > q = j, i 6= j 6= p 6= q, (6.2) p = i and q = j, p = i > j > q or p = i > q > j, i > j = p > q, (6.3) p > q = i > j, i > p > q = j or p > i > q = j, i 6= j 6= p 6= q.

Proof of (3.2). Using (C1) and applying the second order Taylor expansion to m(xj ) around xi , we have m(xj ) = m(xi ) + (xj − xi )m(1) (xi ) + (1/2)(xj − xi )2 m(2) (xi )+ Ou (|xj − xi |3 ). Using the result, (1.1), and the facts that xj − xi = h(zi − zj ) and |zi |, |zj | ≤ 1 for xi , xj ∈ [x − h, x + h], we have E(b aij ) = {zi m(xj ) − zj m(xi )} / (zi − zj )

= m(xi )+h zi m(1) (xi )+(1/2) h2 zi (zi −zj ) m(2) (xi )+Ou (h3(6.4) ).

Applying the (2 − j)th order Taylor expansion to m(j) (xi ) around x, for each j = 0, 1, 2, and combining the result with (6.4) and (C1), the result of E(b aij ) in (3.2) follows. The result of E(bbij ) in (3.2) follows similarly. Thus the proof of (3.2) is complete.

Proof of Theorem 4.1. Using (C1)-(C5), (1.1), (3.2), (6.2), and (6.3), we have the following asymptotic results: E{S (1) (θ 0 )} = (1/2) n2 h4 m(2) (x) ξ + o(n2 h4 ),

(6.5)

V ar{S (1) (θ 0 )} = 2g n3 h3 ψ + o(n3 h3 ),

(6.6)

E{S (2) (θ 0 )} = n2 h2 ζ + o(n2 h2 ),

(6.7)

E[S{θ 0 + (gu, gv)}] = n2 h2 g2 ξ 0 (u, v) + o(n2 h2 ),

(6.8)

Double smoothing robust estimators P ( |b θ − θ0| > g

325

i.o. ) = 0.

(6.9)

To show (6.5) and (6.6), by (3.2) and (C1), we have Raij = (−1) h2 αij + Ou (h3 ) ≡ Ou (h2 ), Rbij = (−1) h2 β ij + Ou (h3 ) ≡ Ou (h2 )

(6.10)

for each x ∈ [0, 1]. Using (C3), (6.10), and the decomposition of b aij and bbij in (6.1), and applying the first order Taylor expansion to each of L(j) {(Raij + Vaij )/g} and L(j) {(Rbij + Vbij )/g}, for j = 0, 1, S (1) (θ 0 ) can be decomposed into S (1) (θ 0 ) = A1 + A2 + A3 + O(n2 h6 ), (6.11) where A1 = g−1 X X

A2 = g−2

i

j, i>j

X X

A3 = g−2

i

X X

j, i>j

i

j, i>j

Ωij , (2, 0)

, Lij

(1, 1)

, Lij

Ki Kj Raij {Lij Ki Kj Rbij {Lij

(1, 1)

},

(0, 2)

}.

By (C2)-(C5), (6.10), Riemann sum approximation, and changing of integration variable, through a straightforward calculation, we have E(A1 ) = 0, E(A2 ) = (1/2)n2 h4 m(2) (x) E(A3 ) = (1/2)n2 h4 m(2) (x)

Z Z

ϕ ρ

ϕ ρ

Z Z

ϕ

K(s)K(t)st(t − s){I2,0 (s, t), I1,1 (s, t)}dsdt + o(n2 h4 ),

ρ, t>s

ϕ

K(s)K(t)(s2 − t2 ){I1,1 (s, t), I0,2 (s, t)}dsdt + o(n2 h4 ).

ρ, t>s

Combining these results with (6.11), the asymptotic expectation of S (1) (θ 0 ) in (6.5) follows. To show (6.6), by (6.11) and E(A1 ) = 0, the asymptotic variance of S (1) (θ 0 ) follows by showing E(AT1 A1 ) = 2g n3 h3 ψ + o(n3 h3 ),

(6.12)

E[ {AT2 − E(AT2 )} {A2 − E(A2 )} ] = o(n3 h3 ),

(6.13)

E[ {AT3 − E(AT3 )} {A3 − E(A3 )} ] = o(n3 h3 ).

(6.14)

326

R.C. Hwang, Z.H. Lin and C.K. Chu

To show (6.12), by the same subindex structure of Cov(b aij , b apq ) in (6.2), T decompose A1 A1 into AT1

A1 = g

X X

−2

i

X

j

X

p

ΩTij q, i>j, p>q

Ωpq =

5 X

Bj ,

(6.15)

j=1

where B1 = g−2 B2 = g−2 B3 = g−2 B4 = g−2

X X i

X X i

j

X X i

j

X X i

B5 = g−2

j

j

X

p

X

p

X

p

X X i

j

X

p

X

X

q, i>j, p>q, p=i, q=j

ΩTij Ωpq ,

q, i>j, p>q, p=i>j>q or p=i>q>j

X

q, i>j, p>q, i>j=p>q or p>q=i>j

X

q, i>j, p>q, i>p>q=j or p>i>q=j

X

p

X

q, i>j, p>q, i6=j6=p6=q

ΩTij Ωpq , ΩTij Ωpq , ΩTij Ωpq ,

ΩTij Ωpq .

Using (C2)-(C5), Riemann sum approximation, and changing of integration variable, through a straightforward calculation, we have E(B1 ) = O(n2 h2 ), E(B2 ) = 2gn3 h3

Z

ϕ

ρ

3 3

E(B3 ) = 2gn h

Z

Z

ρ

ϕ

ρ

ϕ

ρ

E(B4 ) = 2gn3 h3

Z

Z

ϕ

ρ

ϕ

Z

ρ

ϕ

Z

Z

Z

ϕ

E(B5 ) = 0,

K(s)2 K(t)K(u)(s−t)(s−u)J1 (s, t, u) dsdtdu+o(n3 h3 ),

ρ, u>t>s ϕ

K(s)K(t)2 K(u)(t−s)(u−t)J2 (s, t, u) dsdtdu+o(n3 h3 ),

ρ, u>t>s ϕ

K(s)K(t)K(u)2 (u−s)(u−t)J3 (s, t, u) dsdtdu+o(n3 h3 ).

ρ, u>t>s

Combining these results with (6.15), (6.12) follows. Following the same (2,0) (1,1) proof of (6.12), and replacing Ωij in A1 respectively with {Lij , Lij } − (2,0)

(1,1)

(1,1)

(0,2)

(1,1)

(0,2)

E{Lij , Lij } and {Lij , Lij } − E{Lij , Lij }, through a straightforward calculation, (6.13) and (6.14) follow. Thus the proof of (6.6) is complete. Those of (6.7) and (6.8) can be similarly obtained, and are omitted. We now give the proof of (6.9). The proof is complete by showing P { sup S(θ) ≥ S(θ0 ) θ∈Ω1

i.o. } = 0.

(6.16)

Double smoothing robust estimators

327

To check (6.16), by (C1)-(C5) and (6.8), through a straightforward calculation, we have RS(θ0 ) − sup RS(θ) = 4 n2 h2 cg + o(n2 h2 ), θ∈Ω1

where cg = (1/4) g2 {ξ 0 (0, 0) − ξ 0 (0, 1)} > 0. Combining the result with the decomposition of S(θ) in (6.1), we have sup S(θ) − S(θ 0 ) ≤ 2

θ∈Ω1

2

sup (p,q)∈Ω∗

sup |a−ω p |, |b−ω q

|≤n−3

sup |V S(ω p , ω q )| +

(p,q)∈Ω∗

|V S(ω p , ω q ) − V S(θ)| − 4n2 h2 cg + o(n2 h2 ).

By this inequality, the proof of (6.16) is complete by showing that P{ P{

sup |V S(ω p , ω q )| ≥ n2 h2 cg

(p,q)∈Ω∗

i.o. } = 0,

|V S(ω p , ω q ) − V S(θ)| + o(n2 h2 ) ≥ n2 h2 cg

sup

sup

(p,q)∈Ω∗

|a−ωp |, |b−ωq |≤n−3

i.o. } = 0.

Their proofs are essentially the same as (5.2) of Chu and Cheng (1996), and are omitted. Thus the proof of (6.9) is complete. We now prove (4.1) and (4.2). By the first order Taylor expansion, we have 0 = S (1) (b θ) = S (1) (θ0 ) + (b θ − θ0 ) S (2) (θ∗ ), (6.17)

for each x ∈ [0, 1], where θ∗ = (θ ∗1 , θ ∗2 ) = ub θ + (1 − u)θ 0 with u ∈ [0, 1]. By (6.5) and (6.6), S (1) (θ 0 ) = Op (n2 h4 + n3/2 h3/2 ). By (C3), (6.1), (6.9), and (6.10), each quantity L(j) {(θ ∗1 −b aij )/g}L(2−j) {(θ ∗2 −bbij )/g} in S (2) (θ ∗ ) can be (j) −1 expressed as L (an +g Vaij )L(2−j) (bn +g−1 Vbij )+Ou (h2 ) ≡ Op (1), for each j = 0, 1, and 2, where an = u(b aDSRE − m0 )/g and bn = u(bbDSRE − m1 )/g satisfy |an | ≤ 1 and |bn | ≤ 1 with probability one. Combining the result with (C2), (1.1), (6.2), and (6.3), we have S (2) (θ∗ ) = Op (n2 h2 ). Thus, by comparing the magnitudes of S (1) (θ 0 ) and S (2) (θ ∗ ) in (6.17), we have b θ − θ0 = Op (h2 + n−1/2 h−1/2 ).

(6.18)

E{S (2) (θ∗ )} = n2 h2 ζ + o(n2 h2 ).

(6.19)

By (6.18), (6.7), (C3)-(C5), and Theorem 1.3.6 of Serfling (1980), we have

328

R.C. Hwang, Z.H. Lin and C.K. Chu

Using (C1)-(C6), (6.5)-(6.7), (6.17)-(6.19), and approximation to the standard errors of functions of random variables given in Section 10.5 of Stuart and Ord (1987), through a straightforward calculation, (4.1) and (4.2) follow. Hence the proof of Theorem 4.1 is complete. Acknowledgements. The authors thank the Editor and two referees for their valuable suggestions which greatly improve the presentation of this paper. The research was supported by National Science Council, Taiwan, Republic of China. References Abraham, B. and Ledolter, J. (1983). Statistical Methods for Forecasting. Wiley, New York. Besl, P.J., Birch, J.B. and Watson, L.T. (1989). Robust window operators. Machine Vision and Applications, 2, 179–191. Boente, G. and Rodriguez, D. (2008). Robust bandwidth selection in semiparametric partly linear regression models: Monte Carlo study and influential analysis. Computational Statistics and Data Analysis, 52, 2808–2828. Cantoni, E. and Ronchetti, E. (2001). Resistant selection of the smoothing parameter for smoothing splines. Statistics and Computing, 11, 141–146. Chu, C.K. and Cheng, P.E. (1996). Estimation of jump points and jump values of a density function. Statist. Sinica, 6, 79–95. Chu, C.K., Glad, I., Godtliebsen, F. and Marron, J.S. (1998). Edge-preserving smoothers for image processing (with discussion). J. Amer. Statist. Assoc., 93, 526–556. Cleveland, W.S. (1979). Robust locally weighted regression and smoothing scatterplots. J. Amer. Statist. Assoc., 74, 829–836. ¨ rfi, L. (1985). Nonparametric Density Estimation: The L1 View. Devroye, L. and Gyo Wiley, New York. Donoho, D.L. and Johnstone, I.M. (1994). Ideal spatial adaptation by wavelet shrinkage. Biometrika, 81, 425–455. Eubank, R.L. (1988). Spline Smoothing and Nonparametric Regression. Marcel Dekker, New York. Fan, J. (1992). Design-adaptive nonparametric regression. J. Amer. Statist. Assoc., 87, 998–1004. Fan, J. (1993). Local linear regression smoothers and their minimax efficiencies. Ann. Statist., 21, 196–216. Fan, J., Gasser, T., Gijbels, I., Brookmann, M. and Engels, J. (1993). Local polynomial fitting: a standard for nonparametric regression. Discussion paper 9315. Institut de Statistique, Universite Catholique de Louvain, Belgium. Fan, J. and Gijbels, I. (1995). Data-driven bandwidth selection in local polynomial fitting: variable bandwidth and spatial adaptation. J. Roy. Statist. Soc., Series B, 57, 371–394.

Double smoothing robust estimators

329

Fan, J. and Gijbels, I. (1996). Local Polynomial Modeling and its Application — Theory and Methodologies. Chapman and Hall, New York. Fan, J., Hu, T. and Truong, Y. (1994). Robust non-parametric function estimation. Scand. J. Statist., 21, 433–446. Fan, J. and Jiang, J. (1999). Variable bandwidth and one-step local M -estimator. Science in China, Series A, 29, 1–15. Fernandez, C. and Steel, M.F.J. (1998). On Bayesian modeling of fat tails and skewness. J. Amer. Statist. Assoc., 93, 359–371. Fried, R., Einbeck, J. and Gather, U. (2007). Weighted repeated median smoothing and filtering. J. Amer. Statist. Assoc., 102, 1300–1308. ¨ rfi, L., Kohler, M., Krzyzak, A. and Walk, H. (2002). A Distribution-free Gyo Theory of Nonparametric Regression. Springer, New York. Hampel, F., Ronchetti, E., Rousseeuw, P. and Stahel, W. (1986). Robust Statistics: The Approach Based on Influence Functions. Wiley, New York. ¨ rdle, W. (1990). Applied Nonparametric Regression. Cambridge University Press, Ha Cambridge. ¨ rdle, W. (1991). Smoothing Techniques: with Implementation in S. Springer, Berlin. Ha ¨ rdle, W. and Gasser, T. (1984). Robust nonparametric function fitting. J. Roy. Ha Statist. Soc., Series B, 46, 42–51. Huber, P.J. (1981). Robust Statistics. Wiley, New York. Hwang, R.C., Cheng, K.F. and Lee, J.C. (2007). A semiparametric method for predicting bankruptcy. J. of Forecasting, 26, 317–342. Leung, D. (2005). Cross-validation in nonparametric regression with outlier. Ann. Statist., 33, 2291–2310. Marron, J.S. and Wand, M.P. (1992). Exact mean integrated squared error. Ann. Statist., 20, 712–736. ¨ ller, H.G. (1988). Nonparametric Regression Analysis of Longitudinal Data. LecMu ture Notes in Statistics, 46. Springer, Berlin. Portnoy, S. and Koenker, R. (1997). The Gaussian hare and the Laplacian tortoise: computability of squared-error versus absolute-error estimators (with discussion). Statistical Science, 12, 279–300. Rousseeuw, P.J. and Leroy, A.M. (1987). Robust Regression and Outlier Detection. Wiley, New York. Scott, D.W. (1992). Multivariate Density Estimation: Theory, Practice, and Visualization. Wiley, New York. Serfling, R. (1980). Approximation Theorems of Mathematical Statistics. Wiley, New York. Siegrist, K. (2005). The Pareto distribution. Available at http://www.ds.unifi.it/VL/VL EN/special/special12.html [10 December 2008]. Silverman, B.W. (1986). Density Estimation for Statistics and Data Analysis. Chapman and Hall, New York. Simpson, D.G., He, X. and Liu, Y.T. (1998). Comment: Edge-preserving smoothers for image processing. J. Amer. Statist. Assoc., 93, 544–548. Simonoff, J.S. (1996). Smoothing Methods in Statistics. Springer, New York.

330

R.C. Hwang, Z.H. Lin and C.K. Chu

Stuart, A. and Ord, J.K. (1987). Kendall’s Advanced Theory of Statistics, Volume 1. Oxford University Press, New York. Tsybakov, A.B. (1986). Robust reconstruction of functions by the local approximation. Problems of Information Transformation, 22, 133–146. Wand, M.P. and Jones, M.C. (1995). Kernel Smoothing. Chapman and Hall, London. Wu, J.S. and Chu, C.K. (1993). Kernel-type estimators of jump points and values of a regression function. Ann. Statist., 21, 1545–1566.

Ruey-Ching Hwang Dept. of Finance National Dong Hwa University Hualien, Taiwan 974. E-mail: [email protected]

Zong-Huei Lin General Education Center Taiwan Hospitality & Tourism College Hualien, Taiwan 974.

C.K. Chu Dept. of Applied Mathematics National Dong Hwa University Hualien, Taiwan 974.

Paper received February 2008; revised October 2009.