Nonparametric estimation for dependent data

13 downloads 0 Views 377KB Size Report
Mar 9, 2009 - 14:57. Journal of Nonparametric Statistics. Johannes-Subba-Rao. RESEARCH ARTICLE. Nonparametric estimation for dependent data†.
March 9, 2009

14:57

Journal of Nonparametric Statistics

Johannes-Subba-Rao

RESEARCH ARTICLE Nonparametric estimation for dependent data† Jan Johannesa∗ and Suhasini Subba Raob a

Institute of Applied Mathematics, University Heidelberg, 69120 Heidelberg, Germany;

b

Department of Statistics, Texas A&M University, College Station, Texas 77843-3143, U.S.A. (26 February 2008) In this paper we consider nonparametric estimation for dependent data, where the observations do not necessarily come from a linear process. We study density estimation and also discuss associated problems in nonparametric regression using the 2-mixing dependence measure. We compare the results under the 2-mixing with those derived under the assumption that the process is linear.

Keywords: Density estimation; nonparametric regression; 2-mixing; nonlinear processes. AMS Subject Classification: Primary: 62G05; 62M10; Secondary: 62G07; 62G08.

1.

Introduction

Nonparametric estimation for dependent observations has a long history in statistics. Rosenblatt [42] first studied density estimation for dependent data. Since then several authors have considered nonparametric estimation under various assumptions (notable early articles include Robinson [39] and Hart [29]). For example, Hall and Hart [25], Giraitis et al. [22], Mielniczuk [34] and Estevas and Vieu [18] consider density estimation for linear processes which have long memory, whereas Cheng and Robinson [9] consider density estimation for random variables which are nonlinear functions of a linear process. A notable result, is that they show if the observations were from a linear process and have short memory, then the usual rate of convergence, known for independent observations, also holds for dependent observations. On the other hand, for long memory processes, the rate of convergence is different. Interestingly, despite long memory influencing the rate of convergence, there is no influence of long memory on the bandwidth choice, which is same regardless of short or long memory. In other words, if the observations come from a linear process, a larger bandwidth does not improve the rate of convergence † This

work was partially supported by the DFG (DA 187/12-3). author. Email: [email protected]

∗ Corresponding

ISSN: 1048-5252 print/ISSN 1029-0311 online c 200x Taylor & Francis

DOI: 10.1080/1048525YYxxxxxxxx http://www.informaworld.com

March 9, 2009

14:57 2

Journal of Nonparametric Statistics

Johannes-Subba-Rao

J. Johannes and S. Subba Rao

of the density estimator. Similar results can also be derived for nonparametric regression problems (c.f. Hall and Hart [26], Cheng and Robinson [10], Hall et al. [28], Claesken and Hall [11] and Cs¨ org¨ o and Mielniczuk [13, 14, 15] and Bryk and Mielniczuk [7]). However, usually it is assumed that the observations come from a linear process or are functions of a linear or Gaussian process (often refered to as generalised Gaussian processes). In the case of linearity, the joint density of the observations can be characterized (in some sense) in terms of the autocovariances. It is this representation that allows for the mean squared error of the nonparametric estimator to be derived in terms of the autocovariance function. However this result does not necessarily hold when the process is nonlinear. The assumption of linearity can be relaxed by using the notion of 2-mixing (see Bosq [3] and Bradley [5]). Unlike the autocovariance function, 2-mixing can be considered as a measure of dependence between two random variables (see Definition 2.3, below) and the 2-mixing size quantifies this dependence: a large mixing size indicates little dependence, whereas a small mixing size indicates large dependence. Since the strong mixing size gives a lower bound for the 2-mixing size it can be established for several types of processes, for example, linear processes, see Athreya and Pantula [1], Chanda [8], Gorodetskii [24] and Appendix A.3, and also nonlinear processes, see, Pham [35], Masry and Tjøstheim [33], Cline and Pu [12], Bousamma [4] and Basrak et al. [2] (where many of the these results show geometric ergodicity, which implies 2-mixing of the process). Assuming that the 2-mixing size (or strong mixing size) is sufficiently large, Robinson [39] (see assumption A3.1 and A3.2) and Bosq [3] (see, also, Vieu [46], Viano et al. [45], Mielniczuk [34], Fan and Yao [19] etc.) obtain the rate of convergence for nonparametric kernel estimators. However, despite, there existing a huge body of literature on rates of convergence for nonparametric kernel estimators based on the assumption of linearity of the process, and some on rates of convergence for processes which are 2-mixing with a sufficiently large 2-mixing size, as far as we are aware, very little exists on rates of convergence of nonparametric kernel estimators for nonlinear processes whose 2-mixing size can be small. This is particularly pertinent, as nonlinear processes with small mixing sizes can arise in several applications, for example the ARCH(∞) process is a nonlinear process which is used to model finance time series and can have a small mixing size (see Fryzlewicz and Subba Rao [20] for the details). In this paper we address this issue, and obtain rates of covergence for nonparametric kernel estimators for dependent data and formulate the results in terms of the 2mixing size. We study both density estimation and also problems in nonparametric regression. In Section 3 we consider kernel density estimation, in particular we obtain the sampling properties of the Rosenblatt-Parzen kernel estimator and obtain a bound for the mean squared error under the assumption that the time series are stationary and 2-mixing. We show that, like the long memory process, the 2-mixing size can influence the rate of convergence. But unlike the long memory process, a much larger 2-mixing size may be required to obtain the usual rate of convergence. Moreover, the bandwidths which minimise the obtained bounds are influenced by

March 9, 2009

14:57

Journal of Nonparametric Statistics

Johannes-Subba-Rao

Nonparametric estimation for dependent data with an application to panel time series

3

the 2-mixing size - the smaller the 2-mixing size the larger the bandwidth. We demonstrate that several problems could arise if one were to falsely suppose that observations were from a linear process, when they do not. For example, if the usual optimal bandwidth for linear processes were used on nonlinear processes, the mean squared error may no longer converge to zero. Thus our results give a warning to practitioners who apply well known results for the linear process, without checking whether the process is linear or not. In Section 4 we consider nonparametric regression for dependent data. We discuss this with reference to two models. First we suppose the response and explanatory variables (Xt , Zt ) satisfy (i) Xt = ϕ(Zt ) + h(Zt )ηt , where {ηt } and {Zt } are independent of each other (Hart [30] considers a particular example of this model, where h(·) is a constant), and secondly we assume the conditional expectation satisfies (ii) (Xt |Zt ) = ϕ(Zt ). We observe that the latter model includes the former model as a special case. We estimate ϕ(·) using the classical kernel estimator and derive rates of convergence similar to those obtained for the density estimator. But in the case of model (i) the rate of convergence depends on two factors, the 2-mixing size of {Zt } and the rate of decay of the autocovariance function of {ηt }, whereas for model (ii) the rate of convergence is determined by the mixing size of the multivariate random process {(Xt , Zt )}. All the proofs can be found in the appendix. Also some 2-mixing inequalities for linear processes used here are included in the appendix.

2.

Notation

In this section we introduce some definitions that will be used in the paper. Note we will assume all the necessary densities exist. We start by defining the kernel. Definition 2.1 A kernel K is of order r (see Scott [43]), if K is a univariate, even function such that Z

du K(u) = 1,

Z

du ui K(u) = 0

for all i = 1, . . . , r − 1 and there exists a constant SK such that Z

du |u|r K(u) = SK .

Let Kb (z) := b−d K(z/b), where b > 0 is a bandwidth. Below we define the smoothness class (c.f. Robinson [38]) which we use to bound the bias of the estimators. Definition 2.2 For s, △ > 0, the space Gs△ is the class of functions g : → satisfying: g is everywhere (m − 1)-times differentiable for m − 1 < s 6 m; where

March 9, 2009

14:57

Journal of Nonparametric Statistics

4

Johannes-Subba-Rao

J. Johannes and S. Subba Rao

for some ρ > 0 and for all x, the inequality |g(y) − g(x) − Q(y − x)| 6 △, |y − x|s y:|y−x| 1, Q is an (m − 1)th-degree homogeneous polynomial in y − x, whose coefficients are the derivatives of g of orders 1 to m − 1 evaluated at x; and △ is a finite constant. The dependence of the process {Yt } is quantified in terms of 2-mixing size, which we define below. Definition 2.3 (i) A stationary process {Yt } is said to be 2-mixing with size v if for all t 6= 0 sup

|P (A ∩ B) − P (A)P (B)| ≤ C|t|−v.

A∈σ(Yt ),B∈σ(Y0 )

for some C < ∞ independent of t. (ii) The covariance of a stationary process {Yt } has size u if for all t 6= τ , |cov(Yt , Y0 )| ≤ C|t|−u for some C < ∞ independent of t. We note that the notion of the strong mixing is defined in a similar way (see Bradley [5] for properties of strong mixing and Rio [37] for applications in central limit theorems). However the crucial difference between 2-mixing and strong mixing are the sigma-algebras over which the supremum is taken. In the definition of strong-mixing the sigma algebra is over the entire left and right tails of {Xt }, whereas the sigma-algebras in the definition 2-mixing are more restrictive. We observe that the covariance is a measure of linear dependence, whereas 2-mixing is a generalization of this, and can be considered as a measure of dependence. 2-mixing is quite a general notion, which is satisfied by several processes. For example, under certain conditions on the innovations, most linear models are 2mixing (see Appendix A.3, and Athreya and Pantula [1], Cline and Pu [12] and Chanda [8], where strong mixing is shown). Further, under additional conditions on the innovations and the parameters, ARCH/GARCH processes are also strongly mixing (c.f. Masry and Tjøstheim [33], Bousamma [4] and Basrak et al. [2]) which implies that they also 2-mixing. Most of the results and bounds in this paper are derived using 2-mixing. In general, the larger the mixing size the faster the rate of convergence. For example, in the case of iid observations (the 2-mixing size can be treated as ∞) using just a few observations, information over the entire domain of the density function can be obtained. On the other hand, a sample which has a small mixing size (so tends to be clustered about certain points) will require a much larger number of observations to give the same information. For brevity, we use the standard notation ∧ to denote minimum and ∨ to denote maximum.

March 9, 2009

14:57

Journal of Nonparametric Statistics

Johannes-Subba-Rao

Nonparametric estimation for dependent data with an application to panel time series

3.

5

Kernel density estimation

Suppose we observe the stationary time series {Z1 , . . . , ZT }, and let f denote the density of Zt . The most popular estimator of f , is the Rosenblatt-Parzen kernel estimator T 1X fˆ(u) = Kb (Zt − u), T

(1)

t=1

where Kb (z) is defined below Definition 2.1. In this section we investigate the sampling properties of the kernel density estimator defined above. The dependence of the process {Zt } is quantified in terms of its 2-mixing size (see Definition 2.3). We first derive a bound for the mean squared error (MSE) |fˆ(z) − f (z)|2 using only minimal assumptions on the distribution of {Zt }. Proposition 3.1 Suppose the univariate stationary process {Zt } is 2-mixing with size v and the marginal density f of Zt and its second derivative f ′′ are both uniformly bounded. Let fˆ be defined as in (1), where K is a rectangular kernel, i.e., K(x) = 1 if x ∈ [−1/2, 1/2] and zero otherwise. Then we have

|fˆ(z) − f (z)| = O(b + T 2

4

−[v∧1]

b

−[(v∨1)+1] v∨1

)=

(

v+1

O(b4 + T −1 b− v ) v > 1 O(b4 + T −vb−2 ) v ≤ 1

Proof. To prove the result we will bound the risk using the standard variance bias decomposition. First the bias: as we are using a rectangular kernel and f ′′ is uniformly bounded, it is clear that fˆ(z) = f (z) + O(b2 ). To obtain a bound for the variance we require a bound for the covariances inside the variance expansion P T 2 · var(fˆ(z)) = t,τ cov[Kb (Zt − z), Kb (Zτ − z)]. Since {Zt } is 2-mixing with size v by using the covariance inequality in Bradley [6] (see also Rio [36]) we have |cov[Kb (Zt − z), Kb (Zτ − z)]| Z ∞Z ∞  min C|t − τ |−v, P (|Kb (Zt − z)| > x), P (|Kb (Zτ − z)| > y) dxdy. ≤ 4· 0

0

(2)

Studying P (|Kb (Zt − z)| > x) and recalling that K(·) is a rectangular kernel we can show that

P (|Kb (Zt − z)| > x) =

(

0, if x > 1/b; P (Zt ∈ [z − b/2, z + b/2]), otherwise.

By using the mean value theorem we have P (Xt ∈ [z − b/2, z + b/2]) = bf (˜ z ), for

March 9, 2009

14:57 6

Journal of Nonparametric Statistics

Johannes-Subba-Rao

J. Johannes and S. Subba Rao

some z˜ ∈ [z − b/2, z + b/2]. Substituting this into (2) leads to |cov[Kb (Zt − z), Kb (Zτ − z)]| ≤ 4 ·

Z

0

0

 min C|t − τ |−v, b · f (˜ z ) dxdy

 = 4 · b−2 min C|t − τ |−v, b · f (˜ z) .

Altogether this yields the bound T 2 · var(fˆ(z)) ≤ 4

1/b Z 1/b

X t,τ

(3)

 b−2 min C|t − τ |−v, b · f (˜ z) .

Examining the minimum inside the summand above, we partition the sum into two parts which we bound separately (for the details see the proof of Theorem 3.3, in the Appendix). Finally recalling that | fˆ(z) − f (z)|2 = O(b4 ) leads to the desired result.  We observe, in the proof above, that besides the 2-mixing condition we do not have any assumptions on the joint distribution of (Zt , Zτ ). The cost of using such weak assumptions is that the usual bound O(b4 + (bT )−1 ) for the MSE, obtained for independent observations, is not achieved. Even for large v the 2-mixing size has an influence on the bound. However, introducing some assumptions on the joint densities of {Zt } allows us to tighten the bound derived in (3) and, hence for a sufficiently large mixing size v to recover the usual bound O(b4 + (bT )−1 ) for the MSE (we note that the rest of the proofs in this section and the subsequent sections require more subtle arguments, and these can be found in the appendix). Assumption 3.2 Densities and kernels (i) The marginal density f is uniformly bounded. (ii) For each t, τ ∈ let f (t) denote the joint density of (Zt , Z0 ). Define1 F t := f t − f ⊗ f . Then kF (t) kpF is uniformly bounded in t for some pF > 2 and we define qF = 1 − 2/pF . (iii) The kernel K is uniformly bounded and has a finite first and second moment, i.e., kKk1 < ∞ and kKk2 < ∞. We use these assumptions to derive an uniform bound for the MSE of the density estimator. Theorem 3.3 Let us suppose the stationary time series {Zt } is 2-mixing with size v and Assumption 3.2 is fulfilled for some qF ∈ (0, 1). In addition assume that f ∈ Gs△ for some △, s > 0 (see Definition 2.2). Let fˆ be defined as in (1), where K is a kernel of order s. Then we have uniformly for all z ∈   |fˆ(z) − f (z)|2 = O b2s + b−1 · T −1 + b−2−qF (1−[v∨1]) · T −[v∧1] , 1 We

R use the notation f ⊗ g(x, y) = f (x)g(y) and kf kp = ( |f (x)|p dx)1/p .

T → ∞.

March 9, 2009

14:57

Journal of Nonparametric Statistics

Johannes-Subba-Rao

Nonparametric estimation for dependent data with an application to panel time series

7

For ease of presentation we have only stated the result for univariate {Zt }, however it is straightforward to extend this result for multivariate {Zt }. Remark 3.4 We note that in the bound given in Theorem 3.3 the second term dominates the third term when v > 1 + 1/qF . Conversely, when v < 1 + 1/qF the third term dominates the second term. Moreover, the third term can be partitioned into two further cases, when 1 < v ≤ 1 + 1/qF and when v ≤ 1. This means that Theorem 3.3 can be written as   1 ; (i) if v > 1 + 1/qF , then |fˆ(z) − f (z)|2 = O b2s + bT   (ii) if 1 < v ≤ 1 + 1/qF , then |fˆ(z) − f (z)|2 = O b2s + b(2+qF 1(1−v)) T ;   (iii) if v ≤ 1 then |fˆ(z) − f (z)|2 = O b2s + b21T v ;

as T → ∞. Studying the three bounds, we see that the bound increases linearly with v for 0 ≤ v ≤ 1, after this point there is a change in behavior and the increase is more gradual. The bound plateaux when v > 1 + 1/qF , after this point we have the usual nonparametric bound obtained for iid observations. There is also a continuity in the three bounds. More precisely, when v is at the boundary of 1 and 1 + qF−1 , there is a continuous transition between the bounds. 

We now consider the rate of convergence by using a bandwidth b∗ which balances the three terms in the bound of Theorem 3.3. Corollary 3.5 Under the assumptions of Theorem 3.3 if b∗ ≈ T −γ/(2s+1) with γ :=

(

1, [v ∧ 1] ·

2s+1 2s+(2+qF (1−[v∨1])) ,

Then uniformely for all z ∈

we have

v > 1 + 1/qF ; 1 + 1/qF ≥ v.

(4)

  2s − 2s+1 ·γ 2 ˆ |f (z) − f (z)| = O T as T → ∞.

In other words, if b∗ ≈ T −γ/(2s+1) , then we have

   2s − 2s+1  , v > 1 + 1/qF ; O T      2s+1 2s − ·( ) |fˆ(z) − f (z)|2 := O T 2s+1 2s+(2+qF (1−v)) , 1 + 1/qF ≥ v > 1; .     2s+1   O T (v· 2s+2 ) , 1 ≥ v.

(5)

We note that if supz |f (z)| < ∞ and supt supz |f (0,t) (z)| < ∞ (both the density and the joint densities are uniformly bounded), then uniformly in all t, kF (t) k∞ < ∞. This means qF = 1, and the bound can be divided into the three cases where v ≤ 1, 1 ≤ v ≤ 2 and v ≥ 2. On the other hand when kF (t) kpF < ∞ for only a finite pF , then qF < 1 and v > 1 + qF−1 > 2 to be sure of the usual nonparametric bound. Referring to Corollary 3.5, we observe that when v < 1 + qF−1 , then the used bandwidth b∗ is much larger than usual bandwidths encountered in nonparametric −1 regression (b ≈ T 2s+1 ). We discuss this further in Section 3.2.

March 9, 2009

14:57 8

3.1.

Journal of Nonparametric Statistics

Johannes-Subba-Rao

J. Johannes and S. Subba Rao

A comparison of the MSEs for linear processes

In this section we compare the MSE in Theorem 3.3 with the results obtained under the stronger condition that the observations {Zt } come from a linear process (which is a much stronger assumption than 2-mixing, see Appendix A.3). We will use the results in Appendix A.3 and show that if the process were linear, and not just mixing, that then the rate of convergence is better than the rate obtained in Corollary 3.5. However, in Section 3.2 we demonstrate that by misspecifying the process to be linear, can lead to several problems with the density estimator, including bounds which do not converge to zero. Let us suppose {Zt } has a linear process representation and satisfies Zt =

∞ X

aj εt−j ,

(6)

j=0

where the innovations {εt } are iid. Under the assumptions in Lemma A.8 (see the Appendix), it can be shown that cov(Kb (Z0 ), Kb (Zt )) = O(cov(Z0 , Zt )). Using this as the basis, Hall and Hart [25], Giraitis et al. [22], Mielniczuk [34] and Estevas and Vieu [18]) have shown that the MSE is   1 1 |fˆ(z) − f (z)|2 = O b2s + + RT , bT T

(7)

P where RT = Tt=1 |cov(Z0 , Zt )|. It is clear that both cov(Z0 , Zt ) and RT depend on the rate of decay of the parameters {aj }. We observe if |aj | ≤ Cj −θ , then |cov(Z0 , Zt )| = O(T −2θ+1 ) and RT = O(T −(2θ−1)+1 ) if 1/2 < θ ≤ 1 |cov(Z0 , Zt )| = O(T −θ ) and RT = O(T −θ+1 ) if θ > 1. Substituting these rates into (7) we see that the bound of the MSE depends on θ. We P recall that a process {Zt } is called a short memory process if t |cov(Z0 , Zt )| < ∞, otherwise it is called a long memory process. Now studying (7) we see that RT does not depend on the bandwidth b. In other words long memory has no influence on the choice of the optimal bandwidth. To summarize, the rate of convergence for observations coming from a linear process is |fˆ(z) − f (z)|2 ≤

(

O(T −(2θ−1) ), if 2θ − 1 ≤ O(T

−2s 2s+1

),

if 2θ − 1 >

2s 2s+1 ; 2s 2s+1 .

(8)

Hall and Hart [25] have shown that the rates above are optimal. On the other hand let us recall Corollary 3.5, above, in particular the case v ≤ 1, where we have shown 2v

|fˆ(z) − f (z)|2 = O(T −v+ 2s+2 ).

(9)

It is difficult to directly compare (8) and (9), since (8) is in terms of its long memory parameter whereas (9) is in terms of its mixing size v. However in the special case that {Zt } is Gaussian (and thus linear), there is a one-to-one correspondence, for

14:57

Journal of Nonparametric Statistics

Johannes-Subba-Rao

Nonparametric estimation for dependent data with an application to panel time series

9

3.0 2.5 2.0 1.5

size

1.0 0.5 0.0

0.0

0.5

1.0

1.5

size

2.0

2.5

3.0

example, if 2θ − 1 ≤ 1 then the covariance size and mixing size are the same, and v = (2θ − 1) = u. We illustrate the case when the mixing and the covariance sizes are the same in Figure 1 (for both large and small s). In the non-Gaussian case,

0.0

0.2

0.4

0.6 rate

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

rate

March 9, 2009

Figure 1. The top and bottom plot corresponds to s = 1 and s = 5, respectively. The x-axis is the covariance and mixing size (assuming both are the same) and the y-axis is the index δ in the MSE |fˆ(x) − f (x)|2 = O(T −δ ). The solid line is the MSE using 2-mixing and dotted line is the MSE when {Zt }t is a linear process. We have assumed that qF = 1 (in other words kF (t) k∞ < ∞).

where the 2-mixing and covariance size do not necessarily coincide (v 6= u), if all the moments of {Zt } exist, then for 1/2 < θ ≤ 1 we have that (2θ − 1)/2 ≤ v ≤ (2θ − 1) (see (A32) in the appendix). In this case, it is not clear whether the rate (9) is worse than (8). However substituting the lower bound v ≥ (2θ − 1)/2 into Corollary 3.5 yields a rate which is less than (8). In summary better rates of convergence can often be obtained if the observations come from a linear process. On the other hand, 2-mixing is a weaker condition, that is satisfied by a far wider class of processes. We consider below the MSE for processes which are not linear, and show that misspecifying the model, and assuming linearity, when the process is nonlinear could severely affect the MSE.

3.2.

The MSE for nonlinear processes

As far as we are aware, theory is required to bridge the gap for processes which are nonlinear but have a small mixing size. One of the main aims of Theorem 3.3 is to fill in the gap in the theory, and to derive a bound for the MSE when the observations come from nonlinear processes with small 2-mixing size.

March 9, 2009

14:57

Journal of Nonparametric Statistics

10

Johannes-Subba-Rao

J. Johannes and S. Subba Rao

The joint densities of processes which are nonlinear do not necessarily satisfy the density decomposition in Lemma A.8. Without this result it cannot be shown that cov(Kb (Z0 ), Kb (Zt )) = O(cov(Z0 , Zt )), and the rates in (8) do not necessarily hold. Instead, to prove the results, under Assumption 3.2, we use classical mixing inequalities to tighten the bound given in (3) (see the proof of Proposition 3.1). More precisely, to prove Theorem 3.3 we show that   |cov(Kb (Zt − z), Kb (Zτ − z))| ≤ C · b−2 min |t − τ |−v, b(1+qF ) , where C is a finite constant (see Lemma A.1, for the proof). Looking at some of the implications of Theorem 3.3, we demonstrate below that several problems could arise if one were to falsely suppose that the observations come from a linear process, when they do not. (i) In the case of linear processes, the optimal bandwidth has the same order as the bandwidth for iid random variables (regardless of long memory). The same is not necessarily true when all that is known is that the process is 2-mixing. Moreover, if the mixing size satisfies v ≤ 1 and the bandwidth is such that b2 T v < ∞, then we see from Theorem 3.3 that the bound does not converge to zero. An important example, is when the ‘usual’ bandwidth for linear or iid data 1 1 is used (that is b ≈ T − 2s+1 ). In this case, substituting b ≈ T − 2s+1 into Theorem 3.3 leads to the result  −2s  v > 1 + 1/qF O(T 2s+1 )   1+qF (1−v)−2s 2 2s+1 |fˆ(z) − f (z)| = O(T ) 1 < v ≤ 1 + 1/qF  2−v(2s+1)   O(T 2s+1 ) 0≤v≤1

Studying the rates above we see when v < 1 + 1/qF , the rates are lower than the rates given using the bandwidth which balances bias and variance (compare the 2 above with the rates in Corollary 3.5). Moreover, in the case that v ≤ 2s+1 , the bound cannot be used to show consistency of the estimator - since the bound does not even converge to zero. In short, to estimate the density at any given point, the number of observations (approximately bT ) needs to be much larger than in the iid case. 2s P − 2s+1 )’ (ii) Rather surprisingly even when ∞ t=1 |cov(Z0 , Zt )| < ∞, the ‘usual O(T rate, may not hold, unlike for linear processes. However, the usual rate does hold when v ≥ 1 + 1/qF > 2. Therefore, even when the mixing and covariance size are 2s the same, a far larger mixing size may be require to obtain the ‘usual O(T − 2s+1 )’ rate of convergence.

Our results give a cautionary warning to practitioners who apply the optimal bandwidths for linear processes to nonlinear process. In the subsequent sections, where we consider nonparametric regression problems, the assumptions and proofs will be more involved, however the underlying message is the same. That is, more than just the second order autocovariance function may have influence on the rate of convergence, and the rate of convergence can be severely compromised if the

March 9, 2009

14:57

Journal of Nonparametric Statistics

Johannes-Subba-Rao

Nonparametric estimation for dependent data with an application to panel time series

11

usual bandwidths were used. Example 3.6 It is almost impossible to estimate the 2-mixing size from the observations, in contrast to long memory (c.f. Geweke and Porter-Hudak [21], K¨ unsch [31] and Robinson [41]). However to conclude this section we give an example of a nonlinear process whose 2-mixing size is less than 1 + δ, for some δ > 0. Let us consider the ARCH(∞) process (see Robinson [40]), where {Zt } satisfies Zt = σt εt

σt2

= a0 +

∞ X

2 aj Zt−j ,

j=1

with (εt ) = 0 (estimation of ARCH(∞) parameters is considered in Subba Rao [44]). Giraitis et al. [23] have shown that if for large t, at ≈ t−(1+δ) (for some P −(1+δ) . That is, the 2 2 δ < 0) and [ (ε4t )]1/2 ∞ j=1 aj < 1, then |cov(Z0 , Zt )| ≈ t absolute sum of the covariances is finite, but ‘only just’ if δ is small. Furthermore, if we assume that |εt | < 1, then it is straightforward to show that Zt is a bounded random variable. This means that using the mixing inequality for bounded random variables (see Hall and Heyde [27]) we can show |cov(Z02 , Zt2 )| ≤ C

sup

|P (A ∩ B) − P (A)P (B)|,

A∈σ(Z0 ),B∈σ(Zt )

for some C < ∞. Altogether, this gives a lower bound for the 2-mixing size of the ARCH(∞) process with at ≈ t−(1+δ) , and v ≤ (1 + δ). In other words the 2-mixing size for some ARCH(∞) process is small, and far from the geometric rate often assumed in nonparametric estimation. An upper bound  for the 2-mixing size can be found in Fryzlewicz and Subba Rao [20].

4.

Nonparametric regression

In this section we consider nonparametric regression, with random design, where the observations are dependent. It is worth mentioning that there has been extensive research done on nonparametric regression with fixed design and dependent errors (c.f. Hall and Hart [26], Cs¨ org¨ o and Mielniczuk [13], and the references therein). In this case typically, one observes Yt , where Yt = ϕ( Tt ) + εt and {εt }t are stationary random variables with varying degrees of dependence. It has been shown that the rate of convergence depends on the covariance of {εt }t , in particular their P absolute sum, ∞ t=1 |cov(ε0 , εt )|. In the random design model, one observes the stationary two-dimensional vector time series {(Xt , Zt )}t , where Xt = ϕ(Zt ) + εt

(10)

with (Xt |Zt = z) = ϕ(z) and εt = Xt − (Xt |Zt ). The randomness in this model is determined by two factors: the design {Zt } and the errors {εt = Xt − (Xt |Zt )}. Therefore, unlike the fixed design model, the rate of convergence of any estimator

March 9, 2009

14:57

Journal of Nonparametric Statistics

12

Johannes-Subba-Rao

J. Johannes and S. Subba Rao

of ϕ must depend on the sampling properties of the design density estimator. Thus, it is clear that similar results to those in Section 3 should also apply to an estimator of ϕ. We now define the classical Nadaraya-Watson estimator of ϕ(·) and study its sampling properties, under various assumptions on {(Xt , Zt )}. Let p(x, z) be the joint density of (Xt , Zt ). The estimator is ϕ(z) ˆ =

gˆ(z) , fˆ(z)

(11)

P P where gˆ(z) := T1 Tt=1 Xt Kb (Zt − z) and fˆ(z) := T1 Tt=1 Kb (Zt − z) are estimators R of g(z) = xp(x, z)dx and f (z), which is the density of Zt . We first consider the sampling properties for a particular class of models which satisfy (10). Suppose the vector time series {(Xt , Zt )} satisfies the representation Xt = ϕ(Zt ) + h(Zt )ηt

(12)

for some h : → + , where the time series {Zt } and {ηt } are independent of each other. This class of models is similar to the fixed design model Xt = ϕ( Tt )+ηt , but in (12) the design is random and the conditional variance var(Xt |Zt ) = h(Zt )2 var(ηt ), depends on the design. This model arises in various applications and we consider one application in Remark 4.5. We will show in the theorem below that the rate of convergence depends both on the mixing size of the design {Zt }, but also on the size of the covariances of the process {ηt } (which we denote by u, see Definition 2.3). We require the following assumptions. Assumption 4.1 Densities, moments and kernels (i) For some p > 2 the functions h2 · f and |ϕ|p · f are uniformly bounded and we define q := 1 − 2/p. (ii) Let f (t) and F (t) be defined as in Assumption 3.2 (ii), g (t) (z1 , z2 ) :=

[Xt X0 |Zt = z1 , Z0 = z2 ] · f (t) (z1 , z2 ).

and G(t) := g (t) − g ⊗ g. Then kF (t) kpF and kG(t) kpG are uniformly bounded in t for some pF , pG > 2. We define qF := 1−2/pF , qG := 1−2/pG and qF G := qF ∧qG . (iii) The kernel K has finite first and p-th moment. Studying Assumption 4.1(i), we see that it allows for various types of growth of the regression function ϕ and the conditional variance h. The type of growth depends on the rate the density f decays to zero. For example, if f were the Gaussian density, then exponential growth of ϕ and h is possible. However, as we shall demonstrate in the theorem below, the larger the p, such that supx h(x)p · f (x) < ∞ and supx |ϕ(x)|p · f (x) < ∞, then the faster the rate of convergence of |ϕ(z) ˆ − ϕ(z)|2 . Theorem 4.2 Suppose the stationary time series {(Xt , Zt )} satisfies (12), {Zt }

March 9, 2009

14:57

Journal of Nonparametric Statistics

Johannes-Subba-Rao

Nonparametric estimation for dependent data with an application to panel time series

13

is 2-mixing with size v and the autocovariance of the time series {ηt } has size u. Let Assumption 4.1 be fulfilled for some q, qF G ∈ (0, 1). In addition assume that ϕ · f, f ∈ Gs△ for some △, s > 0 and that f is bounded away from zero. Let the estimator ϕ(z) ˆ be defined as in (11), where K is a kernel of order s. Then we have for all z ∈   2 |ϕ(z)−ϕ(z)| ˆ = OP b2s +b−1 ·T −(u∧1) +b−1−q−qF G (1−[(qv)∨1]) ·T −[(qv)∧1] , T → ∞. Remark 4.3 We observe that the bound obtained in Theorem 4.2 are similar to the bound derived for the density estimator in Theorem 3.3, where   |fˆ(z) − f (z)|2 = O b2s + b−1 · T −1 + b−2+qF (1−[v∨1]) · T −[v∧1] . (13) The difference is the inclusion of the covariance size u of the errors and the q which ‘balances’ the tails of 1/ϕ and f (see Assumption 4.1(i)). However, we observe that we can partition the bound in Theorem 4.2 into three cases, which are similar to the three cases considered in Remark 3.4. Most notably, we observe if u > 1 and v > 1/qF G + 1/q then we obtain the usual bound OP (b2s + b−1 · T −1 ) for the squared error. It is interesting to note that in the case hp · f and |ϕ|p · f are uniformly bounded for all p, then q = 1 (eg. h and ϕ are bounded functions and f is exponential density). In this case the bounds given in (13) and Theorem 4.2 are quite similar. The main difference is the appearance of qF G rather than qF and, the term b−1 T −(u∧1) which replaces b−1 T −1 .  Corollary 4.4 Under the assumptions of Theorem 4.2 if b∗ ≈ T −γ/(2s+1) with γ :=

(

min(u,  1), min u, [(qv) ∧ 1] ·

2s+1 2s+1+q+qF G (1−[(qv)∨1])

 qv > 1 + 1/qF G ; , , 1 + 1/qF G ≥ qv.

  2s then we have |ϕ(z) ˆ − ϕ(z)|2 = OP T − 2s+1 ·γ for all z ∈

(14)

.

Let us now compare Theorem 4.2 with the bound obtained for the deterministic design Xt = ϕ( Tt ) + εt , where u is the covariance size of the errors. In the case of the fixed design, the bound for the deviation of the kernel estimator is O(b2s + T −(u∧1) b−1 ) (c.f. Hall and Hart [26]). We see that the bound in Theorem 4.2 include this term, but also the additional term O(b−1−q−qF G (1−[(qv)∨1]) · T −[(qv)∧1] ), which is the influence of the design, in particular, v. If the mixing size of the design were sufficiently large, then the fixed design and random design estimators have the 2s same rate of convergence, O(T − 2s+1 ). Example 4.5 Examples of processes which satisfy (12) are stochastic volatility models (c.f Linton and Mammen [32]), where one observes {Yt }, which satisfies the representation Yt = σ(Zt )ηt . Here {ηt } are iid random variables,

(ηt2 ) = 1 and {Zt } are explanatory variables

March 9, 2009

14:57 14

Journal of Nonparametric Statistics

Johannes-Subba-Rao

J. Johannes and S. Subba Rao

which can include past values of Yt . Usually in finance the object is to estimate the conditional volatility σ 2 . By noting that Yt2 can be written as Yt2 = σ(Zt )2 + (ηt2 − 1)σ(Zt )2 , we see that Yt2 satisfies (12) with Xt = Yt2 , εt = (ηt2 − 1) and h(·) = σ(·)2 . Therefore we can estimate the volatility σ(·)2 using (11), where σ ˆ (·)2 , is the kernel 2 estimator of σ(·) . Furthermore, Theorem 4.2 can be applied to obtain the rate of convergence. More precisely, let v be the mixing size of {Zt }, and noting that cov{(ηt2 − 1), (ηs2 − 1)} = 0, when t 6= s, which implies u = ∞, we obtain   |ˆ σ (z)2 − σ(z)2 |2 = OP b2s + b−1−q−qF G (1−[(qv)∨1]) · T −[(qv)∧1] .



From Corollary 4.4 we see that there are two factors which affect the rate of convergence: the mixing size v of the random design {Zt } and the size u of the covariance function of {ηt }. There are however several models of interest, which do not satisfy condition (12). In this case Theorem 4.2 cannot be applied and it is of interest to investigate what happens in the general case. Examples of models which do not necessarily satisfy (12) include the ChengRobinson model, where {Xt } satisfies the representation Xt = F (Ut ) + G(Ut , Yt ) with (G(Ut , Yt )|Ut ) = 0 and {Yt } is a long memory process, which is independent of the weakly dependent design random variables {Ut } (c.f. Cheng and Robinson [10], Cs¨ org¨ o and Mielniczuk (1999, 2001)). However, the results are derived under the assumption that {Yt } comes from a linear process and G(·) has a particular form. An alternative approach is developed in Bosq [3], who considers nonparametric prediction for time series, where one observes the stationary time series {(Xt , Zt )} and the parameter of interest is ϕ(z) = (Xt |Zt = z). The sampling results in Bosq [3] are based on the assumption that the mixing size of {(Xt , Zt )} is sufficiently large, (thus excluding Cheng-Robinson type models) yielding an estimate which has the same rate as the kernel estimator for iid random variables. We now consider the sampling properties of ϕ, ˆ when the observations {(Xt , Zt )} satisfy the general model defined in (10), and dependence is quantified through its 2-mixing size, which can be arbitrary. We will use the following assumptions. Assumption 4.6 Densities, moments and kernels (i) Let |Xt |p < ∞ for some p > 2 and define g (p) (z) := [|Xt |p |Zt = z] · f (z). Then the functions g (p) and f are uniformly bounded and we define q := 1 − 2/p. (ii) Let f (t) and F (t) be defined as in Assumption 3.2 (ii) and let g (t) and G(t) be defined as in Assumption 3.2 (ii). Then kF (t) kpF and kG(t) kpG are uniformly bounded in t for some pF , pG > 2, where we define qF := 1−2/pF , qG := 1−2/pG and qF G := qF ∧ qG . (iii) The kernel K has finite first and p-th moment.

March 9, 2009

14:57

Journal of Nonparametric Statistics

Johannes-Subba-Rao

Nonparametric estimation for dependent data with an application to panel time series

15

We note that assumptions above are similar to Assumption 4.1. The difference lies in Assumption 4.1(i) and Assumption 4.6(i). Assumption 4.6(i) is in terms of moments whereas Assumption 4.1(i) is in terms of functions. In the following theorem we derive an error bound for the estimator ϕ. ˆ Theorem 4.7 Suppose the stationary time series {(Xt , Zt )} satisfies (10), and is 2-mixing of size v. Furthermore, Assumption 4.6 is fulfilled for some qF G , q ∈ (0, 1). In addition assume that ϕ · f, f ∈ Gs△ for some △, s > 0 and that f is bounded away from zero. Let the estimator ϕ(z) ˆ be defined as in (11), where K is a kernel of order s. Then we have for all z ∈   2 |ϕ(z)−ϕ(z)| ˆ = OP b2s +b−1 ·T −1 +b−1−q−qF G (1−[(qv)∨1]) ·T −[(qv)∧1] ,

T → ∞.

We now obtain the rates of convergence by balancing the three terms in the bound of the last assertion. Corollary 4.8 Under the assumptions of Theorem 4.7 if b∗ ≈ T −γ/(2s+1) with

γ :=

(

1, [(qv) ∧ 1] ·

2s+1 2s+1+q+qF G (1−[(qv)∨1]) ,

qv > 1 + q/qF G ; 1 + q/qF G ≥ qv.,

  2s then we have |ϕ(z) ˆ − ϕ(z)|2 = OP T − 2s+1 ·γ for all z ∈ 5.

(15)

.

Discussion

In this paper we have considered nonparametric estimation for dependent data. Focusing on the case that the observations are nonlinear and highly dependent. We have obtained bounds for the kernel density estimator and also rates of convergence of two types of nonparametric regression models, both using the 2-mixing dependence measure. We show that when the assumption of linearity is relaxed, the rate of convergence does not necessarily depend on the autocovariance function of the observations. As we are working under relatively weak conditions, we do not claim that the bounds obtained are minimax. However, the bounds can be considered as the worst case scenario for the nonparametric estimator. In future work, it would be of interest to investigate if the bounds in the paper are indeed close to minimax for certain nonlinear time series. It would also be of interest to develop bandwidth selection methods when the 2-mixing size of the observations is unknown. Acknowledgments The authors are grateful to Rainer Dahlhaus, Rafal Kulik and two anonymous referees for making several useful suggestions. This work has been supported by the Deutsche Forschungsgemeinschaft under DA 187/15-1.

March 9, 2009

14:57

Journal of Nonparametric Statistics

16

Johannes-Subba-Rao

J. Johannes and S. Subba Rao

Appendix A. A.1.

Proofs: Nonparametric density estimation

We now prove the results in Section 3. Lemma A.1 Suppose the time series {Zt } is 2-mixing with size v and Assumption 3.2 is fulfilled for some qF ∈ (0, 1). If 1 ≤ t, τ ≤ T , then 1   |cov {Kb (Zt − z), Kb (Zτ − z)} | . min b−(1−qF ) ; b−2 |t − τ |−v . (A1) Proof. Writing the covariance as an integral, and using the notation in Assumption 3.2 (ii) we have cov {Kb (Zt − z), Kb (Zτ − z)} =

Z

Kb (u − z)Kb (v − z)F (t−τ ) (u, v)dudv.

Now by using H¨ older’s inequality with p−1 ¯−1 F +p F = 1, and recalling that qF = 1 − 2/pF , it is clear that |cov {Kb (Zt − z), Kb (Zτ − z)} | ≤

1 2/¯pF kKk2p¯F · kF (t−τ ) kpF . b−(1−qF ) . ·b b2

Using Assumption 3.2 we have that kF (t−τ ) kpF is uniformly bounded and by using Lyaponov’s inequality kKkp¯F < ∞ for all 1 < p¯F < 2. This gives us the first bound in (A1). On the other hand, under Assumption 3.2 (i) the kernel K is uniformly bounded and therefore, using the 2-mixing property of {Zt } together with Hall and Heyde [27], Theorem A.5, we obtain |cov {Kb (Zt − z), Kb (Zτ − z)} | . b−2 · |t − τ |−v, which gives the second bound in (A1).



Proof of Theorem 3.3. We mention that parts of the following proof are motivated by techniques used in Bosq [3]. Consider the standard decomposition |fˆ(z) − f (z)|2 = var(fˆ(z)) + | fˆ(z) − f (z)|2 .

(A2)

Under the stated assumptions we will derive the following two bounds, which give together the result of the theorem. The bias is bounded by | fˆ(z) − f (z)|2 . b2s ,

(A3)

while for the variance we have var(fˆ(z)) . T −1 · b−1 + T −[v∧1] · b−(2+qF −qF [v∨1]) . 1 We

(A4)

write A . B when there exists a positive constant c independent of A and B such that A 6 cB.

March 9, 2009

14:57

Journal of Nonparametric Statistics

Johannes-Subba-Rao

Nonparametric estimation for dependent data with an application to panel time series

17

Proof of (A3). We can write T  Z 1X  ˆ f (z) = Kb (Zt − z) = du f (u)Kb (u − z). T t=1

R Since f ∈ Gs△ and K is a kernel of order s with du|u|s K(u) ≤ SK , using a Taylor expansion up to the order s leads to fˆ(z) = f (z) + bs R with reminder |R| ≤ △SK < ∞, which proves (A3). In order to proof (A4), we consider the expansion var(fˆ(z)) =

T 2 X 1 X var {K (Z − z)} + cov {Kb (Zt − z), Kb (Zτ − z)} t b T2 T 2 t>τ t=1

=: A1 + A2 .

(A5)

We will show that |A1 | . T −1 · b−1 and

|A2 | .

(

T −v · b−2 , v ≤ 1; −1 −1 −(2+q F −qF v) T · {b + b }, 1 < v.

(A6)

Furthermore, if 0 ≤ v ≤ 1/qF + 1 then |A1 | is dominated by |A2 |. Whereas for v > 1/qF + 1 the terms |A1 | and |A2 | are of the same order O(T −1 b−1 ). Therefore, the bounds derived for |A2 | will lead to (A4). First let us consider A1 . Due to stationarity, we have the bound T · A1 ≤

[Kb2 (Z1

− z)] =

Z

du f (u)Kb2 (u − z).

Since under the stated assumptions kKk2 < ∞ and the density f is uniformly bounded this leads to A1 . T −1 · b−1 . P The term T · |A2 | is bounded by the sum 4 Tt=2 |cov {Kb (Zt − z), Kb (Z1 − z)} |. If v ≤ 1 then we estimate the sum using the second bound in Lemma A.1, i.e., T · |A2 | . T −v+1 b−2 , which is the first bound in (A6). On the other hand if v > 1 we partition the sum into two parts which we estimate separately using the bounds in Lemma A.1, thus giving us T h o n nX o X b−2 t−v . h · b−(1−qF ) + h−v+1 · b−2 . b−(1−qF ) + T · |A2 | . t=2

t=h+1

Thereby using h ≈ b−qF we obtain T · |A2 | . b−1 + b−1(2+qF −qF v) , i.e., the second  bound in (A6). Thus we have proved (A4). Proof of Corollary 3.5 Under the assumption on the bandwidth the result is obtained by balancing the terms in the bound given in Theorem 4.2. 

March 9, 2009

14:57

Journal of Nonparametric Statistics

18

A.2.

Johannes-Subba-Rao

J. Johannes and S. Subba Rao

Proofs: Nonparametric regression

We now prove the results in Section 4. Lemma A.2 Suppose the stationary time series {Xt , Zt } satisfies (12), and {Zt } is 2-mixing with size v and the autocovariances of the time series {ηt } have size u (see Definition 2.3). Suppose Assumption 4.1 is fulfilled for some q, qG ∈ (0, 1). If 1 ≤ t, τ ≤ T , then |cov {Xt Kb (Zt − z), Xτ Kb (Zτ − z)} | . b−(1−qG ) ,

(A7)

|cov {ϕ(Zt )Kb (Zt − z), ϕ(Zτ )Kb (Zτ − z)} | . b−(1+q) |t − τ |−qv,

(A8)

|cov {h(Zt )Kb (Zt − z)ηt , h(Zτ )Kb (Zτ − z)ητ } | . b−1 |t − τ |−u.

(A9)

Proof. Using the notation in Assumption 4.1 together with H¨ older’s inequality, −1 −1 and recalling that qG = 1 − 2/pG with pG + p¯G = 1, we have |cov {Xt Kb (Zt − z), Xτ Kb (Zτ − z)} | . b−(1−qG ) , where we use that under Assumption 4.1, K has finite 1 < p¯G < p moment (by Lyaponovs inequality) and kGt,τ kpG is uniformly bounded. This gives us (A7). We now prove (A8). Under Assumption 4.1 the function |ϕ|p · f is uniformly bounded and kKkp is finite for some p = 2/(1 − q) > 2, therefore we have [ |ϕ(Z1 )Kb (Z1 − z)|p ]2/p . b−(q+1) . Using the 2-mixing property of {Zt } together with Hall and Heyde [27], Theorem A.6, we obtain (A8). We now prove (A9). The series {Zt } and {ηt } are independent, therefore expanding the term A := cov {h(Zt )Kb (Zt − z)ηt , h(Zτ )Kb (Zτ − z)ητ } gives A = cov(ηt , ητ ) ·

[h(Zt )Kb (Zt − z)h(Zτ )Kb (Zτ − z)].

Since the covariance of the time series {ηt } has size u, applying the Cauchy-Schwarz inequality gives |A| . |t − τ |−u ·

|h(Z1 )Kb (Z1 − z)|2 .

Under Assumption 4.1 the function |h|2 · f is uniformly bounded and kKk2 < ∞, therefore |h(Z1 )Kb (Z1 − z)|2 . b−1 , and hence we obtain (A9).  Lemma A.3 Suppose the stationary time series {Zt } is 2-mixing with size v and Assumption 4.1 is fulfilled for some q, qF ∈ (0, 1). If 1 ≤ t, τ ≤ T , then   |cov {Kb (Zt − z), Kb (Zτ − z)} | . min b−(1−qF ) ; b−(1+q) |t − τ |−qv . (A10) Proof. The proof is very similar to the proof of Lemma A.2 and we omit the details. 

March 9, 2009

14:57

Journal of Nonparametric Statistics

Johannes-Subba-Rao

Nonparametric estimation for dependent data with an application to panel time series

19

Lemma A.4 Suppose the assumptions in Theorem 4.2 are satisfied. Let gb be defined as in (11). Then we have |b g (z)−g(z)| . b2s +b−1 T −1 +b−(1+q+qg (1−[(qv)∨1])) T −[(qv)∧1] +b−1 T −(u∧1) . (A11)

Proof. Consider the standard variance bias decomposition |ˆ g (z) − g(z)|2 = var(ˆ g (z)) + | gˆ(z) − g(z)|2 .

(A12)

Under the stated assumptions we will derive the following two bounds, which altogether give the estimate in (A11). The bias is bounded by | gˆ(z) − g(z)|2 . b2s ,

(A13)

while for the variance we have var(ˆ g (z)) . b−1 T −1 + b−(1+q+qG (1−[(qv)∨1])) T −[(qv)∧1] + b−1 T −(u∧1) .

(A14)

We first prove (A13). We can write T  Z 1X  gˆ(z) = [Xt |Zt ]Kb (Zt − z) = du g(u)Kb (u − z). T t=1

R Since g ∈ Gs△ and K is a kernel of order s with du|u|s K(u) ≤ SK , using a Taylor expansion up to the order s leads to gˆ(z) = g(z) + bs R with reminder |R| ≤ △SK < ∞, which proves (A13). In order to proof (A14), we consider the expansion T 1 X 2 X var(ˆ g (z)) = 2 var {Xt Kb (Zt − z)} + 2 cov {Xt Kb (Zt − z), Xτ Kb (Zτ − z)} T T t>τ t=1

=: A1 + A2 .

(A15)

We will show that |A1 | . T −1 · b−1 + T −(u∧1) · b−1 and |A2 | .

(

T −qv · b−(1+q) + T −(u∧1) · b−1 , qv ≤ 1; −1 −1 −1 −(1+q+q −(u∧1) −1 G (1−qv)) T ·b +T ·b +T · b , 1 < qv.

(A16)

Furthermore, if 0 ≤ qv ≤ q/qG + 1 then we show that |A1 | is dominated by |A2 |. Whereas for qv > q/qG + 1 the terms |A1 | and |A2 | are of the same order O(T −1 · b−1 +T −(u∧1) ·b−1 ). Therefore, the bounds derived for |A2 | will lead to the estimates in (A14). First let us consider A1 . Due to stationarity of the process, we have the bound T · A1 ≤

|X1 Kb (Z1 − z)|2 .

|ϕ(Z1 )Kb (Z1 − z)|2 +

|h(Z1 )Kb (Z1 − z)|2 .

March 9, 2009

14:57

Journal of Nonparametric Statistics

20

Johannes-Subba-Rao

J. Johannes and S. Subba Rao

Under the stated assumptions the functions |ϕ|p · f with p > 2 and |h|2 · f are uniformly bounded and the kernel kKk2 < ∞, therefore A1 . T −1 · b−1 . Let us now consider the term A2 , which is bound by

T · |A2 | ≤ 4

T X

|cov {Xt Kb (Zt − z), X1 Kb (Z1 − z)} |,

(A17)

t=2

where using representation (12) the t-th summand in (A17) can be estimated by |cov {ϕ(Zt )Kb (Zt − z), ϕ(Z1 )Kb (Z1 − z)} | + |cov {h(Zt )Kb (Zt − z)ηt , h(Z1 )Kb (Z1 − z)η1 } |. (A18) If qv ≤ 1 and u ≤ 1 then we bound the sum (A17) using (A18), i.e.,

T · |A2 | .

T X

|cov {ϕ(Zt )Kb (Zt − z), ϕ(Z1 )Kb (Z1 − z)} |

t=2

+

T X

|cov {h(Zt )Kb (Zt − z)ηt , h(Z1 )Kb (Z1 − z)η1 } |. (A19)

t=2

We use the bounds (A8) and (A9) in Lemma A.2 to estimate each of the sums in (A19) separately, which gives T · |A2 | . T −qv+1 · b−(1+q) + T −u+1 · b−1 .

(A20)

On the other hand if qv > 1 or if u > 1 we partition the sum (A17) into two parts, where we estimate the first part using the bound (A7) in Lemma A.2 and the second using (A18), thus giving us

T · |A2 | . h · b

−(1−qG )

+

T X

|cov {ϕ(Zt )Kb (Zt − z), ϕ(Z1 )Kb (Z1 − z)} |

t=h+1

+

T X

|cov {h(Zt )Kb (Zt − z)ηt , h(Z1 )Kb (Z1 − z)η1 } |. (A21)

t=h+1

We use the bounds (A8) and (A9) in Lemma A.2 to estimate each of the sums in (A21) separately, which gives  −qv+1 · b−(1+q) + h−u+1 · b−1 , qv ≤ 1 and u > 1;  T T ·|A2 | . h·b−(1−qG ) + h−qv+1 · b−(1+q) + T −u+1 · b−1 , qv > 1 and u ≤ 1;   −qv+1 −(1+q) h ·b + h−u+1 · b−1 , qv > 1 and u > 1.

(A22)

March 9, 2009

14:57

Journal of Nonparametric Statistics

Johannes-Subba-Rao

Nonparametric estimation for dependent data with an application to panel time series

21

Thereby using h ≈ b−qG we obtain  −qv+1 · b−d(1+q) + b−(1+qG (1−u)) , qv ≤ 1 and u > 1;  T T · |A2 | . b−d + b−(1+q+qG (1−qv)) + T −u+1 · b−1 , qv > 1 and u ≤ 1;   −(1+q+qG (1−qv)) b + b−(1+qG (1−u)) , qv > 1 and u > 1.

(A23)

and hence, combining (A20) and (A23) gives the bound (A16) for the term A2 .  We now state a slight variation of Theorem 3.3, where f can satisfy slightly weaker conditions. We use this result to prove Theorem 4.2. Lemma A.5 Suppose the stationary time series {Zt } is 2-mixing with size v and Assumption 4.1 is fulfilled for some q, qF ∈ (0, 1). In addition assume, that the function f belongs to Gs△ for s, △ > 0. Let fb be defined as in (1), where the kernel is of order s. Then we have |fb(z) − f (z)| . b2s + b−1 T −1 + b−(1+q+qF (1−[(qv)∨1])) T −[(qv)∧1] .

(A24)

Proof. Under the stated assumptions using Lemma A.3 the proof is very similar to the proof of Lemma A.4 and we omit the details.  Proof of Theorem 4.2. Consider the decomposition ϕ(z) ˆ − ϕ(z) = =

gˆ(z) fˆ(z) − ϕ(z) fˆ(z) fˆ(z) gˆ(z) − fˆ(z)ϕ(z) f (z) − fˆ(z) gˆ(z) − fˆ(z)ϕ(z) + . · f (z) f (z) fˆ(z)

We first note that Lemma A.5 gives |f (z) − fˆ(z)|2 = o(1), which implies that |fˆ(z)−1 | is bounded in probability. Therefore the second term in the above expansion is of order oP ({ˆ g (z) − fˆ(z)ϕ(z)}/f (z)), hence in the decomposition above the second term is negligible in comparison to the first term. Thereby bounding the first term of the decomposition we obtain the result. By using Lemma A.4 and A.5 and noting that qF G = qF ∧ qG , we obtain Theorem 4.2.  Proof of Corollary 4.4 Under the assumption on the bandwidth the result is obtained by balancing the terms in the bound given in Theorem 4.2.  Lemma A.6 Suppose the stationary vector time series {(Xt , Zt )} is 2-mixing with size v and Assumption 4.6 is fulfilled for some q, qG ∈ (0, 1). If 1 ≤ t, τ ≤ T , then   |cov {Xt Kb (Zt − z), Xτ Kb (Zτ − z)} | . min b−(1−qG ) , b−(1+q) |t − τ |−qv . (A25) Proof. Under Assumption 4.6 (ii) the first bound in (A25) follows from (A7) in Lemma A7. On the other hand, under Assumption 4.1 (i,iii) the function

March 9, 2009

14:57

Journal of Nonparametric Statistics

22

Johannes-Subba-Rao

J. Johannes and S. Subba Rao

[|X1 |p |Z1 ] · f is uniformly bounded and kKkp is finite for some p = 2/(1 − q) > 2, therefore we have [ |X1 Kb (Z1 − z)|p ]2/p . b−(q+1) . Using the 2-mixing property of {Zt } together with Hall and Heyde [27], Theorem A.6, we obtain the second bound in (A25).  Lemma A.7 Suppose the stationary vector time series {(Xt , Zt )} is 2-mixing with size v and Assumption 4.6 is fulfilled for some q, qG ∈ (0, 1). In addition assume, that the function g = ϕ · f belongs to Gs△ for s, △ > 0. Let gb be defined as in (11), where the kernel is of order s. Then we have |b g (z) − g(z)| . b2s + b−1 T −1 + b−(1+q+qG (1−[(qv)∨1])) T −[(qv)∧1]

(A26)

Proof. Under the stated assumptions using Lemma A.6 the proof is very similar to the proof of Lemma A.4 and we omit the details.  Proof of Theorem 4.7. Using Lemma A.5 and A.5 we obtain the result using a similar proof as Theorem 4.2.  Proof of Corollary 4.8 Under the assumption on the bandwidth the result is obtained by balancing the terms in the bound given in Theorem 4.2.  A.3.

Covariances and 2-mixing rates for linear processes

We use the results derived in this section in Section 3.1, where we compared the rates of convergence for linear processes with the rates in the general 2-mixing case. Let us suppose {Zt } satisfies the linear process representation in (6). By placing some additional conditions on the innovations we have the following lemma, which is due to Giraitis et al. [22], Lemma 1 and 2. Lemma A.8 Giraitis et al. [22] Suppose {Zt } is a linear process which satisfies (6), and cov(Z0 , Zt ) ≤ Ct−θ . Let f be the density of Zt and let ft denote the joint density Z0 , Zt . If (|ε3t |) < ∞, and for all u ∈ R suppose the characteristic function 1 satisfies | [exp(−iuε1 )]| ≤ (1+|u|) δ for some δ > 0, then the joint density satisfies the relation ft (x, y) = f (x)f (y) + r(t)f ′ (x)f ′ (y) + O(t−θ−d ), where f ′ ∈ L1 (R) and r(t) = cov(Z0 , Zt ), for some 0 < d < min( 7θ , 1−θ 12 ). Using the result above the MSE of the kernel estimator with observations from a linear process can be derived. For most processes, there isn’t a direct correspondence between the 2-mixing and the covariance size. However for Gaussian processes both sizes are linked by the inequality |cov(Z0 , Zt )| |cov(Z0 , Zt )| ≤ sup |P (A∩B)−P (A)P (B)| ≤ 2π (A27) var(Z0 ) var(Z0 ) A∈σ(Z0 ),B∈σ(Zt )

March 9, 2009

14:57

Journal of Nonparametric Statistics

Johannes-Subba-Rao

Nonparametric estimation for dependent data with an application to panel time series

23

(see Doukhan [17], Section 2.1), thus the covariance and the 2-mixing sizes are the same. Suppose that {Zt } satisfies (6), where the innovations are Gaussian and |aj | . j −θ . Then we have    

|cov(Z0 ,Zt )| var(Z0 )

sup A∈σ(Z0 )B∈σ(Zt )

and .  |P (A ∩ B) − P (A)P (B)|  

(

t−(2θ−1) , if 1/2 < θ ≤ 1; (A28) t−θ , if θ > 1.

We now consider more general linear processes, which are not necessarily Gaussian. Then the covariance size does not immediately give the 2-mixing size. However, if the density of the innovations satisfies certain smoothness conditions then we can obtain the following bound. We note that the supremum in the below is taken over a larger sigma-algebra than the one used in the definition of 2-mixing. Lemma A.9 Suppose {Zt } is a linear process which satisfies the representation P∞ P j P∞ 1 Zt = ∞ j = j=0 bj z is analytic inside and on the j=0 aj εt−j , the function a z j=0 j R unit circle. Let fε be the density of the innovation εt , which satisfies |fε (x + a) − fε (x)|dx ≤ C|a|. If E(|εt |ℓ ) < ∞, sup

|P (A ∩ B) − P (A)P (B)| ≤ C(

A∈σ(Z0 ,Z−1 ),B∈σ(Zt )

∞ X

a2j )ℓ/(2(ℓ+1))

j=t+1

where C is an arbitrary constant. PROOF. The following proof is motivated by Chanda [8], Gorodetskii [24] and Davidson [16] (Theorem 14.9). To prove the result, we use the decomposition Zt = P P a0 εt + tj=1 aj εt−j + Vt , where Vt = ∞ j=t+1 aj εt−j . We first note that under the P∞ stated assumptions, {Zt }, is an invertible time series that is εt = j=0 bj Zt−j , hence σ(Zt , Zt−1 , . . .) = σ(εt , εt−1 , . . .), and the event {Vt ≤ η} ∈ σ(Z0 , Z−1 , . . .). Now by slightly adapting Proposition 2.1, in Fryzlewicz and Subba Rao [20], we obtain |P (A ∩ B) − P (A)P (B)|

sup

(A29)

A∈σ(Zt )

B∈σ(Z0 ,Z−1 ,...)

≤ 2 sup |v|≤η

Z

fεt−1 (εt−1 )

Z

fZ

 , v) − f (x|ε (x|ε , 0) dx dεt−1 + 3P (|Vt | ≥ η), Zt |εt−1 ,Vt t−1 t−1 t |εt−1 ,Vt

where fZt |εt−1 ,Vt is the conditional density of Zt given εt−1 = (εt−1 , . . . , ε1 ) and Vt . We now bound each of the terms in (A29). We first obtain an expression the conditional density fZt |εt−1 ,Vt in terms of Pt the innovation density fε . Since εt = a−1 0 (Zt − j=1 aj εt−j − Vt ) we have Pt −1 −1 fZt |εt−1 ,Vt (x|εt−1 , v) = a0 fZ (a0 (xt − j=1 aj εt−j − z)). Substituting this into (A29) and by using the stated assumptions we have Z

fεt−1 (ε)

Z

fZ

 (x|εt−1 , v) − fZt |εt−1 ,Vt (x|εt−1 , 0) dx dεt−1 ≤ K|z|. (A30) t |εt−1 ,Vt

March 9, 2009

14:57

Journal of Nonparametric Statistics

24

Johannes-Subba-Rao

J. Johannes and S. Subba Rao

This means that for all η > 0 |P (A ∩ B) − P (A)P (B)| ≤ 2K|η| + 3P (|Vt | ≥ η),

sup

(A31)

A∈σ(Zt )

B∈σ(Z0 ,Z−1 ,...)

where K is a constant independent of η. We now bound P (|Vt | ≥ η). Using the Markov and Burkh¨older inequalities we have P (|Vt | ≥ η) ≤

E|Vtℓ | ηℓ



P 2 ℓ/2 E(|εℓ |) 2ell−1 ( ∞ t j=t+1 aj ) ηℓ

.

Substituting the above bound into (A31) gives sup A∈σ(Zt )

B∈σ(Z0 ,Z−1 ,...)

P 2 ℓ/2 )  ( ∞ j=t+1 aj ) |P (A ∩ B) − P (A)P (B)| ≤ 2K |η| + . ηℓ 

The minimum of the right hand side of the above is obtained when η = P 2 ℓ/(2(ℓ+1)) , giving ( ∞ j=t+1 aj ) |P (A ∩ B) − P (A)P (B)| ≤ K(

sup A∈σ(Zt )

∞ X

a2j )ℓ/(2(ℓ+1)) .

j=t+1

B∈σ(Z0 ,Z−1 ,...)

Thus yielding the desired result.



Remark A.10 (i) Let us suppose the parameters in the linear process satisfy |aj | ≤ Cj −θ (with θ > 1/2 and E(|εt |ℓ ) < ∞). Then we have sup

|P (A ∩ B) − P (A)P (B)| ≤ Cj

ℓ (−2θ+1) 2(ℓ+1)

A∈σ(Z0 ,Z−1 ,...),B∈σ(Zt )

where C is an arbitrary constant. (ii) It is interesting to compare the 2-mixing sizes derived in Lemma A.9 with the strong α-mixing results for MA(∞) processes. Under the same set of conditions, but with the additional restriction that θ > 3/2, we have that sup

|P (A ∩ B) − P (A)P (B)| . |t|

ℓ (−2θ+1) 2(ℓ+1)+1

.

A∈σ(Z0 ,Z−1 ,...),B∈σ(Zt ,Zt+1 ,...)

In other words, the 2-mixing size is larger than the α-mixing size. This is because, by definition, the σ-algebras involved in the definition of α-mixing is far larger than the σ-algebras in the definition of 2-mixing, thus allowing more extreme cases.  Comparing Lemma A.9 with the covariance size given in (A28) we see when the Gaussianity assumption is relaxed the covariance and 2-mixing sizes no longer necessarily coincide. However by using Lemma A.9 and Hall and Heyde [27], Theorem

March 9, 2009

14:57

Journal of Nonparametric Statistics

Johannes-Subba-Rao REFERENCES

25

A.5, we have the upper and lower bounds j

ℓ (−2θ+1) (ℓ−2)

.

sup

|P (A ∩ B) − P (A)P (B)| . j

ℓ (−2θ+1) 2(ℓ+1)

.

A∈σ(Z0 ),B∈σ(Zt )

Therefore the 2-mixing size v of the linear process {Zt } is bounded by (2θ − 1)

ℓ ℓ ≤ v ≤ (2θ − 1) . 2(ℓ + 1) (ℓ − 2)

(A32)

References [1] K. B. Athreya and S. Pantula. Mixing properties of Harris chains and autoregressive processes. J. Appl. Probab., 23:880–892, 1986. [2] B. Basrak, R. Davis, and T. Mikosch. Regular variation of GARCH processes. Stochastic Processes and their Applications, 99:95–115, 2002. [3] D. Bosq. Nonparametric Statistics for Stochastic Processes. Springer, New York, 1998. [4] F. Bousamma. Ergodicit´ e, m´ elange et estimation dans les mod` eles GARCH. PhD thesis, Paris 7, 1998. [5] R. C. Bradley. Introduction to Strong Mixing Conditions Volumes 1,2 and 3. Kendrick Press, 2007. [6] R. C. Bradley. A covariance inequality under a two-part dependence assumption. Statistics and Probability Letters, 30:287–293, 1996. [7] A. Bryk and J. Mielniczuk. Asymptotic properties of density estimates for linear processes: applications of linear projection method. J. Nonparametric Statistics, 17:121–133, 2005. [8] K. C. Chanda. Strong mixing properties of linear stochastic processes. J. Appl. Prob., 11:401–408, 1974. [9] B. Cheng and P. M. Robinson. Density estimation in strongly dependent non-linear time series. Statistica Sinica, 1:335–359, 1991. [10] B. Cheng and P. M. Robinson. Semiparametric estimation from time series with long-range dependence. Journal of Econometrics, 94:335–353, 1994. [11] G. Claesken and P. Hall. effect of dependence of stochastic measures of accuracy of density estimators. Ann. Statist., 30:431–454, 2002. [12] D. Cline and H. Pu. Geometric ergodicity of nonlinear time series. Statistica Sinica, 9:1103–118, 1999. [13] S. Cs¨ org¨ o and J. Mielniczuk. Nonparametric regression under long-range dependent normal errors. Ann. Statist., 23:1000–1014, 1995. [14] S. Cs¨ org¨ o and J. Mielniczuk. The smoothing dichotomoy in random-design regression with longmemory based on moving averages. Statistica Sinica, 10:771–787, 2001. [15] S. Cs¨ org¨ o and J. Mielniczuk. Random-design regression under long range dependence errors. Bernoulli, 5:209–224, 1999. [16] J. Davidson. Stochastic Limit Theory. Oxford University Press, Oxford, 1994. [17] P. Doukhan. Mixing, Properties and Examples. Springer, New York, 1994. [18] G. Estevas and P. Vieu. Nonparametric estimation under long memory dependence. Nonparametric Statistics, 15:535–551, 2003. [19] J. Fan and Q. Yao. Nonlinear Time Series: nonparametric and parametric models. Springer, New York, 2005. [20] P. Fryzlewicz and S. Subba Rao. On mixing properties of ARCH and time-varying ARCH processes. Preprint, 2007. [21] J. Geweke and S. Porter-Hudak. The estimation and application of long memory time series models. Journal of Time Series Analysis, 4:221–238, 1983. [22] L. Giraitis, H. L. Koul, and D. Surgailis. Asymptotic normality of regression estimators with long memory errors. Statistics and Probability Letters, 29:317–335, 1996. [23] L. Giraitis, P. Kokoskza, and R. Leipus. Stationary ARCH models: Dependence structure and central limit theorem. Econometric Theory, 16:3–22, 2000. [24] V. Gorodetskii. On the strong mixing propery for linear sequences. Theory of Probability and its Applications, 22:411–413, 1977. [25] P. Hall and J. Hart. Convergence rates in density estimation for data from infinite-order moving average processes. Probability Theory and Related Fields, 87:253–274, 1990. [26] P. Hall and J. Hart. Nonparametric regression with long-range dependence. Stochastic Processes and their Applications, 87:339–351, 1990.

March 9, 2009

14:57 26

Journal of Nonparametric Statistics

Johannes-Subba-Rao REFERENCES

[27] P. Hall and C. Heyde. Martingale Limit Theory and its Application. Academic Press, New York, 1980. [28] P. Hall, S. N. Lahiri, and Y. K. Truong. On bandwidth selection choice for density estimation with dependent data. Ann. Statist., 23:2241–2264, 1995. [29] J. Hart. Efficiency of a kernel density estimator under an autoregressive dependence model. Journal of the American Statistical Association, 83:86–99, 1984. [30] J. Hart. Automated kernel smoothing of dependent data using time series cross validation. Journal of the Royal Statistical Society (B), 56:529–542, 1994. [31] H. R. K¨ unsch. Statistical aspects of self-similar processes. Proceedings of the World Congress of the Bernoulli Society, 1:67–74, 1987. [32] O. B. Linton and E. Mammen. Estimating semiparametric ARCH(∞) models by kernel smoothing methods. Econometrica, 73:771–836, 2004. [33] E. Masry and D. Tjøstheim. Nonparametric estimation and identification of nonlinear ARCH time series. Econometric Theory, 11:258–289, 1995. [34] J. Mielniczuk. On the asymptotic mean integrated squares error of kernel density estimator for dependent data. Statistics and Probability Letters, 34:53–58, 1997. [35] D. T. Pham. The mixing propery of bilinear and generalised random coefficient autorregressive models. Stochastic Processes and their Applications, 23:291–300, 1986. [36] E. Rio. Covariance inequalities for strongly mixing processes. Ann. Inst. H. Poincar´ e Prob. Statist., 29:587–597, 1993. [37] E. Rio. About the lindeberg method for strongly mixing sequences. ESAIM Probab. Statist., 1:35–61, 1997. [38] P. M. Robinson. Root-n-consistent semiparametric regression. Econometrica, 56:931–954, 1988. [39] P. M. Robinson. Nonparamatric estimates for time series. Journal of Time Series Analysis, 4:185–201, 1983. [40] P. M. Robinson. Testing for strong serial correlation and dynamic conditional heteroskedasity in multiple regression. Journal of Econometrics, 47:67–78, 1991. [41] P. M. Robinson. Log-periodogram regression of time series with long range dependence. Ann. Statist., 23:1048–1072, 1995. [42] M. Rosenblatt. Density estimates and markov sequences. In M. Puri, editor, Non-parametric techniques in statistical inference, pages 199–210. Cambridge University Press, London, 1970. [43] D. W. Scott. Multivariate Density Estimation. Wiley, New York, 1992. [44] S. Subba Rao. A note on uniform convergence of an ARCH(∞) estimator. Sankhya, 68:600–620, 2006. [45] M.-C. Viano, C. Deniau, and G. Oppenheim. Long-range dependence and mixing for discrete time fractional processes. J. Time Ser. Anal., 16:323–338, 1995. [46] P. Vieu. Quadratic errors for nonparametric estimates under dependence. Journal of Multivariate Analysis, 39:324–47, 1991.