Generalized Method of Moments with Tail Trimming

0 downloads 0 Views 616KB Size Report
i,t(θ0)] < expedites Gaussian asymptotics, but this requires t and yt to have finite 4th ...... two-tailed thresholds ci,T and the same fractiles kj,i,T = kT to simplify notation. Hence. T ...... Î. T,t, and m∗T,t = mtIT,t. We prove ||T−1S−1. T. PT s,t=1 kT,s,t{ ˆm∗T,s ˆm∗T,t − m∗T,sm∗T,t}|| ...... [62] Ibragimov, I. A. and Y. V. Linnik (1971).
Generalized Method of Moments with Tail Trimming Jonathan B. Hill∗ and Eric Renault† Dept. of Economics University of North Carolina - Chapel Hill April. 2010

Abstract We develop a GMM estimator for stationary heavy tailed data by trimming an asymptotically vanishing sample portion of the estimating equations. Trimming ensures the estimator is asymptotically normal, and self-normalization implies we do not need to know the rate of convergence. Tail -trimming, however, ensures asymmetric models are covered under rudimentary assumptions about the thresholds, and √ it implies possibly √ heterogeneous convergence rates below, at or above T . Further, it implies super- T -consistency is achievable depending on regressor and error tail thickness and feedback, with a rate equivalent to the largest possible rate amongst untrimmed minimum distance estimators for linear models with iid errors, and a faster rate than QML for heavy tailed GARCH. In the latter cases the optimal rate is achieved with the efficient GMM weight, and by using simple rules of thumb for choosing the number of trimmed equations. Simulation evidence shows the new estimator dominates GMM and QML when these estimators are not or have not been shown to be asymptotically normal.

1. INTRODUCTION We develop a Generalized Method of Tail-Trimmed Moments estimator for possibly very heavy tailed nonlinear time series. Heavy tails could be the result of the underlying shocks (e.g. ARX) and/or the parametric structure (e.g. GARCH), depending on the model. There now exists an abundance of evidence in favor of asymmetry and heavy tails in financial, macroeconomic and actuarial data like exchange rate and asset price fluctuations and insurance claims (Mandelbrot 1963, Campbell and Hentschel 1992, Engle and Ng 1993, Embrechts et al 1997, Finkenstadt and Rootzén 2003); microeconomic data like auction bids and birth weight (Hill and Shneyerov 2009, Chernozhukov 2010); and network traffic (Resnick 1997). Coupled with the necessity for ∗ Corresponding author. Dept. of Economics, University of North Carolina -Chapel Hill; www.unc.edu/∼jbhill; [email protected]. † Dept.

of Economics, University of North Carolina -Chapel Hill; [email protected].

Key words and phrases : GMM, tail trimming, robust estimation, heavy tails, super consistency. JEL subject classifications : C13, C22. ˇ zek, Feico Drost, Eric Ghysels, Christian Gouriéroux, Alberto Holly, We thank Brendan Beare, Pavel Ciˇ James Hamilton, Artem Prokhorov, Bas Werker and Hal White for helpful comments. We also thank seminar participants at the Triangle Econometrics Conference 2008, the Joint Statistical Meetings 2008 and 2010, the First French Econometrics Conference 2009, University of North Carolina-Chapel Hill Dept. of Economics, University of Tilburg Dept. of Econometrics and O.R., CIREQ and Concordia University Dept. of Economics, Univeristy of Califormia at San Diego and Riverside Dept. of Economics, Brussels Conference on Latest Developments in Heavy-Tailed Distributions, and the Australian Meeting of the Econometric Society.

1

over-identifying restrictions in economic models, a robust GMM methodology will be useful to the analyst unwilling to impose ad hoc restrictions. See Hansen (1982), Renault (1997) and Hall (2005). 1.1

TAIL-TRIMMED ESTIMATING EQUATIONS Let mt (θ) denote estimating equations, a stochastic mapping mt : Θ → Rq , compact Θ ⊂ Rr , q ≥ r,

induced from some moment condition. The strong global identification condition is E [mt (θ)] = 0 if and only if θ = θ0 for unique θ0 ∈ Θ. Consider a strong-ARCH(1) process {yt } as a simple example, 2 yt = ht t , h2t = α0 + β 0 yt−1 , θ0 = [α0 , β 0 ]0 ,

iid t

∼ (0, 1) and =t := σ(yτ : τ ≤ t)

(1)

with least squares-type equations © £ 2 ¤ ª 2 mt (θ) = yt2 − α − βyt−1 , ... ∈ Rq , q ≥ 2. zt−1 , zt−1 = 1, yt−1 The GMM estimator solves

ˆθg = argmin θ∈Θ

⎧Ã T ⎨ 1X ⎩ T

!0

mt (θ)

t=1

ˆT Υ

Ã

!⎫ T ⎬ X 1 mt (θ) ⎭ T t=1

ˆ T ∈ Rq×q , and T ≥ 1 is the sample size. Under standard for some positive definite matrix Υ regulatory conditions ˆθg is asymptotically linear (e.g. Newey and McFadden 1994) ³ ´ T 1/2 θˆ − θ0 = AT ×

1 T 1/2

T X t=1

mt (θ0 ) + op (1) for some AT ∈ Rr×q ,

P so asymptotics are grounded on Tt=1 mt (θ0 ). Finite variances E[m2i,t (θ0 )] < ∞ expedites Gaussian asymptotics, but this requires t and yt to have finite 4th and 8th moments respectively. This rules out mildly heavy-tailed shocks, integrated random volatility (e.g. IGARCH), and much more. If over-identifying restrictions exist q ≥ 3 with say z3,t−1 = |yt−1 |2+δ/2 and δ > 0 then yt must have a finite (8 + δ)th moment, a very tall order for financial time series (Embrechts et al 1997, Finkenstadt and Rootzén 2003). The problem carries over to a large class of random volatility models (cf. Meddahi and Renault 2004). Models with heterogeneous estimating equations include the multifactor Capital Asset Pricing Model with high risk (e.g. oil futures), composite market returns (e.g. NYMEX), low risk asset returns (e.g. U.S. Treasury Bill), and factor premia (e.g. market capitalization and book-to-price ratio); VARX for causality modeling of financial and macroeconomic returns; and multivariate random volatility. See French and Fama (1996), Ding and Granger (1996), Mikosch and St˘ aric˘ a (2000), and Embrechts et al (2003). QML equations by contrast allow for arbitrarily heavy tailed yt as long as E[ 4t ] < ∞. But QML for GARCH with E[yt4 ] = ∞, like IGARCH, is well known for its poor finite sample properties (Lumsdaine 1995, 1996, Mikosch and Straumann 2006), and errors need not be thin tailed (Mikosch and St˘ aric˘ a 2000, Hall and Yao 2003, Ling 2007, Davis and Mikosch 2008, Linton et al 2010). 2

Although GMM with a non-Gaussian limit is certainly achievable in the manner of Mestimators (e.g. Hannan and Kanter 1977, An and Chen 1982, Knight 1987, Cline 1989, Davis et al 1992), we seek an estimator that permits standard inference and is therefore simple to use. We propose asymptotically negligibly trimming k1,i,T left-tailed and k2,i,T right-tailed observations from each equation sample {mi,t (θ)}Tt=1 , where kj,i,T → ∞ and kj,i,T /T → 0 Define tail specific observations of mi,t (θ) and sample order statistics: (−)

(−)

(−)

(+)

(+)

(+)

mi,t (θ) := mi,t (θ) × I (mi,t (θ) < 0) and mi,(1) (θ) ≤ · · · ≤ mi,(T ) (θ) ≤ 0 mi,t (θ) := mi,t (θ) × I (mi,t (θ) > 0) and mi,(1) (θ) ≥ · · · ≥ mi,(T ) (θ) ≥ 0 (a)

(a)

(a)

mi,t (θ) := |mi,t (θ)| and mi,(1) (θ) ≥ · · · ≥ mi,(T ) (θ) ≥ 0. Now trim any equation mi,t (θ0 ) that may have an infinite variance between its lower k1,i,T /T th and upper k2,i,T /T th sample quantiles: ³ ´ (−) (+) m ˆ ∗i,T,t (θ) := mi,t (θ) × I mi,(k1,i,T ) (θ) ≤ mi,t (θ) ≤ mi,(k2,i,T ) (θ) (2) = mi,t (θ) × Iˆi,T,t (θ)

h iq m ˆ ∗T,t (θ) = mi,t (θ) × Iˆi,T,t (θ)

i=1

where Iˆj,T,t (θ) = 1 if equation j is not trimmed,

and I(A) = 1 is A is true, and 0 otherwise1 . If the data generating process is symmetric and mi,t (θ0 ) is heavy-tailed then symmetric trimming is appropriate: for ki,T → ∞ and ki,T /T → 0 ³ ´ (a)

m ˆ ∗i,T,t (θ) := mi,t (θ) × I |mi,t (θ)| ≤ mi,(ki,T ) (θ) .

(3)

The Generalized Method of Tail-Trimmed Moments [GMTTM] estimator solves ⎧Ã !0 Ã !⎫ T T ⎨ 1X ⎬ X 1 ˆθT = argmin ˆT × m ˆ ∗T,t (θ) × Υ m ˆ ∗T,t (θ) . ⎭ T t=1 θ∈Θ ⎩ T t=1

As long as mt (θ0 ) is integrable and satisfies a mixing condition, and standard smoothness conditions apply, then ³ ´ d 1/2 ˆ VT θT − θ0 → N (0, Ir )

for some sequence of positive definite matrices {VT }. Negligible trimming ensures m ˆ ∗T,t (θ) identifies θ0 as T → ∞ for symmetric or asymmetric processes, and the Gaussian limit holds for a host of heavy tailed time series. See Sections 2 and 4. In Section 3 we present simple rules of thumb for selecting the rate kj,i,T → ∞ in order to optimize the rate ||VT || → ∞ for efficient GMTTM, where || · || is the spectral norm. Further, if mi,t (θ0 ) is symmetrically distributed and trimmed then any two-tailed ki,T ensure identification for each T . Otherwise, in Section 4 we present rules of thumb for relating k1,i,T and k2,i,T that optimize the rate of identification. (N )

1 Other

criteria for trimming exist, including trimming according to a norm ||·||: mt (θ) := ||mt (θ) ||. (N ) In this case m ˆ T,t (θ) = mt (θ)I(||mt (θ)|| ≤ m(k ) (θ)) where kT → ∞ and kT /T → 0. Simulation work T not presented here reveals the latter is massively dominated by component-wise trimming when q > 1, irrespective of distribution symmetry.

3

The components ˆθi,T may have heterogeneous rates, and linear combinations of subsets of ˆθT may help optimize the rates (Antoine and Renault 2010). In the sequel, where 1/2 confusion is avoided, we simply say "rates of convergence" to refer variously to2 VT , 1/2 1/2 diag{VT }, or ||VT || . Inference does not require knowledge of the rates of convergence since we self-normalize, and tail trimming equations with finite variance has no impact on the rate. In particular, ˆθT is T 1/2 -consistent if all equations have a finite variance. In Section 3 we show sub-, exact- or super-T 1/2 -convergence may arise in heavy tailed cases depending on the equation form (e.g. QML versus least squares-type equations for GARCH); relative tail thickness of error and regressor; and whether the error is iid (e.g. AR with iid shocks) or depends on the regressor through some form of random volatility feedback (e.g. AR with ARCH shocks, or GARCH). The feasible rates of convergence, however, is dampened in general precisely due to trimming, but the damage can be truly or nearly eradicated following simple rules of thumb for choosing kj,i,T . GMTTM for linear regression models with infinite variance iid errors obtains the highest rate possible amongst M-estimators, identically T 1/κ /L(T ) for slowly varying L(T ) → ∞ where κ < 2 is the tail index (Davis et al 1992)3 . In the case of AR-GARCH or GARCH feedback exists that depresses the rate to o(T 1/2 ). A simple rule of thumb for selecting kj,i,T optimizes the rate to T 1/2 /L(T ) which dominates QML for GARCH with infinite kurtosis errors (Hall and Yao 2003). See Section 3 for convergence rate derivation for dynamic linear regression, IV, ARCH and AR-ARCH models. See also Antoine and Renault (2010) for broad GMM theory under variable coefficient estimator rates that are no greater than T 1/2 . In Section 5 we show consistency and asymptotic normality are primitive properties for linear-in-parameters models with innovations that have smooth distributions. Finally, perform a monte carlo study in Section 6 demonstrates the superiority of GMTTM over GMM and QML for linear and nonlinear models including AR, GARCH, IGARCH, Quadratic-ARCH, and Threshold-ARCH with Gaussian or Paretian innovations. Fixed quantile or central order trimming, by comparison, imposes kj,i,T /T → λj,i ∈ (0, 1) for each equation i and tail j. This is the standard in the robust M-estimation and Method of Moments literatures where symmetry is imposed λ1,i = λ2,i . See the review below. In this case without further information the equations mi,t (θ0 ) must be symmetrically distributed to ensure identification of θ0 . Since key asymptotic arguments in this paper exploit negligibility and degeneracy properties under tail trimming, a direct extension to fixed quantile trimming is not evident. 1.2

EXTANT METHODS

The best extant theory of Minimum Distance Estimation for time series covers Mestimators, in particular QML and LAD for GARCH models and Least Trimmed Squares in the robust estimation literature. Francq and Zakoïan (2004) prove the QML estimator is asymptotically normal for strong-GARCH and ARMA-GARCH under E( 4t ) < ∞. See, also, Hansen and Lee (1994), Lumsdaine (1996) and Jensen and Rahbek (2004) for results covering stationary and non-stationary cases. Hall and Yao (2003) characterize nonnormal QML limit laws for linear GARCH models with possibly infinite variance errors. Davis et al (1992) characterize a general class of M-estimators with smooth criterion, and LAD, for stationary autoregressions with infinite variance iid errors. 2 In

1/2

t-tests of a single parameter restriction Vi,i,T is the proper measure of a rate of convergence. The ||1/2

measures the maximum rate which is required for asymptotic arguments. norm ||VT 3 A slowly varying function L(T ) satisfies lim p T →∞ L(λT )/L(T ) = 1 for all λ > 0, hence L(T ) = o(T ) for all p > 0 (Resnick 1987). The natural log ln(T ) is a classic example.

4

Linton et al (2010) prove asymptotic normality of the log-transformed LAD estimator for non-stationary GARCH provided E( 2t ) < ∞ for martingale difference t , and E| t |p < ∞ for some p > 0 for iid t . See also Peng and Yao (2003). Although robust estimation has a substantial history (Huber 1964, Stigler 1973, Jurecková and Sen 1996), only a few results concern fully nonlinear models with heavy tails. Most regression treatments focus on breakdown point analysis for thin tailed data with ˇ zek outliers under contamination (e.g. Rousseeuw 1985, Basset 1991, He et al 1996, Ciˇ ˇ 2005, 2008, 2009); most concern M-estimator frameworks (e.g. Ciˇzek 2008 and the citations therein); and when trimming, truncation or weighting are employed only non-tail data quantiles are considered. Almost all uses of tail trimming appear in the robust location and central limit theory literatures, and to our knowledge extremum estimation and tail-trimming have never been combined. See Hill (2010) for a review, and see Horowitz (1998) for a tail-trimmed covariance matrix without supporting theory. There are few regression estimators that are asymptotically normal for heavy tailed data. Andrews (2008) exploits rank orders of least squares residuals to estimate linear ARMA models with iid errors. The estimator is T 1/2 -consistent and non-linear models or non-iid errors are not treated. Let st (θ) ≥ 0 denote criterion equations, for example st (θ) = |yt − θ0 xt | for LAD. Ling (2005, 2007) proposes Least Absolute Weighted Deviations [LAWD] and Quasi-Maximum PT Weighted Likelihood [QMWL] criteria t=1 wt (c)st (θ) where wt (c) is a symmetric smooth stochastic function of the data yt based on some threshold c. Since wt (c) is not a function of θ, the threshold c is not with respect to the criterion st (θ). Linear autoregressive and GARCH models are separately covered allowing E[ 2t ] < ∞ and E|yt |p < ∞ for some p > 0. The rate of convergence is T 1/2 for both AR and ARMA-GARCH models since restrictions imposed on wt (c) imply it operates like a smoothed fixed quantile trimming indicator. ˇ zek (2005, 2008) improves the breakdown point of M-estimators by trimming the kT Ciˇ ∼ λT largest st (θ). Nonlinear models and models with limited dependent variables are covered, the errors are assumed to be iid with a finite variance, and asymptotic variance estimation is neglected so inference is not available. Kan and Lewbel (2007) use trimming to solve bias problems in semiparametric least squares estimation for linear truncated regression models of iid data with thin tails. The literature is too large for a fair review: ˇ zek (2005, 2008) and his citations, and see Ruppert and Carroll (1980), Rousseeuw see Ciˇ (1985), Basset (1991), Tableman (1994), Stromberg (1993), and Agulló et al (2008) for trimmed and truncated M-estimators. The fundamental short-comings of trimming M-estimator criterion equations st (θ) by a fixed quantile of itself are super-T 1/2 -convergence is impossible for stationary data under required moment restrictions; asymptotic normality cannot be achieved when regressors are heavy tailed; and in general asymmetric models and over-identifying restrictions are ignored. Adaptive weighting based on observable data as in Ling (2005, 2007) neglects criterion and normal equation information; and weights that act like fixed quantile trimming, and rank-orders can only deliver a T 1/2 -consistent estimator. See Section 2.4 for direct comparisons of GMTTM with trimmed and weighted M-estimators. ˇ zek (2009) trims a fixed quantile of A few results are couched in method of moments. Ciˇ mt (θ) for thin-tailed cross-sections under data contamination, covering limited dependent and instrumental variables. Since the quantile is fixed identification must be assumed and an efficient criterion weight does not exist. Powell (1986) and Honoré (1992) construct least squares estimators couched in the method of trimmed moments for censored linear regressions models of iid data. Ronchetti and Trojani (2001a) symmetrically truncate mt (θ) and propose a method of simulated moments to overcome bias in asymmetric models. See also Ronchetti and Trojani (2001b) and Dell’Aquila et al (2003). The error

5

distribution must therefore be known and heavy-tailed cases are ignored. Throughout λmin (A) and Pλmax (A) are the minimum and maximum eigenvalues of A. The Lp -norm is ||x||p = ( i,j E|xi,j |p )1/p , and the spectral (matrix) norm is ||A|| = (λmax (A0 A))1/2 . (z)+ := max{0, z}. K denotes a positive finite constant whose value may p d change from line to line; ι > 0 is a tiny constant; N is a whole number. → and → denote probability and distribution convergence. Id is a d-dimensional identity matrix and A1/2 denotes the square-root matrix for positive definite A. U 0 (δ 1 , δ 2 ) := {θ ∈ Θ : δ 1 ≤ ||θ − θ0 || ≤ δ 2 } for any 0 ≤ δ 1 < δ 2 and U 0 (δ) = U 0 (0, δ). supθ = supθ∈Θ and inf θ = inf θ∈Θ . 2. TAIL-TRIMMED GMM In this section we develop a model-free theory of GMTTM based on primitive properties of mt (θ). We treat specific models in Sections 3-5. 2.1

TAIL-TRIMMING

Denote by Li (θ), Ui (θ) ∈ [0, ∞] equation specific support bounds: −Li (θ) ≤ mi,t (θ) ≤ Ui (θ) a.s. The problem of interest is some mi,t (θ0 ) may have an unbounded support and infinite variance. Assume by convention the first q ∈ {1, ..., q} equations are trimmed: h iq m ˆ ∗T,t (θ) = mi,t (θ) × Iˆi,T,t (θ)

i=1

=

hn oq mi,t (θ) × Iˆi,T,t (θ)

i=1

q

, {mi,t (θ)}i=q+1

i0

and assume throughout q ≥ 1 since otherwise the following reduces to known results. Let positive integer sequences {k1,i,T , k2,i,T : 1 ≤ i ≤ q} and positive sequences of threshold functions {li,T (θ), ui,T (θ) : 1 ≤ i ≤ q} satisfy kj,i,T → ∞, kj,i,T /T → 0, 1 ≤ k1,i,T + k2,i,T < T li,T (θ) → Li (θ) and ui,T (θ) → Ui (θ) uniformly on compact Θ ⊂ Rr , and for all θ ∈ Θ T k1,i,T

P (mi,t (θ) < −li,T (θ)) = 1 and

T k2,i,T

P (mi,t (θ) > ui,T (θ)) = 1.

(4)

Thus, li,T (θ) and ui,T (θ) are identically the equation specific lower k1,i,T /T th → 0 and upper k2,i,T /T th → 0 tail quantiles (e.g. Leadbetter et al 1983: Theorem 1.7.13). We can always find threshold sequences {li,T (θ), ui,T (θ)} that satisfy (4) since we assume the marginal distributions of mi,t (θ) are absolutely continuous. See Appendix A for all assumptions and related discussions, in particular condition D1. The practice of GMTTM involves m ˆ ∗T,t (θ) in (2), but theory centers around deterministic trimming: m∗i,T,t (θ) := mi,t (θ) × I (−li,T (θ) ≤ mi,t (θ) ≤ ui,T (θ)) = mi,t (θ) × Ii,T,t (θ) : 1 ≤ i ≤ q m∗T,t (θ)

= [mi,t (θ) × Ii,T,t (θ)]qi=1 where Ij,T,t (θ) = 1 for q + 1 ≤ j ≤ q.

Although mt (θ) identifies θ0 , we can only say m∗T,t (θ) eventually identifies θ0 : £ ¤ E m∗T,t (θ0 ) → 0. 6

(5)

This is easily guaranteed by Lebesgue’s dominated convergence theorem for any fractiles kj,i,T → ∞ since E[mt (θ0 )] = 0 and tail trimming is negligible kj,i,T = o(T ). It is interesting to note "eventual identification" runs contrary to weak and nearly weak identification where information vanishes at some rate (e.g. Stock and Wright 2000, Antoine and Renault 2009). Here, information amasses at some rate. If the DGP of {mt (θ0 )} is symmetric then E[m∗T,t (θ0 )] = 0 for any thresholds li,T (θ0 ) = ui,T (θ0 ) and fractiles k1,i,T = k2,i,T . Since trimming is negligible and we assume mt (θ) has absolutely continuos marginal distributions, the asymptotic covariance matrix of ˆθT is identical in form to the standard GMM estimator. We state the results here and develope the details in the appendices. Thus, we need standard covariance and Jacobian matrix constructions. The instantaneous and long run covariances are £ ¤ ΣT (θ) := E m∗T,t (θ) m∗T,t (θ)0 and ΣT := ΣT (θ0 ); ST (θ) :=

T ¤ 1 X £ ∗ E mT,s (θ) m∗T,t θ0 and ST := ST (θ0 ); T s,t=1

population and sample Jacobia are JT (θ) :=

∗ (θ) JT,t

¤ ∂ £ ∗ E mT,t (θ) ∈ Rq×r and JT = JT (θ0 ) ∂θ ∙

∂ := mi,t (θ) × Ii,T,t (θ) ∂θ

∗ (θ) := JˆT,t



∂ mi,t (θ) × Iˆi,T,t (θ) ∂θ

¸q

and JT∗ (θ) :=

i=1

¸q

i=1

T 1X ∗ J (θ) T t=1 T,t

T 1 X ˆ∗ and JˆT∗ (θ) := J (θ); T t=1 T,t

and the GMTTM scale is VT (θ) := T × HT (θ) [JT0 (θ) ΥT ST (θ) ΥT JT (θ)]

−1

HT (θ) and VT := VT (θ0 )

where HT (θ) := JT (θ)0 ΥT JT (θ) ∈ Rr×r

and HT := HT (θ0 ).

We scale the covariance ST with 1/T to help clarify the difference between ||VT ||1/2 and T 1/2 . See Section 3. 2.2

MAIN RESULTS

The main results follow: θˆT is consistent for θ0 and asymptotically normal. See Appendix A for assumptions details concerning distribution properties D, identification properties I and matrix properties M. p THEOREM 2.1 Under D1-D6, I1-I3 and M1-M2 ˆθT → θ0 . 1/2

The rate VT (ˆθT − θ0 ) = Op (Ir ) can similarly be shown under conditions D1-D6, I1-I3 and M1-M2, without proving asymptotic normality of ˆθT , by using arguments in

7

Pakes and Pollard (1989) and Newey and McFadden (1994). We do not present the result here since asymptotic normality follows from the same set of assumptions4 . 1/2

THEOREM 2.2 Under D1-D6, I1-I3 and M1-M2 VT

d (ˆθT − θ0 ) → N (0, Ir ) .

Remark 1: An "optimal" GMTTM weight sequence {ΥT } in the sense of asymptotic efficiency is {ST−1 /||ST−1 ||} for any fractiles {kj,i,T } due to the quadratic form VT = T × HT (JT0 ΥT ST ΥT JT )−1 HT (Hansen 1982; Newey and MacFadden 1994: p. 2164). In this case ¡ ¢ VT = T JT0 ST−1 JT .

Remark 2: Under β-mixing D3 the long-run covariance of the tail-trimmed equa−1 tions ST satisfies ||Σ−1 T ST || ≤ K and ||ST ΣT || ≤ K. See Lemma C.2 in Appendix C. −1 The efficient weight therefore reduces to ΥT = Σ−1 T /||ΣT ||. Remark 3: The existence of an efficient weight ΥT = ST−1 ||ST || is non-trivial since a symmetric variance form does not arise under fixed quantile trimming. In this case JT has two components that enter VT asymmetrically as T → ∞, so an optimal weight does ˇ zek 2009). Under tail trimming, however, each Ji,j,T also decomposes into not exist (Ciˇ two components, a mean Jacobian and a Jacobian of a mean: ∙ ¸ £ ¤ ∂ ∂ mi,t (θ)|θ0 × Ii,T,t (θ0 ) + E mi,t (θ0 ) × Ii,T,t (θ) |θ0 . Ji,j,T = E ∂θj ∂θj The latter is asymptotically dominated by the former due to negligibility Ii,T,t (θ) → 1 a.s. so ∙ ¸ ∂ 0 Ji,j,T = E mi,t (θ)|θ0 × Ii,T,t (θ ) × (1 + o(1)). ∂θj

See Lemma C.4 in Appendix D.

If the Jacobian and covariance are asymptotically bounded JT → J and ST → S then 1/2 the rate is exactly T 1/2 since VT ∼ T 1/2 V 1/2 for some positive definite V ∈ Rr×r . This holds for any stationary DGP for which the conventional GMM estimator is asymptotically normal, so tail trimming is always a safe practice. Otherwise the rates need not be homogeneous over ˆθi,T and may be greater or less than T 1/2 . See Section 3. LEMMA 2.3 If lim supT ≥1 ||ST || < ∞ and lim supT ≥1 ||JT ]|| < ∞ then the rate of convergence of each ˆθi,T is T 1/2 .

2.3

COVARIANCE AND JACOBIAN MATRIX ESTIMATION

Unless specific dependence information is available for mt (θ0 ) we must estimate the long run covariance ST non-parametrically. A now classic approach is a kernel HAC estimator T 1 X k ((s − t) /γ T ) m ˆ ∗T,t (˜θT )m ˆ ∗T,t (˜θT )0 SˆT (˜θT ) = T s,t=1 4 A standard aproach to proving asymptotic normality of MDE’s involves first demonstrating consistency and T 1/2 -convergence (cf. Pakes and Pollard 1989, Newey and McFadden 1994). This pat1/2 tern is not well suited to our problem since showing VT (ˆ θT − θ0 ) = Op (Ir ) follows from consistency is quite tedious in lieu of non-differentiability and the nonlinearity imbedded order statistics (−) (+) {mi,(k (θ) , mi,(k (θ)}. Our proof of Theorem 2.2 relies on a version of the δ-method based 1,i,T ) 2,i,T ) on an asymptotic expansion that exploits non-degeneracy properties under tail-trimming. See Appendix C for background theory.

8

for some consistent plug-in ˜θT , where k(·) is a kernel function and γ T → ∞ is bandwidth (cf. Newey and West 1987). We restrict k(·) and γ T = o(T ) in condition K1 of Appendix A based on theory developed in Davidson and de Jong (2000). Kernels covered include Bartlett, Parzen, Quadratic Spectral, Tukey-Hanning and others. Since SˆT (˜θT ) may itself be used for GMTTM estimation, in practice ˜θT need not be ˆT = the final GMTTME ˆθT . Candidate plug-ins include one-step GMTTM (e.g. naïve Υ Iq ), but also untrimmed estimators like the OLS, GMM and QML since in general they converge faster than GMTTM. See Section 3. LEMMA 2.4 Under D1-D6, I1-I2, K1 and M1 ||ST−1 SˆT (˜θT ) − Iq || = op (1) for any ||˜θT − θ0 || = Op (T −1/2 ||ST ||1/2 ||JT ||−1 ). Remark 1: If the equations are sufficiently orthogonal that ST = ΣT (1 + o(1)) ∗ ˜ ˆ ∗ (˜θT )0 . Since ˆ T (˜θT ) = 1/T PT m then the appropriate estimator of ST is Σ T,t t=1 ˆ T,t (θ T )m kernel class K1 includes k((s − t)/γ T ) = 0 ∀s 6= t and k((s − t)/γ T ) = 1 ∀s 6= t we have ˆ T (˜θT ) − Iq || = op (1) by Lemma 2.4. ||ST−1 Σ Remark 2: In principle ˜θT may be a first-stage GMTTM estimator based on some weight ΥT and fractile policy {k˜j,i,T }. Notice irrespective of the choice of plug-in ˜θT , we must have the Jacobian JT and covariance ST reflect the policy {kj,i,T } of the final GMTTM estimator based on its weight and fractiles. ∗ Tail trimming implies the Jacobian JT is proportional to E[JT,t ], cf. Lemma C.4 in ∗ ˜ ˆ Appendix D. Due to its simple form consistency JT (θT ) = JT × (1 + op (1)) follows for any consistent ˜θT .

LEMMA 2.5 Under D1-D6, I2 and M1-M2 JT∗ (˜θT ) = JT × (1 + op (1)) and JˆT∗ (˜θT ) = p JT × (1 + op (1)) for any ||˜θT − θ0 || → 0. The efficient-weighted covariance matrix VT−1 is estimated by n o−1 VˆT−1 (θ) = T × JˆT∗ (θ)0 SˆT−1 (θ) JˆT∗ (θ) .

Since trivially the efficient estimator satisfies ||ˆθT − θ0 ||2 = O(||VT ||−1/2 ) = O(T −1/2 ||JT ||−1 ||ST ||1/2 ) by Theorem 2.2, the scale estimator VˆT−1 (ˆθT ) is consistent by Lemmas 2.4 and 2.5. THEOREM 2.6 Under D1-D6, I1, I2 and M1-M2 VˆT (ˆθT ) = VT × (1 + op (1)). Remark : It is important to note we only show ||VˆT (θˆT ) − VT || = op (||VT ||), and p p ˆ ˆ not ||VT (θT ) − VT || → 0 since that requires ˆθT → θ0 far faster than Op (||VT ||−1/2 ). This is irrelevant, however, since any estimator VˆT (ˆθT ) = VT × (1 + op (1)) = VT + op (||VT ||) 1/2 1/2 suffices for inference. Consider a t-ratio Vˆi,i,T (ˆθT )ˆθi,T : if θ0i = 0 then Vˆi,i,T (ˆθT )ˆθi,T = d −1/2 ˆ θi,T × (1 + op (1)) → N (0, 1) by Theorems 2.2 and 2.6, and Cramér’s Theorem. V i,i,T

2.4

ROBUST M-ESTIMATORS

We now discuss why trimming M-estimator criterion equations may fail to promote asymptotic normality. Least Trimmed Squares:

Consider a linear model with least squares criterion

yt = θ00 xt +

t

¡ ¢2 with st (θ) := yt − θ0 xt . 9

Assume t is zero-mean with distribution function F ( ) := P ( t ≤ ), and two-tailed inverse F|−1 | (λ) := inf{ ≥ 0 : P (| t | ≤ ) ≤ λ}. The fixed quantile LTS estimator solves ˇ zek 2008) (Ruppert and Carrol 1980, Rousseeuw 1985, Ciˇ ˜θT = argmin θ∈Θ

(

) T ¡ ¢ 1X st (θ) × I st (θ) ≤ s([T λ]) (θ) , λ ∈ (0, 1) . T t=1

If the distribution governing st (θ) is absolutely continuous on Θ-a.e., {xt , t } have finite variance marginal distributions, { t , xt } are geometrically β-mixing, and J˜ (λ) := −E[xt x0t I(| t | ≤ F|−1 | (λ))] is non-singular, then for the given linear DGP ³ ´ T 1/2 ˜θT − θ0 = J˜ (λ)−1

1 T 1/2

X

t xt I

³ ´ ³ ´ d ˜ −1 (λ) , | t | ≤ F|−1 (λ) + o (1) → N 0, V p |

˜ 0Σ ˜ −1 J(λ) ˜ ˜ ˇ zek (2005, where V˜ (λ) = J(λ) and Σ(λ) := E[ 2t xt x0t I(| t | ≤ F|−1 | (λ))]. See Ciˇ 2008). Clearly if the error t isPindependent of xt and any stochastic element of xt has an 2 2 infinite variance then 1/T 1/2 t xt I( t ≤ ([T λ]) ) does not have a Gaussian limit since ˜ ˜ and J(λ) do not exist.The object that governs asymptotics only t is trimmed, and Σ(λ) is the gradient (∂/∂θ)st (θ)|θ0 , so its components t xi,t must be trimmed to ensure asymptotic normality, not simply t . iid

Quasi-Maximum Trimmed Likelihood: Consider an ARCH(1) yt = ht t , t ∼ 2 (0, 1), h2t (θ) = α + βyt−1 , (α, β) ≥ 0 with QML criterion equations st (θ) = ln h2t (θ) + ˇ zek (2008) for Maximum Trimmed yt2 /h2t (θ). See Neykov and Neytchev (1990) and Ciˇ Likelihood of models of the conditional mean. Since a standard question is whether a conditional heteroscedastic effect exists, suppose not for simplicity: β 0 = 0. If the distribution governing t is absolutely continuous then by ˇ zek (2008) gT,t (θ) := (∂/∂θ)sT,t (θ) = (∂/∂θ)st (θ) × I(st (θ) ≤ F −1 (λ)) Lemma 2.1 of Ciˇ s(θ)

a.s. on Θ-a.e. By direct computation it follows under β 0 = 0 ³ ´ ¢ £ 2 ¤0 ¡ ¢ ¡ × I | t | ≤ F|−1 gT,t θ0 = − 2t − 1 1, yt−1 | (λ) .

Now exploit independence to deduce ´i h¡ ¢2 ³ £ 4 ¤ £ 2 ¡ 0 ¢¤ −1 2 − 1 I | | ≤ F (λ) × E yt−1 E g2,T,t θ =E . t t | |

Since β 0 = 0 we know yt has an unbounded fourth moment E[yt4 ] = ∞ if and only if E[ 4t ] = ∞. In this case the QMTL Jacobian is unbounded, and by asymptotic linearity and independence between t and yt−1 the estimator is not asymptotically normal. Adaptive M-Estimation: Ling’s (2005, 2007) symmetrically weighed LAD and QML criteria work like smoothed fixed quantile trimmed criteria. Although heavy-tails are allowed the imposed weight structure rules out super-T 1/2 -convergence. Further, theory is only delivered for symmetric DGP’s, only fixed quantiles of the data yt are considered for the weight function, and by construction over-identification is not allowed. The theory developed cannot be immediately extended to the problem of tail-trimming parametric equations mt (θ). 3. FRACTILE SELECTION AND GMTTM CONVERGENCE RATES We now characterize convergence rates for heavy tailed models, which provides rules of thumb for selecting the fractile rate kj,i,T → ∞. We treat asymmetric fractile selection separately in Section 4. 10

We specifically treat regression models with martingale difference equations mt (θ0 ). Let h ≥ 1. In the spirit of Section 4 it can be shown E[m∗i,T,s (θ0 )m∗j,T,t+h (θ0 )] = 0 for {m∗i,T,s (θ0 ), m∗j,T,t+h (θ0 )} with symmetric joint distributions, and any two-tailed {ki,T } hence ST = ΣT . Similarly, E[m∗i,T,s (θ0 )m∗j,T,t+h (θ0 )] → 0 sufficiently fast for infinitely many policies {k1,i,T , k2,i,T } such that ST = ΣT (1 + o(1)). We therefore assume ST = ΣT for clarity. By positive definiteness and the Cauchy-Schwartz inequality we can define diagonal matrices ΓT ∈ Rq×q with elements ³ ¡ ¡ ¢¢2 ´−1/2 −1/2 −1 : Γ−1 Γi,i,T = Σi,i,T = E[ m∗i,T,t θ0 ] T ΣT ΓT → Σ a positive definite matrix. Now write

¡ ¡ ¢0 ¢ −1 VT = T × Γ−1 × Σ−1 × Γ−1 = [σi,j ]qi,j=1 . T JT T JT × (1 + o (1)) and Σ

Assume σ i,j 6= 0 ∀i, j, to simplify exposition. The component-wise rates are ⎤1/2 ⎡ q X 1/2 −1 ⎦ . Tθi := Vi,i,T = KT 1/2 × ⎣ σ l1 ,l2 Γ−1 l1 ,l1 ,T Γl2 ,l2 ,T Jl1 ,i,T Jl2 ,i,T

(6)

l1 ,l2 =1

¡ ¢ Textbook intuition explains Tθi . Holding everything else constant, if Γi,i,T = (E[(m∗i,T,t θ0 )2 ])1/2 → ∞ due to heavy-tailed errors then Tθi → ∞ slowly: sharp estimates are more difficult to obtain from models with disproportionately dispersive errors (an "outlier" effect). Conversely, using Jacobian relation Lemma C.4 in Appendix C, if |Jl,i,T | = |E[(∂/∂θi )ml,t (θ)|θ0 Il,T,t (θ0 )]| × (1 + o(1)) → ∞ due to heavy tailed regressors then Tθi → ∞ quickly: sharpness improves with regressor dispersion and association (a "leverage" effect.). If both error and regressor are heavy-tailed and exhibit feedback then ΓT may overwhelm JT . In this section we inspect the gamut of such cases. In order to characterize ΓT and JT we consider dynamic linear regression, AR, ARCH and AR-ARCH models with least-squares and in some cases QML equations. Since we only care about convergence rates, initially all equations are symmetrically trimmed with two-tailed thresholds ci,T and the same fractiles kj,i,T = kT to simplify notation. Hence ¡ ¢¯ ¢ T ¡¯¯ P mi,t θ0 ¯ > ci,T = 1. kT

This allows for a cleaner discussion of rules for selecting the rate kT → ∞. Throughout { t } is iid Lp -bounded, p > 0, with an absolutely continuous distribution on R-a.e., symmetric about 0. Assume covariance and Jacobian non-degeneracy properties D 5 and M2 throughout. See Appendix A. Let L(T ) be a slowly varying function, L (T ) → ∞ and L (T ) ≤ T, whose value or rate may change with the context. 3.1

DYNAMIC REGRESSION WITH IID ERRORS We first examine a stationary dynamic linear regression with an intercept yt = θ00 xt + t , x1,t = 1, xt ∈ Rr with mt (θ) = (yt − θx0t ) xt ,

where t and xt are mutually independent. Stochastic xi,t are measurable with R-a.e. continuous, stationary, symmetric distributions. Independence rules out random volatility errors. See Sections 3.2-3.4 for this case. 11

Define the moment suprema of zt ∈ { t , xi,t }, κz := sup {α > 0 : E |zt |α < ∞} > 1, where κz = ∞ is possible (e.g. uniform, normal, or bounded support). Identification requires integrability of t xi,t hence κz > 1. If any zt ∈ { t , xi,t } has an infinite variance κz ∈ (1, 2] then assume the distribution tail decays as a power law: P (|zt | > z) = dz z −κz (1 + o (1)) with indices κz ∈ {κ , κi } > 1. Such zt satisfy the Feller (1971: Theorem 1 in IX.8) property: £ ¤ κz < 2 : E zt2 I (|zt | ≤ c) ∼ Kc2 P (|zt | > c) as c → ∞

(7)

(FE)

£ ¤ κz = 2 : E zt2 I (|zt | ≤ c) ∼ L(c) as c → ∞ where L(c) → ∞ is slowly varying.

Heavy-tailed convolutions mi,t (θ0 ) = t xi,t also satisfy (7) with index κ (Cline 1986), so ci,T = K(T /kT )1/κ ,i by construction. Define a∗,(i) := min {1/κ j ∈{1,i} /

BT :=

max

j ∈{1,i}:κ / j ≤2

,j

+ (1 − 1/κj ) κi /κ ,i } , AT := (T /kT )1/2+1/κ

n o (T /kT )2/κj −1 , CT :=

max

j ∈{1,i}:κ / j ≤2

,i

:= min{κ , κi }

∗ ,i +κi /κ ,i −2a ,(i)

n o −1 (T /kT )2−2/κj −2/κi +2κ ,i (κi /κj +1−κi ) .

By convention κ ,1 = κ since x1,t = 1, a∗,(i) = κi /κ ,i if there is only one stochastic regressor, and a∗,(i) is not defined if there is only an intercept. LEMMA 3.1 (ARX with IID Error) a. Let max{κ , κ2 , ..., κr } < 2. Each Γi,i,T = (T /kT )1/κ ,i −1/2 and Ji,j,T ∼ −E[xi,t xj,t I(| t xj,t | −1 ≤ cj,T )]. For stochastic {xi,t , xj,t } in general Ji,i,T ∼ K(T /kT )κ ,i (2−κi ) , Ji,j,T = −1 O((T /kT )κ ,j (κj /κi +1−κj ) ) ∀i 6= j, and Ji,j,T ∼ K if xi,t is independent of xj,t . Hence the slope rates are in general 1/2−κi /κ

Tθi ∼ KT 1/2 (T /kT )

,i +1/κ ,i

1/2

[K + O (AT )]

, i = 2, ..., r.

Further J1,1,T = −1+o(1) and Ji,1,T , J1,i,T = O(1) × (1 + o(1)), hence the intercept rate is Tθ1 = KT 1/2 × K(kT /T )1/κ −1/2 (1 + O(1)) .

b. Let κ > 2. If κi > 2 then Tθi ∼ T 1/2 × [K + O(BT )]1/2 , and if κi < 2 then Tθi ∼ KT 1/2 (T /kT )1/κi −1/2 × [K + O(CT )]1/2 .

Remark 1: Notice Tθ1 = o(T 1/2 ) when κ < 2. Irrespective of the other regressors, as long as the error t is heavy tailed the intercept rate is sub-T 1/2 -consistent due an outlier efffect: to kT /T → 0 under tail trimming allows ||ΣT || → ∞. Remark 2: If the error has a finite variance then Tθi is governed entirely by the regressor tails, hence ˆθi,T is super-T 1/2 -consistent when xi,t has an infinite variance due to a leverage effect: under tail-trimming ||JT || → ∞. If t and xi,t have finite variances but some other regressor xj,t is heavy-tailed, then all we can say is ˆθi,T is at least T 1/2 consistent since cross-Jacobia complexity precludes sharper results.

12

Remark 3: We ignore the hairline infinite variance cases κ = 2 or κi = 2 to retain brevity. We inspect the case in specific models below which can be easily verified from (FE) and the arguments used to prove Lemma 3.1. In the following assume all random variables are heavy tailed max{κ , κ2 , ..., κr } ≤ 2. General examples are similar. EXAMPLE 1 (Slope Rate Lower Bound): Assume max{κ , κ2 , ..., κr } < 2, and 1/2−κi /κ ,i +1/κ ,i ] ≥ K depends solely on the dispersion note lim inf T →∞ Tθi /[T 1/2 (T /kT ) of t and xi,t . As long as 1/2 − κi /κ ,i + 1/κ ,i > 0 then ˆθi,T is super-T 1/2 -consistent. There are two cases. In this case the leverage effect dominates, so Tθi ≥ KT 1/2 (T /kT ) Case 1 (κi ≤ κ ): only reflects the tails of xi,t . Simply choose a light fractile kT = [λL(T )] for λ > 0 and slowly varying L(T ) → ∞ to obtain Tθi ≥ KT 1/κi /L(T ). The GMTTME rate is therefore equivalent to the highest achievable rate T 1/κi /L(T ) for stationary data with Paretian tails evidently by any estimation method, including untrimmed OLS, LAD, QML, smooth M-estimators, and Whittle and Yule-Walker estimation (e.g. Hannan and Kanter 1977, An and Chen 1982, Cline 1989, Davis et al 1992, Hall and Yao 2003). The latter estimators, however, have non-Gaussian limits, so the GMTTME offers a two-fold advantage: it obtains the highest possible rate and is asymptotically normal. Case 2 ( κi > κ ): If t is more heavy-tailed than xi,t then super-T 1/2 -convergence still arises as long as xi,t has an infinite variance κi < 2 and the dispersion of t is not too great, κ > 2(κi − 1). If κi = 1.5, for example, then any κ ≥ 1 allows a dominate leverage affect. The converse κ ≤ 2(κi − 1) naturally arises in random volatility models with AR-in-squares representations. See Sections 3.2-3.4. EXAMPLE 2 (Independent Regressors): If stochastic xi,t are independent random variables then because they have finite means Ji,j,T ∼ K ∀i 6= j. The leverage effect vanishes hence Tθi = o(T 1/2 ) if the error has an infinite variance. EXAMPLE 3 (Tail Homogeneity): If κ = κi = κ < 2 for all i, then κ a∗,(i) = 1, hence AT = (kT /T )1/2 = o(1) and the rate reduces to

,i

= κ and

Tθi ∼ KT 1/2 (T /kT )1/κ−1/2 . Similarly, using property (FE) when κ = 2 it is easy to verify Tθi ∼ KT 1/2 L(T ). SuperT 1/2 -convergence arises if and only if variance is infinite κ ≤ 2. In this case the leverage affect dominates the outlier effect. The following provide deeper intuition as to why super-T 1/2 -convergence may or may not arise. EXAMPLE 4 (Location):

Consider estimating location

yt = θ0 + t ,

iid t

∼ (7), κ ∈ (1, 2],

with one equation mt (θ) = yt − θ. The Jacobian is JT = −1 + o(1), and the covariance 1/2 ΓT = ΣT = T 1/2 (T /kT )1/κ−1/2 when κ < 2 or ΓT = T 1/2 L(T ) when κ = 2. Therefore Tθ = KT 1/2 (kT /T )1/κ−1/2 = o(T 1/2 ) if κ < 2 and Tθ = KT 1/2 /L(T ) = o(T 1/2 ) if κ = 2, so GMTTM is sub-T 1/2 -consistent when κ ≤ 2. 13

1/κi −1/2

Remark 1: The reason for the sluggish rate is essentially the same as in Example 2: bounded regressors cannot provide explanatory leverage against a heavy tailed shock. In the above simple model there is only an outlier effect. Remark 2: It is straightforward to show over identifying restrictions involving lags of yt have no impact on the sub-T 1/2 rate since the added regressors yt−i are independent. Remark 3: Since there is only an outlier effect when κ ≤ 2, maximal trimming optimizes the rate. Set kT = [T /L(T )] to achieve Tθ = KT 1/2 /L(T ). EXAMPLE 5 (AR with iid error): Consider a stationary infinite variance autoregression r X iid yt = θ0i yt−i + t , t ∼ (7), κ ∈ (1, 2]. i=1

The AR process {yt } satisfies (7) with the same index κ (Cline 1989, Brockwell and Cline 1/κ−1/2 → 1985). Since κ = κi = κ ,i = κ, apply Example 3 to get Tθi /T 1/2 ∼ K (T /kT ) ∞. Remark 1: →∞

The AR(1) case is particularly revealing: for slowly varying L0 (T ) , L (T )

κ < 2 : Tθ = KT

κ = 2 : Tθ ∼ KT

1/2 |JT |

ΓT

1/2

∼ KT

1/2

£ 2 ¤ 2/κ−1 E yt−1 I (| t yt−1 | ≤ c1,T ) (T /kT ) 1/2 ¤¢1/2 ∼ KT ¡ £ 2 2 h i1/2 2/κ−1 E t yt−1 I (| t yt−1 | ≤ c1,T ) (T /kT )

£ 2 ¤ E yt−1 I (| t yt−1 | ≤ c1,T ) L0 (T ) 1/2 = T 1/2 L (T ) . ¤¢1/2 ∼ KT ¡ £ 2 2 1/2 [L0 (T )] E t yt−1 I (| t yt−1 | ≤ c1,T )

The case κ = 2 can be verified from (FE) and the proof of Lemma 3.1. Independent errors 2 2 mean yt−1 and 2t yt−1 have the same tail shape up to a constant scale, so L0 (T ) are the same in the numerator and denominator (Cline 1986), hence Tθ ∼ T 1/2 L (T ) for some L (T ). Consider the case κ < 2. The numerator Jacobian £ 2 ¤ 2/κ−1 E yt−1 I (| t yt−1 | ≤ c1,T ) ∼ K (T /kT ) works like a tail-trimmed variance. If sequences {cy,T , kT } satisfy (T /kT )P (|yt | > cy,T ) → 2 ∞ then arguments in the proof of Lemma 3.1 reveal E[yt−1 I(|yt−1 | ≤ cy,T )] ∼ c2y,T P (|yt−1 | > cy,T ) = K(T /kT )2/κ−1 . Trimming by t yt−1 delivers the same rate because t is independent of yt−1 and each have tail index κ < 2, hence t yt−1 has index κ (Cline 1986). By comparison the denominator ¤¢1/2 ¡ £ 2 2 = (T /kT )1/κ−1/2 E t yt−1 I (| t yt−1 | ≤ c1,T )

is a tail-trimmed standard deviation of an object with the same tail index κ. Therefore Tθ ∼ KT 1/2 (T /kT )1/κ−1/2 dominates T 1/2 when t has an infinite variance. If t is not independent of yt−1 then the above arguments fails, and feedback can cause ΓT → ∞ so fast that super-T 1/2 -convergence cannot arise. See Sections 3.2 and 3.3. Remark 2: Heavy-tailed regressors without error-regressor feedback promotes a pure leverage effect, so minimal trimming is optimal. If kT is regularly varying kT = [T λ ] for λ ∈ (0, 1) then Tθi ∼ KT 1/κ−λ(1/κ−1/2) > KT 1/κ−ι for any tiny λ, ι > 0. Conversely if slowly varying kT = [λ ln(T )] then Tθi = KT 1/κ /L(T ), the maximum rate, cf. Davis et al (1992).

14

EXAMPLE 6 (Instrumental Variables): It is tempting to use heavy-tailed instruments zt to induce super-T 1/2 -convergence. Consider a simple scalar model for reference iid

yt = θ0 xt + t , where {xt , t } ∼ (0, 1) and mt (θ) = (yt − θxt ) zt ∈ R. Assume the instrument zt ∈ R has tail (7) and index κz ∈ (1, 2), and is valid: it is independent of t and inf T ≥N |E[xt zt I(| t zt | ≤ cT )]| > 0 for large N . For example, we might use zt = x2t if xt has a finite variance and infinite kurtosis. Since mt (θ0 ) = t zt has tail index κz (Cline 1986), the Cauchy-Schwartz inequality and arguments in the proof of Lemma 3.1 reveal à £ ¤ !1/2 E zt2 I (| t zt | ≤ c1,T ) E [xt zt I (| t zt | ≤ c1,T )] 1/2 1/2 Tθi = KT ≤ KT = KT 1/2 . 2 z 2 I (| z | ≤ c 1/2 2 2 E [ )] t t 1,T t t (E [ z I (| t zt | ≤ c1,T )]) t t

A thin-tailed regressor xt handicaps the Jacobian rate irrespective of the instrument zt . 3.2

ARCH WITH LEAST SQUARES EQUATIONS Consider a Strong-ARCH(p) model with least squares-type equations yt = ht t , ¡

iid t

∼ (0, 1) , h2t = α0 +

mt (θ) = yt2 − θ0 xt

¢

p X i=1

2 β 0i yt−i = θ00 xt , α0 > 0, β 0 ≥ 0, θ = [α, β 0 ]0

£ 2 ¤0 2 , ..., yt−p . × xt , xt = 1, yt−1

Assume the Lyapunov exponent associated with the stochastic recurrence equation form is negative5 . Then {yt , ht } are stationary with tail (7) and index κy > 0 as long at least one as one β 0i > 0 (e.g. Basrak et al 2002: Theorem 3.1). If β 0i > 0 then mt (θ0 ) = ( 2t − 1)h2t xt have tails (7) with indices {κy /2, κy /4, ..., κy /4}6 . 2 Integrability of least squares-type mt (θ0 ) requires E[ 2t ] < ∞ and E|h2t yi,t | < ∞, so assume £ ¤ E yt4 < ∞ hence κy > 4.

If β 0 = 0 or β 0 > 0 and κy > 8 then mi,t (θ0 ) have finite variances so all Tα , Tβ i ∼ KT 1/2 . The harsh moment requirement is relaxed with QML equations due to scaling. See Section 3.4. Since (∂/∂θ)mt (θ) = −xt x0t have indices in {κy /4, κy /2, ∞} and κy > 4 each component is integrable, thus JT ∼ J. The standard deviations Γi,i,T are almost as simple to compute since κy ∈ (4, 8) implies the intercept term Γ1,1,T ≤ (E[( 2t − 1)2 h4t ])1/2 = K, and the remaining Γi,i,T = (E[m2i,t (θ0 )Ii,T,t (θ0 )])1/2 ∼ Kci,T (kT /T )1/2 = K(T /kT )4/κy −1/2 by Feller property (FE) and ci,T = K(T /kT )1/(κy /4) for tail (7). Therefore J1,1,T /Γ1,1,T ∼ K and all other Ji,i,T /Γi,i,T ∼ K(T /kT )−(4/κy −1/2) = o(1). Now use (6) to deduce Tα , Tβ i ∼ KT 1/2 . If κy = 8 similar steps under (FE) reveal Tα , Tβ i ∼ KT 1/2 . This proves the next claim. LEMMA 3.2 (Strong-ARCH) Any stationary strong-ARCH process with negative Lyapunov exponent and κy > 4 has GMTTME rates Tα , Tβ i ∼ KT 1/2 . Remark 1: Strong-ARCH are AR in squares yt2 = θ0 xt + vt , where E[vt |=t−1 ] = 0. But stationary AR equations all have the same tail index κ when t is iid with tail (7). Sp Sp 0 0 5 The Lyapunov condition can be replaced with i=1 β i < 1, or i=1 β i = 1 and the error distribution does not have an atom at zero (Bougerol and Picard 1992). 6 In the ARCH(1) case κ solves (E| |κy )2/κy β 0 = 1: large β 0 implies small κ (de Haan et al 1989). y t y 1 1

15

In the ARCH case the AR-in-squares error vt = yt2 − h2t = ( 2t − 1)θ00 xt depends on xt , thus m1,t (θ0 ) = ( 2t − 1)h2t = vt has tail index κy /2 > 2 and all other mi,t (θ0 ) = ( 2t − 2 1)h2t yt−i+1 for i ≥ 2 have index κy /4 > 1 due to feedback. The intuition from Section 3.1 suffices to explain the rate: models with disproportionately heavy-tailed "errors" render less sharp estimates (in this case, less than super-T 1/2 -consistent). Remark 2: The tails of t do not play any role per se. Thicker tailed t and/or larger slopes β 0 imply yt is heavier tailed: why yt is heavy tailed is irrelevant. 3.3

AR-ARCH WITH LEAST SQUARES EQUATIONS Consider an AR(1) with ARCH(1) error ¯ ¯ yt = ρ0 yt−1 + ut , ¯ρ0 ¯ < 1, ut = ht t ,

iid t

∼ N (0, 1)

h2t = α0 + β 0 u2t−1 , α0 > 0, β 0 > 0, θ = [ρ, α, β]0 h ¯ ¡ ¢1/2 ¯¯i ¯ E ln ¯ρ0 + β 0 t ¯ < 0,

and three least squares-type equations used to estimate each θ = [ρ0 , α0 , β 0 ]0 , ⎤ ⎡ (yt − ρyt−1 ) yt−1 ⎥ ⎢ t − ρyt−1 )2 − α − β (yt−1 − ρyt−2 )2 mt (θ) = ⎣ (y ⎦. ³ ´ 2 2 2 (yt − ρyt−1 ) − α − β (yt−1 − ρyt−2 ) × (yt−1 − ρyt−2 )

Thus {yt } is stationary, geometrically β-mixing with regular varying tail (7) and index κy > 0 (Borkovec and Klüppelberg 2001: Theorems 1 and 3), and mi,t (θ0 ) satisfy (7) with indices {κy , κy /2, κy /4}. Error normality can be replaced with sufficient distribution smoothness. Integrability again requires κy > 4, hence JT ∼ J. If κy > 8 then ΓT = T 1/2 I3 . If κy ∈ (4, 8) use E[yt4 ] < ∞ to deduce E[u4t ] < ∞, E[h4t ] < ∞ and E[ 4t ] < ∞ so Σ1,1,T ∼ K, 4 2 Σ2,2,T ∼ K, and Σ3,3,T = E[( 2t − 1)2 h4t yt−1 I(| 2t − 1|h2t yt−1 ≤ c3,T )] ∼ K(T /kT )8/κy −1 . This gives trimmed standard deviations Γ1,1,T = Γ2,2,T = 1 and Γ3,3,T = (T /kT )4/κy −1/2 . Similarly if κy = 8 then Γ1,1,T = Γ2,2,T = 1 and Γ3,3,T = L(T ). The same conclusion as the Strong-ARCH case is therefore reached by working through formula (6): all rates are T 1/2 since feedback between ut and yt−1 dulls the rate below super-T 1/2 -convergence iid

LEMMA 3.3 (AR-ARCH) Any stationary AR(1)-ARCH(1) with t ∼ N (0, 1), E[ln |ρ0 ¡ ¢1/2 1/2 + β0 . t |] < 0 and κy > 4 has GMTTME rates Tρ = Tα = Tβ ∼ KT EXAMPLE 7 (AR with ARCH Error): Estimate only the AR slope ρ0 in the above AR-ARCH: mt (ρ) = (yt − ρyt−1 ) yt−1 .

Notice mt (ρ0 ) = t ht yt−1 has a tail index κy /2 due to feedback, half that in the iid case Example 5. But this implies mt (ρ0 ) is integrable only if κy > 2. Arguments in the proof of Lemmas 3.1 can be used to deduce the following. COROLLARY 3.4 (AR with ARCH Error) Under the conditions of Lemma 3.3 if only ρ0 is estimated with one equation mt (ρ) = (yt − ρyt−1 )yt−1 then Tρ ∼ KT 1/2 if κy > 4, Tρ ∼ KT 1/2 (T /kT )−(2/κy −1/2) if κy ∈ (2, 4), and Tρ ∼ KT 1/2 /L(T ) if κy = 4. 16

Remark : Feedback between error ut and regressor yt−1 substantially elevates estimating equation tail thickness relative to the Jacobian, hence the convergence rate Tρ falls below T 1/2 . 3.4

ARCH WITH QML EQUATIONS Consider an ARCH(1) with QML equations iid

yt = ht t ,

t

2 ∼ (0, 1) , h2t = α0 + β 0 yt−1 = θ00 xt , α0 > 0, β 0 ≥ 0

¡ ¢ mt (θ) = yt2 − θ0 xt {θ0 xt }−2 xt ,

where E[mt (θ)] = 0 if and only if θ = θ0 . Integrability requires E[ 2t ] < ∞, and assume at least one β 0i > 0 to highlight the impact of scaling. If E[ 4t ] = ∞ assume t has tail (7) with index κ ∈ (2, 4]. Scaling mt (θ0 ) = ( 2t 0 − 1)h−2 t xt with β > 0 implies only the tails of t matter, and JT ∼ J. This implies diminished rates of convergence when E[ 4t ] = ∞ for the reasons given in Section 3.1. LEMMA 3.5 Assume E|yt |p < ∞ for some p > 0, β 0 > 0 and E[ 2t ] < ∞. If κ > 4 then Tα = Tβ = KT 1/2 , if κ = 4 then Tα , Tβ ∼ KT 1/2 /L(T ), and if κ ∈ (2, 4) then Tα , Tβ ∼ KT 1/2 (T /kT )−(2/κ −1/2) . Remark 1: In lieu of scaling only trivial restrictions on the tails of yt are required, as long as E[ 2t ] < ∞. Remark 2: Although QML equations permit far heavier tails than least-squares equations, the rates of convergence suffer when t has an infinite kurtosis since scaling eliminates a leverage effect. Nevertheless the rate can be made arbitrarily close to T 1/2 when κ ∈ (2, 4] by maximal trimming kT = [T /L(T )]: Tα , Tβ ∼ KT 1/2 /L(T ). Compare this with the QML rate T 1−2/κ for any κ ∈ (2, 4) (Hall and Yao 2003: Theorem 2.1). Since 1 − 2/κ < 1/2 is strict for any κ ∈ (2, 4), then GMTTM beats QML in rate of convergence as long as the rule of thumb kT = [T /L(T )] is used. iid Remark 3: The same conclusion is met for (nonlinear) GARCH yt = ht (θ0 ) t , t ∼ −4 −2 (0, 1), with QML equations mt (θ) = (yt2 − h2t (θ))ht (θ) × (∂/∂θ)ht (θ) as long as ht (θ0 ) × (∂/∂θ)ht (θ)|θ0 is integrable. Remark 4: Although we do not tackle the general problem of nonlinear errorregressor dependence, it seems reasonable to suspect such feedback will always reduce the rate below T 1/2 when tails are thick enough. 4. FRACTILE SELECTION FOR ASYMMETRIC EQUATIONS The preceding section presents rules for selecting how fast kj,T → ∞ to maximize the efficient GMTTM rates of convergence. In this section we use the rate of identification E[m∗T,t ] → 0 as a criterion for selecting an asymmetric policy {k1,T , k2,T }. Let mt (θ0 ) ∈ R denote a specific equation, and drop θ0 everywhere for notational simplicity. The existence of asymmetric equations is hardly in question. An ARCH(1) yt iid 2 = (α0 + β 0 yt−1 )1/2 t , α0 , β 0 > 0, with t ∼ N (0, 1) has asymmetrically distributed least squares and QML equations. Let mt = ( 2t − 1)zt2 denote a particular scalar least squares 2 equation zt ∈ {1, yt−1 }. Then mt has tail (7) with index κ > 0 (de Haan et al 1989), hence as m → ∞ ¢ 2 ¢ ¡¡ 2 ¢ 2 ¢ ¡¡ 2 κ P t − 1 zt > 1 ∼ m P t − 1 zt > m → d2 P

¡¡

2 t

¢ ¢ ¡¡ − 1 zt2 < −1 ∼ mκ P 17

2 t

¢ ¢ − 1 zt2 < −m → d1 .

iid

Since t ∼ N (0, 1) it follows d2 > d1 . Now let mt be an arbitrary scalar equation. Simplify exposition further by sharpening condition D1.ii in Appendix A to an exact Pareto tail: for all m ≥ M and some M ≥ 1 P (mt < −m) = d1 m−κ1 and P (mt > m) = d2 m−κ2 , where di > 0, and min{κi } > 1 ensures integrability. It is a simple exercise to show for large T µ µ ¶ ¶ ¤ £ d1 κ1 d2 κ2 1 1 − . E m∗T,t = E [mt I (−lT ≤ mt ≤ uT )] = κ −1 κ 1 κ1 − 1 lT κ2 − 1 uT2 −1

Although identification E[m∗T,t ] → 0 holds for any thresholds {lT , uT } → ∞ hence any intermediate order sequences {k1,T , k2,T }, we can always relate k1,T to k2,T to promote E[m∗T,t ] → 0 arbitrarily fast. In general, the heavier tail (e.g. d2 > d1 and/or κ2 < κ1 ) should be trimmed less (e.g. uT > lT hence k2,T < k1,T ). Simply symmetric £ consider ¤ trimming lT = uT = cT with d2 > d1 and κ1 = κ2 . Then E m∗T,t < 0 because a disportionate number of large positive values are being trimmed. The is ameliorated for each T by lowering the number of trimmed right-tail equations k2,T < k1,T . This is strongly demonstated by simulation: see Section 6. 4.1

Asymmetric Fractile Policy 1/κ

The thresholds {lT , uT } for Pareto tails by construction (4) are lT = d1 1 (T /k1,T )1/κ1 1/κ 1/κ and uT = d2 2 (T /k2,T )1/κ2 . Use this to deduce E[m∗T,t ] ≈ 0 for any T when d2 2 /(T /k2,T )1−1/κ2 1/κ1

≈ d1

/(T /k1,T )1−1/κ1 or {k1,T , k2,T } satisfy 1−1/κ2

k2,T

1−1/κ k1,T 1

1/κ1

= T 1/κ1 −1/κ2

d1

1/κ d2 2

(1 − 1/κ2 ) = T 1/κ1 −1/κ2 × D (d, κ) (1 − 1/κ1 )

The ARCH model is a simple case since κ1 = κ2 = κ for any particular equation, hence k2,T /k1,T ∼ (d1 /d2 )1/(κ−1) . For example, if kj,T ∼ δ j T / ln(T ) then any policy {δ 1 , δ 2 } must satisfy δ 2 /δ 1 ∼ (d1 /d2 )1/(κ−1) . Similarly, if kj,T ∼ δ j T λj then δ 2 /δ 1 ∼ (d1 /d2 )1/(κ−1) and λ1 = λ2 . Table 1 presents three classes of k1,T based on the Section 3 rules of thumb for optimizigin GMTTM rates convergence, and solving for k2,T .

k1,T Class

Table 1 : General Policy Classes {k1,T , k2,T } k2,T 1/(1−1/κ2 )

(1−1/κ1 )/(1−1/κ2 )

L1 (T )

T κ1 (1−κ1 /κ2 )/(1−1/κ2 )

L1 (T )

D (d, κ)

T L1 (T )

D (d, κ)1/(1−1/κ2 )

δ 1 T λ1

´1/(1−1/κ2 ) ³ 1−1/κ1 D (d, κ) T [λ1 (1−1/κ1 )+(1/κ1 −1/κ2 )]/(1−1/κ2 ) δ1

T L1 (T )

(1−1/κ1 )/(1−1/κ2 )

18

k2,T Class L2 (T ) T λ2 T L2 (T )

δ 2 T λ2

In each case the rate of increase kj,T → ∞ should differ across tails only if tail indices differ. The first class k1,T ∼ L1 (T ) only makes sense if κ2 ≥ κ1 since κ2 < κ1 implies λ2 < 0 hence k2,T → 0. Otherwise fix k2,T ∼ L2 (T ) and solve for k1,T . If κ2 > κ1 the right-tailed observations are trimmed exponentially faster than left-tailed observations. Only the heaviest tail can have a slowly varying fractile in this class. Based on Section 3 this class apparently makes sense solely for models without error-regressor feedback. The remaining classes T /Lj (T ) and δ j T λj apply to both tails for any scales or indices. In the second class T /Lj (T ) the heavier tail (e.g. κ1 < κ2 ) receives heavier trimming (e.g. k1,T /k2,T → ∞) aligning this class naturally to models with error-regressor feedback. In the third class δ 1 T λ1 the heavier tail (e.g. κ1 < κ2 ) receives lighter trimming (e.g. k1,T /k2,T → 0), aligning this class with models that do not exhibit error-regressor feedback. GARCH models with innovations that have the same left- and right-tail index have least squares and QML equations that are symmetric in κ1 = κ2 = κ. In each class above it is easy to show if κ1 = κ2 = κ then k2,T = (d1 /d2 )

1/(κ−1)

k1,T ,

so a thicker right tail d2 > d1 implies trimming fewer right tail equations k2,T < k1,T . 4.2

Adaptive Asymmetric Fractile Computation

The shape parameters di and κi can be estimated by classic methods, although most estimators require a nuisance intermediate order sequence, say {hT }. Consider the widely used estimators by Hill (1975) and Hall (1982): κ ˆ 1,hT (θ) =

hT hT ³ ´ ³ ´ 1 X 1 X (−) (−) (+) (+) ln m(i) (θ) /m(hT +1) (θ) , κ ˆ 2,hT (θ) = ln m(i) (θ) /m(hT +1) (θ) hT i=1 hT i=1

´κˆ 1,hT (θ) ´κˆ 2,hT (θ) hT ³ (−) hT ³ (+) , dˆ2,hT (θ) = . m(hT +1) (θ) m(hT +1) (θ) dˆ1,hT (θ) = T T

Define

ˆ hT D

´ ³ −1 κ ˆ −1 (θ) 1 − κ ˆ 1,hT (θ) ˆ 2,h d1,hT (θ) T ´, (θ) := ׳ κ ˆ −1 −1 2,hT (θ) ˆ 1−κ ˆ 1,hT (θ) d2,hT (θ)

and let {k1,T (θ) , k2,T (θ)} denote any policy that satisfies the relation k2,T (θ) k1,T (θ)

1−ˆ κ−1 2,h (θ) T

1−ˆ κ−1 1,h (θ) T

−1

−1

ˆ h (θ) . = T κˆ 1,T (θ)−ˆκ2,T (θ) × D T

The parameter θ is obviously problematic since computing ˆθT requires {k1,T (θ) , k2,T (θ)} and {k1,T (θ) , k2,T (θ)} require a plug-in. There are at least two solutions. First, iterate by initializing a policy {k1,T , k2,T } to obtain ˆθT , then plug in ˆθT above to obtain an updated policy, and repeat. The initial policy {k1,T (˜θT ), k2,T (˜θT )} can be based on any consistent estimator ˜θT with only two iterations: compute ˜θT , then ˆθT , and stop. This will be ˆ T , e.g. Sˆ−1/2 (˜θT )/||Sˆ−1 (˜θT )||. convenient since ˜θT may be used to compute the weight Υ T T Second, estimate ˆθT in one step by minimizing the criterion with {k1,T (θ) , k2,T (θ)} by a standard numerical optimization routine. We complete this section with a few deserved comments. First, Hill (2010) shows 1/2 1/2 κ ˆ i,hT = κ + Op (1/hT ) and dˆi,hT = d1 + Op (1/hT ) for a truly vast array of dependent, 19

heterogeneous processes that covers all equation processes allowed under tail decay D1 and mixing D3. The plug-in ˆθT will not affect consistency of {ˆ κi,hT , dˆi,hT } by an application of expansion Lemma D.1 in Appendix D. See Hsing (1991) and Hill (2010) for reviews on other tail shape estimators. Second, it is well beyond the scope of this paper to characterize in theory how a stochastically selected policy {k1,T (θ) , k2,T (θ)} impacts the GMTTM limit law. Third, even if we accept as valid any particular method for computing {k1,T (θ) , k2,T (θ)}, we only have a relationship for which there are infinitely many solutions. In theory any solution is valid, but small sample evidence suggests heavier tails should be more (less) trimmed if the errors are (not) dependent on the regressors. This is reverberated by four different simulations that use different fractile policies, including {k1,T (˜θT ), k2,T (˜θT )} with QML ˜θT . Fourth, still (!) we have an intermediate order sequence {hT } that must be chosen to compute κ ˆ hT and dˆhT . The extreme value theory literature is entirely silent on fractile selection for dependent data, although some theory exists for iid data with exact or nearly Pareto tails (e.g. Drees et al 2001, Huisman et al 2001). All of these matters deserve more attention, but far exceed the limits of this paper’s scope. We leave the topic of fractile selection to the simulation study, below. 5. AUTOREGRESSIONS AND ARCH We now verify the major assumptions for heavy-tailed stationary AR and ARCH with weight ΥT = ST−1 ||ST ||.

5.1

Autoregression

Consider a stationary AR(r) process with iid, heavy-tailed errors 0

yt = θ00 xt + t , xt = [yt−1 , ..., yt−r ] ,

iid t

∼ (7) with κ ∈ (1, 2) , E[ t ] = 0

(8)

¡ ¢ mt (θ) = yt − θ0 xt xt .

Assume t has an absolutely continuous marginal distribution symmetric at zero, uniformly bounded and positive R-a.e. Then yt is uniformly L1+ι -bounded geometrically β-mixing (An and Huang 1996: Theorem 3.1), and yt and mi,t (θ0 ) = t yt−i have tail (7) 2 with the same index κ while mi,t (θ) = t yt−i − (θ0 − θ)0 xt yt−i has tail (7) with index 0 κ/2 if θ 6= θ (Cline 1986, 1989, Davis and Resnick 1996). Impose symmetric trimming m∗i,T,t (θ) = mi,t (θ)I(|mi,t (θ)| ≤ ci,T (θ)) where (T /kT )P (|mi,t (θ)| > ci,T (θ)) → 1, with the same fractile kT for each equation. Error independence, stationarity, linearity, distribution continuity, mixing and the efficient weight ensure continuity and power-law tails D1, differentiability D2, mixing D3, identification I1 and covariance non-degeneracy M2. The envelope bounds D4 are trivial given the linear form of mt (θ), compactness of Θ, and Lp -boundedness. Further I2 is trivial since E[m∗T,t (θ0 )] = 0 under symmetry, integrability and orthogoˆ T = Sˆ−1 ||SˆT || given HAC consistency Lemma nality; M1 holds under I1 and D1-D6 for Υ T 2.4. What remains is Jacobia property D5, indicator property D6, and smoothness I3. D5 (Jacobia). D5.i. Each part is trivial given the linear data generating process and iid innovations with absolutely continuous marginal distribution. D5.ii. We must show supθ∈U 0 (δT ) {||JT (θ) − JT ||} = o(||JT ||). Stationarity, Lp ∗ boundedness of yt and the construction JT,i,j,t (θ) = −yt−i yt−j IT,j,t (θ) for a linear AR im∗ ∗ s ply E[(supθ∈U 0 (δT ) {||JT (θ) − JT |||}) ] ≤ K for any s ∈ (0, p/2), hence supθ∈U 0 (δT ) {||JT∗ (θ) 20

− JT∗ ||} = op (||JT ||) by Markov’s inequality and ||JT || → ∞. Hence D5.ii follows if we show ||JT∗ − JT || = op (||JT ||). ∗ In lieu of the general Jacobian property E[JT,t ] = JT (1 + o(1)) by Lemma C.4 in Appendix C, it suffices to show £ ¤ T T 0 0 1 X yt−i yt−j IT,j,t (θ ) − E yt−i yt−j IT,j,t (θ ) 1X Yi,j,T = op (1) . := T t=1 kJT k T t=1 Since geometric β-mixing ensures {yt−i yt−j IT,j,t (θ0 ), =t } forms a geometric L2 -mixingale, apply stationarity and the Lemma E.1 variance bound in Hill and Renault (2010) to PT 2 ]. Lemma 3.1 dictates ||JT || ∼ K(T /kT )2/κ−1 , deduce E(1/T t=1 Yi,j,T )2 ≤ T −1 E[Yi,j,T 2 2 yt−j IT,j,t (θ0 )] and arguments in the line of proof of Lemma 3.1 can be used to show E[yt−i ∼ K (T /kT )4/κ−1 . Therefore

1 £ 2 ¤ E Yi,j,T = O T

Ã

1 (T /kT )4/κ−1 T (T /kT )4/κ−2

!

=O

µ

(T /kT ) T



= o(1).

The law now follows by Chebyshev’s inequality. D6 (indicator class). The fdd’s of mi,t (θ) are absolutely continuous so we may assume without any loss of generality the thresholds ci,T (θ) are continuous. Further, fdd’s of mi,t (θ) have bounded densities uniformly on Θ. Therefore Ii,T,t (θ) is L2 -Lipschitz: E[(Ii,T,t (θ) − Ii,T,t (˜θ))2 ] ≤ K||θ − ˜θ||. Proving the L2 -bracketing numbers satisfy D6 is then a classic exercise. See Giné and Zinn (1984), Pollard (1984, 1989, 2002) and van der Vaart and Wellner (1996). I3 (smoothness). Define QT (θ) := E[m∗T,t (θ)]0 × ΥT × E[m∗T,t (θ)]. Identification at interior point θ0 ∈ Θ and the definition of a derivative imply E[m∗T,t (θ)] = JT (θ − θ0 ) + o(||JT || × ||θ − θ0 ||), hence mT := supθ ||E[m∗T,t (θ)]|| ≤ K||JT || × (1 + o(1)) given compactness of Θ. Therefore since ΥT is bounded ½ ¾ 0 ª ¡ © −2 ¢ ¢ JT ¡ 0 0 JT 0 inf inf θ−θ mT QT (θ) ≥ ×(1 + o (1))+o (1) . ΥT θ−θ kJT k kJT k ||θ−θ0 ||>δ ||θ−θ0 ||>δ

But boundedness and positive definiteness of ΥT imply JT0 ΥT JT /||JT ||2 is positive definite for sufficiently large T , so I3 follows. Asymptotic normality is therefore a primitive property of GMTTM for stationary AR data with a sufficiently smooth error distribution. The above verification with Theorems 2.2 and 2.6 and Lemma 3.1 suffice to prove the following claim. 1/2 COROLLARY 5.1 Let yt satisfy (8) and let ΥT = ST−1 /||ST ||−1 . Then VT (θˆT − θ0 ) d → N (0, Ir ) and VˆT = VT (1 + op (1)). In particular, T 1/2 (T /kT )1/κ−1/2 (ˆθi,T − θ0i ) d

→ N (0, Vi ) for any kT → ∞ and kT = o(T ) each i and some Vi < ∞. Further, d (T 1/κ /L(T ))(ˆθi,T − θ0i ) → N (0, Vi ) for some Vi < ∞ and slowly varying L(T ) → ∞ when kT ∼ L(T ).

Remark : The result generalizes to autoregressive distributed lags, and AR-ARCH with adjustments to the convergence rates. 5.2

ARCH, GARCH, Nonlinear-GARCH

21

The ARCH(p) process {yt } from section 3.2 is stationary, Lp -bounded, and geometrically β-mixing with tail (7). See Basrak et al (2002: Theorem 3.1) and Meitz Pp and0 Saikkonen (2008: Proposition 1). The Lyapunov condition can be replaced with i=1 αi Pp < 1, or i=1 α0i = 1 and the error distribution does not have an atom at zero (Bougerol and Picard 1992). Least squares and QML equations mt (θ) are differentiable in θ with absolutely continuous marginal distributions, and integrable at θ0 for all κy > 4 (least squares) or all κy > 2 (QML). All conditions can be verified using arguments from Sections 4.1 and 5.1. The same basic arguments apply to a wide variety of smooth nonlinear AR-GARCH models, including Quadratic ARCH, smooth transition AR and GARCH, Asymmetric GARCH, and so on (An and Huang 1996, Borkovec and Klüppelberg 2001, Carrasco and Chen 2002, Cline 2007, Meitz and Saikkonen 2008). SIMULATION STUDY In this section we compare one-step and two-step (1) (2) GMTTM’s, denoted ˆθT and ˆθT , to GMM and QML ˆθg and ˆθq . Write ˆθ to denote any estimator. The models are LOCATION, AR(1), ARCH(1), GARCH(1,1), Threshold ARCH(1) and Quadratic ARCH(1), covering symmetric and asymmetric DGP’s. Let N0,1 denote a standard normal law and Pγ a symmetric Pareto law with index γ > 0 : if t is governed by Pγ then P ( t > ) = P ( t < − ) = (1/2) × (1 + )−γ . Define the moment supremum κy = sup{α > 0 : E|yt |κ < ∞}. See Table 2 for each DGP yt = f (θ0 , t , xt ) where xt may contain a constant and lags of yt . We simulate 1000 samples of size T = 1000 for each model. 6.

Model Type LOCATION AR(1) ARCH(1) ARCH(1) ARCH(1) GARCH(1,1) IGARCH(1,1) GARCH(1,1) TARCH(1) TARCH(1) QARCH(1) QIARCH(1)

TABLE 2 - Data Generating Functional Form: f (θ0 , t , xt ) 1+ t .9 × yt−1 + t , 2 {.3 + .5yt−1 }1/2 t 2 {.3 + .6yt−1 }1/2 t 2 {.3 + .6yt−1 }1/2 t 2 {.3 + .3yt−1 + .6h2t−1 }1/2 t 2 {.3 + .4yt−1 + .6h2t−1 }1/2 t 2 {.3 + .3yt−1 + .6h2t−1 }1/2 t 2 {.3 + .6yt−1 × I(yt−1 < 0)}1/2 t 2 {.3 + .6yt−1 × I(yt−1 < 0)}1/2 t |.3 + .8yt−1 | t |.3 + yt−1 | t

Processes θ0r iid errors 1 P1.5 , P2.5 .9 P1.5 , P2.5 .5 N0,1 .6 N0,1 .6 P2.5 .6 N0,1 .6 N0,1 .6 P2.5 .6 N0,1 .6 P2.5 .8 N0,1 1 N0,1

t

κy 1.50, 2.50 1.50, 2.50 4.657 3.80 1.80 4.10 2.00 1.50 5.258 2.60 3.509 2.00

The GMM and GMTTM estimating equations are mt (θ) = ut (θ) × zt (θ) for some "error: ut (θ) ∈ R and "regressor" zt (θ) ∈ Rq described in Table 3. In each location and AR model least squares-type equations are used, ut (θ) = yt − θ0 xt . Recall QML-type 7 Basrak et al (2002: eq. 2.10) show E[(β 2 + γ)κ/2 ] = 1 for GARCH(1,1) y = h t t t with iid t and t 2 h2t = α + βyt−1 + γh2t−1 , provided the Lyapunov index is negative. The index κ is computed as κ ˆ = S 2 κ/2 − 1|} over K ∈ {.01, .02, ..., 10} based on N = 100, 000 iid random arg minκ∈K {|1/N N t=1 (β t + γ) draws t from N0,1 , P2.5 or P2.1 . The 1% bands are less than .001 in all cases. 8 An ARCH affect exists only for the left-tail, so κ solves β κ/2 E[| |κ I( < 0)] = 1 (Cline 2007: Lemma t t 2.1 and Example 3). But t is symmetrically distributed about 0, hence β κ/2 E[| t |κ ] = 2. The monte carlo experiment described in above allows for computation of κ. κ 9 Since y = |α + βy κ t t−1 | t use Lemma 2.1 of Cline (2007) to deduce β E[| t | ] = 1.

22

equations in GARCH cases allows for the minimal moment condition E[ 2t ] < ∞. We therefore use QML-type ut (θ) = (yt2 − h2t (θ))h−4 t (θ) for GARCH models with index κy ≤ 4 and least squares-type ut (θ) = yt2 − h2t (θ) if κy > 4. We consider both exact and over-identification cases concerning choice of zt (θ). Exact identication allows for direct comparisons with least squares and QML. The two cases result in qualitatively similar results, so we only present output for exact identication. See Hill and Renault (2010) for omitted results.

Model Location AR(1) ARCH(1) ARCH(1)

TABLE 3 iid errors t P1.5 , P2.5 P1.5 , P2.5 N0,1 N0,1 , P2.5

GARCH(1,1)b

N0,1

GARCH(1,1)

N0,1 , P2.5

c

TARCH(1) QARCH(1)

N0,1 , P2.5 N0,1 , N0,1

Estimating Equations mt (θ) = ut (θ) × zt κy ut (θ) ∈ R zt (θ) ∈ Rq 0 1.5, 2.5 LS 1 or [1, yt−1 ] 1.5, 2.5 LS [yt−i ]qi=1 where q = 1 or 2a £ © 2 ªq ¤0 4.65 LS 1, y £ © t−i ªi=1 q ¤0 2 3.80, 1.80 QML 1, yt−i i=1 ªq ¤0 £ © 2 ∂ 4.10 LS 1, yt−i , h2t−i (θ) i=1 + β h2t−1 (θ) ∂θ ªq ¤0 £ © 2 ∂ 2.00, 1.50 QML 1, yt−i , h2t−i (θ) i=1 + β h2t−1 (θ) ∂θ £ 2 ¤0 q 5.25, 2.60 LS, QML It−q 1, yt−1 It−1 , ..., yt−q £ © ªq ¤0 2 3.50, 2.00 QML 1, yt−i , yt−i i=1

Notes: a. In all cases q = 1 or 2. Exact identification corresponds to q = 1. 2 + βh2t−1 (θ); c. It := I(yt < 0). b. h2t (θ) = ω + αyt−1

All processes in this study have regularly varying distribution tails (7) with index κy > 0 (Hannan and Kanter 1977, Cline 1986, 1989, Borkovec and Klüppelberg 2001, Cline 2007). In the iid and AR cases tail thickness is gauged by the Pareto innovations. All equations mi,t (θ0 ) have symmetric tail indices, and all GARCH equations are asymmetric through a tail scale. See Section 4. iid The collective GARCH group have heavy tails due to the innovations t ∼ P2.5 and/or the parametric structure. The kurtosis of yt is infinite in most cases, and variance is infinite for IGARCH and QIARCH, and ARCH and GARCH with Pareto errors. Thus, in all random volatility models the GMM estimator is not asymptotically normal, and QML has not been shown to deliver an asymptotically normal estimator when E[ 4t ] = ∞. Nevertheless, QML is consistent in all cases10 . 6.1

Evaluation

We analyze rth parameter estimates ˆθT,r for brevity. Consult the third column of Table 2 for the true θ0r . Estimator performance is gauged by simulation means, meansquared-errors, and Kolmogorov-Smirnov tests of standard normality. Let {ˆθT,j,r }1000 j=1 be the independently drawn sequence of estimates of θ0r . P 0 2 ˆ We usse the simulation mse sˆ2T,r = (1/1000) 1000 j=1 (θ T,j,r − θ r ) to generate an iid P1000 0 ˆ ˆ sequence of ratios {tˆj,r }1000 s2T,r . We report (1/1000) j=1 ˆθT,j,r , j=1 , tj,r = {θ T,j,r − θ r }/ˆ

ˆ q (θ) denote the QML criterion. It can be easily verified for all models above (∂/∂θ)Q ˆ q (θ) = Q 0 g (θ) for some L -bounded martingale difference sequence {g (θ ), = }. Since an L -bounded t t t 1+ι 1+ι t=1 mds trivially forms a uniformly integrable L1 -mixingale, Andrews’ (1988: Theorem 1) law of large numbers S p applies: 1/T T t=1 gt (θ0 ) → 0. Further, each gt (θ) satisfies Andrew’s (1992: W-LIP) Lipschitz condition S p W-LIP given differentiability and the QML criterion form, so supθ |1/T T t=1 {gt (θ) − E[gt (θ)]}| → 0 by Theorem 3 of Andrews (1992). Consistency of QML is now a standard exercise (e.g. Pakes and Pollard 1989: Corollary 3.4). See also Hall and Yao (2003) for the GARCH case with infinite kurtosis errors. 1 0 Let

ST

23

sˆT,r , and the KS test is based on {tˆj,r }1000 j=1 . 6.2

Fractile Selection

In all models we use the same fractiles {k1,T , k2,T } for each equation since evaluation focuses on just ˆθT,r . In the first experiment we use regularly varying trimming fractiles kj,T = [T λj ] where λj ∈ {.01, .02, ..., .99}. Models with symmetrically distributed equations (LOCATION, AR) are trimmed symmetrically: λ1 = λ2 . All GARCH models demand asymmetric trimming. These results are summarized in Tables 4-6. The class T λj is best aligned with symmetric equations without error-regressor feedback, as in LOCATION and AR. Still, this policy does not optimize the convergence rates in any model treated here. In a second experiment we use rate optimizing policies: kT = [δ ln(T )] for AR; kT = [δT / ln(T )] for LOCATION; and kj,T = [δ j T / ln(T )] for all GARCH11 . In all cases we minimize the KS statistic over a one or two dimensional grid of δ i ∈ {.01, .02, ..., 3.0}. See Table 7. 6.3

GMTTM Weight

(1) ˆ T = Iq , and Let ˆθT be the one-step GMTTM estimator based on the naïve weight Υ (2) ˆ ˆT = Σ ˆ −1 (˜θT ) × ||Σ ˆ T (˜θT )|| and θT (˜θT ) the two-step estimator with efficient weight Υ T plug-in θ˜T . (1) (2) (1) In simulations not reported here ˆθT dominated ˆθT (ˆθT ) across models and evaluation criteria due to the computational complexity of a multi-step algorithm under nonlinearity (2) associated with trimming. Further, the two-step ˆθ (ˆθq ) with a QML plug-in dominated T

(1) (2) ˆ θT and ˆθT (ˆθg ) with a GMM plug-in. Since QML is more stable than GMM and one-step GMTTM due to non-trimming and scaling (for GARCH), and ˆθq is consistent for each (2) DGP in this study, we compute θˆT = θˆT (θˆq ) in all cases. In all cases the GMM estimator is computed in two steps using a QML first-step plug-in.

6.4

Summary of Results

Refer to Tables 4-7 for all results. Tail-trimming always delivers an approximately normal estimator. The GMTTME is roughly normal even for the profoundly heavy-tailed linear and nonlinear GARCH models. By comparison the GMME fails tests of normality in every heavy tailed case as expected, and the QMLE is non-normal in all cases where it is not asymptotically normal (infinite variance location and AR, GARCH with infinite kurtosis error). The most notable findings are summarized below. i. In the presence of heavy-tails the GMME and QMLE strongly fail KS tests of normality, while the GMTTME passes roughly as well as in any other case. ii. The QMLE fails normality tests in all GARCH cases where the errors have an infinite fourth moment. Further, even though the QMLE for IGARCH with Gaussian innovations is asymptotically normal (e.g. Lumsdaine 1996), for small samples it is demonstrably non-normal as shown elsewhere (e.g. Lumsdaine 1995). iii. Asymmetric trimming for asymmetrically distributed equations is always optimal. Approximate normality for a small sample is promoted by trimming more observations from the thinner tail, with the intuition discussed in Section 4. iv. Lighter trimming ensures approximate normality for location and AR models, where heavier tails require monotonically less trimming. Conversely, heavier trimming 1 1 Convergence rate optimization has only been verified for LOCATION, AR and ARCH in Section 3. We must leave the study of the convergence rate for nonlinear models and GARCH models to future research.

24

is required for GARCH models with more trimming for heavier tails. Both outcomes match theory predictions from Section 3. This is verified by optimizing the KS statistic over fractile classes [T λj ], [δ j ln(T )] or [δ j T / ln(T )] and comparing λj and δ j . In general the heavier the tails in location and AR the smaller is the KS minimizing λ and δ; and the heavier the tails in GARCH the larger is λ and δ. v. The GMTTME works equally well if kj,T = [T λj ] or the convergence rate optimizing variety kj,T = [δ j ln(T )] or [δ j T / ln(T )] is used. This is rather trivial since a grid search renders the KS minimizing values {k2,T , k2,T } essentially identical. 7. CONCLUSION This paper develops a robust GMM estimator for possibly very heavy tailed data commonly encountered in financial and macroeconomic applications. This is accomplished by trimming an asymptotically vanishing portion of the sample estimating equations. Our approach applies equally to asymmetric or symmetric processes with thin or thick tails. We prove trimming estimating equations themselves ensures asymptotic normality, while tail -trimming can promote consistency for θ0 , super-T 1/2 -convergence for models without error-regressor feedback, and a rate that beats QML for GARCH models with infinite kurtosis errors. Simulation work demonstrates the new estimator is approximately normal for a variety of linear and nonlinear data generating processes with heavy tails; symmetric trimming leads to profoundly poor estimates for asymmetric data; and GMTTM dominates GMM and QML in heavy tailed cases. Future work should tackle rates of convergence for nonlinear processes; the tradeoff between small sample distribution and efficiency; adaptive methods for selecting the trimming fractiles {k1,i,T , k2,i,T }; and other criteria for trimming.

APPENDIX A: Assumptions Let {ΥT } be a sequence of positive definite weight matrices ΥT ∈ Rq×q . The sample criterion is T T X X ˆ T (θ) := 1 ˆT × 1 ˆ T ∈ Rq×q , Q m ˆ ∗T,t (θ)0 × Υ m ˆ ∗ (θ), where Υ T t=1 T t=1 T,t

hence the GMTTME solves ˆθT = argmin{Q ˆ T (θ)}. θ∈Θ

Under the identification and smoothness conditions detailed below, ˆθT exists and is unique. Asymptotic arguments require the following matrix constructions, some of which are already defined above. The trimmed equation instantaneous and long run covariance matrices are £ 0¤ ΣT (θ) := E m∗T,t (θ) m∗T,t (θ) and ΣT := ΣT (θ0 ) ST (θ) :=

T ¤ 1X £ ∗ E mT,s (θ) m∗T,t (θ)0 and ST := ST (θ0 ); T s,t

the tail-trimmed moment envelope is ° £ ¤° mT = sup °E m∗T,t (θ) ° ; θ∈Θ

25

population and sample Jacobia are JT (θ) :=

∗ (θ) JT,t

¤ ∂ £ ∗ E mT,t (θ) ∈ Rq×r and JT = JT (θ0 ) ∂θ ∙

∂ := mi,t (θ) × Ii,T,t (θ) ∂θ

∗ (θ) := JˆT,t



∂ mi,t (θ) × Iˆi,T,t (θ) ∂θ

¸q

and JT∗ (θ) :=

i=1

¸q

i=1

T 1X ∗ J (θ) T t=1 T,t

T 1 X ˆ∗ and JˆT∗ (θ) := J (θ); T t=1 T,t

and the Hessian and scale are HT (θ) := JT (θ)0 ΥT JT (θ) ∈ Rr×r

and HT := HT (θ0 )

VT (θ) := T × HT (θ) [JT0 (θ) ΥT ST (θ) ΥT JT (θ)]

−1

HT (θ) and VT := VT (θ0 ).

Write compactly throughout ci,T (θ) := max {li,T (θ) , ui,T (θ)} and cT (θ) = max {ci,T (θ)} 1≤i≤q

ki,T = max {k1,i,T , k2,i,T } and kT = max {ki,T } . 1≤i≤q

asymptotiFour sets of assumptions for θ0 ; ˆθT can be expressedP PTensure∗ identification T 0 PT cally as a linear function of t=1 m ˆ T,t (θ ); t=1 m ˆ ∗T,t (θ) is sufficiently close to t=1 m∗T,t (θ) P −1/2 T 0 ∗ ˆ∗ ˜ ˆ ˜ uniformly on Θ; ST t=1 mT,t (θ ) is asymptotically normal; and JT (θ T ) and ST (θ T ) are consistent for consistent ˜θT . Most are versions of standard regulatory conditions contoured to heavy tailed data under tail trimming. The remaining are easily verified for linear-in-parameters models. See Section 4 for dynamic linear regression and ARCH models. Let {=t } be any sequence of increasing σ-fields adapted to {mt (θ)}, θ ∈ Θ, where {=t } itself does not depend on θ. The first set characterizes matrix norms, weight limits and covariance definiteness12 . M1 (weight). ΥT is positive definite for every T ≥ N and sufficiently large N ≥ 1; p ˆ T − ΥT || → 0 and ||ΥT − Υ0 || → 0 for some positive definite Υ0 , 0 < ||Υ0 || < ∞. and ||Υ M2 (covariance non-degeneracy ). lim inf T ≥N inf θ {λmin (AT (θ))} > 0.

Each AT (θ) ∈ {ΣT (θ), ST (θ)} satisfies

Remark 3: M1 is standard. M2 imposes positive definiteness for sufficiently large T ≥ N since trimming can technically render a zero matrix, hence, e.g. λmin (ΣT (θ)) = 0 for some T . The second set promotes local identification of θ0 . I1 (identification by mt (θ)). E[mt (θ)] = 0 if and only if θ = θ0 , a unique interior 1 2 We simplify notation by ignoring measurability issues that arise when taking a supremum over an index set when the index itself is a function. We implicitly assume all functions in this paper satisfy Pollard’s (1984) permissibility criteria, the measure space that governs all random variables is complete, and therefore all majorants are measurable. See also Dudley (1978) for an appeal to Suslin measurability or image admissible Suslin. Thus, probability statements for majorants are with respect to outer probability, and expectations over majorants are outer expectations.

26

point of compact Θ ⊂ Rr .

I2 (identification by m∗T,t (θ)). E[m∗T,t (θ0 )] = o(||ST ||1/2 /T 1/2 ) for T ≥ N and sufficiently large N ≥ 1.

∗ I3 (smoothness). inf T ≥N inf ||θ−θ0 ||>δ {m−1 T ||E[mT,t (θ)]||} > 0 for tiny δ > 0, and lim inf T ≥N {mT } > 0 for some N ≥ 1.

Remark 1: Identification E[m∗T,t (θ0 )] → 0 is assured by I1 and Lebesgue’s dominated convergence. In lieu of the GMM quadratic criterion, I2 states m∗T,t (θ) identifies θ0 sufficiently fast as information accumulates T → ∞. If the marginal distributions mt (θ0 ) are symmetric then E[m∗i,T,t (θ0 )] = 0 by I1 for any fractiles k1,i,T = k2,i,T or thresholds li,T (θ0 ) = ui,T (θ0 ). Further, we show in Section 4 if m∗i,T,t (θ0 ) has exact Pareto tails then E[m∗i,T,t (θ0 )] ≈ 0 arbitrarily close for each T for infinitely many sequences {k1,i,T , k2,i,T }. Otherwise ST = o(T ) is trivial in thin-tailed cases by the Cauchy-Schwartz inequality, and holds for any threshold sequences {li,T (θ), ui,T (θ)} for equations with power-law tail decay under intermediate order tail-trimming. See Lemma C.2. Remark 2: Versions of smoothness I3 are standard for consistency (Huber 1967, Pakes and Pollard 1989, Newey and McFadden 1994). The envelope scale mT is required since mt (θ) need not be integrable on Θ-a.e. in heavy-tailed cases, whereas E[m∗T,t (θ)]/mT is always well defined. Consider an AR(1) yt = θ0 yt−1 + t with |θ0 | < 1, =t = σ(yτ : τ ≤ t), martingale difference innovations E[ t |=t−1 ] = 0 with infinite variance E[ 2t ] = ∞, and one equation mt (θ) = (yt − θyt−1 )yt−1 . Then E[mt (θ0 )|=t−1 ] = 0 a.s. hence E[mt (θ0 )] 2 = 0, but in general mt (θ) = −(θ − θ0 ) × yt−1 is non-integrable for any coefficient θ 6= θ0 . p This matters for a proof of consistency ˆθT → θ0 since that requires a ULLN for m∗T,t (θ) by well known arguments (e.g. Pakes and Pollard (1989)13 . The next set concerns properties of the equations mt (θ), trimming indicators Ii,T,t (θ) ∗ ∗ and the random Jacobian matrices JT,t (θ) and JˆT,t (θ). Let κi (θ) ∈ (0, ∞] denote the p moment supremum of mi,t (θ): E|mi,t (θ)| < ∞ if and only if p < κi (θ), where κi (θ) = ∞ is possible (e.g. bounded support, exponential tail decay). Similarly Θ2,i ⊆ Θ is the set of all θ such that κi (θ) ≤ 2. D1 (distribution). i. The finite dimensional distributions of mt (θ) are strictly stationary and absolutely continuous with respect to Lebesgue measure on Θ. ii. Let inf θ κi (θ) > 0 and κi (θ0 ) > 1. If κi (θ) ≤ 2 then P (|mi,t (θ)| > m)} = di (θ)m−κi (θ) (1 + o(1)). In particular supθ∈Θ2,i |mκi (θ) P (|mi,t (θ)| > m)−di (θ)| → 0 where inf θ∈Θ2,i di (θ) > 0. D2 (differentiability). mt (θ) is continuous and differentiable on Θ-a.e. D3 (mixing ). mt (θ) is stationary geometrically β-mixing (absolutely regular): β l : = supA⊂=+∞ E|P (A|=t−∞ ) − P (A)| = o(ρl ) for ρ ∈ (0, 1). t+l

D4 (envelope bounds). supθ ||mt (θ)|| and supθ ||(∂/∂θ)mt (θ)||} are Lι -bounded . D5 (Jacobia). i. JT (θ) exists on Θ-a.e.; lim inf T ≥N ||Ji,i,T || > 0 for each i ∈ {1, ..., q}; supθ ||JT (θ)|| 1 3 Although

the untrimmed equations mt (θ) need not be integrable for arbitrary θ, we prove S ∗ ∗ supθ ||1/T T t=1 {mT,t (θ) − E[mT,t (θ)]}|| = op (m T ) in Lemma D.3 in Appendix D, which suffices for consistency.

27

∗ ∗ < ∞ for each T ; {JT (θ), JT∗ (θ), E[JT,t (θ)], JˆT∗ (θ), E[JT,t (θ)]} have full column rank for each T ≥ N and each θ ∈ Θ.

ii. supθ∈U 0 (δT ) {||JT (θ) − JT } = o(||JT ||) for any δ T → 0. D6 (indicator class). {Ii,T,t (θ) : θ ∈ Θ} satisfies metric entropy with L2 -bracketing H[ ] (ε, Θ, || · ||2 ) = O(ln(ε)), ε ∈ (0, 1). Remark 1: Distribution continuity D1.i and equation differentiability D2 reduce generality, but simplify key uniform arguments since trimming adds substantial complexity. In regression models D1.i requires at least idiosyncratic shocks to have a density. Power-law tails D1.ii in the infinite variance case is very mild since distribution tails are bounded by regularly varying function by Markov’s inequality, and power-laws govern those infinite variances processes that satisfy a central limit theorem (Ibragimov and Linnik 1997, Leadbetter et al 1983, Resnick 1987)14 . Remark 2: Mixing D3 and indicator metric entropy D6 promote uniform laws for m∗T,t (θ) and m ˆ ∗T,t (θ) − m∗T,t (θ), while geometric decay keeps notation simple. Nevertheless, many nonlinear AR-nonlinear GARCH models are covered (An and Huan 1996, Carrasco and Chen 2002, Meitz and Saikonnen 2008). Remark 3: Although we do explicitly bound the rate cT (θ) → ∞ it is implicitly bounded under power-law tail decay D1.ii since for any intermediate order sequences {k1,i,T , k2,i,T } n o 1/2

sup cT (θ)/ kΣT (θ)k

= o(T 1/2 ).

θ

−1/2

See Lemma C.1 in Appendix C. Therefore power-law tails ensures ΣT uniformly relatively stable ½ ° °¾ ³ ´ ° −1/2 ° ∗ max sup °ΣT (θ)mT,t (θ)° = op T 1/2 , 1≤t≤T

(θ)m∗T,t (θ) is

θ

a property held by finite variance processes that are stationary with weakly dependent maxima (Naveau 2003), or are merely identically distributed (Bonnal and Renault 2004: Lemma A.1; Kitamura et al 2004: Lemma D.2)15 . Remark 4: Jacobian non-degeneracy D5.i is standard. Smoothness D5.ii ensures if ||JT (θ)|| → ∞ due to heavy tails then the rate is proportional to ||JT || for θ "close" to θ0 . The assumption automatically holds for linear models. See Section 4. Remark 5: The D4 bounds, D5 Jacobia properties and D6 indicator class PTmoment are used to prove 1/T t=1 {m ˆ ∗T,t (θ) − m∗T,t (θ)} = op (1) uniformly on Θ, required for consistency. Uniformity under non-differentiability of Ii,T,t (θ) is expedited if {Ii,T (θ) : θ ∈ Θ} exhibits good metric entropy properties, where metric entropy with L2 -bracketing D6 delivers a required uniform CLT and uniform maximal inequality16 . We must exploit a

assume a Paretian tail P (|mi,t (θ)| > m)} = di (θ)m−κi (θ) (1 + o(1)) to simplify notation. It is straightforward to generalized D1.ii to P (|mi,t (θ)| > m)} = m−κi (θ) L(θ, m) for slowly varying L(·, m) and bounded L(θ, ·). See, e.g., Resnick (1987). 1 5 Technically D1.ii only requires the tails to be identically distributed. Relative stability aligns with a necessary and sufficient condition for the distribution limit of a sum of an iid array to be Gaussian (e.g. Kallenberg 2002: Theorem 5.15). 1 6 The brackets {l, u} of an index function class F satisfy l ≤ f ≤ u for every member f ∈ F, where {l, u} may not be members of F; an ε-L2 -bracket {l, u} satisfies ||l − u|| ≤ ε; the L2 -bracketing numbers N[ ] (ε, B, || · ||2 ) are the number of ε-L2 -brackets required to cover F, and metric entropy with L2 bracketing is H[ ] (ε, B, || · ||2 ) = ln(N[ ] (ε, B, || · ||2 )). See Giné and Zinn (1984), Pollard (1984), van der Vaart and Wellner (1996) and Dudley (1999). Since H[ ] (ε, Θ, || · ||2 ) = O(| ln(ε)|) under D6 implies U 1 1/2 0 H[ ] (ε, Θ, || · ||2 )dε < ∞, a required stochastic equicontinuity condition for weak convergence of a partial sum of IT,t (θ) applies (see Dudley’s 1978 landmark paper). 1 4 We

28

uniform maximal inequality, and the simplest set of sufficient conditions with the greatest payoff appears to be due to Doukhan et al (1995) under D3 and D6. Finally, the kernel class for the HAC estimator SˆT (θ). K1 (kernel ). k(·) is a member of class K, where K = {k : R → [−1, 1] | k(0) = 1, k(x) = k(−x) ∀x ∈ R, Z ∞ Z ∞ |k(x)|dx < ∞, | (ξ)|dξ < ∞, −∞

−∞

k(·) is continuous at 0 and all but a finite number of points}, R PT ∞ 2 and (ξ) = (2π)−1 −∞ k(x)eiξx dx < ∞. Further s,t=1 |k((s − t)/γ T )| = o(T ), PT max1≤s≤T t=1 k((s − t)/γ T ) = o(T ) and bandwidth γ T = o(T ).

Remark : Class K includes Bartlett, Parzen, Quadratic Spectral, Tukey-Hanning and other kernels. See Davidson and de Jong (2000) and the citations therein.

APPENDIX B: Proofs of Main Results The following proofs exploit threshold and moment properties Lemmas C.1-C.3 and limit theory Lemmas D.1-D.8. See Appendices C and D respectively. P P ˆ ∗T,t (θ), m∗T (θ) := 1/T Tt=1 m∗T,t (θ), Proof of Theorem 2.1. Define m ˆ ∗T (θ) := 1/T Tt=1 m and QT (θ) := E[m∗T,t (θ)]0 × ΥT × E[m∗T,t (θ)]. The following is similar to Pakes and Pollard’s (1989: p. 1039) argument. Use smoothness I3 and weight boundedness M1 to define (δ) := inf T ≥N inf ||θ−θ0 ||>δ {m−2 T × QT (θ)} > 0 for arbitrarily large N and tiny δ > 2 ˆ ˆ 0. Since P (||ˆθT − θ0 || > δ) ≤ P (m−2 T QT (θ T ) > (δ)) it suffices to show QT (θ T ) = op (mT ) p 0 ˆ to prove ||θT − θ || → 0. Now, the Lemma D.5 uniform criterion probability bound implies ¯ ¯ ³ ´ ˆ T (ˆθT ) + ¯¯Q ˆ T (ˆθT ) − QT (ˆθT )¯¯ ≤ Q ˆ T (ˆθT ) + m2T + QT (ˆθT ) × op (1) , QT (ˆθT ) ≤ Q

ˆ T (ˆ ˆ T (ˆθT ) ≤ Q ˆ T (θ0 ), while hence QT (θˆT )(1 − op (1)) ≤ Q θT ) + op (m2T ). By construction Q weight bound M1, the Lemma D.2.a asymptotic approximation and covariance bound ˆ T (θ0 ) is bounded: Lemma C.2.c dictate Q ³° ³ ´´2 ° ∗ 0 °2 ° ° ¡° ¢2 ˆ T (θ0 ) ≤ K °m Q ˆ T (θ )° ≤ K °m∗T (θ0 )° + op kST k1/2 /T 1/2 = K °m∗T (θ0 )° + op (1) . p

Finally, the Lemma D.3 law of large numbers states ||m∗T (θ0 ) − E[m∗T,t (θ0 )]|| → 0, and ||E[m∗T,t (θ0 )]|| = o(||ST ||1/2 /T 1/2 ) = o(1) by identification I2 and covariance bound Lemma C.2.c. Therefore ||m∗T (θ0 )|| = op (1) by Minkowski’s inequality which completes the proof. Proof of Theorem 2.2. Asymptotic linearity Lemma D.6 states 1/2

VT

T ³ ´ ¡ ¢ X ˆθT − θ0 = −V 1/2 H −1 J 0 ΥT 1 m ˆ ∗ (θ0 ) × (1 + op (1)) . T T T T t=1 T,t

Invoke asymptotic approximation Lemma D.2.a coupled with the construction of VT to deduce 1/2

VT

T ³ ´ ¡ ¢ X ˆθT − θ0 = −V 1/2 H −1 J 0 ΥT 1 m∗ (θ0 ) × (1 + op (1)) + op (1) . T T T T t=1 T,t

29

1/2

1/2

Now invoke central limit theorem Lemma D.7 and VT (T −1/2 HT−1 JT0 ΥT ST ) → Ir to conclude T ³ ´ X d 1/2 ˆ −1/2 −1/2 0 m∗T,t (θ0 ) × (1 + op (1)) + op (1) → N (0, Ir ) . VT θT − θ = −ST T t=1

Proof of Lemma 2.4. Define x ˆ∗T,t (λ) := λ0 m ˆ ∗T,t (˜θT ) and x∗T,t (λ) := λ0 m∗T,t (θ0 ) for any 0 conformable λ λ = 1, and write kT,s,t = k((s − t)/γ T ). Since it suffices to show PT ˆ∗T,s (λ)ˆ x∗T,t (λ) p s,t=1 kT,s,t x h i → 1 ∀λ0 λ = 1, PT ∗ ∗ s,t=1 E xT,s (λ)xT,t (λ)

ˆ ∗T,t (θ0 )||, we and x∗T,t (λ) is geometrically β-mixing under D3 with the same bounds as ||m need only consider the univariate case mT,t (θ0 ) ∈ R (e.g. Newey and West 1987). Thus, by Minkowski’s inequality we must show AT := ST−1

T o p n 1 X kT,s,t m ˆ ∗T,t (˜θT ) − m∗T,s (θ0 )m∗T,t (θ0 ) → 0 ˆ ∗T,s (˜θT )m T s,t=1

BT := ST−1

T 1 X p kT,s,t m∗T,s (θ0 )m∗T,t (θ0 ) → 1. T s,t=1

Cross-product expansion Lemma D.2.d with ||˜θT − θ0 ||2 = O(T −1/2 ||ST ||1/2 ||JT ||−1 ) p implies AT → 0. p In order to show BT → 1 we will apply Theorem 2.1 of Davidson and de Jong (2000) = DJ. It suffices to verify their Asssumptions 1-3. DJ’s Assumption 1 holds by K1. Assumptions 2 and 3 concern a Near Epoch Dependence property and relate the NED property to the bandwidth γ T . Both conditions are used soly to promote partial sum variance bounds. It helps to translate DJ’s environment into ours. In their setting m∗T,t (θ0 ) is assumed Lr -bounded for r > 2, or uniformly square integable, with a standardization XT,t −1/2 : = T −1/2 ΣT m∗T,t (θ0 ). Under geometric β-mixing D3 {m∗T,t (θ0 ), =t } forms a geometric L2 -mixingale with constants eT,t (cf. McLeish 1975: Theorem 2.1). Therefore {XT,t , =t } −1/2 forms a geometric L2 -mixingale with constants ET,t := T −1/2 ΣT eT,t . P PT 2 ≤ K by Their Assumption 2 is used only to ensure E( t=1 XT,t )2 ≤ K Tt=1 ET,t invoking McLeish’s (1975: Theorem 1.6) maximal inequality and by bounding eT,t . But PT 2 under Lemma C.2.a ||Σ−1 T ST || ≤ K hence E( t=1 XT,t ) ≤ K without any reference to P mixingale constants. A careful inspection of DJ’s proof of their Theorem 2.1 reveals E( Tt=1 XT,t )2 ≤ K suffices in place of their Assumption 2. 2 Finally, Asssumption 3 states γ T × max1≤t≤T {ET,t } = o(1) and is used, like Assumption 2, only to ensure partial sum bounds for L2 -mixingale functions of XT,t . See especially the proofs of their Lemmas A.3 and A.4. Lemma C.2.a, however, implies we can always side-step the use of mixingale coefficients in partial sum variance bounds for geometrically β-mixing data, in particular we can always replace eT,t with K||ΣT ||1/2 , hence ET,t −1/2 = T −1/2 ΣT eT,t with T −1/2 . Therefore γ T × T −1 = o(T /T ) under K1. This completes the proof. Proof of Lemma 2.5. Recall JT = JT (θ0 ) = (∂/∂θ)E[m∗T,t (θ)]|θ0 and write m ˆ ∗T (θ) = PT 1/T t=1 m ˆ ∗T,t (θ). We only prove JˆT∗ (˜θT ) = JT × (1 + op (1)) since JT∗ (˜θT ) = JT × (1 + op (1)) is similar. 30

Denote by ei ∈ Rr the unit vector (e.g. e2 = [0, 1, 0, ..., 0]0 ), define a sequence of bounded positive numbers {εT } that satisfies lim inf T ≥1 εT ||JT || > 0 and ||˜θT − θ0 ||/εT p → 0. This is always possible in lieu of the plug-in rate and Lemma C.2.c: ||˜θT − θ0 ||/εT = Op (T −1/2 ||ST ||1/2 ) = op (1). Define T ª 1 1 X© ∗ ∗ Jˇi,j,T (θ, εT ) := × ˆ ∗j,T,t (θ − ei εT ) . m ˆ j,T,t (θ + ei εT ) − m 2εT T t=1

Minkowski’s inequality implies for arbitrary θ ° ° ° ° ° ° ° ˆ∗ ˜ ° ° ° °JT (θT ) − JT ° ≤ °JˆT∗ (˜θT ) − JˇT∗ (θ, εT )° + °JˇT∗ (θ, εT ) − JT °

Apply asymptotic expansion Lemma D.1.b to deduce for some ˜θT,∗ ∈ {˜θT − ei εT , ˜θT + ei εT } ° ° ° ° ∗ JˆT∗ (˜θT ) = Jˇi,j,T (˜θT,∗ , εT ) + op (kJT k) , hence °JˆT∗ (˜θT ) − JˇT∗ (θ, εT )° = op (kJT k) .

Since ||˜θT,∗ − θ0 || ≤ ||˜θT − θ0 || = op (1) it remains to show ||JˇT∗ (˜θT , εT ) − JT || = p op (||JT ||) for any ||˜θT − θ0 || → 0. Define ° ° © ª U 0 (δ 1 , δ 2 ) := θ ∈ Θ : δ 1 ≤ °θ − θ0 ° ≤ δ 2 for 0 ≤ δ 1 ≤ δ 2 JT (δ 1 , δ 2 ) :=

sup

θ∈U 0 (δ1 ,δ2 )

½

kJT∗ (θ) − JT k kJT k

¾

Stochastic differentiability Lemma D.8 and the fact that U 0 (δ 1 , δ 2 ) ⊆ U 0 (0, δ 2 ), and p consistency ˜θT → θ0 imply for large K and any non-zero constant vector a ∈ Rr /0 °n ³ ´i £ ¡ ¢o n h ¡ ¢¤o° ° ° ˆ T (˜θT + aεT ) − m ˆ T θ0 − E m∗T,t ˜θT + aεT − E m∗T,t θ0 ° m ° °o ° n ° ° ≤ K 1 + kJT k × °˜θT + aεT − θ0 ° × op (1) × (JT (δ 1 , δ 2 ) + op (1))

° ° n o ° ° ≤ K 1 + kJT k × °˜θT − θ0 ° + kJT k × kaεT k × (JT (δ 1 , δ 2 ) + op (1)) = op (εT kJT k) + Op (εT kJT k × JT (δ 1 , δ 2 )) .

Similarly, by differentiability of E[m∗T,t (θ)], i ° h ° ° E m∗ (˜θ + aε ) − E £m∗ ¡θ0 ¢¤ ° T ° ° T,t T T,t ° − aJT ° ° ° ε T ° ° ° ³ ´ ³ ³ ´´° ° ˜θT + aεT − θ0 − aJT + op kJT k ε−1 ˜θT + εT − θ0 ° = °JT ε−1 ° T T ° ³ ´° ° ˜θT − θ0 ° = °JT ε−1 ° + op (kJT k) = op (kJT k) . T

Replace ˜θT + aεT with ˜θT − aεT to deduce the same bounds. Therefore ° ° ° ° ° °m ˆ ∗T (˜θT − εT ) ° ˇ∗ ˜ ° ° ° ˆ ∗T (˜θT + εT ) − m − JT ° = op (kJT k)+Op (kJT k × JT (δ 1 , δ 2 )) °JT (θT , εT ) − JT ° = ° ° ° 2εT 31

hence we have shown JˆT∗ (˜θT ) = JT (1 + op (1)) + Op (||JT || × JT (δ 1 , δ 2 )). Since 0 ≤ δ 1 < δ 2 are arbitrary, the proof is complete if we show for some sequence of positive numbers {δ 1,T }, δ 1,T → 0 and δ 2,T = 2δ 1,T : p

JT (δ 1,T , δ 2,T ) → 0. Define mT (δ 1 , δ 2 ) =

sup θ∈U 0 (δ 1 ,δ 2 )

° £ ∗ ¤° °E mT,t (θ) ° .

The required limit follows from expansion Lemma D.1.b, and a generalization of ULLN Lemma D.3 to U 0 (δ 1 , δ 2 ). For each θ ∈ U 0 (δ) we can always find a sequence {θT,δ } ∈ U 0 (δ 1 , δ 2 ), θT,δ 6= θ0 for each finite T ≥ N , such that ¤ £ ¡ ¢¤ £ ¡ ¢ E m∗T,t (θT,δ ) − E m∗T,t θ0 m∗T (θT,δ ) − m∗T θ0 mT (δ 1 , δ 2 ) ° ° ° ° ° = + op (1) × ° °θT,δ − θ0 ° °θT,δ − θ0 ° °θT,δ − θ0 ° ¢ ¡ θT,δ − θ0 m (δ , δ ) ∗ ° × (1 + op (1)) + op (1) × ° T 1 20 ° , = JT (θ) × ° °θT,δ − θ0 ° °θT,δ − θ ° where each op (1) term does not depend on θ. Moreover, by moment expansion Lemma C.3 £ ¤ £ ¡ ¢¤ ¢ ¡ E m∗T,t (θT,δ ) − E m∗T,t θ0 θT,δ − θ0 ° ° ° × (1 + o (1)) . = JT × ° °θT,δ − θ0 ° °θT,δ − θ0 ° Further, by construction ||θT,δ − θ0 || ≥ δ 2,T /2. Together it follows ½ ∗ µ ¾ ¶ kJT (θ) − JT k mT (δ 1 , δ 2 ) = op (1) + op . sup kJT k δ 2,T kJT k θ∈U 0 (δ) p

Therefore JT (δ 1,T , δ 2,T ) → 0 if mT (δ 1,T , δ 2,T )/[δ 2,T ||JT ||] = O(1). By the definition of a derivative, the construction U 0 (δ 1,T , δ 2,T ) ⊆ U 0 (0, δ 2,T ) = U 0 (δ 2,T ) and identification I2, mT (δ 1,T , δ 2,T ) ≤ Kδ 2,T sup kJT (θ)k × (1 + o (1)) θ∈U 0 (δ 2,T )

Now invoke Jacobian smoothness D5.ii to conclude mT (δ 1 , δ 2 ) Kδ 2,T kJT k (1 + o(1)) + o(1) ≤ = O(1). δ 2,T kJT k δ 2,T kJT k

Proof of Lemma 3.1. We treat the cases κ ≤ 2 and κ > 2 separately. Case 1 (max{κ , κ2 , ..., κr } < 2): First recall properties of regularly varying tails. Since t and stochastic xi,t are mutually independent with tails (7) and indices κ and κi , the convolutions t xi,t satisfy (Cline 1986) mi,t (θ0 ) =

t xi,t

∼ (7) with index κ

,i

:= min {κ , κi } ,

(9)

where x1,t = 1 implies (10) with κ ,i = κ . Therefore by the construction of ci,T and kT , and tail (7), (10) ci,T = K (T /kT )1/κ ,i .

32

Finally, processes zt with regularly varying tails (7) and index κz ∈ (0, 2] satisfy (Feller 1971: Theorem 1 in IX.8): £ ¤ (11) κz < 2 : E zt2 I (|zt | ≤ c) ∼ Kc2 P (|zt | > c) as c → ∞. Use (6), and Γi,i,T > 0 and Ji,i,T > 0 for all T ≥ N and some N ∈ N by D5.i and M2, to express Tθi as

Tθi = KT 1/2

Ji,i,T Γi,i,T

Step 1 (ΣT , ΓT ):



⎤1/2 n o ⎞2 −1 J Γ max j6=i j,j,T j,i,T ⎢ ⎠ × (1 + O (1))⎥ × ⎣K + K ⎝ ⎦ . −1 Γi,i,T Ji,i,T ⎛

Properties (10) and (11) imply

¯ £ ¤ ¡¯ ¢ kT 2/κ Σi,i,T = E m2i.T,t (θ0 ) ∼ c2i,T P ¯mi,t (θ0 )¯ > ci,T ∼ c2i,T = K (T /kT ) T

Hence Γi,i,T = (T /kT )1/κ

(12)

,i −1/2

,i −1

.

.

Step 2 (JT ): By Lemma C.4 Ji,j,T = −E[xi,t xj,t I(| t xj,t | ≤ cj,T )] × (1 + o(1)). Assume initially all regressors are stochastic. Since κi , κ ≤ 2 it follows κi < κ + 2. Thus, by independence, (10) and (11) µ ¶# Z " 2 ¤ £ 2 ci,T ci,T P |xi,t | > E xi,t I (| t xi,t | ≤ ci,T ) = K f (d ) 2 || " µ ¶ # Z c2i,T ci,T −κi ∼ K E f (d ) 2 || h i κi −2 i i = Kc2−κ = (T /kT )(2−κi )/κ ,i . = Kc2−κ i,T E | t | i,T

We bound the cross-products E[xi,t xj,t I(| t xj,t | ≤ cj,T )] by case. If xi,t and xj,t are independent then E[xi,t xj,t I(| t xi,t | ≤ ci,T )] ∼ K given supt∈Z E|xi,t | < ∞ ∀i. If they are perfectly positively dependent then since xi,t , xj,t ∼ (7) with indices κi , κj ∈ (1, 2) it can only be the case that xi,t = sign(xj,t ) × |xj,t |p where p = κj /κi . But this implies κj − p − 1 < κ hence " Ã µ ¶(p+1)/2 !# Z cj,T p+1 (p+1)/2 E [xi,t xj,t I (| t xj,t | ≤ cj,T )] = E |xj,t | I |xj,t | ≤ f (d ) | | ¶p+1 µ ¶# Z "µ cj,T cj,T = K P |xj,t | > f (d ) || | | "µ ¶p+1 µ ¶−κj # Z cj,T cj,T ∼ K E f (d ) || || Z i h p+1−κj = Kcj,T E | |κj −p−1 f (d ) p+1−κj

= Kcj,T

= K (T /kT )(κj /κi +1−κj )/κ ,j .

−1

The perfect negative dependence case is similar. Hence Ji,j,T = O((T /kT )κ ,j (κj /κi +1−κj ) ). Finally, use x1,t = 1 to deduce E[x21,t I(| t xi,t | ≤ ci,T )] ∼ 1 − kT /T , E[xi,t x1,t I(| t x1,t | ≤ ci,T )] = E[xj,t I(| t | ≤ ci,T )] = O(1) and E[x1,t xi,t I(| t xi,t | ≤ ci,T )] = E(E[xi,t I(|xi,t | 33

≤ ci,T /| t |]| t ) = O(1). Therefore J1,1,T = −1 + o(1) and Ji,1,T , J1,i,T = O(1) × (1 + o(1)). Step 3 (Tθi ): Consider the slope rates, the intercept rate being similar. The claim follows from (12) by noting 1/2−1/κ ,i (T /kT )(2−κi )/κ ,i = (T /kT )1/2+(1−κi )/κ Γ−1 i,i,T Ji,i,T = (T /kT )

,i

and µ n o n 1/2−1/κ max Γ−1 J = O max (T /kT ) j,j,T j,i,T j6=i

j ∈{1,i} /

³ = O (T /kT )1/2+1/κ

,j

× (T /kT )(κi /κj +1−κi )/κ

{1/κ ,j +(1−1/κj )κi /κ ,i } ,i −minj ∈{1,i} /

,i

´



£ ¤ Case 2 (κ > 2): In this case Σ1,1,T = E m21,T,t (θ0 ) < ∞ hence Γ1,1,T = 1; if any κi > 2 then Γi,i,T = 1; and if κi < 2 then arguments under Case 1 imply Γi,i,T = (T /kT )1/κi −1/2 . The intercept Jacobia J1,i,T and Ji,1,T are characterized by Case 1 since κi > 1. If both κi , κj < 2 then Ji,j,T follow from Case 1. If κi , κj > 2 then Ji,j,T = −K. If κi > 2 > κj then by Case 1 and κ ,j = κj it follows Jj,j,T = K(T /kT )2/κj −1 , Ji,j,T = −1 O((T /kT )1/κi +1/κj −1 ), and Jj,i,T = O((T /kT )κ ,i (κj /κi +1−κi ) ). The rate can be deduced from (12) by noting if κi > 2 then ¯ n o¯¯ n o ³ ´ ¯ −1 −1 1/κj −1/2 ¯ Γi,i,T Ji,i,T = 1×K, ¯ max J ) Γj,j,T Jj,i,T ¯¯ = K, max Γ−1 = O (T /k j,i,T T j,j,T j ∈{1,i}:κ / j >2

j ∈{1,i}:κ / j ≤2

{(T /kT )2/κj −1 })]1/2 . hence Tθi ∼ T 1/2 × [K + O(maxj ∈{1,i}:κ / j ≤2 1/κi −1/2 and The rate under κi < 2 follows similarly since Γ−1 i,i,T Ji,i,T = ( T /kT ) −1

{Γ−1 {(T /kT )1/2−1/κj +κ ,i (κi /κj +1−κi ) }). maxj ∈{1,i}:κ / / j ≤2 j ≤2 j,j,T Jj,i,T } = O(maxj ∈{1,i}:κ Proof of Lemma 3.5. See Hill and Renault (2010).

APPENDIX C : Threshold and Moment Properties LEMMA C.1 (threshold bound) Under power-law tail decay D1 supθ {cT (θ)/||ΣT (θ)||1/2 } = o(T 1/2 ). LEMMA C.2 (covariance bounds) Let D1, D3, and M2 hold. −1 a. supθ ||Σ−1 T (θ)ST (θ)|| ≤ K and supθ ||ST (θ)ΣT (θ)|| ≤ K.

b. ΣT = o(T ), ΣT (θ) = o(T ||E[m∗T,t (θ)]||2 ) and supθ ||ΣT (θ)|| = o(T supθ || E[m∗T,t (θ)]||2 ). c. ST = o(T ). LEMMA C.3 (moment expansion) Under D5.i E[m∗T,t (θ)] − E[m∗T,t (˜θ)] = JT (˜θ)(θ − ˜θ) + o(||JT (˜θ)|| × ||θ − ˜θ||) for any θ, ˜θ ∈ Θ. ∗ LEMMA C.4 (Jacobian approximation) Under D1-D6 JT = E[JT,t ] × (1 + o(1)).

34

Proof of Lemma C.1. Use the arguments from the proof of Lemma C.2 under powerlaw decay D1.ii to deduce ⎧ ⎫ ( ) ⎪ ⎪ ⎨ ⎬ max1≤i≤q {ci,T (θ)} max1≤i≤q {ci,T (θ)} sup ≤ K × sup ´1/2 ⎪ ⎪ ³Pq θ θ ⎩ kΣT (θ)k1/2 ⎭ c2 (θ)(k /T ) i,T

i=1 i,T

= O

Ã

T 1/2

1/2

min1≤i≤q {ki,T }

!

³ ´ = o T 1/2 .

Proof of Lemma C.2. Claim (a): Apply variance bound Lemma E.1 in Hill and Renault (2010) to deP P 2 2 duce E[( Tt=1 zT,t (θ, r)] ≤ K Tt=1 E[zT,t (θ, r)] = K. An identical argument reveals P PT T 2 2 supθ E[( t=1 zT,t (θ, r)] ≤ K supθ t=1 E[zT,t (θ, r)] = K, hence supθ ||Σ−1 T (θ)ST (θ)|| ≤ −1 K. The remaining claim supθ ||ST (θ)ΣT (θ)|| ≤ K follows from the construction of ST (θ) and non-degeneracy M2. Claim (b): If ||ΣT || < ∞ the claim is trivial, so assume at least one E[m2i,t (θ0 )] = ∞, and assume without loss of generality mi,t (θ) is symmetrically trimmed with two-tailed thresholds ci,T (θ) and fractiles ki,T : (T /ki,T )P (|mi,t (θ)| > ci,T (θ)) = 1. Power-law tail 0 0 D1.ii implies ci,T = d(θ0 )1/κi (θ ) (T /ki,T )1/κi (θ ) for some κi (θ0 ) ∈ (1, 2]. Coupled with properties of trimmed variances for regularly varying tails (Feller 1971: Theorem 1 in IX.8) if κi (θ0 ) ∈ (1, 2) then h¡ ¯ ¢2 i ¡¯ ¢ 0 E m∗i,T,t (θ0 ) ∼ Kc2i,T P ¯mi,t (θ0 )¯ > ci,T ∼ Kc2i,T (ki,T /T ) = K(T /ki,T )2/κi (θ )−1 0

It is easy to show (T /ki,T )2/κi (θ )−1 = o(T ) for all κi (θ0 ) ≥ 1. Similarly if κi (θ0 ) = 2 then E[(m∗i,T,t (θ0 ))2 ] ∼ L(T ) → ∞ a slowly varying function which is trivially o(T ). Now invoke the Cauchy-Schwartz inequality to deduce ΣT = o(T ). The uniform case is identical. If supθ ||ΣT (θ)|| < ∞ the claim is trivial. Otherwise under D1.ii at least one equation tail index inf θ κi (θ) = inf θ∈Θ2,i κi (θ) < 2 by the definition of subset Θ2,i ⊆ Θ Therefore supθ E[m∗i,T,t (θ)2 ] ∼ K supθ∈Θ2,i c2i,T (θ)(ki,T /T ) = K(T /ki,T )2/ inf θ∈Θ2,i κi (θ)−1 = o(T ) if inf θ∈Θ2,i κi (θ) ≥ 1. However, if inf θ∈Θ2,i κi (θ) < 1 then supθ∈Θ2,i |E[m∗i,T,t (θ)]| ∼ supθ∈Θ2,i ci,T (θ)(ki,T /T ) = K(T /ki,T )1/ inf θ∈Θ2,i κi (θ)−1 . Therefore h¡ ¢2 i supθ∈Θ2,i E m∗i,T,t (θ) ¯ h i¯2 ∼ K(T /ki,T ) = o(T ), ¯ ¯ supθ∈Θ2,i ¯E m∗i,T,t (θ) ¯ which proves (b). Claim (c):

The final claim follows from (a) and (b).

Proof of Lemma C.3. Apply Jacobian existence D5.i and the definition of a derivative. Proof of Lemma C.4. Lemma D.1.b implies for rT → 0 arbitrarily fast, some ||θ∗ − θ0 || ≤ ||θ − θ0 || and tiny ι > 0 ⎛ £ ¤ £ ¡ ¢¤ °1/ι ⎞ ° ¢ ¡ ¤ £ ∗ E m∗T,t (θ) − E m∗T,t θ0 θ − θ0 rT × °θ − θ0 ° ⎠. ° ° ° +o⎝ ° ° = E JT,t (θ∗ ) × ° °θ − θ0 ° °θ − θ0 ° °θ − θ0 ° 35

Further, moment expansion Lemma C.3 asserts ¤ £ ¡ ¢¤ £ ¢ ¡ E m∗T,t (θ) − E m∗T,t θ0 θ − θ0 ° ° ° ° × (1 + o (1)) . × = J T °θ − θ 0 ° °θ − θ0 °

Equate the right-hand-side of each equation and take ||θ∗ − θ0 || ≤ ||θ − θ0 || → 0 to prove the claim.

APPENDIX D: Limit Theory for Tail-Trimmed Sums This appendix contains limit theory for tail-trimmed arrays. Since mt (θ) are differentiable under D2, if some mi.t (θ) has a finite variance and is untrimmed such that mT,i.t (θ) = mi.t (θ) and m ˆ T,i.t (θ) = m ˆ i.t (θ), the following claims applied to mT,i.t (θ) can be proven from existing arguments. We therefore assume all equations are trimmed for clarity: q = q. The first two results characterize expansions and rate of approximations for the trimmed equations. Define m ˆ ∗T (θ) :=

T T 1X ∗ 1X ∗ m ˆ T,t (θ) and m∗T (θ) := m (θ). T t=1 T t=1 T,t

ˆ ∗T (θ)}, JˇT∗ (θ) LEMMA D.1 (expansions) Assume D1-D6 hold. Let m ˇ ∗T (θ) ∈ {m∗T (θ), m ∈ {JT∗ (θ), JˆT∗ (θ)} and IˇT,t (θ) ∈ {IT,t (θ), IˆT,t (θ)}, choose θ, ˜θ ∈ Θ and let {rT } be a sequence of strictly positive numbers where rT → 0 arbitrarily fast. In the following o(1) and op (1) are not functions of t ∈ Z or θ ∈ Θ. For some sequence {θT,∗ } satisfying ||θT,∗ − ˜θ|| ≤ ||θ − ˜θ|| that may be different in different places, and for infinitesssimel ι > 0: n o PT a. 1/T t=1 mt (θ) IˇT,t (θ) − IˇT,t (˜θ) = op (rT × ||θ − ˜θ||1/ι ) 1/T b.

PT

t=1

n o Jt (θ) IˇT,t (θ) − IˇT,t (˜θ) = op (rT × ||θ − ˜θ||1/ι )

ˇ ∗T (˜θ) + JˇT∗ (θT,∗ )(θ − ˜θ) + op (rT × ||θ − ˜θ||1/ι ) m ˇ ∗T (θ) = m

LEMMA D.2 (approximation) Under D1-D4, and D6: °P ³ ´ ª° ° T © ∗ ° a. ˆ T,t (θ) − m∗T,t (θ) ° = op T 1/2 kST (θ)k1/2 for any θ ∈ Θ ° t=1 m b.

°o n° ° °¢ ¡ P ° ° supθ °1/T Tt=1 {m ˆ ∗T,t (θ) − m∗T,t (θ)}° = op supθ °E[m∗T,t (θ)]° .

Recall the kernel function kT,s,t under K1. If additionally I2 and K1 hold then °¤ª ° £ © ∗ ˆ T (θ) − m∗T (θ)k / 1 + kJT k × °θ − θ0 ° = op (1) c. supθ∈U 0 (δT ) km for any positive sequence {δ T }, δ T → 0 d.

° n o° ° −1 −1 PT ° ˆ ∗T,t (˜θT )0 − m∗T,s (θ0 )m∗T,t (θ0 )0 ° = op (1) ˆ ∗T,s (˜θT )m °ST T s,t=1 kT,s,t m ³ ´ 1/2 −1 for any ˜θT = θ0 + Op T −1/2 kST k kJT k

ˆ T (θ). Next, uniform laws and bounds for m∗T,t (θ), Ii,T,t (θ) and Q 36

P LEMMA D.3 (LLN and ULLN) Under D1-D4 and I2 1/T Tt=1 m∗T,t (θ0 ) = op (1). If £ ¤ PT additionally I3 holds supθ {||1/T t=1 (m∗T,t (θ) − E m∗T,t (θ) )||} = op (supθ E[||m∗T,t (θ)||]).

LEMMA D.4 (uniform indicator laws) Let D1-D4 and D6 hold. Define I∗T,t (θ) := ((T /kT )1/2 ){Ii,T,t (θ) − E[Ii,T,t (θ)]} for any i, and let {I(θ) : θ ∈ Θ} be a Gaussian process with a version17 that has uniformly bounded and continuous samPuniformly T ∗ ple paths with respect to the L2 -norm. Then {T −1/2 t=1 IT,t (θ) : θ ∈ Θ} =⇒∗ PT ∗ {I(θ) : θ ∈ Θ} and E[(supθ {|T −1/2 t=1 IT,t (θ)|})2 ] = O(1), and =⇒∗ denotes weak convergence in the sense of Hoffmann-Jørgensen (1984).

Remark : Hoffmann-Jørgensen (1984) defines weak convergence for some index function space Z as {IT (ξ) : ξ ∈ Z} =⇒∗ {I (ξ) : ξ ∈ Z} if and only if limT →∞ E ∗ [g(IT (ξ))] = E ∗ [g(I(ξ))] for all uniformly bounded g. See Dudley (1978), Pollard (1984), and van der Vaart and Wellner (1994)18 . LEMMA D.5 (uniform criterion bound) Under D1-D4 and D6 ¯ ¯⎫ ⎧ ¯ˆ ¯⎬ ⎨ m−2 T × ¯QT (θ) − QT (θ)¯ = op (1) . sup ⎭ 1 + m−2 θ ⎩ T × QT (θ)

Asymptotic linearity of ˆθT and asymptotic normality of the tail-trimmed equations follow. LEMMA D.6 (asymptotic linearity) Under D1-D6, I1-I3 and M1-M2 1/2

VT

T ³ ´ X ˆθT − θ0 = AT m ˆ ∗T,t (θ0 ) × (1 + op (1)) + op (1) a.s. t=1

1/2

where AT = −VT

(HT−1 JT0 ΥT )T −1 ∈ Rr×q . −1/2

LEMMA D.7 (CLT) Under D1 and D3 T −1/2 ST

PT

t=1

d

m∗T,t (θ0 ) → N (0, 1).

Stochastic differentiability aids proving Jacobian estimator consistency. LEMMA D.8 (stochastic differentiability) Under D1-D6 and M2 for any δ ≥ 0 ( °© ª © £ ¤ £ ¤ª° ) ° m ˆ ∗T (θ0 ) − E m∗T,t (θ) − E m∗T,t (θ0 ) ° ˆ ∗T (θ) − m ° ° sup 1 + kJT k × °θ − θ0 ° θ∈U 0 (δ) ½ ∗ ¾ kJT (θ) − JT k = sup + op (1). kJT k θ∈U 0 (δ) 1 7 Two

random variables are versions of each other if they have the same finite dimensional distributions. a landmark paper Dudley (1978) shows it suffices to prove convergence in finite dimensional U 1/2 distributions and the metric entropy with bracketing bound 01 H[ ] (ε, Z, ρ)dε < ∞, where ρ is the metric under which the brackets are defined. This justifies D6. See also Ossiander (1987) and Doukhan et al (1995). 1 8 In

37

Proof of Lemma D.1. Assume θ and mt (θ) are scalars and mt (θ) is symmetrically trimmed to simplify notation. We only expand m∗T (θ) since m ˆ ∗T (θ) is similar. Write m∗T,t (θ) = mt (θ) × IT,t (θ) where IT,t (θ) = I(|mt (θ)| ≤ cT (θ)), and choose ||θ − ˜θ|| ≤ δ for any δ > 0. Use differentiability D2 to deduce by Taylor’s theorem n o m∗T,t (θ) = mt (˜θ) + Jt (θT,δ )(θ − ˜θ) × IT,t (θ) where ||θT,δ − ˜θ|| ≤ ||θ − ˜θ||, and Jt (θ) := (∂/∂θ)mt (θ). Therefore m∗T (θ) − m∗T (˜θ) = JT∗ (θT,δ ) × (θ − ˜θ) + +

T n o 1X mt (θ) × IT,t (θ) − IT,t (˜θ) T t=1

(13)

T 1X Jt (θT,δ ) × {IT,t (θ) − IT,t (θT,δ )} × (θ − ˜θ). T t=1

We will show the second and third terms are op (rT × ||θ − ˜θ||1/ι ), proving both claims (a) and (b). Consider the second term in (13) and use IT,t (θ) − IT,t (˜θ) ∈ {−1, 0, 1} to bound ¯ ¯ T T ¯1 X n o¯ n o¯ 1 X ¯¯ ¯ ¯ ¯ ˜ mt (θ) IT,t (θ) − IT,t (˜θ) ¯ ≤ (θ) I (θ) − I ( θ) ¯m ¯ ¯ t T,t T,t 1/2 ¯T ¯ T t=1 t=1 ×

1

T 1/2

T ¯ ¯ X ¯ ¯ θ). ¯IT,t (θ) − IT,t (˜θ)¯ = AT (θ, ˜θ) × BT (θ, ˜ t=1

The threshold construction (4), IT,t (θ) ∈ {0, 1} and triangle inequality imply for any p > 0 ¯ ¯p ¯ ¯ sup E ¯IT,t (θ) − IT,t (˜θ)¯ = O (kT /T ) θ,˜ θ∈Θ

where O(·) is not a function of θ. Combined with D1.i continuity and boundedness of the finite dimensional distributions of mt (θ) and the mean-value-theorem, it follows E|IT,t (θ) − IT,t (˜θ)|p = O((kT /T )) × ||θ − ˜θ||. Now invoke stationarity D1.i, envelope bound D4 and the Cauchy-Schwartz inequality to deduce for tiny ι > 0 °1/ι n o¯ι i1/ι ³ i´1/ι h ¯ ´ ° ³ h ¯ ¯ ° ° ≤ T 1/2 E ¯mt (θ) IT,t (θ) − IT,t (˜θ) ¯ = O T 1/2 (kT /T )1/ι ×°θ − ˜θ° . E AT (θ, ˜θ)ι

Since ι > 0 can be chosen arbitrarily small and kT /T → 0 by tail trimming, invoke Markov’s inequality to conclude for some rT → 0 arbitrarily fast and op (·) not a function of θ µ °1/ι ¶ ° °1/ι ´1/ι ° ³ ° ° ° ° 1/2 1/2 ˜ ˜ AT (θ, θ) = op T = op (rT ) × °θ − ˜θ° . k /T °θ − θ° T

˜ ∈ {0, 1} we have Since E|BT (θ, ˜θ)| ≤ T 1/2 follows trivially from |IT,t (θ) − IT,t (θ)| shown for some rT → 0 arbitrarily fast ¯ ¯ T ° °1/ι ¯1 X n o¯ ¯ ° ° ¯ ˜ mt (θ) IT,t (θ) − IT,t (θ) ¯ ≤ AT (θ, ˜θ) × BT (θ, ˜θ) = op (rT ) × °θ − ˜θ° . ¯ ¯ ¯T t=1 Repeat the argument for the third term in (13) by invoking envelope bound D4 for Jt (θ). 38

The proof of approximation Lemma D.2 requires consistency of the intermediate order (·) (a) statistics mi,(kj,i,T ) (θ). Simplify notation by considering the two-tailed equations mi,t (θ) := |mi,t (θ)|, and two-tailed fractiles and thresholds that satisfy (T /ki,T )P (|mi,t (θ)| > (a) ci,T (θ)) = 1. The order statistic mi,(ki,T ) (θ) therefore estimates ci,T (θ). (a)

LEMMA D.2.1 (uniform order statistic law) Under D1-D4 and D6 supθ |mi,(kT ) (θ)/ci,T (θ) −1/2

− 1| = Op (ki,T ).

Proof. The claim follows from the Lemma D.4 ULCT for tail indicators, and an argument linking tail indicator limit behavior to order statistics (e.g. Hsing 1991: p. 1553). See Hill and Renault (2010: Lemma D.2.1). Proof of Lemma D.2. Assume θ and mt (θ) are scalars and mt (θ) is symmetrically trimmed for notational convenience, and write I¯T,t (θ) := 1 − IT,t (θ). Assume θ and mt (θ) are scalars and mt (θ) is symmetrically trimmed for notational convenience, and write I¯T,t (θ) := 1 − IT,t (θ). Claim (a): Let θ ∈ Θ be arbitrary, and write mt = mt (θ), cT = cT (θ), m ˆ ∗T,t = m ˆ ∗T,t (θ), m∗T,t = m∗T,t (θ), I¯T,t = 1 − IT,t (θ), IˆT,t = IˆT,t (θ), and ST := ST (θ). First bound ° T ° T ° °X © ° ° n° n o°o X ª ° °ˆ ° ° ° ° m ˆ ∗T,t − m∗T,t ° ≤ max °mt IˆT,t − IT,t ° × ° °IT,t − IT,t ° . ° ° 1≤t≤T t=1

t=1

(a) (a) −1/2 By construction ||mt {IˆT,t − IT,t }|| ≤ 2||m(kT ) − cT ||, where m(kT ) /cT = 1 + Op (kT ) given Lemma D.2.1. Now use threshold bound Lemma C.1 and covariance relation Lemma C.2.a to deduce ° ° ° ° n° n o°o ³ ´ ° ° (a) ° ° ° ° (a) max °mt IˆT,t − IT,t ° ≤ 2 °m(kT ) − cT ° = 2cT °m(kT ) /cT − 1° = op kST k1/2 (T /kT )1/2 .

1≤t≤T

Next, by construction and the triangle inequality ° ° ° µ ¶° T T ° ° ° X X ° £ © ¤ª° T £¯ ¤ °ˆ ° ° 1/2 ° 1 1/2 ° 1/2 ° k E I I¯T,t − E I¯T,t ° + kT ° − 1 °IT,t − IT,t ° ≤ kT ° 1/2 T,t T ° ° °k ° kT t=1 t=1 T 1/2

which is Op (kT ) by the threshold construction (4) and an application of Lemma D.4.b. P 1/2 Therefore Tt=1 {m ˆ ∗T,t −m∗T,t } = op (||ST ||1/2 (T /kT )1/2 kT ) = op (||ST ||1/2 T 1/2 ). Claim (b):

Define

MT∗

½ ° °¾ ° ° ˆ := max sup °mt (θ) {IT,t (θ) − IT,t (θ)}° . 1≤t≤T

θ

and repeat the above argument to reach ° ° ° ° T T 1/2 °1 X ° 1 X ª° £ ¤ª° © ∗ © kT ° ° ° ° ∗ ∗ sup ° m ˆ T,t (θ) − mT,t (θ) ° ≤ MT × I¯T,t (θ) − E I¯T,t (θ) ° sup ° 1/2 ° ° T θ ° T t=1 θ °k t=1 T ° ¶° µ 1/2 ° 1/2 T £ ° ¤ k + MT∗ × T sup ° kT E I¯T,t (θ) − 1 ° ° °. T kT θ 39

Uniform indicator law Lemma D.4.b and threshold construction (4) imply the right-hand1/2 side is Op (MT∗ kT /T ). 1/2 We need only prove MT∗ = op (supθ ||E[m∗T,t (θ)]||T /kT ) to complete the proof. Since ¯ ¯ ¯ n o¯ ¯ ¯ (a) ¯ ¯ ¯mt (θ) IˆT,t (θ) − IT,t (θ) ¯ ≤ 2cT (θ) ¯m(kT ) (θ)/cT (θ) − 1¯ ,

use uniform law Lemma D.2.1, threshold bound Lemma C.1, and covariance bound Lemma C.2.b to deduce µ ¶ ¯ ¯ ¯ (a) ¯ 1/2 1/2 1/2 ∗ MT ≤ K sup cT (θ) sup ¯m(kT ) (θ)/cT (θ) − 1¯ ≤ o sup kΣT (θ)k T /kT θ

θ

θ

µ ¶ ° £ ∗ ¤° 1/2 ° ° = o sup E mT,t (θ) T /kT . θ

Claim (c): The claim follows (b) and Jacobian smoothness supθ∈U 0 (δT ) ||JT (θ)||/||JT || = O(1) under D5.ii, since by the definition of a derivative and identification I2 ( ° £ ( ¤° ) ° °) °E m∗ (θ) ° kJT (θ)k × °θ − θ0 ° T,t ° ≤ sup ° ≤ K. ° ° sup 1 + kJT k × °θ − θ0 ° 1 + kJT k × °θ − θ0 ° θ∈U 0 (δT ) θ∈U 0 (δ T )

ˆ ∗T,t Claim (d): Write mt = mt (θ0 ), IˆT,t = IˆT,t (θ0 ), IT,t = IT,t (θ0 ), I¯T,t := 1 − IT,t , m P = mt IˆT,t , and m∗T,t = mt IT,t . We prove ||T −1 ST−1 Ts,t=1 kT,s,t {m ˆ ∗T,s m ˆ ∗T,t − m∗T,s m∗T,t }|| P T ˆ ∗T,s (˜θT )m ˆ ∗T,t (˜θT ) − m ˆ ∗T,s m ˆ ∗T,t }|| = op (1) in two steps. = op (1) and ||T −1 ST−1 s,t=1 kT,s,t {m The claim then follows by the triangle inequality. Step 1: Observe ° ° T ° °1 X © ª ° ° kT,s,t m ˆ ∗T,t − m∗T,s m∗T,t ° ˆ ∗T,s m ° ST−1 ° °T s,t=1 ° ° T ° °1 ³ ´ X ° ° kT,s,t ms IˆT,s − IT,s m∗T,t ° ≤ 2 ° ST−1 ° °T s,t=1 ° ° T °1 ³ ´ ³ ´° X ° ° kT,s,t ms IˆT,s − IT,s mt IˆT,t − IT,t ° + ° ST−1 ° °T s,t=1 = A1,T + A2,T .

We only bound A1,T since A2,T is similar. Define for any δ > 0

© 2 −2 ª 1 ηδ (x) := ¡ ¢1/2 exp −x δ /2 and η δ,T,j := ηδ (j/γ T ) 2 2δ π ! Ã T −t 2T ³ ´ X 1 1 X −1/2 ∗ ∗ k (l/γ T ) 1/2 ST mt+l IˆT,t+l − IT,t+l I (0 ≤ l ≤ [γ T /δ]) A1,T,δ := 1/2 T γ t=−T +1 l=1−t T ⎞ ⎛ T −t X 1 1 −1/2 ηδ,T,j 1/2 ST m∗T,t+j I (0 ≤ j ≤ [γ T /δ])⎠ × (1 + op (1)) . × ⎝ 1/2 T γ j=1−t T

By CLT Lemma D.7

° ° T ° ° 1 X ° −1/2 ∗ ° mT,t ° = O(1). ° 1/2 ST ° °T t=1

40

2

(14)

Similarly, approximation Lemma D.2.a coupled with CLT Lemma D.7 and the Helly-Bray theorem imply ° ° T ° 1 ³ ´° X ° ° −1/2 ∗ ∗ ∗ mT,t IˆT,t − IT,t ° = O(1). (15) ° 1/2 ST °T ° t=1

2

Now imitate Davidson and de Jong’s (2000: Lemmas A.2-A.3) arguments to deduce19 lim lim sup kA1,T − A1,T,δ × (1 + op (1))k1 = 0.

(16)

δ→0 T →∞

Next, consider the components of A1,T,δ . It is straightforward to generalize approximation Lemma D.2.a to a weighted version with k(t/γ T ) under K1. Specifically, define NT (δ) := min{T, [γ T /δ] + 1} use stationarity to deduce for any δ ° ° T −t ° 1 ° X © ª 1 ° ° −1/2 T 1/2 max k (l/γ T ) m ˆ ∗T,t+l − m∗T,t+l I (0 ≤ l ≤ [γ T /δ])° ° 1/2 1/2 ST −T +1≤t≤2T ° γ ° T l=1−t T

° °1/2 ° ° ° NT (δ) ° ° ° N 1/2 (δ) X ° ° © ª 1 ° ° −1/2 ∗ ∗ ° ≤ ° T 1/2 SNT (δ) ST−1 ° ° k (t/γ ) m ˆ − m S T,t T,t ° T N (δ) ° 1/2 T ° ° N (δ) ° γ ° t=1 T T

2

→ 0 as n → ∞,

since ||SNT (δ) ST−1 || = O(1) and NT (δ)/γ T = O(1) by construction and covariance nondegeneracy M2. Similarly, by a straightforward generalization of Lemma D.7 for any δ ° ° ° ° T −t X ° ° 1 −1/2 ∗ ° 1 ° max η S m I (0 ≤ j ≤ [γ /δ]) T 1/2 δ,T,j 1/2 T T T,t+j ° ° 1/2 −T +1≤t≤2T ° γ T ° j=1−t T 2 ° ° °1/2 ° ° ° NT (δ) ° ° ° N 1/2 (δ) X ° 1 ° ° −1/2 ∗ ° ≤ ° T 1/2 SNT (δ) ST−1 ° ° η m S δ,T,j T,t [γ /δ] ° ° + o (1) 1/2 ° ° N (δ) T ° γ ° T

T

t=1

2

→ 0 as n → ∞.

Therefore lim lim sup kA1,T,δ k1 = 0.

δ→0 T →∞

(17)

Combine (17) and (18) to conclude A1,T = op (1).

Step 2: Note ° ° T °1 n o° X ° ° ˆ ∗T,s (˜θT )m kT,s,t m ˆ ∗T,t (˜θT ) − m ˆ ∗T,s m ˆ ∗T,t ° ° ST−1 ° °T s,t=1 ° ° T ° °1 n o X ° ° kT,s,t m ˆ ∗T,s m ˆ ∗T,s (˜θT ) − m ˆ ∗T,t ° ≤ 2 ° ST−1 ° °T s,t=1 ° ° T °1 n on o° X ° ° kT,s,t m ˆ ∗T,s ˆ ∗T,t ° . ˆ ∗T,s (˜θT ) − m m ˆ ∗T,t (˜θT ) − m + ° ST−1 ° °T s,t=1

S −1/2 ∗ 2 XT,t := T −1/2 ST mT,t . Davidson and de Jong (2000: p. 414) invoke E( T t=1 XT,t ) = O(1) under their Lemma A.1, which holds by a mixingale property and McLeish’s (1975: Theorem 1.6) S 2 maximal inequality. Close inspection of their proofs show E( T t=1 XT,t ) = O(1) and kernel property K1 need only hold, without reference to McLeish (1975). Thus (15) and (16) suffice. 1 9 Define

41

2

We will bound the first term, the second is similar. Use the Taylor expansion argument in the proof of expansion Lemma D.1 to deduce for some ||θT,∗ − θ0 || ≤ ||˜θT − θ0 || ° ° T ° °1 n o X ° ° −1 ∗ ∗ ∗ kT,s,t m ˆ T,s m ˆ T,s (θ˜T ) − m ˆ T,t ° ° ST ° °T s,t=1 ° ° T ° ° ° °1 X ° ° ° ° −1 ∗ kT,s,t JˆT,s (θT,∗ )m ˆ T,t ° × °˜θT − θ0 ° ≤ ° ST ° °T s,t=1 ° ° T ° ° °1 n X ¡ 0 ¢o ∗ ° ° ° ° ° −1 + ° ST kT,s,t Js (θT,∗ ) IˆT,s (θT,∗ ) − IˆT,s θ m ˆ T,t ° × °˜θT − θ0 ° ° °T s,t=1 ° ° T ° ° °1 n ³ ´ X ¡ 0 ¢o ∗ ° ° ° ° ° −1 + ° ST kT,s,t Js (θT,∗ ) IˆT,s ˜θT − IˆT,s θ m ˆ T,t ° × °˜θT − θ0 ° ° °T s,t=1 ° ° T ° °1 n ³ ´ o X ¡ ¢ ° ° 0 0 kT,s,t ms (θ ) IˆT,s ˜θT − IˆT,s θ m ˆ ∗T,t ° + ° ST−1 ° °T s,t=1

=

4 X i=1

Bi,T .

The gist of Davidson and de Jong’s (2000: p. 419-420) Fourier inversion argument applies. Extend their equation (A.51) to our environment to obtain ° °! ° ° Z ∞à T T ° X ° ° ° ° ° −1/2 −1/2 X iξt/γ T ∗ ° −1 ° 1 −iξs/γ T ˆ e ST e m ˆ T,t ° | kJT k ° JT,s (θT,∗ )° × °T B1,T ≤ K °T ° ° ° −∞ s=1 t=1 Z ∞ = K CT (ξ) DT (ξ) | (ξ)| dξ,

(ξ)| dξ

−∞

where (ξ) is defined under K1. Approximation Lemma D.2.a and CLT Lemma D.7 render DT (ξ) = Op (1). Further, Jacobian consistency Lemma 2.5 with ||θT,∗ − θ0 || ≤ PT ||˜ θT − θ0 || = Op (T −1/2 ||ST ||1/2 × ||JT ||−1 ), and K1 properties s,t=1 |kT,s,t | = o(T 2 ), R∞ PT max1≤s≤T t=1 |kT,s,t | = o(T ) and γ T = o(T ) imply CT (ξ) = op (1). Therefore −∞ CT (ξ) DT (ξ) | = op (1) by dominated convergence and K1. Similar arguments extend to the remaing terms by exploiting expansion Lemma D.1.a. Proof of Lemma D.3. The pointwise LLN T T ¡ ¢ª ¡ ¢ 1 X ∗ ¡ 0¢ 1 X © ∗ ¡ 0¢ mT,t θ = mT,t θ − E[m∗T,t θ0 ] + E[m∗T,t θ0 ] = op (1) T t=1 T t=1

follows from identification I2, covariance bound Lemma C.2.c and Chebyshev’s inequality. The ULLN follows from a classic bracketing argument (e.g. Blum 1955, DeHardt 1971, cf. Dudley 1999). Define for any i ∈ {1, ..., q} £ ¤ m∗i,T,t (θ) − E m∗i,T,t (θ) ∗ ° h i° . hT,t (θ) := ° ° sup °E m∗T,t (θ) ° θ∈Θ

£ ¤ Use the Lemma C.2.a,b to deduce the covariance bound ST (θ) = o(T supθ∈Θ ||E m∗T,t (θ) ||2 ), £ ¤ P hence 1/T Tt=1 {h∗T,t (θ) − E h∗T,t (θ) } = op (1) by Chebyshev’s inequality. Further, 42

(ξ)| dξ

h∗T,t (θ) is uniformly L1 -bounded so it belongs to a separable Banach space (Royden 1988). Therefore the L1 -bracketing numbers satisfy N[ ] (ε, Θ, || · ||1 ) < ∞ (Dudley 1999: Proposition 7.1.7). Together, the pointwise LLN and bracketing numbers N[ ] (ε, Θ, || · ||1 ) < £ ¤ P ∞ deliver supθ |1/T Tt=1 (h∗T,t (θ) − E h∗T,t (θ) )| = op (1). See Theorem 7.1.5 of Dudley (1999), cf. Blum (1955) and DeHardt (1971). ∗ Proof of Lemma D.4. Threshold construction (4) and D3 imply IT,t (θ) is L2 -bounded uniformly on 1 ≤ t ≤ T , T ≥ 1, and Θ, and geometrically β-mixing. Coupled with metric entropy with L2 -bracketing D6 we may extend Doukhan et al’s (1995: Theorem ∗ 1; eq. (2.17)) uniform central limit theorem to triangle arrays {IT,t (θ)}. See especially PT 1/2 ∗ Application 4 in Doukhan et al (1995). Therefore {1/T t=1 IT,t (θ) : θ ∈ Θ} =⇒ {I(θ) : θ ∈ Θ}, a Gaussian process with a version that has uniformly bounded and uniformly continuous sample paths with respect to || · ||2 . Further, the conditions for Doukhan et al’s (1995: Theorem 2) uniform maximal inequality are satisfied since their required bound (2.10) holds under their (2.17), which D6 PT ∗ (θ)|})2 ] = O(1) where O(1) which completes ensures. Therefore E[(supθ {|T −1/2 t=1 IT,t the proof.

PT PT Proof of Lemma D.5. Write m ˆ ∗T (θ) := 1/T t=1 m ˆ ∗T,t (θ) and m∗T (θ) := 1/T t=1 m∗T,t (θ). By weight property M1 and the triangle inequality ¯ ° ¯ ° ¯ˆ ° ¯ °ˆ −2 −2 m−2 ˆ ∗T (θ)k2 × °Υ ˆ ∗T (θ)0 ΥT m ˆ ∗T (θ) − QT (θ)| T − ΥT ° + mT |m T ¯QT (θ) − QT (θ)¯ ≤ mT km ≤

n o ¡ ¢o n 2 ∗ ∗ km ˆ ∗T (θ)k2 × op m−2 (θ) − m (θ)k + K k m ˆ T T T ª © −2 ∗ 0 ª © −2 ∗ ∗ ∗ ˆ T (θ) − mT (θ)k + mT |mT (θ) ΥT m∗T (θ) − QT (θ)| + KmT kmT (θ)k × km

= A1,T (θ) + A2,T (θ) + A3,T (θ) + A4,T (θ). Uniform approximation Lemma D.2.b and lim inf T ≥N mT > 0 under smoothness I3 imply ⎫ ⎧ ⎬ ⎨ ∗ km ˆ (θ)k ° Th i° × op (1) sup {A1,T (θ)}1/2 = sup ° ° θ θ ⎩ sup °E m∗ (θ) ° ⎭ θ∈Θ





T,t

supθ km∗T (θ)k ° h

i° × op (1) + op (1) ° ° supθ∈Θ °E m∗T,t (θ) °

supθ km∗T (θ) − E [m∗t (θ)]k ° h i° × op (1) + op (1) . ° ° supθ∈Θ °E m∗T,t (θ) °

Now apply the ULLN Lemma D.3 to deduce supθ {A1,T (θ)} = op (1). Similar arguments based on approximation Lemma D.2.b reveal supθ {A2,T (θ)} and supθ {A3,T (θ)} are op (1). Finally, under M1 ¯ ∗ 0 ¯ ¯mT (θ) ΥT m∗T (θ) − E [m∗t (θ)]0 ΥT E [m∗t (θ)0 ]¯ A4,T (θ) = m−2 T 2

−2 ∗ ∗ ∗ ∗ ∗ ≤ Km−2 T kmT (θ) − E [mt (θ)]k + KmT kE [mt (θ)]k × kmT (θ) − E [mt (θ)]k .

Lemma D.3 implies each term is op (1) uniformly on Θ. 43

ˇ zek’s (2008: Lemma 2.1) argument to deduce absolute Proof of Lemma D.6. Apply Ciˇ continuity of the equation distributions D1.i and equation differentiability D2 ensures ˆ T (θ) is continuous and differentiable at ˆθT a.s. Now use Q ˆ T (ˆθT ) ≤ Q ˆ T (θ) ∀θ ∈ Θ to Q ˇ deduce by Ciˇzek’s (2008: Lemma 2.1) argument T X ˆT 1 JˆT∗ (ˆθT )0 Υ m ˆ ∗ (ˆθT ) = 0 a.s. T t=1 T,t

The Lemma D.1.b asymptotic expansion for 1/T ˆT JˆT∗ (ˆθT )0 Υ

(

JˆT∗ (θT,∗ )0

PT

t=1

m ˆ ∗T,t (ˆθT ) implies we may write

) µ T °1/ι ¶ ° ³ ´ 1X ° 0 0 ∗ ˆθT − θ + ˆ T ×op rT × ° m ˆ T,t (θ ) +JˆT∗ (ˆθT )0 Υ = 0 a.s. °ˆθT − θ0 ° T t=1

for some ||θT,∗ − θ0 || ≤ ||ˆθT − θ0 ||, rT → 0 arbitrarily fast and tiny ι > 0. p Consistency ||ˆθT − θ0 || → 0 under Theorem 2.1 and Jacobian consistency Lemma 2.5 ensure both JˆT∗ (ˆθT ) = JT (1 + op (1)) and JˆT∗ (θT,∗ ) = JT (1 + op (1)). Further HT−1 := (JT0 ΥT JT )−1 exists given weight and Jacobian properties M1 and D5.i. Re-arrange terms and exploit the construction of VT and rT → 0 arbitrarily fast to deduce 1/2

VT

T ³ ´ n o1X 1/2 ˆθT − θ0 m ˆ ∗ (θ0 ) × (1 + op (1)) + op (1) = − VT HT−1 JT0 ΥT T t=1 T,t

= AT

T X t=1

m ˆ ∗T,t (θ0 ) × (1 + op (1)) + op (1) .

−1/2

∗ Proof of Lemma D.7. Define zT,t (r) = r0 T −1/2 ST (θ0 )m∗T,t (θ0 ) for arbitrary r ∈ Rs , ∗ 2∗ (r)] = o(1) and E[zT,t (r)] = 1 + r0 r = 1, and observe identification I2 implies both E[zT,t PT d ∗ o(1). The proof that t=1 zT,t (r) → N (0, 1) follows from a standard martingale difference decomposition under geometric β-mixing D3. The proof is tedious, hence relegated to Hill and Renault (2010: Lemma D.7).

Proof of Lemma D.8. Apply Minkowski’s inequality and the Lemma D.2.b uniform approximation to obtain ( °© ª © £ ¤ £ ¤ª° ) ° m ˆ ∗T (θ0 ) − E m∗T,t (θ) − E m∗T,t (θ0 ) ° ˆ ∗T (θ) − m ° ° sup 1 + kJT k × °θ − θ0 ° θ∈U 0 (δ) ( °© ª © £ ¤ £ ¤ª° ) ° m∗ (θ) − m∗ (θ0 ) − E m∗ (θ) − E m∗ (θ0 ) ° T T T,t T,t ° ° ≤ sup 1 + kJT k × °θ − θ0 ° θ∈U 0 (δ) ( ) km ˆ ∗T (θ) − m∗T (θ)k ° ° + 2 sup 1 + kJT k × °θ − θ0 ° θ∈U 0 (δ) ( °© ª © £ ¤ £ ¤ª° ) ° m∗ (θ) − m∗ (θ0 ) − E m∗ (θ) − E m∗ (θ0 ) ° T T T,t T,t ° ° = sup + op (1) 1 + kJT k × °θ − θ0 ° θ∈U 0 (δ)

Moment expansion Lemma C.3 and equation expansion Lemma D.1.b imply the last line is bounded by supθ∈U 0 (δ) {||JT∗ (θ) − JT ||/||JT ||} + op (1). 44

REFERENCES [1] Agulló, J., C. Croux, S. Van Aelst (2008). The Multivariate Least-Trimmed Squares Estimator, Journal of Multivariate Analysis 99, 311-338. [2] An, H.Z. and Z.G. Chen (1982). On Convergence of LAD Estimates in Autoregression with Infinite Variance, Journal of Multivariate Analysis 12, 335-345. [3] An, H.Z., and F.C. Huang (1996). The Geometrical Ergodicity of Nonlinear Autoregressive Models, Statistica Sinica 6, 943-956. [4] Andrews, B. (2008). Rank-Based Estimation for Autoregressive Moving Average Time Series Models, Journal of Time Series Analysis 29, 51-73. [5] Antoine, B. and E. Renault (2009). Efficient GMM with Nearly-Weak Instruments, Econometrics Journal (Tenth Anniversary Issue ) 12, S135-S171. [6] Antoine, B. and E. Renault (2010). Efficient Minimum Distance Estimation with Multiple Rates of Convergence, Journal of Econometrics: forthcoming. [7] Bassett, G.W. (1991). Equivariant, Monotonic, 50% Breakdown Estimators, American Statistician 45, 135—137. [8] Basrak, B., R.A. Davis, and T, Mikosch (2002). Regular Variation of GARCH Processes, Stochastic Processes and their Applications 12, 908-920. [9] Blum, J.R. (1955). On Convergence of Empirical Distribution Functions, Annals of Mathematical Statistics 26, 527-529. [10] Bonnal, H. and E. Renault (2004). On the Efficient Use of the Informational Content of Estimating Equations: Implied Probabilities and Euclidean Empirical Likelihood, Working Paper, CIREQ, Université de Montréal. [11] Borkovec, M. and C. Klüppelberg (2001). The Tail of the Stationary Distribution of an Autoregressive Process with ARCH(1) Errors, Annals of Applied Probability 11, 1220—124.. [12] Bougerol, P. and N. Picard (1992). Stationarity of GARCH Processes and of Some Nonnegative Time Series, Journal of Econometics 52, 115—127. [13] Brockwell, P.J. and D.B.H. Cline (1985). Linear Prediction of ARMA Processes with Infinite Variance, Stochastic Processes and their Applications 19, 281-296. [14] Campbell, J. and L. Hentschel (1992). No News is Good News: An Asymmetric Model of Changing Volatility in Stock Returns, Journal of Fi6ancial Economics 31, 281-318. [15] Carrasco M., and X. Chen (2002). Mixing and Moment Properties of Various GARCH and Stochastic Volatility Models, Econometric Theory 18, 17-39. [16] Chernozhukov, V. (2010). Inference for Extremal Conditional Quantile Models, with an Application to Birthweights, Review of Economic Studies, forthcoming. ˇ zek, P. (2005). Least Trimmed Squares in Nonlinear Regression Under Dependence, Jour[17] Ciˇ nal of Statistical Planning and Inference 136, 3967—3988. ˇ zek, P. (2008). General Trimmed Estimation: Robust Approach to Nonlinear and Limited [18] Ciˇ Dependent Variable Models, Econometric Theory 24 , 1500-1529. ˇ zek, P. (2009). Generalized Method of Trimmed Moments, Working Paper, CentER, Uni[19] Ciˇ versity of Tilburg. [20] Cline, D.B.H. (1986). Convolution Tails, Product Tails and Domains of Attraction, Probability Theory and Related Fields 72, 529—557. [21] Cline, D.B.H. (1989). Consistency for Least Squares Regression Estimators with Infinite Variance Data, Journal of Statistical Planning and Inference 23, 163-179. [22] Cline, D.B.H. (2007). Regular Variation of Order 1 Nonlinear AR-ARCH models, Stochastic Processes and Theiry Applications 117, 840-861. [23] Davidson, J. (1994). Stochastic Limit Theory. Oxford University Press: Oxford. [24] Davidson, J. and R. de Jong (2000). Consistency of Kernel Estimators of Heteroscedastic and Autocorrelated Covariance Matrices, Econometrica 68, 407-423. [25] Davis, R.A., K. Knight and J. Liu (1992). M-Estimation for Autoregressions with Infinite

45

Variance, Stochastic Processes and Their Applications 40, 145-180. [26] Davis, R.A. and T. Mikosch (2008). Extreme Value Theory for GARCH Models. In: Andersen, T.G., Davis, R.A., Kreiss, J.-P. and Mikosch, T. (eds.): Handbook of Financial Time Series, 186—200. Springer: New York. [27] Davis, R.A. and S.I. Resnick (1986). Limit Theory for Sample for the Sample Covariance and Correlation Functions of Moving Averages, Annals of Statistics 14, 533-558. [28] Davis, R., and S. Resnick (1996). Limit Theory for Bilinear Processes with Heavy-Tailed Noise, Annals of Applied Probability 6, 1191—1210. [29] de Haan, L., S.I. Resnick, H. Rootzén, and C.G. de Vries (1989). Extremal Behaviour of Solutions to a Stochastic Difference Equation with Applications to ARCH Processes, Stochastic Processes and their Applications 32, 213—224. [30] Dell’Aquila, R., E. Ronchetti, F. Trojani (2003). Robust GMM Analysis of Models for the Short Tate Process, Journal of Empirical Finance 10, 373-397. [31] DeHardt, J. (1971). Generalizations of the Glivenko-Cantelli Theorem Generalizations of the Glivenko-Cantelli Theorem, Annals of Mathematical Statistics 42, 2050-2055. [32] Ding, Z. and C.W.J. Granger (1996). Modeling Volatility Persistence of Speculative Returns: A New Approach, Journal of Econometrics 73, 185-215. [33] Doukhan, P. (1994). Mixing: Properties and Examples, Lecture Notes in Statistics 85. Springer: Berlin. [34] Doukhan, P., P. Massart, and E. Rio (1995). Invariance Principles for Absolutely Regular Empirical Processes, Annales de l’Institut Henri Poincaré 31, 393-427. [35] Dudley, R. M. (1978). Central Limit Theorem for Empirical Processes. Annals of Probaility 6, 899-929. [36] Dudley, R. M. (1999). Continuous Central Limit Theorems. Cambridge University Press: Cambridge . [37] Embrechts, P., Klüppelberg, C. and Mikosch, T. (1997). Modelling Extremal Events for Insurance and Finance. Springer-Verlag: Frankfurt. [38] Engle, R. amd V. Ng (1993). Measuring and Testing the Impact of News On Volatility, Journal of Finance 48, 1749-1778. [39] Fama, E.F., and K.R. French (1993). Common Risk Factors in the Returns on Stocks and Bonds, Journal of Financial Economics 33, 3—5. [40] Feller, W. (1971). An Introduction to Probability Theory and its Applications, 2 ed., Vol. 2. Wiley: New York. [41] Finkenstadt, B., and H. Rootzén (2003). Extreme Values in Finance, Telecommunications and the Environment. Chapman and Hall: New York. [42] Francq, C. and J.M. Zakoïan (2004). Maximum Likelihood Estimation of Pure GARCH and ARMA-GARCH Processes, Bernoulli 10, 605—637. [43] Giné, E. and J. Zinn (1984). Some Limit Theorems for Empirical Processes, Annals of Probability 12, 929-989. [44] Griffin, P.S. and W.E. Pruitt (1987). The Central Limit Problem for Trimmed Sums, Mathematical Proceedings of the Cambridge Philosophical Society 102, 329-349. [45] Hall, A. (2005). Generalized Method of Moments. Oxford University Press: Oxford. [46] Hall, P. (1982). On Some Estimates of an Exponent of Regular Variation, Journal of the Royal Statistical Society Series B 44, 37-42. [47] Hall, P. and Q. Yao (2003). Inference in ARCH and GARCH Models with Heavy-Tailed Errors, Econometrica 71, 285-317. [48] Hannan, E.J. and M. Kanter (1977). Autoregressive Processes with Infinite Variance, Journal of Applied Probability 14, 411-415. [49] Hansen, L.P. (1982). Large Sample Properties of Method of Moments Estimators, Econometrica 50, 1029-1054. [50] Hansen, B. and S.W. Lee (1994). Asymptotic Theory for the Garch (1,1) Quasi-Maximum

46

Likelihood Estimator, Econometric Theory 10, 29-52. [51] He, X., J. Jurecková, R. Koenker and S. Portnoy (1990). Tail Behavior of Regression Estimators and their Breakdown Points, Econometrica 58, 1195-1214. [52] Hill, B.M. (1975). A Simple General Approach to Inference about the Tail of a Distribution, Annals of Mathematical Statistics 3, 1163-1174. [53] Hill, J.B. (2010). On Tail Index Estimation for Dependent Heterogeneous Data, Economeric Theory 26: in press. [54] Hill, J.B. and E. Renault (2010). Appendix E for "Generalized Method of Moments with Tail Trimming", University of North Carolina - Chapel Hill; www.unc.edu/∼jbhill/ gmttm_tech_append.pdf. [55] Hill, J.B. and A. Shneyerov (2009). Are There Common Values on BC Timber Sales? A Tail-Index Nonparametric Test, Working Paper, Dept. of Economics, University of North Carolina - Chapel Hill: submitted. [56] Hoffmann-Jørgensen, J. (1984). Convergence of Stochastic Processes on Polish Spaces. Unpublished manuscript. [57] Honoré, B. (1992). Trimmed LAD and Least Squares Estimation of Truncated and Censored Regression Models with Fixed Effects, Econometrica 60, 533-565. [58] Horowitz, J. L. (1998). Bootstrap Methods for Covariance Structures, Journal of Human Resources 33, 39-61. [59] Hsing, T. (1991). On Tail Index Estimation Using Dependent Data, Annals of Statistics 19, 1547-1569. [60] Huber, P.J. (1964). Robust Estimation of Location Parameter, Annals of Mathematical Statistics 35 73-101. [61] Huisman, R., K. Koedijk, and C. Kool (2001). Tail-Index Estimates in Small Samples, Journal of Business and Economic Statistics 19, 208-216. [62] Ibragimov, I. A. and Y. V. Linnik (1971). Independent and Stationary Sequences of Random Variables. Wolters-Noordhof: Berlin. [63] Jensen, S.T. and A. Rahbek (2004). Asymptotic Normality of the QML Estimator of ARCH in the Nonstationary Case, Econometrica 72, 641-646. [64] Jurecková, J. and P.K. Sen (1996). Robust Statistical Procedures: Asymptotics and Interrelations. Wiley: New York. [65] Kan, S., and A. Lewbel (2007). Weighted and Two-Stage Least Squares Estimations of Semiparamertic Truncated Regression Models, Econometric Theory 23, 309-347. [66] Kitamura, Y, T. Gautam, and A. Hyungtaik (2004). Empirical Likelihood-Based Inference in Conditional Moment Restriction Models, Econometrica 72, 1667-1714. [67] Knight, K. (1987). Rate of Convergence of Centred Estimates of Autoregressive Parameters for Infinite Variance Autoregressions, Journal of Time Series Analysis 8, 51-60. [68] Leadbetter, M.R., G. Lindgren and H. Rootzén (1983). Extremes and Related Properties of Random Sequences and Processes, Springer-Verlag: New York. [69] Ling, S. (2005). Self-Weighted LAD Estimation for Infinite Variance Autoregressive Models, Journal of the Royal Statistical Society Series B 67, 381—393. [70] Ling, S. (2007). Self-Weighted and Local Quasi-Maximum Likelihood Estimators for ARMAGARCH/IGARCH Models, Journal of Econometrics 150, 849-873. [71] Linton, O., J. Pan and H. Wang (2010). Estimation for a Non-stationary Semi-Strong GARCH(1,1) Model with Heavy-Tailed Errors, Econometric Theory 26, 1-28. [72] Lumsdaine, R. (1995). Finite-Sample Properties of the Maximum Likelihood Estimator in GARCH(1,1) and IGARCH(1,1) Models: A Monte Carlo Investigation, Journal of Business and Economic Statistics. 13, 1-10. [73] Lumsdaine, R. (1996). Consistency and Asymptotic Normality of the Quasi-Maximum Likelihood Estimator in IGARCH(1,1) and Covariance Stationary GARCH(1,1) Models, Econometrica 64, 575-596.

47

[74] Mandelbrot, B. (1963). The Variation of Certain Speculative Prices, Journal of Business 36, 394-419. [75] McFadden D.L. (1989). A Method of Simulated Moments for Estimation of Multinomial Discrete Response Models Without Numerical Integration, Econometrica 57, 995-1026. [76] McLeish, D.L. (1975). Maximal Inequality and Dependent Strong Laws, Annals of Probability 3, 829-839. [77] Meddahi, N. and E. Renault (2004). Temporal Aggregation of Volatility Models, Journal of Econometrics 119, 355-379. [78] Meitz M. and P. Saikkonen (2008). Stability of Nonlinear AR-GARCH Models, Journal of Time Series Analysis 29, 453-475. [79] Mikosch, T. and C. St˘ aric˘ a (2000). Limit Theory for the Sample Autocorrelations and Extremes of a GARCH(1,1) Process, Annals of Statistics 28, 1427-1451. [80] Mikosch, T and D. Straumann (2006). Quasi-Maximum-Likelihood Estimation in Conditionally Heteroscedastic Time Series: A Stochastic Recurrence Equations Approach, Annals of Statistics 34, 2449-2495. [81] Naveau, P. (2003). Almost Sure Relative Stability of the Maximum of a Stationary Sequence, Advances in Applied Probability 35, 721-736. [82] Newey, W.K., and D.L. McFadden (1994). Large Sample Estimation and Hypothesis Testing. In: R. Engle and D. McFadden (eds.): Handbook of Econometrics, Vol. 4. Amsterdam: North-Holland. [83] Ossiander, M. (1987). A Central Limit Theorem Under Metric Entropy with L2 Bracketing, Annals of Probability 15, 897-919. [84] Pakes, A. and D. Pollard (1989). Simulation and the Asymptotics of Optimization Estimators, Econometrica 57, 1027-1057. [85] Peng, L. and Q. Yao (2003). Least Absolute Deviations Estimation for ARCH and GARCH Models, Biometrika 90, 967-975. [86] Pollard, D. (1984). Convergence of Stochastic Processes. Springer-Verlang New York. [87] Pollard, D. (1989). Bracketing Methods in Statistics and Econometrics, in Proceedings of the Conference on Nonparametric and Semiparametric Methods in Econometrics and Statistics by W.A. Barnett, J. Powell and G. Tauchen (eds.), Cambridge University Press: Cambridge. [88] Pollard, D. (2002). Maximal Inequalities via Bracketing with Adaptive Truncation, Annales de l’Institut Henri Poincaré 38, 1039-1052. [89] Powell, J.L. (1986). Symmetrically Trimmed Least Squares Estimation for Tobit Models, Econometrica 54, 1435-1460. [90] Renault, E. (1997). Econométrie de la Finance: La Méthode des Moments Généralisés, in Encyclopédie des Marchés Financiers, edited by Y. Simon, Economica, Paris, 1997, 330-407. [91] Resnick, S.I. (1987). Extreme Values, Regular Variation and Point Processes. SpringerVerlag: New York. [92] Resnick, S.I. (1997). Heavy Tail Modeling and Traffic Data, Annals of Statistics 25, 18051849. [93] Ronchetti, E. and F. Trojani (2001a). Robust Inference with GMM Estimators, Journal of Econometrics 101, 37-69. [94] Ronchetti, E. and F. Trojani (2001b). Robust Inference for Generalized Linear Estimators, Journal of the American Statistical Association 96, 1022-1030. [95] Royden, H.L. (1988). Real Analysis, Prentice Hall: New Jersey. [96] Rousseeuw, P.J. (1985). Multivariate Estimation with High Breakdown Point, In W. Grossman, G. Pflug, I. Vincze, and W. Wertz (eds.), Mathematical Statistics and Applications, volume B, 283-297, Reidel: Berlin. [97] Ruppert, D. and J. Carroll (1980). Trimmed Least Squares Estimation in the Linear Model, Journal of the American Statistical Association 75, 828-838. [98] Stock, J.H. and J.H. Wright (2000). GMM with Weak Identification, Econometrica 68, 1055-

48

1096. [99] Tableman, M. (1994). The Asymptotics of the Least Trimmed Absolute Deviations (LTAD) Estimator, Statistics and Probability Letters 19, 329-337. [100] van der Vaart, A.W., and J.A. Wellner (1994). Weak Convergence and Empirical Processes, Springer-Verlang: New York.

TABLE 4 : LOCATION, AR, kT ∼ T λ LOCATION AR 0 0 a b t ∼ P2.5 , κy = 2.5 , θ r = 1 t ∼ P2.5 , κy = 2.5, θ r = .9 ˆθ Meanc KSd λe kT Mean KS λ θˆ f GMT 1.01 (.33) .083 .44 21 GMT .901 (.01) .051 .48 GMM 1.00 (.41) .111 GMM .898 (.01) .086 QML .999 (.04) .121 QML .899 (.01) .132 0 0 ∼ P , κ = 1.5, θ = 1 ∼ P , κ = 1.5, θ = .9 t 1.5 y t 1.5 y r r ˆθ Mean KS λ kT Mean KS λ θˆ GMT 1.01 (.31) .072 .32 9 GMT .900 (.01) .069 .32 GMM 1.07 (.46) .145 GMM .900 (.01) .187 QML 1.01 (.15) .272 QML .889 (.01) .253 -

kT 28 -

kT 9 -

True parameter value for the rth element θ0r . Error distribution (P1.5 or N0,1 ), and moment supremum or tail index of yt . Simulation mean of parameter estimation (square root of mse in parentheses). Kolmogorov-Smirnov test p-value. In the case of GMTTM, KS p-values are evaluated at that kT = [T λ ] which minimizes KS. 1%, 5%, 10% critical values: .136, .122, .107. e. KS minimizing λ in the trimming fractile kT = [T λ ], λ ∈ {.01, .02, ..., .99). f. GMT=GMTTM. GMT and GMM are computed in two steps with efficient weight and QMLE plug-in, under exact identification is considered here. See Hill and Renault (2010) for over-identifcation GMT and GMM results.

a. b. c. d.

ˆθ GMT GMM QML ˆθ GMT GMM QML ˆθ GMT GMM QML

TABLE 5 : ARCH, GARCH, kj,T ∼ T λj ARCH GARCH 0 0 ∼ N , κ = 4.65, θ = .5 N , κ t 0,1 y 0,1 y = 4.1, θ r = .6 r Mean KS λ1 , λa2 k1 , k2b Mean KS λ1 , λ2 θˆ .501 (.09) .074 .50,.35 32,11 GMT .598 (.16) .066 .46,.28 .457 (.13) .228 GMM .577 (.20) .177 .492 (.07) .051 QML .597 (.08) .098 0 0 ∼ N , κ = 3.8, θ = .6 N , κ = 2.0, θ = .6 t 0,1 y 0,1 y r r ˆ Mean KS λ1 , λ2 k1 , k2 Mean KS λ1 , λ2 θ .605 (.09) .069 .55,.40 45,16 GMT .603 (.18) .071 .68,.33 .600 (.17) .162 GMM .628 (.23) .273 .599 (.07) .082 QML .586 (.20) .262 0 0 P2.5 , κy = 1.5, θr = .6 t ∼ P2.5 , κy = 1.8, θ r = .6 ˆ Mean KS λ1 , λ2 k1 , k2 Mean KS λ1 , λ2 θ .606 (.33) .097 .70,.35 126,11 GMT .596 (.24) .101 .79,.42 .547 (.22) .289 GMM .264 (.35) .448 .620 (.27) .110 QML .605 (.18) .569 -

49

k1 , k2 24,7 -

k1 , k2 110,10 -

k1 , k2 234,18 -

a. KS minimizing pair {λ1 , λ2 }.b. kj = kj,T = [T λj ].

ˆθ GMT GMM QML ˆθ GMT GMM QML

TABLE 6 : TARCH, QARCH, kT ∼ T λ TARCH QARCH N0,1 , κy = 5.25, θ0r = .6 N0,1 , κy = 3.5, θ0r = .8 Mean KS λ1 , λ2 k1 , k2 Mean KS λ1 , λ2 θˆ .617 (.16) .083 .29,.16 7,3 GMT .798 (.09) .066 .32,.10 .536 (.15) .269 GMM .673 (.09) .454 .595 (.09) .073 QML .896 (.66) .389 0 0 P2.5 , κy = 2.6, θr = .6 N0,1 , κy = 2.0, θr = 1 ˆ Mean KS λ1 , λ2 k1 , k2 Mean KS λ1 , λ2 θ .619 (.29) .104 .67,.09 102,2 GMT 1.01 (.09) .050 .57,.37 .270 (.20) .532 GMM .721 (.09) .643 .516 (.38) .323 QML .997 (.64) .287 TABLE 7 : GMTTM, kT ∼ δ ln(T ), δT / ln(T )

Model LOCATION LOCATION AR AR ARCH ARCH IGARCH GARCH TARCH QARCH QIARCH

t

P2.5 P1.5 P2.5 P1.5 N0,1 P2.5 N0,1 P2.5 P2.5 N0,1 N0,1

κy 2.5 1.5 2.5 1.5 3.8 1.8 2.0 1.5 2.6 3.5 2.0

θ0r 1 1 .9 .9 .6 .6 .6 .6 .6 .8 1

KSa

δb

kT

(.04) (.13) (.01) (.01)

.095 .994 .053 .089

.18 .05 .19 .06

26 7 28 9

.593 (.09) .606 (.25)

.105 .116

.31, .12 .90, .09

45,17 130,13

.597 (.21) .595 (.26)

.076 .104

.73, .06 1.7, .13

106,9 246,19

.618 (.29) .805 (.10) .995 (.71)

.105 .069 .074

.70, .01 .06, .01 .37, .10

101,1 9,1 54,14

Mean

.996 .995 .898 .901

a. The KS minimizing δ for LOCATION where kT = [δT / ln(T )] and AR where kT = [δ ln(T )]; and {δ 1 , δ 2 } for all GARCH where kj,T = [δ j T / ln(T )].

50

k1 , k2 10,2

k1 , k2 51,13

-