Quantile Processes for Semi and Nonparametric Regression

3 downloads 71 Views 602KB Size Report
Apr 7, 2016 - Stanislav Volgushev†. Guang Cheng‡. April 7, 2016. Abstract A collection of quantile curves provides a complete picture of conditional distribu-.
Quantile Processes for Semi and Nonparametric Regression Shih-Kang Chao∗

Stanislav Volgushev†

Guang Cheng‡

arXiv:1604.02130v1 [math.ST] 7 Apr 2016

April 7, 2016

Abstract A collection of quantile curves provides a complete picture of conditional distributions. Properly centered and scaled versions of estimated curves at various quantile levels give rise to the so-called quantile regression process (QRP). In this paper, we establish weak convergence of QRP in a general series approximation framework, which includes linear models with increasing dimension, nonparametric models and partial linear models. An interesting consequence is obtained in the last class of models, where parametric and non-parametric estimators are shown to be asymptotically independent. Applications of our general process convergence results include the construction of non-crossing quantile curves and the estimation of conditional distribution functions. As a result of independent interest, we obtain a series of Bahadur representations with exponential bounds for tail probabilities of all remainder terms. Keywords: Bahadur representation, quantile regression process, semi/nonparametric model, series estimation.

1

Introduction

Quantile regression is widely applied in various scientific fields such as Economics (Koenker and Hallock (2001)), Biology (Briollais and Durrieu (2014)) and Ecology (Cade and Noon (2003)). By focusing on a collection of conditional quantiles instead of a single conditional mean, quantile regression allows to describe the impact of predictors on the entire conditional distribution of the response. A properly scaled and centered version of these estimated curves form an underlying (conditional) quantile regression process (see Section 2 for a formal definition). Existing literature ∗

Postdoctoral Fellow, Department of Statistics, Purdue University, West Lafayette, IN 47906. E-mail: [email protected]. Tel: +1 (765) 496-9544. Fax: +1 (765) 494-0558. Partially supported by Office of Naval Research (ONR N00014-15-1-2331). † Assistant Professor, Department of Statistical Science, Cornell University, 301 Malott Hall, Ithaca, NY 14853. E-mail: [email protected]. Part of this work was conducted while the second author was postdoctoral fellow at the Ruhr University Bochum, Germany. During that time the second author was supported by the Sonderforschungsbereich “Statistical modelling of nonlinear dynamic processes” (SFB 823), Teilprojekt (C1), of the Deutsche Forschungsgemeinschaft. ‡ Corresponding Author. Associate Professor, Department of Statistics, Purdue University, West Lafayette, IN 47906. E-mail: [email protected]. Tel: +1 (765) 496-9549. Fax: +1 (765) 494-0558. Research Sponsored by NSF CAREER Award DMS-1151692, DMS-1418042, and Office of Naval Research (ONR N00014-15-1-2331).

1

on QRP is either concerned with models of fixed dimension (Koenker and Xiao, 2002; Angrist et al., 2006), or with a linearly interpolated version based on kernel smoothing (Qu and Yoon, 2015). In this paper, we study weak convergence of QRP in models of the following (approximate) form Q(x; τ ) ≈ Z(x)> γn (τ ), (1.1) where Q(x; τ ) denotes the τ -th quantile of the distribution of Y conditional on X = x ∈ Rd and Z(x) ∈ Rm is a transformation vector of x. As noted by Belloni et al. (2011), the above framework incorporates a variety of estimation procedures such as parametric (Koenker and Bassett, 1978), non-parametric (He and Shi, 1994) and semi-parametric (He and Shi, 1996) ones. For example, Z(x) = x corresponds to a linear model (with potentially high dimension), while Z(x) can be chosen as powers, trigonometrics or local polynomials in the non-parametric basis expansion (where m diverges at a proper rate). Partially linear and additive models are also covered by (1.1). Therefore, our weak convergence results are developed in a broader context than those available in the literature. A noteworthy result in the present paper is obtained for partially linear models Q(X; τ ) = V > α(τ ) + h(W ; τ ), 0

where X = (V > , W )> ∈ Rk+k , α(τ ) is an unknown Euclidean vector and h(W ; τ ) is an unknown smooth function. Here, k and k 0 are both fixed. In the spirit of (1.1), we can estimate (α(τ ), h(·; τ )) based on the following series approximation e h(W ; τ ) ≈ Z(W )> βn† (τ ).

Our general theorem shows the weak convergence of the joint quantile process resulting from k+1 b ), b (α(τ h(·; τ )) in `∞ (T ) , where `∞ (T ) denotes the class of uniformly bounded real functions b ) and b on T 3 τ . An interesting consequence is that α(τ h(w0 ; τ ) (after proper centering and scaling) jointly converge to two independent Gaussian processes. This asymptotic independence result is useful for simultaneously testing parametric and nonparametric components in semi-nonparametric modelsi . To the best of our knowledge, this is the first time that such a joint asymptotic result for quantile regression is established – in fact, even the point-wise result is new. Therefore, we prove that the “joint asymptotics phenomenon” discovered by Cheng and Shang (2015) even holds for non-smooth loss functions with multivariate nonparametric covariates. Weak convergence of QRP is very useful in developing statistical inference procedures such as hypothesis testing on conditional distributions (Bassett and Koenker, 1982), detection of treatment effect on the conditional distribution after an intervention (Koenker and Xiao, 2002; Qu and Yoon, 2015) and testing conditional stochastic dominance (Delgado and Escanciano, 2013). For additional examples, the interested reader is referred to Koenker (2005). Our paper focuses on the estimation of conditional distribution functions and construction of non-crossing quantile curves by means of monotone rearrangement (Dette and Volgushev, 2008; Chernozhukov et al., 2010). i

In the literature, a statistical model is called semi-nonparametric if it contains both finite-dimensional and infinite-dimensional unknown parameters of interest; see Cheng and Shang (2015).

2

Our derivation of quantile process convergence relies upon a series of Bahadur representations which are provided in Section 5. Specifically, we obtain weak convergence results by examining the asymptotic tightness of the leading terms in these representations, as shown in Lemma A.3. This is achieved by combining with a new maximal inequality in Kley et al. (2015). These new Bahadur representations are more flexible than those available in the literature (see, for instance, Belloni et al. (2011)) in terms of choosing approximation model centers. This is crucial in deriving the joint process convergence in partial linear models. As a result of independent interest, we obtain bounds with exponential tail probability for the remainder terms in our Bahadur representations. Bounds of this kind are especially useful in analyzing statistical inference procedures under divideand-conquer setup; see Zhao et al. (2016); Shang and Cheng (2015); Volgushev et al. (2016). The rest of this paper is organized as follows. Section 2 presents the weak convergence of QRP under general series approximation framework. Section 3 discusses the QRP in quantile partial linear models. As an application of our weak convergence theory, Section 4 considers various functionals of the quantile regression process. A detailed discussion on our novel Bahadur representations is given in Section 5, and all proofs are deferred to the appendix. Notation. Denote {(Xi , Yi )}ni=1 i.i.d. samples in X × R where X ⊂ Rd . Here, the distribution of (Xi , Yi ) and the dimension d can depend on n, i.e. triangular arrays. For brevity, let Z = Z(X) and Zi = Z(Xi ). Define the empirical measure of (Yi , Zi ) by Pn , and the true underlying measure by P with the corresponding expectation as E. Note that the measure P depends on n for triangular array cases, but this dependence is omitted in the notation. Denote by kbk the L2 -norm of a vector b. λmin (A) and λmax (A) are the smallest and largest eigenvalue of a matrix A. 0k denotes a k-dimensional 0 vector, and Ik be the k-dimensional identity matrix for k ∈ N. Define ρτ (u) = (1(u ≤ 0) − τ )u, where 1(·) is the indicator function. C η (X ) denotes the class of η-continuously differentiable functions on a set X . C(0, 1) denotes the class of continuous functions defined on (0, 1). Define      ψ(Yi , Zi ; b, τ ) := Zi (1{Yi ≤ Z> µ(b, τ ) := E ψ(Yi , Zi ; b, τ ) = E Zi FY |X (Z> , i b}−τ ), i b|X)−τ and for a vector γn (τ ) ∈ Rm , we define the following quantities

  

gn := gn (γn ) := sup kµ(γn (τ ), τ )k = sup E Zi FY |X (Z> γ (τ )|X) − τ

i n τ ∈T

(1.2)

τ ∈T

Let S m−1 := {u ∈ Rm : kuk = 1} denote the unit sphere in Rm . For a set I ⊂ {1, ..., m}, define > m Rm I := {u = (u1 , ..., um ) ∈ R : uj 6= 0 if and only if j ∈ I}

SIm−1 := {u = (u1 , ..., um )> ∈ S m−1 : uj 6= 0 if and only if j ∈ I} Finally, consider the class of functions Ληc (X , T ) := n fτ ∈ C bηc (X ) : τ ∈ T , sup

sup |Dj fτ (x)| ≤ c,

|j|≤bηc x,τ ∈T

3

sup

sup

|j|=bηc x6=y,τ ∈T

o |Dj fτ (x) − Dj fτ (y)| ≤ c , kx − ykη−bηc

(1.3)

where bηc denotes the integer part of a real number η, and |j| = j1 + ... + jd for d-tuple j = (j1 , ..., jd ). For simplicity, we sometimes write supτ (inf τ ) and supx (inf x ) instead of supτ ∈T (inf τ ∈T ) and supx∈X (inf x∈X ) throughout the paper.

2

Weak Convergence Results

In this section, we first present our weak convergence results of QRP in a general series approximation framework that covers linear models with increasing dimension, nonparametric models and partial linear models. Furthermore, we demonstrate that the use of polynomial splines with local support, such as B-splines, significantly weakens the sufficient conditions required in the above general framework.

2.1

General Series Estimator

b τ ) := γ b (τ )> Z(x), where for each fixed τ Consider a general series estimator Q(x; b (τ ) := argmin γ γ∈Rm

n X i=1

ρτ (Yi − γ > Zi ),

(2.1)

and m is allowed to grow as n → ∞, and assume the following conditions: (A1) Assume that kZi k ≤ ξm = O(nb ) almost surely with b > 0, and that 1/M ≤ λmin (E[ZZ> ]) ≤ λmax (E[ZZ> ]) ≤ M holds uniformly in n for some fixed constant M > 0. (A2) The conditional distribution FY |X (y|x) is twice differentiable w.r.t. y. Denote the corresponding derivatives by fY |X (y|x) and fY0 |X (y|x). Assume that f¯ := supy,x |fY |X (y|x)| < ∞ and f 0 := supy,x |fY0 |X (y|x)| < ∞ uniformly in n. (A3) Assume that uniformly in n, there exists a constant fmin > 0 such that inf inf fY |X (Q(x; τ )|x) ≥ fmin .

τ ∈T

x

In the above assumptions, uniformity in n is necessary as we consider triangular arrays. Assumptions (A2) and (A3) are fairly standard in the quantile regression literature. Hence, we only make a few comments on Assumption (A1). In linear models where Z(X) = X and m = d, it holds that √ e ξm . m if each component of X is bounded almost surely. If B-splines B(x) defined in Section e 4.3 of Schumaker (1981) are adopted, then one needs to use its re-scaled version B(x) = m1/2 B(x) √ as Z(x) such that (A1) holds (cf. Lemma 6.2 of Zhou et al. (1998)). In this case, we have ξm  m. In addition, Assumptions (A1) and (A3) imply that for any sequence of Rm -valued (non-random) functions γn (τ ) satisfying supτ ∈T supx |γn (τ )> Z(x) − Q(x; τ )| = o(1), the smallest eigenvalues of the matrices Jem (τ ) := E[ZZ> fY |X (γn (τ )> Z|X)],

Jm (τ ) := E[ZZ> fY |X (Q(X; τ )|X)]

are bounded away from zero uniformly in τ for all n. Define for any u ∈ Rm , h i χγn (u, Z) := sup u> Jm (τ )−1 E Zi 1{Yi ≤ Q(Xi ; τ )} − 1{Yi ≤ Z> γ (τ )} . i n τ ∈T

4

We are now ready to state our weak convergence result for QRP based on the general series estimators. 2 (log n)3 = o(n). Let γ (·) : T → Rm be a seTheorem 2.1. Suppose (A1)-(A3) hold and m3 ξm n quence of functions. Let gn := gn (γn (τ )) = o(n−1/2 ) (see (1.2)) and cn = cn (γn ) := supx,τ ∈T |Q(x; τ )− Z(x)> γn (τ )| and assume that mcn log n = o(1). Then for any un ∈ Rm satisfying χγn (un , Z) = b (τ ) defined in (2.1), o(kun kn−1/2 ) and γ

u> γ (τ ) n (b

  n X 1 > kun k −1 − γn (τ )) = − un Jm (τ ) Zi (1{Yi ≤ Q(Xi ; τ )} − τ ) + oP √ n n

(2.2)

i=1

where the remainder term is uniform in τ ∈ T . In addition, if the following limit −1 > −1 H(τ1 , τ2 ; un ) := lim kun k−2 u> n Jm (τ1 )E[ZZ ]Jm (τ2 )un (τ1 ∧ τ2 − τ1 τ2 ) n→∞

exists for any τ1 , τ2 ∈ T , then √   n > b u> γ (·) − u γ (τ ) n n kun k n

G(·) in `∞ (T ),

(2.3)

(2.4)

where G(·) is a centered Gaussian process with the covariance function H defined as (2.3). In particular, there exists a version of G with almost surely continuous sample paths. The proof of Theorem 2.1 is given in Section A.1. Theorem 2.1 holds under very general conditions. For transformations Z that have a specific local structure, the assumptions on m, ξm can be relaxed considerably. Details are provided in Section 2.2. In the end, we illustrate Theorem 2.1 in linear quantile regression models with increasing dimension, in which gn , cn and χγn (u, Z) are trivially zero. As far as we are aware, this is the first quantile process result for linear models with increasing dimension. Corollary 2.2. (Linear models with increasing dimension) Suppose (A1)-(A3) hold with Z(X) = X 2 (log n)3 = o(n). In addition, if and Q(x; τ ) = x> γn (τ ) for any x and τ ∈ T . Assume that m3 ξm m un ∈ R is such that the following limit −1 > −1 H1 (τ1 , τ2 ; un ) := lim kun k−2 u> n Jm (τ1 )E[XX ]Jm (τ2 )un (τ1 ∧ τ2 − τ1 τ2 ) n→∞

(2.5)

exists for any τ1 , τ2 ∈ T , then (2.4) holds with the covariance function H1 defined in (2.5). Moreover, by setting un = x0 , we have for any fixed x0 √ n b Q(x0 ; ·) − Q(x0 ; ·) G(x0 ; ·) in `∞ (T ), kx0 k

where G(x0 ; ·) is a centered Gaussian process with covariance function H1 (τ1 , τ2 ; x0 ). In particular, there exists a version of G with almost surely continuous sample paths.

5

2.2

Local Basis Series Estimator

In this section, we assume that Z(·) corresponds to a basis expansion with “local” support. Our main motivation for considering this setting is that it allows to considerably weaken assumptions on m, ξm made in the previous section. To distinguish such basis functions from the general setting b ) be defined as in the previous section, we shall use the notation B instead of Z. Let β(τ X  b ) := argmin β(τ ρτ Yi − b> Bi , (2.6) b∈Rm

i

where Bi = B(Xi ). The notion of “local support” is made precise in the following sense.

(L) For each x, the basis vector B(x) has zeroes in all but at most r consecutive entries, where r is fixed. Moreover, supx,τ E[B(x)> Jem (τ )−1 B(X)] = O(1). The above assumption holds for certain choices of basis functions, e.g., univariate B-splines.

Example 2.3. Let X = [0, 1], assume that (A2)-(A3) hold and that the density of X over X is uniformly bounded away from zero and infinity. Consider the space of polynomial splines of order q with k uniformly spaced knots 0 = t1 < ... < tk = 1 in the interval [0, 1]. The space of such splines can be represented through linear combinations of the basis functions B1 , ..., Bk−q−1 with each basis function Bj having support contained in the interval [tj , tj+q+1 ). Let B(x) := (B1 (x), ..., Bk−q−1 (x))> . Then the first part of assumption (L) holds with r = q. The condition −1 (τ )B(X)|] = O(1) is verified in the Appendix, see Section A.2. supx,τ E[|B(x)> Jem Condition (L) ensures that the matrix Jem (τ ) has a band structure, which is useful for bounding −1 (τ ). See Lemma 6.3 in Zhou et al. (1998) for additional details. the off-diagonal entries of Jem Throughout this section, consider the specific centering   βn (τ ) := argmin E (B> b − Q(X; τ ))2 fY |X (Q(X; τ )|X) , b∈Rm

(2.7)

where B = B(X). For basis functions satisfying condition (L), assumptions in Theorem 2.1 in the previous section can be replaced by the following weaker version. 4 (log n)6 = o(n) and letting e (B1) Assume that ξm cn := supx,τ |βn (τ )> B(x) − Q(x; τ )| with e c2n = o(n−1/2 ), where kB(Xi )k ≤ ξm almost surely. 4 (log n)6 = o(n) in (B1) is less restrictive than m3 ξ 2 (log n)3 = Note that the condition ξm m √ o(n) required in Theorem 2.1. For instance, in the setting of Example 2.3 where ξm  m, we only require m2 (log n)6 = o(n), which is weaker than m4 (log n)3 = o(n) in Theorem 2.1. This improvement is made possible based on the local structure of the spline basis. In the setting of Example 2.3, bounds on e cn can be obtained provided that the function x 7→ Q(x; τ ) is smooth for all τ ∈ T . For instance, assuming that Q(·; ·) ∈ Ληc (X , T ) with X = [0, 1] and integer η, Remark B.1 shows that e cn = O(m−bηc ). Thus the condition e c2n = o(n−1/2 ) holds provided that m−2bηc = o(n1/2 ). Since for splines we have ξm ∼ m1/2 , this is compatible with the restrictions imposed in assumption (B1) provided that η ≥ 1.

6

Theorem 2.4. (Nonparametric models with local basis functions) Assume that conditions (A1)(A3) hold with Z = B, (L) holds for B and (B1) for βn (τ ). Assume that the set I consists of at b (τ ), most L consecutive integers, where L ≥ 1 is fixed. Then for any un ∈ Rm I , (2.2) holds with γ b γn (τ ) and Z being replaced by β(τ ), βn (τ ) and B. In addition, if the following limit −1 > −1 e 1 , τ2 ; un ) := lim kun k−2 u> H(τ n Jm (τ1 )E[BB ]Jm (τ2 )un (τ1 ∧ τ2 − τ1 τ2 ) n→∞

(2.8)

exists for any τ1 , τ2 ∈ T , then (2.4) holds with the same replacement as above, and the limit G is a e defined as (2.8). Moreover, for any x0 , let centered Gaussian process with covariance function H b ) and assume that e b 0 ; τ ) := B(x0 )> β(τ Q(x cn = o(kB(x0 )kn−1/2 ). Then √ n b Q(x0 ; ·) − Q(x0 ; ·) G(x0 ; ·) in `∞ (T ), (2.9) kB(x0 )k

e 1 , τ2 ; B(x0 )). In particwhere G(x0 ; ·) is a centered Gaussian process with covariance function H(τ ular, there exists a version of G with almost surely continuous sample paths. The proof of Theorem 2.4 is given in Section A.2.

Remark 2.5. The proof of Theorem 2.4 and the related Bahadur representation result in Section 5.2 crucially rely on the fact that the elements of Jem (τ )−1 decay exponentially fast in their distance from the diagonal, i.e. a bound of the form |(Jem (τ )−1 )i,j | ≤ Cγ |i−j| for some γ < 1. Assumption (L) provides one way to guarantee such a result. We conjecture that similar results can be obtained for more classes of basis functions as long as the entries of Jem (τ )−1 decay exponentially fast in their distance from suitable subsets of indices in (j, j 0 ) ∈ {1, ..., m}2 . This kind of result can be obtained for matrices Jem (τ ) with specific sparsity patterns, see for instance Demko et al. (1984). In particular, we conjecture that such arguments can be applied for tensor product B-splines, see Example 1 in Section 5 of Demko et al. (1984). A detailed investigation of this interesting topic is left to future research. We conclude this section by discussing a special case where the limit in (2.9) can be characterized more explicitly. e can be explicitly characterized under un = B(x) and Remark 2.6. The covariance function H univariate B-splines B(x) on x ∈ [0, 1], with an order r and equidistant knots 0 = t1 < ... < tk = 1. Assume additional to (A3) that  (2.10) sup ∂x |x=t fY |X Q(x; τ )|x < C, where C > 0 is a constant, t∈X ,τ ∈T

and the density fX (x) for X is bounded above, then under e cn = o(kB(x0 )kn−1/2 ), (2.9) in Theorem 2.4 can be rewritten as r   n >b B(x ) β(·) − Q(x ; ·) G(·; x0 ) in `∞ (T ), (2.11) 0 0 B(x0 )> E[BB> ]−1 B(x0 )

where the Gaussian process G(·; x0 ) is defined by the following covariance function τ1 ∧ τ2 − τ1 τ2 e 1 , τ2 ; x0 ) = H(τ . fY |X (Q(x0 ; τ1 )|x0 )fY |X (Q(x0 ; τ2 )|x0 )

Although we only show the univariate case here, the same arguments are expected to hold for tensor-product B-spline based on the same reasoning. See Section A.2 for a proof of this remark. 7

3

Joint Weak Convergence for Partial Linear Models

In this section, we consider partial linear models of the form Q(X; τ ) = V > α(τ ) + h(W ; τ ),

(3.1)

0

where X = (V > , W > )> ∈ Rk+k and k, k 0 ∈ N are fixed. An interesting joint weak convergence b ), b b ) and b result is obtained for (α(τ h(w0 ; τ )) at any fixed w0 . More precisely, α(τ h(w; τ ) (after proper scaling and centering) are proved to be asymptotically independent at any fixed τ ∈ T . Therefore, the “joint asymptotics phenomenon” first discovered in Cheng and Shang (2015) persists even for non-smooth quantile loss functions. Such a theoretical result is practically useful for joint inference on α(τ ) and h(W ; τ ); see Cheng and Shang (2015). e Expanding w 7→ h(w; τ ) in terms of basis vectors w 7→ Z(w), we can approximate (3.1) through † > > > > e e : Rk0 → Rm is the series expansion Z(x) γn (τ ) by setting Z(x) = (v , Z(w) ) . In this section, Z regarded as a general basis expansion that does not need to satisfy the local support assumptions in the previous section. Estimation is performed in the following form X  e i) . b † (τ ) = (α(τ b )> , βb† (τ )> )> := argmin γ ρτ Yi − a> Vi − b> Z(W (3.2) a∈Rk ,b∈Rm

i

b † , define population coefficients γn† (τ ) := (α(τ )> , βn† (τ )> )> , where For a theoretical analysis of γ e βn† (τ ) := argmin E[fY |X (Q(X; τ )|X)(h(W ; τ ) − β > Z(W ))2 ] β∈Rm

(3.3)

similar to (2.7); see Remark 3.3 for additional explanations. To state our main result, we need to define a class of functions n o 0 Uτ := w 7→ g(w) g measurable and E[g 2 (W )fY |X (Q(X; τ )|X)] < ∞, w ∈ Rk . For V ∈ Rk , define for j = 1, ..., k,

hV W,j (·; τ ) := argmin E[(Vj − g(W ))2 fY |X (Q(X; τ )|X)]

(3.4)

g∈Uτ

where Vj denotes the j-th entry of the random vector V . By the definition of hV W,j , we have for all τ ∈ T and g ∈ Uτ , E[(V − hV W (W ; τ ))g(W )fY |X (Q(X; τ )|X)] = 0k .

(3.5)

The matrix A is defined as coefficent matrix of the best series approximation of hV W (W ; τ ): e A(τ ) := argmin E[fY |X (Q(X; τ )|X)khV W (W ; τ ) − AZ(W )k2 ].

(3.6)

A

The following two assumptions are needed in our main results. > β † (τ ) − h(w; τ )| and assume that e (C1) Define c†n := supτ,w |Z(w) n

ξm c†n = o(1);

(3.7)

e sup E[fY |X (Q(X; τ )|X)khV W (W ; τ ) − A(τ )Z(W ))k2 ] = O(λ2n ) with ξm λ2n = o(1);

(3.8)

τ ∈T

8

(C2) We have maxj≤k |Vj | < C almost surely for some constant C > 0. Bounds on c†n can be obtained under various assumptions on the basis expansion and smoothness 0 of the function w 7→ h(w; τ ). Assume for instance that W = [0, 1]k , that h(·; ·) ∈ Ληc (W, T ) and e corresponds to a tensor product B-spline basis of order q on W with m1/k0 equidistant knots that Z in each coordinate. Assuming that (V, W ) has a density fV,W such that 0 < inf v,w fV,W (v, w) ≤ 0 supv,w fV,W (v, w) < ∞ and q > η, we show in Remark B.1 that c†n = O(m−bηc/k ). Assumption (3.8) essentially states that hV W can be approximated by a series estimator sufficiently well. This assumption is necessary to ensure that α(τ ) is estimable at a parametric rate without undersmoothing when estimating h(·; τ ). In general, (3.8) is a non-trivial high-level assumption. It can be verified under smoothness conditions on the joint density of (X, Y ) by applying arguments similar to those in Appendix S.1 of Cheng et al. (2014). In addition to (C1)-(C2), we need the following condition. (B1’) Assume that



2/3

mξm log n n

3/4

−1/2 + c†2 ). n ξm = o(n

Moreover, assume that c†n λn = o(n−1/2 ) and mc†n log n = o(1). We now are ready to state the main result of this section. e Theorem 3.1. Let Conditions (A1)-(A3) hold with Z = (V > , Z(W )> )> , (B1’) and (C1)-(C2) hold   e for βn† (τ ) defined in (3.3). For any sequence wn ∈ Rm with E |wn> M2 (τ2 )−1 Z(W )| = o(kwn k)   e e where M2 (τ ) := E Z(W )Z(W )> fY |X (Q(X; τ )|X) , if e e Γ22 (τ1 , τ2 ) = lim kwn k−2 wn> M2 (τ1 )−1 E[Z(W )Z(W )> ]M2 (τ2 )−1 wn n→∞

exists, then  







b − α(·)) n(α(·)  βb† (·) − βn† (·)

n > kwn k wn

(G1 (·), ..., Gk (·), Gh (·))> in (`∞ (T ))k+1 ,

and the multivariate process (G1 (·), ..., Gk (·), Gh (·)) has the covariance function ! Γ11 (τ1 , τ2 ) 0k Γ(τ1 , τ2 ; wn ) = (τ1 ∧ τ2 − τ1 τ2 ) 0> Γ22 (τ1 , τ2 ) k

(3.9)

(3.10)

(3.11)

with   Γ11 (τ1 , τ2 ) = M1,h (τ1 )−1 E (V − hV W (W ; τ1 ))(V − hV W (W ; τ2 ))> M1,h (τ2 )−1 (3.12)   where M1,h (τ ) = E (V − hV W (W ; τ ))(V − hV W (W ; τ ))> fY |X (Q(X; τ )|X) . In addition, at any 0 e 0 ) satisfy the above conditions, b e 0 )> βb† (τ ), c†n = fixed w0 ∈ Rk , let wn = Z(w h(w0 ; τ ) = Z(w e 0 )kn−1/2 ), then o(kZ(w   √  b − α(·) n α(·) >  √   G1 (·), ..., Gk (·), Gh (w0 ; ·) in (`∞ (T ))k+1 , (3.13) n b h(w0 ; ·) − h(w0 ; ·) e 0 )k kZ(w

9

where (G1 (·), ..., Gk (·), Gh (w0 ; ·)) are centered Gaussian processes with joint covariance function Γw0 (τ1 , τ2 ) of the form (3.11) where Γ22 (τ1 , τ2 ) is defined through the limit in (3.9) with wn replaced e 0 ). In particular, there exists a version of Gh (w0 ; ·) with almost surely continuous sample by Z(w paths.

The proof of Theorem 3.1 is presented in Section A.3. The invertibility of the matrices M1,h (τ ) b ) is not semiparametric efficient, as its and M2 (τ ) is discussed in Remark 5.5. In general, α(τ covariance matrix τ (1 − τ )Γ11 does not achieve the efficiency bound given in Section 5 of Lee (2003). The joint asymptotic process convergence result (in `∞ (T )) presented in Theorem 3.1 is new in the quantile regression literature. The block structure of covariance function Γ defined in (3.11) b ) and b implies that α(τ h(w0 ; τ ) are asymptotically independent for any fixed τ . This effect was recently discovered by Cheng and Shang (2015) in the case of mean regression, named as joint asymptotics phenomenon.   e Remark 3.2. We point out that E |wn> M2 (τ2 )−1 Z(W )| = o(kwn k) is a crucial sufficient condition for asymptotic independence between the parametric and nonparametric parts. We conjecture that e 0 ) or wn = this condition is also necessary. This condition holds, for example, for wn = Z(w e 0 ) at a fixed w0 , j = 1, ..., k 0 , where Z(w) e ∂wj Z(w is a vector of B-spline basis. However, this condition may not hold for other estimators. Consider for instance the case W = [0, 1], B-splines of R e e and the vector wn = λ Z(w)dw for some λ > 0. In this case kwn k  1, and one can order zero Z  >  0 −1 e show that E |wn M2 (τ2 ) Z(W )|  1 instead. A more detailed investigation of related questions is left to future research. Remark 3.3. A seemingly more natural choice for the centering vector, which was also considered in Belloni et al. (2011), is e γn∗ (τ ) = (α∗n (τ ), βn∗ (τ )) := arg min E[ρτ (Y − a> V − b> Z(W ))], (a,b)

which gives gn (γn∗ (τ )) = 0. However, a major drawback of centering with γn∗ (τ ) is that in this representation, it is difficult to find bounds for the difference α∗n (τ ) − α(τ ).

4

Applications of Weak Convergence Results

In this section, we consider applications of the process convergence results to the estimation of conditional distribution functions and non-crossing quantile curves via rearrangement operators. For the former estimation, define the functional (see Dette and Volgushev (2008), Chernozhukov et al. (2010) or Volgushev (2013) for similar ideas) ( `∞ ((τL , τU )) → `∞ (R) Φ: Rτ Φ(f )(y) := τL + τLU 1{f (τ ) < y}dτ. A simple calculation shows that

Φ(Q(x; ·))(y) =

    τL

if FY |X (y|x) < τL

FY |X (y|x) if τL ≤ FY |X (y|x) ≤ τU   τ if FY |X (y|x) > τU . U 10

The latter identity motivates the following estimator of the conditional distribution function Z τU b b τ ) < y}dτ, FY |X (y|x) := τL + 1{Q(x; τL

b τ ) denotes the estimator of the conditional quantile function in any of the three settings where Q(x; discussed in Sections 2 and 3. By following the arguments in Chernozhukov et al. (2010), one can easily show that under suitable assumptions the functional Φ is compactly differentiable (see Section A.5 for more details). Hence, the general process convergence results in Sections 2 and 3 allow to easily establish the asymptotic properties of FbY |X - see Theorem 4.1 at the end of this section. The second functional of interest is the monotone rearrangement operator, defined as follows ( `∞ ((τL , τU )) → `∞ ((τL , τU )) Ψ:  Ψ(f )(τ ) := inf y : Φ(Q(x; ·))(y) ≥ τ .

The main motivation for considering Ψ is that the function τ 7→ Ψ(f )(τ ) is by construction nonb ·), its rearranged version Ψ(Q(x; b ·))(τ ) is an estidecreasing. Thus for any initial estimator Q(x; mator of the conditional quantile function which avoids the issue of quantile crossing. For more detailed discussions of rearrangement operators and their use in avoiding quantile crossing we refer the interested reader to (Dette and Volgushev, 2008) and (Chernozhukov et al., 2010). b τ ))). For any fixed x0 and an initial estimator Theorem 4.1 (Convergence of Fb(y|x) and Ψ(Q(x; b 0 , ·), we have for any compact sets [τU , τL ] ⊂ T , Y ⊂ Y0,T := {y : FY |X (y|x0 ) ∈ T } Q(x  an FbY |X (·|x0 ) − FY |X (·|x0 )  b 0 ; ·))(·) − Q(x0 ; ·) an Ψ(Q(x

 −fY |X (FY |X (·|x0 )|x0 )G x0 ; FY |X (·|x0 ) in `∞ (Y), G(x0 ; ·) in `∞ ((τU , τL )),

b 0 , ·), the normalization an , and the process G(x0 ; ·) are stated as follows where Q(x

b 0 , ·) = γ b (·)> x0 and the 1. (Linear model with increasing dimension) Suppose Z(X) = X, Q(x √ conditions in Corollary 2.2 hold. In this case, we have an = n/kx0 k. G(x0 ; ·) is a centered Gaussian process with covariance function H1 (τ1 , τ2 ; x0 ) defined in (2.5). b 0 , ·) = β(·)> B(x0 ) and the conditions in Theorems 2.4 2. (Nonparametric model) Suppose Q(x √ hold. In this case, we have an = n/kB(x0 )k. G(x0 ; ·) is a centered Gaussian process with e 1 , τ2 ; B(x0 )) defined in (2.8). covariance function H(τ

> > e 0 )> )> and the b b † (τ )> (v)> , Z(w 3. (Partial linear model) Suppose x> 0 = (v0 , w0 ), Q(x0 , ·) = γ √ e 0 )k. G(x0 ; ·) is a conditions in Theorem 3.1 hold. In this case, we have an = n/kZ(w e 0 )) defined in (3.9). centered Gaussian process with covariance function Γ22 (τ1 , τ2 ; Z(w

The proof of Theorem 4.1 is a direct consequence of the functional delta method. Details can be found in Section A.5.

11

5

Bahadur representations

In this section, we provide Bahadur representations for the estimators discussed in Sections 2 and 3. In Sections 5.1 and 5.2, we state Bahadur representations for general series estimators and a more specific choice of local basis function, respectively. In particular, the latter representation is developed with an improved remainder term. Section 5.3 contains a special case of the general theorem in Section 5.1 that is particularly tailored to partial linear models. The remainders in these representations are shown to have exponential tail probabilities (uniformly over T ).

5.1

A Fundamental Bahadur Representation

b (τ )−γn (τ ) for centering functions γn satisfying Our first result gives a Bahadur representation for γ b (τ ) in (2.1). This kind of representation for quantile certain conditions. Recall the definition of γ regression with an increasing number of covariates has previously been established in Theorem 2 of Belloni et al. (2011). Compared to their results, the Bahadur representation given below has several advantages. First, we allow for a more general centering. This is helpful for the analysis of partial linear models (see Sections 3 and S.1.2). Second, we provide exponential tail bounds on remainder terms, which is much more explicit and sharper than those in Belloni et al. (2011). 2 log n = o(n). Then, Theorem 5.1. Suppose Conditions (A1)-(A3) hold and that additionally mξm −1 ) and c (γ ) = o(1), we have for any γn (·) satisfying gn (γn ) = o(ξm n n n

X 1 b (τ ) − γn (τ ) = − Jm (τ )−1 γ ψ(Yi , Zi ; γn (τ ), τ ) + rn,1 (τ ) + rn,2 (τ ) + rn,3 (τ ) + rn,4 (τ ). n i=1

The remainder terms rn,j ’s can be bounded as follows: sup krn,1 (τ )k ≤

τ ∈T

1

inf τ ∈T

mξm λmin (Jm (τ )) n

a.s.

(5.1)

2 , sufficiently large n, and a constant C independent of n Moreover, we have for any κn  n/ξm   P sup krn,j (τ )k ≤ C

Bi (1{Yi ≤ βn (τ ) Bi } − τ ) +

4 X

rn,k (τ, un ),

(5.5)

k=1

where the remainder terms rn,j ’s can be bounded as follows: sup

sup |rn,1 (τ, un )| .

un ∈SIm−1 τ ∈T

sup

sup |rn,4 (τ, un )| ≤

un ∈SIm−1 τ ∈T

ξm log n n

a.s.

(5.6)

1 1 0 2 e n , B) + fe c sup E(u n 2 n un ∈S m−1

a.s.

(5.7)

I

2 , all sufficiently e n , B) := supτ E|un Jem (τ )−1 B|. Moreover, we have for any κn  n/ξm where E(u large n, and a constant C independent of n   e j (κn ) ≥ 1 − n2 e−κn , j = 2, 3 P sup sup |rn,j (τ, un )| ≤ C < un ∈SIm−1 τ ∈T

where

e 2 (κn ) := C
−1 sup u> n Jm (τ )E[BB ]Jm (τ )un τ

1/2

= O(1).

Remark 5.3. Theorem 5.2 enables us to study several quantities associated with the quantile function Q(x; τ ). For instance, consider the spline setting of Example 2.3. Setting un = B(x)/kB(x)k b τ ), while setting un = B0 (x)/kB0 (x)k yields in the Theorem 5.2 yields a representation for Q(x; a representation for the estimator of the derivative ∂x Q(x; τ ). Uniformity in x follows once we observe that for different values of x, the support of the vector B(x) is always consecutive so that there is at most nl , l > 0, number of different sets I that we need to consider.

13

5.3

Bahadur Representation for Partial Linear Models

In this section, we provide a joint Bahadur representation for the parametric and non-parametric part of this model. Recall the partial linear model Q(X; τ ) = h(W ; τ ) + α(τ )> V . e Theorem 5.4. Let conditions (A1)-(A3), (C1)-(C2) hold with Z = (V > , Z(W )> )> and assume 2 (log n)2 = o(n). Then mξm ! n 4 X X b ) − α(τ ) α(τ −1 −1 † > = −J (τ ) n Z (1{Y ≤ {γ (τ ) } Z } − τ ) + rn,j (τ ), m i i n i βb† (τ ) − βn† (τ ) i=1

j=1

where the remainder terms rn,j ’s satisfy the bounds stated in Theorem 5.1 with gn = ξm c†2 n . Addi−1 tionally, the matrix Jm (τ ) can be represented as ! −1 −1 A(τ ) M (τ ) −M (τ ) 1 1 −1 (5.10) Jm (τ ) = −A(τ )> M1 (τ )−1 M2 (τ )−1 + A(τ )> M1 (τ )−1 A(τ ) where e e M1 (τ ) := E[fY |X (Q(X; τ )|X)(V − A(τ )Z(W ))(V − A(τ )Z(W ))> ],   e e M2 (τ ) = E Z(W )Z(W )> fY |X (Q(X; τ )|X) ,

and A(τ ) is defined in (3.6).

See Section S.1.3 for the proof of Theorem 5.4. Remark 5.5. We discuss the positive definiteness of M1 (τ ) and M2 (τ ). Following Condition (A1) e with Z = (V > , Z(W )> )> , we have 1/M ≤ inf λmin (M1 (τ )) ≤ sup λmax (M1 (τ )) ≤ M ;

(5.11)

1/M ≤ inf λmin (M2 (τ )) ≤ sup λmax (M2 (τ )) ≤ M,

(5.12)

τ ∈T

τ ∈T

τ ∈T

τ ∈T

for all n. To see this, observe that M1 (τ ) = (Ik | − A(τ ))Jm (τ )(Ik | − A(τ ))> where Ik is the kdimensional identity matrix, [A|B] denotes the block matrix with A in the left block and B in the right block, and ! M1 (τ ) + A(τ )M2 (τ )A(τ )> A(τ )M2 (τ ) Jm (τ ) = , M2 (τ )A(τ )> M2 (τ ) whose form follows from the definition and the condition (3.5) (see the proof for Theorem 5.4 for more details). Thus, for an arbitrary nonzero vector a ∈ Rk , by Condition (A1), >    0 < 1/M ≤ a> M1 (τ )a = a> Ik − A(τ ) Jm (τ ) Ik − A(τ ) a ≤ M < ∞

by the strictly positive definiteness of Jm (τ ) for some M > 0. The strictly positive definiteness of M2 (τ ) follows directly from the observation that > > > > 0 < 1/M ≤ b> M2 (τ )b = (0> k , b )Jm (τ )(0k , b ) ≤ M < ∞,

for all nonzero b ∈ Rm and some M > 0. 14

APPENDIX This appendix gives technical details of the results shown in the main text. Appendix A contains all the proofs for weak convergence results in Theorems 2.1, 2.4 and 3.1. Appendix B discusses basis approximation errors with full technical details. R Additional Notations. Define for a function x 7→ f (x) that Gn (f ) := n1/2 f (x)(dPn (x)−dP (x)) R and kf kLp (P ) = ( |f (x)|p dP (x))1/p for 0 < p < ∞. For a class of functions G, let kPn − P kG := supf ∈G |Pn f − P f |. For any  > 0, the covering number N (, G, Lp ) is the minimal number of balls of radius  (under Lp -norm) that is needed to cover G. The bracketing number N[ ] (, G, Lp ) is the minimal number of -brackets that is needed to cover G. An -bracket refers to a pair of functions within an  distance: ku − lkLp < . Throughout the proofs, C, C1 , C2 etc. will denote constants which do not depend on n but may have different values in different lines.

APPENDIX A: Proofs for Process Convergence A.1

Proof of Theorem 2.1

A.1.1

Proof of (2.2)

Under conditions (A1)-(A3) and those in Theorem 2.1 it follows from Theorem 5.1 applied with −1 ) and c = o(1) in κn = c log n for a suitable constant c (note that the conditions gn = o(ξm n Theorem 5.1 follow under the assumptions of Theorem 2.1) that n

u> γ (τ ) − γ(τ )) + n (b

X 1 > un Jm (τ )−1 Zi (1{Yi ≤ Q(Xi ; τ )} − τ ) = I(τ ) + oP (kun kn−1/2 ), n i=1

where the remainder term is uniform in T and −1 I(τ ) := −n−1 u> n Jm (τ )

n X i=1

 Zi 1{Yi ≤ Z> i γn (τ ))} − 1{Yi ≤ Q(Xi ; τ )} .

Under the assumption χγn (un , Z) = o(kun kn−1/2 ), we have supτ ∈T |E[I(τ )]| = o(kun kn−1/2 ) and moreover sup |I(τ ) − E[I(τ )]| ≤ kun k[ inf λmin (Jm (τ ))]−1 kPn − P kG5 , τ ∈T

τ ∈T

where the class of functions G5 is defined as

  G5 (Z, γn ) = (X, Y ) 7→ a> Z(X)1{kZ(X)k ≤ ξm } 1{Y ≤ Z(X)> γn (τ )} − 1{Y ≤ Q(X; τ )} τ ∈ T , a ∈ S m−1 .

It remains to bound kPn − P kG5 . For any f ∈ G5 and a sufficiently large C, we obtain |f | ≤ |a> Z| ≤ ξm , ≤ 2f¯cn λmax (E[ZZ> ]) ≤ Ccn .

kf k2L2 (P )

15

By Lemma 21 of Belloni et al. (2011), the VC index of G5 is of the order O(m). Therefore, we obtain from (S.2.2) e EkPn − P kG5 ≤ C

h mc

ξm 1/2 mξm ξm i + log √ log √ . n cn n cn n

(A.1)

For any κn > 0, let 0 rN,3 (κn )

=C

 mc

ξm 1/2 mξm ξm + log √ log √ + n cn n cn n



cn κn n

1/2

ξm + κn n



for a sufficiently large constant C > 0. We obtain from (S.2.3) combined with (A.1)     0 ≤ e−κn . P sup |I(τ )| ≥ kun krN,3 (κn ) + sup E I(τ ) τ ∈T

τ ∈T

Finally, note that under condition mcn log n = o(1) and

2 m3 ξm (log n)3 = o(n) 0 (log n) = o(n−1/2 ). This completes the proof of (2.2). we have that rN,3

A.1.2

Proof of (2.4)

Throughout this subsection assume without loss of generality that kun k = 1. It suffices to prove finite dimensional convergence and asymptotic equicontinuity. Asymptotic equicontinuity follows from (A.33). The existence of a version of the limit with continuous sample paths is a consequence of Theorem 1.5.7 and Addendum 1.5.8 in van der Vaart and Wellner (1996). So, we only need to focus on finite dimensional convergence. Let n X 1 > −1 √ u Jm (τ ) Gn (τ ) := Zi (1{Yi ≤ Q(Xi ; τ )} − τ ), n n i=1

and G be the Gaussian process defined in (2.4). From Cram´er-Wold theorem, the goal is to show for arbitrary set of {τ1 , ..., τL } and {λ1 , ..., λL } ∈ RL , we have L X l=1

d

λl Gn (τl ) →

L X

λl G(τl ).

l=1

−1 Let the triangular array Vn,i (τ ) := n−1/2 u> n Jm (τ ) Zi (1{Yi ≤ Q(Xi ; τ )} − τ ). Then for all τ ∈ T , −1 > −1 we have E[Vn,i (τ )] = 0, |Vn,i | ≤ n−1/2 ξm and var(Vn,i (τ )) = n−1 u> n Jm (τ ) E[Zi Zi ]Jm (τ ) un τ (1− P PL n τ ) < ∞ by Conditions (A1)-(A3). We can express Gn (τ ) = i=1 Vn,i (τ ) and l=1 λl Gn (τl ) = Pn PL Pn PL 2 i=1 l=1 λl Vn,i (τl ). Observe that var( i=1 l=1 λl Vn,i (τl )) =: σn,L where 2 σn,L =

L X

l,l0 =1

−1 > −1 λ l λ l 0 u> n Jm (τl ) E[Zi Zi ]Jm (τl0 ) un (τl ∧ τl0 − τl τl0 ).

P PL 2 = L If 0 = limn→∞ σn,L l,l0 =1 λl λl0 H(τl , τl0 ; un ) = var( l=1 λl G(τl )), then by Markov’s inequality Pn PL PL i=1 l=1 λl Vn,i (τl ) → 0 in probability, which coincides with the distribution of l=1 λl G(τl ), 16

2 which is a single point mass at 0. Next, consider the case σn,L → σL2 > 0. For sufficiently large n and arbitrary v > 0, Markov’s inequality implies −2 σn,L

 X 2  X  n L L X E λl Vn,i (τl ) 1 λl Vn,i (τl ) > v i=1

l=1 n X

2 −1 −2 . ξm n σn,L

i=1

2 −1 −2 −2 . ξm n σn,L v

l=1

 X  L E 1 λl Vn,i (τl ) > v l=1

L X

l,l0 =1

−1 > −1 λl λl0 u> n Jm (τl ) E[Zi Zi ]Jm (τl0 ) un (τl ∧ τl0 − τl τl0 )

= o(1) 2 n−1 = o(1) by the assumption mξ 2 log n = o(n). Hence the Lindeberg condition is verified. since ξm m The finite dimensional convergence follows from Cram´er-Wold devise. This completes the proof.

A.2

Proofs of Theorem 2.4, Example 2.3 and Remark 2.6

We begin by introducing some notations and useful preliminary results. For a vector u = (u1 , ..., um )> and a set I ⊂ {1, ..., m}, let u(I) ∈ Rm denote the vector that has entries ui for i ∈ I and zero otherwise. For a vector a ∈ Rm , let ka denote the position of the first non-zero entry of a with kak0 non-zero consecutive entries I(a, D) := {i : |i − ka | ≤ kak0 + D},

(A.2)

I 0 (a, D) := {1 ≤ j ≤ m : ∃i ∈ I(a, D) such that |j − i| ≤ kak0 },

(A.3)

Lemma A.1. Under (L), for an arbitrary vector a ∈ Rm with at most kak0 non-zero consecutive entries we have for a constant γ ∈ (0, 1) independent of n, τ ka +kak0

−1 ka> Jem (τ ) −1 ka> Jm (τ )





−1 |(a> Jem (τ ))j | ≤ C1 kak∞

−1 (a> Jem (τ ))(I(a,D)) k −1 (a> Jm (τ ))(I(a,D)) k

X

γ |q−j| .

(A.4)

q=ka

. kak∞ kak0 γ D

. kak∞ kak0 γ D

(A.5) (A.6)

Proof for Lemma A.1. Under (L) the matrix Z(x)Z(x)> has no non-zero entries that are further −1 is a band matrix with band width no larger that 2r. than r away from the diagonal. Thus Jem Apply similar arguments as in the proof of Lemma 6.3 in Zhou et al. (1998) to find that under (L) −1 (τ ) satisfy the entries of Jem −1 sup |(Jem (τ ))j,k | ≤ C1 γ |j−k| τ,m

for some γ ∈ (0, 1) and a constant C1 where both γ and C1 do not depend on n, τ . It follows that ka +kak0 −1 |(a> Jem (τ ))j |

≤ C1 kak∞ 17

X

q=ka

γ |q−j| .

and thus (A.4) is established. For a proof of (A.5) note that by (A.4) we have ka +kak0 −1 |(a> Jem (τ ))j |

≤ C1 kak∞

X

q=ka

γ |q−j| ≤ C1 kak∞ kak0 γ |ka −j|−kak0 .

By the definition of I(a, D) we find −1 −1 ka> Jem (τ ) − (a> Jem (τ ))(I(a,D)) k ≤ Ckak∞ kak0 γ D

for a constant C independent of n. The proof of (A.6) is similar to the proof of (A.5). Proof for Theorem 2.4. By Theorem 5.2 and Condition (B1), we first obtain n X  b ) − β(τ ) = − 1 u> Jem (τ )−1 u> β(τ Bi (1{Yi ≤ B(x)> βn (τ )} − τ ) + oP (kun kn−1/2 ). n n n

(A.7)

i=1

e 1,n (τ ) := n−1 u> Jem (τ )−1 Pn Bi (1{Yi ≤ B(x)> βn (τ )} − τ ). We claim that Let U n i=1  −1/2 e u> ), n U1,n (τ ) − Un (τ ) = oP (kun kn

(A.8)

Pn −1 where Un (τ ) := n−1 u> n Jm (τ ) i=1 Bi (1{Yi ≤ Q(Xi ; τ )} − τ ). Given (A.8), the process conver > b ) − β(τ ) and continuity of the sample paths of the limiting process follows from gence of un β(τ process convergence of u> n Un (τ ), which can be shown via exactly the same steps as in Section A.1.2 by replacing Z by B given assumptions (A1)-(A3). Pn −1 e To show (A.8), we proceed in two steps. Given un ∈ SIm−1 , let U1,n (τ ) := n−1 u> n Jm (τ ) i=1 Bi (1{Yi ≤ Q(Xi ; τ )} − τ ).  −1/2 ), for all u ∈ S m−1 . e Step 1: supτ ∈T u> n n U1,n (τ ) − U1,n (τ ) = oP (n I  >  e e Let I0 (τ ) := E un U1,n (τ ) − U1,n (τ ) and observe the decomposition   >    e e e u> + E u> n U1,n (τ ) − U1,n (τ ) − E un U1,n (τ ) − U1,n (τ ) n U1,n (τ ) − U1,n (τ )  (I(un ,D))   −1 −1 = un Jem (τ ) − un Jem (τ ) (Pn − P ) Bi (1{Yi ≤ B(x)> βn (τ )} − 1{Yi ≤ Q(Xi ; τ )}) n o (I(un ,D)) −1 + (Pn − P ) un Jem (τ ) Bi (1{Yi ≤ B(x)> βn (τ )} − 1{Yi ≤ Q(Xi ; τ )}) + Ie0 (τ ) =: Ie1 (τ ) + Ie2 (τ ) + Ie0 (τ ).

For supτ ∈T |Ie0 (τ )|, by the construction of βn (τ ) in (2.7), h i −1 > e sup e I0 (τ ) ≤ sup u> J (τ )E B 1{Y ≤ Q(X ; τ )} − 1{Y ≤ B β (τ )} i i i i n m i n

τ ∈T

τ ∈T

−1 ≤e c2n f¯0 sup E|u> Jem (τ )B| ≤ e c2n f¯0 sup u∈S m−1

u∈S m−1



−1 −1 u> Jem (τ )E[BB> ]Jem (τ )u

1/2

= o(n−1/2 ),

where the final rate follows from assumptions (A2) and e c2n = o(n−1/2 ) in (B1). By (A.5) in Lemma A.1, let D = c log n for large enough c > 0, we have almost surely

> e−1 (I(un ,D)) e−1 sup e I1 (τ ) ≤ sup u> ξm ≤ kun k∞ kun k0 nc log γ ξm = o(n−1/2 ). n Jm (τ ) − (un Jm (τ )) τ ∈T

τ ∈T

18

For bounding supτ ∈T |Ie2 (τ )|, observe that n

1X >e (un Jm (τ ))(I(un ,D)) Bi (1{Yi ≤ B> i βn (τ )} − 1{Yi ≤ Q(Xi ; τ )}) n

=

i=1 n X

1 n

1 = n =

1 n

i=1

(I(un ,D))

e u> n Jm (τ )Bi X

(I 0 (un ,D)c )

{i:Bi

=

1 n

(I(un ,D))

{i:supp(Bi )∩I(un ,D)6=∅}

X

n X i=1

(1{Yi ≤ B> i βn (τ )} − 1{Yi ≤ Q(Xi ; τ )})

e u> n Jm (τ )Bi

(I(un ,D))

=0}

e u> n Jm (τ )Bi

(I(un ,D))

e u> n Jm (τ )Bi

(1{Yi ≤ B> i βn (τ )} − 1{Yi ≤ Q(Xi ; τ )})

(1{Yi ≤ B> i βn (τ )} − 1{Yi ≤ Q(Xi ; τ )})

(I 0 (un ,D)) >

(1{Yi ≤ Bi

(A.9)

βn (τ )} − 1{Yi ≤ Q(Xi ; τ )}),

n 1X >e (I(u ,D)) (I 0 (un ,D)) un Jm (τ )Bi n (1{Yi ≤ B> } − 1{Yi ≤ Q(Xi ; τ )}), = i βn (τ ) n i=1

(I(u ,D))

where the third equality follows from the fact that Bi n 6= 0 can only happen for i ∈ {i : (I 0 (un ,D)c ) Bi = 0}, because B can only be nonzero in r consecutive entries by assumption (L), where 0 c I (un , D) = {1, ..., m} − I 0 (un , D) is the complement of I 0 (un , D) in {1, ..., m}. By restricting 0 (I 0 (un ,D)c ) ourselves on set {i : Bi = 0}, it is enough to look at the coefficient βn (τ )(I (un ,D)) in the last equality in (A.9). Hence,

sup e I2 (τ ) ≤ Pn − P Ge5 (I(un ,D),I 0 (un ,D))

τ ∈T

where for any two index sets I1 and I10

  0 Ge5 (I1 , I10 ) = (X, Y ) 7→ a> B(X)(I1 ) 1{Y ≤ B(X)> b(I1 ) }−1{Y ≤ Q(X; τ )} τ ∈ T , b ∈ Rm , a ∈ S m−1 . With the choice of D = c log n, the cardinality of both I(un , D) and I 0 (un , D) is of order O(log n). Hence, the VC index of Ge5 (I(un , D), I 0 (un , D)) is bounded by O(log n). Note that for any f ∈ Ge5 (I(un , D), I 0 (un , D)), |f | . ξm and kf kL2 (P ) . e cn . Applying (S.2.3) yields P



h e cn κn 1/2 κn ξm i cn (log n)2 1/2 ξm (log n)2  e + + + sup e I2 (τ ) ≤ C n n n n τ ∈T



≥ 1 − eκn .

4 (log n)6 = o(n) in (B1) implies that sup e Taking κn = C log n, e c2n = o(n−1/2 ) and ξm τ ∈T I2 (τ ) = 1/2 oP (n ).  −1/2 ), for all u ∈ S m−1 . Step 2: supτ ∈T u> n n U1,n (τ ) − Un (τ ) = oP (n I

19

Observe that  e u> n U1,n (τ ) − Un (τ ) n X    1  > e em (τ )−1 − Jm (τ )−1 (I(un ,D)) = un Jm (τ )−1 − Jm (τ )−1 − u> J Bi (1{Yi ≤ Q(Xi ; τ )} − τ ) n n i=1

+

 (I(un ,D)) 1 > e un Jm (τ )−1 − Jm (τ )−1 n

=: Ie3 (τ ) + Ie4 (τ ).

n X i=1

Bi (1{Yi ≤ Q(Xi ; τ )} − τ )

Applying (A.5) and (A.6) in Lemma A.1 with D = c log n where c > 0 is chosen sufficiently large, we have almost surely 



 e−1 (τ ) − (u> Je−1 (τ ))(I(un ,D)) + sup u> J −1 (τ ) − (u> J −1 (τ ))(I(un ,D)) ξm sup e I3 (τ ) ≤ sup u> J n m n m n m n m τ ∈T

τ ∈T

≤ 2kun k∞ kun k0 n

c log γ

τ ∈T

−1/2

ξm = o(n ). Now it is left to bound supτ ∈T e I4 (τ ) . We have

(A.10)

n

 (I(un ,D)) 1 X > e un Jm (τ )−1 − Jm (τ )−1 Bi (1{Yi ≤ Q(Xi ; τ )} − τ ) Ie4 (τ ) = n i=1

n  (I(u ,D)) 1X > e un Jm (τ )−1 − Jm (τ )−1 Bi n (1{Yi ≤ Q(Xi ; τ )} − τ ). = n i=1

Hence,





−1 e − Jm (τ )−1

Pn − P G0 (I(un ,D))·G4 sup e I4 (τ ) ≤ sup u> n Jm (τ )

τ ∈T

where for any I,

τ ∈T

 G0 (I) := (B, Y ) 7→ a> B(I) 1{kBk ≤ ξm } a ∈ S m−1 ,  G4 := (X, Y ) 7→ 1{Yi ≤ Q(X; τ )} − τ τ ∈ T .

The cardinality of the set I(un , c log n) is of order O(log n). Thus, the VC index for G0 (I(un , D)) is of order O(log n). The VC index of G4 is 2 (see Lemma S.2.4). By Lemma S.2.2, N (G0 (I(un , D)) · G4 , L2 (Pn ); ε) ≤



AkF kL2 (Pn ) ε

v0 (n)

,

where v0 (n) = O(log n). In addition, for any f ∈ G0 (I(un , D)) · G4 , |f | . ξm and kf kL2 (P ) = O(1) by (A1). Furthermore, by assumptions (A1)-(A2) and the definition of e cn ,

> 

un Jem (τ )−1 − Jm (τ )−1 ≤ e cn λmax (E[B(X)B(X)> ])f¯0 . e cn . (A.11) By (S.2.3), we have for some constant C > 0,  h (log n)2 1/2 ξ (log n)2  κ 1/2 κ ξ i n m n m e P sup I2 (τ ) ≤ Ce cn + + + ≥ 1 − eκn . n n n n τ ∈T 20

Taking κn = C log n, an application of (B1) completes the proof. Proof for Example 2.3. As Jem (τ ) is a band matrix, applying similar arguments as in the proof of Lemma 6.3 in Zhou et al. (1998) gives 0 −1 sup |(Jem (τ ))j,j 0 | ≤ C1 γ |j−j | ,

(A.12)

τ,m

for some γ ∈ (0, 1) and C1 > 0. Let kB(x) be the index of the first nonzero element of the vector B(x). Then by (A.12), we have kB(x) +kB(x)k0

and also

−1 (τ ))j | sup |(B(x)> Jem τ,m

−1 (τ )B(X)| sup E|B(x)> Jem τ,m

≤ C1 kB(x)k∞

0

γ |j −j| ,

j 0 =kun

≤ C1 kB(x)k∞ max E|Bl (X)| l≤m

X

m X j=1

kB(x) +kB(x)k0

X

0

γ |j −j| .

(A.13)

j 0 =kB(x)

Since kB(x)k0 is bounded by a constant, the sum in (A.13) is bounded uniformly. Moreover, in the present setting we have kB(x)k∞ = O(m1/2 ) and maxl≤m E|Bl (X)| = O(m−1/2 ). Therefore, for each m we have −1 sup E|B(x)> Jem (τ )B(X)| = O(1). τ ∈T ,x∈X

Proof of Remark 2.6. Consider the product Bj (x)Bj 0 (x) of two B-spline functions. The fact that Bj (x) is locally supported on [tj , tj+r ] implies that for all j 0 satisfying |j − j 0 | ≥ r, Bj (x)Bj 0 (x) = 0 for all x, where r ∈ N is the degree of spline. This also implies Jm (τ ) and E[BB> ] are a band matrices with each column having at most Lr := 2r +1 nonzero elements and each non-zero element is at most r entries away from the main diagonal. Recall also the fact that maxj≤m supt∈R |Bj (t)| . m1/2 (by the discussion following assumption (A1)). Define Jm,D (τ ) := Dm (τ )E[BB> ], where matrix Dm (τ ) := diag(fY |X (Q(tj ; τ )|tj ), j = 1, ..., m), and Rm (τ ) := Jm (τ ) − Jm,D (τ ). Both Jm,D (τ ) and Rm (τ ) have the same band structure as Jm (τ ). For arbitrary j, j 0 = 1, ..., m, τ ∈ T and a universal constant C > 0, |(Rm (τ ))j,j 0 |    = E Bj (X)Bj 0 (X) fY |X (Q(X; τ )|X) − fY |X (Q(tj ; τ )|tj )  Z 1  r 2 fY |X (Q(x; τ )|x) − fY |X (Q(tj ; τ )|tj ) fX (x)dx ≤ 2 max sup |Bj (t)| 1 |x − tj | ≤ C j≤m t∈R m 0  Z 1  r ≤ 2Cm 1 |x − tj | ≤ C |x − tj |dx m 0 = O(m−1 ),

(A.14)

21

where the second inequality is an application of the upper bound of maxj≤m supt∈R |Bj (t)| . m1/2 and the local support property of Bj ; the third inequality follows by the assumption (2.10) and bounded fX (x). This shows that maxj,j 0 =1,...,m supτ ∈T |(Rm (τ ))j,j 0 | = O(m−1 ). Now we show a stronger result that supτ ∈T kRm (τ )k = O(m−1/2 ) for later use. Let v = (v1 , ..., vm ). Denote kj the index with the first nonzero entry in the jth column of Rm (τ ). By the band structure of Rm (τ ), 2 m  kj +L X Xr −1 vi (Rm (τ ))i,j sup kRm (τ )k2 = sup sup kRm (τ )vk22 = sup sup τ ∈T

τ ∈T v∈S m−1 j=1

τ ∈T v∈S m−1

. sup max |(Rm (τ ))j,j 0 | m = O(m 0 2

τ ∈T j,j

−1

i=kj

),

(A.15)

where the last equality follows by (A.14). Note that from assumptions (A1)-(A3) that fmin M −1 < λmin (Jm,D (τ )) ≤ λmax (Jm,D (τ )) < f¯M

(A.16)

uniformly in τ ∈ T , where the constant M > 0 is defined as in Assumption (A1). Using (A.16), assumptions (A1)-(A3) and (A.15), −1 −1 −1 −1 kJm (τ ) − Jm,D (τ )k ≤ kJm,D (τ )kkJm,D (τ ) − Jm (τ )kkJm (τ )k . sup kRm (τ )k = O(m−1/2 ) τ ∈T

uniformly in τ ∈ T . Without loss of generality, from now on we drop the term τ1 ∧ τ2 − τ1 τ2 out of our discussion e 1 , τ2 ; un ) defined in (2.8). From (A1) and focus on the matrix part in the covariance function H(τ we have kE[BB> ]k < M for some constant M > 0 so for any τ1 , τ2 ∈ T , −1 > −1 > −1 > −1 kun k−2 u> J (τ )E[BB ]J (τ )u − u J (τ ) E[BB ]J (τ ) u 1 2 n 1 2 n m,D m,D n m m n





. sup Rm (τ ) sup E[BB> ]Jm,D (τ )−1 + sup Rm (τ ) sup E[BB> ]Jm (τ )−1 τ ∈T

−1/2

= O(m

τ ∈T

τ ∈T

τ ∈T

).

(A.17)

Moreover, note that −1 > −1 > −1 > −1 −1 u> n Jm,D (τ1 ) E[BB ]Jm,D (τ2 ) un = un Dm (τ1 ) E[BB ] Dm (τ2 ) un

(A.18)

If un = B(x), observe that for l = 1, ..., m, as suggested by the local support property, we only need to focus on the index l satisfying |x − tl | ≤ Cr/m, for a universal constant C > 0. We have (B(x)> Dm (τ )−1 )l = Bl (x)fY |X (Q(tl ; τ )|tl )−1 = Bl (x)fY |X (Q(x; τ )|x)−1 + R0 (tl ),

(A.19)

−2 where by assumption (2.10), |R0 (tl )| ≤ maxj≤m supt∈R |Bj (t)|Cfmin |x − tl | = O(m−1/2 ). Therefore, > −1 −1 > the sparse vector B(x) Dm (τ ) = fY |X (Q(x; τ )|x) B(x) + aB(x) , where aB(x) ∈ Rm is a vector with the same support as B(x) (only r < ∞ nonzero components) and with nonzero components of order O(m−1/2 ). Hence, kaB(x) k = O(m−1/2 ). Continued from (A.18), for any x ∈ [0, 1], B(x)> E[BB> ]−1 B(x) −2 > −1 > −1 −1 kB(x)k B(x) Dm (τ1 ) E[BB ] Dm (τ2 ) B(x) − fY |X (Q(x; τ1 )|x)fY |X (Q(x; τ2 )|x)

−1



≤ B(x) aB(x) sup E[BB> ]−1 Dm (τ )−1 + B(x)k−1 kaB(x)

E[BB> ]−1 /fmin

= O(m

−1

τ ∈T

).

22

We observe that B(x)> E[BB> ]−1 B(x) does not depend on τ1 and τ2 and can be treated as a scaling factor and shifted out of the covariance function as (2.11). Therefore, we finish the proof.

A.3

Proof of Theorem 3.1

Observe that b j (·) − αj (·) = e> α γ † (·) − γn† (·)) j (b

where ej denotes the j-th unit vector in Rm+k for j = 1, .., m + k, and  > wn> βb† (τ ) − βn† (τ ) = (0> γ † (·) − γn† (·)). k , wn )(b

> β † (τ ). The following results will be established at the end of the proof. e Let h†n (w, τ ) = Z(w) n   > −1 > † sup E e J (τ ) Z 1{Y ≤ Q(X; τ )} − 1{Y ≤ α(τ ) V + h (W, τ )} = o(n−1/2 ), j m n τ ∈T ,j=1,...,k

(A.20)   > −1 > † −1/2 sup (0> ). k , wn /kwn k)Jm (τ ) E Z(1{Y ≤ Q(X; τ )} − 1{Y ≤ α(τ ) V + hn (W, τ )}) = o(n

τ ∈T

(A.21)

From Theorem 5.4, we obtain that under Condition (B1’) −1 † −1/2 e> γ † (τ ) − γn† (τ )) = −n−1/2 e> ), j = 1, ..., k; (A.22) j (b j Jm (τ ) Gn (ψ(·; γn (τ ), τ )) + oP (n

> > −1 † −1/2 (0> γ † (τ ) − γn† (τ )) = −n−1/2 (0> ) k , wn )(b k , wn )Jm (τ ) Gn (ψ(·; γn (τ ), τ )) + oP (n

(A.23)

uniformly in τ ∈ T . Equation (A.20) implies that for j = 1, ..., k −1 † > −1 −1/2 e> ) = o(n−1/2 ). j Jm (τ ) E[ψ(Yi , Zi ; γn (τ ), τ )] = ej Jm (τ ) E[1{Yi ≤ Q(Xi ; τ )} − τ ] + o(n

Following similar arguments as given in the proof of (2.2) in Section A.1.1, (A.20) and (A.22) imply that e> γ † (τ ) j (b



γn† (τ ))

=

−1 −n−1 e> j Jm (τ )

n X i=1

Zi (1{Yi ≤ Q(Xi ; τ )} − τ ) + oP (n−1/2 ), j = 1, ..., k.

uniformly in τ ∈ T . Similarly, by (A.21) and (A.23) we have −1 > −1 kwn k−1 wn> (βb† (τ )−βn† (τ )) = −n−1 (0> k , kwn k wn )Jm (τ )

n X i=1

Zi (1{Yi ≤ Q(Xi ; τ )}−τ )+oP (n−1/2 ).

Thus, the claim will follow once we prove

G n (·) := (Gn,1 (·), ..., Gn,k (·), Gn,h (·)) where −1 Gn,j (τ ) := −n−1/2 e> j Jm (τ )

n X i=1

G (·) in (`∞ (T ))k+1

Zi (1{Yi ≤ Q(Xi ; τ )} − τ ), 23

j = 1, ..., k

and > −1 Gn,h (τ ) := −kwn k−1 n−1/2 (0> k , wn )Jm (τ )

n X i=1

Zi (1{Yi ≤ Q(Xi ; τ )} − τ ).

We need to establish tightness and finite dimensional convergence. By Lemma 1.4.3 of van der Vaart and Wellner (1996), it is enough to show the tightness of Gn,j ’s and Gn,h individually. Tightness follows from asymptotic equicontinuity which can be proved by an application of Lemma A.3. More precisely, apply Lemma A.3 with un = ej to prove tightness of Gn,j (·) for j = 1, ..., k, and > > Lemma A.3 with u> n = (0k , wn ) to prove tightness of Gn,h (w0 ; ·). Continuity of the sample paths of Gn,h (w0 ; ·) follows by the same arguments as given at the beginning of Section A.1.2. Next, we prove finite-dimensional convergence. Observe the decomposition ( ! !) n e i )) X  M1 (τ )−1 (Vi − A(τ )Z(W 0 −1/2 G n (τ ) = −n 1{Yi ≤ Q(Xi ; τ )} − τ + e i) kwn k−1 w> M2 (τ )−1 Z(W ϕi (τ ) n

i=1

where

 e i )) 1{Yi ≤ Q(Xi ; τ )} − τ . ϕi (τ ) := −kwn k−1 wn> A(τ )> M1 (τ )−1 (Vi − A(τ )Z(W

By definition, we have E[ϕi (τ )] = 0 and moreover

e i ))(Vi − A(τ )Z(W e i ))> ]M1 (τ )−1 A(τ )wn . E[ϕ2i (τ )] . kwn k−2 wn> A(τ )> M1 (τ )−1 E[(Vi − A(τ )Z(W

Since fY |X (Q(Xi ; τ )|X) is bounded away from zero uniformly, it follows that

e i ))(Vi − A(τ )Z(W e i ))> ]k ≤ fmin λmax (M1 (τ )) < ∞, kE[(Vi − A(τ )Z(W

by Remark 5.5. Moreover, by Lemma A.2 proven later, kA(τ )wn k = O(1) uniformly in τ , and thus P by kwn k → ∞, supτ ∈T E[ϕ2i (τ )] = o(1). This implies that n−1/2 ni=1 ϕi (τ ) = oP (1) for every fixed τ ∈ T . Hence it suffices to prove finite dimensional convergence of ! n −1 (V − A(τ )Z(W e X  M (τ ) )) 1 i i −n−1/2 1{Yi ≤ Q(Xi ; τ )} − τ . e i) kwn k−1 w> M2 (τ )−1 Z(W n

i=1

 e i )) 1{Yi ≤ Q(Xi ; τ )}−τ ] = 0 and by assumpObserve that E[M1 (τ )−1 (hV W (Wi ; τ )−A(τ )Z(W tions (A1)-(A3), (C1) e sup E[kM1 (τ )−1 (hV W (W ; τ ) − A(τ )Z(W ))k2 ]

τ ∈T

≤ sup τ ∈T

1 e E[kfY |X (Q(X; τ )|X)(hV W (w; τ ) − A(τ )Z(W ))k2 ] = o(1). fmin λmin (M1 (τ ))

 P e i )) 1{Yi ≤ Q(Xi ; τ )} − τ = oP (1) for every Thus, n−1/2 ni=1 M1 (τ )−1 (hV W (Wi ; τ ) − A(τ )Z(W fixed τ ∈ T . So, now we only need to consider finite dimensional convergence of ! n n −1 (V − h X X  M (τ ) (W ; τ )) 1 i i VW ψi (τ ) := −n−1/2 1{Yi ≤ Q(Xi ; τ )} − τ . (A.24) e i) kwn k−1 w> M2 (τ )−1 Z(W i=1

i=1

n

24

Note that E[ψi (τ1 )ψi (τ2 )> ] = (τ1 ∧ τ2 − τ1 τ2 )

Γ11 (τ1 , τ2 ) + o(1)

Γ12 (τ1 , τ2 )

Γ12 (τ1 , τ2 )>

Γ22 (τ1 , τ2 ) + o(1)

!

,

e i )]. We shall now show where Γ12 (τ1 , τ2 ) := kwn k−1 E[M1 (τ2 )−1 (Vi − hV W (Wi ; τ1 ))wn> M2 (τ2 )−1 Z(W that Γ12 (τ1 , τ2 ) = o(1) uniformly in τ1 , τ2 . Note that from the definition of hV W (W ; τ ) in (3.4), by standard argument we can write hV W (W ; τ ) = E[f (Q(X; τ )|X)|W ]−1 E[V f (Q(X; τ )|X)|W ].

(A.25)

From (A3) we obtain E[f (Q(X; τ )|X)|W ] ≥ fmin > 0, and from (C1), (A2) it follows that |E[V (j) f (Q(X; τ )|X)|W ]| ≤ C f¯ = O(1), i.e. the components of hV W (W ; τ ) are bounded by a constant almost surely. Hence, e i )]k kΓ12 (τ1 , τ2 )k = kwn k−1 kM1 (τ1 )−1 E[(Vi − hV W (Wi ; τ ))wn> M2 (τ2 )−1 Z(W e i )]k ≤ kwn k−1 kM1 (τ1 )−1 kkE[(Vi − hV W (Wi ; τ1 ))wn> M2 (τ2 )−1 Z(W   e i )| . kwn k−1 E kVi − hV W (Wi ; τ1 )k|wn> M2 (τ2 )−1 Z(W   e i )| . kwn k−1 E |w> M2 (τ2 )−1 Z(W n

= o(1),

where the third inequality applies the lower bound for inf τ ∈T λmin (M1 (τ )) in Remark 5.5; the fourth (j) inequality follows from sup1≤j≤k,τ |V (j) | + |hV W (W ; τ )| < ∞ a.s., while the last equality follows by the assumptions of the Theorem. Now we prove the finite dimensional convergence (A.24). Taking arbitrary collections {τ1 , ..., τJ } ⊂ T , c1 , .., cJ ∈ Rk+1 , we need to show that J X

c> j G n (τj )

j=1

=

J X n X j=1 i=1

d c> j ψi (τj ) →

J X

c> j G (τj ).

j=1

P −1/2 ξ . Using the results derived Define Vi,J = Jj=1 c> m j ψi (τj ). Note that E[Vi,J ] = 0 and |Vi,J | . n above, we have var(Vi,J ) = o(n

−1

)+n

−1

J X

(τj ∧ τj 0 − τj τj 0 )c> j Γ(τj , τj 0 )cj 0 ,

j,j 0 =1

P where Γ(τj , τj 0 ) is defined as (3.11). If Jj,j 0 =1 (τj ∧ τj 0 − τj τj 0 )c> Γ(τj , τj 0 )cj 0 = 0, then the distriPJ PJ j > P P > bution of j=1 cj G (τj ) is a single point mass at 0, and j=1 cj G n (τj ) = Jj=1 ni=1 c> j ψi (τj ) converges to 0 in probability by Markov’s inequality. P Pn 2 If n−1 Jj,j 0 =1 (τj ∧τj 0 −τj τj 0 )c> j Γ(τj , τj 0 )cj 0 > 0 define sn,J = i=1 var(Vi,J ). We will now verify that the triangular array of random variables (Vi,J )i=1,...,n satisfies the Lindeberg condition. For any v > 0 and sufficiently large n, Markov’s inequality gives s−2 n,J

n X  2    2 −2 2 −2 −2 −1 2 E Vi,J I(Vi,J ≥ v) . ξm sn,J E I(V1,J ≥ v) . ξm sn,J v n sn,J , i=1

25

2 n−1 = o(1) by (B1’). Thus the Lindeberg condition holds and it follows that where ξm Pn i=1 Vi,J d → N (0, 1). sn,J

Finally, it remains to prove (A.20) and (A.21). Begin by observing that

 

e )) FY |X (Q(X; τ )|X) − FY |X (α(τ )> V + h†n (W, τ )|X)

E (V − A(τ )Z(W

 

e e . E (V − A(τ )Z(W ))fY |X (Q(Xi ; τ )|X) h(W, τ ) − βn† (τ )> Z(W )

  2 

e e ¯ † (Xi , τ )|X) h(W, τ ) − βn† (τ )> Z(W + E (V − A(τ )Z(W ))fY0 |X (Q )

 

e e . E (hV W (W, τ ) − A(τ )Z(W ))fY |X (Q(Xi ; τ )|X) h(W, τ ) − βn† (τ )> Z(W ) + c†2 n

h i

e . c†n E hV W (W, τ ) − A(τ )Z(W ) fY |X (Q(Xi ; τ )|X) + c†2 n = O(c†n λn + c†2 n ),

(A.26)

¯ † (Xi , τ ) lying on the line where the first inequality is an application of Taylor expansion, with Q segment of Q(X; τ ) and α(τ )> Vi +h†n (Wi , τ ); the second inequality is the result of the orthogonality condition (3.5), Condition (A2), Conditions (C1)-(C2); the third inequality follows from Conditions (A2), (C1), and the last line follows from condition (C1) and the H¨ older inequality. For a proof of > −1 > −1 e (A.20) observe that by (5.10) ej Jm (τ ) Z = ej M1 (τ ) (V − A(τ )Z(W )) for j = 1, ..., k. Thus we obtain from Remark 5.5  > E[ej Jm (τ )−1 Z 1{Y ≤ Q(X; τ )} − 1{Y ≤ α(τ )> V + h†n (W, τ )} ]   −1 > † e = e> M (τ ) E (V − A(τ ) Z(W )) F (Q(X; τ )|X) − F (α(τ ) V + h (W, τ )|X) 1 Y |X Y |X j n = O(c†n λn + c†2 n ),

To prove (A.21), without loss of generality, let kwn k = 1. We note that by (5.10) > −1 > > −1 > −1 e e (0> k , wn )Jm (τ ) Zi = −wn A(τ ) M1 (τ ) (Vi − A(τ )Z(Wi )) + wn M2 (τ ) Z(Wi ).

From (A.26), (5.11) in Remark 5.5 and Lemma A.2 we obtain   e i )) 1{Yi ≤ Q(Xi ; τ )} − 1{Yi ≤ α(τ )> Vi + h† (Wi , τ )} E wn> A(τ )> M1 (τ )−1 (Vi − A(τ )Z(W n = O(c†n λn + c†2 n ).

(A.27)

Moreover,   e i ) 1{Yi ≤ Q(Xi ; τ )} − 1{Yi ≤ α(τ )> Vi + h† (Wi , τ )} E wn> M2 (τ )−1 Z(W n   e i )fY |X (Q(Xi ; τ )|X) h(W, τ ) − β † (τ )> Z(W e . wn> M2 (τ )−1 E Z(W ) n  2  e i )f 0 (Q e ¯ † (Xi , τ )|X) h(W, τ ) − βn† (τ )> Z(W + wn> M2 (τ )−1 E Z(W ) Y |X  2  e i )f 0 (Q e ¯ † (Xi , τ )|X) h(W, τ ) − βn† (τ )> Z(W = E wn> M2 (τ )−1 Z(W ) Y |X  >  e i )| c†2 = o(kwn kc†2 ), . E |wn M2 (τ2 )−1 Z(W n n 26

(A.28)

¯ † (Xi , τ ) lying on the line segment where the first inequality follows from the Taylor expansion with Q of Q(X; τ ) and α(τ )> Vi + h†n (Wi , τ ); the second equality follows from the first order condition of (3.3), and the last line follows Conditions (A2), (C1) and the conditions of the theorem. Combining (A.27) and (A.28), and (B1’) we obtain (A.21). e Lemma A.2. Under Assumptions (A1)-(A3), (C2) and supτ ∈T E[|Z(W )> M2 (τ )−1 wn |] = o(kwn k), we have sup kA(τ )wn k2 = o(kwn k) τ

(A.29)

Proof for Lemma A.2. By the first order condition for obtaining A(τ ), e A(τ ) = E[hV W (W ; τ )Z(W )> fY |X (Q(X; τ )|X)]M2 (τ )−1 .

By the orthogonality condition (3.5),

e )> fY |X (Q(X; τ )|X)]M2 (τ )−1 wn kA(τ )wn k = E[hV W (W ; τ )Z(W

e = E[(hV W (W ; τ ) − V + V )Z(W )> fY |X (Q(X; τ )|X)]M2 (τ )−1 wn

e )> fY |X (Q(X; τ )|X)]M2 (τ )−1 wn . = E[V Z(W

By the assumption that at fixed j, |Vj | ≤ C, the uniform boundedness of the conditional density e )> M2 (τ )−1 wn |] = o(kwn k), in (A2), and the hypothesis supτ ∈T E[|Z(W e e sup E[|Vj Z(W )> M2 (τ )−1 wn |fY |X (Q(X; τ )|X)] ≤ f¯C sup E[|Z(W )> M2 (τ )−1 wn |] = o(kwn k).

τ ∈T

τ ∈T

(A.30)

This completes the proof of (A.29) by noting that supτ ∈T kM2 (τ )−1 k = O(1).

A.4

Asymptotic Tightness of Quantile Process

∞ In this section we establish the asymptotic tightness of the process n1/2 u> n Un (τ ) in ` (T ) with un ∈ Rm being an arbitrary vector, where

Un (τ ) =

−1 n−1 Jm (τ )

n X i=1

 Zi 1{Yi ≤ Q(Xi ; τ )} − τ .

(A.31)

Note that the results obtained in this section, in particular Lemma A.3, apply to any series expansion Z = Z(Xi ) satisfying Assumptions (A1). The following definition is only used in this subsection: For any non-decreasing, convex function Ψ : R+ → R+ with Φ(0) = 0, the Orlicz norm of a real-valued random variable Z is defined as (see e.g. Chapter 2.2 of van der Vaart and Wellner (1996))  kZkΨ = inf C > 0 : EΦ(|Z|/C) ≤ 1 . (A.32)

2 (log n)2 = Lemma A.3 (Asymptotic Equicontinuity of Quantile Process). Under (A1)-(A3) and ξm o(n), we have for any ε > 0 and vector un ∈ Rm ,   > lim lim sup P kun k−1 n1/2 sup U (τ ) > ε = 0, (A.33) un Un (τ1 ) − u> n n 2 δ→0 n→∞

τ1 ,τ2 ∈T ,|τ1 −τ2 |≤δ

where Un (τ ) is defined in (A.31).

27

Proof of Lemma A.3. Without loss of generality, we will assume that un is a sequence of vectors with kun k = 1, which can always be achieved by rescaling. Define Gn (τ ) := n1/2 u> n Un (τ ). Consider the decomposition > −1 > −1 −1 u> n Un (τ1 ) − un Un (τ2 ) =n un (Jm (τ1 ) − Jm (τ2 )) −1 + n−1 u> n Jm (τ2 )

X

X i

Zi (1{Yi ≤ Q(Xi ; τ1 )} − τ1 )

Zi ∆i (τ1 , τ2 ),

i

where ∆i (τ1 , τ2 ) := 1{Yi ≤ Q(Xi ; τ1 )} − τ1 − (1{Yi ≤ Q(Xi ; τ2 )} − τ2 ). Note that for any L ≥ 2, h h L i 2 i −1 L−2 > −1 E u> J (τ )Z ∆ (τ , τ ) . ξ E u J (τ )Z ∆ (τ , τ ) 2 i i 1 2 2 i i 1 2 n m m n m   −1 L−2 > −1 2 = ξm un Jm (τ2 )E Zi Z> i ∆i (τ1 , τ2 ) Jm (τ2 )un L−2 . ξm |τ1 − τ2 |.

(A.34)

−1 (τ ) (cf. Lemma 13 of Belloni et al. (2011)) and positive By the Lipschitz continuity of τ 7→ Jm −1 definiteness of Jm (τ ), we have

−1

 −1 −1 −1 kJm (τ1 ) − Jm (τ2 ) = Jm (τ2 ) Jm (τ1 ) − Jm (τ2 ) Jm (τ1 )  −2 f¯0 ≤ |τ1 − τ2 | inf λmin (Jm (τ )) λmax (E[ZZ> ]), τ ∈T fmin

where k · k denotes the operator norm of a matrix. Thus, we have for L ≥ 2, h  L i −1 −1 E u> J (τ ) − J (τ ) Z (1(Y ≤ Q(X ; τ )) − τ ) 1 2 i i i n m m h  2 i L−2 −1 −1 . ξm E u> J (τ ) − J (τ ) Z (1(Y ≤ Q(X ; τ )) − τ ) 1 2 i i i n m m h  2 i L−2 −1 −1 ≤ ξm E u> n Jm (τ1 ) − Jm (τ2 ) Zi  −1   −1 L−2 > −1 −1 = ξm un Jm (τ1 ) − Jm (τ2 ) E Zi Z> Jm (τ1 ) − Jm (τ2 ) un i

 −1   −1 L−2 −1 −1 ≤ ξm Jm (τ1 ) − Jm (τ2 ) E Zi Z> Jm (τ1 ) − Jm (τ2 ) i L−2 . ξm |τ1 − τ2 |2 .

(A.35)

To simplify notations, define  −1 −1 > −1 Ven,i (τ1 , τ2 ) := u> n Jm (τ1 ) − Jm (τ2 ) Zi (1(Yi ≤ Q(Xi ; τ )) − τ ) + un Jm (τ2 )Zi ∆i (τ1 , τ2 ).

Combining the bounds (A.34) and (A.35) yields  1/2L E |Ven,i (τ1 , τ2 )|2L     −1  −1 2L 1/2L −1 2L 1/2L ≤ E |u> + E |u> n Jm (τ2 )Zi ∆i (τ1 , τ2 )| n Jm (τ1 ) − Jm (τ2 ) Zi (1(Yi ≤ Q(Xi ; τ )) − τ )| 1/2L 2(L−1) . ξm |τ1 − τ2 | . (A.36) 28

Note that (A.36) holds for all positive integers L ≥ 1. By the fact that Gn (τ1 ) − Gn (τ2 ) = P n−1/2 ni=1 Ven,i (τ1 , τ2 ) and EVen,i (τ1 , τ2 ) = 0, we obtain from (A.36) that E[|Gn (τ1 ) − Gn (τ2 )|2L ]  X 2L  n = n−L E Ven,i (τ1 , τ2 ) i=1

= n−L

X n i=1

+

X   L−1 E Ven,i (τ1 , τ2 )2L + l=1

X

X

l1 +l2 0, −1

Ψ

(D (ω, d)) . Ψ

−1

2



1 ω4



= ω −2/L .

(A.39)

Therefore, applying Lemma S.2.1 yields for any δ > 0 sup |τ1 −τ2 |≤δ

|Gn (τ1 ) − Gn (τ2 )| =

sup |τ1 −τ2 |1/2 ≤δ 1/2

≤ S1,n (δ) + 2

|Gn (τ1 ) − Gn (τ2 )| sup

|τ 0 −τ |1/2 ≤¯ ωn ,τ ∈Te

29

|Gn (τ 0 ) − Gn (τ )|,

(A.40)

where Te ⊂ T has at most D(¯ ωn , d) . ω ¯ n−2 points and S1,n (δ) is a random variable that satisfies  h Z P (|S1,n (δ)| > z) ≤ z 8K

ω

−1

Ψ



D(, d) d + (δ

ω ¯ n /2

.

 ω1−L−1 1−L−1



−1

(¯ ωn /2)1−L 1−L−1

1/2

+ 2¯ ωn )Ψ

+ (δ 1/2 + 2¯ ωn )ω −2/L

z

−1

2L

2

D (ω, d)

i−1

−2L

.

(A.41)

for a constant K > 0. Let ω = δ and L = 6. As n → ∞, ω > ω ¯ n . We obtain limδ→0 lim supn→∞ P (|S1,n (δ)| > z) = 0 for any z > 0. To bound the remaining term in (A.40), observe that sup d(τ,τ 0 )≤¯ ωn ,τ ∈Te

|Gn (τ 0 ) − Gn (τ )| =

sup 2 ,τ ∈T e |τ −τ 0 |≤¯ ωn

Now by Lemma A.4 we have  P sup

2 ,τ,τ 0 ∈T |τ 0 −τ |≤¯ ωn

|Gn (τ 0 ) − Gn (τ )| ≤

sup 2 ,τ,τ 0 ∈T |τ −τ 0 |≤¯ ωn

|Gn (τ 0 ) − Gn (τ )|.

 |Gn (τ 0 ) − Gn (τ )| > rn (κn ) < e−κn ,

where   ξm ξm ξm 1/2 ξm 1/2 rn (κn ) := C ω ¯ n log + √ log + κn ω ¯ n + √ κn , ω ¯n ω ¯n n n

(A.42)

√ for a sufficiently large constant C > 0. Take κn = log n. Since ω ¯ n (log n)1/2 = 2ξm (log n)1/2 / n = √ o(1) and ξm log(n)/ n = o(1) by assumption, it follows that rn (log n) → 0. Therefore, we conclude from Lemma A.4 that sup d(τ,τ 0 )≤¯ ωn ,τ ∈Te

Gn (τ 0 ) − Gn (τ ) = oP (1).

(A.43)

Applying bounds (A.41) and (A.43) to (A.40) verifies the asymptotic equicontinuity   lim lim sup P sup |Gn (τ1 ) − Gn (τ2 )| > z = 0 δ→0 n→∞

|τ1 −τ2 |≤δ

for all z > 0. The following result is applied in the proof of Lemma A.3. Lemma A.4. Under (A1)-(A3), we have for any κn > 0, 1/n  δ < 1,   > P sup sup |u> U (τ + h) − u U (τ )| ≥ Cr (δ, κ ) ≤ 3e−κn , n n n n n n 0≤h≤δ τ ∈[,1−−h]

where κn > 0, Un (τ ) is defined in (A.31) and un ∈ Rm is arbitrary, and

  δ ξm 1/2 kun kξm ξm κn δ 1/2 kun kξm log √ + log √ + kun k + κn . rn (δ, κn ) = kun k n n n n δ δ 30

(A.44)

To prove Lemma A.4, we need to establish some preliminary results. For any fixed vector u ∈ Rm and δ > 0, define the function classes  G3 (u) := (Z, Y ) 7→ u> Jm (τ )−1 Z1{kZk ≤ ξm } τ ∈ T ,  G4 := (X, Y ) 7→ 1{Yi ≤ Q(X; τ )} − τ τ ∈ T ,  G6 (u, δ) := (Z, Y ) 7→ u> {Jm (τ1 )−1 − Jm (τ2 )−1 }Z1{kZk ≤ ξm } τ1 , τ2 ∈ T , |τ1 − τ2 | ≤ δ ,  G7 (δ) := (X, Y ) 7→ 1{Yi ≤ Q(X, τ1 )} − 1{Yi ≤ Q(X, τ2 )} − (τ1 − τ2 ) τ1 , τ2 ∈ T , |τ1 − τ2 | ≤ δ .

Denote G3 , G6 and G7 as the envelope functions of G3 , G6 and C7 , respectively. The following covering number results will be shown in Section S.2.2: for any probability measure Q, C0 ,   2 C0 N (kG6 kL2 (Q) , G6 (u, δ), L2 (Q)) ≤ 2 ,   2 A7 , N (kG7 kL2 (Q) , G7 (δ), L2 (Q)) ≤  N (kG3 kL2 (Q) , G3 (u), L2 (Q)) ≤

¯0

(A.45) (A.46) (A.47)

>

λmax (E[ZZ ]) f where C0 := fmin inf τ ∈T λmin (Jm (τ )) < ∞ given Assumptions (A1)-(A3), and A7 > 0 is a constant. Also, G4 has VC index 2 according to Lemma S.2.4. Proof of Lemma A.4. Observe the decomposition > u> n Un (τ1 ) − un Un (τ2 ) = I1 (τ1 , τ2 ) + I2 (τ1 , τ2 ),

where n  X  −1 −1 I1 (τ1 , τ2 ) := n−1 u> J (τ ) − J (τ ) Zi 1{Yi ≤ Q(Xi , τ1 )} − τ1 , m 1 m 2 n i=1

−1 I2 (τ1 , τ2 ) := n−1 u> n Jm (τ2 )

n X i=1

 Zi 1{Yi ≤ Q(Xi , τ1 )} − 1{Yi ≤ Q(Xi , τ2 )} − (τ1 − τ2 ) .

Step 1: bounding I1 (τ1 , τ2 ). Note that supτ1 ,τ2 ∈T ,|τ1 −τ2 | J (τ ) − J (τ ) τ , τ , τ ∈ T , |τ − τ | ≤ δ . Z 1{Y ≤ Q(X , τ )} − τ m 1 m 2 1 2 3 1 2 i i i 3 3 n Theorem 2.6.7 of van der Vaart and Wellner (1996) and Part 1 of Lemma S.2.4 give N (kG4 kL2 (Pn ) , G4 , L2 (Pn )) ≤

A4 

where the envelope for G4 is G4 = 2 and A4 is a universal constant. Part 2 of Lemma S.2.4 and Part 2 of Lemma S.2.2 imply that 2A4 N (kG6 G4 kL2 (Pn ) , G6 (un , δ) · G4 , L2 (Pn )) ≤  31



2C0 

2





1/3

2/3 3

2A4 C0 

,

(A.48)

1/3

2/3

where 2A4 C0 note that

≤ C for a large enough universal constant C > 0. To bound supf ∈G6 (un ,δ)·G4 kf kL2 (P ) ,  2 −1 E u> − Jm (τ2 )−1 Zi 1{Yi ≤ Q(Xi , τ3 )} − τ3 n Jm (τ1 )

≤4kun k2 λmax (E[ZZ> ])[ inf λmin (Jm (τ ))]−2 C02 δ 2 ≤ Ckun k2 δ 2 , τ ∈T

for a large enough constant C. In addition, an upper bound for the functions in G6 (un , δ) · G4 is 2ξm kun k[ inf λmin (Jm (τ ))]−1 C0 δ ≤ Cδξm kun k, τ ∈T

and we can take this upper bound as envelope. Applying the bounds (S.2.2) and (S.2.3) and taking into account (A.48), for any un and δ > 0, i h  log(ξ ) 1/2 δku kξ n m m + log(ξm ) . EkPn − P kG6 (un ,δ)·G4 ≤ c1 kun kδ n n

(A.49)

Finally, for any κn > 0, let   1/2 δku kξ  κ 1/2 1 δkun kξm n n m e log(ξm ) log(ξm ) + κn + kun kδ + rn,1 (δ, κn ) = C kun kδ n n n n

e > 0. From this, we obtain for a sufficiently large constant C    P sup |I1 (τ1 , τ2 )| ≥ rn,1 (δ, κn ) ≤ P kPn − P kG6 (un ,δ)·G4 ≥ rn,1 (δ, κn ) ≤ e−κn . τ1 ,τ2 ∈T ,|τ1 −τ2 | 0. To bound supf ∈G3 (un )·G7 (δ) kf kL2 (P ) note

2 −1 E u> n Jm (τ3 ) Zi 1{Yi ≤ Q(Xi , τ1 )} − 1{Yi ≤ Q(Xi , τ2 )} − (τ1 − τ2 ) ≤ 3kun k2 λmax (E[ZZ> ])[ inf λmin (Jm (τ ))]−2 δ ≤ Ckun k2 δ. τ ∈T

Moreover sup f ∈G3 (un )·G7 (δ)

sup kf k∞ ≤ 2kun kξm [ inf λmin (Jm (τ ))]−1 ≤ Ckun kξm τ ∈T

32

for some constant C. Applying the bounds (S.2.2) and (S.2.3) and taking into account (A.50) h δ ξm 1/2 kun kξm ξm i log √ log √ . EkPn − P kG3 (un )·G7 (δ) ≤ c1 kun k + n n δ δ

(A.51)

For any κn > 0, let     δ κn δ 1/2 kun kξm ξm 1/2 kun kξm ξm + + log √ log √ + kun k κn rn,2 (δ, κn ) = C kun k n n n n δ δ for a constant C > 0 sufficiently large, we obtain    P sup |I2 (τ1 , τ2 )| ≥ rn,2 (δ, κn ) ≤ P kPn − P kG3 (un )·G7 (δ) ≥ rn,2 (δ, κn ) ≤ e−κn . τ1 ,τ2 ∈T ,|τ1 −τ2 | e βn,g (τ ) := argmin Z(w) β − h(w; τ ) fY |X (Q(v, w; τ )|(v, w))fV,W (v, w)dvdw. (B.1) β∈Rm

>β e Note that w 7→ Z(w) n,g (τ ) can be viewed as a projection of a function g : W → R onto the > b : b ∈ Rm }, with respect to the inner product hg , g i = e spline space Bm (W) := {w 7→ Z(w) 1 2  R R g1 (w)g2 (w)dν(w), where dν(w) := v fY |X (Q(v, w; τ )|v, w)fV,W (v, w)dv dw. We first apply Theorem A.1 on p.1630 of Huang (2003). To do so, we need to verify Condition A.1-A.3 of Huang (2003). Condition A.1 can be verified by invoking (A2)-(A3) in our paper and using the bounds on fW . The choice of basis functions and knots ensures that Conditions A.2 and A.3 hold (see the discussion on p.1630 of Huang (2003)). Thus, Theorem A.1 on p.1630 of Huang (2003) implies there exists a constant C independent of n such that for any function on W, sup e Z(w)> βn,g (τ ) ≤ C sup g(w) . w∈W

w∈W

Recall that W is a compact subset of Rd and h(w; τ ) ∈ Ληc (W, T ). Since Bm (W) is a finite dimensional vector space of functions, by a compactness argument there exists g ∗ (·; τ ) ∈ Bm (W) such that supw∈W |h(w; τ ) − g ∗ (w; τ )| = inf g∈Bm (W) supw∈W |h(w; τ ) − g(w)| for each fixed τ . With m > η, the inequality in the proof for Theorem 12.8 in Schumaker (1981), with their ”mi ” being 0 our η and ∆i  m−1/k yields e cn = sup e Z(w)> βn,h(w;τ ) (τ ) − h(w; τ ) w,τ

= sup e Z(w)> βn,h(w;τ ) (τ ) − g ∗ (w; τ ) + g ∗ (w; τ ) − h(w; τ ) w,τ

≤ sup e Z(w)> βn,h(w;τ )−g∗ (w;τ ) (τ ) + sup g ∗ (w; τ ) − h(w; τ ) w,τ

w,τ

≤ (C + 1) sup

inf

sup h(w; τ ) − g(w)

τ ∈T g∈B(W) x

0

. m−bηc/k k 0 max sup sup |Dj h(w; τ )|, |j|≤η τ ∈T w∈W

where bηc is the greatest integer less than η, max|j|≤η supτ ∈T supx∈W |Dj h(w; τ )| = O(1) by the assumption that h(w; τ ) ∈ Ληc (W, T ) and fixed k 0 . An extension of Theorem 12.8 of Schumaker (1981) to Besov spaces (see Example 6.29 of Schumaker (1981) or p.5573 of Chen (2007)) in similar 0 manner as Theorem 6.31 of Schumaker (1981) could refine the rate to e cn . m−η/k , but we do not pursue this direction here. 34

Next we show the bound e cn = o(m−bηc ) in the setting of Example 2.3. Assume the density fX (x) of X exists and 0 < inf x∈X fX (x) ≤ supx∈X fX (x) < ∞. Define the measure ν(u) by dν(u) = f (Q(u; τ )|u)fX (u)du. Thus, x 7→ B(x)> βn,g (τ ) with βn,g defined similarly to (B.1) is now viewed as a projection of a function g : X → R onto the space B(X ) with respect to the inner R product hg1 , g2 i = g1 (u)g2 (u)dν(u). The remaining proof is similar to the partial linear model, with h(w; τ ) being replaced by Q(x; τ ), and we omit the details.

References Angrist, J., Chernozhukov, V., and Fern´ andez-Val, I. (2006). Quantile regression under misspecification, with an application to the U.S. wage structure. Econometrica, 74(2):539–563. Bassett, Jr., G. and Koenker, R. (1982). An empirical quantile function for linear models with iid errors. Journal of American Statistical Association, 77(378):407–415. Belloni, A., Chernozhukov, V., and Fern´ andez-Val, I. (2011). Conditional quantile processes based on series or many regressors. arXiv preprint arXiv:1105.6154. Briollais, L. and Durrieu, G. (2014). Application of quantile regression to recent genetic and -omic studies. Human Genetics, 133:951–966. Cade, B. S. and Noon, B. R. (2003). A gentle introduction to quantile regression for ecologists. Frontiers in Ecology and the Environment, 1(8):412–420. Chen, X. (2007). Large sample sieve estimation of semi-nonparametric models. In Heckman, J. J. and Leamer, E., editors, Handbook of Econometrics, chapter 76. North-Holland. Cheng, G. and Shang, Z. (2015). Joint asymptotics for semi-nonparametric regression models with partially linear structure. Annals of Statistics, 43(3):1351–1390. Cheng, G., Zhou, L., and Huang, J. Z. (2014). Efficient semiparametric estimation in generalized partially linear additive models for longitudinal/clustered data. Bernoulli, 20(1):141–163. Chernozhukov, V., Fern´ andez-Val, I., and Galichon, A. (2010). Quantile and probability curves without crossing. Econometrica, 78(3):1093–1125. Delgado, M. A. and Escanciano, J. C. (2013). Conditional stochastic dominance testing. Journal of Business & Economic Statistics, 31(1):16–28. Demko, S., Moss, W. F., and Smith, P. W. (1984). Decay rates for inverses of band matrices. Mathematics of Computation, 43(168):491–499. Dette, H. and Volgushev, S. (2008). Non-crossing non-parametric estimates of quantile curves. Journal of the Royal Statistical Society: Series B, 70(3):609–627. He, X. and Shi, P. (1994). Convergence rate of B-spline estimators of nonparametric conditional quantile functions. Journaltitle of Nonparametric Statistics, 3(3-4):299–308. 35

He, X. and Shi, P. (1996). Bivariate tensor-product B-splines in a partly linear model. Journal of Multivariate Analysis, 58(2):162–181. Huang, J. Z. (2003). Local asymptotics for polynomial spline regression. Annals of Statistics, 31(5):1600–1635. Kley, T., Volgushev, S., Dette, H., and Hallin, M. (2015+). Quantile spectral processes: Asymptotic analysis and inference. Bernoulli. Knight, K. (2008). Asymptotics of the regression quantile basic solution under misspecification. Applications of Mathematics, 53(3):223–234. Koenker, R. (2005). Quantile Regression. Cambridge University Press, New York. Koenker, R. and Bassett, Jr., G. (1978). Regression quantiles. Econometrica, 46(1):33–50. Koenker, R. and Hallock, K. F. (2001). Quantile regression. Journal of Economic Perspectives, 15(4):143–156. Koenker, R. and Xiao, Z. (2002). Inference on the quantile regression process. Econometrica, 70(4):1583–1612. Koltchinskii, V. (2006). Local Rademacher complexities and oracle inequalities in risk minimization. Annals of Statistics, 34(6):2593–2656. Lee, S. (2003). Efficient semiparametric estimation of a partially linear quantile regression model. Econometric Theory, 19:1–31. Massart, P. (2000). About the constants in Talagrand’s concentration inequalities for empirical processes. Annals of Probability, pages 863–884. Puntanen, S. and Styan, G. P. H. (2005). Schur complements in statistics and probabiliy. In Zhang, F., editor, The Schur Complement and Its Applications, volume 4 of Numerical Methods and Algorithms. Springer. Qu, Z. and Yoon, J. (2015). Nonparametric estimation and inference on conditional quantile processes. Journal of Econometrics, 185:1–19. Schumaker, L. (1981). Spline Functions: Basic Theory. Wiley, New York. Shang, Z. and Cheng, G. (2015). A Bayesian splitotic theory for nonparametric models. arXiv preprint arXiv:1508.04175. van der Vaart, A. and Wellner, J. (1996). Weak Convergence and Empirical Processes: With Applications to Statistics. Springer. Volgushev, S. (2013). Smoothed quantile regression processes for binary response models. arXiv preprint arXiv:1302.5644.

36

Volgushev, S., Chao, S.-K., and Cheng, G. (2016). Data compression for extraordinarily large data: A quantile regression approach. Technical report, Purdue University. Zhao, T., Cheng, G., and Liu, H. (2016). A partially linear framework for massive heterogeneous data. Annals of Statistics. Zhou, S., Shen, X., and Wolfe, D. (1998). Local asymptotics for regression splines and confidence regions. Annals of Statistics, 26(5):1760–1782.

37

SUPPLEMENTARY MATERIAL: QUANTILE PROCESS FOR SEMI-NONPARAMETRIC REGRESSION MODELS In this supplemental material, we provide the auxiliary proofs needed in the appendices. Section S.1 develops the technicalities for Bahadur representations. Section S.2 presents some empirical process results and computes the covering number of some function classes encountered in the proofs of asymptotic tightness of quantile process.

S.1: Proofs for Bahadur Representations S.1.1

Proof of Theorem 5.1

Some rearranging of terms yields b (τ ), τ ) = n−1/2 Gn (ψ(·; γ b (τ ), τ )) − n−1/2 Gn (ψ(·; γn (τ ), τ )) Pn ψ(·; γ +Jem (τ )(b γ (τ ) − γn (τ )) + n−1/2 Gn (ψ(·; γn (τ ), τ ))   +µ(γn (τ ), τ ) + µ(b γ (τ ), τ ) − µ(γn (τ ), τ ) − Jem (τ )(b γ (τ ) − γn (τ )) .

In other words

b (τ ) − γn (τ ) = −n−1/2 Jm (τ )−1 Gn (ψ(·; γn (τ ), τ )) + rn,1 (τ ) + rn,2 (τ ) + rn,3 (τ ) + rn,4 (τ ) (S.1.1) γ

where

b (τ ), τ ), rn,1 (τ ) := Jem (τ )−1 Pn ψ(·; γ   γ (τ ), τ ) − µ(γn (τ ), τ ) − Jem (τ )(b γ (τ ) − γn (τ )) , rn,2 (τ ) := −Jem (τ )−1 µ(b   b (τ ), τ )) − Gn (ψ(·; γn (τ ), τ )) , rn,3 (τ ) := −n−1/2 Jem (τ )−1 Gn (ψ(·; γ

rn,4 (τ ) := −n−1/2 (Jm (τ )−1 − Jem (τ )−1 )Gn (ψ(·; γn (τ ), τ )) − Jem (τ )−1 µ(γn (τ ), τ )).

The remaining proof consists in bounding the individual remainder terms.

The bound on rn,1 follows from results on duality theory for convex optimization, see Lemma 26 on page 66 in Belloni et al. (2011) for a proof. To bound rn,2 and rn,3 , define the class of functions  G1 := (Z, Y ) 7→ a> Z(1{Y ≤ Z> b} − τ )1{kZk ≤ ξm } τ ∈ T , b ∈ Rm , a ∈ S m−1 .

(S.1.2)

Moreover, let

sn,1 := kPn − P kG1 . Observe that by Lemma S.1.2 with t = 2 we have Ω1,n :=

n

sup kb γ (τ )−γn (τ )k ≤

τ ∈T

o n 4(sn,1 + gn ) inf τ ∈T λ2min (Jem (τ )) o ⊇ sn,1 +gn < =: Ω2,n . 8ξm f 0 λmax (E[ZZ> ]) inf τ ∈T λmin (Jem (τ )) 1

Define the event 1/2 mξ  κ 1/2 ξ κ io n h m m n m n log n + log n + Ω3,n := sn,1 ≤ C + . n n n n

Now it follows from Lemma S.1.3 that P (Ω3,n ) ≥ 1 − e−κn [note that ξm = O(nb ) yields log ξm = 2 log n = o(n), ξ = O(nb ), ξ g = o(1) implies that for O(log n)]. Moreover, the assumption mξm m m n 2 κn  n/ξm and large enough n, C

h m n

log n

1/2

+

 κ 1/2 ξ κ i inf τ ∈T λ2min (Jem (τ )) mξm n m n log n + + + gn ≤ n n n 8ξm f 0 λmax (E[ZZ> ])

(S.1.3)

for n large enough. Thus, for all n for which (S.1.3) holds, Ω3,n ⊆ Ω2,n ⊆ Ω1,n . From this we obtain that on Ω3,n , for a constant C2 which is independent of n, we have sup kb γ (τ ) − γn (τ )k ≤ C2

τ ∈T

h m n

log n

1/2

+

 κ 1/2 ξ κ i mξm n m n log n + + + gn . n n n

In particular, for all n for which (S.1.3) holds, 1/2 mξ  κ 1/2 ξ κ i  h m n m m n log n log n+ +gn ≥ 1−e−κn . (S.1.4) + + P sup kb γ (τ )−γn (τ )k ≤ C2 n n n n τ ∈T The bound on rn,2 is now a direct consequence of Lemma S.1.1 and the fact that n−1 mξm log n = 1/2 o((n−1 m log n)1/2 ) and ξm κn n−1 = o(κn n−1/2 ). The bound on rn,4 follows once we observe that for any a ∈ S m−1 |a> (Jm (τ ) − Jem (τ ))a| ≤ f 0 cn λmax (E[ZZ> ]).

(S.1.5)

Together with the identity A−1 − B −1 = B −1 (B − A)A−1 this implies that for sufficiently large n we have supτ ∈T krn,4 (τ )k ≤ C1 (cn sn,1 + gn ) for a constant C1 which does not depend on n. Thus, it remains to bound rn,3 . Observe that on the set {supτ ∈T kb γ (τ ) − γn (τ )k ≤ δ} we have the bound 1 sup krn,3 (τ )k ≤ kPn − P kG2 (δ) , τ ∈T inf τ ∈T λmin (Jem (τ ))

where the class of functions G2 (δ) is defined as follows

 G2 (δ) := (Z, Y ) 7→ a> Z(1{Y ≤ Z> b1 } − 1{Y ≤ Z> b2 })1{kZk ≤ ξm }

b1 , b2 ∈ Rm , kb1 − b2 k ≤ δ, a ∈ S m−1 . (S.1.6)

It thus follows that for any δ, α > 0

     P sup krn,3 (τ )k ≥ α ≤ P sup kb γ (τ ) − γn (τ )k ≥ δ + P τ ∈T

τ ∈T

 kPn − P kG2 (δ) ≥α . inf τ ∈T λmin (Jem (τ ))

Letting δ := C((n−1 m log n)1/2 + (κn /n)1/2 + gn ) and α := Cζn (δn , κn ) (here, ζn is defined in (S.1.11) in the statement of Lemma S.1.3) with a suitable constant C, the bounds in Lemma S.1.3 and (S.1.4) yield the desired bound. 2

S.1.1.1

Technical details for the proof of Theorem 5.1

Lemma S.1.1. Under assumptions (A1)-(A3) we have for any δ > 0, sup

sup

τ ∈T kb−γn (τ )k≤δ

kµ(b, τ ) − µ(γn (τ ), τ ) − Jem (τ )(b − γn (τ ))k ≤ λmax (E[ZZ> ])f 0 δ 2 ξm ,

Proof of Lemma S.1.1. Note that µ0 (γn (τ ), τ ) = E[ZZ> fY |X (Z> γn (τ )|X)] = Jem (τ ) where we use the notation µ0 (b, τ ) := ∂b µ(b, τ ). Additionally, we have µ(b, τ ) = µ(γn (τ ), τ ) + µ0 (¯ γn , τ )(b − γn (τ )), ¯n = b + λb,τ (γn (τ ) − b) for some λb,τ ∈ [0, 1]. Moreover, for any a ∈ Rm , where γ a> [µ(b, τ ) − µ(γn (τ ), τ ) − µ0 (γn (τ ), τ )(b − γn (τ ))] = a> [(µ0 (¯ γn , τ ) − µ0 (γn (τ ), τ ))(b − γn (τ ))] and thus we have for any kb − γn (τ )k ≤ δ

µ(b, τ ) − µ(γn (τ ), τ ) − µ0 (γn (τ ), τ )(b − γn (τ ))     ¯n |X) − fY |X (Z> γn (τ )|X) ≤ sup E a> Z Z> b − γn (τ ) fY |X (Z> γ kak=1

  γn − γn (τ )) Z> (b − γn (τ )) ≤ f 0 sup E a> Z Z> (¯ kak=1

  ≤ f 0 ξm E Z> (¯ γn − γn (τ )) Z> (b − γn (τ ))

≤ ξm δ 2 f 0 sup E[|a> Z|2 ], kak=1

here the last inequality follows by Chauchy-Schwarz. Since the last line does not depend on τ, b, this completes the proof. Lemma S.1.2. Let assumptions (A1)-(A3) hold. Then, for any t > 1 n

sup kb γ (τ ) − γn (τ )k ≤

τ ∈T

o n 2t(sn,1 + gn ) inf τ ∈T λ2min (Jem (τ )) o ⊇ (sn,1 + gn ) < , 4tξm f 0 λmax (E[ZZ> ]) inf τ ∈T λmin (Jem (τ ))

where sn,1 := kPn − P kG1 and G1 is defined in (S.1.2).

Proof of Lemma S.1.2. Observe that f : b 7→ Pn ρτ (Yi − Z> i b) is convex, and the vector b (τ ) is a minimizer of Pn ρτ (Yi − Pn ψ(·; b, τ ) is a subgradient of f at the point b. Recalling that γ > Zi b), it follows that for any a > 0,

 {sup kb γ (τ ) − γn (τ )k ≤ a(sn,1 + gn )} ⊇ {inf inf δ > Pn ψ ·; γn (τ ) + a(sn,1 + gn )δ, τ > 0}. (S.1.7) τ ∈T

τ kδk=1

To see this, define δ := (b γ (τ )−γn (τ ))/kb γ (τ )−γn (τ )k and note that by definition of the subgradient e we have for any ζn > 0

e b (τ )) ≥ Pn ρτ (Yi − Z> Pn ρτ (Yi − Z> γ (τ ) − γn (τ )k − ζen )δ > Pn ψ(·; γn (τ ) + ζen δ, τ ). i γ i (γn (τ ) + ζn δ)) + (kb

b (τ ) as minimizer, the inequality above can only be true Set ζen = a(sn,1 + gn ). By the definition of γ > e e if (kb γ (τ ) − γn (τ )k − ζn )δ Pn ψ(·; γn (τ ) + ζn δ, τ ) ≤ 0, which yields (S.1.7). 3

 The proof is finished once we minorize the empirical score δ > Pn ψ ·; γn (τ ) + a(sn,1 + gn )δ, τ in (S.1.7) in terms of sn,1 + gn . To proceed, observe that under assumptions (A1)-(A3) we have by Lemma S.1.1 sup E[δ > {ψ(Y, Z; b, τ ) − ψ(Y, Z; γn (τ ), τ ) − ZZ> fY |X (γn (τ )> Z|X)(b − γn (τ ))}] kδk=1

≤ ξm f 0 λmax (E[ZZ> ])kb − γn (τ )k2 .

(S.1.8)

Therefore, we have for arbitrary kδk = 1, τ ∈ T that δ > Pn ψ(·; γn (τ ) + a(sn,1 + gn )δ, τ )     ≥ −sn,1 − gn + δ > E ψ(Y, Z; γn (τ ) + a(sn,1 + gn )δ, τ ) − E ψ(Y, Z; γn (τ ), τ ) ≥ a(sn,1 + gn ) inf λmin (Jem (τ )) − sn,1 − gn − ξm f 0 λmax (E[ZZ> ])a2 (sn,1 + gn )2 , τ ∈T

  where for the first inequality we recall the definition gn = supτ ∈T kE ψ(Y, Z; γn (τ ), τ ) k and sn,1 = kPn − P kG1 ; the second inequality follows by (S.1.8). Setting a = 2t/ inf τ ∈T λmin (Jem (τ )) in ζen , we see that the right-hand side of the display above is positive when sn,1 + gn
])

Observing that for t > 1 we have (2t − 1)/t2 ≥ 1/t and plugging a = 2t/ inf τ ∈T λmin (Jem (τ )) in equation (S.1.7) completes the proof. Lemma S.1.3. Consider the classes of functions G1 , G2 (δ) defined in (S.1.2) and (S.1.6), respectively. Under assumptions (A1)-(A3) we have for some constant C independent of n and all κn > 0 provided that ξm = O(nb ) for some fixed b  1/2 mξ  κ 1/2 ξ κ i h m m n m n P kPn − P kG1 ≥ C log ξm + log ξm + + ≤ e−κn . n n n n

For any δn satisfying ξm δn  n−1 , we have for sufficiently large n and arbitrary κn > 0   P kPn − P kG2 (δn ) ≥ Cζn (δn , κn ) ≤ e−κn ,

(S.1.9)

(S.1.10)

where

1/2 1/2 ζn (t, κn ) := ξm t

m

1/2 mξ m log(ξm ∨ n) + log(ξm ∨ n) + n−1/2 (ξm tκn )1/2 + n−1 ξm κn . n n (S.1.11)

Proof of Lemma S.1.3. Observe that for each f ∈ G1 we have |f (x, y)| ≤ ξm , and the same holds for G2 (δ) for any value of δ. Additionally, similar arguments as those in the proof of Lemma 18 in Belloni et al. (2011) imply together with Theorem 2.6.7 in van der Vaart and Wellner (1996) that, almost surely, N (G2 (δ), L2 (Pn ); ε) ≤

 AkF k

L2 (Pn )

ε

v1 (m)

, 4

N (G1 , L2 (Pn ); ε) ≤

 AkF k

L2 (Pn )

ε

v2 (m)

,

where A is some constant and v1 (m) = O(m), v2 (m) = O(m). Finally, for each f ∈ G1 we have E[f 2 ] ≤ sup a> E[ZZ> ]a = λmax (E[ZZ> ]). kak=1

On the other hand, each f ∈ G2 (δn ) satisfies E[f 2 ] ≤ sup

sup

kak=1 kb1 −b2 k≤δn

   E (a> Z)2 1 |Y − Z> b1 | ≤ |Z> (b1 − b2 )|

  ≤ sup sup E (a> Z)2 1{|Y − Z> b| ≤ ξm δn } b∈Rm kak=1

≤ 2f ξm δn λmax (E[ZZ> ]).

Note that under assumptions (A1)-(A3) the right-hand side is bounded by cξm δn where c is a constant that does not depend on n. Thus the bound in (S.2.2) implies that for ξm δn  n−1 we have for some constant C which is independent of n, h m 1/2 mξ i m EkPn − P kG1 ≤ C log(ξm ∨ n) log(ξm ∨ n) , (S.1.12) + n n h  1/2 mξ i m 1/2 1/2 m EkPn − P kG2 (δn ) ≤ C ξm δn log(ξm ∨ n) + log(ξm ∨ n) . (S.1.13) n n Thus (S.1.9) and (S.1.10) follow from the bound in (S.2.3) by setting t = κn .

S.1.2

Proof of Theorem 5.2

We begin with the following useful decomposition

where

b ) − βn (τ ) = −n−1/2 Je−1 (τ )Gn (ψ(·; βn (τ ), τ )) + Je−1 (τ ) β(τ m m

4 X

Rn,k (τ )

(S.1.14)

k=1

b ), τ ), Rn,1 (τ ) := Pn ψ(·; β(τ   b ), τ ) − µ(βn (τ ), τ ) − Jem (τ )(β(τ b ) − βn (τ )) , Rn,2 (τ ) := − µ(β(τ   b ), τ )) − Gn (ψ(·; βn (τ ), τ )) , Rn,3 (τ ) := −n−1/2 Gn (ψ(·; β(τ Rn,4 (τ ) := −µ(βn (τ ), τ )).

> e−1 (I(un ,D)) R e−1 Define rn,2 (τ, un ) := u> n,k (τ ) for k = 1, 3 and n Jm (τ )Rn,2 (τ ), rn,k (τ, un ) := (un Jm (τ ))    −1 > e−1 > e−1 (I(un ,D)) e rn,4 (τ, un ) := u> J (τ )R (τ ) + u J (τ ) − (u J (τ )) R (τ ) + R (τ ) . n,4 n,1 n,3 n m n m n m

With those definitions we obtain

4 X  b ) − βn (τ ) = −n−1/2 u> Je−1 (τ )Gn (ψ(·; βn (τ ), τ )) + u> β(τ rn,k (τ, un ). n n m k=1

We will now show that the terms rn,k (τ, un ) defined above satisfy the bounds given in the statement of Theorem 5.2 if we let D = c log n for a sufficiently large constant c. 5

The bound on rn,1 follows from Lemma S.1.7. To bound rn,2 apply Lemma S.1.4 and Lemma S.1.6. To bound rn,3 observe that by Lemma S.1.4 the probability of the event 1/2  ξm (log n + κn ) o b ) − B(x)> βn (τ )| ≤ C e sup |B(x)> β(τ c2n + . n1/2 τ,x  1/2  n+κn ) is at least 1 − (m + 1)e−κn . Letting δn := C e c2n + ξm (logn1/2 we find that on Ω1

Ω1 :=

n

sup

un ∈SIm−1

sup |rn,3 (τ, un )| . kPn − P kG2 (δn ,I(un ,D),I 0 (un ,D)) , τ

this follows since (I(un ,D)) b ), τ )) = n−1/2 (u> Je−1 (τ ))(I(un ,D)) Gn (ψ (I 0 (un ,D)) (·; β(τ b ), τ )) e−1 n−1/2 (u> Gn (ψ(·; β(τ n Jm (τ )) n m

b The bound on rn,3 now follows form Lemma S.1.5. and a similar identity holds with βn instead of β. To bound the first part of rn,4 (τ, un ), we proceed as in the proof of equation (S.1.18) to obtain 1 > e−1 e n , B) c2 E(u (un Jm (τ ))µ(βn (τ ), τ ) ≤ f 0 e 2 n

where the last line follows after a Taylor expansion. To bound the second part of rn,4 (τ, un ) note that supτ kRn,1 (τ )k + kRn,3 (τ )k ≤ 3ξm almost surely and thus choosing D = c log n with c (I(un ,D)) − u> Je−1 (τ )k ≤ n−1 3−1 ξ −1 where we used (A.5). This e−1 sufficiently large yields k(u> m n m n Jm (τ )) completes the proof. S.1.2.1

Technical details for the proof of Theorem 5.2

Lemma S.1.4. Under the assumptions of Theorem 5.2 we have for sufficiently large n and any 2 , κn  n/ξm 1/2    2 b ) − βn (τ ))| ≥ C ξm log n + ξm κn + e P sup |B(x)> (β(τ c ≤ (m + 1)e−κn . n n1/2 τ,x

where the constant C does not depend on n.

Proof of Lemma S.1.4. Apply (A.5) with a = B(x) to obtain −1 −1 kB(x)> Jem (τ ) − (B(x)> Jem (τ ))(I(B(x),D)) k . mξm γ D ,

where I(B(x), D) is defined as (A.2), and γ ∈ (0, 1) is a constant independent of n. Next observe the decomposition b ) − βn (τ )) = −n−1/2 (B(x)> Je−1 (τ ))(I(B(x),D)) Gn (ψ(·; β(τ b ), τ )) + B(x)> (β(τ m

4 X

rn,k (τ, x)

k=1

where

−1 b ), τ ), rn,1 (τ, x) := (B(x)> Jem (τ ))(I(B(x),D)) Pn ψ(·; β(τ   −1 b ), τ ) − µ(βn (τ ), τ ) − Jem (τ )(β(τ b ) − βn (τ )) , rn,2 (τ, x) := −(B(x)> Jem (τ )) µ(β(τ −1 rn,3 (τ, x) := −(B(x)> Jem (τ ))µ(βn (τ ), τ )).

6

and    −1 −1 b ), τ ) − n−1/2 Gn (ψ(·; β(τ b ), τ )) . (τ ))(I(B(x),D)) Pn ψ(·; β(τ (τ ) − (B(x)> Jem rn,4 (τ, x) := B(x)> Jem Letting D = c log n for a sufficiently large constant c, (S.1.14) yields supτ,x |rn,4 (τ, x)| ≤ n−1 almost surely. Lemma S.1.7 yields the bound sup |rn,1 (τ, x)| . x,τ

2 log n ξm n

a.s.

b ) − B(x)> βn (τ )|. From Lemma S.1.6 we obtain under (L) Let δn := supx,τ |B(x)> β(τ   −1 (τ )B|  δn2 , sup |rn,2 (τ, x)| ≤ δn2 sup E |B(x)> Jem x,τ

x,τ

(S.1.15)

(S.1.16)

  −1 (τ )B| = O(1) by assumption (L). Finally, note that where supx,τ E |B(x)> Jem −1 b ), τ )) n−1/2 (B(x)> Jem (τ ))(I(B(x),D)) Gn (ψ(·; β(τ

0 −1 b ), τ )) = n−1/2 (B(x)> Jem (τ ))(I(B(x),D)) Gn (ψ (I (B(x),D)) (·; β(τ

where I 0 (B(x), D) is defined as (A.3). This yields −1 b ), τ )) . ξm sup kPn − P kG (I(B(x),D),I 0 (B(x),D)) sup n−1/2 (B(x)> Jem (τ ))(I(B(x),D)) Gn (ψ(·; β(τ 1 τ,x

x∈X

By the definition of I(B(x), D), I 0 (B(x), D), the supremum above ranges over at most m distinct terms. Additionally, supx∈X |I(B(x), c log n)|+|I 0 (B(x), c log n)| . log n. Thus Lemma S.1.5 yields 1/2  ξ 2 (log n)2 1/2 ξm κn  −1/2 m > e−1 (I(B(x),c log n)) b P sup n (B(x) Jm (τ )) Gn (ψ(·; β(τ ), τ )) ≥ C +C 1/2 ≤ me−κn . n n τ,x (S.1.17) Finally, from the definition of βn as minimizer we obtain −1 |rn,3 (τ, x)| = (B(x)> Jem (τ ))µ(βn (τ ), τ )) −1 = (B(x)> Jem (τ ))E[B(1{Y ≤ βn (τ )> B} − τ )] −1 = (B(x)> Jem (τ ))E[B(FY |X (βn (τ )> B|X) − FY |X (Q(X; τ )|X))]





1 0 2 fe c O(1) 2 n

(S.1.18)

where the last line follows after a Taylor expansion and the fact that E[BfY |X (Q(X; τ )|X)(βn> B − Q(X; τ ))] = 0 from the definition of βn and making use of (L). Combining this with (S.1.15) (S.1.17) yields h ξ 2 (log n)2 1/2 ξ κ1/2 ξ 2 log n i m n m m 2 2 δn ≤ C + + + δ + e c n n n n n1/2

with probability at least me−κn . By Lemma S.1.2 we have P (δn ≥ 1/(2C)) ≤ e−κn for any κn satisfying ξm κn  n−1 This yields the assertion. 7

Lemma S.1.5. Let Z := {B(x)|x ∈ X } where X is the support of X. For I1 , I10 ⊂ {1, ..., m}, define the classes of functions  0 Ge1 (I1 , I10 ) := (Z, Y ) 7→ a> Z(I1 ) (1{Y ≤ Z> b(I1 ) } − τ )1{kZk ≤ ξm } τ ∈ T , b ∈ Rm , a ∈ S m−1 , (S.1.19)  (I 0 ) (I 0 ) Ge2 (δ, I1 , I10 ) := (Z, Y ) 7→ a> Z(I1 ) (1{Y ≤ Z> b1 1 } − 1{Y ≤ Z> b2 1 })1{Z ∈ Z}

b1 , b2 ∈ Rm , sup kv> b1 − v> b2 k ≤ δ, a ∈ S m−1 . (S.1.20) v∈Z

Under assumptions (A1)-(A3) we have

1/2 max(|I |, |I 0 |)ξ  κ 1/2 ξ κ i  h max(|I |, |I 0 |) 1 1 m n m n 1 1 P kPn −P kGe1 (I1 ,I 0 ) ≥ C log ξm + log ξm + + ≤ e−κn 1 n n n n (S.1.21) −1 and for any δn satisfying ξm δn  n we have for sufficiently large n and arbitrary κn > 0   P kPn − P kGe2 (δ,I1 ,I 0 ) ≥ Cζn (δ, I1 , I10 , κn ) ≤ e−κn , (S.1.22) 1

where

 max(|I |, |I 0 |) 1/2 max(|I |, |I 0 |)ξ m 1 1 1 1 log(ξm ∨ n) + log(ξm ∨ n) n n +n−1/2 (tκn )1/2 + n−1 ξm κn .

ζn (t, I1 , I10 , κn ) := t1/2

Proof of Lemma S.1.5. We begin by observing that   AkF kL2 (Pn ) v1 (m) N Ge2 (δ, I1 , I10 ), L2 (Pn ); ε ≤ , ε

  AkF kL2 (Pn ) v2 (m) N Ge1 (I1 , I10 ), L2 (Pn ); ε ≤ , ε

where v1 (m) = O(max(|I1 |, |I10 |)), v2 (m) = O(max(|I1 |, |I10 |)). The proof of the bound for Ge1 (I1 , I10 ) now follows by similar arguments as the proof of Lemma S.1.3. For a proof of the second part, note that for f ∈ Ge2 we have h n oi (I 0 ) (I 0 ) (I 0 ) E[f 2 ] ≤ sup sup E (a> B(I1 ) )2 1 |Y − B> b1 1 | ≤ |B> (b1 1 − b2 1 )| kak=1 b1 ,b2 ∈R(δn )

h 2 i 0 0 0 ≤ sup sup E (a(I1 ) )> BB> a(I1 ) 1{|Y − B> b(I1 ) | ≤ δn } b∈Rm kak=1

≤ 2f δn λmax (E[BB> ]) n o where we defined R(δ) := b1 , b2 ∈ Rm , supv∈Z kv> b1 −v> b2 k ≤ δ . The rest of the proof follows by similar arguments as the proof of Lemma S.1.3. Lemma S.1.6. Under assumptions (A1)-(A3) we have for any a, b ∈ Rm >e a Jm (τ )−1 µ(b, τ ) − a> Jem (τ )−1 µ(βn (τ ), τ ) − a> (b − βn (τ )) ≤ f 0 sup |B(x)> b − B(x)> βn (τ )|2 E[|a> Jem (τ )−1 B|]. x

8

Proof of Lemma S.1.6. Note that µ0 (βn (τ ), τ ) = E[BB> fY |X (B> βn (τ )|X)] = Jem (τ ). Additionally, we have ¯ τ )(b − βn (τ )), µ(b, τ ) = µ(βn (τ ), τ ) + µ0 (b, ¯ = b + λb,τ (βn (τ ) − b) for some λb,τ ∈ [0, 1]. Moreover, where b ¯ τ ) − Jem (τ )](b − βn (τ )) a> [Jem (τ )−1 µ(b, τ ) − Jem (τ )−1 µ(βn (τ ), τ ) − (b − βn (τ ))] = a> Jem (τ )−1 [µ0 (b, and thus

>e −1 > −1 >e J (τ ) µ(β (τ ), τ ) − a (b − β (τ )) J (τ ) µ(b, τ ) − a a m n n m h  i ¯ = E a> Jem (τ )−1 BB> fY |X (B> b|X) − fY |X (B> βn (τ )|X) b − βn (τ )  h 2 i ≤ f 0 E a> Jem (τ )−1 B B> (b − βn (τ ))

≤ f 0 sup |B(x)> b − B(x)> βn (τ )|2 E[|a> Jem (τ )−1 B|]. x

Lemma S.1.7. Under assumptions (A1)-(A3) and (L) we have for any a ∈ Rm having zero entries everywhere except at L consecutive positions: > b ), τ ) ≤ (L + 2r)kakξm . a Pn ψ(·; β(τ n

Proof of Lemma S.1.7. From standard arguments of the optimization condition of quantile regression (p.35 of Koenker (2005), also see equation (2.2) on p.224 of Knight (2008)), we know that for any τ ∈ T , n X  1X b b )} − τ = 1 Pn ψ(·; β(τ ), τ ) = vi Bi Bi 1{Yi ≤ B> β(τ i n n i=1

i∈Hτ

b where vi ∈ [−1, 1] and Hτ = {i : Yi = B> i β(τ )}. Since a has at most L non-zero entries, the dimension of the subspace spanned by {Bi : a> Bi 6= 0} is at most L + 2r [each vector Bi by construction has at most r nonzero entries and all of those entries are consecutive]. Since the conditional distribution of Y given covariates has a density, the data are in general position almost surely, i.e. no more than k of the points (Bi , Yi ) lie in any k-dimensional linear space, it follows that the cardinality of the set Hτ ∩ {i : a> Bi 6= 0} is bounded by L + 2r. The assertion follows after an elementary calculation.

S.1.3

Proof of Theorem 5.4

The statement follows from Theorem 5.1 if we prove that the vector γn† (τ ) satisfies

sup µ(γn† (τ ); τ ) = O(ξm c†2 n )

τ ∈T

9

(S.1.23)

−1 ) in Condition (C1), and establish the identity in (5.10). For the identity (5.10), we as c†n = o(ξm first observe the representation ! M1 (τ ) + A(τ )M2 (τ )A(τ )> A(τ )M2 (τ ) Jm (τ ) = , (S.1.24) M2 (τ )A(τ )> M2 (τ )

which follows from (3.5) and e e E[(V − A(τ )Z(W ))Z(W )> fY |X (Q(X; τ )|X)] = 0, for all τ ∈ T .

To simplify the notations, we suppress the argument in τ in the following matrix calculations. Recall the following identity for the inverse of 2 × 2 block matrix (see equation (6.0.8) on p.165 of Puntanen and Styan (2005)) !−1 ! A B (A − BD−1 C)−1 −(A − BD−1 C)−1 BD−1 = . C D −D−1 C(A − BD−1 C)−1 D−1 + D−1 C(A − BD−1 C)−1 BD−1 Identifying the blocks in the representation (S.1.24) with the blocks in te above representation yields the result after some simple calculations. For a proof of (S.1.23) observe that e µ(γn† (τ ); τ ) = E[(V > , Z(W )> )> (FY |X (γn† (τ )> Z|X) − τ )]

Now on one hand we have, uniformly in τ ∈ T ,



e )|X) − τ )]

E[V (FY |X (α(τ )> V + βn† (τ )> Z(W



e = E[V fY |X (Q(X; τ )|X)(Q(X; τ ) − α(τ )> V − βn† (τ )> Z(W ))] + O(c†2 n )



e = E[V fY |X (Q(X; τ )|X)(h(W ; τ ) − βn† (τ )> Z(W ))] + O(c†2 n )



e = E[(V − hV W (W ; τ ) + hV W (W ; τ ))fY |X (Q(X; τ )|X)(h(W ; τ ) − βn† (τ )> Z(W ))] + O(c†2 n )



e = E[hV W (W ; τ )fY |X (Q(X; τ )|X)(h(W ; τ ) − βn† (τ )> Z(W ))] + O(c†2 n )



† >e e e = E[(hV W (W ; τ ) − A(τ )Z(W ) + A(τ )Z(W ))fY |X (Q(X; τ )|X)(h(W ; τ ) − βn (τ ) Z(W ))] + O(c†2 n )



e e = E[(hV W (W ; τ ) − A(τ )Z(W ))fY |X (Q(X; τ )|X)(h(W ; τ ) − βn† (τ )> Z(W ))] + O(c†2 n ) † = O(c†2 n + λn cn ).

Here, the first equation follows after a Taylor expansion taking into account that, by the definition of the conditional quantile function, FY |X (Q(X; τ )|X) ≡ τ . The fourth equality is a consequence of (3.5), the sixth equality follows since e e E[Z(W )fY |X (Q(X; τ )|X)(h(W ; τ ) − βn† (τ )> Z(W ))] = 0

(S.1.25)

by the definition of βn† (τ ) as minimizer. The last line follows by the Cauchy-Schwarz inequality. On the other hand e e E[Z(W )(FY |X (α(τ )> V + βn† (τ )> Z(W )|X) − τ )] e e = E[Z(W )fY |X (Q(X; τ )|X)(h(W ; τ ) − βn† (τ )> Z(W ))] 1 e e + E[Z(W )fY0 |X (ζ(X; τ )|X)(h(W ; τ ) − βn† (τ )> Z(W ))2 ]. 2 10

By (S.1.25), the first term in the representation above is zero, and the norm of the second term is of the order O(ξm c†2 n ). This completes the proof.

APPENDIX S.2: Auxiliary Results S.2.1

Results on empirical process theory

In this section, we collect some basic results from empirical process theory needed in our proofs. Denote by G a class of functions that satisfies |f (x)| ≤ F (x) ≤ U for every f ∈ G and let σ 2 ≥ supf ∈G P f 2 . Additionally, let for some A > 0, V > 0 and all ε > 0, N (ε, G, L2 (Pn )) ≤

 AkF k

L2 (Pn )

ε

V

.

(S.2.1)

Note that if G is a VC-class, then V is the VC-index of the set of subgraphs of functions in G. In that case, the symmetrization inequality and inequality (2.2) from Koltchinskii (2006) yield h V AkF kL2 (P ) 1/2 V U AkF kL2 (P ) i EkPn − P kG ≤ c0 σ log log + n σ n σ

(S.2.2)

for a universal constant c0 > 0 provided that 1 ≥ σ 2 > const × n−1 [in fact, the inequality in Koltchinskii (2006) is for σ 2 = supf ∈G P f 2 . However, this is not a problem since we can replace G by Gσ/(supf ∈G P f 2 )1/2 ]. The second inequality (a refined version of Talagrand’s concentration inequality) states that for any countable class of measurable functions F with elements mapping into [−M, M ] 1/2 √  n o P kPn − P kF ≥ 2EkPn − P kF + c1 n−1/2 sup P f 2 t + n−1 c2 M t ≤ e−t ,

(S.2.3)

f ∈F

for all t > 0 and universal constants c1 , c2 > 0. This is a special case of Theorem 3 in Massart (2000) [in the notation of that paper, set ε = 1].

Lemma S.2.1 (Lemma 7.1 of Kley et al. (2015)). Let {Gt : t ∈ T } be a separable stochastic process with kGs − Gt kΨ ≤ Cd(s, t) (k · kΨ is defined in (A.32)) for all s, t satisfying d(s, t) ≥ ω ¯ /2 ≥ 0. Denote by D(, d) the packing number of the metric space (T, d). Then, for any δ > 0, ω ≥ ω ¯ , there exists a random variable S1 and a constant K < ∞ such that sup |Gs − Gt | ≤ S1 + 2

d(s,t)≤δ

sup d(s,t)≤¯ ω ,t∈Te

|Gs − Gt |,

where the set Te contains at most D(¯ ω , d) points, and S1 satisfies Z ω    −1 −1 2 kS1 kΨ ≤ K Ψ D(, d) d + (δ + 2¯ ω )Ψ D (ω, d)

(S.2.5)

ω ¯ /2

  h Z P (|S1 | > x) ≤ Ψ x 8K

ω

ω ¯ /2

−1

Ψ



−1

D(, d) d + (δ + 2¯ ω )Ψ

11

(S.2.4)

i−1 D (ω, d) 2

−1

.

(S.2.6)

S.2.2

Covering number calculation

A few useful lemmas on covering number are given in this section. Lemma S.2.2. Suppose F and G are two function classes with envelopes F and G. 1. The class F − G := {f − g|f ∈ F, g ∈ G} with envelope F + G and      kF kQ,2 kGkQ,2 sup N kF + GkQ,2 , F − G, L2 (Q) ≤ sup N  √ , F, L2 (Q) sup N  √ , G, L2 (Q) , 2 2 Q Q Q (S.2.7) 2. The class F · G := {f g : f ∈ F, g ∈ G} with envelope F G and      kF kQ,2 kGkQ,2 sup N kF GkQ,2 , F · G, L2 (Q) ≤ sup N , F, L2 (Q) sup N , G, L2 (Q) , 2 2 Q Q Q

(S.2.8)

where the suprema are taken over the appropriate subsets of all finitely discrete probability measures Q. Proof of Lemma S.2.2. 1. It is obvious that |f − g| ≤ |f | + |g| ≤ F + G for any f ∈ F and g ∈ G. Hence, F + G is an envelop for the function class F − G. Suppose that A = {f1 , ...fJ } and B = {g1 , ..., gK } kGk kF k are the centers of √2Q,2 -net for F and √2Q,2 -net for F and G respectively. For any f − g, there exists fj and gk such that  k(fj − gk ) − (f − g)k2Q,2 = k(fj − f ) − (gk − g)k2Q,2 ≤ 2 kfj − f k2Q,2 + kgk − gk2Q,2 ≤ 2 (kF k2Q,2 + kGk2Q,2 ) ≤ 2 kF + Gk2Q,2 ,

where the last inequality follows from the fact that both F and G are nonnegative. Hence, {fj +gk : 1 ≤ j ≤ J, 1 ≤ k ≤ K} forms an kF +GkQ,2 -net for the class F −G, with cardinality JK. 2. See Lemma 6 of Belloni et al. (2011). For any fixed vector u ∈ Rm and δ > 0, recall the function classes  G3 (u) = (Z, Y ) 7→ u> Jm (τ )−1 Z τ ∈ T ,  G4 = (X, Y ) 7→ 1{Yi ≤ Q(X; τ )} − τ τ ∈ T ,  G6 (u, δ) = (Z, Y ) 7→ u> {Jm (τ1 )−1 − Jm (τ2 )−1 }Z τ1 , τ2 ∈ T , |τ1 − τ2 | ≤ δ ,  G7 (δ) = (X, Y ) 7→ 1{Yi ≤ Q(X, τ1 )} − 1{Yi ≤ Q(X, τ2 )} − (τ1 − τ2 ) τ1 , τ2 ∈ T , |τ1 − τ2 | ≤ δ .

−1 (τ ) by Lemma 13 of Belloni et al. (2011): Recall the following Lipschitz continuity property of Jm for τ1 , τ2 ∈ T ,  −2

f¯0 −1 −1 kJm (τ1 ) − Jm (τ2 ) ≤ |τ1 − τ2 | inf λmin (Jm (τ )) λmax (E[ZZ> ]) τ ∈T fmin

:= C0 inf λmin (Jm (τ ))−1 |τ1 − τ2 |.

where C0 =

λmax (E[ZZ> ]) f¯0 fmin inf τ ∈T λmin (Jm (τ ) .

τ ∈T

12

(S.2.9)

Lemma S.2.3. G3 (u) has an envelope G3 (Z) = kukξm [inf τ ∈T λmin (Jm (τ ))]−1 and N (kG3 kL2 (Q) , G3 (u), L2 (Q)) ≤ where C0 =

λmax (E[ZZ> ]) f¯0 fmin inf τ ∈T λmin (Jm (τ ))

C0 , 

< ∞ , for any probability measure Q and x.

Proof of Lemma S.2.3. By (S.2.9), for any τ1 , τ2 ∈ T and u, ¯0 > u Jm (τ1 )−1 Z − u> Jm (τ2 )−1 Z ≤ kukξm f λmax (E[ZZ> ])[ inf λmin (Jm (τ ))]−2 |τ1 − τ2 | τ ∈T fmin = C0 kG3 kL2 (Q) |τ1 − τ2 |.

Applying the relation of the covering and bracketing number on p.84 and Theorem 2.7.11 of van der Vaart and Wellner (1996) yields for each u and any probability measure Q,  N kG3 kL2 (Q) , G3 (u), L2 (Q) ≤ N[ Lemma S.2.4.

]

  C0 2kG3 kL2 (Q) , G3 (u), L2 (Q) ≤ N , T , | · | ≤ . 

1. G4 is a VC-class with VC index 2.

2 [inf −1 2. The envelopes for G6 (u, δ) and G7 are G6 = ξm τ ∈T λmin (Jm (τ ))] C0 δ and G7 = 2. Furthermore, it holds for any fixed x and δ ≤ |T | that  2 C0 N (kG6 kL2 (Q) , G6 (u, δ), L2 (Q)) ≤ 2 , (S.2.10)   2 A7 N (kG7 kL2 (Q) , G7 (δ), L2 (Q)) ≤ , (S.2.11) 

where A7 is a universal constant and Q is an arbitrary probability measure. Proof of Lemma S.2.4. 1. Due to the fact that Q(X; τ ) is monotone in τ , it can be argued with basic VC subgraph argument that G4 has VC index 2, under the definition given in p.135 of van der Vaart and Wellner (1996). 2. By (S.2.9), the envelope for G6 (un , δ) is kukξm λmin (Jm (τ ))−1 C0 δ. The envelope for G7 (δ) is obvious. By the fact that G7 (δ) ⊂ G4 −G4 and the covering number of G4 (implied by Theorem 2.6.7 of van der Vaart and Wellner (1996)), (S.2.11) thus follows by (S.2.7) of Lemma S.2.2. As for (S.2.10), we note that G6 (u, δ) ⊂ G3 (u) − G3 (u). Then, (S.2.7) of Lemma S.2.2 and Lemma S.2.3 imply  2  2  C0 N (kG6 kL2 (Q) , G6 (u, δ), L2 (Q)) ≤ N √ kG3 kL2 (Q) , G3 (u), L2 (Q) ≤ 2 ,  2 where Q is an arbitrary probability measure.

13