Nonparametric Kernel Regression with Multiple ... - Semantic Scholar

1 downloads 0 Views 535KB Size Report
Jun 2, 2011 - In this paper, we extend the monotone kernel regression method in ... Key words: Shape restrictions, nonparametric regression, multivariate ...
Nonparametric Kernel Regression with Multiple Predictors and Multiple Shape Constraints Pang Du, Christopher F. Parmeter and Jeffrey S. Racine



June 2, 2011

Abstract Nonparametric smoothing under shape constraints has recently received much well-deserved attention. Powerful methods have been proposed for imposing a single shape constraint such as monotonicity and concavity on univariate functions. In this paper, we extend the monotone kernel regression method in Hall and Huang (2001) to the multivariate and multi-constraint setting. We impose equality and/or inequality constraints on a nonparametric kernel regression model and its derivatives. A bootstrap procedure is also proposed for testing the validity of the constraints. Consistency of our constrained kernel estimator is provided through an asymptotic analysis of its relationship with the unconstrained estimator. Theoretical underpinnings for the bootstrap procedure are also provided. Illustrative Monte Carlo results are presented and an application is considered.

Key words: Shape restrictions, nonparametric regression, multivariate kernel estimation, hypothesis testing ∗

Pang Du is Assistant Professor at Department of Statistics, Virginia Tech, Blacksburg, VA 24061 (Email: [email protected]). Christopher Parmeter is Assistant Professor at Department of Economics, University of Miami, Coral Gables, FL 33124, USA (Email: [email protected]). Jeffrey Racine is Professor at Department of Economics, McMaster University, Hamilton, Ontario, L8S 4M4, Canada (Email: [email protected]). The research of Du is supported by NSF DMS1007126. We would like to thank but not implicate Daniel Wikstr¨om for inspiring conversations and Li-Shan Huang and Peter Hall for their insightful comments and suggestions. The research of Du is supported by NSF DMS-1007126. Racine would like to gratefully acknowledge support from Natural Sciences and Engineering Research Council of Canada (NSERC:www.nserc.ca), the Social Sciences and Humanities Research Council of Canada (SSHRC:www.sshrc.ca), and the Shared Hierarchical Academic Research Computing Network (SHARCNET:www.sharcnet.ca).

1

1

Introduction

Imposing shape constraints on a regression model is a necessary component of sound applied data analysis. For example, imposing monotonicity and concavity constraints is often needed for a range of application domains. Early developments in the absence of smoothness restrictions can be found in Robertson et al. (1988) and references therein. When data are believed to have nonlinear structure in addition to obeying shape constraints, nonparametric smoothing methods such as kernel smoothing and splines are often used. For example, restricted kernel smoothing has been considered by (Braun and Hall, 2001; Hall and Huang, 2001; Hall and Kang, 2005; Mammen, 1991; Mukerjee, 1988), and restricted spline smoothing has been studied by (Mammen and Thomas-Agnam, 1999; Ramsay, 1988; Wang and Shen, 2010; Wright and Wegman, 1980), to name but a few. As pointed out in the comprehensive review by Mammen et al. (2001), a common limitation of these approaches is that they only consider shape constraints on univariate functions and often only one constraint at a time is entertained. In practice, however, many applications necessitate imposing multiple shape constraints on multivariate functions. Examples include dose-response studies where multiple drug combinations are applied and the response function is often assumed to be monotone in the amount of each drug, or economic studies where the response function must satisfy coordinate-wise monotonicity, coordinate-wise concavity, and constant returns to scale which essentially requires first-order partial derivatives from a doublelog model to sum to one. For such problems, (Gallant, 1982; Gallant and Golub, 1984) proposed a series-based estimator called the Flexible Fourier Form (Gallant, 1981) whose coefficients can be restricted to impose the relevant shape constraints. Although the method can handle multiple constraints, it is hard to incorporate certain common constraints such as monotonicity. Villalobos and Wahba (1987) proposed 2

a constrained thin-plate spline estimator and applied it to the posterior probability estimation in classification problems where the probability function must lie between 0 and 1. But the asymptotic properties of their estimator are not well-understood. (Matzkin, 1991, 1992) studied constrained maximum likelihood estimators using interpolation, but these estimators are not smooth and are hard to generalize beyond coordinate-wise monotonicity and concavity. Mammen et al. (2001) discussed extension of their projection framework for constrained smoothing to nonparametric additive models, which can be considered as a special case of the general constraints to be described below. Also, none of the aforementioned approaches for multiple constraints propose hypothesis testing procedures for the constraints themselves. In this paper, we propose a kernel smoothing method that can handle multiple general shape constraints for multivariate functions. Our method is a generalization of Hall and Huang (2001) where monotone regression was considered for a general class of kernel smoothers. Similar to Hall and Huang (2001), our estimator is constructed by introducing weights for each response data point which can dampen or magnify the impact of any observation. In order to deliver an estimate satisfying the shape constraints, the weights are selected to minimize their distance from the uniform weights of the unconstrained estimator while obeying the constraints. In Hall and Huang (2001), this distance was measured by a power divergence metric introduced in Cressie and Read (1984), which has a rather complicated form and is hard to generalize. Instead, we resort to the well-known l2 -metric that is much simpler and has all the desired properties of the power-divergence metric (Theorem 2.1). Under certain conditions, we generalize the consistency results of (Hall and Huang, 2001, Theorem 4.3) to our multivariate and multiple constraint setting. In essence, when the shape constraints are non-binding (i.e., strictly satisfied) on the domain, the restricted estimator is asymptotically and numerically equivalent to the unrestricted estimator.

3

When the shape constraints are non-binding everywhere except for a certain area of measure 0, the restricted and unrestricted estimators are still very close to each other except in a neighborhood of the binding area. Besides the multivariate and multi-constraint extension, we also propose a bootstrap procedure to test the validity of the shape constraints. Inference for shape restrictions in nonparametric regression settings has attracted much attention in the past decade. For example, (Hall and Heckman, 2000) develop a bootstrap test for monotonicity whose test statistic is formulated around the slope estimates of linear regression models fitted over small intervals. Ghosal et al. (2000) introduced a statistic based on a U-process measuring the discordance between predictor and response. Both of these approaches are specifically designed for monotonicity constraints and it is difficult to extend them to our multivariate and multi-constraint setting. Yatchew and H¨ardle (2006) employs a residual-based test to check for monotonicity and convexity simultaneously in a regression setting but their result is for univariate functions and requires that the constraints be strictly satisfied. Our bootstrap procedure originates from Hall et al. (2001), which tested the monotonicity of hazard functions but provided no theoretical justification for the procedure. Although the test statistic is difficult to analyze asymptotically, we are able to provide asymptotic results for its numerical version when shape constraints are satisfied on a sufficiently dense grid of points. The derivation of our result takes advantage of the simple form of the l2 metric that may not be easily available for the power divergence metric, a promising artifact of the l2 metric we adopt. Another appealing aspect of our method is that it only involves quadratic programming and can be readily implemented using standard sequential quadratic programming software. We construct our kernels using the multivariate kernel method developed in Li and Racine (2007), which can incorporate both categorical and con-

4

tinuous covariates. The rest of this paper proceeds as follows. Section 2 outlines the basic approach then lays out a general theory for our constrained estimator in the presence of linear shape constraints and presents a simple test of the validity of the constraints. Section 3 considers a number of simulated applications, examines the finite-sample performance of the proposed test, and presents an empirical application involving technical efficiency on Indonesian rice farms. Section 4 presents some discussion remarks. Section 4 provides the technical proofs of the theorems presented in Section 2.

2 2.1

Constrained Kernel Regression Estimator The Estimator

In what follows we let {Yi , Xi }ni=1 denote sample pairs of response and explanatory variables where Yi is a scalar, Xi is of dimension r, and n denotes the sample size. The goal is to estimate the unknown average response g(x) ≡ E(Y |X = x) subject to constraints on g (s) (x) where s is an r-vector corresponding to the dimension of x. In what follows, the elements of s represent the order of the partial derivative corresponding to each element of x. Thus s = (0, 0, . . . , 0) represents the function itself, while s = (1, 0, . . . , 0) represents ∂g(x)/∂x1 . In general, for s = (s1 , s2 , . . . , sr ) we have g (s) (x) =

∂ s1 g(x) ∂ sr g(x) · · · . ∂xs11 ∂xsrr

(1)

We consider the class of kernel regression smoothers that can be written as linear combinations of the response Yi , i.e., gˆ(x) =

n X i=1

5

Ai (x)Yi ,

(2)

where Ai (x) is a local weighting matrix. This class of kernel smoothers includes the Nadaraya-Watson estimator (Nadaraya (1965),Watson (1964)), the Priestley-Chao estimator (Priestley and Chao (1972)), the Gasser-M¨ uller estimator (Gasser and M¨ uller (1979)) and the local polynomial estimator (Fan (1992)), among others. We presume that the reader wishes to impose constraints on the estimate gˆ(x) of the form l(x) ≤ gˆ(s) (x) ≤ u(x)

(3)

for arbitrary l(·), u(·), and s, where l(·) and u(·) represent (local) lower and upper bounds, respectively. For some applications, s = (0, . . . , 0, 1, 0, . . . , 0) would be of particular interest, say for example when the partial derivative represents a budget share and therefore must lie in [0, 1]. Or, s = (0, 0, . . . , 0) might be of interest when an outcome must be bounded (i.e., gˆ(x) could represent a probability hence must lie in [0, 1] but this could be violated when using, say, a local linear smoother). Additional types of constraints that could be imposed in this framework are (log)supermodularity, (log-)convexity, and quasiconvexity, all of which focus on second order or cross-partials derivatives. Or, l(·) = u(·) might be required (i.e., equality rather than inequality constraints) such as when imposing adding up constraints, say, when the sum of the budget shares must equal one, or when imposing homogeneity of a particular degree, by way of example. The approach we describe is quite general. It is firmly embedded in a conventional multivariate kernel framework, and admits arbitrary combinations of constraints (i.e., for any s or combination thereof) subject to the obvious caveat that the constraints must be internally consistent. Following Hall and Huang (2001), we consider a generalization of gˆ(x) defined in (2) given by gˆ(x|p) = (s)

and for what follows gˆ (x|p) =

Pn i=1

n X

pi Ai (x)Yi ,

(4)

i=1 (s)

(s)

pi Ai (x)Yi where Ai (x) = 6

∂ s1 Ai (x) s ∂x11

··· ∂

sr A (x) i ∂xsrr

for real-valued x. Again, in our notation s represents a r × 1 vector of nonnegative integers that indicate the order of the partial derivative of the weighting function of the kernel smoother. We first consider the mechanics of the proposed approach, and by way of example use (4) to generate an unrestricted Nadaraya-Watson estimator. In this case we would set pi = 1/n, i = 1, . . . , n, and set nKh (Xi , x) Ai (x) = Pn , j=1 Kh (Xj , x)

(5)

where Kh (·) is a product kernel and h is a vector of bandwidths; see Racine and Li (2004) for details. When pi 6= 1/n for some i, then we would have a restricted Nadaraya-Watson estimator (the selection of p satisfying particular restrictions is discussed below). We now consider how one can impose particular restrictions on the estimator gˆ(x|p). Let pu be the n-vector of uniform weights 1/n and let p be the vector of weights to be selected. In order to impose our constraints, we choose p to minimize some distance measure from p to the uniform weights pi = 1/n ∀i as proposed by Hall and Huang (2001). This is appealing intuitively since the unconstrained estimator is that for which pi = 1/n ∀i, as noted above. Whereas Hall and Huang (2001) P consider probability weights (i.e., 0 ≤ pi ≤ 1, i pi = 1) and distance measures suitable for probability weights (i.e., Hellinger), we need to relax the constraint that 0 ≤ pi ≤ 1 and will instead allow for both positive and negative weights (while P retaining i pi = 1), and shall also therefore require alternative distance measures. To appreciate why this is necessary, suppose one simply wished to constrain a surface that is uniformly positive to have negative regions. This could be accomplished by allowing some of the weights to be negative. However, probability weights would fail to produce a feasible solution as they are non-negative, hence our need to relax this condition. 7

We also have to forgo the power divergence metric of Cressie and Read (1984) which was used by Hall and Huang (2001) since it is only valid for probability weights. For what follows we select the well-worn l2 metric D(p) = (pu − p)0 (pu − p) which has a number of appealing features in this context, as will be seen. Our problem therefore boils down to selecting those weights p that minimize D(p) subject to l(x) ≤ gˆ(s) (x|p) ≤ u(x) (and perhaps additional constraints of a similar form), which can be cast as a general nonlinear programming problem. For the illustrative constraints we consider below we have (in)equalities that are linear in p, which can be solved using standard quadratic programming methods and off-the-shelf software. For example, in the R language (R Development Core Team (2008)) it is solved using the quadprog package (Code in the R language is available from the authors upon request). Even when n is quite large the solution is computationally fast.

2.2

Existence of Constrained Estimator

In this section, we show the existence of a unique set of weights that satisfy the constraints defined in (3). Hall and Huang (2001) demonstrate that a vector of weights always exists that satisfy the monotonicity constraint when the response is assumed to be positive for all observations (this assumption, however, is too restrictive for the framework we consider). For what follows we focus on linear (in p) restrictions which are quite general and allow for the equivalent types of constraint violations albeit in a multidimensional setting. To allow for multiple simultaneous constraints we express our restrictions as follows: n X i=1

pi

" X

# (s)

αs Ai (x) Yi − c(x) ≥ 0,

(6)

s∈S

where the inner sum is taken over all vectors S that correspond to our constraints and αs is a set of constants used to generate various constraints. In what follows we 8

shall presume, without loss of generality, that for all s, αs ≥ 0 and c(x) ≡ 0, since c(x) is a known function. For what follows c(x) equals either l(x) or −u(x) and since we admit multiple constraints our proofs cover both lower bound, upper bound, and equality constraint settings. To simplify the notation, we define a differential operator P m 7→ mD such that mD (x) = αs m(s) (x) and let ψi (x) = AD i (x)Yi . Additionally, s∈S P we define |s| = ri=1 si as the order for a derivative vector s = (s1 , . . . , sr ). We say a derivative s1 has a higher order than s2 if |s1 | > |s2 |. Denote d as the derivative of the “maximum order” among all the derivatives in S. In the case of ties, such as S = {(2, 2), (0, 3), (1, 3)}, d can be either (2, 2) or (1, 3); (0, 3) is of smaller order. Also, let g (d) (x) denote the partial derivative for this maximum order derivative. The following theorem ensures the existence of a unique constrained estimator. Its proof is similar to that of Theorem 4.1 in Hall and Huang (2001) and thus omitted here. Theorem 2.1. If one assumes that (i) A sequence {i1 , . . . , ik } ⊆ {1, . . . , n} exists such that for each k, ψik (x) is strictly Q positive and continuous on (Lik , Uik ) ≡ ri=1 (Lik , Uik ) ⊂ Rr , and vanishes on (−∞, Lik ] (where Lik < Uik ), (ii) Every x ∈ I ≡ [c, e] =

Qr

i=1 [ci , ei ]

is contained in at least one interval

(Lik , Uik ), (iii) For 1 ≤ i ≤ n, ψik (x) is continuous on (−∞, ∞), then there exists a vector p = (p1 , . . . , pn ) such that the constraints are satisfied for all x ∈ I. We note that the above conditions are sufficient but not necessary for the existence of a set of weights that satisfy the constraints for all x ∈ I. For example, if for some 9

sequence jn in {1, . . . , n} sgn ψjn (x) = 1 ∀x ∈ I and for another sequence ln in {1, . . . , n} sgn ψln (x) = −1 ∀x ∈ I, then for those observations that switch signs pi may be set equal to zero, while pjn > 0 and pln < 0 is sufficient to ensure existence of a set of ps satisfying the constraints.

2.3

Hypothesis Testing on Shape Constraints

In this section, we propose a test for the validity of arbitrary shape constraints using D(ˆ p) as the test statistic. It is a bootstrap testing procedure that is simple to implement and extends the monotonicity testing procedure in Hall et al. (2001) to our multivariate and multiple constraints setting. A theoretical investigation of D(ˆ p) is provided in Theorem 2.3 of Section 2.4. This bootstrap approach involves estimating the constrained regression function gˆ(x|p) based on the sample realizations {Yi , Xi } and then rejecting H0 if the observed value of D(ˆ p) is too large. We use a resampling approach for generating the null distribution of D(ˆ p) which involves generating resamples for y drawn from the constrained model via iid residual resampling (i.e., conditional on the sample {Xi }), which we denote {Yi∗ , Xi } (if one were concerned about the presence of heteroscedasticity then a wild residual resampling scheme could be used instead). These resamples are generated under H0 , hence we recompute gˆ(x|p) for the bootstrap sample {Yi∗ , Xi } which we denote gˆ(x|p∗ ) which then yields D(p∗ ). We then repeat this process B times. Finally, we compute the empirical P value, PB , which is simply the proportion of the B bootstrap resamples D(p∗ ) that exceed D(ˆ p), i.e., PB = 1 − Fˆ (D(ˆ p)) =

B 1 X I(D(p∗ ) > D(ˆ p)), B j=1

where I(·) is the indicator function and Fˆ (D(ˆ p)) is the empirical distribution function of the bootstrap statistics. Then one rejects the null hypothesis if PB is less than α, 10

the level of the test. Before proceeding further, we note that there exist three situations that can occur in practice: (i) Impose non-binding constraints (they are ‘correct’ de facto) (ii) Impose binding constraints that are correct (iii) Impose binding constraints that are incorrect If one encounters (i) in practice (i.e., D(ˆ p) = 0) we recommend following the advice of (Hall et al., 2001, p 609) who state “For those datasets with D(ˆ p) = 0, no further bootstrapping is necessary [. . . ] and so the conclusion (for that dataset) must be to not reject H0 .”

2.4

Theoretical Properties

2.4.1

Consistency of Constrained Estimator

Theorem 2.2 below states that for nonbinding constraints the constrained estimator converges to the unconstrained estimator with probability one in the limit, for constraints that bind with equality on an interior hyperplane subset (defined below) the order in probability of the difference between the constrained and unconstrained estimates is obtained, while in a shrinking neighborhood where the constraints bind the ratio of the restricted and unrestricted estimators is approximately constant. Q Define a hyperplane subset of J = ri=1 [ai , bi ] to be subset of the form S = n o Q x0k × i6=k [ai , bi ] for some 1 ≤ k ≤ r and some x0k ∈ [ak , bk ]. We call S an interior hyperplane subset if x0k ∈ (ak , bk ). For what follows, g D (g) is the true data generating process (DGP), pˆ is the optimal weight vector satisfying the constraints, gˆD (·|ˆ p) (ˆ g (·|ˆ p)) is the constrained estimator defined in (4), and g˜D (˜ g ) is the unconstrained estimator defined in (2). 11

Theorem 2.2. Suppose that Assumption A.1(i)-(iv) in the Appendix holds. (i) If g D > 0 on J then, with probability 1, pˆ = 1/n for all sufficiently large n and gˆD (·|ˆ p) = g˜D on J for all sufficiently large n. Hence, gˆ(·|ˆ p) = g˜ on J for all sufficiently large n. (ii) Suppose that g D > 0 except on an interior hyperplane subset X0 ⊂ J such that g D (x) = 0, ∀x ∈ X0 and for any x0 ∈ X0 , and suppose that g D has second order D

continuous derivatives in the neighborhood of x0 with ∂g (x0 ) = 0 and with ∂x ³ ´ r+1 ∂ 2 gD (x0 ) nonsingular; then |ˆ g (·|ˆ p) − g˜| = Op h|d|+ 2 uniformly on J . ∂x∂xT (iii) Under the same conditions given in (ii) above, there exist random variables Θ = ¡ ¢ Θ(n) and Z1 = Z1 (n) ≥ 0 satisfying Θ = Op h|d|+r+1 and Z1 = Op (1) such p) = (1 + Θ)˜ g (x) uniformly for x ∈ J with inf x0 ∈X0 |x − x0 | > Z1 h that gˆ(x|ˆ

r+1 4

.

The latter property reflects the fact that, for a random variable Z2 = Z2 (n) ≥ 0 satisfying Z2 = Op (1), we have pˆi = n−1 (1 + Θ) for all indices i such that both inf x0 ∈X0 |Xi − x0 | > Z2 h

r+1 4

and Ai (x) 6= 0 for some x ∈ J .

The technical proof of Theorem 2.2 is relegated to the Appendix. Theorem 2.2 is the multivariate, multi-constraint, hyperplane subset generalization of the univariate, single constraint, single point violation setting considered in Hall and Huang (2001) having dispensed with probability weights and power divergence distance measures of necessity. However, the theory in Hall and Huang (2001) lays the foundation for several of the results that we obtain, for which we are indebted. We obtain the multivariate analogues in Theorem 2.2(ii) and (iii) of the rates in (Hall and Huang, 2001, Theorem 4.3(c)) in their univariate setting. A further issue with the imposition of constraints is the choice of distance metric for which to select the optimal weights. Whereas Hall and Huang (2001) use the power divergence metric depending on a parameter ρ, they impose the condition that 12

ρ lies between 0 and 1 in their technical arguments. In a sense, if 0 ≤ ρ ≤ 1 then the l1 and l2 norms can be viewed as limiting cases for the Hall and Huang analysis; see their equation (5.26). However, a key difference here is the relative ease with which the constraints can be implemented in practice if one forgoes power divergence and uses either a linear or quadratic program to solve for the optimal weights. Furthermore, in our proof of Theorem 2.3 the use of the l2 norm delivers several simplifications that are not available when using the power-divergence. 2.4.2

Asymptotic Property of D(ˆ p)

We introduce some necessary notation for the theorem below. Recall that ψi (x) = AD i (x)Yi , i = 1, . . . , n and that our minimization problem is min

pi ,...,pn

n X

−1

(n

2

− pi ) ,

s.t.

i=1

n X

pi = 1,

i=1

n X

pi ψi (x) ≥ 0, ∀x.

(7)

i=1

In practice, this minimization is carried out by taking a fine grid (x1 , . . . , xN ), where N is large, and solving min

pi ,...,pn

n X (n−1 − pi )2 ,

n X

s.t.

i=1

pi = 1,

i=1

n X

pi ψi (xj ) ≥ 0, 1 ≤ j ≤ N.

(8)

i=1

Let pˆi , i = 1, . . . , n be the solution to the quadratic programming problem (8). Then the asymptotic distribution of D(ˆ p) is given in the following theorem, whose proof is relegated to the Appendix. Theorem 2.3. Suppose that Assumptions A.1(i)-(iv) and A.2(i)-(ii) hold. Then, as n → ∞, we have 2 n2 σK ³P

p) ∼ χ2 (n), ´2 D(ˆ

(9) h2|d|+r ¤2 R£ 2 where σK = σ 2 K (d) (y) dy, and {x∗1 , . . . , x∗M } ⊂ {x1 , . . . , xN } are the slack points M j=1

g D (x∗j )

defined in Section 4.

13

The diverging degrees of freedom in the asymptotic distribution is of no surprise here since both the null and alternative hypotheses are nonparametric and reside on infinite dimensional parameter spaces. A similar phenomenon was observed by Fan et al. (2001) for their generalized likelihood ratio test. Theoretically, the asymptotic distribution of D(ˆ p) in Theorem 2.3 can be used in determining the p-value of our test. However, this may pose several difficulties in practice. First, the asymptotic distribution does not necessarily provide a good approximation for finite sample sizes. Also, the normalizing constant in (9) requires the determination of slack points which can be rather difficult in practice. Hence, a bootstrap approach like the one proposed in Section 2.3 is the alternative commonly used in nonparametric tests.

3

Simulated Illustrations, Finite-Sample Test Performance, and an Application

We now demonstrate the flexibility and simplicity of the approach by first imposing a range of constraints on a simulated dataset using a large number of observations thereby showcasing the feasibility of this approach in substantive applied settings, and then consider some Monte Carlo experiments that examine the finite-sample performance of the proposed test on both equality constraints and inequality constraints. For what follows we simulate data from a nonlinear multivariate relationship and then consider imposing a range of restrictions by way of example. We consider a 3D surface defined by Yi =

sin

³p p

2 2 Xi1 + Xi2

2 2 Xi1 + Xi2

´ + εi ,

i = 1, . . . , n,

(10)

where x1 and x2 are independent draws from the uniform [-5,5]. We draw n = 10, 000 observations from this DGP with ε ∼ N (0, σ 2 ) and σ = 0.1. As we will demonstrate 14

the method by imposing restrictions on the surface and also on its first and second partial derivatives, we use the local quadratic estimator for what follows as it delivers consistent estimates of the regression function and its first and second partial derivatives. Figure 1 presents the unrestricted regression estimate whose bandwidths were chosen via least squares cross-validation.

0.8

itiona

Cond

0.6

0.4

l Exp ectati

0.2

on

0.0

4 2

−0.2 0

X1

4 2 0 X2

−2 −2 −4 −4

Figure 1: Unrestricted nonparametric estimate of (10).

3.1

A Simulated Illustration: Restricting gˆ(0) (x)

Next, we arbitrarily impose the constraint that the regression function lies in the range [0,0.5]. A plot of the restricted surface appears in Figure 2. Figures 1 and 2 clearly reveal that the regression surface for the restricted model is both smooth and satisfies the constraints.

3.2

A Simulated Illustration: Restricting gˆ(1) (x)

We consider the same DGP given above, but now we arbitrarily impose the constraint that the first derivatives with respect to both x1 and x2 lie in the range [-0.1,0.1] 15

0.8

l Exp itiona

Cond

0.6

0.4

on ectati

0.2 0.0

4 2

−0.2 0

X1

4 2 0 X2

−2 −2 −4 −4

Figure 2: Restricted nonparametric estimate of (10) where the restriction is defined over gˆ(s) (x|p), s = (0, 0), (0 ≤ gˆ(x|p) ≤ 0.5).

(s = (1, 0) and t = (0, 1)). A plot of the restricted surface appears in Figure 3.

0.8

l itiona

Cond

0.6

0.4

n ctatio Expe

0.2 0.0

4 2

−0.2 0

X1

4 2 0 X2

−2 −2 −4 −4

Figure 3: Restricted nonparametric estimate of (10) where the restriction is defined over g (s) (x), s ∈ {(1, 0), (0, 1)}, (−0.1 ≤ ∂ˆ g (x|p)/∂x1 ≤ 0.1, −0.1 ≤ ∂ˆ g (x|p)/∂x2 ≤ 0.1).

16

Figure 3 clearly reveals that the regression surface for the restricted model possesses derivatives that satisfy the constraints everywhere and is smooth.

3.3

A Simulated Illustration: Restricting gˆ(2) (x)

We consider the same DGP given above, but now we arbitrarily impose the constraint that the second derivatives with respect to both x1 and x2 are positive (negative), which is a necessary (but not sufficient) condition for concavity and convexity. As can be seen from figures 4 and 5 the shape of the restricted function changes drastically depending on the curvature restrictions placed upon it.

1.0

Cond on ectati

l Exp

itiona

0.5

0.0

4 2 0

X1

4 2 0 X2

−2 −2 −4 −4

Figure 4: Restricted nonparametric estimate of (10) where the restriction is defined over gˆ(s) (x), s ∈ {(2, 0), (0, 2)} (∂ˆ g 2 (x|p)/∂x21 ≥ 0, ∂ˆ g 2 (x|p)/∂x22 ≥ 0).

We could as easily impose restrictions defined perhaps jointly on, say, both gˆ(x) and gˆ(1) (x), or perhaps on cross-partial derivatives if so desired. We hope that these illustrative applications reassure the reader that the method we propose is powerful, fully general, and can be applied in large-sample settings.

17

1.0

n

o ectati

l Exp

itiona

Cond 0.5

0.0

4 2 0

X1

4 2 0 X2

−2 −2 −4 −4

Figure 5: Restricted nonparametric estimate of (10) where the restriction is defined over gˆ(s) (x), s ∈ {(2, 0), (0, 2)} (∂ˆ g 2 (x|p)/∂x21 ≤ 0, ∂ˆ g 2 (x|p)/∂x22 ≤ 0).

3.4

Finite-Sample Performance: Testing Inequality Restrictions

By way of illustration we consider testing the inequality restriction H0 : g(x) ≥ 0 versus H1 : g(x) < 0 and consider a simple linear DGP given by Yi = g(Xi ) + εi = Xi + εi ,

(11)

where Xi is uniform [−a, 4 − a], a is a parameter that determines whether the constraint is binding (a > 0) or not (a = 0) and the length of interval over which the constraint binds, and ε ∼ N (0, 1/4). When a = 0 we can assess rejection frequencies under H0 while when a > 0 we can assess power. We construct power curves based on M = 1, 000 Monte Carlo replications, and we compute B = 99 bootstrap replications. The power curves corresponding to α = 0.05 appear in Figure 6. Figure 6 reveals that the empirical rejection frequencies are in line with nomi18

1.0 0.6 0.4 0.0

0.2

Rejection Frequencies

0.8

n=25 n=50 n=75 n=100

0.0

0.2

0.4

0.6

0.8

1.0

Interval Length Over Which Restriction Binds (a)

Figure 6: Power curves for α = 0.05 for sample sizes n = (25, 50, 75, 100) based upon the DGP given in (11). The solid horizontal line represents the test’s nominal level (α).

nal size while power increases with n. Given that the sample sizes considered here would typically be much smaller than those used by practitioners adopting nonparametric smoothing methods, we expect that the proposed test would have reasonable performance in empirical applications.

3.5

Finite-Sample Performance: Testing Equality Restrictions

By way of example we consider testing the restriction that the nonparametric model g(x) is equivalent to a specific parametric functional form (i.e., g(x) = x0 β for some fixed β). We consider the following DGP: 2 Yi = g(Xi1 , Xi2 ) + εi = 1 + Xi1 + Xi2 + εi ,

where Xij , j = 1, 2 are uniform [−2, 2] and ε ∼ N (0, 1/2). 19

We then impose the restriction that g(x) is of a particular parametric form, and test whether this restriction is valid. When we generate data from this DGP and impose the correct model as a restriction (i.e., that given by the DGP, say, β0 + β1 x2i1 + β2 xi2 ) we can assess the test’s rejection frequencies under H0 , while when we generate data from this DGP and impose an incorrect model that is in fact linear in variables we can assess the test’s power. We conduct M = 1, 000 Monte Carlo replications from our DGP, and consider B = 99 bootstrap replications.

Results are presented in Table 1 in the

form of empirical rejection frequencies for α = (0.10, 0.05, 0.01) for samples of size n = 25, 50, 75, 100, 200. Table 1 indicates that the tests’ empirical rejection appears to be in line with nominal size while power increases with n.

3.6

Application: Imposing Constant Returns to Scale for Indonesian Rice Farmers

We consider a production dataset that has been studied by Horrace and Schmidt (2000) who analyzed technical efficiency for Indonesian rice farms. We examine the issue of returns to scale, focusing on one growing season’s worth of data for the year 1977, acknowledged to be a particularly wet season. 171 farmers were selected from six villages in the rice production area of the Cimanuk River Basin in West Java by the Center for Agro Economic Research, Ministry of Agriculture, Indonesia through the Agro Economic Survey which was part of the larger Rural Dynamic Study. Output is measured as kilograms (kg) of rice produced, and inputs to produce rice are seed (kg), urea (kg), trisodium phosphate (TSP) (kg), labour (hours), and land (hectares). Table 2 presents several summary statistics for the data. We use log transformations throughout. Of interest here is whether or not the production technology exhibits constant 20

Table 1: Test for correct parametric functional form. Values represent the empirical rejection frequencies for the M = 1, 000 Monte Carlo replications. n

α = 0.10 α = 0.05 α = 0.01 Size

25

0.100

0.049

0.010

50

0.074

0.043

0.011

75

0.086

0.034

0.008

100

0.069

0.031

0.006

200

0.093

0.044

0.007

Power 25

0.391

0.246

0.112

50

0.820

0.665

0.356

75

0.887

0.802

0.590

100

0.923

0.849

0.669

200

0.987

0.970

0.903

returns to scale (i.e., whether or not the sum of the first order partial derivatives equals one). Returns to scale measure the proportional increase in output for an equal increase in all inputs. Constant returns to scale implies that output increases by exactly the amount that all the inputs are increased; if all the inputs are doubled, then output doubles. Given the primitive nature of the production of rice, one expects the existence of constant returns to scale to be present in the underlying technology. Learning about returns to scale are important for a variety of reasons. In our context returns to scale can provide insights into appropriate rural development strategies. If several farms possessed decreasing returns to scale then farm operations could

21

Table 2: Summary Statistics for the Data Variable

Mean StdDev

log(rice)

6.9170

0.9144

log(seed)

2.4534

0.9295

log(urea)

4.0144

1.1039

log(TSP)

2.7470

1.4093

log(labor)

5.6835

0.8588

log(land)

-1.1490

0.9073

be scaled back resulting in lower costs. Moreover, a well known result in the rural development literature is that land is more productive on small farms than on larger farms. A classic explanation for this result is the presence of decreasing returns to scale. Therefore, a finding of constant returns to scale suggests an alternative explanation for this inverse relationship between land and productivity. We estimate the production function using a nonparametric local linear estimator with least squares cross-validated bandwidth selection. Figure 7 presents the unrestricted and restricted partial derivative sums for each observation (i.e., farm), where the restriction is that the sum of the partial derivatives equals one. The horizontal line represents the restricted partial derivative sum (1.00) and the points represent the unrestricted sums for each farm. An examination of Figure 7 reveals that the estimated returns to scale lie in the interval [0.98, 1.045]. In order to test whether the restriction is valid we apply the test outlined in Section 2.3. We conducted B = 999 bootstrap replications and test the null that the technology exhibits constant returns to scale. The empirical P value is PB = 0.122, hence we fail to reject the null at all conventional levels. We are encouraged by this fully nonparametric application particularly as it involves a fairly large number of 22

1.02 1.01 0.98

0.99

1.00

Gradient Sum

1.03

1.04

Unrestricted Restricted

0

50

100

150

Observation

Figure 7: The sum of the partial derivatives for observation i (i.e., each farm) appear on the vertical axis, and each observation (farm) appears on the horizontal axis.

regressors (five) and a fairly small number of observations (n = 171).

4

Discussions

We present a framework for imposing and testing the validity of conventional constraints on the partial derivatives of a multivariate nonparametric kernel regression function, namely, l(x) ≤ g (s) (x) ≤ u(x). The proposed approach nests special cases such as imposing monotonicity, concavity (convexity) and so forth while delivering a seamless framework for general restricted nonparametric kernel estimation and inference. Illustrative simulated examples are presented, finite-sample performance of the proposed test is examined via Monte Carlo simulations, and an illustrative application is undertaken. An open implementation in the R language (R Development Core Team (2008)) is available from the authors. We note that our procedure is valid for a range of kernel estimators in addition to

23

those discussed herein as well as estimation and testing in the presence of categorical data. Moreover, our constrained smoothing approach nests discussion on covariate irrelevance and significance testing given that this represents constraints of the first partials being equal to zero. Semiparametric models such as the partially linear, single index, smooth coefficient, and additively separable models could utilize this approach towards constrained estimation. Nonparametric unconditional and conditional density and distribution estimators, as well as survival and hazard functions, smooth conditional quantiles and structural nonparametric estimators including auction methods could also benefit from the framework developed here. Future work on the theoretical side could focus on the importance of the choice distance metric, the asymptotic behavior of the bootstrap testing procedure, and the relative merits of the alternative data tilting methods that exists. We leave this as a subject for future research.

Appendices: Assumptions and technical proofs A.1

Assumptions

Assumption A.1 is needed in Theorems 2.2 and 2.3. Assumption A.2 is on the grid points (x1 , . . . , xN ) in Section 2.4.2 and used in Theorem 2.3. Assumption A.1. (i) The sample Xi either form a regularly spaced grid on a compact set I or constitute independent random draws from a distribution whose density f is continuous and nonvanishing on I; the εi are independent and identically distributed with zero mean and variance σ 2 , and are independent of the Xi ; the kernel function K(·) is a symmetric, compactly supported density such that K D is Q H¨older-continuous on J ≡ [a, b] = ri=1 [ai , bi ] ⊂ I. 24

(ii) E (|εi |t ) is bounded for sufficiently large t > 0. (iii) f D and g D are continuous on J . (iv) The bandwidth associated with each explanatory variable, hj , satisfies hj ∝ Pr n−1/(3r+2|d|) , 1 ≤ j ≤ r, where |d| = j=1 dj is the maximum order of the derivative vector d defined above. Assumption A.1(i) is standard in the kernel regression literature. Assumption A.1(ii) is a sufficient condition required for the application of a strong approximation result which we invoke in Lemma A.3, while Assumption A.1(iii) assures requisite smoothness of f D and g D (f is the design density). Note that the bandwidth rate in Assumption A.1(iv) is generally higher than the standard optimal rate n−1/(r+4) . However, this is not surprising for our restricted problem. The optimal rate only guarantees the convergence of our unrestricted function estimator g˜. But the restricted problem also requires the convergence of the derivative g˜D , which often needs a higher bandwidth rate. In the single-predictor monotone regression problem considered in Hall and Huang (2001), this rate happens to coincide with the optimal rate n−1/5 . Furthermore, when the bandwidths all share the same rate, one can rescale each component of x to ensure a uniform bandwidth h ∝ n−1/(3r+2|d|) for all components which simplifies the notation in the proofs somewhat without loss of generality. In the Q proofs that follow we will therefore use hr rather than rj=1 hj purely for notational simplicity. Assumption A.2. (i) N → ∞ as n → ∞ and N = O(n). (ii) Let dN = inf 1≤j1 ,j2 ≤N |xj1 − xj2 | be the minimum distance between grid points. We require dN → 0 and h−1 dN → ∞. 25

Assumption A.2(i) is to ensure that enough grid points are used in calculating D(ˆ p) and that the number of grid points is at most the order of n. Assumption A.2(ii) requires that the minimum distance between grid points decreases at a rate slower than h such that the correlation between derivative estimates at these grid points is zero when n is sufficiently large.

A.2

Proof of Theorem 2.2

For the proofs that follow we first establish some technical lemmas. Lemma A.1 partitions the weights into two sets, one where individual weights are identical and one where they are a set of fixed constants (that may differ), and is used exclusively in Theorem 2.2 (iii). Lemma A.2 establishes that there exist a set of weights that guarantee that the constraint is satisfied with probability approaching one in the limit. These weights do not have to equal the optimal weights, but are used to bound the distance metric evaluated at the optimal weights. Lemma A.2 is used in Theorem 2.2 (ii) and (iii). Lemma A.3 establishes that with probability approaching one in the limit the unrestricted estimate g˜ satisfies g˜D (x) > 13 g D (x) for any point x of certain distance away from X0 . It is used to determine the suitable distance away from X0 for a point x to have gˆD (x|ˆ p) > 0, which is then used to derive the orders of Z1 and Z2 in Theorem 2.2(iii). Lemma A.1. If A and B are complementary subsets of the integers 1, . . . , n and if pi , for i ∈ A are fixed, then the values of pj for j ∈ B that minimize D(p) are identical, and are uniquely determined by the constraint that their sum should equal P 1− pi . i∈A

P Proof of Lemma A.1. We find the optimal pj s by minimizing (pj − 1/n)2 subject j∈B P P to pj = 1 − pi . Without loss of generality, assume n ∈ B. By incorporating j∈B

i∈A

26

the constraint directly into our distance metric we obtain   2  X X X min  (pj − 1/n)2 + 1 − pi − pj − 1/n  , pj ,j∈B\{n}

i∈A

j∈B\{n}

j∈B\{n}

P P which yields the first order conditions, for j ∈ B \ {n}, pj = 1 − pi − pk . i∈A k∈B\{n} P Since pk is fixed we see that all pj , j ∈ B \ {n}, are identical. Moreover, the k∈B\{n} P P pk which proves Lemma A.1. excluded pn is also equal to 1 − pj − j∈A

k∈B\{n}

Lemma A.2. For each δ > 0 there exists a p˜ = p˜(δ) satisfying © P gˆD (x|˜ p) > 0

∀x ∈ J

ª

>1−δ

(A.12)

¡ ¢ for all sufficiently large n. Moreover, this yields D(ˆ p) = Op n−1 h2|d|+2r+1 . Proof of Lemma A.2. Choose any x0 ∈ X0 . Define −1

p˜i = n

½ µ ¶¾ x0 − Xi |d|+r −1 1+∆+h g(Xi ) L , h

where ∆ is a constant defined by

n P

p˜i = 1, and L(x) =

Pr j=1

i=1

(A.13)

Lj (xj ) with each Lj

being a fixed and compactly supported function with appropriate derivatives to be specified later. Note that the compactness of L is component-wise such that for some ¡ ¢ i 6= 0 requires that |x0j − Xij | ≤ C1 h for at least one constant C1 > 0, L x0 −X h 1 ≤ j ≤ r. This is different from the compactness of K (and its derivatives) which ¡ ¢ i indicates that there exists a constant C2 > 0 such that K x−X 6= 0 requires that h |xj − Xij | ≤ C2 h for all j = 1, . . . , r. Now D

gˆ (x|˜ p) =

n X

p˜i AD i (x)Yi

i=1

=(1 + ∆)n

−1

n X i=1

pi AD i (x)Yi

+n

−1

n · X

µ |d|+r

h

i=1

=(1 + ∆)˜ g D (x) + h|d|+r B1 (x) + h|d|+r B2 (x),

27

−1

g(Xi ) L

x0 − X i h



¸ AD i (x)Yi (A.14)

where, µ ¶ n X x0 − Xi L AD B1 (x) = n i (x) h i=1 µ ¶ n X x0 − Xi −1 −1 B2 (x) = n g(Xi ) L AD i (x)εi . h i=1 −1

The compactness of L(·) gives us n

−1

n X

µ −1

g(Xi ) L

i=1

and we have

n P

x0 − Xi h

¶ = Op (h)

(A.15)

¡ ¢ p˜i = 1 + ∆ + h|d|+r Op (h), which implies ∆ = Op h|d|+r+1 .

i=1

Since a higher order derivative (j) implies a higher order for A(j) (x), we have ³ ´ ³ ´ ˜1 (x) and B2 (x) = Op B ˜2 (x) with B1 (x) = Op B µ ¶ n X x0 − Xi (d) L Ai (x), h i=1 µ ¶ n X x − X 0 i (d) −1 −1 ˜2 (x) = n B g(Xi ) L Ai (x)εi , h i=1

˜1 (x) = n−1 B

where only the highest order derivative is included for both terms. We can then ˜1 (x) and B ˜2 (x) as follows. bound B ˜2 (x) is the sum of n i.i.d. random variables with mean zero and Note that B variance " # µ µ ¶ ¶ µ ¶2 x − X x − X 0 i 0 i (d) (d) Var g(Xi )−1 L Ai (x)εi =E g(Xi )−2 L Ai (x)2 E[ε2i ] h h ¡ ¢ =Op h−(2|d|+r−1) , where the independence of εi and Xi is used in the first equality and the compactness of L and K is used in the second equality. Hence sup|B2 (x)| = Op

³¡

¢ 2|d|+r−1 −1/2

nh

x∈J

28

´ .

(A.16)

˜1 (x), without loss of generality, we have the following For B ˜1 (x) =h B

−(|d|+r) −1

n

=(−h)

µ K

(d)

i=1

Z |d|

n X

µ (d)

K(y)L

x − Xi h



µ L

x0 − x x − Xi + h h



¶ x0 − x + y dy + Op (1), h

(A.17)

where the last equality follows from the definition of the integral, integration by parts and the compactness of L and K. Combining (A.14), (A.16) and (A.17) gives us µ

Z D

D

|d| r

(d)

gˆ (x|˜ p) = g˜ (x) + (−1) h

K(y)L

¶ x0 − x + y dy + B3 (x), h

(A.18)

where B3 (x) = op (hr ) uniformly in x ∈ J . Particularly, we require nhr−1 → ∞ here to claim h|d|+r B2 (x) = op (hr ). For a subset S ⊂ J , define ½ ¾ B(S, δ) = x ∈ J : inf |x − y| ≤ δ . y∈S

¡ ¢ Consider any point x ∈ J \ B X0 , Chr/2 . Let y0 ∈ X0 be the point such that |x − y0 | = inf y∈X0 |x − y|. The existence of such a y0 is guaranteed by the fact that X0 is a closed subset. Since ∂g [d] /∂x is continuous, we must have

∂g D (y0 ) ∂x

= 0. Hence, ¡ 3r/2 ¢ ∂ 2 gD the Taylor expansion at y0 gives g D (x) = 21 (x − y0 )T ∂x∂x . T (y0 )(x − y0 ) + Op h Since g˜ is a consistent estimate of g, for sufficiently large n and C > 0 we have © P g˜D (x) > 3hr ,

¡ ¢ª δ ∀x ∈ J \ B X0 , Chr/2 > 1 − . (A.19) 3 ³¡ ¢−1/2 ´ In (A.18), when x is in the neighborhood of x0 , g˜[d] (x) = Op nh2|d|+r = ¢−1 ¡ = O(1). Hence, for such an x, we need to have a Op (hr ) if we require nh2|d|+3r positive dominating second term in (A.18) in order to claim ½ P

µ

Z D

|d| r

g˜ (x) + (−1) h

K(y)L

(d)

¶ x0 − x + y dy > 2hr h

29

¾

∀x ∈ B(X0 , Ch

1 > 1− δ. 3 (A.20)

r/2

)

Given (A.19), we must also have µ

Z (−1)

|d|

(d)

K(y)L

¶ x0 − x + y dy > −1 ∀x ∈ J \ B(X0 , Chr/2 ). h

(A.21)

To guarantee (A.20) and (A.21), given both C and δ, we may choose L such that each component of L(d) is a positive or negative constant with sufficiently large absolute value on a sufficiently wide interval containing the origin and returning sufficiently slowly to 0 on either side of the interval where it is constant. Note that our restrictions on X0 play a role here. A point in the neighborhood of X0 will have at least one coordinate falling in the neighborhood of the corresponding coordinate of the chosen ¡ ¢ point x0 in (A.13). This is critical for guaranteeing that L(d) x0h−x + y in (A.20) has at least one non-zero component for any x ∈ B(X0 , Chr/2 ). Lastly, for all C, δ > 0, we have, P {B3 (x) > −hr ,

δ ∀x ∈ J } > 1 − . 3

(A.22)

The derivation in (A.18) with the above set probabilities yields © P gˆD (x|˜ p) > 0,

∀x ∈ J

ª

> 1 − δ,

(A.23)

which establishes (A.12). ¡ ¢ Next, we will show that D(˜ p) = Op n−1 h2|d|+2r+1 . This follows from ∆ = ¡ ¢ Op h|d|+r+1 and ¶2 µ n X x0 − Xi −1 −2 = Op (h) , (A.24) n g(Xi ) L h i=1

30

which follows from the compactness of L(·). Replacing pi with p˜i yields " µ ¶#2 n n n X X X x − X 0 i (n˜ pi − 1)2 = ∆ + h|d|+r g(Xi )−1 L h i=1 i=1 i=1 " µ ¶2 # n X x − X 0 i ≤2 ∆2 + h2|d|+2r g(Xi )−2 L h i=1 µ ¶2 n X x0 − Xi 2 2|d|+2r −1 −2 =2n∆ + 2nh ·n g(Xi ) L h i=1 ¡ ¢ ¡ ¢ ¡ ¢ =Op nh2|d|+2r+2 + Op nh2|d|+2r+1 = Op nh2|d|+2r+1 . Since pˆ minimizes D(p) subject to nonnegativity of gˆD on J , this implies that for all sufficiently large n, P {D(ˆ p) ≤ D(˜ p)} > 1 − δ. ¡ ¢ Hence D(ˆ p) = Op n−1 h2|d|+2r+1 . Lemma A.3. For each δ > 0, there exists C = C(δ) such that, for all sufficiently large n, ½ ¾ 1 [d] D r/2 P g˜ (x) > g (x), ∀x such that inf |x − x0 | ≥ Ch > 1 − δ. x0 ∈X0 3

(A.25)

Proof of Lemma A.3. Define X = {X1 , . . . , Xn } and µ = E[˜ g |X ]. Let ½ ¾ 2 [d] D An = sup M > 0 : µ (x) < g (x), ∀x such that inf |x − x0 | ≤ M , (A.26) x0 ∈X0 3 ½ ¾ 1 D D D Bn = sup M > 0 : |˜ g (x) − µ (x)| > g (x), ∀x such that inf |x − x0 | ≤ M , x0 ∈X0 3 (A.27) and Cn = max(An , Bn ). Then, for any x such that inf x0 ∈X0 |x − x0 | ≥ Cn , we have ¯ ¯ µD (x) ≥ 23 g [d] (x) and ¯g˜D (x) − µD (x)¯ ≤ 13 g [d] (x). Thus, 1 g˜D (x) > g D (x), ∀x such that inf |x − x0 | ≥ Cn . x0 ∈X0 3 Therefore, we only need to show that ¡ ¢ An = Op hr/2

and 31

¡ ¢ Bn = Op hr/2 .

(A.28)

Define the stochastic processes γ1 = µD − g D and γ2 = g˜D . For an integer j ≥ 0, let ½ Jj =

¾ x ∈ C(J \ X0 ) : jh

r/2

≤ inf |x − x0 | ≤ (j + 1)h x0 ∈X0

r/2

,

where C(S) is the closure of a subset S ⊂ J . Let J = max{j : Jj 6= ∅}. Define √ r γkj (x) = nh|d|+ 2 γk (x) on Jj for each 0 ≤ j ≤ J. We want to show that ( ) sup P j

sup |λj (x)| > v

x∈Jj

≤ C1 (1 + v)−2

(A.29)

for all v ∈ [0, C2 h−r/2 ], where λj stands for γ1j or γ2j and the constants C1 , C2 > 0 do not depend on n. For this purpose, we need to use a strong approximation result found in Koml´os et al. (1975), generally referred to as the Hungarian embedding. Let zi , i = 1, . . . , n be a sequence of independent and identically distributed random variables with E[zi ] = 0 P and E[zi ] = σ 2 , and let Sn = ni=1 zi . Then the Hungarian embedding result says that there exists a sequence Tn with the same distribution as Sn and a sequence of a standard Brownian bridge Bn (t) such that ¯ ¯ √ sup ¯Tbntc − σ nBn (t)¯ = Op (log n),

0≤t≤1

where bxc signifies the integer part of x. Again, we only need to prove (A.29) for the highest order derivative d. To economize on notation, we now use γ1 = µ(d) − g (d) and γ2 = g˜(d) − µ(d) . Due to the indeP (d) pendence between εi = Yi − g(Xi ) and Xi , we have µ(d) (x) = n1 ni=1 Ai (x)g(Xi ). Hence n n o 1 X (d) 1 X n (d) (d) Ai (x)g(Xi ) − g (x) and γ2 (x) = A (x)εi . γ1 (x) = n i=1 n i=1 i

(A.30)

¢ ¢−1 ¡ ¡ Ck,K g(x) + o (nh2|d|+r+1 )−1 , where C1,K and Note that Var(γk (x)) = nh2|d|+r+1 C2,K are both constants depending only on K(·). Also, the fact that K(·) is compactly supported indicates that both sums in (A.30) essentially have only bnCx,h c terms, 32

where Cx,h = P {X ∈ B(x, Ch)} for some constant C. Now applying the Hungarian √ embedding result to each nγkj (x) yields ¯q ¯ µ ¶ ¯ ¯ log n ¯ ¯ sup |λj (x)| ≤ sup ¯ Ck,K g(x)Bn (Cx,h )¯ + Op √ . n x∈J x∈J j

j

Then (A.29) follows from the modulus of continuity for a Brownian bridge; see e.g., (Cs¨org˝o and R´ev´esz, 1981, Chapter 1). ¡ ¢ Therefore, letting wj ≡ (3hr )−1 inf x∈Jj g D (x) and vj ≡ min wj , C2 h−r/2 , we have ½ P

1 |γ(t)| > g D (t) for some t ∈ ∪Jj=j0 Jj 3

¾

½ ¾ 1 [d] ≤ P |γ(t)| > g (t) for some t ∈ Jj 3 j=j0 ( ) J X ≤ P sup |λj (x)| > v J X

j=j0



J X

x∈Jj

C1 (1 + vj )−2 ,

(A.31)

j=j0

where γ represents either γ1 or γ2 . For any x ∈ Jj , let x0 = arg miny0 ∈X0 |x − y0 | ∈ ϑX0 \ ϑJ and e = (x − x0 )/|x − x0 | be a unit vector. Then there exists a constant 0 ≤ u ≤ 1 such that x = x0 + (j + u)hr/2 e. The conditions imposed on g in the theorem imply that ¡ ¢ 1 ∂ 2gD g D (x0 + (j + u)hr/2 e) = (j + u)2 heT (x0 )e + Op h3r/2 . T 2 ∂x∂x Hence, wj ≥ C3 j 2 and vj ≥ min(C3 j 2 , C4 h−r/2 ), where C3 , C4 > 0 do not depend on n. ¡ ¢−2 P P −2 Since Jj=j0 (1 + C3 j 2 ) → 0 as j0 → ∞ and Jj=j0 1 + C4 h−r/2 = Op (Jhr ) = ¡ ¢ Op hr/2 , it now follows from (A.31) that ½ ¾ 1 D D r/2 lim lim supn→∞ P g˜ (x) > g (x), for some x such that inf |x − x0 | ≥ Ch = 0. C→∞ x0 ∈X0 3 Applying the results to the respective versions of γ we obtain (A.25). Proof of Theorem 2.2. We prove each part of Theorem 2.2 in turn.

33

(i) For n sufficiently large the uniform weights will automatically satisfy the constraints. In this case the constrained estimator will be equivalent to the unconstrained estimator. (ii) By the Cauchy-Schwarz inequality, ¯1/2 ¯ ¯1/2 ¯ n n ¯ ¯ ¯ ¯ X X ¯ (j) ¯ ¯ ¯ ¯ ¯ (j) ¯gˆ (x|ˆ p) − g˜(j) (x)¯ ≤ ¯n−1 (nˆ pi − 1)2 ¯ ¯n−1 Ai (x)2 Yi2 ¯ ¯ ¯ ¯ ¯ i=1

¡

≤Op n−1 h

i=1

¢ 2|d|+2r+1 1/2

¡

Op h−(r+2|j|)

¢1/2

¡ ¢ = Op hjj , (A.32)

where jj = |d| + r+1 − |j|. The last inequality follows from Lemma A.2 and the 2 fact that K(·) is compactly supported. Note that (A.32) indicates that when j is of lower order than d, ³ r+1 ´ ¯ ¯ ¯ ¯ ¡ ¢ sup ¯gˆ(j) (x|ˆ p) − g˜(j) (x)¯ ≤ sup ¯gˆ(d) (x|ˆ p) − g˜(d) (x)¯ = Op hjd = Op h 2 .

x∈J

x∈J

(A.33) We used Lemma A.2 to establish the final order of the above difference. In ´ ³ r+1 particular, supx∈J |ˆ g (x|ˆ p) − g˜(x)| = Op h|d|+ 2 . This proves part (ii) of Theorem 2.2. (iii) Since K(·) is compactly supported, there exists a constant M such that Ai (x) = 0

if |x − Xi | ≥ M h.

(A.34)

Let Z2 be a random variable such that ½ ¾ r+1 −1 4 = inf z > 0 : pˆi = n (1 + Θ), ∀i such that inf |x0 − Xi | ≥ z , Z2 h x0 ∈X0

where Θ is a random variable not depending on i. This infinitum is well defined given Lemma A.1. Given (A.34), any x such that inf x0 ∈X0 |x − x0 | ≥ M h + Z2 h

r+1 4

implies pˆi = n−1 (1 + Θ), ∀i such that Ai (x) 6= 0, which in turn implies

that gˆ(x|ˆ p) = (1 + Θ)˜ g (x). 34

Then Lemma A.3 yields ( 1 P gˆD (x|ˆ p) > (1 + Θ)g D (x) > 0, ∀x such that 3 ³ inf |x − x0 | ≥ max Ch

x0 ∈X0

r/2

(A.35)

, M h + Z2 h

r+1 4

´

) > 1 − δ.

Using (A.33) along with the convergence rate of g˜D gives us gˆD (x|ˆ p) = g D (x) + ³ r+1 ´ ³ ´ r+1 Op h 2 for any x ∈ J . In particular, let x = x0 + Chr/2 + Z2 h 4 e ∈ J \ X0 , where x0 = arg miny0 ∈X0 |x − y0 | ∈ ϑX0 \ ϑJ and e = (x − x0 )/||x − x0 || is a vector of unit length. Using Z(h) = Chr/2 + Z2 h

r+1 4

and a Taylor expansion

gives ³ r+1 ´ 2 D ¢ ¡ 1 2 T ∂ g 2 2 gˆ (x0 + Z(h)e|ˆ p) = Z(h) e (x )e + O h + o Z(h) 0 p p 2 ∂x∂xT ´ ³ r+1 ´ ³ r+1 =Op (hr ) + Op Z22 h 2 + Op h 2 . D

Clearly, we should have Z2 = Op (1) in order to guarantee (A.35). Hence, pˆi is identically equal to n−1 (1 + Θ) for some random variable Θ and for all i such that |Xi − x0 | ≥ Z2 h, where Z2 = Op (1). Now we can define a random variable Z1 such that ( Z1 h

r+1 4

= inf

z ≥ 0 :∀x ∈ J for which inf |x − x0 | ≥ z, x0 ∈X0

)

Ai (x) = 0 whenever pˆi 6= n−1 (1 + Θ) . Clearly Z1 = Op (1) and gˆ(x|ˆ p) = (1 + Θ)˜ g (x) for all values x ∈ J such that inf x0 ∈X0 |x − x0 | ≥ Z1 h

r+1 4

. This establishes the second part of (iii) of Theorem

2.2. The order of Θ is derived as follows. From Lemma A.1 we have that the weights ¢ ¡ pˆi are identical to n−1 (1 + Θ) for indices i that lie a distance O hr/2 from x0 . Let there be N such indices that pˆi = n−1 (1 + Θ), and let A be the set of 35

¡ ¢ P remaining indices. Then N = Op (n), N1 ≡ |A| = Op nhr/2 and i pˆi = 1 deliver |Θ| ≤ N −1

X

|nˆ pi − 1|

i∈A

( ≤ N −1

X

)1/2 ( )1/2 X 1 (nˆ pi − 1)2

i∈A

(



1/2 N −1 N1

i∈A n X

)1/2

(nˆ pi − 1)2

¡ ¢ = Op h|d|+r+1 .

i=1

This establishes the first part of (iii) of Theorem 2.2.

A.3

Proof of Theorem 2.3

The proof of Theorem 2.3 is divided into two steps: the first step uses the KarushKuhn-Tucker conditions to obtain a closed form for D(ˆ p) and the second step applies standard asymptotic results for multivariate kernel regression function estimates and their derivatives to obtain the asymptotic distribution of D(ˆ p). Some tedious details are skipped in the second step for the sake of brevity. Proof of Theorem 2.3. The Karush-Kuhn-Tucker conditions say that there exist constants λ1 and λ2j , j = 1, . . . , N such that −1

2(ˆ pi − n ) + λ1 −

N X

λ2j ψi (xj ) = 0, i = 1, . . . , n,

j=1 n X

λ2j

pˆi ψi (xj ) = 0, j = 1, . . . , N,

i=1

X

(A.36) (A.37)

pˆi = 1,

(A.38)

pˆi ψi (xj ) ≥ 0, λ2j ≥ 0, j = 1, . . . , N,

(A.39)

i=1 n X i=1

where (A.36) is the stationary condition, (A.37) is the complementary slackness condition, and (A.38) and (A.39) are feasibility conditions. 36

P ¯ Let ψ(x) = n−1 ni=1 ψi (x). Solving (A.36) for pˆi we have N

−1

pˆi = n

λ1 1 X + + λ2j ψi (xj ). 2 2 j=1

Summing up (A.40) and using (A.38) we can obtain λ1 =

(A.40)

PN j=1

¯ j ). Substitution λ2j ψ(x

of this into (A.40) yields N

1 1X ¯ j )). λ2j (ψi (xj ) − ψ(x pˆi = − n 2 j=1

(A.41)

In light of (A.37), we can permute (λ21 , . . . , λ2N ) to obtain (δ1 , . . . , δN ) such that δj > 0 for j = 1, . . . , M and δj = 0 for j = M + 1, . . . , N , where M is the number of nonzero λ2j ’s. Permute (x1 , . . . , xN ) accordingly to obtain (x∗1 , . . . , x∗N ), and (A.41) becomes M

1 1X ¯ ∗ )). pˆi = − δj (ψi (x∗j ) − ψ(x (A.42) j n 2 j=1 P For j = 1, . . . , M , (A.37) implies ni=1 pˆi ψi (x∗j ) = 0, which, after plugging in (A.42), yields M X k=1 (n)

where σjk =

1 n

Pn i=1

(n)

δk σjk =

2¯ ∗ ψ(xj ), j = 1, . . . , M, n

(A.43)

¯ ∗ )ψ(x ¯ ∗ ). ψi (x∗j )ψi (x∗k ) − ψ(x j k (n)

Let Σn be the matrix with element (j, k) equal to σjk ,δ = (δ1 , . . . , δM )T , x∗ = (x∗1 , . . . , x∗M )T , and for a function f , denote f (x∗ ) = (f (x∗1 ), . . . , f (x∗M ))T . Then ¯ ∗ (A.43) gives δ = n2 Σ−1 n ψ(x ). Combining this with (A.42) gives D(ˆ p) =

n µ X 1 i=1

n

¶2 − pˆi

n n i2 1 X h ¯ ∗ T −1 1 X 2 ∗ ∗ ¯ ψ(x ) Σn (ψ i (x ) − ψ(x )) ≡ 2 T , = 2 n i=1 n i=1 i (A.44)

∗ ¯ ∗ ¯ ∗ )T Σ−1 where Ti = ψ(x n (ψ i (x ) − ψ(x )).

We now investigate the asymptotic behavior of the Ti ’s. For the sake of brevity we only provide a sketch of the proof here. The details we skip are essentially straightforward and tedious extensions of the existing local asymptotic results in Gasser and 37

M¨ uller (1984). We will use the empirical process notation for part of our argument. P Let En denote the empirical mean such that En [k(X, Y )] = n1 ni=1 k(Xi , Yi ) for any function k, where X and Y are respectively random variables following the distributions of Xi and Yi . ¯ First note that under Assumptions A.1(i)-(iv), ψ(x) = En [AD X (x)Y ] is a consistent estimate of g D (x) with ¯ 1 ), ψ(t ¯ 2 )) = Cov(ψ(t

σ2 nh2|d|+r

·Z

µ K

(d)

(y)K

(d)

t2 − t1 y+ h



¸ dy + o(1) ,

(A.45)

(n)

and σjk = En [ψX (x∗j )ψX (x∗k )] − En [ψX (x∗j )]En [ψX (x∗k )] is a consistent estimate of Cov(ψX (x∗j ), ψX (x∗k )). Under Assumption A.2(ii), when j 6= k, x∗j and x∗k can be considered as sufficiently far away from each other to claim Cov(ψX (x∗j ), ψX (x∗k )) = 0. Let ¤2 R£ 2 2 2 σK = σ 2 K (d) (y) dy, then Var(ψX (x∗j )) = h−2|d|+r σK IM + and Σn = h−2|d|+r (σK op (1)), where IM is the identity matrix of dimension M . Together with the consistency ¯ we thus have of ψ(x), −2 2|d|+r ¯ ∗ ) = gD (x∗ ) + op (1) and Σ−1 ψ(x (σK IM + op (1)). n = h

By Slutsky’s Theorem, the asymptotic distribution of Ti is equivalent to that of the random variable £ D ∗ ¤T −2 2|d|+r Wi = σK h g (x ) (ψ i (x∗ ) − gD (x∗ )). Thus, the asymptotic distribution of D(ˆ p) is equivalent to that of

(A.46) 1 n2

Pn i=1

Wi2 . Un-

der Assumptions A.1(i)-(iv), h|d|+r/2 (ψ i (x∗ )−gD (x∗ )) are independent and identically distributed random vectors that are asymptotically normal with zero mean and covarin PM D ∗ o −1 |d|+r/2 2 ance σK IM . Hence Wi is asymptotically equivalent to σK h j=1 g (xj ) Zi , where Zi ’s are i.i.d. standard normal random variables. Thus n

2 X n2 σK p) = Zi2 ∼ χ2 (n). ´2 D(ˆ ³P M ∗ D i=1 h2|d|+r j=1 g (xj )

38

Note that M may increase with n, so Assumption A.2(i) ensures that

1 n

PM j=1

g D (x∗j ) =

O(1).

References Braun, W. J. and Hall, P. “Data sharpening for nonparametric inference subject to constraints.” Journal of Computational and Graphical Statistics, 10:786–806 (2001). Cressie, N. A. C. and Read, T. R. C. “Multinomial Goodness-of-Fit Tests.” Journal of the Royal Statistical Society, Series B , 46:440–464 (1984). Cs¨org˝o, M. and R´ev´esz, P. Strong Approximations in Probability and Statistics. New York: Academic Press (1981). Fan, J. “Design-Adaptive Nonparametric Regression.” Journal of the American Statistical Association, 87(420):998–1004 (1992). Fan, J., Zhang, C., and Zhang, J. “Generalized Likelihood Ratio Statistics and Wilks Phenomenon.” The Annals of Statistics, 29(1):153–193 (2001). Gallant, A. R. “On the Bias in Flexible Functional Forms and an Essential Unbiased Form: The Fourier Flexible Form.” Journal of Econometrics, 15:211–245 (1981). —. “Unbiased Determination of Production Technologies.” Journal of Econometrics, 20:285–323 (1982). Gallant, A. R. and Golub, G. H. “Imposing Curvature Restrictions on Flexible Functional Forms.” Journal of Econometrics, 26:295–321 (1984).

39

Gasser, T. and M¨ uller, H.-G. “Kernel Estimation of Regression Functions.” In Smoothing Techniques for Curve Estimation, 23–68. Berlin, Heidelberg, New York: Springer-Verlag (1979). —. “Estimating Regression Functions and Their Derivatives by the Kernel Method.” Scandinavian Journal of Statistics, 11(3):171–185 (1984). Ghosal, S., Sen, A., and van der Vaart, A. W. “Testing Monotonicity of Regression.” Annals of Statistics, 28(4):1054–1082 (2000). Hall, P. and Heckman, N. E. “Testing monotonicity of a regression mean by calibrating for linear functionals.” Annals of Statistics, 28(1):20–39 (2000). Hall, P. and Huang, H. “Nonparametric kernel regression subject to monotonicity constraints.” The Annals of Statistics, 29(3):624–647 (2001). Hall, P., Huang, H., Gifford, J., and Gijbels, I. “Nonparametric estimation of hazard rate under the constraint of monotonicity.” Journal of Computational and Graphical Statistics, 10(3):592–614 (2001). Hall, P. and Kang, K. H. “Unimodal kernel density estimation by data sharpening.” Statistica Sinica, 15:73–98 (2005). Horrace, W. and Schmidt, P. “Multiple Comparisons with the Best, with Economic Applications.” Journal of Applied Econometrics, 15:1–26 (2000). Koml´os, J., Major, P., and Tusn´ady, G. “An approximation of partial sums of independent random variables and the sample distribution function, part I.” Zeitschrift f¨ ur Wahrscheinlichskeitstheorie und Verwandte Gebiete, 32(1-2):111–131 (1975). Li, Q. and Racine, J. Nonparametric Econometrics: Theory and Practice. Princeton University Press (2007). 40

Mammen, E. “Estimating a Smooth Monotone Regression Function.” Annals of Statistics, 19(2):724–740 (1991). Mammen, E., Marron, J. S., Turlach, B. A., and Wand, M. P. “A General Projection Framework for Constrained Smoothing.” Statistical Science, 16(3):232–248 (2001). Mammen, E. and Thomas-Agnam, C. “Smoothing Splines and Shape Restrictions.” Scandinavian Journal of Statistics, 26:239–252 (1999). Matzkin, R. L. “Semiparametric Estimation of Monotone and Concave Utility Functions for Polychotomous Choice Models.” Econometrica, 59:1315–1327 (1991). —. “Nonparametric and Distribution-Free Estimation of the Binary Choice and the Threshold-Crossing Models.” Econometrica, 60:239–270 (1992). Mukerjee, H. “Monotone Nonparametric Regression.” Annals of Statistics, 16:741– 750 (1988). Nadaraya, E. A. “On Nonparametric Estimates of Density Functions and Regression Curves.” Theory of Applied Probability, 10:186–190 (1965). Priestley, M. B. and Chao, M. T. “Nonparametric Function Fitting.” Journal of the Royal Statistical Society, 34:385–392 (1972). R Development Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria (2008). ISBN 3-900051-07-0. URL http://www.r-project.org/ Racine, J. S. and Li, Q. “Nonparametric estimation of regression functions with both categorical and continuous Data.” Journal of Econometrics, 119(1):99–130 (2004).

41

Ramsay, J. O. “Monotone Regression Splines in Action (with Comments).” Statistical Science, 3:425–461 (1988). Robertson, T., Wright, F., and Dykstra, R. Order Restricted Statistical Inference. Wiley Series in Probability and Mathematical Statistics. John Wiley and Sons (1988). Villalobos, M. and Wahba, G.

“Inequality-constrained Multivariate Smoothing

Splines with Application to the Estimation of Posterior Probabilities.” Journal of the American Statistical Association, 82:239–248 (1987). Wang, X. and Shen, J. “A class of grouped Brunk estimators and penalized spline estimators for monotone regression.” Biometrika, 97:585–601 (2010). Watson, G. S. “Smooth Regression Analysis.” Sankhya, 26:15:175–184 (1964). Wright, I. W. and Wegman, E. J. “Isotonic, Convex and Related Splines.” The Annals of Statistics, 8:1023–1035 (1980). Yatchew, A. and H¨ardle, W. “Nonparametric state price density estimation using constrained least squares and the bootstrap.” Journal of Econometrics, 133:579– 599 (2006).

42