Nonparametric Estimation of Conditional Average Treatment Effects

60 downloads 0 Views 785KB Size Report
Keywords: conditional average treatment effect, inverse probability weighted ... that the average treatment effect (ATE) in the population is nonparametrically ...
Nonparametric Estimation of Conditional Average Treatment Effects Jason Abrevaya∗ Yu-Chin Hsu† Robert P. Lieli‡

Abstract We introduce a functional parameter called the conditional average treatment effect (CATE), designed to capture the possible heterogeneity of a treatment effect across subpopulations in situations where the unconfoundedness assumption applies. In contrast to quantile regressions, the subpopulations of interest are defined in terms of the possible values of a set of continuous covariates rather than the quantiles of the outcome distribution. We show that the CATE parameter is nonparametrically identified under the unconfoundedness assumption and propose to estimate it with an inverse probability weighted estimator. Under regularity conditions, some of which are standard and some of which are new in the literature, we show (pointwise) consistency and asymptotic normality. We apply our method to estimate the average effect of a first-time mother’s smoking during pregnancy on the baby’s birth weight as a function of the mother’s age. Focusing on the subpopulation of black mothers, we find that the effect of smoking decreases with age in absolute value, i.e. smoking is particularly harmful for the babies of young first time mothers. Keywords: conditional average treatment effect, inverse probability weighted estimation, treatment effect heterogeneity, nonparametric methods, birth weight ∗

Department of Economics, University of Texas, Austin. Department of Economics, University of Missouri, Columbia. ‡ Department of Economics, Central European University, Budapest and the National Bank of Hungary. †

Email: [email protected]. Preliminary and incomplete; please do not cite without the authors’ permission.

1

1

Introduction

When individual treatment effects in the population are heterogeneous, but treatment assignment is unconfounded given a vector X of observable covariates, it is a well-known result that the average treatment effect (ATE) in the population is nonparametrically identified (see, e.g., Rosenbaum and Rubin 1983, 1985). Given the heterogeneity of individual effects, it may also be of interest to estimate ATE in various subpopulations defined by the possible values of some component(s) of X. We will refer to the value of the ATE parameter within such a subpopulation as a conditional average treatment effect (CATE). For example, if one of the covariates is gender, one might be interested in estimating ATE separately for males and females. As treatment assignment in the two subpopulations is unconfounded given the rest of the components of X, one can simply split the sample by gender and apply standard nonparametric estimators of ATE to the two subsamples. A second example, considered by Lee and Whang (2009), is the case when X is absolutely continuous and the subpopulations of interest are of the form X = x, each having measure zero. The parameter of interest, CATE, is an entire function defined over the support of X that returns, for any given value x of X, the conditional mean of the treatment effect. Lee and Whang (2009) estimate this function using a kernel-based nonparametric estimator and propose a method for testing various hypotheses about it. (In effect, they study the ’first stage’ of the imputation-based nonparametric ATE estimators proposed by Hahn (1998) and Heckman, Ichimura and Todd (1997, 1998).) In this paper we extend the second example to the technically more challenging situation in which the covariates X1 we want to condition on are a strict subset of X. As the unconfoundedness assumption will not generally hold conditional on X1 alone, it is not possible to simply apply the Lee and Whang (2009) estimator with X1 playing the role of X. Rather, one needs to estimate CATE as a function of X, and then average out the unwanted components by integrating with respect to the conditional distribution of X(1) given X1 , where X(1) denotes those components of X that are not in X1 . This distribution is, however, generally unknown and has to be estimated. When X1 is a discrete variable (such as in the first example), averaging with respect 2

to the empirical distribution of X(1) | X1 is accomplished “automatically” by virtue of the ATE estimator being implemented subsample-by-subsample. The result is an estimate of CATE(x1 ) for each point x1 in the support of X1 . This suggests that when X1 is continuous one could at least approximate CATE by discretizing X1 and estimating ATE on the resulting subsamples (provided that they are large enough). There are at least two problems with this procedure. First, the fact that treatment assignment is unconfounded conditional on X does not theoretically imply that it will continue to be unconfounded if the continuous subvector X1 is replaced by a discrete approximation.1 Second, even if unconfoundedness holds up (at least in some approximate sense), the CATE estimate obtained this way will still depend on the discretization used and will be rather crude and discontinuous, just as a histogram is generally a crude and discontinuous estimate of the underlying density function. The technical contribution of this paper consists of proposing a “smooth” nonparametric estimator of CATE when X1 , the vector of continuous conditioning variables, is a strict subset of X.2 The estimator is constructed as follows. First, the propensity score, the probability of treatment conditional on X, is estimated by a kernel-based nonparametric regression. Second, the estimator assigns weights to the observed outcomes based on treatment status and the estimated propensity score. The third (final) step can be interpreted as integrating with respect to a smoothed estimate of the distribution of X(1) given X1 . Under regularity conditions the estimator is shown to be consistent and asymptotically normal; the results allow for pointwise inference about CATE as a function of X1 . Of the conditions used to prove this result, the most noteworthy ones are those that restrict the relative convergence 1

Let U , V and X be random variables such that U and V are independent conditional on X. If X is

discrete, then U and V are also conditionally independent given 1{X=x} , where P (X = x) > 0. In contrast, if X is absolutely continuous, then U and V are not generally independent conditional on 1{X∈I} , where I is some interval with P (X ∈ I) > 0. E.g., let U = X +  and V = X + ν, where X,  and ν are mutually independent and have mean zero. Then E[U | X ∈ I] = E[V | X ∈ I] = E[X | X ∈ I] but E[U V | X ∈ I] = E[X 2 | X ∈ I] 6= (E[X | X ∈ I])2 . Nevertheless, if the interval I is small enough, an approximate equality might hold. 2 For simplicity, most formal results stated in the text will assume that X has continuous components only. Nevertheless, in applications X will typically contain discrete as well as continuous covariates. We will discuss a way to deal with the mixed case in the comments following Theorem 1.

3

rates of the smoothing parameters used in estimating the propensity score and the final averaging step and prescribe the order of the kernel. The CATE estimator described above can be regarded as a generalization of the inverse probability weighted ATE estimator proposed by Hirano, Imbens and Ridder (2003). An alternative nonparametric estimator of CATE can be based on the imputation approach (e.g. Hahn (1998)). While we can show that the two estimators are (first order) asymptotically equivalent, in this paper we will restrict attention to the first approach. We apply our method to a dataset on births in North Carolina recorded between 1988 to 2002. More specifically, we estimate the expected effect of a first-time mother’s smoking during pregnancy on the birth weight of her child as a function of several (continuous) independent variables, taken one at a time. We focus on black mothers. The contribution of this exercise to the pertaining empirical literature consists of exploring the heterogeneity of the smoking effect in an unrestricted and intuitive fashion. Previous estimates reported in the literature are typically constrained to be a single number by the functional form of the underlying regression model. If the effect of smoking is actually heterogeneous, such an estimate is of course not informative about how much the effect varies across relevant subpopulations and may not even be consistent for the overall population mean. Nevertheless, there have been some attempts in the literature to capture the heterogeneity of the treatment effect in question. Most notably, Abrevaya and Dahl (2008) estimate the effect of smoking separately for various quantiles of the birth weight distribution. Though insightful, a drawback of the quantile regression approach is that it allows for heterogeneity in the treatment effect across subpopulations that are not identifiable based on the mother’s characteristics alone. Hence, the estimated effects are hard to translate into targeted ’policy’ recommendations. For instance, ibid. report that the negative effect of smoking on birth weight is more pronounced at the median of the birth weight distribution (−81 grams) than at the 90th percentile (−56 grams). However, it is not clear, before actual birth, or at least without additional modeling, which estimate should be ’assigned’ to a mother with a given set of observable characteristics. In addition, the treatment effect for any given quantile could also be a function of these characteristics (this is assumed away by the linear specification

4

they use). In contrast, the CATE parameter is defined as a function of variables that are observable a priori, and the estimator proposed in this paper places only mild restrictions on the shape of this function. Furthermore, variations in the average effect might be easier to communicate to the general public than variations across quantiles. In this draft of the paper we discuss the estimation results only for the case when X1 = mother’s age. The main story that emerges from the exercise is that smoking is particularly harmful for the babies of young first time mothers in the population of interest. More specifically, the estimated average smoking effect on birth weight is −320 grams for a 16year-old black mother and −130 grams for a 32-year-old. The rest of the paper is organized as follows. In Section 2 we introduce the CATE parameter, discuss its identification and estimation. In Section 3 we present the empirical application. Section 4 outlines possible extensions of the basic framework to multivalued treatments, instrumental variables, etc. Section 5 concludes.

2

Theory

2.1

The formal framework and the estimator

Let D be a dummy variable indicating treatment status in a population of interest with D = 1 if an individual (unit) receives treatment and D = 0 otherwise. Define Y (1) as the potential outcome for an individual if treatment were imposed exogenously; Y (0) is the corresponding potential outcome without treatment. Let X be a k-dimensional vector of covariates with k ≥ 2. The econometrician observes D, X, and Y ≡ D · Y (1) + (1 − D) · Y (0). In particular, we make the following assumption. Assumption 1 (Sampling): The data, denoted {(Di , Xi , Yi )}ni=1 , is a random sample of size n from the joint distribution of the vector (D, X, Y ). Throughout the paper we maintain the assumption that the observed vector X can fully control for any endogeneity in treatment choice. Stated formally: Assumption 2 (Unconfoundedness): (Y (0), Y (1)) ⊥ D X. 5

Assumption 2 is also known as (strongly) “ignorable treatment assignment” (Rubin 1978), and it is a standard identifying assumption in the treatment effect literature. Let X1 ∈ R` be a subvector of X ∈ Rk , 1 ≤ ` < k, X absolutely continuous. The conditional average treatment effect (CATE) given X1 = x1 is defined as τ (x1 ) ≡ E[Y (1) − Y (0)|X1 = x1 ]. Under the unconfoundedness assumption, τ (x1 ) can be identified from the joint distribution of (X, D, Y ) as # (1 − D)Y DY − τ (x1 ) = E X 1 = x1 , p(X) 1 − p(X) "

(1)

where p(x) = P [D = 1|X = x] denotes the propensity score function. The identification result follows from a simple string of equalities justified by the law of iterated expectations and the unconfoundedness assumption. We propose a two-step estimator for τ (x1 ) based on (1). In the first stage, we estimate pˆ(x) by a kernel-based (Nadaraya-Watson) nonparametric regression. That is,  Pn Xi −x 1 D K i k i=1 h  . pˆ(x) = nh 1 P n Xi −x K i=1 h nhk

(2)

where K(·) is a kernel function and h is a smoothing parameter (bandwidth). The inverse probability weighted kernel estimator for τ (x1 ) is then defined as   P n  D i Yi (1−Di )Yi X1i −x1 1 − K 1 ` i=1 pˆ(Xi ) 1−ˆ p(Xi ) h1 nh  τˆ(x1 ) = 1 , Pn X1i −x1 1 i=1 K1 h1 nh` 1

where K1 (u) is a kernel function and h1 is a bandwidth different from K(u) and h, respectively. Even though we develop the asymptotic theory of τˆ(x1 ) for any ` < k, in our assessment the most relevant case in practice is ` = 1 (and maybe ` = 2). When X1 is a scalar, τˆ(x1 ) can easily be displayed as a two-dimensional graph while for higher dimensional X1 the presentation and interpretation of the CATE estimator can become rather cumbersome.

6

2.2

CATE in a linear regression framework

The standard linear regression model for program evaluation combines the unconfoundedness assumption with the assumption that the conditional expectation of the potential outcomes is a linear function. As noted by Imbens and Wooldridge (2009), the treatment effect literature has moved away from the regression method in the last fifteen years. The main reason is that the estimated average treatment effect can be severely biased if the linear functional form is not correct. Nevertheless, as the CATE parameter introduced in this paper has not been in use in the treatment effect literature, it is useful to develop further intuition about it by relating it to a standard linear regression framework. Given the vector X of covariates, we can, without loss of generality, write E[Y (d) | X] = µd + rd (X),

d = 0, 1,

where µd = E[Y (d)] and rd (·) is some function with E[rd (X)] = 0. Under the unconfoundedness assumption, the mean of the observed outcome conditional on D and X can then be represented as E(Y | X, D) = µ0 + (µ1 − µ0 )D + r0 (X) + [r1 (X) − r0 (X)]D. We examine two benchmark cases. Case 1. If one assumes r0 (X) = r1 (X) = [X−E(X)]0 β, then CATE(x1 ) = E[Y (1)−Y (0) | X1 = x1 ] is a constant function whose value is equal to ATE= µ1 − µ0 everywhere. ATE itself can be estimated as the regression coefficient on D from a linear regression of Y on a constant, D and X. Case 2. If one assumes r0 (X) = [X − E(X)]0 β and r1 (X) = [X − E(X)]0 (β + δ), then ATE can be estimated as in Case 1. However, CATE(X1 ) is no longer a trivial function of X1 ; rather, it is given by CAT E(X1 ) = µ1 − µ0 + E[(X − E(X))0 δ | X1 ]. Further assuming that the above conditional expectation w.r.t. X1 is a linear function of X1 gives rise to a three-step parametric estimator of CATE. The first step consists of regressing 7

[ Y on a constant, D, X and X · D; let AT E and δˆ denote the regression coefficients on D ¯ 0 δˆ on a constant and and X · D, respectively. The second step consists of regressing (X − X) X1 ; let γˆ0 and γˆ1 denote the estimated value of the constant term and the slope coefficients, respectively. Finally, for X1 = x1 , one takes [ AT E + γˆ0 + γˆ1 x1 as an estimate of CATE(x1 ). In most real world applications the assumptions underlying the two benchmark cases will hold, at best, as approximations. We will show that our nonparametric CATE estimator can tease out useful information from the data that would otherwise be masked by these narrow functional form assumptions. We do not mean to imply that nonparametric methods are inherently superior—often the curse of dimensionality can only be overcome by using tightly parameterized models. It is however important to be aware of the limitations of such models, especially if they are being used for analytical convenience only.

2.3

Asymptotic properties

We derive the asymptotic properties of τˆ(x1 ) under the following regularity conditions. Assumption 3 (Distribution of X): The support X of the k-dimensional covariate X is a Cartesian product of compact intervals, and the density of X, f (x), is bounded away from 0 on X . Let s and s1 denote positive integers such that s ≥ k and s1 ≥ (2 + `)k/2. Assumption 4 (Propensity score): (i) p(x) is bounded away from 0 and 1 on X ; (ii) p(x) is s-times continuously differentiable on X . Assumption 5 (Moments and smoothness): (i) supx∈X E[Y (j)2 | X = x] < ∞ for j = 0, 1; (ii) the functions mj (x) = E[Y (j) | X = x], j = 0, 1 and f (x) are s-times continuously differentiable on X .

8

Assumption 6 (Kernels): (i) K(u) is a kernel function of order s, is supported on Qk 3 i=1 [−1, 1], symmetric around zero, integrates to 1 and is continuously differentiable. Q (ii) K1 (u) is a kernel function of order s1 , is supported on `i=1 [−1, 1], symmetric around zero, integrates to 1 and is s times continuously differentiable. Assumption 7 (Bandwidths): The bandwidths h and h1 satisfy the following conditions: (i) hs = op (n−1/4 ) and n−1 h−k (log n) = op (n−1/2 ). (ii) nh12s+` → 0 and nh`1 → ∞. (iii) h2 h−2−` → 0. 1 Define the function ψ(x, y, d) as ψ(x, y, d) =

d(y − m1 (x)) (1 − d)(y − m0 (x)) − + m1 (x) − m0 (x) − τ (x1 ). p(x) 1 − p(x)

The following theorem states our main theoretical result. The proof is given in the Appendix. Theorem 1 Suppose that Assumptions 1 through 7 are satisfied. Then, for each point x1 in the support of X1 , n

X − x  X 1 1i 1 p ψ(Xi , Yi , Di )K1 (a) + op (1) h1 nh`1 fx1 (x1 ) i=1   q kK1 k22 σψ2 (x1 ) d ` (b) nh1 (ˆ τ (x1 ) − τ (x1 )) → N 0, , fx1 (x1 ) R where fx1 (x1 ) is the pdf of X1 , kK1 k2 = ( K(u)2 du)−1/2 , and σψ2 (x1 ) = E[ψ 2 (X, Y, D)|X1 = q nh`1 (ˆ τ (x1 ) − τ (x1 )) =

1

x1 ]. Comments 1. The technical restrictions imposed on the distribution of X and on various conditional moment functions in Assumptions 3 through 5 are analogous to those in Hirano, Imbens and Ridder (2003) (other references?) and are common in the literature on nonparametric estimation. As pointed out by Tamer and Khan (2010), the assumption that the 3

K : Rk → R is an s-order kernel if P that 1 ≤ i pi < s.

R

up11 · . . . · upkk K(u)du = 0 for all nonnegative integers p1 , . . . , pk such

9

propensity score is bounded away from zero and one plays an important role in determining the convergence rate of inverse probability weighted estimators. 2. We must show that there exist bandwidth sequences h and h1 satisfying Assumption 7—otherwise the asymptotic theory would be vacuous. Choose s = k and let s1 be the smallest even integer such that s1 ≥ (2 + `)k/2. If one sets −1

h = a · n 2k+δ , where a and δ are positive numbers; −1

h1 = a1 · n (2+`)k+δ1 , where a1 and δ1 are positive numbers; δ1 < ` and 2δ1 > (2 + `)δ, then nh12s1 +` = n

2s +`

1 1− 3k+δ

1 1− (2+k)k+δ

nh1 = n

1

1

−k

(2+`)k+` 1

1− (2+`)k+δ

≤n

δ1 −`

= n 3k+δ1 = op (1),

→ ∞, 1

hs = n 2k+1 = op (n− 4 ), k

−k−1

1

n−1 h−k log n = n−1+ 2k+1 log n = n 2k+1 log n = op (n− 2 ), −2

2+`

(2+`)δ−2δ1

h2 h1−2−` = n 2k+δ n (2+`)k+δ1 = n (2k+δ)((2+`)k+δ1 ) = op (1).

(3)

3. It is Assumptions 6 and 7 that underlie the more novel aspects of our asymptotic theory. [*** More about the role of bandwidth choice and the higher order kernels ***] 4. Assumption 3 does not allow X to have discrete components, which is of course restrictive in applications. One way to incorporate discrete covariates into the analysis is as follows. For concreteness, suppose that in addition to some continuous variables, X contains gender. Let M denote the indicator of the male subpopulation and define qm (x1 ) = P (M = 1 | X1 = x1 ) and τm (x1 ) = E[Y (1) − Y (0) | X1 = x1 , M = 1]. For the female subpopulation the quantities qf (x1 ) and τf (x1 ) are defined analogously. CATE in the overall population is given by τ (x1 ) = qm (x1 )τm (x1 ) + qf (x1 )τf (x1 ) = qm (x1 )τm (x1 ) + [1 − qm (x1 )]τf (x1 ). Thus, one can estimate CATE(x1 ) separately in the two discrete subpopulations using the proposed estimator, and then take a point-by-point weighted average with weights given by 10

a consistent estimate of qm (x1 ). The only difficulty lies in constructing standard errors for τˆ(x1 ). As we show in Appendix B, one can proceed as follows. Let τm (x1 ) and τf (x1 ) be estimated using the same bandwidth sequence h1 satisfying Assumption 7, where n is the overall sample size (including males and females). If, in addition, qm (x1 ) is estimated by   Pn X1i −x1 1 M K i 1 i=1 h1 nh`   , qˆm (x1 ) = 1 P n X1i −x1 1 i=1 K1 h1 nh` 1

then the following influence function representation applies to τˆ(x1 ): q τ (x1 ) − τ (x1 )) nh`1 (ˆ n h X  1 p = ψmi Mi + ψf i (1 − Mi ) + Mi − qm (x1 ) τm (x1 ) f (x1 ) nh`1 i=1 i  + 1 − Mi − qf (x1 ) τf (x1 ) K1i + op (1),

(4)

where ψm and ψf are analogs of the function ψ in the male/female subpopulations, ψmi =   1 ψm (Xi , Yi , Di ), and K1i = K1 X1ih−x . One can then obtain standard errors as in Theorem 1 1 with the function in the square brackets above playing the role of the function ψ.

3 3.1

Empirical application The data and indentification

The data set we use is from the North Carolina Vital Statistics State Center for Health Statistics. We focus on black mothers with their first child born between 1988-2002; these restrictions yield 157,989 observations. An attractive feature of the North Carolina Vital Statistics data is that it records the mother’s zip code. The availability of this information allows us to assign a per capita income and a geographical location (latitude and longitude) to each mother. Per capital income in the mother’s zip code is a proxy for family income, which is a potentially important covariate but is unavailable at the individual level. Geographical location also allows us to control for zip-code level unobserved heterogeneity (’zip code fixed effects’). 11

The full set of conditioning variables, i.e., the vector X used in estimating the propensity score, consists of the baby’s gender, marital status, mother’s age, mother’s education, month of first prenatal visit4 , number of prenatal visits, per capita income in mother’s zip code, zip code’s latitude and longitude. In addition to the effects for the whole population, we also consider four subpopulations defined by the first two components of X: single mothers with a baby girl (58,897 observations), single mothers with a baby boy (61,217 observations), married mothers with a baby girl (18,716 observations) and married mothers with a baby boy (19,159 observations). The rest of the components of X are treated as continuous within each subpopulation.5 We will estimate the conditional average treatment effect of the mother’s smoking on the firstborn’s birth weight as a function two factors, taken one at a time. First we set X1 as the mother’s age and then as the per capita income in the mother’s zip code. Our identifying assumption is that the potential birth weight outcomes are independent of the smoking decision conditional on the observables in X. Almond, Chay and Lee (2005), da Veiga and Wilder (2008), and Walker, Tekin and Wallace (2009) also use variants of the unconfoundedness assumption to identify the effect of smoking on birth weight. By contrast, Abrevaya (2006) and Abrevaya and Dahl (2008) attempt to control for unobserved heterogeneity by using a panel of mothers with multiple births. Nevertheless, their approach still imposes strong restrictions on the channels through which unobservable factors are allowed to operate. In particular, there cannot be feedback from the birth weight of the first child to the decision whether or not to give up smoking during the second pregnancy. This is likely violated in practice. Our purely cross-sectional approach (we focus on first births) eliminates this problem. Furthermore, as mentioned above, by including the location of the mother’s zip code in X, we hope to control for unobserved heterogeneity at the zip code 4

Equals zero if no prenatal care. This makes moving from 0 to 1 very different from moving from 1 to 2.

Of course, a nonparametric estimator should be able to handle such nonlinearity. 5 Admittedly, some of these variables strain the definition of ’continuous’. For example, month of first prenatal visit is of course not continuous, but it is more convenient to treat it as such instead of further partitioning the sample according to the ten possible values this variable can take on. We also experimented with adding uniform[−.5, .5] random numbers to the observations, but this does not seem to make much difference.

12

level (this is new in the literature). In sum, there appears to be no identification strategy in the literature this is inherently superior to ours; there are advantages and disadvantages of each approach.

3.2

Implementation and results

We estimate the CATE parameter separately in the four subpopulations defined above; results for all first time black mothers are obtained by averaging these estimates as described in comment 4 after Theorem 1. We will focus the estimation results for X1 =mother’s age. (We also report results for X1 =zip code level per capita income, but do not discuss them in this draft.) To avoid scaling issues arising form using the same bandwidth for all continuous covariates, we standardize them before estimation (the full sample mean and standard deviation is used). For mother’s age, the full sample mean is 21.8 years, the standard deviation is 5.32 years. The youngest age at which CATE is estimated is mean-1.1std≈16 years and the oldest is mean+2.5std≈35 years. To generate higher order kernels, we use the method in Imbens and Ridder (2009), using the Epanechnikov kernel as the basis. As with all nonparametric estimation procedures, bandwidth choice is critical. Theory suggests that we should set h = Cn1/(2k+δ) and h1 = C1 n1/(2+k1 )(k+δ) , where k1 = 1 and k = 7. In order to be able to calculate standard errors for the overall CATE estimate (obtained by averaging the CATE estimates in the four subpopulations), we use the same bandwidth in all subpopulations regardless of their size, i.e. here n is the full sample size. Unfortunately, the first order asymptotic theory does not have implications about how to pick the constants C, C1 , δ and δ1 . After some experimenting we found that both h and h1 have to be large (i.e., we need to oversmooth) to obtain ’sensible’ results that are roughly in line with the magnitudes of the previous estimates in the literature. In particular, we set C = 6, C1 = 7, δ = 0.25 and δ1 = 0.75, which yields h = 2.59 and h1 = 4.04.6 Obviously, 6

Using the estimate of the propensity score, one can estimate the (unconditional) ATE of smoking in the

four subpopulations and in the overall population using the Hirano, Imbens and Ridder (2003) nonparametric estimator (HIR). If h is too small, then the ATE estimates tend to be very large in absolute value—several hundred grams instead of the more usual 100-200 gram range. Regarding h1 , we found that a low value

13

we will further investigate the sensitivity of our estimator to bandwidth choice in a more principled way. Standard error calculations entail bandwidth choices as well, but these do not seem to be critical, and one does not need to oversmooth. Results are presented in Figures 1 through 5 in Appendix D. The solid line represents the estimates and the dashed lines denote the upper bound and lower bound of the pointwise 95% confidence bands. Recall that the horizontal axis measures age in standard deviations from the mean. The main story that emerges from these pictures is that smoking is particularly impactful for the babies of young first time mothers. Considering the overall population of interest, the estimated average smoking effect on birth weight is −320 grams for a 16-year-old mother (≈ −1 std. dev.) and −130 grams for a 32-year-old (≈ +2 std. dev.) If one examines the subpopulations separately, there are some finer points, but these are less intuitive. For example, for older single mothers the negative effect of smoking disappears and turns into positive. (The standard errors at the upper end of the age scale are surprisingly tight given that there are not that many first time black moms around age 35.) For married mothers, on the other hand, the effect of smoking remains significantly negative even at higher ages. A possible story here is that if the mother smokes, then the father is more likely to smoke too (we don’t control for this), so the baby could be exposed to more smoke than in a single household, though some of it only passively. Also, for single mothers the estimated effect does not strongly depend on the baby’s gender. In contrast, for married mothers it apparently does—smoking affects baby boys quite a bit more negatively. It is hard to rationalize this result. In comparison, Almond, Chay and Lee (2005) find that the average impact of smoking on birth weight is around −203.2 grams based on Linked National-Natality-Mortality Details Files from Pennsylvania between 1989 and 1991. de Veiga and Wilder (2008) report that the smoking impact on birth weight is −227.4 grams for Caucasians and is −186.9 grams of h1 causes the CATE estimates to have too wide a range over the support of X1 . If h1 is small, then at older ages smoking can easily be estimated to have a strong positive effect, and at young ages a very strong negative effect. Here ’very strong’ can mean as large as −800 to −1000 grams, which seems extreme in light of other estimates in the literature.

14

for African-Americans based on the live births in the United States in 1995.7 Based on data from Georgia, Walker, Tekin and Wallance (2009) the estimated average effect for nonblack teens is −164 grams, for nonblack nonteens is −211 grams, for black teens is −106 grams, and for black nonteens is −176grams. That is, there is a significant difference between the smoking effect on birth weight for teens and nonteens and the absolute value of the effect is larger for non-teens. Hence, our estimation results contradict the age effect found by Walker, Tekin and Wallance (2009). In Appendix D we also report estimation results for the case when X1 =zip code level per capita income, but we do not discuss them in this draft. See Figures 6 through 10 for the results. The sample mean of per capita income is about 18K per year; the standard deviation is about 5K.

4

Theoretical extensions

In this section we discuss several possible extensions of our theory. First, under the unconfoundedness assumption, we define the conditional average treatment effect of the treated (CATT) and extend our theory to this case. Second, in the cases where the unconfoundedness assumption is violated, but there exists a valid binary instrument variable, we define the conditional local average treatment effects (CLATE) and the conditional local average treatment effects treated (CLATT), and extend our theory to these parameters. Finally, in the cases where the treatment is multi-valued and under the unconfoundedness assumption, we extend our theory to the conditional marginal average treatment effect (CMATE).

4.1

Conditional Average Treatment Effect of the Treated

In empirical studies, sometimes the average treatment effect of the treated (ATT), which evaluates the average treatment effect for those who are actually treated, is a more policy 7

The birth certificates of California, Indiana, New York State (except for New York City) and South

Dakota do not have information on maternal smoking during pregnancy, so de Veiga and Wilder (2008) exclude these data from consideration.

15

revelent parameter to report. As a result, we extend the ATT to CATT. The CATT(x1 ) is defined as τt (x1 )E[Y (1) − Y (0)|D = 1, X1 = x1 ], where X1 ∈ R` be a subvector of X ∈ Rk and conditioning on X1 , the unconfoundedness assumption does not hold.8 It can be shown that τ (x1 ) can be identified as h i.   p(X)(1 − D)Y τt (x1 ) = E DY − X1 = x1 E p(X) X1 = x1 , 1 − p(X) and τt (x1 ) can be estimated by Pn  1 i=1 Di Yi − nh` τˆt (x1 ) = 1 Pn 1 nh`1

i=1

pˆ(Xi )(1−Di )Yi 1−ˆ p(Xi )

pˆ(Xi )K1



K1  −x1

X1i h1

X1i −x1 h1

(5)

 .

(6)

The following theorem which is similar to Theorem 1 summarizes the asymptotics of τˆ(x1 ). Theorem 2 Suppose that Assumptions 1 through 7 are satisfied. Then, for each point x1 in the support of X1 ,   q kK1 k22 σψ2 t (x1 ) d ` nh1 (ˆ τt (x1 ) − τt (x1 )) → N 0, , fx1 (x1 ) where σψ2 t (x1 ) = E[ψt2 (X, Y, D)|X1 = x1 ] with 1 ψt (x, y, d) = p x1

  p(x)(1 − d)(y − m0 (x)) + d(m1 (x) − m0 (x) − τt (x1 )) , d(y − m1 (x)) − 1 − p(x)

px1 = E[D = 1|X1 = x1 ]. Comments 1. The influence function ψt (x, y, d) is similar to the efficent estimator for ATT in the literature, e.g., Hahn (1998) and Hirano, Imbens and Ridder (2003). 8

If the unconfoundedness assumption holds conditioning on X1 , it can be shown that CAT T (x1 ) =

CAT E(x1 ) and as we discuss in Section 2, it is not necessary to use our method to estimate CAT E(x1 ).

16

2. The proof is done by show the following:     Pn  pˆ(Xi )(1−Di )Yi X1i −x1 1 q D Y − K i i 1 ` i=1 1−ˆ p(Xi ) h1 nh  − τt (x1 )px1  nh`1  1 Pn X1i −x1 1 K 1 ` i=1 h1 nh1  n X 1 1 p(Xi )(1 − Di )(Yi − m0 (Xi )) Di (Yi − m1 (Xi )) − =p 1 − p(Xi ) nh`1 fx1 (x1 ) i=1 ! X − x  1i 1 + op (1), + Di (m1 (Xi ) − m0 (Xi )) − τt (x1 )px1 K1 h1  Pn ! X1i −x1 1 q D K i 1 ` i=1 h1 nh1  − p x1 nh`1 Pn X1i −x1 1 K 1 ` i=1 h1 nh1 n X − x  X 1 1 1i 1 =p + op (1). (Di − px1 )K1 ` fx (x1 ) h 1 nh1 1 i=1 The second result in the previous equation suggests that we can replace the denominator of  P τˆt (x1 ) in (6) with ni=1 Di K1 (X1i − x1 )/h1 /(nh`1 ) without changing the first order asymptotics of τˆt (x1 ).

4.2

Endogenous Treatment Assignment

In this section, we extend our theory to the cases where the unconfoundeness assumption does not hold conditional on X, but we assume that there is a valid binary instrument variable (Z) available. The following IV framework, augmented by covariates, is now standard in the treatment effect literature, e.g., Abadie (2003), Frolich (2007), Hong and Nekipelov (2010) and Donald, Hsu and Lieli (2011). In addition to Y , D and X, we observe the value of a binary instrument Z ∈ {0, 1} for each individual in the sample. For Z = z, the random variable D(z) ∈ {0, 1} specifies individuals’ potential treatment status with D(z) = 1 corresponding to treatment and D(z) = 0 to no treatment. The actually observed treatment status is then given by D ≡ D(Z) = D(1)Z + D(0)(1 − Z). The following assumptions, taken from Donald, Hsu and Lieli (2011) with some modifications, describe the relationships between the variables defined above and justify Z being referred to as an instrument:  Assumption 8 (i) (Unconfoundedness Assumption): Y (0), Y (1), D(1), D(0) ⊥ Z X. (ii) (First stage): P [D(1) = 1 | X] > P [D(0) = 1 | X] and 0 < P (Z = 1 | X) < 1. 17

(iii) (Monotonicity): P [D(1) ≥ D(0)] = 1. Assumption 8(i) is the unconfoundedness assumption in the IV framework which requires that conditional on X, Z is independent of the potential outcomes and the potential treatment status. Part (ii) postulates that the instrument is (positively) related to the probability of being treated and implies that the distributions X|Z = 0 and X|Z = 1 have common support. Finally, the monotonicity of D(z) in z, required in part (iii), implies that there are no defiers [D(0) = 1, D(0) = 0] in the population. Parts (ii) and (iii) together imply that conditional on X, the group of compliers [D(0) = 0, D(1) = 1] is not of measure zero. This guarantees that the groups of compliers is not of measure zero conditional on X1 = x1 for all x1 . We define the conditional local average treatment effect (CLATE) and the conditional local average treatment effect of the treated (CLATT) parameters conditional on X1 = x1 as CLAT E(x1 ) ≡= E[Y (1) − Y (0) | D(1) = 1, D(0) = 0, X1 = x1 ] CLAT T (x1 ) ≡= E[Y (1) − Y (0) | D(1) = 1, D(0) = 0, D = 1, X1 = x1 ]. Following Donald, Hsu and Lieli (2011), we can show that CLAT E(x1 ) and CLAT T (x1 ) are identified by 

.   ZY (1 − Z)Y ZD (1 − Z)D γ(x1 ) = E − X 1 = x1 E − X1 = x 1 , q(X) 1 − q(X) q(X) 1 − q(X) .    q(X)(1 − Z)Y q(X)(1 − Z)D γt (x1 ) = E ZY − X 1 = x1 E ZD − X1 = x 1 , 1 − q(X) 1 − q(X) where q(x) = P [Z = 1|X = x]. That is, the CLATE (CLATT) is identified by the CATE (CATT) of Z on Y over the CATE (CATT) of Z on D. Therefore, we can use the CATE and CATT estimators developed in previous sections to estimate the numerators and denominators of CLATE and CLATT. Under similar regularity conditions and by the delta method, one can obtain the asymptotic properties of the CLATE and CLATT estimators and we omit the formal statements.

18

4.3

Multi-valued Treatment

In this section, we consider multi-valued treatment assignment. For example, Walker, Tekin and Wallance (2009) and Cattaneo (2010) further divide the smoking indicator into several groups depending on the intensity of the smoking. In Walker, Tekin and Wallance (2009), they consider four groups: non-smokers, smokers with tobacco use between 1 and 10 cigarettes per day, smokers with tobacco use between 11 and 20 cigarettes per day and smokers with tobacco use more than 20 cigarettes per day. On the other hand, Cattaneo (2010) divides the smokers into 5 groups depending on the daily tobacco use: 1-5, 6-10, 11-15, 16-20, and 20+. By doing this, one can study the effect of maternal smoking intensity on birth weight. We introduce the model and the notation following Cattaneo (2010). A finite collection of multiple treatment status (categorical or ordinal) indexed by t ∈ {0, 1, . . . , J} ≡ T with J ∈ N fixed.9 For all t ∈ T , let Y (t) be the potential outcome under treatment level t. Define random variable T ∈ T indicates which of the J + 1 potential outcomes is observed and Dt = 1(T = t) for all t ∈ T where 1(·) is the indicator function. The observed P outcome Y is defined as t∈T Dt Y (t). The parameters considered in Cattaneo (2010) are the marginal average treatment effects (MATE): E[Y (t)] for all t ∈ T . In this paper, we extend to conditional marginal average treatment effects (CMATE) conditional on X1 = x1 : E[Y (t)|X1 = x1 ] for all t ∈ T . The following assumption is the main assumption we need to identify the MATE or CMATE and define the generalized propensity score as pt (x) = P (Dt = 1|X = x) for t ∈ T . Assumption 9 (i) (Unconfoundedness Assumption): For all t ∈ T , Y (t) ⊥ Dt |X. (ii) (Generalized Propensity Score): For all t ∈ T , pt (x) ≥ δ > 0 on X . Assumption 9 is similar to the binary treatment case, so we omit the discussion. Under Assumption 9, E[Y (t)|X1 = x1 ] for all t ∈ T is identified as   Dt Y E[Y (t)|X1 = x1 ] = E X 1 = x1 , pt (X) 9

If J = 1, the whole set-up reduces to the binary treatment case.

19

and E[Y (t)|X1 = x1 ] ≡ λt (x1 ) can be estimated by  Pn  Dti Yi  X1i −x1 1 i=1 pˆt (Xi ) K1 h1 nh`1 ˆ t (x1 ) =  λ , Pn X1i −x1 1 K 1 ` i=1 h nh 1 1

where pˆt (x) is a Nadaraya-Watson estimator for pt (x) as in (2). Under similar regularity conditions as in Theorem 1, one can obtain the asymptotic properties of the CMATE estimators and we omit the formal statements.

5

Conclusions

In this paper we introduce the idea of a conditional average treatment effect. This functional parameter is designed to capture the variation in the average treatment effect conditional on some covariate(s) used in identifying the unconditional average. We propose an inverse probability weighted nonparametric estimator of this function and provide pointwise first order asymptotic theory. We also discuss a number of straightforward extensions. The application consists of estimating the effect of a first time mother’s smoking on the birth weight of their baby conditional on various characteristics of the mother. In this draft we focus on African-American mothers and present (preliminary) estimation results as a function of age. The main qualitative finding is that smoking has a larger impact (in absolute value) for young mothers. Refining and extending the estimation exercise is ongoing work.

20

Appendix A. Proof of Theorem 1 and Theorem 2 It is suffices to show Theorem 1 because the proof for Theorem 2 can be obtained by analogous arguments. Define ωij , which is depends only on X1 , ..., Xn , as   Xi −Xj 1 K k h nh  . ωij = Xi −Xt 1 P K t:t6=i h nhk

(7)

Define τ = τ (x1 ), w = (y, d, x) and Ψ(w, τ, q) ≡

dy (1 − d)y − − τ. p 1−p

Let Ψp and Ψpp denote the partial derivative of Ψ w.r.t. the argument p, and let Wi = (Yi , Di , Xi ). Then (1 − Di )Yi Di Yi − − τ, p(Xi ) 1 − p(Xi )   D i Yi (1 − Di )Yi Ψp (Wi , τ, p(Xi )) = − + , p2 (Xi ) (1 − p(Xi ))2 2Di Yi 2(1 − Di )Yi Ψpp (Wi , τ, p(Xi )) = 3 − , p (Xi ) (1 − p(Xi ))3

Ψ(Wi , τ, p(Xi )) =

and we further define  Sp (Xi ) = E[Ψp (Wi , τ, p(Xi ))|Xi ] = −

m1 (Xi ) m0 (Xi ) + p(Xi ) 1 − p(Xi )

 ,

ζi = Ψp (Wi , τ, p(Xi )) − Sp (Xi ), i = Di − p(Xi ), βn (Xi ) = E[ˆ p(Xi )|X1 , ..., Xn ] − p(Xi ) =

X

ωij p(Xj ) − p(Xi ),

j:i6=j

where the last quantity is the bias of the estimator conditional on X1 , . . . , Xn . Note that p nh1 (ˆ τ − τ) =

√1 nh1

X1i −x1 Ψ(Wi , τ, pˆ(Xi )) i=1 K1 h1 ,  X1i −x1 1 Pn i=1 K1 nh1 h1

Pn



and n 1 X  X1i − x1  p → fx1 (x1 ). K1 nh1 h1 i=1

21

Hence, it is sufficient to show that √

n n 1 X  X1i − x1  1 X  X1i − x1  d K1 Ψ(Wi , τ, pˆ(Xi )) ∼ √ K1 ψ(Wi ). h h nh1 i=1 nh1 i=1

By a Taylor series expansion around p(Xi ), n 1 X  X1i − x1  K1 Ψ(Wi , τ, pˆ(Xi )) h nh1 i=1 n 1 X  X1i − x1  Ψ(Wi , τ, p(Xi )) =√ K1 h1 nh1



i=1

n 1 X  X1i − x1  K1 Ψp (Wi , τ, p(Xi ))(ˆ p(Xi ) − p(Xi )) h1 nh1 i=1 n 1 X  X1i − x1  Ψpp (Wi , τ, p∗ (Xi ))(ˆ p(Xi ) − p(Xi ))2 +√ K1 h nh1 i=1 1 n 1 X  X1i − x1  =√ Ψ(Wi , τ, p(Xi )) K1 h1 nh1 i=1 n 1 X  X1i − x1  +√ K1 Sp (Xi )(ˆ p(Xi ) − p(Xi )) h1 nh1

+√

i=1

+√ +√

1 nh1 1 nh1

n X

K1

X − x  1i 1 ζi (ˆ p(Xi ) − p(Xi )) h1

K1

X − x  1i 1 Ψpp (Wi , τ, p∗ (Xi ))(ˆ p(Xi ) − p(Xi ))2 h1

i=1 n X i=1

≡J0 + J1 + J2 + J3 , where p∗ (Xi ) is a value between pˆ(Xi ) and p(Xi ) for all i, and the J’s are defined line by line. We

22

further expand the J1 term as n 1 X  X1i − x1  Sp (Xi )(ˆ p(Xi ) − p(Xi )) J1 = √ K1 h1 nh1 i=1

=√

=√

1 nh1 1 nh1

n X i=1 n X i=1

K1

X  X − x  1i 1 Sp (Xi ) ωij Dj − p(Xi ) h j:j6=i

X − x  X  1i 1 K1 Sp (Xi ) ωij (j + p(Xj )) − p(Xi ) h1 j:j6=i

n X

n X  X − x  1 1 X  X1i − x1  1i 1 =√ Sp (Xi )i + √ Sp (Xi ) ωij j − i K1 K1 h1 h1 nh1 i=1 nh1 i=1 j:j6=i

+√

n X  1 X  X1i − x1  K1 Sp (Xi ) ωij p(Xj ) − p(Xi ) h1 nh1 i=1 j:j6=i

n 1 X  X1i − x1  K1 Sp (Xi )i h1 nh1 i=1 n X − x  X − x   1 X X 1j 1 1i 1 +√ i ωji K1 Sp (Xj ) − K1 Sp (Xi ) h1 h1 nh1

=√

+√

1 nh1

i=1 n X i=1

j:j6=i

K1

X − x  1i 1 Sp (Xi )βn (Xi ) h1

≡ J11 + J12 + J13 . Note that J0 and J11 together gives the expression we need in Theorem 1. Hence, to show Theorem 1, it is sufficient to show that J12 , J13 , J2 , and J3 are all op (1). STEP 1: We first claim that Ch  Xi − Xj  K |ωij − ωji | ≤ . h nhk

(8)

By Assumption 6, ωij = ωji = 0 for kXj − Xi k∞ > h. Now assume kXj − Xi k∞ ≤ h. For all i, define 1 X  Xi − Xt  fˆ(Xi ) = K . h nhk t:t6=i

Then   Xi − Xj ˆ−1 1 −1 ˆ |ωij − ωji | = K · f (Xi ) − f (Xj ) h nhk   Xi − Xj n ˆ−1 1 −1 ≤ k K · f (X ) − f (X ) i i h nh o + fˆ−1 (Xj ) − f −1 (Xj ) + f −1 (Xi ) − f −1 (Xj ) .

23

(9)

Using arguments similar to those in, e.g., Corollary 1 of Masry (1996), one can show that ! r log n = Op (h). sup kfˆ−1 (Xi ) − f −1 (Xi )k∞ = Op h + nhk i By continuity and X compact, it follows that sup kfˆ−1 (Xi ) − f −1 (Xi )k∞ = Op (h). i

Since f is continuously differentiable and is bounded away from zero, |f −1 (x1 ) − f −1 (x2 )| ≤ Ckx1 − x2 k∞ for all x1 , x2 ∈ X . Hence, |f −1 (Xi ) − f −1 (Xj )| = O(h) given kXj − Xi k∞ ≤ h. Combining these observations yields   Xi − Xj Ch |ωij − ωji | = K , h nhk where C > 0. This bound is independent of Xi and Xj . This supports our claim. Also, under regularity conditions, we have ! r log n s sup |ˆ p(x) − p(x)| = Op h + = op (n−1/4 ). k nh x∈X

(10)

STEP 2 (J12 ): First note that   X X1j − x1 1 √ sup (ωji − ωij )K1 Sp (Xj ) h1 h1 i j:j6=i  X X1j − x1  Sp (Xj ) ≤ sup |ωji − ωij | K1 h1 i j:j6=i  X 1 Xj − xi  Ch ≤ √ sup K = op (1) · Op (1) = op (1). h nhk h1 i

(11)

j:j6=i

The last inequality follows from the fact that K1 (·) is bounded, supi

Xj −xi  1 k j:j6=i nh K h

P

= op (1),

and h/h1 = op (1) by assumption. Second, note that   X  X − x  X1j − x1  1 1i 1 √ sup ωij K1 − K1 Sp (Xj ) h1 h1 h1 i j:j6=i X C X1j − X1i ≤√ |ωij ||Sp (Xj )| sup h1 h1 i j:j6=i ≤

Ch 3/2

h1

· Op (1) = op (1).

(12)

24

The first inequality follows from that K1 is Lipschitz continuous of order one. The last inequality P −3/2 follows from that supi j:j6=i |ωij ||Sp (Xj )| = Op (1) and hh1 = op (1) by assumption. Last, we have X X − x  X − x  1 1i 1 1i 1 √ sup ωij K1 Sp (Xj ) − K1 Sp (Xi ) h h h1 i j:j6=i 1 1   X1i − x1 1 X ωij Sp (Xj ) − Sp (Xi ) = √ K1 sup h1 h1 i j:j6=i

hs+1

≤√

h1

= op (1).

(13)

P −1/2 = The last inequality follows from that supi j:j6=i ωij Sp (Xj ) − Sp (Xi ) = Op (hs ) and hs+1 h1 op (1). By (11), (12) and (13), we have 1 X X − x  X − x  1j 1 1i 1 sup √ ωji K1 Sp (Xj ) − K1 Sp (Xi ) = op (1). h1 h1 h1 j:j6=i i Further, the i ’s are mutually independent conditional on the sample path of the Xi ’s, then we can show J12

n X − x   X − x  1 X  1 X 1j 1 1i 1 Sp (Xj ) − K1 Sp (Xi ) = op (1) ωji K1 =√ i √ h1 h1 n h1 i=1

j:j6=i

conditional on the sample path of the Xi ’s with probability approaching one. STEP 3 (J13 ): Note that βn (x) is the (conditional) bias of pˆ(x), which is Op (h

s+1

) uniformly. By

assumption, we have supx∈X βn (x) = op (n−1/2 ). It follows that n 1 X X − x  1i 1 |J13 | = √ K1 Sp (Xi )βn (Xi ) h1 nh1 i=1 n p 1 X  X1i − x1  ≤ sup | nh1 βn (x)| √ K1 |Sp (Xi )| = op (1)Op (1) = op (1). h1 nh1 x∈X i=1

−1/2

STEP 4 (J2 ): Note that h1

supx∈X |ˆ p(x) − p(x)| = op (1). By the same argument for J12 , we

have J2 = op (1). STEP 5 (J3 ): Note that p∗ (Xi ) is between pˆ(Xi ) and p(Xi ), so it is uniformly bounded away from √ zero and one. Also, we have nh1 supi (ˆ p(x) − p(x))2 = op (1), so by the same argument for J13 , we have J3 = op (1). These results together show Theorem 1.



25

B. Handling discrete covariates Suppose there are two discrete subpopulations ‘males’ and ‘females’. Let Mi denote the gender indicator such that Mi = 1 if individual i is male and 0 otherwise. The CATE estimator for the male group is defined as p  d nm h1 τˆm (x1 ) − τm (x1 ) ∼

n X 1 √ ψmi Mi K1i , fm (x1 ) nm h1 i=1

where K1i = K((X1i − x1 )/h1 ). Similarly, we have p  d nf h1 τˆf (x1 ) − τf (x1 ) ∼

n X 1 p ψf i (1 − Mi )K1i . ff (x1 ) nf h1 i=1

The parameter of our interest is τ (x1 ) = qm (x1 )τm (x1 ) + qf (x1 )τf (x1 ), where qm (x1 ) = P (Mi = 1|X1 = x1 ) and qf (x1 ) = 1 − qm (x1 ). Note that qm (x1 ) is estimated by 1 P i=1 Mi K1i nh1 , qˆm (x1 ) = 1 P i=1 K1i nh1 such that p  d nh1 qˆm (x1 ) − qm (x1 ) ∼

n X  1 √ Mi − qm (x1 ) K1i . f (x1 ) nh1 i=1

Similarly, p  d nh1 qˆf (x1 ) − qf (x1 ) ∼

n X  1 √ 1 − Mi − qf (x1 ) K1i . f (x1 ) nh1 i=1

As a result, an estimator for τ (x1 ) can be defined as qˆm (x1 )ˆ τm (x1 ) + qˆf (x1 )ˆ τf (x1 ). Note that n X p  d n 1 √ nh1 τˆm (x1 ) − τm (x1 ) ∼ ψmi Mi K1i nm fm (x1 ) nh1 i=1

n X 1 d √ ψmi Mi K1i , ∼ qm fm (x1 ) nh1 i=1

where qm = P (Mi = 1). Also, note that we have P (X1 = x1 , M = 1) = f (x1 )qm (x1 ) = qm fm (x1 ). Therefore, p  d nh1 τˆm (x1 ) − τm (x1 ) ∼

n X 1 √ ψmi Mi K1i qm fm (x1 ) nh1 i=1

n X 1 √ ψmi Mi K1i . ∼ f (x1 )qm (x1 ) nh1 i=1 d

26

Similarly, p  d nh1 τˆf (x1 ) − τf (x1 ) ∼

n X 1 √ ψf i (1 − Mi )K1i . f (x1 )qf (x1 ) nh1 i=1

Finally, we have p p  nh1 (ˆ τ (x1 ) − τ (x1 )) = nh1 qˆm (x1 )ˆ τm (x1 ) + qˆf (x1 )ˆ τf (x1 ) − qm (x1 )τm (x1 ) − qf (x1 )τf (x1 ) p  p  = nh1 qm (x1 ) τˆm (x1 ) − τm (x1 ) + nh1 qf (x1 ) τˆf (x1 ) − τf (x1 ) p p   + nh1 qˆm (x1 ) − qm (x1 ) τˆm (x1 ) + nh1 qˆf (x1 ) − qf (x1 ) τˆf (x1 ) d



n n X X 1 1 √ √ ψmi Mi K1i + ∼ ψf i (1 − Mi )K1i f (x1 ) nh1 i=1 f (x1 ) nh1 i=1

n n X X   1 1 √ √ Mi − qm (x1 ) τm (x1 )K1i + 1 − Mi − qf (x1 ) τf (x1 )K1i f (x1 ) nh1 i=1 f (x1 ) nh1 i=1 n  X   1 d √ ∼ ψmi + ψf i + Mi − qm (x1 ) τm (x1 ) + 1 − Mi − qf (x1 ) τf (x1 ) K1i . f (x1 ) nh1 i=1

+

C. Proof of Theorem 2 The proof is similar to that of Theorem 1, and we omit it.

27

D. Figures

300

CATE in subpop 1 (single mom, baby girl)

200 100 0 −100 −200 −300 −400 −1.5

−1

−0.5

0

0.5

1

1.5

2

Figure 1: CATE as a function of age: single, baby girl

28

2.5

300

CATE in subpop 2 (single mom, baby boy)

200 100 0 −100 −200 −300 −400 −1.5

−1

−0.5

0

0.5

1

1.5

2

Figure 2: CATE as a function of age: single, baby boy

29

2.5

−50

CATE in subpop 3 (married, baby girl)

−100 −150 −200 −250 −300 −350 −400 −1.5

−1

−0.5

0

0.5

1

1.5

2

Figure 3: CATE as a function of age: married, baby girl

30

2.5

CATE in subpop 4 (married, baby boy) −150 −200 −250 −300 −350 −400 −450 −500 −550 −1.5

−1

−0.5

0

0.5

1

1.5

2

Figure 4: CATE as a function of age: married, baby boy

31

2.5

Estimated CATE in overall population (black moms, first child) 50 0 −50 −100 −150 −200 −250 −300 −350 −1.5

−1

−0.5

0

0.5

1

1.5

2

2.5

Figure 5: CATE as a function of age (averaged over the four subopulations)

32

CATE in subpop 1 (single, baby girl)

−160 −180 −200 −220 −240 −260 −280 −300 −1.5

−1

−0.5

0

0.5

1

1.5

2

2.5

3

Figure 6: CATE as a function of per capita income: single, baby girl

33

CATE in subpop 2 (single, baby boy)

−160

−180

−200

−220

−240

−260

−280 −1.5

−1

−0.5

0

0.5

1

1.5

2

2.5

3

Figure 7: CATE as a function of per capita income: single, baby boy

34

CATE in subpop 3 (married, baby girl)

−50 −100 −150 −200 −250 −300 −350 −400 −450 −1.5

−1

−0.5

0

0.5

1

1.5

2

2.5

3

Figure 8: CATE as a function of per capita income: married, baby girl

35

CATE in subpop 4 (married, baby boy)

−150 −200 −250 −300 −350 −400 −450 −500 −550 −1.5

−1

−0.5

0

0.5

1

1.5

2

2.5

3

Figure 9: CATE as a function of per capita income: married, baby boy

36

CATE in overall population (black moms with first child) −160 −180 −200 −220 −240 −260 −280 −300 −320 −340 −360 −1.5

−1

−0.5

0

0.5

1

1.5

2

2.5

3

Figure 10: CATE as a function of per capita income (averaged over the four subopulations)

37

References Abadie, A. (2003). “Semiparametric Instrument Variable Estimation of Treatment Response Models,” Journal of Econometrics, 113, 231–263. Abrevaya, J. (2006). “Estimating the Effect of Smoking on Birth Outcomes Using a Matched Panel Data Approach”. Journal of Applied Econometrics, 21, 489–519. Abrevaya, J., and C. Dahl (2008). “The effects of birth inputs on birthweight: evidence from quantile estimation on panel data,” Journal of Business and Economic Statistics, 26, 379-397. Almond, D., K. Y. Chay, and D. S. Lee (2005). “The Costs of Low Birth Weight,” Quarterly Journal of Economics, 120, 1031–1083. Cattaneo, M. D. (2010). “Efficient Semiparametric Estimation of Multi-Valued Treatment Effects under Ignorability,” Journal of Econometrics, 155, 138–154. da Veiga, P. V., and R. P. Wilder (2008). “Maternal Smoking During Pregnancy and Birthweight: A Propensity Score Matching Approach,” Maternal and Child Health Journal, 12, 194–203. Donald, S. G., Y.-C. Hsu and R. P. Lieli (2010). “Efficient Nonparametric IV Estimation of Local Average Treatment Effects Using the Estimated Propensity Score,” Working Paper. Fr¨olich, M. (2007). “Nonparametric IV Estimation of Local Average Treatment Effects with Covariates,” Journal of Econometrics, 139, 35–75. Hahn, J. (1988). “On the Role of the Propensity Score in Efficient Semiparametric Estimation of Average Treatment Effects,” Econometrica, 66, 315–331. Heckman, J. and J. Hotz (1989): “Choosing Among Alternative Nonexperimental Methods for Estimating the Impact of Social Programs: The Case of Manpower Training,” Journal of the American Statistical Association, 84, 862–874.

38

Heckman, J., H. Ichimura and P. Todd (1997). “Matching as an Econometric Evaluation Estimator: Evidence from Evaluating a Job Training Program,” Review of Economic Studies, 64, 605–654. Heckman, J., H. Ichimura and P. Todd (1998). “Matching as an Econometric Evaluations Estimator,” Review of Economic Studies, 65, 261–294. Hirano, K., G. W. Imbens and G. Ridder (2003). “Efficient Estimation of Average Treatment Effects Using the Estimated Propensity Score,” Econometrica, 71, 1161–1189. Hong, H., and D. Nekipelov (2010). “Semiparametric Efficiency in Nonlinear LATE Models,” Quantitative Economics, 1, 279–304. Ichimura, H., and O. Linton (2005). “Asymptotic Expansions for Some Semiparametric Program Evaluation Estimators,” in Identification and Inference for Econometric Models: Essays in Honor of Thomas Rothenberg ed. by Andrews, D.W.K. and Stock, J., Cambridge University Press. Imbens, G., and G. Ridder (2009). “Estimation and Inference for Generalized Full and Partial Means and Derivatives,” Working Paper. Imbens, G. W., and J. W. Wooldridge (2009). “Recent Developments in the Econometrics of Program Evaluation,” Journal of Economic Literature, 47, 5–86. Khan, S., and E. Tamer (2010). “Irregular Identification, Support Conditions, and Inverse Weight Estimation,” Econometrica, 6, 2021–2042. Lee, S., and J.-Y. Whang (2009). “Nonparametric Tests of Conditional Treatment Effects,” Cowles Foundation Discussion Papers 1740, Cowles Foundation, Yale University. Masry, E. (1996). “Multivariate Local Polynomial Regression for Time Series: Uniform Strong Consistency and Rates,” Journal of Time Series Analysis, 17, 571-599. Rosenbaum, P. and D. Rubin (1983): “The Central Role of the Propensity Score in Observational Studies for Causal Effects,” Biometrika, 70, 41–55. Rosenbaum, P. and D. Rubin (1985): “Reducing Bias in Observational Studies Using Sub-

39

classfication on the Propensity Score,” Journal of American Statistical Association, 79, 516–524. Rosenbaum, P. (1987): “Model-Based Direct Adjustment,” Journal of American Statistical Association, 82, 387–394. Walker, M. B., E. Tekin, and S. Wallace (2009). “Teen Smoking and Birth Outcomes,” Southern Economic Journal, 75, 892–907.

40