Web-based Supplementary Materials for Variable ... - Biometrics

1 downloads 0 Views 152KB Size Report
Variable Selection in the Cox Model with Covariates Missing at Random. 1 ..... (¯θSτ |θ∗) for all τ and the third and fifth equalities follow from (9). There- fore, we ...
Biometrics 64, 1–??

DOI: 10.1111/j.1541-0420.2005.00454.x

December 2008

Web-based Supplementary Materials for Variable Selection in the Cox Model with Covariates Missing at Random by Garcia, Ibrahim, and Zhu

Ramon I. Garcia∗ , Joseph G. Ibrahim∗∗ , and Hongtu Zhu∗∗∗ Department of Biostatistics, University of North Carolina at Chapel Hill, USA *email: [email protected] **email: [email protected] ***email: [email protected]

1

Variable Selection in the Cox Model with Covariates Missing at Random

1

Web Appendix A Second order Taylor’s series approximation of PQ1 (β|θ (s) ) In order to maximize PQ1,τ (β|θ (s) ), following Fan and Li (2001), a second order Taylor’s series approximation of PQ1 (β|θ (s) ) was used. The approximation of PQ1 (β|θ(s) ), centered at the value β (s) is: PQ1 (β|θ(s) ) ≈

1 (s) (ˆ y − β)T Σ(s) (ˆ y(s) − β), 2

(1)

where terms independent of β have been excluded and (s)

ωi

=

L o n X 1X exp (zu(s,l)T , xTu )β (s) , L l=1

u∈R(yi ) (s) Γi

(s)

Σ

and yˆ(s)

=

1 (s)

ωi

L n o X 1X exp (zu(s,l)T , xTu )β (s) (zu(s,l)T , xTu )T , L l=1

u∈R(yi )

L n n o X X 1X (s) (s)T (s) (s,l)T T (s,l)T T T (s,l)T T exp (zu , xu )β (zu , xu ) (zu , xu ) − δi Γi Γi , = (s) L l=1 i=1 i=1 ωi u∈R(yi ) ( n ) L n  −1 X 1 X X (s) (s) (s) (s,l)T T = β − Σ . δi (zu , xu ) − δi Γ i L i=1 i=1 n X

δi

k=1

Expression of f (β|τ, n) for random effects penalty estimation of the SCAD penalty function The SCAD penalty (Fan and Li, 2001) is a nonconcave function defined by φτ (0) = 0 and for |β| > 0, φ0τ (|β|) = τ 1(|β| 6 τ ) +

(aτ − |β|)+ 1(|β| > τ ) , a−1

(2)

where t+ denotes the positive part of t and a = 3.7. Because the integral of the negative R∞ exponential of the SCAD penalty is infinite, i.e. −∞ exp(−nφτ (|β|))dβ = ∞, we use a

2

Biometrics, December 2008

truncated version of φτ (|β|) to define the density f (β|τ, n). For the SCAD, we have     exp(−nτ |β|), |β| < τ,        exp[n{(|β|2 − 2aτ |β| + τ 2 )}/{2(a − 1)}], τ 6 |β| 6 aτ,  f (β|τ, n)C(τ, n) =    ¯ exp{−n(a + 1)τ 2 /2}, aτ 6 |β| 6 |β|,        ¯ 0 |β| > |β|, where β¯ is an arbitrarily large value and C(τ, n) is given by

C(τ, n) = 2{1 − exp(−nτ 2 )}/(nτ ) + 2(β¯ − aτ ) exp{−n(a + 1)τ 2 /2} Z aτ +2 exp{n(β 2 − 2aτ β + τ 2 )/(2(a − 1)} dβ. τ

Algorithms to estimate penalty parameter and perform MPL estimation Algorithm to compute MPL estimate ˆ τ is For a given penalty parameter τ and penalty function φτj (·), the MPL estimate θ computed as follows: Step 1. Choose an initial value of θ, denote it by θ (0) , and let s = 0. (s,1)

Step 2. For i = 1, . . . , n, take a MCMC sample of size L, (zi

(s,L)

, . . . , zi

), from the density

f (zi,m |di,o ; θ (s) ). Step 3. Using the approximation in (1), maximize PQ1,τ = PQ1 (β|θ (s) ) − n using the LLA algorithm and denote this value by β (s+1) .

Pp

j=1

φτj (|βj |)

Step 4. Using a standard optimization algorithm, such as the Newton-Raphson algorithm, (Little and Schluchter, 1985; Schluchter and Jackson, 1989; Ibrahim, 1990; Ibrahim and nP o PL (s,l) n −1 Lipsitz, 1996) compute α(s+1) = argmax L |x ; α) . log f (z i i i=1 l=1 α P Step 5. Compute Λ(s+1) (yi ) = nu=1 λ(yu )(s+1) 1{yu 6 yi , δu = 1} oi−1 n hP PL (s,l)T (s+1) 1 T )β exp (z , x . where λ(s+1) (yi ) = δi u u l=1 u∈R(xi ) L

Step 6. Return to step 2 until the difference between θ (s+1) and θ (s) is small. The converged ˆτ . value of θ is the MPL estimate θ

Variable Selection in the Cox Model with Covariates Missing at Random

3

Algorithm to compute ICQ penalty estimate For a given penalty function φτj (·) and ICQ criterion function cn (·), the ICQ penalty estimate is computed as follows: ˆ 0 . For i = 1, . . . , n take a MCMC sample of size L, (z(1) , . . . , z(L) ), from Step 1. Compute θ i i ˆ 0 ). Using this sample, the ICQ criterion, ICQ (τ ) = −2Q(θ b τ |θ b0 ) + the density f (zi,m |di,o ; θ

bτ ), can be approximated for any value of τ . c n (θ

Step 2. Minimize the criterion ICQ (τ ) with respect to τ . The minimizing value of τ is the ICQ penalty estimate. Algorithm to compute random effects penalty estimate For a given penalty function φτj (·), the random effects penalty estimate is computed as follows: Step 1. Choose an initial value of (α, Λ, τ ), denote it by (α(0) , Λ(0) , τ (0) ), and let s = 0. (s,1)

Step 2. Take a MCMC sample of size L, (z1 from the density Z Y n Z i=1

(s,n)

, . . . , z1

f (yi , δi , zi |xi ; θ)f (β|τ , n) dzi,m dβ =

where f (β|τ , n) is defined by f (β|τ , n) =

p Y j=1

(s,L)

, β(s,1) , . . . , z1

n Z Y i=1

(s,L)

, . . . , zn

, β(s,L) ),

f (di,o |θ)f (β|τ , n) dβ,

exp{−nφτj (|βj |)}/{C(τj , n)},

in which C(τj , n) is the normalizing constant of f (β|τj , n). Step 3. Using a standard optimization algorithm, nP o PL n (s,l) −1 L log f (β |τ , n) . compute τ (s+1) = argmax i=1 l=1 τ

Step 4. Using a standard optimization algorithm, nP o PL (s,l) n −1 compute α(s+1) = argmax L log f (z |x ; α) . i i i=1 l=1 α P Step 5. Compute Λ(s+1) (yi ) = nu=1 λ(yu )(s+1) 1{yu 6 yi , δu = 1} hP n oi−1 PL (s,l)T (s,l) 1 T where λ(s+1) (yi ) = δi . exp (z , x )β u u l=1 u∈R(xi ) L

(3)

4

Biometrics, December 2008

Step 6. Return to step 2 until the difference between (α(s+1) , Λ(s+1) , τ (s+1) ) and (α(s) , Λ(s) , τ (s) ) is small. The converged value of τ is the random effects penalty estimator. Assumptions and Proofs of Theorems 1-2 In this section, we establish the asymptotic theory of MPL estimators and the consistency of the penalty selection estimation procedure based on ICQ . In order to establish these results, we assume the following conditions: (C1) Let ymax be a finite time point at which any individual still under study is censored. Assume P (yi > ymax ) > 0 and Λ(ymax ) < ∞. (C2) Let f (z|x; α) be continuous with respect to α and have second order derivatives with respect to α. Suppose that f (z|x; α) is identifiable; that is f (z|x; α1 ) = f (z|x; α2 ) a.e. implies α1 = α2 . Also suppose z is bounded. (C3) We assume that the partially observed covariates can be observed for all possible covariate values. That is, if we let r be a q × 1 vector with jth component equal to 1 if the jth component of z is missing and 0 otherwise, then P (r = (1, . . . , 1)T |y, δ, z, x) > 0, for almost all z and almost all y ∈ [t1 , t2 ] such that Λ(t1 ) 6= Λ(t2 ). (C4) The parameters α and β are interior points of known compact sets A ⊂ 0 in probability. p

(2) n1/2 dn → ∞. p ¯ S = arg sup Q(θ|θ ∗ ). Assume that θ ¯S → (C8) Let θ θ ∗S where θ ∗S = arg sup E{Q(θ|θ ∗ )} and θ : βj 6=0,j∈S

θ : βj 6=0,j∈S



the expectation in E{Q(θ|θ )} is taken with respect to the true density of the observed random variables.

Proof of Theorem 1a. Conditions (C1) - (C5) establish identifiability of the observed data log likelihood, existence ˆ 0 , and asymptotic normality of β ˆ . Let PL(γ) to be a profiled version of and consistency of θ 0 ¯ is the set of all `(θ), in which Λ has been profiled out. That is PL(γ) = sup `(θ), where L ¯ Λ∈L

nonegative increasing stepwise continuous functions such that Λ(0) = 0. Following Chen and Little (1999) and Murphy and van der Vaart (2000), it can be shown that for any random p

sequence γ n → γ ∗ , ∗

∗ T

log PL(γ n ) = log PL(γ ) + (γ n − γ ) √



n X i=1

2

+op ( n||γ n − γ || + 1) , n 1 X˜ ∗ l(γ ; di,o ) + op (1), n−1/2 ∂γ PL(γ n ) = √ n i=1 and



˜l(γ ∗ ; di,o ) − 1 n(γ n − γ ∗ )T ˜I(γ ∗ )(γ n − γ ∗ ) 2 (4) (5)

n

1 X ˜ ∗ −1 ˜ ∗ I(γ ) l(γ ; di,o ) + op (1), n(ˆ γn − γ ) = √ n i=1 ∗

(6)

where ˜l(γ ∗ ; di,o ) is the efficient score function of γ, ˜I(γ ∗ ) is the efficient Fisher Information √ ˆ n is the maximizer of PL(γ). To show γ ˆ τ is a n-consistent maximizer of γ ∗ , matrix and γ

6

Biometrics, December 2008

it is enough to show that ∀u ∈ B × A, P

"

sup

(



PL(γ + n

||u||=C

−1/2

u) − n

>1−

p X

φτjn (|β ∗j

+n

−1/2

j=1

uj |)

)



< PL(γ ) − n

p X j=1

φτjn (|β ∗j |)

#

for large C, since this implies there exists a local maximizer in the ball {γ ∗ + n−1/2 u; ||u|| 6 C}, and thus ||ˆ γ τ − γ ∗ || = Op (n−1/2 ). Using this approach, we have PL(γ ∗ + n−1/2 u) − n 6 PL(γ + n ∗

= n

−1/2 T

u

−1/2

6 n

j=1

φτjn (|β ∗j + n−1/2 uj |) − PL(γ ∗ ) + n

u) − n

p1

X

φτjn (|β ∗j

+n

−1/2

j=1

p X j=1



uj |) − PL(γ ) + n Xn p1

u

= n−1/2 uT

p1 X j=1

φτjn (|β ∗j |) o

˜l(γ ∗ ; di,o ) − 1 uT ˜I(γ ∗ )u − n1/2 2

n X

˜l(γ ∗ ; di,o ) − 1 uT ˜I(γ ∗ )u + √p1 n1/2 an ||u1 || + 1 bn ||u1 ||2 + op (1) 2 2

j=1

φ0τjn (|β ∗j |)sgn(β ∗j )uj

o 1 X n 00 φτjn (|β∗j |)u2j + op (||u|| + 1)2 + n−1 O(||u||2 ) 2 j=1

−1/2 T

φτjn (|β∗j |)

n X i=1

p1



p X

i=1 n X i=1

˜l(γ ∗ ; di,o ) − 1 uT ˜I(γ ∗ )u + Op (1)||u1 || + op (1), 2

(7)

where u = (uT1 , uT2 )T and u1 is a p1 ×1 vector. The first inequality follows because φτjn (0) = 0 and φτjn (·) > 0. The second inequality follows from using the expansion in (4) and a second order Taylor’s expansion of the penalty function. The third inequality follows from condition P1 √ (C6) and pi=1 |ui | 6 p1 ||u1 ||. Since the first term in (7) is Op (1) and uT ˜I(γ ∗ )u is bounded below by ||u||2 ×(the smallest eigenvalue of ˜I(γ ∗ )), the second term in (7) dominates the rest

and (7) can be made negative for large enough C. Proof of Theorem 1b. ˆ τ , of Now suppose the conditions of Theorem 1(a) hold. Then there exists a maximizer, γ P √ PL(γ) − n pj=1 φτjn (|βj |) which is a n-consistent estimator of γ ∗ such that ||ˆ γ τ − γ ∗ || =

Variable Selection in the Cox Model with Covariates Missing at Random

7

ˆ (2)τ || 6 Cn−1/2 . Using the expansion in (5), we have Op (n−1/2 ), and ||β )# 0 = n−1/2 ∂γ PL(γ) − n ∂γ φτjn (|βj |) j=1 γ=ˆ γ ( p ) τ n X X ˜l(γ ∗ ; di,o ) − n1/2 ∂γ = n−1/2 φτjn (|βj |) + op (1) j=1 i=1 γ=ˆ γτ ( p ) X = Op (1) − n1/2 ∂γ φτjn (|βjτ |) "

(

j=1

where (8) follows because n−1/2

p X

(8)

γ=ˆ γτ

Pn ˜ ∗ i=1 l(γ ; di,o ) = Op (1). For j = p1 + 1, . . . p, the gradi-

−1 0 ent with respect to βj of the second term of (8), is −sgn(βˆj )n1/2 τjn {τjn φτjn (|βˆj |)}. Since

ˆ || = Op (n−1/2 ), τ −1 φ0 (|βˆj |) is greater than zero for large n by condition (C7)-(1). ||β (2)τ jn τjn p

Therefore, the second term in (8) is dominated by −sgn(βˆj )n1/2 dn . Since n1/2 dn → ∞ by condition (C7), it must be the case that βˆjτ = 0, for j = p1 + 1, . . . , p, otherwise (8) could be made large in absolute value and could not possibly be equal to zero. Proof of Theorem 1c. ˆτ = Given conditions (C1) - (C7), Theorems 1(a) and 1(b) apply. Thus, there exists a γ T  T P √ T ˆ , 0T , α ˆτ = ˆ , which is a n local maximizer of PL(γ) − n pj=1 φτjn (|β j |). Let ϕ β (1)τ τ

T T ∗T ∗ ∗T T ˜ ∗ ∗ ∗ ˆT , α ˜ ∗ (β (1)τ ˆ τ ) , ϕ = (β (1) , α ) , l(ϕ ; di,o ) = l((β (1) , 0, α ); di,o ), and C(ϕ ) be the resulting

matrix from removing the p1 + 1 to p rows and columns from the matrix ˜I((β ∗(1) , 0, α∗ )). Also let

h1 (β (1) ) = (φ0τ1 (|β1 |)sgn(|β1 |), . . . , φ0τp1 (|βp1 |)sgn(|βp1 |))T , G1 (β (1) ) = diag(φ00τ1 (|β1 |), . . . , φ00τp1 (|βp1 |)),     ∗ ∗  h1 (β (1) )   G1 (β (1) ) 0  ∗ h(ϕ∗ ) =   , G(ϕ ) =   , and 0 0 0 Σ(ϕ∗ ) = {C(ϕ∗ ) + G(ϕ∗ )}−1 C(ϕ∗ ) {C(ϕ∗ ) + G(ϕ∗ )}−1 .

8

Biometrics, December 2008

The efficient score function for ϕ of PL((β (1) , 0, α)) − n

P p1

j=1

φτjn (|β j |) is ˜l(ϕ∗ ; di,o ) − h(ϕ∗ )

and the efficient Fisher information matrix is C(ϕ∗ ) − G(ϕ∗ ). Using (6), we have ( n ) X √ ˜l(ϕ∗ ; di,o ) − nh(ϕ∗ ) + op (1), ˆ − ϕ∗ ) = n−1/2 n{C(ϕ) − G(ϕ)}(ϕ τ

i=1

which can be reexpressed as √

ˆ τ − ϕ∗ + {C(ϕ) − G(ϕ)}−1 h(ϕ∗ )] = n−1/2 n{C(ϕ) − G(ϕ)}[ϕ

n X

˜l(γ ∗ ; di,o ) + op (1).

i=1

P p D Since n−1/2 ni=1 ˜l(ϕ∗ ; di,o ) −→ N(0, C(ϕ∗ )) and h(ϕ∗ ) −→ 0 by condition (C6), then √

D

ˆ τ − ϕ∗ ) −→ N(0, Σ(ϕ∗ )). n(ϕ

For certain penalty parameters, condition (C6)-(1) implies (C6)-(2) and (C6)-(3) because an and bn are functions of τjn . For example, for the SCAD penalty with τjn = τn , if τn = op (1) p

and n1/2 τn → ∞, then an = 0 for n >> 1 and conditions (C6) and (C7) are satisfied. For the ALASSO penalty, with τjn = τn |βˆj |−1 where βˆj is the unpenalized ML estimator, τn = Op (n−1/2 ) implies conditions (C6) and (C7). This follows because for j = 1, . . . , p1 , τn /|βˆj | = Op (n−1/2 ) is equivalent to τn = Op (n−1/2 ) and for j = p1 + 1, . . . , p, n1/2 τn /|βˆj | = p

Op (1)/|βˆj | → ∞ since |βˆj | → 0. Proof of Theorem 2. To prove Theorem 2, we use the law of large numbers from van der Vaart and Wellner (1996) p

to show that for all θ and θ 1n → θ 1 , 1 ˆ 0 ) − 1 Q(θ|θ ∗ ) = op (1), Q(θ|θ n n 1 1 Q(θ 1n |θ ∗ ) − E{Q(θ 1 |θ ∗ )} = op (1). n n Sufficient conditions for ensuring (9) are 1 1 ˜ ˜ Q(θ|θ) − E{Q(θ|θ)} = op (1) sup n ˜ ∗ ||6δ n θ∈Θ,||θ−θ

(9)

(10)

and

1 1 0 ˜ − E{Q(θ|θ ˜ )} = o(1). E{Q(θ|θ)} sup n n 0 ˜ θ∈Θ,||θ−θ ||6δ

(11)

Variable Selection in the Cox Model with Covariates Missing at Random

9

Conditions (C2) and (C4) are sufficient for establishing (10) and (11). For instance, to prove (10), we note that both Λ(·) and λ(·) belong to different functional spaces with bounded e Sτ = argsup Q(θ|θ b0 ). Since bracketing numbers and f (z|x; α) is a smooth function. Let θ θ: βj =0,j∈Sτ

p p b0 → ¯ Sτ → θ θ ∗ and θ θ ∗Sτ by conditions (C1)-(C5) and (C8), we have

1 1 dICQ (τ , 0) = {ICQ (τ ) − ICQ (0)} n n  o  1 n b b  b τ |θ b 0 + c n (θ b τ ) − c n (θ b0 ) = 2Q θ 0 |θ 0 − 2Q θ n  o 2 n b b  e b > Q θ 0 |θ 0 − Q θ S τ |θ 0 + op (1) n n    o 2 b 0 |θ b0 − Q θ eSτ |θ ∗ Q θ + op (1) = n o 2 n b b  ¯ Sτ |θ ∗ + op (1) > Q θ 0 |θ 0 − Q θ n   2 = E {Q (θ ∗ |θ ∗ )} − E Q θ ∗Sτ |θ ∗ + op (1) n 2 > min [E {Q (θ ∗ |θ ∗ )} − E {Q (θ ∗S |θ ∗ )}] + op (1), S6 n ⊃ST

    b τ |θ b0 6 Q θ e S τ |θ b0 and where the second and fourth inequalities follow because Q θ    ¯ Sτ |θ ∗ for all τ and the third and fifth equalities follow from (9). ThereeSτ |θ ∗ 6 Q θ Q θ fore, we have

Pr



inf ICQ (τ ) > ICQ (0)

τ ∈Rpu



→ 1,

which yields Theorem 2(a). For Theorem 2(b), we have n−1/2 δQ (τ 2 , τ 1 ) = n−1/2 {ICQ (τ 2 ) − ICQ (τ 1 )] n o −1/2 b b b b bτ ) − c(θ bτ )} = 2n Q(θ τ1 |θ 0 ) − 2Q(θ τ2 |θ 0 ) + n−1/2 {c(θ 2 1 h i h i b τ |θ b0 ) − E{Q(θ ∗ |θ b0 )} b τ |θ b0 ) − E{Q(θ ∗ |θ b0 )} − 2n−1/2 Q(θ = 2n−1/2 Q(θ 2 S τ2 1 S τ1 i h −1/2 b b0 )} − E{Q(θ ∗ |θ δc (τ 2 , τ 1 ) +2n−1/2 E{Q(θ ∗Sτ1 |θ Sτ2 0 )} + n p

= Op (1) + n−1/2 δc21 → ∞.

10

Biometrics, December 2008

Thus ICQ (τ 2 ) > ICQ (τ 1 ) in probability, which yields Theorem 2(b). The proof of Theorem 2(c) is similar to that of Theorem 2b. VA Administration Lung Cancer Data. We applied the proposed methodology to the well known Veterans Administration (VA) lung cancer data set of Kalbfleisch and Prentice (2002). Although these data have no missing covariates, we analyzed these data to compare complete data results with scenarios based on hypothetical missing data. The VA data set includes n = 137 cancer patients and 5 predictors of interest: age (years), treatment (yes and no coded as 1 and 0), prior therapy (yes and no coded as 1 and 0), cell type (squamous, small, adeno, and large) and Karnofsky score, which is a measure of general performance. The response variable is survival time since diagnosis, which may be right censored. The goal of the analysis is to identify the most important predictors of lung cancer. In addition, age and Karnofsky score were standardized to reduce collinearity, and for the predictor cell type, squamous was selected as the reference category. Since these data contain no missing covariates, we hypothetically assigned Karnofsky score and prior therapy to be MAR. The covariate distribution used for the missing covariates is given by



 exp(η i ) [zi1 |zi2 , xi ] ∼ Bernoulli 1, , 1 + exp(η i ) [zi2 |xi ] ∼ N (µi , σ 2 ),

for i = 1, . . . , n, where zi = (zi1 , zi2 )T = (prior therapyi , Karnofski scorei ), xi = (xi1 , . . . , xi5 ) = P P (agei , smalli , adenoi , largei , treatmenti )T , η i = η0 + 5j=1 ηj xij +η6 zi2 , and µi = µ0 + 5j=1 µs xij . To assign hypothetical missing values, we used the following missing data mechanism:

f (ri1 , ri2 |xi , yi ; ξ 1 , ξ 2 ) = f (ri1 |ri2 , xi , yi ; ξ 1 )f (ri2 |xi , yi ; ξ 2 ),

Variable Selection in the Cox Model with Covariates Missing at Random

11

where exp(ξ i1 ) f (ri1 = 1|ri2 , xi , yi ; ξ1 ) = , 1 + exp(ξi1 ) exp(ξ i2 ) f (ri2 = 1|xi , yi ; ξ2 ) = , 1 + exp(ξi2 )

ξ i1 = ξ10 +

5 X

ξ1j xij + ξ16 yi + ξ17 ri2 ,

j=1

ξ i2 = ξ20 +

5 X

ξ2j xij + ξ26 yi ,

j=1

and the values of ξ 1 , and ξ2 were selected to achieve 30% missingness. The same MPL estimators as those in the simulations were computed and the unpenalized ML estimate was computed. The results are presented in Table 1. For the analysis based on no missing covariate data, the unpenalized ML analysis identified Karnofsky score, small cell type and adeno cell type as significant predictors of survival time. The SCAD-ICQ , ALASSO-ICQ , and ALASSORE estimates also identified these covariates as significant predictors but the ALASSORE estimate also identified treatment as significant. In contrast, the SCAD-RE estimate identified Karnofsky score as the sole significant predictor of survival time. In the presence of missing data, the SCAD-RE and ALASSO-RE estimates identified the same set of covariates as significant as in the analysis with no missing data. The estimates using the ICQ penalty estimate, however, were different. The ALASSO-ICQ estimate identified Karnofsky score as the significant predictor, whereas the SCAD-ICQ estimate did not identify any covariates as significant. The results indicates that when there is no missing covariate data, the ICQ estimates are consistent with the results from an ML analysis since the same set of covariates are identified as being significant. This is not the case, however, in the presence of missing data. The ALASSO-RE estimate identifies treatment as a significant predictor whereas the ML analysis does not. This result is consistent with the results from the simulations where the ALASSO-RE estimate was found to have significant overfit. [Table 1 about here.]

12

Biometrics, December 2008

References

Chen, H. Y. and Little, R. J. A. (1999). Proportional hazard regression with missing covariates. Journal of the Americal Statistical Association 94, 896-908. Ibrahim, J. G. (1990). Incomlete data in generalized linear models. Journal of the American Statistical Association 85, 765-769. Ibrahim, J. G. and Lipsitz, S. R. (1996). Parameter estimation from incomplete data in binomial regression when the missing data mechanism is nonignorable. Biometrics 52, 1071-1078. Kalbfleisch, J. D. and Prentice, R. L. (2002) The Statistical Analysis of Failure Time Data. New York: John Wiley. Little, R. J. A. and Schluchter, M. (1985).

Maximum likelihood estimation for mixed

continuous and categorical data with missing values. Biometrika 72 , 497-512. Murphy, S. A. and van der Vaart, A. W. (2000). On profile likelihood. Journal of American Statistical Association 95, 449-465. Schluchter, M. and Jackson, K. (1989). Log-linear analysis of censored survival data with partially observed covariates. Journal of the American Statistical Association 84, 42-52. van der Vaart, A. W. and Wellner, J. (1996). Weak Convergence and Empirical Processes: with Applications to Statistics. Springer

Variable Selection in the Cox Model with Covariates Missing at Random

13

Table 1 Maximum penalized likelihood estimates of VA lung cancer data with no missing data and covariates missing at random.

No missa (MARb ) SCAD Variable Prior therapy Karnofsky Age Diagnosis Small cell Adeno cell Large cell Treatment

RE 0.000 ( 0.000) -0.670 (-0.599) 0.000 ( 0.000) 0.000 ( 0.000) 0.000 ( 0.000) 0.000 ( 0.000) 0.000 ( 0.000) 0.000 ( 0.000)

ICQ 0.000 ( 0.000) -0.610 ( 0.000) 0.000 ( 0.000) 0.000 ( 0.000) 0.573 ( 0.000) 1.000 ( 0.000) 0.000 ( 0.000) 0.000 ( 0.000)

ALASSO RE 0.000 ( 0.000) -0.599 (-0.538) 0.000 ( 0.000) 0.000 ( 0.000) 0.499 ( 0.539) 0.891 ( 0.762) 0.000 ( 0.000) 0.032 ( 0.162)

a No miss is the estimate from data with no missing covariates b MAR is the estimate from data with covariates missing at random c * indicates p value < .05, ** indicates p value < .001

ICQ 0.000 ( 0.000) -0.596 (-0.471) 0.000 ( 0.000) 0.000 ( 0.000) 0.468 ( 0.000) 0.870 ( 0.000) 0.000 ( 0.000) 0.000 ( 0.000)

MLEc 0.073 ( 0.022 ) -0.658**(-0.215**) -0.092 ( 0.020 ) -0.004 ( 0.083 ) 0.863**( 0.913* ) 1.198**( 1.312**) 0.404 ( 0.291 ) 0.297 ( 0.201 )