Nonparametric Tree-Structured Modeling for Interval ... - CiteSeerX

Joint Statistical Meetings - Biometrics Section-to include ENAR & WNAR

Nonparametric Tree-Structured Modeling for Interval-Censored Survival Data Yanming Yin and Stewart J. Anderson, University of Pittsburgh Stewart J. Anderson, 303 Parran Hall, GSPH, 130 DeSoto Street, Department of Biostatistics University of Pittsburgh, Pittsburgh, PA 15261 ([email protected] and [email protected]) October 18, 2002 Key Words: Interval-censored data, Tree- of those methods only deal with right-censored surstructured model, Exponential model, Nonpara- vival data. Bacchetti and Segal (1995) extended the method in Segal (1988) to allow for left truncation metric model. and time-dependent covariates. Huang, Chen and Soong (1998) proposed a piecewise exponential survival tree with time-dependent covariates. 1 Introduction All of the above tree-structured models are appliSurvival analysis has been a major research area in cable to data that are right-censored or data that are statistics. In survival analysis, time to some event both left-truncated and right-censored. In this paper, is usually the outcome variable. The objective of a we propose to extend the tree-structured model fursurvival study is to identify the relationship between ther to accommodate interval-censored survival data treatments, risk factors or other covariates and the in two ways. First, we extend the exponential tree time to event. However, in a real clinical trial or model proposed by Davis and Anderson (1989) to longitudinal study, the exact time to event for each interval-censored survival data (Yin and Anderson participant is not always known. The exact time to (2001)). We also propose a nonparametric method event might be unknown because either it takes too to construct tree models for interval-censored data. long to follow every participant until his (her) event The performance of the two methods will be comoccurs or each participant is only evaluated periodi- pared through two sets of simulation. cally, that is, the time to event is only known inside an interval. When the time to event is known to occur inside an interval, we say that the survival time is 2 Nonparametric Treeinterval-censored. Left censoring and right censoring Structured Model for are special cases of interval censoring. The idea of recursive partitioning was first proInterval-Censored Survival posed by Morgan and Sonquist (1963). Breiman, Friedman, Olshen and Stone (1984) advanced the Data development of tree-structured models. Meanwhile, the availability of their software, CART (Classificarepresent as the interval-censored data tion And Regression Trees), helped tree-structured We , i = 1, 2, . . . , N , where L , R , δ , X , R L i i i i i i methods become a popular statistical tool. Treestructured regression models were introduced into is the interval during which the ith response ocsurvival analysis using different splitting and prun- curred, Xi is a p-dimensional vector of covariates, ing approaches by Gordon and Olshen (1985), Segal δi is the censoring indicator variable and N is the (1988), Davis and Anderson (1989), Ciampi (1991), total number of observations. ∆i = Ri − Li is used LeBlanc and Crowley (1992), LeBlanc and Crowley to denote the width of censoring interval. (1993) and Ahn and Loh (1994). However, most In the previous work of the authors (2001), we pro-

3877


posed an exponential tree model for interval-censored a given node t. The likelihood associated with node survival data. There we made an assumption that t can be written as survival time has an exponential distribution. For nt any node t, if we assume the subjects in this node F (Ri ) − F (Li ) = (2.2) L t have the same constant hazard rate, λt , then the logi=1 likelihood for the interval-censored data of this node can been written as where F (t) denotes the underlying CDF at time t. First we need to construct a set of disjoint intern t vals whose left and right end points lie in the sets lt = δi log 1 − exp(−λt ∆i ) − λt Li (2.1) {L ; i = 1, 2, . . . , n } and {R ; i = 1, 2, . . . , n }, rei t i t i=1 spectively, and contain no other members of {Li } and {Ri }. We denote this set of disjoint intervals To estimate the log-likelihood lt , we first have to by C = ∪mt (uj , vj ], where mt is the number of disj=1 estimate the hazard rate, λt . The parameter, λt can joint intervals constructed for node t. These disjoint be estimated by the Newton-Raphson method. This intervals are also called “sample innermost intervals” estimated log-likelihood is the key in constructing by Yu et al. (2000), where u < v ≤ u < v ≤ . . . ≤ 1 2 2 interval-censored exponential tree model. It is used u < v . The (u , v ] are1 nonempty intersections mt mt j j in almost every step in constructuring tree models; of the intervals (Li , Ri ] so that for any pair of intere.g., splitting, pruning and cross-validation. vals, (uj , vj ] and (Li , Ri ], either (uj , vj ] ⊆ (Li , Ri ] The exponential assumption could be a reasonable assumption in many situations. However, most of the or (uj , vj ] ∩ (Li , Ri ] = ∅. We also use pj to denote time, we do not have information about the distribu- the probability that an event would occur in intertion of the survival time. Consequently, the exponen- val (uj , vj ] and αij be an indicator variable to detial tree model might not be the appropriate method note whether (uj , vj ] ⊆ (Li , Ri ], that is, αij = 1 if to use and a nonparametric method should be applied (uj , vj ] ⊆ (Li , Ri ] and αij = 0 otherwise. Then the instead. We propose a nonparametric tree-structured log-likelihood can be written as model for interval-censored survival data. Accordnt mt ingly, we make no assumption about the distribution l (2.3) = log αij pj t of the survival time. For each node, we’ll formulate i=1 j=1 the problem as a multinomial problem by constructing disjoint intervals. We are then interested in calculating the probability of an event occurring in each To estimate the log-likelihood, we have to estimate interval. Alternatively, results can be expressed as a the MLE of pj ’s. Turnbull (1976) proposed a method so-called “experimental survival curve” (Peto, 1973). to estimate the MLE of pj ’s, using the EM algorithm Turnbull’s (1976) method will be used to estimate or “self-consistent algorithm”. There are two major steps in estimating the MLE those probabilities. With multinomial probabilities, of parameters using the EM algorithm: the E-step we can still write out a log-likelihood function and use and the M-step. For our problem, in each node t, the its maximization as the criterion for splitting, pruning and model selection similar to what we do in a EM algorithm can be easily applied as below. Step 0: Obtain initial estimates p0j , j = parametric setting. 1, 2, . . . , mt . Step 1 (E-step): Estimate the expected number of 2.1 Nonparametric estimation for observations in each interval (uj , vj0] given the incomplete data and initial estimates pj ’s. The expected interval-censored data number of observation for the j th interval would be For a node t, the interval for subject i is censoring nt α p 0 represented as Li , Ri , where i denotes the ith submt ij j wj = (2.4) 0 ject, i = 1, 2, . . . , nt , and nt is the number of subk=1 αik pk i=1 jects in node t. Following Turnbull (1976), we define t A = ∪ni=1 (Li , Ri ] and assume that a semi-infinite inStep 2 (M-step): Update the estimates of pj ’s given terval is semi-closed. We then use Turnbull’s (1976) the estimated sufficient statistics in E-step, that is, method to construct a set of disjoint intervals and wj estimate the MLE of the probabilities in those interp j 1 = mt (2.5) vals and therefore get the MLE of the likelihood for j=1 wj

3878


Step 3: Substitute the pj 0 in step 0 by pj 1 estimated in step 2 and repeat the above procedure until it converges. The resulting pj 1 from the final iteration would be the MLE of pj , pˆj . By substituting those pj ’s in the log-likelihood by pˆj we get the estimated loglikelihood of node t which is ˆlt =

nt

log

mt

i=1

αij pˆj

(2.6)

j=1

After we get the estimated log-likelihood, we will use it and follow a similar procedure as for the exponential tree model to construct the nonparametric tree model.

2.2

Splitting rules

The goodness of split criterion for nonparametric tree-structured interval-censored model is the sum of the estimated log-likelihood of both left and right daughter nodes of each node t, that is, G(s, t) = ˆlt + ˆ ltR . In our extension, however, the estimated L log-likelihood is calculated by equation (2.6). For each possible split of node t, we apply Turnbull’s method (1976) to both the resulting left and right daughter nodes of this split and calculate the estimated log-likelihood of them. The best split, s∗ , is the one that maximizes G(s, t) for all possible splits of node t.

2.3

Stopping rules

We repeat the above splitting process recursively until we reach some stopping criterion. In constructing the exponential tree model, we defined the minimum node size to be 25. For the nonparametric model, we change the minimum node size to be 60 in order to get reasonable estimates of parameters using Turnbull’s method. This criterion could be adjusted according to the number of parameters and the number of subjects in the dataset. A node will not be split further if all the subjects in the node have the same covariates or the same censoring interval. In this case, we say that this node is “pure” already.

cross-validation procedures as used for the exponential tree model. In the pruning procedure, the within node risk, R(h), for terminal node h is substituted by the negative estimated log-likelihood using Turnbull’s method. Starting from the largest tree, Tmax , we use the weakest-link cutting process to prune the tree upward to generate a series of nested tree models. A cross-validation method is then utilized to choose the final model from the above series of tree models. In a V -fold cross-validation procedure, we divide the whole sample into V parts and repeat the tree growing and pruning processes for each learning sample L(v) , v = 1, 2, ..., V . For each learning sample L(v) , we would have a series of nested tree models: (v) (v) (v) T1 T2 ... {t0 }. Next is to run the data down in Lv to each of tree models and find which terminal node each subject belongs to, then calculate the estimated log-likelihood for the data in Lv using the parameters estimated in these terminal nodes. For an exponential tree model, this process is simple since there is only one estimated characteristic parameter for each terminal node, that is, the estiˆh . It is more complicated for the mated hazard rate λ nonparametric tree model because for each terminal node h here we have a vector of estimated parameters, ˆ = (ˆ P p1 , pˆ2 , . . . , pˆmh ). The estimated log-likelihood can be calculated for each nested tree model by finding which terminal node each subject belongs to and using the estimated probabilities of that terminal node. The other steps in cross-validation follow those used in the exponential tree model. However, after we choose the tree model, Tk0 , which gives the maximum estimated log-likelihood, we will not use the chisquare test to compare it with simpler trees as what we did in the exponential model. Because the number of parameters estimated in each terminal node is more than one and could be different for different node, this is not an appropriate test anymore. What we do is use the one standard error rule proposed by Breiman et al. (1984) to choose the final tree model. We choose the smallest tree T∗ that hasthe property GCV (T ∗ ) < GCV (Tk0 ) + SE GCV (Tk0 ) as the final tree model.

3

Simulations

The performance of the nonparametric tree model is assessed by two sets of simulations: 1) simulations usBy applying the above splitting and stopping rules ing exponential interval-censored survival data; and recursively we get a very large tree, Tmax . To choose 2) simulations using Weibull interval-censored surthe best tree model, we follow similar pruning and vival data.

2.4

Choosing the final tree model

3879


3.1

Simulations data

using

exponential size simulations, the performance of the two models

is similar for lightly right-censored data (15%). For example, in our simulation of sample size 2000, the performance of the nonparametric tree model (85.2% In order to compare the performance of the exponen- with a correct structure) is quite close to the perfortial tree and nonparametric tree models we used the mance of the exponential model (95.2% with a correct same structure as in the simulation of the exponential structure). tree model (Yin and Anderson (2001) and Davis and Anderson (1989)). Survival times were generated as exponentially distributed with parameters dependent 3.2 Simulations using Weibull data on the covariates. Eight covariates X1-X8 were generated. X2-X7 were all binary variables with equal The purpose of the second set of simulations is to probability of being 0 and 1. X1 and X8 were uni- show that when the data is no longer exponentially form discrete variables with values 1, 2, 3, 4 and 5. distributed, the nonparametric tree model still perExponential parameters were generated according to forms well while at the same time, the exponential realization of variables X1-X4. For example, if X1 tree model might not be a good choice any more. The had a value of 1 or 2 and X2 had a value of 0, then the survival times here were randomly generated from exponential parameter was assigned the value 0.35. the Weibull distribution following a structure similar The overall structure consisted of five terminal nodes to the structure we used in above simulation. Acand each of which had a characteristic failure rate. cordingly, eight covariates X1-X8 were generated the The hazard used to generate survival time for each same way as in the simulation of the previous section. terminal node is shown in Table 1. The data only de- The corresponding survival times were randomly genpend on variables X1-X4. X5 − X8 are “noise” vari- erated according to the corresponding covariates as ables. The right-censoring times were generated by shown in Table 1. Like the previous simulation, there uniform distributions, U(2,4) and U(0.25, 1.25) which were 5 terminal nodes. For a subject who has X1 give 15% and 57% dropout rates, respectively. We value 1 or 2 and X2 value 0, his (her) survival time chose the constant censoring interval to be 0.5. The would be randomly generated from a Weibull distripurpose of this simulation was to compare the perfor- bution with shape parameter 0.4 and scale parameter mance of the two tree models (exponential and non- 0.85967. The shape and scale parameters were chosen parametric tree models), specially, the effect of sam- so that we have about the same expected value of the ple size and right censoring percentage. Accordingly, survival time as in the previous simulation. Rightwe randomly generated exponentially distributed sur- censoring times (dropout times) were generated via a vival data with sample sizes of 500, 1000 and 2000 for uniform distribution U(2,4). About 11% of the data each right censoring percentage situation and applied were right censored. two tree models to those data sets. Two hundred and We compare the performance of our model for cenfifty simulations were done in each case. The num- soring interval lengths 0.5 and 1 and sample sizes ber of correct structures and the average number of 1000 and 2000. Both nonparametric and exponential terminal nodes were used to assess the performance tree-structured models were used to fit the simulated of our models here. Weibull data and 250 simulations were performed The results of our simulations are shown in Ta- for each situation. Table 3 summarizes the results. ble 2. In this set of simulations, the performance Again, we use the percentage of correct structures of the nonparametric tree model increases with the and the average number of terminal nodes to assess increase of sample size. The percentage of correct the performance of our models. structures increases from 8.4% to 85.2% when samIn this simulation, sample size is still a very imple size increases from 500 to 2000 for right censoring portant factor affecting performance as for both the percentage 15%. When data are highly right cen- cases where the interval lengths are 0.5 and 1.0, the sored (57%), the performance of model also increases performance gets better with increasing sample size. with the increase of sample size, from 0.8% to 38.4% The nonparametric interval-censored tree model perwhen sample size increases from 500 to 2000. As we forms very well for the largest sample size (2000). have expected, for exponentially distributed data the The correct structures are obtained 85.2% and 76.0% exponential tree-structured model has better perfor- of the time for interval lengths 0.5 and 1.0, respecmance, especially for small sample size. When data tively.In contrast, the performance of the exponential are lightly right-censored (15%) and the sample size is interval-censored tree model is poor: the percentage big (2000), the exponential tree model gives the cor- of correct structures is only 8% for sample size 2000 rect structure 95.2% of the time. For the large sample and interval length 0.5. Of course, this performance

3880


may be expected because of the fact that we misspecified the model. The performance gets worse with the increase of censoring interval length. In a large sample size data set, the censoring interval length doesn’t affect the performance of the nonparametric tree model very much. Moreover, when an exponential tree model was applied to the Weibull generated data, the average number of terminal nodes was around 4. However, most of the time, the model still could correctly identify 3 of the 4 splits.

4

Discussion

A nonparametric tree-structured model was proposed here to analyze interval-censored survival data. When applied to simulated data, this method was capable of identifying the true structure very well when there was enough information (i.e., a large sample size and a low dropout rate). As demonstrated by our simulations, when we have the information about the distribution of survival time, a parametric method utilizing the correct failure distribution is clearly the best way to analyze the data. For example, in our simulation for exponentially distributed data, the exponential tree model performs better than the nonparametric tree model in all cases. When the sample size is big enough and the dropout rate is not very high, we can almost always get the correct structure (95.2% correct structures for sample size 2000, dropout rate 15% and interval length 0.5). However, with large sample sizes and low censoring, the nonparametric procedure doesn’t perform that much worse than does the parametric procedure (85.2% correct structures versus 95.2% correct structures, respectively). However, if the distribution of survival time is not available, parametric models can give misleading results. In those situations, a nonparametric model is preferable since its performance is reasonably good regardless of the underlying survival distribution as we have shown in Tables 2 and 3. In contrast, it is not appropriate to use an exponential tree model when failure data are not exponentially distributed. As demonstrated in Table 3, the performance of exponential tree models in all cases are very poor when (incorrectly) applied to interval-censored Weibull data. BIBLIOGRAPHY

2. Bacchetti, P. and Segal, M. R. (1995). Survival trees with time-dependent covariates: application to estimating changes in the incubation period of AIDS. Lifetime Data Analysis 1, 35-47. 3. Breiman, L., Friedman, J. H., Olshen, R. A. and Stone, C. J. (1984). Classification and Regression Trees. Wadsworth, Belmont, Calif.. 4. Ciampi,A., Lou Z., Lin Qian and Negassa, A.(1991). Recursive partition and amalgamation with the exponential family: theory and applications. Applied Stochastic Models and Data Analysis 7,121137. 5. Davis, R. B. and Anderson, J. R. (1989). Exponential survival trees. Statistics in Medicine 8, 947961. 6. Gordon, L. and Olshen, R. A. (1985). Treestructured survival analysis. Cancer Treatment Reports 69, 1065-1068. 7. Huang, X., Chen, S. and Soong, S. (1998). Piecewise exponential survival trees with Time-dependent Covariates. Biometrics 54, 1420-1433. 8. LeBlanc,M. and Crowley, J. (1992). Relative risk trees for censored survival data. Biometrics 48, 411425. 9. LeBlanc,M. and Crowley, J. (1993). Survival Trees by Goodness of Split. Journal of the American Statistical Association 88, 457-467. 10. Morgan, J.N. and Sonquist, J.A. (1963). Problems in the analysis of survey data and a proposal. Journal of the American Statistical Association 58, 415-434.

11. Peto, R. (1973). Experimental survival curves for interval-censored data. Applied Statistics 22, 86-91. 12. Segal, M. R. (1988). Regression trees for censored data. Biometrics 44, 35-47. 13. Turnbull, B. W. (1976). The empirical distribution function with arbitrarily grouped, censored and truncated data. Journal of the Royal Statistical Society, Series B 38, 290-295. 14. Yin, Y. and Anderson S. J. (2001). Exponential Tree-Structured Modeling for Interval-Censored Survival Data. 2001 Proceedings of the American Statistical Association, Biometrics Section [CD-ROM]. Alexandria, VA: American Statistical Association. 15. Yu, Q., Li, L. and Wong, G. Y. C. Q. Yu, 1. Ahn, H. and Loh, W.-Y.(1994). Tree-structured L. Li and G. Y. C. Wong (2000). On consistency proportional hazards regression modeling. Biomet- of the self-consistent estimator of survival functions with interval-censored data. Scandinavial Journal of rics 50, 471-485. Statistics 27, 35-44.

3881


Table 1: Simulation description for generating exponential and Weibull data

Covariates combination {X1=1,2} & {x2=0} {X1=1,2} & {X2=1} {X1=3,4,5} & {X3=1} {X1=3,4,5} & {X3=0} & {X4=0} {X1=3,4,5} & {X3=0} & {X4=1}

Exponential data structure hazard rate 0.35 0.6 0.8 1.75 1.00

Weibull data structure Scale parameter 0.85967 1.471 1.25 0.6443 1.10773

shape parameter 0.4 0.8 1 2 1.5

Table 2: Comparison of nonparametric and exponential tree-structured models using exponential data

Right censoring percentage 15

57

Sample Size 500 1000 2000 500 1000 2000

Nonparametric tree model Correct structures (%) 21(8.4) 88(35.2) 213(85.2) 2 (0.8) 22(8.8) 96(38.4)

Average # of terminal nodes 3.272 5.024 5.184 3.088 3.692 4.527

Exponential tree model Correct structures (%) 54(21.6) 184(73.6) 238(95.2) 8(3.2) 53(21.2) 152(60.8)

Average # of terminal nodes 4.388 5.116 5.060 2.888 4.108 4.908

Table 3: Comparison of nonparametric and exponential tree-structured models using Weibull data

Sample size 1000 2000

Censoring interval length 0.5 1 0.5 1

Nonparametric tree model Correct structures (%) 112(44.8) 71(28.4) 213(85.2) 190(76.0)

3882

Average # of terminal nodes 5.224 4.836 5.356 4.968

Exponential tree model Correct structures (%) 7(2.8) 2(0.8) 20(8.0) 9 (3.6)

Average # of terminal nodes 4.208 4.204 4.204 4.124