Global goodness-of-fit tests in logistic ... - Wiley Online Library

STATISTICS IN MEDICINE Statist. Med. 2002; 21:3789–3801 (DOI: 10.1002/sim.1421)

Global goodness-of-t tests in logistic regression with sparse data‡ Oliver Kuss∗;† Institute of Medical Epidemiology; Biostatistics; and Informatics; University of Halle-Wittenberg; 06097 Halle/Saale; Germany

SUMMARY The logistic regression model has become the standard analysing tool for binary responses in medical statistics. Methods for assessing goodness-of-t, however, are less developed where this problem is especially pronounced in performing global goodness-of-t tests with sparse data, that is, if the data contain only a small numbers of observations for each pattern of covariate values. In this situation it has been known for a long time that the standard goodness-of-t tests (residual deviance and Pearson chi-square) behave unsatisfactorily if p-values are calculated from the 2 -distribution. As a remedy in this situation the Hosmer–Lemeshow test is frequently recommended; it relies on a new grouping of the observations to avoid sparseness, where this grouping depends on the estimated probabilities from the model. It has been shown, however, that the Hosmer–Lemeshow test also has some deciencies, for example, it depends heavily on the calculating algorithm and thus dierent implementations might lead to dierent conclusions regarding the t of the model. We present some alternative tests from the statistical literature which should also perform well with sparse data. Results from a simulation study are given which show that there exist some goodness-of-t tests (for example, the Farrington test) that have good properties regarding size and power and that even outperform the Hosmer–Lemeshow test. We illustrate the various tests with an example from dermatology on occupational hand eczema in hairdressers. Copyright ? 2002 John Wiley & Sons, Ltd. KEY WORDS:

logistic regression; goodness-of-t; sparse data; Hosmer–Lemeshow test

1. INTRODUCTION The logistic regression model has become the standard analysing tool for binary responses in medical statistics for many reasons: ease of interpretation of parameters; possibility of calculating prognoses for the event of interest; availability of standard software, to name just a few.

∗

Correspondence to: Oliver Kuss, Institute of Medical Epidemiology; Biostatistics; and Informatics; University of Halle-Wittenberg; 06097 Halle/Saale; Germany. † E-mail: [email protected] ‡ Presented at the International Society for Clinical Biostatistics, Twenty-second International Meeting, Stockholm, Sweden, August 2001.

Copyright ? 2002 John Wiley & Sons, Ltd.

Received January 2002 Accepted October 2002

3790

O. KUSS

Methods for checking goodness-of-t, however, are less developed, which may be due to the relative youth and enhanced mathematical complexity of the logistic regression model compared to, for example, the linear regression model. In general, there are two dierent approaches to assessing goodness-of-t in logistic regression models. The rst, known as residual analysis, investigates the model on the level of individuals and looks for those observations which are not adequately described by the model or which are highly inuential on the model t [1]. The second approach seeks to combine the information on the amount of lack-of-t in a single number. Statistical tests, so-called goodness-of-t tests, are then calculated to judge if this lack-of-t is signicant or due to random chance. We can distinguish two types of goodness-of-t tests: specic and global. Specic tests embed the logistic model in a wider class of models, say, with a more general link function, and check if the data at hand can be better described by the enhanced model. If not, we stay with our tted model. Opposed to this, global tests do not evaluate specic alternatives, rather test unspecic hypotheses of the form ‘the model ts’ versus the alternative ‘the model does not t’. It is an inherent feature of global tests that we do not get advice on how to improve the model in the case of a bad model t. However, it is dangerous to expect this from specic tests. In general, specic tests are derived under the assumption of a single isolated misspecication (for example, misspecied link function) that is checked in the alternative hypothesis, but are only valid when all other aspects of the model specication are correct [2]. Only in this special case will a rejection of the null hypothesis lead to an indication of how to improve the model. A second disadvantage of specic goodness-of-t tests is that they require, at least if we consider likelihood ratio or Wald tests, the estimation of the parameters from the enlarged model which in most cases is unfeasible with standard software [3]. Two classical global goodness-of-t tests for the logistic regression model have been the Pearson test and the residual deviance. It has been known for a long time [3–6], however, that these tests behave unsatisfactorily with sparse data, that is, if we have continuous covariates or a large number of covariates, so that, in extreme cases, every observation has its own pattern of covariate values. It appears to us that sparseness is more the rule than the exception in today’s data sets. In the following we present ve alternative tests which do not rely on the assumption of large cell counts and show some results of a simulation study which compares these tests to the standard procedures. The methods are illustrated by the aid of a data set on occupational hand eczema in hairdressers.

2. THE EPIDEMIOLOGICAL EXAMPLE: RISK FACTORS FOR OCCUPATIONAL HAND ECZEMA IN HAIRDRESSERS Occupational skin diseases rank at the top of all occupational diseases in Germany. More than 90 per cent of all occupational skin diseases manifest as hand eczema, and, with respect to occupational groups, highest incidence rates are reported for hairdressers [7]. Despite this fact and the resulting social and economical consequences (for example, in Germany retraining costs amount to more than 50 000 per case [8]), little was known about possible endogenous and exogenous risk factors, at least up to the mid-1990s. However, this knowledge was Copyright ? 2002 John Wiley & Sons, Ltd.

Statist. Med. 2002; 21:3789–3801

3791

GOODNESS-OF-FIT IN LOGISTIC REGRESSION

necessary to introduce and test eective measures of prevention like pre-employment screening or improved protection against work-related hazardous substances. Supported by the ‘German Federation of Institutions for Statutory Accident Insurance and Prevention’ a prospective cohort study was conducted to assess endogenous and exogenous risk factors for developing hand eczema in hairdresser apprentices [9]. After one year of follow-up, data from 574 hairdressers were available and hand eczema was diagnosed in 340 hairdressers. Six known or suspected confounders were included in a logistic regression model to evaluate their inuence on developing hand eczema: wet work (more/less than 4hours/day); working with permanent wave (more/less than 1hour/day); atopic disposition (square root of the continuous atopy score by Diepgen et al. [10]); diagnosed dyshidrosis (yes/no); centre (Erlangen/Dortmund), and change in skin protection behaviour (continuous score ranging from 0 to 5), because it was known from previous studies that this is an important confounder. The parameters were estimated by maximum likelihood and a highly signicant inuence was found for all covariates except for wet work which was marginally signicant.

3. GOODNESS-OF-FIT TESTS IN LOGISTIC REGRESSION WITH SPARSE DATA Let yi ∼ binomial(mi ; i ) and logit(i ) = xiT ; i = 1; : : : ; N , where = (0 ; : : : ; p )T is a vector of regression parameters corresponding to a vector of p + 1 covariates xi = (1; xi1 ; : : : ; xip )T . In the following we assume that estimates of the have been calculated by maximum likelihood. We get estimates of the i by plugging the î into the model equation. The standard tests for assessing goodness-of-t are the Pearson statistic X2 = and the residual deviance N

yi D = 2 yi log m î i i=1

2 N (y − m i i î ) î (1 − î ) i=1 mi

(1)

mi − y i + (mi − yi ) log mi (1 − î )

(2)

Note that D should not be confused with the deviance statistic. The deviance measures the dierence between the null model (containing only an intercept) and the actual model, whereas the residual deviance measures the dierence between the actual model and the saturated model. Pearson statistic and residual deviance rely on the principle of comparing observed, yi , to predicted, mi î , values and should be large if the model does not t the data well. To judge statistical signicance they are usually compared to a 2 -distribution with N − p − 1 degrees of freedom. The validity of this distribution, however, relies on the assumption of large mi and both tests show unsatisfactory behaviour with sparse data. It can be shown [5] that D degenerates to N î + log(1 − î ) (3) D = 2 î log 1 − î i=1 in the extreme case when every individual observation has its own covariate pattern (mi ≡ 1). Then D is completely independent of the observations and contains absolutely no information about the model t. The Pearson statistic performs not that much better in this situation, for Copyright ? 2002 John Wiley & Sons, Ltd.

Statist. Med. 2002; 21:3789–3801

3792

O. KUSS

it can be shown [5] that X 2 ≈ N , the sample size also not being a sensible measure of t! However, alternative tests have been proposed and we present ve of them. Osius and Rojek [11] derived asymptotic moments for a general class of goodness-of-t statistics in logistic regression under sparseness assumptions (N; M → ∞ where M = iN mi ). This class, the so-called ‘power-divergence family’ of Cressie and Read [12], incorporates X 2 as well as D, however, moments in closed form can only be calculated for X 2 . A statistical test can be performed by standardizing X 2 with these moments and comparing the resulting test statistic (XO2 ) to the standard normal distribution. McCullagh [13, 14] followed the same direction and relaxed the assumption of large mi but ˆ he argued for using conditional asymptotic moments for X 2 given the parameter estimates . This conditioning on a sucient statistic of the parameter estimates removes the dependency of X 2 from the ˆ and thus accounts for the fact that the parameters from the logistic regression model have been estimated and were not xed in advance. Although similar results could be derived for D, these conditional moments are computable conveniently only for X 2 . To assess 2 ) statistical signicance, McCullagh also proposed comparing his standardized statistic (XMcC to a standard normal distribution. Farrington [15] also relies on the conditioning principle and investigated a family of generalized Pearson statistics which extend X 2 by an additive constant. He showed that the member XF2 =

2 N (y − m N −(1 − 2 î ) i i î ) + (yi − mi î ) î (1 − î ) i=1 mi î (1 − î ) i=1 mi

(4)

ˆ That has minimal variance in this family and has the property of local orthogonality to the . is, the Farrington statistic removes the dependence of the distribution of the test statistic on the bias of the parameter estimates and thus can be considered as an improvement of the McCullagh method. Approximate moments for XF2 can be calculated in closed form and the standardized statistic can be compared to the standard normal distribution. Unfortunately the Farrington test has the structural deciency that in the case of extreme sparseness (mi ≡ 1) XF2 = N and the test will never reject the null hypothesis of a good t. The Hosmer–Lemeshow test [16] relies on a new grouping of the individual observations in preferably ten groups of equal size to avoid small cell counts where the grouping depends on the percentiles of the estimated probabilities (î ) from the model. Observed and expected numbers of events are determined for each of the new groups, and their discrepancies are summed. Lack-of-t is judged by comparing this sum, which is, after standardization, a Pear2 -distribution. son statistic from a 2 × g table with g being the number of new groups, to a g−2 This distribution was assessed through simulation and the loss of a degree of freedom can be interpreted as accounting for the fact that the new grouping depends on estimated parameters and was not xed in advance. The Hosmer–Lemeshow test has become the standard test for assessing goodness-of-t in logistic regression and is implemented in all major statistical packages. It has some deciencies, however. It has been shown that the value of the test statistic may depend on the number of new groups and on the calculating algorithm. Hosmer et al. [17] reported results from tting the same data set in six major statistical software packages, obtaining identical values for the estimated parameters but six dierent values for the p-value of the Hosmer–Lemeshow test ranging from 0.02 to 0.16! Pigeon and Heyse [18] gave an even more impressive example and found p-values ranging from 0.02 to 0.45 for a single data set. Recent work of Bertolini Copyright ? 2002 John Wiley & Sons, Ltd.

Statist. Med. 2002; 21:3789–3801


3793

et al. [19] showed that results from the Hosmer–Lemeshow test might depend on the simple ordering of observations. Pigeon and Heyse [18] point to the fact that the grouping strategy of the Hosmer–Lemeshow test collects all individual observations with low and high probabilities in single groups, respectively. Thus it might be possible that the rst groups have low expected possibilities for events and the latest groups have low expected possibilities for non-events, both facts questioning the validity of the 2 -distribution. A nal disadvantage of the Hosmer–Lemeshow test is that observations belonging to same new group might dier considerably in their covariate values and so we get no information where there is lack-of-t in the space of covariates. Recently, Pulkstenis and Robinson [20] addressed this problem and proposed a two-stage modication of the Hosmer–Lemeshow test where the individual observations are grouped at rst stage according to a cross-classication of all categorical covariates in the model, and at second stage they are split according to the median estimated probability of the î within the newly dened groups. Analogous to the Hosmer–Lemeshow test, an ordinary Pearson test or the deviance is then calculated to compare expected and observed counts in the resulting contingency table. The authors showed by simulation that their test is superior to the standard Hosmer–Lemeshow test. Nevertheless, the requirement of categorical and continuous covariates in the model – owing to the construction principle – should be considered a weakness of this new procedure. The information matrix test [21] relies on the idea of comparing two dierent estimators of the information matrix which should give comparable results under a satisfactory model t. Hosmer and Lemeshow [4] termed the IM-test ‘elegant, but dicult to compute in practice’, but Orme [22] showed how to calculate this test for logistic regression models. Evaluating the dierence of the diagonal elements of the two estimators results in the ((p + 1) × 1)-vector M 1 dˆ = (yi − î )(1 − 2î )zi M i=1

(5)

2 2 T ; : : : ; xip ) where the components of dˆ sum to 0 in the case of a good model t. with zi = (1; xi1 After standardization with an appropriate variance, the test statistic (IMDIAG ) can be compared 2 -distribution. Note that the IM-test is calculated for the individual and not for the to a p+1 grouped observations so we do not expect problems with sparse data. Chesher [23] noted that the IM-test is also a score test for random variation in the parameters p across the observations in logistic regression, that is, a score test with null hypothesis 0i ; : : : ; pi ≡ 0 ; : : : ; p ∀i leads to dˆ as a test statistic. This questions the current distinction between specic and global goodness-of-t tests if a specic and a global goodness-of-t test use an identical test statistic. The RSS (residual sum of squares) test by Copas [24] considers only the numerator of the Pearson statistic, where the summation is again over the individual observations

RSS =

M

(yi − î )2

(6)

i=1

Hosmer et al. [17] show how to calculate asymptotical moments of RSS and to perform a statistical test. Copyright ? 2002 John Wiley & Sons, Ltd.

Statist. Med. 2002; 21:3789–3801

3794

O. KUSS

Note how the prescribed tests dier in their constructing principles. Osius and Rojek as well as McCullagh retain the classic goodness-of-t statistics but derive dierent limiting distributions, the Hosmer–Lemeshow test introduces a new grouping of observations to avoid sparseness, whereas the other tests have completely new statistics. We would also like to emphasize again that the tests of McCullagh and Farrington are derived conditional on sucient statistics of the parameter estimates, that is, they explicitly account for the additional error resulting from estimating the parameters of the logistic model. However, Osius and Rojek [11] showed that conditional moments of the Pearson statistic, as given by McCullagh, and the unconditional moments, as given by themselves, coincide, at least to rst order, and thus 2 . we expect similar results from assessing the model t by XO2 and XMcC

4. COMPARISON OF THE VARIOUS GOODNESS-OF-FIT TESTS To our knowledge, up to now there has only been one large systematic investigation of global goodness-of-t tests in logistic regression, namely that of Hosmer et al. [17]. Our work can be seen as kind of supplement to this study where we consider in more depth the behaviour of the tests under varying mi and add some more tests (XO2 ; XF2 and IMDIAG ) which have not been investigated by Hosmer and co-workers. By varying the mi we might be able to detect a lower bound for the mi above which the standard tests have good properties and could be used without danger. Note that this is just a natural extension of the common problem of the validity of the asymptotic chi-square test in a 2 × 2 table with small cell counts. This is because the data in a 2 × 2 table can be interpreted as a logistic regression model with a single binary covariate and it can be shown that the chi-square statistic for independence equals X 2 in this case [3]. In the following we give some results of our simulation study.† We report the empirical levels of the tests under various situations concerning null and alternative hypothesis and under diering constellations of the mi . Six dierent constellations of the mi were chosen to adequately assess the eect of varying sparseness. We used four constellations with constant mi (1, 2, 5, 10), one constellation (1-2) with half of the covariate patterns consisting of a single individual observation and the other half with two individual observations, and one constellation (1-10) where 64 per cent of the covariate pattern contained one single individual observation, 21 per cent two observations, 9 per cent ve observations and 6 per cent ten observations. This last constellation was chosen because it reects a distribution across covariate patterns which is frequently seen in daily practice. The number of individual observations (M ) was chosen to be 100 or 500. This coincides with the numbers that were used by Hosmer et al. in their simulations and also with our experience of sample sizes in medical data sets. In the simulation 10 000 data sets were generated and assessed for M = 100 and 1000 for M = 500. The simulation program was written in SAS/IML. † In our simulation we actually investigated 28 dierent global goodness-of-t tests. The ve tests described here were chosen (besides the reference tests X 2 and D) because they performed best in terms of power and size 2 throughout the whole study. XMcC and RSS were also found as the most valid tests in the study of Hosmer et al. [17]. A complete list of included tests and results is available from the author on request.


Statist. Med. 2002; 21:3789–3801

3795


Table I. Empirical level under the null hypothesis of a correctly specied model for X 2 and D under various mi , various model specications, M = 100 and M = 500 and = 0:05. The simulated data sets were generated under the model described in the ‘Model:’ statement. Goodness-of-t test

Constellation of mi 1

1-2

2

1-10

5

0.00 99.97

0.00 86.48

1.28 58.58

0.21 46.09

4.11 14.59

5.18 7.67

0.00 100.00

0.00 100.00

1.15 99.69

0.10 99.90

3.53 30.64

3.93 9.41

Model: logit(i ) = 0:693x1i , x1 ∼ U (−6; 6) M = 100 11.43 14.83 X2 D 0.00 0.00 M = 500 11.08 16.30 X2 D 0.00 0.00

8.00 0.40

19.43 0.00

5.77 2.53

4.60 4.14

8.17 0.00

24.70 0.00

5.97 1.66

5.37 5.24

Model: logit(i ) = 0:693x1i , x1 ∼ N (0; 1) M = 100 0.06 0.20 X2 D 76.68 43.56 M = 500 X2 0.00 0.00 D 100.00 99.00

1.14 43.48

1.16 13.15

4.08 13.76

4.94 8.33

1.02 97.76

0.20 58.50

3.43 33.91

4.57 11.42

4.71 14.05

5.21 8.18

4.01 32.86

4.31 11.79

Model: logit(i ) = 0 M = 100 X2 D M = 500 X2 D

Model: logit(i ) = 0:223x1i + 0:405x2i + 0:693x3i , xj iid N (0; 1) M = 100 0.47 1.12 2.20 2.56 X2 D 56.76 43.86 38.49 21.18 M = 500 0.00 0.10 1.41 0.90 X2 D 100.00 99.70 95.89 86.60

10

Under the null hypothesis (Table I) we only report empirical levels of the standard tests X 2 and D, because the other tests kept to the prespecied level in most of the cases. We nd a dramatic anticonservatism of D in three of four simulated models where the anticonservatism gets worse if the mi get smaller and where this behaviour does not depend that much on the sample size. Surprisingly, we nd a very conservative behaviour of D in the second model. Somewhat satisfactory results regarding empirical size can be observed only for mi = 10. The Pearson test X 2 also shows some undesirable behaviour, but to a much lesser extent than the residual deviance. In most cases we nd the Pearson test too conservative for mi ¡5, but it keeps to the prespecied level with mi ¿5. These simulation results show (and thus conrm existing knowledge) that X 2 and D are not valid goodness-of-t tests in logistic regression with sparse data, with the residual deviance showing by far the more erratic behaviour. Copyright ? 2002 John Wiley & Sons, Ltd.

Statist. Med. 2002; 21:3789–3801

3796

O. KUSS

In Table II we report some results for the empirical power of the various goodness-of-t tests from our simulation study under four dierent alternative hypotheses (missing covariate, wrong functional form of the covariate, overdispersion, and misspecied link function). This time we consider only the introduced alternative tests because they kept to the prespecied level under the null hypothesis, as opposed to the standard tests X 2 and D. We shall not comment on every single gure in the table but several global points can be 2 2 and XF2 behave very similarly; XMcC performs better that XO2 but both are made. XO2 , XMcC outperformed by the Farrington test XF2 . Unfortunately, this test has the structural deciency of never rejecting the null hypothesis with extreme sparseness (mi ≡ 1), but it recovers from this in the constellation of mi = 1-2. IMDIAG and RSS also behave similarly where IMDIAG is superior in the models with a missing covariate, with overdispersion and a misspecied link function, and RSS performs better with a wrong functional form of the covariate. The described tests can be seen to divide into two groups regarding their behaviour under the 2 and XF2 behave better with a missing covariate and overdisalternative hypothesis: XO2 , XMcC persion, whereas IMDIAG and RSS have better power with a wrong functional form of the covariate and a misspecied link function. The Hosmer–Lemeshow test Cˆ lies somewhere in between and it should be noted that for every misspecied model there is a goodness-of-t test which has superior power to this test which still is used as the standard procedure for assessing global goodness-of-t in applied medical statistics. In general, all tests gain empirical power with increasing mi and increasing sample size M . Nevertheless, we may still be disappointed with the low power for detecting lack-of-t, especially with small mi . 5. EVALUATING GOODNESS-OF-FIT FOR THE HAIRDRESSERS’ DATA Returning to the hairdressers’ data we nd, depending on the six covariates in the model, that the 574 observations can be divided into 334 dierent covariate groups with identical values of the covariates within each group. The distribution of the mi is the following: mi 1 2 3 ¿3

Absolute frequency 205 68 35 26

Relative frequency 61.4 20.4 10.5 7.8

We notice a certain degree of sparseness and decide not to rely on the standard tests X 2 and D in assessing goodness-of-t. If we calculate the introduced tests (see Table III) we nd that the goodness-of-t of the model does not seem to be very high; even the standard Pearson test X 2 , which in the simulation study showed a conservative behaviour with sparse data, indicates some lack-of-t. A second look at the results reveals that most of the tests which are based on a summation 2 and RSS) indicate lack-of-t, as opposed to the tests that rely on of residuals (X 2 , XO2 , XMcC dierent computing principles (Cˆ and IMDIAG ). This arouses the suspicion that some outlying observations are responsible for the bad t of the model, and indeed a residual analysis identies two observations which had an estimated probability of 0.96 for developing a hand Copyright ? 2002 John Wiley & Sons, Ltd.

Statist. Med. 2002; 21:3789–3801

3797


2 ˆ XF2 , IMDIAG , and RSS Table II. Power to detect a misspecied model for XO2 , XMcC , C, under various mi , various misspecications (missing covariate, wrong functional form of the covariate, overdispersion, misspecied link function), M = 100 and M = 500, and = 0:05. The simulated data sets were generated under the misspecication described in the ‘Model:’ statement, the actually tted model in all cases is a standard logistic model with an intercept and one continuous covariate.

Goodness-of-t test


1-2

2

1-10

5

10

17.28 21.23 10.51 32.00 11.60 10.08

26.17 36.05 18.89 36.59 11.21 9.48

34.93 49.10 42.99 48.96 20.53 15.12

46.60 50.70 20.70 82.10 11.90 9.50

80.50 83.80 18.90 85.00 9.70 8.00

94.60 95.90 38.70 95.90 17.30 13.50

0.16 0.20 5.25 6.67 13.07 24.51

0.13 0.17 2.44 10.29 13.14 25.24

1.22 2.50 4.87 9.86 12.74 23.05

3.02 6.45 5.24 13.42 12.64 20.23

0.00 0.00 57.50 9.80 92.40 97.50

0.00 0.00 60.50 21.40 92.90 96.90

0.70 0.80 55.40 18.80 92.20 96.30

4.40 6.50 54.90 33.50 92.50 96.50

Missing covariate Model: logit(i ) = 0:405x1i + 0:223x2i , xj iid U (−6; 6) M = 100 5.04 9.68 13.10 XO2 2 5.81 11.47 16.80 XMcC Cˆ 4.29 6.05 7.48 XF2 0.00 15.32 18.20 IMDIAG 4.75 5.70 6.45 RSS 5.34 5.72 6.08 M = 500 XO2 3.80 22.00 37.60 2 4.20 22.70 40.20 XMcC Cˆ 5.60 6.70 8.00 XF2 0.00 35.00 41.70 IMDIAG 5.80 6.30 6.60 RSS 4.80 6.40 5.80 Wrong functional form of the covariate Model: logit(i ) = 0:405x1i2 , x1 ∼ U (−6; 6) M = 100 0.06 0.05 XO2 2 0.08 0.08 XMcC Cˆ 5.06 5.02 XF2 0.00 10.29 IMDIAG 12.86 13.03 RSS 25.16 25.55 M = 500 0.00 0.00 XO2 2 0.00 0.00 XMcC Cˆ 59.90 57.50 XF2 0.00 13.10 IMDIAG 94.40 92.70 RSS 97.90 97.30

Overdispersion Model: logit(i ) = 0 + 0:405x1i , x1 ∼ U (−6; 6) with E(0 ) = 0 und Var(0 ) = 0:323 M = 100 5.66 6.79 8.67 10.93 13.46 XO2 2 6.26 8.22 11.09 13.87 20.42 XMcC Cˆ 4.59 5.47 5.86 5.87 11.04 XF2 0.00 10.20 12.08 18.71 20.71 IMDIAG 4.56 5.29 5.04 8.23 7.99 RSS 5.06 6.00 5.52 8.27 7.39 Copyright ? 2002 John Wiley & Sons, Ltd.

17.76 29.74 24.15 29.58 13.20 10.88

Statist. Med. 2002; 21:3789–3801

3798

O. KUSS

Table II. Continued. Goodness-of-t test

M = 500 XO2 2 XMcC Cˆ XF2 IMDIAG RSS


1-2

2

1-10

5

10

4.50 4.70 4.60 0.00 4.30 4.50

11.50 11.90 6.00 21.30 6.80 7.90

21.10 23.00 5.20 23.20 4.00 5.30

20.10 23.00 12.10 46.40 8.60 6.10

46.90 52.40 11.10 52.10 7.90 5.70

64.50 69.40 23.10 69.90 12.30 10.70

Misspecied link function Model: log[− log(1 − i )] = 0:405x1i , M = 100 0.23 XO2 2 0.26 XMcC Cˆ 3.76 0.00 XF2 IMDIAG 8.70 RSS 3.68 M = 500 0.00 XO2 2 0.00 XMcC Cˆ 20.00 XF2 0.00 IMDIAG 54.10 RSS 27.50

x1 ∼ U (−6; 6) 0.24 0.32 4.31 5.85 9.32 3.69

0.91 1.22 3.79 6.16 9.29 3.86

0.33 0.44 2.01 5.64 8.89 3.71

1.55 2.88 4.18 6.93 8.94 4.38

2.11 4.98 4.14 8.32 8.05 4.83

0.00 0.00 20.60 6.30 53.00 26.40

0.10 0.10 19.70 6.70 54.50 27.70

0.00 0.00 20.40 5.90 52.70 28.90

1.30 1.80 20.10 10.60 55.00 28.10

2.50 3.70 19.50 12.60 51.70 26.70

Table III. Results from assessing goodness-of-t for the various tests for the hairdressers’ data set. The middle column shows the p-values for the original data, the right column (p∗ ) shows the p-values for the data set with two outlying observations removed. Goodness-of-t test X2 D XO2 2 XMcC ˆ C XF2 IMDIAG RSS

p-value

p∗ -value

0.053 0.012 0.044 0.031 0.451 0.408 0.365 0.062

0.391 0.033 0.511 0.458 0.299 0.427 0.873 0.734

eczema, but in both cases no hand eczema was observed. A reanalysis of the data after removal of these two outliers gave the results which are shown in the right column (p∗ ) of Table III. There no longer is any indication of a bad model t. The sharpest decrease in the p-value is observed for the RSS test which seems plausible regarding the denition of the test statistic summing up only the numerators of the Pearson residuals. On the contrary, Copyright ? 2002 John Wiley & Sons, Ltd.

Statist. Med. 2002; 21:3789–3801


3799

the p-value of the Farrington test remains essentially unchanged and we might speculate that this test is less sensitive to outliers; however, this behaviour cannot be explained easily from the denition of the test statistic. Note the strange behaviour of the Hosmer–Lemeshow test Cˆ which adds another point to the list of its peculiarities. The removal of two obviously outlying observations leads to a worse model t, at least according to the Hosmer–Lemeshow test.

6. DISCUSSION Our work shows, and thus conrms existing knowledge, that the standard goodness-of-t tests for logistic regression (residual deviance and Pearson chi-square) are not valid in the case of sparse data if p-values are calculated from the 2 -distribution with the residual deviance D showing by far the more erratic behaviour. The Pearson test X 2 might be used and will give reliable results if a signicant portion of covariate patterns has more than ve individual observations. Regarding non-standard goodness-of-t tests, our simulation study conrms the work of Hosmer et al. [17] wherever identical tests have been investigated. However, we extended their ndings by showing that there exist additional tests which perform at least as well and sometimes better than their recommended goodness-of-t tests in logistic regression, namely 2 and RSS. The rst one is the Farrington test XF2 , which extends the Pearson statistic XMcC by an additive constant. Although this test has the structural deciency of never rejecting the null hypothesis in case of extreme sparseness (mi ≡ 1) it already gives reliable results in cases where the average mi is as small as 1.5. The second test is the information matrix test IMDIAG , better known in econometrics [25–27] than in medical statistics, which relies on the comparison of two estimators of the information matrix. This test uses the individual observations for the calculation of the test statistic and has thus reasonable power even with very sparse data. It should be stated that a violation of the information matrix equation results in inconsistent estimators for the variance of parameters and thus in invalid hypothesis tests [28]. However, a simple correction of the variance estimator is available [21] that leads to a quasi-maximum likelihood estimator that is optimal in terms of the Kullback–Leibler divergence and leads to asymptotically valid parameter tests. The current quasi-standard in assessing goodness-of-t in logistic regression, the Hosmer– Lemeshow test, was shown to keep the asymptotic level under the null hypothesis and to have comparable power under the alternative. However, considering the described weaknesses of this test (dependency on the calculating algorithm, yielding dierent results from dierent software packages) we might feel better if certain valid alternative tests exist to supplement the ndings from the Hosmer–Lemeshow test. It could be observed that the described tests can be divided into two groups regarding their power to detect dierent misspecications, and therefore we recommend calculating all of them. This should be done in a descriptive manner to avoid multiple comparison problems. Concerning everyday practice it could be shown that global goodness-of-t tests can be a valuable tool in the data analyst’s tool box for logistic regression, however, a sound goodnessof-t analysis should not be considered as adequate after calculating a single goodness-of-t statistic. Especially when the tests show certain lack-of-t, we get no advice from them how to improve the model and should turn to residual analysis. This is even more important with Copyright ? 2002 John Wiley & Sons, Ltd.

Statist. Med. 2002; 21:3789–3801

3800

O. KUSS

very sparse data or small overall sample sizes where only low power was observed in the simulation study. We restricted our discussion to goodness-of-t tests that are calculated after the parameters of the logistic model have been t by maximum likelihood. This excludes some important special cases from everyday practice where the assumptions for valid ML estimation are no longer fullled, for example, data sets with separation where ML estimates no longer exist, small sample size or a large number of parameters to be estimated. Conditional or exact estimation should be used under these circumstances. There are also some goodness-of-t tests in these situations but they suer from computational complexity requiring Gibbs sampling [29] or extensive numeration [30] to achieve p-values, and are undened for situations with extreme sparseness. Additionally, it can be speculated that these methods show unduly conservative behaviour as it is frequently observed with conditional inference methods [31]. Calculation of the described tests is straightforward and can be realized with standard software, thus they all full the requirement of Pagan (cited in Reference [32]), who stated that ‘the reality of model construction demands that diagnostic and specication tests be neither expensive nor cumbersome to construct. Once methods begin to cause trouble on either of these criteria, they are likely to be ignored.’ An SAS/IML macro that calculates the described tests is available from the author [33]. To conclude, the fundamental dilemma of all goodness-of-t tests remains: a non-signicant result of a goodness-of-t test does not tell you that your model is correct, it just tells you that the lack-of-t is not large enough for you to reject your model. ACKNOWLEDGEMENTS

We thank (in alphabetical order) Maria Blettner, Ralf Bender, Thomas L. Diepgen, Peter Dirschedl, Herwig Friedl and Gerhard Osius for valuable help and fruitful discussion during the project. The comments of an anonymous referee helped to improve the paper signicantly and were very much appreciated. REFERENCES 1. Pregibon D. Logistic regression diagnostics. Annals of Statistics 1981; 9(2):705 – 724. 2. Alston JM, Chalfant JA. Unstable models from incorrect forms. American Journal of Agricultural Economics 1991; 73(4):1171 – 1181. 3. Santner TJ, Duy DE. The Statistical Analysis of Discrete Data. Springer: New York, 1989. 4. Hosmer DW, Lemeshow S. Applied Logistic Regression. Wiley: Chichester, 1989. 5. McCullagh P, Nelder JA. Generalized Linear Models. Chapman & Hall: London, 1989. 6. Agresti A. Categorical Data Analysis. Wiley: Chichester, 1990. 7. Dickel H, Kuss O, Schmidt A, Schmitt J, Diepgen TL. Incidence of occupation-related skin diseases in skinexposure occupational groups. Hautarzt 2001; 52(7):615 – 623. 8. Diepgen TL, Coenraads, PJ. The epidemiology of occupational contact dermatitis. International Archives of Occupational and Environmental Health 1999; 72(8):496 – 506. 9. Diepgen TL, Tepe A, Pilz B, Schmidt A, Huner A, Huber A, Hornstein OP, Frosch PJ, Fartasch M. Berufsbedingte Hauterkrankungen bei Auszubildenden im Friseur- und Krankenpegeberuf. Allergologie 1993; 16(10):396 – 403. 10. Diepgen TL, Sauerbrei W, Fartasch M. Development and validation of diagnostic scores for atopic dermatitis incorporating criteria of data quality and practical usefulness. Journal of Clinical Epidemiology 1996; 49(9): 1031 – 1038. 11. Osius G, Rojek D. Normal goodness-of-t tests for multinomial models with large degrees of freedom. Journal of the American Statistical Association 1992; 87(420):1145 – 1152. 12. Cressie NAC, Read TRC. Multinomial goodness-of-t tests. Journal of the Royal Statistical Society, Series B 1984; 46(2):440– 464. Copyright ? 2002 John Wiley & Sons, Ltd.

Statist. Med. 2002; 21:3789–3801


3801

13. McCullagh P. Sparse data and conditional tests. Bulletin of the International Statistics Institute, Proceedings of the 45th Session of ISI (Amsterdam), Invited Paper 1985; 28(3):1 – 10. 14. McCullagh P. On the asymptotic distribution of Pearson’s statistic in linear exponential-family models. International Statistical Review 1985; 53(1):61 – 67. 15. Farrington CP. On assessing goodness of t of generalized linear models to sparse data. Journal of the Royal Statistical Society, Series B 1996; 58(2):349–360. 16. Hosmer DW, Lemeshow S. Goodness of t tests for the multiple logistic regression model. Communications in Statistics – Theory and Methods 1980; 9(10):1043–1069. 17. Hosmer DW, Hosmer T, Le Cessie S, Lemeshow S. A comparison of goodness-of-t tests for the logistic regression model. Statistics in Medicine 1997; 16(9):965–980. 18. Pigeon JG, Heyse JF. A cautionary note about assessing the t of logistic regression models. Journal of Applied Statistics 1999; 26(7):847 –853. 19. Bertolini G, D’Amico R, Nardi D, Tinazzi A, Apolone G. One model, several results: the paradox of the Hosmer– Lemeshow goodness-of-t test for the logistic regression model. Journal of Epidemiology and Biostatistics 2000; 5(4):251 –253. 20. Pulkstenis E, Robinson TJ. Two goodness-of-t tests for logistic regression models with continuous covariates. Statistics in Medicine 2002; 21(1):79–93. 21. White H. Maximum likelihood estimation of misspecied models. Econometrica 1982; 50(1):1 –25. 22. Orme C. The calculation of the information matrix test for binary data models. The Manchester School 1988; 54(4):370 –376. 23. Chesher A. Testing for neglected heterogeneity. Econometrica 1984; 52(4):865–872. 24. Copas JB. Unweighted sum of squares test for proportions. Applied Statistics 1989; 38(1):71 –80. 25. Lechner M. Testing logit models in practice. Empirical Economics 1991; 16:177– 198. 26. Thomas JM. On testing the logistic assumption in binary dependent variable models. Empirical Economics 1993; 18:381 –392. 27. Aparicio T, Villanua I. The asymptotically ecient version of the information matrix test in binary choice models. A study of size and power. Journal of Applied Statistics 2001; 28(2):167– 182. 28. White H. Estimation, Inference and Specication Analysis. Cambridge University Press: Cambridge, 1994. 29. Forster JJ, McDonald JW, Smith PWF. Monte Carlo exact conditional tests for log-linear and logistic models. Journal of the Royal Statistical Society, Series B 1996; 58(2):445– 453. 30. Tang ML. Exact goodness-of-t test for binary logistic model. Statistica Sinica 2001; 11(1):199–211. 31. Agresti A. Exact inference for categorical data: recent advances and continuing controversies. Statistics in Medicine 2001; 20(17-18):2709–2722. 32. Godfrey LG. Misspecication Tests in Econometrics. Cambridge University Press: Cambridge, 1988. 33. Kuss O. A SAS/IML macro for goodness-of-t testing in logistic regression models with sparse data. Proceedings of the 26th Annual SAS Users Group International Conference 2001; P265–26.


Statist. Med. 2002; 21:3789–3801