The effect of collapsing multinomial data when assessing agreement

8 downloads 0 Views 196KB Size Report
May 10, 2000 - Dirichlet distribution with parameters α1, α2, …, αJ, and with ... It follows that has a limiting chi-square distribution with one degree of freedom.
© International Epidemiological Association 2000

International Journal of Epidemiology 2000;29:1070–1075

Printed in Great Britain

The effect of collapsing multinomial data when assessing agreement E Bartfaya,b and A Donnera

Background In epidemiological studies researchers often depend on proxies to obtain information when primary subjects are unavailable. However, relatively few studies have performed formal statistical inference to assess agreement among proxy informants and primary study subjects. In this paper, we consider inference procedures for studies of interobserver agreement characterized by two raters and three or more outcome categories. Of particular interest is the consequence of dichotomizing such data on the expected confidence interval width for the kappa coefficient. The effect of dichotomization on sample size requirements for testing hypotheses concerning kappa is also evaluated. Methods

Simulation studies were used to compare coverage levels and widths for constructing confidence intervals. Sample size requirements were compared for multinomial and dichotomous data. We illustrate our results using a published data set on drinking habits that assesses agreement among primary and proxy respondents.

Results

Our results show that when multinomial data are treated as dichotomous, not only do the expected confidence interval widths become greater, but the penalty in terms of larger sample size requirements for hypothesis testing can be severe.

Conclusion

We conclude that there are clear advantages in preserving multinomial data on the original scale rather than collapsing the data into a binary trait.

Keywords

Agreement, kappa statistic, sample size, confidence interval, epidemiological studies

Accepted

10 May 2000

Since its introduction by Cohen,1 the kappa coefficient (k) has become a very popular index for quantifying agreement among raters with respect to categorical measurements. The principal advantage of kappa as compared to measures of agreement proposed earlier is that it corrects for the excess agreement expected by chance. Donner and Eliasziw2 developed a goodness-of-fit (GOF) procedure to construct inferences for the kappa statistic when the trait of interest is measured on a dichotomous scale. In particular, they showed how one can construct confidence intervals for the kappa statistic and estimate sample size requirements for hypothesis testing using this procedure. Most of the literature on agreement assessment, however, has focused on continuous or dichotomous outcome data.3–6 Nevertheless, recent interest in kappa7 reflects the importance

a Department of Epidemiology and Biostatistics, The University of Western

Ontario, London, Ontario, Canada. b Current address: Radiation Oncology Research Unit, Department of

Oncology, and Department of Community Health and Epidemiology, Queen’s University, Kingston, Ontario, Canada. Reprint requests: E Bartfay, Radiation Oncology Research Unit, Apps Level 4, Kingston General Hospital, Kingston, Ontario, Canada, K7L 2V7. E-mail: [email protected]

of this statistic which can be applied to more general problems. For instance, investigators in epidemiological studies often rely on proxy informants when primary subjects are unavailable to provide the needed information, particularly when study subjects are elderly or very young children.8,9 It has been suggested that the criteria for evaluating agreement between information obtained from primary subjects and their proxy respondents depends on their relationship, the research subject matter, and even the subjects’ ethnicity.9–12 For example, wives seem to report more accurately their husbands’ dietary intake, while husbands tend to be more accurate about their wives’ alcohol consumption.13,14 Moreover, children can provide reliable information on the smoking habits of their cohabiting parents, whereas parents are not effective informants for evaluating their children’s oral health status.15,16 In spite of the varying degrees of proxy-primary agreement reported in the literature, formal statistical evaluation has not been routinely used.9 One purpose of this paper is to show how the results of Donner and Eliasziw2 can be extended to provide a statistical procedure to construct confidence intervals about the kappa coefficient for multinomial data. In addition, we will demonstrate the consequence of collapsing multinomial data into a binary outcome measure. Our results show that this practice can be disadvantageous in terms of sample size requirements for

1070

EFFECT OF COLLAPSING MULTINOMIAL DATA

hypothesis testing and for the expected confidence interval width. We illustrate these results using data on drinking habits from a previously published study.

Suppose that a sample of n subject-pairs has been selected. Each individual is asked to classify a response into one of J (.2) mutually exclusive categories. Let xtj denote the number of individuals of the tth pair in category j, where t = 1, 2, …, n and j = 1, 2, …, J. Assume that the joint distribution of xt1, xt2, …, xtJ is multinomial, i.e. 2 p x t1 p x t2 Kp xJ tJ x t1!x t2! K x tJ ! 1 2

with parameters, p1, p2, …, pJ (p1 + p2 + … + pJ = 1) and xt1 + xt2 + … + xtJ = 2. The parameters p1, p2, …, pJ represent the probability of a rating being classified into categories 1, 2, …, J, respectively. We further assume that p1, p2, …, pJ follow a Dirichlet distribution with parameters α1, α2, …, αJ, and with density function given by f(p1 ,p2 , K ,p J ) =

Γ(α1 + α2 + K + α J ) J α j –1 Π pj J

j=1

Π Γ(α j)

= µ jµ j’ (1 – κ) for all j, j’ = 1, 2, …, J. The coefficient of interobserver agreement κ defined above has a parallel interpretation as the correlation between any two subjects within a litter in toxicological studies.18 Each pair of ratings may be regarded as falling into one of the J(J + 1)/2 classifications (see Table 1 for data layout). Letting ni denote the number of subjects in classification i, i = 1, 2, …, J(J + 1)/2, the log-likelihood function can be written as logL =

J(J +1)/2



i =1

n ilog[Pri(µ1, K , µ J –1 , κ)]

In order to construct a one degree of freedom GOF test, we may further combine all discordant cells into a single cell. The modified log-likelihood function may then be expressed as J +1

logL M = ∑ m ilog[Pri’(µ1, K , µ J –1 , κ)]

j=1

i =1

where αj . 0, j = 1, 2, …, J. The joint distribution of xt1, xt2, …, xtJ is Dirichlet-multinomial,17 which can be written as J

Pr(x t1,x t2 , K ,x tj) =

Letting µj = αj/θ and κ = (1 + θ)–1 the Dirichlet-multinomial model can also be expressed as Pr(j, j) = Pr(x t1 = 0, K ,x tj = 2, K , x tJ = 0) = µ 2j + µ j (1 – µ j) κ Pr(j, j’) = Pr(j’ , j) = Pr(x t1 = 0, K , x tj = 1, x tj’ = 1, K , x tJ = 0)

Materials and Methods

Pr(x t1,x t2 , K ,x tJ ) =

1071

2Γ(α1 + α 2 + K + α J ) Π Γ(α j + x tj) j=1

J

J

j=1

j=1

where mi = ni for concordant cells and mJ+1 represents the sum of all discordant cells. In the next three subsections, we will show how one can construct confidence interval and to estimate sample size requirements for hypothesis testing for κ.

Confidence interval construction

Π x tj! Γ(2 + α1 + α 2 + K + α J ) Π Γ(α j)

If we let Pr(j, j’) be the probability that the first rating is in category j and the second rating is in category j’, the basic model can be written as Pr(j, j) = Pr(xt1 = 0, K ,x tj = 2, K ,x tj = 0) = α j(α j + 1)/(θ + 1) θ Prj, j’) = Pr(j’, j) = Pr(xt1 = 0, K ,x tj = 1,x tj’ = 1, K ,x tJ = 0)

Suppose it is of interest to construct a 100(1 – α)% confidence interval for κ. The observed frequencies mi, corresponding to the Pr’(κ), i = 1, 2, …, J + 1, follow a multinomial distribution, i conditional on the sample size n (Table 1). The estimated probabilities Pˆri’(κ) can be obtained by replacing µj by their suitable estimates in Pri’(κ), i = 1, 2, …, J + 1. It follows that J +1 [m i

ˆ ’(κ )]2 – nPr i

i =1

ˆ ’(κ ) nPr i

2 = ∑ χG

= α jα j’/(θ + 1) θ

has a limiting chi-square distribution with one degree of freedom. One can obtain the two corresponding 100(1 – α)%

where θ = α1 + α2 + … + αJ.

Table 1 Data layout Category 1

Ratings (1, 1)

Frequency n1

Probability Pr1(κ)

Frequency m1

2

(2, 2)

n2

Pr2(κ)

m2

A

A

A

A

A

A

(J – 1, J – 1)

nJ–1

PrJ–1(κ)

mJ–1 mJ

’ (κ) PrJ–1 PrJ’(κ)

m J+1

’ (κ) PrJ+1

J–1 J J+1 A J(J + 1)/2 *j, j = 1, 2, …, J

(J, J)

nJ

PrJ(κ)

(j, j’)*

nJ+1

PrJ+1(κ)

A

A

A

(j’, j)*

nJ(J+1)/2

PrJ(J+1)/2(κ)

Probability Pr1’(κ) Pr2’(κ)

1072

INTERNATIONAL JOURNAL OF EPIDEMIOLOGY

confidence limits for κ by finding the admissible roots to the 2 = χ 2 . When a closed form solution is polynomial equation χG 1–α unattainable, the confidence limits may be expressed in numeric form by replacing µj’s with their maximum likelihood estimates and numerically solving the equation above for κ. Maximum likelihood estimates can be obtained solving ­logLM / ­m1 = 0, …, ­logLM / ­mJ–1 = 0 and ­logLM / ­κ = 0 simultaneously. For the case of a binary outcome variable, an explicit expression for the maximum likelihood estimator was obtained by Bloch and Kraemer.3

Hypothesis testing The procedure above may also be used to test hypotheses concerning κ. Suppose it is of interest to test H0: κ = κ0, where κ0 is a pre-specified value. The GOF test statistic is given by J +1

ˆ ’(κ )]2 [mi – nPr i 0

i =1

ˆ ’(κ ) nPr i 0

χ20 = ∑

the total number of subjects (n = 50, 100, 200). The number of replications used in the simulation was 1000, which allows a departure of 0.025 from the true coverage of 95% to be detected as statistically significant with 90% power.20 To conserve space, we present some of the results in Table 2. It is seen that most of the coverage levels fall between 940 and 960, and are therefore generally acceptable. When the number of subjects is increased to 100 and 200, the differences in coverage level become negligible. The confidence interval results further show that the three-category interval widths are consistently narrower at all parameter values. The advantage is particularly apparent at n = 50 for κ = 0.1, where some of the observed differences in average width are as great as 0.17, or an increase of 48.6%. The average widths become more similar when the number of subjects is increased to 100 and 200. Results for n = 200 are not provided for reasons of space. Nevertheless, using three categories still produced narrower interval widths at all parameter values.

Sample size requirements comparison Under H0, χ 20 follows an approximate chi-square distribution with one degree of freedom. The Pˆri′(κ0), i = 1, 2, …, J + 1, are obtained by replacing µj, j = 1, 2, …, J–1 with their maximum likelihood estimates and κ by κ0.

Sample size requirements Suppose that it is of interest to estimate the number of subjects needed to test the null hypothesis H0: κ = κ0 versus Ha: κ = κa at the 100α% significance level (2-sided), and with power (1 – β). Under Ha, the GOF statistic has a non-central chi-square distribution with one degree of freedom, and with corresponding non-centrality parameter given by J +1 [Pr’(κ ) – i a

λ=n ∑

i =1

Pri’(κ 0)]2

Pri’(κ 0)

If 1 – β(1, λ, α) denotes the power of the GOF statistic corresponding to λ and α, one can determine the sample size required to test H0: κ = κ0 versus Ha: κ = κa by using tables of the non-central chi-square distribution (e.g. Haynam et al.19). The required number of subjects is then given by  J +1 [ Pr’(κ ) – Pr’(κ )]2  i a i 0  n = λ ∑  i =1  Pri’(κ 0)

–1

As an example, suppose that we have a trinomial outcome variable and wish to test H0: κ = 0.2 at α = 0.05 (2-sided) and β = 0.2. When µ1 = 0.2, µ2 = 0.3 and κa = 0.4, we can compute the required number of subjects from the equation above as n = 118.

Results Confidence interval comparison A Monte Carlo simulation study was conducted to evaluate the effect on coverage level and confidence interval width of dichotomizing trinomial outcome data. The parameters in the simulation included various values of µ1, µ2 and κ, as well as

We now consider the effect of dichotomization on sample size requirements for testing H0: κ = κo against Ha: κ = κa. For this purpose, we specify the values of µ1, µ2, κ0 and κa that correspond to the three-category case and may combine any two categories to facilitate this comparison. For example, suppose we have µ1 = 0.2, µ2 = 0.2 and µ3 = 0.6 for a trinomial outcome variable. Collapsing the data into a binary trait yields (i) µ = µ1 = 0.2; (1 – µ) = 0.8 and (ii) µ = µ1 + µ2 = 0.4; (1 – µ) = 0.4. The number of subjects required in the trinomial and binomial cases are displayed in Table 3 for µ1 = (0.2, 0.3), µ2 = (0.2, 0.3, 0.4, 0.6), κ0 = (0.1, 0.2), κa = (0.3, 0.4, 0.5), α = (0.01, 0.05) and β = (0.2, 0.1). The corresponding number of subjects required when β = 0.1 is reported in parentheses. The overall results show that when a trinomial outcome variable is collapsed to create a binary variable, substantial increases in sample size are required in order to maintain the same level of power at a given level of significance. The magnitude of this effect can be influenced by how the categories are combined. For example, when testing H0: κ = 0.2 against Ha: κ = 0.4 at α = 0.05 and β = 0.2, a trinomial outcome variable with µ1 = 0.2 and µ2 = 0.3 requires a sample size of 118 subjects. The corresponding number of subjects required for a dichotomous outcome variable is 248 at µ = 0.2 with (1 – µ) = 0.8, but is reduced to 189 at µ = (µ1 + µ2) = 0.5 with (1 – µ) = 0.5.

Application An example: Drinking habits among non-fatal myocardial infarction patients For illustrative purposes, we use part of the data from a community-based case-control study of coronary heart disease.21 The study population consisted of all white men and women aged 25–64 who resided in the Auckland Statistical Area from 1986 to 1988. Cases included all non-fatal myocardial infarction patients from a World Health Organization MONICA project and all myocardial infarction deaths. Controls for the non-fatal myocardial infarction patients were a group-matched age- and sex-stratified random sample from the study population. Information on alcohol consumption was collected using the ‘typical occasions’ method.22 In this present paper, we are

EFFECT OF COLLAPSING MULTINOMIAL DATA

1073

Table 2 Effect of dichotomization on coverage level and confidence interval widtha m1

0.1

m2

0.2

0.4

k

0.1

0.3

0.5

0.7

0.1

0.2

0.5

0.9

Two

Coverage

947

950

950

950

958

948

945

957

Width

0.513

0.522

0.493

0.424

0.526

0.519

0.469

0.260

Three

Coverage

970

954

945

943

954

945

942

939

Width

0.366

0.444

0.450

0.400

0.354

0.397

0.416

0.241

n = 50

No. of categories

m1

0.3

m2

0.2

0.3

No. of categories

k

0.1

0.3

0.5

0.7

0.1

0.2

0.5

0.9

Two

Coverage

958

941

958

940

943

954

956

963

Width

0.520

0.506

0.465

0.392

0.522

0.520

0.474

0.269

Three

Coverage

940

945

954

951

951

953

945

961

Width

0.377

0.410

0.396

0.341

0.385

0.398

0.383

0.220

m1

0.1

m2

0.2

0.4

k

0.1

0.3

0.5

0.7

0.1

0.2

0.5

0.9

Two

Coverage

954

951

962

950

947

952

950

975

Width

0.382

0.384

0.358

0.302

0.381

0.376

0.334

0.178

Three

Coverage

946

956

951

957

956

953

945

943

Width

0.288

0.336

0.327

0.284

0.277

0.303

0.298

0.166

n = 100

No. of categories

m1

0.3

m2

0.2

0.3

No. of categories

k

0.1

0.3

0.5

0.7

0.1

0.2

0.5

0.9

Two

Coverage

949

937

951

953

962

952

946

953

Width

0.381

0.367

0.334

0.277

0.381

0.378

0.340

0.183

Three

Coverage

948

928

950

955

963

952

947

964

Width

0.282

0.297

0.283

0.241

0.281

0.288

0.274

0.151

a The upper entries denote the coverage level based on 1000 replications, and the lower entries denote the average confidence width, defined as the difference

between the average upper limit and the average lower limit.

interested in the degree of agreement regarding drinking habits among primary respondents (non-fatal myocardial infarction cases) and the proxy respondents (closest next-of-kin). Three categories were used in the data analysis: (I) non-drinker; (II) occasional drinker, or (III) regular drinker. The data for the 58 respondent pairs are given in Table 4 with the results summarized in Table 5. When the data from the 3 3 3 table are used to estimate agreement, the 95% confidence interval for κ is given by (0.38, 0.73). When we collapse the data into either non-regular (groups I and II) or regular drinker (III), the 95% confidence interval for κ is given by (0.45, 0.85). Alternatively, we can collapse the data into either non-drinker (group I) or drinker (groups II and III), the 95% confidence interval for κ is given by (0.32, 0.77). In both cases, we observe an increase in confidence interval width of 16 and 34%, respectively. It is clear that the confidence interval is narrower when more categories are used.

Discussion Donner and Eliasziw23 proposed a hierarchical approach to the construction of inferences concerning interobserver agreement when the outcome variable of interest is multinomial. By

combining the original categories into binary traits, the authors were able to perform a series of nested, statistically independent inferences. However, this method is only appropriate when some of the outcome categories can be naturally combined to answer a series of questions that are of a priori interest. Kraemer24 addressed the problem of multinomial outcome categories by proposing the use of a symmetric matrix of coefficients to measure reliability. For this approach, the intraclass kappa coefficients from the matrix diagonal represent the degree of agreement for a particular category relative to all other categories combined. An advantage of the matrix approach is that it meets the concerns of those who criticized the use of a single overall measure of reliability for multinomial data (e.g. Roberts and McNamee25). When the main interest is in an initial global measure of agreement, however, these methods might become cumbersome to perform and perhaps provide more information than is needed. The results of our simulation study, as reflected in the example, show that there are clear advantages to preserving multinomial data on the original scale rather than collapsing the data into a binary trait. Depending upon how the categories are collapsed, the penalty in terms of larger sample size requirements for testing hypotheses can be quite severe. These observations are

1074

INTERNATIONAL JOURNAL OF EPIDEMIOLOGY

Table 3 Comparison of sample size requirements k0

0.1

ka

0.3

m1 m2

No. of categories Three α = 0.05 β = 0.2(0.1)

0.4

0.2

0.3

0.2

0.3

0.4

0.4

0.6

0.2

0.3

0.4

0.4

117

111

110

107

125

52

50

49

48

56

(156)

(149)

(147)

(143)

(167)

(70)

(66)

(66)

(64)

(75)

Two

Three β = 0.2(0.1)

207 (279)b

103 (139)

92 (124)b

197 (266)c

197 (266)e

88 (119)c

88 (119)e

87 (117)d

174

165

163

159

186

77

74

73

71

83

(221)

(211)

(208)

(203)

(237)

(99)

(94)

(93)

(90)

(106)

Two

343 (437)a

308 (392)b

153 (195)a

137 (174)b

294 (374)c 290 (369)d

294 (374)e

131 (166)c 129 (164)d

131 (166)e

k0

0.2

ka

0.4

m1 m2

No. of categories Three α = 0.05 β = 0.2(0.1)

Three β = 0.2(0.1)

0.5

0.2

0.3

0.2

0.3

0.2

0.3

0.4

0.4

0.6

0.2

0.3

0.4

0.4

127

118

115

111

137

57

53

52

50

61

(170)

(158)

(154)

(148)

(183)

(76)

(70)

(69)

(66)

(82)

Two

α = 0.01

0.6

231 (312)a 195 (263)d

α = 0.01

0.3

0.2

0.6

248 (334)a

209 (282)b

110 (149)a

93 (125)b

193 (261)c 189 (255)d

193 (261)e

86 (116)c 84 (114)d

86 (116)e

189

175

171

165

203

84

78

76

74

91

(240)

(223)

(218)

(211)

(259)

(107)

(99)

(97)

(94)

(115)

Two

368 (469)a

310 (395)b

164 (209)a

138 (176)b

287 (366)c 281 (358)d

287 (366)e

128 (163)c 125 (159)d

128 (163)e

Categories were combined as follows: a µ = 0.2, (1 – µ) b µ = 0.3, (1 – µ) c µ = 0.4, (1 – µ) d µ = 0.5, (1 – µ)

= 0.8. = 0.7. = 0.6. = 0.5.

e µ = 0.6, (1 – µ) = 0.4.

Table 5 95% Confidence interval for κ: alcohol consumption example

Table 4 Data layout for drinking habits by primary and proxy respondents Primary respondents Proxy respondents Non-drinker

Non-drinker 10

Occasional drinker 4

Regular drinker 0

3 categoriesa

Total 14

0.383

0.453

0.314

0.727

0.853

0.774

0.344

0.400

0.460



16%

34%

Occasional drinker

5

19

2

26

Regular drinker

0

5

13

18

Width

58

% increase in width

15

28

15

2 categoriesc (non-drinker versus drinker)

Lower limit Upper limit

Total

2 categoriesb (non-regular drinker versus regular drinker)

a Non-drinker versus occasional drinker versus regular drinker. b Non-regular drinker included non- and occasional drinker. c Drinker included occasional and regular drinker.

also consistent with those of previous authors,26–28 who show there can be a severe loss of power when an inherently continuous variable is dichotomized. However, when there are small numbers in the categories, collapsing these categories may

sometimes be the only possible solution in practice in order to facilitate data analysis. Nonetheless, when an investigator considers collapsing categories, biological or clinical relevancy should also be taken into account.

EFFECT OF COLLAPSING MULTINOMIAL DATA

Acknowledgements Dr Bartfay’s research has been partially supported by a grant from the Advisory Research Committee of Queen’s University and Dr Donner’s research has been partially supported by a grant from the Natural Sciences and Engineering Research Council of Canada.

References 1 Cohen J. A coefficient of agreement for nominal scales. Edu Psychol

Measurement 1960;20:37–46. 2 Donner A, Eliasziw M. A goodness-of-fit approach to inference pro-

cedures for the kappa statistic: confidence interval construction, significance-testing and sample size estimation. Statist Med 1992;11: 1511–19. 3 Bloch DA, Kraemer HC. 2 × 2 kappa coefficients: measures of

agreement or association. Biometrics 1989;45:269–87. 4 Mak TK. Analysing intraclass correlation for dichotomous variables.

Appl Stat 1988;37:344–52. 5 Landis JR, Koch GG. The measurement of observer agreement for

categorical data. Biometrics 1977;33:159–74. 6 Basu S, Basu A. Comparison of several goodness-of-fit tests for the

kappa statistic based on exact power and coverage probability. Statist Med 1995;14:347–56. 7 Banerjee M, Capozzoli L, McSweeney L, Sinha D. Beyond kappa:

a review of interrater agreement measures. Can J Stat 1999;27: 3–23. 8 Pierre U, Wood-Dauphinee S, Korner-Bitensky N, Gayton D, Hanley

1075

13 Humble CG, Samet JM, Skipper BE. Comparison of self- and

surrogate-reported dietary information. Am J Epidemiol 1984;119: 86–98. 14 Cahalan D. Quantifying alcohol consumption: patterns and problems.

Circulation 1981;64(Suppl.III):7–14. 15 Barnett T, O’Loughlin J, Paradis G, Renaud L. Reliability of proxy

reports of parental smoking by elementary schoolchildren. Ann Epidemiol 1997;7:396–99. 16 Beltran ED, Malvitz DM, Eklund SA. Validity of two methods for

assessing oral health status of populations. J Public Health Dent 1997;57:206–14. 17 Johnson NL, Kotz S. Discrete Distribution. New York: Wiley, 1969. 18 Chen JJ, Kodell RL, Howe RB, Gaylor DW. Analysis of trinomial

responses from reproductive and developmental toxicity experiments. Biometrics 1991;47:1049–58. 19 Haynam GE, Govindarajulu Z, Leone GC. Tables of the Cumulative Non-

central Chi-square Distribution, Case Statistical Laboratory, Publication No. 104, 1962. Part of the tables have been published in: Harter HL, Owen DB (eds). Selected Tables in Mathematical Statistics Vol. 1. Chicago: Markham, 1970. 20 Robey RR, Barcikowski RS. Type I error and the number of iterations

in Monte Carlo studies of robustness. Br J Math Stat Psychol 1992; 45:283–88. 21 Graham P, Jackson R. Primary versus proxy respondents: com-

parability of questionnaire data on alcohol consumption. Am J Epidemiol 1993;138:443–52. 22 Alanko T. An overview of techniques and problems in measurement

of alcohol consumption. Res Adv Alcohol Drug Prob 1984;8:299–326. 23 Donner A, Eliasziw M. A hierarchical approach to inferences con-

J. Proxy use of the Canadian SF-36 in rating health status of the disabled elderly. J Clin Epidemiol 1998;51:983–90.

cerning interobserver agreement for multinomial data. Statist Med 1997;16:1097–106.

9 Whiteman D, Green A. Wherein lies the truth? Assessment of agree-

24 Kraemer HC. Measurement of reliability for categorical data in

ment between parent proxy and child respondents. Int J Epidemiol 1997;26:855–59.

25 Roberts C, McNamee R. A matrix of kappa-type coefficients to assess

10 Navarro AM. Smoking status by proxy and self report: rate of

agreement in different ethnic groups. Tob Control 1999;8:182–85. 11 MaCarthur C, Dougherty G, Pless IB. Reliability and validity of

medical research. Stat Meth Med Res 1992;1:183–99. the reliability of nominal scales. Statist Med 1998;17:471–88. 26 Cohen J. The cost of dichotomization. Appl Psychol Measurement 1983;

7:249–53.

proxy respondent information about childhood injury: an assessment of a Canadian surveillance system. Am J Epidemiol 1997;145: 834–41.

27 Kraemer HC. A measure of 2 × 2 association with stable variance and

12 Walker AM, Velema JP, Robins JM. Analysis of case-control data

28 Donner A, Eliasziw M. Statistical implications of the choice between a

derived in part from proxy respondents. Am J Epidemiol 1988;127: 905–14.

dichotomous or continuous trait in studies of interobserver agreement. Biometrics 1994;50:550–55.

approximately normal small-sample distribution: planning costeffective studies. Biometrics 1986;42:359–70.