Journal of Applied Psychology 2003, Vol. 88, No. 2, 356 –362

Copyright 2003 by the American Psychological Association, Inc. 0021-9010/03/$12.00 DOI: 10.1037/0021-9010.88.2.356

Accurate Tests of Statistical Significance for rWG and Average Deviation Interrater Agreement Indexes William P. Dunlap, Michael J. Burke, and Kristin Smith-Crowe

This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

Tulane University The authors demonstrated that the most common statistical significance test used with rWG-type interrater agreement indexes in applied psychology, based on the chi-square distribution, is flawed and inaccurate. The chi-square test is shown to be extremely conservative even for modest, standard significance levels (e.g., .05). The authors present an alternative statistical significance test, based on Monte Carlo procedures, that produces the equivalent of an approximate randomization test for the null hypothesis that the actual distribution of responding is rectangular and demonstrate its superiority to the chi-square test. Finally, the authors provide tables of critical values and offer downloadable software to implement the approximate randomization test for rWG-type and for average deviation (AD)-type interrater agreement indexes. The implications of these results for studying a broad range of interrater agreement problems in applied psychology are discussed.

task items for a single job, judges’ ratings of critical or cutoff scores (e.g., using the Angoff method; see Maurer & Alexander, 1992) on the items of a test, or customers’ ratings of the service of a particular store. The most widely used index of interrater agreement on Likerttype scales has been James, Demaree, and Wolf’s (1984) rWG index. James et al.’s index is based on the ratio of actual variance in ratings to the theoretical variance of ratings given a null distribution. A limitation of the rWG index is that the appropriate specification of the null distribution is debatable. Although James et al. recommended using the uniform distribution (i.e., the distribution in which each response category is equally likely), Lindell and Brandt (1997) recommended using maximum dissensus. Burke, Finkelstein, and Dusig (1999), however, cautioned against the use of maximum dissensus because its use may lead to the overestimation of interrater agreement. Because of interpretability problems of this nature with the rWG index, Burke et al. proposed the average deviation (AD) index, which is computed by finding the absolute deviation of each rating from the mean or median of the group rating and then averaging the deviations. In contrast to rWG-type indexes, AD indexes can be interpreted in terms of the actual categories of the Likert scale used. Because these two indexes (i.e., rWG and AD) can yield somewhat different results, Burke and colleagues (Burke & Dunlap, 2002; Burke et al., 1999) recommended that both indexes be used. Of course, in addition to determining what interrater agreement index to use, one must determine how it can best be employed. Two basic issues to consider when deciding how to use an interrater agreement index are, first, whether an index indicates that interrater agreement is sufficiently strong or disagreement is sufficiently weak so that one can trust that the average opinion of a group is interpretable or representative, and second, that the apparent agreement for the group is sufficiently different from chance agreement so that one can conclude that some agreement exists regardless of its magnitude. These issues are not necessarily mutually exclusive and, as noted above, both may be relevant for some practical interrater agreement problems. The first issue is

Over the past decade, researchers in the fields of applied psychology, occupational health psychology, and management have been addressing a number of problems that require knowledge of interrater agreement, that is, the extent to which raters assign the same ratings to a single target (e.g., Burke, Rupinski, Dunlap, & Davison, 1996; Chatman & Flynn, 2001; Neal, Griffin, & Hart, 2000; Schneider, White, & Paul, 1998; Tesluk, Farr, Mathieu, & Vance, 1995). In particular, the applied psychology literature has mushroomed in the past 2 years with respect to assessments of interrater agreement (e.g., see Button, 2001; Demerouti, Bakker, Nachreiner, & Schaufeli, 2001; Dirks, 2000; Judge & Bono, 2000; Klein, Conn, Smith, & Sorra, 2001; Lindell & Brandt, 2000; Masterson, 2001; Schminke, Ambrose, & Cropanzano, 2000; Totterdell, 2000; Zohar, 2000). In the latter studies, assessments of interrater agreement were made for determining whether aggregated individual-level scores can be used as indicators of grouplevel constructs, such as organizational climate, group efficacy, and team effectiveness. Along with a composition argument for justifying the aggregation of individual level data (cf. Chan, 1998), a demonstration of interrater agreement provides the measurement justification for using aggregated individual-level data (e.g., the group’s mean) as indicators of group-level constructs (Kozlowski & Klein, 2000). In addition, there are numerous problems in applied psychology and management that appear to call for assessing interrater agreement results in both a measurement sense and in terms of statistical significance, such as job analysts’ ratings of

William P. Dunlap and Kristin Smith-Crowe, Psychology Department, Tulane University; Michael J. Burke, A. B. Freeman School of Business, Tulane University. William P. Dunlap passed away on February 28, 2002, during the completion of this article. His Web site that contains the interrater agreement program reported in this study will be maintained at www.tulane .edu/⬃dunlap/psylib.html. Correspondence concerning this article should be addressed to Michael J. Burke, A. B. Freeman School of Business, Tulane University, New Orleans, Louisiana 70118. E-mail: [email protected] 356

This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

RESEARCH REPORTS

probably the most important in deciding whether one should conclude that a reasonable consensus exists for a group to aggregate individual level data to the group level of analysis. Ensuring that the interrater agreement results for any given interrater agreement index meet the second criterion is also helpful for making decisions about data aggregation and is necessary for hypothesis testing purposes (i.e., ensuring that agreement is significantly different from chance responding given the size of the group). Recently, researchers have discussed how the bootstrapping method (Cohen, Doveh, & Eick, 2001) or chi-square test (Lindell & Brandt, 1999) can be employed to test the statistical significance of rWG-type indexes. Alternatively, Burke and Dunlap (2002) proposed the use of an approximate randomization test for testing the statistical significance of AD interrater agreement results. Statistical significance tests for assessing interrater agreement are the focus of the present article. The remainder of this article unfolds as follows. First, we review the statistical logic behind chi-square– based tests of significance for interrater agreement indexes. Next, we demonstrate that, although a chi-square– based test would work well for normal data, the use of chi-square to test issues regarding variance relative to a uniform (i.e., rectangular) distribution, as is done for tests of interrater agreement with rWG, is very sensitive to nonnormality. Therefore, chi-square tests perform poorly for hypothetical rectangular populations and are often very inaccurate for rWG interrater agreement indexes. Subsequently, we develop the rationale for an empirical, or approximate randomization, test of the null hypothesis of rectangular (chance) responding and demonstrate its superiority to the chi-square– based tests. Finally, we present both tables of critical values and a downloadable program that operates efficiently on a personal computer to compute measures of interrater agreement for rWG and AD indexes at the item level and to perform the superior, approximate randomization significance test.

Chi-Square Tests of Significance for Interrater Agreement Indexes Traditionally, chi-square has been the statistic of choice for testing the statistical significance of interrater agreement (e.g., Lawlis & Lu, 1972; Tinsley & Weiss, 1975). Regarding interrater agreement indexes, the null hypothesis tested by chi-square is that there is no agreement among raters in their rating of an item above and beyond what would be expected by chance or random responding. Thus, a significantly small chi-square indicates that disagreement among the raters is less than what would be expected by chance alone. Lindell and Brandt (1999) recently applied a chi-square test to the problem of determining whether an rWG value is statistically significant. Given N raters, chi-square is equal to N ⫺ 1 multiplied by the observed variance (o2) in the ratings and divided by the theoretical variance of the uniform distribution (u2), with the degrees of freedom being equal to N ⫺ 1. The variance of the uniform distribution is equal to

u 2 ⫽ 共c2 ⫺ 1兲/12,

(1)

where c ⫽ the number of Likert categories or response options, and chi-square is equal to 2 共N⫺1兲 ⫽ 共N ⫺ 1兲 o 2 / u 2 .

(2)

357

Unfortunately, chi-square, as a distribution of variance, is not robust against violations of normality (Cohen et al., 2001; Hayes, 1973; Tinsley & Weiss, 1975). In fact, variances, which are central to interrater agreement indexes, are presumed to be sampled from a nonnormal distribution, that is, the rectangular multinomial distribution. To give a practical example, when one’s data are categorical (as they are when using Likert scales), agreement among raters (e.g., most raters assigning a 4 or a 5 out of one to five possible categories to a target or item) is tested against the hypothesis that all categories are equally likely. In this hypothetical situation, a significant chi-square is rendered inaccurate because of platykurtosis in the hypothesized rectangular distribution. Furthermore, chi-square typically increasingly underestimates the stated alpha as either sample size increases or the number of categories decreases. For these reasons, we hold that the application of chi-square to the problem of testing the statistical significance of an interrater agreement index is fundamentally flawed. The first empirical section of the present article was designed to demonstrate that chi-square would be an accurate test if the data were normal but in general becomes highly inaccurate when data are assumed to be sampled from a nonnormal (rectangular) population.

Empirical Type I Error Rates of the Chi-Square Test Method A Monte Carlo program was written in FORTRAN to use data simulated via IMSL (1982) subroutines to generate multinomial rectangular data (categorical data with equal population proportions in each category), rectangular data (equal probabilities of fractions in the interval 0 –1), and normal data, n(0,1). Under the null hypothesis of equal response probabilities, one could not get normal data, so these data were studied for comparison purpose only. The program input included the sample size, the number of categories for discrete multinomial data, and the nominal alpha level. The program then generated 10,000 data sets. For each data set, the ratio of the observed sum of squares, (N ⫺ 1)2, to the theoretical population variance, 2, was computed. The theoretical population variances were (c2 ⫺ 1)/12 for categorical data, 1/12 for rectangular data, and 1 for normal data. The number of times that the observed ratio was less than the lower tail critical value of chi-square was counted for each data type across the 10,000 simulations. For normal data, this comparison should result in a test that the observed variance is significantly smaller than the theoretical variance at the stated level of significance. For categorical data, this test was expected to underestimate the stated level of significance. For illustration purposes, we used 2-, 3-, 5-, 7- and 11-point categorical scales. The empirical Type I error rate was also computed for rectangular data for comparison purposes, which simulates the limit of the aforementioned test if the number of categories increases without bounds.

Results and Discussion As can be seen in Table 1, the test for variance when using normal data (see bottom row of Table 1) works well, as theory dictates it should. Because normal samples were included for each of the numbers of categories of simulated Likert data, the five estimates of empirical Type I errors were averaged so that these estimates are based on 50,000 iterations each. With normal samples, the empirical Type I error rate differed from nominal alpha by at most one digit in the third decimal place. As explained above, the fit of chi-square to ratios of this type is claimed to be accurate only for normal data, and users are often warned that the fit may

RESEARCH REPORTS Note. Empirical Type I error rates for categorical data are based on 10,000 iterations; rates for rectangular and normal data are based on 50,000 iterations. The ␥2 statistic in the final column measures kurtosis; negative values represent platykurtosis.

⫺2.00 ⫺1.50 ⫺1.30 ⫺1.25 ⫺1.22 ⫺1.20 0.00 .000 .000 .000 .000 .000 .000 .010 .000 .002 .005 .006 .006 .007 .050 .000 .010 .019 .002 .023 .025 .100 .002 .055 .078 .085 .088 .090 .200 .000 .004 .008 .008 .009 .010 .051 .062 .094 .165 .117 .130 .131 .200 Categories 2 3 5 7 11 Rectangular Normal

.063 .012 .060 .058 .057 .061 .101

.062 .012 .040 .035 .031 .029 .050

.063 .012 .002 .004 .007 .006 .010

.021 .048 .097 .095 .102 .103 .200

.021 .037 .035 .039 .038 .040 .100

.002 .013 .016 .015 .015 .016 .050

.002 .001 .002 .002 .002 .002 .010

.012 .064 .084 .089 .090 .094 .199

.002 .016 .027 .027 .028 .030 .100

.000 .001 .001 .001 .001 .001 .010

␥2 .01 .05 .10 .05 .20 Data type

.10

.05

.01

.20

.10

.05

.01

.20

.10

.01

.20

Nominal ␣ (N ⫽ 50) Nominal ␣ (N ⫽ 20) Nominal ␣ (N ⫽ 10) Nominal ␣ (N ⫽ 5)

Table 1 Empirical Probabilities That the Ratio of the Sum of Squares/Population Variance (Test of rWG for Categorical Data) for Categorical, Rectangular, and Normal Data Is Less Than the Lower Tail Critical Value of Chi-Square as Functions of Sample Size, Number of Categories, and Nominal Alpha Level

This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

358

be quite inaccurate for nonnormal data (e.g., Cohen et al., 2001). One can see in the next to the bottom row of Table 1 that the fit is quite poor for rectangular data. The observed Type I error rate is at best 58% of the nominal alpha of .05 ([100 ⫻ .029]/.05). The fit is even worse for categorical data: The fewer the categories, the poorer the fit. With few exceptions, as the numbers of categories increased, the inaccuracy approached that of rectangular data but in no case approached the accurate fit for normal data. More important, the accuracy of the chi-square– based test for rWG almost always becomes worse rather than better as the sample size increases. Despite the inaccuracies with the chi-square for testing the significance of rWG, to conclude that using chi-square for this purpose is never worthwhile is premature (cf. Lindell, 2002). For example, Lindell argued that despite the problems associated with chi-square tests of rWG, it could still be meaningfully used when testing the statistical significance of raters’ agreement on multipleitem measures. Nevertheless, we demonstrated that a fundamental problem exists with the fit of the chi-square distribution to nonnormal data that is not likely to be easily ameliorated nor is it diminished by a large N. Therefore, we present an alternative test strategy in the following section for use with interrater agreement indexes.

The Approximate Randomization Test We discuss Monte Carlo procedures as a tool for creating an alternative test strategy, which can be used to test the statistical significance of interrater agreement indexes without succumbing to problems associated with nonnormal data. That is, although Monte Carlo methods have often been narrowly conceived as a tool for assessing statistical significance tests, we argue that Monte Carlo simulations can also be used as a statistical significance test in the form of an approximate randomization test. Rather than using a parametric process based on assumed asymptotic properties of the statistics and their sampling distributions, one can use a Monte Carlo computer simulation program to count how often the test statistic under the null hypothesis in question equals or exceeds its originally obtained value. One way in which simulation procedures can be used as a test of statistical significance is to create an exact randomization test. The exact test was invented by Fisher and serves as a benchmark against which to judge the performance of parametric alternative tests (Edgington, 1980). The exact test is based on generating all possible tables of N subjects responding in c categories. The test statistic can then be computed for each table. If that value is more extreme than the actual obtained outcome, the multinomial probability of that table can be computed with the given hypothetical category probabilities. By summing the multinomial probabilities of all tables for which the computed statistic is as extreme or more extreme than the obtained outcome, one gets the exact probability for the randomization test. For a program to perform such computations for multinomial tables, see Dunlap, Myers, and Silver (1984). The problem with the exact randomization test is that with very reasonable values of the sample size, N, and the number of categories, c, the number of outcomes that must be generated is enormous. That number is (N ⫹ [c ⫺ 1])!/N!(c ⫺ 1)!. For N ⫽ 20 and a seven-category Likert scale, this number is 230,229; for

RESEARCH REPORTS

This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

N ⫽ 50 and a five-category Likert scale, this number is 316,251. For larger sample sizes and numbers of categories, the number of outcomes that must be enumerated becomes simply unworkable. An easy solution to this problem is to use an approximate randomization test. For this test, instead of systematically enumerating every possible outcome, one selects a relatively large number of samples (e.g., 100,000) and randomly generates data sets under the null hypothesis. One then simply counts the number of outcomes that are as extreme or more extreme than the original observed value. Thus, one estimates how rarely such a value or a more extreme value occurs by chance alone. Because of the advantages of using an approximate randomization test, we developed Monte Carlo programs on the basis of this type of test.

An Empirically Based Significance Test for Interrater Agreement Two separate but symmetrical programs were created in FORTRAN, one for each of the interrater agreement indexes of interest (i.e., rWG and AD). To generate the random samples for the approximate randomization test, we used subroutine RNUND from the IMSL (1982) package. This subroutine generates pseudorandom numbers from a discrete uniform distribution with c parameters (categories) in which the probability of each category is 1/c; hence, each category is equally likely or uniform. The program input included the sample size and the number of categories. Random outcomes for 100,000 samples were generated and

359

stored. They were then sorted such that the user could determine values of the statistic that were sufficiently rare (i.e., less than .05).

Results and Discussion The results of these tests are shown in Tables 2 and 3. Table 2 displays the critical values for rWG for various combinations of sample size and number of categories, in which rWG ⫽ 1 ⫺ 共 o 2 / u 2 兲, and

o2 ⫽

冘

(3)

共 x ⫺ x兲 2 /共N ⫺ 1兲,

(4)

where x is the category number of each subject’s response, N equals the number of subjects, and u2 is defined as in Equation 1. In the final column of Table 2 (2) are the critical values of rWG for each sample size that would be significant by the chi-square– based test. The chi-square test (Equation 2) can be written as 2 共N⫺1兲 ⫽ 共N ⫺ 1兲共1 ⫺ rWG 兲,

(5)

by substituting Equation 3. Solving for rWG, 2 rWG ⫽ 1 ⫺ 共N⫺1兲.05 /共N ⫺ 1兲,

(6)

where 2(N ⫺ 1).05 is the critical lower tail value of the chi-square distribution. It should be clear that the chi-square critical values are

Table 2 Critical Values of the rWG Statistic at the 5% Level of Statistical Significance No. of categories N

2

3

4

5

6

7

8

9

10

11

12

13

14

15

20

25

30

2

3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 25 30 35 40 50 70 100 150

— — — 1.00 1.00 1.00 .56 .60 .64 .39 .44 .47 .31 .35 .38 .27 .30 .21 .16 .13 .11 .10 .08 .05 .04 .02

— 1.00 1.00 .75 .78 .68 .62 .52 .45 .42 .40 .40 .40 .36 .35 .33 .32 .30 .27 .24 .22 .20 .18 .15 .13 .10

— 1.00 .84 .76 .73 .60 .58 .50 .48 .49 .42 .42 .41 .39 .35 .35 .35 .32 .28 .26 .24 .23 .20 .17 .14 .11

1.00 1.00 .85 .72 .67 .61 .57 .53 .47 .44 .43 .41 .40 .38 .37 .35 .34 .33 .30 .27 .25 .23 .21 .18 .15 .12

1.00 .91 .83 .72 .66 .61 .56 .52 .49 .47 .43 .41 .40 .38 .37 .36 .35 .34 .30 .28 .26 .24 .21 .18 .15 .12

1.00 .91 .80 .70 .67 .59 .55 .51 .49 .46 .44 .42 .41 .39 .38 .36 .36 .34 .30 .28 .26 .24 .21 .18 .15 .13

1.00 .87 .77 .71 .64 .59 .55 .52 .49 .46 .43 .42 .41 .39 .38 .36 .35 .35 .31 .28 .26 .24 .21 .18 .15 .13

1.00 .90 .77 .70 .65 .58 .55 .52 .49 .46 .44 .42 .41 .39 .38 .37 .36 .35 .31 .28 .26 .24 .22 .18 .15 .13

1.00 .88 .78 .69 .64 .59 .55 .52 .49 .46 .44 .42 .41 .39 .38 .37 .36 .35 .31 .28 .26 .24 .22 .18 .16 .13

1.00 .87 .78 .70 .64 .59 .55 .52 .49 .46 .44 .42 .41 .40 .38 .37 .36 .35 .31 .28 .26 .24 .22 .19 .16 .13

.97 .87 .77 .70 .64 .59 .55 .52 .49 .46 .44 .42 .41 .39 .38 .37 .36 .35 .31 .28 .26 .24 .22 .19 .16 .13

.97 .88 .77 .69 .63 .58 .55 .52 .49 .46 .44 .42 .41 .39 .38 .37 .36 .35 .31 .28 .26 .24 .22 .19 .16 .13

.97 .86 .77 .69 .63 .58 .55 .52 .49 .46 .44 .43 .41 .39 .38 .37 .36 .35 .31 .28 .26 .24 .22 .19 .16 .13

.98 .86 .77 .70 .63 .58 .55 .52 .49 .46 .44 .42 .41 .40 .38 .37 .36 .35 .31 .28 .26 .24 .22 .19 .16 .13

.96 .86 .76 .69 .63 .59 .55 .52 .49 .46 .44 .42 .41 .40 .38 .37 .36 .35 .31 .28 .26 .24 .22 .19 .16 .13

.95 .86 .76 .69 .63 .58 .55 .52 .49 .46 .44 .43 .41 .39 .38 .37 .36 .35 .31 .28 .26 .24 .22 .19 .16 .13

.95 .85 .76 .69 .63 .58 .55 .52 .49 .47 .44 .43 .41 .40 .38 .37 .36 .35 .31 .28 .26 .24 .22 .19 .16 .13

.95 .88 .82 .77 .73 .69 .66 .63 .60 .58 .56 .55 .53 .52 .50 .49 .48 .47 .42 .39 .36 .34 .31 .26 .22 .18

Note. Observed rWG values equal to or greater than the values in the table are significant. The last column (2) gives the rWG value significant by the earlier inaccurate 2 test. Dashes in cells indicate that data were not obtained.

— — — .00 .00 .00 .20 .18 .17 .28 .26 .24 .32 .30 .36 .35 .33 .38 .40 .42 .43 .44 .45 .47 .48 .48

3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 25 30 35 40 50 70 100 150

— .00 .00 .28 .29 .38 .40 .42 .46 .44 .46 .46 .49 .49 .50 .52 .50 .52 .53 .56 .56 .57 .57 .59 .61 .62

3 (0.50)

— .00 .32 .44 .41 .50 .57 .56 .63 .65 .65 .66 .68 .70 .71 .72 .73 .73 .77 .80 .81 .83 .86 .88 .90 .92

4 (0.67) .00 .00 .40 .56 .61 .69 .72 .74 .79 .82 .83 .85 .87 .88 .89 .91 .91 .92 .96 .98 1.01 1.02 1.04 1.07 1.09 1.11

5 (0.83) .00 .40 .48 .61 .74 .81 .86 .92 .94 .99 1.01 1.02 1.05 1.06 1.07 1.09 1.10 1.11 1.16 1.20 1.22 1.24 1.27 1.31 1.35 1.38

6 (1.00) .00 .40 .64 .78 .86 .97 1.01 1.08 1.11 1.15 1.17 1.20 1.22 1.24 1.25 1.28 1.29 1.29 1.36 1.40 1.42 1.45 1.48 1.52 1.55 1.59

7 (1.17) .00 .50 .72 .94 1.02 1.09 1.16 1.22 1.27 1.32 1.34 1.38 1.40 1.43 1.44 1.45 1.47 1.49 1.56 1.60 1.63 1.66 1.70 1.75 1.79 1.83

8 (1.33) .00 .50 .88 1.11 1.14 1.22 1.31 1.38 1.44 1.49 1.50 1.56 1.56 1.60 1.63 1.64 1.66 1.68 1.75 1.80 1.84 1.86 1.91 1.96 2.01 2.05

9 (1.50) .00 .75 .96 1.17 1.31 1.38 1.46 1.54 1.60 1.65 1.68 1.70 1.75 1.77 1.80 1.83 1.85 1.86 1.94 2.00 2.04 2.07 2.12 2.18 2.24 2.29

10 (1.67) .00 .75 1.04 1.28 1.43 1.53 1.63 1.70 1.75 1.82 1.85 1.89 1.94 1.95 1.99 2.01 2.03 2.05 2.14 2.20 2.25 2.28 2.34 2.40 2.46 2.52

11 (1.83) .44 .88 1.20 1.39 1.55 1.69 1.78 1.86 1.92 1.99 2.02 2.06 2.10 2.13 2.17 2.20 2.22 2.24 2.33 2.40 2.45 2.49 2.55 2.62 2.69 2.75

12 (2.00) .44 .88 1.28 1.50 1.67 1.84 1.93 1.98 2.08 2.15 2.19 2.23 2.28 2.32 2.35 2.38 2.40 2.44 2.53 2.60 2.66 2.70 2.77 2.85 2.91 2.98

13 (2.17) .44 1.13 1.44 1.61 1.84 1.97 2.07 2.18 2.23 2.32 2.37 2.42 2.47 2.49 2.53 2.55 2.59 2.62 2.73 2.80 2.86 2.91 2.98 3.06 3.14 3.21

14 (2.33) .44 1.13 1.52 1.78 1.96 2.10 2.20 2.30 2.40 2.49 2.54 2.58 2.63 2.68 2.71 2.75 2.79 2.80 2.92 3.00 3.06 3.11 3.19 3.28 3.36 3.44

15 (2.50) .89 1.63 2.00 2.39 2.61 2.84 2.96 3.10 3.21 3.32 3.38 3.44 3.52 3.57 3.62 3.66 3.70 3.75 3.89 4.02 4.09 4.15 4.25 4.38 4.49 4.59

20 (3.33)

1.11 1.88 2.56 2.94 3.31 3.53 3.73 3.88 4.03 4.14 4.24 4.32 4.39 4.46 4.53 4.59 4.64 4.69 4.88 5.01 5.12 5.19 5.32 5.48 5.61 5.73

25 (4.17)

1.33 2.38 3.12 3.61 3.96 4.22 4.47 4.68 4.79 4.97 5.08 5.19 5.28 5.37 5.45 5.51 5.57 5.63 5.86 6.03 6.13 6.23 6.39 6.47 6.73 6.88

30 (5.00)

.052 .084 .105 .121 .133 .143 .150 .156 .162 .166 .170 .173 .177 .179 .181 .183 .186 .188 .195 .200 .205 .208 .213 .219 .224 .229

Inf. (.167)

Note. Observed AD values that are equal to or less than the values in the table are significant. Values in bold indicate significant AD values that are approximately equal to the cutoff values for each category number. Observed AD values that are equal to or less than the cutoff values are practically significant. Numbers in parentheses refer to the practical cutoff values for AD. Dashes in cells indicate that data were not obtained. Inf. ⫽ infinity.

2 (0.33)

N

No. of categories

Table 3 Critical Values of the Average Deviation (AD) Statistic at the 5% Level of Statistical Significance

This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

360 RESEARCH REPORTS

This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

RESEARCH REPORTS

considerably larger than the approximate randomization (Monte Carlo) values, confirming the conservative bias of the earlier chi-square tests. The probabilities of the tabled rWG values were less than 5% given a true null hypothesis; thus, observed values that are equal to or greater than the tabled values are significant. With respect to Table 2, there are two noteworthy points to be made. First, Table 2 indicates that the empirical values for the rWG statistic change more because of increased sample size, but rather little as a function of numbers of categories. Second, the critical values in Table 2 are more variable (less stable) when the sample size is small and the numbers of categories are few; this finding is a result of the discrete nature of the multinomial distribution under such conditions. Table 3 displays the critical AD values, in which AD ⫽

冘

x ⫺ x/N.

(7)

Like the critical rWG values, the probabilities of the critical (tabled) AD values were less than 5% given a true null hypothesis. However, unlike the rWG values, because AD is a measure of disagreement, observed AD values less than or equal to critical values are significant. Note that the first value equaling or exceeding c/6 in magnitude (a rule-of-thumb for establishing a cutoff for AD; Burke & Dunlap, 2002) is printed in bold in Table 3. As can be seen, when N is 13 or less, AD values that are statistically significant (p ⱕ .05) are also practically significant (i.e., less than c/6) for all Likert scales with five or more categories. When N is greater than 13 and c is at least 5, statistically significant AD values can exceed the cutoff for practical significance. To give an example of how AD and its critical values for statistical and practical significance can be applied, consider the following problem. Five raters attempt to code an organizational climate variable in a particular study for a meta-analysis by using a 5-point rating scale (negative to positive climate). Three of the raters mark the climate found in the study as a 5, or the most positive organizational climate. The other two raters mark the climate as a 3 and a 4, respectively. For this problem, AD ⫽ .72; thus, according to Table 3, there is no statistically significant agreement (as the observed value exceeds the critical value of .40), but there is practically significant agreement (as the observed value is less than the cut-off of c/6 ⫽ .83). If the meta-analytic researcher only had an a priori decision rule of including a study (and variable) in the meta-analysis when the coders significantly agreed on the variable coding, then this study would not have been included in the meta-analysis involving organizational climate. Of course, researchers may desire to only rely on practical significance for interrater agreement problems, including the above example. We should note that Burke and Dunlap (2002, p. 168) ambiguously described the interpretation of the practical and statistical significance of observed AD values in an earlier, less complete version of Table 3 (i.e., Table 4 of Burke & Dunlap, 2002). The reader is also referred to Smith-Crowe and Burke (2003) for a more detailed discussion of this latter issue. The final column of Table 3 (Inf. ⫽ infinity) refers to AD values calculated from data in which the potential number of categories is infinite. An example would be in which a subject is offered a score continuum represented by a line of unit length and is asked to mark a point along this line that best represents his or her response. Therefore, the scores represent proportions of total length (a fraction between 0 and 1).

361

Besides addressing the problem of determining whether interrater agreement index values are statistically significant, we have highlighted a novel and useful application of Monte Carlo procedures, that of creating approximate randomization tests for hypotheses that may be intractable for solution by parametric procedures. Certainly, this application of Monte Carlo procedures can be widely generalized for use regarding any statistic.

General Discussion The purpose of this article was to demonstrate the inappropriateness of applying chi-square– based tests to the problem of determining whether interrater agreement index values are statistically significant. In addition to pointing out the problem with chi-square tests to applied psychologists, we have provided them with a solution to the problem: an empirical significance test for interrater agreement. In general, there exist a broad range of interrater agreement problems in applied psychology in which multiple raters evaluate characteristics of a single target, such as a job, group, or organization. Heretofore, an accurate statistical significance test of interrater agreement has not been available. The results of this study provide an accurate statistical significance test at the item level for researchers dealing with these types of problems and studies. More important, to assist researchers in making statistical tests of interrater agreement results, we have made available a downloadable software program that can be used to calculate item-level rWG and AD values and to test each for statistical significance. This program, labeled “agree.exe,” is available at the following World Wide Web address: http://www.tulane.edu/⬃dunlap/psylib.html. These tools are not only intended to assist behavioral scientists in making more informed judgments about the use of individual-level data as indicators of group-level constructs but also to assist researchers and practitioners with other types of interrater agreement problems in applied psychology and management.

References Burke, M. J., & Dunlap, W. P. (2002). Estimating interrater agreement with the average deviation (AD) index: A user’s guide. Organizational Research Methods, 5, 159 –172. Burke, M. J., Finkelstein, L. M., & Dusig, M. S. (1999). On average deviation indices for estimating interrater agreement. Organizational Research Methods, 2, 49 – 68. Burke, M. J., Rupinski, M. T., Dunlap, W. P., & Davison, H. K. (1996). Do situational variables act as substantive causes of relationships between individual difference variables? Two large-scale tests of “common cause” models. Personnel Psychology, 49, 573–598. Button, S. B. (2001). Organizational efforts to affirm sexual diversity: A cross-level examination. Journal of Applied Psychology, 86, 17–28. Chan, D. (1998). Functional relations among constructs in the same content domain at different levels of analysis: A typology of compositional models. Journal of Applied Psychology, 83, 234 –236. Chatman, J. A., & Flynn, F. J. (2001). The influence of demographic heterogeneity on the emergence and consequences of cooperative norms in work teams. Academy of Management Journal, 44, 956 –974. Cohen, A., Doveh, E., & Eick, U. (2001). Statistical properties of the rwg(J) index of agreement. Psychological Methods, 6, 297–310. Demerouti, E., Bakker, A. B., Nachreiner, F., & Schaufeli, W. B. (2001). The job demands-resources model of burnout. Journal of Applied Psychology, 86, 499 –512.

This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

362

RESEARCH REPORTS

Dirks, K. T. (2000). Trust in leadership and team performance: Evidence from NCAA basketball. Journal of Applied Psychology, 85, 1004 –1012. Dunlap, W. P., Myers, L., & Silver, N. C. (1984). Exact multinomial probabilities for one-way contingency tables. Behavior Research Methods, Instruments, and Computers, 16, 54 –56. Edgington, E. S. (1980). Randomization tests. New York: Marcel Dekker. Hayes, W. L. (1973). Statistics for the social sciences (2nd ed.). New York: Holt, Rinehart, & Winston. IMSL. (1982). International mathematical and statistical libraries reference manual (9th ed.). Houston, TX: Author. James, L. R., Demaree, R. G., & Wolf, G. (1984). Estimating within-group interrater reliability with and without response bias. Journal of Applied Psychology, 69, 85–98. Judge, T. A., & Bono, J. E. (2000). Five-factor model of personality and transformational leadership. Journal of Applied Psychology, 85, 751– 765. Klein, K. J., Conn, A. B., Smith, D. B., & Sorra, J. S. (2001). Is everyone in agreement? An exploration of within-group agreement in employee perceptions of the work environment. Journal of Applied Psychology, 86, 3–16. Kozlowski, S. W. J., & Klein, K. J. (2000). A multilevel approach to theory and research in organizations: Contextual, temporal, and emergent processes. In K. J. Klein, & S. W. J. Kozlowski (Eds.), Multilevel theory, research, and methods in organizations: Foundations, extensions, and new directions (pp. 3–90). San Francisco: Jossey-Bass. Lawlis, G. F., & Lu, E. (1972). Judgment of counseling process: Reliability, agreement, and error. Psychological Bulletin, 78, 17–20. Lindell, M. K. (2002, April). Analyzing interrater agreement with and without disattenuation. Poster session presented at the annual meeting of the Society for Industrial and Organizational Psychology, Toronto, Ontario, Canada. Lindell, M. K., & Brandt, C. J. (1997). Measuring interrater agreement for ratings of a single target. Applied Psychological Measurement, 21, 271–278. Lindell, M. K., & Brandt, C. J. (1999). Assessing interrater agreement on the job relevance of a test: A comparison of the CVI, T, rWG(J), and r*WG(J) indexes. Journal of Applied Psychology, 84, 640 – 647. Lindell, M. K., & Brandt, C. J. (2000). Climate quality and climate

consensus as mediators of the relationship between organizational antecedents and outcomes. Journal of Applied Psychology, 85, 331–348. Masterson, S. S. (2001). A trickle-down model of organizational justice: Relating employees’ and customers’ perceptions of and reactions to fairness. Journal of Applied Psychology, 86, 594 – 604. Maurer, T. J., & Alexander, R. A. (1992). Methods of improving employment test critical scores derived by judging test content: A review and critique. Personnel Psychology, 45, 727–762. Neal, A., Griffin, M. A., & Hart, P. M. (2000). The impact of organizational climate on safety climate and individual behavior. Safety Science, 34, 99 –109. Schminke, M., Ambrose, M. L., & Cropanzano, R. S. (2000). The effect of organizational structure on perceptions of procedural fairness. Journal of Applied Psychology, 85, 294 –304. Schneider, B., White, S. S., & Paul, M. C. (1998). Linking service climate and customer perceptions of service quality: Test of a causal model. Journal of Applied Psychology, 83, 150 –163. Smith-Crowe, K., & Burke, M. J. (2003). Interpreting the statistical significance of observed AD interrater agreement values: Correction to Burke and Dunlap (2002). Organizational Research Methods, 6, 129 – 131. Tesluk, P. E., Farr, J. L., Mathieu, J. E., & Vance, R. J. (1995). Generalization of employee involvement training: Individual and situational effects. Personnel Psychology, 48, 607– 632. Tinsley, H. E. A., & Weiss, D. J. (1975). Interrater reliability and agreement of subjective judgments. Journal of Counseling Psychology, 22, 358 –376. Totterdell, P. (2000). Catching moods and hitting runs: Mood linkage and subjective performance in professional sport teams. Journal of Applied Psychology, 85, 848 – 859. Zohar, D. (2000). A group-level model of safety climate: Testing the effect of group climate on microaccidents in manufacturing jobs. Journal of Applied Psychology, 85, 587–596.

Received December 18, 2001 Revision received May 15, 2002 Accepted May 23, 2002 䡲

Copyright 2003 by the American Psychological Association, Inc. 0021-9010/03/$12.00 DOI: 10.1037/0021-9010.88.2.356

Accurate Tests of Statistical Significance for rWG and Average Deviation Interrater Agreement Indexes William P. Dunlap, Michael J. Burke, and Kristin Smith-Crowe

This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

Tulane University The authors demonstrated that the most common statistical significance test used with rWG-type interrater agreement indexes in applied psychology, based on the chi-square distribution, is flawed and inaccurate. The chi-square test is shown to be extremely conservative even for modest, standard significance levels (e.g., .05). The authors present an alternative statistical significance test, based on Monte Carlo procedures, that produces the equivalent of an approximate randomization test for the null hypothesis that the actual distribution of responding is rectangular and demonstrate its superiority to the chi-square test. Finally, the authors provide tables of critical values and offer downloadable software to implement the approximate randomization test for rWG-type and for average deviation (AD)-type interrater agreement indexes. The implications of these results for studying a broad range of interrater agreement problems in applied psychology are discussed.

task items for a single job, judges’ ratings of critical or cutoff scores (e.g., using the Angoff method; see Maurer & Alexander, 1992) on the items of a test, or customers’ ratings of the service of a particular store. The most widely used index of interrater agreement on Likerttype scales has been James, Demaree, and Wolf’s (1984) rWG index. James et al.’s index is based on the ratio of actual variance in ratings to the theoretical variance of ratings given a null distribution. A limitation of the rWG index is that the appropriate specification of the null distribution is debatable. Although James et al. recommended using the uniform distribution (i.e., the distribution in which each response category is equally likely), Lindell and Brandt (1997) recommended using maximum dissensus. Burke, Finkelstein, and Dusig (1999), however, cautioned against the use of maximum dissensus because its use may lead to the overestimation of interrater agreement. Because of interpretability problems of this nature with the rWG index, Burke et al. proposed the average deviation (AD) index, which is computed by finding the absolute deviation of each rating from the mean or median of the group rating and then averaging the deviations. In contrast to rWG-type indexes, AD indexes can be interpreted in terms of the actual categories of the Likert scale used. Because these two indexes (i.e., rWG and AD) can yield somewhat different results, Burke and colleagues (Burke & Dunlap, 2002; Burke et al., 1999) recommended that both indexes be used. Of course, in addition to determining what interrater agreement index to use, one must determine how it can best be employed. Two basic issues to consider when deciding how to use an interrater agreement index are, first, whether an index indicates that interrater agreement is sufficiently strong or disagreement is sufficiently weak so that one can trust that the average opinion of a group is interpretable or representative, and second, that the apparent agreement for the group is sufficiently different from chance agreement so that one can conclude that some agreement exists regardless of its magnitude. These issues are not necessarily mutually exclusive and, as noted above, both may be relevant for some practical interrater agreement problems. The first issue is

Over the past decade, researchers in the fields of applied psychology, occupational health psychology, and management have been addressing a number of problems that require knowledge of interrater agreement, that is, the extent to which raters assign the same ratings to a single target (e.g., Burke, Rupinski, Dunlap, & Davison, 1996; Chatman & Flynn, 2001; Neal, Griffin, & Hart, 2000; Schneider, White, & Paul, 1998; Tesluk, Farr, Mathieu, & Vance, 1995). In particular, the applied psychology literature has mushroomed in the past 2 years with respect to assessments of interrater agreement (e.g., see Button, 2001; Demerouti, Bakker, Nachreiner, & Schaufeli, 2001; Dirks, 2000; Judge & Bono, 2000; Klein, Conn, Smith, & Sorra, 2001; Lindell & Brandt, 2000; Masterson, 2001; Schminke, Ambrose, & Cropanzano, 2000; Totterdell, 2000; Zohar, 2000). In the latter studies, assessments of interrater agreement were made for determining whether aggregated individual-level scores can be used as indicators of grouplevel constructs, such as organizational climate, group efficacy, and team effectiveness. Along with a composition argument for justifying the aggregation of individual level data (cf. Chan, 1998), a demonstration of interrater agreement provides the measurement justification for using aggregated individual-level data (e.g., the group’s mean) as indicators of group-level constructs (Kozlowski & Klein, 2000). In addition, there are numerous problems in applied psychology and management that appear to call for assessing interrater agreement results in both a measurement sense and in terms of statistical significance, such as job analysts’ ratings of

William P. Dunlap and Kristin Smith-Crowe, Psychology Department, Tulane University; Michael J. Burke, A. B. Freeman School of Business, Tulane University. William P. Dunlap passed away on February 28, 2002, during the completion of this article. His Web site that contains the interrater agreement program reported in this study will be maintained at www.tulane .edu/⬃dunlap/psylib.html. Correspondence concerning this article should be addressed to Michael J. Burke, A. B. Freeman School of Business, Tulane University, New Orleans, Louisiana 70118. E-mail: [email protected] 356

This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

RESEARCH REPORTS

probably the most important in deciding whether one should conclude that a reasonable consensus exists for a group to aggregate individual level data to the group level of analysis. Ensuring that the interrater agreement results for any given interrater agreement index meet the second criterion is also helpful for making decisions about data aggregation and is necessary for hypothesis testing purposes (i.e., ensuring that agreement is significantly different from chance responding given the size of the group). Recently, researchers have discussed how the bootstrapping method (Cohen, Doveh, & Eick, 2001) or chi-square test (Lindell & Brandt, 1999) can be employed to test the statistical significance of rWG-type indexes. Alternatively, Burke and Dunlap (2002) proposed the use of an approximate randomization test for testing the statistical significance of AD interrater agreement results. Statistical significance tests for assessing interrater agreement are the focus of the present article. The remainder of this article unfolds as follows. First, we review the statistical logic behind chi-square– based tests of significance for interrater agreement indexes. Next, we demonstrate that, although a chi-square– based test would work well for normal data, the use of chi-square to test issues regarding variance relative to a uniform (i.e., rectangular) distribution, as is done for tests of interrater agreement with rWG, is very sensitive to nonnormality. Therefore, chi-square tests perform poorly for hypothetical rectangular populations and are often very inaccurate for rWG interrater agreement indexes. Subsequently, we develop the rationale for an empirical, or approximate randomization, test of the null hypothesis of rectangular (chance) responding and demonstrate its superiority to the chi-square– based tests. Finally, we present both tables of critical values and a downloadable program that operates efficiently on a personal computer to compute measures of interrater agreement for rWG and AD indexes at the item level and to perform the superior, approximate randomization significance test.

Chi-Square Tests of Significance for Interrater Agreement Indexes Traditionally, chi-square has been the statistic of choice for testing the statistical significance of interrater agreement (e.g., Lawlis & Lu, 1972; Tinsley & Weiss, 1975). Regarding interrater agreement indexes, the null hypothesis tested by chi-square is that there is no agreement among raters in their rating of an item above and beyond what would be expected by chance or random responding. Thus, a significantly small chi-square indicates that disagreement among the raters is less than what would be expected by chance alone. Lindell and Brandt (1999) recently applied a chi-square test to the problem of determining whether an rWG value is statistically significant. Given N raters, chi-square is equal to N ⫺ 1 multiplied by the observed variance (o2) in the ratings and divided by the theoretical variance of the uniform distribution (u2), with the degrees of freedom being equal to N ⫺ 1. The variance of the uniform distribution is equal to

u 2 ⫽ 共c2 ⫺ 1兲/12,

(1)

where c ⫽ the number of Likert categories or response options, and chi-square is equal to 2 共N⫺1兲 ⫽ 共N ⫺ 1兲 o 2 / u 2 .

(2)

357

Unfortunately, chi-square, as a distribution of variance, is not robust against violations of normality (Cohen et al., 2001; Hayes, 1973; Tinsley & Weiss, 1975). In fact, variances, which are central to interrater agreement indexes, are presumed to be sampled from a nonnormal distribution, that is, the rectangular multinomial distribution. To give a practical example, when one’s data are categorical (as they are when using Likert scales), agreement among raters (e.g., most raters assigning a 4 or a 5 out of one to five possible categories to a target or item) is tested against the hypothesis that all categories are equally likely. In this hypothetical situation, a significant chi-square is rendered inaccurate because of platykurtosis in the hypothesized rectangular distribution. Furthermore, chi-square typically increasingly underestimates the stated alpha as either sample size increases or the number of categories decreases. For these reasons, we hold that the application of chi-square to the problem of testing the statistical significance of an interrater agreement index is fundamentally flawed. The first empirical section of the present article was designed to demonstrate that chi-square would be an accurate test if the data were normal but in general becomes highly inaccurate when data are assumed to be sampled from a nonnormal (rectangular) population.

Empirical Type I Error Rates of the Chi-Square Test Method A Monte Carlo program was written in FORTRAN to use data simulated via IMSL (1982) subroutines to generate multinomial rectangular data (categorical data with equal population proportions in each category), rectangular data (equal probabilities of fractions in the interval 0 –1), and normal data, n(0,1). Under the null hypothesis of equal response probabilities, one could not get normal data, so these data were studied for comparison purpose only. The program input included the sample size, the number of categories for discrete multinomial data, and the nominal alpha level. The program then generated 10,000 data sets. For each data set, the ratio of the observed sum of squares, (N ⫺ 1)2, to the theoretical population variance, 2, was computed. The theoretical population variances were (c2 ⫺ 1)/12 for categorical data, 1/12 for rectangular data, and 1 for normal data. The number of times that the observed ratio was less than the lower tail critical value of chi-square was counted for each data type across the 10,000 simulations. For normal data, this comparison should result in a test that the observed variance is significantly smaller than the theoretical variance at the stated level of significance. For categorical data, this test was expected to underestimate the stated level of significance. For illustration purposes, we used 2-, 3-, 5-, 7- and 11-point categorical scales. The empirical Type I error rate was also computed for rectangular data for comparison purposes, which simulates the limit of the aforementioned test if the number of categories increases without bounds.

Results and Discussion As can be seen in Table 1, the test for variance when using normal data (see bottom row of Table 1) works well, as theory dictates it should. Because normal samples were included for each of the numbers of categories of simulated Likert data, the five estimates of empirical Type I errors were averaged so that these estimates are based on 50,000 iterations each. With normal samples, the empirical Type I error rate differed from nominal alpha by at most one digit in the third decimal place. As explained above, the fit of chi-square to ratios of this type is claimed to be accurate only for normal data, and users are often warned that the fit may

RESEARCH REPORTS Note. Empirical Type I error rates for categorical data are based on 10,000 iterations; rates for rectangular and normal data are based on 50,000 iterations. The ␥2 statistic in the final column measures kurtosis; negative values represent platykurtosis.

⫺2.00 ⫺1.50 ⫺1.30 ⫺1.25 ⫺1.22 ⫺1.20 0.00 .000 .000 .000 .000 .000 .000 .010 .000 .002 .005 .006 .006 .007 .050 .000 .010 .019 .002 .023 .025 .100 .002 .055 .078 .085 .088 .090 .200 .000 .004 .008 .008 .009 .010 .051 .062 .094 .165 .117 .130 .131 .200 Categories 2 3 5 7 11 Rectangular Normal

.063 .012 .060 .058 .057 .061 .101

.062 .012 .040 .035 .031 .029 .050

.063 .012 .002 .004 .007 .006 .010

.021 .048 .097 .095 .102 .103 .200

.021 .037 .035 .039 .038 .040 .100

.002 .013 .016 .015 .015 .016 .050

.002 .001 .002 .002 .002 .002 .010

.012 .064 .084 .089 .090 .094 .199

.002 .016 .027 .027 .028 .030 .100

.000 .001 .001 .001 .001 .001 .010

␥2 .01 .05 .10 .05 .20 Data type

.10

.05

.01

.20

.10

.05

.01

.20

.10

.01

.20

Nominal ␣ (N ⫽ 50) Nominal ␣ (N ⫽ 20) Nominal ␣ (N ⫽ 10) Nominal ␣ (N ⫽ 5)

Table 1 Empirical Probabilities That the Ratio of the Sum of Squares/Population Variance (Test of rWG for Categorical Data) for Categorical, Rectangular, and Normal Data Is Less Than the Lower Tail Critical Value of Chi-Square as Functions of Sample Size, Number of Categories, and Nominal Alpha Level

This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

358

be quite inaccurate for nonnormal data (e.g., Cohen et al., 2001). One can see in the next to the bottom row of Table 1 that the fit is quite poor for rectangular data. The observed Type I error rate is at best 58% of the nominal alpha of .05 ([100 ⫻ .029]/.05). The fit is even worse for categorical data: The fewer the categories, the poorer the fit. With few exceptions, as the numbers of categories increased, the inaccuracy approached that of rectangular data but in no case approached the accurate fit for normal data. More important, the accuracy of the chi-square– based test for rWG almost always becomes worse rather than better as the sample size increases. Despite the inaccuracies with the chi-square for testing the significance of rWG, to conclude that using chi-square for this purpose is never worthwhile is premature (cf. Lindell, 2002). For example, Lindell argued that despite the problems associated with chi-square tests of rWG, it could still be meaningfully used when testing the statistical significance of raters’ agreement on multipleitem measures. Nevertheless, we demonstrated that a fundamental problem exists with the fit of the chi-square distribution to nonnormal data that is not likely to be easily ameliorated nor is it diminished by a large N. Therefore, we present an alternative test strategy in the following section for use with interrater agreement indexes.

The Approximate Randomization Test We discuss Monte Carlo procedures as a tool for creating an alternative test strategy, which can be used to test the statistical significance of interrater agreement indexes without succumbing to problems associated with nonnormal data. That is, although Monte Carlo methods have often been narrowly conceived as a tool for assessing statistical significance tests, we argue that Monte Carlo simulations can also be used as a statistical significance test in the form of an approximate randomization test. Rather than using a parametric process based on assumed asymptotic properties of the statistics and their sampling distributions, one can use a Monte Carlo computer simulation program to count how often the test statistic under the null hypothesis in question equals or exceeds its originally obtained value. One way in which simulation procedures can be used as a test of statistical significance is to create an exact randomization test. The exact test was invented by Fisher and serves as a benchmark against which to judge the performance of parametric alternative tests (Edgington, 1980). The exact test is based on generating all possible tables of N subjects responding in c categories. The test statistic can then be computed for each table. If that value is more extreme than the actual obtained outcome, the multinomial probability of that table can be computed with the given hypothetical category probabilities. By summing the multinomial probabilities of all tables for which the computed statistic is as extreme or more extreme than the obtained outcome, one gets the exact probability for the randomization test. For a program to perform such computations for multinomial tables, see Dunlap, Myers, and Silver (1984). The problem with the exact randomization test is that with very reasonable values of the sample size, N, and the number of categories, c, the number of outcomes that must be generated is enormous. That number is (N ⫹ [c ⫺ 1])!/N!(c ⫺ 1)!. For N ⫽ 20 and a seven-category Likert scale, this number is 230,229; for

RESEARCH REPORTS

This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

N ⫽ 50 and a five-category Likert scale, this number is 316,251. For larger sample sizes and numbers of categories, the number of outcomes that must be enumerated becomes simply unworkable. An easy solution to this problem is to use an approximate randomization test. For this test, instead of systematically enumerating every possible outcome, one selects a relatively large number of samples (e.g., 100,000) and randomly generates data sets under the null hypothesis. One then simply counts the number of outcomes that are as extreme or more extreme than the original observed value. Thus, one estimates how rarely such a value or a more extreme value occurs by chance alone. Because of the advantages of using an approximate randomization test, we developed Monte Carlo programs on the basis of this type of test.

An Empirically Based Significance Test for Interrater Agreement Two separate but symmetrical programs were created in FORTRAN, one for each of the interrater agreement indexes of interest (i.e., rWG and AD). To generate the random samples for the approximate randomization test, we used subroutine RNUND from the IMSL (1982) package. This subroutine generates pseudorandom numbers from a discrete uniform distribution with c parameters (categories) in which the probability of each category is 1/c; hence, each category is equally likely or uniform. The program input included the sample size and the number of categories. Random outcomes for 100,000 samples were generated and

359

stored. They were then sorted such that the user could determine values of the statistic that were sufficiently rare (i.e., less than .05).

Results and Discussion The results of these tests are shown in Tables 2 and 3. Table 2 displays the critical values for rWG for various combinations of sample size and number of categories, in which rWG ⫽ 1 ⫺ 共 o 2 / u 2 兲, and

o2 ⫽

冘

(3)

共 x ⫺ x兲 2 /共N ⫺ 1兲,

(4)

where x is the category number of each subject’s response, N equals the number of subjects, and u2 is defined as in Equation 1. In the final column of Table 2 (2) are the critical values of rWG for each sample size that would be significant by the chi-square– based test. The chi-square test (Equation 2) can be written as 2 共N⫺1兲 ⫽ 共N ⫺ 1兲共1 ⫺ rWG 兲,

(5)

by substituting Equation 3. Solving for rWG, 2 rWG ⫽ 1 ⫺ 共N⫺1兲.05 /共N ⫺ 1兲,

(6)

where 2(N ⫺ 1).05 is the critical lower tail value of the chi-square distribution. It should be clear that the chi-square critical values are

Table 2 Critical Values of the rWG Statistic at the 5% Level of Statistical Significance No. of categories N

2

3

4

5

6

7

8

9

10

11

12

13

14

15

20

25

30

2

3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 25 30 35 40 50 70 100 150

— — — 1.00 1.00 1.00 .56 .60 .64 .39 .44 .47 .31 .35 .38 .27 .30 .21 .16 .13 .11 .10 .08 .05 .04 .02

— 1.00 1.00 .75 .78 .68 .62 .52 .45 .42 .40 .40 .40 .36 .35 .33 .32 .30 .27 .24 .22 .20 .18 .15 .13 .10

— 1.00 .84 .76 .73 .60 .58 .50 .48 .49 .42 .42 .41 .39 .35 .35 .35 .32 .28 .26 .24 .23 .20 .17 .14 .11

1.00 1.00 .85 .72 .67 .61 .57 .53 .47 .44 .43 .41 .40 .38 .37 .35 .34 .33 .30 .27 .25 .23 .21 .18 .15 .12

1.00 .91 .83 .72 .66 .61 .56 .52 .49 .47 .43 .41 .40 .38 .37 .36 .35 .34 .30 .28 .26 .24 .21 .18 .15 .12

1.00 .91 .80 .70 .67 .59 .55 .51 .49 .46 .44 .42 .41 .39 .38 .36 .36 .34 .30 .28 .26 .24 .21 .18 .15 .13

1.00 .87 .77 .71 .64 .59 .55 .52 .49 .46 .43 .42 .41 .39 .38 .36 .35 .35 .31 .28 .26 .24 .21 .18 .15 .13

1.00 .90 .77 .70 .65 .58 .55 .52 .49 .46 .44 .42 .41 .39 .38 .37 .36 .35 .31 .28 .26 .24 .22 .18 .15 .13

1.00 .88 .78 .69 .64 .59 .55 .52 .49 .46 .44 .42 .41 .39 .38 .37 .36 .35 .31 .28 .26 .24 .22 .18 .16 .13

1.00 .87 .78 .70 .64 .59 .55 .52 .49 .46 .44 .42 .41 .40 .38 .37 .36 .35 .31 .28 .26 .24 .22 .19 .16 .13

.97 .87 .77 .70 .64 .59 .55 .52 .49 .46 .44 .42 .41 .39 .38 .37 .36 .35 .31 .28 .26 .24 .22 .19 .16 .13

.97 .88 .77 .69 .63 .58 .55 .52 .49 .46 .44 .42 .41 .39 .38 .37 .36 .35 .31 .28 .26 .24 .22 .19 .16 .13

.97 .86 .77 .69 .63 .58 .55 .52 .49 .46 .44 .43 .41 .39 .38 .37 .36 .35 .31 .28 .26 .24 .22 .19 .16 .13

.98 .86 .77 .70 .63 .58 .55 .52 .49 .46 .44 .42 .41 .40 .38 .37 .36 .35 .31 .28 .26 .24 .22 .19 .16 .13

.96 .86 .76 .69 .63 .59 .55 .52 .49 .46 .44 .42 .41 .40 .38 .37 .36 .35 .31 .28 .26 .24 .22 .19 .16 .13

.95 .86 .76 .69 .63 .58 .55 .52 .49 .46 .44 .43 .41 .39 .38 .37 .36 .35 .31 .28 .26 .24 .22 .19 .16 .13

.95 .85 .76 .69 .63 .58 .55 .52 .49 .47 .44 .43 .41 .40 .38 .37 .36 .35 .31 .28 .26 .24 .22 .19 .16 .13

.95 .88 .82 .77 .73 .69 .66 .63 .60 .58 .56 .55 .53 .52 .50 .49 .48 .47 .42 .39 .36 .34 .31 .26 .22 .18

Note. Observed rWG values equal to or greater than the values in the table are significant. The last column (2) gives the rWG value significant by the earlier inaccurate 2 test. Dashes in cells indicate that data were not obtained.

— — — .00 .00 .00 .20 .18 .17 .28 .26 .24 .32 .30 .36 .35 .33 .38 .40 .42 .43 .44 .45 .47 .48 .48

3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 25 30 35 40 50 70 100 150

— .00 .00 .28 .29 .38 .40 .42 .46 .44 .46 .46 .49 .49 .50 .52 .50 .52 .53 .56 .56 .57 .57 .59 .61 .62

3 (0.50)

— .00 .32 .44 .41 .50 .57 .56 .63 .65 .65 .66 .68 .70 .71 .72 .73 .73 .77 .80 .81 .83 .86 .88 .90 .92

4 (0.67) .00 .00 .40 .56 .61 .69 .72 .74 .79 .82 .83 .85 .87 .88 .89 .91 .91 .92 .96 .98 1.01 1.02 1.04 1.07 1.09 1.11

5 (0.83) .00 .40 .48 .61 .74 .81 .86 .92 .94 .99 1.01 1.02 1.05 1.06 1.07 1.09 1.10 1.11 1.16 1.20 1.22 1.24 1.27 1.31 1.35 1.38

6 (1.00) .00 .40 .64 .78 .86 .97 1.01 1.08 1.11 1.15 1.17 1.20 1.22 1.24 1.25 1.28 1.29 1.29 1.36 1.40 1.42 1.45 1.48 1.52 1.55 1.59

7 (1.17) .00 .50 .72 .94 1.02 1.09 1.16 1.22 1.27 1.32 1.34 1.38 1.40 1.43 1.44 1.45 1.47 1.49 1.56 1.60 1.63 1.66 1.70 1.75 1.79 1.83

8 (1.33) .00 .50 .88 1.11 1.14 1.22 1.31 1.38 1.44 1.49 1.50 1.56 1.56 1.60 1.63 1.64 1.66 1.68 1.75 1.80 1.84 1.86 1.91 1.96 2.01 2.05

9 (1.50) .00 .75 .96 1.17 1.31 1.38 1.46 1.54 1.60 1.65 1.68 1.70 1.75 1.77 1.80 1.83 1.85 1.86 1.94 2.00 2.04 2.07 2.12 2.18 2.24 2.29

10 (1.67) .00 .75 1.04 1.28 1.43 1.53 1.63 1.70 1.75 1.82 1.85 1.89 1.94 1.95 1.99 2.01 2.03 2.05 2.14 2.20 2.25 2.28 2.34 2.40 2.46 2.52

11 (1.83) .44 .88 1.20 1.39 1.55 1.69 1.78 1.86 1.92 1.99 2.02 2.06 2.10 2.13 2.17 2.20 2.22 2.24 2.33 2.40 2.45 2.49 2.55 2.62 2.69 2.75

12 (2.00) .44 .88 1.28 1.50 1.67 1.84 1.93 1.98 2.08 2.15 2.19 2.23 2.28 2.32 2.35 2.38 2.40 2.44 2.53 2.60 2.66 2.70 2.77 2.85 2.91 2.98

13 (2.17) .44 1.13 1.44 1.61 1.84 1.97 2.07 2.18 2.23 2.32 2.37 2.42 2.47 2.49 2.53 2.55 2.59 2.62 2.73 2.80 2.86 2.91 2.98 3.06 3.14 3.21

14 (2.33) .44 1.13 1.52 1.78 1.96 2.10 2.20 2.30 2.40 2.49 2.54 2.58 2.63 2.68 2.71 2.75 2.79 2.80 2.92 3.00 3.06 3.11 3.19 3.28 3.36 3.44

15 (2.50) .89 1.63 2.00 2.39 2.61 2.84 2.96 3.10 3.21 3.32 3.38 3.44 3.52 3.57 3.62 3.66 3.70 3.75 3.89 4.02 4.09 4.15 4.25 4.38 4.49 4.59

20 (3.33)

1.11 1.88 2.56 2.94 3.31 3.53 3.73 3.88 4.03 4.14 4.24 4.32 4.39 4.46 4.53 4.59 4.64 4.69 4.88 5.01 5.12 5.19 5.32 5.48 5.61 5.73

25 (4.17)

1.33 2.38 3.12 3.61 3.96 4.22 4.47 4.68 4.79 4.97 5.08 5.19 5.28 5.37 5.45 5.51 5.57 5.63 5.86 6.03 6.13 6.23 6.39 6.47 6.73 6.88

30 (5.00)

.052 .084 .105 .121 .133 .143 .150 .156 .162 .166 .170 .173 .177 .179 .181 .183 .186 .188 .195 .200 .205 .208 .213 .219 .224 .229

Inf. (.167)

Note. Observed AD values that are equal to or less than the values in the table are significant. Values in bold indicate significant AD values that are approximately equal to the cutoff values for each category number. Observed AD values that are equal to or less than the cutoff values are practically significant. Numbers in parentheses refer to the practical cutoff values for AD. Dashes in cells indicate that data were not obtained. Inf. ⫽ infinity.

2 (0.33)

N

No. of categories

Table 3 Critical Values of the Average Deviation (AD) Statistic at the 5% Level of Statistical Significance

This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

360 RESEARCH REPORTS

This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

RESEARCH REPORTS

considerably larger than the approximate randomization (Monte Carlo) values, confirming the conservative bias of the earlier chi-square tests. The probabilities of the tabled rWG values were less than 5% given a true null hypothesis; thus, observed values that are equal to or greater than the tabled values are significant. With respect to Table 2, there are two noteworthy points to be made. First, Table 2 indicates that the empirical values for the rWG statistic change more because of increased sample size, but rather little as a function of numbers of categories. Second, the critical values in Table 2 are more variable (less stable) when the sample size is small and the numbers of categories are few; this finding is a result of the discrete nature of the multinomial distribution under such conditions. Table 3 displays the critical AD values, in which AD ⫽

冘

x ⫺ x/N.

(7)

Like the critical rWG values, the probabilities of the critical (tabled) AD values were less than 5% given a true null hypothesis. However, unlike the rWG values, because AD is a measure of disagreement, observed AD values less than or equal to critical values are significant. Note that the first value equaling or exceeding c/6 in magnitude (a rule-of-thumb for establishing a cutoff for AD; Burke & Dunlap, 2002) is printed in bold in Table 3. As can be seen, when N is 13 or less, AD values that are statistically significant (p ⱕ .05) are also practically significant (i.e., less than c/6) for all Likert scales with five or more categories. When N is greater than 13 and c is at least 5, statistically significant AD values can exceed the cutoff for practical significance. To give an example of how AD and its critical values for statistical and practical significance can be applied, consider the following problem. Five raters attempt to code an organizational climate variable in a particular study for a meta-analysis by using a 5-point rating scale (negative to positive climate). Three of the raters mark the climate found in the study as a 5, or the most positive organizational climate. The other two raters mark the climate as a 3 and a 4, respectively. For this problem, AD ⫽ .72; thus, according to Table 3, there is no statistically significant agreement (as the observed value exceeds the critical value of .40), but there is practically significant agreement (as the observed value is less than the cut-off of c/6 ⫽ .83). If the meta-analytic researcher only had an a priori decision rule of including a study (and variable) in the meta-analysis when the coders significantly agreed on the variable coding, then this study would not have been included in the meta-analysis involving organizational climate. Of course, researchers may desire to only rely on practical significance for interrater agreement problems, including the above example. We should note that Burke and Dunlap (2002, p. 168) ambiguously described the interpretation of the practical and statistical significance of observed AD values in an earlier, less complete version of Table 3 (i.e., Table 4 of Burke & Dunlap, 2002). The reader is also referred to Smith-Crowe and Burke (2003) for a more detailed discussion of this latter issue. The final column of Table 3 (Inf. ⫽ infinity) refers to AD values calculated from data in which the potential number of categories is infinite. An example would be in which a subject is offered a score continuum represented by a line of unit length and is asked to mark a point along this line that best represents his or her response. Therefore, the scores represent proportions of total length (a fraction between 0 and 1).

361

Besides addressing the problem of determining whether interrater agreement index values are statistically significant, we have highlighted a novel and useful application of Monte Carlo procedures, that of creating approximate randomization tests for hypotheses that may be intractable for solution by parametric procedures. Certainly, this application of Monte Carlo procedures can be widely generalized for use regarding any statistic.

General Discussion The purpose of this article was to demonstrate the inappropriateness of applying chi-square– based tests to the problem of determining whether interrater agreement index values are statistically significant. In addition to pointing out the problem with chi-square tests to applied psychologists, we have provided them with a solution to the problem: an empirical significance test for interrater agreement. In general, there exist a broad range of interrater agreement problems in applied psychology in which multiple raters evaluate characteristics of a single target, such as a job, group, or organization. Heretofore, an accurate statistical significance test of interrater agreement has not been available. The results of this study provide an accurate statistical significance test at the item level for researchers dealing with these types of problems and studies. More important, to assist researchers in making statistical tests of interrater agreement results, we have made available a downloadable software program that can be used to calculate item-level rWG and AD values and to test each for statistical significance. This program, labeled “agree.exe,” is available at the following World Wide Web address: http://www.tulane.edu/⬃dunlap/psylib.html. These tools are not only intended to assist behavioral scientists in making more informed judgments about the use of individual-level data as indicators of group-level constructs but also to assist researchers and practitioners with other types of interrater agreement problems in applied psychology and management.

References Burke, M. J., & Dunlap, W. P. (2002). Estimating interrater agreement with the average deviation (AD) index: A user’s guide. Organizational Research Methods, 5, 159 –172. Burke, M. J., Finkelstein, L. M., & Dusig, M. S. (1999). On average deviation indices for estimating interrater agreement. Organizational Research Methods, 2, 49 – 68. Burke, M. J., Rupinski, M. T., Dunlap, W. P., & Davison, H. K. (1996). Do situational variables act as substantive causes of relationships between individual difference variables? Two large-scale tests of “common cause” models. Personnel Psychology, 49, 573–598. Button, S. B. (2001). Organizational efforts to affirm sexual diversity: A cross-level examination. Journal of Applied Psychology, 86, 17–28. Chan, D. (1998). Functional relations among constructs in the same content domain at different levels of analysis: A typology of compositional models. Journal of Applied Psychology, 83, 234 –236. Chatman, J. A., & Flynn, F. J. (2001). The influence of demographic heterogeneity on the emergence and consequences of cooperative norms in work teams. Academy of Management Journal, 44, 956 –974. Cohen, A., Doveh, E., & Eick, U. (2001). Statistical properties of the rwg(J) index of agreement. Psychological Methods, 6, 297–310. Demerouti, E., Bakker, A. B., Nachreiner, F., & Schaufeli, W. B. (2001). The job demands-resources model of burnout. Journal of Applied Psychology, 86, 499 –512.

This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

362

RESEARCH REPORTS

Dirks, K. T. (2000). Trust in leadership and team performance: Evidence from NCAA basketball. Journal of Applied Psychology, 85, 1004 –1012. Dunlap, W. P., Myers, L., & Silver, N. C. (1984). Exact multinomial probabilities for one-way contingency tables. Behavior Research Methods, Instruments, and Computers, 16, 54 –56. Edgington, E. S. (1980). Randomization tests. New York: Marcel Dekker. Hayes, W. L. (1973). Statistics for the social sciences (2nd ed.). New York: Holt, Rinehart, & Winston. IMSL. (1982). International mathematical and statistical libraries reference manual (9th ed.). Houston, TX: Author. James, L. R., Demaree, R. G., & Wolf, G. (1984). Estimating within-group interrater reliability with and without response bias. Journal of Applied Psychology, 69, 85–98. Judge, T. A., & Bono, J. E. (2000). Five-factor model of personality and transformational leadership. Journal of Applied Psychology, 85, 751– 765. Klein, K. J., Conn, A. B., Smith, D. B., & Sorra, J. S. (2001). Is everyone in agreement? An exploration of within-group agreement in employee perceptions of the work environment. Journal of Applied Psychology, 86, 3–16. Kozlowski, S. W. J., & Klein, K. J. (2000). A multilevel approach to theory and research in organizations: Contextual, temporal, and emergent processes. In K. J. Klein, & S. W. J. Kozlowski (Eds.), Multilevel theory, research, and methods in organizations: Foundations, extensions, and new directions (pp. 3–90). San Francisco: Jossey-Bass. Lawlis, G. F., & Lu, E. (1972). Judgment of counseling process: Reliability, agreement, and error. Psychological Bulletin, 78, 17–20. Lindell, M. K. (2002, April). Analyzing interrater agreement with and without disattenuation. Poster session presented at the annual meeting of the Society for Industrial and Organizational Psychology, Toronto, Ontario, Canada. Lindell, M. K., & Brandt, C. J. (1997). Measuring interrater agreement for ratings of a single target. Applied Psychological Measurement, 21, 271–278. Lindell, M. K., & Brandt, C. J. (1999). Assessing interrater agreement on the job relevance of a test: A comparison of the CVI, T, rWG(J), and r*WG(J) indexes. Journal of Applied Psychology, 84, 640 – 647. Lindell, M. K., & Brandt, C. J. (2000). Climate quality and climate

consensus as mediators of the relationship between organizational antecedents and outcomes. Journal of Applied Psychology, 85, 331–348. Masterson, S. S. (2001). A trickle-down model of organizational justice: Relating employees’ and customers’ perceptions of and reactions to fairness. Journal of Applied Psychology, 86, 594 – 604. Maurer, T. J., & Alexander, R. A. (1992). Methods of improving employment test critical scores derived by judging test content: A review and critique. Personnel Psychology, 45, 727–762. Neal, A., Griffin, M. A., & Hart, P. M. (2000). The impact of organizational climate on safety climate and individual behavior. Safety Science, 34, 99 –109. Schminke, M., Ambrose, M. L., & Cropanzano, R. S. (2000). The effect of organizational structure on perceptions of procedural fairness. Journal of Applied Psychology, 85, 294 –304. Schneider, B., White, S. S., & Paul, M. C. (1998). Linking service climate and customer perceptions of service quality: Test of a causal model. Journal of Applied Psychology, 83, 150 –163. Smith-Crowe, K., & Burke, M. J. (2003). Interpreting the statistical significance of observed AD interrater agreement values: Correction to Burke and Dunlap (2002). Organizational Research Methods, 6, 129 – 131. Tesluk, P. E., Farr, J. L., Mathieu, J. E., & Vance, R. J. (1995). Generalization of employee involvement training: Individual and situational effects. Personnel Psychology, 48, 607– 632. Tinsley, H. E. A., & Weiss, D. J. (1975). Interrater reliability and agreement of subjective judgments. Journal of Counseling Psychology, 22, 358 –376. Totterdell, P. (2000). Catching moods and hitting runs: Mood linkage and subjective performance in professional sport teams. Journal of Applied Psychology, 85, 848 – 859. Zohar, D. (2000). A group-level model of safety climate: Testing the effect of group climate on microaccidents in manufacturing jobs. Journal of Applied Psychology, 85, 587–596.

Received December 18, 2001 Revision received May 15, 2002 Accepted May 23, 2002 䡲