Statistical Properties of the rWG(y) Index of Agreement - APA PsycNET

2 downloads 0 Views 1MB Size Report
Ayala Cohen, Etti Doveh, and Uri Eick. Technion—Israel Institute of Technology. L. R. James, R. G. Demaree, and G. Wolf (1984) introduced rw
Copyright 2001 by the American Psychological Association, Inc. 1082-989X/01/S5.00 DOI: 10.1037//1082-989X.6.3.297

Psychological Methods 2001, Vol.6, No. 3, 297-310

Statistical Properties of the rWG(y) Index of Agreement Ayala Cohen, Etti Doveh, and Uri Eick

This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

Technion—Israel Institute of Technology L. R. James, R. G. Demaree, and G. Wolf (1984) introduced rw 0) and the question is whether it is significantly large. Lindell and Brandt (1997) examined the mathematical conditions under which using the uniform distribution as the null distribution results in rWG(y) values outside the interval (0,1). In the Sampling Distribution of rWG(y) section, we describe the consequences of replacing the values outside the (0,1) interval with zero.

This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

300

COHEN, DOVEH, AND EICK

By definition, rWG(7) depends on the number of items in the scale. For two indices with the same average sample variances s2 but a different number of items J, the index with more items will have a larger value of rWG(7). Table 1 demonstrates this relationship in a small example. In their study on factors influencing unsafe behavior, Hofmann and Stetzer (1996) noted as surprising the large value .98 observed for r wo(/) on one °f tneir scales with 7 = 1 3 . One should bear in mind that with large J and relatively medium variance among raters, it is not surprising to obtain a large rWG(y). As shown in Table 1, the same average variance with only 4 items would yield rWG(y) = .58, compared with .98 for J = 13. The rationale for forcing rWG(y) to be nonnegative seemed questionable to those who argued that as an agreement index, this measure should also take negative values to indicate disagreement. Lindell, Brandt, and Whitney (1999) suggested, as an alternative to r wG(/)> a modified index r^G(y) that is allowed to obtain negative values (even beyond -1). The modified index r^,G(y) provides corrections to two of the criticisms that were raised against rwo(./)- First, negative values can be obtained when the observed agreement is less than hypothesized. Second, unlike rWG(y), it does not include a Spearman-Brown correction, and thus it does not depend on the number of items (/). Lindell and Brandt (2000) used the modified index r wG(7) as a variable that measures climate consensus. Then, they tested the role of this measure as a mediator between organizational antecedents and outcomes. This application of an agreement index differs from the one we consider, in which it is used for testing whether aggregation is justified. The argument against rWG(y) for its use of a Spearman-Brown correction is not necessarily justified. Those who disagree with this correction claim that it does not seem reasonable that two scales with a different number of items will provide different rWG(y) values when the average agreement (over items) is the same. This claim can be countered by the reasoning Table 1 rwc for Different Numbers of Items (J) and the Same Average Variance (s2 = 0.42)

4 5 7 13

.58 .83 .91 .98

Note. rwo is an index of within-group agreement.

that a smaller agreement on each of several items of a scale is equivalent to a larger agreement on each item of a scale with fewer items. Moreover, when we apply a significance test on rWG(y), the number of items is taken into account. A common rule of thumb, which is somewhat arbitrary, is that aggregation is justifiable if rWG(y) > .70. A key problem with such an arbitrary rule is that it does not take into account either the number of judges or the number of items and the number of points in the scale. Researchers who use this measure often report the average rWG(y) for their groups (for each scale they measured). It is also common to report the percentage of groups for which rWG(y) was greater than .70. However, as noted by Klein & Kozlowski (2000a), often overlooked is significance testing of rwo(y). Notation We denote by N the number of groups in a study and by k the number within each group. For simplicity, we assume that each group consists of the same number of individuals, but it is easy to generalize to unequal group sizes. J denotes the number of parallel questions (items) answered by all individuals. These J questions form the scale of interest. Let A denote the number of response categories for each item (which, again, we assume without loss of generality to be the same for all J items). The number of categories (A) can be as small as two (dichotomous rating), but in most cases this number is equal to five or seven. There are various possible combinations ofNxk. In some studies, there are few groups and they are large, whereas in others there are relatively many groups and they are small. Though rWG(y) is applicable for all designs, the inference depends on the design and so does the presentation of the results. Similarly, inference depends on J as well as on A. Thus, when we discuss the properties of rwo(y) and suggest methods of inference, we have to consider various combinations of the parameters N, k, J, and A. Sampling Distribution of rWO(y) The exact distribution of rWG(y) cannot be derived explicitly. Instead, readily available computer power can be used to infer rWG(y-,. Indeed, it has become quite common to use simulations in order to study properties of various test statistics and/or estimators. The distribution of /"WG(y) depends on J, o^ull A, k,

STATISTICAL PROPERTIES OF r,WG(7) and, obviously, the underlying distribution of the data. One cannot cover all the combinations of these parameters. Our study serves two aims: first, to illustrate how a researcher can exploit computer power to infer on rWG(7) for his or her specific data structure and, second, to learn from our limited study some important properties of rWG(7).

This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

rwc for One Item First, we consider the simple case when 7 = 1 , namely, when we test for agreement on one item. Under the null hypothesis, the group responses of its k members are k independent, identically distributed observations from the null distributions. Using the computer, we can simulate k such responses and calculate the corresponding s% and its associated rWG value. This process can be repeated as many times as we wish, but usually 5,000 simulations suffice to obtain the important features of the rWG null distribution, particularly its extreme quantiles. Figure 1 displays the null distribution for k = 10, a Likert scale with A = 5, and a uniform null distribution, based on 5,000 simulations. Because negative values are replaced by zero, we obtain a large proportion (.90) of zero values. The .95 percentile is .217. Thus, any observed rwo that is larger than .217 for

one item with k = 10, A = 5, and a uniform null distribution can be considered as indicating some homogeneity, beyond uniformity. Rejecting the null hypothesis means that the homogeneity is larger than expected from a random response, but it does not necessarily justify considering the groups as homogeneous and aggregating the data to a higher level. This is analogous to inference based on correlation, where significance implies nonzero association, but not necessarily a practically significant relationship. The rule suggested by James (as quoted in George, 1990) as a criterion for justifying aggregation is rwo(7) ^ .1. As Figure 1 shows, when the response is random, there is essentially zero probability of obtaining rwo(y) s .7. In practice, one would rarely use a scale with one item, though in some particular cases it might be of interest to examine one particular item in a scale with 7 items. In general, testing each item separately is not recommended because of multiple comparison problems. This is analogous to classical ANOVA in which one applies an F test rather than several pairwise t tests. Our simulation procedure for testing homogeneity on a single-item scale is more appropriate than the chi-square test suggested by Lindell and Brandt (1997). Their test statistic was defined as

100 90 80 70 60 50 40 30 20 10

0 .00

.04 .08

301

.12 .16 .20 .24 .28 .32 .36 .40 .44 .48 .52 .56 .60 .64 .68 .72 .76

Figure 1. Sampling null distribution of rwo for a group size (k) of 10, one item. is an index of within-group agreement.

302

COHEN, DOVEH, AND EICK

This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

(3)

It is well-known that for normal data, the statistic defined in Equation 3 is distributed as chi-square with k - 1 degrees of freedom. However, for a Likert-type scale, the data are far from normal and using the chisquare test can lead to wrong conclusions. Figure 2 displays the sampling distribution of the statistic defined in Equation 3 and the chi-square theoretical distribution for k = 4 and also for k = 10, both with A = 5. As expected, because the uniform distribution has shorter tails than the normal distribution, similarly, the sampling distribution of the statistic is shorter tailed than chi-square. Thus, using the test suggested by Lindell and Brandt (1997) will result in lower rejection rates than specified alpha value. Consequently, their test will be less powerful (there will be a higher probability of wrongly accepting the null hypothesis of no agreement when there is agreement). Tinsley and Weiss (1975) pointed out the problem in using the chi-square test due to the violation of the normality assumption. Accordingly, they suggested using a stringent critical value. For methodological reasons, we demonstrate that a simulation conducted on a null distribution that is approximately normal, the distribution of the statistic defined by Equation 3, is quite close to chi-square. In our simulation, k = 10. The chosen null distribution, Bin (20, Vz) (Binomial with n = 20 and p = .5) corresponds to a question (item) with 21 possible answers, j = 0, 1 , . . . , 20, and the probability of choosing j as an answer is 20 -5)

20

This discrete distribution is quite close to the normal distribution with expected value 10 and variance 5. (Thus, o^ull = 5.) Figure 3 shows the close resemblance of the theoretical and simulated distributions. In conclusion, if we want to test for departure from the null distribution, the rejection region should be determined by simulations. It should not be determined by the chi-square percentiles.

Consider now the general case with J > 1, namely, a scale with several items. We follow the same basic idea as with J = 1, simulating data from the null distribution to obtain the sampling distribution of r wc(7) under the null hypothesis. The idea of using simulations was previously applied by Charnes and

Schriesheim (1995). They assumed independence between the J items in the scale, and this is obviously an unrealistic assumption, because the J items measure the same construct. We now show our simulation results and thereby demonstrate that the J items' interdependence must be taken into account. Our simulation study included small-, medium-, and large-size groups. Our guidelines in choosing these designs were some real studies. The small-size group study that we analyzed included 40 groups, each of size 3. These were learning teams, and the purpose of the study was to investigate the effect of goal types and organizational learning mechanisms on the learning and performance of teams. Our motivation for the medium-size groups was the study of Hofmann and Stetzer (1996) on factors influencing accidents in a chemical plant. In their study, there were 204 employees nested in 21 groups, so that on average the group size was 10. Finally, in the large-size groups, there were about 100 individuals in each group. These were soldiers in army combat units, and it was of interest to assess the homogeneity within the units with regard to the organizational culture (Brainin & Erez, 1999). There are, therefore, three group sizes that we examined: k = 3,10, and 100. For each one, simulations were conducted with J = 6 and / = 10, and for each of these, both interdependence and independence were assumed to hold between the items. The dependence structure used was symmetrical, namely, the correlation was the same between every pair of items; the value we chose was .6. Such a correlation yields a Cronbach's alpha of approximately .9. The simulation was performed using the MVN macro of SAS (http:// ftp.sas.com/techsup/download/stat/mvn.html). We simulated data from both uniform and slight skew underlying distributions, and the calculation of rWG(y) was performed in each case, assuming a uniform null distribution. With respect to the former case, uniform data with a uniform null distribution are required for studying the error of the first kind (wrongly concluding that the response is not random), whereas the latter corresponds to power. One can simulate numerous possible underlying and null distributions according to their relevance for one's research problem. For brevity, we report on just a very few of those possibilities. Each of the plots in Figure 4 summarizes the results of several simulations. They display the dependence of the rWG(y) sampling distribution on various parameters. In the top panel of Figure 4, we observe

303

STATISTICAL PROPERTIES OF r,WG(7)

30-r

Sampling Distribution Theoretical Distribution

0)

This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

o>

0.6

1.8

3.0

5.4

6.6

7.8

9.0

10.2

11.4

12.6

13.8

15.0 16.2

Chi-square

B

Sampling Distribution Theoretical Distribution

4

6

8

10

12

14

16

18

20

22

24

26

28

Chi — square Figure 2. Sampling distribution of the chi-square statistic and the theoretical chi-square distribution for k = 4 (A) and k = 10 (B). k = group size.

304

COHEN, DOVEH, AND EICK

Sampling Distribution Theoretical Distribution

This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

k= 4

0.6

1.8

3.0

4.2

5.4

6.6

7.8

9.0

10.2 11.4 12.6 13.8 15.0 16.2 17.4 18.6 19.8 21.0

Chi—square B

30

Sampling Distribution Theoretical Distribution

k= 10

13

15

17

19

21

25

27

29

31

33

35

Chi-square Figure 3. Sampling distribution of the chi-square statistic and the theoretical chi-square distribution for k = 4 (A) and k = 10 (B) (approximately normal data), k - group size.

305

STATISTICAL PROPERTIES OF r,WOO/)

0.45

10 items

0.4

/s

0.35 0.3

0.25 LU

0.2

0.15 0.1

6 items

zz: ~zx z

This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

0.05 100

10

3

100

10

Group size Independent

Dependent

B

0.85 =*.

0.8

0.75 CD Luf

0.7

0.65

0.6 10 items

0.55

6 items

0.5 100

10

3

100

10

Group size Dependent

Independent

Figure 4. A: Expected value of rWGW) for uniform data tested versus uniform distribution as the null distribution. B: Expected value of rWG(y) for slight skew data tested versus uniform distribution as the null distribution, rWG(7) is an index of within-group agreement.

£(rWG(y)), which should be zero if rWG(y) were unbiased. Though s2 is an unbiased estimate of a2^, (when the data stem from the null distribution), rwo(y) is not unbiased, mainly because values beyond (0,1) are defined as zero. As expected, for larger size groups (larger k), the bias is smaller, and the expected value is closer to zero. The graph also illustrates the difference between values of rWG(7) obtained from independent items as compared with correlated items. In other words, the percentiles of the rWG(y) distributions obtained by simulations based on independence do not represent the percentiles corresponding to correlated data. We also observe that for a 10-item index, the graph points are higher than the corresponding points of a 6-item index. The bottom panel of Figure

4 is analogous to the previous figure, but because the simulated data are not uniform, the expected values of r wo(7) are positive and relatively high (depending on J and k). Unlike the previous configuration, here, as expected, larger group sizes tend to have higher values. The difference between dependent and independent data is negligible for the large group size (k = 100), but not for the smaller groups. The comparison, based on Figure 4, relates solely to the expected value of rWG(y), but this does not provide the whole picture. More details are given in displays, such as in Figure 5, that describe the entire rWG(J) distributions for some of the cases of Figure 4. The data in Figure 5 were generated for a scale with J = 6 items with a correlation of .6 between each pair of items.

This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

306

CD

COHEN, DOVER, AND EICK

60

Cu

O

eBBJU90J9d

307

This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

STATISTICAL PROPERTIES OF rWGW)

Of special interest is the probability of rWG(J) > .7. This probability is zero (as we would like it to be) for uniform data when the group size is large (k = 100), but it is not negligible for k = 3. This is the probability of wrongly concluding that there is homogeneity. It is not surprising that for small groups (such as k = 3) the probability of this error is larger than for very large groups (k = 100). We now examine the opposite case, when the group members agree. Suppose, then, that the group members were sampled from a slight skew distribution (o-2 = 1.34) and we consider the response to be random if it were uniform ( .7 should be considered as indicating homogeneity, we would like the probability of rwo(7) s .7 to be large, for this case— namely, for slight skew data with a uniform null distribution. This probability (see Figure 5B and Figure 5D) is .62 for k = 3 and .84 for k = 100. In other words, there is a nonnegligible probability that we would wrongly declare no agreement. As expected, the probability of this event decreases with increasing k. We should note that even for a group size of 10, the probability of observing (under the conditions above), r wc(y) — -7 is still relatively small (.63). On the other hand, if J = 10, all the above-mentioned probabilities (of rWG(/) > .7) would be larger, as expected: .998 for k = 100, .81 for k = 10, and .71 for k = 3. The effect of truncating sampled values beyond the unit interval is also well illustrated in Figure 5. When the true '"woe/)l& relatively large (i.e., there is group agreement), and k is large, there are no sampled values of rWG(y) outside the unit interval for k. In this case rWG(y) will be unbiased (see Figure 5B). Both Figure 5B and Figure 5D display the sampling distribution of rWG(/) for simulated data from a slight skew distribution. In both cases the theoretical rWG(/) value is .747, but it is clearly seen that an unbiased estimate is obtained for k = 100 with no observed zero value, whereas for & = 3, there is a large variance and, therefore, zero values are obtained, even though they are far from .747. An additional point should be noted as illustrated in Table 2. It displays the mean and standard deviation of the sampling distributions of rWG(y) for various

Table 2 Means and Standard Deviations of Sampled rWG for a Heavy Skew Distribution (J = 6) Uniform null

Slight skew null

k

M

SD

M

SD

3 10 20 50 100

.951 .954 .955 .955 .955

.0357 .0140 .0094 .0058 .0041

.908 .922 .923 .924 .924

.0902 .0273 .0180 .01190 .0077

Note. rwo is an index of within-group agreement.

group sizes and two null distributions: the uniform (CTnuii = 2) and the slight skew (o-2ull = 1.34). In all cases, the data were simulated from a large-skew distribution, defined according to James et al. (1984), with the proportions of judgment as follows: 1 = .0, 2 = .0, 3 = .1, 4 = .4, and 5 = .5, yielding o-^ull = 0.44. For a scale with 7 = 6 items, the corresponding theoretical rWG(y) values are .955 and .924, respectively. The fact that the mean values are essentially equal demonstrates that the group size has no effect on the mean in this case. On the other hand, the standard deviation decreases with the group size as expected. The mean value stability, with respect to group size, is contrasted with the eta squared dependence on group size (Bliese & Halverson, 1998). One of the practical advantages of /WG(I/) as compared with the WABA method is as follows: Instead of presenting the within-group agreement as a single summarizing number, the variation between groups with regard to their within-group agreement is exhibited by the empirical rwoc/) distribution. This can be illustrated by the above-mentioned study on learning teams. We observed there a .73 average rwo for an index with J = 6 items, each measured on a Likerttype scale with five categories. The mean was over the 40 groups (SD = .335), indicating a relatively high agreement. However, there were 6 groups for which rWGC/) = 0. Consequently, the 40 groups' median was much higher than the mean (Mdn = .90). Clearly, an rWG(7) distribution is more informative than a single summarizing number. From what we have learned (see Figure 5D), it is not surprising to get 6/40 = 15% groups of size 3 with rWG(y) = 0, when the "true" population value is above .7.

Confidence Interval for the Agreement Index It is well-known that significance tests depend on sample size and that statistical significance does not

This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

308

COHEN, DOVER, AND EICK

imply practical significance. Thus, rejecting the hypothesis of no agreement does not necessarily mean that there is substantial agreement. Moreover, in many studies we expect homogeneity, so that rejecting the hypothesis of no agreement is an obvious result. A means of obtaining valuable information beyond that of statistical significance is to assess the degree of agreement by constructing a confidence interval on the index of agreement. This is similar to the inference on correlation, where, beyond stating the significance of the observed correlation, we often add a confidence interval on p (using Fisher's transformation). When k is sufficiently large (e.g., k — 20), one can use the bootstrap method, which is a computerintensive method used to construct confidence intervals. Since its introduction, the bootstrap method has become a popular tool for inference. It requires few assumptions and derives its critical values from the data at hand (Efron & Tibshirani, 1993). We illustrate the application of the bootstrap method for comparing homogeneity between two independent groups. In the army study mentioned in the previous section, the question arose whether one group (unit), consisting of 107 officers, was more homogeneous than another unit consisting of 26 officers with respect to the construct extensive use of technology. The respective rWG(y) values were .73 and .90. Using the bootstrap method requires repeatedly resampling from the two original samples, drawing independently with replacement, until we have two "new samples" the same size as the originals. We then compute for each of the two new samples their rwo values and their differences. The sampling "bootstrap distribution" of these differences is used to construct the required confidence interval for the homogeneity difference, whose point estimate was .90 - .73 = .17. The resulting 95% confidence interval was .013 to .40. Because this interval does not cover the value of zero, we conclude, with .95 confidence, that there is a difference in the homogeneity between the two units. The computer-intensive methods described in this article can be applied in various ways. Although the bootstrap method was not in the main line of this article, we showed, in this section, an example of its use. It should be noted that resampling and Monte Carlo simulations were also used in multilevel research by Bliese (1999). Conclusion Since its introduction by James et al. (1984), rWG(y) has been used in many studies to assess homogeneity

within groups. As researchers have gained experience using this index, several questions have arisen. Informal discussions (e.g., on the Research Methods Division Network; RMNET)1 as well as articles considered questions such as the following: How should values of rWG(y) beyond the (0,1) interval be handled? What are the consequences of replacing values beyond the unit interval by zero? What is the dependence of A\vG(7) on tne group size? In this work, we addressed these questions. Using simulations, we found that a positive bias is caused by the value truncation. However, for large true values of rWG(/) (i.e., large homogeneity), this bias is negligible. We also showed that, in this case, the group size has no effect on the mean of rWG. In summary, our suggestion for the researcher who wants to make inferences about his or her sampled r wo(7) value(s) is to exploit the availability of powerful computers in the following way. Simulate data from your hypothesized (null) distribution with the same structure as your data, namely, number of groups (AT), number of individuals within groups (k), number of items (J), and scale categories (A). Simulate the data with the same correlations between items as those in your sample. Finally, compare the simulation results for rWG(y) with those from your actual data to make the statistical inference. All the calculations presented in this article were performed with the SAS software (SAS Institute, Inc., 1990). The code is available (on request) from the authors. 1 RMNET is the electronic question-and-answer network for members of the Research Methods Division of the Academy of Management. All the participants of this network are members of the Academy of Management. RMNET provides a forum for methodological debates.

References Bartko, J. J. (1976). On various intraclass correlation reliability coefficients. Psychological Bulletin, 83, 762-765. Bliese, P. D. (1999). Using resampling and Monte Carlo simulation in multi-level research (Tech. Rep.). Washington, DC: Walter Reed Army Institute of Research. Bliese, P. D. (2000). Within group agreement, non-independence and reliability: Implications for data and analysis. In K. J. Klein & S. W. J. Kozlowski (Eds.), Multilevel theory, research and methods in organizations: Foundations, extensions, and new directions (pp. 349-381). San Francisco: Jossey-Bass.

STATISTICAL PROPERTIES OF r, Bliese, P. D., & Halverson, R. R. (1998). Group size and measures of group-level properties: An examination of eta-squared and ICC values. Journal of Management, 24, 157-172.

This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

Brainin, E., & Erez, M. (1999, April). Shared technologyoriented values (TOV): Who shares them, and at what level of the organization are they being shared the most? Paper presented at the meeting of the Society for Industrial and Organizational Psychology, Atlanta, GA. Charnes, J. M., & Shriesheim, C. A. (1995). Estimation of quantiles for the sampling distribution of the rw