working paper: please do not cite or distribute without ...

5 downloads 0 Views 1MB Size Report
This study disproves the following six common misconceptions about coefficient ... is remarkable; Cortina (1993) and Schmitt (1996) appear within the list of the.
1

Cronbach’s Coefficient Alpha: Well Known but Poorly Understood

ABSTRACT This study disproves the following six common misconceptions about coefficient alpha: 1) Alpha was first developed by Cronbach. 2) Alpha equals reliability. 3) A high value of alpha is an indication of internal consistency. 4) Reliability will always be improved by deleting items using “alpha if item deleted.” 5) Alpha should be greater than or equal to .7 (or, alternatively, .8). 6) Alpha is the best choice among all published reliability coefficients. This study discusses the inaccuracy of each of these misconceptions and provides a correct statement. This study recommends that the assumption of unidimensionality and tau-equivalency be examined before the application of alpha and that SEM-based reliability estimators be substituted for alpha when one of these conditions is not satisfied. This study also provides formulas for SEM-based reliability estimators that do not rely on matrix notation and step-by-step explanations for the computation of SEM-based reliability estimates.

Keywords:

reliability,

coefficient

alpha,

multidimensionality, multiple-factor model

tau-equivalency,

internal

consistency,

2

Although many methods for estimating the reliability of test scores have been proposed (see, e.g., Feldt & Brennan, 1989; Haertel, 2006, for a complete review of the estimation methods), this study focuses on Cronbach’s (1951) coefficient alpha (hereinafter referred to as “alpha”), which estimates reliability by using data from a single test administration. Alpha, conceived as an “internal consistency” coefficient, is the most frequently used reliability coefficient (which is to be interpreted as an estimator or an estimate of reliability, depending on the context) in organizational research. Previous studies such as Cortina (1993) and Schmitt (1996) have played an important role in providing a better understanding of alpha for organizational researchers. The influence of their seminal articles is remarkable; Cortina (1993) and Schmitt (1996) appear within the list of the ten most-cited articles among all the papers that were published during the last twenty years in the Journal of Applied Psychology and Psychological Assessment, respectively (Harzing, 2013). While their comprehensive reviews still provide valuable guidance for organizational researchers, there is an increasing need for a review that discusses the latest studies and that offers an updated perspective on the issue. The influence of Cronbach (1951) has not been overshadowed. His groundbreaking paper has been cited by more than 22,000 studies, which is the largest number of citations for any paper published in Psychometrika (Harzing, 2013). Cronbach (2004, p. 393) declared it to be “[another sign] of success” that “there were very few later articles by others criticizing parts of my argument.” Despite the optimistic assessment provided by Cronbach (2004), criticisms of alpha have been actively published in the past decade.

3

However, such research does not offer a practical aid to organizational researchers for several reasons. First, the latest theoretical reviews of alpha can only be found in psychometrics journals, such as Psychometrika. As Sijtsma (2009a) noted, psychometricians, who are theorists of measurement, have been distanced from other social scientists, who are practitioners of measurement. Studies in psychometrics are increasingly using advanced mathematics that most social scientists cannot easily understand. Second, previously published papers in psychometrics have typically focused on specific subjects within the field, and these papers are thus less helpful for understanding the broader outline of the field. Therefore, a comprehensive study on the historical background of alpha and the practical issues arising from using it as a reliability coefficient would be particularly helpful to organizational researchers. The current study seeks to respond to such a research need. In the next section, we develop our analyses on test score reliability by focusing on six common misconceptions about alpha that are frequently held by organizational researchers. For example, one may believe that Cronbach first proposed alpha because it is named “Cronbach’s alpha”; one may also contend that alpha should be regarded as the best available reliability coefficient because so many researchers use it in practice. Disproving such misconceptions will enable organizational researchers to make better decisions on fundamental issues such as what to call, how to use, and even whether to use alpha. After the “misconception” section, we present the formulas and procedures of alternatives to coefficient alpha for reliability estimation when the prerequisite assumptions for alpha are suspected to be violated. In presenting this information, we show that reliability estimators based on structural equation modeling (SEM) may be effectively used for all of the situations in which

4

violation of the assumptions for alpha is suspected. Before addressing the misconceptions about alpha, however, we first note the true score modeling of the test score X and the definition of test score reliability in classical test theory. Consider a test consisting of k dichotomously or polytomously scored items. The test score X is defined as the sum of k “observed” item scores X i ; X  i 1 X i . The true score model k

states that X is composed of two unobserved scores, namely, the true score T and error e : X  T  e . The reliability of test scores is defined as the product-moment correlation (  XX  )

between the scores X and X  (  T  e ) from two parallel forms of a test (Lord & Novick, 2 1968; Novick & Lewis, 1967). In a population,  XX  is equal to the squared correlation (  XT )

between X and T , which is also equal to the ratio of the true score variance (  T2 ) to the test score variance (  X2 ). Formally, 2  XX    XT 

 T2  T2  .  X2  T2   e2

(1)

Given sample data from a single administration of a test, alpha estimates the reliability of test scores for a population of examinees of interest, as defined in Equation 1.

COMMON MISCONCEPTIONS ABOUT ALPHA 1. Common Misconception: Alpha was first developed by Cronbach. Although subsequent researchers provide more elaborate interpretations, newer proofs, and more meaningful modifications to earlier works, academia recognizes the researcher who originated a formula to the greatest extent. In our field, it is difficult to imagine not using the

5

originator’s name for a formula. Thus, Cronbach is commonly considered to have initially proposed the formula in question. The Spearman-Brown formula was presented approximately 100 years ago. The formula was so named because it was published independently by Spearman (1910) and Brown (1910) in the same issue of the same journal, the British Journal of Psychology. Let X a and X b denote the first- and second-half test scores, respectively, such that X  X a  X b ; let  X2 ,  a2 , and

 b2 denote the variances of X , X a , and X b , respectively; and let  ab and  ab denote the covariance and the product-moment correlation, respectively, between X a and X b . The Spearman-Brown formula (  SB ) states that the reliability of the full-length test scores (  XX ' ) can be estimated by correcting the correlation between the two half-test scores as follows:

 SB 

2  ab . 1   ab

(2)

The Spearman-Brown formula is based on the assumption that the half tests are parallel and thus that the score variances of the split-halves are equal. General formulas that are applicable when the variances differ were independently proposed by Flanagan (1937), Rulon (1939), and Mosier (1941). However, these split-half reliability formulas yield different results depending on how the halves are split. To resolve this issue, Kuder and Richardson (1937) proposed a reliability formula (called “KR-20”) that can be used for data on dichotomously scored items (e.g., X i  0 or 1). When pi is the percentage of correct responses for item i and k is the number of items, the KR-20 formula is expressed as follows:

6 k k  i 1 pi (1  pi )  .  KR 20  1  k 1  X2  

(3)

Guttman (1945) believed that the preceding studies set too many assumptions to obtain a reliability coefficient and thus presented six reliability estimators. The only assumption required for Guttman’s estimators was the independence between the error scores of the items (i.e.,

 e e  0 ). Guttman (1945) referred to the six estimators as lower bounds of the reliability rather i j

than as reliability coefficients. Below is a list of three of the six lower bounds— 2 , 3 , and 4 ( 4 was proposed as the lower bound of the split-half reliability):

2 

 ij2  i j

k  ij2 k  1 i j ; 2

X

k  i 1 i 3    1 k 1  X2  k

2

  ; and  

  a2   b2   . 4  21  2  X  

(4)

(5)

(6)

Cronbach (1951) proposed alpha, as expressed in Equation 5, which enables KR-20 to be applied to polytomously scored item data (e.g., X i  0, 1, 2, or 3). The Spearman-Brown formula was most frequently used to calculate split-half reliability at that time, but Cronbach (1951) criticized the formula, stating that it increased calculation error when it was used for cases in which the variance differed between the split-halves. Cronbach (1951) used Guttman’s 4 to prove that alpha (i.e., 3 ) is the mean of the 4 values that are computed for all possible split-

7

halves. Guttman’s 4 is algebraically equivalent to the formulas proposed by Flanagan (1937), Rulon (1939), and Mosier (1941). As we have shown previously, Guttman’s 3 is equivalent to alpha, but it is not Guttman who first proposed the formula. Hoyt (1941) applied an analysis of variance (ANOVA) model to elicit the reliability coefficient, and its derivation arrived at a formula identical to alpha (Cronbach, 1951). Additionally, the KR-20 formula is another expression of alpha when it is applied to dichotomously scored items [Note  i2  pi (1  pi ) ]. This argument does not suggest that Cronbach intercepted the achievements of previous researchers. Although the coefficient is typically called Cronbach’s alpha, Cronbach never named the coefficient after himself. In fact, Cronbach (2004, p. 397) shied away from the name Cronbach’s alpha, stating the following: “To make so much use of an easily calculated translation of a well-established formula scarcely justifies the fame it has brought me. It is an embarrassment to me that the formula became conventionally known as ‘Cronbach’s  .’” It is the users (or, rather, the “consumers”) of the formula who credited Cronbach (1951) rather than Hoyt (1941) or Guttman (1945), both of whom preceded Cronbach in introducing algebraically equivalent formulas. Different positioning strategies of these studies could have led users to perceive the formulas to be different even though they yielded the same results. First, alpha was positioned as a general reliability coefficient. Cronbach (1951) mathematically proved that alpha, the general formula of KR-20, can also be the general formula for split-half reliability formulas. That is, Cronbach’s proof that the formulas that were previously thought to be unrelated are actually connected to each other led alpha to be positioned as a representative and

8

comprehensive formula, not merely one of many reliability coefficients. Such positioning of alpha sharply contrasts with that of 3 , which was merely one of six formulas proposed by Guttman (1945). Second, alpha was positioned as a reliability coefficient. Whereas Guttman (1945) referred to 3 as the lower bound of reliability, Cronbach (1951) presented alpha as a reliability coefficient. Users likely preferred to view alpha as a reliability coefficient than to view alpha as the lower bound of reliability. Interestingly, alpha was not the best reliability coefficient even at the time of Cronbach’s (1951) publication. To find a better alternative than alpha, it is natural to assume that recent studies on reliability should be cited. Contrary to such expectations, a superior alternative already existed even before the name of alpha was proposed by Cronbach (1951). Guttman (1945) referred to 1 as “a simple lower bound,” 3 (=alpha) as “an intermediate lower bound,” and 2 as “a better lower bound” and proved that 1  3  2   XX  . That is, under the condition of independence between item errors, 2 is always superior to or as good as alpha. From the modern perspective, 3 , an inferior alternative, may not have merited publication. Guttman proposed 3 because its formula was simpler and easier to calculate than that of 2 . In the 1940s, calculation with computers was unimaginable, and the ease of calculation was a virtue in a world in which all calculations were performed by hand.

2. Common Misconception: Alpha equals reliability. In the current organizational research literature, alpha is often considered to be the

9

equivalent of the reliability of test scores. However, it is difficult to find explanations as to whether alpha is larger or smaller than reliability when the prerequisites for alpha to be the reliability coefficient are not fulfilled. Therefore, alpha can be easily misconceived to always be equal to reliability or at least to be an unbiased estimator of reliability.

(a) Under the assumption of uncorrelated item errors Guttman (1945, p. 274) proposed that 3 (i.e., alpha) could be the reliability coefficient “if and only if the variances and covariances of the expected scores on the items are all equal.” An even more sophisticated proof was proposed by Novick and Lewis (1967), who suggested the concept of the essentially tau-equivalent condition between items. Let us describe the classical true score model, in terms of factor analysis, to explain the proof by Novick and Lewis. In the classical true score model, the observed score for item i (=1,…, k ) can be decomposed into two or three components as follows:

X i  Ti  ei  i  iT  ei , where Ti  i  iT ,



i

 1 , and



i

(7)

 0 . The expression X i  i  iT  ei in Equation 7

is referred to as a single-factor model in psychometric analysis. In the single-factor model,  i are constants that allow for differences in item-score means and the sum to be zero across the items, and i are the factor loadings, which represent the proportionate functional lengths of the items. The similarity between the items may vary depending on which constraints are imposed on i and the inter-item error covariances. For example, if i  0 , i   j and if  e2i   e2j

10

for all i and j , the items are parallel such that  X2 i   X2 j and  X i X j   T2 / k 2 . If i   j for all i and j , the items are essentially tau-equivalent, such that  X2 i   X2 j but  X i X j   T2 / k 2 . The term “essentially” indicates that an addition of a constant ( μ i ) has essentially no effect on the variances or covariances of the item scores. If no constraints (except the constraints and



i



i

1

 0 ) are imposed, the items have congeneric similarity, such that  X2 i   X2 j and

 X X may not be equal to each other. i

j

Now consider another expression of alpha that is based on the average of all k (k  1) observed-score covariances between items:    k  i j i j   k  1   X2 

  k 2 Mean( ij ) .   X2  

(8)

This expression is simply derived from the fact that  X2   ij  i2   ij , where i

j

i

i

j

 i2   X2 and  ij   X X . If items are “at least” essentially tau-equivalent (i.e.,  X X   T2 / k 2 i

i

j

i

j

for all i and j ), the numerator in Equation 8 becomes equal to the true score variance  T2 . That is, alpha is equal to  XX  when the essentially tau-equivalent condition holds among the items. Elaborating this statistical point, Novick and Lewis (1967) demonstrated that the necessary and sufficient condition for alpha to be reliability is the essential tau-equivalency and that, if this condition is not met, alpha is smaller than reliability. Alpha has typically been referred to as a reliability coefficient rather than a lower-bound

11

estimate; however, the latter is a more correct description in a strict sense. The concept of lower bound enables us to better understand the characteristics of alpha, which might seem counterintuitive if we considered alpha to be a reliability coefficient. First, we rethink the meaning of Cronbach’s (1951) mathematical proof. The average of the split-half reliability estimates that are acquired from all possible split-halves is an intuitively attractive concept. However, human intuition is occasionally deceived. One important point is usually overlooked: Guttman’s 4 is not a reliability coefficient; rather, it is a lower bound of reliability. The concept of a lower bound to reliability yields the understanding that Mean ( 4 )— the mean of the 4 values from all possible split-halves—does not produce a maximum value that is close to the actual reliability; however, Max ( 4 )—the maximum of the 4 values—does approximate reliability. Second, alpha is negative when the average of the inter-item score covariances is negative (see Equation 8; Cronbach & Hartmann, 1954). Many textbooks explain that alpha has a value between zero and one. Upon reading this explanation, one may assume that it is impossible for alpha to have a negative value, irrespective of any prerequisite condition. However, in practice, alpha may have a negative value in some situations, for example, where an item is negatively worded but is accidentally not reversely scored in a personality scale (Sijtsma, 2009a) or where an item has a negative discrimination in a multiple-choice achievement test. All reliability coefficients do not have the same problem as alpha; we will show later that SEM estimators of reliability (i.e., Equation 16 and 19) are always non-negative. Third, alpha may be smaller than one even when no measurement error exists. By definition,

12

the reliability of test scores must be one when no measurement error exists. Thus, one might assume that alpha will also be one when the measurement error variance is zero. To examine this statement, let us take an exemplary case in which homogeneity (i.e., unidimensionality) across items is satisfied but in which the essentially tau-equivalent condition is not satisfied. Consider a three-item test. The following variance-covariance matrix for the three-item scores is one such matrix that meets the homogeneous condition: 1.0 1.0 2.0 Σ = 1.0 1.0 2.0 . 2.0 2.0 4.0

The computation of alpha based on this matrix results in   .9375 , which is less than one (i.e., perfect reliability). This example effectively demonstrates that alpha is a lower bound of reliability when it does not satisfy the condition of being essentially tau-equivalent. This example also suggests the need for the items to be standardized before an aggregate score is computed when items are combined with radically different variances. The value of standardized alpha for this example is one. The value of a congeneric reliability coefficient (which will be described later) for this example is also one.

(b) Without the assumption of uncorrelated item errors Thus far, we have assumed that item errors ( ei ) are independent from each other (i.e.,

 e e  0 ). We now consider the following question: “When the errors are correlated, would i j

alpha increase or decrease, compared to when the errors are independent?” To answer the

13

question, we first note that the variance of the observed item scores (  X2 i ) is not affected by the non-zero correlation between item errors, but when the errors are not independent, the inter-item covariance (  X i X j ) changes to σ T2 / k 2 + σ ei e j , not to  T2 / k 2 . That is, collectively,

 i

 k 1  2   T   ei e j .  k  i j

Xi X j

j

(9)

Thus, test score reliability should be expressed as follows:

 XX  

 T2  T2 .   X2  T2    e2    e e i

i

i

(10)

i j

j

Lucke (2005) developed the idea expressed in Equations 9 and 10 into more sophisticated proofs, which generate the following implications. First, correlated item errors affect alpha and classical reliability in the opposite direction. Positively correlated item errors decrease the value of reliability but make alpha overestimate (i.e., inflate) the true reliability. Second, the impact of correlated item errors on alpha and classical reliability strictly occurs through the sum of inter-item error covariances. The internal structure of inter-item error covariances (e.g., autocorrelated, moving average) is irrelevant. The similarity between the items (e.g., parallel, tau-equivalent, congeneric) is also irrelevant. Third, the necessary and sufficient condition for alpha to be equivalent to classical reliability should be that the sum of inter-item error covariances equals the deviance from tauequivalency. If the former is smaller than the latter, alpha underestimates reliability. If the former is greater than the latter, alpha overestimates reliability. Correlated errors may arise from various sources, such as common stimulus materials,

14

consistency response sets, and transient errors (Green and Yang, 2009a). Under the assumption of uncorrelated item errors, alpha is a lower bound estimate of reliability, which implies that a high value of alpha ensures a high value of reliability. Without such an assumption, a high value of alpha no longer guarantees the level of reliability. Positively correlated item errors reduce the level of reliability but increase the value of alpha. Thus, an alternative method that substitutes for alpha and that provides a reliability estimate that is not overly distorted by correlated item errors is needed. Multiple-factor model. Before discussing an alternative, we first define the reliability of test scores by using the multiple-factor model, also known as the hierarchical factor model (Lord & Novick, 1968, p. 535; McDonald, 1999), and introduce two “omega” coefficients based on the model. With the hierarchical factor model, the vector of k observed item scores (in deviation form), x , may be decomposed as follows (Zinbarg, Revelle, Yovel, & Li, 2005):

x  cg  Af  Ds  e ,

(11)

where g is a general factor (common to all items), c is the k  1 vector of unstandardized general factor loadings, f is the r 1 vector of group factors (applied to only some items), A is the k  r matrix of unstandardized group factor loadings, s is the k  1 vector of specific factors that are unique to each item, D is the k  k diagonal matrix of unstandardized specific factor loadings, and e is the k  1 vector of random item errors. (The multifactor measurement model in Equation 11 will be reexpressed in the SEM framework in a later section.) In this model, all factors (g, f , and s ) are assumed to be uncorrelated with each other and with e and the variance of each factor is assumed to be one. Additionally, item errors are assumed to be

15

uncorrelated with each other and not standardized. Based on the factor loadings in Equation 11, the reliability of test scores can be expressed as

 XX  

1cc1  1AA1  1DD1

 X2

.

(12)

This equation suggests that all components except e in Equation 11 contribute to the true score. However, e and s are indistinguishable (i.e., confounded, such that u  Ds  e ) when item scores are obtained from a single test administration. Thus, McDonald (1999) proposed the omega coefficient (denoted  or t ) to estimate reliability. Drawing on work by McDonald (1970, 1999), Zinbarg et al. (2005) explicitly presented a multidimensional version of  :



1cc1  1AA1

 X2

.

(13)

McDonald (1999) also identified the proportion of variance in the test scores that is accounted for by a general factor, called the hierarchical omega  h , and Zinbarg et al. (2005) presented a formula of  h as

h 

1cc1

 X2

.

(14)

The omega coefficient can be used as a general formula for computing reliability that does not require the assumption of uncorrelated item errors. Let us denote the variance-covariance matrix of item errors ( e ) in Equation 11 as Θ . In the multiple-factor model, Θ is often treated as a diagonal matrix because item errors are assumed to be uncorrelated with each other, as noted earlier. Of course, this assumption can be violated, and item errors may be correlated with each other. However, the possibility of correlated item errors does not induce the need for a formula

16

other than Equation 13 to quantify reliability because regardless of whether the errors are correlated, the total test variance is always expressed and computed as cc′ 1 + 1′ AA′ 1 + 1′ DD′ 1 + 1′ Θ1 .  X2 = 1′

Figure 1 illustrates the differential effects of inter-item error correlations on alpha and ω , which were computed under the assumptions of unidimensionality and correlated item errors. A vertical line starting from correlation zero represents the condition of uncorrelated item errors, which is unrealistic and at best exceptional in an actual research environment. Alpha in the tauequivalent condition, and the value of ω has the same value as reliability, but alpha in the congeneric (but not tau-equivalent) condition underestimates reliability. As the inter-item error correlation increases, the value of alpha in the two conditions also increases, but that of ω decreases; the direction of these relations is the same as the reliability shown in Equation 10.

---------------------------------------------Insert Figure 1 about here ----------------------------------------------

3. Common misconception: A high value of alpha is an indication of internal consistency. Alpha has a seemingly inextricable connection with internal consistency. Prestigious psychometricians refer to alpha as “a measure of internal consistency” (Nunnally & Bernstein, 1994, p.290) or “an internal consistency estimate” (Thompson, 2003, p.10). Popular textbooks on research methods offer more detailed descriptions of alpha, for example, “[t]he most commonly

17

reported index of internal consistency” (Christensen, Johnson, & Turner, 2011, p.144) or “[t]he most common and powerful method used today for calculating internal consistency reliability” (Rubin & Babbie, 2009, p.184). The cognitive bond between the two terms is so strong that considerable empirical studies substitute the expression internal consistency for alpha when reporting reliability information (Hogan, Benjamin, & Brezinski, 2000). Despite its frequent use, surprisingly little consensus exists about the precise meaning of internal consistency. This study identifies three different definitions of internal consistency that can be either explicitly or implicitly found in the literature: homogeneity, interrelatedness of a set of items, and general factor saturation. When we consider the definitions of internal consistency, secondary or tertiary definition problems arise. The present study defines homogeneity as the unidimensionality of a set of items, based on previous studies (Cortina, 1993; Green, Lissitz, & Mulaik 1977; McDonald, 1981; Schmitt, 1996; Sijtsma, 2009a). Unidimensionality refers to the existence of one latent trait underlying a set of items (Hattie, 1985). The interrelatedness of a set of items is defined as the arithmetic mean of inter-item correlation coefficients, that is, rij (Cronbach, 1951). General factor saturation refers to the proportion of test variance that is due to a general factor (Revelle & Zinbarg, 2009). The hierarchical omega (  h ) is the most recommended index of general factor saturation (Zinbarg et al., 2005). The definition of internal consistency as homogeneity stems from Cronbach (1951), who used the two terms interchangeably (Schmitt, 1996). He also proposed that a high value of alpha is indicative of homogeneity. Green et al. (1977) and McDonald (1981) noted logical problems in

18

the argument provided by Cronbach (1951), and other studies (e.g., Cortina, 1993; Schmitt, 1996; Ten Berge & Sočan, 2004) demonstrated that alpha cannot be an indication of homogeneity or unidimensionality by offering counterexamples. Nevertheless, such an interpretation persists (Green & Yang, 2009a). The fact that alpha cannot evidence homogeneity may be demonstrated through another counterexample. Consider four tests, each consisting of eight items (V1–V8), whose observed variance-covariance matrices (A, B, C, and D) are presented in Table 1. Matrices A and B both illustrate test situations in which only a general factor is present and all items are homogeneous. Matrix C illustrates a situation in which no general factor exists but two group factors (one loaded on by V1–V4 and the other by V5–V8) are present. Matrix D represents a situation in which a general factor and two group factors are present. To avoid the confounding issue of specific factors and random errors, it is assumed that no specific factors exist. The two tests that are associated with matrices A and B are one-factor (i.e., homogeneous) tests, but they have different values of alpha—.7742 and .9492, respectively. This comparison indicates that the alpha value does not closely relate to test unidimensionality. Further, if we compute alpha values for matrices A and C, we obtain the same value of .7742, although matrix C is derived from a multidimensional test.

---------------------------------------------Insert Table 1 about here ----------------------------------------------

19

The definition of internal consistency as the interrelatedness of a set of items has been accepted by many experts (Cortina, 1993; Green et al., 1977; McDonald, 1981; Schmitt, 1996; Sijtsma, 2009a). First, a high level of alpha does not indicate internal consistency in this definition. Alpha is a function of both item interrelatedness and the number of items in the set. Even when the average of inter-item correlation coefficients is as low as .1, a satisfactory level of alpha can be obtained if there are a sufficient number of items (e.g., if k  21,  =.7, whereas if k  36,  =.8). Therefore, we cannot make any conclusions about internal consistency solely

based on the level of alpha. Second, this definition is not congruent with the dictionary meaning of consistency in a situation in which strong group factors exist. According to the definition, we must declare that matrices A and C in Table 1 have the same level of internal consistency because they have the same value of rij . This conclusion conflicts with the observation that matrix C is not internally consistent in an everyday sense. The definition of internal consistency as general factor saturation was suggested by Revelle (1979). A notable point about this definition is that alpha is no longer closely related to internal consistency when strong group factors and a weak general factor exist. In matrices A and C in Table 1, the same value of alpha exhibits a sharp contrast with  h , which yields clearly distinguished values of .7742 and zero, respectively. In summary, alpha does not indicate internal consistency in any definitions of psychometric properties. In addition, there is little utility in using the term internal consistency from the

20

perspective of clarity and usefulness. What internal consistency exactly means is ambiguous. The use of more descriptive terms such as item interrelatedness is more helpful for understanding content and context.

4. Common Misconception: Reliability will always be improved by deleting items using “alpha if item deleted.” Equation 8 clearly illustrates that given the observed test score variance  X2 , alpha is essentially a function of the number of items and the average inter-item covariance. Theoretically, alpha increases as the number of items increases, with the average covariance being fixed. Peterson (1994) investigated the seemingly obvious relationship between the number of items and alpha and obtained the unexpected result that alpha does not significantly increase even if the number of items increases. Further, Peterson discovered that alpha increases as the number of items that are eliminated during scale development increases. This “paradox of the number of items” reveals two contrasting strategies that are commonly used to obtain an acceptable level of alpha. The first strategy is to increase the number of items to increase alpha, which overcomes problems with the quality of items by changing their quantity. The other strategy is to decrease the number of items. That is, the higher correlation between items is when the number of items is smaller may result from the selection of only higher-correlating items, with lower-correlating items being deleted. The “alpha if item deleted” information provided by statistical software packages facilitates the use of this strategy. Kopalle and Lehmann (1997) suggested that deleting items with lower inter-item

21

correlations can lead to an overstatement of alpha, or “alpha inflation,” in which the sample-level alpha is more highly reported than the population-level alpha. They raised the need to separate the calibration sample, which determines the scale, and the cross-validation sample, which calculates reliability, and suggested that the deletion of items must follow a theoretical and logical basis. Raykov (2007, 2008) revealed a more critical problem associated with reducing the number of items by the “alpha if item deleted” information. Raykov (2007) proposed that the actual reliability of a scale may decrease even though alpha appears to increase after the number of items is reduced, and Raykov (2008) proved that predictive validity can also decrease. Raykov (2008) proposed a latent variable modeling approach that produces point and interval estimates of criterion validity as well as reliability after the deletion of individual components. While Raykov (2008) raised the possibility that an increase in the sample-level value of alpha might be obtained at the expense of predictive validity, the potential tradeoff between reliability and content validity will be discussed in the next section. We have no intention of discouraging the use of the “alpha if item deleted” function itself. It can be helpful for the identification and remedy of dysfunctional items within a scale. However, we generally discourage mechanical reliance on the output of statistical software. Researchers should be well versed in the substance of what they are studying and use that knowledge in conjunction with statistical indices to make judgments about the makeup of a measure.

5. Common Misconception: Alpha should be greater than or equal to .7 (or, alternatively, .8).

22

The works of Nunnally (Nunnally, 1967, 1978; Nunnally & Bernstein, 1994) are the second-most cited documents, with only the Bible having more citations. Interestingly, the broad-ranging remarks of Nunnally’s works across hundreds of pages are not cited nearly as often as their references to acceptable levels of alpha (.7 or .8). Nunnally most likely offered concrete numbers purely with the intention of providing his readers with some practical aid. Such advised levels create various problems, however. First, the advised levels of alpha are neither the result of empirical research (Churchill & Peter, 1981; Peterson, 1994) nor the consequence of clear logical reasoning; instead, they were derived from Nunnally’s personal intuition. For example, there is no evidence that .7 is a better standard than .69 or .71. Second, people use Nunnally’s authority as an immunity standard, which “legally” excuses them from having to think further about reliability when alpha values above .7 or .8 are obtained. There are two intriguing facts that demonstrate that Nunnally’s work (1967, 1978) is cited for its usefulness in providing a “hall pass” or “certificate” rather than for its content. First, in the first edition of the work, Nunnally (1967) stated that a reliability of .5 or .6 is sufficient for exploratory research; however, the standard applied to exploratory research was increased to .7 in the second edition (Nunnally, 1978). People choose which edition of Nunally’s work to cite depending on whether their alpha is above or below .7 (Henson, 2001). Second, the second edition is still the most widely cited version of the work despite the existence of a third edition (Nunnally & Bernstein, 1994). Third, the artificial effort to increase alpha above a certain level may harm reliability and

23

validity. The strategy of deleting items by using the “alpha if item deleted” information to increase alpha may reduce both reliability (Raykov, 2007) and criterion validity (Raykov, 2008). Another common strategy to increase alpha is to repeatedly present slightly different items that essentially measure the same component of a particular construct. Each item must correctly represent its whole to obtain high content validity. Therefore, sacrificing the diversity of items to increase alpha hinders content validity. The phenomenon in which an increase in reliability is obtained at the cost of validity is also known as the attenuation paradox (Humphreys, 1956; Loevinger, 1954). This phenomenon is referred to as a paradox because a test with perfect reliability, despite seemingly being the epitome of an ideal test, is not valid. For example, all examinees who take a test with a reliability of unity will have either a score of zero or a perfect score, as an examinee who gives a correct/wrong answer to an item will also give correct/wrong answers to any other items. In a similar context, Streiner (2003) also argued that although a higher alpha is desirable, an excessively high level of alpha is not desirable because it accompanies unnecessary repetition and overlap. Some scholars further argued that a high level of alpha is fundamentally undesirable. For example, Kline (1986, pp.118-119) asserted that “high internal consistency can be … antithetical to high validity[;] … the importance of internal-consistency reliability has been exaggerated.” Boyle (1991, p. 291) criticized researchers’ obsession with a high level of alpha, stating that “it may often be more appropriate to regard estimates such as the alpha coefficients as indicators of item redundancy and narrowness of a scale.” We recommend against mechanistically or automatically applying a cutoff criterion. When

24

the importance of a decision made on the basis of a test score increases, the standard for reliability should also increase. Cortina’s (1993, p.101) advice that “the finer the distinction that needs to be made, the better the reliability must be” captures the essence of such a guideline. However, Lance, Butts, and Michels (2006) found that such a guideline is rarely followed; most empirical studies have used .70 as a universal standard of reliability regardless of the stage or purpose of research. One size does not fit all. The nature of the decision being made on the basis of a test should be the guide for the acceptable level of reliability1.

6. Common Misconception: Alpha is the best choice among all published reliability coefficients. Although alpha’s presence overwhelmingly overshadows many other reliability coefficients, McDonald (1981, p.113) claimed that “coefficient alpha cannot be used as a reliability coefficient.” A number of reliability coefficients (i.e., estimators) have been proposed in an attempt to overcome the limitations of alpha, and several authors have compared the performance of these estimators. Osburn’s (2000) simulation study included 11 reliability coefficients and reported Max ( 4 ) as the most accurate estimator of reliability. Kamata, Turhan, and Darandari’s (2003) investigation of four methods revealed that stratified-alpha was the best alternative. In Revelle and Zinbarg’s (2009) analysis of 13 formulas, McDonald’s ω was recommended as the best choice. The latent class reliability coefficient (LCRC) was ranked first in an analysis by van der Ark, van der Palm, and Sijtsma (2011), which considered five

1

We are indebted to anonymous reviewers for this idea.

25

techniques. Tang and Cui’s (2012) comparison of three lower bounds supported the use of Guttman’s 2 . In all of these studies, there was a common finding that alpha received relatively poor scores. A more striking fact is that five different reliability coefficients were ranked first by five comparison studies. The results of such studies give an impression that even among experts, there is no consensus on which methodology is superior to others. Until there is an explicit statement about which alternative method should replace alpha, users will continue to use it. To overcome this situation, Sijtsma (2009a, p.107) recommended the use of the greatest lower bound (glb) (Jackson & Agunwamba, 1977; Woodhouse & Jackson, 1977), declaring that his paper is “meant to invite debate on” the issue. His intention of stimulating debate was successfully realized; in 2009, four comments on his paper (Bentler, 2009; Green & Yang, 2009a, 2009b; Revelle & Zinbarg, 2009), as well as his rejoinder (Sijtsma, 2009b), were published in Psychometrika. However, the recommendation for the glb in Sijtsma (2009a) was not easily accepted. Revelle and Zinbarg (2009) criticized how the glb, unlike its name, yields a smaller value than McDonald’s ω . Tang and Cui (2012) noted that the glb not only tends to yield an overestimation but also produces greater bias than 2 . Moreover, Sijtsma excluded the glb from a list of alternatives in a comparative study in which he participated (van der Ark, van der Palm, & Sijtsma, 2011). Such reactions to Sijtsma (2009a) indicated what top psychometricians have been considering to be substitutes for alpha; all four comments proposed approaches based on SEM. Several authors have proposed methods for computing SEM estimates of reliability (Green &

26

Yang, 2009b; Jöreskog, 1971; Miller, 1995; Raykov, 1997; Raykov & Shrout, 2002), including McDonald’s (1999) ω , as shown in Equation 13. Alpha, which requires the more restrictive assumption of tau-equivalency, can be viewed as a special case of SEM estimators of reliability based on congeneric models. Alpha and SEM estimates have the same value if the similarity between items is tau-equivalent. Moreover, violation of the tau-equivalency assumption can be tested by using SEM procedures (Fleishman & Benson, 1987; Graham, 2006; Jöreskog & Sörbom, 1996; Miller, 1995).

A FRAMEWORK FOR CHOOSING A RELIABILITY ESTIMATOR 1. Examinations of the assumptions of unidimensionality and tau-equivalency Alpha should no longer be an unconditional and automatic choice for reliability estimation. As Cortina (1993) noted and as our previous discussion suggests, alpha should be used for reliability estimation when the following conditions are met: (a) the test measures a single factor, (b) the test items are essentially tau-equivalent in statistical similarity, and (c) the error scores of the items are uncorrelated. However, all of these conditions are rarely met in practice; one or more of the assumptions regarding unidimensionality, essential tau-equivalency, and uncorrelated errors may be violated to some degree. This study does not devote further attention to the detection and correction of correlated errors; interested readers are referred to Kim and Feldt (2011) and Raykov (2004). Nevertheless, regarding the selection of a reliability estimator, we recommend that the assumption of unidimensionality and tau-equivalency be examined

27

before the application of alpha and that SEM-based reliability estimators be substituted for alpha when one of these conditions is not satisfied. Figure 2 summarizes our guidelines.

---------------------------------------------Insert Figure 2 about here ----------------------------------------------

Unidimensionality. While various methods have been developed to test unidimensionality (Hattie, 1985), this study focuses on SEM approaches. The unidimensional model is nested within the multidimensional model in SEM, and the chi-square difference is usually used to test for statistical significance. Three models can be employed to conceptualize multidimensionality in SEM: the correlated factors model, the higher-order factor model, and the multiple-factor model (Figure 3).

---------------------------------------------Insert Figure 3 about here ----------------------------------------------

Although the correlated factors model is most frequently used among organizational researchers, its popularity is not an indication of its superiority. As the exact opposite of the unidimensional model, the correlated factors model includes only subdomain constructs and

28

omits a common construct, which is a hidden influencer that causes the latent variables to correlate with each other. Moreover, paradoxically, the construct that most scale developers originally design to measure (Reise, 2012) and that most researchers primarily intend to study is excluded from the measurement model. The higher-order factor model and the multiple-factor model, also known as the hierarchical factor model or bifactor model, share a commonality; they consider both subdomain factors and a common factor. While a general factor (i.e., ξ1 in the multiple-factor model of Figure 3) is analogous to a second-order factor (i.e., ξ1 in the higher-order factor model), group factors (i.e.,

ξ 2 – ξ 4 in the multiple-factor model) are not analogous to first-order factors (i.e., ηi ). Group factors are analogous to disturbances (i.e., ς i ), as both are orthogonal to a general/second-order factor and both explain the variance that was not explained by this factor (Reise, Moore, & Haviland, 2010). The two models are mathematically equivalent under some conditions (Yung, Thissen, & McLeod, 1999). The two models nevertheless have some interesting differences. While a higher-order factor subjugates the lower-order factors, a general factor competes with the group factors in explaining the variances of the manifest variables. Whereas a general factor is directly linked with the manifest variables, a higher-order factor’s connection with the manifest variables must be mediated by the lower-order factors (Reise et al., 2010). Although the multiple-factor model is the least understood and least used model by organizational researchers, it has several advantages over the higher-order factor model (Chen, West, & Sousa, 2006). Multiple-factor models can easily detect a nonexistent domain-specific

29

factor because such a factor will cause an identification problem and a nonsignificant factor loading for the group factor; however, higher-order factor models are insensitive to signal such anomalies because nonsignificant variances of the disturbances of lower-order factors usually do not cause any estimation problems and can be easily overlooked by researchers. While group factors can be predicted by external variables independently of the general factor, estimating paths between the disturbances of first-order factors and external variables is difficult with second-order factor models. Because the higher-order model is nested within the multiple-factor model (Yung et al., 1999), the multiple-factor model functions as a baseline model for testing whether the chi-square differences between the models are statistically significant (Chen et al., 2006). Let us describe the typical constraints of the multiple-factor model. Every manifest variable is assumed to have one general factor and one (and only one) group factor. A general factor is orthogonal to or uncorrelated with group factors by definition. Group factors are also generally constrained to be orthogonal to or uncorrelated with each other for identification and interpretability (Reise, 2012). In other words, the variance-covariance matrix of the latent variables (i.e.,  ) is usually restricted to a diagonal matrix (i.e., all off-diagonal elements or covariances are fixed at zero) or an identity matrix (i.e., all off-diagonal elements or covariances are fixed at zero and the variances of the latent variables are fixed at 1.0). Tau-equivalency. More restrictions are placed on the tau-equivalent model than on the congeneric model. In the latter model, either the variance of the latent variable or a factor loading of one of the manifest variables must be fixed at a non-zero value (typically 1.0) to determine the

30

scale of the latent variable. The former model adds the constraint of equal factor loadings (i.e.,

i   j ) to those of the latter model and thus requires that one of the following conditions be met: (a) the variance of the latent variable is fixed at a non-zero value, and every factor loading of the manifest variables that measure a common latent variable is constrained to be equal, or (b) every factor loading of the manifest variables that measure a common latent variable is fixed at an equal non-zero value. Figure 4 summarizes these requirements. Because the tau-equivalent model is nested within the congeneric model, the chi-square difference between the two models can be used to test for statistical significance.

---------------------------------------------Insert Figure 4 about here ----------------------------------------------

2. Reliability Estimators for Multidimensional Data Among the statistical procedures that have been presented for estimating the reliability of multidimensional test scores, two estimators can be recommended for organizational researchers: the multidimensional version of  (hereinafter referred to as m ) and stratified-alpha (hereinafter referred to as  s ).

ωm .This study offers formulas and computation examples of m that are accessible to those who are not very familiar with matrix notation. Previous studies that have reported the formulas of the omega coefficients have not been very reader-friendly and have typically only briefly

31

referenced matrix formulas, as we did in Equations 13 and 14. Basically, m is an SEM-based reliability estimator, and the value of m is computed by fitting a multiple-factor measurement model to observed data. Equation 15 shows a computation-saving formula of ˆ m , and Equation 16 displays another algebraically equivalent formula. That is, ωˆ m

  1

k 2 i 1 ui 2 X

σˆ

σˆ

(15)

 ˆ     ˆ  r 1

2

k

ˆ m 

, or

i 1

i1

σˆ

j 2 2 X

2

k

i 1

ij

,

(16)

where ˆi1 is the estimated unstandardized factor loading of item i on a general factor, ˆij is the estimated unstandardized factor loading of item i on the ( j  1) th group factor, r is the number of group factors, ˆ u2i is the estimated unique variance of item i , and σˆ X2 is the sum of all components in a fitted/reproduced variance-covariance matrix. Let us consider an illustrative example. If we apply the orthogonal factor solutions to the matrix D data in Table 1, we will obtain ˆi1 = .3 ( i  1,,8 ), ˆi 2 = .4 ( i  1, ,4 ), ˆi 2 =0 ( i  5,,8 ), ˆi 3 =0 ( i  1, ,4 ), and ˆi 3 = .4 ( i  5,,8 ), and σˆ u2i =.3 for all items. These estimates lead to



8 i 1

 ˆ  +   ˆ  = 2

k

i 1

3

i1

j 2

2

k

i 1

ij

82 (.3)  [42 (.4)  42 (.4)] = 19.2 + 12.8 = 32,

ˆ u2 = 8(.3) = 2.4, and σˆ X2 =34.4. Finally, we obtain the estimate of ωˆ m =1 - (2.4/34.4) i

= .9302, according to Equation 15.

32

Stratified-alpha (  s ). Our plan B is to use  s if the SEM approach fails. Not all multidimensional data fit the multiple-factor model nicely. It requires well-structured group factors to be properly estimated (Reise, 2012). For example, as we previously discussed, the existence of a trivial group factor is likely to cause identification and estimation problems (Chen et al., 2006).  s is an easy-to-use alternative that is applicable to a case in which the modelbased reliability estimator is not very successful. Recall that Kamata et al.’s (2003) performance comparison reported  s as the best method. To address the score reliability of stratified tests, Rajaratnam, Cronbach, and Gleser (1965) derived  s from generalizability theory as k

s  1 

 i 1

2 Xi

(1   i )

 X2

,

(17)

where  i each are alpha coefficients for scores of item i . This formula of stratified alpha entails the following estimation procedure: (a) obtain observed score (unbiased) variance ˆ X2 i and coefficient alpha ˆ i for each item, (b) estimate error variance by using the formula

ˆ e2  ˆ X2 (1  ˆi ) for each item, and then (c) substitute the summed error variance and the total i

i

observed variance ˆ X2 into the formula of  s . Let us illustrate the  s estimation procedure with the matrix D data in Table 1, for which the first and second part-tests include the set of items V1~V4 and the set of items V5~V8, respectively. We can figure out that ˆ X2 i =12.4 and that ˆ i =.9032 (thus ˆ e2i =1.2) for both part-

33

tests. Substitution of these component values into Equation 17 leads to

ˆ s  1  (2.4 / 34.4)  .9302 . Notice that ˆ s  ˆ m for the matrix D data.

3. Reliability Estimators Based on the Congeneric Measurement Model When the tau-equivalency assumption is violated, an SEM-based congeneric reliability estimator is recommended as an alternative to coefficient alpha. The SEM-based congeneric reliability, presented by Jöreskog (1971) and McDonald (1999, p. 89), is simply the unidimensional version of  (hereinafter referred to as u ) with the one common factor  and thus can be expressed as

(  λi )2σ ξ2 1 λσ ξ2 λ1 , ωu   1 Σ x1 (  λi )2σ ξ2+ σ u2i

(18)

which is numerically equal to Equation 14. The estimate of u is obtained by fitting the congeneric measurement model to sample data and substituting the estimates of the SEM parameters ( i and  u2i ) into Equation 18. When the SEM parameters are estimated, the value of  2 is usually set at 1.0 to solve scale indeterminacy. Thus, in the literature, the estimate of

ωu is often expressed as Equation 19. Equation 20 shows two other algebraically equivalent formulas of ˆ u . That is, ( ˆi ) 2 , or ˆ u  ( ˆi ) 2   ˆ u2i

(19)

( ˆi ) 2  σˆu2i .  1  σˆ X2 σˆ X2

(20)

ˆ u 

34

Organizational researchers usually refer to these formulas as “composite reliability” (Peterson & Kim, 2013), which is supposedly a shorthand for the reliability of composite scores. This designation is a misnomer, however, because the dictionary meaning of the term is too general to be limited to a specific method, and it can encompass broad categories of reliability estimators, including alpha. Using such a designation is similar to calling a proper noun (e.g., Chicago) by a common noun (e.g., city).

4. Examples of the Computation of Omega Coefficients This section offers two examples that allow interested readers to replicate our computations2. Readers who are familiar with R, a free open-source statistical software platform, will consider the psych package (Revelle, 2014) to be the most convenient tool for obtaining omega coefficients because its Omega function estimates them spontaneously. We will assume that our typical readers use one of the SEM packages (e.g., LISREL, Mplus, and AMOS) that do not offer automated calculations of omega coefficients. Our LISREL program codes are displayed in Appendix 2. Table 2 presents an examination of the assumption of unidimensionality. The fit indices for the unidimensional model indicate unacceptable fit, suggesting that more than one latent trait is underlying this set of items. The chi-square difference between the unidimensional model and

2 Our multidimensional data, originally from Thurstone and Thurstone (1941), are used to provide an example in the psych package and SAS/STAT 9.22 User’s Guide. Its text-format file can be downloaded at http://vincentarelbundock.github.io/Rdatasets/datasets.html. Our unidimensional data are shown in Appendix 2.

35

the multiple-factor model is significant at α = .05 , and m is recommended as the proper reliability estimator for the data, according to the guidelines shown in Figure 2. Although the higher-order factor model is not necessary for a unidimensionality check, we included it because we compared it with the multiple-factor model in the previous section. The chi-square difference between the two model is significant at α = .05 , consistent with Chen et al.’s (2006) findings that the multiple-factor model usually has greater power than the higher-order factor model.

---------------------------------------------Insert Table 2 about here ----------------------------------------------

Table 3 displays a step-by-step explanation of the computation of ˆ m . STEP 1 is to sum the unique variances; the use of a spreadsheet program facilitates this process. STEP 2 is to sum the fitted/implied covariance matrix. Most SEM packages present only lower triangular and diagonal elements of the fitted covariance matrix rather than full elements of the matrix. We can obtain the sum of all the elements by using the formula

2 (subdiagon al)   (diagonal)    (diagonal) . STEP 3 is to compute ˆ m ; the value of ωˆ m for these data is .9312, according to Equation 15. We used the sample covariance matrix to compute the value of αˆ and ˆ s and obtained .8915 and .9260, respectively.

----------------------------------------------

36

Insert Table 3 about here ----------------------------------------------

Table 4 presents an examination of the assumption of tau-equivalency. The fit indices for the tau-equivalent model indicate poor fit, and its chi-square value is significantly greater than that of the congeneric model at α = .05 . According to our guidelines in Figure 2, u is recommended for the data.

---------------------------------------------Insert Table 4 about here ----------------------------------------------

Table 5 displays the computation of ˆ u . STEP 1 is to calculate the square of the sum of factor loadings and the sum of the unique variances. We can skip STEP 2 if we apply the computation-saving formula in Equation 19. In STEP 3, we can use any one of three equivalent formulas, all of which produce the same value of .7261. The value of αˆ is .7009 for the data.

---------------------------------------------Insert Table 5 about here ----------------------------------------------

37

CONCLUSION We commonly observe cases in which the best-selling products are not the products with the best quality. When the switching cost is higher, consumers tend to choose a more familiar alternative (e.g., the QWERTY keyboard layout) even though they know of a more efficient alternative (e.g., the Dvorak keyboard layout). Because of network externalities, in which other people’s choices affect the utility of an individual, in some situations, the winner may take all. For example, an individual may face a considerable disadvantage if he or she uses a different spreadsheet program while everyone else uses Microsoft Excel. Alpha is a good example of how such marketing concepts may be applied to the choice of statistical analysis methods. Alpha is a relatively inferior method despite its widespread use. Even if users are aware of alpha’s inferiority, they may be unwilling to invest effort into becoming familiar with other reliability coefficients. Moreover, although they may be willing to tolerate personal costs from switching to another reliability coefficient, they may fear penalties incurred from not using the alpha coefficient in their studies because dissertation committees and editors are likely familiar with alpha but may not be familiar with its alternatives. In the perspective of network externality, substituting alpha with a superior alternative is not merely a matter of personal choice but a matter of academia consciously responding to the issue. It would be prudent for the editors of various academic journals on organizational research to recommend that their contributors use superior alternatives with or in place of alpha in their works.

References Bentler, P. M. (2009). Alpha, dimension-free, and model-based internal consistency reliability.

38

Psychometrika, 74(1), 137-143. Boyle, G. J. (1991). Does item homogeneity indicate internal consistency or item redundancy in psychometric scales? Personality & Individual Differences, 12(3), 291-294. Brown, W. (1910). Some experimental results in the correlation of mental abilities. British Journal of Psychology, 3(3), 296-322. Chen, F. F., West, S. G., & Sousa, K. H. (2006). A comparison of bifactor and second-order models of quality of life. Multivariate Behavioral Research, 41(2), 189-225. Christensen, L. B., Johnson, R. B., & Turner, L. A. (2011). Research methods, design, and analysis (11th ed.). Boston, MA: Pearson. Churchill, G. A., & Peter, J. P. (1984). Research design effects on the reliability of rating scales: A meta-analysis. Journal of Marketing Research, 21(Nov.), 360-375. Cortina, J. M. (1993). What is coefficient alpha? An examination of theory and applications. Journal of Applied Psychology, 78(1), 98-104. Cronbach, L. J. (1951). Coefficient alpha and the internal structure of tests. Psychometrika, 16(3), 297-334. Cronbach, L. J. (2004). My current thoughts on coefficient alpha and successor procedures. Educational and Psychological Measurement, 64(3), 391-418. Cronbach, L. J., & Hartmann, W. (1954). A note on negative reliabilities. Educational and Psychological Measurement, 14(4), 342-346. Feldt, L. S., & Brennan, R. L. (1989). Reliability. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 105-146). New York: American Council on Education and Macmillan.

39

Flanagan, J. C. (1937). A proposed procedure for increasing the efficiency of objective tests. Journal of Educational Psychology, 28(1), 17-21. Fleishman, J., & Benson, J. (1987). Using LISREL to evaluate measurement models and scale reliability. Educational and Psychological Measurement, 47(4), 925-939. Graham, J. M. (2006). Congeneric and (essentially) tau-equivalent estimates of score reliability: What they are and how to use them. Educational and Psychological Measurement, 66(6), 930-944. Green, S. B., Lissitz, R. W., & Mulaik, S. A. (1977). Limitations of coefficient alpha as an index of test unidimensionality. Educational and Psychological Measurement, 37(4), 827-838. Green, S. B., & Yang, Y. (2009a). Commentary on coefficient alpha: A cautionary tale. Psychometrika, 74(1), 121-135. Green, S. B., & Yang, Y. (2009b). Reliability of summed item scores using structural equation modeling: An alternative to coefficient alpha. Psychometrika, 74(1), 155-167. Guttman, L. (1945). A basis for analyzing test-retest reliability. Psychometrika, 10(4), 255-282. Haertel, E. H. (2006). Reliability. In R. L. Brennan (Ed.), Educational measurement (4th ed., pp. 65-110). Westport, CT: American Council on Education and Praeger. Harzing, A.W. (2013). Publish or perish. Available from http://www.harzing.com/pop.htm Hattie, J. (1985). Assessing unidimensionality of tests and items. Applied Psychological Measurement, 9(2), 139-164. Henson, R. K. (2001). Understanding internal consistency reliability estimates: A conceptual primer on coefficient alpha. Measurement and Evaluation in Counseling and Development,

40

34(3), 177-189. Hogan, T. P., Benjamin, A., & Brezinski, K. L. (2000), Reliability methods: A note on the frequency of use of various types. Educational and Psychological Measurement, 60(4), 523531. Hoyt, C. (1941). Test reliability estimated by analysis of variance. Psychometrika, 6(3), 153-160. Humphreys, L. (1956). The normal curve and the attenuation paradox in test theory. Psychological Bulletin, 53(6), 472-476. Jackson, P. H., & Agunwamba, C. C. (1977). Lower bounds for the reliability of the total score on a test composed of non-homogeneous items: I: Algebraic lower bounds. Psychometrika, 42(4), 567-578. Jöreskog, K. G. (1971). Statistical analysis of sets of congeneric tests. Psychometrika, 36(2), 109133. Jöreskog, K. G., & Sörbom, D. (1996). LISREL 8: User’s reference guide (2nd ed.). Chicago: Scientific Software International. Kamata, A., Turhan, A., & Darandari E. (2003). Estimating reliability for multidimensional composite scale scores. Paper presented at the annual meeting of American Educational Research Association, Chicago. Kim, S., & Feldt, L. S. (2011). A comparative study on coefficient alpha and congeneric-modelbased reliability estimators for tests composed of clusters of items. Journal of Educational Evaluation, 24(4), 1061-1084. Kline, P. (1986). A handbook of test construction: Introduction to psychometric design. London:

41

Methuen. Kopalle, P. K., & Lehmann, D. R. (1997). Alpha inflation? The impact of eliminating scale items on Cronbach’s alpha. Organizational Behavior and Human Decision Processes, 70(3), 189197. Kuder, G. F., & Richardson, M. W. (1937). The theory of the estimation of test reliability. Psychometrika, 2(3), 151-160. Lance, C. E., Butts, M. M., & Michels, L. C. (2006). The sources of four commonly reported cutoff criteria. Organizational Research Methods, 9(2), 202-220. Loevinger, J. (1954). The attenuation paradox in test theory. Psychological Bulletin, 51(5), 493504. Lord, F. M., & Novick, M. R. (1968). Statistical theories of mental test scores. Reading, MA: Addison-Wesley. Lucke, J. F. (2005). “Rassling the hog”: The influence of correlated item error on internal consistency, classical reliability and congeneric reliability. Applied Psychological Measurement, 29(2), 106-125. McDonald, R. P. (1970). The theoretical foundations of common factor analysis, principal factor analysis, and alpha factor analysis. British Journal of Mathematical and Statistical Psychology, 23(1), 1-21. McDonald, R. P. (1981). The dimensionality of tests and items. British Journal of Mathematical and Statistical Psychology, 34(1), 100-117. McDonald, R. P. (1999). Test theory: A unified treatment. Mahwah, NJ: Lawrence Erlbaum.

42

Miller, M. B. (1995). Coefficient alpha: A basic introduction from the perspectives of classical test theory and structural equation modeling. Structural Equation Modeling, 2(3), 255-273. Mosier, C. I. (1941). A short cut in the estimation of split-halves coefficients. Educational and Psychological Measurement, 1(1), 407-408. Novick, M. R., & Lewis, C. (1967). Coefficient alpha and the reliability of composite measurements. Psychometrika, 32(1), 1-13. Nunnally, J. C. (1967). Psychometric theory. New York: McGraw-Hill. Nunnally, J. C. (1978). Psychometric theory (2nd ed.). New York: McGraw-Hill. Nunnally, J. C., & Bernstein, I. H. (1994). Psychometric theory (3rd ed.). New York: McGrawHill. Osburn, H.G. (2000). Coefficient alpha and related internal consistency reliability coefficients. Psychological Methods, 5(3), 343-355. Peterson, R. A. (1994). A meta-analysis of Cronbach’s coefficient alpha. Journal of Consumer Research, 21 (Sep.), 381-391. Peterson, R. A., & Kim, Y. (2013). On the relationship between coefficient alpha and composite reliability. Journal of Applied Psychology, 98(1), 194-198. Rajaratnam, N., Cronbach, L. J., & Gleser, G. C. (1965). Generalizability of stratified-parallel tests. Psychometrika, 30(1), 39-56. Raykov, T. (1997). Estimation of composite reliability for congeneric measures. Applied Psychological Measurement, 21(2), 173-184. Raykov, T. (2004). Behavioral scale reliability and measurement invariance evaluation using

43

latent variable modeling. Behavior Therapy, 35(2), 299-331. Raykov, T. (2007). Reliability if deleted, not “alpha if deleted”: Evaluation of scale reliability following component deletion. British Journal of Mathematical and Statistical Psychology, 60(2), 201-216. Raykov, T. (2008). Alpha if item deleted: A note on criterion validity loss in scale revision if maximizing coefficient alpha. British Journal of Mathematical and Statistical Psychology, 61(2), 275-285. Raykov, T., & Shrout, P. E. (2002). Reliability scales with general structure: Point and interval estimation using a structural equation modeling approach. Structural Equation Modeling, 9(2), 195-212. Reise, S. P. (2012). Invited paper: The rediscovery of bifactor measurement models. Multivariate Behavioral Research, 47(5), 667-696. Reise, S. P., Moore, T. M., & Haviland, M. G. (2010). Bifactor models and rotations: Exploring the extent to which multidimensional data yield univocal scale scores. Journal of Personality Assessment, 92(6), 544-559. Revelle, W. (1979). Hierarchical cluster analysis and the internal structure of tests. Multivariate Behavioral Research, 14(1), 57-74. Revelle, W. (2014). Package ‘psych’. http://cran.r-project.org/web/packages/psych/psych.pdf Revelle, W., & Zinbarg, R. E. (2009). Coefficients alpha, beta, omega and the glb: comments on Sijtsma. Psychometrika, 74(1), 145-154. Rubin, A., & Babbie, E. R. (2008). Research methods for social work (6th ed.). Belmont, CA:

44

Thompson Books/Cole. Rulon, P. J. (1939). A simplified procedure for determining the reliability of a test by split-halves. Harvard Educational Review, 9, 99-103. Schmitt, N. (1996). Uses and abuses of coefficient alpha. Psychological Assessment, 8(4), 350353. Sijtsma, K. (2009a). On the use, the misuse, and the very limited usefulness of Cronbach’s alpha. Psychometrika, 74(1), 107-120. Sijtsma, K. (2009b). Reliability beyond theory and into practice. Psychometrika, 74(1), 169-173. Spearman, C. (1910). Correlation calculated from faulty data. British Journal Psychology, 3(3), 271-295. Streiner, D. L. (2003). Starting at the beginning: An introduction to coefficient alpha and internal consistency. Journal of Personality Assessment, 80(1), 99-103. Tang, W., & Cui, Y. (2012, April). A simulation study for comparing three lower bounds to reliability. Paper presented at the annual meeting of the American Educational Research Association, Vancouver, Canada. Ten Berge, J. M. F., & Sočan, G. (2004). The greatest lower bound to the reliability of a test and the hypothesis of unidimensionality. Psychometrika, 69(4), 613-625. Thompson, B. (2003). Understanding reliability and coefficient alpha, really. In B. Thompson (Eds.), Score reliability: Contemporary thinking on reliability issues (pp. 3-23). Thousand Oaks, CA: Sage. Thurstone, L. L. & Thurstone, T. (1941). Factorial studies of intelligence. Chicago, IL: The

45

University of Chicago Press. van der Ark, L. A., van der Palm, D. W., & Sijtsma, K. (2011). A latent class approach to estimating test-score reliability. Applied Psychological Measurement, 35(5), 380-392. Woodhouse, B., & Jackson, P. H. (1977). Lower bounds for the reliability of the total score on a test composed of non-homogeneous items: II: A search procedure to locate the greatest lower bound. Psychometrika, 42(4), 579-591. Yung, Y. F., Thissen, D., & McLeod, L. D. (1999). On the relationship between the higher-order factor model and the hierarchical factor model. Psychometrika, 64(2), 113-128. Zinbarg, R. E., Revelle, W., Yovel, I., & Li, W. (2005). Cronbach’s α, Revelle’s β, and McDonald’s ωH: Their relations with each other and two alternative conceptualizations of reliability. Psychometrika, 70(1), 123-133.

46

αTE αC

ωTE = ωC

Inter-item Error Correlation Figure 1. Influence of inter-item error correlation on ω and  TE represents the tau-equivalent condition, and C represents the congeneric (but not tauequivalent) condition. True score variances and error variances were fixed, and only error covariances were changed. Computations are described in Appendix 1.

47

Reliability estimation

STEP 1 Unidimensional?

No

Yes

The multiple-factor model?

Yes

ωm

No

STEP 2 Tau-equivalent?

No

ωu

Yes α

αs

Figure 2. A framework for choosing a reliability estimator

48

(a) Correlated factors model u1

X1

u2

X2

u3

(b) Higher-order factor model u1

Y1

u2

Y2

X3

u3

Y3

u4

X4

u4

Y4

u5

X5

u5

Y5

u6

X6

u6

Y6

u7

X7

u7

Y7

u8

X8

u8

Y8

u9

X9

u9

Y9

ξ1

ξ2

ξ3

(c) Multiple-factor model

ς1 η1

ς2 η2

ς3 η3

Figure 3. Three models of multidimensionality in SEM

ξ1

ξ1

u1

X1

u2

X2

u3

X3

u4

X4

u5

X5

u6

X6

u7

X7

u8

X8

u9

X9

ξ2

ξ3

ξ4

49

(a) Tau-equivalent model u1

X1

u2

X2

u3

X3

1 1 1

ξ1

(b) Congeneric model u1

X1

u2

X2

u3

X3

1 ξ1

Figure 4. The tau-equivalent model and the congeneric model in SEM

50

Table 1 Four variance-covariance matrices with different internal structures A (A General Factor Only, Less Saturated) V1 V2 V3 V4 V5 V6 V7 V8 V1 1 .3 .3 .3 .3 .3 .3 .3 V2 .3 1 .3 .3 .3 .3 .3 .3 V3 .3 .3 1 .3 .3 .3 .3 .3 V4 .3 .3 .3 1 .3 .3 .3 .3 V5 .3 .3 .3 .3 1 .3 .3 .3 V6 .3 .3 .3 .3 .3 1 .3 .3 V7 .3 .3 .3 .3 .3 .3 1 .3 V8 .3 .3 .3 .3 .3 .3 .3 1

B (A General Factor Only, More Saturated) V1 V2 V3 V4 V5 V6 V7 V8 V1 1 .7 .7 .7 .7 .7 .7 .7 V2 .7 1 .7 .7 .7 .7 .7 .7 V3 .7 .7 1 .7 .7 .7 .7 .7 V4 .7 .7 .7 1 .7 .7 .7 .7 V5 .7 .7 .7 .7 1 .7 .7 .7 V6 .7 .7 .7 .7 .7 1 .7 .7 V7 .7 .7 .7 .7 .7 .7 1 .7 V8 .7 .7 .7 .7 .7 .7 .7 1

ρXX′≡ ω = α = ωh = .7742 , rij =.3

ρXX′≡ ω = α = ωh = .9492 , rij =.7

(  X2  24.8 , 1cc1  19.2 , 1AA1  0 )

(  X2  47.2 , 1cc1  44.8 , 1AA1  0 )

V1 V2 V3 V4 V5 V6 V7 V8

C (Two Group Factors Only) V1 V2 V3 V4 V5 V6 V7 V8 1 .7 .7 .7 0 0 0 0 .7 1 .7 .7 0 0 0 0 .7 .7 1 .7 0 0 0 0 .7 .7 .7 1 0 0 0 0 0 0 0 0 1 .7 .7 .7 0 0 0 0 .7 1 .7 .7 0 0 0 0 .7 .7 1 .7 0 0 0 0 .7 .7 .7 1 ρXX′≡ ω = .9032   .7742 ,  h  0 , rij =.3

(  X2  24.8 , 1cc1  0 , 1AA1  22.4 )

D (A General Factor and Two Group Factors) V1 V2 V3 V4 V5 V6 V7 V8 V1 1 .7 .7 .7 .3 .3 .3 .3 V2 .7 1 .7 .7 .3 .3 .3 .3 V3 .7 .7 1 .7 .3 .3 .3 .3 V4 .7 .7 .7 1 .3 .3 .3 .3 V5 .3 .3 .3 .3 1 .7 .7 .7 V6 .3 .3 .3 .3 .7 1 .7 .7 V7 .3 .3 .3 .3 .7 .7 1 .7 V8 .3 .3 .3 .3 .7 .7 .7 1 ρXX′≡ ω = .9302   .8771 , h  .5581 , rij =.4714 (  X2  34.4 , 1cc1  19.2 , 1AA1  12.8 )

51

Table 2 An examination of the unidimensionality assumption CFI

TLI

RMSEA

df

χ2

p

a. Unidimensional

.80

.74

.21

27

233.54

.00

b. Higher-order factor

.98

.98

.05

24

38.19

.03

c. Multiple-factor

.99

.98

.03

18

24.21

.14

Difference (a – c)

9

209.32

.00

Difference (b – c)

6

13.98

.02

CFI=comparative fit index, TLI=Tucker-Lewis index, RMSEA=root mean square error of approximation

52

Table 3 A computation of ωˆ m for a multidimensional data example STEP 1: Sum the unique variances

STEP 2: Sum the fitted/implied covariance matrix

ˆ u2

X1

X1

0.17

1.00

X2

0.17

0.83

1.00

X3

0.27

0.78

0.78

1.00

X4

0.25

0.47

0.48

0.46

1.00

X5

0.39

0.46

0.47

0.45

0.67

1.00

X6

0.52

0.44

0.45

0.43

0.59

0.54

1.00

X7

0.15

0.44

0.45

0.43

0.34

0.34

0.32

1.00

X8

0.50

0.51

0.52

0.50

0.40

0.40

0.38

0.56

1.00

X9

0.55

0.41

0.42

0.40

0.32

0.32

0.30

0.60

0.45



2.97

X2

X3

X4

X5

X6

X7

X8

X9

i

STEP 3: Compute ˆ m

1.00

2 (subdiagon al)   (diagonal)    (diagonal) = 43.20

ˆ m = 1 – 2.97 / 43.20 = .9312

53

Table 4 An examination of the tau-equivalency assumption CFI

TLI

RMSEA

df

χ2

p

a. Tau-equivalent

.79

.77

.14

9

64.64

.00

b. Congeneric

1.00

1.00

.01

5

5.21

.39

4

59.43

.00

Difference (a - b)

CFI=comparative fit index, TLI=Tucker-Lewis index, RMSEA=root mean square error of approximation

54

Table 5 A computation of ωˆ u for a congeneric data example STEP 1: Calculate the square of the sum of factor loadings and the sum of the unique variances

STEP 2 (Optional): Sum the fitted/implied covariance matrix

λˆi

σˆ u2i

X1

X1

0.93

5.70

6.57

X2

1.67

3.46

1.55

6.24

X3

2.15

0.87

2.00

3.58

5.48

X4

0.94

4.08

0.88

1.57

2.03

4.97

X5

1.20

3.82

1.12

2.01

2.59

1.13



6.89

B= 17.93



2

X2

X3

X4

5.27 C= 65.46

A= 47.53

STEP 3: Compute ˆ u

X5

ˆ u = A / (A + B) = A / C = 1- B / C = .7261

55

Appendix 1 The computation uses Lucke’s (2005, p.117) compound symmetric item error covariance model, which requires errors to be equally correlated among items. The item error variances are  2 , the item error correlations are  , the number of items is k , the deviance from tau-equivalency for test i is  i , and the average of factor loadings is  .  and ω are shown in the following:



k (k  1)( 2   2 )   k 2 ; .   (k  1) k 2  [1  (k  1) ] 2 k 2  [1  (k  1) ] 2





Note that  is a monotonically increasing function of  because k(k -1)  2 in the numerator is greater than (k -1)(k -1)  2 in the denominator, while ω is a monotonically decreasing function of  . Note also that  and ω have the same value of kλ2 (kλ2 + θ 2 ) if δ = 0 (i.e., tau-equivalent) and γ = 0 (i.e., uncorrelated errors). We consider a set of two hypothetical tests (XTE and XC), each with k = 8 items,  =3, and  2 =9 but with different vectors of factor loadings. These conditions are similar to those of Lucke (2005, p.119), except for k . 1) XTE with TE  [3,3,3,3,3,3,3,3] so that TE  0 . 2) XC with C  [1,1,1,1,7,7,7,7] so that C  128 . The values of  and ω in our hypothetical data can be expressed as in the following: α=

8 × 7 × (3 2 + 9γ ) - δ 8 × 32 ω = ; . (7){8 × 3 2 + [1 + 7γ ] × 9} 8 × 32 + [1 + 7γ ] × 9

56

Appendix 2 MULTIDIMENSIONAL DATA DA NI=9 NO=213 MA=CM CM SY !CM=KM 1 0.828 1 0.776 0.779 1 0.439 0.493 0.46 1 0.432 0.464 0.425 0.674 1 0.447 0.489 0.443 0.59 0.541 1 0.447 0.432 0.401 0.381 0.402 0.288 1 0.541 0.537 0.534 0.35 0.367 0.32 0.555 1 0.38 0.358 0.359 0.424 0.446 0.325 0.598 0.452 1 MO NX=9 NK=4 PH=FI !CODE FOR MULTIPLE-FACTOR MODEL VA 1 PH 1 1 PH 2 2 PH 3 3 PH 4 4 FR LX 1 1 LX 2 1 LX 3 1 LX 4 1 LX 5 1 LX 6 1 LX 7 1 LX 8 1 LX 9 1 FR LX 1 2 LX 2 2 LX 3 2 LX 4 3 LX 5 3 LX 6 3 LX 7 4 LX 8 4 LX 9 4 OU ME=ML RS EF ND=4 CONGENERIC DATA DA NI=5 NO=270;CM SY 6.57 1.67 6.24 2.06 3.56 5.48 0.71 1.79 1.98 4.97 0.63 1.94 2.62 1.25 5.27 MO NX=5 NK=1; VA 1 PH 1 1 !CODE FOR THE CONGENERIC MODEL FR LX 1 1 LX 2 1 LX 3 1 LX 4 1 LX 5 1 !EQ LX 1 1 LX 2 1 LX 3 1 LX 4 1 LX 5 1 !DELETE THE FIRST ! IF TAU-EQ OU ME=ML RS EF ND=4