Effect size, power, and sample size determination for ... - Springer Link

PSYCHOMETRIKA--VOL. 66, NO. 3, 373-388 SEPTEMBER 2001

E F F E C T SIZE, POWER, A N D S A M P L E SIZE D E T E R M I N A T I O N F O R STRUCTURED M E A N S M O D E L I N G A N D M I M I C APPROACHES TO B E T W E E N - G R O U P S HYPOTHESIS T E S T I N G O F M E A N S ON A S I N G L E LATENT C O N S T R U C T GREGORY R. H A N C O C K U N I V E R S I T Y OF MARYLAND, COLLEGE PARK

While effect size estimates, post hoc power estimates, and a priori sample size determination are becoming a routine part of univariate analyses involving measured variables (e.g., ANOVA), such measures and methods have not been articulated for analyses involving latent means. The current article presents standardized effect size measures for latent mean differences inferred from both structured means modeling and MIMIC approaches to hypothesis testing about differences among means on a single latent construct. These measures are then related to post hoc power analysis, a priori sample size determination, and a relevant measure of construct reliability. Key words: structural equation modeling; effect sizes; structured means modeling; MIMIC models; power analysis; construct reliability.

Introduction Researchers wishing to investigate group differences on a latent construct, rather than on a single measured variable (ANOVA) or on a composite (MANOVA), have two structural equation modeling (SEM) methods at their disposal: S6rbom's (1974) structured means modeling (SMM), and a derivative of multiple-indicator multiple-cause (MIMIC) models (J6reskog & Goldberger, 1975; Muthdn, 1989). SMM, a large sample technique, seeks to model variables' mean structure along with the covariance structure in a manner facilitating inference regarding populations' underlying construct means. The M I M I C approach, on the other hand, employs group code (e.g., dummy) predictors within a structural equation model, which in turn is fitted to a single set of data across all groups of interest. Generally requiring the estimation of fewer parameters, M I M I C methods might also allow for smaller samples (Muthdn). (For a didactic treatment of these two approaches, see, e.g., Hancock, 1997.) Because the latent constructs along which group differences are assessed in both methods do not have an inherent metric, they are generally discussed only in terms of their statistical significance, or perhaps relative to the metric of an assigned variance or of a variable chosen as the construct's scale indicator. The purpose of the current article is to present standardized measures of effect size for latent mean differences inferred from both S M M and M I M I C approaches, paralleling those customarily reported as part of ANOVA results. These measures will help to facilitate post hoc power analysis and sample size determination for latent means analysis, as detailed herein. Effect Sizes and Structured Means Modeling For a set of p observed Y indicators of construct r~, Y values in a single group m a y be expressed in a p x 1 vector y as follows: y = ~- + Arl + e, where ~- is a p x 1 vector of intercept values, A is a p x 1 vector of )~ loadings, and e is a p x 1 vector of normal errors. Thus, the first moment vector ~ = E[y] = ~- + AE[rl] = ~- + AK, where K is the mean of I wish to convey my appreciation to the reviewers and Associate Editor, whose suggestions extended and strengthened the article's content immensely, and to Ralph Mueller of The George Washington University for enhancing the clarity of its presentation. Requests for reprints should be sent to Gregory R. Hancock, 1230 Benjamin Building, University of Maryland, College Park, MD 20742-1115. E-Mail: [email protected] 0033 -3123/2001-3/1999-0710-A $00.75/0 @ 2001 The Psychometric Society

373

374

PSYCttOMETRIKA

factor rl; this reduces to E[y] = Ax if all variables can be assumed to contain no measurement bias (i.e., r = 0). The second moment matrix, assuming r/and the errors to be independent, is E [ ( y - / x ) ( y - / x ) ~] = 1i; = A~bA r + O , where q5 is the variance of rl and O is the p x p covariance matrix of the errors in e. Groups' observed covariance and mean information for variables in y are used in making inferences about latent means.

The Case of J

=

2

With two groups, SMM offers a test of construct mean equivalence: Ho : x~ = x2. This test is often conducted under the assumption that obse~wed variables have the same degree of bias (or none) across groups (i.e., ~'1 = ~'~), as well as the assumption that both ~,roups measurement models are tan-equivalent (i.e., A1 = A2). However, while SMM m a y typically proceed by constraining corresponding intercept and loading parameters to be equivalent across groups, Byrne, Shavelson, and Muthdn (1989) suggested that SMMs invariance assumptions may be relaxed (within limits of model identification) and still yield a meaningful comparison of latent means. Examples of SMM include work by Kinnunen and Leskinen (1989) on teacher stress, work by Aiken, Stein, and Bentler (1994) on treatment for drug addiction, and work by Dukes, Ullman, and Stein (1995) to evaluate &ug abuse resistance education with elementary school children. When conducting univariate mean comparisons with two groups, the customary index of effect size is the standardized difference between two population means, d = I#1 -/'21/c~, where #1 and/*2 are population means and cr is the standard deviation of scores typically assumed to be homogeneous across both populations (see, e.g., Cohen, 1988). The d value is then estimated from sample data as d = IT1 - T21/s, where Ta and T2 are sample means and s is the square root of the within-groups pooled variance estimate, s = {[(nl - 1)s 2 + ( n 2 - 1)s~]/(nl + n 2 - 2)} 1/2. Turning to SMM, then, a population-level analog to this measure would be d = lxl - x2l/~ 1/2, where K1 and x2 are population means on the ~ construct and ~b is the variance of r~ scores assumed to be homogeneous across both populations. The d value may then be estimated from sample data as d = Idl - ~21/~ 1/2, where ~?~ and i2 are sample means on the rl construct and is a within-groups pooled variance estimate for scores on rj. A value for ~ may be determined as ~ = (n1~1 + n2~2)/(nl + n2), where ~1 and ~2 are estimated variances of rt from Group 1 and Group 2, respectively; weights in q~ are sample sizes rather than degrees of freedom as SEM typically assumes large sample properties in its estimation procedures. It should be mentioned that a similar approach to standardized effect size estimation in SMM has been employed in the aforementioned applied research by Aiken et al. (1994) and Dukes et al. (1995), with both drawing from meta-analysis work by Hedges and Olkin (1985). The index for measured variables proposed by Hedges and Olkin (1985) seeks to eliminate bias in d as an estimate of d using a multiplicative adjustment of {1 - 3/[4(nl + n2) - 9]}. Practically speaking, however, even a fairly small total sample size of nl + n2 = 100 would yield less than a 1% adjustment to d. Further, Hedges and Olkin (1985) went on to discuss that a m a x i m u m likelihood estimate of d would technically require a multiplicative degree of freedom adjustment to d of [(nl + n2)/(nl + n2 - 2)] 1/2, although this too effectively vanishes with large samples. In the current article, as SMM is a large sample technique, such minute adjustments will be ignored and congruity with measures presented by Cohen (1988) will be preserved. A few additional practical points are worth noting here as well. First, the limits of identification in the means portion of a two-group means model require that one sample serve as a reference group with its latent mean constrained, customarily to zero (e.g., xl = 0); this reduces d to Iz21Ab a/2. Second, some SEM software packages (e.g., EQS; Benfler, 1997) model the latent means as paths from a unit predictor vector. This makes the variance of rl in each sample reside entirely in the variance of a construct disturbance term ff (i.e., qJ~), which may or may not have

GREGORYR. HANCOCK

375

been constrained to be equal across groups. Third, one need not entirely subscribe to the notion of homogeneity of construct variance across populations; however, standardizing effect sizes requires a common yardstick, and the one chosen will derive from a weighted average of construct variances. Finally, the d measure may be interpreted just as in univariate analyses. For example, a value of d = .25 would indicate that the two populations' means on the latent construct rl are estimated to be one-quarter of an error-free standard deviation apart. Still, Cohen's (1988) social science guidelines of small, medium, and large effects being approximated by d = .2, .5, and .8, respectively, should be re-evaluated in the context of the latent standardized effect size measure derived here before endorsing or modifying their application. Cohen (1988) acknowledged that his work focused only on observed scores rather than true scores (i.e., manifest variables rather than latent), noting explicitly (pp. 535-537) the effect of measurement error on measured variable and latent variable effect size estimates. The issue of reliability (specifically construct reliability) and its impact on standardized effect size interpretation will be addressed more thoroughly later in the article. The General Case of J >_ 2 Although applied examples and even technical explications of SMM with more than twogroups are scarce, certainly general effect size measures for J-group SMM analyses may be offered. The null hypothesis tested is H0 : tel . . . . . x j, which may be assessed under customary assumptions of intercept and loading invariance (i.e., ri . . . . . r j and A 1 . . . . . A j, respectively) or under less restrictive assumptions. In the ANOVA arena the common index of standardized effect size is f , file standardized standard deviation of population means (Cohen, 1988). That is, f = a m / a , where cr is the standard deviation of scores (treated as homogeneous across all J populations) and O~mis the standard deviation of population means

am =

' ~ (#j Lj=I

_ #.)2/j

1 1/2

with #. being the grand mean of scores in all J populations combined. The f value is estimated from sample data as f = sin/s, where s is the square root of the estimate of the within-groups variance pooled across all J groups,

Sz and Sm is the square root of the weighted variance of sample means

"

with T. being the grand mean of all scores across the J samples. For SMM, then, a population analog to this measure would be f = c~KAb 1/2, where c~K is the standard deviation of population means on the construct r~,

cr~ =

(xj - tc.)2/J

[_j=l with to. being the grand mean construct scores in all J populations combined, and q5 is the variance of scores on rl (treated as homogeneous across all J populations). The f value may then

376

PSYCHOMETRIKA

estimated from sample data as

f = sK/~ 1/2, where q~ is a within-groups pooled variance estimate for scores on tj, and sk is the standard deviation of sample means on the rt construct, SK =

flj(Kj --/~.)2

flj

j=l

with ~. being the grand mean of construct scores in all J samples combined (i.e.,

/±

~'. = ~ fl j,~j j=l ij=l

fl j ) .

Similar to the two-group case, the value for 6 may be determined as

j=l

/± j=l

where q~j is the estimated variance of rt from the j-th group. Again, a few points are noteworthy. First, as in the two-group case, one smnple must serve as a reference group with its latent mean constrained, customarily to zero. This will in no way affect the estimation of f ; a zero is simply used as the construct mean for the group to be treated as the reference group. Second, software packages modeling latent means as paths from a unit predictor vector will have the variance of rl represented by the variance of a construct disturbance term in each group. Third, the f measure may be interpreted as in univariate analyses, but differently from the d measure. For example, a value of f = .25 would roughly indicate that the J populations' means deviate one-quarter of an error-free standard deviation on average from the grand population mean. Note that if f were used in the J = 2 case it would be half the value of d when nl = n2, and more generally in the J = 2 case f = [(nln2)i/2/(nt + n 2 ) ] d . Cohen (1988) offered interpretive guidelines for f : small, medium, and large effects are f = . 1, .25., and .4, respectively. Note again, however, that these are specifically for measured variables; guidelines for latent variables will be addressed later in the article. Effect Sizes and MIMIC Approaches Like SMM, MIMIC approaches for testing hypotheses about latent means also start by positing a factor measurement model y = $ + Ar~ + e. However, whereas SMM fits the model simultaneously to the J groups' separate data sets, the MIMIC approach combines data from all groups into a single set and incorporates into the structural model a ( J - 1) x 1 vector x of dichotomous (e.g., dummy, contrast, or effect coded) predictors reflecting group membership. Specifically, rl = "gx + (, where "g is a 1 x ( J - 1) row vector containing paths representing the predictive influence of the dichotomous variables in x on the construct rj and ( is the construct residual unexplained by the group code variables in x. Nonzero parameters in )' are indicative of differences among population means on the construct rl. Unlike SMM where flexibility exists to release bias and tau-equivalence constraints, a M I M I C approach results in only one model for the combined data from both groups and thus implicitly assumes that all sources of bias and (co)variation among observed variables are equivalent across groups. The MIMIC approach is thus a special case of SMM, and identical when M I M I C ' s more stringent invariance assumptions hold (Muthdn, 1989). The extent to which MIMIC approaches are robust to violations of these implicit assumptions remains to be investigated within the SEM literature, mid is beyond the

GREGORY R. HANCOCK

377

scope of this article. An example of a MIMIC model for the purpose of inferring latent mean differences may be found in work by Gallo, Anthony, and Muthdn (1994).

The Case of J

=

2

With two groups, the MIMIC approach tests the null hypothesis of construct mean equivalence (H0 : xl = to2) utilizing a single dichotomous predictor X impinging upon rt (making the second M in the MIMIC acronym a technical misnomer here). Given that X utilizes codes of 0 and 1 for Group 1 and Group 2, respectively, the lone ?/parameter in y represents the difference between the two groups' construct means. For Group 1, E[rt] = tq = yE[x] = v E [ X ] = g(0) = 0; for Group 2, E[~] = x2 = yE[x] = y E [ X ] = g(1) = V; thus, 9/ = x2 - xl. Therefore, in the M I M I C approach with two groups the null hypothesis is more simply expressed as Ho : g = 0. Also, because g = x2 - Xl, y already represents an unstandardized effect size. To facilitate standardization, it must be divided by the within-groups population standard deviation of the construct rl. The residual construct variance captured in the variance of the disturbance ~, qJ~, represents that variance in rl that is unexplained by the dichotomous predictor variable X, and hence q~ is the within-groups variance pooled across both groups. Thus, in the two group MIMIC model the standardized effect size is d = IVI/(~P~)1/2, and its sample-based estimate is derived from the model's parameter estimates: d = I~)1/(~)~) i/2. This measure is interpreted just as in univariate analyses and as the SMM measure d proposed above.

The General Case of J >_ 2 With J _> 2 groups, the MIMIC approach utilizes the vector x of J - 1 dichotomous predictors capturing group membership information to test construct mean equivalence: H0 : K1 = --- = xj. A general effect size measure, applicable for two or more groups, may be derived using (in part) the parameters in y (i.e., the path values from the group code predictors in x to rt). However, a simpler approach to a general effect size measure is to consult the standardized parameter value of the path to rt from its disturbance ~. Whereas in the unstandardized model this path is customarily fixed to 1, in the standardized model it conveys inlbrmation regarding the amount of variance in ~1 explained by the dichotomous predictors in x, that is, regarding the magnitude of differences among the J populations' construct means. Specifically, if the standardized path from ~ to rl is denoted as gg, then the quantity 1 - q)2 is the proportion of variance in rt that is between-group variability (i.e., an R 2 value). As shown by, tor example, Cohen (1988), the standardized effect size f may be expressed as f = [R2/(1 - R2)] 1/2. For the population latent mean differences in a MIMIC model, this standardized effect size measure translates into f = [(1 - q)2)/q)2]V2, which is estimated by the quantity f = [(1 - ~ 2 ) / ~ 2 ] 1 / 2 using the sample parameter estimate ~ from file standardized MIMIC model solution. This measure is interpreted just as in univariate analyses (i.e., ANOVA) and as with the SMM f measure proposed above; also, applied to the case of J = 2 groups with nl = he, the value of f would be half the value of d.

378

PSYCttOMETRIKA Effect Sizes, Construct Reliability, and Power Foundations

While power analysis methods have emerged for assessing entire models (e.g., MacCallum, Browne, & Sugawara, 1996), the focus of the current article is on power associated with tests of specific model parameters (i.e., in the mean structure). The standardized effect size measures developed in the previous sections are related to the statistical power to reject the null hypothesis of equivalent latent construct population means. Because MIMIC is a special case of SMM (and identical when MIMICs more stringent invariance assumptions hold), derivations will be presented using the two-moment framework for the more general method of SMM. Standard assumptions of conditional multivariate normality and proper model specification will be made. To clarify the latter, the derivations will be valid for the MIMIC approach under total second moment invariance across populations, and under intercept invariance or under noninvariance that has been properly accommodated through additional paths from the group code predictors directly to the indicator variables. For the SMM approach the derivations will be sufficiently general so as to be valid under less restrictive invariance assumptions; further simplified derivations will also be offered, but holding for more restrictive SMM invariance conditions. Analytical and/or empirical extensions of these derivations to specific noninvariant conditions are left to future endeavors. In SMM, the multisample discrepancy function G across J groups may be expressed generically as J

J

G=~(nj/N)Fj

where

N=~nj

j=l

j=l

and Fj is the joint first and second moment fit function for the j-th group. Specifically for maximum likelihood estimation, f j = [ In [~j] + t r ( S j ~ -1) - In ISjl - p] + (mj - ~ j ) t ] ~ ; l ( m j - ~ j ) , where for the j-th group @ is the observed covariance matrix, Xj is the model-implied covariance matrix based on optimum parmneter estimates, p is the number of indicator variables, mj is the vector of observed sample means of the indicator variables, and/kj is the model-implied mean vector. Given that E(yj) = ixj = "rj + A j K j , the mean structure portion of Fj could be expressed as ( m / - rj - AjKj)'I£-]I(mj -- "rj -- A j K j ) . Thus, when using samples to fit the properly specified covariance and mean structure model (with only that mean constraint necessary for identification), the model test statistic would be J T1 = ( N -

1)G

=

(N

-

1)__~ ( ~ ) { [ i n

IXjl- -}- t r ( S j ] ~ 7 1 ) -

In ISjl

P]

j=l

+ (mj - q -- X j Y ~ j ) ' £ - f l ( m j - q -- ~_j,~j)} where for large samples T1 "~ X2(v 1; )v1) with Vl degrees of freedom and noncentrality parameter )vi = 0 under standard distributional assumptions. Now imagine constraining Kj = K - . f o r j = l t o J ,

where

J ~c.=~__njKj/N; j=l

379

GREGORY R. t t A N C O C K

doing so will introduce badness of fit into the multisample mean structure if the null hypothesis for the equality of construct means is false. For samples, values of the model test statistic would be J

To = ( N -

1)Go = ( N - l ) ~

(-~){[ln

l~j] + tr(Sj~j-1) - In [Sj[ - p]

+ (mj -- 2rj -- A j k . ) ' ~ j - 1 (mj - ~'j - Aj~.)}, where under large sample conditions and multivariate normality To ~ X 2 (vo;)~o) with degrees of freedom vo and noncentrality purameter ).o. The expected value of the noncentrality parameter ~-o is (N - 1)go, where go is the multismnple fit/unction value arising if all correct population parameters (but with the ~-.-constrained latent means) were substituted into the fit function: J

go = ~

(~-~){[In

l~.j]+ tr(]i;,l~ -1) - I n ]~,j]-p]

- , j - A ; K . ) IX^ _j 1 (,, j

+ (,;

- ,j - AjK.)}.

This would reduce to J

:(.,

because the covariance structure is still correctly specified. From the previous two paragraphs, then, when fitting sample data to models with free and constrained values of Kj, the difference between model fit test statistics T o - T1 would be expected to be distributed (under multivariate normality) as X2(v0 - v]; )v0 - )Vl) where vo - vi = J - 1 degrees of freedom and noncentrality

Lo-Li=(N-1)g0-0=(N-1)go, just as for the K.-constrained model. This in turn implies that the power to reject the omnibus null hypothesis of equality of J latent means at a chosen c~ level may thus be expressed as Pr[(To - T~) >(1-ce)

xZ(J

--

1; 0)],

where

(1-c~)X2(J -- 1; 0)

is the oMevel critical value from the upper tail of central X 2 distribution with J - 1 degrees of freedom; this power is simply the area of the X 2 ( J - 1; )vo) distribution exceeding the critical value from the corresponding central distribution. That said, the practical question is how this might be used to facilitate post hoc power analysis and a priori sample size dete~Tnination. ~Ib address these issues first requires a greater understanding of the noncentrality parameter ),0 and its component parts. Recall that under multivariate normality the noncentrality parameter J

)~o = (N - 1)go = (N - 1 ) ~

(-~)[(txj-

rj - AjK.)'~-fl(txj-

r j -- AjK.)].

j=l

The previous relation IXj = r j + A j Kj implies that ~ j - r j = A j Kj , where Kj is the true value of the unconstrained latent mean in the j-th population; thus, the noncentrality parameter )v0 may be expressed as J

380

PSYCHOMETRIKA

J = ( N - - 1 ) ~ (~-~) (Kj - ~-.)2[A5]~5-1Aj]. The value of Kj - x., representing the difference between the j-th population's latent mean and that of the combined population, is still in the metric assigned to the associated construct rl. To standardize, the construct variance ~b (assuming homogeneity across populations) will be incorporat~ into the noncentrality expression: J

)v0 = (N -- 1 ) Z (-~)[(Kj-K.)2/~][~AS~z, j=l J

flAj]

=(N-i) ~ (--~-) [(xy - x.)2/~b]Hy,

where the coefficient Hj = ~bA}Z~-IAj. Much more about this coefficient, which is a uselul index of construct reliability, appears in the next section. To relate these results to the standardized latent effect size indices d and f requires the further assumption that the Hj coefficients are invafiant across populations. Because Hi = q~ASX5-1Aj (where ~ j = Aj~bA~ + O j), such invariance would occur if corresponding loadings and error (co)variances were equal across groups (in addition to qSj = ~b for j = 1 to J); however, other specific combinations of noninvariant conditions could yield invariance of Hj coefficients as well. The tolerability of the assumption of homogeneity of Hj coefficients will be addressed in the next section. For now, if Hj = H for j = 1 to J, then

J ,~o = (N - 1)

j=l = ( N - 1) = (N -- 1 ) f 2 H where f is the previously defined J-group standardized latent effect size measure. For the simpler two-group case, as noted previously f = [(nln2)l/2/(nl -}- n2)]d; thus,

)vo = ( N - 1)[nln2/N2]d2H. Coefficient H: A Measure of Construct Reliability Before addressing post hoc power analysis and a priori sample size determination, the coefficient Hj = qSA~Z]-IAj deserves greater attention. To make it more intuitive, it may first be translated into unit-free components. Because for the j-th population £ j = Dj Pj D j, where Dj is a diagonal matrix containing indicator variables' standard deviations and Pj is the population correlation matrix relating those indicator variables,

Hj = q~A5 (Dj Pj Dj ) -1Aj = q5A) I)71P ~-1D~qAj = (q~U2D-flAj )'P71 (q~1/2D71Aj ). The latter expressions in parentheses simply follil a column vector of the indicator variables' standardized loadings on the construct ~ for the j-th population (again, assuming homogeneity of ~bj), referred to hereafter as Lj. Thus, Hj = L ; p ] - I L j ; and because L j L ; represents the portion of Pj explained by the single factor model, Hj conveys information about construct

GREGORY R. HANCOCK

381

reliability (as discussed later). In fact, as shown in Appendix A, when at least one standardized loading is nonzero, Hj can be simplified to

where ~ij is the standardized loading of the i-th indicator variable on the construct rt within the j-th population. Note that the term ~/2j/(1 - ~/2j) represents a ratio of the proportion of variance in the i-th variable explained by the construct (i.e., the variable's reliability) to that proportion unexplained, thereby making Hj an aggregate function of such information across the construct's p indicator variables within the j-th population. In fact, as demonstrated in Appendix B, coefficient Hj equals the proportion of variability in rl explainable by its indicator variables. Note also that measures equivalent to this coefficient have arisen in other contexts as well (alpha factor analysis, Bentler, 1968; interrater reliability, Drewes, 2000; Li, Rosenthal, & Rubin, 1996). The maximum value for coefficient Hj is 1, occurring when a single standardized loading is 1 or - 1 (note that if two standardized loadings are perfect, this implies two completely collinear variables and thus Pj would be singular). Readers can easily demonstrate that Hj >_max(~2j) for i = 1 to p, meaning that Hj will never be smaller than the reliability (~2) of the best indicator variable. As an example, for three standardized loadings of .90, .50, and .50, the value of Hj will be .8314 (just over .902 = .8100). Also, as the number of indicators increases, so does Hi; for factors with p = 3, 4, 5, 6, and 7 indicators all with ~ij = .70, values of Hj would be .7424, .7935, .8277, .8522, and .8706, respectively. These properties of Hj have intuitive appeal in that, as latent means modeling approaches employ multiple indicators, they should never be worse than using the best variable alone. To the extent that Hj draws information from the additional indicators, the latent procedures should make even further gains in distributional noncentrality, and in turn power. For purposes of testing means, we should also recognize the consonance of coefficient H with prior reliability discussions. Cohen (1988, p. 536), drawing from Cleary and Linn (1969), noted that for a measured variable the standardized effect size E S = (E S*)(rVV)1/2, where E S* is the error-free (i.e., latent) standardized effect size measure (d or f ) and ryy is the reliability of a single measured v ~ a b l e Y. Squaring this expression yields E S 2 = (E S*)2 (ryv), precisely parallel to the expression f2H occuring in the previous noncentrality parameter derivation. Coefficient H, just as ryy, attempts to convey how much dampening the true standardized effect size experiences as a result of measurement error in its manifest reflections. Further, coefficient H can help in evaluating Cohen's (1988) interpretive guidelines for measured variable effect sizes as applied to latent variables. For the two-group case with measured variables, .2, .5, and .8 are common standards for small, medium, and large standardized effects (d), respectively; for the J-group case the corresponding f values are .1, .25, and .4. Howevel, given that ES = (ES*)(ryy)1/2, the troublesome implication is that any standard for gauging ES could t~ corresponding to different underlying latent differences, depending upon the quality of the measured variable chosen to operationalize the construct of interest. And while arguments have been advanced flint latent units are not a natural or stable metric for effect sizes (e.g., Maxwell, 1980), such a position is mainly for the case of a single dependent variable and would seem less defensible with the enhanced stability arising from the introduction of multiple indicators of a latent construct. The latent effect ES* is, after all, the theoretical constant in the equation, and the stability of its estimation using latent mean techniques (captured in part by H) would seem reasonable given at least one high quality indicator among tile set of p. So what standards should be used for gauging standardized latent effect sizes? Given that the measured variable standards allow for measurement error, then tile error-flee analogs should be somewhat higher. But how much higher? With a single outcome variable, Nunnally and Bernstein (1994) recommended that for researchers "concerned with ... mean differences among experimental treatments.., a reliability of .80 is adequate" (p. 265). This would translate into stan-

382

PSYCHOMETRIKA

dards for the latent d of .2/(.80) 1/2, .5/(.80) 1/2, and .8/(.80) 1/2, or roughly .23, .56, and .89 for small, medium, and large standardized latent effects, respectively; for f they would be .11, .28, and .45. And obviously, lower reliabilities would translate into even higher standards, while higher reliabilities would yield values even closer to Cohen's measured variable standards. Practically speaking, though, the small difference between these values and Cohen's will not likely result in a change in how one describes an effect, measured or latent. Further, one may note that within studies employing latent methods, achieving a coefficient Hj estimate of .80 within any group should not be difficult. A value of Hj = .80 would be achieved if any single indicator has gij = .894, if two indicators have gij = .816, if three indicators have gij = .756, if four indicators have gij = .707, if five indicators have gij = .667, and in general if t out of p indicators have gij = D/(t + 4)] 1/2. One would suspect that a reseacher entering into a latent mean analysis (SMM or MIMIC) is confident in the construct and its indicators, having ascertained through prior study the existence of a reasonable measurement model and thus ensuring a high coefficient Hj. For these reasons, Cohen's standards for gauging measured variable effect sizes are believed to suffice for standardized effect sizes involving reasonably reliable latent variables as well. It must be added, however, that the practical worth of latent mean differences, just as with measured variables, should ultimately be gauged within the specific research context. Finally, recall that in order to relate distributional noncentrality to the standardized effect size measures derived previously, in addition to multivariate normality the assumption of invariance of construct reliability (as measured by Hj) is necessary. As Hj = qSA}li;~-lAj (where 2~j = AjqSA~ + O j), such invariance would occur if con'esponding loadings and error (co)variances were equal across groups (in addition m ~bj = q5 for j = 1 to J), as is assumed in the MIMIC approach to hypothesis testing of latent means. This total invariance is not required, however, as other combinations of noninvariant Aj and Oj could also yield invariance of Hj coefficients. Still, in general, assumptions of parameter invariance across populations are likely to be false, including that of Hi. Notwithstanding, such an assumption may be tolerable; if a researcher has chosen construct indicators wisely so flint, say, Hj >_ .80 for j = 1 to J, and if a pooled aggregate of Hj values is utilized in practice (more on this later), then the adverse impact of true Hj heterogeneity on post hoc power estimates should be controlled.

Post Hoc Power Analysis The case ofJ = 2. Post hoc power analysis may be conducted in several ways when J = 2. Most simply, and as done elsewhere (e.g., Kaplan & George, 1995), one could square the z-value associated with the test of the unconstrained x2 (i.e., compute a Wald test statistic) for use as an estimated noncentrality parameter of the disuibution X2(v; ~) where v = 1 and ~ = z 2, and under multivariate nol~al conditions. Power zr of an a-level test of the latent mean difference could be estimated as = Pr[x2(1; z 2) >(1-,) X2(1; 0)],

where (l_ce)X2(1; 0) is the c~-level critical value from tile upper tail of central X2 distribution with 1 degree of freedom. This process may be facilitated by using tables provided by Haynam, Govindarajulu, and Leone (1973), or appropriate software such as SAS (SAS Institute, 1996) or GAUSS (Aptech Systems, 1996). Equivalently, and requiring no special tables or software, one could estimate power as J? = Pr[IXI >(1-,/2) Z] where X ~'. N(z, 1), z is the observed test statistic associated directly with the test of tc2, and (1-,/2)Z is the two-tailed c~-level critical value from the upper tail of the standard normal distribution. As a final strategy, and relating back to the previous noncentrality derivation, an estimated noncentrality parameter

~0 = ( N - 1)(nln2/N2)d2I~I

GREGORY R. HANCOCK

383

could be employed for post hoc power analysis just as described above, where d is computed as shown previously and H is the value derived from the single set of standardized loadings in a MIMIC analysis or

j=l from the multiple sets of standardized loadings in an SMM approach (H estimates will tend to differ across methods unless the stringent invariance assumptions of the MIMIC approach hold). Thus, power of an c~-level test of the latent mean difference could be estimated as = Pr[x2(1; 20) >(1-~) X2(1; 0)], or more simply as ~" = Pr[lXI >(1-c~/2) Z]

where

X ~ N((2o) U2, 1).

Note that these last strategies to estimate power (using ,~0), while asymptotically equivalent to those derived from the Wald test statistic, would likely differ slightly as the Wald values rely on the asymptotic parameter covariance matrix that precipitated the standard error in the denominator of the relevant z test statistic. As a two-group example for SMM, imagine nl = 160, xl = 0 (constrained for identification), ~1 = 4.22, and [~] = [.77.88.74] (yielding H1 = .8591); for the second group, n2 = 240, x2 = .42, ~ = 4.54, and [ ~ = [.81.85.80] (yielding H2 = .8628). The test of the latent mean difference was conducted using a two-tailed test at c~ = .05, and failed to find a statistically significant difference. The pooled reliability is = [160(.8591) + 240(.8628)]/400 = .8613, and = .42/{[160(4.22) + 240(4.54)]/400} 1/2 = .20; thus, the noncentrality parameter estimate is ,~0 = ( 4 0 0 - 1)(160 x 240/4002).202(.8613) = 3.299. A normal distribution with unit variance and mean (3.299)1/2 = 1.816 would have a proportion of approximately .443 exceeding -4-1.96, that is, the estimated post hoc power is ~ = .443.

The general case of J > 2. For more than two groups, the method of post hoc power analysis depends upon how the mean differences were assessed. If the researcher simply evaluated each comparison or contrast individually, the power of individual tests at a chosen c~ level may be determined as outlined above. (And certainly the issues of multiple comparison procedures and Type I error control within latent means modeling are highly relevant and well worth formal articulation; they remain, however, beyond the scope of this article.) On the other hand, the overall power of an omnibus test of latent mean differences may be assessed using the noncentrality parameter estimate ,~0 = (N - 1 ) f 2 / t , where f is as defined previously and H is the value derived from the single set of standardized loadings in a MIMIC analysis or

H=EnjHj/N j=l

384

PSYCHOMETRIKA

from the multiple sets of standardized loadings in an S M M approach. Specifically, the power of an c~-level omnibus test m a y be estimated as = P r [ x 2 ( J - 1; ,~0) >(1-c~) X 2 ( J - 1; 0)] using tabled or computer-generated values from the appropriate noncentral X 2 distribution, assuming multivariate normality. Consider as an example a M I M I C model with two group code predictors being used in an omnibus assessment of overall latent mean differences in the combined data set from subsamples of nl = 100, n2 = 125, and n3 = 150. The standardized path from the disturbance to the construct rl is estimated as ~b = .995, while the standardized loading estimates are [ / = [ . 7 3 . 6 8 . 8 2 ] . As derived previously in the article, the estimated standardized effect size is computed as f = [(1 - .9952)/.9952] 1/2 = .10, and H is determined as .8021; thus the noncentrality parameter estimate is ,~0 = (375 - 1). 102(.8021) = 3.000. The power of a .05-level omnibus test is estimated as = P r [ x 2 ( J - 1; ~.0) >(1-c~) X 2 ( J - 1; 0)] = Pr[x2(3 - 1; 3.000) >(1-.05) X2(3 - 1; 0)] = Pr[x2(2; 3.000) >(.95) X2(2; 0)]. Using the power tables provided by Haynam et at. (1973; pp. 1-42), the estimate is 4 = .3215. A Priori Sample Size Determination

A priori determination of necessary sample size for conducting a between-groups latent mean investigation is just as important as for its well-established measured variable counterpart, and possibly more so given that data from more variables must be gathered. Whether dealing with the two-group case or the more general J - g r o u p case, the initial goal is to determine the noncentrality parameter needed to achieve a desired level of power, and from that parameter to infer the necessary sample size per group. The case o f J = 2. In the two-group case, sample size determination m a y be accomplished using either noncentral X 2 or noncentral normal distributions. First, the researcher must choose an c~-level and a desired minimum level of power re, such as the customary c~ = .05 and rc = .80. Second, a priori values for d and H must be determined based on educated estimation, prior research, or conservative desirable minimum values. Third, a noncentrality parameter Z0 must be found as the solution to the equation

Pr[x2(1; Z0) >(1-~) X2(1; 0)] = re,

where

(1_~)X2(1; 0)

is the c~-level critical value from the upper tail of central X 2 distribution with 1 degree of freedom; this will require tables of noncentral X 2 distributions or appropriate statistical software. Finally, by rearranging the expression Z0 = (N - 1 ) ( n l n 2 / N 2 ) d 2 H , the necessary sample size per group (assuming equal n) can be determined as n = [2Zo/(d2H)] + 0.5. Because normal distributions are much more accessible and familiar, the third and fourth steps above m a y be replaced with the following. Third, a noncentrality parameter k must be found for X ~ N ( k , 1) such that Pr[IXl >(1-~/2> Z] = a', where ( 1 - , / 2 ) Z is the two-tailed c~-level critical value from the upper tail of the standard normal distribution. Fourth, because for the single degree of freedom case the X 2 noncentrality parameter Z0 equals k 2 (the square of the normal distribution noncentrality parameter), the necessary sample size per group (still assuming equal n) can be determined as n = [2k2/(d2H)] + 0.5. As an example, imagine that a researcher wishes to conduct a two-group test of latent means at the c~ = .05 level and with power of at least rc = .80. Further, the researcher would like

GREGORY R. ttANCOCK

385

to be able to detect a minimum standardized latent difference of d = .25 on a five-indicator construct with expected standardized loadings in both groups no smaller than .70 (yielding a minimum possible H = .8277). Using a table of the standard normal curve, a distribution with a proportion of .80 exceeding -4-1.96 is centered roughly at k = 2.80. Minimum sample size required per group is thus determined as n = [2(2.80)2/(.252(.8277))] + 0.5 = 303.6; rounding up to ensure adequate power yields nl = n2 = 304. The general case o f J >_ 2. For more than two groups, the method of sample size determination depends upon how the mean differences are to be assessed. If the researcher plans to evaluate each comparison or contrast individually at a set c~-level, minimum sample sizes may, be derived using the method presented immediately above. On the other hand, if the researcher is interested in achieving a desired level of power tbr an omnibus test of mean differences, the following steps are necessary. First, the researcher must choose an c~-level and a desired minimum level of power zr. Second, a priori values for f and H must be determined. Third, a noncentrality parameter )~o must be found as the solution to the equation

P r [ x 2 ( J - 1; ),o) > ( l - , ) X2(J - 1; 0)] = Jr,

where

( l _ , ) X 2 ( J - 1; 0)

is the ce-level critical value from the upper tail of central X 2 distribution with J - 1 degrees of freedom; this will again require tables of noncentral X 2 distributions or appropriate statistical software. Finally, by rearranging the expression )v0 = (N - 1 ) f 2 H = ( J n - 1 ) f 2 H (assuming equal n), the necessary sample size per group can be determined as n = D v o / ( j f 2 H ) ] + ( l / J ) . Consider, for example, a researcher planning a study with three groups, in which overall differences among latent means will be assessed at the c~ = .05 level and with power of at least rc = .80. The researcher would like to be able to detect a minimmn standardized latent difference of f = . 10 on a four-indicator construct, for which past research has shown standardized loadings in all groups to be no smaller than .75 (yielding a minimum H = .8372). Using tables of noncentrality parameters for X 2 distributions (Haynam et at., 1973, pp. 43-78), a distribution with a proportion of .80 exceeding .95X2(2; 0) = 5.991 would have a noncentrality parameter of approximately )v0 = 9.635. Minimum sample size required per group is thus determined as n = [9.635/(3(.102).8372)] + (1/3) = 383.95; rounding up to ensure adequate power yields ni = n2 = n3 = 384. As a final note, practitioners should not accept the smnple size recommendations arising through these methods without considering other sample size requirements. Specifically, as SEM is based on asymptotic theory, large enough samples must be used no matter how small a sample is determined to achieve a desired level of power. While what constitutes "large enough" is rather ill-defined, minimum recommendations such as five cases per model parameter (Bentler & Chou, 1987) should certainly be consulted in conjunction with power considerations. Conclusions Tile current article has offered standardized effect size measures for gauging differences among groups' latent means, and methods tbr using those measures in post hoc power analysis and a priori sample size determination, rI]lese measures and methods may be applieA when using structured means modeling techniques, or the simpler MIMIC models with group code predictors. As with virtually all power analysis methods, their accuracy relies on satisfying distributional assumptions. To the extent that multivariate normality does not hold, power analysis and sample size determination may be inaccurate. The viability of, for example, asymptotically distribution-free (e.g., Browne, 1984) extensions of these methods, or of using rescaled test statistics and/or robust standard errors (Satorra & Benfler, 1994), remains to be explored. Although peripheral to the focus of the current article, the power derivations also led to a useful measure of construct reliability. Coefficient H describes the relation between the latent

386

PSYCHOMETRIKA

construct and its measured indicators in a manner more consistent with discussions of power and reliability in other research contexts. And unlike other proposed measures of construct reliability (e.g., Fomell & Larckel; 1981), where the index might not exceed its best indicator's reliability, coefficient H is never less than its best indicator's reliability. This makes intuitive sense because employing a latent variable approach should only improve upon using the best single indicator alone; additional indicators merely serve to enhance the construct in a manner commensurate with their own ability to reflect the construct. More investigation and application of this simple yet intriguing index is certainly warranted. Finally, it is hoped thaL as latent means methods become increasingly popular, sample size determination efforts will precede such analyses, and that the measures and post hoc power estimates presented here tbr both two-group and J-group cases will accompany results as a means for communicating the magnitude of latent effects. Further, while the current m'ticle has focused on one-way between-subject designs under specific invariance conditions, expansion of these principles to designs of greater complexity (and noninvariance) as part of a general latent variable experimental design paradigm is eagerly awaited. Appendix A

Derivation of Formula for Coefficient H The proof below is offered to show that, for a given population, H : I J P - 1 L implies that H=I/

1+

1/

[e~/(1-~

when at least one loading is nonzero. I= p-1p I= p-ip _ p-iLL1 + p-iLL~ I = p - i ( p _ LL ~) + p - i L L ~ (P -- L I J ) -1 = p - i + p - i L L ~ ( p LI(p

-

LIf)-IL

=

_

LLI)-I

L ~ p - i L + L~p-iLL~(P _ LL~)-IL

LI[p - L L t ] - I L = L I p - i L [ 1 + L~(P - LL~)-IL] LfP-1L = L~[p - LL~]-IL/(1 + L~[P - LL~]-IL) If L # O, then LfP-1L = 1/[[1/Lt(P - L L I ) - I L ] + 1t L'p-1L=I/

/ [ 1+

1/Z[e/2/(1-E2i i=1

,,]J

[]

Appendix B

Coefficient H as a Measure of ExFlained Variability in the Construct ~7 Assume that construct rl and its p observed indicators in y are standardized. If data for a population of size N ( N -+ oc) existed on rl and y, a regression model predicting rl from its indicators could be constructed as aq = YI(I + u, where aq is an N x 1 vector of true construct scores, Y is an N x p matrix of scores on the indicator variables, l(I is the p x 1 vector of standardized population regression weights, and u is the N x 1 vector of residuals. The N x 1

GREGORY R. HANCOCK

387

vector represented by the product Yfi is the part o f the construct rl reproducible (predictable) by the o b s e r v e d variables, and as such is hereafter l a b e l e d / I . As per the w e l l - k n o w n regression m e t h o d o f estimating factor scores, 13 = P - 1 L , w h e r e P - 1 is the inverse o f the p x p population correlation matrix for the indicator variables and L is the p x 1 vector o f standardized loadings o f the Y variables on rl. Thus, by substitution ~ = / I + u = Y P - 1 L + u. The extent to which the true construct rl is captured by the information in the o b s e r v e d indicators is reflected in the correlation b e t w e e n B a n d / I , a correlation b e t w e e n a latent variable and its o p t i m u m observable counterpart. O n c e squared, the value b e c o m e s a reliability coefficient indicating the proportion o f variance in the construct explained by the o b s e r v e d scores (or, equivalently and m o r e traditionally, the proportion o f variance in the o b s e r v e d scores' c o m p o s i t e explained by the construct). Recalling the expression ~1 = /1 + u, w h e r e scores in ~1 are standardized, the proportion o f variance in ~ a c c o u n t e d for b y / I is simply the variance o f / I . This is: E[~'~] = E [ ( Y P - 1 L ) ' ( Y P - 1 L ) ] = E[LIp-1yIyp-1L] = LIP-1E[y~y]p-1L = L~p-1pp-1L = L~P-1L =H Thus, coefficient H is a m e a s u r e o f the relation b e t w e e n a construct and its indicators, termed here a m e a s u r e o f "construct reliability" (e.g., Crocker & Algina, 1986) as it is the proportion of variance shared by the latent construct and the o b s e r v e d variables. References Aiken, L.S., Stein, J.A., & Bentler, P.M. (1994). Structural equation analyses of clinical subpopulation differences and comparative treatment outcomes: Characterizing the daily lives of drug addicts. Journal of Consulting and Clinical Psychology, 62,488-499.

Aptech Systems. (1996). GAUSS system and graphics manual. Maple Valley, WA: Author. Bentler, P.M. (1968). Alpha-maximized factor analysis (alphamax): Its relation to alpha and canonical factor analysis. Psychometrika, 33, 335-345. Bentler, P.M. (1997). EQS structural equations program. Encino, CA: Multivariate Software. Bentler, P.M., & Chou, C. (1987). Practical issues in structural modeling. Sociological Methods & Research, 16(1), 78-117. Browne, M.W. (1984). Asymptotically distribution-free methods for the analysis of covariance structures. British Journal of Mathematical and Statistical Psychology, 37, 62-83. Byrne, B.M., Shavelson, R.J., & Muth6n, B. (1989). Testing for the equivalence of factor covariance and mean structures: The issue of partial measurement invariance. Psychological Bulletin, 105, 456-466. Cleary, T.A., & Linn, R.L. (1969). Error of measurement and the power of a statistical test. British Journal of Mathematical and Statbtical Psychology, 22, 49-55. Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale, NJ: Lawrence Erlbaum Associates. Crocker, L.M., & Algina, J. (1986). Introduction to classical and modern test theory. New York, NY: Holt, Rinehart and Winston. Drewes, D.W. (2000). Beyond the Spearman-Brown: A structural approach to maximal reliability. Psychological Methods, 5, 214-227. Dukes, R.L., Ullman, J.B., & Stein, J.A. (1995). An evaluation of D.A.R.E. (Drug Abuse Resistance Education), using a Solomon four-group design with latent variables. Evaluation Review, 19, 409-435. Fornell, C., & Larcker, D.E (1981). Evaluating structural equation models with unobservable variables and measurement error. Journal of Marketing Research, 18, 39-50. Gallo, J.J., Anthony, J.C., & Muth~n, B.O. (1994). Age differences in the symptoms of depression: A latent trait analysis. Journal of Gerontology: Psychological Sciences, 49, 251-264. Hancock, G.R. (1997). Structural equation modeling methods of hypothesis testing of latent variable means. Measurement and Evaluation in Counseling and Development, 30, 91-105. Haynam, G.E., Govindarajulu, Z., & Leone, F.C. (1973). Tables of the cumulative non-central chi-square distribution. In H.L. Harter & D.B. Owen (Eds.), Selected tables in mathematical statistics (Vol. 1, pp. 1-78). Chicago: Markham Publishing.

388

PSYCHOMETRIKA

Hedges, L.V., & Olkin, I. (1985). Statistical methodsjbr meta-analysis. New York, NY: Academic Press. J6reskog, K.G., & Goldberger, A.S. (1975). Estimalion of a model with multiple indicators and multiple causes of a single latent variable. Journal of the American SlatisticY,l Assoc'iation, 70, 631~539. Kaplan, D., & George, R. (1995). A study of the power associated with testing factor mean differences under violations of factorial invariance. Structural Equation Modeling: A Multidisciptinary Journal, 2, 101-118. Kinnunen, U., & Les~nen, E. (1989). 'reacher stress dm'ing the school year: Covariance and mean structure analyses. Journal of Occupational Psychology, 62, 111-122. Li, H., Rosenthal, R., & Rubin, D.B. (1996). Reliability of measurement in psychology: From Spearman-Brown to maximal reliability. Psychological Methods, 1, 98-107. MacCallum, R.C., Browne, M.W\, & Sugawara, H.M. (1996). Power analysis and determination of sample size for covariance structure modeling. Psychological Method~, 1, 130-149. Maxwell, S.E. (1980). Dependent variable reliability and determination of sample size. Applied Psychological Measurement, 4, 253-260. Muthtn, B.O. (1989). Latent variable modeling in heterogeneous populations. Psychometrika, 54, 557-585. Nunnally, J.C., & Bernstein, I.H. (1994). Psychometric theory (3rd ed.). New York, NY: McGraw-Hill. SAS Institute. (1996). SAS statistical software. Cary, NC: Author. Satorra, A., & Bentler, RM. (1994). Corrections to test statistics and standard errors in covariance structure analysis. In A. yon Eye & C.C. Clogg (Eds.), Latent variables analysis: Applications for developmental research (pp. 399-419). Thousand Oaks, CA: SAGE Publications. Satorra, A., & Saris, W.E. (1985). Power o:f the likelihood ratio test in covariance structure analysis. Psychometrika, 50, 83-90. Strbom, D. (1974). A general method for studying differences in factor means and factor structure between groups. British Journal of Mathematical and Statistical Psychology, 27, 229-239.

Manuscript received 9 FEB 1999 Final version received 8 AUG 2000