Measurement Educational and Psychological

0 downloads 0 Views 457KB Size Report
Oct 22, 2012 - 1University of Miami, Coral Gables, FL, USA .... rent study uses the simultaneous approach of the MIMIC model to avoid such an effect. MuthИn ...
Educational and Psychological Measurement http://epm.sagepub.com/

A Comparison of Uniform DIF Effect Size Estimators Under the MIMIC and Rasch Models Ying Jin, Nicholas D. Myers, Soyeon Ahn and Randall D. Penfield Educational and Psychological Measurement 2013 73: 339 originally published online 22 October 2012 DOI: 10.1177/0013164412462705 The online version of this article can be found at: http://epm.sagepub.com/content/73/2/339

Published by: http://www.sagepublications.com

Additional services and information for Educational and Psychological Measurement can be found at: Email Alerts: http://epm.sagepub.com/cgi/alerts Subscriptions: http://epm.sagepub.com/subscriptions Reprints: http://www.sagepub.com/journalsReprints.nav Permissions: http://www.sagepub.com/journalsPermissions.nav

>> Version of Record - Feb 28, 2013 OnlineFirst Version of Record - Oct 22, 2012 What is This?

Downloaded from epm.sagepub.com at UNIV OF MIAMI on May 17, 2014

A Comparison of Uniform DIF Effect Size Estimators Under the MIMIC and Rasch Models

Educational and Psychological Measurement 73(2) 339–358 Ó The Author(s) 2012 Reprints and permission: sagepub.com/journalsPermissions.nav DOI: 10.1177/0013164412462705 epm.sagepub.com

Ying Jin1, Nicholas D. Myers1, Soyeon Ahn1, and Randall D. Penfield1

Abstract The Rasch model, a member of a larger group of models within item response theory, is widely used in empirical studies. Detection of uniform differential item functioning (DIF) within the Rasch model typically employs null hypothesis testing with a concomitant consideration of effect size (e.g., signed area [SA]). Parametric equivalence between confirmatory factor analysis under the multiple indicators, multiple causes (MIMIC) model and the Rasch model has been established. Unlike the Rasch approach to DIF, however, the parallel MIMIC approach to DIF detection has relied exclusively on null hypothesis testing. This study derived an effect size estimator for DIF under the MIMIC model (MIMIC-ES) and then investigated the ability of MIMICES to correctly estimate the magnitude of DIF as compared to the SA approach under the Rasch model (Rasch-ES) in a Monte Carlo study. Variables manipulated were sample size, mean ability difference, DIF size, number of DIF items, and item difficulty levels. Results indicated that MIMIC-ES performed well when there was no mean ability difference. When mean ability difference was present, MIMIC-ES became increasingly imprecise and unstable when the sample size was small for all DIF sizes and number of DIF items. MIMIC-ES outperformed Rasch-ES when the number of DIF items reached 30%. Keywords MIMIC, Rasch, DIF, effect size

1

University of Miami, Coral Gables, FL, USA

Corresponding Author: Ying Jin, Department of Educational and Psychological Studies, University of Miami, Merrick Building, Coral Gables, FL 33124-2040, USA. Email: [email protected]

Downloaded from epm.sagepub.com at UNIV OF MIAMI on May 17, 2014

340

Educational and Psychological Measurement 73(2)

Measurement invariance implies an independent relationship between group membership and the probability of a correct response after conditioning on the target ability that an item is intended to measure (Millsap & Meredith, 1992). In contrast, a dependent relationship indicates the possible presence of secondary abilities, whose conditional distribution, given the target ability, differs between groups (Camilli & Shepard, 1994). Differential item functioning (DIF) tests for the existence of such secondary abilities. For dichotomous items, there are two general types of DIF: uniform and nonuniform. Determining the type of DIF can help the researcher understand where and why DIF occurs. For example, if a group of examinees consistently score higher by a constant on an item than another group along the target ability continuum, this implies the presence of uniform DIF. Uniform DIF can be a result of the mean difference of the conditional distribution of secondary abilities between groups. When the relative advantage is not constant, or is inconsistently higher or lower, along the entire range of the target ability continuum, nonuniform DIF is likely to occur. The presence of nonuniform DIF can be a result of the unequal variance of secondary abilities between groups (Ackerman, 1992). Having determined where DIF exists, the researcher will then need to consider potential properties of secondary ability distribution to determine why DIF exists.1 The previous description of DIF in terms of relative advantage of one group over another group can be expressed mathematically by an odds ratio function (ORF; Hanson, 1998), ORF(u) =

PR (Y = 1ju)=(1  PR (Y = 1ju)) , PF (Y = 1ju)=(1  PF (Y = 1ju))

ð1Þ

where u is the target ability and subscripts R and F represent reference and focal groups, respectively. By convention, the reference group refers to the mainstream of the population of interest, whereas the focal group represents the potentially disadvantaged population. P(Y = 1ju) is the item response function (IRF) representing the probability of correct responses (i.e., 1 indicates correct response for an dichotomous item) conditional on u. An item is said to have uniform DIF if ORF is a constant and does not dependent on u. An item is said to have nonuniform DIF if ORF is not a constant and depends on u or other item parameters. Normally, ORF is placed on the natural logarithm scale to facilitate the interpretation of DIF between groups so that there is no DIF when LORF (i.e., LORF = ln (ORF) is equal to zero. DIF detection frequently starts with hypothesis testing, where the null hypothesis of no DIF (i.e., LORF = 0) is evaluated using a significance test. For example, the likelihood ratio test (Thissen, Steinberg, & Wainer, 1993), which is based on the chisquare test of the likelihood difference obtained after fitting nested item response theory (IRT) models, flags an item as DIF-present when the chi-square test is significant. The power to reject a false null hypothesis, however, is heavily influenced by sample size, which when large can result in items with negligible, though nonzero, DIF being flagged as DIF-present (Penfield & Camilli, 2007).2

Downloaded from epm.sagepub.com at UNIV OF MIAMI on May 17, 2014

Jin et al.

341

DIF effect size (ES) estimates have been developed so that items with negligible DIF will not be flagged. For example, the Mantel–Haenszel approach (Holland & Thayer, 1988) uses a DIF ES estimator, the weighted average odds ratio along the u continuum. The widely used ETS DIF classification scheme is based on the combination of both a significance test and an ES estimator of the Mantel–Haenszel approach. The benefits of employing null hypothesis testing with a concomitant consideration of effect size with regard to DIF detection, such as only flagging items with meaningfully large DIF, has been demonstrated with traditional approaches to DIF detection (Jodoin & Gierl, 2001; Kim, Cohen, Alagoz, & Kim, 2007). The multiple indicators, multiple causes (MIMIC) model has been put forth as a competitive, if nontraditional, detection method for DIF (e.g., Muthe´n, 1989; Muthe´n & Asparouhov, 2002). Simulation studies have shown that the MIMIC model was able to detect DIF as efficiently as other methods (e.g., Mantel–Haenszel approach) under most conditions and was superior to other methods when there was DIF contamination in the anchor items (scores obtained from these items are used to match u between groups) in terms of Type I error rate and power (Finch, 2005). Moreover, there was evidence indicating that the MIMIC model was more powerful for detection of uniform DIF (Woods, 2009). However, previous studies relied exclusively on significance tests to detect DIF. Furthermore, and as will be detailed in a subsequent section, the parameterization of the MIMIC model is such that it is ideal for only uniform DIF detection. It should be noted, however, that nonuniform DIF has been parameterized within the MIMIC model (Woods & Grimm, 2011). Arising from the previously mentioned benefits, growing interest in employing the MIMIC model for uniform DIF analysis has called attention to the lack of an ES estimator. Unlike some uniform DIF methods developed under models that do not require large sample size (e.g., the Rasch model), the MIMIC model is within the framework of structural equation modeling (SEM), which normally requires large sample size. As discussed previously, with large sample size, items with negligible, but nonzero, uniform DIF are likely to be flagged due to high power. Therefore, a DIF ES estimator under the MIMIC model is particularly important for the purpose of evaluating the magnitude of uniform DIF. The purpose of current study is threefold. First, an MIMIC ES estimator (MIMICES) for uniform DIF is derived based on a parallel ES estimator for the one-parameter IRT model (i.e., the Rasch model). Second, the performance of MIMIC-ES is examined. Third, the performance of MIMIC-ES will be compared with the ES estimator under the Rasch model (Rasch-ES). Prior to accomplishing these purposes, the following sections will briefly review (a) DIF analysis in the Rasch model and the formulation of Rasch-ES and (b) the parametric relationship between the Rasch and the MIMIC model and the derivation of MIMIC-ES. Finally, because both the MIMIC model and the Rasch model are in parametric form, and the accuracy of parameter estimation is crucial to ES estimates, parameter estimation methods will be discussed briefly.

Downloaded from epm.sagepub.com at UNIV OF MIAMI on May 17, 2014

342

Educational and Psychological Measurement 73(2)

DIF Analysis in the Rasch Model The Rasch model is widely used in educational and psychological studies (Andrich, 2004) and is formulated as follows: P(Yij = 1juj ) =

e(uj bi ) , 1 + e(uj bi )

ð2Þ

where bi is the difficulty parameter for item i, and uj is person j’s ability measure. Because only one b is estimated for each item, and all items are assumed to be equal in discriminating subjects with different u, the Rasch model generally does not require a large sample for parameter estimation (De Ayala, 2009). Furthermore, van de Vijver’s (1986) simulation study provided evidence that both b and u were robust to heterogeneity of item discrimination and the presence of guessing when item discrimination and guessing were within a reasonable range. Under the Rasch model, uniform DIF is quantified by the difference of b between groups. After substituting Equation (2) into Equation (1), LORF =

PR (Y = 1ju)=(1  PR (Y = 1ju)) = bF  bR : PF (Y = 1ju)=(1  PF (Y = 1ju))

ð3Þ

For testing the null hypothesis of no DIF under the Rasch model (i.e., LORF = bF  bR = 0), the IRT likelihood ratio test, for example, is based on the likelihood difference obtained after fitting the model with bF and bR constrained to equality as compared to a model where bF and bR are freely estimated. Significant likelihood difference indicates the presence of uniform DIF (i.e., bF 6¼ bR ). As discussed previously, items with negligible DIF, though nonzero, can be flagged as DIF-present items because power can be high with large sample size. ES estimates should be considered simultaneously so that magnitude of uniform DIF can also be considered. Raju’s signed area index (SA; 1988) computes the area between IRFs for each group, which can serve as an ES estimator. Under the Rasch model, ð‘ SA = ½PR (Y = 1ju)  PF (Y = 1ju)du = bF  bR : ð4Þ ‘

The integrated area is the difference between bF and bR, which is the same as LORF in Equation (3). The risk of flagging negligible DIF by significance tests can be well controlled if SA is considered (i.e., the magnitude of DIF must be meaningful). In the current study, SA was selected because it assumes a parametric form, and the derivation of the MIMIC-ES will also be based on a parametric form that will be shown to be equivalent to SA.

Downloaded from epm.sagepub.com at UNIV OF MIAMI on May 17, 2014

Jin et al.

343

DIF Analysis in the MIMIC Model For dichotomous items, DIF detection under the MIMIC model is implemented via a latent response variable formulation (Muthe´n & Asparouhov, 2002):  1, if yi  ti , yi = 0, if yi \ti , ð5Þ where yi is the dichotomous response, yi  is the continuous latent response, and ti is the threshold for item i. Correct response for an item becomes the more probable outcome when the latent response exceeds the item threshold. Under the MIMIC model ti is conceptually equivalent to bi under the Rasch model (see Takane & de Leeuw, 1987, for the parametric equivalence). DIF detection under the MIMIC model is a fixed effect approach that models between-group differences (Muthe´n, 1989). The fixed effects are in the model as direct effects of the grouping variable on the items. A significant direct effect indicates DIF. Equations (6) and (7) illustrate a single factor MIMIC model with a grouping variable as a covariate. yi  = li j + bz + ei ,

ð6Þ

j = gz + §,

ð7Þ

where li is the factor loading for item i. The regression coefficient g is the direct effect of the grouping variable z on the latent ability j (u in IRT models). The regression coefficient b is the direct effect of z on item i. ei and § are the error terms. The direct effect b represents the change in threshold after controlling for the mean ability difference between groups. In the MIMIC model, multiple items can be tested for DIF simultaneously by significance tests of direct effects, although not all items can be tested due to identification constraints (e.g., parameters of at least one item have to be constrained equal across groups). There is evidence that the performance of DIF detection by the MIMIC model can be optimized by including anchor items and that the number of anchors can be as few as one (Wang & Shih, 2010). The MIMIC model can also detect DIF under an iterative DIF purification procedure by fixing and relaxing parameter constraints (e.g., Muthe´n, 1989; Wang, Shih, & Yang, 2009). DIF detection, however, can be affected by the fact that the matching variable can be contaminated by DIF-present items when iterative procedures are applied. The current study uses the simultaneous approach of the MIMIC model to avoid such an effect. Muthe´n, Kao, and Burstein (1991) established parametric equivalence between the MIMIC model and the IRT model with respect to both difficulty and discrimination parameters. Because direct effects of the grouping variable were not modeled in discrimination parameters (Muthe´n & Asparouhov, 2002), parametric equivalence in difficulty parameter is given by

Downloaded from epm.sagepub.com at UNIV OF MIAMI on May 17, 2014

344

Educational and Psychological Measurement 73(2)

bi =

½(ti  bi z)li 1  m , s1=2

ð8Þ

where ti, bi, li are threshold, direct effect, and factor loading for item i, respectively. The mean and variance, respectively, of latent ability are m and s. To distinguish between reference and focal groups, the dummy variable, z, is included where 1 indicates focal group. Using this formulation, MIMIC-ES is derived based on SA after fixing the latent ability with unit normal distribution, and is expected to be an unbiased ES estimator of uniform DIF.3 If SA = bFi 2 bRi, then MIMIC  ES =

(ti  bi ) (ti ) b  =  i: li li li

ð9Þ

Parameter Estimation Because both MIMIC-ES and Rasch-ES are functions of model parameters, accurate estimation of those parameters determines how capable they are of accurately estimating the magnitude of uniform DIF. It is important to notice that different estimation methods could affect, and thereby confound, the results of studies comparing DIF methods. Such a confounding effect sometimes is inevitable particularly if more than one software program is used. Under the IRT framework, marginal maximum likelihood (MML) normally outperforms joint maximum likelihood (JML) estimation because the item parameters are estimated first with a prior of the latent ability distribution. JML will result in inconsistent item parameters as sample size increases because of the iterative procedure of estimating both item and person parameters together (Bock & Aitkin, 1981). In the Rasch model, however, the inconsistency issue of JML should not affect item parameters because the total score is a sufficient statistic for u. JML groups persons with the same total scores. Only a certain number of person estimates will be estimated. If a test contains n items, it will always be less than or equal to n 2 1 person parameters to be estimated regardless of the increase in sample size (i.e., increase in person parameter estimates; Baker & Kim, 2004). Because of the favorable property of JML under the Rasch model, many Rasch software packages (e.g., WINSTEPS; Linacre, 2011) employ JML.4 Mplus (Muthe´n & Muthe´n, 2010) is a software package that is frequently used for DIF detection via the MIMIC model. Several parameter estimators are available in Mplus, including maximum likelihood (ML) estimators. Because the latent ability in the MIMIC model is often assumed to have a unit normal distribution, MML is employed. For dichotomous data, parameters are estimated from bivariate tetrachoric correlations between pairs of y* where y* is assumed to be normally distributed in order to enable the estimation of tetrachoric correlations. The normality assumption of y* cannot be tested for dichotomous data due to sparse cells, which could be a result of small sample size (Muthe´n, 1993). Violation of this normality assumption of y* can result in inaccurate tetrachoric correlations, therefore leading to biased

Downloaded from epm.sagepub.com at UNIV OF MIAMI on May 17, 2014

Jin et al.

345

parameter estimates. With ML applied in both Rasch and MIMIC approaches, it is expected that MIMIC-ES using MML will not perform as well as Rasch-ES using JML when sample size is small.

Method Previous studies comparing the MIMIC model and other DIF methods have illustrated other potential factors (other than estimation method) affecting the performance of uniform DIF detection (e.g., Finch, 2005). Monte Carlo methods were used in the current study to manipulate these factors while examining the performance of MIMIC-ES and its comparison to Rasch-ES. Forty dichotomous items were generated in IRT-Lab (Penfield, 2003). Parameter values were obtained from the SAT verbal items used in Finch’s (2005) study to reflect reality with b ranging from 21.5 to 2.4. Factors manipulated were sample size, mean ability difference between the focal and reference groups, magnitude of DIF sizes, number of DIF items, and levels of difficulty parameters. There were 216 conditions in total (3 sample sizes 3 2 mean ability differences 3 4 DIF size conditions 3 3 numbers of DIF items 3 3 levels of difficulty parameter). Each condition was replicated 1,000 times. Three levels of sample sizes were generated: small (NF = 50, NR = 200), medium (NF = 100, NR = 400), and large (NF = 250, NR = 1,000) sample sizes. The unbalanced sample sizes between the focal and the reference group reflected real-life situations, and the selected levels of sample size and the ratio of sample size between the focal and reference group were within the range of those manipulated in previous simulation studies (e.g., Finch, 2005; Woods, 2009). Difference in mean ability had two conditions. First, both focal and reference groups were generated from standard normal distribution, N ~ (0, 1). Second, the focal group was generated from N ~ (21, 1), whereas the reference group was generated from N ~ (0, 1). A review of existing literature indicated that the one unit deviation in mean ability between groups has been used most frequently (e.g., Finch, 2005; Narayanan & Swaminathan, 1996; Oort, 1998). As for the magnitude of DIF sizes, 0, 0.3, 0.5, and 0.7 were used to illustrate no DIF, small DIF, medium DIF, and large DIF sizes for the studied item specifically.5 Number of DIF items was manipulated to reflect the situation where multiple DIF items can exist in a test. Four (10%), 8 (20%), and 12 (30%) out of 40 items were selected to represent different numbers of DIF items, respectively. The DIF sizes for these DIF items (other than the studied item) were randomly selected from the following DIF sizes, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, and 0.7. The selection of these conditions was within the range of those manipulated in existing literature (e.g., Finch, 2005; Woods, 2009). Another important factor to evaluate is the levels of item parameters (i.e., difficulty parameter in the Rasch model). Studies have investigated the effect of the levels of item parameters on DIF detection, especially when the DIF detection relied on the

Downloaded from epm.sagepub.com at UNIV OF MIAMI on May 17, 2014

346

Educational and Psychological Measurement 73(2)

accuracy of item parameter estimation (e.g., Donoghue & Allen, 1993 ; Rogers & Swaminathan, 1993 ). In the current study, b = 21.5, b = 0, and b = 1.5 were manipulated to represent low, medium, and high difficulty levels, respectively. In the MIMIC model, the scale was defined by fixing the factor variance to 1. The factor was regressed on the grouping variable and all items were regressed on the factor. Factor loadings were freely estimated for each item. All items except for the prespecified DIF free item were regressed on the grouping variable to alleviate the effect of model misspecification caused by assuming DIF-present items to be DIF free (Oort, 1998). The first item was the studied item. The last item was designated as the anchor item and to achieve model identification as well. The MIMIC model estimated 120 parameters (i.e., 40t, 40l, 39b, and 1g). Mplus 6.1 (Muthe´n & Muthe´n, 2010) was used to obtain parameter estimates from which MIMIC-ES was computed for each item based on Equation (10) using the MML estimator. To avoid the confounding effect of parameter estimation methods discussed in the previous section, Mplus 6.1 was also used to obtain parameter estimates from which Rasch-ES was computed for each item based on Equation (4) using the MML estimator. In the Rasch model, difficulty parameters of all items except the studied item were constrained to be equal between groups.6 The Rasch model estimated 44 parameters (i.e., 39 equal b for both groups, 1 b for the reference group, 1 b for the focal group, 1 discrimination parameter, 1 latent mean ability estimate, and 1 latent group mean estimate).7 MIMIC-ES and Rasch-ES were evaluated via bias and mean square error (MSE): c ) = E( ES c )  ES, Bias( ES

ð10Þ

c ) = ½Bias( ES c )2 + Var( ES c ), MSE( ES

ð11Þ

c is the estimated MIMIC-ES or Rasch-ES, and ES is the true DIF size. where ES Mean bias and MSE for each condition were the average bias and MSE across replications. Mean bias smaller than 0.05 was considered negligible.8 The accumulated bias and MSE across conditions were used to conduct a four-way analysis of variance (ANOVA). Main effects, all orders of interaction effects, and effect sizes (i.e., partial effect size, h ^ 2p ) were investigated to evaluate the performance of MIMIC-ES. Two additional full-factorial ANOVAs were conducted on the accumulated bias and MSE of both approaches to compare the performance between MIMIC-ES and Rasch-ES. Significant main effects and interaction terms with h ^ 2p greater than .001 were reported.

Results Convergence For MIMIC-ES and Rasch-ES estimation, especially those with small sample size, some replications converged only after the program fixed a parameter estimate to a constant in order to avoid a singular weight matrix (e.g., a maximum of ~18% of the

Downloaded from epm.sagepub.com at UNIV OF MIAMI on May 17, 2014

Jin et al.

347

replications converged with the fixed parameter estimate in any particular cell). The singularity was caused by an item with extreme cell probability. In all such cases, the problematic item was the most difficult and the number of correct responses was around 5%. These flagged replications were removed to compute the mean bias and MSE for each condition to avoid artifact caused by those flagged replications (Bandalos, 2006).

Bias When there was no mean ability difference, the pattern was clear, as shown in Table 1: Mean bias of MIMIC-ES was smaller than the cutoff value across almost all levels of manipulated factors except under the conditions of small sample size, high b, and medium to large DIF sizes. MIMIC-ES and Rasch-ES performed equivalently when the number of DIF items was less than 20%. Once the number of DIF items reached 30%, Rasch-ES generally produced larger mean bias, which was expected because in the Rasch model, the matching variable (i.e., total score) can be contaminated by DIF-present items (Penny & Johnson, 2010). MIMIC-ES was not affected by large number of DIF items because the simultaneous approach was implemented, where all items (except the anchor item) were regressed on the grouping variable to test for DIF. When there was mean ability difference between groups, as shown in Table 2, sample size had the greatest effect on mean bias of MIMIC-ES. Specifically, main effect of mean ability difference (p \ .01, h ^ 2p = :008), and sample size (p \ .01, 9 2 h ^ p = :006) were significant. With increasing sample size in both groups, mean bias was reduced noticeably, especially when sample size increased from small to medium. When sample size was small, mean bias was considerably greater than 0.05 across all levels of other factors. With medium sample size, mean bias under the condition of high b was larger as compared with the conditions with easier items; mean was bias in these cells, however, was generally smaller than mean bias under the small sample size conditions. When sample size was large under both low and medium b conditions, mean bias was negligible across levels of number of DIF items and DIF sizes. In summary, MIMIC-ES became increasingly imprecise as sample size decreased. The comparison between MIMIC-ES and Rasch-ES under the condition of mean ability difference provided evidence that Rasch-ES outperformed MIMIC-ES under certain conditions (i.e., MIMIC-ES differed Rasch-ES significantly, p \ .01, h ^ 2p = :008). MIMIC-ES exhibited larger mean bias than Rasch-ES when sample size was small. When sample size was medium to large, mean bias across levels of number of DIF items and DIF size was below or around the cutoff value for both approaches with relatively easy items. MIMIC-ES exhibited larger bias than RaschES with difficult items. The three-way interactions of mean ability difference and sample size with ES estimators (i.e., MIMIC-ES vs. Rasch-ES) was significant

Downloaded from epm.sagepub.com at UNIV OF MIAMI on May 17, 2014

348

Downloaded from epm.sagepub.com at UNIV OF MIAMI on May 17, 2014

12 (30%)

8 (20%)

4 (10%)

Number of DIF Items

250:1,000

100:400

50:200

250:1,000

100:400

50:200

250:1,000

100:400

50:200

Sample Size (NF:NR) Low b Medium b High b Low b Medium b High b Low b Medium b High b Low b Medium b High b Low b Medium b High b Low b Medium b High b Low b Medium b High b Low b Medium b High b Low b Medium b High b

Difficulty

Small DIF 20.022 (20.034) 0.017 (0.004) 0.050 (0.024) 0.004 (20.004) 0.014 (20.002) 20.024 (20.024) 0.011 (20.006) 0.000 (20.016) 0.017 (20.006) 20.024 (20.046) 0.017 (20.011) 0.050 (0.006) 0.003 (20.016) 0.013 (20.017) 20.026 (20.042) 0.012 (20.017) 0.001 (20.031) 0.018 (20.024) 20.023 (20.090) 20.004 (20.067) 0.071 (20.030) 20.016 (20.089) 20.001 (20.058) 20.030 (20.080) 0.008 (20.069) 0.010 (20.070) 0.024 (20.062)

Null DIF 20.050 (20.046) 0.017 (20.002) 0.012 (0.001) 20.014 (20.020) 20.004 (20.010) 20.009 (20.004) 0.003 (20.014) 0.015 (20.007) 0.005 (20.009) 20.053 (20.077) 0.013 (20.036) 0.010 (20.036) 20.017 (20.051) 20.005 (20.044) 20.011 (20.042) 0.005 (20.045) 0.017 (20.041) 0.008 (20.046) 20.018 (20.076) 20.031 (20.077) 0.021 (20.046) 20.038 (20.074) 20.014 (20.060) 0.000 (20.055) 0.005 (20.064) 0.000 (20.063) 20.003 (20.062) 0.013 (20.005) 0.014 (20.014) 0.102 (0.040) 20.033 (20.040) 0.000 (20.002) 0.053 (0.021) 0.010 (20.012) 0.011 (20.018) 0.025 (20.005) 0.013 (20.017) 0.015 (20.030) 0.103 (0.022) 20.032 (20.052) 0.000 (20.018) 0.052 (0.002) 0.010 (20.024) 0.016 (20.027) 0.029 (20.016) 20.028 (20.077) 0.006 (20.063) 0.068 (20.023) 20.006 (20.074) 20.011 (20.076) 0.022 (20.048) 0.008 (20.068) 0.008 (20.070) 0.016 (20.069)

Medium DIF

0.010 (20.027) 0.019 (20.009) 0.084 (0.047) 0.005 (20.025) 20.006 (20.011) 0.010 (20.005) 0.008 (20.014) 20.001 (20.021) 0.033 (20.001) 0.009 (20.040) 0.020 (20.024) 0.085 (0.029) 0.002 (20.038) 20.009 (20.027) 0.007 (20.023) 0.000 (20.038) 0.017 (20.026) 0.031 (20.015) 0.004 (20.090) 0.015 (20.064) 0.083 (20.005) 20.027 (20.075) 20.002 (20.059) 0.002 (20.066) 0.007 (20.074) 0.010 (20.070) 0.045 (20.053)

Large DIF

Note. The first number in each cell is the mean bias for MIMIC-ES; the second number in the parenthesis is the mean bias for Rasch-ES. The bolded mean bias are those greater than the cutoff value of 0.05.

No

Mean Ability Difference

Table 1. Mean Bias for MIMIC-ES and Rasch-ES Under the Condition of No Mean Ability Difference.

349

Downloaded from epm.sagepub.com at UNIV OF MIAMI on May 17, 2014

12 (30%)

8 (20%)

4 (10%)

Number of DIF Items

250:1,000

100:400

50:200

250:1,000

100:400

50:200

250:1,000

100:400

50:200

Sample Size (NF:NR) Low b Medium b High b Low b Medium b High b Low b Medium b High b Low b Medium b High b Low b Medium b High b Low b Medium b High b Low b Medium b High b Low b Medium b High b Low b Medium b High b

Difficulty 0.121 (20.020) 0.159 (0.018) 0.204 (0.046) 0.014 (20.019) 0.026 (20.002) 0.067 (0.018) 0.023 (20.013) 0.040 (20.006) 0.029 (0.002) 0.143 (20.057) 0.130 (20.016) 0.225 (0.010) 0.025 (20.030) 0.031 (20.024) 0.051 (20.014) 0.040 (20.029) 0.022 (20.039) 0.064 (20.013) 0.120 (20.063) 0.159 (20.026) 0.205 (0.002) 0.015 (20.062) 0.026 (20.046) 0.067 (20.027) 0.025 (20.055) 0.041 (20.049) 0.033 (20.042)

Null DIF 0.128 (20.018) 0.149 (0.019) 0.268 (0.073) 0.023 (20.003) 0.037 (0.006) 0.079 (0.027) 0.032 (20.007) 0.029 (20.005) 0.039 (20.005) 0.139 (20.020) 0.182 (0.006) 0.210 (0.035) 0.011 (20.031) 0.044 (0.003) 0.081 (0.018) 0.016 (20.025) 0.026 (20.020) 0.051 (20.008) 0.123 (20.074) 0.150 (20.036) 0.269 (0.018) 0.022 (20.058) 0.038 (20.049) 0.078 (20.028) 0.032 (20.062) 0.037 (20.060) 0.042 (20.060)

Small DIF 0.146 (0.008) 0.189 (0.007) 0.316 (0.098) 0.008 (20.025) 0.061 (0.019) 0.114 (0.045) 0.038 (20.004) 0.027 (20.006) 0.051 (0.017) 0.172 (20.011) 0.155 (20.008) 0.215 (0.035) 0.047 (20.016) 0.048 (20.020) 0.110 (0.048) 0.025 (20.017) 0.043 (20.006) 0.056 (20.004) 0.145 (20.047) 0.192 (20.048) 0.313 (0.043) 0.007 (20.080) 0.061 (20.035) 0.114 (20.010) 0.040 (20.059) 0.032 (20.061) 0.055 (20.038)

Medium DIF

0.158 (20.016) 0.157 (0.013) 0.210 (0.028) 0.044 (20.017) 0.046 (0.003) 0.096 (0.050) 0.028 (20.016) 0.029 (20.006) 0.059 (0.011) 0.139 (20.013) 0.187 (0.031) 0.266 (0.024) 0.016 (20.027) 0.051 (0.001) 0.107 (0.039) 0.029 (20.012) 0.029 (20.009) 0.075 (0.019) 0.159 (20.071) 0.160 (20.041) 0.214 (20.027) 0.043 (20.072) 0.046 (20.052) 0.094 (20.005) 0.028 (20.070) 0.034 (20.061) 0.053 (20.035)

Large DIF

Note. The first number in each cell is the mean bias for MIMIC-ES; the second number in the parenthesis is the mean bias for Rasch-ES. The bolded mean bias are those greater than the cutoff value of 0.05.

Yes

Mean Ability Difference

Table 2. Mean Bias for MIMIC-ES and Rasch-ES Under the Condition of Mean Ability Difference.

350

Educational and Psychological Measurement 73(2)

Figure 1. Significant interaction with meaningful effect: sample size, mean ability difference, and difficulty levels

(p \ .01, h ^ 2p = :001), which confirmed the differing performance of MIMIC-ES versus Rasch-ES under different levels of mean ability difference and sample size.

MSE Under the condition of no mean ability difference, mean MSE of MIMIC-ES was larger than that of Rasch-ES across all levels of other factors, as shown in Figures 1, 2, and 3, where mean MSE of MIMIC-ES and Rasch-ES across levels of sample size were compared between different levels of b, number of DIF items, and DIF sizes, categorized by levels of mean ability difference, respectively. Significant interactions with meaningful effect were presented graphically. For both approaches, mean MSE reduced as sample size increased; mean MSE was highest when items were difficult under small sample size conditions; DIF size and number of DIF items showed no meaningful effect on the change of mean MSE across levels of sample size. In summary, MIMIC-ES was less stable than Rasch-ES under the condition of no mean ability difference.

Downloaded from epm.sagepub.com at UNIV OF MIAMI on May 17, 2014

Jin et al.

351

Figure 2. Significant interactions with meaningful effect: sample size, mean ability difference, and number of DIF items.

When there was mean ability difference between groups, the decrease in mean MSE of MIMIC-ES caused by increased sample size was more noticeable than the change under the condition of no mean ability difference (i.e., main effect of mean ^ 2p = :216] were sigability difference [p \ .01, h ^ 2p = :045], and sample size [p \ .01, h nificant). The largest mean MSE (i.e., greater than 1) occurred under the small sample size and high b conditions. When sample size increased to medium and large, mean MSE was reduced dramatically, especially for the high b conditions. Mean MSE increased gradually as DIF size increased. The magnitude of change, however, was not as large as that of sample size and mean ability difference (p \ .01, h ^ 2p = :005). Number of DIF items showed no meaningful effect on the change of mean MSE across levels of sample size. MIMIC-ES, therefore, became increasingly stable as sample size increased under the condition of mean ability difference, which was confirmed by the significant two-way interaction effect between mean ability difference and sample size (p \ .01, h ^ 2p = :02). The comparison between MIMIC-ES and Rasch-ES indicated that Rasch-ES outperformed MIMIC-ES with respect to stability across all levels of the manipulated factors. Mean MSE for both approaches became more stable under the condition of

Downloaded from epm.sagepub.com at UNIV OF MIAMI on May 17, 2014

352

Educational and Psychological Measurement 73(2)

Figure 3. Significant interactions with meaningful effect: sample size, mean ability difference, and DIF sizes.

mean ability difference as sample size increased. As depicted in the figures, with medium and large sample size, mean MSE were below 0.5 for both approaches except for the high b condition although mean MSE of Rasch-ES were consistently smaller than that of MIMIC-ES (i.e., MIMIC-ES differed Rasch-ES significantly, p \ .01, h ^ 2p = :06). Mean MSE of MIMIC-ES decreased dramatically as sample size increased across levels of DIF size and number of DIF items. The two-way interactions of mean ^ 2p = :021) with type ability difference (p \ .01, h ^ 2p = :008) and sample size (p \ .01, h of ES estimators (i.e., MIMIC-ES vs. Rasch-ES) were significant, respectively, which confirmed the differing (and poorer) performance of MIMIC-ES versus Rasch-ES under different levels of mean ability difference and sample size. In addition, the significant three-way interaction between type of ES estimators, mean ability difference, and sample size (p \ .01, h ^ 2p = :004) indicated that the extent to which Rasch-ES outperformed MIMIC-ES was different under each combined condition of mean ability difference and sample size.

A Post Hoc Follow-Up Study As described in the previous section, MIMIC-ES became imprecise and unstable when sample size was small and there was mean ability difference between groups when the studied item was a difficult one. As shown in Equation (9), MIMIC-ES had

Downloaded from epm.sagepub.com at UNIV OF MIAMI on May 17, 2014

Jin et al.

353

two components, the direct effect of the grouping variable on an item (b) and the factor loading of that same item (l). It was observed that factor loadings with small sample size were considerably smaller than those with medium or large sample size, which may account for the imprecision. The standard errors of factor loadings were larger with small sample size than those with medium or large sample size, which may account for the instability. A follow-up study was conducted under the condition of small sample size and the presence of mean ability difference for difficult studied items to investigate number of parameters estimated as a potential reason for the relatively large mean bias and MSE of MIMIC-ES. Because data were generated from the Rasch model using IRT-Lab under the IRT parameterization, discrimination parameters were fixed to 1.00 for all items. To control for the effect of underestimated factor loadings and increased standard errors, MIMIC-ES was computed with factor loadings fixed to 1.00 for the studied item to be consistent with the discrimination constraint under the Rasch model. For MIMIC-ES, mean bias was reduced by at least 20% and up to 41% as compared with the previously computed mean bias; mean MSE was reduced by at least 25% and up to 43% as compared with the previously computed MSE. Despite improvement in both mean bias and MSE for MIMIC-ES, however, both sets of values were still greater than the relevant values from Rasch-ES under the same condition. Large mean bias and MSE were attributable to both direct effects (b) and factor loadings (l) under the condition of small sample size, mean ability difference, and high b.

Discussion The current study used the Monte Carlo method to investigate the performance of MIMIC-ES and showed that MIMIC-ES can be an unbiased ES estimator of uniform DIF under most conditions studied. An ES estimator is proposed to assist significance tests for evaluating the magnitude of uniform DIF. Based on the results, MIMIC-ES can be expected to estimate DIF size consistently when there is no mean ability difference between groups across levels of sample size, DIF contamination, and DIF size. When there is mean ability difference, MIMIC-ES may be imprecise and unstable when sample size is small and items are difficult. Once the focal group sample size exceeds 100, MIMIC-ES appears to perform well (even when there is mean ability difference), which is consistent with previous studies with respect to Type I error rate and power (e.g., Woods, 2009). The implementation of MIMIC-ES has several potential benefits. First, MIMICES can be implemented in large-scale analysis where there is generally high power to detect negligible, though nonzero, DIF. Second, the lack of ES estimates within the SEM framework for investigating factorial invariance has been recognized as an important area to address in future research (Millsap, 2005). MIMIC-ES has provided an illustration for future studies on ES estimates (within the SEM framework) by incorporating methods from other frameworks (e.g., IRT) based on established

Downloaded from epm.sagepub.com at UNIV OF MIAMI on May 17, 2014

354

Educational and Psychological Measurement 73(2)

parametric equivalence. Last, unlike the Rasch model, which takes care of the measurement part, the MIMIC model provides simultaneous examinations of both the measurement and structural parts. The efficiency of the MIMIC approach can be especially valuable because some other DIF methods generally employ an iterative procedure for measurement model (e.g., multiple-group CFA, Muthe´n & Asparouhov, 2002; Reise, Widaman, & Pugh, 1993). The performance of an established Rasch-ES also was examined to serve as a baseline for comparison with MIMIC-ES. MIMIC-ES and Rasch-ES performed equally well under most conditions (e.g., medium to large sample size across conditions). However, Rasch-ES outperformed MIMIC-ES under the condition of small sample size. The superior performance of Rasch-ES in this case could be expected because a simpler model tends to fit limited data better (Bollen, 1989). For example, in the current study, 76 additional item parameters were estimated under the MIMIC model in each condition (120 parameters were estimated from the MIMIC model; 44 parameters were estimated from the Rasch model). Under small sample size conditions, Rasch-ES turned out to be more precise and stable than MIMIC-ES because fewer parameters were estimated. MIMIC-ES outperformed Rasch-ES once the number of DIF-present items reached 30% of the test under most conditions. The superior performance of MIMIC-ES under such conditions was attributed to the fact that the simultaneous procedure was implemented, where direct effects of all items (except the anchor item) were examined simultaneously. In the follow-up study, after fixing factor loadings, the reduced number of parameters estimated from the MIMIC model resulted in reduced mean bias and MSE under conditions of small sample size. Researchers, therefore, should use caution when MIMIC-ES is used together with hypothesis testing for uniform DIF detection when sample size is small. The superior performance of Rasch-ES over MIMIC-ES under certain conditions is not widely generalizable due to limitations of the current study. First, because the data were generated under the Rasch model, it was not a surprise that Rasch-ES performed better than MIMIC-ES under some of the simulated conditions. Second, the better performance of Rasch-ES may be explained by test length generated in the current study (i.e., 40 items). Rasch parameter estimates have been shown to be quite accurate when tests contain at least 40 items and less accurate as test length decreases (Lord, 1986; van de Vijver, 1986; Wang & Chen, 2005). MIMIC-ES under MML, on the other hand, might perform consistently with shorter tests. Simulations studies have provided evidence that at least four items per latent construct with sample size greater than or equal to 100 was adequate for relatively accurate parameter estimates (e.g., Marsh, Hau, Balla, & Grayson, 1998). Third, the current study generated the response data based on unit normal distribution of the latent ability. Studies have shown that DIF analysis can be affected by nonnormal distribution of the latent ability. Woods (2011) examined the robustness of several DIF methods against the violation of normality of the latent ability. In Woods study, the model-based IRT likelihood ratio test performed promisingly. The superior performance of IRT

Downloaded from epm.sagepub.com at UNIV OF MIAMI on May 17, 2014

Jin et al.

355

likelihood ratio test can shed light on further research on the model-based MIMIC model with respect to both significance tests and ES estimates. To ensure the future implementation of MIMIC-ES as an effective and comprehensive DIF method, future studies should focus on four aspects. First, the current study focused on the investigation of bias for MIMIC-ES and was not directly combined with hypothesis testing for DIF detection. The examination of the MIMIC approach with respect to both hypothesis testing and ES estimates can provide more solid evidence for the MIMIC approach being an established DIF method. Second, uniform DIF was investigated exclusively due to the formulation of the MIMIC model. Attempts have been made for detecting nonuniform DIF (Woods & Grimm, 2011) with respect to hypothesis testing. The derivation of a nonuniform DIF ES estimator under the MIMIC approach would also advance the literature. Third, although it is relatively common that data are generated from one framework when two frameworks are being compared (e.g., Reise, Widaman, & Pugh, 1993; Woods, 2009), results can be biased toward the framework that the data are generated from. In the current study, the data were generated initially under the Rasch model, and therefore, results may be biased toward the Rasch-ES. Future research that investigates the effect of the data generation model, together with the factors studied in the current article, on the performance of Rasch-ES and MIMIC-ES may be useful. Last but not least, MIMIC-ES estimate for polytomous items under other types of IRT models (e.g., 2PL) are possible. For example, the common odds ratio approach for polytomous items (Penfield, 2007; Penfield & Algina, 2003) and signed area index for 2PL IRT models (Raju, 1988) could be used to extend the findings of the current study when item discrimination varies. Declaration of Conflicting Interests The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding The authors received no financial support for the research, authorship, and/or publication of this article.

Notes 1. More detailed descriptions of both types of DIF can be found in literature (Camilli & Shepard, 1994; Osterlind & Everson, 2009). 2. Although revised hypothesis testing such as range-null hypothesis (Wells, Cohen, & Patton, 2009), which incorporated tolerant DIF size within the hypothesis, will alleviate the risk, an effect size estimator directly estimating DIF size would reduce the risk efficiently. 3. In the standard parameterization of the MIMIC model the DIF ES is estimated as the difference of thresholds between groups, and item intercepts are fixed to 0 (Millsap, 2005).

Downloaded from epm.sagepub.com at UNIV OF MIAMI on May 17, 2014

356

Educational and Psychological Measurement 73(2)

4. A brief introduction of Rasch software packages can be found in Bond and Fox (2001). 5. In the current study, all items except the anchor item (i.e., the last item) were regressed on the grouping variable. The first item, however, is designated as the studied item because bias and MSE were computed from this item exclusively. 6. The formulation of the Rasch model in Mplus 6.1 is the closest form as in the widely used Rasch software (J. M. Linacre, personal communication, June 3, 2012). 7. In Mplus 6.1, the factor loadings (equal across all items) and item thresholds were estimated and converted to IRT parameterization (Muthe´n & Asparouhov, 2002). 8. The cutoff value of 0.05 was selected based on the established criterion (Muthe´n & Muthe´n, 2002) determined by percentage bias (i.e., Percentage bias(^u) = ½(E(^u)  u)=u 3 100%). Percentage bias lower than 10% was considered negligible. This criterion was transformed to around 0.05 on the non-percentage scale according to the levels of DIF sizes manipulated in the current study. 9. The relatively small h ^ 2p from ANOVA results was consistent with previous studies (e.g., Finch, 2005).

References Ackerman, T. (1992). A didactic explanation of item bias, item impact, and item validity from a multidimensional perspective. Journal of Educational Measurement, 29, 67-91. Andrich, D. (2004). Controversy and the Rasch model: A characteristic of incompatible paradigms? Medical Care, 42, 1-7. Baker, F. B., & Kim, S.-H. (2004). Item response theory: Parameter estimation techniques. New York, NY: Marcel Dekker. Bandalos, D. L. (2006). The use of Monte Carlo studies in structural equation modeling research. In R. C. Serlin (Series Ed.) & G. R. Hancock & R.O. Mueller (Vol. Eds.), Structural equation modeling: A second course (pp. 385-462). Greenwich, CT: Information Age. Bock, R. D., & Aitkin, M. (1981). Marginal maximum likelihood estimation of item parameters: Application of an EM algorithm. Psychometrika, 46, 443-459. Bollen, K. A. (1989). Structural equations with latent variables. New York, NY: Wiley. Bond, T. G., & Fox, C. M. (2001). Applying the Rasch model: Fundamental measurement in the human sciences. Mahwah, NJ: Erlbaum. Camilli, G., & Shepard, L. A. (1994). Methods for identifying biased test items. Thousand Oaks, CA: Sage. De Ayala, R. J. (2009). The theory and practice of item response theory. New York, NY: Guilford. Donoghue, J. R., & Allen, N. L. (1993). Thin versus thick matching in the Mantel-Haenszel procedure for detecting DIF. Journal Of Educational Statistics, 18, 131-154. Finch, H. (2005). The MIMIC model as a method for detecting DIF: Comparison with MantelHaenszel, SIBTEST, and the IRT likelihood ratio. Applied Psychological Measurement, 29, 278-295. Hanson, B. A. (1998). Uniform DIF and DIF defined by differences in item response functions. Journal of Educational and Behavioral Statistics, 23, 244-253. Holland, P. W., & Thayer, D. T. (1988). Differential item performance and the MantelHaenszel procedure. In H. Wainer & H. I. Braun (Eds.), Test validity (pp.129-145). Hillsdale, NJ: Erlbaum.

Downloaded from epm.sagepub.com at UNIV OF MIAMI on May 17, 2014

Jin et al.

357

Jodoin, M. G., & Gierl, M. J. (2001). Evaluating type I error and power rates using an effect size measure with the logistic regression procedure for DIF detection. Applied Measurement in Education, 14, 329-349. Kim, S.-H., Cohen, A. S., Alagoz, C., & Kim, S. (2007). DIF detection and effect size measures for polytomously scored items. Journal of Educational Measurement, 44, 93-116. Linacre, J. M. (2011). Winsteps (Version 3.72.0) [Computer Software]. Beaverton, OR: Winsteps.com. Retrieved from http://www.winsteps.com/ Lord, F. M. (1986). Maximum likelihood and Bayesian parameter estimation in item response theory. Journal of Educational Measurement, 23, 157-162. Marsh, H. W., Hau, K. T., Balla, J. R., & Grayson, D. (1998). Is more ever too much? The number of indicators per factor in confirmatory factor analysis. Multivariate Behavioral Research, 33, 181-220. Millsap, R. E. (2005). Four unresolved problems in studies of factorial invariance. In A. Maydeu-Olivares & J. J. McArdle (Eds.), Contemporary psychometrics (pp. 155-173). Mahwah, NJ: Erlbaum. Millsap, R. E., & Meredith, W. (1992). Inferential conditions in the statistical detection of measurement bias. Applied Psychological Measurement, 16, 389-402. Muthe´n, B. O. (1989). Dichotomous factor analysis of symptom data. In M. Eaon & G. W. Bohrnstedt (Eds.), Latent variable models for dichotomous outcomes: Analysis of data from the epidemiological Catchment Area Program [Special issue]. Sociological Methods & Research, 18, 19-65. Muthe´n, B. O. (1993). Goodness of fit with categorical and other non-normal variables. In K. A. Bollen & J. S. Long (Eds.), Testing structural equation models (pp. 205-243). Newbury Park, CA: Sage. Muthe´n, B. O., & Asparouhov, T. (2002). Latent variable analysis with categorical outcomes: Multiple-group and growth modeling in Mplus. Retrieved from http://statmodel2.com/ download/webnotes/CatMGLong.pdf Muthe´n, B. O., Kao, C.-F., & Burstein, L. (1991). Instructional sensitivity in mathematics achievement test items: Applications of a new IRT-based detection technique. Journal of Educational Measurement, 28, 1-22. Muthe´n, L. K., & Muthe´n, B. O. (2002). How to use a Monte Carlo study to decide on sample size and determine power. Structural Equation Modeling, 9, 599-620. Muthe´n, L. K., & Muthe´n, B. O. (2010). Mplus: Statistical Analysis with Latent Variables (Version 6.1) [Computer software]. Los Angeles, CA: Author. Narayanan, P., & Swaminathan, H. (1996). Identification of items that show nonuniform DIF. Applied Psychological Measurement, 20, 257-274. Oort, F. J. (1998). Simulation study of item bias detection with restricted factor analysis. Structural Equation Modeling: A Multidisciplinary Journal, 5, 107-124. Osterlind, S. J., & Everson, H. T. (2009). Differential item functioning. Thousand Oaks, CA: Sage. Penny, J., & Johnson, R. L. (2010). How group difference in matching criterion distribution and IRT item difficulty can influence the magnitude of the Mantel-Haenszel chi-square DIF index. Journal of Experimental Education, 67, 343-366. Penfield, R. D. (2003). IRT-Lab: Software for research and pedagogy in item response theory. Applied Psychological Measurement, 27, 301-302. Penfield, R. D. (2007). Assessing differential step functioning in polytomous items using a common odds ratio estimator. Journal of Educational Measurement, 44, 187-210.

Downloaded from epm.sagepub.com at UNIV OF MIAMI on May 17, 2014

358

Educational and Psychological Measurement 73(2)

Penfield, R. D., & Algina, J. (2003). Applying the Liu-Agresti estimator of the cumulative common odds ratio to DIF detection in polytomous items. Journal of Educational Measurement, 40, 353-370. Penfield, R. D., & Camilli, G. (2007). Differential item functioning and item bias. Handbook of Statistics, 26, 125-167. Raju, N. S. (1988). The area between two item characteristic curves. Psychometrika, 53, 495-502. Reise, S. P., Widaman, K. F., & Pugh, R. H. (1993). Confirmatory factor analysis and item response theory: Two approaches for exploring measurement invariance. Psychological Bulletin, 114, 552-566. Rogers, H., & Swaminathan, H. (1993). A comparison of the logistic regression and MantelHaenszel procedures for detecting differential item functioning. Applied Psychological Measurement, 17, 105-116. Takane, Y., & de Leeuw, J. (1987). On the relationship between item response theory and factor analysis of discretized variables. Psychometrika, 52, 393-408. Thissen, D., Steinberg, L., & Wainer, H. (1993). Detection of differential item functioning using the parameters of item response models. In: P. W. Holland & H. Wainer (Eds.), Differential item functioning (pp. 67-113). Hillsdale, NJ: Lawrence Erlbaum. van de Vijver, F. J. R. (1986). The robustness of Rasch estimates. Applied Psychological Measurement, 10, 45-57. Wang, W.-C., & Chen, C.-T. (2005). Item parameter recovery, standard error estimates, and fit statistics of the Winsteps program for the family of Rasch models. Educational and Psychological Measurement, 65, 376-404. Wang, W., & Shih, C.-L (2010). MIMIC methods for assessing differential item functioning in polytomous items. Applied Psychological Measurement, 34, 166-180. Wang, W., Shih, C.-L., & Yang, C.-C. (2009). The MIMIC method with scale purification for detecting differential item functioning. Educational and Psychological Measurement, 69, 713-731. Wells, C. S., Cohen, A. S., & Patton, J. (2009). A range-null hypothesis approach for testing DIF under the Rasch model. International Journal of Testing, 9, 310-332. Woods, C. M. (2009). Evaluation of MIMIC-model methods for DIF testing with comparison to two-group analysis. Multivariate Behavioral Research, 44(1), 1-27. Woods, C. M. (2011). DIF testing for ordinal items with Poly-SIBTEST, the Mantel and GMH Tests, and IRT-LR-DIF when the latent distribution is nonnormal for both groups. Applied Psychological Measurement, 35, 145-164. Woods, C. M., & Grimm, K. J. (2011). Testing for nonuniform differential item functioning with multiple indicator multiple cause models. Applied Psychological Measurement, 35, 339-361.

Downloaded from epm.sagepub.com at UNIV OF MIAMI on May 17, 2014