Using multivariate generalizability theory to assess ...

1 downloads 0 Views 228KB Size Report
are randomly sampled from different strata (content domains, skill areas, etc.). ...... of documentation scores from the USMLE step 2 clinical skills examination.
Adv in Health Sci Educ DOI 10.1007/s10459-010-9233-8

Using multivariate generalizability theory to assess the effect of content stratification on the reliability of a performance assessment Lisa A. Keller • Brian E. Clauser • David B. Swanson

Received: 7 December 2009 / Accepted: 26 April 2010 Ó Springer Science+Business Media B.V. 2010

Abstract In recent years, demand for performance assessments has continued to grow. However, performance assessments are notorious for lower reliability, and in particular, low reliability resulting from task specificity. Since reliability analyses typically treat the performance tasks as randomly sampled from an infinite universe of tasks, these estimates of reliability may not be accurate. For tests built according to a table of specifications, tasks are randomly sampled from different strata (content domains, skill areas, etc.). If these strata remain fixed in the test construction process, ignoring this stratification in the reliability analysis results in an underestimate of ‘‘parallel forms’’ reliability, and an overestimate of the person-by-task component. This research explores the effect of representing and misrepresenting the stratification appropriately in estimation of reliability and the standard error of measurement. Both multivariate and univariate generalizability studies are reported. Results indicate that the proper specification of the analytic design is essential in yielding the proper information both about the generalizability of the assessment and the standard error of measurement. Further, illustrative D studies present the effect under a variety of situations and test designs. Additional benefits of multivariate generalizability theory in test design and evaluation are also discussed. Keywords

Generalizability theory  Performance assessment  Reliability

Introduction During the last two decades, there has been a growing demand for more ‘‘authentic’’ assessment. That is, test users would like assessments to be more closely aligned to the specific skills of interest than is the case with traditional multiple-choice tests. With the growth in technology and computer-based testing, new innovative testing formats, such as computer simulations are becoming increasingly attractive. In the area of licensure, these simulation tasks can mimic tasks that would be encountered ‘‘in the field’’ making them L. A. Keller (&)  B. E. Clauser  D. B. Swanson University of Massachusetts Amherst, Amherst, MA, USA e-mail: [email protected]

123

L. A. Keller et al.

especially desirable, as successful completion should provide greater assurance that candidates are well prepared. However, these more authentic assessments come at a cost. Historically, a major concern about performance assessments has been low generalizability, and, in fact, this was the primary reason for replacing them with multiple-choice items during the twentieth century (Hartog and Rhodes 1936; Stalnaker 1951; Coffman 1971). Many recent studies investigating the generalizability of current performance assessments, including computerized simulation tasks, indicate that generalizability is still an issue (Linn and Burton 1994; Swanson et al. 1995; Clauser et al. 1999; Shavelson et al. 1999; Clauser et al. 2006; Margolis et al. 2006). In a licensing environment, where tests are high stakes, score reliability is of the utmost importance. Because important decisions are made based on test scores, it is essential that scores are not heavily influenced by the particular sample of items an examinee sees or the particular raters who score the examinee. In a test comprising multiple-choice items, it is possible to administer many items to each examinee, potentially allowing for enough items to produce a score with acceptable reliability. Performance tasks typically take longer to complete and are more expensive to score; consequently examinees complete fewer items and the associated scores are generally less reliable. Reliability is a critical issue with these assessments, but the complexity of performance assessment formats may raise challenges for practitioners attempting to evaluate the reliability of the resulting scores. When measurement error stems from multiple sources, classical test theory may be inadequate. Generalizability theory represents an appropriate alternative; it provides the conceptual and analytic tools for partitioning variation in total scores into components associated with examinees, items, raters, and other facets of the test delivery design (Brennan 2001; Cronbach et al. 1972). By determining which sources contribute most to measurement error, steps can be taken in test design to reduce those sources of error and improve the precision of scores on the assessment. In Generalizability theory, the specification of the design (e.g., person by task (p 9 t), person by task nested within strata (p 9 (t:s)) forces the analyst to define the universe of admissible observations, and while it does depend upon the specific data collected, many different designs can be applied to the same data set. The choice of the analytic design depends on the nature of the generalizations that are to be made. In the generalizability theory framework, tasks are typically considered to be randomly sampled from a universe of possible tasks. For examinations built according to a table of specifications, the universe of possible tasks could also include a stratification of those tasks. The strata could be considered fixed in the test construction process, or random, depending upon the nature of the test design. Therefore, there are many different possible designs that could be applied to data arising from a test built from a table of specifications: person by tasks (p 9 t; the tasks are not stratified), person by task nested within strata (p 9 (t:s)) where the strata are considered a random facet, or p 9 (t:s) where the strata are considered a fixed facet. The different designs reflect differences in the conceptualization of the universe of admissible observations and the universe of generalization. In the univariate p 9 t case, the implicit assumption is that the tasks are sampled randomly from a universe of possible tasks and the tasks can be assumed to be interchangeable—literally randomly equivalent. In the case of a test built according to a table of specifications, this assumption is not appropriate; tasks are stratified into different categories specifically because there is a difference between the different types of tasks. The construction of a table of specifications in licensure and certification fields is typically the result of a job analysis where experts in the field agree on the relative importance of the different categories (e.g. Kane 1997; Raymond 2001). The number of

123

Using multivariate generalizability theory

items assigned to each category is then based on the relative importance of those categories. More complex model-based approaches have also been developed to collect data from a practice analysis and construct the table of specifications (e.g. Kane 1997), however, the end result is a determination of the number of items that should be assigned to each category based on the perceived importance of that category. The logic of this methodology is that more of the score points are determined by the category that is more important, which is an inherent weighting of the categories in the total score. The challenge that arises in the current scheme for constructing test specifications is that it fails to take into consideration the variability of items within a content category. While proportional representation may be important, if the variability of items within some content categories is greater than the variability within other content categories, then sampling variability is not equal across the content categories. The result is that the choice of specific items from content categories with larger variability is more important than the choice of the specific items from content categories with less variability; the impact of the particular selected items on the resulting test score is greater when variability is higher. Fitzpatrick (2008) demonstrated how sampling variability could affect the inferences based on test scores within educational achievement testing. The same principle applies here. When high stakes decisions such as credentialing are made, the test specifications should attend not only to the number of items within each category based on perceived importance, but also to the variability of the items within the category. When measurement error varies significantly across content categories, the effective weight (i.e., impact on the test score) may be very different than the nominal weight (i.e., number of items representing the content area) (Brennan 2001; Clauser et al. 2006). Cronbach (2004) argued that generalizability theory was developed to be a random effects theory. The fact that there was a need for fixed effects was a originally a limitation of the theory. This limitation was overcome with the development of the more general multivariate generalizability theory model. In multivariate generalizability theory, the particular levels of the fixed facet would be more accurately modeled as separate dependent variables in a multivariate design. Therefore, the intent of the theory was to favor random effects design, where the proper modeling of a test constructed from a table of specifications would be the multivariate design where the different categories included in the table of specifications are the separate dependent variables. In this way, the multivariate design is the theoretically more appropriate design for use in assessing reliability in the case where the test is constructed based on a table of specifications. However, multivariate generalizability theory provides a methodology to explore not only the reliability of the test score when a test is constructed according to a table of specifications, but also to examine the differences and relationships that exist between the different categories defined by the specifications. Although a table of specifications is often employed in test development, it is not always clear that the categories within the table of specifications measure unique skills, and often they do not. There is typically commonality in the skills measured by each of the categories. If the commonality is large, then the items may all be measuring the same skills, and the creation of specifications may only provide face validity. If, however, there is less commonality of the skills measured by each category, then the existence of the specifications is essential to construct tests that will have the same interpretation across administrations. This information provides important insight into the construct measured by the test, the relationships of the separate categories to each other. Little research has been done to examine the practical consequences of specifying the data design in the estimation of the reliability and standard error of measurement for

123

L. A. Keller et al.

performance assessments. In one study by Shavelson et al. (1999), taking content stratification into account resulted in lower estimates of reliability. However, in this study, strata were viewed as random, rather than as a fixed facet dictated by test specifications and controlled in the test construction process (Jarjoura and Brennan 1982, 1983). Additionally, while generalizability studies have been performed on data that result from performance assessments, there is little research that examines the issues associated with the assessment of reliability of performance assessments that are built according to a set of test specifications. Further, the variability of items within the different content categories within the test specifications has received even less attention in the literature. The purpose of this study is threefold: first, to further explore the effect of the analysis design on estimates of variance components, reliability and standard error of measurement, second, to investigate the differences and relationships between the different content categories, and third, to illustrate the benefits of using a multivariate generalizability theory framework to inform the construction of the test and/or revision of the table of specifications. By modeling the data in a multivariate framework, greater insight into the sources of measurement error is available than in the more common univariate analysis. The paper also offers a practical example of the use of multivariate generalizability theory to inform test design.

Method Data The National Board of Medical Examiners has developed a computer simulation to evaluate physicians’ patient management skills. These simulations are used in medical licensure testing (Clyman et al. 1999; FSMB and NBME 1999; Margolis and Clauser 2006). The simulation allows the physician to manage a patient in an essentially unprompted patient-care environment. After being presented with a brief opening scenario, the examinee uses free-text entry to order tests, treatments, consultations, and the like. When an examinee finishes managing a case, the system produces a record referred to as a transaction list. The transaction list can then be rated by experts to determine a score for an examinee on a particular case or scored using a computer-automated algorithm (Margolis and Clauser 2006). For the data in this study, 179 senior medical students from across the United States each completed the same 16 cases. The transaction lists from these simulations were rated by four primary-care physicians, all of whom were medical school faculty familiar with the system and all of whom had prior experience rating these cases. The transaction lists were rated on a nine-point scale with case-specific anchors developed by the raters prior to beginning the rating process. Analyses Although the exam was not built according to a table of specifications, the cases could be classified in a variety of ways, and various stratifications could be imposed on the test. As recommended by Cronbach et al. (1972), cases were classified in several different ways: by medical discipline (OB-GYN, pediatrics, medicine, surgery), by organ system, by urgency, and by a combination of specialty and urgency. These stratifications were based on expert judgments, and not on statistical considerations, since in practice test specifications are based on expert judgments. By imposing the various stratifications of cases, it is possible to

123

Using multivariate generalizability theory

determine whether or not the different types of cases performed differently from one another, and whether or not a table of specifications would be useful for this assessment. If it were discovered that there were differences in the scores for the separate specialties (e.g. pediatrics vs. surgery), the stratification should be maintained across future form construction. To determine the effect of the analytic design on the estimate of the reliability, three generalizability analyses were conducted for each of the stratifications using the programs GENOVA (Brennan 1983), urGENOVA (Brennan 1999a), and mGENOVA (Brennan 1999b): (1) a person 9 case 9 rater (p 9 c 9 r) random-effects analysis where the strata are a hidden facet in the analysis, (2) a person 9 (case:strata) 9 rater univariate analysis where the strata were represented as a fixed facet, and (3) a person 9 case 9 rater multivariate analysis where the scores on the various subtests are the multiple dependent variable again representing fixed strata. As noted in the introduction, these designs imply different conceptualizations of the nature of the sampling. The multivariate analysis was also used to investigate the relationship of the different content categories, as specified by the stratification of the cases into various specialty areas. The correlation of the universe scores (i.e. true scores) for each of the categories were examined to determine whether or not the cases within the different categories represented different skills, and whether the assessment should be designed according a table of specifications. Additionally, the variance components of the separate categories were compared to determine how different the variability of raters and cases were within each of the categories. Following the G studies, a series of decision studies, or D studies, was conducted. D studies are typically performed to gain insight into how the precision of test scores would be affected by manipulating the various facets (or factors) of the assessment. In this case it would indicate the impact on reliability and SEM if the number of raters or cases were changed. The D-study allows the test developer to use the information provided by the G study to project the effects of modifications in test design on reliability before making any changes. To determine which facets should be manipulated in the D study, it is sensible to consider the facets that are responsible for the majority of the error variance in the current test design. In so doing, the effect of reducing that error on the precision of scores can be ascertained. After each of the G studies was conducted, several D studies were conducted to examine the impact of increasing/decreasing the number of cases, both within content categories and overall, on the generalizability coefficients, and the relative and absolute standard error of measurement (SEM).

Results The results from the various methods of stratification were very similar and so only the results based on medical specialty are presented. The results pertaining to the first purpose: to examine the differences in the estimates from the different analytic designs are presented first. Estimation of variance components and reliability for different analytic designs Tables 1, 2 and 3 provide variance components, G coefficients, Phi Coefficients, relative SEMs and absolute SEMs for the three different designs. Because the third design was a multivariate design, variance components are estimated for each of the four strata

123

L. A. Keller et al. Table 1 Variance components for P 9 C 9 R univariate design (strata hidden)

Table 2 Variance components for P 9 (T:S) 9 R univariate design (strata fixed)

Source

Single observation (SE)

% Variance

Person (P)

0.520 (0.272)

14.8

Case (C)

0.411 (0.010)

11.7

Rater (R)

0.025 (0.005)

0.7

PC

1.826 (0.003)

52.2

PR

0.003 (0.001)

0.1

CR

0.086 (0.000)

2.4

PCR, E

0.628 (0.000)

18.0

G coefficient

0.807

Phi coefficient

0.767

Relative SEM

0.353

Absolute SEM

0.397

Source

Single observation

% Variance

Person (P)

0.522

14.9

Strata (S)

0.063

1.7

Case:Strata (C:S)

0.362

10.2

Rater (R)

0.025

0.9

PC:S

1.747

49.7

PR

0.004

0.0

SR

0.000

0.0

CR:S

0.088

2.6

PSR

0.000

0.0

PCR:S,E

0.629

17.9

G coefficient

0.812

Phi coefficient

0.804

Relative SEM

0.348

Absolute SEM

0.358

separately, making a direct comparison to the other designs more difficult. To aid in the comparison across designs, the composite variance components are tabulated in Table 3. Looking across the results of the three analyses, some comparisons can be made. As in the case of most performance assessments, the person by case interaction was the largest source of variability, accounting for approximately 50% in all cases. This is the problem of task specificity. It is worth noting that the estimate was slightly higher when the strata were ignored in the analysis. As mentioned above, this is due to the person by stratum variability, which contributes to the person variance in the case where the stratification is made explicit. By ignoring the stratification, the estimate of the person-by-case variance component was inflated. Although the differences in the estimate of the person by stratum component were not large in this situation, for a different assessment, this may not be the case. If the person by stratum component were larger, the differences in the estimation of the person by case component would differ more between the designs where the strata were

123

Using multivariate generalizability theory Table 3 Composite variance components for multivariate P 9 T 9 R design (strata fixed)

Source

Composite variance components

% Variance

Person (P)

0.525

15.2

Case:Strata (C:S)

0.424

12.3

Rater (R)

0.034

1.0

PC:S

1.752

50.7

PR

0.000

0.0

SR

0.000

0.0

CR:S

0.091

2.6

PSR

0.000

0.0

PCR:S,E

0.632

18.3

G coefficient

0.815

Phi coefficient

0.779

Relative SEM

0.345

Absolute SEM

0.386

made explicit relative to the design where the strata were ignored. Larger differences would be expected to occur if the strata were measuring more distinct skills. The case within strata (C:S) component for the univariate and multivariate designs were slightly different: 0.362 (10.4%) for the univariate design and 0.420 (12.2%) for the multivariate design. This is because in the univariate nested design, a strata (S) component is also estimated and in this case was approximated .06 (1.7%), which accounts for the difference in the C:S components. Therefore, the difference between the designs is in the decomposition of error: in the univariate design, there are two components: (C:S) and S, while in the multivariate composite, there is the (C:S) component alone, which is the sum of the (C:S) and S components within the univariate design. Although the same total amount of error would be estimated in both cases, and the difference between the two designs is merely the decomposition of the error into different components, there are implications for the estimation of the SEM and reliability. Since these sources of error would be factored into the absolute error and not the relative error, this difference is reflected in the Phi Coefficient and the absolute SEM. The difference in the computation of the absolute SEM (and hence the Phi coefficient) between the univariate design and multivariate designs lies in the weighting of the error components. In the univariate design, the (C:S) component is weighted by the geometric mean of the number of cases per strata, and the S component is weighted by the number of strata. The implication is that the strata are treated as homogeneous. In the multivariate design, the (C:S) component is a composite component that is determined by weighting the case component (C) for each stratum by the number of cases within that stratum. In so doing, the strata are treated as unique and are weighted individually. It is the degree of commonality between the different strata that determined the magnitude of the effect of treating the strata as the same (as in the univariate design) as opposed to treating the strata as unique (as in the multivariate design). Since there is some evidence that the strata exhibited a large degree of commonality, the differences in the absolute SEM and Phi coefficients between the two designs were not substantial, although there are some slight differences that are the result of the slight differences between the strata. If the strata were more distinct, however, larger differences would have been observed.

123

L. A. Keller et al.

When the G and phi coefficients (measures of reliability) were compared across the designs, it is clear that for the G coefficient, the multivariate design produced the largest coefficient, although the difference among the various designs was small. The corresponding error of measurement, the relative SEM, was also smallest for the multivariate design, although again the differences were small. For the Phi coefficient, however, the largest value was in the univariate nested design, with the correspondingly smaller absolute SEM. The differences were not large in this case either, but were larger than in the case of the G coefficient/relative SEM. The absolute error and phi coefficient were more affected because the components which exhibited the largest differences were the PC:S (or PC) components, which are factored into the absolute error variance and phi coefficient but not the relative error and G coefficient. In a different application this pattern of results might be reversed. The reason that the differences are small is due again to the fact that there is likely not a large difference between the various strata. It should be noted that the rater and person by rater components were quite small accounting for less than 1% of the variance in all cases. Further, the case by rater interaction was also very small, accounting for approximately 2% of the variance in all cases. This result is not typical of performance assessments of clinical skills. Often considerable variability among raters is observed, as well as variability in the way that raters respond to different prompts and examinee styles (Clauser et al. 2008). As noted above, the multivariate design allows for the greatest flexibility and the least restrictive assumptions, which would suggest that the results of the multivariate design were more accurate. The results so far have indicated that the strata are quite similar and a large degree of commonality exists. To further investigate the strata, a deeper look into the results of the multivariate design was conducted, as was specified in the second purpose of the research study. Comparisons of the content categories Table 4 below provides the full results of the multivariate analysis, rather than the composite results. An examination of Table 4 provides insight into how the different strata, and the cases within the different strata, function. The results in this case are more complex, and require more discussion. Looking at the person variance, more information is provided: the variance for the scores of each strata, the covariances of the scores between the various strata and the correlations between the scores on the various strata. The diagonal elements represent the variances for each of the strata and are presented in bold type. The covariances are the values below the diagonal, and the correlations are presented above the diagonal in italics. The covariances and correlations provide similar information, although the correlations are more easily interpreted because the scale is well understood. First, looking at the variances, it is clear that the person variance was largest for stratum two (0.970) and smallest for stratum three (0.335). Since the person variance indicates that degree to which the variability in the scores is due to the examinee, there were greater differences in examinees performance on the cases that were included in stratum two, as compared to their performance on the cases in stratum three, where the performance of the examinees was much more similar. Looking at the correlations, however, the scores on stratum two correlate perfectly with the scores on stratum three. Thus, although there is greater variability in the scores on the cases in stratum two relative to the cases in stratum three, the examinees were rank-ordered in the same way regardless of whether the scores from stratum two or three were used. Examination of all of the correlations indicates a high degree of commonality between the scores

123

Using multivariate generalizability theory Table 4 Variance components for P 9 T 9 R multivariate design, correlations in italics (strata fixed) Source Person (P)

Stratum 1 (SE) 0.432 (0.093)

Stratum 2 (SE) 0.779

Stratum 3 (SE)

Stratum 4 (SE)

0.906

0.748

1.000

0.814

0.504 (0.061)

0.970 (0.202)

0.345 (0.038)

0.641 (0.064)

0.335 (0.083)

0.392 (0.049)

0.639 (0.076)

0.394 (0.045)

0.637 (0.099)

Task (C)

0.061 (0.012)

0.438 (0.053)

0.238 (0.044)

1.640 (0.678)

Rater (R)

0.072 (0.005)

0.852

0.030 (0.018)

0.010 (0.001)

0.036 (0.018)

0.009 (0.004)

0.000 (0.001)

0.038 (0.025)

0.011 (0.008)

0.009 (0.004)

0.020 (0.001)

PC

1.159 (0.014)

2.530 (0.028)

1.472 (0.025)

1.787 (0.102)

PR

0.014 (0.001) 0.013 (0.000)

0.000 (0.001)

-0.010 (0.001)

0.000 (0.000)

0.006 (0.001)

0.019 (0.001)

0.000 (0.000)

-0.010 (0.000)

0.002 (0.001)

CR

0.130 (0.003)

0.022 (0.001)

0.138 (0.002)

0.040 (0.003)

PCR,E

0.625 (0.001)

0.572 (0.001)

0.736 (0.001)

0.560 (0.004)

on the various strata. This information is unavailable in the univariate analysis where there is no differentiation across strata. The greatest variability among the strata was observed in relation to the case (C) component. There was little variability among cases in the first stratum (2%) while there was substantial variability for the fourth stratum (35%). It is appropriate to note that the fourth stratum had only two cases relative to the four or five cases in the other strata. Again, this information is not provided in the univariate analyses and provides information regarding the nature of the cases within each of the strata. In this case, the large amount of variability in stratum four indicates that the particular choice of the two cases would have a greater impact on the resulting score as compared to the choice of cases in the first stratum, where the cases are more homogeneous. The relatively large amount of variability in the cases in stratum four may be due to the small number of cases, or may be due to the definition/understanding of that particular stratum. Looking across the other components, the other component of interest is the personby-case interaction component. The person-by-case interaction accounts for the majority of the variance in the scores regardless of stratum, with the variance accounting for between 38 and 56% of the variance. Additionally, it appears that stratum two performed differently than the other three strata with examinee scores varying more across cases. In contrast, scores varied relatively little across cases in stratum four. For the rater and person-by-rater components, the variances of the strata and the covariances between the strata are presented as they were for the person variance. In this case, there was very little variability due to rater for all strata, as indicated by the small variance components for each of the strands. However, by having the estimates for each strata, it would be possible to identify whether there was greater variability among the raters for some strata relative to other strata. Given that the variances were small, it is not surprising that the covariances are small as well. The person-by-rater components are likewise small for all of the strata, indicating that the raters rated the different examinees

123

L. A. Keller et al.

consistently within each stratum. Again, even though these values are small in this instance, the availability of this information allows the test developer to better understand whether the raters are behaving consistently across the different categories as defined by the table of specifications. In the case where the strata were not so similar there may be greater differences in all components across the various strata. The case-by-rater component was not large for any of the strata, although the relative size of the components varies across the strata. The variance for strata one and three is larger than that for strata two and four, which indicates that there was more variability in the ratings across the cases in strata one and three, as compared to strata two and four. The analyses performed here indicate that although the differences weren’t always large, there were some differences across the strata. In particular, there were differences in how variable the cases were within the different strata. These result indicate that it may be important to increase the number of cases from stratum four, where the variability among the cases is greatest, and the choice of cases for this stratum are likely to have a large impact on the reliability of the assessment. This information is not available from a univariate analysis. Improving reliability through changes in test construction: D studies As noted in the methodology section, the G studies were followed by a series of D studies to investigate the impact of increasing or decreasing the number of cases or raters on the estimates of SEM and reliability. Because the raters were not a large source of variability, increasing or decreasing the number of raters would not have much of an impact on the estimates of reliability or SEM. Therefore, the D studies centered around manipulating the number of cases. The results of the multivariate analysis provide insight into some potential ways that the reliability of the assessment could be improved. As noted in Table 4, stratum four had the largest variability due to the cases. Therefore, increasing the number of cases in stratum four might be advantageous from a reliability standpoint. Furthermore, the high level of variability in that stratum indicates that there might also be a validity reason for increasing the number of items within that stratum, as the construct might not be fully represented with two cases and a high level of variability. If there is interest in retaining the current test length, there is evidence that it might be acceptable to decrease the number of cases in stratum one, where there was little variability among the cases. By reallocating the cases to different strata, the reliability can be increased without increasing testing time. To illustrate the effect of changing the number of cases in different strata several test lengths were explored and the resulting G and Phi coefficients were computed, along with the associated SEMs. Table 5 provides the number of cases in each stratum for the test lengths considered. The first configuration, configuration A, considers a design where an equal number of items are allocated to each stratum. Configuration B represents the current test design. Configurations C and D provide two different scenarios where the number of items in stratum four was increased and the number of items in stratum one were decreased. Figure 1 provides a graphical representation of the effect of changing the distribution of the cases on the resulting G coefficient, and Fig. 2 provides the effect on the phi coefficient. In the present example, the balanced design with four cases in each stratum has the lowest generalizability (0.82); redistributing the cases as represented in configuration D produced a slightly higher generalizability (0.83). The other two configurations fall between these two values. A change in 0.01 in the generalizability coefficient may seem

123

Using multivariate generalizability theory Table 5 Number of items per strata for various potential test designs

Test design

Number of items

Stratum:

1

2

3

4

A

4

4

4

4

B

5

5

4

2

C

2

5

4

5

D

2

4

2

8

Fig. 1 G coefficients for different distribution of items, two analytic designs

Fig. 2 Phi coefficients for different distributions of items, two data designs

like a modest improvement, but indicates an increase of one case or a 6% increase in test length. Given the cost of creating, administering and scoring the cases, increasing the test length even by one item might not be a practical option. However, a redistribution of the cases across strata potentially provides this increase in generalizability with no increase testing time, or scoring efforts. When considering the effect on the phi coefficient, however, the picture is slightly different. While configuration D produced the highest G coefficient, it produced the lowest phi coefficient. This is due to the difference in the specification of error. The case-by-strata component was large and would contribute to the absolute error and not the relative error. Therefore, the phi coefficient would be affected differently than the G coefficient. In this case, the case component was largest for stratum four, and since configuration four weighs

123

L. A. Keller et al.

the fourth stratum most heavily, there is a heavier weight for the case component associated with stratum four in the calculation of the composite error and composite phi component. By sampling more heavily from the stratum with the most variability, there is more error incorporated into the estimate of generalizability. This example illustrates the difference between considering the G or phi coefficients and that the proper coefficient should be considered in selecting test design. Since the phi coefficient uses the absolute error, it can be interpreted as representing the more accurate estimate of reliability in a criterion-referenced testing situation (Cronbach 2004), where all examinees are judged relative to the content itself rather than in relation to each other. In a norm-referenced testing environment, however, where the desire is to rank order candidates relative to each other, the G coefficient is more meaningful as the relative error represents only the error that would affect different examinees differently (Cronbach 2004). Since a licensure/credentialing test represents a criterion-referenced test, the phi coefficient is the more appropriate measure of reliability in this case. Given that, configuration B, the current test configuration, would provide the highest reliability. Although the generalizability of the test is of concern, and the test design can be manipulated to maximize generalizability, validity concerns and content representation must also be taken into account. Therefore, while it may increase the G coefficient of the test to load more cases into the strata with more variability, it may not make sense from a content representation point of view. However, given that there is more variability in some content strata, oversampling from those strata may also be sensible from a validity perspective as well as reliability perspective. In addition to examine how the reliability would change by changing the distribution of cases to the strata, it is possible to also investigate the effect of increasing the total number of cases. From the results of the multivariate G study, it is clear that increasing the number of cases in stratum two would produce beneficial results, given the relatively large PC and C components. Therefore, the composite G coefficient was computed for the scenario where the number of cases in stratum two was increased. The resulting G and Phi coefficients can be compared to those which would be obtained from the univariate p 9 c 9 r design, where the strata are treated as hidden. Figure 3 provides the G and Phi coefficients that would be obtained with an increased test length for these two designs.

Fig. 3 G and Phi coefficients for various test lengths for two data analysis designs

123

Using multivariate generalizability theory

It is clear from Fig. 3 that increasing the number of cases increases both the G and phi coefficients. The multivariate design allows for the ideal placement of the extra cases, and as a result produced the greater G and phi coefficients. While reliability coefficients are of interest, the standard error of measurement may be a more useful metric. Because the person (universe) score variance changes slightly with the different analytic designs, a higher generalizability coefficient does not necessarily mean a smaller error component. Figure 4 provides the relative SEM for the various test lengths, and Fig. 5 provides the absolute SEM for the various test lengths. As can be seen in both figures, the SEM was larger for the multivariate design as compared to the univariate design that ignores the strata, indicating that the univariate designs may underestimate the error in the model. The differences between the relative SEMs is due to how the PC component is treated in the computation of the error. As noted above, the PC component is divided by the number of cases when computing the relative (and absolute) error. For the design with the hidden strata, all cases are treated as equivalent, while in the multivariate design, cases from different strata are not treated as equivalent and may be weighted differently. The weighting of the different strata produced a different contribution of that term to the overall error. Because the multivariate design does not assume that the error within a given stratum should be similar to that in a different stratum, the estimate based on the multivariate design is the more appropriate estimate in the case where the strata are assumed to be meaningful, and are used in the construction of the test. The differences were larger for the absolute SEM, which is not surprising given the differences in the estimates for the variability due to the case. Clearly the choice of the model used for the analysis of the data affects not only the estimate of the generalizability coefficients but also the estimate of the SEM. By misspecifying the design, the error was not estimated properly, nor was the generalizability of the scores. The G and Phi coefficients were underestimated when the univariate design was used. This is due to the fact that the error was slightly underestimated. Although the differences in this instance were not large, under other conditions, especially where the categories specified in the test specifications are not as highly related, larger differences would be expected. In addition to providing more accurate estimates of variance components and the resulting reliability coefficients and SEMs, the multivariate design allowed for the estimation of additional information that can provide insight into the relationship of Fig. 4 Relative SEM for four test lengths and two analytic designs

123

L. A. Keller et al. Fig. 5 Absolute SEM for four test lengths and two analytic designs

subdomains to each other and to the construct of interest. The results of the D studies showed that using the information from the multivariate analysis could lead to more efficient choices of how to increase the generalizability of the test, by either lengthening the test in the optimal way, or allocating the same number of cases differently to different strata.

Discussion One of the goals of this study was to examine the effect of specifying the design of an assessment on the estimation of generalizability coefficients and SEMs. While the differences were not dramatic in this case, small differences were observed. The differences would likely be greater in a case where the strata were not so highly correlated. Although scores from fixed strata used in test construction may often be highly correlated, it is not always be the case; Clauser et al. (2006) present the results of a multivariate analysis in which the fixed strata have a substantially lower correlation. As a result, when considering the most appropriate design for estimating reliability for a test built according to a test blueprint, ignoring the specifications in the design may lead to an improper estimate of the reliability of the test. The degree to which the reliability is improperly estimated will depend in part on the correlation of the different strata. When the strata are highly correlated, it may not be necessary to incorporate the individual strata into the computation of the reliability, however, when correlations are lower, it is essential to take the structure of the test into account when computing reliability. One consequence of ignoring the structure of the test is the possible underestimation of reliability. In these instances, efforts may be made to increase the reliability of the exam. Given the expense of writing, administering and, scoring additional items, it may be that resources are improperly allocated if the estimate of reliability is incorrect. A premise of the paper was that that by including the stratification in the design, the assessment of reliability would be more accurate because in a P 9 C design, the person 9 case (PC) component may be overestimated because it fails to account for the fact that an examinee may differ in proficiency across content areas. The results of this study show that there was a decrease in the estimate of this effect in the designs that included the strata. There was also an increase in the person variance. Further, decomposing the PC

123

Using multivariate generalizability theory

component into PC and PS components, (technically, into the PC:S and PS components), allows for a more thorough analysis. The first analysis indicated that the cases were responsible for approximately 12% of the variance. This component does not provide any insight into whether the variability came from the different cases or the different strata. However, the subsequent univariate analysis, which considered the case as being nested within strata, revealed that the cases within the strata were accounting for a majority of that variance, while the strata themselves were not responsible for much of the variance (2%). The implication is that the strata themselves are not responsible for much variance and are similar to each other, and to reduce the case variance, more cases should be sampled. Given that the test was not originally built according to a table of specifications, the results of these analyses indicate that it may not be necessary to impose that level of structure on the test, since there seems to be very little difference among the strata, regardless of the type of stratification used. A note of caution is warranted: the existence of highly correlated strata indicates that there will be little difference in the estimates of reliability if the strata are ignored, however, correlations between the strata are not the only criterion for determining whether it is important to recognize the strata in the analysis. In addition to trying to more accurately assess the reliability of the scores, the benefits of using the multivariate technique rather than a univariate design are clear from a test development perspective, as was shown in the D studies. The multivariate design for analysis provided additional information about the relative advantages of different approaches to test construction. The analysis provided insight regarding both the effect of lengthening the test in the potential advantages of restructuring the test to maximize the reliability. A note of caution is necessary in that restructuring the test can change the meaning of the total score, because the percent of cases within each stratum would change. Therefore, it is necessary that the test not be restructured in a way that raises concerns about the content balance; the percent of the total test that stems from each of the strata should be consistent with the intended score interpretations. However, if the definition of content representation includes not only the proportion of items, but also the proportion of variance accounted for by those items (i.e., the effective weight as well as the nominal weights (Brennan 2001; Clauser et al. 2006), it may be more sensible to view content representation more broadly (e.g. Fitzpatrick 2008). Not only does multivariate generalizability theory offer the most theoretically correct solution to the problem of estimating reliability when test specifications are used (Cronbach 2004), but it provides valuable information about the internal structure of the test. First, it provides an estimate as to how highly correlated the strata are, and in this case, indicates that the various strata are highly correlated (correlations of .75–1.00). Additionally, even in this situation, where the strata are highly correlated, valuable information was obtained regarding the variance components for each section. It was clear that the performance of examinees on some strata was more variable on other strata and that the different strata (levels of the fixed facet) had different degrees of variability in scores across cases. In particular, stratum two had greater variability than the other strata. Again, this can lead to insight into the construct being measured and can help inform future test development. To reduce the overall error variance associated with the test scores it might be desirable to include more cases in the second stratum, where the person-by-case variance component is particularly large. Looking at the case component can provide information as to how much variability there is in case difficulty within each stratum. If strata are identified with considerable variability (such as in stratum four) the test developer may wish to add more cases to support improved precision in that component of the test.

123

L. A. Keller et al.

While the addition of more cases to the assessment would increase the reliability of the scores, it is not always feasible to add more cases (tasks) to the test. The cases are expensive to develop, deliver, and score and so there are practical limitations to how many additional tasks can be added to the assessment. Further, there are likely limits to the length of the test, and how long it takes an examinee to complete the exam. These practical concerns might make it impossible to lengthen the test sufficiently to achieve a desired reliability. For this reason, it may be especially important to consider the within stratum variability to determine whether it is allowable to change the number of items within a given stratum to increase reliability. Multivariate generalizability theory provides a framework to guide this adjustment. Lastly, it is useful to consider how the results of this study might be affected if the strata were not so highly correlated. If the strata were less highly correlated, including the stratification in the analysis would have a greater impact. Consider the extreme situation where the cases within a stratum are very highly correlated, while the cases across the different strata are not. If the analysis of these cases, ignoring the stratification, were compared with the analysis of the same cases wherein the strata are made explicit, several results would be expected. In terms of reliability, the classical reliability estimate that includes the stratification would be significantly larger than the one that did not (Cronbach et al. 1965). The same logic applies to generalizability coefficients.

Conclusions It is clear that method of analysis will impact the estimate of reliability and choosing the most appropriate method will yield the most accurate estimate of reliability. If a test is built according to a table of specifications, then this stratification should be made explicit in the analysis. However, the effect of including the stratification may not be dramatic, depending on how highly correlated the strata are, and thus, it may not seem worth the extra effort. If the stratification is ignored in the design, the estimate obtained will likely be an underestimate of the true reliability. This is fortunate in the sense that the estimate would be conservative, however, it may lead to a misguided decision to lengthen the test in order to increase reliability. Furthermore, by analyzing the data using multivariate G theory, substantial information regarding the structure of the test can be acquired to aid in test development. Although there was little impact on the original G coefficients by making the strata explicit in the design, valuable information regarding the structure of the test was obtained from the multivariate analysis. As a result, multivariate G theory may not only be a more appropriate method for estimating reliability for tests built according to a table of specifications, but it is also a valuable tool for drawing inferences about the proficiencies underlying performance on the test.

References Brennan, R. L. (1983). GENOVA [Computer Software]. Iowa City, IA: University of Iowa. Brennan, R. L. (1999a). Manual for urGENOVA. Iowa testing programs occasional papers number 46. Brennan, R. L. (1999b). Manual for mGENOVA. Iowa testing programs occasional papers number 47. Brennan, R. L. (2001). Generalizability theory. New York: Springer. Clauser, B. E., Harik, P., & Margolis, M. J. (2006). A multivariate generalizability analysis of data from a performance assessment of physicians’ clinical skills. Journal of Educational Measurement, 43, 173– 191.

123

Using multivariate generalizability theory Clauser, B. E., Harik, P., Margolis, M. J., Mee, J., Swygert, K., & Rebbecchi, T. (2008). The generalizability of documentation scores from the USMLE step 2 clinical skills examination. Academic Medicine (RIME Supplement). Clauser, B. E., Swanson, D. B., & Clyman, S. G. (1999). A comparison of the generalizability of scores produced by expert raters and automated scoring systems. Applied Measurement in Education, 12, 281–299. Clyman, S. G., Melnick, D. E., & Clauser, B. E. (1999). Computer-based case simulations from medicine: Assessing skills in patient management. In A. Tekian, C. H. McGuire, & W. C. McGahie (Eds.), Innovative simulations for assessing professional competence (pp. 29–41). Chicago: University of Illinois, Department of Medical Education. Coffman, W. E. (1971). Essay examinations. In R. L. Thorndike (Ed.), Educational measurement (2nd ed., pp. 271–302). Washington, DC: American Council on Education. Cronbach, L. J. (2004). My current thoughts on coefficient alpha and successor procedures. Educational and Psychological Measurement, 64, 391–418. Cronbach, L. J., Gleser, G. C., Nanda, H., & Rajaratnam, N. (1972). The dependability of behavioral measurements: Theory of generalizability for scores and profiles. New York: Wiley. Cronbach, L. J., Schonemann, P., & McKie, D. (1965). Alpha coefficients for stratified-parallel tests. Educational and Psychological Measurement, 25, 291–312. Federation of State Medical Boards of the U.S., Inc. (FSMB), & National Board of Medical Examiners (NBME). (1999). Step 3 general instructions, content description, and sample items. Philadelphia: FSMB and NBME. Fitzpatrick, A. R. (2008). NCME 2008 presidential address: The impact of anchor test configuration on student proficiency rates. Educational Measurement: Issues and Practice, Win, 2008, 34–40. Hartog, P., & Rhodes, E. C. (1936). The marks of examiners. London: MacMillan and Co. Jarjoura, D., & Brennan, R. L. (1982). A variance components model for measurement procedures associated with a table of specification. Applied Psychological measurement, 6, 161–171. Jarjoura, D., & Brennan, R. L. (1983). Multivariate generalizability analysis for tests developed from tables of specification. In L. J. Fyans (Ed.), Generalizability theory: Inferences and practical applications. San Francisco: Jossey-Bass Inc. Kane, M. (1997). Model-based practice analysis and test specifications. Applied Measurement in Education, 10(1), 5–18. Linn, R. L., & Burton, E. (1994). Performance-based assessment: Implications of task specificity. Educational Measurement: Issues and Practice, 13, 5–8, 15. Margolis, M. J., & Clauser, B. E. (2006). A regression-based procedure for automated scoring of a complex medical performance assessment. In D. M. Williamson, R. J. Mislevy, & I. I. Bejar (Eds.), Automated scoring for complex tasks in computer-based testing (pp. 123–167). Hillsdale, NJ: Lawrence Erlbaum Associates. Margolis, M. J., Clauser, B. E., Cuddy, M. M., Ciccone, A., Mee, J., Harik, P., et al. (2006). Use of the MiniCEX to rate examinee performance on a multiple-station clinical skills examination: A validity study. Academic Medicine (RIME Supplement), 81, S56–S60. Raymond, M. R. (2001). Job analysis and the specification of content for licensure and certification examinations. Applied Measurement in Education, 14(4), 369–415. Shavelson, R. J., Ruiz-Primo, M. A., & Wiley, E. W. (1999). Note on sources of sampling variability in science performance assessments. Journal of Educational Measurement, 36, 61–71. Stalnaker, J. M. (1951). The essay type of examination. In E. F. Lindquist (Ed.), Educational measurement (pp. 495–530). Washington, DC: American Council on Education. Swanson, D. B., Norman, G., & Linn, R. (1995). Performance-based assessment: Lessons from the health professions. Educational Researcher, 24, 5–11, 35.

123