Clinical Skills Assessment with Standardized Patients ... - Springer Link

1 downloads 399 Views 233KB Size Report
considerations: in general, procedures that aid in maintaining security ...... Empirical tests of these assumptions are needed, both to provide general guidelines.
Advances in Health Sciences Education 4: 67–106, 1999. © 1999 Kluwer Academic Publishers. Printed in the Netherlands.

67

Clinical Skills Assessment with Standardized Patients in High-Stakes Tests: A Framework for Thinking about Score Precision, Equating, and Security DAVID B. SWANSON∗, BRIAN E. CLAUSER and SUSAN M. CASE National Board of Medical Examiners, 3750 Market Street, Philadelphia, PA, 19104, USA E-mail: [email protected] (∗ Corresponding author: Phone: (215) 590-9570; Fax: (215) 590-9442)

Over the past decade, there has been a dramatic increase in the use of standardized patients (SPs) for assessment of clinical skills in high-stakes testing situations. This paper provides a framework for thinking about three inter-related issues that remain problematic in high-stakes use of SP-based tests: methods for estimating the precision of scores; procedures for placing (equating) scores from different test forms onto the same scale; and threats to the security of SP-based exams. While generalizability theory is now commonly employed to analyze factors influencing the precision of test scores, it is very common for investigators to use designs that do not appropriately represent the complexity of SP-based test administration. Development of equating procedures for SP-based tests is in its infancy, largely utilizing methods adapted from multiple-choice testing. Despite the obvious importance of adjusting scores on alternate test forms to reduce measurement error and ensure equitable treatment of examinees, equating procedures are not typically employed. Research on security to date has been plagued by serious methodological problems, and procedures that seem likely to aid in maintaining security tend to increase the complexity of test construction and administration, as well as the analytic methods required to examine precision and equate scores across test forms. Recommendations are offered for improving research and use of SP-based assessment in high-stakes tests. Over the past decade, the use of standardized patients (SPs) for assessment of clinical skills has increased dramatically. In North America, it is now common for SPs to be used in high-stakes tests. Dozens of medical schools have now instituted “Clinical Practice Exams” that students take during their senior year (Association of American Medical Colleges, 1998). At many of these schools students must pass these exams to graduate; those who fail are typically assigned to remedial work before retesting.

68

DAVID B. SWANSON ET AL.

In North America, there has also been increasing use of SP-based tests for licensure and certification examinations. In 1990, the Corporation Professionnelle des Medecins du Quebec incorporated an SP-based test into the examination used for licensure and certification of family physicians. In 1992, the Medical Council of Canada established an SP-based assessment of clinical skills as the second component of their licensing examination (Reznick et al., 1993, 1997). In 1998, the Educational Commission for Foreign Medical Graduates (ECFMG) added a passing score on an SP-based assessment of clinical and communication skills to the requirements for its certification, which is a prerequisite for graduates of foreign medical schools to enter graduate training in the United States (Whelan, in press). For the past 25 years, the National Board of Medical Examiners has engaged in research and development work on SP-based tests (Case, 1992; Klass, 1994; Klass et al., 1994), with the goal of adding an assessment of clinical skills into the United States Medical Licensing Examination (Klass et al., in press). Despite the increased use of high-stakes SP-based tests, some knotty problems remain. In this paper, we address three of these: methods for estimating the precision of scores, procedures for placing (equating) scores from different test forms onto the same scale, and threats to the security of SP-based exams. Because of the diversity of the procedures used for SP-based test administration, differences in the numbers of examinees to be tested, and variation in the likelihood and consequences of security problems, this paper attempts to provide a framework for thinking about precision, equating, and security problems, rather than a cookbook of recipes for solving them. In order to make discussion more concrete, the first section of the paper describes alternate approaches (“scenarios”) for the design and administration of large-scale SP-based tests; most high-stakes administrations can be viewed as a variant of one of these. The next three sections separately address precision, equating, and security issues. In the last section, we provide a more integrated discussion of these issues, using the scenarios to illustrate key points. We emphasize the interplay and tradeoffs among precision, equating, and security considerations: in general, procedures that aid in maintaining security increase the complexity of test construction and administration, as well as the analytic methods required to examine precision and equate scores across test forms. SP-based tests sometimes include “post-encounter probes” (aka inter-station exercises) in which examinees respond to questions (generally in writing) regarding history and physical findings, differential diagnosis, and next steps in patient care (Van der Vleuten and Swanson, 1990). Though the score precision, equating, and security issues associated with post-encounter probes are similar to those for assessment of hands-on clinical skills, we have chosen to focus exclusively on the latter to simplify the paper.

CLINICAL SKILLS ASSESSMENT

69

Scenarios for Administration of Large-Scale, High-Stakes SP-Based Tests The procedures used for construction and administration of SP-based tests are extremely diverse. The simplest approach involves the use of a single test form, with the same cases completed by each examinee, and, across examinees, the same SP portraying each case role and completing associated checklists and rating forms. At the other extreme, a large pool of cases may be used, with individual test forms constructed from the pool by selecting cases according to a test blueprint and selecting SPs for each case role from among several SPs who have been trained to portray the role. Further complicating matters, the SP portraying a case role may not record performance; instead, one or more raters (who may be physicians, other SPs, etc.) may be assigned systematically or non-systematically to cases, examinees, or case-examinee combinations. This can result in large numbers of overlapping (or non-overlapping) test forms on each date of test administration. In addition, there may be multiple dates and sites of administration, with cases, SPs, and/or raters confounded with sites and dates. Because of this diversity, it is difficult to discuss issues in large-scale, highstakes SP-based tests in the abstract: security considerations and procedures for evaluating score precision and equating are strongly dependent on the specifics of test administration. To provide a concrete context for discussion of precision, equating, and security, we describe five scenarios for administration of largescale, high-stakes SP-based tests; four of the five are then used as illustrations in subsequent sections. S CENARIO 0: O NE

TEST FORM / SINGLE SP PER CASE ROLE / MULTIPLE TEST

DATES

This is the simplest approach to SP-based test construction and administration; based upon published reports, it appears to be very commonly used for schoolbased assessment.1 In this scenario, a single test form is used. It consists of a single set of cases, with each case role portrayed by the same SP for all examinees; SPs also complete checklists and rating forms associated with cases. To handle large numbers of examinees, the test is administered repeatedly for several days or weeks until all examinees have been tested. Random assignment may be used to assign examinees to test administration dates, but, probably more commonly, assignment reflects geographic considerations (e.g., students are tested in groups based upon training site or clerkship rotation) or examinee preferences. Typically, the number of examinees tested concurrently is a multiple of the number of stations on the test form. For example, for a 12-station test form, examinees are likely to be tested in groups of 12 or 24, and the number of test dates will reflect the number of groups of this size that need to be assessed in order to test all eligible examinees. In a sense, the frequency with which this approach to test administration is used is a matter of perspective. For example, clinical practice exams required for graduation are typically administered annually to successive cohorts of students,

70

DAVID B. SWANSON ET AL.

often with some reuse of cases from year to year. The test administration for any single class resembles Scenario 0, but, viewed across years of test administration, this approach to test administration is similar to Scenario 2 or 4 below, depending upon the pattern of reuse of cases and SPs over time. It is often appropriate to view test administrations longitudinally, rather than as separate, independent events, particularly for studies focusing on score precision, equating, or security. As a consequence, Scenario 0 is only included to set the stage for other scenarios. S CENARIO 1: T WO TEST FORMS WITH NO CASE ROLE / ONE TEST DATE PER FORM

COMMON CASES / SINGLE SP PER

In this scenario, two test forms are used. Each consists of a single, fixed set of cases, with each case role portrayed by the same SP for all examinees taking that form. For purposes of discussion, it is assumed that SPs also complete checklists and rating forms associated with the case. A different test form is used for each test date, as illustrated in Figure 1, which graphically depicts the design that results from using this approach for two 12-case test forms administered to 48 examinees. In this example, two groups of 12 examinees would complete the same test form on each day, with the groups sequestered at mid-day for security reasons. Test administration would be spread across two days, with Form 1 used on the first day and Form 2 used on the second day. This approach prevents examinees taking the test during the first day from providing advance information about cases to examinees taking the test the second day, since they take different (non-overlapping) test forms. To improve comparability of content coverage on the two forms and reduce differences in form difficulty, it is very useful to develop similar cases (e.g., two cases of chest pain that present quite differently) in pairs, with one case from each pair randomly assigned to each test form. Test administration designs in which subgroups of examinees complete different test forms composed of non-overlapping sets of cases (or SPs) are termed “disconnected.” When a test administration design is disconnected, it is impossible to analytically determine if observed differences in test score distributions from form to form are due to differences in form characteristics (e.g., difficulty) or group proficiency, because the two are confounded; it is also impossible to partition overall variation in examinee scores into the sources causing the variation (Searle, 1971), which affects the analytic methods appropriate for investigation of factors influencing the reproducibility of scores and limits the equating methods that can be used. In contrast, if examinees are assigned randomly to test forms and the group taking each form is reasonably large, form-to-form differences in the distribution of test scores should be primarily due to differences in the forms, rather than the examinee groups, making it possible to statistically adjust (equate) scores on the forms so they are more comparable, as discussed in the section on equating.

71

CLINICAL SKILLS ASSESSMENT

Figure 1. Assignment of cases and SPs for Scenario 1 – two test forms with no common cases, a single SP per case, and one test date per form (disconnected design).

S CENARIO 2: T WO TEST FORMS ROLE / MULTIPLE TEST DATES

WITH COMMON CASES / SINGLE SP PER CASE

As in Scenario 1, two test forms are used in Scenario 2; it differs from Scenario 1 because there is overlap in cases from form to form. For this scenario, each form consists of a single set of cases, with each case role portrayed by the same SP for all examinees; that SP also completes checklists and rating forms associated with the case. To handle large numbers of examinees, the test is administered repeatedly for several days or weeks until all examinees have been tested. Examinees may be assigned randomly to test administration dates (and forms), they may sign up for whatever date they wish, or they may be assigned to dates based on geographic or other factors. Typically, clinical practice exams required for graduation are based upon this scenario: they are administered annually to successive cohorts of students, with some reuse of cases from year to year. Similarly, clerkship-based exams administered after each rotation generally result in some variant of this scenario. Figure 2 graphically depicts an example in which a 12-case test form is administered to 48 examinees. If one group of 12 examinees were tested each day, test administration would be spread across four dates. Because a common set of cases and SPs are used across test forms and examinees, the design is connected. S CENARIO 3: S INGLE TEST FORM CASE ROLE / SINGLE TEST DATE

WITH TWO REPLICATIONS / TWO SPS PER

In this scenario, a single set of cases is used, but two (or more) SPs are trained to portray each case role; the SP portraying the case also completes associated

72

DAVID B. SWANSON ET AL.

Figure 2. Assignment of cases and SPs for Scenario 2 – two test forms with common cases, a single SP per case, and multiple test dates (connected design). Both test forms include 12 stations. Test Form 1 consists of cases/SPs 1 to 12, and Test Form 2 consists of cases/SPs 9 to 20. Thus, the two forms have four cases/SPs in common (cases/SPs 9 to 12).

checklists and rating forms. All examinees are tested concurrently on the same date. Each examinee sees the same set of cases, but may see different combinations of SPs. Depending upon the strategy used for assignment of examinees to SPs that are portraying each case role, several different designs can result for this scenario. For example, the entire test form may be replicated (possibly at a different site), with examinees assigned (preferably randomly) to replications (often termed “circuits”) of the test form. This results in a design that is similar to the one illustrated in Figure 1, except test forms are differentiated by the SPs portraying the (common) case roles, rather than by the cases included in the form. The “connectedness” of the design is a matter of perspective. One could view the design as connected because the same cases are used in each replication. One could also view the design as disconnected because the test form completed by each group of examinees consists of non-overlapping sets of SPs. For several reasons, we think the latter perspective should generally be adopted. First, research has shown that the use of different SPs portraying the same case role does introduce additional measurement error (Swanson and Norcini, 1990; Colliver et al., 1990, 1991a, 1994, 1998). Second, if examinees have been randomly assigned to replications, it is possible to eliminate some of the additional measurement error through equating. Third, it increases the likelihood that users of SP-based tests will use designs that actually are connected, as discussed next. If test administration takes place at a single site, assignment of examinees to SPs portraying the same case role can be deliberately manipulated to produce a

73

CLINICAL SKILLS ASSESSMENT

Figure 3. Assignment of cases and SPs for Scenario 3 – single test form with two SPs per case role and a single test date (connected version of design).

connected design, as illustrated in Figure 3. This makes it possible, with sufficiently large sample sizes, to partition variation in examinee performance into sources responsible for the variation (e.g., cases, SPs) and use that information to estimate and adjust for differences in test form difficulty, improving the precision of scores. In principle, it is also possible to use statistical procedures in quality control analyses to identify cases and SPs that are performing aberrantly, though only very large discrepancies are detectable with the examinee sample sizes that are likely with this design.

S CENARIO 4: M ULTIPLE FORMS WITH CASE ROLE / MULTIPLE TEST DATES

COMMON CASES / MULTIPLE SPS PER

In this scenario, a large pool of cases is developed, and several SPs are trained to portray each of the associated roles. Multiple test forms are assembled by (stratified random) sampling from this pool, using a fixed blueprint to ensure that forms are content-parallel. Figure 4 illustrates how test forms might be constructed using this approach; it also depicts one (connected) variant of the test administration pattern that might result if all SPs were trained and assigned to test forms/administrations from a central site. Viewed over time, virtually all large-scale, high-stakes administrations of SP-based tests use some variant of this

74

DAVID B. SWANSON ET AL.

Figure 4. Assignment of cases and SPs for Scenario 4 – multiple test forms with common cases, multiple SPs per case role, and multiple test dates (connected version of design).

scenario. Because of practical considerations in test administration, the resulting datasets are generally disconnected. A disconnected example of this scenario is the test administration procedure used until recently by the Medical Council of Canada (MCC) for assessment of clinical skills in the second component of their licensing examination (Reznick et al., 1993, 1997). Each year, test administration took place over two days at multiple regional sites. For security reasons, one set of cases was used on the first day of testing, with a second (non-overlapping) set of cases on the second day of testing; at some sites, multiple circuits were also used. Physician-observers completed checklists and rating forms. Across years of test administration, cases were reused in limited numbers. Thus, across regional sites, multiple SPs played each case role, and multiple physician-observers rated examinee performance for each case. Connectedness is again a matter of perspective. If one is willing to make

CLINICAL SKILLS ASSESSMENT

75

the (almost certainly incorrect) assumption that SPs portraying the same case role and physician-observers assigned to a case are comparable across sites, the design is connected within each date of test administration, and viewed longitudinally, it is connected across administrations through reuse of cases. For reasons discussed earlier, we prefer to view the design as disconnected because there are no overlapping SPs and raters across sites: this approach should provide a more useful framework for considering steps that might be taken to improve precision and equating by making the actual design “visible” in analytic work. The ECFMG is using a connected version of Scenario 4 for its new SP-based assessment of the clinical and communication skills of examinees trained at schools outside the U.S. and Canada. For this assessment, a pool of hundreds of cases is under construction. The examination is presently administered year-round at a single site, with the forms for each date of test administration constructed from the pool according to an exam blueprint (Whelan, in press). Though multiple SPs do, on occasion, portray the same case role, each SP is tracked separately in the definition of test forms. As a consequence, differences in SP stringency and form difficulty can be estimated, and this information can be used for adjustment of scores if the differences are large. Taken together, using a single test administration site, separately tracking SPs portraying the same case role, and constructing test forms to maintain connectedness over time make it possible to partition variation in performance into separate sources, to use sophisticated equating procedures for adjustment of scores, and to make pass/fail decisions that appropriately reflect variation in form difficulty. However, as discussed later, security problems may result as cases and SPs are (re)used over time.

Estimating the Precision of Scores The purpose of high-stakes SP-based assessment is to draw inferences about the clinical skills of an examinee that extend beyond the particular test form administered to the larger domain of cases, SPs, and raters that might have been used on the test (Van der Vleuten and Swanson, 1990). Scores are reproducible if an examinee’s score is reasonably stable across similar but not identical (randomly parallel) samples of cases, SPs, and raters. Variation in an examinee’s performance across cases, inconsistency in SP portrayal (both for the same SP across examinees and across SPs portraying the same case role), and lack of inter-rater agreement all have an adverse effect on the reproducibility of scores. For an estimate of an examinee’s proficiency to achieve a desired level of reproducibility, an adequate number and mix of cases, SPs, and raters must be included in the sample that the test comprises. In order to design an SP-based assessment that yields scores with the desired level of precision, it is necessary to quantify the nature and magnitude of the multiple sources of measurement error that are present. Because multiple sources of measurement error are present, the statistical methods that classical test theory provides for investigating the reproducibility

76

DAVID B. SWANSON ET AL.

of scores (e.g., separate indices of internal consistency and inter-rater agreement) are not adequate. Instead, use of the conceptual and analytic tools of generalizability theory (G-theory) are mandatory (Cronbach et al., 1972; Brennan, 1992). Using analysis of variance (ANOVA) techniques, G-theory provides 1) flexible analytic methods for estimating the magnitude of multiple sources of measurement error, 2) tools for modeling both norm- and domain-referenced decision-making situations, 3) multiple indices of score precision tailored to the purpose of an assessment procedure, and 4) practical approaches for evaluating alternate test designs to guide development of cost-effective assessment procedures (Shavelson et al., 1989). Because the purpose of this section is to provide a framework for examining the precision of scores, we will focus on the first of these, but all four are important in analyzing and improving SP-based testing procedures. After a brief overview of G-theory, the remainder of this section illustrates a G-theory “view” of score precision. Consult Brennan (1992), Shavelson (1989), and Brennan (1995) for readable and more complete treatments of G-theory.

OVERVIEW

OF GENERALIZABILITY THEORY

To illustrate the application of G-theory, consider a hypothetical SP-based test in which all examinees complete all stations and the same raters rate all examinees at every station. While this kind of test is not used in practice for logistical reasons, the design is useful to illustrate the basic concepts of G-theory. From an ANOVA perspective, this is a completely crossed persons by stations by raters random-effects design. The first two columns of Figure 5 present the decomposition of observed checklist scores and associated variance components of the linear model for this design. Each observed checklist score on a station from a rater can be viewed as the sum of the grand mean, a person effect, a station effect, a rater effect, three two-way interaction effects, and a residual that reflects the threeway interaction plus additional (confounded) sources of measurement error. The third column provides variance components for each effect on a (squared) percentcorrect scale; while the values in this column are fictional, they are consistent with those reported for checklist scores in the literature. Cells in the fourth column provide the square root of the variance component – the standard deviation (SD) for the associated effect. These are directly interpretable as indices of variability on the percent-correct scale of scores. For example, the entry of 5 for person proficiency indicates that person proficiency has a (true) SD of 5%. Thus, if the mean score on the exam was 60, roughly 95% of examinees’ true scores (the scores that would be obtained for a very long test in which proficiency was measured without error) would fall within two SDs of the mean – between 50 and 70. The entry of 10 for stations is the SD for station difficulties, implying that 95% of all station means would fall between 40 and 80 if these could be estimated without error. The SD for rater stringencies is 3, a smaller value: raters tend to vary less in stringency than persons vary in proficiency, which, in turn, vary less than stations do in difficulty.

CLINICAL SKILLS ASSESSMENT

Figure 5. Decomposition of observed scores on hypothetical SP-based test. The table reflects a completely crossed design in which all examinees complete all stations and are rated by all raters at each station. Variance components with interpretation information in italics contribute to measurement error for norm-referenced score interpretation. All components except for person proficiency contribute to measurement error for domain-referenced score interpretation.

77

78

DAVID B. SWANSON ET AL.

The largest source of variability is the person by station interaction, often termed “case (or content) specificity” in the medical problem-solving literature (Swanson, 1987; Van der Vleuten, 1996). The relatively large value indicates that the quality of an examinee’s performance varies substantially from one station to the next. As a consequence, a test must include large numbers of stations to obtain a reproducible index of examinee proficiency. Variance components for other interaction terms can be interpreted similarly, though they are smaller in magnitude. A variety of indices of score reproducibility can be derived from the variance components; some of these are defined in Figure 6. Appropriate computational formulas for a test in which examinees, stations, and raters are completely crossed are provided in the third column of the figure. While these formulas change for different designs, they do illustrate which variance components typically influence the indices. In general terms, if a source of measurement error is sampled “n” times for an examinee, the contribution of that source to total measurement error is equal to the variance component divided by “n”. For example, if a test form includes 20 stations, the contribution of stations to total measurement error will be the variance component for stations divided by 20. The most commonly reported G-theory index of precision for SP-based tests is the generalizability coefficient, which provides an index of score reproducibility suitable for relative (norm-referenced) score interpretation. The generalizability coefficient is an intraclass correlation, and it can be interpreted like any other correlation. For example, a value of 0.7 indicates that the expected correlation between a pair of similar but not identical (randomly parallel) test forms is 0.7. The dependability coefficient is an analogous index for domain-referenced score interpretation. Unlike norm-referenced testing situations, where examinees’ test scores are interpreted relative to one another, in domain-referenced situations, test scores are interpreted in relation to an absolute level of performance in the domain of interest. As a consequence, for domain-referenced score interpretation, variance components that affect form difficulty (e.g., station difficulty, rater stringency, and the interaction of the two) must also be included in measurement error, and dependability coefficients are always less than (or equal to) generalizability coefficients calculated from the same dataset. Although generalizability coefficients are the most commonly reported index of precision, the standard error of measurement (SEM) is more useful in practical testing situations because it is easier to interpret. For example, consider a 20-station exam with pass/fail standard of 50 on a percent-correct scale and an SEM of 2.7. The latter value can be used to form confidence intervals around examinee scores, making possible a direct evaluation of whether or not a test score is sufficiently reproducible to serve its intended purpose. If an examinee had a (failing) score of 48, the 95% confidence interval around that score would be from 42.6 to 53.4 (48 plus/minus 2 × 2.7). Whether or not the score is sufficiently precise to make a high-stakes decision like graduation is a matter of judgment, but the SEM and

CLINICAL SKILLS ASSESSMENT

Figure 6. Selected indices of score reproducibility provided by generalizability theory. Indices in italics are appropriate for norm-referenced score interpretation; non-italicized entries are appropriate for domain-referenced score interpretation. Formulas will vary as a function of the specifics of the test administration design and the intended interpretation of scores.

79

80

DAVID B. SWANSON ET AL.

associated confidence interval are more readily understood in practical terms than a generalizability coefficient.2 G-theory also provides an absolute SEM for domain-referenced testing. Like the dependability coefficient, the absolute SEM reflects sources of measurement error that affect form difficulty, and the absolute SEM is always greater than (or equal to) the relative SEM. By using equating procedures to place scores from different test forms on the same scale, it is sometimes possible to reduce the magnitude of the absolute SEM to the point where it is roughly the same size as the relative SEM. G-theory makes it possible to evaluate alternate test designs to guide development of more cost-effective assessment procedures. Because SP-based tests normally “nest” raters within stations (i.e., raters do not move from station to station, and there are different raters at each station), for the following illustrative analyses, we assumed that raters would be nested within stations, and, using the variance components in Figure 5, we performed a series of analyses to evaluate alternate test designs (various combinations of numbers of stations and raters per station) that could be employed. Results are shown graphically in Figures 7 and 8. Not surprisingly, generalizability coefficients in Figure 7 increase as the numbers of stations and raters per station increase. As one would expect from the relative magnitude of variance components in Figure 5, generalizability is increased more by using large numbers of stations than by large numbers of raters per station. For example, the generalizability coefficient for 10 stations with two raters per station is 0.67; in contrast, the coefficient for 20 stations and one rater per station is 0.78. This increase is large enough to be of considerable practical importance; this is particularly notable since the same amount of rater time is required for both designs. Similar results are shown for relative SEMs in Figure 8: SEMs become lower as the number of stations increase, and increasing the number of raters per station has a limited impact.3 For a fixed amount of rater time, it is best to use as many stations as resources permit, with one rater per station (Swanson and Norcini, 1989; Van der Vleuten and Swanson, 1990). The hypothetical test used as an illustration here does not appropriately reflect all of the sources of measurement error that are present in SP-based tests. When a source of measurement error is not directly represented in analysis, it is commonly termed a “hidden facet” in G-theory. For example, if multiple SPs portray the same case role, it is better to conceptualize the variance component for stations as reflecting components for both cases and SPs-nested-in-cases – we used the term stations in the example above to make this distinction clear. Station difficulty is a composite of the difficulty of the case and the difficulty/stringency of the SP portraying the case. For example, for a depression station, in addition to the inherent difficulty associated with the case, depending upon whether the SP portrays the case as a very depressed, uncommunicative patient or a mildly depressed, communicative patient, the station may be more or less difficult. Similarly, for a peripheral neuropathy station, difficulty may shift as a function of the

CLINICAL SKILLS ASSESSMENT

81

Figure 7. (Norm-referenced) generalizability of scores as a function of the number of stations and the number of raters per station (raters nested in stations design).

Figure 8. (Norm-referenced) standard error of measurement as a function of the number of stations and the number of raters per station (raters nested in stations design).

82

DAVID B. SWANSON ET AL.

accuracy with which an SP can simulate abnormal neurological findings. While, ideally, all SPs would portray a case role in exactly the same way, even with excellent training, some variation in portrayal is inevitable, and this introduces additional measurement error. Though it may not be possible to obtain the estimates, at least conceptually, each of the station-related interactions in Figure 5 can be broken down into case- and SP-related components as well. As another example of a hidden facet, in some SP administrations, a physicianobserver completes checklists and rating forms. If different groups of raters participate in test administrations scheduled in the morning and afternoon, examinees demonstrating the same proficiency level on the same cases portrayed by the same SP are likely to receive different scores because of variation in rater stringency. If raters are not included as a factor in analyses of precision, the accuracy of estimated variance components will be reduced. For example, if more stringent raters tend to be present for morning test administrations, the magnitude of person (true score) variance for combined morning and afternoon administrations will be inflated as a result: the variance of total test scores will reflect both variation in person proficiency and systematic differences in rater stringency. For high-stakes testing situations, it is important to recognize hidden facets that may be present and to attempt to develop estimates of their magnitude. This is particularly true if the hidden facet is systematically related to other factors in the test administration, since indices of reproducibility can be markedly affected. Equating4 SPs are commonly used for educational purposes in medical schools, with the interaction between a student and an SP used as a basis for instruction (Association of American Medical Colleges, 1998). If the primary purpose of SP use is instructional, there is little need to ensure that different students’ receive comparable assessments. However, when SPs are used in large-scale, high-stakes tests with important consequences, use of comparable tests that produce equivalent scores becomes a necessity to ensure that consistent, fair, and appropriate decisions are made. The previous section on the generalizability of scores on SP-based tests emphasizes that several facets of the testing process introduce measurement error (e.g., variability among cases, SPs and raters). In high-stakes situations, it is desirable to control as many of these factors as possible. Unfortunately, practical administrative issues (including the need to maintain security) run counter to this intention. For example, a straightforward determination of the relative proficiency level of a group of examinees may be obtained by administering a single set of cases to all examinees, with a single SP trained to portray each case role, with testing at a single location, on a single date. However, for large-scale testing, this approach is likely to prove impractical. It may be necessary to train multiple SPs for each case in order to test a substantial group of examinees in a reasonable time period. When

CLINICAL SKILLS ASSESSMENT

83

testing occurs over multiple days, security concerns may necessitate development of multiple test forms so that different examinee groups complete different sets of cases. The need to vary the conditions under which test administration takes place in order to satisfy practical and security requirements does not reduce the importance of controlling the various sources of measurement error; it simply makes this task more difficult. One solution to this problem is to place scores from different test forms onto the same scale by statistically modeling form-to-form variation in score distributions. Relatively little work has been done on the development of equating procedures for use with SP-based tests. Van der Vleuten and Swanson (1990) includes an overview of some rough-and-ready methods that have been used, along with some suggestions for procedures that might be explored. A paper by Rothman and colleagues (1992) outlines the use of one of these procedures (linear equating based upon common stations) to link five forms of a two-day clinical skills examination. The authors argued that raw scores across forms could be treated as equivalent because (low-power) statistical tests failed to demonstrate significant differences among form means. However, the equated and observed scores for some forms differed by more than 0.5 SDs. In the context in which the test scores were used (selection of graduates of foreign medical schools for a pre-internship program), this was not problematic. However, in more typical high-stakes administrations, differences of this magnitude could have a meaningful impact on score interpretation and pass/fail outcomes. The purpose of this section is to describe some of the procedures that are available for equating. We focus on procedures that appear applicable to SP-based tests. After the procedures are described, we present the various designs to which the procedures can be applied. The limitations of the various procedures are then discussed, along with recommendations for research on the appropriateness of the various methods for SP-based tests. We then return to a discussion of the procedures in the last section of this paper, where they are related to the four scenarios for large-scale use of SP-based tests.

E QUATING

PROCEDURES

The purpose of equating is to produce the condition in which scores from alternate forms of a test can be considered equivalent. The common-sense test of equivalence is that it should be a matter of indifference to the examinee which form (s)he completes (Lord, 1980). To fully meet this requirement, the score distributions must be identical on all forms of a test, so that the percentile rank associated with a given score would be the same on all forms; in addition, conditional standard errors must be identical for all test forms. In practice, these requirements can never be achieved fully through statistical adjustment of scores. The various procedures approximate these conditions to greater or lesser degrees, and differences among them can be conceptualized in terms of the moments of the score distribution

84

DAVID B. SWANSON ET AL.

(mean, variance, skewness, etc.) for which adjustments are made in the equating process. Identity equating. Identity equating is the simplest approach to producing equivalent scores for multiple test forms. It is also the approach that appears to be in widest use with SP-based tests. The assumption is made that if the forms are constructed to the same set of specifications, they will be statistically equivalent, and no adjustment is required. While this approach is generally inadequate for high-stakes assessment, it need not be equivalent to ignoring the problem of equating. For example, taking the need for equated scores seriously can lead to highly structured test development procedures. Multiple cases can be developed for each cell in test blueprint, and forms can then be constructed by randomly assigning the appropriate number of cases from each cell to each form. To the extent that psychometric characteristics of cases covary systematically with cells in the blueprint, this procedure should reduce differences in form difficulty. Equipercentile equating. With this approach, the population percentile rank is estimated for each form. Scores on each form are then adjusted to achieve correspondence to percentile ranks. This approach represents the most complete statistical adjustment possible (after forms have been constructed). However, accurate equipercentile equating requires extremely large samples, and, therefore, it is not likely to be useful for SP-based tests. It is presented here to provide a context for examining other procedures for making statistical adjustments and because equipercentile correspondence is often accepted as the definition of equivalence for scores from alternate test forms (Angoff, 1971), although this definition is less demanding than that described previously. Mean equating. The simplest approach to statistical adjustment of scores from alternate test forms is to attempt to set the population means to be equal. As will be examined in a subsequent section, the design used for data collection significantly influences the specifics of how this is done. Conceptually, what is required is a sample selected from the population of interest. The sample mean is used to estimate the population parameter, and the adjustment is made. If Y is the score on Form 1 and X is the score on Form 2 (and My and Mx are means on Forms 1 and 2 respectively), then Y − My = X − Mx . With this method, the equated score Y will be produced as Y = X + (My − Mx ). For this procedure to produce results identical to those from equipercentile equating requires that all higher moments of the score distributions for the forms

CLINICAL SKILLS ASSESSMENT

85

be identical (a special case of these conditions exists when score distributions are normal and the standard deviations of the forms are equal). Linear equating. With this approach, scores on alternate test forms are adjusted for differences in both means and standard deviations. Again using estimated population parameters, this approach produces equivalence in terms of z-scores. If Sy is the standard deviation for Form 1 and Sx is the standard deviation for Form 2, then Y − My X − Mx = . Sy Sx The equated score Y is then placed on the scale for score X by solving the above equation for Y, Y =

Sy Sy X + My − Mx . Sx Sx

This approach will produce a result identical to equipercentile equating under the circumstances that all higher moments (e.g., skew and kurtosis) of the score distributions are identical across forms (or under the condition that scores are normally distributed). Item response theory. In addition to the procedures already described, there are a range of equating procedures based upon item response theory (IRT). These models differ from those described above in several respects. First, the modeled equating parameters are functions of scores on “items” (i.e., case scores) rather than total scores. Second, when the IRT model fits, these procedures produce a score scale that is linear. A complete description of these models is beyond the scope of this paper. The simplest of these models is referred to as the one-parameter model or Rasch model. For readers unfamiliar with IRT and the Rasch model, there are several readily accessible texts (Hambleton and Swaminathan, 1985; Van der Linden and Hambleton, 1997; Wright and Masters, 1982; Wright and Stone, 1979). See Clauser (1998) for an example of the application of the Rasch model to performance-based testing in medical education. In addition to the fact that they produce a linear score scale, IRT models have the advantage that (when the model fits the data) proficiency estimates are invariant to the set of cases the examinee saw and the difficulty estimates for cases are invariant to the set of examinees used in estimation. As will be discussed, this can greatly increase the flexibility with which examinees, cases, and SPs are assigned to test forms. “Equating” by setting equivalent pass/fail standards. One “equating” approach commonly used for SP-based tests is to attempt to locate equivalent pass/fail standards on alternate test forms. Rather than equating scores, an overall pass/fail decision is made for each examinee by comparing his/her total test score with the

86

DAVID B. SWANSON ET AL.

pass/fail standard for the associated form. Because SP-based tests are commonly intended to serve as mastery tests in which pass/fail results are important but scores are not, this approach seems very logical. For this approach to work – for pass/fail results to be consistent across forms – it is necessary to set a pass/fail standard for each form that represents the same level of proficiency. However, most of the standard-setting procedures developed for SP-based tests attempt to accomplish this task by asking content experts to determine a “clinical standard” for each case – the lowest score that represents provision of adequate medical care. Typically, clinical standards are then averaged across cases, with the resulting mean used as the pass/fail standard for the test form, perhaps with an adjustment based upon the SEM. We think there are some serious problems with this approach to “equating.” Suppose that test form A consists of predominantly easy cases in which it is straightforward to provide adequate medical care, and test form B consists of predominantly hard cases for which it is challenging to provide adequate medical care. Examinees at the same marginal level of proficiency will tend to pass form A and fail form B, and examinees taking form B will be at a significant disadvantage. To be effective, an equating procedure must result in adjustments that reflect differences in form difficulty, and the use of clinical standards does not produce this outcome. Before equating can be achieved by setting equivalent pass/fail standards on alternate test forms, better standard-setting procedures must be developed. At the present time, it appears to us that it is more appropriate to set a well-justified pass/fail standard on one form, and then adjust scores on alternate forms to place them on the same scale.

E QUATING

DESIGNS

Each of the statistical equating procedures described above makes adjustments to test scores based on population parameters. These population parameters are never known; instead, for computational purposes, sample-based estimates of these parameters are used. These estimates are produced by collecting data in the context of an equating design. When examinees are randomly assigned to forms, mean or linear equating parameters may be estimated using the observed form means (and standard deviations) from these randomly equivalent groups. If examinees were randomly assigned to forms, Figure 1 would represent this type of design. The key feature of this “equivalent groups design” is that the groups must actually be randomly equivalent. Examinees from different sites (e.g., schools, cities, etc.), different cohorts (e.g., senior class of 1996, senior class of 1997), or even different testing dates (e.g., those who signed up to take the test early versus late; those taking a make-up examination six weeks after the main administration) should not be assumed to be randomly equivalent. Where random equivalence of groups is not a defensible assumption, a “nonequivalent groups design” will be needed. In essence, a framework is needed

CLINICAL SKILLS ASSESSMENT

87

to estimate group differences before adjusting for form differences. This can be accomplished either by administering both test forms to a single, common (sub)group or by administering a common subset of cases to both groups. With the common group design, separate test forms are administered to each group, but a sample of examinees completes both forms. This approach has advantages in terms of security, because no common cases are administered across forms. However, because it is typically very difficult to arrange for some examinees to take both test forms, this approach is generally infeasible. Alternatively, one variant on the common test design is shown in Figure 2. In this example, each group completes a twelve-case test but a subset of the cases are administered in common to both groups.5 When a common subset of cases is administered to both groups, the linking test (overlapping cases) may be scored on both forms (as suggested in Figure 2), or the cases may be unscored, provided the examinees are unaware of which cases these are. Regardless of the design used, however, performance on the linking test is used as a basis for estimating the magnitude of differences between groups. Differences in form difficulty can then be estimated, controlling for differences in group proficiency, and appropriate score adjustments can be made to place scores from different forms on the same scale. As presented in Figure 2, this process is reasonably simple. In practice, it can be extremely complex. To this point, the presentation has proceeded as though there were two forms to be equated. The issue of how a test form is defined has been largely ignored. As the previous section on generalizability indicates, not only does the selection of cases introduce measurement error (i.e., non-equivalence) into the test scores, but so does the choice of the SP portraying a case role and the choice of the rater scoring the performance. For Scenario 3, in which a single set of cases is administered but multiple patients are trained to portray each part, this means that forms should be defined in terms of the specific SPs that an examinee saw. If the test administration is planned so that examinees follow preset tracks and the number of forms is equal to the number of patients per case (and examinees are assigned randomly), an equivalent groups design may be appropriate. With other administration strategies, it may be possible to identify common sets of patients seen by examinees completing different forms, approximating the conditions in Figure 3. The limitation of this approach is that whether or not the examinees are randomly assigned to forms, the error associated with estimating the parameters used for equating will be a function of sample sizes. As a given sample of examinees complete an increasing number of “forms,” there is a decrease in the accuracy with which adjustments can be made. Scenario 4 is likely to exacerbate this problem by introducing test forms that consist of multiple sets of cases as well as multiple sets of patients, though this may be necessary in order to accommodate the number of examinees to be tested and/or security requirements. This difficulty highlights one of the potential advantages of equating procedures based on IRT models. If the model fits, because the item (case) parameter estimates

88

DAVID B. SWANSON ET AL.

are invariant over both the sample of examinees used for estimation and the cases on test forms, the complexity of the design – which examinees saw which specific cases and patients – is less problematic. The type of design shown in Figure 4, in which each examinee completes twelve cases, with each case sampled from a blueprint area, with multiple SPs for some case roles, would present a nearly intractable problem for linear equating, because of the large number of forms. With IRT models, this design is not problematic because the parameters can be estimated in a single analysis that considers the entire data array simultaneously: there is no need to sequentially link pairs of forms and sets of examinees. As useful as IRT models may be, they do not solve all the problems related to SP-based tests. If test forms are administered at geographically separated sites, there will be no naturally occurring links between sites, and the resulting data sets will be disconnected. Without links between forms from different sites, these (and all other equating) models will be essentially useless. A partial resolution to this problem may be to transport SPs between sites to establish such links. This approach is attractive, but evidence is required to demonstrate that there is no interaction between the patients and sites at which the performance occurs. This approach is also likely to add significantly to the cost of the examination.

S TATISTICAL

MODERATION

If the test administration design does not permit the use of true equating procedures, statistical moderation may provide a practical alternative. This procedure involves the use of an external examination to adjust scores on alternate forms of a test or even tests designed to measure different skills (Angoff, 1971; Mislevy, 1992; Linn, 1993). The most common example is the use of scores on a national test to adjust performance on locally developed tests. For example, scores on writing tests developed and administered in different states could be rescaled so that the mean and SD for each school is equal to the mean and SD for the school on a nationally administered, standardized achievement test. This would increase or decrease scores for all students within a school without changing the relative ranking of students within a school. If the locally developed tests are similar to one another and bear a similar, strong relationship with the external test, this method can be effective. In countries where national examinations are administered, these conditions may be met for alternate SP-based test forms, if the forms show consistent, moderate to high correlations with the external examination. For example, in the United States, some (but not all) groups have reported strong true correlations between written licensing examinations and SP-based tests (Van der Vleuten and Swanson, 1990). In this situtation, the licensing examination can be treated as the “linking test,” with differences in performance on the linking test used as a basis for estimating the magnitude of differences between groups. Differences between forms can then be determined, controlling for group differences, and appropriate

CLINICAL SKILLS ASSESSMENT

89

score adjustments can be determined to place scores from different forms on the same scale. While this procedure can be used to make scores more comparable, the comparability is not perfect, and better results should be achievable through the use of equating procedures.

L IMITATIONS

AND NEEDED RESEARCH

There has been little research on the use of equating procedures (or statistical moderation) for SP-based tests. All equating procedures require estimation of form and/or case parameters from examinee performance data, and the estimation process will not be error-free. Attempting to place scores from different test forms on the same scale using poorly estimated parameters can do more harm than good: more measurement error may be introduced than is removed. Two major issues will influence the magnitude of the resulting error. The first of these is sample size (for both examinees and cases/SPs/raters), and the second is the extent to which the equating model fits the data. Results of equating-related research using other types of performance assessments should provide useful insights regarding appropriate sample sizes and reasonable expectations about model fit, but it is clearly necessary to conduct research using SP-based test scores administered under the specific conditions in which the procedure is to be applied. One valuable approach to examining the magnitude of the equating error as a function of sample size is through bootstrap replication of the equating procedure. The equating process can be repeated multiple times (50 to 500) randomly selecting examinees from the original sample, with replacement. At selected score points, the standard deviation of the equated values, across replications, will provide an estimate of the standard error associated with equating using the selected procedure and sample size. It may also be useful to examine results based on data simulations. Data can be simulated to approximate actual test conditions. The procedures can then be applied. Deviations between the known (simulated) parameters and the observed parameter estimates can then be examined. This approach has the advantage that it allows for examination not only of the deviation of observed scores from the mean, but also of the deviation from the known value. This may reveal the presence of bias, as well as random error, in the estimation process. In addition to the need to produce accurate estimates of the necessary parameters, the utility of the various procedures is limited by the extent to which the model fits the data. For example, when the score distributions are approximately normal, linear equating may result in a close approximation of the definition of equivalence established using equipercentile equating. If the score distributions for the forms are skewed, in terms of the equipercentile definition of equivalence, mean and linear equating may do more harm than good. There are also important assumptions associated with IRT models. Most of these models require the assumption that the test is unidimensional. (Multidimensional models exist, but are not likely to be appropriate for use with the sample sizes

90

DAVID B. SWANSON ET AL.

commonly available in SP-based testing.) The one-parameter model, described previously, also requires the assumption of equal case discrimination. When this assumption is not met, the model may result in inappropriate score adjustments. Empirical tests of these assumptions are needed, both to provide general guidelines and as a basis for individual equating efforts. Finally, recently developed alternate procedures also merit serious consideration. The procedures described in this paper have all been adapted from tests based on multiple-choice items. Recent work using approaches based on structural equation modeling (Gessaroli, Swanson and De Champlain, 1998) or latent class analysis (Luecht and De Champlain, 1998) may prove to be superior in the specific context in which SP-based tests are used. Current information on SP-based assessment does demonstrate that equating must be given very serious consideration for high-stakes administrations. What remains is to do the research needed to determine how equating should be accomplished.

Security For at least 40 years, performance tests have been known to pose serious security problems. Thorndike’s classic 1971 book, Educational Measurement, contains a chapter on performance testing by Fitzpatrick and Morrison that devotes considerable attention to security issues. They note that security is a more difficult problem for performance tests than traditional written exams because performance tests typically include only small numbers of tasks (cases in the SP world). The authors suggest that the most appropriate solution may be to “tell students in advance exactly what will constitute evidence of sufficient learning” (p. 263), noting, at the same time, that this solution is only appropriate when the set of performances included on the test comprehensively covers all learning objectives. Advice, quoted from Highland (1955), is then provided: to include as many cases as you can practically administer; to administer parallel forms of the test; to arrange test administration so that examinees taking a test earlier cannot disclose information to examinees taking it subsequently; and to forego feedback so that examinees will not know how well they did on individual cases. Despite some recent research suggesting that examinees benefit little from advance access to information about an SP-based test, we think this advice from 45 years ago is remarkably current and appropriate for the high-stakes SP-based tests of today. Most research concerning security issues in SP-based tests have been “studies of convenience” using one of two approaches. In the first approach, the performance of subgroups of examinees on the same test form is tracked over time. These have typically been done as observational/correlational studies in situations where multiple dates of test administration are required to assess all eligible examinees. Most commonly, these studies are done in the context of clinical practice exams in which test administration takes place over a period of days or weeks (Williams et al., 1987, 1992; Rutala et al., 1991; Colliver et al., 1991b, 1991c; Stillman et

CLINICAL SKILLS ASSESSMENT

91

al., 1991; Skakun et al., 1992; Battles et al., 1994). Less commonly, similar studies have been done using end-of-clerkship exams taken by successive subgroups of students as they complete clerkship rotations (Niehaus et al., 1992; Furman et al., 1997). In some of these studies, examinees were asked whether or not they had solicited information from or provided information to other examinees6 (Williams et al., 1992; Furman et al., 1997). In the second approach, rather than tracking the performance of examinee subgroups, mean scores on individual (reused) stations are tracked across administrations of different test forms taken by cohorts of examinees over a period of years (Jolly et al., 1993; Cohen et al., 1993). Study results have been variable. Researchers using the first approach have generally concluded that there is little evidence of security problems in SP-based tests, though information solicited from examinees documented that information sharing did occur on a fairly large scale in the studies in which this information was collected. Researchers using the second approach have concluded that there is evidence of security problems because mean scores on reused stations tend to be somewhat higher on later uses. In our view, there are serious methodological problems with both of these approaches. In order to study security by monitoring trends in performance over time, four requirements must be met for results to be interpretable. First, subgroups taking the test (or reused stations) must be equivalent in the proficiency measured by the test, or an independent measure of proficiency must be available. Second, members of groups tested initially must actually share information with those tested subsequently. Third, statistical analyses must be sufficiently sensitive to detect non-trivial improvements in performance if they occur, which means that sample sizes (for both cases and subjects) must be relatively large. Last, scores must accurately reflect the quality of performance. For most previous research, one or more of these requirements are not met. First, it is difficult to ensure that subgroups are equivalent in proficiency unless true random assignment is used to form the subgroups. Even with random assignment, subgroup equivalence is undermined if test administration takes place over a period of weeks or months in which examinees are engaged in related learning activities (e.g., medical school courses or clerkships). It is reasonable to expect that scores should be higher for later administrations because examinees have the advantage of additional training. In fact, the absence of such an effect calls into question the power of the study to detect differences, the validity of test scores, or both.7 The second requirement, that members of the initial groups tested are sharing information with subsequent groups, may not be met if examinees do not perceive the test as a high-stakes examination. Examinees must be motivated to obtain the information and to use it actively in preparation, and this requires the perception of high stakes. If examinees view an assessment as low-stakes, it is impossible to draw inferences about the susceptibility of SP-based tests to security breaches, because the breaches may not occur, or, if they do, there may be little motivation for examinees to use the information as a guide for test preparation. Even clinical

92

DAVID B. SWANSON ET AL.

practice examinations for which a passing score is required for graduation may be perceived as low-stakes if 1) examinees know that almost no one fails or 2) the only consequences of failing are re-taking all or selected portions of the examination after a brief period of study. There are many competing pressures for students’ study time, and students prioritize their learning activities in response to those pressures. It seems unlikely that the results of studies conducted in a relatively lowstakes context will generalize to a high-stakes context in which poor performance has significant consequences for examinees. The third requirement, that statistical procedures are sufficiently sensitive to detect non-trivial improvements as a result of security breaches, has rarely been met in previous research. Sample sizes in examinee subgroups tend to be small, and, even if relatively large groups are available, group means are not likely to be particularly sensitive indicators for estimating the effect of security breaches. For example, if only the bottom 10% of examinees (those most likely to fail) are motivated to take advantage of a security breach and only half of those examinees actually have access to useful information, even if all of those examinees improve their scores by a full standard deviation (a very large effect size), the impact on the group mean will only be 0.05 SDs, which is only detectable with extremely large sample sizes. Even if a larger percentage of examinees receive advance information about test content and are motivated to study, because higher-proficiency examinees are likely to benefit less from the information, the effect size may still be relatively small for the group as a whole. Within-subject experimental designs are more appropriate, both because the effect of security breaches can be analyzed as a function of examinee proficiency and because estimates of effect sizes will be more precise if subjects serve as their own controls. We return to this topic below. The fourth requirement, that test scores validly reflect quality of performance, has been challenged by several researchers (e.g., Norman et al., 1991; Williams et al., 1992; Swartz et al., 1995). For example, if examinee performance is assessed using checklist scores that (over)reward thoroughness in data gathering, examinees who efficiently gather key patient findings may receive lower scores.8 This effect could easily occur for both more proficient examinees and those with advance knowledge of test material due to a security breach. In this situation, because scores are not valid indicators of the quality of performance, the occurrence (and true impact) of security breaches on performance will not be detected. The results of previous research appear to have convinced some researchers and proponents of SP-based tests that security is not an important issue. Simply speaking, we believe it is completely unreasonable to assume that security is a non-issue for SP-based tests, particularly given the limitations of most research to date. If examinees are motivated to perform well, and if they are capable of learning how to perform better, and if scores accurately reflect level of performance, it is a reasonable expectation that scores would improve as a result of security breaches. For example, numerous individuals typically have access to station checklists (and, therefore, scoring criteria) and other case-related materials in advance. If this infor-

CLINICAL SKILLS ASSESSMENT

93

mation is shared with an examinee in advance of testing, given the specificity of checklist items, a higher score seems inevitable. If a checklist indicates that an examinee should ask when the pain started and if the pain was reduced by taking aspirin, even completely incompetent examinees should be able to memorize and ask the appropriate questions when they take a history. To be useful, research on security issues cannot simply ask whether or not SPbased tests are vulnerable to security problems. Rather, a program of research should focus on the impact of advance access to different kinds of information (e.g., station topics, opening scenarios, the checklists themselves, post-encounter probes of various types). This is more likely to inform the development of policies and procedures that reduce the vulnerability of tests to security breaks. In addition, studies should be designed to ensure that examinees are motivated to perform at their highest level, that security breaches of various kinds actually do occur, and that examinees attempt to make use of the information. To aid in unambiguous interpretation of results, experimental designs, with subjects randomly assigned to treatment groups (defined by the type and specificity of information “leaked” to subjects, the format of instruction provided by expert “coaches”, etc.) seem preferable. Ideally, for each study subject, breaches in security would occur for some cases but not others, allowing subjects to serve as their own controls in analysis, thus improving the precision with which effect sizes could be estimated for each study condition. And, of course, sample sizes (of both cases and subjects) must be large enough as well. We are aware of only one study that used a design along these general lines, though it was smaller in scale than seems desirable (DeChamplain et al., under editorial review). In this work, 84 medical students were assigned to a control condition or one of two treatment conditions; to improve the comparability of the three groups, assignment was based upon a previously administered SP-based test (though randomization would probably have been sufficient). In one treatment condition, examinees were provided with checklists for three “exposed” cases and given 90 minutes to study the material prior to the test administration. In the other treatment condition, examinees participated in a 90-minute “test preparation course” in which a clinician presented general case information and suggested strategies for maximizing performance on the three exposed cases. Subjects in all three groups completed the same six-case exam, with test administration arranged to ensure that only the intended security breaches occurred. Subjects in groups with advance access to information performed more than a full SD better on exposed cases than subjects in the control condition; no group differences were observed on unexposed cases. Additional work, using similar strong research designs, seems badly needed.

94

DAVID B. SWANSON ET AL.

Figure 9. Variance components for a persons by SPs nested in cases (p × [SPs: cases]) design. Variance component estimates are hypothetical, but they are consistent with published and unpublished reports of studies investigating the magnitude of measurement error introduced by using multiple SPs to portray the same case role (Swanson and Norcini, 1989; Colliver et al., 1991a, 1994, 1998).

Relationships among Precision, Equating and Security In this section, for each of the four scenarios, we outline the sources of measurement error that affect the precision of scores, the likely results of generalizability analyses, equating procedures that could be used, security considerations that arise, and interrelationships among these. To make this discussion more concrete, we need one last set of variance components for use in discussion of all scenarios. These are provided in Figure 9. In the figure, variance components for Persons, Cases, SPs within Cases,9 and Residual are provided, both for checklist-based measures of data-gathering skills and for ratings-based measures of communication skills. While the values are hypothetical, they are based on the variance component estimates resulting from investigations of the measurement error introduced by using multiple SPs to portray the same case role and to score examinee performance (Swanson and Norcini, 1990; Colliver et al., 1991a, 1994, 1998; DeChamplain, 1997, in press).

S CENARIO 1: T WO TEST FORMS WITH NO CASE ROLE / ONE TEST DATE PER FORM

COMMON CASES / SINGLE SP PER

Regardless of the test administration scenario used, the first step in the analysis of the precision of test scores is estimation of variance components. This can generally be accomplished with commonly used statistical packages.10 Once variance component estimates are available, various indices of reproducibility can be calculated, both for the test as it was administered and/or for alternate test designs in an effort to identify a more cost-effective (or more reproducible) approach to test administration.

CLINICAL SKILLS ASSESSMENT

95

For Scenario 1, because the dataset resulting from the administration design is disconnected, variance components should be estimated separately for each persons by stations dataset.11 The resulting variance component estimates can then be averaged to obtain pooled estimates. Cases, SPs within cases, and raters are hidden, confounded facets in this design, and the expected value for the variance component estimate for stations, representing variation in station difficulty, is the sum of these components. That is, the effects of differences in case difficulty, SP portrayal and SP stringency all have an impact on the variance component estimate for stations, but their separate effects cannot be estimated with this design. For example, if the true values for the variance components are those shown in Figure 9 (with the variance component for SPs reflecting both the effects of differences in portrayal and stringency), the stations variance component estimate should be roughly 97 (81 + 16) for data-gathering and 45 (9 + 36) for communication skills. Similarly, the obtained residual variance component would include various (confounded) interactions among persons, cases, SPs, and raters, though these would not be separately estimable with this design. The variance components also provide a basis for estimating the likely magnitude of variability in test form difficulty. If a test form consists of 12 stations as illustrated in Figure 1, the expected mean and standard deviation of test form difficulties constructed by randomly assigning 12 stations to each form are 0 and 2.8% (square root of 97/12), and the expected difference in mean difficulty for a pair of forms is roughly 4% (2.8% times the square root of 2), though, clearly, it can be much larger for some pairs of forms. Because the variance component for persons is only 25, implying a (true score) SD of 5%, an expected difference of 4% in form difficulties is quite large, making it desirable to use an equating procedure to adjust scores for differences in form difficulty. If no equating procedures are used and scores from both forms are to be treated interchangeably for making grading or promotion decisions, both the stations and residual variance components contribute to measurement error, even for norm-referenced score interpretation. As a consequence, without equating, the reproducibility of scores across forms is quite poor. For example, using the variance components for checklist-based data-gathering scores in Figure 9, the generalizability coefficient and norm-referenced SEM for the 12-station test forms in Figure 1 are roughly 0.53 and 4.7%. Analogous values for communication skills scores are somewhat better (0.72 and 3.7%) because the persons variance component is somewhat larger and the stations (cases plus SPs nested in cases) and residual components are somewhat smaller. The reproducibility of both the data-gathering and communications skills scores for this test design are too poor for most high-stakes decision-making situations. This scenario represents a marked improvement in security relative to a single test form that is repeatedly used across multiple test dates. However, this improvement in security comes at a price. First, it is necessary to develop twice as many cases and to train twice as many SPs. Second, the measurement characteristics of

96

DAVID B. SWANSON ET AL.

the test forms may differ (e.g., one form may be easier than the other), introducing additional measurement error and making it desirable to statistically adjust (equate) scores to improve comparability. Because the design is disconnected, equating procedures that require common examinees are ruled out. If examinees have been randomly assigned to forms, mean (or linear) equating parameters may be estimated using the observed form means (and standard deviations) from the randomly equivalent groups. Of course, the groups must really be randomly equivalent. Examinees from different sites, cohorts, or testing dates (e.g., those signing up early versus late) should not be assumed to be randomly equivalent. Alternatively, a common external test taken by all examinees (e.g., because of the timing of the test adminstration, Step 1 might be suitable in the United States) can be used to place scores from different forms on roughly the same scale through statistical moderation.

S CENARIO 2: T WO TEST FORMS ROLE / MULTIPLE TEST DATES

WITH COMMON CASES / SINGLE SP PER CASE

Because the dataset resulting from use of this design is connected, as long as the software used can handle missing cells correctly, variance components can be estimated in a single persons by stations (p × s) random-effects analysis of variance. Because there are multiple dates of test administration, date of test administration is an additional factor that can be represented directly in analyses (e.g., persons nested in dates), though it is typically left hidden in research reports unless trends in examinee performance (e.g., due to security breaches) are of substantive interest. Even if dates of test administration are included in analysis, unless examinees are randomly assigned to dates (and no security breaches occur), it is not clear what value to expect for date-to-date variation in examinee performance, and the magnitude of the variance component for dates is uninterpretable. Even a value of zero for the dates variance component is uninterpretable, since this could represent the sum of opposing effects that cancel (e.g., if weaker students deliberately schedule for later test administration dates in anticipation of security breaches). If security problems do occur across test administration dates, the magnitude of any or all variance components could be affected. Because several stations are used in common, the magnitude of differences in form difficulty should be reduced relative to Scenario 1. Because the overlapping stations are, in effect, a linking test, the associated mean and linear equating procedures can be used to place scores on a common scale, and IRT-based methods are also applicable. Because a linking test is used, it is not necessary for examinees to be randomly assigned to groups: procedures for non-equivalent groups can be applied. However, random assignment of examinees to groups should still produce better equating results, because errors in estimation of differences in group proficiency should be reduced.

97

CLINICAL SKILLS ASSESSMENT

If an equating procedure fully corrects for differences in form difficulty resulting from variation in case difficulty and patient portrayal/stringency, the associated measurement error is removed, and the reproducibility of test scores is improved. Of the variance components for assessment of data-gathering skills in Figure 9, only the Residual variance component contributes to measurement error. For a 12-station test, the resulting generalizability coefficient and norm-referenced SEM are 0.64 and 3.8%. Analogous values for ratings of communication skills are 0.78 and 3.2%. While these values may still be somewhat low for a high-stakes assessment, they are substantially improved over those for Scenario 1. Unfortunately, while Scenario 2 can improve measurement precision through straightforward application of an equating procedure, the likelihood of security problems is substantially increased because test administration takes place on multiple test dates. In particular, if security problems occur for the overlapping cases (which is likely if the test forms are used sequentially), the linking test no longer provides accurate information concerning differences in group proficiency, and adjustments to scores through equating may reflect the security breach rather than true differences in group proficiency of form difficulty. There are clearly tradeoffs among measurement precision, applicability of equating procedures, and security. Improving one tends to introduce problems in another, increase the complexity of test administration, or both.

S CENARIO 3: S INGLE TEST FORM CASE ROLE / SINGLE TEST DATE

WITH TWO REPLICATIONS / TWO SPS PER

The generalizability analysis appropriate for this scenario depends upon whether or not the approach used in test administration results in a connected dataset. If it is connected, as illustrated in Figure 3, a persons by SPs nested in cases (p × [SPs: Cases]) random-effects analysis of variance can be performed,12 and the overall variation in scores is broken down into the sources of variance shown in Figure 9. More typically, one group of examinees is tested by one set of SPs in one circuit and the second group of examinees is tested by a second set of SPs in a second circuit, so neither examinees nor SPs overlap, producing a disconnected dataset. In this situation, the analysis is similar to that for Scenario 1, with SPs, rather than cases, defining the two test forms. As for the first two Scenarios, the variance components in Figure 9 can be used to estimate the likely magnitude of variation in test form difficulty. For the disconnected design and checklist-based data-gathering scores, if forms are constructed by randomly assigning one SP for each case to each form, the standard deviation of test form difficulties for the two forms should be 1.15% (square root of 16/12), and the expected difference in difficulty for a pair of forms is roughly 1.6% (1.15% times the square root of 2). Relative to Scenario 1, these values represent a significant improvement in the comparability of forms, reflecting the fact that the same cases are used in both. For the same reason, the generalizability coefficient and

98

DAVID B. SWANSON ET AL.

norm-referenced SEM (0.62 and 3.9%) will be better for Scenario 3 than Scenario 1. The advantage of Scenario 3 over Scenario 1 for scores for communication skills is much smaller (generalizability coefficient of 0.73; SEM of 3.6%), because the variance component for cases is relatively smaller and the variance component for SPs nested in cases is relatively larger for these scores. With random assignment of examinees to groups of SPs, equating procedures for equivalent groups can be applied to adjust scores for differences in patient portrayal/stringency, improving the comparability of scores. However, statistical procedures for equivalent groups require large sample sizes, and the actual improvement in score comparability through use of an equating procedure may be modest. If sample sizes are too small, the adjusted scores may actually be less precise than the unadjusted scores. If a connected design is used, as illustrated in Figure 3, a linking test becomes available, making it possible to achieve a more precise equating of scores. In addition, it is no longer necessary (though still desirable) that forms are taken by equivalent groups. If the test administration pattern results in complexly connected test forms, IRT-based equating methods, in particular, are attractive. This approach to equating can provide an estimate of station difficulty and SP stringency that is independent of the proficiency of the group of examinees who worked with each SP. This makes it possible to project the difficulty of test forms consisting of any combination of SPs, greatly increasing flexibility and improving the precision of the adjusted scores, since variation in test form difficulty can be taken into account in the adjustment process. As a consequence, the precision of the adjusted scores should be similar to Scenario 2. In practice, results should be somewhat better than for Scenario 2, since there is no need to adjust for differences in case difficulty, since the same set of cases are used for all examinees. This eliminates error introduced into the equating process by the estimation of differences in form difficulty resulting from different samples of cases. Scenario 3 appears to be an excellent compromise that is responsive to precision, equating, and security considerations. Using multiple SPs to portray the same case roles makes it possible to test small to moderate (dozens but not hundreds) numbers of examinees on the same day, substantially improving security concerns by reducing opportunities for examinees to communicate about cases and SPs. Of course, this improvement does come at a price: more SPs must be trained, and the logistics of test administration become more complex, particularly if a well-connected dataset is to result so that score adjustments for differences in SP portrayal/stringency can be made more precisely. While the use of IRT-based equating procedures introduces additional complexity into analysis, because the additional measurement error introduced by the use of multiple SPs can be substantial, it is desirable to adjust scores to improve comparability, with the procedure used dependent on connectedness and sample sizes.

99

CLINICAL SKILLS ASSESSMENT

S CENARIO 4: M ULTIPLE FORMS WITH CASE ROLE / MULTIPLE TEST DATES

COMMON CASES / MULTIPLE SPS PER

For large scale test administration (hundreds to thousands of examinees to be tested) Scenario 4 is really the only viable option. As for Scenario 3, the generalizability analysis appropriate for Scenario 4 depends upon whether or not the dataset is connected. If it is connected, as shown in Figure 4, a persons by SPs nested in cases (p × [SPs: Cases]) random-effects analysis of variance can again be performed, and variance component estimates like those in Figure 9 result. Use of a connected design, (achieved either through centralization of test administration or by moving SPs or rating videotapes of SPs) across sites also makes it possible to use equating procedures to estimate and adjust for differences in form difficulty. For large-scale administrations at geographically dispersed sites, however, the resulting overall dataset will generally not be connected. Instead, a connected dataset will be produced for each site, but these will not interconnect across sites unless SPs move from site to site.13 If the dataset is disconnected, separate analyses can be performed for each of the smaller connected datasets, and the resulting variance components can be averaged to obtain pooled estimates. However, unless examinees are randomly assigned to sites of test administration (very unlikely), variance components for sites (and dates, if they are represented directly in the analysis) will not be interpretable without strong assumptions about the equivalence of SPs (and raters, if these are not the SPs portraying the case role) across sites. From the standpoint of expected variation in the difficulty of test forms, Scenario 4 is similar to Scenarios 1 and 2: because both cases and SPs within cases vary from form to form, sizable differences in form difficulty are likely to result. For the same reason, the reproducibility of test scores will be adversely affected. Further, if SPs are trained at multiple sites, variation in training across sites may well result in large, systematic differences in form difficulty because of trainingrelated variation in portrayal and stringency.14 If the test administration results in a disconnected dataset, the precision of scores will be adversely affected: for a 12station test, indices of precision will be no better than for Scenario 1, and they may be worse because of training-related variation in portrayal and stringency. As a consequence, much longer test forms will be needed to achieve results comparable to those for a connected design in which an equating procedure is used. Because test administration takes place over an extended time with reuse of cases and SPs, security problems are likely. The magnitude of these problems should be heavily influenced by the size of the case and SP pool from which test forms are constructed and the frequency with which cases and SPs are reused. Large pools of cases and SPs make it more difficult for examinees, over time, to focus their studies on case material represented in the pool.15 Further, individual test forms are more likely to include cases with which examinees are unfamiliar, providing a better basis for assessment of proficiency. However, large banks with

100

DAVID B. SWANSON ET AL.

infrequent reuse of cases and SPs make equating more difficult and less precise, even if the resulting dataset is connected. As the number of forms increases because of the use of a large case pool and/or multiple SPs per role, the advantage of equating procedures based on IRT (or structural equation) models become more apparent. Over time, interconnections among test forms become very complex. With these models, large numbers of test forms can still be linked in a single analytic process.

Discussion Over the past decade, the use of SP-based tests has increased dramatically. Through research. we have gained a much better understanding of the strengths and weaknesses of SP-based testing. In this paper, we have addressed three inter-related topics: methods for estimating the precision of scores, procedures for placing (equating) scores from different test forms onto the same scale, and threats to the security of SP-based exams, focusing on situations in which high-stakes decisions depend upon test results. We have tried to provide a framework for thinking about precision, equating, and security problems and have explored inter-relationships among these. Unfortunately, there is no straightforward formula that can be used to work out the best design for every testing situation. We have emphasized the interplay and tradeoffs among precision, equating, and security considerations: procedures that aid in maintaining security increase the complexity of test construction and administration, as well as the analytic methods required to examine precision and equate scores across test forms. While, increasingly, generalizability theory has been employed to analyze factors influencing the precision of test scores, it is very common for investigators to use a simple persons by stations design for analysis, when the actual structure of the test administration design is much more complex. The resulting variance components are, at best, imprecise; they can also be misleading. More sophisticated analytic procedures are needed, and, preferably, they should be applied to large, highly connected datasets that provide more useful information for test design. Development of equating procedures for SP-based tests is in its infancy, with most of the methods adapted from multiple-choice testing. Research on the applicability of these methods for SP-based tests (and performance-based tests in general) is definitely needed. Despite the obvious importance of adjusting scores on alternate test forms to reduce measurement error and ensure equitable treatment of examinees, even basic information on sample size requirements (for both stations and examinees) is not currently available. For SP-based assessment to be used effectively in high-stakes tests, this situation needs to change. Test administration designs that result in connected datasets appears to be an obvious first step in the right direction, because better equating techniques become possible as a consequence.

CLINICAL SKILLS ASSESSMENT

101

In addition, more and better research on security-related issues is badly needed. Almost all research to date has been grafted on to operational testing programs. These uncontrolled, correlational studies have been plagued with methodological problems and, for the most part, have produced uninterpretable results. Wellcontrolled experimental research is badly needed. This work should not simply ask “are there security problems with SP-based tests?” Instead, research should focus on the impact of advance access to different kinds of information and on measures that can be taken to enhance security. This kind of work is much more likely to inform the development of policies and procedures that reduce the vulnerability of SP-based tests to security breaks. Despite the problems we have identified, we continue to think that SP-based tests, used properly, are, potentially, the best available method for assessment of clinical skills, both in low- and high-stakes contexts. We also think that it is more appropriate “to view the glass as half empty, rather than half full.” In the long run, this perspective seems more likely to lead to improvement in SP-based testing (and we think there is much room for improvement). Given the increased use of SPbased assessment in high-stakes tests, the exams (and research on the exams) must become stronger technically.

Notes 1 Descriptions of test administration procedures in the SP assessment literature (including papers on which we are coauthors) are often vague, even when they are directly relevant to the issues under study. For example, discussions of the number of SPs portraying each case role, the number of test dates, and the procedure used to assign examinees to test dates are commonly omitted, even in studies of score reproducibility and security. Yet, these details affect both the analytic methods that should be used and the interpretation of results. 2 The indices discussed in this section all relate to the reproducibility of scores. Increasingly, SPbased assessment is used for mastery testing, in which pass/fail outcomes are of primary interest. In these situations, an index of decision consistency is more appropriate, though the SEM provides very useful information in this context as well. G-theory does provide an index termed the adjusted dependability coefficient for mastery-testing situations. However, this index is heavily influenced by the location of score distributions in relation to the pass/fail point, and it tends to provide misleadingly encouraging results. See Livingston (1995) for a decision consistency index that is appropriate for SP-based assessments (and other performance-based tests) in mastery-testing situations. 3 The results shown in the graphs are likely for most SP-based test administrations. However, if raters are poorly selected or poorly trained, rater-related sources of measurement error can be much larger. In this circumstance, it may be desirable to have multiple raters per station (or to provide better training). 4 The presentation of equating procedures and designs that follows is intended to provide a conceptual understanding of the process. It is insufficient as a basis for implementing these procedures. The interested reader is referred to Angoff (1984), Hambleton and Swaminathan (1987), and Kolen and Brennan (1995) for a more detailed description. 5 A sufficiently large subsample from both groups can also be used. 6 This strikes us as a very good idea to aid in the interpretation of results. 7 For example, in studies using multiple-choice tests to measure learning across clerkship rotations, research has typically shown that students taking a clerkship in later rotations perform better than

102

DAVID B. SWANSON ET AL.

students taking earlier rotations (Woolliscroft et al., 1995; Ripkey et al, 1997; Whalen and Moses, 1990). Over the course of an academic year, performance typically improves by 0.5 SDs or more. It is unclear why increases in scores should not also be observed for SP-based tests, though the skills may mature at a different rate than for multiple-choice tests. 8 This has been a very common problem in tests involving written and computer-based clinical simulations (Swanson et al., 1987, 1995). 9 The variance component in Figure 9 for SPs within Cases is intended to reflect the combined impact of SP-to-SP variation in portrayal and stringency. To our knowledge, no one has developed separate estimates of these effects, so, rather than guessing the magnitude of each individually, we used estimates of the combined value from the literature. 10 For balanced designs, the GENOVA software (Crick and Brennan, 1983) is easiest to use, because it provides tools tailored for estimating variance components and calculating indices of reproducibility for alternate test designs. The most recent versions of BMDP, SAS, and SPSS include procedures for estimation of variance components; these will handle unbalanced designs, but no tools are available for calculating indices of reproducibility. A practical strategy is to estimate variance components with a statistical package, with GENOVA (which accepts variance components as input) used subsequently to perform more specialized analyses. 11 Alternatively, the design can be specified as a random-effects Persons by Stations nested in Blocks ([p × s]): b) ANOVA, which will generally produce similar estimates of variance components. However, the variance component for blocks will not be interpretable because the design is disconnected. If examinees are randomly assigned to blocks and the variance component for blocks is close to zero, this does not mean that forms are equally difficult, it simply means that the observed differences in form difficulty are no larger than would be expected, given the magnitude of the stations variance component. 12 To obtain correctly estimated variance components, the statistical package used must be able to estimate variance components for unbalanced designs with many missing cells. 13 Even if SPs do move from site to site, unless this is done on a large scale, the design will only be weakly connected. As a consequence, estimates of parameters needed for equating may be very imprecise, and score adjustments based on them may do more harm than good. Research on sample size requirements and equating procedures to achieve good results in this setting is needed. 14 Other than a few small-scale studies (e.g., Reznick et al., 1992; Tamblyn, 1989; Tamblyn et al., 1991a, 1991b), little is known about the impact of multiple training and test administration sites on the accuracy of portrayal and the reproducibility of scores. More work, focused on cross-site quality control procedures, is needed. 15 Some may argue that, if examinees prepare for the test by studying case materials in the pool, they are simply learning medicine, and this is a beneficial side effect of using SP-based tests. We do not agree. For this argument to hold, it would be necessary to show that preparation for one collection of cases improves performance on other cases: that score gains resulting from study would occur for both the studied cases and others as well. We are unaware of any empirical demonstrations that this is true. “Content specificity” – lack of transfer across cases – is the more common outcome of research in this area.

References Angoff, W. H. (1971). Scales, Norms, and Equivalent Scores. In: R. L. Thorndike (ed.) Educational Measurement (2nd edn.). Washington, DC: American Council on Education, 508–600. Association of American Medical Colleges (1998). Emerging Trends in the Use of Standardized Patients. Contemporary Issues in Medical Education 1(7), 1–2.

CLINICAL SKILLS ASSESSMENT

103

Battles, J. B., Carpenter, J. L., McIntire, D. & Wagner, J. M. (1994). Analyzing and Adjusting for Variables in a Large-Scale Standardized-Patient Examination. Academic Medicine 69(5), 370– 376. Brennan, R. L. (1992). Elements of Generalizability Theory (rev. ed.). Iowa City, IA: American College Testing Program. Brennan, R. L. (1995). Generalizability of Performance Assessments. Educational Measurement: Issues and Practice 14(4), 9–12, 27. Case, S. M., Templeton, B., Samph, T. & Best, A. M. III (1992). Comparison of Observation-Based and Chart-Based Scores Derived from Standardized Patient Encounters. In: R. Harden, I. Hart & H. Mulholland (eds.) Approaches to Assessment of Clinical Competence. Norwich, England: Page Brothers, 471–475. Clauser, B. (1998). Equating Performance Assessments with the Rasch Rating-Scale Model Using Internal and External Links. Paper Presentation, Annual Meeting of the American Educational Research Association. Cohen, R. et al. (1993). Impact of Repeated Use of Objective Structured Clinical Examination Stations. Academic Medicine (October Supplement), S73–S75. Colliver, J. A. et al. (1989). Reliability of Performance on Standardized-Patient Cases: A Comparison of Consistency Measures Based on Generalizability Therory. Teaching and Learning in Medicine 1(1), 31–37. Colliver, J. A. et al. (1990). Three Studies of the Effect of Multiple Standardized-Patients on Intercase Reliability of Five Standardized-Patient Examinations. Teaching and Learning in Medicine 2(4), 237–245. Colliver, J. A. et al. (1991a). Effects of Using Two or More Standardized-Patients to Simulate the Same Case Means and Case Failure Rates. Academic Medicine 66(10), 616–618. Colliver, J. A. et al. (1991b). Test Security in Examinations That Use Standardized-Patient Cases at One Medical School. Academic Medicine 66(5), 279–282. Colliver, J. A. et al. (1991c). Test Security in Examinations Using Standardized-Patient Cases for Five Classes of Senior Medical Students. Academic Medicine 66, 279–282. Colliver, J. A. et al. (1994). Effect of Using Multiple Standardized Patients to Rate Interpersonal and Communication Skills on Intercase Reliability. Teaching and Learning in Medicine 6(1), 45– 48. Colliver, J. A. et al. (1998). The Effect of Using Multiple Standardized Patients on the Inter-Case Reliability of a Large-Scale Standardized-Patient Examination Administered over an Extended Testing Period. Academic Medicine 73 (October Supplement), S81–S83. Crick, J. E. & Brennan, R. L. (1983). The Manual for Genova. Iowa City, Iowa: American College Testing Program. Cronbach, L. J., Gleser, G. C., Nanda, H. H. & Rajaratnam, N. (1972). Dependability of Behavioral Measurements: Theory of Generalizability for Scores and Profiles. New York: John Wiley and Sons Inc. DeChamplain, A. F. et al. (1997). Standardized Patients’ Accuracy in Recording Examinees’ Behaviors Using Checklists. Academic Medicine 72 (October Supplement), S85–S87. DeChamplain, A. F. et al. (in press). Do Standardized Patients’ Recording Discrepancies Impact upon Case and Examination Mastery-Level Decisions? Academic Medicine. DeChamplain, A. F. et al. (under editorial review). Modeling the Effects of a Security Breach and Test Preparation on a Large-Scale Performance-Based Assessment. Fitzpatrick, R. & Morrison, E. J. (1971). Performance and Product Evaluation. In: R. L. Thorndike (ed.) Educational Measurement. Washington, DC: American Council on Education, 237– 270. Furman, G. E. et al. (1997). The Effect of Formal Feedback Sessions on Test Security for a Clinical Practice Examination Using Standardized Patients. In: A. J. J. A. Sherpbier, C. P. M. Van der

104

DAVID B. SWANSON ET AL.

Vleuten, J. J. Rethans & A. F. W. Van der Steeg (eds.) Advances in Medical Education. Dordrecht, The Netherlands: Kluwer Academic Publishers, 433–436. Gessaroli, M. E., Swanson, D. B. & DeChamplain, A. F. (1998). Equating Performance Assessments Using Structural Equation Models. Paper Presentation, Annual Meeting of the American Educational Research Association. Grand-Maision, P. et al. (1992). Large Scale Use of an Objective Structured Clinical Examination for Licensing Family Physicians. Can Med Assoc J 146(10), 1735–1740. Hambleton, R. K. & Swaminathan, H. (1985). Item Response Theory: Principles and Applications. Boston: Kluwer Academic Publishers. Highland, R. W. (1955). A Guide for Use in Performance Testing in Air Force Technical Schools. Armament Systems Personnel Research Laboratory. Colorado: Lowry Air Force Base. Jolly, B. (1993). Learning Effect of Reusing Stations in an Objective Structured Clinical Examination. Teaching and Learning in Medicine 6(2), 66–71. Klass, D. J. (1994). High-Stakes Testing of Medical Students Using Standardized Patients. Teaching and Learning in Medicine 6, 23–27. Klass, D. J. et al. (1994). Progress in Developing a Standardized Patient Test of Clinical Skills at The National Board of Medical Examiners: Prototype Two. Proceedings of The Sixth Ottawa Conference on Medical Education. Toronto, Canada: University of Toronto Bookstore Custom Publishing, 324–326. Klass, D. J. et al. (in press). Development of a Performance-Based Test of Clinical Skills for the United States Medical Licensing Examination. Proceedings of the 8th Annual Ottawa Conference. Kolen, M. J. & Brennan, R. L. (1995). Test Equating: Methods and Practices. New York: Springer. Linn, R. (1993). Linking Results of Distinct Assessments. Applied Measurement in Education 6(1), 83–102. Livingston, S. & Lewi, C. (1995). Estimating the Consistency and Accuracy of Classifications Based on Test Scores. Journal of Educational Measurement 32(2), 179–197. Lord, F. M. (1980). Applications of Item Response Theory to Practical Testing Problems. Hillsdale, New Jersey: Lawrence Erlbaum Associates. Luecht, R. M. & DeChamplain, A. F. (1998). Applications of Latent Class Analysis to Mastery Decisions Using Complex Performance Assessments. Paper Presentation, Annual Meeting of the American Educational Research Association. Mislevy, R. (1992). Linking Educational Assessments: Concepts, Issues, Methods, and Prospects. ERIC Document #ED353302. Niehaus, A. H., DaRosa, D. A., Markwell, S. J. & Folse, R. (1996). Is Test Security a Concern when OSCE Stations Are Repeated across Clerkship Rotations. Academic Medicine 71 (October Supplement), S287–S289. Norman, G. R., Van der Vleuten, C. P. M. & de Graaff, E. (1991). Pitfalls in the Pursuit of Objectivity: Issues of Validity, Efficiency and Acceptability. Medical Education 25, 119–126. Reznick, R. K., Smee, S. M., Rothman, A. I., Chalmers, A., Swanson, D. B. & Dufresne, L. et al. (1992). An Objective Structured Clinical Examination for the Licentiate: Report of the Pilot Project of the Medical Council of Canada. Academic Medicine 67, 487–494. Reznick, R. K., Blackmore, D. E., Cohen, R., Baumber, J., Rothman, A. I., Smee, S. M., Chalmers, A., Poldre, P., Birdwhistle, R., Walsh, P., Spady, D. & Berard, M. (1993). An Objective Structured Clinical Exam for the Licentiate of the Medical Council of Canada: From Research to Reality. Academic Medicine 68 (Suppl.), S4–S6. Reznick, R. K., Blackmore, D. E., Dauphinee, W. D., Smee, S. M. & Rothman, A. I. (1997). An OSCE for Licensure: The Canadian Experience. In: A. J. J. A. Scherpbier et al. (eds.) Advances in Medical Education. Dordrecht: Kluwer Academic Publisher, 458–461.

CLINICAL SKILLS ASSESSMENT

105

Ripkey, D. R., Case, S. M. & Swanson, D. B. (1997). Predicting Performances on the NBME Surgery Subject Test and USMLE Step 2: Effects of Surgery Clerkship Timing and Length. Academic Medicine 72 (October Supplement), S31–S33. Rothman, A. I., Cohen, R., Dawson-Saunders, E., Poldre, P. P. & Ross, J. (1992). Testing the Equivalence of Multiple Station Tests of Clinical Competence. Academic Medicine 67 (October Supplement), S40–S41. Rutala, R. J. (1991). Sharing of Information by Students in an OSCE. Archives of Internal Medicine 151, 541–544. Searle, S. R. (1971). Linear Models. New York: John Wiley and Sons. Shavelson, R., Webb, N. & Rowley, G. (1989). Generalizability Theory. American Psychologist 44(6), 922–932. Skakun, E. N., Cook, D. A. & Morrison, J. C. (1992). Test Security on Sequential OSCE and Multiple-Choice Examinations. In: I. R. Hart, R. M. Harden & J. Des Marchais (eds.) Current Developments in Assessing Clinical Competence. Montreal, Canada: Can-Heal Publications, 711–718. Stillman, P. L. et al. (1991). Is Test Security an Issue in a Multistation Clinical Assessment? – A Preliminary Study. Academic Medicine 66 (October Supplement), S25–S27. Swanson, D. B. (1987). A Measurement Framework for Performance-Based Tests. In: I. Hart & R. Harden (eds.) Further Developments in Assessing Clinical Competence. Montreal: Can-Heal Publications, Inc, 13–42. Swanson, D. B. & Norcini, J. J. (1989). Factors Influencing the Reproducibility of Tests Using Standardized Patients. Teaching and Learning in Medicine 1, 158–166. Swanson, D. B., Norcini, J. J. & Grosso, L. J. (1987). Assessment of Clinical Competence: Written and Computer-Based Simulations. Assessment and Evaluation in Higher Education 12(3), 220– 246. Swanson, D. B., Norman, G. R. & Linn, R. (1995). Performance-Based Assessment: Lessons from the Health Professions. Educational Researcher 24(5), 5–11, 35. Swartz, M. H. et al. (1995). The Effect of Deliberate, Excessive Violations of Test Security on a Standardized-Patient Examination: An Extended Analysis. In: Proceedings of The Sixth Ottawa Conference on Medical Education. Toronto, Canada: University of Toronto Bookstore Custom Publishing, 280–284. Tamblyn, R. M. (1989). The Use of Standardized Patients in the Evaluation of Clinical Competence: The Evaluation of Selected Measurement Properties. Doctoral Thesis, McGill University, Department of Epidemiology, Montreal. Tamblyn, R. M. et al. (1991a). The Accuracy of Standardized Patient Presentation. Medical Education 25, 100–109. Tamblyn, R. M. et al. (1991b). Sources of Unreliability and Bias in Standardized-Patient Rating. Teaching and Learning in Medicine 3, 74–85. Van der Linden, W. J. & Hambleton, R. K. (1997). Handbook of Modern Item Response Theory. New York: Springer. Van der Vleuten, C. P. M. (1996). The Assessment of Professional Competence: Developments, Research, and Practical Implications. Advances in Health Sciences Education 1, 41–67. Van der Vleuten, C. P. M. & Swanson, D. B. (1990). Assessment of Clinical Skills with Standardized Patients: State of the Art. Teaching and Learning in Medicine 2, 58–76. Whelan, G. P. et al. (in press). Educational Commission for Foreign Medical Graduates Clinical Skills Assessment. Proceedings of the 8th Annual Ottawa Conference. Whelan, G. P. & Moses, V. K. (1990). The Effect on Grades of the Timing and Site of Third-year Internal Medicine Clerkships. Academic Medicine 65(11), 708–709. Williams, R. G. et al. (1987). Direct Standardized Assessment of Clinical Competence. Medical Education 21, 482–489

106

DAVID B. SWANSON ET AL.

Williams, R. G., Lloyd, J. S. & Simonton, D. K. (1992). Sources of OSCE Examination Information and Perceived Helpfulness: A Study of the Grapevine. In: I. R. Hart, R. M. Harden & J. Des Marchais (eds.) Current Developments in Assessing Clinical Competence. Montreal, Canada: Can-Heal Publications, 363–370. Woolliscroft, J. O., Swanson, D. B., Case, S. M. & Ripkey, D. R. (1995). Monitoring the Effectiveness of the Clinical Curriculum: Use of a Cross-Clerkship Exam to Assess Development of Diagnostic Skills. In: Rothman AI, Cohen R, eds. Proceedings of the Sixth Ottawa Conference on Medical Education. Toronto, Canada: University of Toronto Bookstore Custom Publishing, 476–478. Wright, B. D. & Masters, G. N. (1982). Rating Scale Analysis. Chicago: MESA Press. Wright, B. D. & Stone, M. H. (1979). Best Test Design. Chicago: MESA Press.