Processes and Procedures for Estimating Score ...

2 downloads 0 Views 610KB Size Report
Oct 24, 2017 - Precision is a key facet of test development, with score reliability determined ..... Estimating Precision in an Individual's Score: Standard Error of ...
Measurement and Evaluation in Counseling and Development

ISSN: 0748-1756 (Print) 1947-6302 (Online) Journal homepage: http://www.tandfonline.com/loi/uecd20

Processes and Procedures for Estimating Score Reliability and Precision Gerta Bardhoshi & Bradley T. Erford To cite this article: Gerta Bardhoshi & Bradley T. Erford (2017) Processes and Procedures for Estimating Score Reliability and Precision, Measurement and Evaluation in Counseling and Development, 50:4, 256-263, DOI: 10.1080/07481756.2017.1388680 To link to this article: https://doi.org/10.1080/07481756.2017.1388680

Published online: 24 Oct 2017.

Submit your article to this journal

Article views: 210

View related articles

View Crossmark data

Citing articles: 1 View citing articles

Full Terms & Conditions of access and use can be found at http://www.tandfonline.com/action/journalInformation?journalCode=uecd20

MEASUREMENT AND EVALUATION IN COUNSELING AND DEVELOPMENT , VOL. , NO. , – https://doi.org/./..

METHODS PLAINLY SPEAKING

Processes and Procedures for Estimating Score Reliability and Precision Gerta Bardhoshia and Bradley T. Erfordb a

University of Iowa, Iowa City, IA, USA; b Vanderbilt University, Nashville, TN, USA

ABSTRACT

KEYWORDS

Precision is a key facet of test development, with score reliability determined primarily according to the types of error one wants to approximate and demonstrate. This article identifies and discusses several primary forms of reliability estimation: internal consistency (i.e., split-half, KR–20, α), test–retest, alternate forms, interscorer, and interrater reliability.

Internal consistency; reliability; test–retest reliability

Measurement precision is an important prerequisite for meaningful application and interpretation of psychological assessment results in both research and practice. A central task in determining precision when measuring unobservable characteristics pertinent to counseling and development is determining the inherent error in measuring such psychological constructs. Reliability communicates to practitioners and researchers alike the degree to which scores are consistent and accurate over repeated administrations (Erford, 2013; Meyer, 2010). This consistency ultimately influences the interpretations and conclusions of an assessment. Reliability influences validity, and both influence the ability to use the assessment to make clinical or academic decisions. The Standards for Educational and Psychological Testing (American Educational Research Association, American Psychological Association, & National Council on Measurement in Education [AERA, APA, & NCME], 2014) broadly categorized sources of random error into internal conditions (e.g., being hungry or tired while filling out the assessment) or external conditions (e.g., level of noise during administration). Because no set of test scores is perfectly reliable, and internal and external conditions cannot be completely controlled, professional counselors and researchers estimate reliability under standardized conditions on varied samples (Erford, 2013; Meyer, 2010). Using classical test theory (Mellenbergh, 2011), when discussing scores obtained through an assessment, one is referring to the participant’s observed score (X), which is likely to vary somewhat higher or lower with each test administration. This is different than the client’s true score (T), abstractly defined as the score the client would have obtained if the assessment and measurement conditions were completely free of random error (E). That is, if there is no error (i.e., test scores are perfectly reliable), the participant’s observed score (X) and true score (T) would be identical every time the test is administered. The well-known formula, X = T + E, explains how these concepts are related. The reliability of measurements (rXX ) denotes what proportion of the observed score variance is true T SV , where TSV is true score variance and EV is error variscore variance, with the formula rXX = T SV +EV ance. Therefore, reliability is defined as the ratio of true score variance to observed score variance in a population of persons. This means that reliability coefficients are interpreted directly as a percentage of true score variance (e.g., .90 means 90% true score variance and 10% error variance); you do not square

CONTACT Bradley T. Erford [email protected]  Appleton Place, Nashville, TN -, USA. ©  Association for Assessment and Research in Counseling (AARC)

Vanderbilt University, Human and Organizational Development,

MEASUREMENT AND EVALUATION IN COUNSELING AND DEVELOPMENT

257

the coefficient to obtain the coefficient of determination (r2 ) as you would to interpret a correlation coefficient or validity coefficient. Furthermore, the type of reliability estimated will determine the type of error variance in each analysis, as will the score variability obtained on different samples of participants or clients. This is why one never says, “The test is reliable.” Instead, in addition to the reliability indexes and statistical method used, one reports sampling procedures and sample descriptive statistics to allow informed judgments on the trustworthiness of scores (AERA, APA & NCME, 2014). Thus, “Test scores are reliable;” reliability is a property of the set of scores involved in the analysis (i.e., sample dependent results; Erford, 2013; Meyer, 2010). As such, different samples under different measurement conditions yield different reliability estimates. Reliability coefficients range from 0.00 (no reliability; all error variance) to 1.00 (perfect reliability; error free). Typically, reliability coefficients of 0.80+ are suitable for screening-level clinical decisions (Erford, 2013). Diagnostic assessments have a higher criterion because the decisions involved are more consequential, so the desired reliability coefficient is 0.90+. Although reliability coefficients of less than 0.80 denote substantial error variance and therefore inconsistent conclusions, this metric is not the only framework for interpreting reliability (see the discussion of standard error of measurement later). The most common types of reliability estimates involve internal consistency, test–retest, alternate forms, interscorer, and interrater reliability. Following are explanations, examples, and illustrations for each.

Internal Consistency Internal consistency estimates of score reliability refer to the interrelatedness of items within a given assessment or subscale: How well do the items hang together or measure each other? For example, one would expect a set of depression items to be more internally consistent than a mixed set of depression and defiance items. A major advantage of internal consistency studies is that only a single administration of the test is required. The items are then split and analyzed using various methodologies. Three primary methods have historically been used to estimate internal consistency: split-half reliability (r12 ), Kuder– Richardson Formula 20 (KR–20), and coefficient alpha (α; Erford, 2013; Meyer, 2010). These can be calculated easily with most statistical programs’ reliability analysis and bivariate correlation functions. For r12 , α, and KR–20, error variance is composed primarily of heterogeneity of the item set.

The Split-Half Method A simple way to estimate internal consistency with a group of examinees is split-half reliability. The recommended method for this parallel subtest construction involves one of the following methods: (a) dividing the test in two halves by randomly selecting half the items to create each test form; (b) creating matched item pairs based on similar item means and item–total correlations; or (c) having one item from each pair randomly assigned to each half of the test (Erford, 2013). As the two halves of the test are assumed to be parallel (i.e., equal true scores and error variances), these half-test scores are correlated to obtain an estimate of half-test reliability. One of the golden rules of psychometrics is that, all other things being equal, the more items on the test, the more reliable the scores (Erford, 2013). Subsequently, readjusting the half-test score reliability coefficient to reflect a whole-test score reliability estimate is essential, and requires use of the Spearman– 2r12 ), where r12 is the Pearson correlation between the two halves Brown prophecy formula (i.e., rxx = 1+r 12 of the test. As an example of split-half analysis, see the scores provided for the 20 participants on Items 1 through 8 in Table 1. The items have been split into two comparable halves (Items 1–4 and Items 5–8) and the correlation computed between the two half scores for the 20 participants, obtaining an r12 2(0.933) = .933. Now, applying the Spearman–Brown prophecy formula, one obtains rxx = (1+0.933) = 0.965 An interpretation of this coefficient is that true score variance is 96.5% and error variance is 3.5% of scores. As the coefficient is higher than the recommended .80, these scores are adequate for both screening and diagnostic decisions.

258

G. BARDHOSHI AND B. T. ERFORD

Table . Sample Scores for Various Reliability Example Computations. Score No.

T

T

A

B

R

R

Item

Item

Item

Item

Item

Item

Item

Item

                   

                   

                   

                   

                   

                   

                   

                   

                   

                   

                   

                   

                   

                   

                   

Note. T = score at Time  for Scorer ; T = score at Time ; A = Scorer  result; B = alternative version test score result; R = Rater  result; R = Rater  result. Items  through  are the raw scores for a sample set of  participants used in computation examples for split-half, Kuder–Richardson Formula , and α.

By the way, the Spearman–Brown prophecy formula can also be helpful to test developers who are designing a scale that undershoots a desired reliability coefficient (e.g., .80, .90). For example, if a fouritem scale has a reliability coefficient of .70, how many equally homogeneous items would need to be added to boost the reliability above the threshold of adequacy? Well, doubling the number of items (to 2(0.7) = 0.82. 8) would result in rxx = (1+0.7) You might have noticed our choice to split the Items 1 through 4 and 5 through 8 for the example rather than randomly. This has been a historic problem with the split-half method: It gave investigators some leeway in determining the item split, which could lead to introduction of some bias. This potential researcher bias, along with greater access to high-speed statistical computer programs, has minimized use of the split-half method in favor of less biased approaches: α and KR–20. Coefficient Alpha (α) Cronbach’s (1951) coefficient α represents the theoretical average of all potential split-half reliability estimates among a set of item scores. The coefficient alpha is calculated when item response formats are multiscored (e.g., Likert-type scales) and is one of the most well-known reliability coefficients for multiple part-test data. Given the Items 1 through 8 data in Table 1, and assuming the scores were multiscaled (i.e., 0, 1, 2), the coefficient alpha derived from SPSS24 was α = .928. This result is interpreted as 92.8% true score variance (i.e., consistency) and 7.2% error variance (i.e., inconsistency: heterogeneity of the item set).

Kuder–Richardson Formula 20 For dichotomously scored items (e.g., right–wrong, yes–no, true–false), a special method of estimating internal consistency is the KR–20. It should be noted that the KR–20 represents the same alpha test as the Cronbach’s α described earlier; the only difference is that when the scores are dichotomous, the equation is properly referred to as KR–20. Given the Items 1 through 8 data in Table 1, all raw scores of 2 were recoded to scores of 1 to make them dichotomous, and then we reran the reliability analysis using SPSS24. Please note that this is only being accomplished for the purpose of this example; researchers

MEASUREMENT AND EVALUATION IN COUNSELING AND DEVELOPMENT

259

would not do this in practice. The resulting KR–20 = .964 is interpreted as 96.4% true score variance (i.e., consistency) and 3.6% error variance (i.e., inconsistency: heterogeneity of the item set).

Test–Retest Reliability Test–retest reliability (rtt ; i.e., temporal stability) refers to consistency of responses on the same assessment administered to the same group of individuals on two different occasions. It is a coefficient of stability over time, with the error variance being due either to fluctuations over time of the construct being assessed (e.g., changes in the condition of the examinees), or internal or external testing conditions (e.g., distractibility, guessing, administration times). Illustrating the concept of test–retest reliability, consider again the database provided in Table 1. Using SPSS24, the scores in column T1 were correlated with the scores in column T2 (for Time 1 and Time 2, respectively), to reflect scores on the same teacher-made test administered 1 month apart. The resulting correlation was rtt = .968, which is interpreted as 96.8% true score variance and 3.2% error variance, in this case due to instability over time either because of testing circumstances or changes in the condition of the examinees at Time 1 or Time 2. Accurate interpretation of the test–retest reliability coefficient is dependent on the time interval between the two administrations and the nature of the construct being measured. As assessments in counseling are often used to track the therapeutic progress of clients in treatment, test–retest reliability can provide helpful insights regarding the effects of test readministration on client scores. Matching the test–retest time interval based on the construct being measured, however, is crucial, as the construct itself might fluctuate across periods of time, with the examinee systematically changing on the characteristic being measured between administrations. For example, if the construct being measured varies over time (e.g., depression), a long time interval between two administrations might lead to differences due to biological maturation, cognitive development, or changes in experience or moods. On the other hand, readministration within a short time interval (e.g., a few days) might lead to carryover effects due to client memory or practice. To further illustrate this, a client with a diagnosis of depression who was administered the Beck Depression Inventory–Second Edition (BDI–II; Beck, Steer, & Brown, 1996) might receive a lower score if the second administration of the BDI–II occurred 6 months later, regardless of treatment effectiveness. Reported test–retest reliability estimates on a given assessment are sample dependent and can only be expected to be comparable across similar evaluation periods. Using the BDI–II example again, the originally reported (Beck et al., 1996) rtt estimate of .93 with a clinical sample retested after only 1 week cannot be expected to apply to a 6-week evaluation period. Indeed, a recent psychometric synthesis with the BDI–II reported substantially lower test–retest reliability estimates for clinical samples across a period of 6 weeks (rtt = .68; Erford, Johnson, & Bardhoshi, 2016). Researchers should always report the time interval between the first and second administrations given its influence on the reliability estimate.

Alternate Forms Reliability (Equivalent Forms Reliability) The potential influence of practice effects on test–retest reliability can be reduced by having examinees take two equivalent versions of a given assessment, with a different, but equivalent version used in each administration period. Measures that assess the same construct, and are also similar in observed score means, variances, and correlations with other assessments, are referred to as alternate forms (or equivalent forms). Some examples are Forms A, B, and C of the Woodcock–Johnson IV: Tests of Achievement (Schrank, McGrew, & Mather, 2014) and the Blue and Tan forms of the Wide–Range Achievement Test (WRAT–4; Wilkinson & Robertson, 2006). Because these assessments measure achievement, creating alternate forms is easier due to the larger selection of test items. Alternate forms for psychological constructs (e.g., self-efficacy) are far less common, making estimations of alternate forms reliability less applicable. Note that long-form and short-form versions on an assessment are not equivalent forms. To illustrate the concept of alternative-form reliability (rab ), refer again to Table 1. Using SPSS24, the scores in column T1 were correlated with the scores in column B2 (for Form A at Time 1 [T1[ and Form B at Time 2 [B2], respectively), to reflect scores on two alternate forms of a teacher-made test administered

260

G. BARDHOSHI AND B. T. ERFORD

2 weeks apart. The resulting correlation was rab = .952, which is interpreted as 95.2% true score variance and 4.8% error variance. The error variance in alternative forms reliability studies is due to four primary sources: the differences in content between Forms A and B; scorer reliability; internal consistency of each version; and instability over time primarily due to either uncontrolled testing circumstances or changes in the condition of the examinee at Time 1 or Time 2 (similar to what occurs in a test–retest reliability study). The error variance in this case is due to the forms themselves, although one could not discount other influences such as fatigue from being assessed on the same construct twice. To reduce carryover effects, a recommendation is to employ a 2-week administration period. When test forms are similar and there are no adverse effects from internal or external conditions, the alternate forms reliability coefficient will be high, indicating that the forms may be used interchangeably. However, if the alternate forms reliability coefficient is much lower than the internal consistency one (e.g., a difference of about .20 or more), it is likely that too much content differentiation is evident, and the claim of equivalence (alternate forms) in doubt.

Interscorer Reliability Interscorer reliability is relevant when scoring of assessments relies heavily on rater judgment and refers to consistency of different individuals scoring the same assessment set (AERA, APA, & NCME, 2014). Measures that employ objectivity in scoring (e.g., multiple-choice, true–false) tend to produce higher agreement among scorers than those relying on more subjective scoring methods (e.g., essay, extended response). Obviously, if scoring relies heavily on the subjective judgments of professionals, the potential for variation (inconsistency) from one scorer to the next is higher. Therefore, one can anticipate those scores to be less consistent across scorers even when rating the same individual. For example, a self-efficacy measure scored on easily differentiated Likert-type items can be expected to produce more consistent results between two independent counselors than one with a scoring method involving interpreting a client’s essay on his or her perceived self-efficacy. This is not to say that a less structured assessment method, such as one relying on scoring an essay response, could not produce a direct measure of self-efficacy; however, the expanded variation in this type of interpretation would increase the potential measurement error (AERA, APA, & NCME, 2014). Importantly, all the other types of reliability discussed earlier also contain interscorer reliability as a potential source of error. Computing interscorer reliability (ris ) is simply accomplished by correlating (usually using Pearson’s r) scores derived by Scorer A and Scorer B on the same set of tests. For example, using SPSS24 the scores in column T1 were correlated with the scores in column A2, to reflect scores derived by two different exam scorers on the same set of multiple-choice and short-answer items. The resulting correlation was ris = .998 which is interpreted as 99.8% true score variance and 0.2% error variance. This slight error variance is caused by scoring (judgment) errors by one or more scorers; human error is not uncommon in test scoring, which is why properly programmed optical scanning devices can yield even higher degrees of precision. Aside from a “higher is better” guideline, no hard and fast rules for interscorer reliability coefficients exist. That said, one should reasonably expect highly objective scoring formats like multiple choice or true–false to be very close to 1.00. Subjective and qualitatively judgmental scoring formats should achieve an interscorer coefficient of .90 or higher. Scoring systems that contribute more than 10% error variance should be avoided.

Interrater Reliability Interrater reliability is pertinent in cases when two independent raters (e.g., two parents or guardians of the same child, two teachers of the same student, a teacher and parent or guardian of the same child) use an assessment to rate the same individual. For example, when two parents complete a behavior assessment on a school-aged youth, one can calculate interrater reliability by using one set of scores (e.g., mothers) as the criterion variable for the other set of scores (e.g., fathers). This type of reliability estimate might be controversial, as it could also be considered criterion-related (concurrent) evidence, although

MEASUREMENT AND EVALUATION IN COUNSELING AND DEVELOPMENT

261

this criticism should in no way dissuade researchers from reporting interrater reliability coefficients (see the article by Balkin, 2017, in this issue for more information regarding evidence of criterion validity). Computing interrater reliability (rir ) is simply accomplished by correlating (usually using Pearson’s r) scores derived by Rater 1 and Rater 2 observing the same set of participants. For example, using SPSS24 the scores in column R1 (teacher rating) were correlated with the scores in column R2 (parent rating), to reflect scores derived by two different raters on the same set of clients. The correlation was rir = .493, interpreted as 49.3% true score variance and 50.7% error variance. This substantial error variance is due to differing perspectives between the teacher and parents, and is influenced by the setting, role, and relational objectivity in which the observations occur. Although the Pearson r (rir ) allows the comparison of two sets of interval or ratio ratings simultaneously, the kappa coefficient allows comparison of multiple rater score sets for categorical data.

Estimating Precision in an Individual’s Score: Standard Error of Measurement Up until this point, score reliability has been discussed and reported as a coefficient derived from group scores. However, professional counselors ordinarily assess one client or student at a time. This is complicated by the notion that if all measurements have error, how do you really know if the score the client received is his or her actual true score (T)? Theoretically, an individual’s true score (T) really can only be known if the test measures the construct error free, which is exceedingly improbable. Alternatively, the individual’s true score (T) can be discerned empirically by administering 100 alternate forms of the same test, because these 100 scores will distribute normally around the likely T, which will be the mean, median, and mode of the score distribution. Again, it is practically improbable for a counselor to go to this length. Fortunately, one can estimate the precision of a client’s or student’s score interpretation by applying what one knows about group-based reliability results using the concept of standard error of measurement (SEM), which creates a confidence interval within which √ the true score (T) probably lies. In classical test theory (Erford, 2013; Meyer, 2010), SEM = SD 1 − rtt ; where SD is the standard deviation of the distribution and rtt is the test–retest reliability, although internal consistency estimates (α, KR–20) are often used instead. Note that if the reliability estimate is 1.00, perfectly consistent, then SEM = 0; there is no error and every time the test is administered, the observed score is the true score. Likewise, if the reliability estimate is 0.00, all error and no consistency, then SEM = SD. An important consideration is the level of confidence with which observed test scores should be interpreted: 1 SEM provides a 68% confidence level; 2 SEMs provide a 95% confidence level; and 2.58 SEMs provide a 99% confidence level. For research purposes, the 68% level of confidence suffices; for educational and counseling decision-making purposes, the 95% level of confidence should be used. When making decisions about people’s lives, the 95% level of confidence means being wrong on 1 in 20 decisions, which although not perfect, is acceptable. Conversely, the 68% level of confidence means being wrong one in about three decisions, which is not an acceptable rate of success for scores informing important decisions about clients and students. Thus, at the 95% level of confidence the true score has a 95% probability of falling within the given range of the observed score X ± (2)SEM. For example, assume the reliability coefficient for a given set of scores on an √ IQ test (M = 100, 1 − rα , one can see SD = 15 standard score points) was α = .89. Applying the formula SEM = SD √ that SEM = 15 1 − .89 = 4.97, which rounds to 5 standard score points. If 1 SEM is 5 standard score points, then 2 SEMs is 10 (9.94) standard score points. Thus, if a student’s IQ score was 110, at the 95% level of confidence, the range of scores within which the true score probably lies is 110±10, or an inclusive standard score range of 100 to 120. This means that, given an IQ score of 110, there is a 95% probability (level of confidence) that the student’s true score will fall within the given standard score range of 100 to 120. This also means that there is only a 5% probability that the true score is less than 100 or greater than 120. The importance of the SEM application is obvious. As the reliability coefficient decreases further from 1.00, it becomes more unlikely that the observed score (X) is the true score (T). This is a “truth in advertising” principle essential in test score interpretation. It is inappropriate to portray the observed

262

G. BARDHOSHI AND B. T. ERFORD

score as the true score; however, it is very appropriate to report the range of scores within which the true score probably lies—and at the given level of probability.

Interrelationships Among the Types of Reliability The sources of error in a reliability analysis are not necessarily discrete. For example, all tests need to be scored, so scoring error is a potential source of error variance in every type of reliability analysis. Interscorer error variance is sometimes compounded when tests are used in research studies because after the test is scored the scores are usually entered into a database. Thus, coding and input errors could add another layer of inconsistency. This is why it is an excellent practice to score all protocols twice and to also check all data entry twice, to eliminate “dirty data.” Sometimes it is helpful to think of reliability in an additive fashion. Any time a test is scored, scorer error is possible. Any time a test is administered, internal consistency is a potential source of error variance. When a test is administered twice, such as with test–retest reliability, internal consistency errors could affect each sample of scores, although, one would hope that much of the item heterogeneity error would overlap in the two administrations. Test–retest also adds error estimates due to stability over time. If two alternate forms (e.g., A and B) are administered, there is no item overlap, so the internal consistency estimates are of two different item pools, providing two even more widely diverging potential sources of internal consistency. Add to this the content inconsistency between Forms A and B and the potential error due to temporal stability and one can easily see the potential effects of these additive sources of error. Given these additive effects, it is also easier to understand that different types of reliability usually produce different magnitudes of true score variance. In general, interscorer reliability estimates are the highest, with coefficients in the very high .90s, and often very close to 1.00 being typical for totally objectively scored assessments. Optical scanning and computer scoring technologies are not perfect, but are very close when appropriately programmed. Internal consistency is generally the next highest type of reliability coefficient reported, followed by test–retest reliability estimates, with rtt for 1-week intervals usually greater than 1-month intervals, which are usually greater than 1-year intervals. Although few assessments have alternative forms, rab tends to be lower than other forms of reliability, getting even lower when the time span between administration of Forms A and B is lengthened, just like when estimating test–retest reliability. Introducing two different raters usually drops interrater reliability estimates even lower. The greater the experiential gap between the raters, the lower the interrater reliability estimate becomes. Indeed, Erford, Butler, and Peacock (2015) reported that interrater coefficients between two teachers in the same school rating the same students had interrater coefficients close to .70, as did two parents rating the same children. However, when a teacher and parent rated the same students, interrater coefficients dropped to about .50, because teachers and parents observe students in very different environmental contexts. In summary, the greater the number of additive sources of potential error, the lower the reliability coefficient becomes. Finally, and perhaps most important, score reliability establishes the upward boundary for score validity. So, choosing assessments with high score reliability, although not a guarantee, substantially improves one’s chances of making more meaningful decisions about people’s lives (Erford, 2013).

Conclusions and Recommendations Given that score reliability varies by sample and conditions, authors of empirical studies must report the score reliability for all assessments used (AERA, APA, & NCME, 2014). A very common limitation in evaluating the psychometric properties of assessments used in counseling is the lack of relevant psychometric information in published studies (Bardhoshi et al., 2016). Relying on test manual estimates does not provide an adequate or accurate measure of assessment stability with all samples, and lack of information regarding measurement procedure and major sources of error further limits interpretation of results. Reporting comprehensive reliability estimates, including the standard error of measurement, and

MEASUREMENT AND EVALUATION IN COUNSELING AND DEVELOPMENT

263

describing the collection of data and potential conditions influencing reliability are necessary standards for providing context on a measure (Meyer, 2010) and inspiring confidence in selection of assessments. Given the central role of error variance in the estimation or reliability, remember that reliability coefficients are influenced greatly by the examinee population. Therefore, documentation of the examinee populations’ demographics and descriptive statistics (i.e., mean and standard deviation) is essential in interpreting the consistency of scores on measures across populations (Meyer, 2010). Reporting both group-specific estimates of reliability and the SEM is necessary to evaluate estimates of reliability across subgroups, especially because counselors are interested in the well-being of underrepresented and diverse groups and reliability could be negatively affected by a small homogenous sample.

Notes on Contributors Gerta Bardhoshi, PhD, is an assistant professor in the counselor education and supervision program of the Rehabilitation and Counselor Education Department at the University of Iowa. Bradley T. Erford, PhD, is a professor in the Department of Human and Organizational Development in the Peabody College of Education at Vanderbilt University.

References American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (2014). Standards for educational and psychological testing. Washington, DC: American Educational Research Association. Balkin, R. S. (2017). Evaluating evidence regarding relationships with criteria. Measurement and Evaluation in Counseling and Development, 50, 264–269. https://doi.org/10.1080/07481756.2017.1336928 Bardhoshi, G., Erford, B. T., Duncan, K., Dummett, B., Falco, M., Deferio, K., & Kraft, J. (2016). Choosing assessment instruments for posttraumatic stress disorder screening and outcome research. Journal of Counseling & Development, 94, 184–194. https://doi.org/10.1002/jcad.12075 Beck, A. T., Steer, R. A., & Brown, G. K. (1996). Manual for the Beck Depression Inventory (2nd ed.). San Antonio, TX: Psychological Corporation. Cronbach, L. J. (1951). Coefficient alpha and the internal structure of tests. Psychometrika, 16, 297–334. https://doi.org/10.1007/BF02310555 Erford, B. T. (2013). Assessment for counselors (2nd ed.). Belmont, CA: Cengage Wadsworth. Erford, B. T., Butler, C., & Peacock, E. (2015). The Screening Test for Emotional Problems–Teacher-Report Version (STEP– T): Studies of reliability and validity. Measurement and Evaluation in Counseling and Development, 48, 152–160. https://doi.org/10.1177/0748175614563315 Erford, B. T., Johnson, E., & Bardhoshi, G. (2016). Meta-analysis of the English version of the Beck Depression Inventory–Second edition. Measurement and Evaluation in Counseling and Development, 49, 3–33. https://doi.org/10.1177/0748175615596783 Mellenbergh, G. J. (2011). A conceptual introduction to psychometrics. The Hauge, Netherlands: Eleven International. Meyer, P. (2010). Reliability: Understanding statistics measurement. New York, NY: Oxford University Press. Schrank, F. A., McGrew, K. S., & Mather, N. (2014). Woodcock–Johnson IV. Rolling Meadows, IL: Riverside. Wilkinson, G. S., & Robertson, G. J. (2006). Manual for the Wide-Range Achievement Test (WRAT–4). Los Angeles, CA: Western Psychological Services.