Image Interpretation: Test the Candidate not the Test

3 downloads 0 Views 1000KB Size Report
INTRODUCTION. Image interpretation tests are core to radiography and radiology ... The Royal College of Radiologists (2016) Final FRCR Part B Examination.
Image Interpretation: Test the Candidate not the Test Kirstie Wilby BSc(Hons) & Dr Chris Wright PhD, MSc, HDCR, FHEA [email protected]

[email protected]

INTRODUCTION

Mean test difficulty calculated using IRT was 0.90 (KW1) v 0.86 (KW2).

Image interpretation tests are core to radiography and radiology education to provide a measure of performance and competence. Each educational organisation designs its own tests. Typically only one summative test is used per assessment which raises a question about the efficacy of the test design. Item response theory (IRT), widely used in general education is a quantitative approach to determine test difficulty, question difficulty and discrimination. This research applies the approach to image interpretation testing in higher education. Ideally all tests should be of a similar difficulty and the weaker students should be more likely to get the more difficult questions wrong.

Both tests had a question difficulty range from 0.21 to 1.00 and each contained 24 relatively easy questions (>0.86). Each had one very difficult question (0.21). The remaining questions ranged between 0.64-0.79 (KW1) and 0.43-0.75 (KW2).

METHODOLOGY

Traditionally image interpretation research has been relatively small scale and typically cohort based using one image test bank. Wright & Reeves (2017) identified that when testing the same cohort, two image banks can produce different results. This research supports this finding and begins the process of explaining why that might be by using item response theory (IRT).

Final year radiography students (n=28), took part in two image interpretation tests to evaluate their accuracy scores. Each test was subjectively of equal difficulty and contained 30 MSK images, with a fifty percent incidence of abnormality, chosen from a blind double reported RadBench® database. Item response theory (IRT) was applied to the results using the following formula:-

Question difficulty (p)

Neither test was 100% discriminating; Test KW1 had 2 non-discriminating questions, Test KW2 had 6. In all cases the higher performing students were more successful with the more difficult questions. All non-discriminating questions fell in the ‘easy’ category.

DISCUSSION

RadBench® has a large bank of images that have been tested over many years and IRT applied to determine difficulty. This information was deliberately not used as part of this experiment in order to assess the human factor in image selection. Whilst the difficult images were relatively simple to identify and include in both tests, the ‘easy’ images were much less so. Assessment within undergraduate radiography courses is typically raw scored producing a range of values. Each university sets its own assessments. Test design could impact results. A student should perform within the same grade boundary regardless of their chosen university. Whilst the pass mark for university assessments is typically 40% the professional expectation is that upon graduation the student will be able to make ‘reliable’ decisions (SCoR, 2013). A minimum accuracy of 90% is recommended prior to participation in abnormality signalling or preliminary clinical evaluation (PCE) systems (Wright & Reeves, 2017).

Discrimination (D)

RESULTS Test 1 (KW1): range 77-97%, mean 90%. Test 2 (KW2): range 73-100%, mean 85.8%.

Multiple tests would provide a fairer approach for module assessment. Professional competence is a different measurement. Radiography could in principle adopt a single benchmark test as is used for the FRCR Part B Rapid Reporting to evidence PCE readiness. IRT provides an objective assessment of test efficacy which is particularly important for pass/fail examinations were the candidate needs to score 90%, equating to 27 correct answers out of 30. Tests should be of a similar difficulty and the weaker students should be more likely to get the more difficult questions wrong. Regular audit of accuracy in practice (Higgins & Wright, 2016) is an important element of clinical governance. Neep et al (2014) also highlighted that graduates may not be ready for frontline radiographer decision signalling systems linked to confidence and ability and need a period of preceptorship post graduation to further develop their skills. Benchmark testing is a key prequisite to participation in PCE schemes in order to provide reliable decision making and fulfil the vision of the SCoR 2013 policy.

A paired samples T-Test (t=3.746) indicated a significant difference (p=0.001) between the two tests at 95% significance level (CI: 0.019/0.066). Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20 Q21 Q22 Q23 Q24 Q25 Q26 Q27 Q28 Q29 Q30

Region Ankle Ankle Ankle Ankle Calcaneum Elbow Foot Foot Shoulder Wrist Elbow Elbow Elbow Forearm Forearm Hand Hand Hand Hand Hand Hand Hand Facial Bones Shoulder Shoulder Shoulder Thumb Wrist Wrist Foot Mean

Test 1 Difficulty(p) 1.00 1.00 0.71 1.00 0.68 0.96 1.00 0.93 0.96 0.96 0.96 1.00 0.89 0.79 0.96 0.93 0.96 1.00 0.86 0.21 0.96 1.00 0.89 1.00 1.00 0.96 0.64 0.79 0.93 1.00 0.90

Discrimination (D) 0.00 0.00 0.29 0.00 0.21 0.07 0.00 0.14 0.07 -0.07 0.07 0.00 0.21 0.14 0.07 0.00 0.07 0.00 0.29 0.14 -0.07 0.00 0.07 0.00 0.00 0.07 0.29 0.29 0.14 0.00

Region Ankle Ankle Ankle Ankle Calcaneum Elbow Elbow Elbow Facial Bones Foot Foot Foot Foot Forearm Forearm Hand Hand Hand Hand Hand Hand Knee Shoulder Shoulder Shoulder Shoulder Thumb Thumb Wrist Wrist Mean

Test 2 Difficulty(p) 0.89 1.00 0.89 0.61 0.96 0.96 0.93 0.75 0.93 0.86 0.96 0.96 0.43 0.86 1.00 0.96 0.96 1.00 0.96 0.96 0.21 0.93 0.86 1.00 0.54 0.46 1.00 1.00 0.96 0.93 0.86

Discrimination (D) 0.21 0.00 0.07 0.36 0.07 0.07 0.14 0.21 -0.14 0.29 -0.07 0.07 0.14 0.29 0.00 0.07 -0.07 0.00 -0.07 -0.07 0.00 0.14 0.29 0.00 0.36 0.50 0.00 0.00 -0.07 0.14

CONCLUSION Test bank design can be subjective and impact the result of image interpretation accuracy scores. IRT is a useful approach to assessing test efficacy. Assessment via multiple test banks is recommended within undergraduate modules in order to average out performance. A discriminating National test, similar to the FRCR Part B with a defined pass mark (e.g. 90%), could provide equality in assessment for all radiography candidates regardless of the educating university .

REFERENCES Chong Ho Yu (2013) A Simple Guide to the Item Response Theory (IRT) and Rasch Modeling. http//:www.creative-wisdom.com The College of Radiographers (2013) Preliminary Clinical Evaluation and Clinical Reporting by Radiographers: Policy and PracticeGuidance 11 February Neep,M. et al. (2014). A survey of radiographers' confidence and self-perceived accuracy in frontline image interpretation and their continuing educational preferences. Journal of medical radiation sciences, 61 (2), 69-77. The Royal College of Radiologists (2016) Final FRCR Part B Examination. https://www.rcr.ac.uk/clinical-radiology/examinations/final -frcr-part-b-examination-0 Higgins,S. & Wright,C. (2016) Traffic Light: An Alternative Approach to Abnormality Signalling. UKRC. Liverpool Wright,C. & Reeves,P. (2017) Image Interpretation Performance: A Longitudinal Study from Novice to Professional. Radiography. Volume 23, Issue 1, February 2017, e1–e7, http://dx.doi.org/10.1016/j.radi.2016.08.006 This work was conducted a part of a final year dissertation at Sheffield Hallam University.