Error-Flagging Support and Higher Test Scores - Semantic Scholar

Error-Flagging Support and Higher Test Scores Amruth N. Kumar Ramapo College of New Jersey, Mahwah, NJ 07430, USA [email protected]

Abstract. Previously, providing error-flagging support during tests was reported to lead to higher scores. A follow-up controlled study was conducted to examine why, using partial crossover design. Two adaptive tutors were used in fall 2009 and spring 2010, and the data collected during their pre-test stage was analyzed. The findings are: (1) When a student solves a problem correctly on the first attempt, error-flagging support helps the student move on to the next problem more quickly without pausing to reconsider the answer. But, it may also encourage students to use error-flagging as an expedient substitute for their own judgment; (2) Given error-flagging support, many more students will arrive at the correct answer by revising their answer, which explains why students score higher with error-flagging; (3) Students will use error-flagging to reach the correct answer through trial and error even though the problems are not of multiple-choice nature. However, at least some students may engage in informed (as opposed to brute-force) trial and error. (4) Error-flagging support provided during tests could cost students time. (5) Given how often students move on after solving a problem incorrectly, without ever reconsidering their answer, providing error-flagging support during testing is still desirable. Keywords: Error-flagging, Testing, Adaptation, Evaluation.

1 Introduction and Experiment Studies on the effect of providing error-flagging feedback during testing have yielded mixed results. Multiple studies of paper-and-pencil testing have reported lower performance due to increased anxiety (e.g., [3, 5]) or no difference (e.g., [9]) when feedback about the correctness of answers was provided. Studies with early Computer Assisted Instruction/Testing showed better performance with such feedback during testing than without (e.g., [2, 10]). Later studies with computer-based multiple-choice testing showed no relative advantage or performance gain from providing such feedback [8, 9]. In a recent study, researchers found that there was little difference among the types of feedback provided during testing with the ACT Programming Tutor [4]. In a more recent study of online tests that do not involve multiple-choice questions [6], we found that students scored better on tests with rather than without error-flagging support. We conducted a follow-up study to find out why they scored better – a question of interest since we use online pre-tests to prime the student model used by our adaptive tutors [7]. G. Biswas et al. (Eds.): AIED 2011, LNAI 6738, pp. 147–154, 2011. © Springer-Verlag Berlin Heidelberg 2011

148

A.N. Kumar

In fall 2009 and spring 2010, we used two problem-solving software tutors for the study. The tutors were on functions, an advanced programming concept. One tutor dealt with debugging, and the other, with predicting the behavior of programs with functions. Debugging tutor targeted 9 concepts; Behavior tutor targeted 10 concepts. The tutors presented problems on these concepts, each problem containing a program which had to be debugged or whose output had to be determined by the student. Each software tutor went through pre-test-practice-post-test protocol as follows: • • •

It first administered a pre-test to evaluate the prior knowledge of students and build the student model. The pre-test consisted of one problem per concept – 9 problems in debugging tutor and 10 problems in behavior tutor. Subsequently, it provided practice problems on only those concepts on which students had solved problems incorrectly during pre-test [7]; Finally, it administered post-test problems on only those concepts on which students had solved sufficient number of problems during practice as indicated by the student model.

The three stages were administered online, back-to-back without any break in between. The software tutors allowed 30 minutes for the three stages combined. Since we wanted to study the effect of error-flagging on tests, data from only the pre-test portion of the tutor was considered for analysis. The evaluations were in-vivo. The tutors were used in introductory programming courses at 12 institutions which were randomly assigned to one of two groups: A or B. Subjects, i.e., students accessed the tutors over the web, typically, after class. The tutors remotely collected the data for analysis. A partial cross-over design was used: students in group A served as control subjects on debugging tutor and test subjects on behavior tutor, while students in group B served as test subjects on debugging tutor and control subjects on behavior tutor. All else being equal, error-flagging feedback was provided during pre-test to students in the test group, but not the control group. Error-flagging, i.e., errordetection, but not error-correction support was provided before the student submitted the answer. Debugging tutor: In order to identify a bug, the student had to select the line of code which had the bug, the programming object on that line to which the bug applied, and finally, the specific bug that applied to the programming object on the line. For example, the student would select line 8, the variable count on line 8, and the bug that count was being referenced before it was assigned a value. After the student identified all three, the summary of the bug would appear in the panel that displayed the student’s answer. After the summary was displayed, students had the option of deleting the entire bug and starting over, whether or not error-flagging support was provided. In addition, students had the option to click a button that said that the code had no bugs. This button was presented only when the student had selected no bugs. Behavior tutor: Students identified the output of the program, one step at a time, e.g., if the program printed 5 on line 9, followed by 9 on line 13, students had to enter this answer in two steps. In each step, they entered the output free-hand, and selected the line of code from a drop-down menu. For each step, a button was provided for

Error-Flagging Support and Higher Test Scores

149

students to delete it if they so wished. They also had the option to change the output or the line number in-situ, without deleting the entire step. In addition, students had the option to click a button that said that the code had no output. This button was presented only when the student had not yet identified any output for the program. When error-flagging feedback was provided, if an answer was incorrect, it was displayed on red background if incorrect, and green background if correct. When error-flagging support was not provided, the step was always displayed on white background. When error-flagging support was provided, no facility was provided for the student to find out why it (bug or output step) was incorrect, or how it could be corrected. The online instructions presented to the students before using each tutor explained the significance of the background colors. Whether or not the tutor provided error-flagging feedback, students had the option to revise their answer as often as necessary before submitting it. Once again, the instructions presented to the students before using each tutor explained the user interface facilities provided for revising an answer. On a multiple-choice test question, with error-flagging support, a student could repeatedly guess until arriving at the correct answer. Given n choices in the question, the student would need no more than n guesses. In debugging tutor, though, the number of choices was more than 20 on each problem, and the choices were not arranged as a flat list, but as a hierarchy of selections: line, object and type of bug being the three levels of hierarchy. In behavior tutor, although there were limited choices for the line number, the output itself was entered free-hand, making the number of choices infinite. So, neither debugging tutor nor behavior tutor presented problems that could be considered multiple-choice, and therefore, susceptible to gaming when error-flagging feedback was provided.

2 Results For analysis, only those students were considered who had used both debugging tutor and behavior tutor. Only those students were considered who attempted most of the pre-test problems: at least 6 of the 9 problems on debugging tutor and 8 of the 10 problems on behavior tutor. Students who scored 0 or 100% on either pre-test were excluded. This left 40 students in Group A and 59 students in Group B. In order to factor out the effect of the difference in the number of problems solved by students, the average score per pre-test problem was considered for analysis, which can range from 0 through 1, rather than total score. Score Per Problem: A 2 X 2 mixed-factor ANOVA analysis of the score per pre-test problem was conducted with the topic (debugging versus behavior) as the repeated measure and the group (group A with error-flagging on behavior versus group B with error-flagging on debugging pre-test) as the between subjects factor. A significant main effect was found for error-flagging [F(1,97) = 44.107, p < 0.001]: students scored 0.519 ± 0.048 without error-flagging and 0.689 ± 0.035 with error-flagging (at 95% confidence level). The difference was statistically significant [t(98) = -3.069, p = 0.003]. The effect size (Cohen’s d) is 0.46, indicating medium effect. Students scored more with error-flagging support during the test than without.

150

A.N. Kumar

The between-subjects effect for group (A versus B) was not significant [F(1,97) = 2.340, p = 0.129], indicating that the two groups were comparable, whether they got error-flagging support on debugging pre-test or on behavior pre-test. A large significant interaction was found between treatment and group [F(1,97) = 123.022, p < 0.001]. As shown in Table 1, the group with error-flagging scored statistically significantly more than the group without error-flagging on both debugging pre-test [t(97) = -2.638, p = 0.01] and behavior pre-test [t(97) = 5.604, p < 0.001]. It turned out that students found debugging pre-test to be harder than behavior pre-test, scoring significantly less on it (average 0.4736) than on behavior pre-test (average 0.7241) [t(98) = -8.336, p < 0.001]. This explains why group B scored less with error-flagging on debugging pre-test (0.521) than without error-flagging on behavior pre-test (0.635). Table 1. Average Pre-test Score with and Without Error-Flagging

Without Error-Flagging With Error-Flagging

Debugging pre-test 0.403 ± 0.074 0.521 ± 0.045

Behavior pre-test 0.635 ± 0.061 0.856 ± 0.054

In order to answer why error-flagging support led to better scores, we considered four cases: 1.

2.

3. 4.

Students solved a problem correctly without any revisions – did students with error-flagging support solve them faster, because they were not tempted to reconsider their answer? Students solved a problem incorrectly without any revisions – since students in the experimental group did not take advantage of error-flagging feedback, the two groups should be comparable in how quickly they solved problems. Students solved a problem correctly with revisions – did students with errorflagging support take longer to solve the problem? Did they revise more often? Students solved a problem incorrectly with revisions – if this category applied to students with error-flagging support, it would suggest that error-flagging support is not a substitute for knowing the correct answer.

On Debugging tutor, students had only one mechanism to revise their answer once they got error-flagging feedback: delete the entire bug. On Behavior tutor, students had two mechanisms to revise their answer: either delete the entire step, or edit the step in-situ. Since in-situ editing events were only collected in spring 2010 and not in fall 2009, revision data for Behavior tutor was incomplete and was dropped from further analysis. Case 1: When solving problems correctly without revision, students with errorflagging support solved them faster (53.86 seconds) than those without (70.61 seconds), and this difference was significant [t(231) = 2.057, p = 0.041]. This supports the hypothesis of our first case, that given the positive reinforcement that an answer is correct, students with error-flagging feedback move on to the next problem more quickly without pausing to reconsider their answer. Table 2, where data of control group is shown as “No” and experimental group with error-flagging as “EF”,


151

shows this to be the case for all but two problems (4 and 7). The difference on problems 5 and 6 were marginally significant (p = 0.09), whereas the rest of the differences were not statistically significant. Table 2. Time spent per problem when solving the problem correctly without revision Problem No (N=40) EF (N=59)

1 81.23 64.33

2 74.73 63.69

3 75.31 46.29

4 60.0 70.0

5 94.69 52.19

6 128.0 63.40

7 40.80 53.35

8 55.40 42.54

9 53.94 45.20

However, the news is not all positive. As shown in Table 3, on every problem, the percentage of students who solved problems correctly without revision was smaller with error-flagging than without. Given that the two groups were comparable, this could suggest that when error-flagging feedback is provided, students may be using it as a crutch, as an expedient replacement for their own judgment. In other words, in at least some cases, they arrived at the correct answer through revisions even though they could have do so with additional deliberation instead of revisions - they resorted to revising their answer even when they did not need to, just because they could. Table 3. Percentage of students who correctly solved the problem without revision Problem No (N=40) EF (N=59)

1 32.50 20.34

2 37.50 22.03

3 40.0 11.86

4 30.0 6.78

5 32.50 27.12

6 25.0 8.47

7 37.50 28.81

8 50.0 22.03

9 42.50 25.42

Case 2: In contrast, when solving problems incorrectly without revision, there was no significant difference in the time taken by students with (80.5 seconds) or without (75.37 seconds) error-flagging support [t(265) = -0.636, p = 0.525]. This supports our second case – since students in the experimental group did not take advantage of error-flagging feedback even when their answer was incorrect, the two groups should be comparable in how quickly they solved problems. Table 4 shows the average time taken by the two groups to solve each problem when they solved it incorrectly without revision. The difference between the two groups was statistically significant only on problem 9. Table 4. Time spent per problem when solving the problem incorrectly without revision Problem No (N=40) EF (N=59)

1 92.52 86.63

2 3 73.0 77.56 98.85 59.57

4 78.59 60.33

5 64.08 82.30

6 93.17 98.0

7 74.12 95.60

8 50.65 59.50

9 67.06 46.0

Table 5 shows that when error-flagging support is provided, far smaller percentage of students solves a problem incorrectly without revising it. In other words, students take advantage of error-flagging to fix an incorrect answer. The percentage of students without error-flagging support who moved on after solving a problem incorrectly, but without revising their answer even once is rather large (40% - 60%).

152

A.N. Kumar

Prompting such a large percentage of students to reconsider their answer is the goal of providing error-flagging feedback during tests. If a student knows the material, and solves the problem correctly, but enters the answer incorrectly, this would help the student uncover incidental or accidental mistakes. If a student knows the material, but did not solve the problem correctly, this would prompt the student to go over the steps of solving the problem again. If a student does not know the material, this would make the student aware of his/her lack of knowledge, which is also desirable. Table 5. Percentage of students who incorrectly solved the problem without revision Problem No (N=40) EF (N=59)

1 52.50 13.56

2 50.0 33.90

3 40.0 23.73

4 55.0 10.17

5 60.0 16.95

6 60.0 23.73

7 42.50 8.47

8 42.50 13.56

9 40.0 8.47

Case 3: Table 6 lists the percentage of students who solved problems correctly after revising their answer at least once. As could be expected, the percentage of students who solved problems correctly by revising their answers was much greater with errorflagging than without. This explains the significantly better score of students with error-flagging than without. Table 6. Percentage of students who correctly solved the problem with revision Problem No (N=40) EF (N=59)

1 5.0 25.42

2 3 2.50 5.0 22.03 16.95

4 2.50 40.68

5 0 45.76

6 0 16.95

7 2.50 18.64

8 2.50 30.51

9 0 38.98

Since so few students without error-flagging support actually revised their answers (at most 2), in the next analysis, we considered the time spent per problem and the number of revisions per problem of only those who got error-flagging feedback. Table 7 lists these as “Rev Time” and “Revisions” respectively. For comparison purposes, it also lists the average time spent per problem by the students who solved the problem correctly without any revision (with or without error-flagging) as “NoRevTime”. Note that in order to revise and answer correctly, students with error-flagging support spent more time, often, twice as much time, than those who did not revise their answer. This difference was statistically significant: 104.6 seconds with revisions versus 63.3 seconds without revisions [t(390) = -6.11, p < 0.001]. Table 7. Students with error-flagging support who correctly solved a problem with revisions Problem N NoRevTime Rev Time Revisions Errors

1 2 15 13 73.1 69.6 135.6 108.5 4.27 5.62 28 41

3 10 66.5 132.6 7.40 50

4 24 62.5 125.8 8.63 55

5 27 71.2 81.7 4.22 23

6 10 106.5 104.9 13.0 45

7 11 47.5 105.5 6.0 55

8 18 50.3 93.2 4.17 41

9 23 49.8 80.5 4.04 31


153

As shown on “Revisions” row in Table 7, students on average revised their answer at least 4 times per problem. Each problem had only one correct answer (although this was not communicated to the students). So, an average of 4 revisions per problem indicates that students used error-flagging support to reach the correct answer through trial and error, which is clearly undesirable. The last row titled “Errors” lists the possible number of error options for each problem. Each problem contained 13-18 lines of code over which these error options were spread. So, while an average of 4 attempts to identify one bug is excessive, it represents less than 18% of the total number of possible error options for each problem, suggesting that at least some students may have engaged in informed (as opposed to brute-force) trial and error. Case 4: Table 8 lists the percentage of students who solved problems incorrectly even after revising their answer at least once. Curiously, this percentage is much larger with error-flagging than without. This suggests that error-flagging support did not always help students arrive at the correct answer, and is not a substitute for knowing the answer at the outset. Table 8. Percentage of students who incorrectly solved problems with revision Problem No (N=40) EF (N=59)

1 5.0 23.73

2 7.50 15.25

3 5.0 33.90

4 7.50 40.68

5 7.50 10.17

6 10.0 47.46

7 12.50 37.29

8 2.50 22.03

9 0 8.47

For the follow-up analysis, once again, we excluded data of students without errorflagging since too few of them revised their answers (usually 2 or 3). As shown in Table 9, even after 3 or more revisions facilitated by error-flagging support, a large number of students solved problems incorrectly any way. The table lists the time spent per problem as “RevTime” and average number of revisions as “Revisions”. For comparison purposes, we listed the time spent per problem by students who solved each problem incorrectly, but without any revisions, as “NoRevTime”. We compared against this group because, if our experimental group was going to solve a problem incorrectly any way, we wanted to find out the time penalty, if any, of all the revisions prompted by error-flagging support. When solving problems incorrectly, students who revised took significantly more time per problem (127.72 seconds) than those who did not revise their answer (77.10 seconds) [t(429) = -7.414, p < 0.001]. As the table shows, even when the answer eventually turned out to be incorrect, students spent up to twice as long as the group that did not. Since all the additional time spent on these problems did not increase the students’ score on the test, error-flagging support provided during tests could cost students time by encouraging fruitless speculation. Once again, note that students revised their answer at least 3 times per problem. While it is common knowledge that students solve problems through trial and error if error-flagging support is provided on multiple-choice questions, the finding of this study is that they resort to trial and error even when the problem is not of multiple-choice nature. One mechanism to discourage or minimize excessive revisions might be to limit the number of revisions allowed per problem. This would prevent students from arriving at the correct answer through repeated trials, as well as reduce the time they spend speculating on problems for which they do not eventually find the correct answer. Students spent less time with than without error-flagging in case 1, and more time in cases 3 and 4. Since there was no significant difference with versus without

154

A.N. Kumar

Table 9. Students with error-flagging support who incorrectly solved a problem with revisions Problem N NoRevTime RevTime Revisions

1 14 90.9 109.6 4.93

2 9 85.9 123.4 4.44

3 20 69.2 137.7 8.79

4 24 74.7 109.1 4.58

5 6 69.4 116.8 6.0

6 28 95.0 158.3 9.21

7 22 79.0 137.9 13.1

8 13 53.5 103.2 5.62

9 5 62.0 61.6 3.0

error-flagging in the overall time taken on either debug or behavior tutor pre-test, error-flagging support led students to save time on the problems they knew how to solve and spend it attempting problems for which they did not readily know the solution. This re-allocation of time is desirable in tests and tutors, which makes the case for providing error-flagging support. But, with error-flagging support, students often used trial and error to arrive at the correct solution, and spent significantly more time futilely revising their answers, neither of which is desirable. In order to address these concerns and further elucidate how error-flagging support can be beneficially used in tutors and tests, we plan to conduct a follow-up study after imposing a limit on the number of revisions allowed per problem. Acknowledgments. Partial support for this work was provided by the National Science Foundation under grant DUE-0817187.

References 1. Aïmeur, E., Brassard, G., Dufort, H., Gambs, S.: CLARISSE: A Machine Learning Tool to Initialize Student Models. In: Cerri, S.A., Gouardéres, G., Paraguaçu, F. (eds.) ITS 2002. LNCS, vol. 2363, pp. 718–728. Springer, Heidelberg (2002) 2. Anderson, R.C., Kulhavy, R.W., Andre, T.: Feedback procedures in programmed instruction. J. Educational Psychology 62, 148–156 (1971) 3. Bierbaum, W.B.: Immediate knowledge of performance on multiple-choice tests. J. Programmed Instruction 3, 19–23 (1965) 4. Corbett, A.T., Anderson, J.R.: Locus of feedback control in computer-based tutoring: impact on learning rate, achievement and attitudes. In: Proc. SIGCHI, pp. 245–252 (2001) 5. Gilmer, J.S.: The Effects of Immediate Feedback Versus Traditional No-Feedback in a Testing Situation. In: Proc. Annual Meeting of the American Educational Research Association, pp. 8–12 (April 1979) 6. Kumar, A.N.: Error-Flagging Support for Testing and Its Effect on Adaptation. In: Aleven, V., Kay, J., Mostow, J. (eds.) ITS 2010. LNCS, vol. 6094, pp. 359–368. Springer, Heidelberg (2010) 7. Kumar, A.N.: A Scalable Solution for Adaptive Problem Sequencing and Its Evaluation. In: Wade, V.P., Ashman, H., Smyth, B. (eds.) AH 2006. LNCS, vol. 4018, pp. 161–171. Springer, Heidelberg (2006) 8. Plake, B.S.: Effects of Informed Item Selection on Test Performance and Anxiety for Examinees Administered a Self-Adapted Test. Educational and Psychological Measurement 55(5), 736–742 (1995) 9. Shermis, M.D., Mzumara, H.R., Bublitz, S.T.: On Test and Computer Anxiety: Test Performance Under CAT and SAT Conditions. J. Education Computing Research 24(10), 57–75 (2001) 10. Tait, K., Hartley, J.R., Anderson, R.C.: Feedback procedures in computer-assisted arithmetic instruction. British Journal of Educational Psychology 43, 161–171 (1973)