Evaluating the Performance of a Diagnosis System ... - L'UTES - UPMC

Evaluating the Performance of a Diagnosis System in School Algebra Naima El-Kechaï1, Élisabeth Delozanne1, Dominique Prévit1, Brigitte Grugeon2, Françoise Chenevotot2 1

LIP6, UPMC-Sorbonne Universités, Paris, France {Naima.El-Kechai, Elisabeth.Delozanne}@lip6.fr, [email protected], 2 LDAR – Université Paris Diderot, Paris, France, [email protected], [email protected]

Abstract. This paper deals with PépiMep, a diagnosis system in school algebra. Our proposal to evaluate the students’ open-ended answers is based on a mixed theoretical and empirical approach. First, researchers in Math Education list different types of anticipated patterns of answers and the way to evaluate them. Then, this information is stored in an XML file used by the system to match a student’s input with an anticipated answer. Third, as it is impossible to anticipate every student’s answer, the system can improve: when an unknown form is detected, it is added to the XML file after expert inspection. Results from testing 360 students showed that, in comparison with human experts, PépiMep (1) was very effective in recognizing the different types of solutions when students' input was an algebraic expression (2) but was less effective when students entered a reasoned response expressed by a mix of algebraic expressions and natural language utterances. Keywords: Assessment of open questions, Cognitive Diagnosis, School Algebra, System evaluation.

1 Introduction Modeling a student’s knowledge is known to be a difficult problem and much research was dedicated to investigate this matter. The Pépite project intends to diagnose students' difficulties in order to adapt the learning process [1-2]. Its objective is to design an intelligent aid that supports math teachers when they monitor learning in a classroom context, taking into account their students’ cognitive diversity. This paper deals with diagnosing students’ cognitive profiles in algebra. We assume that it is necessary to assess how students produce by themselves algebraic expressions. From our computational point of view, we focus especially on one of the key difficulties of such work: modeling the wide variety of types and levels of students’ knowledge. We designed a system, called PépiMep, that automatically diagnoses students’ answers, even when students express their answers in their own ways. PépiMep is used online in MathEnPoche (MeP), a web-based platform, developed by Sésamath [11], a community of mathematic teachers. This free platform

is used by thousands of students and teachers in French speaking countries. In this paper, we examine the following specific research questions:  Question 1: How to typify anticipated correct and incorrect solutions and how to specify the various different patterns of answers for each type of solutions?  Question 2: How to ensure the quality of the automatic diagnosis, when it is impossible to anticipate every student’s production? To answer these questions, we propose a mixed theoretical and empirical approach. A trade-off was made so that the system would be generic enough to apply to many classes of algebraic problems, and specific enough to detect students’ personal conceptions. Thus the main points in our proposal are (i) to anticipate most current students’ approaches to solving one class of questions by detailed and accurate epistemological and empirical studies [3, 4 ,10], (ii) to design and implement a specific Computer Algebra system (CAS) called Pépinière, to deal with correct and incorrect algebraic rules [2] (iii) to generate a set of answer patterns representing each solution approach stored in an XML file, (iv) to design a system that can be improved by incrementally augmenting the anticipated answer pattern file when new patterns of answers are collected. We begin with a brief review of the research on diagnosing students’ difficulties, especially in mathematics, followed by an overview of the Pépite project. Then we present the diagnosis system PépiDiag and an evaluation of our proposal. We end with a discussion and plans for future research.

2 Background and related projects Assessment and student modeling are a hot research topic in the e-learning and ITS (Intelligent Tutoring Systems) communities. The most popular approach for on-line assessments is based on tests involving closed questions that are especially tailored to provide evidence about targeted competencies or knowledge. The student’s knowledge is described as a part of the reference competency model with the same structure, but with missing or incompletely mastered items. For instance, the evidence-based design used in ACED [5] relies on: a competency model; a task model to collect evidence on variables from the competency model (generally closed questions); a scoring model to interpret a student’s answer (generally a score estimates whether it is correct or not); and a statistical model that aggregates scores across tasks to link observable variables (scores) to the competency model variables [5]. Many of these systems rely on a sound statistical grounding named Item Response Theory (IRT), and propose adaptive testing. E-assessment aims to deliver a final evaluation or sometimes to provide feedback to students. In the Pépite project, the objective of the assessment is different: the diagnosis is the first step to monitor, in a classroom, learning paths adapted to different students’ knowledge states and levels. As in evidence-based design, in the Pépite project, the diagnosis is based on three models; but they are different in many ways. The competency model includes both the referential competence and common misconceptions. The task model includes open-ended exercises. The analysis of each answer (called local diagnosis) is based on a multidimensional model expressed by a

set of criteria (see section 3.2) and on using a CAS to match students' answers with patterns linked to the criteria. Then, heuristics based model aggregates the result of the local diagnosis to draw a student’s cognitive profile. A second approach aims at drawing a description of the student’s knowledge including personal knowledge building and misconceptions. The Pépite project is a contribution in that direction. Some ITS analyze open answers when they are numerical or reduced to a simple algebraic expression (Algebra Tutor, ASSISTments [6], LeActiveMath [7]). Very few analyze the whole reasoning. From this point of view, closely related to our work are Diane [8] and Aplusix [9]. Diane is a diagnosis system to detect adequate or inadequate problem solving strategies for some arithmetic classes of problems at elementary school level. Diane analyses open-ended numerical calculations according to several criteria. However, for more complex domains such as Physics or Algebra, researchers had to use a standard CAS, or to develop one which is specific to the type of students’ inputs and the intended diagnosis. For instance, Aplusix provides a fine grained analysis of students’ use of algebraic rewriting rules. Pépite does not analyze so deep on the algebraic writing dimension, but assesses a broader panel of skills on other dimensions (see section 3.2). Its objective is to link formal algebraic processing with other students’ conceptions related to the meaning of letters or the meaning of algebra. Thus, in the Pépite project, there are very different diagnosis tasks involving algebraic expressions but also geometric figures and calculation programs.

3 Project design and development The objective of the Pépite project is that students develop or strengthen right conceptions, and question wrong or unsuitable ones that interfere with, and, sometimes, prevent learning [10]. The key point of the Pépite assessment approach is that students’ answers to problems are not simply interpreted as errors or lack of skills but as indicators of incomplete, naive and often inaccurate conceptions that the students themselves have built. A fine analysis of the student’s work is required to understand the coherence of their personal conceptions. Detecting these conceptions is a very complex task that requires special training and a lot of time without the help of automatic reasoning on the student’s performance. 3.1 Project Overview We developed such a cognitive diagnosis tool. Our research approach is a bottom-up approach informed by educational theory and field studies [3, 4, 11]. In previous work, we started from a paper and pencil diagnosis tool grounded in mathematical educational research and empirical studies. Then we automated it in a first prototype, also called Pépite, and tested it with dozens of teachers and hundreds of students in different school settings [1]. In more recent work, we implemented PépiGen that generalizes this first design to create a framework for authoring similar diagnosis tools, offering configurable parameters and options [2]. In 2010 and 2011, we implemented PépiMep to deploy the Pépite diagnosis tool on MathEnPoche (MeP) a

web-based platform widely used by math teachers. Next sections deal with the design, implementation and evaluation of PépiDiag that carries out the automatic characterization of students' answers. The whole project relies heavily on the quality of the local diagnosis implemented in PépiDiag (Figure 1).

Fig. 1. The automatic diagnosis of a student’s answer in PépiMep

3.2 The Diagnosis Process In Pépite, as in other diagnosis tools [12], the diagnosis is a three stage process. In this paper we focus on the first stage. Next ones are more detailed in [1]. We just briefly introduce them to situate the work described here in the whole process. First, for each student’s answer to a diagnosis task, a local diagnosis provides two kinds of information: (i) a type that characterizes the answer in the context of the exercise and (ii) a set of codes referring to the different general evaluation criteria involved in the question. Types are specific to the exercise when codes apply to the whole set of exercises at this school level. Codes are used to situate the answer within the more general level of a multidimensional model of algebraic competence. In Pépite, the set of codes gives an interpretation of the student’s answer according to a set of 36 criteria on 6 assessment dimensions (see Table 1 and table 2 for examples). Second, Pépite builds a detailed report of the student’s answers by collecting similar criteria across different exercises to have a higher-level view of the student’s activity. At this stage, the diagnosis is expressed (i) by success rates on three components of the algebraic competence (usage of algebra, translation from one representation to another, algebraic calculation), and (ii) by the student’s strong points and weak points in these three components. This level is called personal features of the student’s cognitive profile. Third, Pépite evaluates a level of competence in each component with the objective to situate a student in relation with the whole class. This third level is called the stereotype part of students’ profiles. Stereotypes were introduced to support the personalization of learning paths in the context of a whole class management, and to facilitate the creation of student working groups. Figure 2 shows a cognitive profile built by PépiMep.

Fig. 2. An overview of Colin’s cognitive profile automatically built by PépiMep.

Question 1: Show how to find out the area of the blue (big) rectangle Approach Result (algebraic or numerical expression) Area of the blue rectangle

Fig. 3. The student's interface for exercise 3 in PepiMep

The following sections examine the iterative process to design and test PépiDiag, the system that implements the local diagnosis. For readability purposes, we illustrate our approach on a simple example where answers are expressed by one algebraic expression. Figure 3 shows the student interface of this exercise, called exercise 3. The first iteration began with an educational research project that defined a multidimensional model of competency and develop a paper and pencil diagnosis test with a very detailed analysis of anticipated students’ solutions [4]. This test was filled

by 600 students. The team used this corpus to refine the first analysis and to specify an electronic version of the test and a first prototype called Pépite. In the second iteration, we tested this first Pépite prototype in different school settings. In the third iteration we developed Pépinière, a Computer Algebra System (CAS) especially to simulate student’s reasoning based on correct and incorrect algebraic rules [2]. Using Pépinière, it was possible to code most answers along six dimensions automatically (Table 1). But many answers were left undiagnosed and some correct answers were badly diagnosed. In the fourth iteration, we developed a new diagnosis system which is described in the next section. Table 1. The multidimensional model of algebraic assessment (partial view).

Dimensions

Evaluation criteria and their code (partial view)

Validity

V0: No answer, V1: Valid and optimal answer ,V2: Valid non optimal answer, V3: Invalid answer, Vx: Unidentified answer Use of letters …L5: No use of letters… Algebraic writing EA1: Correct use of algebraic rules…,EA3: Incorrect use of transformation rules but correct identification of the role of the operators × and + , EA4: Incorrect identification of the role of the operators × and + EA41: Incorrect rules make linear expressions a²->2a EA42: Incorrect rules gather terms Translation from geometry to algebra … T132: Correct translation, T332: Incorrect translation taking into account relationships, T432: Incorrect translation without taking into account relationships, Tx: No interpretation Type of Justification … Numerical calculation …

3.3 The Diagnosis System PépiDiag Each diagnosis exercise in PépiMep, includes an XML coding prescription file that describes every type of anticipated answer for a question (Table 2). A type is characterized by a number, a label, and a set of codes for the evaluation criteria. This information is used in the next stage of the diagnosis process to build the student’s cognitive profile. It is also characterized by a list of the different patterns of answers used by the system to assess a student’s answer. After a syntactical analysis of the student’s algebraic answer, PépiDiag compares it to each pattern described in the coding prescription file. The comparison is made by Pépinière (the project CAS) that deals with operation commutability problems. When the student’s answer matches a pattern, PépiDiag returns the type and the set of codes. They are stored in the student’s log file along with the student’s answers.

4 Evaluation According to our research questions relating to the quality of the diagnosis, we state three criteria to evaluate this new PepiDiag: (i) no correct answer is badly diagnosed, (ii) better no diagnosis than a wrong one, (iii) a minimal number of answers left undiagnosed. To assess the system validity, we set up a study with three experts. For each exercise of the test, we sorted the answers collected by types and by answers. Then, we asked the three experts to evaluate the automatic local diagnosis. We asked them to check, for both correct and incorrect answers, if they were: (i) diagnosed correctly, (ii) diagnosed incorrectly; (iii) undiagnosed. For undiagnosed ones, we asked them if they could complete the automatic diagnosis. They found this task quite easy when working on a single exercise. They made very few interpretation mistakes and when they disagreed, they reached a consensus very quickly. So we were able to compare the system's diagnosis to the experts’ consensual diagnosis even if, for some answers, they were puzzled and could not interpret them. Table 2. Patterns of anticipated students' answers classified by type and code. Type Label 1 2

3.1 3.2 3.3 4 4.1 5

6

7.1 7.2 8 9

Correct expression of the area as a product Length × Width Correct expression of the area by adding the areas of the different rectangles

Code V1, T132 V2, T132

Patterns of equivalent expressions (a + 3)×(b + a)

ab+a²+3b+3a; (a×(a+b))+(3×(a+b)); a(a+3)+b(a+3) Recognition of sub-figures with V3, EA3, (a+3)×b+a; a+3×(b+a); T332 a+3×b+a; a+3×a+b; a+b×a+3 parentheses errors (e.g. a+3  a+b) Recognition of sub-figures with gathering V3, EA42, (a+3)(ab); (b+a)×(3a); a+3 × transformations T332 ab ; ab×a+3; 3a×a+b ; a+b×3a Recognition of sub-figures with V3, EA41, 5a+ab+3b; ab+2a+3b+3a linearization transformations a²->2a T332 Confusion between area and perimeter V3, T432 (a+b)+(a+3); (a+b+a+3)×2 Transformations by gathering V3, EA42, 6a + 4b; a²+3b; ab+3a; ab+4a; T432 a(3+b);2a+3b ;2ab+3 Translation by gathering the figure items V3, EA42, a² b + 3 ab ; 3a²b; 3a²+3b T432 3a²+3ab; a²+ 4 ab ; a²+3ab 3ab; 3+a²+b;3+a²b; a²+ 7 ab Confusion between the operators + and  V3, EA4, 3a  3b  a²  ba; ab+a²×3b+3a T432 (a²+3a)×(ba+3b); (e.g. 3a  3b  a2  ba) ba×b×3+a²+a×3 Partial Formulae V3 T332 a(b+a); 3a+3b; b(a+3); ba+3b 3(b+a); ab+a²; a(a+3); a²+3a Wrong operations (division, cube) V3 T432 (a²3b):2; 3b² + a³ + 3a (b + a)/(a + 3); a+3×b+a/2 Numerical values V3 L5, (no use of letters) T432 No interpretation Vx

4.1 Results The following tables report the results when comparing the PépiDiag diagnosis with the consensual experts’ diagnosis on the working example. For this exercise, Table 3 shows that a quarter of the 360 students did not answer the question. On the whole, PépiMep diagnosed 82% of answers and failed to diagnose 18%. To assess the first quality criterion, we distinguish between the correct answers (149) and the incorrect ones (117). Table 3. Numbers of answers automatically analyzed/unanalyzed by PépiDiag in the first testing of the diagnosis on exercise 3 with 360 students. Empty answers

Students’ answers

94/360 = 26% 266/360 = 74% Correct answers analyzed by PépiDiag Experts' agreement 136/149=91% 136/136 Interface Problems Use of letters in place of operators 5/13

Automatic coding

Unanalyzed answers 218/266 =82% 48/266= 18% Correct answers unanalyzed by Correct answers PépiDiag Human Experts 149 13/149=9% 0 Mix of algebraic expressions and natural language 8/13

Correct Answers. Table 3 shows that out of the 149 correct answers collected, PépiDiag never diagnosed a correct answer as a wrong one. Only 13 correct answers (Table 5) for the following reasons: 5 students used letters "x" or "X" instead of the sign "×"; eg. (b+a)X(a+3);8 students mixed algebraic expressions and natural language, eg. “(a+b)(3+a)=area of the blue rectangle”, “(a+b) times (3+a)”. Incorrect Answers. Table 4 shows that PépiDiag diagnosed 82 incorrect answers and experts fully agreed for 80 of them. For the two remaining answers ("b+a×a+3" and "3+a×a+b"), both system and experts assessed them as incorrect, but they were classified by experts as parentheses errors (type 3.1), while Pépidiag matched them with 3+a²+b (Translation by gathering the figure items, type 5). Indeed, Pépinière converted "a × a" into "a²". Table 4. Incorrect answers in comparison with human experts. Incorrect answers analyzed by Incorrect answers unanalyzed by PépiDiag Experts’ Pépinière PépiDiag Experts agreement Problems 82/117=70% 80/82 2/82 35/117=30% 24/35

Total

117

Table 4 shows also that 35 incorrect answers were left undiagnosed by PepiDiag. Eleven answers were expressed using either letters in place of operators, or a mix of natural language and algebraic expressions; 24 answers were not anticipated in the coding prescription file. The three human experts could not typify them (eg. "3a+3b+a²+b²" or "a²×6×b²×b²").

4.2 Discussion An important result of this study is that at no time PépiDiag was mistaken on correct answers. This meets the first quality criterion. For incorrect answers, the result of the comparison is also very good. There is a very small discrepancy. We are still studying if it is worth changing this feature of rewriting "a×a" as "a²" while avoiding to disrupt the rest of the analysis. For both correct and incorrect answers, to minimize undiagnosed answers, teachers and educational researchers suggested to constrain the interface to prevent students from typing letters other than those relating to the exercise. They advised that it was an acceptable trade-off between no constraint and a better automatic diagnosis of expressions that students could enter by themselves. We tested with 30 students a new interface preventing students from inputting letters different from a et b (in the example). We did not notice any problems when students were solving the exercise. Three new patterns of answers were collected. They were added to the coding prescription XML file and then, no answers were left undiagnosed and there was a full agreement with the experts. The study evaluated PépiDiag as a valid diagnosing tool to assess algebraic expressions, but it is not sure yet that is can assess a whole reasoning process. In [2] we reported how we dealt with analyzing a particular reasoning process. But the result on a large scale does not fully satisfy the three criteria at present, mostly when students mixed algebra and natural language. We are currently working on that question. After testing the validity, we are working on evaluating the usefulness of this diagnosis system. This is a hard challenge, and, at the moment, we did not make any formal evaluation. In our opinion, a first sign of usefulness is that Sésamath, an important community of math teachers, asked for the Pépite diagnosis tool to be deployed on their web-based platform. Another sign is that in a validation cycle: it took between seven and ten hours for human experts to diagnose a whole class of 30 students with a whole test. The three experts found this task very tedious and they made many slips. They felt the need for an automatic system. Currently, educational researchers are using PépiMep to investigate students’ misconceptions and to develop, with teachers, learning paths tailored to fit the PépiMep diagnosis.

5 Conclusion In this paper, we presented the design of PépiMep and a study to evaluate its performance. We benefited from empirical and theoretical educational studies to design and implement PépiDiag, a system that automatically diagnoses the students' open-ended answers even when students enter their solution in their own ways. Anticipated patterns of correct and incorrect solutions are typified according to six evaluation dimensions based on detailed and accurate epistemological and empirical studies. Pépinière, a computer Algebra System (CAS), was designed in order to simulate a student's reasoning based on correct and incorrect algebraic rules. PépiDiag, which relies heavily on this CAS, can diagnose most students' inputs. We evaluate the performance of PépiDiag by testing 360 students. In this paper, we

presented the results on a simple example where students have to express their reasoning by producing one algebraic expression. We described how we reify in a coding prescription file the way experts typified anticipated solutions. For each type of solutions, this file lists different patterns of answers that the system could match with a student’s answer. We showed that the system could improve with usage when unknown answers occurred. Then we showed that PépiDiag was very efficient according to three quality criteria. Currently, we work on two directions. On the diagnosis side, we are improving and generalizing the analysis of students’ reasoning. On the learning side, we are designing a software assistant to display learning tasks adapted to the students' cognitive profiles. Acknowledgements. The PepiMep project is funded by the Region Ile de France. We thank Christian Vincent, Aso Darwesh, Josselin Allys and Arnaud Rommens for their contribution to implement PépiMep and Julia Pilet and math teachers from Sésamath for testing and suggestions. We acknowledge John Wisdom for correcting the correctness of English,

References 1.

Delozanne, É., Vincent, C., Grugeon, B., Gélis, J.-M., Rogalski, J., Coulange, L.: From errors to stereotypes: Different levels of cognitive models in school algebra. In E-Learn, pp. 262--269, Vancouver (2005). 2. Delozanne, É., Prévit, D., Grugeon, B., Chenevotot, F.: Automatic Multi-criteria Assessment of Open-Ended Questions: A Case Study in School Algebra. In ITS, pp. 101-110, Montreal (2008). 3. Kieran, C.: Learning and teaching algebra at the middle school through college levels. Second Handbook of Research on Mathematics Teaching and Learning. pp. 707--762. Frank K. Lester (2007). 4. Grugeon, B.: Design and development of a multidimensional grid of analysis in algebra (in French). RDM Journal. 17, 167--210 (1997). 5. Shute, V.J., Hansen, E.G., Almond, R.G.: You Can’t Fatten A Hog by Weighing It–Or Can You? Evaluating an Assessment for Learning System Called ACED. IJAIED. 18, 289--316 (2008). 6. Feng, M., Heffernan, N., Koedinger, K.: Addressing the assessment challenge with an online system that tutors as it assesses. UMUAI. 19, 243--266 (2009). 7. Melis, E., Ullrich, C., Goguadze, G., Libbrecht, P.: Culturally Aware Mathematics Education Technology. The Handbook of Research in Culturally-Aware Information Technology: Perspectives and Models. pp. 543–557. Blanchard E. and Allard D. (2009). 8. Hakem, K., Sander, E., Labat, J.-M., Richard, J.-F.: DIANE, a diagnosis system for arithmetical problem solving. In AIED, pp. 258--265, Amsterdam (2005). 9. Nicaud, J.-F., Chaachoua, H., Bittar, M.: Automatic Calculation of Students’ Conceptions in Elementary Algebra from Aplusix Log Files. In ITS, pp. 433--442, Jhongli (2006). 10. Artigue, M., Grugeon, B., Lenfant, A.: Teaching and Learning Algebra: approaching complexity through complementary perspectives. In ICMI, pp. 21--32, Melbourne (2001). 11. Sésamath, http://www.sesamath.net/. (05/28/2011) 12. Delozanne, É., Le Calvez, F., Merceron, A., Labat, J.-M.: A Structured set of Design Patterns for Learners’ Assessment. JILR. 18, 309--333 (2007).