Evaluation of a Constraint-Based Error Diagnosis System for Logic

Evaluation of a Constraint-Based Error Diagnosis System for Logic Programming Nguyen-Thinh Le Department of Informatics, University of Hamburg, Germany [email protected] Abstract. We applied the constraint-based approach to develop a web-based diagnosis system for Prolog. In this paper, we introduce the results of our formative evaluation which reflects the current effectiveness of our system. We gathered 261 log files which are created by 99 users contained records of interactions with the tutoring system. In addition, we present the common problems of Prolog novice programmers, strengths and limitations of our system. Keywords: Formative evaluation, Constraint-based error diagnosis, Logic programming.

Introduction Error diagnosis is one of the essential components of Intelligent Tutoring Systems because understanding the current difficulties of a student is indispensable for providing him with guiding help. This information is necessary irrespective of how this feedback is going to be presented. Thus, error diagnosis and feedback generation can and should be separated to a certain degree. Numerous approaches to error diagnosis in programming languages have been proposed. Many of them turned out difficult to apply, do not provide enough diagnostic information or are restricted to a particular pedagogic strategy. The constraint-based approach was introduced by Ohlsson in [1] and has been shown to be successful in building Intelligent Tutoring Systems (ITS) in the domains of SQL [2]. Since a logic programming language can be understood as the relational calculus enriched with recursion and function symbols, the question arises naturally, whether constraintbased diagnosis techniques can also be used in the more general case of a declarative programming language. To investigate this issue, we developed a diagnosis component and integrated it in a web-based tutoring system [3]. This system is intended to help first year university students to overcome difficulties while doing their homework assignments in programming with Prolog. Students are provided with a database of exercises. They can select from this database the exercises, for which they are in need of help. The system attempts to diagnose their solution and returns feedback which indicates possibilities for remedy. Thus, students can improve their solutions successively. A preliminary evaluation has been carried out at the University of Hamburg. System use was not mandatory, but recommended in case the student wanted additional help. Deliberately, we did not use an authentication mechanism. Hence, students could log in under different names. No reliable user identification was possible and even not desired to respect the privacy of the students and encourage group work. In the next section, we introduce briefly our diagnosis system. In the second section, we demonstrate the formative evaluation method which we applied to optimise our system. Thereafter, we illustrate our evaluation results obtained by analysing the log data. In the last section, we discuss the weakness and strengths of our evaluation method. Lastly, we resume our suggestions for improving the system.

1. Description of the error diagnosis system When a human tutor corrects a student's solution, first he observes the structure of the solution and tries to guess, what kind of approach the student is following. We refer to a particular way to solve a programming problem as a program pattern. A pattern imposes several conditions that must be fulfilled in order to ensure the semantic correctness of the program code. By examining these conditions, the tutor can look deeper into the student's solution and seek for possible misconceptions. If a condition is not fulfilled, the tutor marks the erroneous position and writes his feedback beside the error. Based on this scenario, we have developed our constraint-based error diagnosis approach. Firstly, it tries to "guess" the pattern the student is following, based on a generalized description of the corresponding program structure. Thereafter, it identifies possible errors by evaluating the conditions associated to that pattern. In Prolog, a pattern is a generalization of a class of programs. Such a class of programs shares a common underlying structure and embodies the same general programming techniques. The following predicate definitions member/2 and nested_list/1 apply, for example, the so-called pattern “test-for-existence” [3]. member/2 member(H,[H|T]). member(P,[H|T]):-member(P,T).

nested_list/1 nested_list([H|T]):-islist(H). nested_list([H|T]):-nested_list(T).

The pattern “test for existence” determines that some collection of objects has at least one object with a specified property, i.e a list of terms has at least one term which is also a list. We generalize the structures of the two predicate definitions above and specify a structure for this pattern as follows: pred(, [V2|V3]):-subgoal(1). pred(, [V5|V6]):-pred(, V8). In this representation, the expression stands for a specific number of arguments and subgoal(Y) is replaced by a task dependent subgoal. This generalized structure for “test-for-

existence” is accompanied with two programming techniques which establish the semantic of the pattern. First, as the first n arguments are used to determine that a single input element has the desired property, they must be used according to the so-called “same” technique. It requires the arguments in and to be co-referenced, i.e. to share the same value. The second technique applied on the last argument of the second clause is the “list head” technique which requires that V6 and V8 must have the same value [5]. Techniques capture semantic relationships between variables within a clause. As such, they say something about the computation being undertaken rather than simply providing a syntactic pattern. We apply the constraint-based approach to model programming techniques. A constraint consists of two parts: a relevance and a satisfaction part [1]. The first part identifies the structural elements, for which a constraint is relevant. The latter examines if these elements satisfy the conditions of a constraint. For instance, the statement “if the solution follows the pattern ‘test-for-existence’, then a ‘same’ technique must be applied on the arguments which represent a property.” can be described by a constraint. The “if” phrase corresponds to the relevance part and the “then” phrase to the satisfaction part of the constraint. Hence, constraints express the semantic requirements of techniques. The first step of our diagnosis needs to identify the pattern. When a student's solution has been submitted for evaluation, a list of appropriate patterns is created. Beginning with the first pattern, a reference structure is instantiated based on the selected pattern. The student's solution and reference structure will be normalized and then be matched. This process carries out a heuristic search to map clause to clause, head to head, subgoal to subgoal, argument to argument and operator to operator of the two structures. Unmatched structural elements incur

a cost. The structure of a pattern, which can be matched against the student's solution with the least cost, is taken as the most plausible hypothesis for the pattern the student followed. After the matching process is finished, the constraint analyser is going to evaluate the pattern and the task specific constraints. If a constraint is violated, the constraint analyser forwards error information to the Feedback Generator. 2. Evaluation method We conducted an evaluation of the system 1) to find out the difficulties most Prolog novice programmers have, 2) to determine the efficacy and 3) to detect deficiencies of the system. We apply the formative evaluation method [6] to reach these goals. This method uses log files to analyse the effectiveness of the system. This kind of evaluation is usually applied in the development phase of the learner-integrated quality management model [7]. We integrated in our system a logging component which records every interaction of users with the system. First of all, users have to input an identity name. This is used only for evaluation purposes and has no security relevance. That means that the same user may use many different identity names which result in many log files under different user names. Our log files contain the following data: Who has solved which task? What kind of solution the user submitted for evaluation? What kind of errors the system detected? What kind of hints the user made use of? From the above data, we can infer the following information: which errors were fixed? Did the errors occurred again? Did the user reach a correct solution? We gave students of the first semester four exercise assignments, for which they can request help from the system: Task 1: please write a predicate which examines if a list2 begins with a list1: prefix (List1,List2). You can use the built-in predicate append/3. Task 2: please write a function which converts Peano numbers (s(s(...(0)))) into integer numbers: peano_int(Peano,Number). Task 3: please write a predicate to examine if a Peano number is even: peano_even (Peano).

Task 4: please write a function which computes the sum of compound interests of an investment for a given amount, an interest rate (i.e. 0.03) and a duration in years: interest(Amount,Rate,Duration,EndAmount).

We collected 261 log files created by 99 users in total. Each log file contains records of interactions of a user using the system to solve one task on one day. That means, if a user particularly engages with the same task many times on a day, there is only one log file. In order to reach the goals mentioned above, we analysed the log files from different perspectives: remedy-oriented, task-oriented and error-oriented analysis. 3. Evaluation results 3.1 Task-oriented analysis Task

Table 1 number of false and correct trials for each task trials/user trials for a correct solution task solved

task not solved

1

6.07

4.33

11

7

2

6.21

6.54

22

23

3

5.72

6.83

27

17

4

6.21

74.5

1

24

The first goal of the evaluation is to find out the difficulties most users encountered. This can be done by analysing the log files from the task-oriented perspective. Table 1 provides the results of problem solving for each task. The 2nd and 3rd columns show, how many trials in average a user carried out, and how many trials he needed to reach a correct solution. The last two columns tell us, how many log files contained correct solutions and how many did not. As we noticed from Table 1, most users could not solve task 4. This was due to the fact that the majority of users followed a pattern not covered by our system. Instead of trying a recursive approach, they used to use the formulae sum=Amount*(1+Interest)^Duration. In addition, many users were not able to derive a correct formulae for computation of compound interests. Table 2 What kind of errors most users made by solving each task? Task

Syntax errors (8.24%)

1

4.07%

missing_argument (13.82%)

unmatched_check_value (6.5%)

2

5.61%

missing_subgoal (20.05%)

wrong_arithmetic (8.56%)

3

18.13%

superfluous_subgoal (19.55%)

unmatched_baseconstructor (3.97%)

4

2.21%

superfluous_subgoal (17.61%)

wrong_arithmetic (7.23%)

Match errors (73.53%)

Constraint errors (18.24%)

From Table 2 we can recognize that most errors have been detected by the pattern identification process. missing_argument and missing_subgoal indicate that the user’s solution does not contain an argument or a subgoal as expected. superfluous_subgoal errors occur if a user’s solution contains an extra subgoal which is not required. By investigating the log files, we find out that most users had the following problems: • Many users did not really know, what a Peano number is, neither being familiar with the definition nor with the data structure. Some of them simply used “Peano” or peano(X) as an argument and expected that to be a Peano number. • The arithmetic evaluation mechanism in Prolog poses also a considerable problem for many users. Some of them placed an arithmetic expression at an argument position expecting a functional evaluation. Others used “=” instead of “is” for arithmetic evaluation, as it is common in mathematical notations. • Users called auxiliary predicates without defining them in the hope that they are built-in predicates. Or, they used arbitrary material at an argument or subgoal position and expected that the system is able to provide helpful hints. Constraint errors are detected during the constraint evaluation process. As the frequency for each constraint error type is not considerably high, we chose wrong_arithmetic which had the highest frequency (4.65%) and investigated the user's solutions. Most errors of this type are detected in user's solutions of task 2. We noted that users had the following problems: • Many users applied arithmetic expression without being sure if an argument is sufficiently instantiated. Moreover, they sometimes transposed the positions of operands and result arguments or used operands not correctly. • Instead of decomposing an input argument, many novice programmers composed it in recursive subgoals. Or, they decomposed an input argument and processed it, but then, they did not know how to return the result of the processed input value. This indicates that Prolog novices are not familiar with composition and decomposition. 3.2 Remedy-oriented analysis Remedy-oriented analysis is used to determine if the system helped users. Our system provides different levels of feedback. If the system detects an error in the user’s solution, first it notifies that the solution is not correct. On request, it shows the problem location, gives

explanation and finally, provides suggestions to remove the error. We evaluate the efficacy of the system by determining if errors disappeared after users have seen the error location or a remedial hint. A remedial hint includes error explanation and correction proposals. In only 40.2% cases of the total error occurrences users have requested error location or remedial hint. This is due to that users usually inspected the first error of an error list before they submitted a new solution for evaluation. Users requested remedial hints in 60.5% and error location in 39.5% of total requests. In 68.3% of cases after seeing the error location without requesting remedial hint, users were able to remove the error. In 75.8% of cases after requesting remedial hints, the error was eliminated. As expected, the efficacy of remedial hints is higher than of error location because remedial hints give more information. It is also interesting to note that in 11.3% of the cases, an already fixed error occurred again. 3.3 Error-oriented analysis Among the evaluation goals, we are particularly interested in detecting deficiencies of the system. Therefore, we analysed the log files from an error-oriented perspective. First, we focus on the errors which are identified during the pattern identification process. Table 3 shows the three error types which have the highest frequency. The efficacy of remedial hints for these error types is presented in the third column where the one for missing_subgoal and missing_recursivecase is lower than the average (75.8%). A high quotient of the error frequency and the efficacy of an remedial hint indicates that the error was caused by a false diagnosis or the remedial hint was not helpful. The last column provides us information about the rate of errors which were produced by a false diagnosis. Table 3 Error types which are mostly identified during the pattern identification process. error type

occurrence frequency

Remedial hints helped?

occurrence frequency/ efficacy of remedial hints

false diagnosis

superfluous_subgoal

12.92%

76.92%

0.17

3.27%

missing_subgoal

11.74%

46.81%

0.25

24.14%

missing_recursivecase

9.71%

72.22%

0.13

7.83%

We selected the users' solutions in the log files which contained the three error types in the Table 3 and detected the following deficiencies of our system: • Currently, the pattern database of our system is not able to cover all possible solution strategies. For instance, it lacks a pattern using accumulator for task 2 and task 4. • In some cases, the matching mechanism did not hypothesize the correct pattern. Either, we did not have enough matching rules considering data structures such as Peano number or the heuristic costs for unmatched structures caused false diagnosis. • Another shortcoming of our system is the lack of capability of transformation. Many users did not cope with the arithmetic evaluation in Prolog. Instead of applying “is” they used “=”. Our transformation component understands “=” as an unification and replaces the variable on the left hand side of the operator by the variable on the right hand side. In addition, the transformation component is not able to transform this clause to a canonical form: peanoeven(Peano):-Peano is 0. In order to determine the shortcomings of our system, we also investigated the remedial hints and errors which are detected by evaluating constraints. The efficacy of using remedial hints and of error location for constraint errors is 94.2% and 85.2%, respectively. As the constraints are specified by human tutors. The efficacy of remedial hints depends on the expertise of the exercise author and his feedback expression.

4. Discussion and Future Directions Our evaluation method is bound with several limitations. It is not monitored so we do not know the intentions of our users. The system has been designed as an additional development tool, not as a means to examine students. Thus, we are not able to determine, whether a student failed to obtain a correct solution or simply did without submitting it. For quite similar reasons, we tried to avoid any strong authentication mechanism. Of course, there are other, more precise evaluation methods. They are, however, much more expensive (like the individual monitoring and questioning of the students) or even require a substantial change of the existing learning culture, which favours self-determined problem solving over external control. Thus, we deliberately content ourselves with the limitations of the log files analysis described above. In general, it only results in pessimistic estimations of user acceptance and the utility of feedback messages. Our evaluation results have shown the potential problems of Prolog novices. As they are familiar with the notation in mathematics, they used to apply “=” instead of “is” for arithmetic evaluation. They do not have a deep understanding of the notion of unification, so they used arithmetic expressions inside an argument. Moreover, humans rather tend to reason forwards than back to the past. The recursive definition requires a recursive case to count back to the base case which stands for the past. The recursive computation represents therefore a great difficulty for Prolog newcomers. From our evaluation, we identified the weakness of our system at the pattern identification step. In contrary, the constraint evaluation process has proved itself of high efficacy. With a set of five constraint types [3], the exercise author can easily define constraints and specify remedial hints. In order to improve the system, the pattern database as well as the collection of exercises needs to be extended. To better support a reuse of definitions we currently work on organizing the knowledge base as a hierarchy. References [1] S. Ohlsson. Constraint-based student modelling. In J. E. Greer, G.I. McCalla, Student Modelling: The Key to Individualized Knowledge-based Instruction, 167-189. Berlin, 1994. [2] A. Mitrovic, M. Mayo, P. Suraweera, and B. Martin. Constraint-based tutors: a success story. In L. Monostori, J. Vancza, Proc. of the 14th Int. Conf. on Industrial Eng. App. of AI and Expert Systems, 931-940, Budap., 2001. [3 ]N. T. Le, W. Menzel Constraint-based Error Diagnosis in Logic Programming. Proc.of the 13th Int. Conference on Computers in Education, 2005. [4] P. Brna. Prolog Programming, A First Course. 2001. [5] A. Bowles, P. Brna. Introductory Prolog: A suitable selection of programming techniques. In P. Brna, B.D. Boulay, H. Pain, Learning to build and comprehend complex information structures: Prolog as a case study. 167-177. Ablex Publishing Corp., Stamford, Connecticut, 1999. [6] S. A. Nan, Formative evaluation. http://www.beyondintractability.org/m/formative_evaluation.jsp [7] U. Ehlers, Quality of E-Learning (German).In MedienPädagogik, 2002. http://www.medienpaed.com/021/ehlers1.pdf.