Automatic grading of graphical user interface ...

Automatic Grading of Graphical User Interface Programs Exploiting Jemmy Sami Surakka

Janne Auvinen

Petri Ihantola

Helsinki University of Technology P.O.Box 5400 FI-02015 HUT, Finland +358 9 451 3316

Helsinki University of Technology

Helsinki University of Technology P.O.Box 5400 FI-02015 HUT, Finland +358 9 451 3294

[email protected]

[email protected]

ABSTRACT

The course was compulsory for most of the students. The extent of the course was approximately 3.3 American semester credits (four Finnish credits, which is equivalent to 160 working hours). The programming language used on the course was Java. The Basic Course in Programming 1 was a prerequisite for this course; the prerequisite course also used Java. The course had four compulsory assignments and each assignment was intended to take 10-20 hours. Graphical user interfaces were the topic of the second compulsory assignment. In addition, the course included a compulsory examination, approximately 25 hours of voluntary lectures, and three voluntary assignments. In the spring of 2005, the course book was Jia’s Object-Oriented Software Development Using Java [11]. In addition, some compendiums were delivered to the students.

A new approach to the automatic grading of graphical user interface (GUI) exercises is presented. As an example of this approach, a small set of exercises and a grading system intended for basic courses in programming which use Java are presented. The system is implemented by using (a) the open source Jemmy GUI testing module and (b) the Ceilidh automatic grading program. The use of existing systems was an important goal: only 1,250 lines of code were programmed. The system was used for educational purposes in 2004-2005, which is evaluated using various methods and data sources; for example, the results of automatic grading was compared with the results of a teaching assistant. During the literature search, only two other systems for the automatic grading of GUI exercises were found. Some differences between these three systems are considered.

1.2 Origins of Project

Keywords

At our university, most basic programming courses have several hundred students and automatic grading has been widely used for the past ten years. The automatic grading systems have been developed as medium-budget research projects that last several years (e.g., TRAKLA2 [13]), as low-budget projects that probably last 1-2 years (e.g., Scheme-robo [17]), or as part of teaching assistants’ work (e.g., the reprogramming of Ceilidh). In addition, our laboratory has used or purchased systems developed outside the laboratory (e.g., Ceilidh [7] and Goblin [8]). The development of Gui G was a low-budget project. On the case course in the spring of 2002, the GUI assignment was problematic; the workload and the standard deviation of hours used were greater than in the other compulsory assignments. Therefore we decided to divide the single GUI assignment into a set of smaller exercises. Thus, one (pedagogical) principle behind the current assignment was “divide and conquer” or “divide and manage.” The second (pedagogical) decision or principle was to separate layouts from event handling and begin with basic components and layouts; that is, during the first exercise, students have to learn how to use basic components, layouts, and panels without implementing any event handling. The same separation was also used during the lectures. A third decision was to provide part of the whole program to students so that they could concentrate on implementing the GUI. This can also be classified as a kind of small-scale example of two-tier architecture in which the students had to implement only the GUI when the application logic was provided. These smaller exercises were used during the spring of 2003 but the grading was still manual. On the basis of student feedback, the effect was as planned. The mean and the standard deviation of reported hours decreased. During

Automatic assessment, automatic grading, graphical user interface, GUI, introductory course in programming, Java

1. INTRODUCTION The present paper can be classified as a resource paper or as a system paper. We provide a new approach to the implementing of the automatic grading of graphical user interface (GUI) programs and an example implementation of this approach. The name of the system is Gui G. At the beginning of the project we were simply curious whether the automatic grading of GUI programs would even be technically possible. Now we know it is possible. Programming exercises with text-based user interfaces have been graded automatically since at least the year 1961 [6] but we found only two earlier publications about the automatic grading of exercises with GUIs. This is an interesting observation because the automated testing of GUIs is a commonly-studied subject (see e.g., [9; 12]). The related work is presented later, in Section 2. The GUI assignment and the automatic grading system presented in the present paper are intended for courses that use Java. Java is a common language for introductory programming courses, at least in accredited programs in the USA. McCauley and Manaris [14] reported that 56% of the departments expected to use Java as their first language for the academic year 2002-2003.

1.1 Description of Case Course The automatic grading system was used on the basic course in programming (level CS2) during the springs of 2004 and 2005 at the Helsinki University of Technology in Finland. 330-340 students enrolled for the course per year.

49

with such tools also requires additional tedious manual work; that is, programming. The latter approach is like traditional automatic testing; all the interaction is programmed and the tester performing the interaction also gives the test result. Programming of testers also gives freedom to concentrate on other aspects than pure functionality (e.g., object hierarchy in the student solution), which we believe is sometimes useful for educational purposes. Approaches can also be combined when a capture/replay tool is used to implement an initial version for a hand-coded tester.

the spring of 2003, it was realized that the exercises were now small and simple enough that it might be possible to automate their grading. In the summer of 2003, the project of automating the grading was started.

1.3 Project Goals and Some Decisions Made During Project In order to reduce the amount of work for us, one goal was to implement the system for the automatic grading of GUI exercises by using existing systems as much as possible. First, we selected a GUI testing framework. Second, we programmed the testers based on this framework. Finally, the GUI test programs were combined with the system that took care of evaluating the test results and the administration of submissions. In June 2003, approximately 50 different GUI testing programs and frameworks were listed on the web page [21]. We decided to choose one that (a) was free for noncommercial use, (b) portable into Unix platform, and (c) used Java as the programming language for defining tests. For example, some programs had their own scripting language for implementing tests. We decided to use Java because, in the near future, part-time student assistants who can at least program with Java might maintain the tests. In June 2003, we found only three programs, Abbot [1], Jemmy [16], and jfcUnit [10] that fulfilled these requirements. We compared these three programs in more detail. jfcUnit was ruled out because it did not provide all the features we were looking at. We chose Jemmy because it had a wider selection of properties for our needs and probably a stronger background organization than Abbot had. We also considered using both Abbot and Jemmy, because some of their properties complemented each other. However, we decided to use only one testing program in order to keep things simple. The implemented system for automatic GUI grading was named Gui G. Seriously, the name is an acronym from “GUI Grader.” Humorously, the stage name “Ali G” used by the British comedian Sacha Baron Cohen, and other names of rappers had some influence when the name was selected.

2.1 JEWL English’s paper [3] presents JEWL, which is an abbreviation for “John English’s Window Library.” It is a set of Java packages aimed at complete novices. He also provides a version that is intended for automatic grading. JEWL is available on the web under the GNU Public License. According to English [3, p. 140], automatic grading was used for the first time during the spring of 2004. Thus, the period during which English’s system was used in education was the same as in our project. The work makes two main contributions: (1) it simplifies GUI programming to a level that is more suitable for novices, and (2) it also provides an automatic grading system for GUI programs. We consider these two issues to be separate because simplifying GUI programming is not necessary for automatic grading – or vice versa. English’s approach is more suitable than our approach if a lecturer wants to use GUIs from the very beginning of the first programming course. Our approach is more suitable if a lecturer prefers to provide a standard application programming interface (API) for students, because JEWL provides a non-standard API.

2.2 AJGUITE The work by Sun and Jones [19] is technically closer to ours; their system is also built on Jemmy. The abbreviation of their system is AJGUITE from “automated JAVA GUI testing environment.” AJGUITE has been used since the spring of 2003 but is not freely available. Their goal was to automate the coding of test programs using a specification-driven approach, whereas we programmed the necessary test programs by hand. In their approach, a lecturer has to write an XML based test specification that is automatically converted into a Java test program. In other words, in their approach the level of abstraction is higher than in our approach. Sun and Jones’ work seems to be targeted at the automated testing of GUI programs in general, not educational settings in particular. However, as a case, they tested 66 GUI exercises automatically in order to prove that their approach works. Some differences between AJGUITE and Gui G are: (a) AJGUITE is not capable of producing immediate feedback, (b) AJGUITE cannot handle course management issues such as multiple submissions, and (c) AJGUITE has the functionality to test basic user actions such as pushing a button and writing text in a text field but cannot handle multiple frames [19, p. 144] or test layout issues. These aspects are present in Gui G.

1.4 Structure of Paper Related work is presented in Section 2. Section 3 indicates how other lecturers can use our resources. The system and its use on our case course are evaluated in Section 4. Finally, in Section 5, the automatic grading of GUI programs is discussed and further work is considered.

2. RELATED WORK We found several publications about the automatic testing of GUIs, but these publications were about automatic testing per se, not how automatic testing has been applied in education. For instance, Fewster and Graham’s [5] book on software test automation has a chapter about GUI testing [9]. Recently, Li and Wu [12] have also published a book about the same topic. On the web page [21] over 50 different GUI testing programs are listed. As mentioned previously, we found only two publications [3; 19] that are really relevant to our work. In GUI testing, two main technical approaches are capture/replay tools and script/program coding. In the first one, a user interacts with the system under test or with a reference implementation and all the interaction is recorded. When the test is repeated, user interactions are replayed and the behavior of the system under test is evaluated against what has been recorded. However, according to Li and Wu [12, p. 21], in practice, testing

3. HOW CAN OTHER LECTURERS USE THESE RESOURCES? This section shows how other lecturers can benefit from our resources. The GUI assignment, as HTML files, and the test programs, as Java source code, are freely available

50

to adopt the assignment, he or she can obviously use a different limit or numbers of points for different exercises.

at [19] under the GNU Lesser General Public License and the GNU Free Documentation License. Beside these parts, one has to download Jemmy from the web page [16]. In addition, one must have a course management program; that is, a program that takes care of students’ submissions, compiling and running them against test programs, and announcing test results. This management programs also counts each student’s points and keeps records on how many times a student has submitted each exercise. On the case course, the name of the management program was Ceilidh. Apparently many institutions are using text-based programs like Ceilidh for automatic grading. Lecturers can ask for the source codes of model solutions via e-mail from the first author of the present paper. The course management program Ceilidh used on the case course is not distributed because of copyright issues. The resources are described in more detail in the following two subsections.

Table 1. The four exercises of the GUI assignment. Number 1 2 3 4

Topics Different layouts, panels, and basic components Event handling Handling of several windows, dialog windows Menus and confirmation dialogs

Points 30 20 30 30

The grading criterion was approximately 60 simple test cases. For each test case, a student got 1 or 2 points if the test was passed. For brevity, only two examples are given: one test case tested whether the button “Handle Order…” existed and another that the main window contained exactly one text area. Some details of the test cases can be found in the commented source files that are available at the web page [20]. In addition, the test cases of the first exercise are listed later, in Figure 5.

3.1 Description of GUI Assignment We have used the following assignment to demonstrate our approach to the automatic grading of GUI exercises. During the assignment, a student had to implement a GUI for a small stock control application. The assignment was divided into four exercises. In the first exercise, the students had to implement the main window without event handling. Figure 1 presents an example of the main window of the program that students had to implement. In the second exercise, event handling was implemented. For example, if the Exit button was pressed, the text “You pressed the Exit button” had to be printed in System.out. The third exercise was about opening a new window and changing the information pertaining to a single product. In the fourth exercise, menus were implemented. In addition, the user had to confirm that he or she wanted to close the application when the Exit button was pressed. The detailed specification of the assignment can be found in the web page [20].

3.2 Description of Automatic Grading System The Gui G automatic grading system includes three parts: Ceilidh, Jemmy, and the test programs. During the project, Jemmy was installed, the test cases were programmed, and the corresponding configuration files for Ceilidh were written. Ceilidh was already in use on the case course before the project was started. These parts of Gui G are presented in the following subsections. The relations between the different subsystems are illustrated in Figure 2. Gui G

Jemmy

Ceilidh

Test programs

Student's programs under test

Figure 2. Subsystems of Gui G automatic grading system.

3.2.1 Jemmy and Test Programs Jemmy [16] is a free GUI testing framework for Java. It supports most of the AWT/Swing components and their operations. Here we will provide a short introduction to the framework. See, for example, Skrivanek and Sotona’s text [18] for a more complete but still compact overview of Jemmy. The philosophy behind Jemmy is to provide proxies for different GUI components. Through these proxies a tester program can perform all the same interaction as a human tester could do. It would not support the creation of independent testers if creating such a proxy needed a reference to the original GUI component (e.g., button or menu item). Thus, Jemmy provides a way to search for components from a given container with a title and/or component type. For example, a proxy for a window object existing on the same virtual machine as Jemmy can be created just by giving the title of that window. The approach is also called lookup constructors [18]. There are

Figure 1. Main window of the program which students had to implement. The topics of different exercises and the maximum points for them are presented in Table 1. We graded the assignment with a fail/pass scale. The maximum number of points for the whole assignment was 110 and a student had to get at least 65 points to pass. The limit (i.e., 59% of the maximum) was lower than normal because the tests were being used for the first or second time and we expected some bugs or other problems. This was decided before the assignment was even opened. If a lecturer wants

51

several lookup constructors for each proxy class. The example tester in Figure 3 demonstrates how lookup constructors work. When proxies are created, different strategies are sometimes needed in order to distinguish otherwise identical components from each other. In Gui G, for example, we used the coordinates of text fields to recognize different input fields. After that we could give a correct input into a correct field. Coordinates are also useful when the layout of a GUI is graded; for example, all buttons should be in line and there should be a small gap between the buttons. To get a better understanding about testers written with Jemmy you can look at the web page [20] where the source codes for all the testers can be downloaded. import import import import import import

3.2.2 Ceilidh Ceilidh [7] is an online grading system used to evaluate test results and keep records about students’ submissions and points. In this article we will use the name Ceilidh, even though we are not using the original Ceilidh any more, but an ad hoc Perl program that does about the same. Our Ceilidh-like system was implemented several years ago and copyright issues prevent us from publishing it. However, many institutions use similar grading systems. To set up our automatic GUI grading system, Ceilidh can be replaced with any similar system. The grading in Ceilidh is based on searching with regular expressions (defined in configuration files) from the standard output. This is a typical approach to grading text-based programs. The input for the program is typically varied between test cases (defined in configuration files). The feedback is immediate and the number of possible resubmissions can be defined. See [7] for a general introduction to Ceilidh. Our test programs produce textual output that is evaluated by Ceilidh. For example, if the tester found the button “Exit” from the main window, it printed the text “Button title: Exit.” As this text matched a certain line of Ceilidh’s configuration file, the sum of points was increased by one. The configuration files for the testers can be downloaded from the web page [20]. On the case course the students used Ceilidh from their own user accounts. Ceilidh ran the test programs, from which a student’s GUI was opened and tested. All this was done with the account rights of the student. This means that when submitting an exercise the student actually saw the windows popping out and the GUI used while the tester worked. An obvious drawback is that some kind of window system and a display are needed. However, it is possible to do the testing without display by using a virtual frame buffer (e.g., Xvfb). This makes it possible also to use our approach if the grading is done on the server side.

org.netbeans.jemmy.*; org.netbeans.jemmy.explorer.*; org.netbeans.jemmy.operators.*; java.awt.*; javax.swing.*; java.util.*;

/** * This program tests if the main window can be * opened and the button 'Handle Order...' can be * pressed. */ public class TestExample implements Scenario { /** * Method pressButton tries to press the * button 'Handle Order...'. */ public void pressButton(JFrameOperator mainWindow) { try { JButtonOperator handleOrderButton = new JButtonOperator(mainWindow, "Handle Order..."); handleOrderButton.doClick(); System.out.println("Button 'Handle Order...' was pressed."); } catch (Exception e) { System.out.println("Button 'Handle Order...' was not pressed."); } }

3.3 Automatic Grading from Viewpoint of Student

/** * Method runIt is compulsory when the class * Scenario is implemented. The student * program is started and actual testing will * begin. */ public int runIt(Object param) { try { ClassReference applicationReference = new ClassReference("StockControlApplication"); applicationReference.startApplication(); JFrameOperator mainWindow = new JFrameOperator("Stock Control Application v0.0.1"); pressButton(mainWindow); } catch (Exception e) { System.out.println("The main window was not opened."); return(1); } System.out.println("The main window was opened."); return(0); }

A student submitted a GUI exercise in the same way as with text-based exercises. Ceilidh’s user interface is textbased. Figure 4 presents an example of what the user interface looked like when a student was doing the first exercise of the GUI assignment (in Ceilidh, assignments are called units). ================================================== Ceilidh - Exercise Menu - Unit 2 Exercise 1 ================================================== vd View exercise Description set Set up exercise (retrieves exercise files)

public static void main(String[] args) { String[] params = {"TestExample"}; org.netbeans.jemmy.Test.main(params); }

sub std

Submit solution for grading Show Test Data (which is used by Ceilidh for testing)

lx sx #

List eXercises in this unit Select eXercise (enter exercise number in place of #)

vm

View your Marks

q Return to main menu q! Quit Ceilidh ================================================== Unit 2/Exercise 1 >>

}

Figure 3. Example of test program.

Figure 4. Ceilidh’s user interface before submitting.

52

In Ceilidh, a maximum of ten submissions per exercise was allowed. However, the test programs and the JAR file for Jemmy were also distributed to the students. Thus, the students were able to test their programs against the test programs as many times as they wanted. This was allowed so as to support programming at home. The previous results of feedback surveys from the basic programming courses at our university indicate that most students program mainly at home if possible.

During the submission, window events were shown on the screen but obviously very much faster than any human could test manually. Normally, testing of one exercise took 1-5 seconds; that is, students got their results immediately. Figure 5 shows a test report on the first exercise when the test program could not find the button “Handle Order...” (e.g., the three dots were missing). As a consequence, the three other button-related tests were also failed. This report is very similar to Ceilidh’s test reports for text-based assignments.

4. RESULTS AND ANALYSIS

Compiling your program, please wait... Compiled OK

We evaluated Gui G using various methods and data sources: (a) system size and the use of existing systems; (b) a teaching assistant versus Gui G; (c) statistics from the submissions; (d) a feedback questionnaire to the students; and (e) content of articles in the course news group was analyzed. The results are presented in the following subsections. In addition, the students’ behavior was observed during the laboratory sessions and feedback was collected from the course staff. However, these results are not reported for brevity or because they contain only a little new information when compared with the other results.

You have made 0 out of 10 submissions for this exercise. Starting submission 1/10 now: ---[ Running test ]------------------------------Button "Exit" exists .......................... OK Button "Handle Order..." exists .......... FAILED Button "Add New Product..." exists ............ OK Button "Change Product Info..." exists ........ OK Button "Delete Product" exists ................ OK There are exactly 5 buttons altogether .... FAILED Exactly one text area exists .................. OK Scrollable list exists ........................ OK Size of the window ............................ OK Shape of the window (roughly like a square) ... OK Layout of the buttons (buttons are in line) FAILED Layout of the buttons (order and spacing) FAILED Layout in general (space between left and right side of window................................. OK Layout of windows left side (e.g., there is only 1 scroll pane)............................ OK -------------------------------------------------Total points granted: 22/30 --------------------------------------------------

4.1 System Size and Use of Existing Systems System size and the use of existing systems are not normally used as an evaluation criterion in the area of computer-aided instruction. However, we have used this criterion because the use of existing systems to reduce the amount of work for us was an important goal. The four test programs for the GUI exercises had 1,214 lines of code (LOC) written in Java and the configuration files for Ceilidh had 34 LOC. Other parts of the system were Ceilidh and Jemmy, which existed before the project was started. The size of Ceilidh is approximately 2,500 LOC written in Perl, and the size of Jemmy is approximately 58,000 LOC written in Java. Obviously, only a (small) part of Jemmy’s source code is needed to run Gui G.

Figure 5. Ceilidh’s test report. If some part of the test was failed, students did not get any other feedback from Ceilidh than the word “FAILED”, as in Figure 5. There was no WWW page or other documentation that explained each test in detail. Students were able to read the source code of test programs. In some cases, Jemmy threw an exception that was not caught and, as a consequence, Jemmy terminated and an error message produced by Jemmy appeared on the screen. The output was often 100-200 lines of text, whereas a normal Ceilidh test report was only 20-25 lines, as in Figure 5. A short example of this kind of output is presented in Figure 6. In these situations, students probably had more difficulties in understanding the feedback.

4.2 Gui G versus Teaching Assistant A sample of 56 submissions was evaluated by Gui G and by a teaching assistant. The sample included every sixth submission of the spring 2005 course. The teaching assistant graded the submissions after the course was ended and knew that the purpose was to compare his grading against Gui G. As was explained previously, the maximum number of points for the whole assignment was 110 and a student had to get at least 65 points to pass. The means were: Gui G 68.3 and the teaching assistant 69.0 points. According to the paired t test [15, pp. 349-350], this difference is statistically not significant (p = 0.47, i.e., p 0.05). Gui G graded 26.8% and the teaching assistant 23.2% of the submissions as being failed. According to the z-test for proportions [15, p. 324], this difference was statistically not significant either (p = 0.66 i.e., p 0.05). Thus, no major differences were found between Gui G and the teaching assistant. The sums of points for both grading methods are presented in Figure 7. Thirteen submissions were given zero points by both methods. These data points are not shown in the figure because the scales are 50-110. No sum was between 1 and 50 points. The Spearman rank correlation coefficient rs [2, p. 314] was 0.93, which indicates that the sums correlated positively as was expected. It can be noticed from the figure that there are

---[ Running test ]------------------------------Your program encountered the following runtime error during this test: java.lang.ArrayIndexOutOfBoundsException: 0 at Tester24.testThree(Tester24.java:321) at Tester24.runIt(Tester24.java:26) at org.netbeans.jemmy.Test.launch(Test.java:353) at org.netbeans.jemmy.ActionProducer.launchAction(Act ionProducer.java:312) at org.netbeans.jemmy.ActionProducer.run(ActionProduc er.java:269) Your program outputted 33 lines. Here are few of the last ones: at org.netbeans.jemmy.ActionProducer.launchAction(Act ionProducer.java:312) at org.netbeans.jemmy.ActionProducer.run(ActionProduc er.java:269) Trace: "Test Tester24 finished" action has been produced in 2044 milliseconds with result : 1

Figure 6. Example of Jemmy’s error message.

53

have differed considerably from the other exercises. Similar statistics were not counted in 2005.

three clearly differing points A, B, and C that are circled. Actually, the point C includes two same data values. Thus, Gui G gave considerably less points for four (7.1%) submissions than the teaching assistant did. Otherwise the results matched reasonable well. In order to find possible grading mistakes by Gui G or by the teaching assistant: (a) the points of these four submissions were compared in detail, and (b) the students’ programs in question were tested manually more thoroughly than previously. The problems in Gui G were mostly in grading of the exercises 3 and 4. On some occasions Gui G was unable to locate dialogs that the students correctly created with JOptionPane, and graded the exercise as 0 points whereas the teaching assistant gave more points. In addition, small deviations were found when students approached the problem in a different manner than intended (e.g., using a hand-made dialog). Obviously, Gui G could not react to the new situation, and graded the exercise as 0 points whereas the teaching assistant gave more points.

4.4 Feedback from Students In 2004 and 2005, the students had to answer the feedback questionnaire as a compulsory part of the assignment. We did not ask if a student had any previous experience of automatic grading, but probably all students had at least some experience, because automatic grading is used on the prerequisite course and in the first assignment of the case course. In addition, many students had experience of different automatic grading systems because at our university the Basic Course in C/C++ Programming has used the Goblin course management and automatic grading system [8] for the past four years. Goblin has a WWW-based graphical user interface, whereas Ceilidh has a text-based user interface. The results are presented in Table 3. The means of the question “Give an overall grade for the automatic grading of this assignment” are presented in the column “GUI assignment.” The means of the question “Give an overall grade for the automatic grading of the first assignment” are presented in the column “First assignment.” The first assignment was about Java’s input/output, fully text-based, and the students used Ceilidh to submit the assignment. The mean response to the question “Give an overall grade for the assignment A2 (Graphical User Interfaces) as a whole” is presented in the column “Overall grade to GUI assignment.” The letter pairs indicate that the difference of two means was statistically significant (p < 0.01) when calculated using the Student t test and the SmithSatterthwaite procedure [15, pp. 347-349]. Thus the students were more satisfied with the automatic grading of the Input/Output assignment than the GUI assignment. This difference was as expected, because the Input/Output assignment was graded automatically for the third time in the spring of 2004 and the bugs in the test programs had already been corrected, whereas the test programs of the GUI assignment still contained some bugs or other problems. For the spring 2005 course most of bugs were corrected and the satisfaction towards the automatic grading of GUI assignment increased as was expected.

C A

B

Figure 7. Sums of points from all four exercises when the sample of submissions (n = 56) were graded by teaching assistant and Gui G.

4.3 Submissions

Table 3. Means related to student’s satisfaction in years 2003-2005. Scale: 1 Poor … 5 Excellent.

The numbers of enrolled students and proportions of students who passed the GUI assignment in the years 2003-2005 are presented in Table 2. The submissions were graded manually in 2003 and automatically in 2004-2005. It can be noticed that the proportion of 2003 is a little greater than in 2004-2005. However, the differences between the years are statistically not significant when this was calculated using the z-test for proportions [15, p. 324]: p = 0.06, p = 0.10, and p = 0.81; that is, all three p 0.05.

Year

2003 2004 2005

Enrolled 340 334 337

3.4

a

b

2.8

a,c

b,c

Overall grade to GUI assignment 3.9

d,e

3.4

d e

3.1 3.6 3.4 Dash (—) indicates that the topic was not asked or relevant. a,b,c,d,e Letter pairs indicate that the difference between two items was statistically significant (p < 0.01).

Table 2. Proportions of students who passed the GUI assignment in 2003-2005. Year 2003 2004 2005

Automatic grading First GUI assignment assignment — —

Passed (%) 86.8 82.3 81.6

The differences of overall grades were statistically significant (p < 0.01) as well. However, it is impossible to conclude whether the main reason for these differences was the automatic grading, because some other changes were also made to the assignment. In the spring of 2003, the assignment also included two further exercises, 5 and 6, which were excluded in the springs of 2004 and 2005 because they were not suitable for automatic grading. In addition, in 2003, the assignment was graded by means of a scale from 0-5, whereas in 2004 and 2005, the grading was fail/pass.

In 2004, the students submitted on average 2.4 times per exercise when the maximum was ten. Nobody used the maximum of ten submissions. There were no big differences between the different exercises; for example, the means varied between 2.3 and 2.6. This indicates that the automatic grading worked equally well for all four exercises. If the automatic grading had worked worse for a certain exercise, its mean for submissions would probably

54

The students were also asked how many hours they had used for the different exercises. In 2005, the mean of the sums was 13 hours, while it was stated on the course web pages that the assignment should take 10-20 hours. The standard deviation was 6.5, the minimum 2.5, and the maximum 50 hours. 91% of the students reported that they used 20 hours or less. The mean of the number of hours for the different exercises were: the first exercise 4.2, the second exercise 3.3, the third exercise 4.2, and the fourth exercise 1.0 hours. 52% of the students reported that they did not use any time for the fourth exercise, which means that they got enough points to pass the assignment from the exercises 1-3.

The automatic grading of GUI programs is partly difficult because there are often several possible and equally good solutions. Defining the exercises carefully can reduce this problem. For example, in the present assignment, the students did not have to decide the location and the order of the buttons because the picture of GUI was given as part of the description of each exercise. However, some students might find detailed descriptions less motivating. Using static testing could reduce some problems of Gui G. The static tests would be run before the dynamic GUI tests in order to check some mistakes that caused the whole GUI test to fail and the execution to terminate. In these cases, a student might have had severe difficulties in understanding the problem because the program functioned well when tested manually. Some mistakes that could be noticed using static tests were: (a) the name of the main window did not match the expected string; (b) the required class was not found. Typically, a student used a slightly different class name; (c) the student’s class did not contain the main method as required, or (d) an AWT component was used instead of a Swing component. The implemented test programs are more procedural than object-oriented, even though the programming language used is object-oriented. We think this is a consequence of the task itself, because we automated a situation where there is only one user, who acts in exactly the same manner every time. This might be a common feature of GUI testing because many GUI testing programs have their own scripting language and Jemmy’s example programs are more procedural than object-oriented. Object-orientation would be more natural or necessary if the testing situation included different kinds of users (e.g., expert users, normal users, and novices) who would use the application in different ways. In that case, the test programs could function differently according to the users’ properties. We will probably make no changes to the system in the near future because the case course is discontinued after the spring of 2005 and it is unclear whether these GUI exercises are suitable for the other programming courses at our university. Anyhow, the following list presents some possible changes to be made if some other institution wishes to use the system. The changes are listed so that the most important change is presented first: (a) the specification of the assignment should be changed so that the testing criteria are also explained. For example, the minimum space between the components should be defined exactly or this test should be excluded; (b) the test program for the third exercise should be changed, because it throws exceptions that terminate the execution too often; (c) some static tests should be added; (d) the window events could be directed to a virtual buffer such as Xvfb. This should make it easier to submit the exercises over the network from a student’s home PC because then, for example, a normal SSH terminal connection should be enough, as for text-based exercises.

4.5 News Articles The students were able to ask questions at the course news group. These articles were analyzed during the spring of 2004, but not during the spring of 2005. Altogether, 138 news articles were about the GUI assignment. Sixty-two percent of the articles were questions or comments from the students and 38% were answers to these questions from course staff. The articles were classified into the following categories: problems with the automatic grading 68%, questions about the assignment per se 14%, how to submit from home over a network 10%, and other articles 8%. The category “questions about the assignment per se” refers to problems that had nothing to do with automatic grading. The distribution of the articles was also analyzed from the point of view of time. The exercises were opened at Ceilidh on February 9 and the submission deadline was February 23. After the opening of Ceilidh, there was a peak in the number of the articles because 31% of the articles were sent between February 9 and 11. Between February 10 and 16 the course instructor sent five articles about corrections to the test programs and one article about corrections to the specification of the assignment. Eighty-five articles concerned the automatic grading and a certain exercise of the assignment. The proportions of these articles were: the first exercise 38%, the second exercise 9%, the third exercise 38%, and the fourth exercise 14%. This distribution indicates that the automatic grading of the first and third exercises did not function as well as that of the second and fourth exercises.

5. DISCUSSION AND FURTHER WORK We agree with English [3, p. 137] and Sun and Jones [19, p. 140] that the automatic grading of GUI programs is technically more difficult or complicated than that of textbased programs. Jemmy, however, takes care of this complexity for the most part. After getting familiar with Jemmy, the programming of GUI tests was similar to that with text-based exercises. The focus in the case course and in our assignment was not usability. We believe that the automatic grading of usability issues is more difficult or maybe not even possible. Next, we mention two differences between automatic grading and automatic testing in industry. However, we are not able to draw any conclusions or make any suggestions based on these differences. First, we assume that in education there are several dozen or even hundreds of more or less different implementations from the same specification–a case not typical in industry. Second, in industry some software professionals might specialize in automatic (GUI) testing, whereas lecturers probably use only a very small proportion of their total working time for this task. In other words, lecturers are, at the most, casual test programmers.

6. ACKNOWLEDGEMENTS We thank Doctor A. Korhonen for commenting on the manuscripts of the present paper.

55

[13] Malmi, L., Karavirta, V., Korhonen, A., Nikander, J., Seppälä, O., and Silvasti, P. 2004 Visual Algorithm Simulation Exercise System with Automatic Assessment: TRAKLA2. Informatics in Education, 3, 2, 267-288.

7. REFERENCES [1] Abbot Framework for Automated Testing of Java GUI Components and Programs. 2004. Retrieved on December 21, 2004, from the SourceForge.net web site: http://abbot.sourceforge.net/.

[14] Mccauley, R., and Manarism, B. Comprehensive Report on the 2001 Survey of Departments Offering CAC -accredited Degree Programs. Technical report CoC/CS TR# 2002-9-1, Department of Computer Science, College of Charleston, 2002. Retrieved on February 11, 2004, from the College of Charleston web site: http://stono.cs.cofc.edu/~mccauley/survey/report2001 /CompRep2001.pdf.

[2] Conover, W. J. Practical Nonparametric Statistics (3rd ed.). John Wiley & Sons, New York, 1999. [3] English, J. Automated assessment of GUI programs using JEWL. In Proceedings of the 9th Annual Conference on Innovation and Technology in Computer Science Education (ITICSE’04) (Leeds, United Kingdom, June 28-30, 2004). ACM Press, New York, NY, 2004, 137-141. [4] English, J. Download JEWL. 2004. Retrieved on November 18, 2004, from the University of Brighton web site: http://www.it.brighton.ac.uk/staff/je/java/jewl/downlo ad.html.

[15] Milton, J., and Arnold, J. Introduction to Probability and Statistics (4th ed.). McGrawHill, New York, 2003. [16] Netbeans.org. Jemmy Module. Retrieved on March 16, 2004, from the netbeans.org web site: http://jemmy.netbeans.org/.

[5] Fewster, M., and Graham, D. Software Test Automation: Effective Use of Executions Tools. Addison-Wesley, New York, 1999.

[17] Saikkonen, R., Malmi R., and Korhonen, A. Fully automatic assessment of programming exercises. In Proceedings of the 6th Annual Conference on Innovation and Technology in Computer Science Education (ITiCSE 2001) (Canterbury, UK). ACM Press, New York, NY, 2001, 133-136.

[6] Forsythe, G. E. Automatic machine grading programs. In Proceedings of the 1964 19th ACM National Conference. ACM Press, New York, NY, 1964, N1.4-1. [7] Foxley, E., Higgins, C., and Tsintsifas, A. The Ceilidh system: a general overview. In Second Annual Computer Assisted Assessment Conference (Loughborough, UK, June 17-18, 1998). 1998.

[18] Skrivanek, J., and Sotona, A. Testing Forte for Java: Quality Assurance with User Interface Test Libraries, Jemmy and Jelly. 2002. Retrieved on December 20, 2004, from A Sun Developer Network Site – Technical Articles & Tips: http://developers.sun.com/prodtech/javatools/jsstanda rd/reference/techart/JemmyJelly.html.

[8] Goblin System Homepages. 2004. Retrieved on December 11, 2004, from the Helsinki University of Technology web site: http://goblin.automationit.hut.fi/.

[19] Sun, Y., and Jones, E. L. Specification-driven automated testing of GUI-based Java programs. In Proceedings of the 42nd Annual ACM Southeast Regional Conference (ACMSE’04) (April 2-3, 2004, Huntsville, Alabama). ACM Press, New York, NY, 2004, 140-145.

[9] Groder, C. Building maintainable GUI tests, 517-536. In Fewster, M. and Graham, D. Software Test Automation: Effective Use of Executions Tools. Addison-Wesley, New York, 1999. [10] Introduction to JFCUNIT. Retrieved on December 21, 2004, from the SourceForge.net web site: http://jfcunit.sourceforge.net/.

[20] Surakka, S. Gui G: Download. Helsinki University of Technology, Laboratory of Information Processing Science, 2005. URL: http://www.cs.hut.fi/Software/GuiG/.

[11] Jia, X. Object-Oriented Software Development Using Java: Principles, Patterns, and Frameworks (2nd ed.). Addison-Wesley, Reading, MA, 2003.

[21] Testingfaqs.org. GUI Test Drivers. 2004. Retrieved on March 17, 2004, from the testingfaqs.org web site: http://www.testingfaqs.org/t-gui.htm.

[12] Li, K., and Wu, M. Effective GUI Test Automation: Developing an Automated GUI Testing Tool. Sybex Inc., Alameda, 2004.

56