An Information-Based Methodology

3 downloads 0 Views 79KB Size Report
Beak and Layne (1988) considered the role of colour as well as animation in the presentation of educational material. They classified displays as to whether they ...
Evaluating Program Visualisation Systems: An Information-Based Methodology Paul Mulholland Human Cognition Research Laboratory The Open University Milton Keynes, MK7 6AA, UK. email: [email protected]

Abstract: A number of Program Visualisation (PV) systems have been developed to aid the debugging process by providing a detailed account of program execution. Despite their development, empirical claims for their efficacy are few. A review of related empirical evaluations identifies the lack of a principled and integrated approach as being a significant problem. This criticism combined with a description of SV within the context of the programming environment is used to develop a set of methodological criteria for future work. These requirements have motivated the implementation of the Prolog Program Visualization Laboratory (PPVL) which provides an integrated tracing environment comprising four Prolog tracers. An information-based methodology is then presented for evaluating Prolog tracers within the realistic programming environment provided by PPVL. Finally, a survey of PV design claims are used to focus future research. Keywords: Program Visualisation, Evaluation, Methodology, HCI. Acknowledgements: This work was supported by a EPSRC postgraduate research studentship.

1

Introduction Computer programming can be an extremely complex cognitive activity. The first

draft of a program is rarely correct and it is estimated that approximately half of the time required to develop a system is devoted to testing and debugging (Brooks, 1975). To aid this process, a number of Software Visualization (SV) systems have been developed. SVs provide graphical and textual representations of programs and their execution to support program comprehension and debugging. Technical Report 107, Human Cognition Research Laboratory, Open University, Milton Keynes, MK7 6AA.

2 SVs can be divided into Program Visualisation (PV) and Algorithm Animation (AA) systems. AA systems abstract some predefined algorithm from the code, showing execution nearer the task domain. PV systems (which will be discussed here) provide an account of execution close to the underlying code. PV systems (or tracers) are often used to trace the execution of the program on a fine grained level, though many provide navigational aids to "prune" the amount of information displayed. Although a number of PV systems have been built, few empirical evaluations have been carried out, and those that have contain a number of methodological problems. This paper provides a critique of previous research and provides a new information-based methodology which deals with the problems outlined. It is argued that an informational account can be used to shed light on the way the code, task, user and PV system interact during program debugging or comprehension. As a number of PV systems have been developed for Prolog and numerous studies have looked into the problems associated with using the language, the methodology has been initially applied to Prolog PV systems. The methodology includes an outline of how debugging in a Prolog environment including a PV system can be integrally defined. The Prolog Program Visualisation Laboratory (PPVL) is described which incorporates a number of PV systems within a single environment and includes recording facilities to aid data collection. Some claims made by PV designers are then presented to provide a focus for research.

2

Previous evaluation studies Empirical work has been carried in the fields of computer-aided learning and

software visualization to compare the efficacy of different types of representation on learning and information access. Many of these have looked at particular features such as animation or the use of colour. A large number of the relevant studies have taken place in the domain of computer-aided learning (CAL) and only more recently has the state of development in software visualization been sufficient to allow empirical studies and comparisons. The range of studies undertaken are discussed in the following two sections. These will provide a basis for a new information-based methodology described later.

2.1

Computer-Aided Learning (CAL) A large number of studies have been undertaken to investigate the relative

advantages of various types of display for computer-based instruction (CBI) or computeraided learning (CAL). Most of these have focused on determining the efficacy of particular display features such as graphics, animation or colour. This is a common empirical approach

3 which also pervades some of the studies which aim to evaluate software visualization or programming notations. A number of measurement techniques are also reported including the amount of learning that has taken place, the efficiency with which information can be accessed from the display and the ability to simulate the kind of machine behaviour presented within the study. Rigney and Lutz (1976) and Alesandrini and Rigney (1981) investigated the usefulness of graphical representations for the presentation of chemistry concepts. Both studies found animation to be an advantageous feature. Other studies have failed to find any benefit for animation in educational technology. For example, in a study by Reed (1985) subjects were given rules enabling them to estimate how long it would take the computer to perform algebra word problems. Those receiving a dynamic simulation of the behaviour of the computer performed no better than those viewing a static representation. These apparently contradictory findings are found to be a common occurrence within CBI evaluation. Carpenter and Just (1992) found only a limited effect in favour of dynamic displays for teaching the structure and functions of a mechanical system, but only when the animation was accompanied by a verbal explanation. Subjects with little relevant previous knowledge were primarily dependent on the supporting explanation and paid little attention to the display. Those subjects with a higher level of prior knowledge were able to make inferences from the display and had little need for the supporting explanations. This is an important finding as it illustrates the effect of user characteristics such as domain expertise on the ability to interpret and use non-textual information. Beak and Layne (1988) considered the role of colour as well as animation in the presentation of educational material. They classified displays as to whether they were colour or monochrome, and whether the display incorporated text, graphics or animation. A pre-test and post-test were administered to derive a measure of the amount of learning that had occurred. The results showed no effect of colour but found graphics scored significantly higher than text or animation. Beak and Layne were aware of the disparate nature of the findings in the area, particularly among a number of studies investigating the possible benefits of animation. For example, Peters and Daiker (1982) found no advantage for the use of either animation or graphics for CAL. Beak and Layne noted a number of related explanations given for failures to find an advantage for the use of new presentation techniques: the basic elements of the graphics failing to focus on the correct materials, the subjects' failure to interpret the information in the graphical display, and a lack of experience with graphical material.

4

Study

Text v. graphics

Static v. dynamic

Colour v. monochrome

Rigney and Lutz

-

Animation better

-

-

Animation better

-

(1976) Alesandrini and Rigney (1981) Peters and Daiker

No difference

No difference

(1982) Reed (1985)

No difference

Baek and Layne

Graphics better

(1988)

-

-

Animation combined No difference with graphics good

Carpenter and Just

-

(1992)

Animation slightly

-

better (with textual explanation)

Figure 1: A summary of studies evaluating the effectiveness of different display characteristics for CBI applications.

Baek and Layne's (1988) significant results were probably due to their "careful use of colour, graphics and animation" to ensure "students notice the characteristics of the graphics and animation are related to the subject matter" (p. 131). This is an important point as it indicates that effectiveness is a factor of how some particular display feature (such as graphics) is used and what information it is used to present The question should not be “Should animation be used in the presentation of information?” but rather “Under what conditions should animation be used and what should it be used to present?”. The above findings are summarised in figure 1.

2.2

Program notation and SV A few studies have been carried to evaluate SV technology and the benefits of

graphical programming notations. These are summarised in figure 2. Possibly the first attempt to empirically evaluate a Prolog tracer was conducted by Rajan (1986). He attempted to ascertain the educational benefits of using the APT(Animated Prolog Tracer) (Rajan, 1986) to explore the execution of a program. A questionnaire was used to ascertain what meaning subjects conferred upon particular rules within a set of Prolog programs. Subjects were either given an APT representation of Prolog execution and the code

5 or just the code and program output. The experiment investigated how well the subjects understood the run-time behaviour of a program by asking them to provide a description of the execution. Subjects using APT had a more in-depth knowledge of the program's execution, though there was a large variation between subjects.

Study

Text v. graphics

Rajan (1986)

SV v. no SV

SV comparison

Helped clarify execution

Cunniff and Taylor

Graphics better

(1987) Badre and Allen

Text better

(1989) Höök et al (1990)

Reasoning differences between OR-tree and Spy

Patel (1991)

EPTB certain advantages over TPM and Spy

Brusilovsky (1993)

SV useful

Stasko et al (1993)

No effect of SV

Patel (in press)

TPM and TTT better than Spy

Figure 2: Summary of studies into the effect of SV or notation on program comprehension.

A further study looked at the effect using APT could have on common novice misconceptions of Prolog. It was found that APT could help to reduce the number of misconceptions novices had of Prolog execution though once again, there was a large degree of variation. Further questions raised by the author related to what, on a more fine-grained level, was APT providing. Firstly, though APT had been of benefit, particularly to some subjects, it was not clear how their conceptual model of the particular program and Prolog execution in general had changed as a result of the demonstration. For example, using APT appeared to reduce misconceptions of the language. Further work may be able to clarify whether using APT reduced misconceptions in general or whether APT was more successful at tackling particular kinds of misconception. Furthermore, it was not clear why some

6 students had been helped considerably more than others. These students may have had a greater aptitude in general for programming or for example, may have been more able to draw information from a different representation of program execution. Cunniff and Taylor (1987) investigated the effect of textual and graphical presentation on novice code comprehension. The study investigated the comprehension of equivalent code written in one graphical (FPL) and one textual (Pascal) language. Subjects had to perform three tasks thought to be central to computer program comprehension: the recognition and counting of types of program constructs, determining the values assigned to specific variables, and determining the number of times particular program segments will be executed. They found faster response times with FPL than Pascal. The accuracy of responses was also superior in FPL. This was particularly so with questions requiring the comprehension of program conditionals. In particular, there were far more errors using Pascal than FPL on questions relating to the values of variables. Prior to the test, subjects were tested for visual and verbal ability. A high correlation was found between visual aptitude and the comprehension of both graphical and textual code. The authors postulated that visual aptitude may therefore be an important factor in general programming ability, regardless of the notation used. A rather different result was gained by Badre and Allen (1989) in their comparison of a textual and a diagrammatic programming notation. Overall they found no difference in bug location time between the two notations. They then separately analysed the performance of the novice and expert programmers within the subject sample. They found no effect of notation for experts but did find a superior performance for the textual notation among the novice subjects. Höök et al (1990) compared a graphical (OR-tree) and textual (Spy1) notation of Prolog execution. The analysis considered the effect of the two notations on programs that differed both in their complexity and meaningfulness. The presence of novice misconceptions and the application of novice meta-rules (Taylor, 1988) were noted in each experimental condition. Taylor (1988) had found that Prolog novices can be distinguished as to how they use formal and real World problem solving during program comprehension. Different types of formal reasoning can also be distinguished. Subjects may use a formal logical (i.e. declarative) account of the program or a mechanistic (i.e. procedural) account of the program. The subjects in Höök et al's experiment appeared to fall into four distinct groups. Two subjects stayed in the mechanistic domain throughout the experiment. These can be thought of as exhibiting ‘stable’ reasoning. A second group had insufficient knowledge of the mechanistic domain and used real World to support their reasoning. A further group of

1 Spy is introduced in section 5.

7 ‘unstable’ subjects tended to reason in the wrong domain. A further group (excluded from this analysis) appeared to have no Prolog misconceptions. Höök et al's analysis considered the effect of using either an OR-tree or Spy representation of Prolog. It appeared that the OR-tree representation was more able to combat control flow misconceptions common in backtracking and also helped to show variable unbinding. It was also found that few subjects were able to transfer lessons learnt about Prolog using one representation to the other suggesting their knowledge was still highly dependent on the particular display and had not yet been generalised. It was also found that subjects using Spy tended to follow the execution through the program, step by step. Both representations had difficulties providing a clear story particularly during backtracking, where a number of related misconceptions were observed. Rather than investigating whether a particular SV was more useful than no SV or, attempting to isolate comprehension differences between graphical and textual forms of presentation, Patel et al (1991) performed a direct comparison between three Prolog trace formats. The three tracers were Spy, TPM2 and EPTB3 . Snapshots of the tracers were used to remove navigational differences between tracers. Each subject was asked to solve five problems using each trace format. The problems were disguised on each presentation to remove learning effects. Three of the problems referred to backtracking and two more difficult problems were concerned with recursion, system primitives (primarily the cut) and list manipulation. Performance differences, in terms of the speed and accuracy of response, between tracers on the five problems did not conform to the distinction between backtracking and recursion. Overall, there was a slight advantage for EPTB though there was a strong interaction between trace format and problem type. This was explained in terms of the extent to which the problem exploited the particular strengths and weaknesses of each notation. Spy, which provides only a minimal account of execution, was found to be inferior, unless the problem only required the kind of information that Spy presents explicitly. TPM was thought to perform better on problems that could readily be solved using the overall Gestalt provided by the graphical notation but suffered in its presentation of the execution history. EPTB’s performance was explained by its explicit presentation of current information and execution history.

2 TPM is introduced in section 5. 3 EPTB (Enhanced Prolog Tracer for Beginners) (Ditchev and du Boulay, 1987) is a linear

textual tracer emphasising information about data structures, bindings and reasons for goal failure.

8 A further study (Patel et al, in press) compared the performance of Spy, TTT4 and TPM across the problem types used in the previous study. Overall, TPM and TTT performed similarly well, though differently on individual problems. The differences on individual problems were once again explained in terms of the apparent strengths and weaknesses of the two notations. TPM seemed to suffer from the lack of a clear execution history. TTT, which provides a textual tree attempts to combine the benefits of a tree structure with an explicit presentation of the execution history. Though TTT seemed to benefit from its presentation of the execution history, subjects seemed to find the resulting tree hierarchy less clear.

Tool

Percentage solved

Before tools

6

Demonstrate results

19

Visualization

39

Explanation

20

Further

16

Figure 3: Number of students understanding the bug at each stage.

Brusilovsky (1993) investigated the potential role of SV as a tool for novice program debugging. The experiment was carried out as part of a computer programming practical class. When students noticed their solution to a programming problem was working incorrectly, four stages of explanation were used successively until the student managed to understand the reason for the buggy behaviour. Firstly, the student was shown the disparity between their own result and the correct result. Secondly, they were shown a visualization of the execution of their program. Thirdly, a verbal simulation of the execution of the program was provided. If the student still failed to understand, the assistant used their own knowledge to explain the error. The percentage of problems solved at each stage is shown in figure 3. As over a third of problems were solved at the visualization stage, the results indicate a plausible role for SV within a full teaching environment, though it is not clear to what extent having attempted to solve the problem twice by this stage affected the results. Stasko et al (1993) also considered the educational benefits of SV. In their study an AA of a priority queue algorithm was used. Half of the students were provided with the AA

4 TTT is introduced in section 5.

9 and the program, while the other half were just given the program. The AA was found to only slightly assist student comprehension. They suggested one of the reasons the AA may have been of little help was that students were not aware of how their knowledge was to be tested prior to viewing the visualization. The students therefore had no clear goal to pursue during the comprehension phase of the experiment. They suggest a clear task motivation may have helped the students to gain far more from the representation than they actually did. From the review it can be seen that much of the empirical work suffers from the empirical shortfalls found in the work evaluating CBI. Many of the studies derive performance results without deriving the information necessary to able to interpret the those results effectively. Observations are usually not made of for example how the subjects approached the task and what features of the execution they found confusing. Because of this the work in evaluation seems to be providing a set of isolated findings which can often on the surface appear contradictory. As only global measures of performance have been used in many of the studies, it is necessary to rely on anecdotal explanations of any apparent contradictions. A research methodology that sought to focus more heavily on providing a qualitative account of why a particular result occurred would hopefully be able to move away from isolated observations toward building an overall picture of what is occurring.

3

A model of SV use In order to evaluate SVs it is necessary to know how they relate to the other

components of the programming scenario. This is important for two reasons. Firstly, the evaluation must consider what role the SV plays within programming, in order that the evaluation may encompass the kinds of tasks for which the SV is used. Secondly, an understanding of its role will indicate how the nature of the SV may be expected to effect the performance of the programmer. This can be used to motivate the selection of appropriate performance measures. The main components of the programming scenario to be considered are: the SV, the program code and its behaviour, the task and the user’s knowledge and expertise. Their interactions are summarised in figure 4. Lines of modification between components are shown in black, channels of information flow between components are shown in grey. As can be seen from the diagram, the incorporation of the SV adds complexity to the conventional understanding of programming activity, as the SV becomes a further information source which the programmer may assimilate in their understanding of the program and its execution. The user is also able to modify the SV as well as both the program code and the goals or subgoals of the task being performed. The execution history is provided for the SV from the program code. The user may then navigate or customise the SV in order to retrieve the

10 information they wish to see. Currently, the program code can only be modified directly by the programmer, though it can be envisaged, from the progress of current research, that in the future it may be possible to alter the code itself by the manipulation of an image provided by the SV. In the future it may be possible to fix certain bugs directly by manipulating the image provided by the SV to fit how the program should behave. The system may then be able to automatically modify the source code to produce the desired execution history. As yet such features are not available, and programs can only be modified directly.

SV

Code

Task

User

Knowledge

Figure 4: A model of the role of SV within programming.

From the diagram it can be seen that there are four possible ways in which the features of the SV may affect the cognitive or physical actions of the programmer. Firstly, the nature of the SV may obviously alter the way in which the programmer modifies or navigates the SV itself. Secondly, information provided by the SV may motivate alterations to the source code, should the SV be used to support debugging. Thirdly, the SV may alter the nature or composition of the task. For example, if the task is to comprehend a program, the facilities provided by the SV may alter the approach taken by the programmer. In order to make use of the SV the programmer may have to alter the approach taken during program comprehension and the nature of the strategies employed. Fourthly, the use of the SV may effect the users’ knowledge. This change in knowledge could refer to the particular program that the programmer is using the SV in conjunction with or even a more long term alteration of the programmers’ understanding of the language and its execution. The studies of Rajan, Hook et al and Stasko et al reviewed in the previous section can be thought of as investigating potentially long term changes in knowledge resulting from use of the SV. The studies of Patel et al focus exclusively on the information channel between the SV and the user. The study of Brusilovsky considers how the SV may effect the novice programmers modification of the program code.

11 Not only may the features of the SV affect the other components of the scenario, but also the other components may in turn alter the use and effectiveness of the SV itself. Firstly the nature of the program being visualized may effect how the SV has to be used and to what extent the SV can adequately present it. Patel et al (1991, in press) were aware of this possibility in their classification of code segments as to whether they primarily incorporated recursion or backtracking. Similarly, Hook et al (1990) considered the effect of code complexity on the benefits of particular SV notations. Another important issue may be the size of the program, irregardless of its complexity. A larger program, producing a larger execution history will place greater navigational demands in order that the desired information is located. The ease with which this can be done will be affected by the kind of SV notation used and navigational tools available. The efficacy of the SV may also be affected by the nature of the task. For example, a particular SV may be appropriate for explaining the workings of a program to a novice but may make a poor debugging tool. It is also possible that different SVs may be most appropriate at different stages of a task. For example, within debugging a particular SV may be useful in the initial data gathering stages but may be less useful once the approximate location of the bug has been ascertained. This also leads into a side issue as to whether if different representations are more appropriate at different stages, should multiple representations be used (as argued by Brayshaw (1994)) or should a “best bet” representation be used throughout in order to remove the necessity to translate between a large range of representations. Similar issues arise if differences within the user population are considered. Brayshaw and Eisenstadt (1991) in their design of the Prolog SV TPM aimed to provide a cradle to grave system that equally suited to both novices and experts. This however may not be realisable, and the needs of programmers of different levels of expertise may not be satisfiable by a single system. Within an educational setting the multiple versus “best bet” representation issue is once again present. If different representations are optimal at different stages of the learning process, should a range of representations be used to teach particular aspects of the programming course or should a single representation be used throughout. Brayshaw (1994) argues that as long as the different representations all share a common execution model, the use of a range of representations should not create problems, though this remains to be tested. A connected issue to be considered is the effect of familiarity with the SV apart from programming expertise. Programmers with a greater experience may be able to use and access information from an SV in different ways. More complex SV notations may have a larger learning curve before any adequate use can be made of them. This may make the SV inappropriate for those only using the programming language for a short time, but of great benefit to professional programmers who may use it over a number of years.

12 It is unlikely that any single study (or even series of studies) could hope to explore these range of interactions in any depth, though it is worth noting in advance the ways in which the SV could effect and be effected by the other components of the programming scenario. This range of issues illustrates the inherent limitations of evaluation. Any research project cannot hope to do justice to such a range of possibilities and must perform empirical studies within definite constraints. Attempts should be made, for example, to use a representative program or range of programs but there is always the possibility that changes in the experimental materials could greatly effect the results. For this reason, the aim of this research will be to find out how particular SVs perform in particular situations and hopefully draw conclusions from that, rather than the traditional goal of evaluation which is to find out “which one is best”.

4

Methodological requirements The review of previous studies and outline of how an SV can be used within

programming and the issues these entail leads to a number of requirements of future empirical work. A first important characteristic is that the results should provide some integrated understanding of the subjects’ performance in relation to the task and their individual characteristics. For example, in terms of understanding performance within the context of the task, Patel et al (1991) raised the issue as to whether performance differences between subjects using different SVs were due to information access or task modification. By this they meant on one hand, whether the nature of the SV had affected the rate in which necessary information can be accessed from the display, without altering the strategic approach taken to the task. Another possibility is that the nature of the SV would actually affect the approach taken by the subject in performing the task. A desirable quality of the methodology would be that it could produce results that could shed light on this kind of issue. Another issue is how the personal characteristics of the subject will affect their ability to perform the task in relation to the nature of the SV. For example, Hook et al (1990) highlighted how the subjects’ ability could be affected by any misconceptions they may have, which may be counteracted to differing extents depending on the nature of the SV used. Another important issue may be the expertise and experience of the subjects. Expertise is used to describe the level of knowledge and skill the programmer has in the programming language in question, in this case Prolog. More than one kind of experience may also affect performance. Firstly, particularly for Prolog novices, experience with other programming languages may affect performance. Additionally, the level of experience the subject has of the SV may affect their performance qualitatively rather than just increase the speed in which the task is performed. These two factors may also interact. Experience may have differing effects

13 on performance depending on the level of expertise. It is therefore important that the methodology makes provision for recording these characteristics and is able to integrate that information with what is understood about how the task is performed to provide an overall picture. In order to provide an integrated understanding of performance it will be necessary to derive a far more fine-grained account of the subject rather than relying solely on gross performance measures. An accepted method of gaining a fine-grained account of the cognitive activities occurring when performing some task is protocol analysis. The protocol analysis method was developed by Ericsson and Simon (1984). This requires subjects to “think aloud” while performing the task. Their protocols can then be transcribed and coded to give an indication of the number and distribution of particular kinds of conscious cognitive activity occurring during the task. One problem with protocol analysis is that it places far greater demands on the experimenter in terms of the effort required to transcribe and encode protocols than more conventional empirical methods. This can be thought of as a trade-off between gaining a small amount data from each member of a large subject pool (as in conventional approaches) or gaining a large amount of data from each member of a small subject pool (as in protocol analysis). In this case, the latter is preferable for two reasons. Firstly, there will be limitations on how many Prolog experts can be found or on how many people can feasibly be trained in the use of particular SVs. Secondly, as little is known about how SVs are used and how they may be expected to affect the performance of the subjects, protocol analysis, with its emphasis on data collection rather than hypothesis testing seems more appropriate. It has been found that protocol analysis can affect the way a task is performed. Davies (1995) found that plan jumps during a program design activity differed depending on whether the subjects were thinking aloud during the task. It is not clear what effects protocol analysis would have on the kinds of tasks with which an SV could be incorporated such as program debugging or comprehension, though this is clearly a cautionary tale. These problems could hopefully be avoided by making the act of thinking aloud a naturalistic part of the task as much as possible. In many experimental situations the act of providing a think aloud protocol places demands on the subject which are contrary to the demands of the actual task. For example, if a subject was required to solve a problem as efficiently as possible while thinking aloud then the demands would contradict. The subject may be more interested in attaining a high score on the test and therefore provide only a minimal protocol as this would interfere with their own aim. Conversely, a subject may provide such a full protocol that they become diverted from solving the problem. Techniques such as getting subjects to work in pairs and talk between themselves during the task could allow the recording of a protocol without placing any undue demands on the subject. The effectiveness

14 of using subject pairs to evaluate human computer interfaces has already been expounded by Suchman (1987). The methodology should also have a theoretically basis in what is already known from previous evaluation studies and general research within the psychology of programming. Research in the psychology of programming can be used to motivate what kinds of things could be looked for or expected within the protocols. Previous research can be used to show what kinds of experimental hypotheses are likely or unlikely to yield meaningful results. One aspect of previous studies to which this particularly applies is classification according to features of the display or source code. As was considered in section 2.1, global classifications of display features such as the use of colour or animation tend to miss the key issue. The important point is what certain features are used to represent rather than whether they are used at all. Additionally, this form of display classification could not hope to distinguish between many Prolog SVs such as Spy, PTP and TTT though there may be large performance differences between them. Patel et al (1991) found EPTB to be significantly better than Spy though both are textual tracers with very similar dynamics. The focus of the methodology will therefore be more concerned with providing a framework for understanding what occurs rather than for the testing global hypotheses. It is also hoped that the methodology will also have a reasonable level of generalizability. This applies both to the empirical findings and the methodology itself. The more fine-grained account of how the subjects perform should provide a picture not only of how well subjects did in some particular situation but also why. This information will allow justifiable assertions as to what range of situations or subjects the findings will likely apply. This may permit some basic prescriptions to be made as to how suitable particular SVs are likely to be for certain situations. The results of a fine-grained account could also be used to motivate improvements to existing SV systems. Ideally the methodology should also be reasonably generalizable, allowing the application of the approach to new or different programming languages or types of SV. The methodology should be flexible enough for application to relatively realistic empirical settings. As has already been stated, realistic methods of collecting protocols should be tried such as subjects working in pairs. Subjects should also be required to perform realistic tasks closely related to the activities for which an SV could be used such as comprehension and debugging. The adoption of realistic tasks also requires the use of realistic software environments. Some studies have looked at the properties of SV formats in isolation. The new approach will concentrate on providing an overall picture of how the SV can be and is used. Particularly for novices, the way in which the display has to be navigated in order to derive the necessary information may be as important as the notation itself. Current knowledge in the area is not sufficient to permit issues such as navigation to be allowed for in the results. As the case was argued for SV features such as colour or animation,

15 at this stage individual aspects should not be analysed separately, but as part of a whole. Only by seeking a fine-grained overall account of SV use can a picture be built up of their potential role within programming for novices and experts. The methodological requirements of the empirical work can be summarised as follows: •

Provide an integrated understanding;



Provide a fine-grained account of performance;



Have a strong theoretical basis;



Be a generalizable methodology producing generalizable results;



Be flexible enough to allow evaluation within realistic settings.

5

The Prolog Program Visualization Laboratory (PPVL) The PPVL incorporates four Prolog PV systems, providing the first opportunity to

study a number of fully implemented tracers within the same environment. PPVL is implemented in Macprolog™ version 4.5 running on Macintosh™ system 7.1. Spy, PTP and TTT were fully implemented within PPVL. TPM was ported within the PPVL environment and then modified to allow uniform navigation and recording. PPVL has a number of advantages over separate implementations compatible with different software environments. Firstly it provides a common interface for all systems so differences in performance due to the ease of use of different interface technologies are eradicated. The trace required is selected from a common query box. Three navigational aids are provided on the query box allowing the user to prune the trace. The user may "select start points" at which the trace will begin, ignoring any activity happening before that point. Conversely, the user may "compress nodes", hiding from view any of the subgoals of that node. Both PTP and TTT dynamically show non invocation in their raw states. The user can select to hide these steps of the trace from view. Some minor trade-offs had to be made between faithfulness to the original design specification and providing a uniform, easy to use interface. Pre-query navigational tools are not present in the same form in the original design specifications of PTP and TTT, though the availability of similar navigational aids was suggested. Two additions have been made which increase trace usability. Rajan (1990) argued the importance of providing backward as well as forward tracing for novice programming environments, though this feature is only found on TPM. Proving play back facilities for all PV systems removes an unfair advantage which is not an inherent part of the representation and also makes debugging behaviour easier to observe. If a user had to restart a particular trace on having stepped too far this will affect

16 their ability to learn more about the PV system and the code through exploration. Also, as TTT has a highly non-linear mode of execution, a pointer was added to the trace to help focus the user's attention to the most recent point of change. This was not present in the original design proposal. All four representations are derived from the same post-mortem account of execution. This allowed the tracers to be developed more readily by translating the core account in to the required representation. Also the process of deriving the representations from a single history motivated the information-based account of the PV systems discussed earlier, by examining to what extent different kinds of information in the underlying history had to be enhanced or suppressed to provide a faithful reconstruction. PPVL also features built-in recording facilities to aid data collection. A time-stamped textual history of the queries made, and the navigational aids that are active during each session with the PV system. A time-stamped record is also made of the steps of the trace visited by the user. PPVL also stores away changes to the underlying code, allowing the version of the program to be matched against the activities of the programmer. This facility provides much needed support for the collection and interpretation of data. The Prolog PV systems to be considered here are the four implemented within PPVL, which provide an almost exhaustive census of the types of Prolog PV systems available. As a guide to the following discussion, figures 6 to 9 show the last state representations of the four PV systems when tracing a simple example program (figure 5). The Spy tracer is a stepwise, linear, textual PV system which adopts the Byrd Box model of Prolog execution (Byrd, 1980). The model uses a procedural interpretation of Horn clause logic. The head of a clause is classed as a procedure and the tail treated as one or more sub procedures. Each procedure or sub procedure can have one of four states: call, exit, fail or redo. On invocation the procedure is classed as a call. If successful, it is redisplayed prefixed by exit, if unsuccessful it is prefixed by fail. Procedures retried on backtracking are shown using redo. The hierarchical position of the procedure is shown using indentation. Spy uses a data flow view of variables, the query variable "What" permeating through trace.

p(X) :- q(X), r(X). q(a). q(b). r(b). :- p(What). Figure 5: The example program.

17

call p(_1) UNIFY 1 [] call q(_1) UNIFY 1 [_1 = a] exit q(a) call r(a) fail r(a) redo q(a) UNIFY 2 [_1 = b] exit q(b) call r(b) UNIFY 1 [] exit r(b) exit p(b) Figure 6: Spy trace representation of the example program.

PTP (Prolog Trace Package) was developed by Eisenstadt (1984) to provide a more detailed and readable account of Prolog execution than is found in the Spy trace. PTP distinguishes 19 different execution states as opposed to the 4 found in Spy. It also distinguishes between the goal to be satisfied and the clause within the program which relates to that goal, by using different symbols for the entering and exiting of clauses and the satisfaction of goals. The goal hierarchy is shown using indentation. As in Spy, variable names are used to reflect data flow through the program.

1: ? p(_1) 2: > p(_1) [1] 3: ? q(_1) 4: +*q(a) [1] 5: ? r(a) 6: -~r(a) 7: ^ q(a) 8: < q(a) [1] 9: +*q(b) [2] 10: ? r(b) 11: +*r(b) [1] 12: + p(b) [1] Figure 7: A PTP representation of the example program.

The Transparent Prolog Machine (TPM) uses an AND/OR Tree model of Prolog execution (Eisenstadt and Brayshaw, 1988; Brayshaw and Eisenstadt, 1991). Execution is shown as a depth first search of the execution tree. Unlike the other PV systems TPM incorporates two levels of granularity. The Coarse Grained View (CGV) diagram forms the overview of how clauses are interrelated during execution. Fine Grained Views (FGVs) giving the unification history for a particular node are obtained by clicking on the node in

18 question. The fine grained view uses a lozenge notation to show variable binding. This makes explicit the cross variable dependencies between subgoals. TPM gives special treatment for the cut primitive. All goals frozen by the cut are clouded over on the CGV diagram.

Figure 8: A TPM representation of the example program showing the CGV (left) and FGV diagrams for nodes p, q, and r respectively.

>>>1: p(What) 1S |1 What = b ***2: q(What) 1SF/2S |1 What ≠ a |2 What = b ***3: r(a) Fm ***4: r(b) 1S Figure 9: A TTT representation of the example program.

The Textual Tree Tracer (TTT) has an underlying model similar to TPM but uses a sideways textual tree notation to provide a single view of execution (Taylor et al, 1991). Unlike linear textual tracers such Spy and PTP current information relating to a previously encountered goal is displayed with or over the previous information rather than a linear development of the trace. This keeps all information relating to a particular goal in the same location. Seven symbols relating to clause status are employed, five of these distinguishing types of failure. The variable binding history is shown directly below the goal to which it relates. TTT uses the "!" symbol to cloud out goals frozen by the cut. One major difference between TPM and TTT is that the nodes of the TPM tree relate to clauses but the nodes of the TTT tree relate to goals, hence as r is an instantiated call TTT has two nodes relating to it rather than one.

6

Empirical approach and data analysis As outlined above, the motivation behind the methodology is to allow the overall

performance measures to be explained in terms of a fine-grained account using protocol

19 analysis. For the analysis to be able to take account of the information access versus task modification issue it will be necessary to analyse the protocols both in terms of how much information is derived when using a particular SV and what kinds of strategic approach to the task are employed. In order to take account of the interrelation between the SV and the kinds of misconceptions the subjects have of Prolog it will be necessary to undertake a further analysis of the protocols for the presence of misconceptions. As was considered in section 4.2, the background experience of the subject may also affect the way in which the subject is able to use the SV. The protocol data will therefore need to be supplemented with information regarding their relevant previous experience. The different kinds of analysis to be drawn from the experiments, their methods of observation and related literature are summarised in figure 10. Previous research in the psychology of programming can be used to determine the kinds of cognitive activities we can expect to identify for each type of protocol analysis. Some studies have already considered the kinds of information that are accessed during program comprehension activities. Pennington (1987) recorded the use of different kinds of information during Pascal debugging. A number of types of information were recorded such as source-level operations, control flow and data flow. Similarly, Bergantz and Hassell (1991) used protocol analysis to measure information access during Prolog comprehension using four main information types (which they termed relations): control flow, data flow, program structure and program function. Clearly these codes will have to be extended and modified to take into account the new tasks and the incorporation of the SV though these outline some basic information types which are central to program comprehension and should therefore be included in some form. Some work has also looked at the use of programming strategies during program comprehension and debugging. Sime et al (1977) identified forward and backward simulation as common program comprehension strategies. Also, both Gugerty and Olson (1986) and Katz and Anderson (1988) used the analysis of forward and backward reasoning in their discrimination between novice and expert program comprehension. It is therefore possible that forward and backward styles of reasoning could be employed when using an SV during program comprehension. More specific strategies relating to SVs may also be observed. Fortunately a great deal of work has been carried out on the kinds of misconceptions novices tend to have of the Prolog execution model (see section 3.4.2). Many revolve around either Prolog control flow such as 'Try Once and Pass' and 'ReDo Body from the Left' (see Fung et al, 1990) or are related to data flow and unification such as those outlined by van Someren (1990). The protocols will give an insight into how the subject interprets the SV and execution of the program which will allow an investigation into the range of misconceptions and how they are effected by the SV. The work of Höök et al (1990) suggests that different

20 SVs could alleviate misconceptions to differing extents. Rajan (1990) also found that using the Prolog tracer APT reduced misconceptions. It would be expected that using an SV would reduce many of the well documented misconceptions though the extra demands of dealing with the SV could create new misconceptions which have not previously been investigated. Within the methodology these will be referred to as misunderstandings. The term misconception suggests a relatively stable, though incorrect model of the language. The work of Payne and Squibb (1990) on students' errors in algebra found that errors tend to be unstable and used irregularly in response the problem situation given the level of knowledge the student has. The SV may provide the student with a view of aspects of Prolog execution which they had not previously considered. Their incorrect interpretation in such situations would be an ad hoc explanation of the execution rather than a stable misconception of it. For this reason the term misunderstanding will be used to classify errors without making any assumptions regarding their stability.

Type of

Examples

Sources

description Experience

Method of observation

Experience with other

Soloway & Ehrlich (1984)

Questionnaire

languages

Katz & Anderson

Pre-test activities

Prior knowledge of code,

(1987/8)

task and interface Misunderstanding Try Once and Pass

Taylor (1986)

Questionnaire

(TOAP)

Fung et al (1990)

Pre-test activities

ReDo Body from Left

Rajan (1990)

(RDBL)

Hook et al (1990) van Someren (1990)

Strategies

Simulation, forward,

Sime et al (1974)

Activity patterns

backward and causal

Green (1977)

plus support from

reasoning

Gugerty & Olson (1986),

protocols

Katz & Anderson (1987/8) Information

Data flow, program

Pennington (1987)

structure, control flow,

Bergantz & Hassell (1991)

Protocol analysis

function Figure 10: Types of description used in the analysis of program comprehension and debugging.

Prior experience may also affect the programmer's ability to make use of the SV. Most relevant details regarding previous experience with programming languages or tools can be

21 obtained from a questionnaire or activities used to measure familiarity with the interface. Not only has previous work indicated the effects of programming expertise in the same (e.g. Gugerty and Olson, 1986) as well as other languages (e.g. Wu and Anderson, 1991) but also the level of familiarity with certain kinds of notation. Petre and Green (1993) found experience to be a major factor in the ability to use graphical notations. Experience may then be an important indicator particularly in graphical SVs such as TPM. As well as the new kinds of description it will also be necessary to gain more conventional measures of performance such as the speed with which a task was performed and whether or what extent a task was completed. Given the reliance on protocols, timing data should be treated with some caution as speed and quality of protocol run contrary to each other. Timing data should therefore only be used as a vague indicator of performance or used to provide a cut off in order that a subject does not spend an inordinate amount of time on a single task which could cause fatigue or low motivation on further aspects of the task. Due to the observations of Davies (1995) discussed earlier, timing measures should be avoided altogether if the act of providing a protocol does not comprise a realistic component of the task. At this point it is worth reconsidering the methodological requirements outlined in the previous section, for comparison against the proposed methodology. Central to the methodology is the notion of a fine-grained account able to explain the gross performance measures of the subject. This will be provided by a multiple analysis of the protocols to derive the amount and kinds of information, strategies and misunderstandings used by the subject (see figure 10). The development of codes for the three kinds of analysis will be strongly influenced by relevant work in the psychology of programming in order to provide a sound theoretical basis. The findings from the protocol will be supplemented by information regarding the experience and expertise of the subject in order that the findings can be integrated with knowledge differences between subjects. The emphasis on explaining the observations will hopefully provide a reasonable level of generalizability. This section outlined the basic structure of the methodology in order that it could be applied to other forms of SV or programming language. The literature review showed that issues such as information use, strategy, misunderstandings and previous experience and expertise are important determiners of performance across all kinds of programming language. The final criteria is that the methodology should be flexible enough to allow the study of relatively realistic scenarios. The precise nature of the task to be performed will depend on the subject population, though attempts will be made to gain a good balance between empirical tractability and realism in the design of experimental tasks. Attempts will also be made to collect protocols in realistic ways which do not run contrary to the demands of the task. In order that the tasks can be carried out on real SVs, four Prolog SVs were developed within a umbrella environment, which will be described in the next chapter.

22

7

Focus for future research The information-based accounts of TPM and TTT can now be related to the claims

made in the literature by the designers to provide a focus for future research. An number of claims are made by the designers of TPM and TTT. These can be roughly classified into 5 categories: •

Control flow: Brayshaw and Eisenstadt (1991) claim that the dynamic search through the AORTA diagram in TPM provides a very clear model of execution, particularly when showing complex control flow due to backtracking and the cut. Taylor et al (1990) claim that their non-linear textual representation shows clear flow of control.



Debugging: The designers of both TPM and TTT claims that their PV systems will speed bug location (Brayshaw and Eisenstadt, 1991; Taylor et al, 1990). The designers of TPM believe this is related to the overall visual Gestalt of the display, for TTT the compactness of the trace is thought to be a central component.



User population: Eisenstadt and Brayshaw (1990) claim that TPM is equally suited to both novice and expert Prolog programmers. No explicit claim is made with respect to TTT.



Code relatedness: Taylor et al (1990) claim that TTT is more easily related back to the source code because of the closer resemblance between the display and the code. Brayshaw and Eisenstadt (1991) claims that the benefits of their notation far outweigh any translation distance to the source code.



Task relatedness: Brayshaw and Eisenstadt (1991) claim the visual overview of the execution found in TPM helps the user to gain a task perspective on the code. The design claims relate closely to the key differences identified in the information-

based account of TPM and TTT. As would be expected the designers of TTT are claiming that maintaining a close resemblance to the code is worth losing some clarity of the overall execution model. As discussed earlier understanding control flow is central to Prolog comprehension, particularly with novices who may have an incomplete or incorrect model of the interpreter. Therefore TTT may have made a poor trade-off with respect to novices though not for experts. The comparative effects of using a less than optimal representation and having a large translation distance between representations should be a central issue of future evaluations.

23

8

Conclusion

An information-based approach permits an integrated description of the code, task, user and PV system. This provides an approach to empirical evaluation which can obtain a fine grained account of how debugging tools can affect performance. The results will relate to future PV system design decisions.

References Alesandrini, K. L. and Rigney, J. W. (1981). Pictorial presentation and review strategies in science learning. Journal of Research in Science Teaching, 18 (5), 465-474. Badre, A. N. and Allen, J. (1989). Graphic language representation and programming behaviour. In G. Slavendy and M. J. Smith (Eds.), Designing and using human-computer interfaces and knowledge based systems. Amsterdam: Elsevier. Baek, Y. K., & Layne, B. H. (1988). Color, graphics and animation in a computer-assisted learning tutorial lesson. Journal of Computer-Based Instruction, 15(4), 131-135. Bergantz, D., & Hassell, J. (1991). Information Relationships in PROLOG programs: how do programmers comprehend functionality? International Journal of Man-Machine Studies, 35, 313-328. Brayshaw, M. (1994). Information Management and Visualization for Debugging Logic Programs. PhD Thesis, Human Cognition Research Laboratory, The Open University, Walton Hall, Milton Keynes, UK. Brayshaw, M., & Eisenstadt, M. (1991). A Practical Tracer for Prolog. International Journal of Man-Machine Studies, 35(5), 597-632. Brooks, F. P. (1975). The Mythical Man-Month: Essays on Software Engineering. London: Addison-Wesley. Brusilovsky, P. (1993). Program visualization as a debugging tool for novices. In Proceedings of INTERCHI '93. Amsterdam: Addison-Wesley. Byrd, L. (1980). Understanding the control flow of Prolog programs. In S-A Tarnlund (Ed.), Proceedings of the Logic Programming Workshop, Debrecen, Hungary. Carpenter P. A. and Just M. A. (1992). The role of working memory in language comprehension. In D. Klahr and K. Kotovsky (Eds.), Complex information processing: The impact of Herbert A. Simon. Hillsdale, NJ: Lawrence Erlbaum Associates. Cunniff, N., & Taylor, R. P. (1987). Graphical vs. Textual Representation: An Empirical Study of Novices' Program Comprehension. In G. M. Olson, S. Sheppard, & E. Soloway (Eds.), Empirical studies of programmers: Second workshop, Norwood, NJ: Ablex.

24 Davies, S. P. (1995). Effects of concurrent verbalization on design problem solving. Design Studies, 16, 102-116. Ditchev, C. and du Boulay, J. B. H. (1987). An enhanced trace tool for Prolog. In Proceedings of the Third International Conference, Children in the Information Age. Sofia, Bulgaria. Eisenstadt, M. (1984). A Powerful Prolog Trace Package. Proceedings of the 6th European Conference on Artificial Intelligence. Pisa, Italy. Eisenstadt, M. and Brayshaw, M. (1988). The Transparent Prolog Machine (TPM): an execution model and graphical debugger for logic programming. Journal of Logic Programming, 5 (4), 277-342. Ericsson, K. A., & Simon, H. A. (1984). Protocol Analysis. Cambridge MA: MIT Press. Fung, P. B., M. Du Boulay, B. , & Elsom-Cook, M. (1990). Towards a taxonomy of novices' misconceptions of the Prolog interpreter. Instructional Science, 19, (4/5), 311-336. Gilmore, D. (1991). Does The Notation Matter? Research Paper, Dept of Psychology, University of Nottingham. Green, T. R. G. (1977). Conditional program statements and their comprehensibility to professional programmers. Journal of Occupational Psychology, 50, 93-109. Green, T. R. G., Bellamy, R. K. E., & Parker, J. M. (1987). Parsing and gnisrap: a model of device use. In G. M. Olson, S. Sheppard, E. Soloway (Eds.) Empirical studies of programmers: second workshop, Norwood, NJ: Ablex. Green, T. R. G., Petre, M., & Bellamy, R. K. E. (1991). Comprehensibility of visual and textual programs: A test of superlativism against the 'match-mismatch' conjecture. In J. Koenemann-Belliveau, T. G. Moher, S. P. Robertson (Eds.) Empirical Studies of Programmers: Fourth Workshop, New Brunswick, NJ: Ablex, Gugerty, L., & Olson, G. M. (1986). Comprehension differences in debugging by skilled and novice programmers. In E. Soloway & S. Iyengar (Eds.) Empirical Studies of Programmers: First Workshop, Washington, DC: Ablex. Höök, K., Taylor, J, Du Boulay, B. (1990). Redo "TRY ONCE AND PASS": The influence of complexity and graphical notation on novices' understanding of Prolog. Instructional Science, 19(4/5), 337-360. Katz, I. R., & Anderson, J. R. (1988). Debugging: an analysis of bug location strategies. Human-Computer Interaction, 3, 351-399. Kessler, C. M., & Anderson, J. R. (1986). A Model of Novice Debugging in LISP. In E. Soloway & S. Iyengar (Eds.) Empirical Studies of Programmers: First Workshop, Washington DC: Ablex.

25 Ormerod, T. C., Manktelow, K. I., Steward, A. P., & Robson, E. H. (1990). The effects of content and representation on the transfer of Prolog reasoning skills. In K. J. Gilhooly, M. T. G. Keane, R. H. Logie, & G. Erdos (Ed.), Lines of thinking: Reflections on the Psychology of Thought Volume 1. Chichester: Wiley. Pain, H., & Bundy, A. (1987). What stories should we tell novice PROLOG programmers? R. Hawley (Ed.) Artificial Intelligence Programming Environments, 119-130. Patel, M. J., B., d. B., and C., T. (1991). Effect of format on information and problem solving. In Proceedings of the 13th Annual Conference of the Cognitive Science Society, Chicago. Patel, M. J., Taylor, C., and du Boulay, B. (in press). Textual Tree (Prolog) Tracer: An Experimental Evaluation. To appear in D. Gilmore and R. Winder (Eds.), User-Centred Requirements for Software Engineering Environments, Berlin: Springer-Verlag. Payne, S. J. and Squibb, H. R. (1990). Algebra mal-rules and cognitive accounts of error. Cognitive Science, 14 (3), 445-481. Pennington, N. (1987). Comprehension strategies in programming. In G. M. Olson, S. Sheppard, & E. Soloway (Ed.) Empirical studies of programmers: Second workshop, Norwood, NJ: Ablex. Peters, H. J. and Daiker, K. C. (1982). Graphics and animation as instructional tools: A case study. Pipline, 7 (1), 11-13. Petre, M. and Green, T. R. G. (1993). Learning to read graphics: some evidence that 'seeing' an information display is an acquired skill. Journal of Visual Languages and Computing, 4, 5570. Rajan, T. (1986). APT: A principled design for an animated view of program execution for novice programmers. HCRL Technical Report 19a, Open University, Walton Hall, Milton Keynes, UK. Rajan, T. (1990). Principles for the design of dynamic tracing environments for novice programmers. Instructional Science, 19(2/3), 377-406. Rajan, T. (1991). An evaluation of APT: an Animated Program Tracer for novice Prolog programmers. Instructional Science, 20(2/3), 89-110. Reed, S. K. (1985). Effect of computer graphics on improving estimates to algebra word problems. Journal of Educational Psychology, 77 (3), 285-298. Rigney, J. W. and Lutz, K. A. (1976). Effect of graphic analogies of concepts in chemistry on learning and attitude. Journal of Educational Psychology, 68 (3), 305-311. Schertz, Z., Goldberg, D., & Fund, Z. (1990). Cognitive implications of learning Prolog Mistakes and misconceptions. Journal of Educational Computing Research, 6(1), 89-110.

26 Sime, M. E., Green, T. R. G. and Guest, D. J. (1977). Scope marking in computer conditionals A psychological evaluation. International Journal of Man-Machine Studies, 9, 107-118. Soloway, E., & Ehrlich, K. (1984). Empirical studies of programming knowledge. IEEE Transactions on Software Engineering, SE-10(5), 595-609. Spohrer, J. G., & Soloway, E. (1986). Analysing the high frequency bugs in novice programs. In E. Soloway & S. Iyengar (Eds.) Empirical Studies of Programmers: First Workshop, Washington DC: Ablex, Stasko, J., Badre, A, & Lewis, C. (1993). Do algorithm animations assist learning? An empirical study and analysis. To appear in the Proceedings of INTERCHI '93. Suchman, L. A. (1987). Plans and Situated Actions: The problem of human machine communication. Trowbrigde, UK: Redwood. Taylor, J. A. (1988). PROGRAMMING IN PROLOG: An In-Depth Study of the Problems for Beginners Learning to Program in Prolog. PhD Thesis, University of Sussex. Taylor, C., du Boulay, B., & Patel, M. (1991). Outline proposal for a Prolog 'Textual Tree Tracer' (TTT) Cognitive Science Research Paper 177, University of Sussex. van Someren, M. W. (1990). What's wrong? Understanding beginners problems with Prolog. Instructional Science, 19(4/5), 257-282. Vessey, I. (1986). Expertise in debugging computer programs: An analysis of the content of verbal protocols. IEEE Transactions on Systems, Man and Cybernetics, 16(5), 621-637. Wu, Q. and Anderson, J. R. (1991). Knowledge transfer among programming languages. In Proceedings of 13th Conference of the Cognitive Science Society. Youngs, E. A. (1974). Human errors in programming. International Journal of Man-Machine Studies, 6, 361-376.