A Grammar Based Fault Classi cation Scheme and its ... - CiteSeerX

0 downloads 0 Views 511KB Size Report
Nov 27, 1995 - Identifying a (a) misplaced entity fault and (b) missing entity fault. : : : : : : : : : 24. 3 ... Some characteristics of data analysed in di erent fault studies. : : : : : : : : : : : : ...... 8The entries reproduced from 17] are inside quote marks. 27 ...
A Grammar Based Fault Classi cation Scheme and its Application to the Classi cation of the Errors of TEX  Richard A. DeMillo Aditya P. Mathur Software Engineering Research Center and Department of Computer Sciences Purdue University W. Lafayette, IN 47907 November 27, 1995

Abstract

We present a novel scheme for categorizing coding faults. Our grammar based scheme uses the notion of syntactic transformers and is automatable. The classi cation that results from our scheme can be used by researchers investigating the e ectiveness of software testing techniques. In these respects our scheme is signi cantly di erent from several proposed in the past by other researchers. We have used it to categorize the ten year log of errors of TEX reported by Knuth. For each fault classi ed, we also provide, wherever possible, the precise substring that constitutes the fault. The entire error log and the associated program is in public domain and hence our categorization can be veri ed. We also provide a fault classi cation algorithm that uses a top-down strategy to nd the di erences between two parse trees, annotated with syntactic transformers, to classify various faults. We claim that such an algorithm can be integrated within a software development environment and used as a low cost mechanism for monitoring and classifying faults. Keywords : faults, errors, fault classi cation, TEX , grammar based scheme, software development environment.

 This manuscript supersedes an earlier version dated February 15, 1991. This research was supported in part by NSF grants 9102311-CCR and 9311862-CCR.

1

Contents 1 Introduction

6

2 Past Work

7

2.1 2.2 2.3 2.4 2.5 2.6

Endres' scheme : : : : : : : : : : : Basili and Perricone's scheme : : : Ostrand and Weyuker's scheme : : Goodenough and Gerhart's scheme Knuth's scheme : : : : : : : : : : : Marick's scheme : : : : : : : : : :

: : : : : :

: : : : : :

: : : : : :

: : : : : :

: : : : : :

: : : : : :

: : : : : :

: : : : : :

3 A Grammar Based Fault Classi cation Scheme 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 3.10

De nitions and terminology : : : : : : : : : : Fault characterization : : : : : : : : : : : : : Syntactic transformers : : : : : : : : : : : : : Modelling faults using syntactic transformers A grammatical basis for fault classi cation : : A fault hierarchy : : : : : : : : : : : : : : : : Prioritizing transformers : : : : : : : : : : : : Special purpose transformers : : : : : : : : : Fallibility of the classi cation scheme : : : : : Semantic interpretation of faults : : : : : : :

4 Classi cation of the Errors of TEX 4.1 4.2 4.3 4.4

: : : : : : : : : :

Knuth's error log and the classi cation process The fault classi cation scheme : : : : : : : : : Fault persistence : : : : : : : : : : : : : : : : : Comparison with other studies : : : : : : : : :

: : : : : : : : : :

: : : :

: : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : :

8 8 9 10 10 10

11

11 11 12 15 16 18 23 23 25 25

26

26 27 29 31

5 Automating Fault Classi cation

32

6 Miscellaneous issues

47

5.1 De nitions : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 33 5.2 A fault classi cation algorithm : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 35

6.1 Context dependency : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 47 6.2 Hidden faults : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 48 6.3 Missing versus incorrect entity faults : : : : : : : : : : : : : : : : : : : : : : : : : : : 48 2

6.4 Language dependency : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 50

7 Summary and future work

50

Acknowledgement

51

APPENDIX A : Syntactic Transformers for a Pascal-subset

54

APPENDIX B : Cross Listing of Faults and their Categories

58

3

List of Figures 1 2 2 3 3 4 5 6 7 8 8 8 8 8 9 10 11 12 13 14 15 16 17

A generic transformer. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Generic transformers: (a) Ti to model the simple incorrect entity fault and (b) Tm to model the missing enity fault. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Generic transformers (continued): (c) Ts to model the spurious entity fault and (d) Tp to model misplaced entity fault. : : : : : : : : : : : : : : : : : : : : : : : : : : Identifying a (a) misplaced entity fault and (b) missing entity fault. : : : : : : : : : (Continued) Identifying (c) incorrect entity fault. : : : : : : : : : : : : : : : : : : : : Modelling the missing expression fault using the grammar in Example 2. : : : : : : : Fault classi cation scheme used for classifying the errors of TEX . : : : : : : : : : : : Fault classi cation process. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Classifying a missing statement fault. : : : : : : : : : : : : : : : : : : : : : : : : : : : A fault classi cation algorithm: the main routine. : : : : : : : : : : : : : : : : : : : : A fault classi cation algorithm (Contd.): procedures find fnp and is fnp. : : : : : A fault classi cation algorithm (Contd.): procedures apply and applym. : : : : : : : A fault classi cation algorithm (Contd.): procedures applys and applyi. : : : : : : : A fault classi cation algorithm (Contd.): procedure applyp. : : : : : : : : : : : : : : Sample program P 0 in subset-Pascal. : : : : : : : : : : : : : : : : : : : : : : : : : : : Parse tree Pt0 of P 0 . : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Subtree for the actual parameter list y; x. : : : : : : : : : : : : : : : : : : : : : : : : Subtree generating a misplaced statements sequence. : : : : : : : : : : : : : : : : : : Subtree generating (a) an expression and (b) an actual parameter list. : : : : : : : : Subtree generating an expression. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Subtree for identi er list within a type declaration. : : : : : : : : : : : : : : : : : : : Subtree for the if statement in Pt that has two missing statements, one incorrect identi er, and one incorrect operator. : : : : : : : : : : : : : : : : : : : : : : : : : : : Subtree containing an incorrect precedence fault that willbe classi ed by our algorithm as a missing entity fault. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

17 20 21 24 24 25 30 33 34 37 38 40 41 42 42 43 44 45 46 47 48 49 49

List of Tables 1 1 2 2 3

Various proposed fault classi cation schemesy . : : : : : : : : : Various proposed fault classi cation schemes (Contd.). : : : : : Atomic faults in Example 2. : : : : : : : : : : : : : : : : : : : : Errors of TEX not classi ed. : : : : : : : : : : : : : : : : : : : : Some characteristics of data analysed in di erent fault studies. 4

: : : : :

: : : : :

: : : : :

: : : : :

: : : : :

: : : : :

: : : : :

: : : : :

: : : : :

: : : : :

: : : : :

: : : : :

7 8 16 27 31

4 5 6 6 7 8 9 10 11 12 13 14 14 15 16 17 18 19 20 21 22

Comparison with data from other classi cation schemes. : : : : : : : : Syntactic transformers used for modelling faults using a Pascal-subset Incorrect identi er faults : : : : : : : : : : : : : : : : : : : : : : : : : : Incorrect identi er faults (Contd.) : : : : : : : : : : : : : : : : : : : : Incorrect operator faults : : : : : : : : : : : : : : : : : : : : : : : : : : Incorrect constant faults : : : : : : : : : : : : : : : : : : : : : : : : : : Incorrect expression faults : : : : : : : : : : : : : : : : : : : : : : : : : Missing case faults : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Spurious statement faults : : : : : : : : : : : : : : : : : : : : : : : : : Incorrect precedence faults : : : : : : : : : : : : : : : : : : : : : : : : : Incorrect loop faults : : : : : : : : : : : : : : : : : : : : : : : : : : : : Missing condition faults : : : : : : : : : : : : : : : : : : : : : : : : : : Missing condition faults (Contd.) : : : : : : : : : : : : : : : : : : : : : Missing conditional statement faults : : : : : : : : : : : : : : : : : : : Missing initialization faults : : : : : : : : : : : : : : : : : : : : : : : : Missing assignment faults : : : : : : : : : : : : : : : : : : : : : : : : : Missing or incorrect goto faults : : : : : : : : : : : : : : : : : : : : : : Incorrect placement faults : : : : : : : : : : : : : : : : : : : : : : : : : Incorrect procedure calls : : : : : : : : : : : : : : : : : : : : : : : : : : Incorrect type faults : : : : : : : : : : : : : : : : : : : : : : : : : : : : List of compound faults : : : : : : : : : : : : : : : : : : : : : : : : : :

5

: : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : :

32 57 58 59 59 59 60 60 61 61 61 62 62 63 63 64 64 65 65 66 67

1 Introduction Errors creep almost invariably into every phase of the software life-cycle. They often manifest as faults in the code in simple and complex ways. These faults have the potential of causing program failure. Software testing research involves investigation into methods for determining if a program has any faults. Program debugging research involves investigation into methods to locate a fault once it is known to exist in a program. Often such investigations need a knowledge of the types and frequency of faults that can occur in a program. Such knowledge, for example, helps in estimating the e ectiveness of a testing technique in revealing frequently occurring faults. In this paper we report the results of a study undertaken to (a) develop a fault classi cation scheme and (b) classify the errors of TEX reported by Knuth [17] using our classi cation scheme. We use the standard IEEE [15] de nitions of error, fault, and failure. An error is an action committed by the software developer that results in a program containing a fault. A fault (also bug) is a manifestation of an error. A failure is an incorrect behavior of the program with respect to some speci cations. For example, suppose that a programmer assumed that the initial value of variable X should be ?1 when the correct initial value is +1. This error resulted in a program containing the initialization X = ?1 instead of X = 1. It is this statement that contains a fault. When the program executes it may result in an incorrect behavior for certain inputs. We do not consider syntax errors in programs as faults. Our fault classi cation scheme, presented in Section 3, is signi cantly di erent from schemes reported earlier. We consider syntax to be the carrier of semantics. Syntax by itself is notation and semantics is expressed in a given notation. We believe this fact to be valid in the domain of software development. Thus any error, may it arise in the requirements, design, or coding phase of software development manifests itself as syntactic aberration in the code comprising the program developed. This also implies that correction of a fault requires one or more editing operations to be performed on the code. A correction could be as simple as changing a \+" to a \?" or as complex as inserting an entirely new procedure and calls to it. It is the above reasoning that forms the basis of our classi cation scheme. It makes our scheme signi cantly di erent from others reported so far in the literature in the sense that our scheme (i) permits speci cation of fault categories and (ii) can be automated. Thus, by using our fault classi cation scheme, a fault can be classi ed by a tool that observes the editing operations performed to make corrections in the program being developed. One such tool is described in Section 5. Section 2 reviews earlier attempts at fault classi cation. Section 3 presents the notion of syntactic transformers and the grammatical basis of the fault classi cation scheme. The results of classifying the errors of TEX using our scheme appear in Section 4. We conclude in Section 7. Appendix A contains a detailed example of a fault classi cation scheme using a subset of Pascal. Appendix B 6

Table 1: Various proposed fault classi cation schemesy .

Endres Group A

Machine con guration and architecture Dynamic behaviour and communication Functions o ered Output listings and format Diagnostics Performance

Group B

Initialization Addressability Reference to names Counting and calculating Masks and comparisons Estimation of range limits

Goodenough/Gerhart Logic Construction Speci cation Design Requirements

Manifestation of logic error

Missing control ow path Inappropriate path selection Inappropriate or missing action

Ostrand/Weyuker Major category

Data de nition Data handling Decision Decision and Processing Documentation System Not an error

Type

Address Control Data Loop

Presence

Group C

Omitted Super uous Incorrect

Placing of an instruction within a module Spelling errors in messages and commentaries Missing commentaries or

owcharts Incomaptible status of macros or modules Not classi able

Use Initialize Set Update

y Items in bold are names of main error categories. provides details of the faults in TEX classi ed using our scheme.

2 Past Work Several researchers [5, 12, 13, 18, 20, 22] have reported categorization of errors/faults1 found in di erent phases of the software life-cycle. Even though each one of the classi cations proposed is di erent from the others, they all share two common problems, namely, they are ambiguous and dicult to automate. Thus, for each scheme, one can nd at least one example of a fault that could be classi ed into more than one category. Furthermore, each of these schemes requires a human to classify the fault, a process which could lead to errors in the classi cation itself. To corroborate the above claims let us brie y examine ve of the fault classi cation schemes proposed earlier. Table 1 lists these ve schemes. Researchers often claim to have categorized errors whereas they often categorize faults. Therefore we have referred to all the schemes reported in the literature and considered in this paper, as fault classi cation schemes. 1

7

Table 1: Various proposed fault classi cation schemes (Contd.).

Basili/Perricone Initialization Control structure Interface Data Computation

Knuth A: Algorithm B: Blunder C: Cleanup D: Data structure debacle E: Eciency enhancement F: Forgotten function G: Generalization I: Interactive improvement L: Language liability M: Mismatch between modules P: Promotion of portability Q: Quest for portability R: Reinforcement of robustness S: Surprising scenario T: Trivial type

2.1 Endres' scheme The fault classi cation scheme of Endres is application and machine dependent. Further, the software studied by Endres is a program that had several assembly language routines such as I/O drivers. Thus, Endres' scheme is almost unusable beyond the domain considered. To illustrate why, consider Endres' fault category B2 named addressability faults. For example, forgetting to initialize a given machine register is a fault in this category. This category is machine dependent and hence not useful for the categorization of faults in programs written in high level languages.

2.2 Basili and Perricone's scheme Like the schemes proposed by other researchers, Basili and Perricone's (hereafter referred to as B/P) classi cation scheme [5] also su ers from ambiguities. One of their categories is control errors which consists of errors that cause an incorrect path to be taken in a program module. Another category named data errors consists of errors that are the result of an incorrect use of a data structure. To point out the ambiguity, consider the following program segment containing one fault: .. . A[j ] = 0; i = 1; F1 .. . if (A[i] < 0) : : : .. . 8

In the above segment, A[j ] should be A[i]. As per the de nition given above, this fault belongs to the data error category because an incorrect subscript, namely j , has been used to index A. However, the e ect of this fault is that an incorrect path may be taken when the following if is executed. Thus, this fault also belongs to the control error category. One can nd several such examples of faults which belong to more than one of the ve categories listed by B/P.

2.3 Ostrand and Weyuker's scheme Ostrand and Weyuker [20] (hereafter referred to as O/W) have proposed an attribute categorization scheme. They propose four attributes of a fault. These are the major category of a fault, its type, its presence, and its use. There are seven major categories. With each fault they associate one or more of the four attributes and corresponding values. The attributes are not independent. To illustrate possible ambiguity in this classi cation scheme, consider the following erroneous program fragment written in C: if

(a = 1) p = q ;

F2

The condition a = 1 in the above fragment should be a == 1. The incorrect program modi es the value of data , namely the variable a, and hence corresponds to major category data handling. However, the fault appears in the condition part of the if statement. Hence, it also falls under the major category decision in spite of the fact that the fault has no e ect on the truth value of the condition and the path followed thereafter. Notice that using B/P this fault can be classi ed as a control structure or a data fault. As another example, consider the following fragment that has a fault inside a loop: .. . while (x == 0)

f .. .

if

(a < b) x = 0; else x = x + 1;

g The condition in the if statement should be a  b instead of a < b. The major category of this fault is decision as the fault is in code which evaluates a condition and branches according to the result. The type of this fault is branch as the decision controls a 2-way branch, namely the one due to the if statement. However, the decision also controls the while loop because it updates the value of x which in turn controls the loop. Hence one may also classify the type of this fault as loop. 9

One may argue that because the fault is not in the condition that directly controls the loop, namely x == 0, the type of this fault is not loop, instead it is branch and is therefore unique. This argument is certainly valid if the fault type is de ned with respect to the syntax of the language and clearly implies that only faults inside the condition part of a while are of type loop. In the absence of such a rigorous de nition one is likely to face the inherent ambiguity in the classi cation scheme.

2.4 Goodenough and Gerhart's scheme The ambiguity in Goodenough and Gerhart's scheme [13] is illustrated by their own example. Consider the incorrect statement if (A) : : : which should actually be if (A .AND. B) : : : . Clearly there is a failure on the part of the programmer to test for a condition, namely B . Thus, this fault is of type missing control ow path. However, this is also of type inappropriate path selection because the condition has been expressed incorrectly. We admit, however, that from [13] it is not clear whether the authors are attempting to categorize the faults or sources of errors. They refer to an error category both as an error type and a source of error in implementation terms. Further, they have not distinguished between an error and a fault.

2.5 Knuth's scheme Knuth's classi cation scheme also su ers from ambiguity. For example, categories B (blunder) and S (surprise) are dicult to use if someone other than the programmer is the fault classi er. Even a programmer may classify the same fault into any one of these two categories surprise or a blunder. Categories G, I, P, Q, and E were useful in our error analysis. These enabled us to discard those entries from our study which were most likely due to a change in the speci cation of TEX . Sometimes it is not clear whether an entry in one of these categories should instead be considered as an error that crept into code due to either a design aw or missing speci cations. Entry 406 provides an example. It states: \Add new extra space parameter to all text fonts : : : ." This has been categorized by Knuth into type Q (quality improvement). The change made in the code e ectively resulted in extra space being added at the end of a sentence, following the period, in addition to the normal inter-word space. One could classify this error, using Basili's [5] scheme, into incorrect or misinterpreted requirements.

2.6 Marick's scheme Marick's informally speci ed categorization scheme [19] is based on mutation operators such as the ones proposed in [9] and in [2]. Thus, the scheme is syntax-based. In this sense it is similar to 10

our scheme. However, our scheme is more ne grained and we provide a formal mechanism for classifying a fault. Further, as claimed earlier, our mechanism is amenable to automation.

3 A Grammar Based Fault Classi cation Scheme 3.1 De nitions and terminology

Let G = (N; ; R; S ) be a context free grammar that de nes the syntax of a programming language L0 under consideration2 . Here N denotes the set of non-terminal symbols,  the set of terminal symbols, R the set of rules, also known as productions, and S 2 N the start symbol. Each rule r in R is a mapping de ned as r : N ! (N [ ). We use greek symbols ; ; : : : to denote strings over (N [ ) . A greek symbol with an overbar such as  denotes a string in  derivable from . Lower case letters x; y; z; : : : denote strings over  . We use the terms sentential form, sentence, derivation, leftmost derivation, rightmost derivation, derivation tree, and unambiguous grammar as in the literature [4]. =) stands for derives a string in one step using leftmost derivation. =) and =)+ , respectively, correspond to zero or more and one or more steps of derivation. Without any loss of generality, we provide all proofs and de nitions using leftmost derivations. Let L(G) denote the language generated by G. Let L0  L0 (G)  L(G). Informally, L0 (G) contains only those elements of L(G) that satisfy all the static semantic constraints imposed by the language de nition. We shall refer to an element of L0 as a valid program. Note that a valid program is not necessarily a correct program, i.e. one that satis es its speci cation. In the remainder of our discussion we consider only valid programs. We also assume that G is unambiguous and does not have any useless symbols. The latter assumption implies that each nonterminal in N appears in a sentential form and derives a terminal string. We denote by Pt the derivation tree of program P .  denotes the empty string.

3.2 Fault characterization Our fault classi cation scheme requires a more precise de nition of a fault than the IEEE de nition. The following example illustrates the ambiguity in the IEEE de nition of a fault. Suppose that due to an error a statement has been written as a = b + c. The correct statement should have been a = b ? c. We now ask: What exactly is the fault in the program containing this statement ? Is it the fact that the above statement is incorrect ? Or, is the incorrect statement itself the fault ? Or, is the substring \+" the fault ? One can perhaps imagine several other answers to the above question. The ambiguity arises because the IEEE de nition does not ascribe any speci c meaning 2

The terminology used here to refer to a grammar and associated syntactic elements is borrowed from [4].

11

to the word \manifestation" which occurs in the de nition of a fault. Without any such meaning a fault remains an ambiguous entity. The following de nition of a fault is used throughout this paper.

De nition 1 Let the incorrect program3 P consist of the string x1y1x2y2 : : : xk?1 yk?1xk yk where any of yi could be empty. Let P 0 be obtained by substituting yi with yi0 ; 1  i  k. A fault is characterized by yi if yi is nonempty, or by yi0 if yi is empty.

Notice that our de nition is purely syntactic. However, one may add semantic interpretations to di erent kinds of substrings that characterize faults. Our de nition assumes that there are k; k > 0 faults in P . In practice these faults are likely to be detected a few at a time and corrected. In either case, whether the faulty substrings are replaced simultaneously by the correct substrings or not does not a ect our de nition of a fault. The fault classi cation scheme presented later in this section permits the identi cation of the substrings that characterize the faults.

Example 1 To illustrate our de nition of a fault, consider an incorrect P which consists of one statement a = b + c. Let P 0 be obtained by editing P . P can be written as x1 y1 x2y2 , where 4 x1 : a = b, y1 : +, x2 : c, y2 : . If P 0 is a = b ? c then y10 : ? and y20 :  and the fault is characterized by the string +. In case P 0 is a = b ? c + d then y10 : ? and y20 : + d and the two faults are characterized by the strings ? and + d. If P 0 is a = b + c; b = b ? a then x1 : a = b + c; y1 : ; y10 : ; b = b ? a and the fault is characterized by the substring ; b = b ? a. We now present our approach for characterizing faults in a program. It is based on the grammar of the programming language used for writing code.

3.3 Syntactic transformers A syntactic transformer T (hereafter referred to simply as transformer) is a mapping T :  !  such that if T(x) = y and axb 2 L0 (G) for arbitrary strings a and b, then ayb 2 L0 (G). Informally stated, a transformer, when applied to a substring of a valid program results in another valid program. We require a transformer to be de nable by an algorithm. An application of a transformer that leaves its input unchanged is input preserving. Let T = fT1 ; T2; : : :; Tn g be a set of n syntactic transformers. We associate each transformer with a nonterminal or a terminal symbol from N [ . The notation h X i Ti , or X Ti when there is no confusion, denotes that Ti is associated with nonterminal X 2 N . In sentences, h x i Ti implies that Ti transforms the string x 2  to a string in  where x is derivable from X associated 3 4

We say that a program P is incorrect if for at least one test case it does not behave as intended. Read x : as \x is the string".

12

with Ti . We allow multiple transformers to be associated with a non-terminal or a terminal. For example, h X i Ti Tj means that any one of the two transformers Ti and Tj may be applied to X . We introduce transformers into the grammar by rewriting the set of rules R to obtain the set Rf such that each rule rf 2 Rf is a mapping de ned as rf : N ! (N [  [ T ). The modi ed grammar so obtained is a quintuple Gf = (N; ; Rf ; S; T ). The domain of a transformer is the set of all strings that can be generated by the associated non-terminal. A derivation in Gf proceeds exactly as a derivation in G except that a sentential form derivable from S using Rf may contain embedded transformers. The next example illustrates the notion of transformers using a simple grammar.

Example 2 Let G be a CFG that generates a program consisting of a sequence of one or more assignments and if statements. Each assignment is composed of scalar variables and the +, ?, (, and ) symbols. Let N = fprog; aseq; stmt; cond; asgn; exprg,  = fid; +; ?; (; ); ; ; if; theng, S = fprogg, and the rule set R is: prog ::= aseq aseq ::= aseq ; stmt j stmt stmt ::= asgn j cond asgn ::= id := expr expr ::= expr + id j expr ? id j id j ( expr ) cond ::= if expr then stmt

Let us consider the set fT1 ; T2 ; T3; T4 ; T5 ; T6 ; T7g of transformers de ned below. We use the notation S h X i to denote the set of all strings derivable from the nonterminal X . T1 : f+; ?g ! f+; ?g T2 : S h asgn i ! S h asgn i [ fg T3 : S h expr i ! Sh E i T4 : id ! id T5 : S h expr i ! Sh E i T6 : S h aseq i ! S h aseq i T7 : S h aseq i ! S h aseq i The above mappings are de ned such that T1 maps an arithmetic operator to arithmetic operator. T2 maps an assignment to an empty string or does not change it; T3 either removes a pair of parentheses from an expression or leaves the expression unchanged; T4 maps an id to an id; T5 removes the outermost parentheses in an expression or leaves it unchanged; T6 permutes the input sequence of assignments; and T7 maps a statement sequence to statement sequence by changing a statement from assignment to conditional or vice-versa. We assume that for each transformer an 13

input preserving application exists. Thus, for example, T3 (x + y ) = x + y is an input preserving application of T3 whereas T3 (x + y ) = (x + y ) is not (note the added parentheses). Hereafter, we assume that each transformer has an input preserving application and hence, instead of de ning Ti : X ! X [ Y , we de ne it as Ti : X ! Y . Using the above transformers, the rule set R of G can be modi ed to Rf given below: 1 prog 2 aseq 3 aseq 4 stmt 5 stmt 6 asgn 7 expr 8 expr 9 expr 10 cond

::= aseq T6 T7 ::= aseq T6 T7 ; stmt ::= stmt ::= asgn T2 ::= cond ::= id T4 := expr T3 T5 ::= expr T3 T5 + T1 id T4 ::= expr T3 T5 - T1 id T4 ::= id T4 j ( expr T3 T5 ) ::= if expr T3 T5 then stmt

Using Rf we derive the following program consisting of a sequence of two assignments interspersed with the transformers. h

(((id T4 := h (id T4T3 T5 ? T1 id T4 ) iT3 T5)T2 )T6 T7; h (id T4 := (id T4 T3 T5 ? T1id T4 ) iT3 T5 )T2) iT6 T7

(1) (2)

Substituting arbitrary identi ers for id's in the above program and using the input preserving application for each of the transformers, we obtain the following program:

x := x ? y; y := y ? x

(3) (4)

Through the above example we have shown how transformers can be incorporated into a CFG and how the modi ed grammar can be used to derive programs interspersed with transformers. For any zf 2 L(Gf ), we use z to denote the string obtained by selecting an input preserving transformation for each transformer in zf . We state the following lemma without proof.

Lemma 1 For any sentence x 2 L0(G) there exists a sentence yf 2 L(Gf ) such that x = y.

14

3.4 Modelling faults using syntactic transformers Let P; P 0 2 L0 (G) be two valid programs where P 0 is obtained by applying one or more editing operations5 to P . From Lemma 1 we can nd a program Pf0 2 L(Gf ) such that Pf0 reduces to P 0 if the input preserving application is used for each occurrence of a transformer in Pf0 . Let T = fT1; T2; : : :; Tng be the set of n transformers used in Gf .

De nition 2 If y is a substring of P = xyz that characterizes a fault f in P and there exists a substring y 0 in P 0 = xy 0 z such that for any Ti 2 T ; Ti (y 0) = y , then Ti is said to model f . Such a fault is considered to be an atomic fault in P .

Example 3 Consider the grammars G and Gf in Example 2 and the two statement program P 0 with assignments listed in (3) and (4). Now suppose that P 0 is correct and the incorrect program P is:

x := x ? y; y := y + x

(5) (6)

A comparison of (6) with (4) reveals that the string characterizing the fault in P is +. Examining (2) we see that selecting the application T1 (?) = + for the second occurrence of T1 and input preserving applications for all other occurrences of transformers in 2, we obtain P . As another example, assume instead that the incorrect program P is:

x := x ? y In this case the substring characterizing the fault is y := y ? x. If we select T2 (y := y ? x) =  for the second occurrence of T2 and input preserving transformations for all other occurrences of transformers we obtain P . In yet another example assume that the incorrect program P is:

y := y ? x; x := x ? y

(7) (8)

In this case applying T6 to P 0 leads to P . This is an example of a misplaced assignment. Table 2 lists the atomic faults modelled by the transformers in Example 2. Our discussion does not depend on what edit operations are applied. We assume that these could be any one of the operations provided by a typical editor used by the programmer during program development. 5

15

Table 2: Atomic faults in Example 2.

Syntactic Fault modelled Transformer T1 T2 T3 and T5 T4 T6 T7

Incorrect operator Missing or spurious assignment Incorrect precedence Incorrect identi er Misplaced statement Incorrect statement

De nition 3 If x is a substring of P that characterizes a fault f in P and there exists a substring y in P 0 such that a non-empty subset T 0 = fTi1 ; Ti2 ; : : :; Tik g; k > 1 of T is needed for non-input preserving transformation to transform y to x then T 0 is said to model fault f . Such a fault is considered to be a compound fault in P .

Example 4 Referring again to Example 2, suppose now that the correct program P 0 is:

x := x + y

(9)

As before, let the sequence of assignments in (3) and (4) constitute the incorrect program P . By examining the program shown in (2) we note that the following transformations model this compound fault: T1 (+) = ? T2 (y := y ? x) = 

(10) (11)

Here we have used the rst of the two occurrences of T1 and the second of the two occurrences of T2 in (2), input preserving applications are used for the remaining transformers. Thus T1 and T2 together model a compound fault in P .

3.5 A grammatical basis for fault classi cation To develop a set of fault categories, we examine the program development process from an abstract viewpoint. Composition of a program can be viewed as composition of a string belonging to the set of valid programs in the underlying programming language. The composition itself is carried out using multiple applications of two operations: selection and sequencing. Selection involves selecting a syntactic entity from the set of valid syntactic entities. Examples of valid syntactic entities in the 16

S

S

δ

δ

γ

X

X

γ

g

αι u

T

αj u

W

W

v’

v

Figure 1: A generic transformer. language speci ed in Example 2 are assignment, additive expression, and an identi er. Sequencing involves placing the selected entities in an order. For example, having selected two assignments x = x ? y and y = y ? x, one places them in a suitable order to obtain the program composed of (3) and (4). Faults are corrected by a sequence of editing transformations such as \replace a string by another". One question that arises while modelling faults is: Is it possible to model every fault that can be corrected by any sequence of editing transformations by a a syntactic transformer ? The next lemma provides an answer in the armative.

Lemma 2 Any fault in a program can be modelled by a syntactic transformer. Proof : Let X 2 N be a non-terminal de ned as:

X ::= 1 j 2 j : : : n?1 j n

(12)

We associate a generic transformer Tg with each nonterminal de ned as above. Tg is de ned in Figure 1. Let P 0 = xy 0 z be the program obtained by removing a fault from P = xyz . Assuming that both P 0 and P are valid programs, for some X 2 N and i; j  n; i 6= j , there must exist the following two derivations in G and Gf , respectively:

S =) X =) xX =) x j =) xy =) xyz S =) f X Tg f =) xf X Tg f =) xf i Tg f =) xf yf f =) xf yf0 zf

(13) (14)

Applying Tg to X , thereby replacing i by j , and using input-preserving transformations for all other occurrences of Tg , (14) reduces to (13). This shows that Tg models any fault in P . h End of proof i The above lemma lays the foundation for the grammatical modelling of faults. As P 0 can be obtained from P by a sequence of editing transformations that correct one or more faults in P , the lemma ensures that any fault that can be corrected can be modelled by a syntactic transformer. 17

However, the lemma is not directly useful for fault classi cation. Tg models just one fault category, namely the selection fault. It may be of no practical use to categorize all faults in a program as selection faults. However, as shown in the next section, a variety of faults can be considered as special cases of the selection fault. Below we provide a ner fault hierarchy.

3.6 A fault hierarchy Let P be the program under test and P 0 the program obtained by the removal of one or more faults in P using a sequence of editing operations. As before, we assume that both P and P 0 are valid programs. Let X 2 N be a nonterminal de ned by the following n; n > 1, rules:

X ::= 1 j 2 : : : j n

(15)

Let P 0 be the string derived from the start symbol S as follows:

S =)+ rX =) r i  =) rst where 1  i  n; i =) s and  =) t

(16)

Notice that in the above derivation X has been replaced6 by the string i . Given an unambiguous grammar and a parsing strategy (top down or bottom up), the above derivation is unique for P 0 .

De nition 4 P , consisting of the string rs0t, is said to contain a simple incorrect entity fault w.r.t. P 0 if P 0 is derived as in (16) and P can be derived as follows:

S =)+ rX =) r j  =) rs0 t where 1  i; j  n; i 6= j; j =) s0 and  =) t

(17)

Next, let us assume that nonterminal X is recursively de ned by the following rules:

X ::= X j

(18)

Let P 0 be derived as:

S =) uX =)i ui+1 X i =) r i =) rstv where i  0; u i  ui+1 ; ui+1 =) r; =) s; i =) t; and  =) v

(19)

De nition 5 P is said to contain a missing entity fault w.r.t. to P 0 if P 0 is derived as in (19) and P can be derived as follows: 6

Note that replacing i by j implies the transformation Tg (x) = y where i =) x and j =) y.

18

S =) uX =)(i ? 1) uiX i?1  =) ui i?1 =) r0 st0 v where u i?1 =) ui ; i?1 =) t0 ; ui =) r0; =) s; i?1 =) t0 ; and  =) v

(20)

Notice that the derivation of P contains one less application of the recursive rule de ning X .

De nition 6 P is said to contain a spurious entity fault w.r.t. P 0 if P 0 is derived as in (19) and P can be derived as:

S =) uX =)i ui+1 X i =) r X i  =) rr0 i =) rr0st0 tv (21) where i  0; u i  ui+1 ; ui+1 =) r; =) s; i =) t;  =) v; =) s0 ; and =) t0

De nition 7 Program P is said to contain a misplaced entity fault w.r.t. P 0 if P is a permutation

of P 0 .

De nitions 4 through 7 do not provide a unique de nition of any fault type. Thus, as an example, a \misplaced entity" fault can also be considered a \simple incorrect entity" fault. We return to this problem in Section 3.7. The next four lemmas guarantee the existence of transformers that model each one of the four faults de ned above.

Lemma 3 There exists a transformer that models the simple incorrect entity fault. Proof : For each rule in the rule set R of G, which is of the type:

Z ::= X

(22)

X ::= j

(23)

Z ::= X Ti

(24)

and the nonterminal X is de ned as:

we rewrite (22) as

We then add all such rewritten rules of type (24) to rule set Rf of Gf . The remaining rules in R are also added to Rf . To show that Ti de ned as in Figure 2(a) models the simple incorrect entity fault suppose that P is the string uv 0w and P 0 is uvw. We can derive Pf0 as:

S =) uZ =) uX Ti  =) u Ti  =) u0 vTi w

(25)

Applying Ti as per its de nition, we transform (25) to a derivation for P . Hence Ti models the simple incorrect entity fault. 19

Z

Z δ

X Ti

δ

γ

X

γ

β

α u

u

W

W v’

v

(a)

Z

γ

δ

X

Tm Z

W1

u1 α

β

Y u2

γ

δ

X W2

γ

W1

X Tm v γ

v (b)

Figure 2: Generic transformers: (a) Ti to model the simple incorrect entity fault and (b) Tm to model the missing enity fault.

20

Z

δ

Z γ

Ts

X

γ

δ

γ u1

W1

X

u1

W1

v β

α

Y

u1

W1

X γ (c) Z

v

Z

X Tp

αi

X

γι

β

α1

X

γ1

β X

+ X ui α1

+ X

Tp

u1

β

Tp

α2

γ1

X

β

γ2

X + X Tp

u1

u2

α2

β

γ2

X β

αi-1

γι−1

X

u2

+ X

αi-1

X

Tp

ui-1 γι−1

β

αi

γι

β X

X

+

+ X

Tp

X ui

ui-1 αi+1

β

(d)

γι+1

αi+1

β

X

X

ui+1

ui+1

γι+1

Figure 2: Generic transformers (continued): (c) Ts to model the spurious entity fault and (d) Tp to model misplaced entity fault.

21

Lemma 4 There exists a transformer that models the missing entity fault. Proof : For each rule in the rule set R of G, which is of the form:

Z ::= X

(26)

and there are at least two rules of the type:

X ::= Y j 

(27)

Z ::= X Tm

(28)

where Y =) X 0 . We rewrite (26) as

We then add all such rewritten rules of the type (28) to rule set Rf of Gf . All the remaining rules in R are also added to Rf . We show that Tm as de ned in Figure 2(b) models the missing entity fault. Let programs Pf0 and P be the strings uxy 0 ywz 0zv and uxywzv . Consider the following derivation for Pf :

S =) uZ =) uX Tm  =) u Y Tm  =) u  X Tm 0 =) u( 0 )i X (Tm 0 )i  =) u( 0 )i 0 X Tm 0 (Tm 0 )i  =) uxy 0 ycz 0zv where x = ( 0 )i ; y 0 = 0 ; X =) ywz; 0 =) z 0 ; and ( 0 )i =) z;

 =) v; i  0

(29)

Selecting non-input preserving application only for the underlined occurrence of Tm the above derivation reduces to that for P . h End of proof i

Lemma 5 There exists a transformer that models the spurious entity fault. Figure 2(d) provides a de nition of Ts . Derivation (29) can be used to show that Ts models the incorrect entity fault.

Lemma 6 There exists a transformer that models the misplaced entity fault. Figure 2(c) provides a de nition of Tp. Derivation (16) and (17) can be used to show that Tp models the incorrect entity fault. The next theorem guarantees that any fault that can be corrected by a sequence of one or more editing operations can be modelled, and hence classi ed, using the grammar based scheme. 22

Theorem 1 The set of transformers consisting of Ti, Tm, Ts , and Tp models any fault in program

P that can be corrected by a sequence of one or more editing operations. The proof of the above theorem follows from Lemmas 3 through 6.

3.7 Prioritizing transformers It is clear from the de nitions of the transformers that several non-terminals may have more than one associated transformer. It is possible that more than one of these transformers can be applied to reduce a correct program Pt0 to its incorrect version Pt . This may lead to ambiguity in fault classi cation. For example, the two rules de ning X in (23) have the same pattern as the rules de ning X in (27). Thus, if a node in the parse tree of Pt0 is labelled X , both Ti and Tp can be used to transform the subtree rooted at X to the corresponding subtree in Pt . To avoid any ambiguity in fault classi cation, we prioritize the transformers. Thus, if two or more transformers are associated with a nonterminal, the one with the highest priority will be applied rst. If its application succeeds in reducing the subtree to the desired subtree then the fault is of the corresponding type. If not, then the transformer with the next highest priority is applied, and so on. We accord priorities in the following descending order: Tp , Tm , Ts , and Ti . Figure 3 illustrates the application of transformers assuming the correct priority. In Figure 3(a) we apply T6 to reduce Pt0 to Pt . Thus, the fault in P is a misplaced entity. Note that T7 can also be applied to achieve the same tree reduction which would classify the fault in P as an incorrect entity. Similarly, in Figure 3(b) we use T2 to perform the tree reduction and classify the fault as a missing statement fault. In Figure 3(c) we show the identi cation of an incorrect entity fault. Notice that in this case no transformers other than T7 is applicable.

3.8 Special purpose transformers So far we have de ned only four fault classes. It is possible to de ne transformers that are special cases of one of Ti , Tm , Ts , and Tp and thus model a smaller subclass of faults. For example, a missing expression fault is a special case of the missing entity fault modelled by Tm . To identify such a fault in programs written using the grammar of Example 2, we can associate T9 with the non-terminal expr. The de nition of T9 appears in Figure 4. The transformation performed by T9 is identical to that of Tm except that T9 is applicable only to tree nodes that are labelled by expr. Thus, a non-input preserving application of T9 implies a missing expression fault. We consider the \meaning" associated with an transformer, such as missing expression fault, as its semantic interpretation. Appendix A provides examples of several such special purpose transformers used for classifying faults in programs written in a subset of Pascal. Transformers for all fault categories in Figure 5 are shown in Appendix A. 23

prog aseq prog

;

aseq

stmt asgn

stmt aseq aseq

;

T2 T6 T7

T7

T6

T2 stmt

T7

then

expr

stmt

+

id

asgn

id expr then T3

stmt T 7

id T4

asgn

id

expr

T3 T5

+T1

T3 T5

:=

id

asgn

expr

expr

cond if

:=

expr

cond if

stmt T7

id T4

:=

id

T6

:=

id T4 id

expr

expr

T3 T5

expr

expr

+

T3 T5

+T1

id T4

(a) id

prog

T2

id aseq

id T4

id T4 stmt asgn (b)

:=

id

expr

expr

+

id

id

Figure 3: Identifying a (a) misplaced entity fault and (b) missing entity fault. prog

prog

aseq T 2 aseq

T2 T6

;

T6 stmt

T7

T7

aseq aseq

T7

if

expr then

stmt T

7

id T4

:= expr

expr T T 3 5

T3 T5

+T1

id T4 id

asgn

asgn

asgn id T4

stmt

stmt

cond

stmt

;

T7

:=

id

expr

asgn

:=

expr T T 3 5

expr

+T1

id

:= expr

T3 T5

expr

expr

+

id

id

id T4 id

id T4

id T4 (c)

Figure 3: (Continued) Identifying (c) incorrect entity fault. 24

+

id

expr

T9

expr

x expr

id

+ or -

x

Figure 4: Modelling the missing expression fault using the grammar in Example 2.

3.9 Fallibility of the classi cation scheme To categorize a fault f in P , our classi cation scheme relies on an examination of the syntax of program P 0 obtained by possibly removing f from P . Such an examination determines the string that characterizes f . To understand an inherent problem with this approach, suppose that the statement a := a ? b in P is incorrect and the correct statement should be a := a + b. One would therefore expect \?" to be the characteristic string. Assume now that, for some reason, the correction made replaces a := a ? b by the sequence a := a ? b; a := a +2  b. Even though the error that caused the fault has been removed, the string characterizing the fault is ; a := a + 2  b. This string will be categorized as a missing entity fault. We however believe that corrections that mask the true nature of the fault, such as the one above, would be rare in practice and therefore, when used, our scheme will provide a fairly accurate fault classi cation.

3.10 Semantic interpretation of faults Notice that we have not given any semantic interpretation to the four fault categories. Thus, for example, an initialization or a data handling fault can occur due to any one of the four types of errors. Similarly, a missing path could arise due to a fault that belongs to any of the four categories. Initialization of data, handling of data, and missing path are all semantic interpretations of a fault. They describe the e ect of a fault on program behavior and not the fault itself. The B/P scheme which categorizes all errors into those of commission and omission has a close correspondence with the above four categories. Thus, the errors of omission correspond to our missing entity fault category. Also, their errors of commission correspond to all of the three remaining fault categories in our scheme. However, the more detailed categorization of B/P as shown in Table 1 has no correspondence with ours. The O/W scheme also has the omitted and super uous categories that correspond to, respectively, our missing and spurious entity categories. However, the remaining categories of O/W in Table 1 do not have any such correspondence with ours. In the next section we further subdivide our four major categories based on the syntactic 25

entity within which a fault manifests. This subdivision leads to a hierarchical fault classi cation scheme.

4 Classi cation of the Errors of TEX We now describe the fault classi cation scheme used for classifying the errors of TEX . Our scheme is based on the notion of transformers introduced in the previous section. We then use this scheme to categorize the errors reported by Knuth.

4.1 Knuth's error log and the classi cation process The error log reported by Knuth in [17] spans about ten years and consists of 867 entries. Almost every entry refers to a section number7 in the source code listed in [16]. The rst entry was logged on March 10, 1978 and the last one on November 6, 1988. We used three documents to help us in the classi cation process. The starting document was the error log [17]. After examining a log entry we looked at the program listing [16]. This helped us, in many cases, nd the category of the fault and the incorrect string that characterized the fault. However, to categorize log entries 841 to 867 we examined the source listing of version 2.95 of TEX . A comparison of version 2.95 source and the source given in [16] was generally sucient to classify the fault and nd the characteristic strings. Entries categorized by Knuth in the following categories were not classi ed: cleanup, eciency, generality, interaction, quality, and portability. This by no means implies that the use of formal software testing methods will not reveal sections of code that could eventually lead to program changes to improve, for example, the eciency or generality of the program. We decided not to include these entries in our classi cation as they are not due to program failures. Instead, they appear to have been motivated by Knuth's desire to \improve" aspects of TEX such as the quality of the output produced, the mode of interaction with the user of TEX , features provided, and portability to machines other than on which TEX was rst tested. Thus, it was not clear whether these entries should be called errors or not. The total number of such entries, as shown in Table 2, is 495. In addition a total of 80 entries could not be classi ed with the help of the three documents mentioned above. One of the diculty we faced was that the program listed in [16] and the source of version 2.95 correspond to TEX 82. However, entries 1 to 519 correspond to TEX 78. Even though several parts of the code remain unchanged in both versions of TEX , one often nds cases where (i) either the feature was obsolete, This systematic reference to section numbers in TEX source proved to be an extermely useful feature during the classi cation process. Without this information the location of regions that contained the fault would have been practically impossible for us. However, there are several entries where the section number listed does not correspond to the section number in [16] perhaps due to the evolution of the program. Such entries did create signi cant problems in fault classi cation. 7

26

Table 2: Errors of TEX not classi ed.

Error type

Ignored/Not-classi ed Cleanup Eciency Generality Interaction Quality Portability Could not be classi ed/

Total found

570 120 41 109 135 67 23 80

e.g. entry 62, or (ii) the section number listed in the entry does not appear to contain any code that matches with the text of the entry, e.g. entry 109. Reason (ii) was also why we could not classify some entries that correspond to TEX 82, e.g. entry 855. Below are some examples of entries that could not be classi ed8 : 1. Entry 62: \Make start input set up job name in the form needed by shipout; it uses obsolete conventions." In section 532 and 537 of [16] we could not identify code segment that was modi ed. 2. Entry 109: \Make more error checks in endv, e.g. it should not occur in macro de nition or call." Once again we could not identify the code segment corresponding to this entry. 3. Entry 855: \Fix a typo in the initialization of hyphenation tables : : : ." This entry does not provide sucient information to determine what was the characteristic string. One could perhaps classify this entry as an incorrect entity fault, but it is not possible to nd, from the documents we have, which entity was incorrect. Some entries could be classi ed but the substring characterizing the fault could not be determined. For example, entry 322 has been classi ed as an incorrect constant fault. However, what was the incorrect constant is not clear from the description. As another example, entry 134 says: \Don't test for no pages output by looking at the channel status." An examination of the code revealed that in the correct version the test is the condition total pages = 0 and thus this has been classi ed as an expression (condition) fault. However, the incorrect expression (condition) could not be identi ed. In Appendix B, such entries are marked with the dagger (y) symbol.

4.2 The fault classi cation scheme Fig. 5 shows the di erent fault types and their hierarchy. As mentioned earlier, we classify a fault into one of the four major categories: missing entity, spurious entity, misplaced entity, and incorrect 8

The entries reproduced from [17] are inside quote marks.

27

entity. An \entity" in our naming convention refers to the substring that characterizes the fault. Let us examine each of these major categories in more detail.

Spurious entity A fault whose correction requires the removal of its characteristic substring falls under the spurious entity category. As an example, entry 27 is classi ed into the spurious entity category. It states \Delete spurious call to ush list in end token list."

Missing entity A fault whose correction requires the insertion of a syntactic entity into the incorrect program falls under the missing entity category. A missing entity could be a sequence of statements, a single statement, an expression, or a unary operator. These four syntactic entities form the sub-categories of the missing entity category. For example, entry 851 is classi ed into missing code sequence. The missing code sequence (see section 806 of [16] for the code that precedes the following code) is listed below: if

o 0 then begin r := link(q ); link(q ) := null; q := hpack(q; natural); shift amount(q) := o; link(q) := r; link(s) := q;

end;

If only a single statement such as an assignment or a conditional statment is missing, the fault is categorized as a missing statement. A missing initialization is also counted as a missing statement because the initialization is carried out by an assignment. For example, entry 201 is classi ed as a missing statement. The missing statement is listed in Table 16. A fault in which a part of an expression on the right side of the assignment or a part of a condition is missing, is classi ed as a missing expression. Table 14 lists several examples. A missing unary operator is the last category under the missing entity hierarchy. We did not nd any entries that could be classi ed into this category.

Misplaced entity If the correction of a fault requires a change in its position within the code, it is classi ed as a misplaced entity fault. Entry 421, is classi ed as a incorrect placement fault. The entry states \Move command and ' rst mark ?1 ' from vpackage to re up.' Both vpackage and re up are procedures. Similarly, entry 836 reports a conditional statement being misplaced within procedure normalize selector. 28

Incorrect entity When a fault cannot be classi ed as missing entity, misplaced entity, or spurious entity, then it is classi ed into the incorrect entity category. Here we distinguish between an incorrect type and an incorrect algorithm. Incorrect array size or incorrect type declaration, are classi ed as type faults. Other faults in the program are classi ed as incorrect algorithm. We have a hierarchy of fault types under the incorrect algorithm category. Use of an incorrect variable, constant, or operator is at the bottom of this hierarchy. Tables 6, 7, and 8 provide several examples of such faults found in TEX . If there is more than one such fault within an expression or a combination of these and/or precedence faults, then we classify it into the expression fault category. Table 9 provides examples of such faults. In case the fault is within a statement but cannot be classi ed into the expression fault category then it is classi ed as a statement fault. A call to an incorrect procedure is one example of a statement fault. Entry 228, for example, reveals that procedure free node was called instead of free avail. When a fault cannot be classi ed into any one of the categories mentioned above, it is considered as an algorithm fault. Such a fault may arise, for example, due to a missing procedure, and its corresponding call, a combination of missing and/or incorrect code, etc. Entry 854, for example, reports an incorrect algorithm for xed point multiplication. Due to lack of sucient description, we found it almost impossible to work out the details of several entries classi ed as incorrect algorithm. However, some of these faults, notably entries 554 and 854 have been described in some detail by Knuth [17]. As an example, entry 858 is classi ed as an algorithm error9. The correction made as a result of this fault was to add a sequence of statements (an assignment and a loop) just before a loop and an assignment just after this loop.

4.3 Fault persistence We de ne the persistence of a fault as the time period over which it remained undetected. Below we examine di erent fault categories and examine their persistence characteristic. In all a total of 291 faults were classi ed. There was a comparatively high percentage (11.3%) of incorrect identi ers. An examination of Table 6 in Appendix B reveals that 29 of 33 such faults were detected in the rst four months of testing. The remaining four were detected over the next four years. No such fault was reported during the last ve years of testing. From this data we can conclude that even though the number of incorrect identi er faults are signi cant in a program during the initial stages of testing, they do not appear to persist over a relatively long period. 9

This code was missing from section 260 in [16].

29

Fault

Incorrect entity (155/53.26)

Spurious entity (2)

(291/100)

Missing entity (118/40.54)

Misplaced entity (16/5.5)

Ts Type (9)

Algorithm (74/25)

Tp1 Tp2

Ti6 Statement T8i Expression

Identifier (33/11.3) Ti

1

Parameter (2)

Constant(6/2) Ti2

Statement (14/4.8) Tp3

Code sequence T4m

Case (3) Tm5 Others (26/8.9)

Call (2) Ti7 Loop (1) i i Precedence (6/2) T 4 T 5 Compound (15/5.1) Pointer use (2)

Assignment Statement Tm3

Operator (7/2.4) Ti3

Call (7/2.4)

Initialization (14/4.81) Others (22/7.5)

Goto (2) Conditional (12/4.1)

Condition (32/11) Expression Others (0) Unary operator (0) Ti

9

Parameter (0) Tm1 Tm2

Figure 5: Fault classi cation scheme used for classifying the errors of TEX .

30

Table 3: Some characteristics of data analysed in di erent fault studies.

Characteristic

B/P

O/W

Marick Demillo/Mathur y ++ Programming language Fortran High level C and C SAIL/WEB/Pascal{ Program size (Lines of Code) 90,000z 10,000x Unknown 4600  Number of faults 215 173 102 291 33 10 Unknown 128 Duration of data collection (months) y Language not speci ed in [20]. z Comment lines included. x In addition, 1000 assembler instructions. Comments were included or not is not mentioned in [20] { TEX 78 in SAIL, TEX 82 in WEB/Pascal.  23,965 including comments. 21,573 lines of equivalent C code without comments.

We now examine the persistence over the last ve years of testing of the missing condition, missing initialization, and missing conditional statement faults from Tables 14, 15, and 16. There is a total of 32 missing condition faults, which is approximately the same as the number of incorrect identi er faults. However, 12 out of these (37.5%) have persisted over the last ve years. Also, 5 out of 12 (41.6%) of the missing conditional statement faults and 6 out of 14 (42.8%) of the missing initialization faults persisted over the last ve years of testing. Of the compound faults, 12 out of 26 (46.1%) missing code sequence faults and 16 out of 74 (21.6%) of incorrect algorithm faults persisted during this period. The above observations lead us to claim that the percentage of faults in a category is not the only attribute that lends importance to the category. The persistence of a fault type appears to be an important fault characteristic.

4.4 Comparison with other studies A comparison of the size of the software considered in other studies and the number of faults classi ed appears in Table 3. Table 4 compares the results of our analysis (column 4) with that of B/P and O/W. We have carried out the comparison with only those schemes whose categorization matches with ours in at least some major categories. We found that less than 1% of the faults belong to the spurious entity category. This compares well with O/W analysis where less than 1% of the total faults are classi ed into the super uous category. There was no such category in the B/P study. Notice that data in other columns of Table 4 is within 14% of our data. This comparison leads us to conclude that errors that lead to missing entities are indeed quite common. We could not compare our data with that of some of the other categories of O/W and B/P due to a mismatch between the categories. For example, both O/W and B/P do not have any misplaced 31

Table 4: Comparison with data from other classi cation schemes.

Category

B/P

O/W Demillo/Mathur

Super uous Category not used < 1% < 1% 43% 53% Incorrect entity 64% Missing entity 35% 54% 40%

entity category. It however seems that there is indeed a noticeable percentage of programmer errors (5.5%) that lead to a misplaced program entity. Marick [19] found that 43% of the faults were con ned to a single expression (let us call this category C1) and 57% required changes to more than one statement or addition of a statement (let us call this category C2). From Figure 5 we nd that category C2 of Marick consists of faults in the categories incorrect algorithm, missing code sequence, and missing statement. The total number of such faults found in TEX is 160 which is 54% of the total number of faults classi ed. This compares well with Marick's data.

5 Automating Fault Classi cation In this section we outline how faults can be classi ed automatically by a system built around the notion of syntactic transformers. To understand how such a system may function, consider the following scenario. Program P has been developed and is currently under test. A test case has resulted in a failure of P . This in turn led to debugging and the fault has been isolated. The correction is to be carried out using an editor. It is at this point that an automated system can isolate and classify the fault. Figure 6 shows the block diagram of a system that can be used for fault classi cation. As a result of a change made in P the editor generates a modi ed version of P , namely P 0 . Without loss of generality we assume that P 0 is a valid program. Both P and P 0 are now input to the fault classi er. P is parsed using G, the grammar for the programming language under use, and P 0 is parsed using Gf . Let Pt and Pt0 , respectively, denote the parse trees for P and P 0 . Pt0 contains syntactic transformers labelling various tree nodes as speci ed by the rules in Gf . The fault classi er now attempts to reduce Pt0 to Pt by applying various transformers on their inputs. If such a reduction is successful then the transformers for which a non-input preserving transformation was selected determine the fault categories. A failure to reduce P 0 to P implies that the fault has not been modelled in Gf . In this situation one may either update Gf and incorporate the new Gf into the fault classi er or simply classify such a fault into a miscellaneous category. Figure 7 illustrates how a missing statement fault can be classi ed in a program that conforms to the syntax given by the grammar in Example 2. The incorrect program, P , in this case consists of a single assignment x := x ? y . The corrected program P 0 is x := x ? y ; a := a ? b. Figure 7(a) 32

P

Editor

Parse P and P’ Parse trees of P and P’ Fault Classifier Fault types

Figure 6: Fault classi cation process. and (b), respectively, show Pt and Pt0 as obtained by the fault classi er. The classi er now attempts to apply each transformer on its input and determine if the transformed sentence matches P . In this example, applying T2 according to its de nition given in (3.3) reduces Pt0 to Pt . Figure 7(b) highlights that portion of Pt0 which gets a ected by such an application of T2.

5.1 De nitions Let Pt and Pt0 denote the derivation trees, rooted at S and S 0, of programs P and Pf0 , respectively. Using the terminology introduced in Section 3.3, we assume that Pt and Pt0 have been obtained from G and Gf , respectively. Each node in Pt0 is associated with zero or more transformers. For any node Z (Z 0 ) in Pt (Pt0 ) we order its k; k  0, immediate descendents X1; X2;    ; Xk (X10 ; X20 ;    ; Xk0 ) such that X1(X10 ) is the leftmost descendent, Xi(Xi0); 1 < i < k is immediately to the right of Xi?1 (Xi0?1 ), and Xk (Xk0 ) is the rightmost descendent. ndesc(Z; i) denotes the ith immediate descendent of node Z . iasc(X ) and idesc(X ) denote, respectively, the immediate ancestor and the set of immediate descendents of node X in a parse tree. asc(X ) and desc(X ) denote, respectively, the set of all ancestors and descendents of node X . left(X ) and right(X ) are the sets of, respectively, all left and right siblings of node X . front(X ) denotes the frontier of the subtree rooted at node X . The frontier of the subtree rooted at X is the string formed by catenating the leaves of the subtree from left to right. The length of a frontier is the The length of a frontier is the number of leaves in it.

De nition 8 Let X in Pt and X 0 in Pt0 be arbitrary nodes. We de ne corresponding nodes as: 1. S and S 0 are corresponding nodes.

33

prog

prog

prog

< ;

asgn

asgn id x

:=

expr

id T4 a

expr

id T4 x

-

asgn

expr T3

:=

expr T3

:=

expr

T3

- T1 id T4 b

id y expr

T3

id

T4 a

y

id x

id

- T1 id T4

T4 x

(a)

(b)

Figure 7: Classifying a missing statement fault. 2. X and X 0 are corresponding nodes i : (a) X = X 0 (b) iasc(X ) and iasc(X 0) are corresponding nodes (c) If Xj and Xj0 ; 1  j  k, are siblings of X , then X1X2 : : :Xj ?1 XXj +1 : : :Xk = X10 X20 : : :Xj0?1 XXj0+1 : : :Xk0 ; k  0,

We write X , X 0 when X and X 0 are corresponding nodes.

De nition 9 For any two nodes X in Pt and X 0 in Pt0, the pair h X; X 0 i is said to be a faulty node pair i : 1. X , X 0 2. There exist nodes Y 2 idesc(X ) and Y 0 2 idesc(X 0) such that Y , Y 0 , Y 2 desc(X ), and Y 0 2 desc(X 0).

The set of all faulty node pairs is denoted by FNP.

De nition 10 For any non-leaf node X 2 Pt0 such that X =)+ X , we de ne a recursive path starting at X to be a sequence of edges (X; X1); (X1; X2); : : :; (Xn; Y ); n  0, such that Y = 6 X and X 62 desc(Y ). The length of such a path is k; k  1, where k is the number of occurrences of X along the recursive path.

34

Note that there can be several recursive paths starting at a node. However, for our purpose, we are concerned with only some of these. To identify which ones, we make some assumptions. Let the non-terminal X be de ned in G as:

X ::= 1 j 2 j : : : n?1 j n j 1 j 2 j : : : m?1 j m where n; m > 0

(30)

Let each i contain at least one occurrence of X and each i does not contain any occurrence of X . In a derivation, replacement of X by i is termed as a recursive application of X and that by

i as a terminating application of X . Notice that even though the string does not contain any X , it may derive another string, in one or more steps, that contains one or more occurrences of X . The next de nition restricts the recursive paths to the ones we desire.

De nition 11 For any non-leaf node X 2 Pt0 such that X =) + X , we de ne a restricted recursive path starting at X to be the longest sequence of n edges (X; X ); (X; X ); : : :; (X; Y ); n > 0; X = 6 Y , such that Y = 6 X and X 62 desc(Y ). The length of such a path is n. Such a path is denoted by RP (X ) and its length by RL(X ).

De nition 12 Let P = uxvyz and P 0 = u0x0yv0z0, where y; v; v0 6= . If there is a recursive non-terminal X 2 N; X 6= Y such that the following derivations exist: S =) rY  =)+ uX  =) uxvy  =) uxvyz S =) r0 Y =)+ u0 X =) u0x0 v0y =) u0 x0yv 0z0

(31) (32)

then the string xvy is known as the edit region and X is known as the context of this edit region. String y is known as a misplaced entity

5.2 A fault classi cation algorithm An algorithm for fault classi cation appears in Figure 8. The classi cation begins in the main routine. It is passed the roots of trees P and Pt0 . It returns the set of transformers , F , for which a non-input preserving application is found. Each of these transformers denotes a fault. Below we provide a brief description of the main routine and the associated procedures.

The main routine The main routine begins by unmarking all nodes in Pt0 and then computing the set of context nodes each of which is associated with a transformer of type misplaced. A call to applyp then nds any fault that belongs to the misplaced entity category. applyp marks all nodes in Pt0 at which a non-input preserving transformer application was found. 35

The faulty node pair set, fnp set, is now computed by find fnp to determine any remaining faults of type other than misplaced. The main routine now enters a loop to process each FNP h r; r0 i in fnp set. Within the loop body, trset is computed to be the transformers associated with r0 . An empty trset implies that even though there is a fault associated with h r; r0 i, it cannot be classi ed. All such FNP's are added to the set of unclassi able FNP's denoted by u fnp set and the nodes in Pt0 marked. Each element of trset is now processed in the order of its priority. Thus, transformers of type missing will be processed rst, followed by those of type super uous, and lastly of type incorrect. A call to procedure apply is used to nd if a non-input preserving application exists for transformer Tij . If it does, then Tij is added to F , otherwise it is added to u fnp set. Having processed all FNPs and the transformers associated with each FNP, the main routine returns the set F as the set of transformers that denote the categories of faults found in P .

The find fnp routine This routine traverses Pt0 , starting at the root, to determine the set fnp set. If a node is marked then it, and all its descendents, are ignored. In case a node pair h X; X 0 i is not an FNP, then the routine attempts to nd, recursively, if the immediate descendents of X and X 0 could form FNPs. If h X; X 0 i is an FNP then the descendents of X and X 0 are not examined.

The is fnp routine This routine nds if an input node pair h r; r0 i constitutes an FNP. It rst checks if the labels associated with the two nodes are identical. If not then the pair cannot be an FNP by de nition as they cannot be corresponding nodes. If yes then the labels of the descendants of r and r0 are compared. If any one of these is di erent then h r; r0 i is an FNP, otherwise not. is fnp assumes that h iasc(r); iasc(r0) i is not an FNP and that iasc(r) , iasc(r0)

The applyp routine This routine nds if there are any faults in the misplaced entity category. The outer loop examines each context node pair, (X; X 0), in context and context0 array, respectively. For each X , function permute is invoked to nd if front(X ) is a permutation of front(X 0 ). If it is then the transformer associated with X , TR(X ), is added to misp faults.

The apply, applym, applys, and applyi routines apply invokes one of applym, applys, and applyi depending on the type of the input transformer T . We refer to the type of transformers Ti, Tm, Ts , and Tp as incorrect, missing, spurious, and misplaced, respectively.

36

Algorithm: FAULT CLASSIFIER Input:

1. Parse tree Pt of P containing fault(s) to be classi ed. 2. Parse tree Pt of Pf where P is obtained by removing fault(s) from P. Recall that selecting input preserving applications for all occurrences of transformers reduces Pf to P . 3. Edit region. Output:  Fault set, denoted by F , consisting of zero or more syntactic transformers.  Set of FNPs not classi ed denoted by u fnp set. Method: 0

0

0

0

begin

0

 Unmark all nodes in Pt .  Computer context, context , and TP in the edit region. F = applyp(j context j; context; context ); 0

0

0

find fnp(root(Pt ); root(Pt )); if fnp set = ; then return(F ) for each h r; r i 2 fnp set do 0

0

begin

 Let trset = fTi1 ; Ti2 ; : : :; Ti g, be the set of transformers associated with node r such that Ti > Ti +1 ; 1  j < l, 1  ij j T j, where j T j is the maximum number of transformers associated l

0

j

j

with any node in Pt . if l = 0 then u fnp set = u fnp set [ h r; r i; j = 1; while (j  l ^ unmarked(r )) do 0

0

0

begin

F = F [ apply(r; r ; Ti ); j = j + 1; 0

j

if success then mark(r0 ) else u end /* of while */ end /* of for */ return(F )

end /*

fnp set = u fnp set [ fh r; r ig; 0

End of FAULT CLASSIFIER. */

Figure 8: A fault classi cation algorithm: the main routine.

37

Procedure: find fnp Input:

Nodes r 2 Pt and r 2 Pt . The rst call is from the main routine with r and r being, respectively, the roots of Pt and Pt . Output: The possibly empty fnp set consisting of FNP(s). Method: 0

0

0

0

find fnp(r; r : node) 0

begin if marked(r) then return; if isfnp(r; r0 ) then fnp set = fnp else for i = 1 to j idesc(r) j do

set [ fh r; r ig 0

find fnp(ndesc(r; i); ndesc(r ; i)); end /* of find fnp. */ 0

Procedure: is fnp Input:

Nodes r 2 Pt and r 2 Pt . Output: true if nodes r and r constitute a FNP, false otherwise. Method: 0

0

is fnp(r; r : node) 0

begin if (r 6= r0 ) then return(false) else if (j idesc(r) j6=j idesc(r0 ) j) then return(true) else begin for i = 1 to j idesc(r) j do if (ndesc(r; i) 6= ndesc(r0 ; i)) then return(true); return(false); end /* of else. */ end /* End of is fnp. */

Figure 8: A fault classi cation algorithm (Contd.): procedures find fnp and is fnp.

38

Each of the routines called by apply either returns an empty set, if T is not applicable, or the set consisting of T if T is applicable. This set is then returned to the main routine which adds this set to the fault set being computed. apply expects that if a node has multiple transformers associated, the transformers will be input to apply in the order of their relative priority. applym compares the lengths of the restricted recursive paths associated with input nodes r and r0 and determines if there exists one or more missing entities. It returns an empty set, the set fT g, or the set fT +g depending on, respectively, whether there is zero, one, or more than 1 missing entities. applys is similar to applym except that it compares the lengths of the restricted recursive paths di erently to check for zero or more spurious entities. applyi compares the strings obtained by catenating, from left to right, the descendents of nodes r and r0 . If they di er then the set fT g is returned otherwise the empty set is returned. In case a fault has been categorized, apply sets success to true. The main routine uses success to nd if the FNP that is input to apply has been classi ed or not. If not, then the FNP is added to u fnp set otherwise it is added to F .

Example 5 This example illustrates the fault classi cation algorithm described above. A correct subset-Pascal program 10, denoted by P 0 , is shown in Figure 9. We assume that P 0 has been obtained by editing program P to remove one or more faults. Several examples of faults, and their classi cation using the above algorithm, are given below. The parse tree Pt0 for P derived using the grammar rules given in Appendix A appears in Figure 10. For ease of referencing, we have numbered the nodes in Pt0 from 1 to 119 in preorder. The transformers associated with nonterminals in the grammar are associated with corresponding nodes in Pt0 . Below we show how di erent types of faults in P can be classi ed using the fault classi cation algorithm. Misplaced actual parameter : Suppose that statement readln(x; y ) was mistakenly written as readln(y; x) in the incorrect program P . Figure 11 shows the subtree with frontier y; x. The remaining part of the parse tree will be identical to that of Pt0 and is therefore not shown in the gure and also in the remaining examples. We also assume that the edit region has been speci ed to be the string y; x in P . In accordance with de nition 12 exp list (node 47) is a context node of this edit region. The transformer associated with this node is Tp1 . When nodes 47 from Pt0 and Pt , and Tp1 are input to applyp, it returns fTp1 g as the set of misplaced faults in P because permute(exp list; exp list) is true. This fault is interpreted as a misplaced actual parameter. Misplaced statement : Suppose that P had the if and readln statements in the reverse order. Figure 12 exhibits the subtree rooted at s list. The edit region consists of the string if x < 10

This program is only for illustrative purposes.

39

Procedure: apply Input:

1. Nodes r 2 Pt and r 2 Pt . 2. T, a transformer associated with r . Output:  fT g if one non-input preserving application of T exists, fT + g if more than one non-input preserving applications of T exist, ; otherwise.  Global variable success is set to true if output is not an empty set else it is set to false. Method: 0

0

apply(r; r : node; T : transformer):

0 set of begin case type(T) of missing : return(applym(r; r0; T)); spurious : return(applys(r; r0; T)); incorrect : return(applyi(r; r0; T)); end /* of case . */ end /* of apply. */

transformer

Procedure: applym Input:

1. Nodes r 2 Pt and r 2 Pt . 2. T of type missing, a transformer associated with r . Output: Same as for apply. Method: 0

0

applymfr; r : node; T : transformerg: 0

begin if j front(r)

set of

transformer

j=j front(r ) j then begin success = false; return(;) l1 = RL(r); l2 = RL(r ); if (l1  l2 ) begin success = false; return(;) end if (l1 = l2 + 1) begin success = true; return(fT g) end if (l1 > l2 + 1) begin success = true; return(fT +g) end end /* of applym. */ 0

0

Figure 8: A fault classi cation algorithm (Contd.): procedures apply and applym.

40

Procedure: applys Input:

1. Nodes r 2 Pt and r 2 Pt . 2. T of type spurious, a transformer associated with r . Output: Same as for apply. Method: 0

0

applys(r; r : node; T : transformer): 0

begin

set of

transformer

l1 = RL(r); l2 = RL(r ); if (l1 = l2 ? 1) begin success = true; return(fT g) end if (l1 > l2 ) begin success = false; return(;) end if (l1 < l2 ? 1) begin success = true; return(fT g) end end /* of applys. */ 0

Procedure: applyi Input:

1. Nodes r 2 Pt and r 2 Pt . 2. T of type incorrect, a transformer associated with r . Output: Same as for apply. Method: 0

0

applyi(r; r : node; T : transformer): 0

begin

set of

transformer

 Let r1 r2 : : :rk and r1r2 : : :rk denote the descendents of nodes r and r , respectively. if (r1r2 : : :rk = 6 r1 r2 : : :rk ) then begin success = true; return(fT g) end else begin success=false; return(;)end; 0

0

end /*

0

0

0

0

0

0

0

of applyi. */

Figure 8: A fault classi cation algorithm (Contd.): procedures applys and applyi.

41

Procedure: applyp Input:

1. Array context consisting of those n; n  0; context nodes associated with each of which there is a transformer of type misplaced. 2. Array context where context[i] , context [i]; 1  i  n. Output: The set misp faults consisting of 0 or more syntactic transformers of type misplaced. Method: 0

0

applyp(n : integer; context; context : array of node; misp faults : set 0

begin

of:

transformer);

misp faults = ;;

for l = 1 to ndo for j = 1 to k do begin

 Find N = fn1 ; n2; : : :; nr g; r  0; where each node in N is in the subtree rooted at

context[l] in P and is associated with at least one transformer of type misplaced.  Find N = fn1; n2; : : :; nr g; r  0; where each node in N is in the subtree rooted at context [l] in P and is associated with at least one transformer of type misplaced. if r = r then if permute(N; N ) then begin misp faults = misp faults [ fTR(r )g; mark(r ) end end /* of for loop. */ end /* of for loop. */ return (misp faults); end /* of applyp */ 0

0

0

0

0

0

0

0

0

0

Figure 8: A fault classi cation algorithm (Contd.): procedure applyp.

input, output) p x y: int;

program( var , , begin

readln(x; y) if x < y then p := x  y + 1 writeln(p)

end.

Figure 9: Sample program P 0 in subset-Pascal.

42

program

1 program

2 id

14 decls

4 12 13 5 ; ( id_list Tm2 Tp)2

3 faults 6 9 10 id_list Tp2Tm2 id 7 id

15 17 18 29 30 decls var id_list Tm2 Tp2: type Ti6

11 16 output ε

19

26

27

, id_list T 2 T 2 id m

p

31 type Ti6

34 36 119 sub_declsTm4 c_stmt . 33 35 118 37 38 ; ε begin op_stmts end 39 s_list Tm3Tp3

28 32 20 104 23 105 24 y 40 integer ; id_list Tm2 Tp2 , id Tp1 stmtTi8 s_list Tm3Tp3 21 41 25 64 63 id 106 x s_list Tm3Tp3 proc_stmt ; stmtTi8 22 65 80 117 107 p 79 66 42 110 i 109 stmtT then 8 exp m p if idTi7 stmtTi8 ( exp_listT 1T 1 ) 85 81 84 74 43 108 111 72 67 exp var := written proc_stmt i s_exp exp relop T 4 s_exp 82 86 73 75 112 s_exp 68 62 term s_exp Ti5 Tm7 id 47 < 87 46 term 100 98 ( exp_list Tm1 Tp1 ) 76 83 69 s_exp Ti5 Tm7 addop Ti3 term 113 i 48 factor T 4 i p term 56 factor T 4 88 55 99 exp_list 101 77 , exp term 70 + factor Ti4 114 49 id Ti1 89 i factor Ti4 57 id T 1 95 93 102 exp s_exp 78 term mulop Ti3 factor Ti4 50 num Ti2115 71 y x 94 58 96 id 90 s_exp 103 i 116 i term id T factor T 4 * 1 51 1 59 p id_list: identifier_list 97 91 term Ti5 Tm7 52 factor Ti4 s_exp: simple_expression id Ti1 y sub_decls: subprogram_declarations 60 92 factor Ti4 decls: declarations 53 id Ti1 exp: expression x proc_stmt: procedure_statement id Ti1 61 c_stmt: compound_statement 54 op_stmts: optional statements y x s_list: statement_list 8 input

44 id Ti7 45 readin

Abbreviations

Figure 10: Parse tree Pt0 of P 0 .

43

47 exp_list 56

55 48 , exo_list

exp 57 s_exp

49 exp 50

58 term

s_exp 51

59 factor

term 52

60

factor

id 61

53

x

id 54 y

Figure 11: Subtree for the actual parameter list y; x.

p := x  y + 1; readln (x; y) and the context node is s list. Once again it is easy to see that applyp will compute the misplaced fault set to be fTp3 g which is interpreted as a misplaced y

then

statement. Missing expression : Suppose that the statement p := x  y + 1 was incorrectly formulated in P as p := x  y. The subtree generating the subexpression x  y is shown in Figure 13(a). The edit region is empty and hence applyp will return an empty set. However, find fnp will return h 86; 86 i as the only faulty node pair. This is because node 86 in Pt0 corresponds with node 86 in Pt but the immediate descendents of the two nodes di er. The for loop in the main routine is executed for h 86; 86 i. It computes trset to be Tm7 . Thus, l = 1. The while loop now invokes apply with 86, 86, and Tm7 as inputs. As Tm7 is of type missing, apply invokes applym. applym computes l1 and l2 to be 2 and 1, respectively, as the lengths of the restricted recursive paths starting at nodes 86 in Pt0 and P . As l1 = l2 + 1, applym returns fTm7 g as the output. The while loop terminates as l = 1 and F is computed to be fTm7 g which is interpreted as a missing expression. Figure 13(b) is a subtree generating an actual parameter list. This is a subtree of Pt assuming that readln(x; y ) was mistakenly formulated as readln(x) in P . Using the logic described in the paragraphs above, one can conclude that the main routine of the fault classi er will compute F to be fTm1 g which can be interpreted as a missing actual parameter fault. Incorrect entity : Suppose that the statement p := x  y + 1 was incorrectly formulated in P as p := x + y ? 1. Figure 14 shows the corresponding subtree in Pt. In this example find fnp computes h 87; 87 i and h 99; 99 i as the two FNPs. The trset associated with node 87 in Pt is fTm7 , Ti5g. The while loop invokes apply with h 87; 87 i and Tm7 as inputs. apply invokes applym which compares the lengths of the frontiers of the two nodes. In Pt0 , we have j front(87) j= 3 which is j front(87) j in Pt . Thus applym returns an empty set and node 87 in Pt remains unmarked. 44

40 s_list 41

63

64 stmt

;

s_list 42

proc_stmt

stmt

cond exp

if

id

then

stmt

exp_list

(

readin exp_list

s_exp

term

factor

relop