Squall - Carleton University

Carleton University, Technical Report SCE-04-02

September 2004

Assessing and Improving State-Based Class Testing: A Series of Experiments Lionel C. Briand1, Massimiliano Di Penta2 and Yvan Labiche1 1

2

Software Quality Engineering Laboratory Department of Systems and Computer Engineering Carleton University 1125 Colonel By Drive, Ottawa, ON K1S5B6, Canada {briand, labiche}@sce.carleton.ca

RCOST - Research Centre on Software Technology Department of Engineering University of Sannio Piazza Roma, I-82100 Benevento, Italy [email protected]

Abstract This paper describes an empirical investigation of the cost effectiveness of well-known statebased testing techniques for classes or clusters of classes that exhibit a state-dependent behavior. This is practically relevant as many object-oriented methodologies recommend modeling such components with statecharts which can then be used as a basis for testing. Our results, based on a series of three experiments, show that in most cases state-based techniques are not likely to be sufficient by themselves to catch most of the faults present in the code. Though useful, they need to be complemented with black-box, functional testing. We focus here on a particular technique, Category Partition, as this is the most commonly used and referenced black-box, functional testing technique. Two different oracle strategies have been applied for checking the success of test cases. One is a very precise oracle checking the concrete state of objects whereas the other one is based on the notion of state invariant (abstract states). Results show that there is a significant difference between them, both in terms of fault detection and cost. This is therefore an important choice to make that should be driven by the characteristics of the component to be tested, such as its criticality, complexity, and test budget.

1 Introduction State models, such as UML statecharts [9, 22], are widely used to document the design of objectoriented systems, in particular to specify the state-dependent behavior of the most critical and complex classes. Such statecharts help to carefully specify and review class behavior but they are also extremely useful to facilitate class design, for example using the state design pattern. Another useful application is related to the definition of test strategies and criteria based on state 1


September 2004

models. Indeed, state models provide a precise definition of the expected class behavior and are therefore very useful when deriving test cases and oracles. The literature reports a number of criteria, relying on state models, for the unit testing of (clusters of) classes [7, 8, 14, 25, 36]. However, little is known about the effectiveness in practice of such criteria. Since faults are much less expensive when caught during unit testing than during subsequent testing phases (integration testing or system testing), it is important to ensure that effective strategies are being used to detect them. But the cost associated with a coverage criterion has to be acceptable given the specifics of the class(es) under test. It is thus paramount to provide an initial body of results that can serve as useful guidelines for people who, in organizations that develop object-oriented systems, have to plan and perform class (cluster) testing. In order to study the effectiveness of these testing criteria, and to understand whether it can be possible and convenient to apply them, two separate questions have to be answered: (1) to which extent, do people who have adequate training, properly apply test strategies (some may be more difficult to apply than others)? and (2) what is the fault detection effectiveness of the test strategy when properly applied? This paper addresses the latter issue. This entails the investigation of different aspects, for example: -

What are the factors affecting such effectiveness?

-

How effectiveness can be improved by combining different strategies?

-

What is the influence of the chosen oracle in terms of cost and effectiveness?

-

What, in general, are the cost-effectiveness tradeoffs to be considered?

Such investigations, however, cannot be performed by analytical means. Like other types of criteria (e.g., data flow criteria [19]), experimental evaluation is required. This paper reports and discusses results from a series of three experiments, performed with graduate students in a laboratory setting. The purpose of the experiments is to determine the costeffectiveness of one of the most widespread state-based testing strategies: the round trip strategy [7] which was previously known as the W-Method [14] for regular state machines and has been extensively studied over the last 20 years in the context of protocol testing [32]. We also investigate whether black-box testing would help complement state-based testing and how they could be combined. Last, we compare two different oracle strategies in the context of these testing strategies.

2


September 2004

For each experiment, we asked student teams to produce test cases, aiming at applying the strategies to be investigated (round trip [7], category partition [37]) on a given set of classes and methods. Then, we produced mutant versions of class(es) under test (CUT), each including one fault seeded, using mutation operators [16]. Finally, we ran the test cases over the mutant versions of the CUT, to assess their capability to “kill” mutants, that is uncover faults by triggering a failure. Results were also analyzed to investigate variability among teams and the reasons for the remaining live mutants. The main contributions of this paper are: •

a thorough experimental evaluation of a common state-based testing technique: round trip testing. We both look at the fault detection effectiveness and cost of the technique;

•

an investigation into combining round trip testing with black-box testing, namely the category partition technique;

•

an empirical investigation of two oracle strategies to determine the occurrence of a failure, one based on state invariants (as defined for the statechart) and the other being a complete, precise oracle looking at the concrete states of objects (i.e., checking all attribute values). The latter is more expensive to code in test drivers and we aim to determine whether it makes a significant difference in practice.

The paper is organized as follows. Section 2 discusses the related literature and Section 3 introduces the fundamental notions about the testing strategies to be experimented and mutation operators. Section 4 provides a detailed description of the experiments, while Section 5 presents and discusses the results. Overall conclusions and future work are provided in Section 7.

2 Related work One of the earliest works on state-based testing is the W-method proposed by Chow for finite state machines [14], which was later adapted to a UML context by Binder [7] and denoted as round-trip path testing. The W-method and its adaptation by Binder require the construction of a transition tree, for instance using a breadth-first traversal of the finite-state machine (or statechart). The W-method then consists of two steps: the first step traverses the transition tree so that each path in that tree will be covered by the test cases, and the second step appends a state identification sequence to each transition tree sequence (i.e., test case) in order to check the state that was reached. Other similar techniques for state-based testing from finite state machines exist

3


September 2004

(e.g., the technique proposed by Lee and Yannakakis [32]), but they all test the same sequences (from the tree) and only differ with respect to the sequence added for the state identification problem. In the context of class testing, the adaptation performed by Binder reuses the first step and assumes it is possible to directly check the state invariant, e.g., defining/using a particular method that verifies it, and thus replaces the identification sequence with a call to a state invariant checking method. In other words, the implementation must have a trusted ability to report the resultant states [7]. Binder’s adaptation is used during class unit testing when the statechart describes the behavior of a single class, or during cluster unit testing when the statechart describes the behavior of a cluster of classes (e.g., the statechart describes the behavior of a linked list, comprising classes LinkedList and Node). Note that in order to apply Binder’s adaptation, it is necessary to remove all concurrency or hierarchy in the statechart [9], a process referred to as statechart flattening [7]. Additional testing strategies have been defined for statecharts. Offutt and Abdurazik [35] provide definitions for the following three testing criteria: all transitions (requires the test set to cover every transition in the statechart), all transition pairs (the test set must contain every pair of adjacent transitions), and full-predicate that targets guard conditions on transitions (test cases must cause each clause in every guard condition to control the value of the guard and to drive it to true). More recently, Bogdanov and Holcombe [8] proposed a method to derive test sequences (based on the all transitions criteria) in the presence of hierarchical statecharts. Hong et al. [25] provide a way to derive extended finite state machines from statecharts to devise test criteria based on control and data flow analysis. Ball et al. [5] propose a test strategy for container classes that is based on randomly generating instances based on combinatorial algorithms that are specific to every data structure encapsulated in the class. But the W-method and its extensions remain the mostly used and studied state-based technique for software testing as it is, in terms of cost, a compromise between simple, weak test criteria (e.g., all transitions) and expensive, strong ones (e.g., transition pairs), as shown in [10]. Empirical studies have been conducted to study the cost-effectiveness of testing strategies, in a white-box or structural context [19, 20, 40], in a black-box or functional context [41], or even for regression testing [23]. Like in the current paper, many of these works used mutation operators and fault seeding to study the cost-effectiveness of test strategies, and the number of test cases as a surrogate measure of cost.

4


September 2004

More recently, a precise simulation and analysis procedure has been proposed and used to study the cost-effectiveness of four statechart based coverage criteria [10], namely all-transitions, alltransition-pairs, full-predicate [35] and round trip paths [7]. Case studies by Offutt et al. [36] showed that all-transitions is as effective as random testing but significantly cheaper. Fullpredicate is as expensive as random testing but significantly more effective. Similar to the work in [10, 35], we also aim to analyze the cost effectiveness of state-based testing but focusing on different techniques and by performing experiments with human subjects instead of simply resorting to simulation.

3 Background Concepts and Definitions This section briefly introduces the test strategies we investigate in our experiments. Section 3.1 provides further details on round trip path testing whereas Section 3.2 reminds the reader about the well-known category partition black-box testing technique. Section 3.3 discusses alternatives for the implementation of oracles: precise oracles that check every attribute values or oracles that check state invariants. Section 3.4 then provides a brief description of the taxonomy of mutants seeded in the code under test.

3.1 Round-trip path testing In Binder’s adaptation [7] of Chow’s W-method [14], the transition tree includes all the transition sequences that begin and end with the same state, as well as simple paths (i.e., sequences of transitions that contains only one iteration for any loop present in the statechart) from the initial state to the final state. Binder details a procedure based on a breadth-first traversal of the statechart for deriving the transition tree from a statechart. More precisely, during the traversal of the graph corresponding to the statechart, a tree node is considered terminal when the state it represents is already present anywhere in the tree or is the final state. It is worth noting that: 1. for guard conditions containing only logical and operators, only one transition is required in the tree; 2. once guard conditions are transformed into the disjunctive normal form, i.e., so that it is composed of conjunctive expressions (disjuncts) combined using the or operator, a transition for each disjunct is required in the tree.

5


September 2004

Note that both a breadth-first and depth-first traversal is possible when constructing the transition tree, and that the obtained tree depends on the way one traverses the graph corresponding to the statechart. However, these transition trees are all equivalent in the sense that they enable the coverage of all the round trip paths in the statechart. For many statecharts that model the behavior of non-trivial classes in real world systems, the number of paths could be too high (e.g., too expensive to execute given available resources). Therefore, we can define two alternative levels of testing: 1. first level: we simply refer to this level as round trip (RT) path coverage. Each guard condition is treated as a whole and only one corresponding transition is present in the tree. In other words, we only test one situation where the guard condition evaluates to true; 2. second level: we strictly apply the definition above for RT. We refer to this as disjunct coverage (DC). Let us illustrate how RT and CP work using an example from one of our subject classes, namely the class Name, which transition tree is shown in Figure 18, Appendix 8.3. To cover the path (8) that brings the object from the initial state to the PFQ state and then again to PFQ, methods Name(String) and append(Name) need to be invoked. It is worth noting, from Figure 1, that

the guard condition of method append(Name) is in disjunctive normal form. append(d: Name) [

( (d.labels+self.labelsunion(d.name)) ) ]

Figure 1 – Example of guard condition in disjunctive normal form To cover path (8), the RT strategy indicates that the path needs to be covered and this implies that at least one disjunct in the guard condition of append(Name) must be satisfied. To also achieve DC coverage, the other disjuncts also need to be satisfied in additional test cases. In this case where we only have two disjuncts, if the first disjunct has been satisfied during RT coverage, DC requires that path (8) must be exercised again, this time satisfying the second disjunct.

3.2 Category partition A common approach to generating test cases from functional specifications is to partition the input domain of the function/method being tested, and then to select an input for each class of the

6


September 2004

partition, according to the principle that all elements belonging to an equivalence class are interchangeable for testing purposes. The Category Partition (CP) strategy [37] is the most well-known extension of input domain partitioning and can be thought of a sequence of two phases. The first phase aims at performing specification analysis, while the second one aims at generating tests. First, the specification is decomposed into functional units to be tested independently (e.g., system operations or class public operations, depending on the context of application). Second, parameters and environment conditions affecting the function’s behavior are identified. Categories for such parameters and conditions are defined such that they trigger a different behavior of the functionality and are chosen to maximize the chances of finding errors. A category can be seen as a major property of the parameter or condition. Then, each category is partitioned into a series of distinct choices, i.e., partition classes, including all the possible values for the category. The set of categories and choices constitute the test specification from which it is possible to derive the test frames, that is a template to derive test cases. Each test frame is composed of a set of choices from all the categories, where each category contributes with, at most, one choice. Possible interaction among different choices belonging to different categories can be annotated in the test specification as constraints. A constraint indicates, for instance, that a choice belonging to a category cannot appear in the same test frame of another choice belonging to another category. This allows testers to reduce the number of possible test cases. A thorough example of how CP can be used to complement state-based testing is reported in [4].

3.3 Precise oracles versus oracles that check state invariants One important issue in testing software systems is the development of test oracles that can automatically determine, within the test driver, whether a test case is successful or not. Developing such test oracles manually is difficult, expensive, and represents a major cost when writing test drivers. Furthermore, there is no general solution or methodology for their construction [42], thus making them frequently complex and error-prone [6, 7]. In our state-based testing context, the oracle has to check the states reached by the object which behavior is described by a statechart, at different moments during the execution of test cases, as well as any output it generates. As for checking states, a possible, systematic solution is to dump

7


September 2004

the value of all the class attributes and compare them to what is expected, i.e., to check the concrete state [7]. We refer to this kind of oracle as a precise oracle, as it corresponds to the most accurate verification possible. In a state-based testing context, an alternative to precise oracles was proposed by Meyer [33]. It entails checking the state invariants of the states that are expected to be reached during the execution of the test cases, i.e., to check abstract states [7]. The objective is to reuse, during testing, the state invariants, as defined during the analysis or design of the state-based behavior. Not only does this allow one to reuse the state invariants which have already been defined during analysis or design, but checking abstract states is less expensive both during development and run-time than checking concrete states (less attribute values are checked). Additionally, there exist techniques to automatically instrument contracts (pre and post conditions for operations and class invariants), for instance described in OCL in the context of the UML, into program source code [11]. Such techniques can be easily adopted for state invariants. To illustrate the difference between concrete and abstract states, let us suppose we are testing a class encapsulating an ordered set (one of the subject classes used in Section 4.1). Checking the abstract state of an object instance of the ordered set class simply involves checking in which state is the object (Empty, Partially Filled, Filled, Overflow). On the other hand, checking the concrete state entails checking the values of all class attributes which, in this example, are the actual values contained in the set, the set size, the number of resizing operations performed.

3.4 Mutant analysis and applied Mutant Operators To investigate the effectiveness of testing strategies, we use mutation operators [16] to seed faults in the CUT. Mutation operators are syntactic variations that are applied in a systematic way to the program to create faulty versions that exhibit different behavior (Mutants). The idea behind the mutation-based evaluation of test techniques is to generate a set Π = {π1 , π 2 ,..., π n } of programs similar to the program P to be tested, and then to execute each test suite T (generated according to test strategies) on both the original program P and ∀π i ∈ Π . The higher the number of mutants generating at least one failure across all test runs (we say, in that case, that the mutant has been “killed”), the more effective the testing strategy. A common measure of effectiveness is then the percentage of killed mutants referred to as the mutation score. One requirement is that the number of mutants generated must be sufficient for us to assess and compare testing

8


September 2004

strategies. An assumption behind mutant analysis is that mutants should be representative of possible faults in practice and that the resulting fault detection rate should therefore be an estimate of what to expect in practice. This of course raises all kinds of issues as, in practice, distribution of faults vary across environments and probably depend on human factors, systems, and application domains. Mutation operators, however, ensure that a systematic and varied sample of faults is generated. A discussion of related issues can be found in the work of Acree et al. [2], Budd et al. [13] and DeMillo [16]. We chose to ensure that mutants are uniformly distributed in the source code, except when particular components/features require particular attention from an evaluation perspective (see, for example, Section 5.7). Finally, complete automation for generating mutant versions for a program is a desirable feature. But it is currently not entirely feasible, due to the possibility of generating equivalent mutants, i.e., mutants that are semantically equivalent to the original program and that are thus impossible to kill by any test case. In our experiments, mutants were seeded according to a set of mutation operators proposed by Kim et al. [27, 28] (see Table 1), which has been initially used in their experiment [29], and was the most complete set of mutation operators for Java at the time of writing. Some of these mutation operators are general and can apply to all imperative programming languages (e.g., the substitution of two operators). Others are specific to object-oriented languages (e.g., the removal of an overriding declaration) and others are even more specific and only apply to Java. Since those operators have been defined for Java, we had to adapt them to C++ (one of the subject classes is implemented in C++): this did not cause any difficulty as Java and C++ are very similar and the mutation operators we selected do not target language constructs that are specific to Java or C++. Last, note that additional mutation operators focusing on the inheritance relationship and polymorphism have also been proposed [3]. They were not used as only the third experiment involves a simple inheritance relationship between two classes and because they were published after our first experiment was planned. In order to perform a comparative study of results obtained for all subject classes, we needed to use the same mutation operators across the three experiments. Future work will look into using those mutation operators.

9


September 2004

Table 1 – Kim’s mutation operators [27, 28] Mutant VRO (Variable Replacement Operator) CFD (Control Flow Disrupt operator) SSO (Swap Statement Operator) LCO (Literal Change Operator) LOR (Language Operator Replacement) AOC (Argument Order Change) VMR (oVerloading Method Removal) POC (Parameter Order Change) ICE (Instance Creation Expression changes) CRT (Compatible Reference Type replacement) AND (Argument Number Decrease) HFR (Hiding Field variable Removal) HFA (Hiding Field variable Addition) OMR (Overriding Method Removal) AMC (Access Modifier Change) EHR (Exception Handler Removal) EHC (Exception Handler Change) SMC (Static Modifier Change) SCO (Scope Change Operator)

Description Replaces a variable name with other names of the same type and of the compatible types. Disrupt normal control flow: add/remove break, continue, return. Swap case block statement in a switch statement, swap first and second contained statements in if-then-else statement, and so on. Increase/decrease numeric values or swap Boolean literals. Replace a language operator with other legal alternatives. Change number, order of arguments in method invocations. Remove the declaration of an overloading method. Change method parameter order in method declarations. Change an instance creation expression with other instance creation expressions of the same and/or compatible class types. Replace a class type with compatible types. Decrease arguments one by one. Remove a field variable declaration when it hides the variable in super-classes. Add a field variable of the same name as the inherited field variable. The declaration of an overriding method is removed Replace an access modifier with other modifiers. Remove an exception handler Change an exception handler Add/remove a static modifier Change the scope of a variable

4 Experiment description This section describes and motivates the design of our experiment and discusses the threats to validity. Its structure follows the template provided by Wohlin et al. [43], a well-known textbook on software engineering experimentation. We first define the purpose of the experiment and discuss the context in which it took place (Section 4.1). Section 4.2 reports on the experiment planning, specifying the hypotheses, independent and dependent variables, treatments, and the experiment design. Section 4.3 then outlines the preparation and execution of the experiment. Finally, Section 4.4 discusses threats to validity.

4.1 Experiment definition and context The overall objective of the experiment, as stated in the introduction, is to provide an initial body of results that can serve as useful guidelines for people who work in organizations that develop

10


September 2004

object-oriented systems and have to plan and perform class or class cluster testing. This objective comprises the evaluation of: 1. the fault detection effectiveness of the Round Trip strategy for class (cluster) testing (RT and DC); 2. how effectively the Round Trip strategy can be complemented with black-box testing, namely Category Partition testing; 3. two alternative solutions to implement test oracles in the above context: (1) state invariants, (2) precise oracles (Section 3.3). Precise oracles are implemented by dumping and comparing the values of all the class attributes and is therefore more expensive, though potentially more precise than just checking state invariants. Three experiments took place in the context of graduate classes as part of the Master of Software Technology at the University of Sannio, Italy. The first author of this paper was the instructor of the course. We performed our three experiments three years in a row in the context of the laboratory work associated with the abovementioned graduate classes. The three experiments involved testing both C++ and Java classes: 1. The first experiment was performed on a container class, handling ordered sets (OrdSet), which was part of a C++ software system, developed in an academic environment, and used for code analysis; 2. The second experiment was performed on a class from a Java open source Domain Name Server (DNS), named dnsjava [ #542]. Release 1.2.3 was used for the experiments. The class used was the class Name, a container class for handling DNS names; 3. The third experiment was performed on a hierarchy of two classes which are also coming from dnsjava: class NameSet, for handling sets of DNS names, and its subclass Cache for handling caches of DNS names. The class diagram for class OrdSet (it comes with an iterator class) is shown in Figure 12. Figure 13 shows an excerpt of the dnsjava class diagram. This class diagram includes the class Name (subject of the second experiment), as well as classes NameSet and Cache (subjects of

the third experiment). The classes to be tested have been extracted from working software systems. Although the class used in the first experiment (OrdSet) is quite simple and comes from an academic software system, the classes used in the last two experiments (Name, NameSet, and Cache) come from a 11


September 2004

real DNS system and exhibit complex statecharts and transition trees. The first two subject classes (OrdSet and Name) represent container classes. The classes used in the third experiment (NameSet and Cache) represent, instead, control classes [12]. The third experiment focuses on the application of state-based testing on a class hierarchy. Table 2 reports a number of relevant details about the three experiments. Additional information (CUT statechart and transition trees) can be found in the appendices. To be able to perform all the tasks at hand within time constraints (12 hours of lab work) students were randomly grouped into teams and student teams were asked to write test cases following the abovementioned strategies (the number of teams is reported in Table 2). For each experiment, the table reports the class size in LOC (lines of code), the number of class methods, the number of methods affecting the state of class instances, the number of states and transitions in the class statecharts, the number of paths in the transition trees to achieve RT (Level 1) coverage and the number of additional paths to be covered for exercising all disjuncts and therefore achieve DC (Level 2) coverage. Note that for OrdSet, the same number of paths achieves Level 1 and Level 2 coverage. Table 2 – Experiment participants and subject classes’ characteristics Number of Experi- # of ment teams

Class(es)

1 2

4 9

3

5

OrdSet Name NameSet Cache

methods Language LOC methods affecting state C++ 450 45 20 Java 462 25 11 Java 145 11 6 Java 495 19 14

paths states transitions in the tree 5 24 26 5 32 28 2 11 9 37 2 (4)1 31 (36)1

disjuncts to be covered 19 4 15

For each experiment, mutants were created by seeding faults in the CUTs and test sets generated by the student teams were executed on them. The fault detection effectiveness was then computed with the two alternative oracles. Furthermore, we also investigated the cost-benefit of test strategies using surrogate measures for the cost of test cases/drivers development and execution. All these tasks were performed offline by the experimenters. A total of 42 mutants for class OrdSet, 81 for class Name, 24 for class NameSet, and 25 for class Cache were seeded (Table 3). The table also reports on the number of methods that were seeded with mutants, the average method size in classes (LOC), the corresponding amount of 1

In parentheses, the number of states (or transitions) of the flattened statechart.

12


September 2004

code in those methods, and the density of mutants seeded in those methods2. Their distribution is shown in Figure 2 where we use the acronyms defined in Table 1. Differences were simply due to the fact that mutant seeding is strongly driven by the characteristics of the code under test. The number of seeded mutants has to be sufficient so that differences among test techniques are likely to be visible and, from that perspective, the numbers above were deemed large enough3. It is also important to note that, to avoid any positive bias in the results, faults were seeded randomly4 by one of the authors before the test specifications (transition trees, test frames) were specified by the other two authors. Those were only precautionary measures as there is no straightforward relationship between test specifications and mutants. Finally, fault locations were also selected to obtain fault frequencies that are roughly proportional to the size of mutated methods. Table 3 – Data on mutants seeded in the three experiments Experiment

Class

Number of mutants

Number of mutated methods

1 2 3


42 81 24 25

13 9 5 6

Avg. method LOC 11 15 7 9

LOC for mutated methods 196 158 56 71

Ratio Mutants/ Mutated LOC 0.21 0.49 0.45 0.32

Moreover, in order to obtain unbiased results, only faults that cannot be detected by the compiler and that can possibly be found by running test cases were considered. If we take the example of the Access Modifier Changes (AMC) mutation operator [28], where an access mode is changed with another one (e.g., from private to public), it is difficult to imagine how a test case can uncover such a fault through testing. We therefore did not use this particular mutation operator. Such faults should be addressed by systematic design and code reviews. As a result, only 14 of the 19 mutation operators in Table 1 appear in Figure 2. After random mutant injection, we first manually verified the presence of equivalent mutants by analyzing the resulting source code. But a few equivalent mutants remained undetected until we

2

Only methods whose behavior affects the object state were seeded with mutants.

3

A difference in 20% in fault detection effectiveness would result, on average, in a significant difference in terms of

mutants killed: between approximately 4 and 16 mutants for all subject classes. 4

This was simply done by browsing through the code and identifying opportunities to seed mutants. Then a subset

was randomly selected. The mutants seeded were therefore driven by the characteristics of the code.

13


September 2004

realized that no test case was able to kill them, thus warranting further analysis. Finally, we paid particular attention to mutate those methods to be subject to CP testing after RT testing. For such methods, which tend to be of high complexity, the number of seeded mutants was higher (see Section 5.7). OrdSet

Name

NameSet

Cache

# of mutants seeded

16 14 12 10 8 6 4 2 VRO

SSO

SMC

SCO

POC

LOR

LCO

ICE

EHR

EHC

CRT

CFD

AOC

AND

0

Mutant categories

Figure 2 – Mutant distribution on the three systems

4.2 Experiment planning 4.2.1 Context selection All the subjects were graduate students in software engineering with a previous 5-year laurea degree in computer science, computer engineering, or electronics engineering. Every year, these students are carefully selected from around 300 applicants based on their transcripts and experience. Students were very well versed in the programming languages used in the experiments (i.e., C++ and Java), they had studied UML, including the Object-Constraint Language [39] (OCL) in previous courses. Before the experiments, students attended courses on software testing with a particular focus on object-oriented testing.

4.2.2 Research question and hypotheses formulation Based on the objectives stated in Section 4.1, we summarize the list of research questions to be investigated and, when hypothesis testing is relevant, the null hypothesis to be tested.

14


September 2004

1. When applying DC, is there a significant improvement in terms of number of mutants killed over the RT strategy? (H0: DC does not significantly improve the number of mutants killed by RT.) 2. When applying CP after DC, is there a significant improvement in terms of number of mutants killed? (H0: CP does not significantly improve the number of mutants killed when applied after DC.) 3. Which are the differences, in terms of number and percentages of mutants killed, between using state invariants and precise oracles? (H0: precise oracles do not increase the number of mutants killed when compared to state invariant oracles.) 4. Are there differences among teams in terms of mutation scores, using all three test strategies and the two oracles? The goal here is to assess the consistency of results across teams. (H0: for a given test technique and oracle, proportions of mutants killed across teams do not vary.) 5. What is the relative cost-effectiveness of applying the three different test strategies and two different oracles, both in terms of test preparation and test execution? The goal here of to provide initial data that people can use to decide whether or not to use a particular technique. 6. Can we use a measure of code complexity, such as cyclomatic complexity, to decide where to apply CP, on a subset of class methods? It is, in practice, not possible and not desirable to apply CP to all methods. We need a selection criterion and the underlying assumption here is that the higher the complexity, the more likely state-based techniques will miss some of the faults and CP will be required.

4.2.3 Variable selection and experiment design Our experiments can be considered one-factor, multi-treatments [26, 43]. The factor (or independent variable) is the combination of test strategies and oracle to be used. There are a priori eight possible treatments: RT, RT+CP, DC, DC+CP, with state invariant or precise oracles. However, we do not see RT+CP (with the two oracle strategies) as practically relevant treatments to study since if one wants to complement RT and has the resources to do so, DC should be achieved first (so that state-based testing would be completed) before considering CP. Additionally, DC+CP has to be used with precise oracles as functional, black-box testing would

15


September 2004

not work well with state invariant oracles as they most of the time do not affect the object state. Indeed, when applying CP no change of state is guaranteed, and we thus have to resort to precise oracles. This leads to five treatments: RT and DC combined with the two oracles strategies, and DC+CP combined with precise oracles only. Given the fact that the three testing strategies are applied sequentially, i.e., RT, DC and then DC+CP (Section 4.3), that the tasks have been designed to fit in the time available, and no serious learning or fatigue effects are expected, the experiment design is quite simple: for each experiment, all five treatments are assigned to all subjects (i.e., to all teams of students). Data is collected off-line for each treatment by running test cases developed by the subjects on mutant programs, capturing failures, and computing mutation scores. Students were randomly assigned to teams of three-four people each. For the three experiments, as reported in Table 2, we were able to form 4, 9 and 5 teams, respectively. We gave the subjects a total of 12 hours, divided across 3 labs of 4 hours each, and held in three consecutive afternoons. Students were told they were evaluated based on the conformance of their work to the test strategies: they were also told not to exchange information about each others’ work and they knew we would carefully compare their test drivers. The tester skills, as well as the slight variability in the number of people composing each team, do not significantly affect the result of our experiment, since we simply asked them to produce test cases according to the aforementioned test strategies, which they all managed to do. Additionally, as detailed in Section 4.3.1, we provided the teams with statecharts, transition trees, frames for CP5, and driver templates. In particular, the provided RT and DC driver templates contain invocations to methods of static classes that check whether the path followed was the correct one. The motivation was to make sure they would properly apply the techniques and limit the chances of a mistake. We also did not want individual abilities to affect the depth with which CP was applied as this was not the purpose of our investigation. Our work assumes that techniques are properly and fully applied. Recall that we are interested in the costeffectiveness of the techniques when properly applied, rather than the ability of subjects to properly apply them. In order to ensure our assumptions were correct test drivers were checked

5

Those test frames were devised before seeding mutants. We attempted to define test frames that, based on our

experience in using CP, are as realistic as possible assuming an experienced tester.

16


September 2004

when delivered by the students and they were asked to correct them when not fully compliant with the test strategies. Based on our research questions, the dependent variables to be measured in the experiments are: 1. the (cumulative) number of mutants killed when applying each test strategy. The purpose here is of course to measure a strategy’s effectiveness at uncovering faults; 2. the number of test cases that testers need to write for each strategy. The number of test cases corresponds to the number of paths in the transition tree, the number of disjuncts in all guard conditions, and to the number of frames for RT, DC, and CP, respectively. This is a surrogate measure for the cost of test case preparation and execution. Though different test cases will of course take varying levels of effort to code and time to execute, the assumption is that overall the number of test cases will roughly be proportional to the amount of time needed to write the test drivers and run them; 3. the number of lines of code in the test drivers, for each test strategy and oracle combination. This is another surrogate measure for the cost of test case preparation and execution but it accounts for the differing costs of state invariant and precise oracles. The latter are more costly and the corresponding code is larger; 4. the amount of CPU consumed by a test suite execution. This is another surrogate measure for the cost of test case execution that accounts for the differing costs of state invariant and precise oracles. Our assumption is that if we manage to show consistent results for different size measures, then the credibility of our conclusions is strengthened. Note that test preparation effort is not considered in our experiment as no such data could be collected. As an attempt to quantify the relative cost-effectiveness of test strategies when applied sequentially, we measured the ratio between incremental numbers of mutants killed and incremental costs. More specifically, we look at three ratios, based on the three test cost measures discussed above in the points 2, 3 and 4:

R1 =

incr. # of mutants killed incr. # of test cases executed

R3 =

R2 =

incr. # of mutants killed incr. execution time (sec .)

incr. # of mutants killed incr. # of KLOC developed

17


September 2004

4.3 Operation This section describes, in practical terms, the preparation and execution of the experiments.

4.3.1 Preparation Right before the experiment, the system under test and the tasks to be performed were presented to the students. In particular, we briefly explained the main functionalities of the system (to help people understand the context), including a presentation of the classes under test, also showing some simple code making use of the class and methods to be tested. Then, we explained the class statechart and transition tree, including examples of OCL guard conditions. To provide operational insights on how to perform state-based testing, we showed some examples of partial test drivers (i.e., template code), carefully avoiding any reference to the CUT as this could have biased the results. Finally, the principles of CP testing were also restated to students by showing examples of test frames and test cases derived from them. The students were then provided with the following material before the experiment: 1. the material (slides, examples) used to prepare the experiment; 2. the source code of the CUT; 3. HTML documentation of the CUT; 4. static classes for printing the current state of an object according to the two different oracle strategies, as mentioned in Section 3.3; 5. the class diagram, statecharts, transition trees, OCL state invariants and guard conditions of the CUT; 6. the test frames for the methods on which to perform CP. Note that the subjects were not given mutant programs and were, therefore, not aware of the faults their test drivers were intended to detect.

4.3.2 Execution Subjects were asked to apply the testing strategies in a specific order, that is, first RT, then DC, and last DC+CP for the experiments where we considered that they would have enough time to complete this last task given time constraints (Table 4). Given our research questions (Section 4.2.2) we must execute the techniques in that specific order. When producing test drivers, students were also asked to distinguish drivers that implement precise oracles from drivers based

18


September 2004

on state invariants. Note from Table 4 that they were not asked to apply both oracle strategies in all experiments. As a result we only have observations for all five treatments for classes Name and NameSet. The reason is that we did not consider they would have enough time to perform everything for OrdSet and Cache given the laboratory time constraints. Table 4 – Testing and oracle techniques applied during each experiment Experiment 1 2 3

OrdSet Name

Order of testing technique applied 1 2 3 RT DC RT DC DC+CP

Precise only Invariant (except with CP), Precise

NameSet

RT

DC

DC+CP

Invariant (except with CP), Precise

Cache

RT

DC

-

Class

Oracle used

Invariant, Precise

Once all test drivers were developed by the student teams, we collected them and, after a careful inspection to verify their correctness, we ran them using an automatic script that, for each treatment (i.e., combination of testing technique and oracle approach): 1. retrieves the list of test drivers to be executed; 2. executes each driver (i.e., test cases) over the original version of the class(es), recording the output to be used to reveal failures when executing mutants; 3. executes each driver over each mutant version. Each time the output produced (which differs depending on the oracle) does not correspond to that of the original version, the mutant is considered killed (the diff UNIX utility was used for that purpose). Note that in the case of classes NameSet and Cache (Cache being a sub-class of NameSet), since a parent class is unit tested before its child classes [24], the mutation score of test sets executed on Cache accounts for mutants seeded in inherited methods from NameSet but not killed during NameSet unit testing. For test drivers related to the RT strategy, a method of a static class (provided to test teams) is invoked after each state transition has been triggered. According to the type of oracle chosen (Section 3.3), such a method returns a string identifying the current state of the class (state invariant oracle) or the values of all the class attributes (precise oracle). (Note that in the latter

case, since the subject classes do not produce outputs, the precise oracles in our case only check attribute values.) This string is then printed while executing the driver. We asked students (showing them an example) to develop test drivers according to this format, and to verify that, on the correct version of the software system, the path to be covered on the transition tree was

19


September 2004

correctly followed. For CP testing, after the method under test is invoked with parameters and environment settings satisfying the corresponding test frame, the driver invokes a method of a static class that, similarly to what is done for the RT strategy, outputs the values of all class attributes. Test case execution time was measured using the time Unix utility, that measures the CPU user time of a Unix process. Measures were made on a Dell Inspiron 5150 with a 3 GHz Intel Pentium IV, 512 Mbytes of Ram and Linux Mandrake 9.2.

4.4 Threats to validity We briefly summarize below what are the threats to validity in our experiments [43].

4.4.1 Conclusion validity Conclusion validity is related to our ability to determine whether there is a relationship between the test strategies and fault detection effectiveness. It is closely related to the concept of statistical power and can be improved by a good measurement reliability and treatment implementation [38]. With respect to the latter, we provided the participants with all the material to ensure they would properly apply testing strategies and we subsequently checked they did so. Recall that due to the time required for the tasks and the time constraints which are typical to such controlled experiments, we had to group participants in teams. This, however, reduces the number of observations we have to work with though it helps us ensure the completion of the tasks. As a result, our statistical power is rather limited and we chose to work at a level of significance α = 0.05. For CP, there may be a validity problem, since the strategy was not extensively applied to all methods (and not applied at all for the first experiment). We tried to limit the problem by (i) applying CP where we believe it was more useful, i.e., on methods exhibiting higher cyclomatic complexity (results shown in Sections 5.7 and 5.8 support such a conjecture) and (ii) analyzing the potential effectiveness (see Section 5.8) of an extensive application of CP.

4.4.2 Internal validity Internal validity is related to the risk of having confounding factors to cause the effect being observed, as opposed to the treatments. Each of our three experiments were performed with different teams of students, so we do not have learning or fatigue effects taking place from one 20


September 2004

experiment to the next. Within each experiment, we made sure everybody had enough time to perform the tasks properly by trying them ourselves beforehand. Based on our observations, as we were monitoring the labs, all the teams had enough time to complete their tasks properly. Students were only graded based on the proper application of the test techniques. Recall that they did not know the effectiveness of their test cases would be evaluated and were not aware of which faults were seeded. Inter-team collaboration was prevented through monitoring the lab, though considering the maturity and quality of these graduate students, this did not turn out to be a problem. The three experiments took place in similar lab settings, following similar steps, though their objectives, as described in Section 4.1, were slightly different and complementary.

4.4.3 Construct validity Construct validity is related to the validity of our measurement (‘constructs’), especially when we use surrogate measures for attributes that are difficult or impossible to measure directly. We use mutation scores as a measure of fault detection effectiveness. One well known issue from mutation testing is whether mutants are representative of real faults [16]. A number of papers discuss this issue [15, 28, 30]. But in any case the distribution of real faults is likely to vary from one environment to another and at least mutation operators ensure that a wide variety of faults are considered. Mutation operators tend to introduce “small” faults but the assumption is that this can only produce conservative results as larger faults would be easier to detect for the testing techniques being evaluated [16, 34]. Furthermore mutants were randomly seeded before any test case was defined or run and were not known to the student teams. Another issue is related to our surrogate measures for test preparation effort and execution time. Since our experiment did not take place in a real testing environment, and since we provided material to the student teams, it was not possible for us to measure preparation effort in any realistic manner. The number of test cases or the size of test drivers should be a good surrogate measure as code size has shown to be proportional to effort in numerous studies [18]. Using two different size measures should decrease the chances of accidentally getting spurious or biased results which may result from using surrogate measures. Although the number of mutants killed after a given strategy also depends on the performance of the previous strategies (e.g., the number of mutants killed after CP also depends on the number

21


September 2004

of mutants killed with DC), the three strategies are to be used sequentially, as DC implies RT, and CP is used as a complement to state-based testing, which remains the basic strategy for complex classes with a state-based behavior.

4.4.4 External Validity As usual, the biggest threats when performing controlled experiments with students are related to external validity. In our experiments, however, the problem was alleviated as follows: 1. As previously mentioned, participants were all carefully selected graduate students who had performed internships with the companies who sponsored their degree, and some of them had additional experience as software developers; 2. Participants were given enough material and training to ensure their tasks would be properly performed. The classes used in our experiments are representative of classes one can find in real systems, both in terms of size and type. The corresponding statecharts are rather complex especially if we consider the complexity of guard conditions and the number of paths in the transition trees (Table 2), when compared to the statecharts we have encountered in many designs, and the resulting test drivers are definitely not trivial as shown by their size. However, we do not know if our results were to remain realistic if one were to use statecharts to model larger components (e.g., subsystems) and test them. We would expect that in those cases statecharts would be an even more abstract representation of the code, thus re-enforcing the results presented in the next section. But this conjecture remains to be investigated.

5 Experimental Results In the first subsection we report on the mutation scores of the three experiments for the round trip strategy (Section 5.1). Second, we investigate the differences in results when using state invariant assertions or more precise test oracles (Section 5.2). In a subsequent step, we perform an analysis to determine the practical gains of more demanding test strategies (Section 5.3): gain of DC over RT and gain of DC+CP over DC. The variability of results across teams is then analyzed in order to determine plausible causes (Section 5.4). Mutants that appear difficult to kill are carefully investigated to better understand the limitations of the investigated test techniques (Section 5.5). In Section 5.6, we analyze the cost-effectiveness of those test techniques so that

22


September 2004

we can assess the gains in light of additional costs. We then try to investigate how method cyclomatic complexity can help explain some of our results and whether it can help guide and focus the application of CP testing in the context of state-based testing (Section 5.7). Last, we attempt to predict what would have been the results if we had had the time to apply CP more broadly, in a more systematic manner (Section 5.8). Note that, when reporting statistical results, every significant p-value (i.e., lower than α = 0.05) is written in bold face.

5.1 Mutation scores of the RT and DC coverage strategies Table 5 and Table 6 report the mutation score for each team, for all three experiments, using the RT and DC strategies, respectively, both with a precise oracle. These are the only treatments that were consistently applied throughout all three experiments (Table 4) and these results provide insights into what would be the best mutation scores one could hope to obtain with the best oracle possible. Note that for class OrdSet the results are identical in the two tables as there were no disjunctive guard conditions in the statechart (Table 2), and thus the DC strategy was automatically satisfied after covering all paths in the transition tree. To avoid any misinterpretation, recall that these results were obtained from different experiments and thus figures in the same column do not belong to the same test teams (except for the last two rows, which are both related to the third experiment). Table 5 – Mutation scores for each team using RT

Exp. 1 Exp. 2 Exp. 3


1 83% 70% 79% 76%

2 93% 79% 79% 76%

3 79% 73% 75% 60%

4 88% 67% 75% 60%

Teams 5 6 74% 65% 75% 60% -

Statistics 7 8 9 Avg. Min Max 86% 79% 93% 69% 73% 91% 74% 65% 91% 77% 75% 79% 66% 60% 76%

Table 6 – Mutation scores for each team using DC

Exp. 1 Exp. 2 Exp. 3


1 83% 86% 79% 76%

2 93% 91% 79% 76%

3 79% 91% 79% 76%

4 88% 86% 79% 64%

Teams 5 6 93% 89% 75% 64% -

Statistics 7 8 9 Avg. Min Max 86% 79% 93% 89% 91% 93% 90% 86% 93% 78% 75% 79% 71% 64% 76%

Both tables show some variability in mutation score averages, ranging from 66% to 86% for RT and 71% to 90% for DC. Results obtained for class Cache were significantly lower and, as 23


September 2004

further discussed in Section 5.5, many of the undiscovered mutants could only be killed by a black-box technique such as CP. To keep things simple, there is, for class Cache, a higher chance of not fully covering the code when testing based on its statechart. In general, results show that, even for classes whose behavior is carefully modeled by a statechart, state-based testing kills a significant proportion of mutants but there is significant room for improvement. In most contexts, such detection rates are likely not to be satisfactory during class (cluster) testing as it would lead to a high number of faults slipping to integration and system testing.

5.2 Comparison between precise oracle and state invariants The results above are obtained when using the best oracle possible. Those oracles, as discussed in Section 3.3, tend to be expensive and it was suggested that state invariant assertions could be used. This section investigates the impact of adopting such assertions as oracles. The variability graph shown in Figure 3 highlights the differences in results when using state invariant assertions and precise oracles after performing DC testing: boundaries indicate the maximum and minimum values, the middle line the average value, and the dots are observations. These results are not available for experiment 1 (i.e., OrdSet class) since only a precise oracle was used in that case (Table 4). 90

Mutation Score

80 70 60 50 40 State Invariant

Precise Oracle

Name

State Precise Invariant Oracle NameSet

State Invariant

Precise Oracle

Cache

Figure 3 – Comparison between state invariant and precise oracle for DC

Figure 3 clearly shows a mutation score improvement when using a precise oracle. However, results obtained for class Cache warrant further investigation: when using state invariants, all the teams obtained the exact same results whereas there was some variance when using a precise oracle. Recall that mutants are killed with the state invariant assertions when, if executing a test case on a mutant version of the system, the sequence of reached states is different from the

24


September 2004

sequence obtained when executing the test case on the original system. In that particular case, the test cases written by different teams yielded, when executing mutants, the same sequence of wrong states, and thus the same set of mutants was killed. However, when using a precise oracle, only some test cases yielded, for some mutants, a different behavior than the one expected by the oracle. This resulted in some teams killing more mutants than others. To obtain statistical evidence of the difference between the two oracles, we performed a onetailed Wilcoxon paired test [17], a non-parametric test comparing each team’s mutation score for the two oracles. The test is one-tailed as the precise oracle can only perform better than the state invariant one. This test was performed for RT, DC, and DC+CP6. Results are reported in Table 7, for the three classes where both oracle strategies are used, and clearly show that differences are statistically significant and are unlikely to be due to chance. Table 7 – One-tailed Wilcoxon paired test results: H0: there is no significant difference between precise oracle and state invariants

System Name NameSet Cache

State Invariant 45 17 11

RT Precise Oracle 59 18 15

Median number of killed mutants by strategies DC DC+CP State Precise State Precise p-value p-value Invariant Oracle Invariant Oracle 59 74 71 80 0.002 0.0002 17 19 20 20 0.009 0.007 11 19 0.0032 0.0032

p-value 0.0003 0.46 -

Note, however, that Table 7 shows an exception: NameSet. It turns out that DC enables the detection of most mutants that remained undetected with RT, even with invariant assertions. After investigation it is plausible that because NameSet is a much simpler class than Name (i.e., its behavior is fully captured by its statechart), it is then easier for DC to kill most of the remaining live mutants. As a result, there are few mutants to be killed by CP and the type of oracle used does not really change the end result.

6

Recall that CP is always applied using precise oracle (Section 4.2.3). But performing this test is still useful in order

to determine whether the difference between oracles is still significant after applying CP or whether most defects that remained undetected with invariant assertions are detected with CP.

25


September 2004

5.3 The Impact of disjunctive coverage and category partition testing This section focuses, for the classes tested in experiments 2 and 3, on the benefits of using the DC and CP strategies when RT, DC, and CP are applied in a sequence. Our goal is to look at the relative, additional benefit of applying additional test cases as we use additional techniques. Though, due to time constraints, CP was performed on only one method in class Name and class NameSet, our analysis will nevertheless show how CP is able to kill a significant number of

mutants not discovered by other strategies. In a complementary analysis, Section 5.8 will show how we attempted to generalize our analysis to the entire classes. To investigate the impact of CP, we proceeded as follows: 1. for the three classes, CP was applied on the method having the highest cyclomatic complexity; 2. a higher number of mutants was seeded (see Section 5.7) in those methods so as to obtain more precise estimates of increases in mutation scores. Figure 4 (a) suggests that, for class Name, when considering all teams together and regardless of the invariant, each test technique in the sequence yields a significant increase in mutation scores. It is also clearly visible that RT alone yields very variable results and this suggests that, in practice, it would be difficult to accurately predict what the outcome of using RT alone would yield as mutation score. 100 100%

80 70 60 50 40 30

Group 9

90% Mutation Score

Mutation Score

90

Precise State State Invariant Oracle Invariant

RT

Precise Precise State Oracle Invariant Oracle

DC

DC+CP

80% 70% 60% 50% Group 4

40% RT

DC

DC+CP

(a) (b) Figure 4 – Class Name: cumulative mutation scores (a) by treatments and (b) for the different teams, using state invariants.

Figure 4 (b) looks at the cumulative mutation scores obtained for each team individually when testing Name and using state invariant oracles. Team 9 performed significantly better when

26


September 2004

using RT, but mutation scores in all teams subsequently converge when applying DC. A detailed analysis of the test cases shows that the different disjuncts selected by team 9─when following RT─allowed them to kill a higher percentage of mutants. However, those disjuncts were subsequently exercised by test cases developed by other teams when applying DC thus yielding similar scores for all teams. This (arbitrary) choice of disjuncts therefore explains the high variability in mutation scores for RT in Figure 4 (a). Figure 4 (b) also shows that team 4, while performing slightly better than others (except team 9) with respect to RT, obtained no benefits from DC. As for team 9, higher scores were due to specific disjuncts selected for RT. However, there was no score improvement due to DC for team 4. As confirmed from a manual inspection of the test cases, the low mutation scores are due to the choice of inputs leading to weak code coverage. Though Figure 4 (b) was obtained with state invariants, the above discussion is consistent with what we obtain for precise oracles. 90

90

80

80

Mutation Score

100

Mutation Score

100

70 60 50

70 60 50 40

40 30 State Invariant

Precise State Oracle Invariant

RT

30

Precise State Precise Oracle Invariant Oracle

DC

Precise Oracle

State Invariant RT

DC+CP

Precise Oracle

State Invariant DC

(a) (b) Figure 5 – Classes (a) NameSet and (b) Cache: cumulative results

Figure 5 shows cumulative mutation scores obtained for classes NameSet and Cache. These results show differences with those of Name. 1. DC did not yield a significant improvement, especially when using state invariants. This may be in part explained by the fact that DC represents a smaller increase in test cases for NameSet (+44%) and Cache (+40%) than for Name (+68%), due to a lower number of

disjuncts in guard conditions. This may also be due to choices of inputs leading to weak code coverage, as previously observed. 2. There is a high variability in scores across teams when using CP. Such results are mainly driven by the choice of representative inputs by the testers and can probably be explained

27


September 2004

as follows: because the percentage of live mutants after DC is overall larger than for class Name, there is more room for variation in scores when applying CP.

3. After applying CP to NameSet, we can observe that cumulative mutation scores are similar for state invariant and precise oracles. Again, this may be due to the relative simplicity of this class (Table 2 shows that NameSet is simpler than Name and Cache). Simpler classes may not require a precise oracle. In other words, if most of their behavior is modeled by the statechart, using the state invariant oracle is enough to catch failures. Point 3 above warrants further investigation. Most of the live mutants after DC when using state invariants were subsequently uncovered using CP. This can lead, at least for this particular subject class, to the following conclusions: 1. mutants that affect state transitions and, in general, the sequence of states followed when executing the test cases, can be discovered applying DC and state invariant oracles. The class being relatively simple, most of its behavior is well explained by the statechart, and thus using the state invariant oracle is sufficient to detect failures; 2. the remaining mutants can only be killed using a more precise oracle or, alternatively, by performing CP, as they do not affect the state transition behavior of the class instances. Results obtained for class Cache (Figure 5 (b)), when compared to NameSet, show similar scores and variability for RT, DC, with precise oracles. Again, DC did not bring significant benefits when used on top of RT. But when using state invariants, the mutation score is significantly lower than that of NameSet, there is a much larger difference with the scores of precise oracles, and there is no variability across teams. These results can be explained as follows: 1. Even if there are 15 disjunctive conditions to be covered (Table 2), half of them (8) correspond to statements that belong to methods inherited from NameSet. Thus, as this class is already tested when Cache is under test, we do not expect a significant improvement in mutation score as most mutants in inherited methods were already killed; 2. When applying RT and disjunctive conditions are present, the same disjunct was selected by all teams, as it was the most common situation one would expect to encounter, and thus yielded the same results. 3. Mutation scores for DC with state invariant oracles did not improve over RT because mutants did not generate differences in the state transitions resulting from test executions,

28


September 2004

but only exhibited a different concrete state (e.g., different values stored into the class data structures). Such differences could only be detected by using a precise oracle. This also explains the variability observed with precise oracles as the differences in concrete states are driven by the inputs selected by the testers. To obtain statistical evidence that the mutation score improvements we observed are significant in the figures above, we performed a one-tailed Mann-Whitney paired test [17] comparing in a pairwise manner, the scores of each team (Table 8). More precisely, this non-parametric test is used to compare: 1. the number of mutants killed with RT vs. the number of mutants killed with DC; and 2. the number of mutants killed with DC vs. the number of mutants killed with DC+CP. Table 8 – One-tailed Mann-Whitney test H0: there is no significant difference between two strategies’ mutation scores

Class Name NameSet Cache

Median number of killed mutants State invariant oracle Precise oracle DC vs. RT vs. DC vs. RT vs. p-value p-value p-value DC+CP DC DC+CP DC 45 - 59 0.003 59 - 71 0.0009 59 -74 74 -80 0.002 17 - 17 0.6 17 - 20 18 - 19 0.34 19 - 20 0.02 11 - 11 1 11 - n/a 15 - 19 0.25 19 - n/a

p-value 0.007 0.18 -

Table 8 shows the p-values resulting from comparing mutation scores and the median values (for all student teams) for each of the techniques compared (e.g., 45 versus 59 mutants killed for RT and DC on Name, respectively). The former tells us about the statistical significance of the difference whereas the latter can help us assess its practical significance, i.e., whether the difference would be worth noting in practice. Results highlight that, for class Name, there is always a statistically significant score improvement from RT to DC and from DC to DC+CP for both types of invariants. For NameSet, covering DC did not significantly help, whereas CP, though it was only partially applied (on only one method), significantly improves scores when using state invariant oracles (p-value = 0.02). This is due to the smaller number of mutants killed with DC in this case, which left a large number of live mutants for CP to kill. Finally, for class Cache, DC did not help increase scores for both types of invariants. The conclusions we drew

above from Figure 4 and Figure 5 are thus confirmed by statistical testing.

29


September 2004

The general conclusions we can draw from the above analysis is that applying DC and CP in a sequence, after RT, can help improve mutation scores (from 6% to 31% for DC except, as explained above, for class NameSet and Cache using a state invariant oracle, and from 5% to 20% for CP) and therefore fault detection rates. However, we expect the gain for DC to depend on the complexity of the class under test in terms of disjunctive conditions and input values to be chosen by the tester.

5.4 Consistency across teams In this section, we compare teams in order to determine whether (1) the relative mutation scores of state invariants and precise oracles are approximately the same across teams and (2) the mutation scores of a given test strategy and oracle combination are approximately the same across teams. The former question can give us an indication on whether the relative effectiveness of the two invariants depends on team/individual factors, for each test strategy. In a similar way, the latter question tells us whether the effectiveness of a specific test strategy/oracle combination depends on team/human factors. We use proportion tests [17] to investigate the two above questions. Given two vectors a and b of n elements, where n is the number of teams to be considered, the null hypothesis H0 is that the n teams have the same proportions between successes (vector a) and trials (vector b). For

question (1) those two vectors represent the number of mutants killed with state invariants and precise oracles7, respectively. For question (2), they represent the number of mutants killed with different oracles or strategies and the total number of mutants, respectively. As highlighted in Table 9, which reports p-value results for question (1), the null hypothesis was not rejected for all the strategies, except for RT on class Name. This means that the mutation score ratio when using state invariants and precise oracles were significantly different only for RT on class Name (otherwise, results are consistent across teams). This exception is due to team 4 who had a significantly lower score than other teams after using RT with state invariant oracles but the difference was not significant when using a precise oracle. This just confirms that the scores obtained with RT can be inherently variable in the presence of disjunctive conditions as 7

Note that mutants killed with a precise oracle are always greater than those killed with state invariant. In our tests,

the former constitute the trials, while the latter are the successes, as we compare how the two oracles fares in relation to each other.

30


September 2004

possibly arbitrary choices of disjuncts need to be made and that a precise oracle can help attenuate differences in scores across disjuncts. Table 9 – p-values for proportion test – H0: proportions of mutant killed across the two oracle strategies do not vary across teams (question 1) Class Name NameSet Cache

Strategy DC 0.63

RT 0.001 (0.15 removing team 4) 0.68 0.70

0.92 0.90

DC+CP 0.59 0.34 -

In Table 10 where question (2) is addressed, the null hypothesis was not rejected, except again when using RT on class Name. This exception is due to team 9 as illustrated in the table: if we do not consider data from team 9, then the null hypothesis is not rejected for RT on class Name. Recall that team 9 covered different disjuncts than other teams, which allowed this team to kill a higher number of mutants (Figure 4 (b)). This further confirms that RT is bound to be more variable in mutation scores as it depends on human choices regarding the selection of disjuncts. For simpler classes exhibiting no or few disjuncts in their statecharts (such as NameSet and Cache), this is however not a problem.

Table 10 – p-values for proportion test. H0: for a given test and oracle strategy, all the teams killed the same proportion of mutants (question 2) Using state invariants Class RT DC OrdSet 0.019 Name 0.73 (0.96 removing team 9) NameSet 1 0.99 Cache 1 1

DC+CP 0.77 0.09 -

Using precise oracle RT DC 0.27 0.27 0.008 0.83 (0.63 removing team 9) 0.99 1 0.49 0.72

DC+CP 0.20 0.14 -

5.5 Qualitative analysis of mutants Through a systematic qualitative analysis of mutants, we seek to confirm whether some mutants can only be killed by using a precise oracle when state-based testing is performed and, if mutants are still alive after using precise oracles, whether CP would likely kill them. The goal is to ensure that the quantitative results we observe are really due to the need to use precise oracles or black-

31


September 2004

box testing. Such analyses, despite their inherent subjectivity, bring qualitative insights into our results and help explain them in a more precise manner. We first use, as examples, two mutants from the OrdSet class, as they are more intuitive and easier to explain. They are representative examples of the mutants we found that required CP to have a reasonable chance to be detected. The reader is referred to Table 11 for a complete reporting of the number of mutants in that category, for each subject class. Let us start with mutant A (VRO) that was missed by all the test teams. This particular mutant was seeded in the

operator “- (OrdSet& s1, Ordset& s2)” that computes the difference set between two ordered sets s1 and s2 (Figure 6). The maximum size of the difference set is the size of s1. Due to the mutation, it is wrongly initialized with the size of s2. So a failure can only occur if the real difference set is larger than the size of s2. None of the team’s test cases initialized s1 and s2 that way and they were, as a consequence, not able to kill the mutant. As for mutant A, mutant B (LCO) was also seeded in the ‘-’ operator and was found by only one team. The mutation in this case consisted in setting a loop iterator to 1 instead of 0 (Figure 6). That particular loop searches for elements in s1 that are not part of s2. So a failure can only be detected if the first element in s1 is part of the difference with s2. OrdSet * operator -(OrdSet& s1, OrdSet& s2) { int *set1=s1._set; int *set2=s2._set; int size1=s1._last+1; int size2=s2._last+1; OrdSet *set=new OrdSet(size1); // Mutant A: uses size2 instead of size1 int *array=set->_set; register int k,pos=0; for(k=0;k4 V(g)>4 (median)

V(g)>2 V(g)>2 (avg)

95%

% of mutants killed

% of mutants killed

100% 90% 85% 80% 75% 70% 65% 60% 1

2

3

No categ

V(g)>4

V(g)>2

No categ (median)

V(g)>4 (median)

V(g)>2 (median)

100% 95% 90% 85% 80% 75% 70% 65% 60% 1

4

2

3

4

No categ

V(g)>2

No categ (median)

V(g)>2 (median)

100% 95% 90% 85% 80% 75% 70% 65% 60% 2

6

7

8

9

(b)

% of mutants killed

% of mutants killed

(a)

1

5 Groups

Groups

3

4

V(g)>2

No categ (median)

V(g)>2 (median)

100% 95% 90% 85% 80% 75% 70% 65% 60% 1

5

No categ

2

3

4

Groups

Groups

(c)

(d)

Figure 10 – Complete CP: classes a) Ordset b) Name c) NameSet d) Cache

39

5


September 2004

Based on a careful analysis of mutants and CP test frames, we can determine that CP applied to class OrdSet is able to kill all the remaining live mutants, except one (mutant A in Section 5.5). To look at cost-effectiveness of complete CP, it suffices here to consider the number of test cases as a surrogate cost measure as we are not concerned here with the costs of different oracles. (Results for LOCs and execution times are similar and will not be presented.) The results in Figure 11 are obtained assuming that CP is systematically applied to all methods where V(g)>4. The resulting additional cost is to develop 82 test cases (frames) for a gain of four additional mutants killed (Figure 11 (a)). We expect class Name to also require additional CP testing as two of its methods show high complexity but have not been tested with CP: compact (V(g)= 5) and fromDNAME (V(g)= 3). Figure 11 (b) shows that CP enables the detection of six additional mutants at the expense of 36 test cases. Results are similar to those of OrdSet as we also see here a decrease in the costeffectiveness of test cases for CP. In Figure 11 (c) and (d), for classes NameSet and Cache, we see that DC is less cost-effective than for Name and as a result, CP is more cost-effective. The graph for class Cache shows two curves: one includes reused test cases (from NameSet, its parent class) whereas the other does not account for them, thus showing an increased cost-effectiveness.

40

September 2004

90

90

80

80

70

70

Mutants killed

Mutants killed


60 50

RT

40 30

RT + Full CP

20

DC

60 50 40 30 20 10

10

0

0 0

10

20

30

40

50

60

70

80

90

0

100 110

20

40

Test cases

Test cases

(a)

(b) Without t.c. reuse

90 80

90

70

80

60

70 Mutants killed

Mutants killed

DC + Full CP

RT

50 40 30

RT

20

DC + Full CP

DC

10 5

10

15

20

40 30

RT

DC + Full CP

DC RT

0 0

Test cases

Reusing test cases

50

10

0

80

60

20

0

60

20

40

DC

DC

+ Full CP 60

80

Test cases

(c)

(d)

Figure 11 – Cost-effectiveness of full CP: OrdSet (a), Name (b), NameSet (c) and Cache (d)

Overall we can conclude that though CP entails significant additional cost, it is definitely helpful in detecting a significant number of additional faults. Overall, the cost-effectiveness for CP is lower than for RT and DC (based on comparing ratios). But this is to be expected as it is used last when the mutants that are most difficult to kill are left to detect. Furthermore, considering we obtain significant gains (relative increases in mutation scores are: 11%, 8%, 11%, 128%) for, if we except one case (OrdSet), a reasonable additional cost (relative increases in test cases are: 315%, 76%, 54%, 52%), it seems practical to use complexity measurement to decide where to apply CP. Complete data can be found in Table 15, where we report mutation score averages (across teams) and cost (in terms of number of test cases and LOC) after applying DC (i.e., we add test cases required for both RT and DC and the respective number of mutants they kill) and DC + full CP. It is noteworthy that using LOC as surrogate cost measure yields the same results and lead to the same conclusions.

41


September 2004

Table 15 Increases in mutation scores and cost when applying CP

OrdSet Name NameSet Cache (no reuse) Cache (reuse)

Mutation Score DC + DC Increase full CP 36 4 11% 72 6 8% 18.4 2 11%

Cost (test cases) DC

DC + full CP

Increase

DC

26 47 13

82 36 7

315% 76% 54%

1112 1283 1489 5449

18

23

128%

52

27

52%

18

23

128%

31

27

87%

Cost (LOC) DC + Increase full CP 2460 212% 647 50% 455 31% 1758 32%

1758

37%

4728

6 Lessons Learned This section reflects on the lessons learned from the abovementioned experiments. We focus on what is particularly important when experimenting with test techniques and we try to discuss issues that will likely be encountered by other researchers in empirical testing research. First, to assess the fault detection effectiveness of test techniques we need to work with known faults for the component under test. But when dealing with actual development faults, we often do not have enough faults to be able to draw statistical conclusions from our analyses. Recall in the above experiments that we used several dozens seeded faults per class so that we had a reasonable chance of observing differences across techniques. Mutation operators are then often utilized to seed large numbers of faults and create mutant programs [16]. The question that arises is then whether the results we obtain with mutants would be comparable to the ones with real development faults. This is a difficult question as the profile of fault distributions varies a great deal across development environments and there is probably no such thing as a universal fault population. This means that even if we had enough actual development faults the results we would obtain would not be necessarily more generalizable. And despite the drawbacks of using mutation operators there are also advantages. First, they allow us to seed large numbers of faults, as mentioned above. Second, this also implies that we are then more likely to seed a wide variety of faults. Last, mutation faults are low level faults and there is not an obvious relationship with testing techniques, except perhaps for white-box testing such as control and data flow coverage. It is therefore hard to seed faults that would bias results one way or another when assessing techniques based on a model representation of the CUT. 42


September 2004

Because we seed a wide variety of faults, analyzing live mutants (i.e., the reason why faults remained undetected) becomes a very interesting exercise. For example, in our first experiment, through such an analysis we were able to determine that some degree of black-box testing was probably missing. Through such analyses we are then able to determine types of faults that will be hard to detect. We can then decide, in a given environment, if such faults would be common enough to justify complementary testing or the refinement of a technique. To summarize, quantitative analysis of test experiments is not enough to fully exploit the results of an experiment. Systematic qualitative analysis is key and we need to learn how to perform and report it in a rigorous manner. Another important practical issue when planning a testing experiment is related to the fact that when assessing a test technique, we need to ensure that by the time the experiment completes, the test technique has been fully and properly applied. In other words, if a coverage criterion is involved, we need to make sure we achieve 100% coverage if we want the cost-effectiveness results to be usable. In the context of controlled experiments, with usually strictly limited time, this means the tasks have to be carefully defined and that the right balance has to be struck between the complexity of the tasks and the number of tasks to be studied. When assessing a testing technique a clear choice has to be made as to the conditions under which the technique is being investigated. Are we investigating it under ideal conditions where no mistakes are committed on the part of testers, assuming they have full technical competence? Or are we trying to account for human factors and evaluate a technique under “representative” conditions, with imperfect tools and training? The latter is difficult as what “representative” means will again depend on the specifics of the development environment. It is likely that controlled experiments will focus on assessing the upper-bound cost-effectiveness under ideal conditions whereas additional field studies, in various organizations, will then be necessary to account for human factors. However, even in controlled, academic settings, human factors will play a role and create a certain level of indeterminism in the resulting cost-effectiveness. For example, in the techniques investigated the testers still have to decide on the input parameters of the methods called by the test drivers. This is why it is necessary, in order to grasp some of that statistical variation, to experiment with several teams of testers.

43


September 2004

7 Conclusions This paper investigates, in controlled experiment settings, the effectiveness of state-based testing for classes or class clusters modeled with statecharts. The practical importance of this research stems from the common use of statecharts to model complex components in object-oriented software. Such modeling is recommended by most methodologies [12, 22] and leads to the use of good design practices such as the state design pattern [21]. Our results show that the most referenced and used state-based test technique (roundtrip (RT) path testing [7], an adaptation of the well-know W-method [14]) is not likely to be sufficient in most situations as significant numbers of faults remain undetected (from 10% to 34%, on average across subject classes). This is especially true when using a weaker form of RT where, when guard conditions show several disjuncts, only one of them is exercised (from 14% to 34%, on average). To address this issue we investigated whether a functional testing technique could be used in combination with state-based testing in order to achieve significantly better results with respect to fault detection. Based on a careful analysis of undetected faults, our assumption was that it might be useful to complement state-dependent techniques because the class behavior is typically not fully captured by statecharts. Because this is by far the most commonly used black-box technique, we decided to determine whether category partition (CP) testing could be used in addition to RT testing. CP testing can be used to test methods in isolation, as opposed to sequences of method executions like state-based testing. It is however not suitable to use CP alone to test complex classes with state-dependent behavior, as discussed in [31]. The question is then to decide which subset of methods to apply this technique to as resources are not usually available to apply CP to all methods. We decided to explore the use of control flow complexity, measured with the cyclomatic number (V(g)), to prioritize which methods should undergo CP testing. Our results show that a large percentage of latent faults could be detected after state-based testing was applied (from 5% to 25% of the total number of mutants). The price of applying CP testing can however be significant (increases from 52% to 315% were observed) and it may therefore be necessary to select a subset of methods to which to apply it. Using a selection threshold based on V(g) turns out to be supported by our results which suggest that the number of latent faults after state-based testing tends to increase as V(g) increases.

44


September 2004

Another important result concerns the oracle strategy adopted to detect failures while executing test cases. It was suggested by Binder [7] that state invariants could be checked, especially in the context of state-based testing, to detect failures. Assertions are instrumented in the code so that each state transition is verified by checking the expected state invariant. Our results show that oracles based on state invariant assertions are not as effective as precise oracles checking the exact resulting state of the objects on which test cases are run. However, significant increase in fault detection (from 11% to 72% for RT) can come, in some cases, at a significance expense, i.e., a substantial increase in driver’s size (e.g., for the weaker form of RT: from a few percentage points in Name up to 300% in NameSet), one of our surrogate measures for cost. Again, it is then a matter of choice whether to use one oracle strategy or the other. This choice may be determined by the complexity and criticality of the class (cluster), as well as the test budget available. When using a precise oracle is too expensive, one possible alternative is to complement state-based testing with CP that has shown to help uncover some of the mutants still alive due to the use of state invariant oracles. In other words, CP reduces the gap between the results obtained with the two different oracles. The results above are, with hindsight, rather intuitive. But one has to consider there are very few practical guidelines that testers can follow to help them test complex classes (clusters) with statedependent behavior. There are very few tangible results they can rely upon. A number of techniques exist but their cost effectiveness is not well understood and strategies to combine techniques have not been fully investigated. We evaluated here a number of well-known techniques (2 variants of RT, state-based testing and CP) and proposed a way to combine them that is both logical, intuitive, and that can be tailored to the test budget available. The results we obtained, through a series of three controlled experiments, clearly indicate the usefulness of combining state-based and functional testing and also clarify the choices the user has to face concerning the test oracle to be used in test drivers. The limitations of our study are related, as for most controlled experiments, to the artificial settings we are working with. Though our experiments represent a substantial effort, they involved only 4 classes. Those classes were, however, carefully selected to be representative of many real-world system classes, and so that reasonably complex statecharts could be derived and used. But it is very well possible that other results could have been obtained with classes showing very different characteristics. Furthermore, the students were provided with all the

45


September 2004

support they needed to properly apply the techniques. This was motivated by our objective to assess the techniques when properly used. Under real project constraints, however, it may turn out that some of the techniques are difficult to apply for a significant portion of main stream software engineers. Despite its significant advantages from the point of view of experimentation, we also face the usual limitations of using mutation operators to seed faults [16].

Acknowledgments This work was partly supported by a Canada Research Chair (CRC) grant. Lionel Briand and Yvan Labiche were further supported by NSERC operational grants. This work is part of a larger project on testing object-oriented systems with the UML (www.sce.carleton.ca/Squall). We would like to thank the students involved in the three experiments as well as the organizers of the Masters degree in Software Technology at the University of Sannio. Lionel Briand was the instructor for these courses. Last but not least, we would like to thank Gerardo Canfora for his constructive comment on an early draft of this paper.

References [1]

dnsjava Home Page, http://www.dnsjava.org (Last accessed 4 Apr. 2004).

[2]

Acree, A.T., et al., Mutation Analysis, School of Information and Computer Science, Georgia Institute of Technology, Technical Report GIT-ICS-79/08, 1979.

[3]

Alexander, R.T. and A.J. Offutt, Criteria for testing polymorphic relationships, Proc. 11th IEEE International Symposium on Software Reliability Engineering, pp. 15-23, 2000.

[4]

Antoniol, G., et al., A Case Study Using the Round-Trip Strategy for State-Based Class Testing, Proc. 13th IEEE International Symposium on Software Reliability Engineering, pp. 269-279, 2002.

[5]

Ball, T., et al., State Generation and Automated Class Testing, Software Testing, Verification and Reliability, vol. 10 (3), pp. 149-170, 2000.

[6]

Beizer, B., Software Testing Techniques, 2nd ed, Van Nostrand Reinhold, 1990.

[7]

Binder, R.V., Testing Object-Oriented Systems - Models, Patterns, and Tools, Object Technology, Addison-Wesley, 1999.

[8]

Bogdanov, K. and M. Holcombe, Statechart Testing Method for Aircraft Control Systems, Software Testing, Verification and Reliability, vol. 11 (1), pp. 39-54, 2001.

[9]

Booch, G., J. Rumbaugh, and I. Jacobson, The Unified Modeling Language User Guide, Addison Wesley, 1999.

46


September 2004

[10]

Briand, L., Y. Labiche, and Y. Wang, Using Simulation to Empirically Investigate Test Coverage Criteria, Proc. IEEE/ACM International Conference on Software Engineering, pp. 86-95, 2004.

[11]

Briand, L.C., W. Dzidek, and Y. Labiche, Using Aspect-Oriented Programming to Instrument OCL Contracts in Java, Carleton University, Technical Report SCE-04-03, http://www.sce.carleton.ca/squall, 2004.

[12]

Bruegge, B. and A.H. Dutoit, Object-Oriented Software Engineering - Conquering Complex and Changing Systems, Prentice Hall, 2000.

[13]

Budd, T.A. and D. Angluin, Two Notions of Correctness and their Relation to Testing, Acta Informatica, vol. 18 (1), pp. 31-45, 1982.

[14]

Chow, T.S., Testing Software Design Modeled by Finite-State Machines, IEEE Transactions on Software Engineering, vol. SE-4 (3), pp. 178-187, 1978.

[15]

Delamaro, M.E., J.C. Maldonado, and A.P. Mathur, Interface Mutation: An Approach for Integration Testing, IEEE Transactions of Software Engineering, vol. 27 (3), pp. 228247, 2001.

[16]

DeMillo, R.A., Hints on Test Data Selection: Help for the Practicing Programmer, Proc. IEEE Computer, pp. 34-41, 1978.

[17]

Devore, J.L., Probability and Statistics for Engineering and the Sciences, 5th ed, Duxbury Press, 1999.

[18]

Fenton, N.E. and S.L. Pfleeger, Software Metrics: A Rigorous and Practical Approach, 2nd ed, PWS Publishing, 1998.

[19]

Frankl, P.G. and S.N. Weiss, An Experimental Comparison of the Effectiveness of Branch Testing and Data Flow Testing, IEEE Transactions of Software Engineering, vol. 19 (8), pp. 774-787, 1993.

[20]

Frankl, P.G., S.N. Weiss, and C. Hu, All-Uses versus Mutation Testing: An Experimental Comparison of Effectiveness, Journal of Systems and Software, vol. 38 (3), pp. 235-253, 1997.

[21]

Gamma, E., et al., Design Patterns - Elements of Reusable Object-Oriented Software, Addison-Wesley, 1995.

[22]

Gomaa, H., Designing Concurrent, Distributed, and Real-Time Applications with UML, Object Technology, Addison Wesley, 2000.

[23]

Graves, T.L., et al., An Empirical Study of Regression Test Selection Techniques, ACM Transactions on Software Engineering and Methodology, vol. 10 (2), pp. 184-208, 2001.

[24]

Harrold, M.J., J.D. McGregor, and K.J. Fitzpatrick, Incremental Testing of ObjectOriented Class Structures, Proc. 14th IEEE International Conference on Software Engineering, pp. 68-80, 1992.

[25]

Hong, H.S., et al., A Test Sequence Selection Method for Statecharts, Software Testing, Verification and Reliability, vol. 10 (4), pp. 203-227, 2000.

[26]

Juristo, N. and A.M. Moreno, Basics of Software Engineering Experimentation, Kluwer, 2001. 47


September 2004

[27]

Kim, S., J.A. Clark, and J.A. McDermid, The Rigorous Generation of Java Mutation Using HAZOP, Proc. 12th International Conference on Software & Systems Engineering and their Applications, pp. 9-10(11), 1999.

[28]

Kim, S., J.A. Clark, and J.A. McDermid, Class Mutation: Mutation Testing for ObjectOriented Programs, Proc. Net.ObjectDays, 2000.

[29]

Kim, S., J.A. Clark, and J.A. McDermid, Investigating the effectiveness of ObjectOriented Testing Strategies with the Mutation Method, Proc. Mutation, 2000.

[30]

King, K.N. and A.J. Offutt, A Fortran Language System for Mutation-Based Software Testing, Software - Practice and Experience, vol. 21 (7), pp. 686-718, 1991.

[31]

Kung, D., et al., On Object State Testing, Proc. IEEE International Computer Software and Application Conference, pp. 222-227, 1994.

[32]

Lee, D. and M. Yannakakis, Principles and Methods of Testing Finite State Machines - A Survey, Proceedings of the IEEE, vol. 84 (8), pp. 1090-1123, 1996.

[33]

Meyer, B., Design by Contracts, IEEE Computer, vol. 25 (10), pp. 40-52, 1992.

[34]

Offutt, A.J., Investigations of the Software Testing Coupling Effect, ACM Transactions on Software Engineering and Methodology, vol. 1 (1), pp. 3-18, 1992.

[35]

Offutt, A.J. and A. Abdurazik, Generating Tests from UML specifications, Proc. 2nd International Conference on the Unified Modeling Language, pp. 416-429, 1999.

[36]

Offutt, A.J., Y. Xiong, and S. Liu, Criteria for Generating Specification-Based Tests, Proc. 5th International Conference on Engineering of Complex Computer Systems, pp. 119-129, 1999.

[37]

Ostrand, T.J. and M.J. Balcer, The Category-Partition Method for Specifying and Generating Functional Test, Communications of the ACM, vol. 31 (6), pp. 676-686, 1988.

[38]

W. M. Trochim, The Research Methods Knowledge Base, http://trochim.human.cornell.edu/kb/index.htm (Last accessed 4 Apr. 2004).

[39]

Warmer, J. and A. Kleppe, The Object Constraint Language, Addison-Wesley, 1999.

[40]

Weyuker, E., The Cost of Data Flow Testing: An Empirical Study, IEEE Transactions of Software Engineering, vol. 16 (2), pp. 121-128, 1990.

[41]

Weyuker, E., Automatically Generating Test Data from a Boolean Specification, IEEE Trans. on Software Engineering, vol. 20 (5), pp. 353-363, 1994.

[42]

Weyuker, E.J., On Testing Non-testable Programs, The Computer Journal, vol. 25 (4), pp. 465-470, 1982.

[43]

Wohlin, C., et al., Experimentation in Software Engineering - An Introduction, Kluwer, 2000.

48


September 2004

8 APPENDICES 8.1 Class Diagrams Item I terator OrdSet _set_size : int _resize_error : int _resized_times : int _last : int _set : int[]

First() Next() IsDone() CurrentI tem()

(int)

-_set 0..1

SetIterator _current : int _top : int First() Next() IsDone() CurrentItem() SetIterator()

1

OrdSet() getResizedTimes() getResizeErrors() getSetSize() getSetLast() isEmpty() getItem() getSetArray() operator ==() operator >() operator+() operator |() operator -() operator >>() remove() operator +=() make_a_free_slot() operator []()

Figure 12 - OrdSet subsystem class diagram

Name (from DNS) LA BE L_NORM AL : In tege r = 0 LA BE L_COMP RES SI ON : In te ge r = 0 xC0 LA BE L_E XT ENDED : In te ge r = 0 x4 0 LA BE L_M ASK : Int eg er = 0x C0 EX T_ LAB EL _BITS TRING : I nt eg er = 1 l ab el s : In tege r qu ali fie d : B oo le an MA XLA BEL S : In te ge r = 1 28 ST ART LA BEL S : Inte ge r = 4 Na me() gro w(n : in t) : v oi d gro w() : vo id c omp ac t() : v oi d Na me(s : String, ori gin : Nam e) Na me(s : String) Na me(i n : DataB yte In pu tS tre am, c : Comp ression) Na me(d : Na me, n : int) wil d(n : in t) : Name f ro mDNAME (d na me : DNA MERec ord) : Na me i sWild() : b oole an i sQ ua lif ie d() : bo olean ap pe nd (d : Na me) : void l en gt h() : short l ab el s() : b yt e sub dom ai n(do main : Na me) : b oo le an byt eS tri ng (a rra y : b yte []) : Strin g t oS tri ng () : S tri ng ge tLa belS tri ng (n : i nt ) : St rin g t oWire(ou t : Da ta By te Ou tp ut Strea m, c : Co mpre ssion) : v oid t oWireCan onic al (out : DataB yteO ut pu tStre am) : v oid t oL owe r(b : byt e) : byt e eq ua ls(arg : Obje ct ) : bo ol ea n ha shCo de () : int c omp areT o(o : O bj ec t) : in t

NameSet (fro m DNS ) isCache : Boolean

-origin

NameSet(isCache : Boolean) add(name : Name, type : Integer, set : oclAny) clear() flush(name : Name) flush(name : Name, type : Integer, set : oclAny) setOrigin(origin : Name) findSets(name : Name, type : Integer) : oclAny findExactSet(name : Name, type : Integer) : oclAny findName(name : Name) : TypeMap names() : Sequence(Name) toString() : String

TypeMap (from DNS) data

name : Name

TypeMap() get() getAll() put() remove() isEmpty()

type : int

Ob je ct (from lang) )

Element (from Cache)

Thread Zone (from DNS) PRIMARY : Integer = 1 SECONDARY : Integer = 2 type : Integer dclass : Integer = DClass.IN validate() Zone() Zone() Zone() getOrigin() getNS() getSOA() getDClass() findRecords() findExactMatch() addRecord() AXFR()

Cache (from DNS) secure : B oo lean ma xn ca ch e : Inte ger = - 1 c le an Int erva l : In teger = 30 dcl ass : In teger Cache(dcl ass : In teger) Cache(f ile : Strin g) c le ar() add(r : Re co rd, cred : Int eg er, o : ocl Any) f lush(na me : Name) f lush(na me : Name, ty pe : Int eger) secureCache (e na bl e : B oo lean ) addRRset(rrset : RRset , cre d : In te ge r, o : oclAny) addNeg at ive (rcode : I nteger, nam e : Name , t yp e : In te ge r, t tl : Inte ger, cred : I nteg er, o : ocl An y) l oo ku pRecords(nam e : Name , type : In te ge r, mi nCre d : In te ge r) : S et Re sp on se f indRec ords(na me : Name, ty pe : Int eger, minCred : In teger) : Seq ue nc e(RRse t) f indRec ords(na me : Name, ty pe : Int eger) : Seq ue nc e(RRse t) f indA ny Re co rd s(name : Na me, type : I nteg er) : Se qu ence(RRset ) v erify Re co rd s(t ca ch e : Cac he ) get Cred (re co rd Nam e : Name , q ue ryNa me : Name, sect ion : In tege r, i sA ut h : B oo lean ) : Int eger addM essa ge (in : Message) setVe rifi er(v : Verifier) setSe cu re Po lic y() setMaxNCac he (seconds : Integer) setCleanInte rval(mi nu te s : In teger)

-cleaner

(from lang) )

CacheCleaner (from Cache) CacheCleaner() run()

tid

t yp e : In te ge r cre dibil ity : I nteg er t im eI n : In te ge r t tl : In te ge r srci d : Inte ge r Ele men t() Ele men t() Ele men t() upda te() del et eRec ord() expired TT L() TT L0 Ours() TT L0 NotOurs() t oS tri ng ()

-verifier

Verifier (from DNS) verify()

Figure 13 - dnsjava subsystem used for experiments 2 and 3 (class Name, NameSet and Cache)

49


September 2004

8.2 Statecharts

operator >> (ifstream &s)

Empty

Filled

operator +=( int n )[ _last=set_size-1 and not(self->includes(n)) ]

make_a_free_slot( int )

operator >>(ifstream &s) remove( int n )[ _last=0 and self->includes(n) ]

remove( int n )[ self->includes(n) ]

OrdSet( int )

operator >>(ifs tream &s)

operator +=( int ) make_a_free_slot( int ) operator >>( ifstream &s ) Partially Filled

OrdSet( OrdSet* ) OrdSet( OrdSet& )

operator +=( int n )[ not (self->includes(n)) ]

[ not(resized_t imes>max_accepted_resizes or _set_size>max_set_size) ]

OrdSet( int*, int ) operator operator +

operator +=( int n )[ _lastincludes(n)) ]

operator |

operator >>(ifstream &s) Resizing Overflow

[ resized_times>max_accepted_resizes or _set_size>max_set_size ]

entry/ resiz eArray()

Figure 14 – Class OrdSet statechart

PFNQ

FNQ

Overflow

FQ

PFQ

Figure 15 – Class Name statechart (events and guard conditions omitted)

50


September 2004

EmptyCac he

PFCache

a)

NoSecure

Secure

EmptyNoSecure

PFNoSecure

PFSecure

Empt ySecure

b)

c)

Figure 16 – a) Class NameSet statechart – b) Class Cache statechart – c) Class Cache flattened statechart (events and guard conditions omitted)

51


September 2004

8.3 Transition Trees

Start

OrdSet( Int )

OrdSet( OrdSet* ) Partially Fil le d

OrdSet(OrdSet&) Partially Fil led

Partially Filled

(16)

(17)

(15)

operator +

OrdSet( Int*, Int )

operator >>

Empty

operator +=

Partially Filled

Partially Filled

Partially Filled

(18), (19), (20)

(21), (22), (23)

(24), (25), (26)

operator >>

make_a_f ree_slot( Int )

Filled

Em pt y

(12)

(13)

operator |

operator -

Partially Fil led

Partially Filled (14) rem ove

operator +=

operator +=

operator >>

operator >> rem ove

make_a_f ree_slot

Fil led

Fil led

Partially Filled

(11)

(10 )

(9)

Partially Filled

Partially Fil led

Partially Filled

Empty

(8)

(7)

(6)

operator += Filled

remove

operator >>

Partially Filled

Resizing

(4)

(5)

op erator +=

Resizing

[ resized_timesmax_accepted_resizes ] / resizeArray

Overflow

Overflow

Pa rtial ly Filled

(2)

(3)

(1)

Figure 17 – Class OrdSet transition tree

Overflow (11)

Name( s: String, origin: Name ) Name( s: St ring ) Name( s: String, origin: Name )

Overflow

wild( int n ) FQ

(12)

(13) Name( s: String ) FQ

Name(d: Name, n:Integer) Name( s: String, origin: Name )

wild( int n )

(14) Name( d: Name, n:Integer )

Name( s: String )

Name( d: Name, n:Integer ) FNQ

fromDNAME( dname: DNAMERecord )

FQ

FQ

(15)

(16)

(19) Name( s: String, origin: Name )

fromDNA ME(dname: DNAMERecord)

PFQ PFQ

FNQ

wild( int n )

(25) Name( s: String )

(20)

(26)

FNQ FQ

FNQ Name( s: String )

(18) Name( s: String, origin: Name ) append( d: Name )

append(d: Name) append( d: Name ) PFQ (3)

FQ

FNQ

wild( int n )

(24)

append(d: Name)

PFQ

append( d: Name ) append( d: Name )

PFNQ

PFNQ

(22)

(23)

append(d: Name)

append(d: Name) PFNQ (5)

(4)

(28)

PFNQ

PFNQ append( d: Name )

PFQ

(27) Name( d: Name, n:Integer )

(21) (17)

PFQ

Overflow

append(d: Name) FNQ

(6)

append( d: Name )

FQ

(7) append(d: Name)

Overflow

Overflow

(1)

(2)

Figure 18 – Class Name transition tree

52

PFNQ

PFQ

Overflow

(8)

(9)

(10)


September 2004

NameSet( Boolean ) removeSet( Name, Integer, Object )

EmptyCache

EmptyCache

removeName( Name )

9

addSet( Name, Integer, Object ) clear() PFCache

EmptyCache

PFCache 8

addSet( Name, Integer, Object ) 1

EmptyCache

clear( ) removeSet( Name, Integer, Object )

removeSet( Name, Integer, Object )

7

removeName( Name )

EmptyCache

PFCache removeName( Name )

6

EmptyCache

2 PFCache

EmptyCache

3

4

5

Figure 19 – Class NameSet transition tree

Cache( integer )

secureCa che( Boolea n)

ENS

ES

secureCache(Boolean)

ENS

add( Name,Integer,OclAny ) add( Record,Integer,OclAny )

add( Name,Integer,OclAny )

PFNS

secureCache( Boolean ) ENS

37

flush(Name)

clear()

ENS

ES

36

23

29

flush( Name ) f lush( Name,In teg er, OclAn y )

flush( Name,Integer,OclAny )

clear()

ENS

add( Record,Integer,OclAny )

ES s ecur eCach e( Boolea n )

35

32

28 ENS

ENS

ES


34

33

24 PFNS

PFNS ad d( Name,Integer,OclAny )

12 flush( Name,Integer,OclAny )


22

clear()

ENS

26 PFS

1

PFNS

f lush( Na me ,I nte ger, OclAny )

flush( Name,Integer )

flush( Name,Integer,OclAny )

PFS

2

ENS

ES flush( Name,Integer,OclAny )

flush( Name,Integer)

20

flush(Name)

fu l sh( Name)

3 ENS

ES

flush(Name)


9 ES

PFS

19

8

4

PFNS

PFS ENS

15

5

18

PFNS

ES PFS

6

16 PFNS

17

Figure 20 – Class Cache transition tree

53

10

clear()

a dd( Reco rd,In teg er, OclAn y )

PFS


14

PFNS

secureCache( Boolean ) PFS

add( Name,Integer,OclAny )



27 ES

25

21 flush( Name,Integer)

ES

ES

PFS

flush(Name)

PFNS

13

30 PFS



31

ENS

ad d( Reco rd,In teger, Oc A l ny )

7

11


September 2004

8.4 Class Cache cost-effectiveness analysis W ithout t.c. reuse

State Invariant

Reusing test cases

Mutants killed (avg.)

80 Mutants killed (avg.)

Precise Oracle

20

90

70 60 50 40 30 RT

10

DC+CP

DC+CP

RT DC

20

DC

10

20

30

40

50

DC

15 10

DC

RT

5 0

0 0

RT

60

70

80

90

100

0

110

2

4

6

8

10

CPU user time (sec.)

a)

T est cases

a)

b) State Invariant State Invariant (reusing)

Precise Oracle Precise Oracle (reusing)

20

DC

Mutants killed (avg.)

RT

DC

15

RT RT

10

DC

DC

RT

5 0 0

1000

2000

3000

4000

5000

LOC

c) Figure 21 – Class Cache average number of killed mutants killed vs. number of test cases executed (a), execution time (b), and driver’s LOC (c)

54