Using Program Data-State Scarcity to Guide ...

2 downloads 0 Views 302KB Size Report
May 27, 2009 - The heuristic directs the search towards test cases that produce rare or scarce data states. Scarce data states can provide different outputs for ...
Using Program Data-State Scarcity to Guide Automatic Test Data Generation Mohammad Alshraideh Department of Computer Science University of Jordan Amman 11942, Jordan [email protected] Leonardo Bottaci Department of Computer Science, The University Of Hull, HULL, HU6 7RX,UK. [email protected] Basel A. Mahafzah Department of Computer Science University of Jordan Amman 11942, Jordan [email protected] May 27, 2009 Abstract Finding test data to cover structural test coverage criteria such as branch coverage is largely a manual and hence expensive activity. A potential low cost alternative is to generate the required test data automatically. Search based test data generation is one approach that has attracted recent interest. This approach is based on the definition of an evaluation or cost function that is able to discriminate between candidate test cases with respect to achieving a given test goal. The cost function is implemented by appropriate instrumentation of the program under test. The candidate test is then executed on the instrumented program. This provides an evaluation of the candidate test in terms of the “distance” between the computation achieved by the candidate test and the computation required to achieve the test goal. Providing the cost function is able to discriminate reliably between candidate tests that are close or far from covering the test goal and the goal is feasible, a search process is able to converge to a solution, i.e. a test case that satisfies the coverage goal.

1

For some programs, however, an informative cost function is difficult to define. The operations performed by these programs are such that the cost function returns a constant value for a very wide range of inputs. A typical example of this problem arises in the instrumentation of branch predicates that depend on the value of a Boolean-valued (flag) variable although the problem is not limited to programs that contain flag variables. Although methods are known for overcoming the problems of flag variables in particular cases, the more general problem of a near constant cost function has not been tackled. This paper presents a new heuristic for directing the search when the cost function at a target branch is not able to differentiate between candidate test inputs on the basis of instrumentation of the test goal. The heuristic directs the search towards test cases that produce rare or scarce data states. Scarce data states can provide different outputs for the cost function thereby allowing it to discriminate between candidate tests. The proposed method is evaluated empirically for a number of example programs for which existing methods are inadequate. Keywords: Test data generation; Test Data Search; Data State Scarcity; Data State Distribution; Diversity; Fitness Function

1

Introduction

The development of high quality software depends crucially on testing. During structural testing, the aim is to test the program under test with test data that covers given elements of the program, an example is branch coverage in which each branch of the program should be executed by some test. Generating test sets with the required coverage property is typically done manually and is therefore expensive. A number of different automatic software test data generation methods for structural testing have been investigated [27]. These methods may be categorized as either static or dynamic. Static methods aim to analyze the static structure of the program under test in order to compute suitable test cases. Static methods exploit control and data-flow information and may use symbolic execution [19], [20], [3] but the program under test is not executed. Dynamic test data generation methods use information from the execution of the program under test. A simple example of a dynamic method is random test data generation [14]. In this method, candidate test data is generated randomly by sampling values from the input domain. Given that a candidate test case is generated randomly, it may or may not cover the required program element. Each candidate test case is therefore executed and only those that cover a required program element are retained. Unfortunately, the likelihood that a candidate test, generated randomly, will execute a difficult to reach statement or branch can be very low. As an example, consider the problem of generating an input to execute the target branch A of the Flag program shown in Figure 1. The target branch A is executed when a = 1. The problem is to find input values, x and y such that a is set to 1. Note that f may be a complex and poorly understood function of x and y. Depending on the size of the domains of x and y and also on the behaviour of f, it is possible that there is only a very small probability that a randomly generated input will set the variable a to 1 and thus execute the target branch 2

A. In general, random test data generation is generally considered to be ineffective at covering all branches in realistic programs [9]. void Flag(int x, int y) { a = f(x, y); if (a == 1) flag = true; // target A } ... if (flag) { // target B } } Figure 1: An example of a program with a “flag variable” problem. Search-based software testing is a dynamic method of test data generation in which search methods or optimisation techniques are used to generate tests and has been successfully applied in structural testing [[7], [1], [29] , [15], [22]]. Search-based software testing methods rely crucially on an evaluation or cost function function that can be used to compare candidate test cases. To illustrate the operation of a cost function, consider once more the problem of generating test data to execute the target branch A of the Flag program. Assume the cost function is required to discriminate between the two candidate tests, (3, 2) and (7, 9). Assume also that these inputs produce values for a as shown below: Input (3, 2) (7, 9)

value of a 8 -1

abs(a−1) 7 2

Given that -1 is closer to the required value of 1 than is 8, a reasonable heuristic is that a search for inputs in the region of (7, 9) is more likely to be successful than a search in the region of (3, 2). The distance of the value of a to 1, e.g. abs(a−1) can be used as a cost function since it increases the further a is from the required value of 1 and is zero when a solution is found. A cost function such as abs(a−1) allows the search method to discriminate between candidate tests. If the information provided by the cost function is reliable, and it is not always so, the search will generate candidate tests with progressively lower values of abs(a−1). If a solution exists (i.e. the execution of the target branch is feasible) then a search method will converge to a solution, i.e. a test case for which a = 1 and A is executed. Consider now the situation in which a test is required to execute the target branch B in the program Flag. B is executed when the Boolean variable flag is true. The definition of a cost function to guide the search to an input to execute the branch at B is difficult. The execution of the target branch is controlled by a Boolean variable flag. Such a variable, irrespective of the input, will take one of just two values. If the inputs (3, 2) and (7, 3

9), for example, both produce the same false value for the variable flag, they cannot be discriminated on the basis of the value of flag. In contrast to the cost function defined at the predicate expression controlling execution of the target branch A, no guidance can be provided to the search at the predicate expression controlling execution of the target branch B. The difference between the cost functions available at A and B is illustrated in Figure 2. abs (a − 1)

3

only two cost values possible for flag

2 false

1

true

a −2

−1

0

1

2

3

−2

cost at target A

−1

a 0

1

2

3

cost at target B

Figure 2: The cost function of abs(a−1) at A in program Flag is informative because it measures the distance of a to 1. The functions available at B to measure the distance of false to true are necessarily restricted to two values. There is a plateau corresponding to the “false” value of the flag at the branch B. If the search compares any two inputs that both produce a false value, the cost values returned for these inputs are exactly the same. The search therefore becomes random and likely to be unsuccessful if the input domain is very large. Although branches predicated on flag variables lead to ineffective cost functions, they also arise in situations not involving flag variables. As an example consider once more the target branch A. Depending on the behaviour of the function f in Flag, the cost function abs(a−1) may also be ineffective. Consider, for example, the situation in which f is 0 for the vast majority of the inputs x and y. As a result, the cost function abs(a−1) is constant for the vast majority of the inputs. When f has a very limited range of values, a will be limited to the same range of values and consequently the cost function abs(a−1) will be limited; and as a consequence provide little or no guidance to the search. This paper presents a heuristic for guiding the search when the cost functions at a Boolean expression designed to measure the distance of an input to a test goal return a constant or very limited range of values and is therefore unable to discriminate between candidate inputs. The heuristic is to search for inputs that produce scarce data states. The heuristic can be applied to programs that contain flag variables but is also applicable to a much wider class of programs. A method is given to instrument the program under test in order to identify those inputs that produce scarce data states. Cost functions for scarce data states are also given. The rest of the paper is organized as follows: Section 2 presents the background and related work. Section 3 describes constant cost functions and information loss in a program. Section 4 presents data state scarcity as a search strategy. Section 5 discusses data state scarcity in relation to the related concept of diversity. Sections 6 and 7 describe the 4

implementation of data state scarcity search. Section 8 presents an empirical investigation and Section 9 concludes the paper and presents further work.

2

Background and Related Work

This section presents an overview of search-based test data generation, evolutionary algorithms and the computation of the cost function for branch coverage. It also presents, by considering a number of example programs, the problem of a cost function that is constant for most of the input domain. Existing techniques are shown to be inadequate for this problem.

2.1

Search-Based Software Testing

Search-based software testing is a general approach towards the automatic generation of test data for structural test criteria and depends on a heuristically guided iterative search of the input domain. There are a number of different search methods that can and have been used, each with different characteristics, this paper does not address the differences between these search methods. The basis of any search method, however, is the guidance provided by a heuristic cost function of some description. This paper presents an effective cost function for a class of test data generation problems for which so far, no effective cost function has been available. The heuristic cost function presented in this paper has been validated empirically. In order to do this, it was necessary to use a particular search method and an evolutionary search algorithm was used. It should be stressed that the contribution of this paper is the cost function or heuristic and that there is no reason to suppose that the heuristic can be used only with an evolutionary search algorithm. Evolutionary algorithms are search methods based on theories of natural selection and survival of the fittest in the biological world. Evolutionary algorithms maintain a “population” of candidate solutions rather than a single candidate. Genetic algorithms, a subclass of evolutionary algorithms, have been used for search-based software testing to search for structural test data [20], [35], [32]. Figure 3 shows the basic steps of a genetic algorithm of the steady-state variety [10]. First the population is initialized with candidate solutions, either randomly or with user-defined candidates. The genetic algorithm then iterates through an evaluate-select-produce cycle until either a solution is found or some other stopping criterion applies. In the evaluate step, any newly created candidate tests are evaluated by the cost function. In search-based software testing, this is done by executing the candidate test on an instrumented version of the program under test. Once every candidate test has been evaluated, one or two candidates are selected from the population. The candidates are selected randomly but with a bias towards those candidates with the lowest cost, i.e. estimated to be closer to a solution. A simple way to do this is to rank the candidates of the population in terms of lowest cost and select a candidate with a probability that increases with rank. From these candidates, known as parents, are constructed new candidate solutions, known as offspring. The offspring are constructed to resemble their parents who, because of the biased selection are expected 5

initialise population with random candidate tests

evaluate candidate tests

yes stop

solution found? or timeout no delete least fit candidates to limit population size

select parent(s)

use mutation and crossover to produce offspring and add to population

Figure 3: Flowchart of search-based test data generation using a genetic algorithm.

to be closer to a solution than average for the population. For example, an offspring may be created by copying a parent and making a small modification, known as a mutation operation. An offspring may also be created by combining elements from two parents, known as a crossover operation. The newly created offspring are evaluated and ranked with the other members of the population. The population size is limited and so the lowest ranked candidates that exceed that limit are discarded. Over many iterations of the evaluate-select-produce cycle, the lowest cost candidates are retained and the higher cost candidates are discarded. The population will contain progressively lower cost candidates until a solution is found or the search is not able to find any candidates with a lower cost than that of the existing candidates. The success of the search depends crucially on the reliability of the guidance provided by the cost function. The cost function evaluates each candidate in terms of its “closeness” to a solution. In the context of evolutionary algorithms, an evaluation function is known as a fitness function because of the analogy with evolutionary fitness. To avoid the suggestion that the use of the evaluation function presented in this paper requires a commitment to evolutionary search, the more general term ‘cost function’ is used. 6

2.2

Branch Cost Functions

A conditional statement (an if-statement or a while-statement) contains a predicate expression and a pair of branches: the true branch, executed when the predicate expression is true, and the false branch, executed otherwise. The so-called multi-way or switch statements are considered to be a sequence of conditional statements. In some programs, the execution of a target branch may require the execution of a sequence of nested branches. The problem of nested branches is not significantly different from the point of view of this paper. The problem decomposes into a sequence of sub-goals, in each of which, a single branch must be executed in the context of a solution for the enclosing branch. The cost function of abs(a−1) applied to the Boolean expression a == 1 of the Flag program illustrates the cost function for the equality operator. Similar cost functions are available for the other relational operators as shown below where a, b are numbers and ² is the smallest positive constant in the domain (i.e. 1 in the case of integer domains and the smallest number greater than zero in the particular real number representation). Predicate expression a≤b ab a=b a 6= b

Cost of not satisfying predicate expression a−b a−b+² b−a b−a+² abs(a − b) ² − abs(a − b)

These cost functions are effective only in so much that they are able to take on a range of values and this is possible only if the operands of the relational operator are able to take a range of values. A particular example in which an informative cost function is not available is that of the Boolean expression consisting of a flag variable. Since a Boolean variable may take only two values, a cost function is necessarily restricted to only two values. Techniques for tackling this special case are discussed first. They are shown to be inadequate, however, for the more general problem of a restricted range cost function.

2.3

Boolean Flag Variable Instrumentation

In the program Flag of Figure 1, note that a potential route to the execution of the target branch B is to input values of x and y for which f(x, y) is 1. This satisfies the condition at the conditional statement A and sets the flag true. This is a potential rather than a certain route because the value of flag used at B is not necessarily the value defined when the true branch at A is executed. In general, there may also be other assignments to flag before it is used at B. A number of techniques have been proposed to define an effective cost function in programs that contain Boolean flag variables. Bottaci [6] gives a technique for programs in which the flag variable is assigned a predicate expression (as opposed to a constant true or false). The technique relies on a program transformation that replaces Boolean variables with real-valued variables. The 7

sign of the real value indicates the Boolean value, the magnitude of the real value is a predicate expression cost. To illustrate the transformation, consider the AllTrue program of Figure 4 that checks if all the Boolean values in an array are true. At any iteration of the loop, the flag variable alltrue will be set false when the array element is false. The execution of the target is controlled by a flag variable alltrue. AllTrue(boolean[] a) { boolean alltrue = true; for (i = 0; i < 64; i++) { alltrue = alltrue && a[i]; } if (alltrue) { // target } } Figure 4: Example program AllTrue with a flag variable problem. The transformation requires that the Boolean expression that is used to set the flag value is instrumented to compute a Boolean expression cost rather than a Boolean value. The logical constants are instrumented as -1.0 and 1.0 so that cost(a[i]) returns -1.0 when a[i] is true and 1.0 when a[i] is false. The Boolean flag variable is replaced by a floating point variable that can hold the Boolean expression cost rather than a Boolean value. The transformation is illustrated in Figure 5. AllTrueTransDouble(boolean[] a) { double alltrue = cost(true); // -1.0 is the cost of constant true for (i = 0; i < 64; i++) { alltrue = costAnd(alltrue, cost(a[i])); } if (alltrue < 0) { // any negative value represents true // target // instrument as alltrue - 0 + epsilon } } Figure 5: A testable transformation of AllTrue program which is shown in Figure 4 using the transformation in [6]. The real-valued function costAnd replaces the logical-and. This function is defined as the sum of the costs of the two operands in the case that the operands are both false, i.e. are greater than zero. If only one operand is false then costAnd returns the cost of the false operand. When both operands are true, the resultant cost is the product of the operand costs divided by the sum of the operand costs. So for example, if a[0] is true then after the first iteration alltrue is assigned the value -1/2. If a[1] is false then after the second iteration alltrue is assigned the value 1.0 and then the value 2.0 if a[2] is also false. Once alltrue is false, it remains false but its value increases with each false element of the array. A detailed and general justification for the costAnd function 8

is given in [5, 4]. For the purpose of this paper, it is sufficient to appreciate that the loss of information that occurs when a flag variable is assigned the value of a predicate expression can be avoided by replacing flag variables with real-valued variables. Any predicate expressions that use the Boolean flag variable are rewritten to test the sign of the instrumentation value. These predicate expressions can then be instrumented as relational operator expressions using the cost functions given in Section 2.2. Note that the transformation is not applicable when the flag variable alltrue is assigned a Boolean constant as opposed to the value of an expression involving relational operators. This is the situation that is shown in the example program Flag. Harman et al. [18] [19] present a program transformation to remove internal flag variables from branch predicates, replacing them with the expressions that led to their determination. In the transformed version of the program, the branch predicate is flag-free and can therefore be instrumented in a straightforward way. Their approach, however, does not handle assignment to flags within loops. In particular, 5 levels of program difficulty are identified and the given transformations are effective only for the first 4 levels. The fifth level consists of programs in which assignments are made to flag variables inside a loop that does not also contain the target branch. The example program AllTrue of Figure 4 is an example of a level 5 problem. A testability transformation for loop assigned flags is, however, given by Baresel et al. [2] where they extend the transformation approach for internal flags assigned within loop structures. To illustrate this transformation, consider a modification of the AllTrue program of Figure 4 in which the flag alltrue is set to a constant i.e. alltrue = alltrue && a[i]; is replaced with if (alltrue && a[i]) { alltrue = true; } else { alltrue = false; } The result of applying the transformation shown in Figure 6. The first step of the transformation is to insert two numeric variables, counter to count the number of loop iterations and fitness to accumulate a fitness value. Both are initialised to zero. The next step is to copy the flag variable that controls execution of the target branch to be the condition of an if-statement inserted immediately after the statement at which the flag variable alltrue is assigned a value. In the true-block of this if-statement is inserted a statement to set the value of fitness to that of counter. Note that on each iteration that the flag is assigned a desired value, the difference between fitness and counter is zero. If on the last iteration of the loop, the desired value is assigned to the flag variable, the fitness variable will be assigned the value of the counter. Consequently, the flag variable alltrue can be represented by the expression counter == fitness and the transformation finally replaces the flag variable that controls execution of the target branch with the expression counter == fitness. It is clear that when the target is not 9

AllTrueTransCounter(boolean[] a) { boolean alltrue = true; counter = fitness = 0; for (i = 0; i < 64; i++) { if (alltrue && a[i]) { alltrue = true; } else { alltrue = false; } if (alltrue) { fitness = counter; } counter++; } if (counter == fitness) { // target } } Figure 6: A testability transformation of AllTrue program which is shown in Figure 4 executed, the value of the cost function abs(counter − fitness) represents the number of iterations since the last assignment of true to flag. This will vary in value and hence provide guidance to the search. The transformation shown in Figure 6 shows the coarse-grained transformation presented in [2]. The difference between the coarse-grained transformation and the finegrained transformation lies in the modification of the fitness variable within the loop when the flag is false. The fine-grained approach adds to fitness the cost, a number less than 1, of the failed branch predicate. This approach has also been extended by Wappler et al. [?] to instrument programs with function-assigned flags. Programs with function assigned flags are difficult to instrument because in polymorphic languages the specific function call that results in the assignment of a flag value cannot be determined statically. The chaining method of Korel [23] can be an effective technique for programs in which the cost function at a branch predicate is unable to guide the search. The chaining method attempts to find inputs to execute a path from each last definition of each variable used in the unsatisfied branch predicate. In broad terms, the heuristic is that the execution of a new path may produce a different value at the goal branch expression. In the example AllTrue program in Figure 4, there is only one path through the program up to the target branch and hence the chaining method is ineffective. There are many programs for which the transformations of Bottaci [6], Harman et al. [18] [19] Baresel et al. [2] and Wappler et al. [?] are inapplicable. These transformations rely on the identification of a flag variable and this is not always possible in general. As an example of such a program in which a “flag variable” problem nonetheless occurs, consider the program Orthogonal shown in Figure 7. This program, which contains no 10

Boolean variables, determines whether two binary vectors are orthogonal by computing their inner product. Each of the two input arrays consists of integers with the value 0 or 1. This is a precondition of the program and not a property that can be determined from a static analysis of the program. Orthogonal(int []a, int []b) { // for all i, a[i] and b[i] in [0, 1] int product = 0; for (i = 0; product == 0 && i < 64; i++) { product = a[i] * b[i]; } if (product == 0) { // target } ... Figure 7: A difficult to execute branch in a program with an integer “flag” variable. The target branch is difficult to execute because for almost all random inputs, the value of the integer product is set to 1. Even though product is not a Boolean variable, a “flag” variable problem arises because product may take one of only two integer values 0 or 1. Since these values are multiplied only, they are essentially Boolean values. The transformation in [2] applies only to variables that can be identified as flag variables, product is not such a variable and so the transformation is not applicable. The method of data-flow graph search of Korel [24] searches for paths that are selected from examination of the data dependency graph of the variables that appear in the target branch predicate expression. In the example program of Figure 7, all such paths begin with the initial assignment to product and then take one or more iterations of the loop. By searching the space of paths, which in this case is relatively small, the solution is found. However these paths, except for the solution path, all produce the same branch distance value and so the search is random. A random search would pose a problem if the arrays were very large.

3

Constant Cost Functions and Information Loss

It is clear that any cost function will produce a limited set of values at a branch predicate expression if the set of values reaching that expression is limited. The use of flag variables is only one way this can occur. In general, it occurs because of information loss Voas [34]. For an example of a program that has no flag variables but exhibits information loss that makes cost function instrumentation ineffective, consider the program Log10 shown in Figure 8. Log10 computes the log10 of the input x (an integer between 1 and 100, 000) and then takes the ceiling of this value to produce an integer between 0 and 5 which is then used to access an integer array a. The first element of this array is 0 but the remaining five elements are 1. The behaviour of the Log10 program can be seen from Figure 9. Note how the value of y varies continuously as the log10 function of the input 11

x but that k takes only a small number of values. The fact that most of the values in the array a are equal means that the branch distance function is constant for all inputs except when x is 1. void Log10(int x) { // x in [1, 100,000] a[0] = 0; a[1] = a[2] = a[3] = a[4] = a[5] = 1; double y = log10(x); int k = ceiling(y); //y in [0, 5] if (a[k] == 0) { // target } } Figure 8: A difficult to execute branch in a program for which no existing transformation technique is effective. k 5 4 3

y

2 abs(a[k] − 0) 1 0 1

10

100

1000

10000

100000

x value (not to scale)

Figure 9: The distribution of the data state values in the program Log10. Figure shows that the function that computes k produces a larger set of values than the branch cost function; hence it is easier to search for different values of k. Note that none of the techniques discussed previously are applicable to this program. The technique of substituting cost values for Boolean values [6] is not applicable since there are no Boolean expressions that can be effectively instrumented. The only Boolean expression is the expression that controls the execution of the target branch and is the expression that is almost constant. Also, there are no flag variables that can be used in the transformations of Harman [19] and Baresel et al. [2]. There is a single path through the program and so the path search methods of Korel [23, 24] are not applicable. Voas [34] introduces the notions of information loss and the domain/range ratio. The information loss of a mapping is the ratio of the size of the domain to the size of the range. 12

Log10 computes a mapping from x in [1, 100 000] to a[k] in [0, 1]. The information loss from the input x to the value of a[k] is extreme at 1001000 . It is the almost complete loss of information that is responsible for the ineffective cost function at a[k] == 0. A necessary condition for any cost function to be able to guide the search is that it should produce a range of values as it is evaluated during the execution of different inputs. When this necessary condition is not met, i.e. the cost function is constant over the input region so far explored and the search degenerates into a random search. Given no success in the search for values to satisfy a specific search goal, a plausible strategy is to generalise the goal i.e. modify it in order that it is more easily satisfied. More specifically, a search sub-goal is created to search for inputs that produce a range of values at the cost function rather than search for inputs to minimise the cost function. The idea behind this generalisation is that once an input region has been found in which the cost function varies, it will be possible to resume the search for inputs that minimise the cost function. At the test goal, however, any cost function is constant and so instrumentation here cannot guide the search for even the more general goal of producing a range of values. The advantage of adopting the more general search goal, however, is that a suitable cost function is easier to construct. To consider how this instrumentation can be done in a particular example, note that the mapping of Log10 may be decomposed into a number of computational steps. Each step of the computation implements a mapping. The map from x to y is performed by log10. The ceiling function maps y to k and the array a maps k to a[k]. Each of these mappings has it own information loss so that the high information loss computed by the program Log10 is spread among these mappings. The information loss from x to k for example, is such that 90% of the inputs map to a single value k = 5, 9% of the inputs map to a single value k = 4 and so on. Since the information loss from x to k is much less than that from x to a[k] == 0 it is easier to locate scarce values of k than scarce values of a[k] == 0. The notion of progressive information loss suggests that stepping back in the statements of a program can yield more information for discriminating between candidate tests. The search for scarce values of k is motivated by the heuristic that assumes that when k takes a scarce value then so will a[k] == 0. The underlying data scarcity search heuristic is that inputs that produce scarce intermediate values will also produce scarce values at the cost function expression and a scarce value at the cost function expression is more likely to be different to the cost values so far encountered. Figure 9 shows the distribution of these values. For example, in a population of 100 individuals, selected randomly, there is a reasonable probability of encountering inputs that produce values of k that are 4 or 3. If the search is directed to search in the region of these inputs, then inputs that produce a value of 2 will be found. Such inputs are scarcer than those that produce 4 or 3 and so the search is directed to generate inputs similar to those that produced a value of 2, with the result that inputs producing 1 and finally 0, will eventually be found.

4

Data State Scarcity as a Search Strategy

In abstract terms, a program p can be considered to be a function from an input domain D to an output domain O, i.e. p : D → O. For a given input, the program produces 13

D

R values that do not map to r

values that map to r

f r

r has a scarcity value wrt f

Figure 10: An illustration of the notion of the scarcity of a value r with respect to a mapping f . the corresponding output by means of a computation d1 , d2 , . . . , dn , which is a sequence of data states, starting with the input state d1 and ending with the output state dn . The scarcity of a value depends on the relative frequency of its occurrence in some given context. Taking a mapping as the context, the scarcity of a value r produced by a mapping may be defined as the proportion of the values in the domain that do not map to r. More formally, let f : D → R be a mapping from domain D to range R. The image of a subset S of the domain under a mapping f is the set of values to which members of the subset are mapped, denoted f (|S|). If the inverse relation is denoted f −1 then the image of a value r in the range under the inverse relation is the set of values in the domain that map to r, denoted f −1 (|{r}|). The set of values that do not map to r is D \ f −1 (|{r}|) and corresponds to the shaded region in Figure 10. The scarcity of r with respect to the mapping f is the proportion of values in the domain that do not map to r, which is denoted s(r, f ) = |D \ f −1 (|{r}|)| / |D| The scarcity of values rarely produced by f approaches 1 and the scarcity of common values is small with the scarcity of the constant function equal to zero. Consider now a mapping consisting of the composition of two mappings f : D → I followed by g : I → R, to give the composition denoted f g : D → R. The scarcity of r with respect to the mapping f g is therefore s(r, f g) = |D \ (f g)−1 (|{r}|)| / |D| Figure 11 illustrates the composition of two mappings f g. The data scarcity heuristic is to search for inputs that produce scarce values in I with respect to f in the hope that such values are mapped to scarce values in R with respect to f g. Let i in I have scarcity s(i, f ). Consider now the scarcity of g(i) in R with respect to f g, i.e. s(g(i), f g). If i is scarce with respect to f , i.e. s(i, f ) is close to 1, it cannot be assumed that g(i) in R 14

D

I f

R g

i r j

scarce wrt fg

scarce wrt f

Figure 11: An illustration of the notion of the scarcity of a value with respect to a composition of mappings. Note that all the values in I that map to a scarce value in R are themselves scarce but that not all values in I that map to a prevalent value in R are prevalent, i is a counter-example. is also scarce. The value i in Figure 11 is an example of a value that is scarce in f but not in f g because g maps i to a non-scarce value with respect to f g. To consider a more extreme example, note that if f is the identify function over a large domain and g is a constant function then every value in I is scarce but the value in R is not. A more likely situation in practice, however, is that the information loss (domain/range ratio) is spread more evenly between the two mappings f and g, in which case not all scarce values in I may map to non-scarce values in R. It can be shown, however, that if r in R is scarce with respect to f g then intermediate values that map to r must also be scarce, i.e. for all i in I such that g(i) = r, i is scarce with respect to f . In Figure 11, j is an example of such a scarce intermediate value. This can be shown by the argument that the set of values that map to i, i.e. f −1 (|{i}|) must be a subset of those that map to r, i.e. (f g)−1 (|{r}|). Since this latter set must be small if r is scarce with respect to f g then the subset f −1 (|{i}|) must also be small. The heuristic is reliable when inputs that map to scarce values with respect to f are close in the domain to inputs that are mapped to scarce values in R with respect to f g. It is plausible that values close in the range have similar scarcity because values close in the range are likely to be produced by the application of the same function or operation and that function will have an information loss that does not vary greatly with small differences in input. An example is the log10 function. Small values, similar in size, are also scarce, i.e. similar in scarcity. Larger values, also similar in size, are not scarce, again, similar in scarcity. 15

Pursuing data state scarcity is also an effective strategy for finding an input to execute the required branch in the AllTrue and Orthogonal programs of Figures 4 and 7. Moving back in the data flow graph from the value of alltrue used in the branch predicate leads to the mapping from input to the values of alltrue assigned within the loop. To take a specific example, consider the assignment after the loop has iterated just half of the 64 iterations. Although this is likely to be an assignment of false, the probability of this value being false is not as high as that of a false value for alltrue in the branch condition. In general, a number of true values will be assigned followed by a number of false values to a make a total of 64 values. The number of true values is equal to the number of true values in a before the first false value. High numbers of true values will be scarce. Moreover, any two scarce inputs will be similar, i.e. close in the input domain. Within a loop, it is often more convenient, and probably more effective, to sample all values assigned to a variable such as alltrue. This is equivalent to sampling values from a number of data states. There is a straightforward generalisation of the concept of the scarcity of a single value to the concept of the scarcity of a set of values. Given a function f consisting of the composition of functions f1 , f2 , . . . fn , the aim is to define the scarcity of a set of values V where each vi ∈ V is in the range of exactly one function fi , i ∈ [1, n], i.e. vi is in the range of fi . Let f¯i denote the composition of all functions up to i, i.e. f1 f2 . . . fi and then the scarcity of vi w.r.t. f¯i is −1 s(vi , f¯i ) = |D \ f¯i (|{vi }|)| / |D| and the scarcity of V with respect to f is the proportion of the values in the domain that do not map to all the values in V , i.e. |D \

\

−1 f¯i (|{vi }|)| / |D|

i∈V

In the Orthogonal program, consider, the values assigned to product within the loop. In three out of four cases, i.e. (0, 0), (0, 1) and (1, 0), product is assigned the value 0 and in the remaining case, product is assigned the value 1 and the loop terminates. Without considering probabilities in detail, we may assume a 1/4 chance of loop termination at each iteration. In this situation, low numbers of loop iterations are more likely than high numbers. These inputs may be identified as producing scarce data states and hence the search may be directed to the region in the vicinity of these scarce inputs. Directing the search towards inputs that produce scarce patterns of assignments is an effective strategy for changing the final value assigned since this value changes when the maximum number of zeros is assigned. The program Mask shown in Figure 12 is another example that exhibits information loss and for which existing techniques are inadequate. This program checks that each character in an array conforms to the bit mask 1010101. There are a total of 8 characters that have the four mask bits set and 16 values that may be assigned to x. Mask repeatedly performs a bit-wise-and operation and so since once a bit in x becomes zero, it will remain so. In general, we may expect the bits within x to tend towards zero as the array is iterated (assuming the data in the array is relatively random). This means that although x within the loop may take a number of values, x in the branch predicate expression is almost 16

Mask(char[] a) { char x = 0x55; // 1010101 for (i = 0; i < 10; i++) { ... x = x & a[i]; //bitwise and } if (x == 0x55) { // target } } Figure 12: Program Mask checks that each character in an array has the odd bits set. always 0. Although x in the predicate expression is almost constant, a greater range of values is assigned within the loop, where it is thus possible to identify relatively scarce values. In a reasonably large population of inputs, a small number of inputs will assign the relatively scarce value of 1010101 to x a higher number of times than is typical among the inputs that have so far been executed. By directing the search towards these scarce inputs, the search is directed to inputs that produce more than a single value for x in the branch predicate expression. Once this occurs, the cost function at the branch predicate expression may guide the search to a solution. In summary, the data state scarcity search strategy may be outlined as follows: When the search is no longer able to make progress because of a population of candidate tests have equal cost function values, i.e. the cost function surface across the input region so far encountered is flat, then it is assumed that there is a constant mapping from the majority of the input domain values to the values that are supplied as arguments to the cost function. At this point, the search goal is generalised to search for inputs that produce a range of values used in the cost function for the test goal. There is also the assumption that constant cost values result from a progressive information loss and so there is a greater likelihood that inputs can be discriminated in terms of data state values produced prior to their use in the constant cost expression. Directing the search towards inputs that produce scarce intermediate values is not necessarily directing the search towards inputs that solve the test goal. The cost functions are certainly different. The purpose of directing the search towards inputs that produce scarce intermediate values is to provide inputs that produce a variety of arguments to the cost function that instruments the test goal. Only when this cost function receives a range of values, can produce a range of values which can be used to discriminate between candidate tests. An often useful way of gaining insight into the general behaviour of a heuristic is to construct a counter example, i.e. a program for which the heuristic is not only ineffective but is in fact deceptive. A deceptive heuristic for a given problem guides the search away from a solution rather than towards it. Consider therefore a modification of the Log10 program, the program Log10Deceptive, shown in Figure 13. This program has a domain that extends to 100 001 rather than 100 000. This allows y to take the value 6 for just one value in the input domain and 6 is the only index into 17

void Log10Deceptive(int x) { // x in [1, 100,001] a[0] = a[1] = a[2] = a[3] = a[4] = a[5] = 1; a[6] = 0; double y = log10(x); int k = ceiling(y); //y in [0, 6] if (a[k] == 0) { // target } } Figure 13: A program for which data scarcity search is a deceptive heuristic. The target branch is executed when x is 100 001 but progressively scarce values of k lead towards low values of x. the array a that contains a 0. The target branch is difficult to execute because of the very small part of the domain that extends beyond 100 000. The target branch would not be so difficult to execute if the domain were to be extended to, say, [1, 1000 000].

5

Data State Scarcity and Diversity

The heuristic search literature contains a number of studies of the concept of diversity applied to a population of candidate solutions. In particular, it has been suggested that increased diversity in the population of candidate solution improves the performance of the search in certain situations. In particular, population diversity has been used to investigate the problems of premature convergence and the avoidance of local minima. It is plausible that these methods could be used to improve the diversity of the branch distance costs which is a necessary condition for a solution to be found. There are also a number of approaches in the literature for generating random test data that is biased by a diversity measure of some kind. It will be shown, however, that with respect to search, scarcity and diversity are different properties.

5.1

Diversity measures

The term “variety” was used by Koza [25] to represent the number of different genotypes in a population. In the context of search-based testing, a genotype is a candidate test case. A simple measure of genotype diversity can be obtained from the number of unique individuals or candidate test cases [26] in a population. Genotype diversity in this sense is a necessary condition for the presence of fitness diversity since unless candidates differ they cannot have different cost values. In addition to the relatively simple measure of the number of unique individuals, some researchers have considered also the dissimilarity between individuals. To this end, an edit distance based on string matching was used by O’Reilly [31]. He uses single node insertions, deletions and substitutions to transform two tree structured genotypes to be 18

equal in structure and content. The measure represents the number of node changes that need to be made to either tree to make them equal in structure and content. De Jong et al. [11] used a similar distance measure, normalised by dividing the sum of all different nodes by the size of the smaller tree. To incorporate such a distance measure in the context of search-based testing, it would be necessary to define a distance measure between candidate test cases which could be done on the basis of a metric defined over the program input domain. In the case of the Log10 program, for example, the input space consists of the integers in the range [1, 100000] and the absolute difference between two integers could be used as a distance metric. For more complex domains of more than one dimension, an n-dimensional metric could be used. Metrics are difficult to define, however, over unordered sets. Keijzer [21], in the context of genetic programming where a candidate solution is a tree structured program, used the ratio of unique genotype sub-trees over total sub-trees to measure the sub-tree variety and the ratio of the number of unique individuals over the size of the population as program variety. Keijzer also used a distance measure between two individuals as the number of distinct genotype sub-trees the individuals share. Genotype diversity that considers also a measure of genotype dissimilarity would appear to be more effective in producing diversity when searching for test data in that maximising the distance between candidate test cases would cover more of the input domain than that required simply to ensure that there are no duplicate candidates in the population. In addition to genotype diversity, fitness diversity has also been investigated. Fitness diversity, also known as phenotype diversity, measures the diversity of fitness values in a population. In the context of this paper, fitness diversity is the diversity of branch distance values produced by the population of candidate test cases. Fitness entropy (in the sense of Shannon entropy) has been used to investigate the problems of premature convergence and local minima. Fitness entropy is calculated by grouping the fitness values into equivalence classes [33]. Given k classes in the current population, and let pi be the proportion of the population which belongs to class i. Fitness entropy is then defined as shown in Equation(1). F itness entropy = −

X

pk .log2 (pk )

(1)

k

Figure 14 illustrates how entropy increases as number of classes increases and decreases as distribution becomes less uniform. The high fitness entropy in genetic algorithms describes the presence of many unique fitness values in the population, where the population is evenly distributed over those values. However, low fitness entropy describes a population which contains fewer unique fitness values where many individuals have the same fitness. When the cost function is constant across the population, fitness entropy is 0.

5.2

Diversity control methods

Some common methods for improving the diversity of a population of candidates test cases are listed below: 19

0.6

Probability

0.5 0.4 0.3 0.2 0.1 0 1

2

Class Number

Entropy = 0.30

(a)

Probability

0.25 0.2 0.15 0.1 0.05 0 1

2

3

4

5

Class Number

Entropy = 0.70

(b)

Probability

1 0.8 0.6 0.4 0.2 0 1

2

Class Number

Entropy = 0.14

(c)

0.7

Probability

0.6 0.5 0.4 0.3 0.2 0.1 0 1

2

3

Class Number

4

5

Entropy = 0.45

(d)

Figure 14: Four different distributions showing how entropy varies according to number and distribution of classes. 1. Restricting the selection procedure (crowding models) [3]. Crowding induces niches by forcing new individuals to replace those that are similar in genotype. In the crowding algorithm, an individual is selected for replacement by selecting a subset of the population randomly and then selecting the member of that subset that is most similar to the individual. 2. Restricting the mating procedure (assortative mating) [12]. Assortative mating algorithms encourage crossover to occur between similar individuals. The general idea behind retaining diversity by restricted crossover is that when crossover is unrestricted, the successful characteristics in any individual are quickly spread to all the individuals of the population with the result that the individuals become quite similar. 3. Explicitly dividing the population into sub-populations. Each sub-population is 20

separated from the others in the sense that it evolves independently with occasional migration of individuals from one sub-population to another. Based on concepts from population genetics, random genetic drift should cause each sub-population to search different regions of the domain and the migration will communicate important discoveries among the sub-populations. Local mating algorithms [8] arrange the population geometrically (e.g., in a two-dimensional plane) and crossovers occur only between individuals that are “near” one another geometrically. McMinn et al. [30] presented a unique species per path technique. They showed how in the program under test, different program paths to the target branch can be assigned to species statically on the basis of slicing and transformation. By allocating a separate species to each path, each path is searched. 4. Modifying the way fitness is assigned (fitness sharing). Fitness sharing [13], [17] [16] induces sub-populations by penalizing individuals for the presence of other similar individuals in the population, thereby encouraging individuals to find uncrowded niches. The methods discussed above, although effective at maintaining diversity in a population are not likely to be effective for the problem presented in this paper. Genotype diversity in the sense of unique candidates is likely to be present already during a random search. Diversity control methods are useful to avoid situations in which the population contains many similar candidates. For the typical test data generation problem, the search space is large and avoiding premature convergence will not in itself provide a solution. Genotype diversity in the sense of a population of genotypes that are distant as well as distinct from each other may be more effective. The literature describes a number of approaches to random test data generation in which the generation of tests, although random, is biased by a consideration of diversity. In particular, the Antirandom Testing approach of Malaiya [?] generates sequences of test cases in which each successive test case is chosen to maximise to total distance between it and all the previous test cases in the sequence. The generation of such a test case may require the exhaustive examination of the input domain. The generation of Antirandom test sequences requires a “distance” measure to be defined over the input domain, although it is unclear how to define metrics over unordered sets. Adaptive Random Testing, proposed by Chen et al. [?], is similar to Antirandom Testing but relaxes the requirement that the successive test case be that member of the input domain that is maximally distant to all previous test cases. Instead, a set of candidate test cases are generated randomly and from these candidates, the test case that is maximally distant to all previous test cases is selected as the next test case. In addition, a test case is considered to be maximally distant when the distance to the closest member of the test sequence is greater than the distance of any other candidate to its closest member of the test sequence, i.e. has the largest nearest neighbour distance. In Antirandom test generation and in Adaptive Random Testing, a test is maximally distant from the preceding tests in the generation sequence but not necessarily maximally distant from all tests in the sequence. The Diversity Oriented Test Generation approach of Bueno et al. [?] requires that all the test cases within an initial random set of test cases be iteratively “repositioned” until they are maximally distant from each other. In this 21

approach, there is no distinction between tests that are early in the sequence and hence relatively unconstrained; and tests that are late in the sequence and more constrained. This allows a more uniform spread of test cases and an increased likelihood of achieving a distribution in which every test case is maximally distant from every other test case. As a simple illustration of this effect consider the selection of three tests from an interval. Using a sequential process, if the first test is close to but not at the upper boundary of the interval, the next must be placed at the lower boundary and the third must be placed at the upper boundary even though it may be very close to the first test. The Diversity Oriented Test Generation approach would place the three tests and the start, middle and end of the interval to achieve a more uniform spread. A significant property of all diversity oriented test generation methods is that they tend to produce tests that are closer to the domain boundary than would otherwise be produced by purely random generation or by guided search. Consider, for example, how the Antirandom method would be used to generate test data for the Log10 program. The first test is selected randomly from the interval [1, 100 000], assume it is 40 000. The second test will be 100 000 since this is the value that is furthest from 40 000. The third test will be 1 since this values maximises the distance to 40 000 and to 100 000. After just three tests, the values on the boundary of the domain have been generated. Note that this test executes the target branch. This is because the test that executes the required branch lies on the boundary of the input domain. The data scarcity search searches for inputs that produce scarce data states, which may or may not lie on the domain boundary. The solution is not on the boundary of the domain in the case of the other example programs. The domain of the AllTrue program, for example, is a 64 dimensional binary-valued space. Using Hamming distance, the most distant array to a given array of Boolean values is the array that contains the negation of each value. The array that executes the target branch, i.e. the array in which every element is true, is the most distant array from only one other array, that in which every element is false. A significant difference between diversity biased random test data generation and data state scarcity test generation is that data state scarcity is based on the examination of intermediate program data states rather than program inputs. The processing of inputs by a program to produces intermediate data states can mean that distances between test cases in the input domain space need not be closely correlated with distances between test cases in a space of intermediate data states. Although a diversity guided random search may be more efficient than a purely random search, it does nonetheless aim to sample uniformly the entire input domain. For very large input domains, however, this can be inefficient compares to a guided search that is able to focus on the region that contains a solution.

6 6.1

Implementing Data State Scarcity Search Initiating Data State Scarcity Search

Data scarcity search should be initiated only when the branch cost function has become locally constant and program transformation techniques and path search techniques are 22

either inapplicable or ineffective. Detecting that the search has stopped converging can be done by monitoring the average or best population branch distance cost. In this work, the branch distance was considered locally constant if there was no improvement in the best cost value after 50 offspring. At this point, data state scarcity search was introduced immediately since it was known that for the selected example programs, no transformation technique was applicable. Recording data state information is a computational cost that need not be incurred until data state scarcity search is initiated. This can be done by re-instrumenting the program under test. In order to guide the search towards inputs that produce scarce data values it is necessary to record the values produced by specific expressions during the execution of a given input. A histogram, a set of (value, frequency-count) pairs, is used for recording the values of a given expression. Once the distribution of data values associated with the execution of an input has been obtained, it can be compared with the distributions associated with other inputs. This will allow the identification of inputs that produce scarce distributions of values.

6.2

Recording program data state values

Figure 15 shows the instrumentation required so that data values can be recorded for the example program Log10 shown earlier in Figure 8. void Log10(int x){ //x in [0, 100000] a[0] = 0; a[1] = a[2] = a[3] = a[4] = a[5] = 1; double y = Inst(log10(x), "y1"); int k = Inst(ceiling(y), "k1"); if (a[k] == 0) { // target } } Figure 15: Data-state instrumentation of the Log10 program. Instrumentation is performed by the function Inst. Inst is the identity function with respect to its first argument but has the side effect of adding its first argument to a histogram identified by its second argument, a label that identifies the expression for which values are recorded. Figure 16 shows how data values can be recorded for the example program Orthogonal shown earlier in Figure 7. The value of the variable product in the Boolean expression that controls execution of the target branch is defined by the expression within the loop. It is at this point that, the instrumentation function Inst is added. In general, the identification of the expressions to instrument may be done by determining the data flow predecessors of the expression producing a constant value. Each definition of a variable used in the expression (e.g. an assignment statement), providing 23

OrthogonalDataState(int []a, int []b) { // for all i, a[i] and b[i] in [0, 1] int product = 0; for (i = 0; product == 0 && i < 64; i++) { product = Inst(a[i] * b[i], "s1"); } if (product == 0) { // target } Figure 16: Data-state instrumentation of the Orthogonal program. it is not the assignment of a compile-time constant, is associated with a histogram in which is recorded the values assigned and the number of times any particular value is assigned. The execution of each input is thus associated with a set of histograms. Figure 17 shows, in general, the relation between the population of inputs and associated sets of histograms. population of candidate test cases

values sampled in program

test case 1

{y at y1, k at k1}

histogram for each sampled value

{(5, 1)} {(4.7, 1)} test case 2

{y at y1, k at k1} {(4, 1)}

. . . test case n

. . .

{(3.9, 1)} . . .

{y at y1, k at k1} {(5, 1)} {(4.4, 1)}

Figure 17: The structure of the data state histograms associated with each candidate test case for the Log10 program. Experience with the programs investigated in this paper shows that typically it is only the immediate predecessors of the constant expression that require instrumentation. If the immediate predecessor expressions are also found to be constant then attention can be directed at their immediate predecessors and so on. In the Log10 program example, however, instrumentation is necessary only for k and not y. 24

Data-state values are recorded for basic value types only (e.g. Boolean, integer, character and string). Arrays of types are not instrumented because it is not clear how array values could be recorded without excessive memory use. The recording of floating point values poses a problem in that the likelihood of equal values produced by different inputs is low. A discretisation process can be applied, however, to group these values and thereby identify scarce regions if not scarce values. Initially, the discretisation quantum, d would be assigned a small positive value. Each floating point value, r is mapped to a value nd where n is an integer and (n − 1)d < r ≤ nd. In programs with loops, this may nonetheless lead to the recording of a large number of values, each with a frequency count of 1. Consequently, if the size of any histogram exceeds a given bound, the discretisation quantum would be doubled and the values in all existing histograms would be regrouped accordingly. A program with a loop may also generate a large number of different values of discrete type. This would lead to an impractically large number of histogram classes. To limit the number of classes in the histograms of discrete types, the rate at which assigned values are sampled is progressively reduced as the number of classes increases. Initially all values assigned are recorded until the total number of classes equals a positive constant s, set to 1000 for this work. At this point, the sampling rate is halved so that only each second value is recorded. This does not directly limit the number of new classes but it does reduce the rate at which they may be created. If the number of class grows to 2s then the sampling rate is again halved, and so on. This scheme biases data state sampling to states that are produced early in the computation and there are likely to be programs for which this bias is advantageous and programs for which it is not. Given that the performance implications of the scheme are unclear, at the moment the scheme can be justified only on the basis of the simplicity of implementation.

6.3

Cost functions for data state scarcity

In general, a number of expressions within the program under test will be instrumented and so the execution of an input will be associated with a set of histograms, one for each instrumented expression. Two histogram sets are equal if the contain they same histograms. A simple method to identify test cases which produce scarce data state values is to group test cases into classes on the basis of equal histogram sets. Consequently, for those inputs where the branch distance costs are equal, the inputs are ranked according to smallest histogram set equivalence class (as shown Table 1). This is a strong criterion for grouping histogram sets but can be justified on the grounds that a test case may produce a lower branch distance when just one of the values produced during the execution of the instrumented expression cost function takes a different value to that taken during the execution of other test cases. Empirically, it has proved effective for a number of the example programs investigated in this paper. To guide the search towards scarce distributions of data states, a cost function must evaluate data states so that inputs that produce scarce data states are ranked highly in the population. One way to do this is to group inputs according to equal data state distributions. The inputs in the smallest equivalence class produce the rarest data states. This is in effect a binary-valued, equal or not equal, distance measure between data state distributions. More formally, let D be a domain of data values on which is defined an equality operator. Let a histogram on D be defined as a set of (value, frequency-count) 25

Table 1: Population before and after ranking using Data State Distribution equivalence class size. Population before ranking

1 2 3 4 5 6 7

branch cost c c c c c c c

Data state distributions {(a, 1)} {(b, 2)} {(a, 1)} {(b, 2)} {(a, 1)} {(d, 4)} {(b, 2)}

8 9 10 11 12 13 14 15 16 17 18 19 20

c c c c c c c c c c c c c

{(a, 1)} {(a, 1)} {(c, 3)} {(b, 2)} {(a, 1)} {(d, 4)} {(d, 4)} {(c, 3)} {(a, 1)} {(a, 1)} {(e, 5)} {(d, 4)} {(b, 2)}

candidates

Population after ranking

Class {(e, 5)} {(c, 3)} {(d, 4)} {(b, 2)} {(a, 1)}

count 1 2 4 5 8

Rank 1 2 3 4 5

candidates 18 10 15 6 13 14

Rank 1 2 2 3 3 3

19 2 4 7 11 20 1 3

3 4 4 4 4 4 5 5

5 8 9 12 16 17

5 5 5 5 5 5

pairs where value is a member of D and count is a non-negative integer. For each value v ∈ D, the histogram contains a pair (v, f ) where f ≥ 0. An initial histogram is a histogram in which every value of the domain of the histogram has a zero frequency count. Two histograms are equal if they contain the same set of value-count pairs. If an input does not execute an instrumented expression then that expression will be associated with an initial histogram. Other dissimilarity distance measures may be defined provided they are symmetric, non-negative, zero for two equal data state distributions and the distance between any two data state distributions is less than or equal to the sum of the distances of these two data state distributions to any other distribution. Two distance measures were considered, both are present in the literature on population diversity [28], but here they are applied to data state distributions. 1. Hamming Distance (HD): Two histograms over the same domain may be compared by the number of pairs in each histogram that differ. For example if D = {a, b, c, d, e} then given x = {(a, 2), (b, 3), (c, 1), (d, 0), (e, 0)} and y = {(a, 2), (b, 5), (c, 0), (d, 4), (e, 0)} then HD(x, y) = 3, since the counts for b, c and d differ. More formally, HD(x, y) = |x\y| The Hamming distance is not sensitive to the magnitude of the difference in frequency counts but two inputs with a large difference in frequency count might be 26

expected to be more diverse than two inputs with a small difference in frequency count. This motivates the definition of an Euclidean distance measure between histograms. 2. Euclidean Distance (ED): Two histograms over the same domain may be compared by the sum of squares of frequency count differences. For each pair (v, f ) in x, and (v, g) in y i.e. f and g are corresponding frequency counts for the same value v, the contribution is (f − g)2 , hence ED(x, y) =

X

(f − g)2

(v,f )∈x∧(v,g)∈y

The Euclidean distance between x and y in the previous example is ED(x, y) = (2 − 2)2 + (3 − 5)2 + (1 − 0)2 + (0 − 4)2 + (0 − 0)2

In general, an input will be associated with a set of histograms, hence the distance between two inputs is the sum of the distances between corresponding histograms. The sum of the distances from each individual to all other individuals in the population is a measure of how similar an individual is to the population as a whole. Let d be a distance function (e.g. using Hamming distance or Euclidean distance between individual histograms) between two individuals i and j. The total distance between individual i and all other individuals in the population, called the population distance P di is shown in equation 2 where n is the population size.

P di =

n X

d(i, j)

(2)

j=1

Using the total distance to all other members of the population can result in a candidate, with a high total distance, that is also close to another candidate. An alternative distance measure between candidate i and a population is the distance from i to the closest individual. An advantage of using the distance to the closest individual is that any candidate with a high population distance will be distant to all individuals. P di can be used to rank individuals within the population, i.e. the individual with the largest P di has the highest rank. This is called the maximum population distance measure. During the search, this leads to a replacement strategy based on the contribution to scarcity by the offspring to the population where it will be included. In more detail, let x be a newly created offspring. x is added to the population and the populations distance of each member of the population is recomputed to include x. The individual with the lowest population distance is then removed from the population. If this individual is x then the population remains unchanged. This operation is of order the size of the population because it requires calculation of the distance between x and each current member of the population. Distance is symmetric and so the distance from each current member to the new individual is now also known. For each previous member of the population, the distance to the new member is added to produce an updated population distance for that member. 27

6.4

Data-state distribution histogram

A cost function based on the size of the equivalence class of data state distributions directs the search to those individuals that are in equivalence classes of smallest size. With the progress of the search, however, the size of the equivalence classes is progressively reduced. This may continue for a while until all the individuals are in singleton equivalence classes and yet no solution has been found. At this point, all individuals will be assigned the same rank since the histogram set equivalence classes are all the same size and this will lead to random search. Such a situation occurs in the Orthogonal program, when the size of the array is increased from 64 to beyond the population size (100) to 128. In practice, a population in which all equivalence classes have the same size is likely to occur only when the number of classes is equal to the population size. Otherwise, different data state values are likely to have non-uniform distributions. For example, in program Log10, it is very unlikely that the equivalence class for k = 5 will contain as many individuals as the equivalence class for k = 3, say. A possible remedy is to increase the population size but this is clearly not scalable. After a period of search directed towards scarce inputs, although all inputs may belong to an equivalence class of size 1, it does not follow that all histogram sets are equally scarce. Some of the histograms sets will have been encountered by previous inputs no longer in the population whereas other histogram sets will be new and genuinely scarce. Keeping a record of all histogram sets encountered during a search, rather than just those in the current population, will allow the identification of scarce histogram sets. A record of histograms can be kept by associating with the population a histogram of histogram sets to record the histogram sets produced by any input encountered during a particular search. These data state distributions are recorded in a histogram called a Data State Distribution histogram or DSD histogram for short. The population rank of an input can now depend on the size of the histogram set frequency count in the DSD histogram rather than the size of the histogram set equivalence class in the population. A possible DSD histogram is illustrated in Figure 24 for the example program Orthogonal. In this program, the histogram set for any input consists of a single histogram for the variable product with a frequency count for the value 0 and a count of 1 for the value 1. The population member column shows the candidate in the population with the corresponding histogram set. Figure 24 shows the situation in which the search has encountered three inputs that have produced the histogram set {{(0, 4), (1, 1)}}. Just one of these inputs, input8 , is in the population. The search has also encountered seven inputs that have produced the histogram set {{(0, 2), (1, 1)}}. No member of the population produces this histogram set. The DSD histogram is constructed when the individuals in the population all belong to their own equivalence class (necessarily of size 1). The DSD histogram contains one histogram set with a frequency count of 1 for each input. New candidates with histogram sets that are not in the DSD histogram are added to the population and an arbitrary individual is removed, since all individuals in the population have the same histogram scarcity. The histogram set of the new candidate is added to the DSD histogram. This part of the search is random. After some time, however, new individuals may have histogram sets that have already been seen before, i.e. they are present in the DSD histogram. These individuals are less fit than any in the current population and are not 28

Table 2: Example of DSD histogram used to rank inputs for the Orthogonal program. histogram set count population member {{(0, 4), (1, 1)}} 3 input8 {{(0, 7), (1, 1)}} 1 input3 {{(0, 2), (1, 1)}} 7 null ... ... ... {{(0, 5), (1, 1)}} 2 input9 added to the population because their histogram sets have been generated at least twice and all individuals in the population have a histogram set that has been seen just once only. The repeat occurrence of the histogram sets is noted, however, in the DSD histogram by incrementing the frequency count for the relevant histogram set. In this way, a variety of frequency counts will arise in the DSD histogram and will allow the individuals in the population to be ranked according to data state distribution scarcity. When a newly generated candidate is added to the population and has a rank equal to the lowest rank and there are members of the population with that rank then a candidate other than the newly added candidate is removed. This provides a mechanism for the population to change when the new candidate has the lowest rank and might be discarded leaving the population unchanged.

7

Clustering histograms

There is a potential problem with the use of the Data State Distribution (DSD) histogram in that it can become very large since every unique set of histograms produced by any input in the search must be stored. This problem was not encountered for any of the example problems of the empirical evaluation. However, an alternative, less memory intensive method for discriminating between inputs once they are all in singleton equivalence classes is to cluster the members of the population. Clustering is the grouping of individuals according to a distance or similarity measure. If the number of clusters is less than the population size, then cluster size (or rather its inverse) can be used to measure the scarcity of the individuals that it contains. There is no guarantee that the cluster sizes will not all be equal but it is not likely. Compared to the DSD histogram, clustering requires more time to compute the clusters but only a fixed amount of additional memory space is needed. To perform clustering a distance measure between data state distributions is required. The Hamming and Euclidean distance measures described earlier could be used. The following sections 7.1, 7.2 show how two common clustering techniques may be adapted to cluster the individuals in the population on the basis of histogram set similarity. The algorithm can use the square Euclidean Distance between the Data State Distributions.

7.1

Hierarchical clustering

The bottom-up hierarchical clustering algorithm takes as input the number of desired classes k and the distances between every pair of individuals. The algorithm begins with 29

dsd2

dsd1

dsd3

dsd4

dsd2, dsd3

dsd5

dsd6

dsd4, dsd5

dsd4,dsd5, dsd6 dsd2,dsd3, dsd4,dsd5, dsd6

dsd1, dsd2, dsd3, dsd4, dsd5,dsd6

Figure 18: Hierarchical clustering example, using 6 data state distributions n classes (n equal to population size), each containing a single candidate test and its associated data state histogram set. The distance between class A and some other class B is the average distance between the elements of A and the elements of B. The algorithm then iterates as follows, for i = n − 1 down to k • Find the closest pair of classes, call these A and B and remove them from the set of classes. • Generate a new class C, containing the members of A and B. • Generate new distances from class C to all the other remaining classes. Figure 18 shows a possible hierarchical clustering for a set of 6 data state distributions. In this figure, k is 1 so that all classes ultimately belong to the root cluster. Initially, dsd2 and dsd3 are placed in the same cluster. Next, dsd4 and dsd5 are placed in the same cluster. At the stage in which dsd6 is added to the dsd4, dsd5 cluster, the ranking is as follows data state distribution dsd1 dsd2 dsd3 dsd4 dsd5 dsd6

rank 1 2 2 3 3 3

30

7.2

K-means algorithm

K-means [27] is one of the simplest unsupervised learning algorithms for clustering. The algorithm requires as input a number k which is the number of clusters to be produced. The main idea of the K-means algorithm is to assign each point to the cluster whose centre is nearest to that point. The centre is the average of all the points in the cluster. Again, using the Euclidean Distance between histogram sets: • Choose the number of classes, k. • remove k individuals, chosen randomly, from the population to form the centres of k classes. • for i = 1 to n − k 1. Assign each remaining individual to the nearest class centre. 2. Recompute the new class centres. The main advantage of this algorithm is its simplicity. Its disadvantage is that it does not yield the same result with each run, since the resulting clusters depend on the initial k random selections. To solve this problem, the initial k individuals can be selected according to the total Euclidean Distance (P di ) from those individuals to all other members of the population. Starting the clusters with the most distant individuals is a heuristic for selecting widely spaced clusters.

8

Empirical Investigation

The data state scarcity search strategy was investigated empirically by generating test data for the example programs, AllTrue, Orthogonal, Log10, Mask and for three additional programs Error, CountEqual and FloatRegEx. If the analysis of the example programs presented so far is correct then it is clear that a data scarcity heuristic will guide a search to a solution. What is not clear, however, is how long such a search would take. In addition, therefore, to confirming the correctness of the analysis, the empirical investigation will provide information about the efficiency of data scarcity search.

8.1

Experimental programs

The program Error is shown in Figure 19 and it counts the number of points in a supplied sequence of points that differ from the sequence of points 0, 1, 2, ..., 15. The information in this program is lost by the min function which returns either 0 or 1. In addition, the integer division operation ensures that errorsum / 4 does not vary when errorsum varies between 0 and 3. The program CountEqual shown in Figure 20 determines if more than half the characters in a 64 character string are equal to the respective preceding character. 31

Error(int[] a) { //a[i] in [-100000, 100000] int error = 0; int errorsum = 0; for (i = 0; i < 16; i++) { error = a[i] - i; errorsum = errorsum + min(1, abs(error)); } if ((errorsum / 4) < 1) { // integer division // target } } Figure 19: Program Error compares a sequence of points with the sequence 0, 1, 2, ..., 15. The count of errors is accumulated in errorsum. The target branch is executed when fewer than one quarter of the points disagree with the sequence. CountEqual(char[] a) { int equal = 0; for (i = 0; i < 64; i++) { string s = match(a[i] + "+", a, i); equal = equal + s.Length - 1; i = i + s.Length - 1; } if ((equal / 32) == 1) { // target } Figure 20: CountEqual determines if more than half the characters in a 64 character string are equal to the respective preceding character For randomly selected inputs, the variable equal is likely to be zero or close to zero. In such cases, given integer division, the value of equal / 32 is invariably zero and the histogram of values assigned to equal will be skewed towards zero. Directing the search towards inputs that produce scarce histograms will direct the search towards inputs in which the histogram of values assigned to equal is less skewed towards zero, which increases the probability of finding an input with a relatively high value for equal. The target branch is executed when equal has the value 32. The FloatRegEx program is a simple table-driven finite state machine. The table, in which the state transitions are defined, consists of a two dimensional array of state (the next-state-array), indexed by input and current state. The program consists essentially of a single loop that reads each character of the input string and, together with the current state, accesses the next state from the next-state-array. The target branch is executed when the final state is reached. The next-state-array is initialised to define the transitions corresponding to the regular expression [0-9].[0-9]+[e][-+]?[1-9][0-9]* which is intended to recognise a floating point number with an exponent and one digit before the decimal point and one or more after the decimal point. For random character strings, the 32

accepting state, S7, is difficult to reach. [1 -9] [0 -9] [.]

[0 -9] S2

[0 -9]

[e]

[+, -]

S4

S3

S5

S1

S6 [1 -9] S7

[0 -9] S0

Figure 21: Flowchart of FloatRegEx program with a difficult to reach state S7. Unlabeled transitions are executed when the input does not match any labeled transition. Clearly, the set of test programs assembled is a biased collection but the purpose of the investigation is to show the effectiveness of data scarcity search for a specific class of programs for which existing techniques are not effective.

8.2

Experimental Conditions

Full branch coverage was attempted for each of the programs under test. Each branch was taken as the individual target of a search, unless it was fortuitously covered during the search for test data for another branch. In this work, a population size of 100 was always used. This parameter was not “tuned” to suit any particular program under test. A candidate is represented as a sequence of basic values, double, int, char, etc. The probability of performing mutation or crossover was 0.5. The mutation operators are type-specific, so for example a mutation to a double value is performed by adding an offset value selected from a Gaussian distribution centred at the original value.

8.3

Results and Discussions

In this section, two sets of results are presented. In the first set, the number of distinct histogram sets did not exceed the population size. This allowed the use of histogram set equality to be used to define equivalence classes of candidate tests. In the second set of results, the number of histogram sets was equal to or exceeded the population size. In this situation, histogram set equality could not to be used to define equivalence classes 33

Table 3: The number of executions of the program under test required to find test data to achieve branch coverage using three measures of data scarcity (average over 50 trials). Program Equivalence class size PopDist-HD PopDist-ED AllTrue 4651 6563 8218 Orthogonal 8004 10814 9325 Log10 2184 1641 1641 Mask 1119 3251 2074 CountEqual 9421 9652 9936 Error 8719 12362 12251 FloatRegEx 11081 12354 11832 total 45179 56637 55277 of candidate tests since it was likely to lead to equivalence classes of equal size. Instead, candidates were ranked (in separate experiments) according to frequency count in the DSD (data state distribution) histogram and both K-means and hierarchical cluster size. 8.3.1

Number of histogram sets less than population size

The aim was to assess the performance of the three ranking methods, i.e. size of histogram set equivalence class and total population distance P di from a candidate to all other members of the population using the Hamming distance between histogram sets and the Euclidean distance between histogram sets. Test data was generated for each program and the number of program executions required to find branch coverage test data for a given program was noted. This was done for 50 trials. The results in Table 3 show the average number of executions required to find branch coverage test data when candidates were ranked according to histogram set equivalence class size and distance between histogram sets (for both Hamming Distance and Euclidean Distance). It is clear that in all cases the search is effective since, for each program, the required test data is found with a number of program executions well below that which would be required if the search were to be unguided. The probability that a solution 1 can be found for the Log10 selecting tests at random from the domain is 100000 . The probabilities for the remaining programs are in excess of this. In an experiment in which all candidates were always given the same rank, i.e. the search was random, no solutions were found after 100 000 executions of the program under test. There is some evidence to suggest that counting equal data state distributions is the most efficient. This method also requires less computation than using either Hamming distance or Euclidean distance. Figure 22 shows how the population entropy decreases during the progress of the search for a single run to find test data for program Log10. The branch cost is 1 for all but one point, i.e. when the solution is found. Figure 23 shows the increasing population entropy diversity and unique data state distribution (scaled between 0 and 1 for simplicity of representation) during the progress of the search for a single run to find test data for the Orthogonal program of Figure 7. 34

Population Entropy

Branch cost

1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 0

500

1000

1500

2000

2500

Number of Subject Executions

Figure 22: Plot showing the branch distance and the population entropy (candidates grouped by histogram equivalence class) for the program Log10 during the progress of the search. 8.3.2

Number of data state distributions is greater than or equal to the population size

In order to assess the three methods of candidate ranking when the number of histograms sets is greater than or equal to the population size, the sample of programs was modified to increase the domain sizes. For example, AllTrue128 is the AllTrue program of Figure 4 except that the array length is equal to 128 instead of 64. The modifications made to the other programs are analogous and may be identified from the program name given in Table 4. The DSD histogram for the AllTrue128 program is shown in Figure 24. Specific histogram sets are marked on the horizontal axis and the frequency of each set is marked on the vertical axis. Figure 24 part (a) shows the situation after a period of data scarcity search in which candidates have been ranked using the data state distribution equivalence class size and each class has a size of 1. Figure 24 part (b) shows the situation after a DSD histogram has been used to rank candidates for 520 program executions. Figure 24 part (c) shows the situation after a DSD histogram has been used to rank candidates for 11000 program executions. For K-means clustering and hierarchical clustering the number of clusters is selected to be equal to n2 , where n is the population size. This value was chosen without detailed analysis or experimentation and proved effective for the example programs. The results in Table 4 show the number of program executions required to find input data to achieve branch coverage (averaged over 50 trials). Note again that the required test data was found thus validating the data scarcity search strategy. The domain sizes of these programs are so large that there is negligible probability of finding a solution with an unguided search within the number of executions shown. These results provide some evidence that the DSD histogram is the most efficient of the three methods. A possible reason for the poorer performance of the clustering methods 35

Population Entropy

5

Unique DSD Histogram.

4

3

2

1

0 0

1000

2000

3000

4000

5000

6000

7000

8000

Number of subject execution

Figure 23: The diversity in the program Orthogonal. For clarity of presentation, the number of unique data state distribution is normalised between 0 and 1. is that sometimes there is more than one closest cluster. For example, if the cluster A is equi-distant to two clusters B and C, a single cluster, B or C, is chosen arbitrarily. Depending on the choice of cluster, the new cluster will be closer or further from other clusters. A possible improvement is to merge not just two but all clusters that lie at the closest distance. There is a small difference in performance between the two forms of clustering but no significance is claimed for such a small difference.

9

Conclusions and Further Work

Search based test data generation depends crucially an evaluation or cost function that is able to discriminate between candidate test cases with respect to achieving a given test goal. Typically, the cost function is constructed by instrumenting the program at the test goal. For some programs, however, an informative cost function at this location is difficult to define. The operations performed by these programs are such that the cost function returns a constant value for a very wide range of inputs. A typical example of this problem arises in the instrumentation of branch predicates that depend on the value of a Boolean-valued (flag) variable although the problem is not limited to programs that 36

Table 4: The number of executions of the programs under test required to find test data to achieve branch coverage (averaged over 50 trials), when number of data state distribution greater than or equal to the population size. Program K-means Hierarchical DSD histogram AllTrue128 28325 23847 20005 AllTrue256 66135 58964 48632 Orthogonal128 37945 33264 30987 Orthogonal256 73218 69258 62847 CountEqual128 41367 40988 37564 CountEqual256 83214 82586 78645 Mask64 11387 12631 10456 Error64 46254 48299 39874 total 387845 369837 329010 contain flag variables. Although some transformation techniques have been developed to overcome the problems arising from the use of flag variables in particular situations, they are not applicable to all programs that contain flag variables. Moreover, the problem of the almost constant cost function arises in programs that do not use flag variables. This paper presents a new method for directing the search when the cost function constructed by instrumentation of the test goal is not able to differentiate between candidate test inputs. The method has two parts. In the first part, the program under test is re-instrumented to sample the values that are inputs to the constant expression. The variables to instrument can be identified from the data flow graph. The rationale for this is that the cost function at the test goal is constant because of the information loss in the program. The expectation is that this information loss is progressive hence the inputs to the expression are less likely to be constant. Given diversity in the values of the expression inputs, it is possible to use these values to discriminate between candidate tests and this allows a search to once more receive guidance. The second part of the method is to refine the test goal to direct the search towards scarce data values while test goal guidance is not available. In the most general terms, the justification for this heuristic is that if after a sustained period of search, a solution has not been found, the search should be directed to a region that differs from that which has so far been explored. In the context of searching for software test cases, the sampling of intermediate values that are the inputs to the cost function at the test goal may not allow the search to discriminate between candidates in terms of satisfying the test goal but it may allow the search to discriminate between scarce and non-scarce intermediate data states. This provides a dimension in which to apply the heuristic “search in a different place to that which has so far been searched”. So far the search has produced non-scarce data states and so scarce intermediate values represent a different place to search. In this paper, two specific cost functions were investigated; one is based on grouping equal data state distributions, and another is based on the distance between data state distributions. A limitation emerged when grouped equal data state distributions and population distance (P d) are used when the number of data state distributions is greater than or equal to the population size. A possible solution is to record the data state 37

distributions in a Data State Distribution histogram, and based on the size of the data state distribution frequency count in this histogram, a cost function is constructed, rather than being based on the size of the class in the population. However, since it is possible for the Data State Distribution histogram to become very large, an alternative solution based on clustering is also presented. The individuals in a population are clustered on the basis of data state distribution similarity. The rank of an individual in the population is then determined by the size of the cluster to which it belongs since the individuals with the most scarce data states will belong to the smallest clusters. Data state scarcity search was evaluated empirically for a number of example programs for which the existing methods are, otherwise, inadequate. A number of experiments have shown that the method is effective. There is some evidence to suggest that measuring scarcity by the number of equal data state distributions is more efficient than using a distance measure between histograms. As further work, it is intended to investigate the effectiveness of data scarcity search in medium to large programs. Data scarcity search relies on the identification of data flow predecessors at which a range of values can be observed. In the small example programs considered in this paper, it has not been difficult to find such predecessors but this may not be the case for a large program in which there are many variables between the input and the constant cost expression. For large programs, finding suitable predecessors may itself become a search problem. This paper has focussed on data scarcity search but there is also the dual notion of control flow scarcity search and whether the notion of scarcity search can be applied to the program control flow. It is also intended to define population’s distance (P d) in terms of the distance to the nearest member of the population rather than the sum of the distances to each member. Furthermore, the use of a top-down algorithm in hierarchical clustering, instead of using agglomerative clustering algorithm, is also a candidate for investigation. Acknowledgments The authors would like to express their gratitude to the anonymous referees for their valuable comments and suggestions for improving the paper.

References [1] M. Alshraideh and L. Bottaci, Automatic software test data generation for string data using heuristic search with domain specific search operators, Software Testing, Verification and Reliability. 16 (2006), no. 3, 175–203. [2] A. Baresel, D. Binkley, M. Harman, and B. Korel, Evolutionary testing in the presence of loop-assigned flags: a testability transformation approach, Proceedings of the 2004 ACM SIGSOFT international symposium on Software testing and analysis (2004), 108–118. [3] L. Booker, intelligent behavior as an adaptation to the task environment, Ph.D. thesis, The University of Michigan, Ann Arbor, MI, 1982. 38

[4] L. Bottaci, Use of branch cost functions to diversify the search for test data, Proceedings of the UK Software Testing Workshop (UKTest 2005) (University of Sheffield, UK), September 5-6, 2005 2005, pp. 151–163. [5] L. Bottaci, Predicate expression cost functions to guide evolutionary search for test data, Genetic and Evolutionary Computation Conference (GECCO 2003), July 2003, pp. 2455–2464. [6] L. Bottaci, Instrumenting programs with flag variables for test data search by genetic algorithm, Proceedings of the Genetic and Evolutionary Computation Conference, GECCO 2002 (2002), 1337–1342. [7] A. Bouchachia, an immune genetic algorithm for software test data generation, Proceedings of the 7th International Conference on Hybrid Intelligent Systems (HIS 2007) (Washington, DC, USA, 2007), 2007, pp. 84–89. [8] R. J. Collins and D. R. Jefferson, Selection in massively parallel genetic, Proceedings of the Fourth International on Genetic Algorithms (San Mateo, CA. Morgan Kaufmann) (R. K. Belew and L. B. Booker, eds.), 1991, p. 249256. [9] P. Coward, Symbolic execution and testing, Information and Software Technique 33 (1991), no. 1, 229–239. [10] L. Davis, Handbook of genetic algorithms, International Thomson Computer Press, 1996. [11] E. D. De Jong, R. Watson, and J. Pollack, Reducing bloat and promoting diversity using multi-objective methods, Proceedings of the Genetic and Evolutionary Computation Conference (San Francisco, CA. Morgan Kaufmann) (L. et al. In Spector, ed.), 2001, pp. 11–18. [12] K. A. De Jong, An analysis of the behavior of a class of thesis, Ph.D. thesis, The University of Michigan, Ann Arbor, MI, 1975. [13] K. Deb and D. Goldberg, An investigation of niche and species formation in genetic function optimization, Proceedings of the Third International Conference on Genetic Algorithms (San Mateo, CA. Morgan Kaufmann), 1989, pp. 42–50. [14] J. Duran and S. Ntafos, An evaluation of random testing, IEEE Transactions on Software Engineering 10 (1984), no. 4, 438–443. [15] R. Feldt, T. Torkar, R.and Gorschek, and W. Afzal, Searching for cognitively diverse tests: Towards universal test diversity metrics., In ICST ’08: 1st IEEE International Conference on Software Testing, Verification and Validation, 2008. [16] D. E. Goldberg, Genetic algorithms in search optimization and machine learning, Addison Wesley, 1989. [17] D. E. Goldberg and J. Richardson, Genetic algorithms with sharing for multimodal function optimization, In Proceedings of the second international conference on genetic algorithms (Morgan Kaufmann), 1987, p. 148154. 39

[18] M. Harman, L. Hu, R. Hierons, A. Baresel, and H. Sthamer, Improving evolutionary testing by flag removal, Proc. Genetic and Evolutionary Computation Conf., GECCO 2002, July 2002, pp. 1359–1366. [19] M. Harman, L. Hu, R. Hierons, G. Wegener, H. Sthamer, A. Baresel, and M. Roper, Testability transformation, IEEE Transaction on software Engineering. 30 (2004), no. 1, 73–81. [20] B. Jones, R. H. Sthamer, and D. Eyres, Automatic structural testing using genetic algorithms, Software Engineering Journal 11 (1996), no. 5, 299–306. [21] M. Keijzer, Advances in genetic programming 2, chapter 13, pages 259-278., MIT Press, MA, USA, 1996. [22] B. Korel, Automated software test data generation, IEEE Transactions on Software Engineering 16 (1990), no. 8, 870–879. [23] B. Korel and R. Ferguson, The chaining approach for software test data generation, IEEE Transactions on Software Engineering 5 (1996), no. 1. [24] B. Korel, M. Harman, and etl., Data dependence based testability transformation in automated test generation, 16th IEEE International Symposium on Software Reliability Engineering (ISSRE’05), 2005, pp. 245–254. [25] J. Koza, Genetic programming: On the programming of computers by means of natural selection, MIT Press, Cambridge, MA, USA, 1992. [26] W. Langdon, Data structures and genetic programming: Genetic programming + data structures = automatic programming!, Genetic Programming 1 (1999). [27] J. MacQueen, Some methods for classification and analysis of multivariate observations, Proceedings of Fifth Berkeley Symposium on Mathematical Statistics and Probability (Berkeley, University of California Press), vol. 1, 1967, pp. 281–297. [28] C. Mattiussi, M. Waibel, and D. Floreano, Measures of diversity for populations and distances between individuals with highly reorganizable genomes, Evolutionary Computation 12 (2004), no. 4, 495–515. [29] P. McMinn, Search-based software test data generation: A survey, Software Testing, Verification and Reliability 14 (2004), no. 2, 105–156. [30] P. McMinn, D. Binkley, M. Harman, and P. Tonella:, The species per path approach to searchbased test data generation, Proceedings of the International Symposium on Software Testing and Analysis (ISSTA 2006) (Portland, ME, USA), July 17-20 2006, pp. 13–24. [31] U. O’Reilly, Using a distance metric on genetic programs to understand genetic operators, In IEEE International Conference on Systems, Man, and Cybernetics, Computational Cybernetics and Simulation 5 (1997), 4092–4097. [32] M. Roper, Computer aided software testing using genetic algoritms, Proceeding of the 10th International Software Quality Week , San Francisco, USA (1997). 40

[33] J. Rosca, Entropy-driven adaptive representation, Proceedings of the Workshop on Genetic Programming: From Theory to Real-World Applications (1995). [34] J. Voas and K. Miller, Software testability: The new verification, IEEE Software 12 (1995), no. 3, 17–28. [35] J. Wegner, R. Pitschinz, and H. Sthmar, Automated testing of real-time tasks., Proceedings of the 1st Intrenational Workshop on Automated program Analysis, Testing and Verification, Limerick, Ireland (2000).

41

Frequency

1.5

1

0.5

0 38 F 90 T

53 F 75 T

68 F 60 T

83 F 45 T

98 F 30 T

113 F 15 T

128 F 0 T

Data State Distributions

(a) 80

Frequency

60 40 20 0 38 F 90 T

53 F 75 T

68 F 60 T

83 F 45 T

98 F 30 T

113 F 15 T

128 F 0 T

Data State Distributions

(b)

500

Frequency

400 300 200 100 0 17 F 111 T

32 F 96 T

47 F 81 T

62 F 66 T

77 F 51 T

92 F 36 T

107 F 21 T

Data State Distribution

(c)

Figure 24: (a) DSD histogram for AllTrue128 program is constructed when the population size is equal to the data state distribution size, (b) DSD histogram after 520 program executions, (c) DSD histogram after 11000 program executions.

42