Evolutionary functional black-box testing in an ... - Semantic Scholar

Software Qual J DOI 10.1007/s11219-012-9174-y

Evolutionary functional black-box testing in an industrial setting Tanja E. J. Vos • Felix F. Lindlar • Benjamin Wilmes • Andreas Windisch • Arthur I. Baars • Peter M. Kruse • Hamilton Gross • Joachim Wegener

Ó Springer Science+Business Media, LLC 2012

Abstract During the past years, evolutionary testing research has reported encouraging results for automated functional (i.e. black-box) testing. However, despite promising results, these techniques have hardly been applied to complex, real-world systems and as such, little is known about their scalability, applicability, and acceptability in industry. In this paper, we describe the empirical setup used to study the use of evolutionary functional testing in industry through two case studies, drawn from serial production development environments at Daimler and Berner & Mattner Systemtechnik, respectively. Results of the case studies are presented, and research questions are assessed based on them. In summary, the results indicate that evolutionary functional testing in an industrial setting is both scalable and applicable. However, the creation of fitness functions is time-consuming. Although in some cases, this is compensated by the results, it is still a significant factor preventing functional evolutionary testing from more widespread use in industry. Keywords Evolutionary computation Functional testing Empirical assessment Case study Industrial practice Test data generation

1 Introduction Software and systems testing currently is the most important and most often used quality assurance technique applied in industry and can take over 50% of development cost and time (Beizer 1990). Even though many test automation tools are currently available to aid test planning and control, as well as test case execution and monitoring (Fewster and T. E. J. Vos (&) A. I. Baars Research Center on Software Production Methods, Technical University of Valencia, Valencia, Spain e-mail: [email protected] F. F. Lindlar B. Wilmes A. Windisch Daimler Center for Automotive IT Innovations, Berlin Institute of Technology, Berlin, Germany P. M. Kruse H. Gross J. Wegener Berner & Mattner Systemtechnik GmbH, Berlin, Germany

123

Software Qual J

Graham 1999), most of these tools, particularly those used in industrial practice, share a similar passive philosophy toward test case design, selection of test data, and test evaluation. They leave these crucial, time-consuming and demanding activities to the human tester. This is not without reason; test case design and test evaluation are difficult to automate in an effective and scalable way with the techniques available in current industrial practice, since the domain of possible inputs (potential test cases), even for a trivial program, is typically too large to be exhaustively explored. However, while the test cases determine the nature and scope of the test, the test case design is the activity that determines the quality and effectiveness of the whole testing process. The lack of automation for this important testing activity means that industry still spends a lot of effort and money in testing, while the quality of the resulting tests is sometimes low since they fail to find important errors in the system. Existing work on evolutionary testing has shown that stochastic optimization and intelligent search techniques (like evolutionary algorithms) constitute a successful way to automatically generate effective test cases. Most work has been done for white-box testing (Baresel et al. 2003; Gross et al. 2009; Harman et al. 2002; Jones et al. 1996; McMinn 2004; Pargas et al. 1999; Sthamer and Wegener 2002; Vos et al. 2010; Wegener et al. 2002). For black-box testing, a substantial part is related to real-time testing (Gros 2003; Mueller and Wegener 1998; Sthamer and Wegener 2002; Tlili et al. 2006; Tlili et al. 1917–1924), where the test is interpreted as an optimization problem, and evolutionary computation is employed to find test data with long or short execution times. Past research has shown that evolutionary real-time testing outperforms traditional methods. Also, for functional black-box testing, realized work (Baresel et al. 2003; Bu¨hler and Wegener 2008; Chan et al. 2004; Tracey et al. 2000; Windisch et al. 1943–1944) has led to promising results. One shortcoming, however, is that evolutionary testing techniques have hardly been applied to real-world complex systems and as such, little is known about their scalability, applicability, or acceptability in an industrial setting. The European project EvoTest (2006–2009) (IST-33472) (Evotest. http://evotest. iti.upv.es 2006) includes some objectives aimed at improving this situation. These include: developing an Evolutionary Testing Framework (ETF) to provide general components and interfaces facilitating the automatic generation, execution, monitoring, and evaluation of effective test scenarios; improving the applicability of this ETF through a user interface, which hides the underlying evolutionary computation techniques; and applying the resulting ETF to complex real-world systems (selected as case studies by the industrial partners Daimler and BMS) to empirically evaluate evolutionary testing. This paper describes the functional testing case studies performed within the EvoTest project and reports on the scalability of the ETF for functional testing in an industrial setting (the empirical evaluation for structural testing can be found in other work, e.g. Gross et al. 2009; Vos et al. 2010). The cases examine the applicability of evolutionary functional testing in industry and provide clear insights about the barriers that might prevent evolutionary functional testing from being widely adopted in industry. We invite other researchers to repeat the studies in other industrial settings. Outline The structure of this paper is as follows: First, an introduction on evolutionary functional testing is given. Next, in Sect. 3, the ETF is described. In Sect. 4, details are given about the case study setup, including research questions, design and data analysis, criteria for interpreting the findings, and threats to validity. Afterward, in Sect. 5, the two case studies from Daimler and BMS will be described, and Sect. 5.2 contains the results. This paper finishes with an overall assessment and conclusion based on our experience

123

Software Qual J

with evolutionary functional testing in an industrial setting (Sect. 6), as well as future work.

2 Evolutionary testing The application of evolutionary algorithms (Goldberg 1989; Holland 1975) to test data generation is often referred to as evolutionary testing (Harman et al. 2002; Wegener et al. 2002; Bu¨hler and Wegener 2004). The suitability of evolutionary algorithms for testing is based on their ability to produce effective solutions for complex and poorly understood search spaces with multiple dimensions. The dimensions of the search spaces are directly related to the number of input parameters of the system under test (SUT). 2.1 Evolutionary algorithms Evolutionary algorithms represent a class of adaptive search techniques and procedures based on the processes of natural genetics and Darwin’s theory of biological evolution (Goldberg 1989; Holland 1975). They are characterized by an iterative procedure and work in parallel on a number of potential solutions for a population of individuals. Permissible solution values for the variables of the optimization problem are encoded in each individual. A typical evolutionary algorithm procedure consists of the following steps: (1) A population of possible solutions to a problem is initialized, usually at random; (2) Each individual within the population is evaluated by calculating its fitness; (3) Pairs of individuals are selected from the population according to a pre-defined selection strategy, and recombined in such a way, as to produce new individuals analogous to biological reproduction. Subsequently mutation may be applied; (4) The new individuals are evaluated, and it is decided based on the reinsertion-strategy, which individuals are fit enough to make it to the next iteration; (5) The algorithm is iterated until the optimum is achieved, or another stopping condition is fulfilled. Evolutionary algorithms are generic and can be applied to a wide variety of optimization problems. In order to specialize an evolutionary algorithm for a specific problem, one needs to define a problem-specific objective (fitness) function. The objective function compares and contrasts solutions of the search with respect to the search goal. Using this information, the search is directed into potentially promising areas of the search space. 2.2 Evolutionary testing In order to automate software tests using evolutionary algorithms, the test aim itself must be transformed into an optimization task. Depending on which test aim is pursued, different objective (fitness) functions emerge for test data evaluation. If an appropriate fitness function can be defined for the test aim, and evolutionary computation is applied as the search technique, then the evolutionary test proceeds as follows. First, the initial set of test data is generated, usually at random. However, instead of random initialization, one might also use test data that have been obtained by a previous systematic test (Wegener et al. 1996), and in this way, the evolutionary test benefits from the tester’s knowledge of the SUT. After initialization, each individual within the population represents a test datum with which the SUT is executed. For each test datum, the execution is monitored and the fitness value determined for the corresponding individual. Next, test data with high fitness values are selected with a higher probability than those with a lower value and are

123

Software Qual J

subjected to combination and mutation processes to generate new offspring test data. A new population of test data is formed by merging offspring and parent individuals according to the laid down survival procedures. From here on, the process repeats itself, starting with selection until the test objective is fulfilled or another given stopping condition is reached. What test objective is pursued depends on the type of testing that has to be performed. For example, for structural testing, the test objective will be to reach a high coverage percentage of the SUT source code. For functional testing, the test objective will be to find an error with respect to some selected functional property. Hence, the objective function must be defined to measure how ‘‘close’’ a test datum is to breaking the selected property.

3 Evolutionary testing framework To facilitate the development of evolutionary tests, the EvoTest project has implemented the ETF1 as an extension of the Eclipse IDE. The heart of the framework is the optimization engine that is generated by GUIDE (GUIDE. http://gforge.inria.fr/projects/guide/. Last accessed 2011) using the Evolving Objects library (EOlib) (Keijzer et al. 2001). In addition, the framework consists of a GUI component for configuring and tuning the parameters of the algorithm. An extension point for the search engine is provided, so engines of other meta-heuristic search techniques such as hill-climbing, simulated annealing, and particle swarm optimization can be easily plugged in. Furthermore, the framework provides various ways of visualizing the search progress (e.g. best and average fitness values per generation). To customize the framework for a particular test aim, the following domain-specific components need to be supplied: 1) an individual specification, 2) a test driver, and 3) an objective function. The individual specification describes the structure of the individuals, that is the test data in an XML file. The test driver provides the connection between the framework and the SUT. It converts the individuals from the search process into test data. Subsequently, the test driver executes the SUT using the test data and monitors the output of the SUT. The monitoring results are passed back to the framework and are used by the objective function to calculate the adequacy of the test data. In the past, the ETF has mainly been used for performing structural (Gross et al. 2009; Vos et al. 2010) and functional tests (Kruse et al. 2009; Windisch et al. 1943–1944), yet it can easily be adapted to other test aims, such as searching for the worst case execution time (non-functional testing). For this example, a fitness function has to be implemented that assigns better fitness values to individuals, which cause higher execution times. Depending on the type of SUT (e.g. an executable model or code) and the structure of its inputs and outputs, the test data encoding (individual specification) and the test driver (execution and observation of SUT) need to be adapted and implemented. However, it is not necessary to alter the evolutionary algorithm itself. If the SUT depends on continuous input signals, as in this case study, the framework offers using a signal generator tool as proposed by Windisch and Al Moubayed (2009). Signals are generated and optimized by stringing together a certain number of parameterized base signals. Windisch et al. make a distinction between optimization and simulation sequences. The optimization sequence, which is basically an abstract form of 1

If you are interested in trying out this one-click automated testing tool on your software, please go to: http://evotest.iti.upv.es.

123

Software Qual J

Fig. 1 Representation of a signal as a sequence of parameterized signal segments

a signal, is used by the optimization engine. It can be transformed into a signal with a value for each time step, called simulation sequence, which in turn is used as input for the execution of the SUT. The approach makes use of further information about the input signals for the SUT, provided by the tester. This signal specification, on the one hand, includes general attributes, for example the length of the signals to be generated and their designated resolution. On the other hand, it also contains separate specifications of the signals for each input of the SUT. For each input, the signal amplitude boundaries can be specified, and it is possible to select a list of base signals with which the corresponding input signal may be composed of. The signal generator then creates signals by lining up a certain number of signal segments. As visualized in Fig. 1, each of these signal segments parameterizes (length and amplitude) a base signal transition (e.g. sine, step, impulse, and linear transitions). Thus, a signal is described according to its sequence of segments and their corresponding parameters. Using this form of abstraction, a signal consists of only 3 n parameters with n being the number of signal segments, plus one parameter that describes the starting amplitude of the first segment. It is worth noting that within the search engine’s process, the SUT is always executed with entire simulation sequences (so not with partial ones) before the fitness function is used to evaluate the individual represented by the corresponding optimization sequences. For further technical details of the signal generation approach, like the subsequent adaption of the evolutionary algorithm’s operators, we refer the reader to Windisch and Al Moubayed (2009).

4 Case study setup Following the work of Perry et al. (2000, 2005) and Kitchenham et al. (2002), we define the research question, describe the criteria for selection of the cases, present the context of the cases, specify details of the case study design, and discuss threats to validity. 4.1 Research question Juristo et al. (2004) reviewed 25 years of testing technique experiments. One of the limitations they encountered was the non-representativeness of the programs chosen, either because of size or the number of faults. Also Briand (2007) analyzed the empirical research in software testing and indicated problems like overly small and unrealistic programs. Briand also stressed the importance of doing empirical studies for evolutionary testing in order to investigate its capacity to achieve objectives, and analyze its scalability relative to the complexity of the SUT and the inputs of the search algorithms.

123

Software Qual J

Consequently, considerable interest has arisen in the empirical study of evolutionary testing techniques and their scalability for realistic, industrial programs. Thus, our research question is: ‘‘Is evolutionary functional black-box testing scalable to real-world industrial practice?’’ In the context of industrial testing practice, evolutionary functional testing is an interesting approach for automating and tackling selected test case generation problems. In industry, the functional-oriented test case derivation is mostly, even nowadays, still a manual process. For each requirement of the system or system component under test, a tester usually defines one or more test cases to assure compliance (with the requirement). Most such test cases, based on implicitly known use cases, are aimed at confirming that the system’s behavior matches the expected behavior and are rarely designed in order to reveal the violation of a requirement. This may be attributed to the fact that deriving or finding such test cases manually is a very difficult, if not, infeasible task. Evolutionary functional testing appears very promising in overcoming the challenges of developing such tests manually. This use of evolutionary functional testing within industrial testing practice makes clear that this approach has no ambition to replace current test case generation methods. Rather, evolutionary functional testing can extend and support functional-oriented testing activities by automating error-oriented test case derivation activities that are hard to realize manually. No other comparable technique or method exists, or is used in industrial practice, to the best of our knowledge. Thus a comparison to other testing techniques is not part of this study. 4.2 The selection of the cases To ensure that the cases were real-world complex industrial systems, the systems used for our case studies were chosen by the industrial partners Daimler and BMS. Daimler is a German automotive corporation and one of the largest car manufacturers in the world, while BMS is a German SME that can be contracted as a development or testing partner in the sectors of aeronautics, defense, automotive and rail. The most important factor for selection was that these were systems for which the industrial partner decided that the successful application of evolutionary testing technologies is most useful in their industrial setting. The industrial partners were aware of the basic process and underlying technique of evolutionary functional testing. In order to explore the potential, as well as the limitations, of this testing technique, the industrial partners decided to select challenging cases that go beyond applications of evolutionary functional testing in previous publications and studies (McMinn 2011). Since successful applicability of evolutionary testing in an industrial setting was one of the most important objectives of the case studies described in this paper, the industrial partners were not forced with too many constraints in order to select a system that constitutes the most perfect system for an empirical study. Access to previous versions with corresponding information about their faults and the techniques used to find them, as well as the ability to inject faults into the systems, would have been ideal for our studies. However, this is not always a possibility in an industrial setting, nor does it always comply with the needs or wishes of the industrial partners. Consequently, we did not have these possibilities because the companies chose black-box systems with no additional information.

123

Software Qual J

4.3 The context of the case studies The systems used in both case studies are real-world embedded control systems from the automotive domain. Both case study systems were taken from serial production developments originated by Daimler and BMS and were executed by embedded systems testers; three testers working for Daimler and three testers working for BMS. The testers involved were instructed on evolutionary testing principles, though some of them already had basic knowledge on how evolutionary algorithms work in general and how evolutionary computation can be used for test data generation. However, the testers had no experience in actually designing fitness functions, particularly not in applications of high complexity. They also did not use the ETF for evolutionary functional testing prior to these case studies. 4.3.1 Daimler’s system: adaptive cruise control The purpose of the Adaptive Cruise Control (ACC) system is not only to maintain a given speed but also to control the distance to preceding vehicles (measured using radar systems), which helps to prevent accidents. The cruise control maintains the setpoint speed intended by the driver independently of the engine load. Gear shifts are also performed automatically. In case of a violation of the minimum distance to a preceding vehicle, the distance control adjusts the speed by braking. If the allowed maximum braking power of the system is insufficient to maintain a safe distance, the driver is alerted by visual and acoustic signals. The driver then has to take back control of the car to resolve the situation. The ACC system was implemented using the standard modeling language MATLAB SIMULINK. It consists of 3,284 blocks and 8 statecharts with a total of 25 states. Although the executable SIMULINK model served as test object for the case study, the number of code lines of the auto-generated C code files of the ACC system (altogether 13,339 including comment lines) might act as a better indicator for the complexity of the ACC system. For this study, a basic yet complex scenario was chosen for testing the ACC system. Two vehicles are involved in the simulation: one equipped with the ACC system and a preceding vehicle driving along the same lane on the road. The scenario chosen is preluded by an initialization sequence, during which the vehicle equipped with the ACC system accelerates from standstill until it reaches a defined velocity. Adding this initialization sequence before every possible scenario reduced the search space, thereby supporting the evolutionary search in finding desirable solutions. Not including such an initialization sequence would have raised the complexity of the search problem and possibly compromised search efficiency. However, limiting the search space needs to be done with great care and attention, as possible solutions of the search could be inadvertently excluded. After the initialization sequence, the ACC system is activated and gains control over the velocity of the vehicle. The preceding vehicle is driving at the speed provided by an input signal and varies during the progress of the simulation sequence. The behavior of the ACCequipped car, with respect to the cruise control lever, is another input signal that was to be optimized. The goal of the functional test was to find a scenario in which the minimum distance criterion is violated (i.e. an accident is inevitable) but where the system still does not raise a warning signal or raises it too late. 4.3.2 BMS’ system: anti-lock braking system An anti-lock braking system (ABS) prevents the wheels of a vehicle from locking while braking, in order to allow the driver to maintain steering control under heavy braking and

123

Software Qual J

to shorten braking distances. ABS is very effective at braking in adverse weather conditions like ice, snow, or rain. When ABS equipped brakes are depressed hard - like in an emergency braking situation - the ABS pumps the brakes several times per second. Sensors measure the speed at which the wheels are turning. If the speed decreases rapidly, the electronic control system reports blocking danger. The pressure of the brake hydraulics is reduced immediately and then raised to just under the blocking threshold. The goal of the anti-locking control system is to maintain the slip of the wheels at a level that guarantees highest braking power and highest steerability of the vehicle. The SUT selected for this case study is an electronic control unit of a heavy goods vehicle ABS, enriched with a vehicle dynamics model. In conjunction with this model, it was possible to simulate the effect of changes in several different input variables on the vehicle dynamics, including velocity and position. Since the SUT is an electronic control unit, it had to be executed with the generated test scenarios in real-time. The ABS control unit was available to us only as a black-box system, and we did not have a functional specification for it. As such, it was impossible to characterize observed behavior as correct or faulty. Consequently, the functional testing in this case study aimed at finding the maximum braking distance achieved by the ABS through varying the road friction profile. Our search for the maximum braking distance, therefore, approximates to a search for performance errors of the braking system, in the absence of a specification. In order to give a rough estimate of the complexity of the ABS, a software model was derived by observing the input–output characteristics of the real ABS. As such, the model provides an approximation of the behavior of the real ABS, and we cannot claim that its behavior is identical. The derived model was implemented using MATLAB SIMULINK and consists of 37 blocks. The auto-generated code of the model, together with the vehicle dynamics model, contain about 5,300 lines of code, including comments. Note that the ABS control unit includes, in contrast to the derived model, additional software code, such as a frame software. For performing evolutionary testing of the anti-lock braking system, input signals have to be provided, encoded within the individuals generated during the search. Typical input signals for the anti-lock braking system are the wheel speeds for each wheel and the braking torque requirement resulting from the driver pressing the brake pedal (often supported by driver assistance systems controlling the overall vehicle performance). Additionally, the data encoded in the individuals has to control parameters of the simulation environment, for example the grip of the roadbed and the road conditions (icy, wet, gravel, etc.), or temperature of the brake disks. Input signals and simulation parameters are defined for several seconds of real-time simulations of braking maneuvers. 4.4 The design of the case studies This section presents the propositions that are used to answer the research question, how and which data are to be collected during the studies, and which criteria are to be used to interpret these findings w.r.t. the propositions. 4.4.1 The propositions In this section, we refine the research question into propositions that can be evaluated through the variables measured during the study (see Sect. 4.4.2). In order to do this, we discussed with the industrial partners which factors determine both applicability and scalability of evolutionary testing in their industrial setting.

123

Software Qual J

P1 It is possible to use the ETF without detailed knowledge in evolutionary computation to search for interesting test data. P2 The ETF applied to real-world sized examples, in real-world test environments, is able to generate better test cases w.r.t. achieving the test goal than random testing. The argument here is that while fitness values are improving, the system is continuously being exercised closer and closer to a boundary condition (an optimal fitness value means breaking that boundary, i.e. the requirement). P3 The ETF is more effective in generating/finding error revealing test cases when applied to real-world systems for black-box testing compared to random testing. P4 Automated parameter tuning (DaCosta et al. 2008) improves the results of the search in terms of search effectiveness and efficiency. P5 After installation of the ETF, the amount of time and effort it takes to configure the ETF in order to apply it to real-world systems for evolutionary functional testing is suitable within an industrial setting. 4.4.2 Case study procedure The following common steps for carrying out our studies and collecting the data have been defined as follows. 1. Installation and configuration according to the ETF user manual (2011); Maintenance of work-diaries that should contain the tasks (including their date, time, and description), which are performed to setup the ETF according to the user manual (i.e. tasks like finding an appropriate set of parameters for the evolutionary engine, for example). 2. Implementation of case-study-specific components (e.g. individual specification and test drivers). Work-diaries will be maintained in order to be able to estimate the necessary effort. 3. Define, refine, and implement the fitness function and validate its suitability for breaking the requirement; working diaries will be maintained. 4. Run each search 30 times to give it statistical meaning and collect data listed in the next section. 5. Interviews about the general suitability and acceptability in the specific industrial setting. These interviews are informal and not intended to perform statistical tests on the data since, evidently, the sample existing of Daimler and BMS engineers are not representative for the population of interest. However, these interviews are still interesting to include (Lethbridge et al. 2005) since their objective is to get insight into the experiences practitioners had when applying the evolutionary testing techniques and tools. These experiences are used to find bugs in the tool, areas which need improvement, or extensions necessary for the particular industrial setting. Typical questions that are used during the interviews can be found in Appendix A. However, respondents are encouraged to elaborate on important areas for them and so may deviate from the questions in the script. 4.4.3 Dependent and independent variables The studies are run by controlling the independent variables and measuring the effect on the dependent variables. The dependent variables are chosen in such a way that answers can be given to the propositions in Sect. 4.4.1.

123

Software Qual J

1. Independent variables (a) Complexity of the industrial systems. (b) The set of evolutionary parameters used for setting up the evolutionary engine of the ETF; for a detailed description of the parameters question in evolutionary computation, please consult (Description of evolution engine parameters 2011; GUIDE 2011). We distinguish between four different parameter settings, defining different versions of the ETF. – ETF_Random use random search instead of evolutionary search. – ETF_Default a default set of parameters is selected by the evolutionary engine. – ETF_Manual the parameters are chosen based on the expertise of the tester. This is done by choosing different sets of parameters and observing their performance. – ETF_Automated automated parameter tuning techniques are run on the Evolutionary engine in order to choose the ‘‘best’’ set of parameters. 2. Dependent variables (a)

(b) (c) (d) (e) (f) (g) (h)

(i) (j)

Number of test cases evaluated. This is equivalent to the number of times the SUT has been executed for a specific requirement and to the number of fitness function evaluations (quantitative). Number of invalid test cases that were generated (quantitative). Number of errors found (quantitative). The progress of fitness values (quantitative). Time and effort needed to find an appropriate set of parameters for the evolutionary engine (measured in developers time) (quantitative). Time needed to define the fitness function (measured in developers time) (quantitative). Time and effort needed to install the ETF (measured in developers time) (quantitative). Time and effort needed to customize the framework for a particular test aim (i.e. development of XML file for individual specification and Eclipse plugins for test driver and objective function) (measured in developers time) (quantitative). Lines of code of the necessary case-study-specific components that need to be integrated in the ETF for the functional test (quantitative). Applicability and acceptability in an industrial setting (qualitative).

4.4.4 Plan for manipulating independent variables For each of the two industrial systems selected (see Sects. 4.3.1 and 4.3.2), the studies were run with four different versions of the ETF, that is ETF_Random, ETF_Default, ETF_Manual, and ETF_Automated. This means 8 studies overall. 4.4.5 Criteria for interpreting the findings As indicated before, the dependent variables have been chosen in such a way that answers could be given to the propositions in Sect. 4.4.1. In order to assess propositions P5 and P1,

123

Software Qual J

we will use the results obtained from the interviews and work-diaries (i.e. dependent variables 2e, 2f, 2g, 2h, 2i, and 2j) For P2, the fitness value optimization progress (measured as dependent variable 2d) will be used to interpret the validity of the proposition. We will also investigate the best test cases within the problem-specific context in order to get more insight into the results. Finally, the amount of invalid test cases (i.e. test cases with inputs that cannot happen in reality) (measured as dependent variable 2b) will be measured to asses this proposition. For P4, two criteria will be used to evaluate the search results. The number of fitness evaluations (dependent variable 2a) will be used as a measure of efficiency, with an improved search (dependent variable 2d) being considered more efficient. The change, if any, in the number of errors found (dependent variable 2c) will be used to assess the effectiveness of the search. For P3, the amount of errors found (dependent variable 2c) compared to random testing and other testing techniques applied in the case studies will be used. Note, however, that this information may not always be present and depends on the SUT the case study provider chooses for the studies (see Sect. 4.2). 4.5 Threats to validity 4.5.1 Construct validity threats Construct validity threats relate to the admissibility of statements about the underlying constructs on the basis of the operationalization. Did we measure what we expected to measure? Possible threats to construct validity: The extent to which the errors in systems are related to the requirements they try to break. To mitigate this risk, errors found will be investigated to determine whether they break the requirement that is tested. The representativeness of the interviews. The interviews with the companies are not representative for the population of interest. Consequently, these interviews are informal and not intended to perform statistical tests on the data. As indicated before, the data are used to find bugs in the tool, areas which need improvement, or extensions necessary for the particular industrial setting. The individual experience and education. The developers’ time measured for variables 2e, 2f, 2g, and 2h depends on the experience and education level of the people involved. To mitigate this threat, both companies have basically selected the same profile for the people involved in the case studies. This profile corresponds to those people that normally execute automated tests. Generation of invalid test cases. The signal generator of the ETF may not allow for the generation of entirely realistic signals. To mitigate and evaluate this risk, the amount of invalid test cases are measured as dependent variable 2b. Technically, the generated input signals are checked against certain constraints within the test driver implementation. The constraints are taken from specification documents of the SUT. The results of the checks do not affect the search itself. The checks are just performed in order to ensure that the chosen individual specifications (input signal specifications) already prevent the ETF from generating signals with unrealistic characteristics. Comparability with random tests. Random tests (i.e. ETF_Random) are run at least until a comparable amount of test cases has been generated. Afterward, random test data generation is stopped as soon as the search reaches a level of ‘‘stagnation’’ that shows no further improvement of test results for a large number of additionally generated test sets.

123

Software Qual J

From a mathematical point of view, for random testing, each test case is independent and follows a geometric distribution over p (where p is the ratio of test cases with optimal fitness over the number of all possible test cases). As such, there is always a probability that when running random searches longer, a better solution will be found. However, in previous work (Grochtmann and Wegener 1998; Pohlheim 2000; Wegener et al. 2002), it was shown that after ‘‘stagnation’’ has been reached the probability of finding better results through random testing is very small, because only test sets with a very low execution probability lead to better results.

4.5.2 Internal validity Internal validity is concerned with the interpretability of the findings; it describes the certainty that changes in the dependent variables are attributed to changes in the independent variables. It expresses the extent to which the design and analysis may have been compromised by the existence of confounding variables and other unexpected sources of bias. Possible threats to internal validity: The experience with evolutionary computation or evolutionary testing. Obviously people using the ETF with experience in tuning evolutionary algorithms will do better than people that lack this experience because they are apt to select better parameters that can lead to a more efficient search. Again this threat is mitigated by selecting individuals with no more than a basic knowledge of evolutionary computation. Quality of the ETF. Instabilities and bugs of the ETF in an early development stage could hinder the execution of the studies. The intrinsic randomness of evolutionary algorithms. This threat is mitigated by running each case study multiple times.

4.5.3 External validity Threats to external validity may influence the extent to which conclusions can be generalized. Possible threats to external validity: Representativeness of the selected case study systems. The selected systems all come from the automotive domain. This may lead to the findings being valid in only these industrial contexts. However, repeating the case studies in other domains can give more information about the generalization of the results. Representativeness of the chosen test case scenarios. The scenario’s complexity highly influences the complexity of the resulting search problem and in turn the difficulty level faced by the search procedure. One can argue that scenarios in other industrial applications might exhibit more complex characteristics that could result in unacceptable runtime for industrial use. Nonetheless, by carefully selecting scenarios of similar complexity as typical scenarios in industrial testing practice, this risk can be mitigated. Furthermore, the chosen scenarios should lead to more complex search problems than those hitherto investigated in the context of evolutionary functional testing. Strictly synthetic signals. Generated input signals may be strictly synthetic (e.g. no noise) and thus might not be generalizable for realistic systems. Long execution time. The amount of repetitions of the studies necessary to gain statistical meaning can take too long in an industrial setting.

123

Software Qual J

5 Case study execution 5.1 Search configuration In order to obtain a comparison baseline, random searches were carried out first (ETF_Random). As explained before, the random searches were run until a certain amount of test cases was generated and then were continued only as long as the search did not reach the stagnation stage. Afterward, searches were performed using the default parameter settings (ETF_Default). Then the results of our studies were optimized by adapting the optimization parameters to our specific optimization problem (ETF_Manual). Finally, a set of searches were carried out using ETF_Automated, which automates the expensive task of tuning the optimization parameters during the ultimate optimization progress. The parameters that needed to be set in advance are thus limited to the population size and number of generations evolved during optimization. Table 1 summarizes the respective configurations (details about the GUIDE parameters involved are available in the corresponding documentation Description of evolution engine parameters 2011). 5.2 Data collection The following two subsections report the data collected during the evolutionary functional testing case studies. They describe how the dependent variables were measured while varying the independent variables as described in Sect. 4.4.

Table 1 Optimization parameter settings used for evolutionary functional testing Parameter

ETF_Default

ETF_Manual (Daimler/BMS)

ETF_Automated

Algorithm Fitness goal

Minimize

Minimize

Minimize

Generations

100

100

100

Size

100

100

100

Fertile

85%

35%/100%

–

Elite

15%

10%/0%

–

Size

100

100/500

–

Selection scheme

Tournament

Tournament

–

Mutation probability

10%

85%/100%

–

Crossover probability

85%

85%/10%

–

Surviving offspring

100%

50%/100%

–

Surviving non-elite

25%

50%/0%

–

Reduction scheme

Tournament

Tournament/sequential

–

Initial population

Offspring population

Intermediate population

Final population Type of elitism

Strong

Weak/strong

-

Final reduction

Sequential

Sequential/tournament

–

123

Software Qual J

5.2.1 Adaptive cruise control Developing case-study-specific components (dependent variables 2f, 2g, 2h, and 2i). As described in Sect. 3, to customize the framework for a particular test aim, the following domain-specific components need to be supplied (see the ETF user manual 2011): (1) an individual specification, (2) a test driver, and (3) an objective function. 1. Individual specification. The first step is to define the individual specification for the evolutionary engine. For this case study, only two input signals had to be generated to stimulate the SUT: the speed of the preceding vehicle and the interaction of the driver with the cruise control lever. As described in Sect. 3, for each of the signals, parameters had to be specified (see Table 2). These make up the individual specification for the evolutionary functional test. 2. Test driver. The task of the test driver is to execute the SUT with the generated test data. Therefore, the generated test data had to be mapped to the inputs of the SUT. After execution of the SUT, the output data required for evaluating the quality of each individual have to be transferred back to the evolutionary engine. Since the test object is implemented with the standard modeling language MATLAB SIMULINK, an adapter had to implemented, which allows for communication between the ETF and the test object (Fig. 2). However, this adapter is not really test object specific and can be used for other case studies with MATLAB SIMULINK test objects (Klimke et al. 2003). As such, it has been integrated into the ETF as a component. 3. Objective function. The final task that had to be completed is designing the objective function (fitness function). The objective function provides guidance for the evolutionary search toward desired scenarios as described in the previous section. For this, the time to collision, as well as the arrival time of the driver warning, must be taken into account: We need to distinguish two time-to-collision calculations. The first involves calculating the time to collision if both our own car and the preceding car do not brake. In this case, the time to collision can be calculated as follows: TTCdriving ðtÞ ¼

dðtÞ maxðvrel ðtÞ; vrelmin Þ

ð1Þ

where d(t) is the distance between the two cars, vrel(t) the relative speed, and vrelmin ¼ 0:002 ms the minimum relative speed preventing division by zero and a negative TTC value. Table 2 Individual specification used for the Adaptive Cruise Control case study Signal

Length (in s)

Resolution (in s)

Amplitude range

Number of segments

Segment width

Base signals

Cruise control lever

60

0.01

N½1;1

5

[1,10]

Impulse

R½0;100

10

[1,50]

Spline, linear

Speed of preceding vehicle

Fig. 2 Test environment for the ACC system

123

Software Qual J

The second time to collision calculation considers that the driver of the ACC-equipped car might actually brake because of the ACC red warning light flashing up. Therefore, we need to account for the reaction time of the driver (TR) and the possible deceleration. First, the distance to the preceding car after reaction time of the driver dR(t) needs to be calculated: dR ðtÞ ¼ dðtÞ vrel ðtÞ TR

ð2Þ

If the distance to the preceding car after reaction time is already smaller than zero, a crash will inescapably occur. Thus in this case, the non-braked time to collision TTCdriving can be used. If the distance after reacting is greater than zero, we have time for the car to be decelerated. Our approach for calculating the braked time to collision is as follows. The distance to the preceding car at the critical collision time tc can be calculated as: dC ðtÞ ¼ dR ðtÞ sC;ego ðtÞ þ sC;target ðtÞ

ð3Þ

where sC, ego(t) is the distance travelled by the car with the ACC equipped from the beginning of the braking until standstill, and sC, target(t) the distance travelled by the preceding car during that same time period. Since we propose that the target distance is zero in case of a collision, we can simply set dC(t) = 0 and determine the associated time tc(t): qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi vrel ðtÞ þ vrel ðtÞ2 þ 2 aego dR ðtÞ ð4Þ tc ðtÞ ¼ aego where aego is the deceleration evoked by the driver of the rear vehicle after the reaction time has passed, and tc(t) the elapsing time from the onset of braking, that is after reaction time, until the distance to the preceding car equals zero, that is until a collision occurs. In order to calculate the entire time to collision from the maneuver starting time, i.e. including reaction time, we simply need to add the reaction time to tc(t). Remember that we can use the time to collision for constant speeds TTCdriving(t), if the vehicles collide during the driver’s reaction time. TTCbraking ðtÞ ¼ tc ðtÞ þ TR

ð5Þ

TTCbraking(t) is undefined if the term inside the square root is lower than zero, that is in the case if vrel ðtÞ2 þ 2 aego dR ðtÞ\0: This appears if the distance to the preceding car is wide enough, such that a collision will be avoided by braking. In this case, the time to collision without braking can be consulted since we still want to ensure the search guidance. Hence, the overall time to collision can be formulated as: 8 dR ðtÞ 0 < TTCdriving ðtÞ; ð6Þ TTCðtÞ ¼ TTCdriving ðtÞ; vrel ðtÞ2 þ 2 aego dR ðtÞ\0 : TTCbraking ðtÞ; else: The arrival time of the warning light still needs to be incorporated. The warning signal is either 0 if a warning is not to be displayed, or it is 1 if the driver must be alerted. If the warning light is showing up, the time to collision is not of importance to us anymore, since the driver must take over control. Thus, we can simply multiply the warning signal with a big value d and add it to the TTC signal to ensure that these values are not the minimum values chosen by the min function. Additionally, to distinguish individuals in more detail,

123

Software Qual J

the minimum TTC value is divided by the gradient of the TTC signal at time step tmin to incorporate the amount of signal change into the evaluation. objValue ¼

minðTTCðtÞ þ WarnðtÞ dÞ oTTC ðtmin Þ

ð7Þ

ot

During these activities of customizing the framework for a particular test aim, data were collected for associated dependent variables. These data, as well as data measured during the case study execution and linked to the remaining dependent variables, are presented in the following table (pms stands for person-months):

Dependent variable

Measured

Generated test cases (2a)

300 k (Random) 300 k (Default) 300 k (Manual) 300 k (Automated)

Number of invalid test cases generated (2b)

0

Number of errors found (2c)

0

Time and effort needed to find an appropriate set of parameters (2e)

0.5 pms

Time needed to define and refine the fitness function (2f)

0.5 pms

Time needed to install the ETF (2g)

2h

Time needed to customize the ETF (2h)

3 pms

Lines of code of the case study specific components (2i)

365

Execution of the evolutionary functional tests (dependent variables 2a, 2b, 2c, and 2d). Further examination is now given to the results obtained by performing multiple evolutionary functional test runs for the ACC system. 1. Generated test cases (dependent variable 2a). For ETF_Random, the minimum number of test cases to be generated and evaluated in each run was set to 10,000. Since the search had already stagnated at this point in all of the runs, no additional test cases were generated. For the other configurations, the number of evaluated test cases depended on the respective search configuration; particularly on the selected number of generations and enclosed individuals to be evaluated. For the configurations shown in Table 1, the following values unfold: For each given configuration of the engine, a total of 300,000 test cases have been generated and evaluated – composed of 100 individuals times 100 generations times 30 executions. To sum it up, a total of 1.2 million tests runs of the ACC have been evaluated. In order to get a picture of the time requirements of this case study, the following side note might be helpful: The execution of the ACC and the evaluation of the fitness function for one generated test case took about 5 s—or in other words, it took about 14 h per run of evaluating 10,000 test cases. 2. Number of invalid test cases generated (dependent variable 2b). During all the search runs conducted, no invalid test case was generated due to a very precise and applicable configuration of the signal generator.

123

Software Qual J

Fig. 3 Input and output signals of one critical driving scenario detected by the ETF

3. Number of errors found (dependent variable 2c). The evolutionary testing approach was unable to find a fault in the SUT. This was not a surprise considering the fact that it is a production system already implanted in cars. However, even though the ETF did not find any input signals for the ACC system causing a violation of the safety requirement, it still found interesting driving scenarios that might be worth reusing in further testing activities like regression testing. Figure 3 shows one of these scenarios found during the evolutionary test. Apart from the generated signals for the ACC lever position (Lever Position) and the speed of the target vehicle (Target Speed), additional signals that result directly from observing the behavior of the executed system and most of which are involved in the calculation of the scenario’s fitness are visible in the figure. During the initialization phase, the following car is accelerating up to 23 m per second (82.8 km/h). Afterward the speed is influenced by the speed of the preceding vehicle, the resulting distance, and the desired speed. The desired speed itself depends on how the driver uses the control lever. At 47 s, the system raises a sound warning to the driver and shortly afterward triggers a crucial visual warning due to the rapid speed reduction of the preceding vehicle along with a noticeable shortening of vehicle distance. The time to collision, used for calculating the fitness value, can be seen in the lower-right diagram. Even though it might look like it, the time to collision does not reach the value zero. 4. Optimization progress (dependent variable 2d). As is visible in Fig. 4, the ETF was able to find situations that can be classified as becoming more critical the further the optimization progresses. Figure 4 illustrates the results using certain means of comparison (Fig. 4a arithmetic mean, Fig. 4b the median, and Fig. 4c the best objective found). For all of them, the x-axis shows the number of objective function evaluations, whereas the y-axis shows the calculated comparative value and is plotted using a logarithmic scale to enhance perceptibility. For evaluating the ETF_Random runs, the same fitness function has been used as that for the evolutionary test runs. In order to compare the effectiveness of random testing with evolutionary testing, Fig. 4 shows the average, median, and best of the lowest fitness values for ETF_Random found so far. Figure 4a compares the arithmetic mean of the results. It was calculated for each abscissa value of the 30 studies conducted. Comparing the two different settings of

123

Software Qual J

a

b

c

Fig. 4 The convergence characteristics of ETF_Random, ETF_Default, ETF_Manual and ETF_Automated compared using a arithmetic mean, b the median, and c the best objective found over all 30 ACC studies

algorithm parameters, ETF_Manual obtained a better average fitness value than ETF_Default, although ETF_Default delivered slightly better results on average at the beginning. Results of Mann-Whitney-Wilcoxon (MWW) tests, based on the final fitness that were achieved in all of the 30 repetitions, reveal that the difference between ETF_Manual and ETF_Default is of high statistical significance (P \ 0.001, two-tailed test). As expected, random search obtained a poor progress toward the test objective compared to the evolutionary algorithm runs (high statistical significance for ETF_Default, ETF_Manual, and ETF_Automated when compared to ETF_Random). A first glance at the average values of ETF_Automated gives the impression that it performs badly. This is indeed true for around 50% of the searches; however, for the remaining 50% of the searches, it performed rather well. A comparison of ETF_Automated with ETF_Manual shows no statistically significant difference (P C 0.05, two-tailed test). In the end, this results in poor average values, accompanied by large standard deviations. To get a better impression of the performance using a comparison measure that is more robust against outliers, the median has been used as illustrated in Fig. 4b. Again random test data generation serves as a baseline showing only minor convergence as expected. In contrast, all evolutionary optimization techniques show considerable convergence at which ETF_Manual performs better than ETF_Default. Here, ETF_Automated constitutes

123

Software Qual J

both better search effectiveness and efficiency. Figure 4c shows the best objective value that has been found so far while the optimization progresses, considering all 30 repetitions. The ETF_Random was able to cumulatively find better objective values as the search progressed, yet it got stuck in local optima without sufficiently approaching the global one. It is worth mentioning that ETF_Default was able to find its best objective value within the first 2,000 objective function evaluations. After that, however, the search did not improve anymore—ETF_Default hence got trapped in local optima very quickly. In contrast, ETF_Manual exhibited a more consistent convergence curve, but in spite of that it also got trapped in local optima and was not able to approach the optimal solution as far as ETF_Default was. The best results with respect to search effectiveness were realized by ETF_Automated that featured a steady improvement until it reached its best solution after 4.000 objective function evaluations.

5.2.2 Anti-lock braking system Developing case-study-specific components (dependent variables 2f, 2g, 2h, and 2i). 1. Individual specification. For this case study, one input signal had to be optimized: The slip profile of the simulated road. The slip profile signal consists of 30 segments (see Sect. 3) where the segment widths are bounded to [1.0, 2.0] and the amplitude values to [0.5, 1.2]. In addition, the steering wheel angle was optimized as a single value (resulting in a constant signal), ranging [0.0, 30.0]. Hence, a total of 61 values had to be optimized. 2. Test driver. A thorough description regarding the composition of the ETF with MESSINA and a HiL-System for the SUT used for this case study (an electronic control unit of a heavy goods vehicle anti-lock braking system enriched with a vehicle dynamics model) can be found in Kruse et al. (2009). Here we provide just a short summary: The ETF has been integrated into the model-based test tool MESSINA (Messina 2010). MESSINA’s abstraction layer allows the user to seamlessly perform Model-in-the-Loop (MiL), Software-in-the-Loop (SiL), and Hardware-in-the-Loop (HiL) tests. Through the MESSINA signal pool, the input variables of the SUT such as the steering wheel angle, brake and accelerator pedal positions, as well as the road surface friction profile (per wheel) can be modulated from within the (Java) test case description. Figure 5 shows an overview of the test environment. 3. Objective function. Desired testing scenarios are those in which the ABS does not work as intended or required, that is we want to find test cases (individuals) causing a braking maneuver with too high of a wheel slip. Having worked through several iterations of objective function improvements, the final version chosen for the case study is focused on maximizing the measured slip between the two front wheels and the road. Since it became clear that the measured slip has a high correlation with the braking distance, a focus on maximizing the slip would guide the search to areas of worse system performance. The objective function had to be designed accordingly. The main values to be taken into account are the average wheel slip values of all four wheels, normalized over the braking maneuver duration. These are calculated as follows: AWS ¼

n X sk k¼1

n

ð8Þ

123

Software Qual J

where sk is the wheel slip at time k. The devisor n is needed to prevent false optimization toward extreme, time-consuming braking, which would simply create a lot of values. Along with completely iced road characteristics for example, this would have the same impact as a nonfunctional ABS. The average wheel slip is then maximized during the evolutionary evaluation. Since the ETF requires minimization problems, the objective value is transformed correspondingly. objValue ¼ minð1 AWSÞ

ð9Þ

During these activities of customizing the framework for a particular test aim, data were collected for associated dependent variables. These data, as well as data measured during the case study execution and linked to the remaining dependent variables, are presented in the following table (pms stands for person-months):

Dependent variable Generated test cases (2a)

Measured 70 k (Random) 70 k (Default) 20 k (Manual) 60 k (Automated)

Number of invalid test cases generated (2b)

0

Number of errors found (2c)

0

Time and effort needed to find an appropriate set of parameters (2e)

0.5 pms

Time needed to define and refine the fitness function (2f)

1 pms

Time needed to install the ETF (2g)

2h

Time needed to customize the ETF (2h)

4 pms

Lines of code of the case study specific components (2i)

5,195

Execution of the evolutionary functional tests (dependent variables 2a, 2b, 2c, and 2d). Further examination is now given to the results obtained by performing multiple evolutionary functional test runs for the ABS. 1. Generated test cases (dependent variable 2a). For ETF_Random, the minimum number of test cases to be generated and evaluated in each run was set to 10,000. Again, the search had already stagnated at this point in all of the runs, thus no additional test cases were generated. It should be noted that the execution of the ABS and the evaluation of the fitness function for one generated test case took about 30 s—or in other words, it took about 83 h per run of evaluating 10,000 test cases. The long execution times are caused by executing the ABS in real-time within a Hardware-in-the-Loop (HiL) setting. For this reason, as well as overall time constraints for the study, not as many search runs as desired were accomplished for the different configurations. The number of test cases evaluated using the final version of the fitness function are 70,000 for ETF_Default and 60,000 for ETF_Automated. For ETF_Manual, the number of test cases evaluated is 20,000. Instead of a stopping criterion based on the fitness, the default stopping criterion (100 generations) was used to end the search.

123

Software Qual J

Fig. 5 Test environment for the ABS system

2. Number of invalid test cases generated (dependent variable 2b). During all the search runs conducted, no invalid test case was generated. 3. Number of errors found (dependent variable 2c). The evolutionary testing approach was unable to find a fault in the SUT. This was not a surprise considering the fact that it is a production system already embedded in trucks. 4. Optimization progress (dependent variable 2d). As Fig. 6 shows, the ETF generated a large set of problem-oriented test cases. Although the random data is limited, the ETF_Default, ETF_Manual, and ETF_Automated runs show a clear reduction in the fitness compared with the random run. This reduction (improvement) in fitness corresponds with increased slip caused by the dynamic friction profile, which leads to a longer braking distance and hence brings the ABS closer to its specified performance limits. The search configurations ETF_Default, ETF_Manual, and ETF_Automated show an optimization progress where the test cases generated by the ETF are more and more problem-oriented.2 ETF_Default and ETF_Manual delivered the best progress over the first 40–50 generations. ETF_Automated performed slightly worse for this first half of the optimization progress. Then, it started to perform better than ETF_Default and ETF_Manual for the second half. ETF_Automated seems to need a certain amount of time to learn the problem and adapt to it. Only for the best individual found (Fig. 6c), ETF_Default performed better over all 100 generations calculated. 5.3 Assessment Based on the dependent variables whose measures were described in Sect. 5.2 and through the set of propositions from Sect. 4.4.1, this section answers the underlying research question about how successful the application of EvoTest technologies is to real systems in industrial practice Proposition P1 ‘‘It is possible to use the ETF without detailed knowledge in evolutionary computation to search for interesting test data.’’ For functional black-box testing, it is not possible to use the evolutionary testing framework to search for interesting test data without having detailed knowledge of evolutionary computation, as a sophisticated fitness function must be designed. The fitness function is one of the most crucial factors for the success of the evolutionary search, and thus for the evolutionary test data generation process in general. The fitness function is meant to guide the search by evaluating generated test cases and in this process, assigns a 2 Due to the small sample size of ETF_Manual (two optimization runs), the differences between the final fitness values of ETF_Default and ETF_Manual as well as ETF_Manual and ETF_Automated turned out to be not statistically significant, according to MWW tests. However, the difference between ETF_Random and ETF_Default is of high statistical significance.

123

Software Qual J

a

b

c

Fig. 6 The convergence characteristics of ETF_Random, ETF_Default, ETF_Manual and ETF_Automated compared using a arithmetic mean, b the median, and c the best objective found over all ABS studies carried out

real number (called fitness) to each of them. A simple one-or-zero (yes-or-no) is insufficient. It must differentiate test cases in a fine-grained, continuous manner and the fitness landscape it spans over the space of possible inputs should be as smooth as possible, in order to improve search guidance. Lacking experience in the design of suitable fitness functions can easily lead to mistakes that may cause the creation of local optima or similar pitfalls which can hinder the search for the desired test data. Even with knowledge of evolutionary computation, experiences made during the course of the two case studies show that designing a sophisticated fitness function can be far from trivial, take for instance the long time that was needed to design the fitness functions (measured by variable 2f, see Sect. 5.2). With respect to the parameters of the evolutionary engine, it is essential that support from evolutionary computation experts is readily available when advice is needed for performing manual tuning of the evolutionary engine. However, if manual tuning is not to be performed, then such skilled knowledge might not be necessary to search for interesting test data—as we can see from the minor difference in effectiveness with regard to the fitness improvements achieved by ETF_Default compared to ETF_Manual.

123

Software Qual J

Proposition P2 ‘‘The ETF applied to real-world-sized examples, in real-world test environments, is able to generate better test cases w.r.t. achieving the test goal than random testing.’’ Figures 4 and 6 show that the ETF generated a large set of problem-oriented test cases. Referring to Sect. 5.2 and dependent variable 2d, it is obvious that random testing was unable to achieve results comparable to the respective evolutionary engines. As expected and according to its usual capabilities, random testing gradually found slightly better solutions during optimization without reaching areas of the search space that are close to the optimal solution, not to mention, the global one. In contrast, the evolutionary engines, that is ETF_Default, ETF_Manual, and ETF_Automated showed a solution-approaching behavior. After reaching interesting areas, they tried to approach the optimum by generating slightly different solutions that may be described as highly problem-oriented due to their proximity to proper solutions that were already found. Proposition P3 ‘‘The ETF is more effective in generating/finding error revealing test cases when applied to real-world systems for black-box testing compared to random testing.’’ The SUTs chosen by the case study partners to investigate the scalability of evolutionary testing (see Sect. 4.2) are both taken from serial production developments and are meant to challenge the evolutionary testing technique. As such, finding errors was not necessarily likely in our studies. Hence, we cannot evaluate this proposition within the context of our two case studies. We hope that repeated case studies in future work will be able to comment on this aspect. Nevertheless, although the ETF did not find any test cases that violate the considered requirements, the industrial partners were content with the outcome of this case study. The ETF found test cases that came very close to violating safety-relevant properties. Such test cases can and will be reused for regression testing. Apart from that, in industry, testing is performed in order to gain confidence in the quality of developed systems. Within this case study, the systems have been tested for thousands of problem-oriented scenarios and have, therefore, greatly increased the industrial partners’ confidence in their systems (in terms of the considered requirements). Proposition P4 ‘‘Automated parameter tuning improves the results of the search in terms of search effectiveness and efficiency.’’ Automated parameter tuning delivered encouraging results in terms of effectiveness and efficiency. On average, it provided comparable results to manual tuning, yet relieved the tester from the difficult, tedious task of manually specifying the evolutionary engine’s parameters (around 0.5 person-month, see dependent variable 2e). Hence, though in general the effort of manually tuning the parameters is significantly lower than the effort of designing a fitness function and customizing the ETF, automated parameter tuning makes a minor contribution to the applicability of evolutionary testing in industry, since allocating human resources to tuning is far more expensive than computational resources. However, it should be taken into account that if the execution of the SUT takes a large amount of time (which is frequently the case with functional black-box testing), automated parameter tuning can take up a lot of computational resources. Proposition P5 ‘‘The amount of time and effort it takes to configure the ETF in order to apply it to real-world systems for evolutionary functional testing is suitable within an industrial setting.’’

123

Software Qual J

The amount of time and effort it takes to configure the already installed evolutionary testing framework in order to apply it to real-world practice for functional testing was found to be reasonable for the industrial settings studied. This also includes the design and implementation of a suitable fitness function to be evaluated by the evolutionary algorithms. The creation of this fitness function is admittedly a time-consuming task for which specific evolution computation knowledge is required. Nonetheless, the industrial partners found this effort to be compensated by the large amount of problem-oriented test cases generated and evaluated automatically afterward, since as soon as the fitness function has been designed, the test data generation process can be carried out fully automatically without any human effort. Meanwhile the SUT is executed with an exceptionally large number of problem-specific and highly relevant test data, which would never be possible when done manually. As already indicated in Sect. 5.2, both SUTs were executed with 900,000 and 150,000 different inputs, respectively. Since time is always a crucial factor in industrial practice, the time necessary to complete the test data generation process for a certain test object and test goal is of major concern. The machine execution time for an evolutionary functional testing optimization run varies depending on the complexity of the SUT, the testbed used (e.g. real-time execution for hardware-in-the-loop testing), and the computational power of the hardware. In addition, the overall execution time is affected by the selected optimization parameters, the selected stopping criterion, the quality of the fitness function, and the difficulty of achieving the test goal. Because the execution time is hard to measure in a statistically sound manner, the experiment design uses the number of fitness function evaluations as an efficiency metric instead. Nevertheless, the real execution times were still measured (see ‘‘Generated Test Cases’’ in Sects. 5.2.1 and 5.2.2) and found to be reasonable by the industrial partners, particularly because the pure execution of evolutionary functional testing does not require any human interaction Some usability problems and ideas for further improvements that came out of the informal interviews were related to the need for more direct interaction between the user and the ETF. For example, the preference page provided to manually configure the evolutionary engine parameters could be more user-friendly. Moreover, some guidance on how to develop the fitness function would also be very much appreciated.

6 Conclusion This paper proposed an empirical study for investigating the scalability of the ETF’s implementation of evolutionary functional testing within our specific automotive setting. Both industrial partners responded positively about the efficiency of the tool and found the results interesting. Automated parameter tuning generally can alleviate the need for users having advanced evolutionary computation knowledge. However, in order to define and refine a suitable fitness function, it is necessary to have a certain level of evolutionary computation skills and be able to devote a significant amount of time to this task. Although in some cases, this is compensated by the results, it is still an important factor standing in the way of total acceptability of functional evolutionary testing in industry. For broader acceptance in an industrial setting, it is important to improve automated parameter tuning, as it offers a chance to enable inexperienced testers to use evolutionary testing and gain valuable results. Lindlar et al. (2010) introduce an approach aimed at simplifying the process of designing search space and fitness function for evolutionary functional testing, to further increase acceptability.

123

Software Qual J

Moreover, the difficult task of designing a suitable fitness function could be supported by using a wizard-based approach or question answering. More advanced testers, on the other hand, will likely want the option of combining different test targets in one search - so called multi-objective search (e.g. functional testing combined with time response tests). This includes combinations of structural and functional testing as well. Different strategies for seeding of test data (Arcuri et al. 2008) and parallelization of tests (e.g. using multiple or distributed test targets) are also requested. In addition, a set of common development guidelines on when to use which optimization technique might further increase industrial acceptance of evolutionary testing, for both average and advanced users. A real need exists in embedded systems industry for guidelines recommending what test techniques apply to which testing objectives, testing levels, and system development phases; how these techniques contribute to the overall reliability and dependability of the embedded system; and how efficient and usable an application is. Up to today, such guidelines, at least in a complete form, are hardly existent for traditional testing techniques, and even less so for evolutionary testing. Because of this field’s diversity, empirical studies are essential to laying the foundations for proper guidelines and hence integrating them in a general Test & Quality Assurance strategy for embedded systems. Having a central repository of exemplary embedded systems that can act as benchmarks for the evaluation of evolutionary testing techniques, for functional as well as non-functional properties, would also be helpful. Acknowledgments This work is supported by EU grant IST-33472 (EvoTest). For their support and help, we would like to thank Mark Harman, Kiran Lakhotia and Youssef Hassoun from Kings College London; Marc Schoenauer and Luis da Costa from INRIA; Jochen Ha¨nsel from Fraunhofer FIRST; Dimitar Dimitrov and Ivaylo Spasov from RILA; and Dimitris Togias from European Dynamics.

A questions for informal interviews Typical questions that were used for the informal interviews. 1. Was installation and setup of the ETF as easy and quick as you would like to? What would you improve? 2. Would you recommend the use of the ETF to other testers you know? How? 3. Do you think that you could persuade your management to invest in a tool like ETF? 4. Are there any additional functionalities you need in order for the tool to be suitable in your industrial context? 5. Do you have any other suggestions as to how the tools could be made more suitable for your industrial context? 6. Do you have any other comments, criticisms or suggestions relating to the usability ease of use - of the tools? 7. Do you have confidence into the results of the ETF (maybe compared to the testing techniques currently used)?

References Description of evolution engine parameters. http://guide.gforge.inria.fr/eeparams/EEngineParameters.pdf. Last accessed April 19, 2011. ETF user manual and cookbook. http://evotest.iti.upv.es. Last accessed April 13, 2011. GUIDE. http://gforge.inria.fr/projects/guide/. Last accessed April 13, 2011.

123

Software Qual J Evotest. http://evotest.iti.upv.es (2006). Last accessed April 13, 2011. Arcuri, A., White, D. R., Clark, J., & Yao, X. (2008). Multi-objective improvement of software using coevolution and smart seeding. In: X. Li, M. Kirley, M. Zhang, D. G. Green, V. Ciesielski, H. A. Abbass, Z. Michalewicz, T. Hendtlass, K. Deb, K. C. Tan, J. Branke, & Y. Shi (Eds.), Proceedings of the 7th international conference on simulated evolution and learning (SEAL ’08), LNCS (Vol. 5361, pp. 61–70). Melbourne, Australia: Springer. Baresel, A., Pohlheim, H., & Sadeghipour, S. (2003). Structural and functional sequence test of dynamic and state-based software with evolutionary algorithms. In GECCO (pp. 2428–2441). Beizer B. (1990). Software testing techniques. London: International Thomson Computer Press. Briand L. C. (2007). A critical analysis of empirical research in software testing. In: Empirical software engineering and measurement, 2007. First International Symposium on ESEM 2007 (pp. 1–8). Bu¨hler, O., & Wegener, J. (2004). Automatic testing of an autonomous parking system using evolutionary computation. In Proceedings of SAE 2004 world congress (pp. 115–122). Bu¨hler, O., & Wegener, J. (2008). Evolutionary functional testing. Computers & Operations Research, 35(10), 3144–3160. Chan, B., Denzinger, J., Gates, D., Loose, K., & Buchanan, J. (2004). Evolutionary behaviour testing of commercial computer games. In Proceedings of CEC 2004, Portland (pp. 125–132). DaCosta, L., Fialho, A., Schoenauer, M., & Sebag, M. (2008). Adaptive operator selection with dynamic multi-armed bandits. In Proceedings of the 10th annual conference on genetic and evolutionary computation, GECCO ’08 (pp. 913–920). New York, NY: ACM. DOI http://doi.acm.org/10. 1145/1389095.1389272. http://doi.acm.org/10.1145/1389095.1389272. Fewster, M., & Graham, D. (1999). Software test automation: effective use of test execution tools. New York, NY: ACM Press/Addison-Wesley Publishing Co. Goldberg, D.*E. (1989). Genetic algorithms in search, optimization and machine learning. Boston: Addison Wesley. Grochtmann, M., & Wegener, J. (1998). Evolutionary testing of temporal correctness. In: Proceedings of the 2nd international software quality week Europe (QWE 1998). Brussels, Belgium. Gros, H. G. (2003). Evaluation of dynamic, optimisation-based worst-case execution time analysis. In: Proceedings of the international conference on information technology: Prospects and challenges in the 21st century, (Vol. 1, pp. 8–14). Gross, H., Kruse, P. M., Wegener, J., Vos, T. (2009). Evolutionary white-box software test with the evotest framework: A progress report. In ICSTW ’09: Proceedings of the IEEE international conference on software testing, verification, and validation workshops (pp. 111–120). IEEE Computer Society, Washington, DC, USA. Harman, M., Hu, L., Hierons, R., Baresel, A., & Sthamer, H. (2002). Improving evolutionary testing by flag removal. In Proceedings of the genetic and evolutionary computation conference (GECCO 2002) (pp. 1233 – 1240). Morgan Kaufmann, New York, USA. Holland, J.H. (1975). Adaptation in natural and artificial systems. Ann Arbor: University of Michigan Press. Jones, B., Sthamer, H., & Eyres, D. (1996). Automatic structural testing using genetic algorithms. The Software Engineering Journal, 11(5), 299–306. Juristo, N., Moreno, A., & Vegas, S. (2004). Reviewing 25 years of testing technique experiments. Journal of Empirical Software Engineering 9(1), 7–44. Keijzer, M., Merelo, J. J., Romero, G., & Schoenauer, M. (2001). Evolving objects: A general purpose evolutionary computation library. In Artificial evolution (pp. 231–244). http://citeseer.ist.psu.edu/ keijzer01evolving.html. Kitchenham, B. A., Pfleeger, S. L., Pickard, L. M., Jones, P. W., Hoaglin, D. C., Emam, K. E., et al. (2002). Preliminary guidelines for empirical research in software engineering. IEEE Transactions on Software Engineering, 28(8), 721–734. Klimke, A. (2003) How to access Matlab from Java, IANS report 2003/005. Tech. rep., University of Stuttgart. http://preprints.ians.uni-stuttgart.de. Kruse, P. M., Wegener, J., & Wappler, S. (2009). A highly configurable test system for evolutionary blackbox testing of embedded systems. In GECCO ’09: Proceedings of the 11th annual conference on genetic and evolutionary computation (pp. 1545–1552). New York, NY: ACM.http://doi.acm.org/10. 1145/1569901.1570108. Lethbridge, T. C., Sim, S. E., & Singer, J. (2005). Studying software engineers: Data collection techniques for software field studies. Empirical Software Engineering, 10(3), 311–341. Lindlar, F., Windisch, A., & Wegener, J. (2010). Integrating model-based testing with evolutionary functional testing. In Proceedings of the 3rd international conference on software testing, verification, and validation workshops (ICSTW 2010) (pp. 163–172). Washington, DC: IEEE Computer Society.

123

Software Qual J McMinn, P. (2004). Search-based software test data generation: A survey. Software Testing, Verification and Reliability, 14(2), 105–156. McMinn, P. (2011). Search-based software testing: Past, present and future. In Proceedings of the 4th international workshop on search-based software testing (SBST 2011). Messina. http://www.berner-mattner.com/en/automotive-messina.php. Last accessed Feb 3, 2010. Mueller, F., & Wegener, J. (1998). A comparison of static analysis and evolutionary testing for the verification of timing constraints. In RTAS ’98: Proceedings of the 4th IEEE real-time technology and applications symposium (p. 144). Washington, DC: IEEE Computer Society. Pargas, R. P., Harrold, M. J., & Peck, R. R. (1999). Test-data generation using genetic algorithms. Journal of Software Testing, Verification and Reliability, 9(4), 263–282. Perry, D. E., Porter, A. A., & Votta, L. G. (2000). Empirical studies of software engineering: A roadmap. In: ICSE ’00: Proceedings of the conference on the future of software engineering, (pp. 345–355). ACM. Perry, D. E., Sim, S. E., & Easterbrook, S. (2005). Case studies for software engineers. In SEW ’05: Proceedings of the 29th annual IEEE/NASA software engineering workshop—Tutorial notes (pp. 96–159). Washington, DC: IEEE Computer Society. Pohlheim, H. (2000). Evolutiona¨re algorithmen: Verfahren, operatoren und hinweise fu¨r die Praxis. Springer, Berlin: Heidelberg [u.a.]. Sthamer, H., & Wegener, J. (2002). Using evolutionary testing to improve efficiency and quality in software testing. In Proceedings of 2nd Asia-Pacific conference on software testing. Tlili, M., Sthamer, H., Wappler, S., & Wegener, J. (2006). Improving evolutionary real-time testing by seeding structural test data. In Proceedings of the congress on evolutionary computation (CEC) (pp. 3227–3233). IEEE. Tlili, M., Wappler, S., Sthamer, H., & Wegener, J. (2006). Improving evolutionary real-time testing. In Proceedings of the 8th annual conference on genetic and evolutionary computation (GECCO) (pp. 1917–1924). New York: ACM Press. Tracey, N., Clark, J., Mander, K., & McDermid, J. (2000). Automated test-data generation for exception conditions. Software: Practice and Experience, 30(1), 61–79. Vos, T., Baars, A., Lindlar, F., Kruse, P., Windisch, A., & Wegener, J. (2010). Industrial scaled automated structural testing with the evolutionary testing tool. In Proceedings of the 3rd international conference on software testing, verification and validation (ICST2010), Paris (France) (pp. 175–184). IEEE Computer Society. Wegener, J., Buhr, K., & Pohlheim, H. (2002). Automatic test data generation for structural testing of embedded software systems by evolutionary testing. In GECCO ’02: Proceedings of the genetic and evolutionary computation conference (pp. 1233–1240). San Francisco, CA: Morgan Kaufmann Publishers Inc. Wegener, J., Grimm, K., Grochtmann, M., Sthamer, H., & Jones, B. (1996). Systematic testing of real-time systems. In Proceedings of the 4th European international conference on software testing, analysis and review. Amsterdam, The Netherlands. Windisch, A., & Al Moubayed, N. (2009). Signal generation for search-based testing of continuous systems. In Proceedings of the 2nd international conference on software testing, verification, and validation workshops (pp. 121–130). Washington, DC: IEEE Computer Society. Windisch, A., Lindlar, F., Topuz, S., & Wappler, S. (2009). Evolutionary functional testing of continuous control systems. In GECCO ’09: Proceedings of the 11th annual conference on genetic and evolutionary computation (pp. 1943–1944). New York, NY: ACM. Windisch, A., Lindlar, F., Topuz, S., & Wappler, S. (2009). Evolutionary functional testing of continuous control systems. In Proceedings of the 11th annual conference on genetic and evolutionary computation (GECCO) (pp. 1943–1944). New York, NY: ACM.

123

Software Qual J

Author Biographies Tanja E.J. Vos studied computer science at the University of Utrecht (The Netherlands) and obtained her PhD on formal verification in 2000 at the same university. She has more than 10 years of experience with formal methods and software testing. She is a lecturer at the Computation and Information Systems Department (DSIC) of the Technical University of Valencia (Spain), and carries out her research at the Center for Software Production Methods (ProS) where she leads the Software Testing & Quality (STaQ) group. She is involved in many research projects on software testing in an industrial setting. She has successfully coordinated the EU-funded EvoTest proposal from 2006 to 2009 and is currently coordinating the EU-funded FITTEST project on Future Internet Testing.

Felix F. Lindlar received his computer engineering degree from the Berlin Institute of Technology, Germany, in 2007. Since the beginning of 2008 he has been working as a Ph.D. student at the Daimler Center for Automotive IT Innovations. The focus of his research is in the domain of testing, in particular model-based testing and search-based testing of embedded system software. He was working on the EUfunded project EvoTest and has published several research papers in the field of search-based testing. In addition, he is managing testing activities at Mercedes-Benz Trucks for next generation driver assistance systems.

Benjamin Wilmes received his diploma degree in computer science from the Berlin Institute of Technology (TU Berlin), Germany, in 2010. Since 2009, he has been working for the Daimler Center for Automotive IT Innovations, where he is currently pursuing his PhD degree in close collaboration with Daimler AG and the Software Engineering Department of TU Berlin. His research interests include industry-oriented approaches to software test automation, particularly the application of search-based testing.

123

Software Qual J Andreas Windisch has been working in the field of search-based testing for more than four years. Because of the close collaboration with Daimler AG, he has outstanding expertise in search-based (evolutionary) testing with regard to its industrial applicability, which has resulted in several publications including several masters theses (own and supervised). He was working on the EU-funded project EvoTest.

Arthur I. Baars studied computer science at the University of Utrecht (The Netherlands) and obtained his PhD on embedded compilers in 2009 at the same university. He is now a researcher at the Center for Software Production Methods (ProS) in the Software Testing & Quality (STaQ) group. He participated in the EU-funded project EvoTest and has published several research papers in the field of search-based testing. Currently, he is working on the EU-funded FITTEST project on Future Internet Testing.

Peter M. Kruse is a software engineer working in the domain of testing, including evolutionary testing and the classification tree method. He is an experienced software developer and tester in the German automotive industry. Peters project experience includes Hardware-in-the-Loop Testing (HiL), model driven development (MDD), and evolutionary structural and functional testing. He is responsible for development of CTE XL, a very popular test design tool.

123

Software Qual J Hamilton Gross received his electrical engineering degree from the University of Bristol, England. He is currently working at Berner and Mattner in Berlin in the field of search-based testing.

Joachim Wegener studied Computer Science at the Technical University Berlin and obtained his PhD on the evolutionary testing of realtime systems at the Humboldt University of Berlin. This work gained him the Software Engineering Prize 2002, awarded by the Ernst Denert Foundation. Dr. Wegener is the local representative of Berner and Mattner in Berlin, where he leads the automotive department. He previously worked for Daimler AG, where he led the development of the world’s first Industrial Evolutionary Testing System. Joachim Wegener is a pioneer of Search-based Testing and has been the first program chair for the GECCO Search-based Software Engineering track. Furthermore, he played a central role in the development of the test system TESSY, the classification-tree editor CTE and the TimePartition-Testing TPT. He is managing the ’’Embedded Systems Testing’’ research group of the German Informatics Society, and is a member of the industrial advisory board of King’s College.

123