Constraint-based Evolutionary Testing of Autonomous ... - CiteSeerX

Constraint-based Evolutionary Testing of Autonomous Distributed Systems Cu D. Nguyen, Anna Perini and Paolo Tonella Fondazione Bruno Kessler Via Sommarive, 18 38050 Trento, Italy {cunduy, perini, tonella}@itc.it

Abstract

service, so as to adapt to the environment. The capability to discover the appearance of new systems providing useful functionalities (services) may trigger actions to integrate the invocation of such functionalities. Moreover, by interacting through asynchronous message passing, systems may share information and give raise to forms of cooperation, useful while executing complex tasks. We call ADS (Autonomous Distributed Systems) those systems which exhibit the properties described above, regardless of their agent or service-oriented nature 1 . Whenever possible, we make no specific assumption on the underlying development and execution platform. Testing ADS is a challenging task. All troublesome issues related to distributed software testing [20] apply to ADS as well. In addition, ADS’ autonomous behavior makes it more difficult to specify the correct behavior expected from these systems, and makes it hard to consider all relevant scenarios associated with behavior adaptation and reconfiguration. Moreover, integration of external systems, which can be discovered and replaced dynamically, represents another complex source of variation that should be considered during testing [4]. Successive tests with the same test data may give different results [19] because of self-adaptation or learning. Only few studies addressed the issues of multi-agent system testing [2, 5] and of service oriented testing [3, 17, 10]. Autonomy is a crucial property of ADS. In this paper, we approach autonomy testing by exploiting constraint enforcement. The ADS under test is free to evolve and to behave autonomously, but it needs to respect some constraints imposed to its behaviors. Once a constraint is violated during continuous testing, a fault is reported to the development team. For example, an agent (an ADS) needs to move from one location to another, but it has to avoid any obstacles encountered; or a service (another ADS) may decide to sell a given good, but the final price must be greater than a

Distributed software systems are characterized by increasing autonomy. They often have the capability to sense the environment and react to it, discover the presence of other systems and take advantage of their services, adapt and re-configure themselves in accordance with the internal as well as the global state. Testing this kind of systems is challenging, and systematic and automated approaches are still missing. We propose a novel evolutionary testing framework for autonomous distributed systems. In this framework, test cases are continuously generated and executed. Our current implementation of the framework provides two techniques for the automated, continuous generation of test cases: (1) random; (2) evolutionary-mutation. Preliminary experimental results, obtained on a case study, are encouraging and indicate that evolutionary testing can complement effectively the manual one.

1

Introduction

The availability of recognized and widely used platforms for distributed software programming is promoting the development of systems with increasing levels of autonomy and complexity. In the area of agent-oriented software development, mature agent platforms (e.g. JADE) are now available, based on open standards like FIPA. In the serviceoriented world, developers can choose among a number of commercial, as well as free, enterprise service buses, built on top of the SOAP standard. Agents and services developed on such platforms can implement advanced functionalities that give them autonomy of behavior. The capability to monitor what’s happening in the surrounding environment gives the system the possibility to react to changes. On the basis of an interpretation of the information collected while monitoring, the system can reconfigure itself, e.g., choosing a variant of a given

1 The notion of autonomy has been largely studied in the Multi-Agent System community, leading to formal definitions and to requirements for the agent’s internal architecture [12].

1

specific amount. In this paper, we propose a framework for the continuous testing of ADS that complements, but does not replace, the manual creation of test suites. Through continuous testing it is possible to exercise the system under test for a long time, since the testing process is fully automated and can proceed unattended. We take advantage of such long testing time to explore the space of possible behaviors of an ADS more extensively. Over time, we may observe self-adaptation, discovery of new functionalities, cooperation and changes in the environment. Key to this approach is the capability to generate test cases automatically, so as to exercise as many execution conditions as possible. In our framework, two new autonomous distributed programs are introduced, respectively called the Tester Agent (TA) and the Monitoring Agent (MA). MA observes the behavior of the ADS under test and signals any deviation from the expected behavior. TA stimulates the ADS under test by sending it messages that aim at exploring behavior conditions not yet considered and potentially fault exposing. TA data are based on the current state of the ADS under test as well as the surrounding environment. We present two experiments with the proposed framework, which target respectively agent-oriented and service-oriented systems. The same case study was considered in both variants, providing preliminary results on the general viability and potential effectiveness of the approach. In turn, each implementation can use either random or evolutionary-mutation test case generation during the continuous testing of the ADS. The paper is organized as follows: the next section introduces related work. Our approach is presented in Section 3. Then, Section 4 describes our framework for automated continuous testing of ADS. Section 5 discusses some experimental results. Finally, Section 6 summarizes the main outcomes of this work and anticipates our future investigation.

2

go even further by separating the testing responsibility from the monitoring responsibility, the latter being assigned to monitoring agents. This makes our framework easily applicable to distributed MAS. Differently from Rouff, our final aim is to have a continuously running tester agent which autonomously generates new test cases. Dikenelli et al. [22] proposed a test-driven MAS development approach that supports iterative and incremental MAS construction. A testing framework, which is built on top of JUnit and Seagent [7], is used to support the approach. The framework allows writing automated tests for agent behaviors and interactions between agents. Similarly, Coelho et al. [5] introduced an approach for MAS unit testing built on top of JUnit and JADE [21]. Both approaches involve mock agents, which simulate real agents, to interact with the agents under test. The same choice applies to our testing framework whenever it is not practical or even possible to consider the agent under test in its real execution environment, because the other involved agents or services cannot be accessed for testing purposes only. The main contribution of the present work to the state of the art in agent and service testing is threefold: (1) we address the autonomy of agents and services by means of constraint enforcement; (2) we propose a novel test case generation technique, tailored for ADS testing; (3) we provide a freely-available implementation of the proposed framework.

3

Continuous Testing of Autonomous Distributed Systems

The specific features of ADS (monitoring, selfadaptation, discovery, cooperation) demand for a framework that supports extensive and possibly automated testing. We propose to complement manually derived test cases with automatically generated ones, which are continuously run in order to reveal errors associated with conditions that are hard to simulate and reproduce. Testing of an ADS can be achieved by means of a dedicated program, the TA agent, which continuously interacts with the system under test, and of the monitoring agent MA, which checks the state of the ADS. Since ADS communicate primarily through message passing, the TA can send messages aimed at triggering some behavior that can potentially lead to fault discovery. The messages sent by the TA are continuously generated according to the algorithms described below. Such algorithms require a seed (i.e., an initial test suite), which can be either a test suite manually derived from the specifications (e.g., from goal diagrams, following the goal-oriented testing methodology [14]) or can be generated randomly. It is then the MA’s responsibility to observe the reactions to the messages sent by the TA and, in case these are not compliant with the expected be-

Related Work

Di Penta et al. [17] proposed a search-based approach for testing service level agreements (SLA), i.e. QoS constraints between service providers and consumers. A genetic algorithm (GA) was used to explore the search space for cases that violate SLA. We share with this work the use of GA and monitoring. However, we open our framework to different kinds of constraints than SLA and we use mutation testing to determine the fitness value used to evolve the test cases. Rouff [19] discusses the challenges involved in testing of single agents and agent communities. He proposed a special tester agent, used to test the other agents individually or within the community they belong to. We share with Rouff the natural choice of testing a MAS using an agent and we 2

havior (pre-/post-conditions violated) or the ADS crashes, to inform the development team that a fault was revealed. Since the behavior of an ADS can change over time, a single execution of a test suite might be inadequate to reveal faults. Usage of the autonomous distributed program TA allows for an arbitrary extension of the testing time, that can proceed unattended and independently of any other humanintensive activity. Continuous testing of an ADS requires that the TA has the capability to evolve existing test suites and to generate new ones, with the aim of exercising and stressing the application as much as possible, the final goal being the possibility to reveal yet unknown faults. We propose a framework for continuous testing of ADS, called eCAT (environment for the Continuous ADS Testing), which includes the TA and MA. We consider two automated test case generation techniques for the TA: random generation and evolutionary mutation generation (called evol-mutation from now on).

3.1

erated randomly. However, when the interaction protocol needs only one trigger message, random testing is a cheap and efficient technique that can reveal faults. Evidence of this is provided in the experimental results section. For the generation of longer sequences that increase the likelihood of revealing faults, more sophisticated techniques need to be used, such as evol-mutation, described below.

3.2

Evolutionary Mutation Testing

Mutation testing [6, 11] is a way to assess the adequacy of a test suite and to improve it. Mutation operators are applied to the original program (i.e., agents under test) in order to artificially introduce known defects. The changed version of the program is called a mutant. For example, a mutant could be created by modifying branch conditions, e.g. the following JADE code [21]: if (msg.getPerformative() == ACLMessage.REQUEST)

can be changed into

Random Testing

if (msg.getPerformative() == ACLMessage.REQUEST WHEN)

or a mutant can be created by modifying method invocation (e.g., receive() changed into blockingReceive()). Mutation operators for object-oriented programming languages have been studied, e.g. [23], while they are still an open research for the agent-oriented community. In this paper, we adopt object-oriented mutation operators since programming language used by JADE [21] and JADEX [18] platforms are Java. Agent-oriented mutation operators will be investigated in our future work. A test case is able to reveal the artificial defects seeded into a mutant if the output of its execution deviates from the output of its execution on the original program. In such a case, the mutant is said to have been killed. The adequacy of a test suite is measured as the ratio of all the killed mutants over all the mutants generated. When such a ratio is low, the test suite is considered inadequate and more test cases are added to increase its capability of revealing the artificially injected faults, under the assumption that this will lead to revealing also the “true” faults. Evolutionary testing [16] is based on the possibility to evolve test suites by applying mutation operators to the test cases themselves. In order to guide the evolution towards better test suites, a fitness measure is defined as a heuristic approximation of the distance from achieving the testing goal (e.g., covering all statements or all branches in the program). Test cases with higher fitness values are more likely to be selected for evolution when a test suite is transformed into the next one. We propose to use a combination of mutation and evolutionary testing for the automated generation of the test cases executed by the TA in a given ADS. Intuitively, we use the mutation adequacy score as a fitness measure to guide evolution, under the hypothesis that test suites that are better at

The TA is capable of generating random test cases from scratch or based on existing manually-created test cases, following the random test data generation strategy [13]. To generate test cases from scratch, first, the TA selects a communication protocol among those provided by the given platform (e.g. FIPA Interaction Protocols [8] in JADE [21]). Then, messages are randomly generated and sent to the ADS under test. In order to insert meaningful data into the messages, a model of the domain data, coming from the business domain of the ADS under test, must be also supplied. The message format is that prescribed by the distributed platform of choice (such as the FIPA ACLMessage [9]), while the content is constrained by a domain data model (DDM). The DDM prescribes the range and the structure of the data that are produced randomly, either by means of generation rules or in the (simpler) form of sets of admissible data that are sampled randomly. On the other hand, the TA can reuse existing test cases, which provide testing scenarios or communication protocols tailored for the system under test, to generate new test cases. To achieve this, the TA simply selects an existing test case and substitute specified messages with randomlygenerated ones. Again, a model of the domain data is also useful to provide meaningful data. Randomly generated messages are then sent to the ADS under test and it is the MA’s responsibility to observe the reactions, i.e., communications, exceptions etc. happening in the ADS. When a deviation from the expected behavior is found (pre-/post-condition violated or crash), it is reported to the development team. The main limitation of random testing of an ADS is that long and meaningful interaction sequences are hardly gen3

killing mutants are also likely to be better at revealing real faults. The proposed technique consists of the following four steps: Step 0: Preparation, given the ADS under test S, we apply mutation operators to S to produce a set of mutants {S1 , S2 , . . . , SN }. One or more mutations are applied to one or more (randomly chosen) modules of S. Step 1: Test execution and adequacy measurement, the TA executes the test cases {T C1 , T C2 , . . . , T Cn } on all the mutants. Initially, test cases can be randomly generated or they can be those derived from goal analysis by the user. The TA then computes the adequacy of each test i case (fitness value): F (T Ci ) = K N , where Ki is the number of mutants killed by T Ci . To increase performance, the executions of the test cases on the mutants are performed in parallel (e.g., on a cluster of computers, with one mutant per node). The basic mechanisms used to evolve a given test case are mutation and crossover. Mutation consists of a random change of the data used in the messages exchanged in a test case, similarly to the random generation described above. A good test case (according to the fitness value) is selected, one of its messages is chosen randomly and the content of the message is then changed randomly. Crossover consists of the combination of two test cases. Two good test cases are chosen. Some data in the second test case replace the data used in the first one, or an entire sequence of messages is taken from the second test case and is appended at the end of the first test case, possibly after truncating its message sequence at a randomly selected point. Step 2: Test case evolution, the procedure for generating new test cases is described as follows.

ing testing with messages returned by the MAS under test and it mutants. Random sampling of this database is used to produce test case mutants. The size and diversity of the messages in the database are crucial properties that determine the ability of mutated test cases to reveal faults. This is the reason why we continuously grow the database as testing proceeds, by capturing and storing the exchanged messages. Step 3: Decision 1: if Number of generations > Max number of generations then 2: DONE 3: else 4: if max F (T Ci ) did not improve over last L generations then 5: Create new mutants for the next generation {substitute current set of mutants by a new one to increase the diversity of mutants (i.e. injected faults)}

end if Goto Step 1. end if The algorithm stops when the number of generation exceeds a given maximum number of generation. Otherwise, we go back to Step 1 and keep on testing continuously. When no improvement of the fitness values is observed for a number of evolutionary iterations, Step 0 (Preparation) is repeated and a new set of mutants generated by different mutation operators is produced, so as to generate test cases that are assessed on a different set of artificial defects. In fact, the occurrence of no progress for some time may indicate that either the residual mutations are too hard (maybe impossible) to reveal or that all mutations are easily revealed by the current population of test cases. Hence, the time has come to change mutants. As with random generation, each time the MA observe a deviation from the expected behavior, the development team is informed through a bug report submission. The mutated (also randomly-generated) test cases and message content could be valid or invalid with respect to the agents under test. However, they are both useful in order to reveal fault, i.e. to make agents misbehave. Further discussion about legality of message content and test case, using ontoloty, is introduced in [15]. 6: 7: 8:

1: Select randomly whether to apply mutation or crossover 2: if Crossover is chosen then 3: Select 2 test cases (i, j) with probability proportional to

F (T Ci ), F (T Cj ) 4: Apply crossover to T Ci and T Cj 5: else 6: Select a test case i with probability proportional to F (T Ci ) 7: Apply mutation to T Ci 8: end if 9: Add the new test cases to the new set of test cases

The encoding of test cases for the evolutionary algorithm is described as follows. Since each test case specifies a test scenario that contains a sequence of messages, test case T Ci is encoded as {M sg i 1 , M sg i 2 , . . . , M sg i ni } where ni is the number of messages specified in T Ci . Crossover and mutation are realized by operating on these messages. In particular, mutation requires a random modification of a message. This can be achieved by resorting to a database of messages, built by collecting all messages from initial test cases, and enriched with domain data, like in random testing. Moreover, the database is gradually enriched dur-

3.3

Constraints as Test Oracles

The behavior of autonomous ADS can change over time. This makes the evaluation of test results a non-trivial task. In many cases, it is impossible to give a verdict on a test case based on the comparison of the returned message with a gold standard, because the returned message may be different, even for the same input, at different times. We propose to use constraints that restrict behaviors of ADS as test ver4

dicts. The underlining idea is that as soon as a constraint is violated, the MA will report the violation to the TA, which then gives the final verdict on the test case under execution. Behavioral constraints are specified in terms of pre-, post-, and invariant conditions. We specify those conditions by using the object constraint language(OCL [1]) and generate monitoring guards (to check constraint violation) automatically, using a tool called OCL4Java2 and its userdefined handler. Following is an example of pre-/postcondition specified in OCL, which requires item to be not null and ensures after updating that the size of bibDB increased by one:

Agent (MA), that monitors communication, constraints, and all events happening in the execution environment in order to trace and report errors. Multiple Remote MAs can be deployed, one for each environment of the ADS under test. For instance, an ADS under test may be composed of two geographically-different environments: one is on a mobile phone while the other one is on a Web server. Two Remote MAs will be deployed on the two environments. All the Remote MAs are under the control of a central MA, which is located at the same host as the TA. Together, the monitoring agents provide a global view of what is going on during testing and help the TA evaluate the ADS under test as well as the behavior of its mutants. In particular, the roles of the monitoring agents are three-fold: (i) monitoring events and interactions taking place inside the ADS and in its environment; (ii) observing the ADS operations to check for constraint violation; (iii) providing execution traces of the ADS under test to the TA. Conditions are used to judge the operation of the ADS and to determine the test result correspondingly.

@Constraint("pre: item->notEmpty\n" + "post: bibDB->size() = bibDB@pre->size() + 1") public synchronized String updateBib(String item) { .... }

4

eCAT Framework

We developed a testing framework, called eCAT 3 that implements our method for automated continuous testing of ADS. The framework facilitates test suite derivation from goal analysis following the goal-oriented testing methodology introduced in [14]. The framework also provides GUIs to help human testers specify test inputs, scenarios, and expected outputs if applicable. More importantly, eCAT can evolve and generate test inputs by evol-mutation or random testing techniques, described in the previous section, and can run these test inputs continuously to extensively test the ADS.

Current version of eCAT is implemented as an Eclipse plug-in and uses JADE [21] agents. The TA can interact directly with –and test, eventually– the ADS, if this is implemented as a set of JADE or JADEX [18] agents. In case the ADS is implemented as a set of Web services, JADE agents that act as proxies are produced. They allow the TA to invoke and test the Web services. To test a Web service, the TA sends messages to the corresponding proxy agent. This agent will invoke the service under test and send back the result to the TA. Table 4 shows testing techniques implemented in eCAT and their descriptions.

Table 1. Testing techniques supported by eCAT Name G

Figure 1. eCAT framework

R R+

Fig. 1 depicts a high level architecture of eCAT that consists of three main components: Test Suite Editor, allowing human testers to derive test suites from analysis diagrams; Tester Agent (TA), capable to automatically generate new test cases and to execute them on an ADS; and Monitoring

M

M+

2 http://www.ocl4java.org 3 http://sra.fbk.eu/people/cunduy/ecat

5

Description Goal-oriented and manual test case creation. There is one derivation of G, called G+, that contains all test cases of G and some extra test cases created to satisfy statement coverage. G+ will be discussed in more details in Section 5. Test cases are generated randomly from scratch Test cases are generated randomly based on manually-created test cases (based on G) Test cases are generated by evolution, the initial population of test cases is generated randomly from scratch Test cases are generated by evolution, the initial population of test cases is generated from G

5

Experimental Results

a whole; BibExchangerAgent is in charge of querying the local database and exchanging data with external agents (e.g. with other instances of BibFinder); BibExtractorAgent crawls on local storage devices looking for BibTeX files, and performs searches on and extracts BibTeX items from the Internet. Goal-oriented testing - G. We derived 6 test suites to test the fulfillment of the associated goals. This derivation follows the goal-oriented software testing methodology discussed in [14]. These test suites contain 12 test cases specifying 12 different test scenarios. Each scenario challenges a goal with valid or invalid test input and follows a concrete interaction protocol. Goal-oriented testing enhanced by coverage - G+. Given a test suite, such as the one derived through goaloriented testing, statement coverage can be measured and used to make sure that all statements in the implementation have been exercised by at least one test case (excluding any unreachable code). We enhanced goal-oriented testing by manually adding 3 new test cases, in order to reach 100% statement coverage of the main packages. In other words, we complemented black-box testing with white-box testing. Random testing - R. In order to apply R during continuous testing, we pre-defined a library of interaction protocols and a repository of domain data. The interaction protocols include the five FIPA protocols Propose, Request, Request-When, Subscribe, and Query [8], and twenty-one (simple) protocols, which are created from twenty-one different FIPA message performatives, such as AGREE, REQUEST, etc. Domain data have been collected from the test suites derived from the goal model and have been manually augmented with additional possible input values. The TA generates test cases by selecting domain data randomly and combining them with interaction protocols. The TA continuously generates test cases and executes them against BibFinder. The MA is in charge of observing the whole system, i.e. BibFinder and the JADE platform. Based on the intercepted information, it can recognize the situations in which bugs are revealed (e.g. agents crash or pre-/postconditions are violated). Random testing - R+. This technique takes 12 test cases created by the manual technique G and domain data similar to the one in R as the basis to generate test cases during continuous testing. Evol-mutation testing - M. The preparation step of M consists of creating initial test cases, as initial individuals, and creating mutants of the original BibFinder system. The initial population contains 12 randomly-generated test cases. Since BibFinder agents are implemented in JADE, a Java platform, we were able to apply existing objectoriented mutation operators for Java on them in order to create mutants. Evol-mutation testing - M+. This technique is similar

This section describes the experimental results obtained when we used eCAT to test BibFinder. First, we introduce the features of BibFinder, the ADS under test. Then, we describe the different testing techniques applied to BibFinder, the testing results, and our evaluation. BibFinder is an ADS for the retrieval and exchange of bibliographic information in BibTeX format4 . The features of BibFinder are: (1) scan the local drivers of the host machine, where it runs, to search for and consolidate bibliographic data in the format of BibTeX; (2) exchange bibliographic information with other deployments of BibFinder, in a peer-to-peer manner, thus augmenting its search capability with those provided by other peer systems; (3) perform searches on and extract BibTeX data from the Scientific Literature Digital library5 , exploiting the Google search Web service6 . BibFinder exhibits an autonomic behavior whenever it searches for some bibliographic items, since it decides whether to scan the local drivers, to ask another peer for searched information, or to search directly the Internet (via Google Web service). BibFinder can be compiled either as a Multi-Agent System (MAS) or as a Service Oriented Application (SOA). The MAS version of BibFinder was designed and developed first, specifically on the JADE [21] platform. Then, it was redesigned to operate in the JAX-WS Web service platform7 . Detailed specifications and design of BibFinder are given in http://sra.fbk.eu/people/cunduy/bibfinder. We applied testing techniques available in eCAT when testing BibFinder: (1) random testing (R, R+), mainly uncovering crashes; (2) goal-oriented testing (G, G+) [14], aimed at verifying if the agents in BibFinder can fulfill their goals; and (3) evol-mutation testing (M, M+), aimed at revealing more faults thanks to the possibility of continuous execution and check for constraint violation. We repeat each experiment more than 10 times in order to observe and measure the average time and number of faults revealed. Different parameters, such as the population size, mutation probability of M, M+ have been exercised.

5.1

Testing BibFinder MAS

The MAS implementation of BibFinder consists of three agents: BibFinderAgent, BibExchangerAgent, and BibExtractorAgent. Roles of each agent are briefly described as follows: BibFinderAgent maintains the local BibTeX database and coordinates the operation of the system as 4 http://www.ecst.csuchico.edu/

˜jacobsd/bib

5 http://citeseer.ist.psu.edu 6 http://code.google.com/apis/soapsearch 7 http://java.sun.com/webservices

6

Tester Agent

8

Proxy Agent

SOAP/HTTP query

inform

6 Number of Bugs

request

Central Monitoring Agent

4

getMonitoringLog

Proxy Agent

get log

BibFinder WS

Multi-agent Environment

2 0

Figure 3. Test configuration for BibFinder WS 0

2

4

6

8

10

12

14

16

18

Time(cycle)

5.2

Figure 2. BibFinder MAS: number of new faults revealed by cycle

Testing BibFinder Webservice

The BibFinder Web service (WS) provides one main operation, query, used to perform the bibliographic search supported by BibFinder. Another facility available from the Web service implementation is the service getMonitorLog, used to retrieve the monitoring log. The XML content of the query issued to the query contains two elements ... and .... The first element defines the desired action, such as search-bib, while the second element encloses the corresponding content, such as keywords that the user wants to search for. In order to exploit eCAT to test BibFinder WS, we create two simple proxy agents BibFinderProxyAgent and RemoteMonitoringProxyAgent. They receive requests and invoke the corresponding operation of BibFinder WS, using the parameters in the content of the requests. The test configuration for BibFinder WS is depicted in Fig. 3. BibFinder WS was deployed onto Apache Tomcat 6.0 with JAX-WS 2.0, while the agents ran in the JADE environment. Goal-oriented (G, G+) and random testing (R, R+). For goal-oriented testing, we reused all the test suites derived to test BibFinder MAS. The only modification is that in testing BibFinder MAS, messages could be sent to different agents, while in the case of BibFinder WS, all messages are directed to the proxy agent. For random testing, we reused the same domain data that were used to test BibFinder MAS. Evol-mutation testing (M, M+). Here again we exploited µJava [23] to generate mutants of BibFinder WS. Among hundred mutants generated, we choose randomly 20 mutants and deployed them onto Apache Tomcat 6.0 on two cluster machines (4GB RAM, 4 CPUs Xeon 3GHz). Each mutant operates on a separate port and has it own proxy agents. The TA runs test cases on these mutants through the proxy agents. Results. Table 3 shows a summary of faults uncovered in the BibFinder Webservice. Similar to the presentation of faults for BibFinder MAS, we show faults uncovered by cycles. In particular, G+ has revealed 4 faults at the first testing cycle. These faults were repeatedly found by other

to M, except for the initial population. In M+, the initial population contains 12 test cases created by G. To generate mutants from the original ADS, we used the tool MuClipse8 , built on top of µJava [23], to create mutants, given the source code of the three agents: BibFinderAgent, BibExchangerAgent, BibExtractorAgent. The source code of the supporting classes was left untouched. 24 class-level and 15 statement-level mutation operators [23] were applied to the agents under test. After combining the results, we obtained 178 mutants of BibFinder and used 20 of them in evol-mutation testing. The execution of G and G+ is also continuous: it consists of a number of cycles, at each cycle each test case is executed once. Differently from random and evol-mutation testing, the test cases of G, G+ are unchanged. The benefits of continuous execution of G and G+ test cases descend from the agents evolving or adapting their behaviors over time, possibly resulting in different outputs for the same inputs. So continuous execution of G, G+ can increase the chance of revealing faults in the presence of evolving or adapting agent behaviors. Results. Table 2 shows a summary of faults uncovered in the BibFinder MAS. Since continuous testing consists of a number of cycles where test cases are generated (or evolved) and then executed, in the table we show cycles at which faults were first found by each technique. Most faults can be found by more than one technique, although not every technique can reveal every fault. In particular, G+ has revealed most of the faults (except fault No.7) at the first testing cycle. These faults were repeatedly found by other techniques as well. Finally, fault No.7 was found by M+ after a very long testing time: 18 cycles for around one hour. Figure 2 depicts the discovery of new faults over cycle. 8 http://muclipse.sourceforge.net

7

Table 2. Results of testing BibFinder MAS No

Bug description

Cycle when bug was found G,G+ R R+ M M+ 1 × 1 5 1 1 8 1 6 1 1 × 1 × 1 1 × 1 × 1 1 × 1 × 1 1 × 1 × 1 × × × × 18 1 11 1 2 1 1 9 1 4 1

Bug type

1 Data conversion fault Fatal 2 Keyword constraint violation Moderate 3 EventFactory does not recognize SEARCH BIB action Moderate 4 EventFactory does not recognize UPDATE BIB action Moderate 5 EventFactory does not recognize SEARCH URL action Moderate 6 SEARCH URL returns unexpected result Moderate 7 Add new wrong BibTeX Fatal 8 Update wrong BibTeX Fatal 9 BibTeX parsing Fatal Average time per test case for all techniques: 20 seconds (∼ 0.33 minutes); × means the corresponding technique can not find the bug

Table 3. Results of testing BibFinder Webservice No

Bug description

Cycle when bug was found G,G+ R R+ M M+ 1 13 1 9 1 1 × 1 × 1 1 × 1 × 1 1 × 1 × 1 × 10 12 3 6 × 10 13 4 7

Bug type

1 2 3 4 5 6

Exception when marshaling invalid XML request Fatal Return empty when no BibTeX found Minor Return unexpected message when updating Bib-TeX Moderate Return empty when no URL found Minor Constraint violated when keyword is empty Moderate Constraint violated when keyword has length minor Moderate than 3 Average time per test case for all techniques: 25 seconds (∼ 0.42 minutes)

5.3

Number of Bugs

techniques as well. Finally, the last two faults (No. 5 and 6) were first found by M, the other techniques also found them but they needed more time. Figure 4 depicts the discovery of new faults over cycle.

Discussion

In summary, results in Tables 2 and 3 indicate that: 1. Goal-oriented (G, G+) testing is very effective in revealing real, high severity faults. 2. Random and evol-mutation (R, R+, M, M +) are complementary techniques with respect to G and G+, and can reveal additional, otherwise unnoticed, faults. 3. Evol-mutation explores the input space more effectively than random, since most of the times it reveals the same faults earlier or additional faults otherwise unrevealed.

4

2

0

4. Continuous testing seems to be particularly effective for ADS, in that the fault exposing capability of random and evol-mutation is apparent only after substantial testing time.

0

2

4

6

Time(cycle)

Figure 4. BibFinder Webservice: number of new faults revealed by cycle

Overall, results indicate that ADS contain faults that are very hard to expose, in that they require long execution times. It is only under special conditions that the ADS reacts to a given request in a way that deviates from its intended behavior. Such conditions are hard to reproduce and 8

are often ignored when test cases are derived from the specifications (in our case, from goal diagrams). Even coverageadequate test suites may be insufficient to reveal such faults. Our proposal to execute an ADS continuously and to evolve the test suite so as to increase its capability to reveal faults is supported by the experimental data collected with BibFinder MAS and BibFinder WS. Despite the technological differences between the two implementations, the two systems exhibited a similar trait in terms of difficulty of fault exposure: some of their bugs required an execution time longer than the one allocated for black-box and whitebox testing. On the other hand, black-box and white-box techniques were also quite effective. Hence, they should be used as the starting point for a more extensive testing phase, based on continuous test case execution and monitoring of the ADS. In BibFinder MAS, one Fatal fault went unnoticed after black-box and white-box testing. It is only when evol-mutation testing was executed, after 18 cycles, that we were able to expose such a fault.

and random testing techniques are quite effective in the initial testing cycles, when bugs can be revealed by simple and short message sequences and the selection of the input data is not critical to expose them (i.e., there exist large equivalence classes of input data that can be used interchangeably to reveal a given fault). When remaining bugs become hard to find (last testing cycles), goal-oriented and random testing become ineffective and it is only through evol-mutation that additional fault can be revealed.

6

We introduced a novel approach for the continuous testing of Autonomous Distributed Systems (ADS). The two key features of our testing approach are: (1) the ability to execute testing continuously; and, (2) the ability to generate test cases automatically, thus complementing manual derivation (goal-oriented testing). The eCAT framework, supporting this testing approach, has two main components, Tester Agent and the Monitoring Agents. The Tester Agent generates test cases continuously, by means of two test case generation techniques, random generation and evolutionary-mutation generation. The Monitoring Agents report any discovered error through the bug tracking system. Two experiments have been performed respectively on an agent-oriented and on a service-oriented implementation of the same case study. The results obtained in these experiments indicate that continuous testing has a big potential to complement manual testing. In fact, especially for faults involving long message sequences and specific input data, continuous testing seems particularly suited to explore those states that can potentially lead to them. Whenever high reliability (i.e., long mean time between failures) is the aim, evol-mutation can contribute to the discovery of the hardto-reveal faults, which would go probably unnoticed under goal-oriented and random testing. In our future work, we will further investigate the preand post-conditions that can be checked by the Monitoring Agents. This can potentially contribute to guide evolmutation to reveal faults that violate specified conditions. In addition, we plan to investigate faults that are more specifically those related to autonomous nature of ADS in order to guide evol-mutation testing towards the generation of test cases that are more likely to reveal them.

Table 4. BibFinder MAS: Mean time between failures Cycle Technique Time (m) Numb. of Bugs MTBF (m)

1 G+ 4.59 8 0.62

18 M+ 89.1 1 84.15

Table 5. BibFinder Webservice: Mean time between failures Cycle Technique Time (m) Numb. of Bugs MTBF (m)

1 G+ 6.3 4 1.58

3 M 18.9 1 12.6

4 M 25.2 1 6.3

The effect of continuous testing on the system reliability can be appreciated if we consider the mean time between failures (MTBF) after each testing phase. Tables 4 and 5 report the MTBF for BibFinder. During goal-oriented testing, the MTBF is quite short (0.62 and 1.58 minutes). When we use also evol-mutation, we can notice another major increase of MTBF, jumping to a value as high as 84.15 minutes. We think that continuous testing equipped with an evolutionary test case generation mechanism can deliver high reliability even for systems that are intrinsically hard to test, such as ADS. In a real development scenario, eCAT could be left running continuously, so as to try to reveal also those bugs that are associated with a very long MTBF and are thus hard to reveal9 in traditional testing sessions. Goal-oriented 9 From

Conclusion

References [1] Object Constraint Language Specification. OMG Specification, 05 2006. version 2.0. [2] C. Bernon, M. Cossentino, and J. Pav´on. An overview of current trends in european aose research. Informatica (Slovenia), 29(4):379–390, 2005.

the tester’s perspective.

9

[23] J. O. Yu-Seung Ma and Y. R. Kwon. Mujava : An automated class mutation system. Software Testing, Verification and Reliability, 15(2):97–133, June 2005.

[3] M. Bruno, G. Canfora, M. D. Penta, G. Esposito, and V. Mazza. Using test cases as contract to ensure service compliance across releases. In ICSOC, pages 87–100, 2005. [4] G. Canfora and M. D. Penta. Testing services and servicecentric systems: Challenges and opportunities. IT Professional, 8(2):10–17, 2006. [5] R. Coelho, E. Cirilo, U. Kulesza, A. von Staa, A. Rashid, and C. Lucena. Jat: A test automation framework for multiagent systems. In 23rd IEEE International Conference on Software Maintenance, 2007. [6] R. A. DeMillo, R. J. Lipton, and F. G. Sayward. Hints on test data selection: Help for the practicing programmer. IEEE Computer, 11(4):34–41, 1978. [7] O. Dikenelli, R. C. Erdur, and O. Gumus. Seagent: a platform for developing semantic web based multi agent systems. In AAMAS ’05: Proceedings of the fourth international joint conference on Autonomous agents and multiagent systems, pages 1271–1272, New York, NY, USA, 2005. ACM Press. [8] FIPA. Interaction Protocols Specifications. http://www.fipa.org/repository/ips.php3, 2000-2002. [9] FIPA. ACL Message Structure Specification. http://www.fipa.org/specs/fipa00061, 2002. [10] A. Ghose and G. Koliadis. Auditing business process compliance. In ICSOC, pages 169–180, 2007. [11] R. G. Hamlet. Testing programs with the aid of a compiler. IEEE Transactions on Software Engineering, 3(4):279–290, 1977. [12] H. Hexmoor, C. Castelfranchi, and R. Falcone, editors. Agent Autonomy. Springer, 2003. [13] H. D. Mills, M. D. Dyer, and R. C. Linger. Cleanroom software engineering. IEEE Software, 4(5):19–25, September 1987. [14] C. D. Nguyen, A. Perini, and P. Tonella. A goal-oriented software testing methodology. In 8th International Workshop on Agent-Oriented Software Engineering, AAMAS, May 2007. [15] C. D. Nguyen, A. Perini, and P. Tonella. Ontology-based test generation for multiagent systems (short paper). In Proc. of 7th Int. Conf. on Autonomous Agents and Multiagent Systems (AAMAS 2008), 2008. [16] R. Pargas, M. J. Harrold, and R. Peck. Test-data generation using genetic algorithms. Journal of Software Testing, Verifications, and Reliability, 9:263–282, September 1999. [17] M. D. Penta, G. Canfora, G. Esposito, V. Mazza, and M. Bruno. Search-based testing of service level aggreements. In GECCO’07, 2007. [18] A. Pokahr, L. Braubach, and W. Lamersdorf. Jadex: A BDI Reasoning Engine, chapter Multi-Agent Programming. Kluwer Book, 2005. [19] C. Rouff. A test agent for testing agents and their communities. IEEE, 2002. [20] W. Sch¨utz. Fundamental issues in testing distributed realtime systems. Real-Time Syst., 7(2):129–157, 1994. [21] TILAB. Java agent development framework. http://jade.tilab.com/. [22] A. M. Tiryaki, S. Oztuna, O. Dikenelli, and R. Erdur. Sunit: A unit testing framework for test driven development of multi-agent systems. In 7th International Workshop on Agent Oriented Software Engineering, 2006.

10