A Fault Tolerant Infrastructure for Mobile Agents

A Fault Tolerant Infrastructure for Mobile Agents Summiya, Kiran Ijaz, Umar Manzoor, Arshad Ali Shahid Department of Computer Science, National University of Computer & Emerging Sciences-FAST, Islamabad, Pakistan {summiya.sam, kiranijaz, umarmanzoor}@gmail.com , [email protected] Abstract Mobile agent technology is a promising paradigm for a myriad of real world applications. Owing to their tremendous capabilities, multiagent systems have been scoped in a large number of applications. However issues related to fault tolerance can hamper the suitability of mobile agents in these real world systems. In this paper we have proposed an infrastructure which provides agent fault tolerance. An algorithm similar to the sliding window model ensures a fault tolerant behavior. Different types of agents, work in collaboration to provide the desired system behavior by tolerating faults. The proposed infrastructure will be applicable in a variety of systems including, ecommerce, online banking etc. With the increasing market of electronic commerce it becomes an interesting aspect to use autonomous mobile agents for electronic business transactions.

1. Introduction Agent based computing methodology is mostly a networking paradigm with similar motivations to client server systems and remote procedure calls (RPC). Client server systems and RPCs form the corner stone of distributed computing and networked systems. While the importance of these concepts is definitely unquestionable, research in this field over the last few years has lead to a number of new computing paradigms. Agent technology is one of the most progressive among these. Agent technology offers a large number of unique features which make it a potential choice for implementing real world applications generally and distributed systems in particular. Agents are appropriate for networks with low bandwidth, high latency, ad intermittent connectivity [1]. Since agents move the computation close to the data, they can also be used for solving problems with highly interactive client server systems. Since an agent’s interaction with the host doesn’t pass over the network, better scalability is achieved. Agents also offer potential benefits by reducing the amount of encryption and authentication required. As compared to RPCs conversation where each and every message has to be encrypted, an agent must be encrypted once per host it visits. Agents also provide more reliable International Conference on Computational Intelligence for Modelling Control and Automation,and International Conference on Intelligent Agents,Web Technologies and Internet Commerce (CIMCA-IAWTIC'06) 0-7695-2731-0/06 $20.00 © 2006

communication with host services in the presence of faulty networks. This is because an agent executes directly on the host it is communicating with. Another implication of this direct execution is negligible message latency [2, 3]. Clients with low processing power can take advantage of an agent’s ability to preprocess data on the server before returning. Agents can also provide semantic routing freeing the end user from remembering server addresses. Agents have found tremendous application in the emerging concept of electronic market place. It is easy to see how hosts can be made to represent shopping centers and agents will shop and barter for goods on the user’s behalf. Agents’ ability to filter through big databases on the host and their ability to roam from host to host make them well suited for shopping for items or services. So much for their individual capabilities, these strengths get amplified when agents work in collaboration with each other. A Multiagent system is a society of these autonomous computational entities which work together to achieve a specific goal. Multiagent platforms form a very strong basis for scalable and flexible decentralized software. Systems based on the multiagent paradigm can be dynamically modified, agent objectives can be modified, new agents can be introduced and useless can be set aside etc. This high dynamicity and flexibility makes this paradigm very suitable for real world application [4]. This suitability is further supported by an orthogonal property of agents called mobility. Termed as mobile agents, these are programs which are dispatched from a source computer and run among a set of networked servers until they are able to accomplish their task [2, 8]. An agent can simply get lost by errors of the network or the hosts. If an agent itinerates autonomously in the network, there is no instance which can guarantee that the agent reaches the next host correctly and won't get lost. Hence fault-tolerance is fundamental to the further development of mobile agent applications. In the context of mobile agents, fault-tolerance prevents a partial or complete loss of the agent, i.e., ensures that the agent arrives at its destination. Another important thing is to distinguish between lost and delayed agents.

2. Related Work Though the agent paradigm itself is quite strong, but a technical hurdle restricting their adoption is the lack of reliability. This issue stems out of the uncertain and heterogeneous nature of the highly dynamic underlying network. As a result great amount of research is being done in this field. Designing a reliable mobile agent system is especially challenging since a mobile agent is potentially affected by failure of any host that it visits, or failure of any communication link that it needs to traverse [1]. One way of achieving fault tolerance can be to detect the crash of an agent and upon such an event launch a new agent. However this requires a correct and efficient failure detection mechanism. In case some agent is mistakenly assumed to be failed, this can result in the violation of the highly desirable “exactly once” property of agent execution. [11]. For example the system proposed by D. Johansen et al in [5] assumes a perfect failure detection mechanism. With this assumption being a bit unrealistic in the real world applications, “exactly once” property is not ensured. Some systems have also used the notion of transactions and leader selection to address the issue of exactly once execution [11]. A. Mohindra et al in [1] and L.Silva et al in [10] address the same problem by detecting duplicate agents in the end and undoing their effects at that time. This can not only limit the system throughput by adding overheads, but also in certain cases it becomes impossible to undo the actions of these duplicates. Some approaches have been proposed which involve consensus and agreement steps. Since real world systems want fault recovery within a minimal amount of time, the consensus process can be a great delay. Strasser et al [6, 7] discuss two approaches for improving reliability in agent systems. The first approach allows an agent to specify a flexible itinerary with the possibility to defer the visit to currently un-available machines or to select alternate machines in case of machine failures. The second approach uses a faulttolerant protocol to implement exactly-once execution property for agents using formation of explicit stages of computation and results in a constrained lock step manner of execution. Such execution constraints, although useful for certain classes of applications, are not desirable in general [1].

3. Need for Fault Tolerance Before elaborating, the actual mechanism for fault tolerance let us consider the need for doing so. For the purpose of this paper we discuss only mobile agent fault tolerance. As already discussed, mobile agents are more

International Conference on Computational Intelligence for Modelling Control and Automation,and International Conference on Intelligent Agents,Web Technologies and Internet Commerce (CIMCA-IAWTIC'06) 0-7695-2731-0/06 $20.00 © 2006

fault prone since they move around in unreliable networks. While on their way to the final destination these agents might get killed due to the failure of underlying network. The loss of an agent means a transaction failure. On the other hand until and unless the machine on which a static agent is running doesn’t crash, there is a very rare possibility of faults. So to ensure guaranteed transactions, we need to introduce fault tolerance A very important issue in transactions and especially financial transactions is keeping the exactly once property. This means that each intended transaction should be executed exactly once. Obviously many of mobile agent applications require an agent to be executed exactly once. For example, assume a user that launches a mobile agent to make a hotel and flight reservation for a forthcoming business trip. The agent is expected to make both reservations if possible and in any case to return a status message back to the user. Of course, the agent must also guarantee that it makes each reservation exact once and cannot be caught by a network partitioning or node failure. Consider another scenario where a user starts a transaction for withdrawal of $1000. On its way if the transaction agent carrying the transaction dies; resulting in a failure, the user has to reinitiate the transaction. But since the transaction status isn’t reported back, the user at the ATM terminal doesn’t know where the agent died. In the worst case, if the agent died right after performing the transaction, and if the user starts another transaction, this causes duplicate transaction initiation. Similarly in case a ping agent fails, the TM will incorrectly assume that a path isn’t available. To avoid such situations we introduce fault tolerance and continuous monitoring. Maximum effort is made to get the transaction through, and in the case when this isn’t possible exact transaction status is reported.

4. System Architecture We have developed a fault tolerant infrastructure using the Java Agent Development Environment (JADE). Figure 1 gives an overview of the system architecture. Different types of agents work collaboratively to provide an overall fault tolerant behavior. The system can be used for a number of real world applications including online transactions, electronic markets etc. Following five agents form the back bone of the system architecture. • • • • •

Transaction Manager Observer Agents Ping Agents Transaction Agent Statistical Agent

4.1 Transaction Manager (TM) The Transaction Manager has the responsibility of application initialization. The TM gets the initial set up information by reading a configuration file written in XML. Once it sets up the environment for the application, a GUI for input is initiated. The application user enters the transaction information. After the complete transaction information is available; the TM starts setting up the transaction environment. The first step in this regard is the discovery of an appropriate path. The TM has a list of possible paths available with it, on which it can route the transactions. An observer agent is started for each path. These observer agents in turn start their corresponding ping agents. These two agents work together to determine the best available path for the transaction. The TM then starts a transaction agent equipped with the appropriate transaction parameters on the selected path. After this it waits for the completion of the transaction, and when completed displays the appropriate results on the graphic display.

4.2 Observer Agents (OA) An observer agent is assigned to each possible path. Paths are the different alternate routes available for a transaction. These observer agents sit at the edges and monitor the transactions. After creation an OA’s first job is to start a ping agent. This is done in order to determine the exact status of the associated path. The OA lets the TM know about path availability/unavailability. After an OA’s path is chosen for a transaction, it becomes responsible for monitoring the transaction, on the associated path.

4.3 Ping Agents (PA) These agents actually move from server to server to confirm path availability. Ping agents hop from machine to machine in the assigned path.

Figure 1. Architecture Diagram of the Fault Tolerant Infrastructure International Conference on Computational Intelligence for Modelling Control and Automation,and International Conference on Intelligent Agents,Web Technologies and Internet Commerce (CIMCA-IAWTIC'06) 0-7695-2731-0/06 $20.00 © 2006

Each ping agent before moving to the next machine creates a clone of itself on its current machine. These clones serve the purpose of fault tolerance. After traversing their itinerary the ping agents inform their corresponding OA’s about the actual path status. In this way, these ping agents confirm path availability / unavailability to the OA’s which in turn indicate it to the TM.

4.4 Transaction Agents (TA) After the complete transaction environment set up by the TM, it starts a transaction agent. The transaction agent moves on already checked path. In this way fault tolerance is incorporated from the first step. In case some machine in the path becomes unavailable later, the TA lets the TM know about it. The TM then immediately tells the TA, which next machine it has to move to. The TA then takes the alternate path. On reaching the destination the TA performs the actual transaction. On successful completion of the transaction, the TA sends a status message back to the TM. The TAs also keeps sending status messages to the TM after a predefined number of hops.

understood with the help of figure 2. Each mobile agent makes a clone of itself, before moving. These clones sit on the machines and do nothing. Since they aren’t performing any computations so a minimal amount of bandwidth is consumed. To further reduce the load on the system, a limit is kept on the maximum number of clones. This limit is defined by the window size. For example, if the window size is 3 and the currently active agent is A4 then A0 will get out of the window as the window can accommodate only 3 agents. So A0 will be killed. In this way depending upon the window size, there will be a limited number of clones in the system. The window size can also be adjusted as required. In the event a currently active agent dies, the clone will take up its responsibility. If the clone also fails the previous clone will try to perform the job and so on. The clones periodically consult their local Agent Management Service (AMS), provided by JADE to determine if the parent agent took its next hop successfully. If the parent fails, the clones start another agent having similar name and task and move it to the destination. Performance gains are also achieved as an agent doesn’t have to start from scratch in case of failure. Further each agent has to consult its local AMS; responses will be quicker and accurate.

5. The Fault Tolerant Process - A Complete Scenario To explain the working of the fault tolerant frame work, let us consider a scenario where a client initiates a transaction at an ATM terminal. If the application has been installed on the terminal and if the TM is running, the TM will be continuously maintaining a list of the registered machines and the possible paths. When the user starts interacting through the GUI, it asks for the type of transaction and other required parameters. After getting the input, the TM starts observer agents for each possible path. To harness the power of parallel processing, these observer agents are started in parallel. Each of the OA in turn starts a ping agent. The ping agent starts its movement from the source to the destination. The Ping agent reports back the path availability/unavailability to the corresponding OA. If the path is available, the TM routes the transaction on that particular path. In case a path is unavailable, the transaction is rerouted on an alternate path maintained with the TM. When this rerouting isn’t possible, the TM will report the exact transaction status to the user. A careful look at the above scenario highlights the fact that there are many fault prone points. To account for this fault tolerance is ensured in mobile agents by using a concept similar to the sliding window protocol in computer communications. This concept can be International Conference on Computational Intelligence for Modelling Control and Automation,and International Conference on Intelligent Agents,Web Technologies and Internet Commerce (CIMCA-IAWTIC'06) 0-7695-2731-0/06 $20.00 © 2006

Figure 2. Fault Tolerant Mechanism

6. System Model The system model for the fault tolerant mechanism can be expressed by the following specification. Ai=ith Agent; Agent Itinerary ={M1,M2,……….Mi, Mi+1…….Mn} Source= M1; Destination= M2; Agent Ai moves to the machine Mi Agent Ai leaves a clone Ai-1 on the machine Mi-1 Window Size= WS= N If (i-N)>=0 Kill (Ai-N) If Ai at Mi fails then Ai-1 at Mi-1 takes over the responsibility and so on;

starts the transaction agents and the agent count increases to 3 again. At 16.754 seconds, from startup the agent count increases to 8 again. After 22.252 seconds, the transaction completes and the agent count reduces to 1. So at the end of the transaction the only agent in the system is the TM.

Time Vs Agents

No. of Agents

To ensure proper status reporting the transaction agents will send status messages back to the transaction manager. The system also incorporates a mechanism, by using the search facilities of the AMS to ensure whether an agent is lost or whether its execution has only been delayed due to slow processors or communication links. This kind of mechanism can help in tolerating crash failures. In case if any of the agents fails, clones will be there to take up the failed agent’s responsibility. Also in an asynchronous distributed multiagent system an agent has the problem to properly figure out if or not an agent has actually crashed. The proposed mechanism, by consulting the local AMS, solves this problem as well.

9 8 7 6 5 4 3 2 1 0 0

5

10

15

20

25

Time in Sec

Figure 3. Agents over Time (Scenario 1)

7.2 Test Case 2

Time vs Agents 7

7.1 Test Case 1 As our first test case we simulated the fault tolerant infrastructure, with 20 online containers (machines). Machine 1 was the source while machine 20 was set as the agent destination. The window size was set to 5. As shown in the graph of figure 3, initially there is just one agent in the system. This agent is the transaction manager. As the user input completes, the TM starts an observer agent and the agent count increases to 2. After a delay of 0.07 seconds agent count increases to 3, as a ping agent is started. After a total delay of 0.48 seconds from transaction start up, the agent count increases to 8. As there were already 3 agents in the system, and the window size was set to 5, so a total of 3+5 agents remain in the system for the next 16.003 seconds. This upper limit of 8 agents is maintained since agents getting out of the window size get killed. As the ping agent confirms path availability, the agent count reduces to 2 again since all ping agents are deliberately removed from the system. After confirmation of path availability, the transaction manager International Conference on Computational Intelligence for Modelling Control and Automation,and International Conference on Intelligent Agents,Web Technologies and Internet Commerce (CIMCA-IAWTIC'06) 0-7695-2731-0/06 $20.00 © 2006

6

N o . of A ge nts

7. Performance Analysis

5 4 3 2 1 0 0

2

4

6

8

10

12

14

Time in Sec

Figure 4. Agents over Time (Scenario 2) In the second test case, the container count was set to 10 and the window size was 3. As in the previous test case the agent count increased from 1 to 2 to 3 within a delay of 0.07 seconds from transaction start up. The agent count reaches a maximum value of 6 (3+ Window Size) after 0.49 seconds and stays so for the next 7.54 seconds. As the ping agents finish there job, the agent count becomes 2 again. The first transaction agent starts 31 seconds from startup making the agent count 3. The agent count then once again reaches the maximum vale of 6 and stays so for the next 3.234 seconds. After a

total period of 11.686 seconds, the transaction completes making the agent count 2 again.

works for agents can contribute to the reliability of the overall system. The implications of such reliability and dependability are tremendous.

7.3 Test Case 3 When ping agents or transactions agents reach the destination, they have two options. Messaging vs Traversal

10 8 Time in Sec

6 4 2 0

Traversal

Message

9.283

0.6

Series1

Figure 5. Message vs. Path Traversal Either they can traverse the whole path again or can simply send a message back to the associated observer agent. We tested both the options by making the ping agents traverse the path back while the transaction agent sends back a message. The time taken by the two approaches is compared in the graph of figure 5. As the graph indicates message sending takes far less time then traversal. Since transaction and ping agents, on reaching the destination, have done their job, so they should send a status message back and then be removed from the system.

8. Future Work As future work we want to compare our system’s performance with other available approaches. We also plan to examine the potential gain achieved by fault tolerance and the time and resource consumed. As such we will examine the trade offs involved in achieving the desired agent survivability.

9. Conclusion The inherent qualities of agent technology make it a perfect choice for multitudinous real world applications. However this usefulness can be undermined due to problems incurred by lack of fault tolerance. Introducing fault tolerance in agent systems will not only provide the guarantee that an agent performs its assigned job well, but will also result in more dependable applications. Interfaced with different applications, fault tolerant frame

International Conference on Computational Intelligence for Modelling Control and Automation,and International Conference on Intelligent Agents,Web Technologies and Internet Commerce (CIMCA-IAWTIC'06) 0-7695-2731-0/06 $20.00 © 2006

10. References [1] A. Mohindra, A. Purakayastha, and P. Thati. Exploiting nondeterminism for reliability of mobile agent systems. In Proc. of the International Conference on Dependable Systems and Networks, pages 144–153, New York, June 2000. [2] Danny B. Lange and Mitsuru Oshima. Seven Good Reasons for Mobile Agents Communications of the ACM, 3(42):88-89, March 1999. [3] Jason D. Hartline, Mobile Agents: A Survey of Fault Tolerance and Security, November 19, 1998 [4] Gerhard Weiss. “Multiagent Systems A Modern Approach to Distributed Artificial Intelligence”, Chapters: 1-4, The MIT Press Cambridge, Massachusetts London, England, 1999. [5] D. Johansen, K. Marzullo, F. B. Schneider, K. Jacobsen, and D. Zagorodnov. NAP: Practical faulttolerance for itinerant computations. In Proc. of the 19th IEEE International Conference on Distributed Computing Systems (ICDCS), Austin, Texas, June 1999. [6] M. Strasser and K. Rothermel. Reliability Concepts for Mobile Agents. International Journal of Cooperative Information Systems, 7(4), 1998. [7] M. Strasser, K. Rothermel, and C. Maihofer. Providing Reliable Agents for Electronic Commerce. IEEE, 1998. [8] Qiao Xin, Yang Yu, Yu Xu, and Zhanhai Qin. Fault Tolerance Issues in Mobile Agents. Project Report, University of California, San Diego, June 2000. 255. [9] K. Rothermel and M. Strasser. A fault-tolerant protocol for providing the exactly-once property of mobile agents. In Proc. of the 17th IEEE Symposium on Reliable Distributed Systems (SRDS), pages 100–108, West Lafayette, Indiana, Oct. 1998. [10] L. Silva, V. Batista, and J. Silva. Fault-tolerant execution of mobile agents. In Proc. of the International Conference on Dependable Systems and Networks, pages 135–143, New York, June 2000. [11] S. Pleisch and A. Schiper. Fatomas—A fault-tolerant mobile agent system based on the agent-dependent approach. In Proceedings of the IEEE International Conference on Dependable Systems and Networks, pages 215–224, G¨oteborg, Sweden, July 2001. [12] Michael R. Lyu, Xinyu Chen, Tsz Yeung Wong, Design and Evaluation of a Fault-Tolerant Mobile-Agent System, IEEE Intelligent Systems Volume 19, Issue 5 (September 2004) Pages: 32 – 38, 2004.