An Efficient Recovery Scheme for Fault-Tolerant Mobile ... - CiteSeerX

An Efficient Recovery Scheme for Fault-Tolerant Mobile Computing Systems Taesoon Park

Namyoon Woo and Heon Y. Yeom

Department of Computer Engineering

Department of Computer Science

Sejong University

Seoul National University

Seoul 143-747, KOREA

Seoul 151-742, KOREA

[email protected]

fnywoo, [email protected]

ABSTRACT This paper presents an efficient recovery scheme to provide fault-tolerance for the mobile computing systems. The proposed scheme is based on the message logging and independent checkpointing, and for the efficient management of the recovery information, such as checkpoints and message logs, the movement-based scheme is suggested. The mobile host carrying its recovery information to the nearby mobile support station can recover instantly in case of a failure. However, the support stations visited by the mobile host may have to experience the high failure-free execution cost for transferring the recovery information and accessing the stable storage. On the other hand, the recovery cost can be too high, if the recovery information remains dispersed over a number of mobile support stations. The movement-based scheme considers both of the failure-free execution cost and the failure-recovery cost. Hence, while a mobile host moves within a certain range, the recovery information of the mobile host remains at the support stations where the information was first saved. However, if the mobile host moves out of the range, it transfers the recovery information into the nearby support station. As a result, the scheme can control the transfer cost as well as the recovery cost. The performance of the proposed scheme is evaluated with extensive simulation results. Keywords: Distributed Systems, Fault-Tolerance, Mobile Computing Systems, Message Logging, Asynchronous Recovery. ————————————– An earlier version of this work was appeared in the proceedings of the 2001 International Conference on Parallel and Distributed Systems.

1 Introduction Many algorithms supporting distributed services are nowadays extended to continue their services in the mobile environment [5]. However, straightforward extension of existing algorithms are not well adopted to the new environment, since the mobile computing system introduces a new challenge such as the handling of mobile hosts. The mobile host, called a MH, has some special properties. For example, it keeps moving from a cell to another cell and is connected to the mobile support station, called a MSS, via a wireless network which has the low bandwidth and very fragile connection. The MHs usually carry the small memory and disk spaces, and the low battery capacity of MHs can also be a problem. The checkpointing-recovery is one of the distributed services to provide the fault-tolerance for the system. Considering the fact that the MHs are vulnerable to the failure, it is very desirable for the mobile computing system to be equipped with the checkpointing-recovery facility. Many checkpointingrecovery schemes have been proposed for the distributed systems, however, these schemes cannot be directly used in the mobile environment, as most of the other distributed services. Especially, the following properties of the mobile computing system introduce new design issues for the implementation of checkpointing-recovery. First, the low bandwidth of the wireless network makes it impossible to use the schemes enforcing a large number of system messages or a large size of information carried in a message. Also, since the checkpoints of a MH must be transferred to the MSSs due to the lack of disk spaces on MHs, the schemes enforcing frequent checkpointing cannot be used and it is plausible to make a MH to take its checkpoints on its own schedule. While the MH moves around the cells, the checkpoints of a MH has to be stored on several MSSs, and hence, in case of a failure of the MH, a mechanism to trace and retrieve the proper checkpoint must be provided. Moreover, checkpointing and recovery schemes enforcing any kind of synchronization or coordination among the MHs may not be used, because of the frequent disconnection to save the power consumption of MHs. Considering these properties, coordinated checkpointing and recovery schemes [9, 23, 12, 13], which require the processes in the system to synchronize their checkpointing and recovery activities, are not suitable for the mobile environment. The low network bandwidth cannot afford the large number of coordination messages exchanged between the MHs, and also some MHs temporarily disconnected from the network may not participate in the coordination. Some schemes have been proposed to achieve the checkpointing coordination with the less message overhead and the fewer number of processes partici1

pating in the coordination [8, 19]. Communication-pattern based checkpointing [20] allows a process to take consistent checkpoints independently, whenever it changes its communication status from the sending mode to the receiving mode. This scheme does not require any coordination overhead for checkpointing and recovery, and an extended version of this scheme for the mobile environment has been proposed in [1]. However, since the frequency of checkpointing in this scheme is dependent on the communication pattern of a MH, the MH may have to transfer a checkpoint with every outgoing message, in the worst case, and the wireless network with the low bandwidth cannot afford it. Communication-induced checkpointing [7, 24, 26] is the scheme with the least checkpointing overhead, since an index or a timestamp carried in a message makes the processes to eventually take globally consistent checkpoints. For the mobile computing environment, some extended algorithms of this checkpointing scheme are proposed [16, 15]. However, when the recovery is concerned, checkpointing-only schemes have a common problem in rollback. Because of the livelock problem [13], which causes recursive rollbacks, either the rollback of the related processes have to be synchronized as proposed in [13] or a centralized coordination is required [6]. One way to guarantee the asynchronous recovery [10, 22] is to use the message logging in addition to the checkpointing. Causal logging scheme [3, 4, 11], however, requires a large size of log space and also a large amount of dependency information to be carried in a message, which can be a serious drawback in the mobile environment. Hence, instead of causal logging, a scheme to implement the optimistic logging [21, 22, 25] with the less message overhead and a little inference with MHs has been proposed in [17]. Also, in spite of the heavy stable storage access overhead, pessimistic logging is considered as a good design choice for the mobile environment, because of its simple implementation and the capability of asynchronous recovery [18, 27]. In mobile computing environment, the MHs are frequently disconnected from the network without a failure and the exchange of a large number of coordination messages is considered too costly. Hence, recovery coordination among the MHs is not plausible and asynchronous recovery must be sought. With the asynchronous recovery, a process can independently decide its rollback and after the rollback, it can immediately resume the computation without waiting for any coordination message from another process. To achieve the asynchrony in recovery, it is desirable to log the messages in addition to independent checkpointing.

2

Most checkpointing schemes for mobile computing systems have suggested that MSSs should manage checkpoints for the MHs, because of the storage limitation of MHs. The stable storage of the MSS is also a reasonable choice for the message log. Since messages heading to the MHs are routed through the MSSs, message logging by the MSS may not impose any extra communication overhead. However, the mobility of MHs introduces a new design issue for the storage management. As a MH moves around the cells, the storages for checkpoints and message logs of a MH become dispersed over a number of MSSs, and in case of a failure, a MH must locate the latest or any consistent checkpoint and also has to locate the sequence of logged messages, which in turn increases the recovery cost. Considering relatively high failure rate of MHs, instant recovery from a failure must be very important. However, checkpoints and logs distributed over the network may cause the severe message traffic and the delay in recovery. For the efficient implementation, the mobility of MHs must be carefully traced and a proper mechanism for gathering distributed recovery information must be prepared. Also, since the size of a stable storage managed by each MSS is limited, checkpoints and logs no longer required for any recovery must be discarded for the reuse of space. Such garbage collection may have to put extra communication overhead if the logs are dispersed over a large number of cells. Some works related to the distributed storage management have been proposed in [18, 27]. For the fast recovery, it is desirable for the checkpoints and message logs to be near the MH on recovery, and hence, checkpoints and logs of a MH in [18] keep moving as the MH performs the handoff between two cells. As a result, instant recovery can be possible, however, the failure-free communication overhead cannot be negligible, considering the size of a checkpoint and logged messages. One suggestion made in [27] utilizes the home of each MH to maintain the recovery information. As a MH moves, it transfers checkpoints or logs to the home, and in case of a failure, it can find the recovery information at home. However, if the MH is far from home, the transfer cost can also be a problem. This paper presents an efficient recovery scheme based on pessimistic message logging for the mobile computing environment. With the message logging and periodic checkpointing, asynchronous recovery can be achieved even in case of multiple and concurrent failure occurrences. For the management of recovery information, such as checkpoints and message logs, the movement-based scheme is used. The MH carrying its recovery information to its current MSS can recover instantly in case of a failure. However, the MSSs visited by the MH have to experience the high failure-free cost to transfer the recovery information and access the stable storage. On the other hand, the recovery cost can be too high, if the re-

3

covery information is dispersed over a large of number of support stations. The movement-based scheme considers both of the costs. While the MH moves within a certain range, recovery information of the MH is not moved. However, if the MH moves out of the range, it transfers the recovery information nearby. As a result, the scheme controls the transfer cost as well as the recovery cost. The proposed scheme is evaluated with the earlier schemes in [18] with extensive simulation results. The rest of this paper is organized as follows: Section 2 describes the system model and basic components of fault tolerant mobile computing systems. In Section 3, the proposed recovery scheme is presented and the performance of the proposed scheme is compared with the earlier ones with the extensive simulation results in Section 4. Finally, Section 5 concludes the paper.

2 Mobile Computing System Model The mobile computing system considered in this paper follows the model presented in [1, 2]. The system consists of a set of mobile hosts(MHs) and static mobile support stations(MSSs). A set of dynamic and wireless communication links can be established between a MH and a MSS; and a set of high speed static and wired communication links is assumed between MSSs. A region covered by a MSS is called a cell. A MH residing in a cell can be connected to the MSS servicing the cell and the MH can communicate with another MH or MSS only through the local MSS. The links in the dynamic network support FIFO communication in both directions, however, there is no assumption on the message delivery order by static links. Distributed computation in the mobile computing environment is performed by a set of processes running concurrently on MHs or MSSs in the network. A process experiences a sequence of state transitions during its execution and the atomic action which causes the state transition is called an event. The event having no interaction with another process is called an internal event; and message-sending and message-receipt are the external events. A sequence of state transitions within a process is called a computation, and the computation of a process is assumed to follow the piece-wise deterministic model, in which the process always produces the same sequence of states during the execution if the same sequence of message receipt events would happen at the process. Failures assumed in this paper are transient and independent; that is, a process does not likely fail again at the same execution point after it recovers from a failure; and the failure of a process does not

4

affect the other processes in the system. The processes running in MHs as well as MSSs are also assumed to be fail-stop. In case of a failure, the process stops its execution and does not perform any malicious action. When a failure occurs, the contents stored in the volatile memory of the MH or MSS would be lost, however, the stable storage survives the failure. Reliable message delivery is assumed; that is, there is no message loss or modification during the normal operation. The message transmission delay in static and dynamic networks is assumed to be finite but arbitrary. To support the mobility of MHs, hand-off and location update mechanisms are assumed as follows: For a MH to leave a cell and enter into another cell, it first has to end its current connection by sending a leave(r) message to the local MSS, where r is the sequence number of the last message received from the MSS, and then establish a new connection by sending join(MH-id, previous MSS-id) message to the new MSS. Usually, leaving a cell and entering another cell happens simultaneously when the MH crosses the boundary between two cells and it is called a handoff. Each MSS maintains a list of identifiers, called an

A tive MH List, for the MHs which are currently connected to itself. A MH can also disconnect itself from the local MSS without leaving the cell by sending a dis onne t(r) message when the MH goes into the

sleep mode for power conservation.

When the MSS receives the

dis onne t message from an MH, it marks the MH as ”disconnected” by setting a flag and maintains a list of disconnected MHs, called a Dis onne t MH List. Later on, the MH can reconnect to any MSS by sending a re onne t(MH-id, previous MSS-id) to the MSS. If the MH is reconnected to a new MSS, the new MSS informs the previous MSS of the reconnection of the MH so that the previous MSS can perform the proper hand-off procedures. The difference between

leave and dis onne t is that the

MSS can immediately forget about the MH when receiving a leave message while it has to maintain the necessary information after receiving a dis onne t message. Location management of MHs is performed based on the two-level data hierarchy consisting of the home location register(HLR) and visitor location register(VLR). Each MH is associated with a

home

which keeps track of the MH’s current location by the HLR. Also, one or more MSSs are grouped and associated with a VLR which maintains the location information for the MHs in the cells serviced by the MSSs of the group. During the handoff of a MH, the new MSS sends a registration query to the VLR. If the MH moves within the region managed by the same VLR, the VLR just updates the location of MH; otherwise, the VLR sends the location registration message to the HLR of the MH for the location update. After location update, the HLR sends the registration cancellation message to the old VLR. With

5

this mechanism, the HLR maintains the current location of MH. Hence, when a message for an MH is initiated, the message is delivered to a MH based on the information in HLR and VLR.

3 The Proposed Recovery Scheme The proposed recovery scheme is based on independent checkpointing, pessimistic message logging and asynchronous rollback-recovery. For the efficient management of distributed log storages, a scheme based on MH’s movement is proposed.

3.1 Checkpointing and Logging Each mobile host MHi periodically takes a checkpoint and the time period between two consecutive checkpointing of MHi is determined by itself. For checkpointing, each MHi first saves its current state as a checkpoint and assigns a unique sequence number to the checkpoint. Ci denotes the -th checkpoint of MHi , and the pair of integers, (i; ), is used as the identifier for Ci . MHi then sends the checkpoint with seq , to count the number the identifier to its current MSS, say MSSp . MHi also maintains a counter, mr v i of messages MHi has received. The latest counter value before checkpointing is also delivered with the checkpoint. On the receipt of the checkpoint, MSSp saves the checkpoint and the related information into the stable storage. Using the carried message counter value, MSSp can decide the correct position of the checkpoint with respect to the logged messages. In addition to the checkpoints, each MSSp also has to maintain the message log for the MHs residing in the cell. Since each message delivered to MHi in the cell is routed through MSSp, logging of the messages incurs no extra communication overhead between MSSp and MHi . Let Mi denote the -th message delivered to MHi , and each message, Mi , is identified by the pair of integers, (i; ). The messages headed to MHi in the cell are first logged into the stable storage of MSS p, and then delivered to MHi . With the application messages from MSSp to the MHs in the cell, MSSp also logs the messages

join, leave, dis onne t and re onne t messages received from seq , and Any of these messages sent from a mobile host, MHi , must carry the value of mr v i

related to the mobility, such as the the MHs.

the sequence number is also logged with the message. The log of these messages are used to trace the movement of each MHi during the recovery. A mobile host, MHi , after a failure, must locate its latest checkpoint and logged messages for con-

6

T ra ei , is T ra ei consists of two integer variables, p seq and

sistent recovery. For the fast retrieval of such information, a trace record for MH i , called maintained by the MSSs which MHi has visited.

p lo , and a list, log set. p seq and p lo include the sequence number of the latest checkpoint and the identifier of the MSS which saves that checkpoint, respectively. log set includes a set of MSSs which carry the message logs for MH i . During the handoff of MHi , T ra ei is delivered from the old MSS to the new MSS, say MSSp , and MSSp saves T ra ei into the stable storage. On the first message logging for MHi , MSSp includes its identifier into T ra ei .log set. When MSSp saves a new checkpoint for MHi , it puts the checkpoint sequence number into T ra ei . p seq and its identifier into T ra ei . p lo . MSSp also makes T ra ei .log set empty, so that the list can include only the MSSs which have saved the logs since the latest checkpointing. The formal description of the checkpointing and message logging scheme is given in Figure 1.

3.2 Distributed Storage Management As a MHi moves from a cell to another cell, the message logs of MHi become distributed over the stable storages of a number of MSSs, and also MHi may be far away from the MSS which carries the latest checkpoint when a failure occurs. Considering the relatively high failure rate of MHs, the recovery cost to collect the checkpoint and the logs may be considered too high. However, the cost of carrying the checkpoint and logs as each MHi moves can be a significant communication overhead and also the cost for repetitive stable storage accesses cannot be negligible. Considering both of the recovery cost and the failure-free operation cost, two movement-based schemes are suggested in this paper. While MHi moves around the cells within a small range of area, the checkpoint can be retrieved with a little overhead. Hence, the checkpoint and message logs need to be moved into a MSS near MHi , only when the moving distance of MHi from a MSS carrying the latest checkpoint exceeds a certain threshold. This scheme is called a distance-based scheme, which focuses on the distance between MHi and the MSS carrying its latest checkpoint. On the other hand, the frequency-based scheme concerns the number of handoffs, since that number indicates the number of sites carrying the message logs and the frequency of communication for collecting the message logs in case of recovery. Hence, in this scheme, MHi keeps counting the number of handoff and transfers the checkpoint and logs if the number exceeds a certain value, k . Of course, in both of the above schemes, the recovery cost and the failure-free operation cost are adjustable using the threshold and the k -value. 7

Checkpointing for MHi Managed by MSSp : When Che kpointing T imer of MHi Expires:

p seqi= p seqi +1; = p seqi counts the number of checkpoints taken by MHi =; Take a Volatile Checkpoint, Ci p seqi ; = Ci indicates the -th checkpoint taken by MHi =; seq ) with C p seqi ; Save the Entry (i; p seqi ; mr v i i r v seq = mi counts the number of messages received by MHi =;

p seq i ;(i; p seq ; mr v seq )] to MSS ; Send [Ci i i p seq

p i , from MH : When MSSp Receives a Checkpoint, Ci i

p r v seq seq i ;(i; p seqi ; mi )] into Stable Checkpoint Space; Save [Ci Tracei . p seqi = p seqi ; Tracei . p lo =p; Tracei .log set=; Logging for MHi Managed by MSSp: When MSSp Delivers a Message, M , to MHi : msg seqi=msg seqi + 1; = msg seqi counts the number of messages sent to MHi = Save [Mimsg seqi ;(i; msg seqi )] into Stable Log Space; If (p 2 = T ra ei :log set) Tracei .log set=Tracei .log set[p; Delivers M ; When MHi Enters into a Cell Managed by MSSp : MSSp Saves [(T ra ei , mir v seq )] into Stable Storage; When MSSp Receives a Message, M , from MHi : If (M 2 fjoin; leave; dis onne t; re onne tg) Save [M ;(mir v seqi )] into Stable Log Space; Figure 1: Checkpointing and Message Logging

8

Frequency-based scheme: Each MHi maintains a handoff frequency counter, denoted by Cif , to count the number of handoffs, and the value of Cif is initialized as k . For each handoff of MHi , the value of Cif becomes decremented by one. While the value of Cif remains positive, the data, T ra ei , is maintained as described before, and no other extra action is taken. However, when the value of Cif becomes zero, the new MSS, say MSSp, now performs the collection of the latest checkpoint and message logs by sending the recovery information collection message to MSSs recorded in

T ra ei . p lo and T ra ei .log set.

On the receipt of the message, each MSS replies with the checkpoint or the message log of MHi . MSSp,

T ra ei . p lo identifier and T ra ei .log set to include only MSSp, and then resets the value of Cif as k .

after saving the collected checkpoint and message logs, updates the value of

with its

Distance-based scheme: Distance-based scheme works in a manner similar to the frequency-based scheme, and the only difference is that the distance value, denoted by Cid , is used instead of handoff frequency counter. Cid indicates the distance between the MSS in which MHi is now residing and the MSS carrying the latest checkpoint of MHi . After each handoff of MHi , a new MSS, say MSSp, measures Cid and it performs the normal handoff procedure as long as the value of Cid is less than a given threshold, T Hi . Only when the value of Cid exceeds T Hi , the latest checkpoint and message logs are transferred as in the frequency-based scheme. To measure the distance between two MSSs, a simple table look-up is assumed. Each MSS maintains a table including the distance between any MSS and itself, and during the handoff, a MSS sets the value of Cid by table look up using the value of T ra ei . p

lo .

The

formal description of the distributed storage management scheme is given in Figure 2.

3.3 Independent Recovery With checkpointing and message logging, each MH can perform the rollback-recovery, independently. By independent rollback-recovery, we mean that only the MH which has failed rolls back to its latest checkpoint and replays the logged messages to achieve the consistent recovery, and no other MHs need

re overy message to its current MSS, say MSSp. On the receipt of the re overy message, MSSp sends the he kpoint retrieve message with T ra ei . p seq to the MSS indicated by T ra ei . p lo , say MSSq , and also sends the log retrieve message to each MSSr in T ra ei .log set. On the receipt of the he kpoint retrieve to roll back together. For a MH i to recover from a failure, it first sends the

message, MSSq replies with the checkpoint of MHi saved with the received sequence number, and also sends any message log for MHi saved after that checkpoint. Note that each checkpoint must be saved 9

Distributed Storage Management for MHi Managed by MSSp When MSSp Receives a join Message from MHi : = MHi performs a handoff from an Old MSS to MSSp = Retrieve T ra ei from Old MSS; If ((mode == distan e based and Cid T Hi ) or (mode == frequen y based and Cif = 0) f = Cid measures the distance from the MSS carrying the MHi ’s latest checkpoint and MSSp, and T Hi is the given threshold for distance based scheme, and Cif is a handoff counter of MHi . = Send (re overy information olle tion;T ra ei . p seq ;i) to T ra ei . p For (Each MSSr 2 T ra ei .log set) Send (re overy information olle tion;i) to MSSr ; g Else = Normal handoff is performed. = seq Save [(T ra ei ,mr v )] into Stable Storage; i

lo ;

When MSSq Receives (re overy information olle tion;T ra ei . p seq ;i) from MSSp ; Send (CiT ra ei : p seq , Mi 2 Log Space, if > CiT ra ei : p seq .mir v seq ) to MSSp; When MSSr Receives (re overy information olle tion;i) from MSSp; Send (Mi 2 Log Space) to MSSp; When MSSp Collects (CiT ra ei : p seq , Mi s); Save CiT ra ei : p seq into Stable Checkpoint Space; seq +1;;++) For (=CiT ra ei : p seq .mr v i Save Mi into Stable Log Space; Tracei . p seqi = p seqi ; Tracei . p lo =p; Tracei .log set=; Figure 2: Distributed Storage Management

10

Independent Recovery for MHi Managed by MSSp When MHi Recovers from a Failure:

Send re overy Message to MSSp;

When MSSp Receives re overy Message from MHi : Send ( he kpoint

retrieve;T ra ei . p seq;i) to T ra ei . p lo ; For (Each MSSr 2 T ra ei .log set) Send (log retrieve;i) to MSSr ; When MSSq Receives ( he kpoint retrieve;T ra ei . p seq ;i) from MSSp; Send (CiT ra ei : p seq , Mi 2 Log Space, if > CiT ra ei : p seq .mir v seq ) to MSSp; When MSSr Receives (log retrieve;i) from MSSp; Send (Mi 2 Log Space) to MSSp; When MSSp Collects (CiT ra ei : p seq , Mi s); Send CiT ra ei : p seq to MHi ; seq +1;;++) For (=CiT ra ei : p seq .mr v i Send Mi to MHi ; When MHi Receives Ci from MSSp : Restore Ci ; Resume the Computation; Figure 3: Independent Recovery with the sequence number for the last message which have been received before the checkpointing. Hence, MSSq can correctly select the messages saved after that checkpoint. Meanwhile, each MSSr on the receipt of log

retrieve message replies with the message log of MHi .

Sometime, it is possible that MSSp is the one carrying the latest checkpoint and all the logs saved

T ra ei . p lo equals to p and the T ra ei .log set also includes p only. In this case, he kpoint retrieve and log retrieve messages need

after the checkpoint. This situation can be confirmed if the value of

not be sent out. On the receipt of the replies with the checkpoint and message logs, MSSp transfers the checkpoint and the logged messages to MHi . In transferring of logged messages, MSSp first has to seq saved with the checkpoint, which indicates the largest sequence number check the value of mr v i of the messages logged before the latest checkpointing. MSSp hence examines the message sequence

msg seqi , logged with each message and sends only the messages with the sequence number seq in the order. During the recovery of MH , new messages heading to MH can be larger than mr v i i i number,

11

arrived, however, those messages are delivered to MH i after consuming all the messages in the log. The formal description of the independent recovery scheme is given in Figure 3.

3.4 Garbage Collection As the amount of checkpoints and logged messages saved for the MHi which have visited increases, a MSS may experience a severe storage problem. To cope with this problem, the checkpoints and message logs which are no longer required for any recovery must be discarded, so that the stable storage space can be reused. In the pessimistic logging, only the latest checkpoint and messages logged after the checkpoint was taken are required for any failure recovery. Hence, logged messages and the previous checkpoint can safely be discarded when a new checkpoint is taken. For such garbage collection, when a new checkpoint of MHi is saved by MSSp , it sends the garbage collection message to the MSSs indicated in

T ra ei . p lo and T ra ei .log set.

On the receipt of the message, each MSSq discards any checkpoint

and message log saved for MHi and then replies with the acknowledgement message. After collecting the acknowledgements, MSSp properly updates the T ra ei data.

4 Performance Study The performance of the proposed scheme is evaluated and compared with the earlier schemes proposed in [18].

4.1 Simulation Model A mobile computing system with the mesh cell configuration [14], which consists of the 10 X 10 square shaped cells, was simulated. Each cell in the system has exactly eight neighbors and the homogeneous size of cells is used. Initially, one hundred MHs are randomly distributed over the cells and managed by the MSSs of the cells. Each MH can take the next move into one cell out of eight neighboring cells, and then the handoff procedure is performed between the two MSSs. The next cell for each movement is selected randomly out of eight neighbors and the time interval between two consecutive handoffs follows the exponential distribution with a mean 1h . Each MH carries one process participating in a distributed computation. A process running on a MH can communicate with another process residing in another MH by sending and receiving a message. The

12

message sending rate of a process follows a Poisson process with rate , and for each message sending event, the recipient of the message is selected randomly. Also, for the message delivery through MSSs,

C is used for the checkpointing time interval. The failure rate of each MH follows a Poisson process with rate f , and a fixed routing is used. Each process periodically takes a checkpoint, and a fixed value

each MH performs recovery action instantly after a failure. To measure the relative cost regarding checkpointing, logging and recovery, the following variables are used: Let Cm be the average cost of transferring a control message over one hop of the wired network. Then, Cm and Cm are used for the costs of transferring an application message and a checkpoint over one hop of the wired network, respectively.

is used for the wireless network factor, which is the

ratio of the cost of transferring a message over one hop of wireless network to the cost of transferring the message over one hop of wired network. To measure the stable storage access cost for checkpointing and logging,

Cs is used as the average cost of saving one log entry into the stable storage and Æ is used

as the ratio of cost of saving one checkpoint to the cost of saving one log entry.

4.2 Performance Metrics The performance metrics concerned in this paper are the network cost and the stable storage access cost related to checkpointing, logging and rollback-recovery. The network cost includes the periodic checkpoint transfer cost over the wireless network ( he kpointing

ost, denoted by C p), the transfer cost of the checkpoint and message logs over the wired network during the handoff (handoff ost, denoted by Cho ), the garbage olle tion ost, denoted by Cg , the cost to collect the latest checkpoint and logged messages over the wired network for the recovery of a MH (retrieval ost, denoted by Crt ), and the cost to transfer the collected recovery information to the MH over the wireless network (restoring ost, denoted by Cr ). Among these, the checkpointing cost, C p , the handoff cost, Cho , and the garbage collection cost, Cg , are regarded as the cost imposed during the failure-free operation, and hence called failure-free operation cost. The retrieval cost, Crt , and the restoring cost, Cr , are required during the failure recovery and hence called recovery cost. The stable storage access cost includes not only the cost for the first saving of a checkpoint or a message, but also the cost for the repetitive saving of the recovery information after transferring them. The performance of the proposed scheme is compared with the following three schemes proposed in [18].

13

Pessimistic scheme: The checkpoint and message logs of a MH are transferred to the current MSS

during the handoff of the MH, and hence the handoff cost must be high. Instead, the retrieval cost must be low since the recovery information is always near the MH. Lazy scheme : The checkpoint and message logs are not transferred during the handoff and they are

collected only in case of the failure recovery. Hence, though the failure free operation cost is low, the retrieval cost must be high. Trickle scheme : To reduce the handoff cost, the transfer of checkpoints and logs is performed asyn-

chronously with the handoff. After the handoff of a MH, the current MSS requests the transfer of the checkpoints and logs of the MH.

4.3 Simulation Results Figure 4 first presents the failure free operation costs of eight schemes with the varying handoff frequency. Eight schemes are the pessimistic scheme denoted by PL, the lazy scheme denoted by LL, the trickle scheme denoted by TL, frequency based schemes with k values 3, 5 and 10, denoted by 3-F, 5-F and 10-F, respectively, and distance based schemes with T H values of 3 cell, 5 cell and 10 cell distance, denoted by 3-D, 5-D and 10-D, respectively. To obtain this performance, the following variables are used : =10

1,

C =100, f =10

2,

Cm =1, =10, =100, =10, Cs =1, and Æ = 10.

From the figure, it can be noticed that all the schemes other than the scheme LL require higher network cost as the handoff rate increases. Scheme PL especially shows the drastic cost increase since a checkpoint and logged messages are transferred for each handoff. Comparing with this scheme, frequency-based and distance-based schemes show the adjustable performance. For example, 3-F and 3-D schemes require relatively high network cost, however, 5-F, 5-D, 10-F and 10-D schemes show slow increase in the network cost, since the recovery information transfer is performed with some interval considering the handoff frequency or the moving distance. Scheme LL does not perform any recovery information transfer during the normal operation and requires only the periodic checkpointing cost. Average network cost imposed on each handoff is shown in Figure 5. The cost of schemes PL and LL is not varying as the handoff rate changes, since the scheme PL requires the transfer of the whole recovery information and for the scheme LL, none of the information is transferred, for each handoff. However, frequency and distance based schemes show the slight increase in the handoff cost as the

14

Figure 4: Failure-free Operation Cost

! """ #" $'%& %%&& $'() * % %%** ()

Figure 5: Average Handoff Cost

15

+3-+2-+1-AG FEC CDB + 0 - ?A@

HI JKI LII M NL M KNO LLP LQQ M OPL Q

+/-+.-+,--

- 4- - 0 5 /

- 4- + 5 /

-64-7 58 /9 : ; ; < -7 4=> /

- 4+

- 45

Figure 6: Average Recovery Cost handoff rate increases and they show the adjustable performance with the different values of k and T H . The scheme TL basically transfers a checkpoint and logged messages for each handoff though the transfer is performed asynchronously with the handoff procedure. However, when a MH moves back and forth within two cells, the transfer of recovery information is not necessary, and hence, the scheme TL shows the handoff cost less than the cost of scheme PL. While the scheme PL requires the high failure free operation cost, the scheme requiring the highest recovery cost is scheme LL, since this scheme performs no storage optimization during the failure free operation and hence the MHs using the scheme LL have to collect the latest checkpoint and message logs over the more number of hops. Figure 6 shows the average recovery cost for the eight schemes. As it is expected, the recovery cost of PL is the lowest and the recovery cost of frequency and distance based schemes are shown to be adjustable as

k and T H

values are varied. For all schemes except the

scheme PL, the recovery cost increases as the handoff rate increases since with the high handoff rate, the message logs must be distributed over the wide range and the MSS carrying the latest checkpoint must be far away from the current MSS, which in turn increases the network cost. Figure 7 shows the stable storage access cost for the schemes, which includes the cost for the first

16

Figure 7: Stable Storage Access Cost saving of a checkpoint or a message and also the cost for the repetitive saving of those recovery information. Hence, the schemes with frequent transfer of recovery information, such as the schemes PL and TL, incurs the higher stable storage access cost, while the others remain low. Figures 8 and 9 show the total cost of the schemes with the varying handoff rate and failure rate, respectively. In both figures, the total cost increases as the handoff rate increases and as the failure rate increases. With the high handoff rate, failure free operation cost must become increased, and the recovery cost must be increased as the failure rate increases. The frequency based and the distance based schemes show the performance adjustable according to the k and T H values, in both figures.

5 Conclusions In this paper, we have presented an efficient recovery scheme based on message logging and checkpointing for mobile computing systems. The mobile host carrying its recovery information to its current mobile support station can recover instantly in case of a failure. However, the mobile support stations visited by the mobile host have to experience high failure-free execution cost to transfer the recovery

17

Figure 8: Total Network Cost with Varying Handoff Rate

Figure 9: Total Network Cost with Varying Failure Rate

18

information and access the stable storage. On the other hand, the recovery cost can be too high, if the recovery information is dispersed over a wide range of cells. For efficient management of recovery information, such as checkpoints and message logs, the movement-based scheme is suggested, which considers both of the failure-free operation cost and the recovery cost. In the proposed scheme, while the mobile host moves within a certain range, recovery information of the mobile host is not moved. However, if the mobile host moves out of the range, it transfers the recovery information nearby. As a result, the scheme controls the transfer cost as well as the recovery cost. The performance of the proposed scheme has been evaluated with extensive simulation results, which show that the performance of the proposed scheme provides various levels of failure-free operation cost and recovery cost by adjusting movement factors.

References [1] A. Acharya and B.R. Badrinath. Checkpointing Distributed Applications on Mobile Computers. In Proc. of the 3rd Int’l Conf. on Parallel and Distributed Information Systems, pp. 73–80, Oct. 1994. [2] I.F. Ayildiz and J.S.M. Ho. On Location Management for Personal Communications Networks. IEEE Communications Magazine, pp. 138–145, Sep. 1996. [3] L. Alvisi, B. Hoppoe and K. Marzullo. Nonblocking and Orphan-free Message Logging Protocols. In Proc. of the 23rd Int’l Symp. on Fault Tolerant Computing Systems, pp. 145–154, Jun. 1993. [4] L. Alvisi and K. Marzullo. Message Logging: Pessimistic, Optimistic and Causal. In Proc, of the 15th Int’l Conf. on Distributed Computing Systems, pp. 229–236, May 1995. [5] B.R. Badrinath, A. Acharya and T. Imielinski. Structuring Distributed Algorithms for Mobile Hosts. In Proc. of the 14th Int’l Conf. on Distributed Computing Systems, pp. 21–28, Jun. 1994. [6] B. Bhargava and S.R. Lian. Independent Checkpointing and Concurrent Rollback for Recovery An Optimistic Approach. In Proc. of the Int’l Conf. on Data Engineering, pp. 182–189, 1988. [7] D. Briatico, A. Ciuffoletti, and L. Simoncini. A Distributed Domino-effect Free Recovery Algorithm. In Proc. of the 4th Symp. on Reliability in Distributed Software and Database Systems, pp. 207–215, 1984. 19

[8] G. Cao and M. Singhal. Low-cost Checkpointing with Mutable Checkpoints in Mobile Computing Systems. In Proc. of the 18th Int’l Conf. on Distributed Computing Systems, pp. 464–471, May 1998. [9] M. Chandy and L. Lamport. Distributed Snapshot: Determining Global States of Distributed Systems. ACM Transactions on Computer Systems, Vol. 3, No. 1, pp. 63–75, 1985. [10] O.P. Damani and V.K. Garg. How to Recover Efficiently and Asynchronously When Optimism Fails. In Proc. of the 16th Int’l Conf. on Distributed Computing Systems, pp. 108–115, 1996. [11] E.N. Elnozahy and W. Zwaenepoel. Manetho: Transparent Rollback-recovery with Low Overhead, Limited Rollback, and Fast Output Commit. IEEE Transactions on Computers, Vol. 41, No. 5, pp. 526–531, May 1992. [12] J.L. Kim and T. Park. An Efficient Algorithm for Checkpointing Recovery in Distributed Systems. IEEE Transactions on Parallel and Distributed Systems, Vol. 4, No. 8, pp. 955–960, Aug. 1993. [13] R. Koo and S. Toueg. Checkpointing and Rollback-recovery for Distributed Systems. IEEE Transactions on Software Engineering, Vol. SE-13, No. 1, pp. 23–31, Jan. 1987. [14] J. Li, H. Kameda and K. Li. Optimal Dynamic Location Update for PCS Networks. In Proc. of the 19th Int’l Conf. on Distributed Computing Systems, May 1999. [15] D. Manivannan and M. Singhal. Failure Recovery Based on Quasi-synchronous Checkpointing in Mobile Computing Systems. OSU-CISRC-796-TR36, Dept. of Computer and Information Science, The Ohio State University, 1996. [16] N. Neves and W.K. Fuchs. Adaptive Recovery for Mobile Environments. Communications of the ACM, Vol. 40, No. 1, pp. 68–74, Jan. 1997. [17] T. Park and H.Y. Yeom. An Asynchronous Recovery Scheme Based on Optimistic Message Logging for Mobile Computing Systems. In Proc. of the 20th Int’l Conf. on Distributed Computing Systems, pp. 436-443, Apr. 2000. [18] D.K. Pradhan, P. Krishna, and N.H. Vaiday. Recoverable Mobile Environment : Design and Tradeoff Analysis. In Proc. of the 26th Int’l Symp. on Fault Tolerant Computing Systems, pp. 16–25, Jun. 1996. 20

[19] R. Prakash and M. Singhal. Low-cost Checkpointing and Failure Recovery in Mobile Computing. IEEE Transactions on Parallel and Distributed Computing Systems, Vol. 7, No. 2, pp. 1035–1048, Feb. 1996. [20] B.L. Randell, P.A. Lee, and P.C. Treleaven. Reliability Issue in Computing System Design. ACM Computing Surveys, Vol. 2, pp. 123–166, 1978. [21] A.P. Sistla and J.L. Welch. Efficient Distributed Recovery Using Message Logging. In Proc. of the 8th ACM Symposium on Principles of Distributed Computing, pp. 223–238, 1989. [22] S.W. Smith and J.D. Tygar and D.B. Johnson. Completely Asynchronous Optimistic Recovery with Minimal Rollbacks. In Proc. of the 25th Int’l Symp. on Fault Tolerant Computing Systems, pp. 361–370, Jun. 1995. [23] Y. Tamir and C.H. Sequin. Error Recovery in Multicomputers using Global Checkpoints. In Proc. of the Int’l Conf. on Parallel Processing, pp. 32–41, 1984. [24] K. Venkatesh, T. Radhakrishan and H.F. Li. Optimal Checkpointing and Local Recording for Domino-free Rollback Recovery. Information Processing Letters, Vol. 25, pp. 295–303, 1987. [25] Y.M. Wang, O.P. Damani and V.K. Garg. Distributed Recovery with k-Optimistic Logging. In Proc. of the 17th Int’l Conf. on Distributed Computing Systems, pp. 60–69, 1997. [26] Y.M. Wang and W.K. Fuchs. Lazy Checkpoint Coordination for Bounding Rollback Propagation. In Proc. of the 12th Symp. on Reliable Distributed Systems, pp. 78–85, Oct. 1993. [27] B. Yao, K. Ssu and W.K. Fuchs. Message Logging in Mobile Computing. In Proc. of the 29th Symp. on Fault Tolerant Computing Systems, pp. 294–301, Jun. 1999.

21