Proceedings Template - WORD - Cornell Computer Science

Virtually Synchronous Methodology for Building Dynamic Reliable Services Ken Birman

Dahlia Malkhi

Robbert Van Renesse

Cornell University

Microsoft Research

Cornell University

Silicon Valley

[email protected]

[email protected]

[email protected]

ABSTRACT1

Specifying a Dynamic Reliable Service

There has been considerable interest in reliability services such as Google‟s Chubby and Yahoo‟s Zookeeper, and in the State Machine Replication model, the standard way of formalizing them. Yet, traditional SMR treatments omit a formal analysis of reconfiguration as actually implemented in production settings. We develop such a model; it ensures that members of the new configuration start with full knowledge of the finalized state of the prior configuration. To evaluate the approach, we develop a new implementation of atomic multicast, and evaluate its performance under a range of reconfiguration scenarios.

Here, we develop a formalism for Dynamic Reliable Services (DRS) that uses the same style that has become standard for reasoning about SMR and Paxos. There are three components: A service safety specification, a liveness model, and a solution for reconfiguring the service replicas.

1. INTRODUCTION Our paper is motivated by 30 years of building reliable replicated systems, but more specifically responds to recent interest in massive cloud computing systems. These must guarantee availability even as nodes are added, fail, are taken out of service, and service migration is performed. The paper focuses on service reconfiguration (sometimes called dynamic membership), which is the problem of replacing a set of replicas running in the old configuration with a new set, without disrupting service or introducing inconsistencies. In particular, new service replicas need to learn the final state of the terminated configuration. To be reliable, a service must trigger reconfigurations in response to failures, but a service may also reconfigure for other purposes, such as to adapt to changing loads. With this in mind, we explore the following questions:  State Machine Replication (SMR) and protocols such as Paxos are widely popular, but should every reliable service be designed using SMR? If not, then what guarantees should a “dynamic reliable service” provide?  To what extent can a dynamic reliable service be opaque to its clients? Must the clients of a service be aware of its dynamically evolving configuration?  Reconfigurable Paxos is rarely deployed and the powerful formal tools being used to reason about Paxos and SMR under steady-state conditions are rarely applied to reconfiguration. Can reconfiguration be simplified, taken off the critical path, and reasoned about in a “modular” manner?  At a foundational level, what distinguishes a service with dynamic, self-managed membership from one in which membership is confined to a statically defined set?

1

This work was supported, in part, by grants from NSF and AFRL and by Microsoft Corporation.

Consistent with good programming style, it is desirable to conceal the implementation of a service from its clients. Accordingly, our specification will focus on the service as a whole, as opposed to the server instances of which it is composed. More specifically, we offer clients a service API (e.g., read/write or send/deliver). Our safety requirement is that the service API satisfy linearizability, a widely-accepted consistency guarantee [12]. As we will see below, linearizability does not necessitate SMR. Here we demonstrate that our approach gives more flexibility in implementing a linearizable service than classical SMR. Our approach to liveness draws heavily on prior work (by ourselves and others) on a dynamic group communication model called virtual synchrony (VS) [6][26]. DRS departs from this earlier work by introducing a new VS definition that maintains linearizability in the context of a formal model for dynamic systems introduced in [1]. The liveness model gives clients a handle for triggering reconfiguration thru a Reconfig API; our objective is to carry out reconfiguration in a manner that maximizes liveness and preserves linearizability. The model minimizes constraints on the service, and has advantages relative to both earlier VS models and the SMR model.

2. The DRS Model in a Nutshell The Need for a DRS Model Consistency mechanisms play a central role in highly available applications. In cloud and data center settings, these mechanisms are often prepackaged for developer convenience; well known examples include Microsoft‟s Boxwood [22], Google‟s Chubby [8] and Yahoo!‟s ZooKeeper [14]. Under the surface, the SMR model, often realized using the Paxos protocol, plays a foundational role. Yet SMR is generally not used as the main replication engine in any of these systems, for several reasons. First, highly available services need to be reconfigurable, for example to permit service migration within the data center (while the service is running), to change protocol parameters, etc. A reconfigurable Paxos exists, but is rarely used: Lamport, Malkhi and Zhou observe in [21] that reconfigurable Paxos permits anomalous behaviors, for example by permitting multiple

(possibly conflicting) reconfiguration decisions to be chosen by the current configuration. Second, SMR guarantees that all operations are applied in the same order at all replicas, a property that may be more than is needed and may waste valuable resources. Finally, in many replication settings a single process performs client operations, resulting in a stream of state updates that have to be performed in the order in which they are produced. SMR guarantees that different replicas perform updates in the same order, but without guaranteeing that any particular order will be used. For example, Junqueira et al. in [24] present a practical scenario in which Paxos leaders attempt to execute client commands in accordance with an “intended” order, but Paxos fails to respect this intention, causing problems at the client layer. We‟ll justify each of these points more carefully, below. Although SMR replication may seem to be a de-facto standard today, the approach is not the only option. In particular, Virtual Synchrony group communication substrates provide a useful but different form of reliability than SMR. However, the VS model doesn‟t require linearizability, and some implementations fail to provide this guarantee. Additionally, VS ensures that members of the current configuration that survive in a succeeding configuration will agree on the commands chosen (e.g., messages delivered) in the current configuration. However, this guarantee does not extend to members excluded from the new configuration. VS also necessitates that the members of the next configuration constitute a majority of the current one, an unnecessary constraint as we shall see shortly. Our DRS model subsumes and improves upon both branches of prior work. Relative to VS, DRS plugs the gaps just identified. In contrast to SMR, it can be applied to systems that are not constructed from deterministic state machines. Moreover, the solutions obtained with DRS are faster in some important cases, and eliminate the anomalous behaviors mentioned above. Finally, DRS exploits a recently proposed dynamic liveness model [1]. Informally, DRS offers a “virtually synchronous Paxos”, combining the best aspects of each of these prior models to offer a fast, dynamically live, and consistent model.

The DRS Solution Our methodology centers on two building blocks: 1.

2.

While terminating the current configuration, DRS forms a consensus decision on both the new configuration, and on the exact set of commands selected by the current configuration (which is then frozen). DRS then transfers selected commands to the new configuration and starts it.

Note how our approach differs from VS. First, all members of the current configuration are bound by agreement on the set of chosen commands, not just the members that remain as participants in the new configuration. Second, the VS requirement that a majority of members of the current configuration remain as members of the new configuration is eliminated; instead, we only require that the members of the new configuration will know the set of commands chosen in the current configuration. To illustrate these points, we describe one example of an important service that does not require SMR‟s full strength, a Reliable Multicast Service. Reliable Multicast offers a Send and Deliver API. The Deliver call must return all the messages

previously Sent or Delivered. An F-fault tolerant implementation uses a group of 2F+1 servers to store messages. Send(M) stores M on at least F+1 servers. To implement Deliver, we need two exchanges with a group of F+1 servers: One exchange to learn the set of Sent messages, and a second to guarantee that each retrieved message is stored on F+1 servers. But interestingly, we can implement Deliver via a single exchange if we are willing to wait for responses from all servers. Although this approach blocks if any server fails, DRS offers a way out: the service can be reconfigured by a command to drop failed servers (or to add new ones). Progress is thus restored, and we can build a singleexchange protocol that also maintains liveness, so long as reconfiguration is able to make progress. Contrast this approach with a multicast implemented using SMR. One would start with an opaque state machine that embodies a deterministic implementation of the basic API. SMR would then perform a consensus decision on each Send/Deliver operation. These will be more costly then our single exchange. Moreover, traditional ways of handling dynamic membership in SMR systems are awkward, as we will see below. Thus one arrives at a correct solution, but in so doing accepts several kinds of costs. Readers familiar with classic atomic Read/Write storage emulation techniques [5] may note that the DRS approach permits an elegant and lightweight solution with the same benefits as for multicast: Instead of the standard two-phase read/write procedures, we can implement a single-round procedure that writes to all 2F+1 replicas, relying on DRS reconfiguration for progress in the same manner described above.

Reconfiguration: The Key to DRS The above examples illustrate our central point. SMR performs a consensus decision for each and every command, and this may be more costly than necessary. DRS permits steady-state behavior that would block in the event of failure; but then employs a reconfiguration consensus decision in the event that progress stalls. The advantage is that the steady state protocol is simplified without a performance cost (indeed, may be quite a bit cheaper than a protocol built for reconfigurable SMR, as we demonstrate below). One thus obtains a faster service, and one more easily proved correct, perhaps using one of the machine-assisted theorem provers that have been applied to non-reconfigurable Paxos and other SMR solutions. We treat reconfiguration as a command (“Reconfig”) that can be issued by a component of the system that has noticed a problem, which we model as an “administrative” client. Reconfiguration must preserve linearizability in face of dynamic configuration changes, which it does by suspending the current configuration from processing further client requests. It then takes a snapshot of commands executed in the current configuration; forms a consensus decision in the current configuration on these commands, and finally transfers the commands to the new configuration. If a command is not in the snapshot, we guarantee that no member of the current configuration will subsequently execute it as part of that configuration. Because reconfiguration is used to make progress in the event of component failures, the steady state protocol can often be simplified, as we saw in the examples introduced above. Our Reliable Multicast needs just two message delays for Send/Deliver; if a failure occurs, it pauses, reconfigures the service to drop sluggish or faulty participants, and then moves on. Reconfig provides the new configuration with sufficient

information about past messages to continue generating correct responses as new client requests are issued. Moreover, while reconfiguration is more costly than the steady-state protocol, the overall cost of the solution can be quite low because steady-state operations will be frequent, while reconfigurations should be rare. We noted that one can also solve Reliable Multicast by a simple reduction from state machine replication. However, had we done this, the reduction would have introduced an extra message delay or additional processes. Our DRS model also has an interesting advantage relative to the usual way of handling reconfiguration in SMR protocols. In Paxos, reconfiguration is accomplished via control commands that are interjected into the sequence of state machine commands. To support this mechanism, Paxos includes a synchronization barrier that potentially delays command execution during normal, faultless periods: the protocol waits for delivery of a consensus decision on past offsets in the total order of command, to determine if the current configuration may continue processing client requests. But the synchronization barrier itself is costly, which may help explain why reconfigurable Paxos is rarely deployed. Intuitively, one can understand DRS as a model that includes a similar synchronization barrier to the one in reconfigurable Paxos, but implements it off the critical path, an idea that traces to VS. We can apply DRS to traditional SMR, and when we do so, we obtain a rigorously formalized form of reconfigurable SMR in which the synchronization barrier mentioned above doesn‟t impact steady-state protocols. The resulting solutions will often have performance substantially better than that of traditional reconfigurable Paxos. This insight isn‟t entirely new; the method has been discussed using the notion of a Stoppable State Machine in [‎ 21] and discussion of view change protocols in older systems such as Isis ‎[6], Horus ‎[26], and many other works in the field, reflected similar motivations. In summary, DRS leverages an idea from VS to obtain fast reconfigurable versions of important core functionality such as atomic multicast and atomic R/W storage, or more elaborate highlevel services such as the data center components listed earlier. In doing so it overcomes weaknesses in VS (such as the failure to require linearizability). While DRS is expressed using an SMR formalism, the model is more general than SMR (because it can be applied to non-state machine services), and it eliminates a unwieldy synchronization barrier that has prevented widespread deployment of the Paxos reconfiguration protocol.

3. Problem Statement and Model 3.1 Safety Our interest is in services with persistent state which must survive reconfiguration, and that support commands to manipulate the state, and have state-dependent responses. In contrast to SMR, a reliable service need not be specified using a state machine, although the services do have state. In particular, a state machine deterministically executes commands one at a time. The state machine model seemingly precludes many styles of multicore parallelism, and rules out services that can concurrently execute operations on behalf of large numbers of clients. As we will see, the “pure” question of reliability on which we focus here has little to do with implementation decisions within the service members.

Recall from the introduction that many services require linearizability. This property looks at executions from the point of view of clients. Each command is started with an invoke event and ends with a response event. An execution history is a series of events at different processes, as if timestamped by an external observer whose clock granularity is infinite (so that no two events overlap). A history is composed of any sequence of invocation/response events. A history is linear if every invoke is followed by its response: operations are invoked one by one. We define a service by giving a sequential specification, specifying what responses are allowed for a command C, given the linear history that precedes C‟s invocation. Adhering to a sequential specification requires us to have a way of mapping an execution with concurrent operations into an indistinguishable sequence, in which operation responses are legal. Linearization: Let Γ be a history. A linear history Γ‟ is a linearization of Γ if the following holds. 1. 2. 3.

Γ’ contains all of the invoke/response pairs that appear in Γ Some, or all of the, invoke events without response in Γ may appear in Γ’ with matching responses Invoke/response events may be re-ordered in Γ‘, as long as an invoke event that follows a response event in Γ, follows it in Γ’ as well.

Linearizability: An execution history is linearizable if there exists a linearization in which operation responses adhere to the service specification. Implementing a Dynamic Reliable Service entails writing code that, when executed, can be proved to generate only executions that are linearizable with respect to the service specification.

3.2 Liveness In a static system, a typical liveness requirement is that `no more than F out of N processes become faulty’. In a long-lived system, we wish to reconfigure the system dynamically when failures occur rather than relying upon on a fixed resilience threshold. We adopt a fault-tolerance model that captures dynamism from ‎[1]. Clients initiate reconfiguration using the Reconfig command. This operation generates no output; a client simply invokes it and waits for its completion. Reconfig operations determine the current configuration, which in turn, determines when progress is guaranteed, e.g., when a majority of the current configuration is alive. Whatever the system does internally in response to Reconfig requests is of no interest to the fault model. The extension of the fault model to cover dynamic membership introduces some subtleties. A fault model specifies conditions under which progress is guaranteed. A dynamic fault model serves the same purpose, but must satisfy it even in a service that can be reconfigured. In particular, when a service reconfigures some members may permanently depart, and others may be added that were not participants at the outset. We don‟t want faults that cause departed processes to crash to “count against” the faulttolerance of our service, nor should faults impacting members that joined be neglected, despite the fact that until they were added, they were not part of the service. Accordingly, our model specifies not just the threshold on failures that must hold, but also the relationship between the threshold and the Reconfig interface. We impose limitations on use of Reconfig: if those are respected,

and fewer than a threshold of failures occur, the system will make progress.

remain valid, insulating the reconfigurations it may perform.

Although more complex than the usual SMR fault model, these enhancements provide useful flexibility lacking in traditional SMR solutions. For example, an administrative client can deploy machines to replace faulty ones, and thereby enhance system longevity. With this power comes some limits: if used carelessly, reconfiguration might cause the service to halt, as when servers are capriciously removed from the system, leaving too few for safe progress.

Whereas SMR is often portrayed as a “black box” solution for building a reliable service from a state machine component, implementing Reconfig has a service-specific element.

Formally, we define dynamic liveness as follows: 1. 2. 3. 4.

Denote by Current(t) the processes in the system (defined by all the completed Reconfig commands) Denote by AddPending(t) the processes whose Add is pending at t Likewise, by RemovePending(t) the processes whose Remove is pending at t Faulty(t) is the set of processes that have crashed by t.

We do not assume any bounds on message latencies or message processing times (i.e., the execution environment is asynchronous), and messages may get lost, re-ordered, or duplicated on the network. However, we assume that a message that is delivered was previously sent by some process and that correct processes can eventually communicate any message. We assume that processes fail only by crashing. Finally, we are obviously bound by the usual impossibility of consensus, and must rely on the possibility of electing an eventual leader, in order to guarantee termination for Reconfig decisions.

4. Reconfiguration using Virtual Synchrony

of

a

service

from

Choosing commands: We say that a command C was chosen if its response event has occurred. The core of the DRS approach entails capturing the set of chosen commands and transferring this information to the members of the new configuration. This action begins when a client invokes the Reconfig command. Carrying this out entails: 

 

We require that at any time t, fewer than |Current(t)|/2 processes contained in Current(t)AddPending(t) are in Faulty(t)RemovePending(t). In practice, of course, this requirement is really an assumption: we can‟t prevent a run from violating it, but if that were to occur, the system as a whole would not guarantee further progress. For purposes of initialization, we stipulate a set of Reconfig‟s that have “completed”. This uniquely determines a non-empty initial configuration Current(0) which is known to all processes. Once “bootstrapped”, membership changes move the system from configuration to configuration; they are not applied while a given configuration holds. Thus our system should be visualized as a series of well-defined configurations, which can accumulate pending requests in the manner just shown, eventually apply them to arrive at a new configuration, and then begin accumulating a new set of changes.

users

Forming agreement among members of a configuration on a set Π of commands that are chosen within the configuration. DRS reconfiguration must guarantee that any command C that may ever be chosen in the configuration is included in Π. Performing an agreement decision to terminate the current configuration, and to agree on the Reconfig command itself. Transferring Π to the new configuration and then permitting the new configuration to start choosing commands of its own (outside Π). Formally, the transfer is done by choosing every command in Π again in the new configuration in legal order, but suppressing duplicate response events at the clients. However, this describes the way we model the action, not the way it would be implemented. An implementation would probably transfer a checkpoint, not the full command set. If a server belongs to successive configurations, which should be common, it might not need to “do” anything at all.

Fix some desired service. Assume that we have a linearizable implementation of the service within any fixed configuration. For a DRS to maintain correctness under reconfiguration, we: 1. 2. 3.

4.

5.

Suspend the service execution. Form agreement on a set Π of chosen commands; Π might include extraneous (partially chosen) commands. If there is any command C in Π which is not known to be chosen, wait until it is chosen; from this point onward, no unchosen command will ever be chosen Invoke again in the new configuration all commands in Π, in any linearization order, but without actually sending responses to clients. Resume execution at the new configuration.

These above steps establish a single configuration, which chooses commands and possibly terminates with a reconfiguration decision. We then instantiate the solution again and again, so that each new configuration takes the state of the prior configuration as its starting state, and each ends with a reconfiguration that uniquely defines the next new configuration and its starting state.

In this section, we sketch the high level DRS approach to implementing Reconfig in a way that is integrated with the service implementation, and maintains linearizability. In subsequent sections, we exemplify the approach in full using a specific (reliable multicast) service.

5. Dynamic Reliable Multicast

The key step is to transfer the state of the service from the current configuration to the next, before the old one is abandoned. In our model, the members of the new configuration can be thought of as reenacting the full history of executed commands, but doing so within a new system configuration. Responses seen by clients

Reliable Multicast: Clients invoke Send and Deliver commands. Send returns an ACK. Deliver should return all messages whose Send or Deliver has completed.

We now describe a complete solution to the Dynamic Reliable Multicast problem. We first give a sequential specification of the service.

We note that in the literature, the Deliver command is often implicit: clients subscribe by supplying a method that should be invoked by the platform for automatic delivery. In effect, there is always a pending “Deliver” command. The solution has two parts: A steady-state protocol for sending and delivering messages during normal, stable periods; and a reconfiguration protocol. Note that, we provide a non-optimized version, which serves for illustrating correctness only. In the description below, we denote in bold the occurrence of a response to commands, including the Reconfig command. These are vital for correctness.

The roles The protocol distinguishes among the following roles: Clients submit commands, Send/Deliver, to the service. Choosing commands and maintaining persistent knowledge of chosen commands are done by a set of Servers and a set of Leaders. Multiple roles can be, and often are, performed on the same physical machine. Our liveness model stipulates an initial configuration, of which a majority of servers are alive (until a reconfiguration completes). We call any majority set of servers a quorum. Recall that DRS is concerned with implementing the service within a single configuration up to, and including, a possible reconfiguration decision. Configurations are then instantiated one after another. This raises the question of ensuring that clients can learn the current configuration: perhaps the service ran on nodes {A,B,C} yesterday, but today it runs on {D,E,F}. This practical problem lies outside of our scope, but can be addressed, e.g. by updating a record in the Domain Name Service (DNS) or informing clients each time a service reconfigures.

Steady-State Protocol Send: A client sends a message to all servers. Servers acknowledge the message directly to the client. When every server has responded to the client‟s message, the Send response event occurs and the client may return. Below, we say for brevity that the message was chosen (rather than saying that the Send command relevant to the message was chosen). Deliver: A client sends a query to all servers. Servers respond with all messages they received (or, for efficiency, with those which were not delivered by the client already). Deliver returns to the client every message that was acknowledged by all of the servers.

Reconfiguration Recall that clients initiate reconfiguration by issuing a Reconfig command. Forming a configuration-changing decision entails forming a consensus decision on the next configuration and on a set of chosen messages. This is almost a standard consensus problem, with the additional requirement that it is tied with the steady-state multicast protocol so as to uphold linearizability of Send/Deliver. In addition, subsequent to a consensus decision we need to transfer the state of chosen messages from the old configuration to the new one in order to cope with our dynamic failure model. The consensus decision at the core of the reconfiguration protocol is done using leaders and servers. Leaders make use of unique integer ranks, repeatedly trying increasingly higher ranks until a decision is reached. The leader protocol of a particular rank has two phases.

Phase 1: A leader performs one exchange with a quorum of servers. This is always possible under our liveness assumption. When the leader hears back from a quorum of servers, it learns: (1) Either a reconfiguration command rc, which might have been chosen; rc includes a set φrc of possibly chosen messages. In case of multiple possibly chosen rc‟s, the one whose leader rank is highest is selected. (2) Or that no reconfiguration command was chosen, and a set φ of messages, that might have been chosen. In this exchange, the leader also obtains a commitment from the servers not to respond to any Send/Deliver request from clients, and to ignore future proposals from any leader that precedes it in the leaders-order. Phase 2: The leader performs another single exchange with a quorum of servers. If case (1) applies, then it tells servers to choose rc and every message in the set φ = φrc. Otherwise, in case (2), it proposes the new configuration rc that its client requested, which includes the set φ; and it tells the servers to choose every message in φ. In either case, when a quorum of servers replies to the leader‟s proposal the Send response event occurs (if it did not occur earlier) for every message in φ. We emphasize here that responses to different leaders and to Send requests cannot be `combined‟ to form a response event. The server‟s protocol is to respond to leader‟s messages, unless it was instructed by a higher-ranking leader to ignore this leader:  

In phase 1, it responds with the list of (possibly) chosen messages; and with the value of a reconfiguration proposal rc of the highest-ranking leader it knows of, or empty if none. In phase 2, it acknowledges a leader‟s proposal and stores it.

State Transfer Upon reconfiguration, the knowledge of all commands chosen in the past, held by the current set of Servers, needs to be transferred to the new set. Our liveness condition mandates that the transfer complete before reconfiguration is done. Accordingly the reconfiguration procedure includes a state transfer part: Any learner that learns (e.g., from a quorum of servers) a reconfiguration decision rc, along with the set  of chosen messages, can start the new configuration by (i) transferring  to every server in the new configuration rc, and (ii) upon receiving acknowledgement from all servers in rc, responding to the client‟s reconfiguration request. After step (ii) completes, the Reconfig(rc) response event occurs and the new configuration can commence operation. It may begin processing client commands, and may itself reconfigure. To reiterate a point made earlier, notice that after step (ii) has been performed, our failure model presumes (only) that a majority of rc remains available. In particular, at that time, there is no guarantee that a quorum of the previous configuration remains available. This poses no risk, because any knowledge held by that previous configuration has been transferred to the new configuration. Indeed, the previous configuration may be garbage collected at this time.

6. Comparing DRS with SMR and Other Work

6.2 A Direct Implementation: DRS-Stoppable State Machines

To build SMR, we need to form agreement among a set of Servers on a totally ordered sequence of state machine commands, which are requested by Clients. The sequence needs to be learned and executed by Replicas.

We can give a direct implementation of SMR that uses the DRS approach for reconfiguration. Although our steady-state mode could employ the Paxos Synod protocol, which is resilient to the failure of a minority of the current configuration, we may simply use a fixed leader. When the leader fails, we reconfigure to facilitate progress. Reconfiguration is implemented using a similar protocol to the Dynamic Reliable Multicast protocol above, and we do not repeat it here. This approach for implementing SMR reconfiguration has already been alluded to in high level in [21] under the name Stoppable State Machines.

In a static fault-free environment, it is easy to form a reduction from SMR to Reliable Multicast: Clients send command requests to a designated member of the configuration called the Sequencer. The Sequencer acts as a client, attaches a monotonic counter to its own requests and sends them in multicast messages. Replicas deliver the requests (possibly out of order). They each apply client commands on their local copy of the service state in sequence order. They use the counter inside the messages for ordering, and not their delivery order.. Implementing SMR in this manner for dynamic settings has been far from obvious. During the 90s, there was great interest in building layered middleware for reliable distributed applications; see e.g., Horus [26], Transis [9], Cactus [13], and others. Several PhD theses including [4][15] and [23] were devoted to solving the following problem: Given a VS membership-view service, how can we build dynamic SMR on top? With our DRS solution, we are in a position to address this challenge elegantly. In particular, we can use our Reliable Multicast solution to build a dynamic SMR solution exactly in the manner described above, i.e., going through the Sequencer within each configuration. The key point is that during reconfiguration of the set of servers, replicas learn the set of chosen commands in the current configuration. It is worth noting that sequence-gaps may exist in the set of chosen messages. That is, during steady-state operation, a replica might deliver a Sequencer‟s 28th message without having delivered the 27th. Normally, a replica delays execution of the 28th command until the delivery of the 27th message. However, a reconfiguration decision may exclude the 27th message from the set of chosen command so as to facilitate progress. In that case, the replica can deduce that existing gaps are never going to be filled and skip them.

6.1 Complexity This SMR protocol can be implemented on top of our Reliable Multicast protocol with a single additional message delay (from clients to the sequencer). In the general case2, the overall number of message delays from clients initiation to obtaining responses by F+1 replicas is 4: (1) Clients submit a command request to the sequencer, (2) Sequencer communicates with servers, (3) Servers send acknowledgements to replicas, (4) Replicas execute commands and send responses to clients. Notice that this is precisely the same number of message delays as incurred in standard Paxos. We note that in many implementations, such as the Primary-Backup (PB) one we describe below (Section 8), Severs are the Replicas, and hence, client commands are served in just three message delays.

2

This delay can be avoided if the Sequencer role is played by one of the Clients, and that Client is also the initiator of the commands. With a bit of engineering ingenuity, it is even possible to arrange that this will be the usual situation and hence that sequencing can be “free”.

6.3 A Note Concerning Paxos Paxos derives a solution to SMR from the desired end-result, namely, the unique order of commands executed by the statemachine replicas. This derivation makes it easy to argue correctness, and ignores the different epochs and distinct leaders that choose these commands. On the other hand, one unfortunate consequence of this clean approach is that distinct leaders contending for the same commands may cause violation of some intuitively desired properties. The problematic scenario is as precisely the one described above, where our DRS reconfiguration resolves the fate of „holes‟ per epoch. In Paxos, one leader proposes command A at offset 27 and command B at offset 28. Neither reaches a full quorum, nor does either commit. Now another leader starts. In its phase 1, it finds no traces of commands A or B. It proposes commands C and D at 27 and 28, neither of which reaches a quorum either. Finally, a third leader starts. This third leader queries a quorum and finds traces of C at 27 and of B at 28. It must ask a quorum to accept C at 27 and B at 28. In this case, commands may be executed in an order that satisfies the consensus correctness statement, and yet the chosen order violates the “intended” order from the perspective of system clients, which may perceive the service as faulty. After all, the original plan was for command B to follow A. In some sense the problem reflects a confusion of notation: from the perspective of Paxos, the “leaders” that actually issued the commands are viewed as clients, and the commands themselves are unrelated. Thus Paxos is operating correctly according to its specification. Yet, if we set semantic debates to the side, this particular case was sufficiently troublesome to convince the ZooKeeper engineers to move away from Paxos and to implement a virtually-synchronous type of leader election in their system. Our scenario turns out to have even more serious ramifications for reconfigurable Paxos, which we can now explain. Reconfigurable Paxos allows reconfiguration of the servers that implement the state machine by injecting configuration-changing commands into the sequence of state machine commands. Servers need to form agreement about the position of a configuration-changing command in the sequence of commands, as any other state machine command. This is a natural use of the power of consensus inherent in the implementation of SMR. A limitation associated with Paxos reconfiguration is that commands are processed in the new configuration before the old configuration has terminated. With virtually synchronous SMR, this situation is eliminated: the state of the prior configuration is transferred to the Learners in the new configuration before the first command is chosen in the new configuration.

Earlier, we alluded to the consequences of this design, but now we can see precisely why it entails a concurrency barrier on the steady-state command path: Suppose we form a reconfiguration command, e.g., at index y in the sequence of commands. The command determines how the system will form the agreement decision on subsequent commands, e.g., at index y+1 onward. More generally, the configuration-changing command at y may determine the consensus protocol for commands starting at index y+, for some pre-determined concurrency parameter . We must wait for a reconfiguration decision at y to complete before we inject additional command requests to the system, at index y+1 onward (or y+ in the general case). Even though each index may contain a decision on a batch of actual commands, this barrier may not be desirable, and often reconfigurable Paxos If at all) is deployed with  greater than 1 ‎[10]. Now, consider the behavior of a Paxos-based SMR solution in which the concurrency window  is larger than 1. Recall the scenario of contending leaders above, and suppose that commands C and B are reconfiguration commands, which are mutually incompatible. This may cause the current configuration to issue two (or more) Reconfig commands, up to the maximum of  commands. Suppose that all of these were intended to apply to the configuration in they were issued. The first to be chosen will update the configuration as usual. But what are we to do when a second or subsequent command is chosen? These commands may no longer make sense. For example, a pending request to remove a faulty server from the configuration might be chosen after one that switches to a configuration in which that server is no longer a member. Executing these reconfigurations one after another is nonsensical. Likewise, a command to change a protocol parameter might be executed in a context where the system has reconfigured and is now using some other protocol within which that parameter has a different meaning, or no meaning at all. If we use a window  is larger than 1, then such events will be possible. An approach that seeks to prevent these commands from being chosen once they are no longer meaningful would require us to implement complex semantic rules. Allowing them to be chosen forces the application designer to understand that a seemingly “buggy” behavior is actually permitted by the protocol. In practice, many Paxos implementations (indeed, all that we know of) either don‟t support reconfiguration at all, or set  to 1, thus serializing command processing: only one command can be performed at a time. Batching commands can alleviate this cost only to a limited degree. Our remarks may come as a surprise to readers familiar with SMR and Paxos, because many published presentations of the model and protocols omit any discussion of the complexities introduced by reconfiguration. Readers interested in learning more might refer to ‎[21] for a tutorial on state machine reconfiguration.

6.4 Related Work on SMR and Paxos SMART [10] is an SMR approach that supports reconfiguration. Each “configuration” is an instantiation of Paxos with a static set of participants, responsible for a range of commands so that ranges of different configurations do not overlap. A configuration can decide on the next configuration, and the new configuration takes effect  commands later, not unlike the idea described originally in [19]. When the range of the current configuration has been filled up, the clients are notified about the new

configuration. Before a server in the next configuration can execute requests, it must obtain the state from the prior configuration. The work is loosely based on experience from building the Petal Distributed Virtual Disks system [25]. Generalized Paxos [20] is a mechanism for agreement on Command Structure Sets (C-Structs), which grow over time. Such a structure consists of a generalized notion of a set with partial ordering. C-Structs can be applied to a range of problems, including the full Paxos problem, Consensus, and Reliable Multicast, and weaker problems structure admit lower-latency solutions. However, Generalized Paxos does not guarantee linearizability, nor does it handle dynamism. The RAMBO system [10] implements atomic read/write storage in dynamic settings. Reconfiguration is done by an auxiliary‟s decree. DynaStore [1] solves the same problem without a consenus building block at all. In both systems, clients may continue performing read/write operations when reconfigurations occur, rather than blocking as in DRS. However, the penalty is that state transfer is performed on-the-fly by clients, who must access multiple active configurations until they become expired. The DRS approach generalizes the specific dynamic problem model in RAMBO and DynaStore, and gives a generic construction for linearizable services in dynamic settings. As mentioned briefly in the introduction, DRS can address the atomic storage problem with single-exchange read/write operations. DRS reconfiguration would defer activation of a new configuration until state transfer completes, but this simplifies the clients role, who need to access only one configuration. It is an interesting question to see whether a consensus-less reconfiguration approach like DynaStore exists for generic problem models, and how it evaluates compared to DRS. Our DRS model was stated using a traditional SMR formalism. A benefit is that one can reason about and prove properties of DRS solutions much as one reasons about and proves properties for SMR, and can leverage a growing body of work that applies theorem proving systems to SMR protocols, or uses high level languages in conjunction with the SMR methodology [3]. Indeed, although the work lies beyond the scope of this paper, we are starting an effort that will use Cornell‟s NuPRL theorem proving system [7][2] to formally verify the correctness of DRS applications.

7. DRS IMPLEMENTATION We have developed a reference implementation of DRS in 109 lines of Erlang, a functional scripting language widely used in modern web services platforms because of its elegant support for concurrency and its high performance (for example, FaceBook, eBay, Amazon, Orbitz, and SAP all have services implemented in Erlang). In Erlang, a node is a virtual machine, identified by a network address. A process runs on a node and has a globally and temporally unique identifier that includes the node‟s network address. Processes can exchange messages, even if the processes reside on different nodes. A DRS configuration consists of an ordered list of server processes, each running on a different node. Each server maintains a state object and a Boolean flag called wedged, initially false, and invokes functions in the application. The

Figure 1: Unordered (UO) Replication Protocol

Figure 2: Primary-Backup (PB) Replication Protocol

Figure 3: Chain Replication (CR) Replication Protocol

application provides a merge(list of state objects) function that returns a state object, as described below. The DRS module has the following API:

state may be updated while the state transfer takes place, but usually this method is preferred over shutting down the service while the state transfer takes place.

- Start(list of nodes, initial state): spawns a process on each of the nodes with the specified state and returns the corresponding list of process identifiers as the initial configuration;

8. EVALUATION

- Reconfig(new list of nodes, old configuration): wedges a quorum of processes in the old configuration, starts a new process on each of the given nodes (the set of old nodes and new nodes may overlap), and returns a new configuration. The DRS client-side component sends requests directly to the servers, attaching to the request the client‟s process identifier, a unique request identifier, and the operation. Upon receipt of such a request, assuming the wedged flag is clear, a server invokes the application with the operation and the state that the server holds. The application returns an updated state, and, optionally, a response to the client, which the server sends using the client‟s process identifier. Any invoker of Reconfig( ) becomes a leader. It sends a phase 1 request to each of the servers in the old configuration. Each recipient sets the wedged flag, causing the server‟s state to become immutable. If the server has voted on another reconfiguration proposal, it returns the last such operation. Otherwise it returns the state object. The leader collects responses from a quorum of servers. If all those servers respond with a state object, the leader invokes the application‟s merge( ) function to create a single state object, and spawns processes on each of the nodes in the new list of nodes with that state object. The corresponding list of process identifiers becomes the proposal for the next configuration. The leader then attempts phase 2 to commit the proposal. Should phase 1 or phase 2 fail (because one of the servers returns an error message), the leader attempts again using a new unique leader identifier. It is worthwhile noting that there are no timeouts or other tunable configuration parameters in this code. However, Reconfig( ) has to be invoked when there is suspicion of a failure, and this suspicion would typically be triggered by a configurable timeout after waiting for a response to a service request. The usual action is to reconfigure the service to remove the tardy, possibly faulty server, restoring responsiveness. In order to restore the same level of fault-tolerance as before the failure, a new server has to be allocated and initialized with the state. This can happen in the background. When completed, another reconfiguration step adds the new server to the configuration. The details are tricky, as the

We posit that our approach is a convenient and efficient approach to building DRS services. In this section, we validate this claim by showing how such a service may be supported, evaluating the performance of various replication approaches, and measuring the overhead of reconfiguration. In particular, we are developing a cluster management service that manages a collection of distributed services, including itself. The state consists of a version number (incremented upon each operation), the set of nodes in the cluster, and a map of service identifiers to configurations. A configuration of the service is a list of 2F+1 servers, although only the first F+1 are used to execute operations while ensuring availability of state in the face of at most F failures; the remaining F servers are used only for liveness of reconfiguration. We evaluated three different replication schemes: UO (Un-Ordered): clients send operations directly to all servers, and wait for a response from all servers, but no attempt is made to keep replicas consistent with one another (consistency is restored at each reconfiguration operation). This approach involves 2F+2 messages, and 2 message latencies (Figure 1), and 30 lines of Erlang code. PB (Primary-Backup): clients send operations to the first server in the configuration, which orders operations and forwards them in FIFO order to the remaining F servers, all of which respond directly to the client (the primary need not respond). This involves 2F+1 messages, and 3 message latencies (Figure 2), and 48 lines of Erlang code. CR (Chain Replication): the servers are organized in a chain. Clients send operations to the head of the chain, which orders operations. The operations are forwarded in FIFO order along the chain. The tail responds to the client. This involves F+2 messages, and F+2 message latencies (Figure 3), and 46 lines of Erlang code. The hardware consists of 19 Dell 1650, dual 2.5 GHz Intel Xeon processors (5 bogoGips), 2 GB memory. Each processor runs VMware and Linux 2.6.9. A 100Mb/s Ethernet connects the processors. On each processor we run a single Erlang node (using the Erlang R13B03 distribution). We measure the overhead of a no-op operation that increments the version number but leaves the

remainder of the state unchanged. For a single server (F = 0), the end-to-end latency of a single operation (client and server on separate machines) is approximately 275 microseconds. In all measurements below, variance is low and error bars have been omitted for clarity. Figure 4 shows the throughput of no-ops as a function of the number of servers in the absence of failures or reconfigurations. Each client sends a no-op request and waits for responses before repeating with the next. We test both with a single client and 10 client processes, each running concurrently on its own node. In the UO, 10 clients case, the throughput is limited because the servers are CPU-saturated. For CR and PB, the limitation is due to CPU-saturation of the primary/head node. While PB outperforms CR in the case of a single client for F > 1, due to the higher latency of CR, CR significantly outperforms PB in the face of client contention, due to the lower message overhead of CR compared to PB, as well as reduced network contention.

Figure 5: Latency for Reconfiguration.

9. CONCLUSIONS Our paper proposes a new DRS methodology for service reconfiguration, borrowing on the Virtual Synchrony Model and State Machine Replication. DRS improves on both of its ancestors, repairing some problems with the VS model, enforcing linearlizability throughout, and eliminating performance-limiting synchronization barriers that reconfigurable Paxos imposes on steady state operations. DRS can also support applications for which the State Machine model is constraining, such as atomic multicast and shared atomic registers. An Erlang implementation confirms that our work is of practical value and also permits evaluation; the methodology is capable of very high performance and reconfiguration overheads are low.

ACKNOWLEDGMENTS Figure 4: Throughput as a function of the number of servers. In Figure 5 we graph the latency of reconfiguration, as a function of the number of servers. Reconfiguration involves wedging a quorum of servers, obtaining and merging their state, spawning a new set of servers with the resulting state, and running consensus on the resulting proposal for a new configuration. In our experiments, we kept the state small, consisting of an empty list of nodes and services. The graph shows measurements in which all servers are up before reconfiguration, and measurements in which we crashed F = ceil((#servers + 1) / 2) of the servers just before reconfiguration. As can be seen from the graph, reconfiguration overhead grows approximately linearly with the number of servers, and is slightly faster in case some servers are faulty. In all cases, it is the reconfiguration client that is CPU-saturated; the load on the servers is modest (55% for a single server, 30% with 3 servers, and continuing to decrease quickly with system size). In the case of significant state size, state transfer would occur in the background. We clocked the speed of state transfer between two Erlang nodes in our system at 10.22 Mbytes/second (on a 100Mbit Ethernet).

We thank the following people for helpful discussions, for reviewing earlier versions of this manuscript, and for contributing ideas that helped shape this work: Ittai Abraham, Marcos Aquilera, Mahesh Balakrishnan, Mark Bickford, Gregory Chockler, Bob Constable, Idit Keidar, Leslie Lamport, JP Martin, Alex Shraer, and Lidong Zhou.

10. Bibliography [1] Dynamic atomic storage without consensus. M. Aguilera, I. Keidar, D. Malkhi and A. Shraer. 2009. Proceedings of the ACM Symposium on Principles of Distributed Computing (PODC). [2] Innovations in Computational Type Theory using Nuprl. Stuart F. Allen, Mark Bickford, Robert L. Constable, Richard Eaton, Christoph Kreitz, Lori Lorigo, and Evan Moran, Journal of Applied Logic, Volume 4, Issue 4, December 2006, Pages 428-469, 2006.

[13] The Cactus Approach to Building Configurable Middleware Services. Hiltunen, M and Schlichting, R. 2000. Workshop on Dependable System Middleware and Group Communication (DSMGC 2000). [14] The ZooKeeper Coordination Service (Poster). F. Junqueira, P. Hunt, M. Konar, and B. Reed. 2009. Symposium on Operating Systems Principles (SOSP). [15] Consistency and High Availability of Information Dissemination in Multi-Processor Networks. Keidar, I.: The Hebrew University of Jerusalem, 1998. PhD thesis.

[3] Dedalus: Datalog in Time and Space. Peter Alvaro, William R. Marczak, Neil Conway, Joseph M. Hellerstein, David Maiery Russell Sears. PODS 2010, Indianapolis, IN, USA.

[16] Efficient Message Ordering in Dynamic Networks. Keidar, I and Dolev, D. 1996. 15th ACM Symp. on Principles of Distributed Computing (PODC).

[4] Replication Using Group Communication Over a Partitioned Network. Amir, Y.: The Hebrew University of Jerusalem, 1995. PhD thesis.

[17] Time, Clocks and the Ordering of Events in a Distributed System. L. Lamport. 7, 1978, Communications of the ACM, Vol. 21, pp. 558-565.

[5] Sharing Memory Robustly in Message Passing Systems. Attiya, H, Bar Noy, A and Dolev, D. 1, 1995, Journal of the ACM, Vol. 42, pp. 121-132.

[18] On Interprocess Communication--Part I: Basic Formalism, Part II: Algorithms. L. Lamport. 1986, Distributed Computing, pp. 77-101.

[6] Exploiting virtual synchrony in distributed systems. Birman, K. and Joseph, T.: ACM, 1987. 11th ACM Symposium on Operating Systems Principles.

[19] The Part-Time Parliament. L. Lamport. 1998, ACM Transactions on Computer Systems, Vol. 16, pp. 133--169.

[7] Knowledge-based synthesis of distributed systems using event structures. Mark Bickford, Robert L. Constable, Joseph Y. Halpern, and Sabina Petride, In Franz Baader and Andreo Voronsky, editors, Logic for Programming, Artificial Intelligence, and Reasoning, volume 3452 of Lecture Notes in Computer Science, pages 449-465, 2005. [8] The Chubby Lock Service for Loosely-Coupled Distributed Systems. Burrows, M. 2006. Seventh Symposium on Operating System Design and Implementation (OSDI). [9] The Transis Approach to High Availability Cluster Communication. Dolev, D and Malkhi, D. 4, 1996, Communications of the ACM, Vol. 39. [10] The SMART way to migrate replicated stateful services. J. R. Lorch, A. Adya, W. J. Bolosky, R. Chaiken, J. R. Douceur, and J. Howell. Proceedings of ACM Eurosys 2006. [11] Rambo II: Rapidly reconfigurable atomic memory for dynamic networks. S. Gilbert, N. Lynch, and A. Shvartsman. In Proc. 17th Intl. Symp. on Distributed Computing (DISC), pages 259-268, June 2003. [12] Linearizability: a correctness condition for concurrent objects. Herlihy, M and Wing, J. 3, 1990, ACM Transactions on Programming Languages and Systems (TOPLAS), Vol. 12, pp. 463 - 492.

[20] Generalized Consensus and Paxos. Leslie Lamport. Microsoft Research Technical Report TR-2005-33, 2005. [21] Reconfiguring a State Machine. L. Lamport, D. Malkhi and L. Zhou. : Microsoft Technical Report, 2009. [22] Boxwood: Abstractions as the Foundation for Storage Infrastructure. MacCormick, John, et al.: Usenix, 2004. Symposium on Operating System Design and Implementation (OSDI). pp. 105-120. [23] System Support for the Development of Object-Oriented. Montresor, A.: University of Bologna, Italy, 2000. [24] A simple totally ordered broadcast protocol. Reed, B. and Junqueira, F. : ACM, 2008. LADIS '08: Proceedings of the 2nd Workshop on Large-Scale Distributed Systems and Middleware. [25] Petal: Distributed Virtual Disks. E.K. Lee and C.A. Thekkath. 1996, —Proceedings of the 7th International ACM Conference on Architectural Support for Programming Languages and Operating Systems. [26] Horus, a flexible Group Communication System. Van Renesse, R., Birman, K., and Maffeis, S.: ACM, 1996, Communications of the ACM.

11. Appendix A: Correctness Sketch In this section, we provide a correctness sketch of our Dynamic Reliable Multicast implemention. For brevity, we limit ourselves to a partial proof of correctness, which addresses only a single reconfiguration. A full proof can be constructed by using this single-epoch proof as part of an argument that inducts over the sequence of configurations; interested readers are referred to an extended version of this paper, which we are preparing for eventual submission to a journal. Refer to an initial configuration V0 and to one new configuration V1, which is defined by the first Reconfig event in the execution. Since the agreement part of reconfiguration is composed of the well known Synod protocol [19], we do not repeat the correctness proof of the consensus protocol. We henceforth assume agreement on a unique new configuration and a set φ of messages. Claim 1: We first claim that if a message m is chosen in V 0, then it is included in the agreement decision, and hence is transferred to (chosen in) V1. Proof: There are two cases. A message may be chosen in V0 either by a client request or a by a leader request (during reconfiguration). In the first case, every server in V0 received the client‟s Send(m) request and acknowledged it. Since during reconfiguration servers are instructed not to respond to client requests, these servers respond to the client before they ever receive any leader‟s reconfiguration phase 1 message. Therefore, any leader‟s phase 1 reveals m. Therefore, m is included in the set φ, which is a component of the consensus decision. Consequently, m is sent again in V1 before V1 starts receiving client requests. In the second case, the message is chosen during phase 2 of reconfiguration by some leader. Since choosing m is done inseparably from choosing the agreement decision on φ, m is included in the reconfiguration decision, which means that m is sent again in V1 before it starts receiving client requests.

Claim 2: If a message m was chosen in V0, and a subsequent Deliver command is invoked, then m is included in the response of every server to the Deliver request. Proof: There are two cases: m was chosen by a client request, or m chosen through reconfiguration. In the first case, a client Deliver request which is subsequent to the completion of the Send command may either occur in V0 or in V1. In V0, every server already acknowledged m, hence its response includes m. In V1, a deliver request arrives after φ was sent to every server in V1 and before it started. By Claim 1, φ contains m. Hence, the responses from the servers in V1 include m. In the second case, we first note that a Deliver request that arrives at V0 after m is chosen in V0 is not served by the servers that formed agreement on reconfiguration; hence, it does not complete. This leaves the case of a Deliver request that arrives at V 1 after φ was sent to every server in V1 and before it started. As above, by Claim 1, φ contains m. Hence, the response from every server in V1 includes m. Claim 3: If a message m was chosen in V1, and a subsequent Deliver command is invoked, then m is included in the response of every server to the Deliver request. Proof: We first note that a Deliver request that arrives at V 0 after m is chosen in V1 is not served by the servers that formed agreement on reconfiguration; hence, it does not complete. In V1, every server already acknowledged m, hence its response includes m. Hence, it will return m. Claim 4: If a message m was chosen, and subsequently a client invokes a Deliver request, then the Deliver response includes m. Proof: By Claims 2 and 3, after m is successfully Sent or Delivered, a subsequent Deliver request returns m from all servers of the relevant configuration. Hence, the response to the Deliver call includes m.