Mixed Consistency - Semantic Scholar

In Proceedings of the 13th ACM Annual Symposium on Principles of Distributed Computing

Mixed Consistency: A Model for Parallel Programming (Extended Abstract) Divyakant Agrawal

Manhoi Choyy Hong Va Leong Ambuj K. Singhy Department of Computer Science University of California at Santa Barbara Santa Barbara, CA 93106

A general purpose parallel programmingmodel called mixed consistency is developed for distributed shared memory systems. This model combines two kinds of weak memory consistency conditions: causal memory and pipelined random access memory, and provides four kinds of explicit synchronization operations: read locks, write locks, barriers, and await operations. The resulting suite of memory and synchronization operations can be tailored to solve most programming problems in an ecient manner. Conditions are also developed under which the net eect of programming in this model is the same as programming with sequentially consistent memory. Several examples are included to illustrate the model and the correctness conditions. Keywords: distributed shared memory, memory consistency, concurrency, synchronization.

1 Introduction

Programming parallel and distributed systems is a dif cult task. The inherent concurrency of these systems poses a number of problems, viz., data placement, load balancing, race conditions, and non-determinism. Using message passing for interprocess communication adds a further degree of complexity since the programmer has to explicitly manage the shared data. Due to these considerations, the shared memory paradigm is becoming increasingly popular with hardware architects and application programmers. However, the latency of accessing shared variables can be much higher than that of sending or receiving messages. For example, before updating a variable, a process may need to obtain a write permission and invalidate any ex Work supported in part by NSF grant IRI-9117094 and by the Los Alamos National Laboratory grant UC94-B-A-223. y Work supported in part by NSF grant CCR-9223094.

Copyright (C) 1994 by the Association for Computing Machinery, Inc. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Publications Dept, ACM Inc., fax +1 (212) 689-0481, or .

isting copies of the variable. For the shared memory paradigm to be useful for parallel and distributed computing, techniques for reducing this latency are needed. A proposal in this regard has been the weakening of the abstraction of atomic memory. If multiple copies of a shared variable are maintained then atomicity amounts to keeping these copies identical at all times. Several de nitions have been put forth in the literature for relaxing this strong coupling; in general, the degree of coupling is closely related to the latency of memory access and to the cost of maintaining consistency. Sequential consistency [19] weakens atomicity by not enforcing the ordering of operations across processes. For parallel applications where the nal results of a computation are the only concern, sequential consistency suces as the net eect is the same as that of using atomic memory. Sequential consistency has been widely adopted as the correctness criterion by hardware architects and implementors of distributed shared memory (DSM) [22]. However, even sequential consistency imposes very strict requirements and inhibits many optimizations such as pipelined writes and out-of-order reads. Consequently, a number of weaker consistency conditions have been proposed recently. These include weak ordering [10], release consistency [14], hybrid consistency [6], causal memory [3], and pipelined random access memory (PRAM1) [23]. This paper proposes a new programming model called mixed consistency for DSM. The architecture that we target consists of a set of processors, each equipped with its own memory, interconnected by a message passing network. Our aim is to provide a programmingmodel for this kind of architectures that facilitates simple reasoning, and is ecient. As opposed to most existing models that assume sequential processes, we use partial orders to model local computations by processes. This allows us to express concurrency within a process. Building on the general idea that synchronization in user programs should be explicitly identi ed, we introduce explicit primitives for modeling dierent kinds of synchronization patterns. The de nitions of weak ordering [1, 10] and release consistency [14] iden1 This is not to be confused with the Parallel RAM computational model [11].

tify a limited kind of synchronization by labeling memory operations. Software implementations of these consistency models replace labels by lock operations [7, 18]. Some of the other memory models do not provide any synchronization operations [3, 6, 23]. The mixed consistency model combines two dierent abstractions of DSMs: PRAM and causal memory. PRAM admits ecient implementation as suggested by Lipton and Sandberg [23]. Causal memory on the other hand expresses the causality constraints of a user program very naturally and may simplify the design of parallel programs [3]. For programming convenience, our model also provides special synchronization primitives such as read locks, write locks, barriers, and await statements. The lock and unlock operations are useful for handling competing accesses to shared data, barrier operations are used for separating dierent phases of computation and await operations are useful for producer/consumer type of interactions. The net outcome of combining the above constructs is a general purpose model that can be tuned to the programming task at hand. A problem with some of the weaker consistency requirements proposed in the literature has been the absence of a clear semantics. To circumvent this problem, we de ne the mixed consistency model formally by identifying a clear interface between the user programs and the memory system. This model is presented in Section 3 and is general enough to permit non-blocking operations [15] and multi-threaded user processes. We also develop conditions under which programming in our model has the same nal eect as that of using sequentially consistent memory. Such conditions can be useful for a programmer and also for a compiler, which can exploit these conditions to speed up a computation transparently from the programmer. Similar conditions have also been investigated for release consistency [13] and causal memory [4, 28]. The discussion of these conditions appears in Section 4. In Section 5, we consider a number of examples that illustrate the applicability of our model. These include the solution of linear equations, computation of electromagnetic elds, and Cholesky factorization of sparse matrices. Section 6 considers implementation issues of mixed consistency. We conclude this paper with a brief discussion of our results in Section 7.

2 Related Work

Atomicity of shared variables is the most stringent requirement on the behavior of shared variables in a concurrent system. It requires that even in the presence of concurrent access, the shared variables should behave as if each concurrent access occurred atomically [20, 25]. This idea was later generalized to arbitrary shared objects and termed linearizability by Herlihy and Wing [16]. Sequential consistency [19] weakens the requirement of atomicity by not enforcing that the ordering of non-overlapping operations be maintained

in the equivalent global history. Dubois, Scheurich, and Briggs presented the concept of weak ordering [10] in which coherence of caches is enforced only at userde ned synchronization points. This memory consistency condition was later re ned by Adve and Hill [1]. The idea of identifying special synchronization points in the user program is extended further by Gharachorloo et al. [14]. Here a synchronization access is further classi ed as acquire or release. The execution of a release operation cannot be completed before all preceding accesses have completed and the execution of any access cannot complete before all preceding acquire operations have completed. Based on the above classi cation scheme, they propose a new correctness condition called release consistency that permits greater concurrency than before. More recently, Gibbons and Merritt [15] have presented a generalization of release consistency in which shared accesses are non-blocking. Release consistency, though originally proposed for the DASH architecture [21], has also been adopted in software implementations of DSM. In these systems, explicit lock and unlock operations play the role of acquire and release instructions. The Munin system [8] classi es shared variables based on their patterns of accesses and uses this information to implement release consistency. The Memo system [18] develops a more ecient implementation of release consistency by delaying updates until they are needed. This is referred to as lazy release consistency . As a part of the Midway system, Bershad et al. [7] restrict release consistency by explicitly associating synchronization variables with critical sections. The resulting consistency condition is called entry consistency and can be implemented more eciently. Though the mixed consistency model proposed in this paper is based on a dierent consistency condition, the combination of causal memory with locks and barriers provides a similar eect as the above implementations of release consistency. However, we also incorporate PRAM accesses and await statements that can be used to capture the producer/consumer paradigm in an ecient manner. Pipelined random access memory (PRAM) [23], introduced by Lipton and Sandberg, uses a fully replicated representation of shared data objects. Every read operation returns the value of the local copy. A write operation is performed by updating the local copy and broadcasting the update to other processes in a FIFO manner; such updates are applied when received. Concurrent updates from two processes may be performed in a dierent order by dierent processes. This kind of memory has a low latency; however, very little can be ensured about the consistency of the multiple copies of a shared object. For certain applications, the FIFO ordering of PRAM needs to be generalized to causal message delivery. Causal memory as de ned by Ahamad et al. [3] is based on this idea. A causal order obtained from the program order and the reads-from order is de ned for any history, and a read operation is constrained to return a value consistent with this causal

order. We extend the prior work on causal memory and PRAM by combining them in the same model and by including explicit synchronization operations. Hybrid consistency, proposed by Attiya and Friedman [6], classi es read/write operations into strong and weak kinds. All processes observe the same ordering between a strong and a weak operation on the same process, as well as the same ordering between any pair of strong operations across dierent processes. Adjacent weak operations however can be observed to occur in dierent orders by dierent processes. Recently, Attiya et al. have extended hybrid consistency and a few other consistency conditions by considering the control ow in user programs and allowing non-sequential executions [5]. The mixed consistency model is similar to hybrid consistency in that it combines PRAM and causal operations instead of weak and strong operations. However, unlike hybrid consistency, it also provides explicit synchronization operations that can be used to simplify the programming task.

3 System Model

A program is composed of a set of processes each of which is speci ed by means of a text written in some programminglanguage. Processes issue two kinds of operations during execution: memory operations and synchronization operations. Memory operations include read and write operations but may be extended to operations on abstract data types. Synchronization operations are used for synchronization and do not generally aect the content of a memory location. Synchronization operations discussed in this paper include read/write lock, barrier and await operations. The lock and barrier operations access a set of synchronization objects disjoint from the memory locations whereas the await operation reads a memory location. We formalize the semantics of operations by considering the interface between each process and the remainder of the system. The execution of each operation is modeled by two events: an invocation event issued by the process and a matching response event issued by the system. We do not specify when a response is returned to a user process. In particular, we do not require that the eects of a write operation be globally visible before a response is returned to the write invocations; a memory system can buer the update and issue a response locally before transmitting the updates to other processes. The execution of a process is modeled by a sequence of invocation and response events. The execution of an operation is blocking if no new invocations can be issued before a response to the pending operation is received. Otherwise, it is non-blocking. In order to permit maximum generality and to allow for optimizations such as pipelining, we concentrate on non-blocking operations in this paper. The ordering of operations at the system interface is modeled by a partial order. For any two operations o1 and o2 of process pi , we say that o1 precedes o2, de-

noted by o1 !i o2 , if the response event of o1 precedes the invocation event of o2 . The local history of a process pi consists of the set of operations of the process and the partial ordering !i on the operations. Each read or write operation refers to a memory location and is associated with a value. An operation o issued by a process pi on location x with an associated value v is denoted by oi (x)v. For example, r2 (y)3 denotes a read operation issued by p2 on location y, returning a value of 3 and w1(z)4 denotes a write operation from p1 , storing the value 4 into location z. For convenience, the identity of the process issuing an operation and the value read/written by an operation may not be shown explicitly as a part of the operation when the extra information is not needed. A local history should meet some constraints in order to be meaningful. We say that the local history of a process pi is well-formed if it satis es the following four conditions. The ordering of operations at the interface is consistent with the program of process pi . At any time, process pi has at most one pending invocation event on a given object. For any invocation of an unlock operation by process pi on a lock object, there is a preceding matching lock operation by process pi on the same lock object. Each barrier operation of pi is totally ordered with respect to all operations of pi . The above formulation of the interface and the partial ordering is similar to the speci cation of non-blocking shared memories by Gibbons and Merritt [15]. A local history in which every invocation event has a matching response event is said to be complete . We consider only well-formed and complete local histories in the rest of this paper. A history H of a program, consisting of processes p1 ; :::; pn, is a pair (Op; ; ) consisting of the operations of the processes and a causality relation ; on these operations. The causality relation ; is de ned by the transitive closure of the union of a program order ( ! ), a reads-from relation ( |. ), and a synchronization order ( 7! ). We restrict our attention to histories with acyclic causality relations. Program order ! is the union of the partial orders !i , one for each process pi . The reads-from relation |. is de ned as follows. We say that a read operation r(x) reads-from a write operation w(x), denoted w(x) |. r(x), if the value returned from the read operation r(x) is written by the write operation w(x). For simplicity, we assume that all write operations are associated with distinct values. This assumption is also made by Misra [25] and Ahamad et al. [3]. The de nition of the synchronization order 7! appears in the next subsection.

3.1 Synchronization Operations

The eect of a synchronization operation type s on a history is captured by its synchronization order 7!s .

We use o1 7!s o2 to denote that an operation o1 precedes another operation o2 in the synchronization order 7!s . The union of the synchronization orders 7!s is referred to as the synchronization order and denoted by 7! . In this paper, the synchronization operations that we consider are locks, barriers, and awaits. The synchronization ordering for these operations ( 7!lock , 7!bar , and 7!await ) are de ned next.

read lock

read lock

unlock write lock

unlock

unlock

3.1.1 Lock Operations

read lock

unlock

read lock

unlock

barrier

The usual semantics of read and write locks is assumed, namely read and write locks cannot be held simultaneously and a write lock can be held by at most one process at any given time. The lock and unlock operations for read and write locks are denoted by rl, ru, wl and wu respectively. A process holding a write lock must issue an unlock operation before another process can be granted a read or a write lock. Also, all processes holding a read lock must issue unlock operations before another process can be granted a write lock. Given a write lock operation wl and a matching unlock operation wu by a process p, the set of operations of process p that follow wl and precede wu constitutes a critical section. Relation 7!lock on a lock object ` de nes an ordering on the rl(`), ru(`), wl(`), and wu(`) operations such that the following three properties hold. 1. The wl(`) and wu(`) operations are totally ordered with respect to each other and with respect to all rl(`) and ru(`) operations. 2. No wl(`) or rl(`) operation is ordered between a wl(`) operation and its matching wu(`) operation. 3. No wl(`) operation is ordered between a rl(`) operation and its matching ru(`) operation.

where x is a memory location and v is a value that can be written to x. An operation await(x = v) issued by process pi is denoted by ai (x)v. Because of our assumption of unique write values, when ai (x)v is executed there is a unique write operation wj (x)v that has written the value v to location x. Relation 7!await de nes an ordering between write operations and await operations such that for any await operation ai (x)v of process pi , there exists a write operation wj (x)v by some process pj and wj (x)v 7!await ai (x)v. This completes the de nition of the synchronization order and the causality relation of a history. Next, we consider the semantics of memory operations.

A barrier is a synchronization point where all the processes must arrive before any one of them can proceed. A generic barrier operation is of the form bkj , where index k refers to the k-th barrier in the history and index j identi es the issuing process pj . Relation 7!bar de nes an ordering between barriers and other operations such that for any operation o of process pj and any process pi , if o !j bkj then o 7!bar bki, and if bkj !j o then bki 7!bar o. (A barrier can also be de ned for a subset of processes by restricting the range of the universal quanti cation to the subset.) An example of the synchronization orders due to locks and barriers is illustrated in Figure 1.

We rst de ne sequential consistency. De ne a history to be sequential if at any point during execution there is at most one pending invocation event by the set of processes and every read operation on a memory location returns the value written by the most recent write operation on that memory location. Note that the operations of a sequential history can be totally ordered and a state can be associated with each pre x of a sequential history. This state includes the values of the memory locations, synchronization variables, and program counters. A serialization of a history is de ned to be a total order on the operations that respects the causality relation.

barrier barrier operations of phase i

operations of phase i+1

Figure 1: An Example of Lock and Barrier Synchronization Orders

3.1.2 Barrier Operations

3.2 Memory Operations

3.1.3 Await Operations

De nition 1 (Sequential Consistency) A history

An await statement is of the form await(cond ), where cond is a boolean condition. When a process pi issues an await operation, its execution is blocked until cond becomes true. In general, condition cond may depend on more than one memory location. For simplicity, we restrict ourselves to a special form of await statements: await(x = v);

is sequentially consistent if at least one of its serializations is a sequential history.

A memory system is sequentially consistent if it admits only sequentially consistent histories. Though sequential consistency is the most prevalent programming model oered by hardware designers, it has stringent consistency requirements that lead to a high access latency [23]. A number of weaker consistency require-

ments have been proposed in the literature. As a part of the mixed consistency model, we focus on causal memory and PRAM. To de ne a causal read operation, we need to de ne the causality observable to a process pi . The set of operations that may aect process pi are the operations of pi and all write and synchronization operations of other processes. Let the causality relation restricted to this set of operations be ;i;C . De nition 2 (Causal Read) A read operation r(x)v on process pi is a causal read if there exists a write operation w(x)v such that w(x)v ;i;C r(x)v and there does not exist a read/write operation o(x)u, u 6= v, with w(x)v ;i;C o(x)u ;i;C r(x)v. Note that read operations of other processes are not included in this causality relation. Therefore, the o(x) operation in the above de nition can be a read operation of process pi but not a read operation of another process pj . A history in which all reads are causal reads is called a causal history and a memory system that admits only causal histories is called causal memory. In the absence of synchronization operations, this de nition of causal memory reduces to that proposed by Ahamad et al. [3]. As opposed to causal memory, the operational definition of PRAM considers only pairwise interactions. This means that we only need to consider direct dependencies between processes in the formal de nition. The relation obtained by removing the transitive edges from the causality relation is called PRAM order. The PRAM order for process pi , denoted ;i;P is de ned as follows. 1. De ne restrictions on synchronization orderings by removing transitive edges. Relations 7!plock , 7!pbar , and 7!pawait are obtained in this manner. Let 7!PRAM = 7!plock [ 7!pbar [ 7!pawait . 2. De ne a subrelation 7!i of 7!PRAM by considering only edges involving process pi, i.e., those edges of 7!PRAM that either emanate from or are incident upon operations of process pi . Similarly, de ne |. i by restricting the reads-from relation |. . 3. Construct the transitive closure of the union of ! , 7!i , and |. i . De ne ;i;P by projecting this transitive closure on the set of all operations excluding read operations not of process pi . With this relation, a PRAM read is de ned in a manner similar to a causal read.

De nition 3 (PRAM Read) A read operation

r(x)v of process pi is a PRAM read if there exists a write operation w(x)v such that w(x)v ;i;P r(x)v and there does not exist a read/write operation o(x)u, u 6= v, with w(x)v ;i;P o(x)u ;i;P r(x)v.

A history in which all reads are PRAM reads is called a PRAM history and a memory system that admits only PRAM histories is called a PRAM. We would like to observe the following about the above de nition. First, if there are only two processes then the transitive closure of 7!i = the transitive closure of 7!PRAM = the transitive closure of 7! , and |. i = |. . Consequently, a PRAM read becomes synonymous to a causal read. Second, the de nition can be easily generalized to maintain causality across an arbitrary group of processes; PRAM reads and causal reads form the two end points of the spectrum. Finally, in the absence of synchronization operations, this de nition reduces to the original idea due to Lipton and Sandberg [23]. We could have de ned PRAM reads so that they also preserve the operation orderings on account of previous causal operations. This would have been similar to the formalization of weak operations in hybrid consistency [6] and ordinary operations in release consistency [14]. However, such a de nition complicates the implementation of PRAM. The memory operations in our model consist of writes, and reads that are labeled either as \PRAM" or \Causal". We have de ned the semantics of two different kinds of read operations but not discussed the write operations. This is because the semantics of a particular memory consistency model is de ned by the read operations. Write operations only add to the set of possible values for reads. Based on the de nitions of synchronization operations and memory operations, we can now de ne mixed consistency.

De nition 4 (Mixed Consistency) A history is mixed consistent if 1. all read operations that are labeled as \PRAM" are PRAM reads, and 2. all read operations that are labeled as \Causal" are causal reads.

A memory system is mixed consistent if it admits only mixed consistent histories.

4 Programming in Mixed Consistency

Weaker memory consistency requirements usually lead to implementations with lower access latency and more ecient programs. However, it is dicult to program in weak consistency models and programs written assuming sequential consistency may not run correctly on a weaker consistency model. Therefore, it is important to characterize programs that behave similarly on weakly consistent memory and sequentially consistent memory (and, by transitivity, atomic memory). In this section, we isolate such a class of programs for our programming model. This is similar to the idea of properly-labeled programs for release consistency [14]. We begin with some de nitions.

De ne two sequential histories h1 and h2 to be equivalent if they consist of the same set of operations and result in the same nal state. De nition 5 (Commutativity) Two operations o and o commute if for any sequential history h, whenever h; o and h; o are sequential histories, both h; o; o and h; o ; o are equivalent sequential histories. The above de nition is similar to the idea of forward commutativity by Weihl [29], which is used to develop concurrency control protocols for abstract objects, and to the idea of loosely-coupled processes by Misra [26]. It is clear from the de nition that any pair of operations on dierent memory objects commute. Furthermore, any pair of read operations commute and operations that are never enabled simultaneously commute. Intuitively, all operations that commute with one another can be executed in any order without aecting the nal state. Theorem 1 A history H is sequentially consistent if every pair of operations not related by ; commutes 0

0

0

0

and every read operation is a causal read. Proof (Sketch):

It is sucient to show that any serialization of H = (Op; ; ) is a sequential history. This is proved by induction on the structure of the history. For the induction step, we will assume the hypothesis for all proper pre xes of H and prove it for H. For this purpose, let h be a serialization of H. We need to show that h is a sequential history. Let o be a maximal operation in H. Let H be the proper pre x of H obtained by removing o and H1 be the proper pre x of H obtained by considering operations that precede o in H. Let h1 be a serialization of H1 that is a subsequence of h. Let h2 be the subsequence of h consisting of operations in H ? H1. Then h1 ; h2 is a serialization of H and consequently, by the induction hypothesis, a sequential history. Since operation o is concurrent with each operation in h2 , it commutes with each operation in h2 . Let h = h3; o; h4. Since h4 consists only of operations in H ? H1 and h2 is a subsequence of h3; o; h4 obtained by considering operations in H ? H1, it follows that h4 is a sux of h2. Therefore, there exists an interleaving h5 ; o; h4 of h2 and o. On account of the reads being causal and histories respecting the semantics of synchronization operations, h1 ; o is a sequential history. Since h1 ; h2 is a sequential history and operation o commutes with every operation in h2, it follows that h1; h5; o; h4 is also a sequential history. Let history H3 consist of operations in h3, i.e., operations in h1; h5. Both h3 and h1; h5 are serialization of a proper pre x of H. Applying the induction hypothesis we obtain that h3 is a sequential history. Since h3 and h1 ; h5 are sequential histories consisting of the same set of operations, the states at the end of h3 and h1 ; h5 are identical. Since h1; h5; o; h4 is a sequential history, so 0

0

0

0

0

is h3; o; h4. This establishes the desired condition that h = h3; o; h4 is a sequential history. 2 The above theorem extends Singh's results for causal memory in [28] by considering synchronization operations. Ahamad et al. [4] have also developed similar conditions for causal memory with await statements and semaphore operations. Next, we discuss some consequences of the theorem. Call a program entryconsistent if the following four conditions hold: 1. the shared variables are partitioned into disjoint sets, 2. a unique lock is associated with each set, 3. all read accesses to a shared variable occur under a read or write lock of the corresponding lock variable, and 4. all write accesses to a shared variable occur under a write lock of the corresponding lock variable. The above de nition is motivated by the de nition of entry consistency by Bershad et al. [7].

Corollary 1 Any history of an entry-consistent program in which all reads of shared variables are causal is sequentially consistent.

Let us call the computation between consecutive barriers a computation phase . De ne a program to be PRAM-consistent if in any phase of any sequential history of the program, a variable is updated at most once and all reads of the variable follow the updates to the variable.

Corollary 2 Any history of a PRAM-consistent program in which all reads of shared variables are PRAM reads is sequentially consistent.

The de nitions of entry-consistency and PRAMconsistency can be easily checked by a compiler for a given program. Consequently, the above corollaries can be used to speed up computations without the programmer being made aware of the existence of the weaker memories. We consider applications of these results in the next section.

5 Applications

In this section we illustrate the applicability of the mixed consistency model by considering some examples, mainly from scienti c applications. The rst two examples that we consider are iterative solution of linear equations [3] and computation of electromagnetic elds [24]. The computations in these examples can be decomposed into a set of phases so that updates made within a phase are made available in the subsequent phases. PRAM and causal reads along with barriers are used to solve these problems. The third example that we consider is Cholesky factorization of sparse matrices [12]. The computations in this case are non-uniform and cannot be decomposed easily into phases. Causal reads along with locks are used to solve this problem.

5.1 Linear Equation Solver

The algorithm discussed in this section consists of a coordinator process and a set of worker processes. The input matrix is partitioned among the worker processes which compute, in a sequence of phases, new estimates of the solution based on previous estimates. This computation requires reading of the entire matrix and synchronization is needed in order to ensure that consistent values are read. The coordinator performs checks for the convergence of the new estimates. First, we present a solution to the problem using barriers. We insert two barriers to split each computation phase into two subphases { one in which processes read the entire matrix and another in which processes install new estimates. The resulting algorithm is outlined in Figure 2. Initially, arrays A and b contain the input, and array x contains the initial estimate of the solution. Boolean variable done is initialized to false. Since no variable is both read and written in the same phase, the program is PRAM-consistent. It follows from Corollary 2 that PRAM reads can be used for the program.

Coordinator p0 done do if converged(x) then done := true; barrier; barrier; endwhile; Worker pi while not done do P temp[i] := x[i] + (b[i] ? j A[i; j]x[j])=A[i; i]; barrier; if not done then x[i] := temp[i]; barrier; endwhile; Figure 2: Synchronous Iterative Equation Solver with Barriers (PRAM) while not

If barriers are not available then the coordinator is required to synchronize the worker processes by appropriate handshaking. The current phase of the computation of a process pi (either a coordinator or a worker) is kept track of through the use of a local variable phase[i]. The variable phase[i] is initialized to zero and incremented at the beginning of each phase. In each phase, worker process pi reads the entire matrix, calculates the next estimate of the solution, and assigns the value of phase[i] to the handshaking variable computed[i] to inform the coordinator. The coordinator waits until all these variables are set to the value of phase[0] and then resets all of them to ?phase[0]. Once computed[i] equals ?phase[i], process pi installs the new values for its shared variables. Then it sets another handshaking variable updated[i] to the value of phase[i] to inform the coordinator. The coordinator waits until the entire array has been updated to the value of phase[0] and then resets updated[i] to ?phase[0]. Process pi waits until

this is done and then starts the next phase. The algorithm is depicted in Figure 3. The forall statement is a fork and a join of parallel loop bodies, one for each value of the loop index. Though the handshakes between the coordinator and the workers play the role of the barriers, the reads of the input matrix in this solution cannot be PRAM. It is possible to show that inconsistent values of the matrix are read in that case. However, we can show by applying Theorem 1 that causal reads have the same eect as sequentially consistent reads, since all operations not related by the causality relation commute. Therefore, causal reads are used to read the matrix in the new program.

Coordinator p0 done do phase[0] := phase[0] + 1; forall i do await computed[i] = phase[0]; forall i do computed[i] := ?phase[0]; forall i do await updated[i] = phase[0]; if converged(x) then done := true; forall i do updated[i] := ?phase[0]; endwhile; Worker pi while not done do phase[i] := phase[i] + 1;P temp[i] := x[i] + (b[i] ? j A[i; j]x[j])=A[i; i]; computed[i] := phase[i]; await computed[i] = ?phase[i]; x[i] := temp[i]; updated[i] := phase[i]; await updated[i] = ?phase[i]; endwhile; Figure 3: Synchronous Iterative Equation Solver with Handshaking (Causal Memory) while not

5.2 Computation of Electromagnetic Fields

The second application we consider is the problem of computing the electric eld (E- eld) and the magnetic eld (H- eld) in a certain space [24]. E- eld and H- eld values are sampled at dierent points of this space and stored in variables called E-nodes and H-nodes respectively. Each process pi holds a partition of E-nodes and H-nodes and requires read access to adjoining nodes in neighboring partitions. The computation consists of alternating phases in which adjoining H-node values are used to compute E-node values and adjoining E-node values are used to compute H-node values. Updates performed in a phase should be available in subsequent phases. We solve this problem by using barriers as shown in Figure 4. This program is also PRAMconsistent and therefore PRAM reads can be used while preserving correctness. This numeric solution is also discussed by Culler et al. [9] in the context of providing programming language

while not

done do

e do adjoining H-node h do update the value of e using h;

forall E-nodes for each

;

barrier

h do adjoining E-node e do update the value of h using e;

forall H-nodes for each

;

barrier

;

endwhile

Figure 4: Electromagnetic Field Computation for Process pi primitives for pre-fetching shared variables. \Ghost copies" of shared variables are made and accessed in order to improve the performance of the algorithm. By using PRAM, the responsibility of providing these ghost copies shifts from the programmer to the underlying system.

5.3 Cholesky Factorization

We now consider Cholesky factorization of large sparse matrices. Basically, the problem is to factorize an inputT symmetric sparse matrix A as a matrix product L L where L is a lower triangular matrix. A symbolic factorization is rst carried out to build a dependency tree of the columns [27]. A column k depends on column j if the values of column j are used to update column k. Each column j is associated with a count initialized to the number of columns on which it depends. Once count[j] equals 0, the process assigned to column j performs the following p computation: Ljj := Ljj Lij := Lij =Ljj ; 8i > j Lik := Lik ? Lij Lkj ; 8i k > j. The computations in the rst and the second lines are local to column j. The nal computation requires updating of values in a remote column based on values in column j. This update is done inside a critical section guarded by a lock l[k] to ensure non-interference. The variable count[k] is also decremented in this critical section. The resulting algorithm is shown in Figure 5. 1: 2: 3: 4: 5: 6: 7: 8:

count[j] p = 0; Ljj := Ljj ; forall i := j + 1 to N do Lij := Lij =Ljj ; forall k := j + 1 to N do wlock(l[k]); for i := k to N do Lik := Lik ? Lij Lkj ; count[k] := count[k] ? 1; wunlock(l[k]); await

Figure 5: Cholesky Factorization: Computation for Column j

On account of Theorem 1, causal reads can be used for reading the shared variables. Weakening these to PRAM reads may result in inconsistent values as updates made by critical section entries prior to the previous one may not be observed. Since lock acquisition and release incur a high latency, a more ecient alternative is to view each matrix entry and count variable as an abstract object supporting read, write, and decrement operations. By adopting such a view, we can do away with all the critical sections: the operation on Lik in line 6 becomes a decrement of Lij Lkj and the operation on count[k] in line 7 becomes a decrement of 1. Furthermore, all the operations can be shown to commute, allowing causal memory to be used without any critical sections.

6 Implementing Mixed Consistency

In this section, we brie y discuss possible implementations of mixed consistency. We assume a message passing system with FIFO communication channels. The memory is maintained as a set of pages and each process keeps a local copy of the memory. Read operations are non-blocking and return local values. The implementation of write operations is similar to the implementation of causal memory [4]. Each process maintains a vector timestamp in order to de ne the causality between operations. The timestamp is updated after each write operation. Update messages for each variable (object) are broadcast along with the process vector timestamp to remote processes. Both causal and PRAM reads are implemented by reading local values. A causal read can return a value only if all preceding operations (in the timestamp order) have been performed locally. A PRAM read, on the other hand, returns the most recent value. In the absence of synchronization operations, the implementation of reads and writes above satis es the causality relations due to program order ! and reads-from relation |. . The extra overhead of sending a timestamp in each message and performing the updates in the timestamp order can be avoided if it can be shown that all read operations of the program following a write operation are PRAM operations. Note that the class of PRAM-consistent programs in which reads are PRAM reads (de ned in Section 4) satis es this criterion. The overhead of broadcasting messages for each update and of duplicating memory at each node may be avoided by making optimizations based on the patterns of accesses to shared variables [2]. The implementation of synchronization operations is more interesting as these operations impact the control

ow as well as the correctness of a read operation. Ensuring correct control ow is not too dicult. Every lock is mapped to a process called the lock manager which accepts the requests for locking and unlocking. Every barrier is also mapped to a barrier manager: each process sends a message to this manager upon reaching the barrier and the manager in turn signals the pro-

cesses to go ahead when all of them have reached the barrier. An await operation is implemented by a busywait loop of PRAM reads until the correct value is read. Now, consider the impact of synchronization operations on the correctness of read operations. We rst consider barrier operations. From the correctness condition of a causal or a PRAM read operation and the de nition of the synchronization order 7!bar , it follows that all updates from a prior computation phase should be received before any read operation of the current phase. One way to implement this is to have each process pi reaching a barrier send a vector to the manager. Component j of this vector counts the number of update messages sent by process pi to process pj in the current phase. Upon receiving these vectors from each process, the manager constructs a new vector for each process pi ; the jth component of this vector counts the number of messages sent by process pj to process pi in the current phase. A process waits until it receives this vector from the manager. Read operations following a barrier operation locally are blocked until the required number of messages are received from each process. Write operations following a barrier operation locally do not need to be blocked if the update messages due to these writes are performed at remote processes only after all operations from the previous phase have been performed. The semantics of await and lock/unlock operations (synchronization orders 7!await and 7!lock ) requires that each read operation observes the eect of a set of prior updates. For a PRAM read, this set includes updates from the immediately preceding process (in case of an await operation, the process writing the appropriate value and in case of a lock operation, the previous lock holder). A causal read operation, on the other hand, observes transitive dependencies and so the eect of all prior updates need to be observed. The transmission of the updates in the mixed consistency model can be done in an eager manner, a lazy manner, or a demand-driven manner. For example, consider the synchronization order 7!lock due to lock/unlock operations. Consider a write unlock operation wu(`) by process pi and a subsequent read lock operation rl(`) by process pj . Let o1 be a write operation by process pi such that o1 !i wu(`) and let o2 be a read operation by process pj such that rl(`) !j o2. Thus, o1 !i wu(`) 7!lock rl(`) !j o2 . The eect of the synchronization order of lock/unlock operations is to ensure that o2 observes the eect of o1 . An eager implementation requires that the eect of o1 be observable (by every process) when wu(`) is executed regardless of whether rl(`) and o2 will be performed later; a lazy implementation [17] requires that the eect of o1 be observable when rl(`) is executed regardless of whether o2 will be performed later; a demand-driven implementation requires that the eect of o1 be observable when o2 is performed. A simple eager implementation of lock/unlock operations is for process pi to release a write lock as follows. A message is broadcast

to all processes to ush all updates and the lock is released after receiving all acknowledgements. In a lazy implementation of lock/unlock operations, process pi sends update-message counts with a write unlock message to the lock manager. Upon acquiring the lock, process pj waits for the required number of messages before proceeding. Eager and lazy implementations of lock/unlock operations are similar to the eager and lazy implementations of release consistency [17]. Both eager and lazy implementations do not take into account whether data is actually accessed subsequently. Thus, unnecessary data transmission may occur in some executions. A demand-driven implementation alleviates these problems by only ensuring the required updates to be delivered prior to subsequent accesses. A possible demand-driven implementation of lock/unlock operations is to encode the set of variables updated inside a critical section in a small message. This message is sent to the manager when the lock is released by process pi . Process pj can later use this information to determine if local copies are valid before performing any read operations. Similar classi cation of implementations are possible for await operations. Detailed discussion of these implementations is deferred to the full paper.

7 Discussion

In this paper, we have developed a mixed consistency model for distributed shared memory systems. This model combines a number of existing weak consistency conditions and explicit synchronization operations to provide a rich class of primitives for parallel programming. We also investigated conditions under which mixed consistency leads to the same nal results as sequentially consistent memory. We illustrated the model and these conditions by considering several examples of scienti c computations. Equivalence to a sequentially consistent computation may not always be necessary. For example, some asynchronous relaxation algorithms such as Gauss-Seidel iteration converge even with PRAM. The mixed consistency model allows the programmer to choose the appropriate level of consistency needed for a given application. We have developed a platform called Maya [2] to implement and evaluate dierent memory consistencies. Currently, it supports a limited form of mixed consistency including the implementations of PRAM reads, causal reads, writes, and barriers. The best way to implement await and lock/unlock operations is still under consideration. The performance of a number of algorithms including some of those presented in Section 5 has been investigated using Maya. For example, the linear equation solver using barriers (Figure 2) has a better performance than the one with handshaking (Figure 3). Similarly for Cholesky factorization, an algorithm using counter objects outperforms the lock-based algorithm (Figure 5) signi cantly. We are also investigating other applications in order to evaluate their performance on

weak memories in general and mixed consistency in particular.

References

[1] S.V. Adve and M.D. Hill. Weak ordering - A new de nition. In Proceedings of the 17th Annual International Symposium on Computer Architecture, pages 2{14. IEEE, May 1990. [2] D. Agrawal, M. Choy, H.V. Leong, and A.K. Singh. Investigating weak memories using Maya. In Proceedings of the Third International Symposium on HighPerformance Distributed Computing, August 1994. To appear. [3] Mustaque Ahamad, James E. Burns, Phillip W. Hutto, and Gil Neiger. Causal memory. In Proceedings of the 5th International Workshop on Distributed Algorithms, pages 9{30. LNCS, October 1991. [4] Mustaque Ahamad, Gil Neiger, Prince Kohli, James E. Burns, and Phillip W. Hutto. Causal memory: De nitions, implementation, and programming. Technical Report 93/55, College of Computing, Georgia Institute of Technology, September 1993. Submitted for Publication. [5] Hagit Attiya, Soma Chaudhuri, Roy Friedman, and Jennifer Welch. Shared memory consistency conditions for non-sequential execution: De nitions and programming strategies. In Proceedings of the 5th Annual ACM Symposium on Parallel Algorithms and Architectures, 1993. [6] Hagit Attiya and Roy Friedman. A correctness condition for high-performance multiprocessors. In Proceedings of the 24th Annual ACM Symposium on the Theory of Computing, pages 679{690, 1992. [7] B.N. Bershad, M.J. Zekauskas, and W.A. Sawdon. The Midway distributed shared memory system. In The 38th IEEE Computer Society International Conference. IEEE, 1993. [8] John B. Carter, John K. Bennett, and Willy Zwaenepoel. Implementation and performance of Munin. In Proceedings of the 13th ACM Symposium on Operating System Principles, pages 152{164. ACM, 1991. [9] D.E. Culler, A. Dusseau, S.C. Goldstein, A. Krishnamurthy, S. Lumetta, T. Von Eicken, and K. Yelick. Parallel programming in Split-C. Technical report, Computer Science Division, University of California, Berkeley, 1993. [10] Michael Dubois, Christoph Scheurich, and Faye A. Briggs. Memory access buering in multiprocessors. In Proceedings of the 13th Annual International Symposium on Computer Architecture, pages 434{442, May 1986. [11] S. Fortune and J. Wyllie. Parallelism in random access machines. In Proceedings of the 10th Annual ACM Symposium on the Theory of Computing, pages 114{ 118, 1978. [12] A. George and J. Liu. Computer Solution of Large Sparse Positive De nite Systems. Prentice Hall, 1981. [13] K. Gharachorloo, S.V. Adve, A. Gupta, J.L. Hennessy, and M.D. Hill. Programming for dierent memory consistency models. Journal of Parallel and Distributed Computing, 15(4):399{407, August 1992.

[14] K. Gharachorloo, D. Lenoski, J. Laudon, P. Gibbons, A. Gupta, and J.L. Hennessy. Memory consistency and event ordering in scalable shared-memory multiprocessors. In Proceedings of the 17th Annual International Symposium on Computer Architecture, pages 15{26. IEEE, May 1990. [15] Phillip B. Gibbons and Michael Merritt. Specifying nonblocking shared memories. In Proceedings of the 4th Annual ACM Symposium on Parallel Algorithms and Architectures, pages 306{315, 1992. [16] Maurice P. Herlihy and Jeannette M. Wing. Linearizability: a correctness condition for concurrent objects. ACM Transactions on Programming Languages and Systems, 12(3):463{492, July 1990. [17] Pete Keleher, Alan L. Cox, and Willy Zwaenepoel. Lazy release consistency for software distributed shared memory. In Proceedings of the 19th Annual International Symposium on Computer Architecture, pages 13{ 21. IEEE, 1992. [18] Pete Keleher, Sandhya Dwarkadas, Alan Cox, and Willy Zwaenepoel. Memo: Distributed shared memory on standard workstations and operating systems. Technical Report COMP TR93-206, Rice University, June 1993. [19] Leslie Lamport. How to make a multiprocessor computer that correctly executes multiprocess programs. IEEE Transactions on Computers, 28(9):690{ 691, September 1979. [20] Leslie Lamport. On interprocess communication: Parts I and II. Distributed Computing, 1(2):77{101, 1986. [21] D. Lenoski, J. Laudon, K. Gharachorloo, W. Weber, et al. The Stanford DASH multiprocessor. IEEE Computer, 25(3):63{79, March 1992. [22] Kai Li and Paul Hudak. Memory coherence in shared virtual memory systems. ACM Transactions on Computer Systems, 7(4):321{359, November 1989. [23] Richard J. Lipton and Jonathan S. Sandberg. PRAM: A scalable shared memory. Technical Report CS-TR180-88, Princeton University, Department of Computer Science, September 1988. [24] N.K. Madsen. Divergence preserving discrete surface integral methods for Maxwell's curl equations using non-orthogonal unstructured grids. Technical Report 92.04, RIACS, February 1992. [25] J. Misra. Axioms for memory access in asynchronous hardware systems. ACM Transactions on Programming Languages and Systems, 8(1):142{153, January 1986. [26] J. Misra. Loosely-coupled processes. In E.H.L. Aarts et.al, editor, Parallel Architectures and Languages Europe, volume II, pages 1{26, 1991. [27] Edward Eric Rothberg. Exploiting the Memory Hierarchy in Sequential and Parallel Sparse Cholesky Factorization. PhD thesis, Stanford University, Department of Computer Science, December 1992. [28] Ambuj K. Singh. A framework for programming using non-atomic variables. In Proceedings of the 8th International Parallel Processing Symposium, 1994. [29] William E. Weihl. The impact of recovery on concurrency control. In Proceedings of the 8th ACM Annual Symposium on Principles of Database Systems, pages 259{269. ACM, 1989.