Distributed Window Concurrency Control for

Distributed Window Concurrency Control for Distributed Database Systems by William Perrizo1 North Dakota State University Fargo, ND 58105 ABSTRACT In recent years research in the area of distributed transaction management has been growing rapidly. While designing concurrency control components for such systems, the most common techniques have involved locking and timestamping and variations of these. In spite of their heavy use in the designing of distributed systems, each of the above techniques has its own specific disadvantages and overhead. Our primary goal is to describe an approach that has the advantages of each of these techniques and none of their disadvantages. In this paper, we describe the Distributed Window Request Order Linked List method and show how this technique embodies most of the advantages of the standard techniques yet at the same time does not introduce serious difficulties or extra overhead of its own. We include an experimental comparison to compare transaction response time for this new method to those of locking and timpstamping methods.

between the Scheduler and the Data Manager modules. Locking and timestamping use serializability as a correctness criterion, yet certain serializable histories will be rejected by the methods. In terms of advantages, locking methods requires no Scheduler-to-DataManager handshake, while timestamping methods have no global data structures for making the accept/reject decision on an operation (therefore, they are easily distributed) and SGT. We propose a new technique called Distribution Window Request Order Linked List (DW-ROLL) and show how this technique embodies most of the advantages of the aforementioned techniques without the inherent shortcomings. In Section 2, we introduce the DW-ROLL method. In Section 3, we report on implementations. In Section 4, we present the results of an experiment conducted to compare the methods using a queuing model. The conclusions are in Section 5.

SECTION 2. Distributed Window-Request Order Linked List (DW-ROLL) For the rest of this paper, we assume the existence of a reliable communication network connecting several sites, a distributed database system on this network which has no replicated data. We assume no replicated data simply to make the discussion tractable, given the space limitations. The methods can be extended to the replicated distributed database system setting, but many new issues would need to be dealt with.

1

Keywords: distributed, database, concurrency, ROLL.

SECTION 1. Introduction Locking [AGR93, BER87, BAR88, GAR90, GRA78, GRA93, HEL93, THO91] and timestamping [KUN81, BER87, GRA93, BAR88, BAS88] and various modifications and combinations have been proposed for use in distributed database systems. Each of techniques has inherent shortcomings and advantages. Locking is based on inherently global data structures (Lock Tables and Wait-For Graphs) which force additional network communication just to do the concurrency control. Timestamping requires massive data structures (a Reads-in-Transit counter, a Writes-in-Transit counter and a wait-queue for each data item) and considerable processing time to accomplish the required handshake 1Partially funded by USAF Grant F19628-86-K-

0019 & IBM Grant 3341-IN-R-90233.

The DW-ROLL is basically a priority queue of request vectors keyed on the global timestamp. We assume that there exists a mechanism by which each transaction (global or local) is assigned a unique global time stamp (gts). One such mechanism involves the use of an ordered pair consisting of the clock time at the site where the transaction entered the system followed by the site number of that site (timestamp, site#). These global timestamp pairs are ordered by timestamp as the primary sort key and site# as the secondary sort key. The query originating site sends to each of the participating sites a Request Vector (RV) requesting access to the data items at that site along with the unique global timestamp as a transaction identifier. An ’RV’ is a string of 0s and 1s. Each pair of consecutive position represents a unique data item at that site. The first bit of any pair is the read bit for that data item and the second bit is the write bit, A ’0’ at a particular position means that the corresponding data item is not required in that mode, while a ’1’ means that the data item is required in that mode. For simplicity we will deal with only a single lock mode and therefore assign only one bit per item

(exclusive access only). Later we will use both Read and Write modes to represent a single data item ([PER91]). _____________________________________ | 0001 1000 0000 0101 . . . 0001 | gts | lpt | |___________________________|____ |____|

The first field in the RV above is called the data portion, the second is the key portion and the third is the posttime portion. The data portion is the string of bits specifies the data items at that site which are needed (we assume a dictionary mapping which assigns positions to data items, as in [PER91]). The key is the global timestamp of the submitting transaction and is used for determining the RV’s position in the priority queue or the Request Order Link List (ROLL). The post-time portion is reserved for placement of the local clock time when the RV completes its enqueuing. For that reason we refer to the contents of this last field as the "local post time" or lpt. At each local site, this ’RV’ has to be enqueued into a priority queue of request vectors keyed on the unique global timestamp. At each site, this ROLL priority queue structure is the main system object for concurrency control. There are three simple operations available to the transactions for access to this object: POST, CHECK and RELEASE. Using the POST operation, transactions make known their serialization partial order. The CHECK operation allows transactions to determine current resource availability. The RELEASE operation makes a resource available to other transactions, once a transaction has completed access to it. The basic ROLL protocol is not a locking protocol nor a timestamp-based protocol. It is a modification of the basic Serialization Graph Testing protocol. However, no system scheduler module is involved and all operations can be done in parallel by concurrently executing Transaction Managers (TMs). By using ROLL both waiting and restarting are minimized. In its basic form it is deadlock and livelock free. The ROLL is the only data structure needed for concurrency control, deadlock management and livelock management. The basic ROLL structure and protocol was first introduced in [PER89]. The Request Vector for basic ROLL is simply the data portion, as described above, together with a pointer field. the queue for basic ROLL is a FIFO queue. As we have already described it, the DWROLL structure is only slightly different. The three operation for the DW-ROLL object are also different than those with the same names in basic ROLL ([PER89]).

During the DW-ROLL design process, we had the following issues in mind. The scheduler in a locking system is a critical section. That is, there is just one scheduler for the entire system and all transactions must make requests to it in a serial fashion. Thus, the scheduler can easily become a bottleneck in terms of system performance ([BER87]). Our approach is to offload scheduling responsibilities to the transactions themselves (or Transaction Managers, each acting concurrently on behalf of its respective transaction). The data structures (lock tables, wait queues, dependency graphs, wait-for graphs, etc.) used in concurrency control are all combined into one data structure,the ROLL, which can be partitioned and distributed so that the partition at a given site governs only the data at that site. Waiting and restarting are minimized (eliminated when possible).

A detailed description of the three DW-ROLL operations, POST, CHECK and RELEASE follows. When a transaction Ti enters the system, a transaction manager TMi is assigned to it. TM i determines what data accesses are needed at which sites and creates subtransactions and Request Vectors (Tij and RV ij for each site j). TMi then requests a global timestamp, gts, which it attaches to each RV as an identifier. Each subtransaction and RV is then sent to its sites of operation. At site j, for instance, a local transaction manager, TMij, is assigned. TM ij is responsible for executing the operations on the local DW-ROLLj object at its site, j. The following is a description of these operations. The POST Operation for DW-ROLL is a three step process. The three steps need not be executed as an atomic unit, however, thus allowing concurrent posting by many transactions without system synchronization. The window parameter used in 3. below, is a global timeout parameter chosen by the database administrator. 1. Enqueue the RV ij to the DW-ROLLj at that site (where the global timestamp, gts i, is the key which determines the RV’s position in this priority queue). 2. Copy the local clock time to the local post time field (lpti) of RV ij. 3. For each RV kj which is behind RVij in DW-ROLLj, if lptk + window > lpti, then the POST was "successful", else the POST was "unsuccessful" and RVij is dequeued and the POST is rejected.

The idea here is to give the POSTing transaction a window of time to be successfully placed in the queue even if a younger transaction has already POSTed. The younger transaction will not be allowed to check for availability of data items (and therefore not be allowed to access data items) until that window of grace time has passed. An unsuccessful POST should be rare, since POSTing follows immediately after assignment of global timestamp, thus allowing little time for a younger transaction (which would have been assigned its global timestamp later) to POST ahead of Ti. If, however, Ti ’s POST at site j is unsuccessful, it must get an new global timestamp. The CHECK operation by Ti at site j can be initiated only after the grace window of time has elapsed for "older" transaction to successfully POST. Ti must wait to check until the local clock reaches lpti + window. The CHECK allows a transaction to check the ROLL for availability of the entire set of needed data items in one operation and without need for atomicity, critical sectioning or sequencing of needed items. A transaction CHECKs by simply accumulating the logical OR of all elements ahead of it in the ROLL. The accumulated OR represents an "access filter" for the transaction (a one means the item corresponding to that position is in use and a zero means it is available). CHECK can be repeated at any time to determine which needed items have become available (RELEASEd) since the last CHECK operation. Thus, transactions are never forced to wait idly for a response from the system scheduler. If the ROLL is quite lengthy, the multiple ORs required in the CHECK operation can be done in a binary fashion by materializing the pertinent ROLL elements into contiguous locations and performing the OR of the first half with the second, then the first half of the result with the second half of the result, etc. This reduces the number of ORs required by log2. To RELEASE a data item once finished with it, the transaction simply flips the bit from a one to a zero (exclusive OR with a 1). If a subtransaction with a global timestamp greater than the subtransaction trying to post has already started the CHECK operation then this subtransaction cannot post successfully, but will have to be aborted or be returned to the originating site and will have to come back with a new global timestamp. The only condition that we are

enforcing is that the operation of enqueueing has to be atomic. An enqueue operation is provided as an atomic operation by some operating systems (IBM OS/400 [AST88]). The CHECK operation allows a transaction to check the list of request vectors for the availability of the entire set of needed data items in one operation and without need for atomicity or critical sectioning. A subtransaction CHECKS by simply accumulating the logical OR of all elements ahead of it in the list. The accumulated OR represents an "access filter" for the transaction (a one means the item corresponding to that position is in use and a zero means that it is available). The CHECK operation can be repeated at any time thereafter to determine which needed items have become available (RELEASEd) since the last CHECK operation [PER91a].

Locking In spite of the popularity, simplicity and the proven performance of locking for concurrency control, there are difficulties in trying to use it in a distributed setting. In locking any transaction that needs access to a data item first obtains a lock (use privilege) for that data item, and relinquishes the lock after use of the data item and after all needed locks to other data items have been obtained. This is basic Two-Phase Locking (2PL). It has been shown that extending the 2PL technique to a distributed setting also involves extensive communication between the sites involved. The method employs a "lock table" data structure, which is usually implemented as a hash table. All requests for access to data must be made to a scheduler process, which must consult the lock table for the information necessary to determine if the request can be granted or must be denied. The lock table is an inherently global, centralized structure and the scheduler is typically a performance bottleneck in the system. Methods have been proposed to partition or replicate this lock table structure around the distributed system, but high communication costs persist. On the other hand, 2PL has the distinct advantage over timestamp approaches in that no separate "handshake" is needed between the Schedulers and the Data Manager (DM) processes (the processes that perform the actual accesses). Much overhead is required to do that handshake, both in terms of space to contain the large amount of ancillary timestamp data required and time overhead to implement the handshake. The DW-ROLL method uses a partitioned data structure

for concurrency control which requires no additional inter-site communication other than the required transport of the subtransactions and their RVs to their execution sites. Also, DW-ROLL retains the strong advantage of 2PL with regard to Data Manager handshake (automatically taken care of through the lock release protocol with no additional structure required).

Timestamping A timestamp-based approach is a non-locking approach, in which transactions are given a unique timestamp upon entering the system. Conflicts are resolved by comparing the timestamp of the operation with the highest timestamp of any transaction previously accessing the data item requested. Restart is commonly used to resolve detected conflicts. The main advantage is the ease of distribution. Global transactions (requiring access to data at more than one site) can piggy-back the timestamp to subtransactions destined for other sites. Once the subtransaction arrives, no further communication is required for the system to make the access decision (assuming non-replicated data, for simplicity of argument). The timestamp is simple compared with the highest timestamp of previously accessing transactions and the decision is make. Typical transaction processing requires no additional communication over the network is the main reason these methods have been popular in the distributed setting. The main disadvantages are that it uses restart as the main conflict resolution methodology and it requires elaborate space and time consuming mechanisms to "handshake" with Data Managers. The DW-ROLL method retains the advantage of timestamp methods for distributed systems, namely that a timestamp is piggy-backed with the transmission of subtransactions and no additional communication is required for normally executing transactions. Unlike timestamp methods, for normally executing transactions, DW-ROLL uses a waiting policy rather than a restart policy for conflict resolution. Also, unlike timestamping, DW-ROLL does not require an additional DM handshake, but uses an access release delay to oversee operation execution order for data management . Serialization Graphs Another non-locking technique that has been proposed is the Serialization Graphing technique. One of the major advantages of this technique is that it allows any

serializable history to be accepted by the system (both locking and timestamping reject certain serializable, and therefore correct, executions). However, the serialization graph requires the same extensive DM handshaking as timestamping. Since serialization graphing is based on a Stored Serialization Graph (SSG), a global object which must be consulted to detect and resolve conflicts, serialization graphing does not distribute very well. As in 2PL, proposals have been advanced to partition the SSG, but then intersite communication is required whenever the conflict detection and resolution decision is remote from the data requested. If the decision is to allow each site on the network to maintain its own local serialization graph, then it has been shown that it is possible to construct executions in which all local serialization graphs are acyclic, while the global serialization graph contains a cycle. Yet another difficulty is handling global cycle detection and its elimination, the cost of which could be prohibitive. The DW-ROLL method retains the main advantage of serialization graphs, namely that no serializable history is rejected. In fact the basic ROLL data structure contains a serialization graph. Unlike standard uses of serialization graphing, DW-ROLL does not require an additional DM handshake. Again, it uses an access release delay to oversee operation execution order at the Data Management level.

SECTION 3. Implementations Prototype implementations have been done on an IBM AS/400 B60 with the OS/400[tm] database operating environment and on SUN Workstations with UNIX [tm] . The AS/400 platform was chosen as the primary implementation platform because of its extensive use world-wide as a data management platform and because, at the lowest level, OS/400 provides a completely object-oriented interface. Included in this interface is a very general, optimized "priority queue" object and a set of operations which are tailored to DW-ROLL. The operations include an atomic enqueue operation which is used in the POST operation, a dequeue operation which gives us the ability to remove bit vectors as soon as they are zeroed (substantially shortening the queue as compared to other implementations), a materialize and an OR for bit-vectors which allow easy implementation of the binary-OR CHECK operation in log2 steps, and a host of other logical operations to implement RELEASE and manage the access vector once it has been generated. DW-ROLL space reclamation is unnecessary, as logical space is never reused and the

dequeue operations is atomic. The UNIX implementation is done in the C language. The ROLL object and all operations were written from scratch. It is to be noted that the object oriented Machine Interface environment on the made implementation much easier on the AS/400 than it was in the UNIX environment. We attribute this difference to the fact that the objects we need to implement DWROLL are already there in OS/400. DW-ROLL has the advantages of locking, timestamping and serialization graphs, while at the same, it does not have the disadvantages of those techniques. We also need to point out that the storage and communication requirements of DW-ROLL do not exceed those required of the other methods.

SECTION 4. Experimentation

We have conducted three separate experiments to compare DW-ROLL with Two Phase Locking (2PL) and Timestamp Ordering (TO). In all three experiments, we used an open queuing model for a distributed database, in which transaction interarrival times were generated from a Poisson distribution with a mean of 4 seconds. Transactions are generated at each site and proceed through logical queues which determine the other logical queues to which transactions will be sent. In the logical design, above, transactions enter the concurrency control queue immediately. The subsequent action depends upon the concurrency control method being simulated (blocked|read/update for locking, restart|read/update for timestamping and POST-CHECK-(blocked/read-update) or rejectPOSTrestart for DW-ROLL). The physical queue model below consists of the resources used by the logical model shown above. Each queue in the logical model requires one or more of the queues in the physical model. For example, the logical concurrency control queue uses a physical CPU queue and a communication channel.

Logical Queuing Model

Physical Queuing Model

The following parameters were used in all three experiments.

Parameter

Value

Cost of each locking, DW-ROLL operation (lock, POST...)

100 µsec

Cost of each timestamping operation (compare, handshake...)

20 µsec

I/O operation service time= (item_length / IO_channel_capacity) + avg_latency IO_channel_capacity = avg_latency = item_length =

16 Mbytes/sec .003 seconds 2Kbytes

Number of network sites

10

Number of data items

100,000

Mean inter-arrival time of trans

4 seconds

Transaction base set

8 data items

Data item size

2,000 bytes

In experiment 1, 100 concurrent read-only transactions were issued. The response times for the 3 methods were recorded and averaged. The experiment was run 100 times. The final average response time were compared. In experiment 2, 100 concurrent write-only transactions were issued. In experiment 3, 100 concurrent read-andwrite transactions were issued. In the first experiment, DW-ROLL’s performance was clearly superior. In the second experiment, DW-ROLL’s performance was again superior. However, the differences in the averages was not so marked as in the first experiment. In the third experiment, DW-ROLL’s response times were shown to be statistically better than Timestamp Ordering (TO), but it could not be concluded

that DW-ROLL’s response times are better than those of 2PL. However, the average response time for DWROLL over all runs was better than that of 2PL.

SECTION 5. Conclusion In this paper, we have introduced a database concurrency control object, called DW-ROLL (Distribution Window-Request Order Linked List). The DW-ROLL object consists of a linked list of bit vectors and three simple operations available to transactions. Using the POST operation, transactions establish their serialization order. The CHECK operation allows transactions to determine current resource availability. The RELEASE operation makes a resource available to other transactions, once a transaction has accessed it. DW-ROLL is neither a locking protocol nor a timestamp protocol and no system scheduler module is involved. Using DW-ROLL, both waiting and restarting are minimized. It is always deadlock and livelock free. Almost all system operations can be invoked in parallel by individual transaction manager modules. Other than the atomic enqueue operation, no critical section is needed. All other operations can be executed in parallel. DW-ROLL can be used in a heterogeneous distributed database environment. We introduced the DW-ROLL object, discussed desirable extensions and showed DWROLLs superior performance through experimentation and statistical analysis of the results. The DW-ROLL concurrency control method is a special implementation of the Serialization Graph method. It has several notable advantages over the concurrency control protocols in wide use today, such as Two Phase Locking and Timestamp Ordering. It minimizes waiting and restarting, requires no additional network communication for concurrency control, requires no Data Manager-to-Scheduler handshake protocol, and adapts well to the heterogeneous distributed database environment. We feel DW-ROLL is very promising for a variety of situations. DW-ROLL is applicable in a heterogeneous environment by managing global transaction on top of local autonomous database management systems. A global concurrency control manager would be required, as well as local agents at teach site. Future research is planned in this direction.

Bibliography [AST88] IBM Application System/400 Technology, SA21-9540-0. [AGR93] D. Agrawal, S. Sengupta, "Modular Synchronization in Distributed, Multiversion Databases: Version Control and Concurrency Control, , page 126, IEEE Transactions on Knowledge and Data Engineering, Vol 5, No 1, February, 1993. [BAR88] Bhargava, B. and Reidl, J., "A Model for Adaptable Systems for Transaction Processing", Proceedings of the IEEE International Conference on Data Engineering, IEEE , April, 1991. [BAS88] Bassiouni, M., "Single-site and Distributed Optimistic Protocols for Concurrency Control", ACM Transactions on Software Engineering, August 1988. [BER87] P.A Bernstein, V. Hadzilacos and N. Goodman, Concurrency Control and Recovery in Database Systems, Addison-Wesley Publishing. Co., Inc., copyright 1987. [GAR90] H. Garcia-Molina, B. Lindsay, Research Directions for Distributed Database, page 98, SIGMOD RECORD, Vol. 19, No. 4, December 1990. [GRA78] Gray, J., "Notes on Database Operating Systems", Operating Systems: An Advanced Course, Springer-Verlag, 1978. [GRA93] Gray, J. and Reuter, A., Transaction Processing: Concepts and Techniques, Morgan Kaufmann, 1993. [HEL93] Helal, A., Hu, T., Elmasri, R., and Mukherjee, S.. "Adaptive Transaction Scheduling", Proceedings of the ACM Conference on Information and Knowledge Management, Washington, DC, November, 1993. [KUN81] Kung, H. and Robinson, J., "Optimistic Concurrency Control", ACM Transactions on Database Systems, Vol. 6:2, June 1981. [PER89] W. Perrizo, "IBM AS/400 Query Acceleration Techniques," Final Report #6319118-9-89, IBM Rochester, MN, Nov. 1, 1989. [PER91] W. Perrizo, et al., "Transaction Management in HYDRO: A Multidatabase System", Proceedings of The First International Workshop on Interoperability in Multidatabase Systems, Kyoto, Japan, April, 1991.

[PER91a] William Perrizo, "Request Order Linked List (ROLL): A Concurrency Control Object". Proceedings of the IEEE International Conference on Data Engineering, IEEE , Japan, April 11, 1991. [THO91] Thomasian, A., "Performance Limits of TwoPhase Locking", Proceedings of the IEEE International Conference on Data Engineering, IEEE , Japan, April 11, 1991.