Intermediate Checkpointing with Conflicting Access ... - CiteSeerX

24 downloads 1084 Views 196KB Size Report
Sweden. Email: {waliulla,pers}@ce.chalmers.se. Abstract ... uncovered; otherwise, transactional memory forces some transactions to re-execute in a serial fashion. ...... ``Transactional Memory: architectural support for lock-free data structures.
Intermediate Checkpointing with Conflicting Access Prediction in Transactional Memory Systems* M. M. Waliullah and Per Stenstrom Department of Computer Science and Engineering Chalmers University of Technology SE 412-96 Göteborg Sweden Email: {waliulla,pers}@ce.chalmers.se

Abstract Transactional memory systems promise to reduce the burden of exposing thread-level parallelism in programs by relieving programmers from analyzing complex inter-thread dependences in detail. By encapsulating large program code blocks and executing them as atomic blocks, dependence checking is deferred to run-time at which point one of many conflicting transactions will be committed whereas the others will have to roll-back and reexecute. In current proposals, a checkpoint is taken at the beginning of the atomic block and all execution can be wasted even if the conflicting access happens at the end of the atomic block In this paper, we propose a novel scheme that (1) predicts when the first conflicting access occurs and (2) inserts a checkpoint before it is executed. When the prediction is correct, the only execution discarded is the one that has to be re-done. When the prediction is incorrect, the whole transaction has to be re-executed just as before. Overall, we find that our scheme manages to maintain high prediction accuracy and leads to a quite significant reduction in the number of lost cycles due to roll-backs; the geometric mean speedup across five applications is 16%.

1. Introduction As we embark on the multi-core roadmap, there is a major quest for strategies to make parallel programming easier. One of many difficulties faced when designing a parallel program is to orchestrate the program in such a way that dependences are respected among threads. While lock-based constructs, such as critical sections, have been popular, they can introduce serialization bottlenecks if the critical sections are too long and also deadlocks. Transactional memory (TM) [1,4,5,6,8,12,13,14,15] can avoid the serialization imposed by coarse critical sections. This is done by allowing threads to execute critical sections in parallel while preserving atomicity and isolation. As long as there are no data conflicts, thread-level parallelism is uncovered; otherwise, transactional memory forces some transactions to re-execute in a serial fashion. * This research is sponsored by the SARC project funded by the EU under FET. The authors are members of HiPEAC – a Network of Excellence funded by the EU under FP6.

1

Transactional memory system proposals abound in the literature and can be classified broadly in hardware (HTM) [1,4,5,8,12] and software (STM) [14] transactional memory systems. Whereas the former does data-conflict resolution in hardware, the latter emulates data-conflict resolution in software. Recently, some researchers are investigating hybrids between the two (HyTM) [6,13,15]. Based on the observation that STM imposes significant overhead, we target HTM systems in this paper although our contributed concepts can be applied to STM and HyTM systems as well. In the transactional memory systems proposed so far, data conflicts are detected either lazily or eagerly [8]. HTM systems built on lazy data-conflict detection, such as TCC [2], typically take a checkpoint when a transaction is launched for execution and record the read and the write set for each speculatively executed memory access, i.e., the set of locations that are speculatively read or written. Data-conflict resolution occurs when a transaction finishes and is about to commit. If the read-set of an unfinished transaction intersects with the write-set of the committing transaction, the unfinished transaction has to be squashed and rolled back to the beginning. Clearly, if the conflicting access happened at a late point, useful execution can get wasted which may lead to a significant loss in performance and power. In HTM systems built on eager data-conflict resolution, such as in LogTM [8], ongoing transactions are notified immediately when a location is modified. As a result, the decision of what transaction should be re-executed is not postponed until the commit point. On the contrary, the faulting read access is sometimes delayed so that the conflicting transaction is not squashed. However, when two transactions happen to have a conflicting read access with respect to each other, one of them is squashed and has to be rolled back to the beginning. Again, the faulting read access may have happened much later and lots of useful execution can get wasted. Of course, the longer the transactions are, the more useful execution can get wasted by conservatively forcing a transaction to re-execute from its beginning. While recent proposals for supporting nested transactions [7,10] insert checkpoints at the start of each transaction nested inside another, thus reducing the amount of useful execution to be wasted, they do not solve the general problem of squashing only the execution that depends on the conflicting access. This paper provides a solution to this general problem. In this paper, we propose a new HTM protocol that records all potentially conflicting accesses when a transaction is executed. When a transaction is squashed, the set of conflicting addresses that are part of the write set of the committing transaction are book-kept. Next time a transaction is executed, a check-point is inserted when any of the book-kept conflicting addresses is accessed. If the transaction is squashed, it is rolled back to the check-point associated with the first conflicting access. We show that this scheme can be supported with fairly limited extensions to a TCC-like protocol and that it manages to save a significant part of the useful execution done by a squashed transaction. We first establish the baseline system and frame the problem we solve in Section 2. Section 3 presents our scheme for inserting intermediate check-points – our main contribution. We then evaluate our concept starting with describing the methodology in Section 4 and the experimental results in Section 5. We discuss how our contributions position themselves in relation to prior work in Section 6 and finally conclude in Section 7.

2

2. Baseline System and Problem Statement We first describe the baseline system assumed in the study in Section 2.1. Section 2.2 then describes the problem investigated in this paper. 2.1 Baseline Architecture Without loss of generality, but to assume a concrete design point, we consider TCC [2] as the baseline system. In TCC, all the processors are connected to the main memory through a central bus. To support transactions, the private data cache is modified to track speculative read and write operations of an ongoing transaction. The speculatively read and written locations for a whole transaction form the read and the write set, respectively. The read and the write set are kept track of with an R and a W bit associated with each block as shown in Figure 1. When a read or a write operation occurs in a transaction, the respective R or W bit is set and reset on a commit or on a miss-peculation. Before a transaction is launched, the internal state of the processor is check-pointed. On a commit, all the modified values of the write set are propagated to the main memory and all the information regarding the read and the write set is reset by resetting the R and W bits. When the values associated with the write set are propagated to the main memory through the central bus, all other caches snoop the addresses associated with it on the bus. If any address in the write set of the committing transaction conflicts with an address in the read or the write set of another cache, the processor attached to that cache squashes its current transaction and restarts it from the beginning. Squashing involves invalidating all speculatively modified cache blocks and resetting all the R and W bits. While the metadata is allocated at the block level in the original implementation it can be maintained at the word granularity too, to avoid false miss-speculations. Processor

Checkpoint Registers

Data Cache

V

Tag

R

W

Data

Figure 1: The modified private data cache proposed in TCC. The cache is extended with two single-bit metadata: a speculative read (R) and a speculative modification (W). Figure 2 shows the execution of two conflicting transactions. Processor P1 finishes and commits its first transaction. While P1 broadcasts its write set, which contains Wa, P2 detects a conflict with the read set of its ongoing transaction and squashes its transaction. The squashed transaction wastes a lot of execution time but nevertheless maintains the correctness of the program.

3

P1

P2 Tx0

Wa

Tx0

Comitted Transaction Squashed Transaction

Ra

Ongoing Transaction Tx1

Tx0

Figure 2: An execution pattern of transactions running in two processors. In accordance with the TCC hardware model, and for clarity, we assume that the whole program will be decomposed into transactions. Hence, a transaction will start followed by a successful commit of the previous transaction. 2.2 Problem Statement In Figure 2, we have seen that an access conflict between a modified location of a committing transaction and a read access to the same location by an ongoing transaction forces the ongoing transaction to squash and restart from the beginning. While the transaction restarts from the beginning, it ignores the position of the conflicting access which is ‘Ra’ in this case. To preserve the correctness of the program, it is necessary and sufficient to restart from a position before the first conflicting access. Therefore, by restarting from the beginning, useful execution is wasted which may impede performance and lead to power losses. The effect of this waste in execution bandwidth can be huge if the conflicting access occurs late in a long-running transaction. To concretely show the losses, let’s consider an execution scenario illustrated by the following micro-benchmark. char *str, *vowels; int count; main() { for (i=0; i str_size) high = str_size; create_thread(worker, low, high); } }

worker(low, high) { for (i=low; i