Dynamic Computation Scheduling for Load ... - Semantic Scholar

Dynamic Computation Scheduling for Load Balancing in Home-based Software DSMs Weisong Shi and Zhimin Tang Institute of Computing Technology Chinese Academy of Sciences Beijing, 100080, P.R.China E-mail: f wsshi, [email protected] Abstract

Load balancing is a critical issue for achieving good performance in parallel and distributed systems. However, this issue is neglected in the research area of software DSMs in the past decade. In this paper, we present and evaluate a dynamic computation scheduling scheme for load balancing of iterative applications in software DSM system. The experiment platform is a home based DSM system named JIAJIA. Preliminary results show that this load balancing scheme is ecient and can be used in other software DSM systems. Compared with simple chunk self scheduling scheme which works well for single iteration applications, the system performance is improved by about 30% with the anitybased self scheduling proposed in this paper.

1 Introduction

Load balancing plays an important role in parallel and distributed systems in order to achieve good performance. This issue is even more critical in a metacomputing environment where the machines might be shared among many users. In order to maximize performance based on available resources, the parallel system must not only optimally distribute the work according to the inherent computation and communication demands of the application, but also according to the available computation resources dynamically. With the context of dynamic load balancing scheme, the run time system needs an appropriate way to change the amount of work assigned to each processor. In many scienti c applications, loops are the richest source of parallelism, therefore, change the number of the loop iterations performed by each processor can balance the load. We will restrict our attention to this kind of program in this paper. Loop scheduling has been extensively studied in the past years[10, 8, 1, 11, 6, 12]. However, most of these previous work focus either on shared memory multiprocessors or on distributed systems. As we know, The

work of this paper is supported by the CLIMBING Program, and the National Natural Science Foundation of China under Grant No. 69896250-1.

there is only one paper addressing this issue on software DSM system [5], which requires compiler support to extract access patterns from the source code. In this paper, we present and evaluate a new approach which uses dynamic computation migration technique to redistribute the load for iterative applications. 1 In this scheme, all operations are done by run time system automatically without the requirement of user intervention. Compared with traditional task queue based scheduling methods, the synchronization overhead is reduced to the minimum, and the anity between processor and memory is exploited to great extent by our new approach. For single iteration applications2, we show that simple chunk self scheduling works well, but remote data fetch time contribute much to extra overhead. Therefore, we propose a new anity-based self scheduling for this kind of applications. Our target software DSM system is JIAJIA[3]. With our new dynamic scheduling technique, the execution time of iterative applications can be improved by 20%40% as the load of the machines vary greatly. Although simple chunk self scheduling scheme works well, the system performance can be improved by about 30% for single iteration applications by the new anitybased self scheduling. The rest of the paper is organized as follows. Related work are listed in Section 2. In Section 3, the design and implementation of dynamic scheduling algorithm are described. Preliminary results and analysis are given in Section 4. Finally, conclusion remarks and future work are listed in Section 5.

2 Related Work

Many loop scheduling schemes were proposed to exploit parallelism. Static scheduling minimizes run time synchronization overhead, but does not balance the load dynamically. Dynamic scheduling algorithms include processor self-scheduling [10], guided selfscheduling[8], adaptive guided self-guiding[1], trape-

1 Here, we use iterative application to represent the application which has a sequential outer loop. 2 Here, we use single iteration application to represent the application whose outest loop can be parallelized.

zoid self-scheduling[11], anity scheduling[6], adaptive anity scheduling[12], etc.. These algorithms fall into two distinct classes: central queue based algorithms and distributed queue based ones according to the organization of take queue. These schemes work well in the context of multiprocessors, however, their performance is not desirable in software DSMs[9]. Ioannidis and Dwarkadas[5] addressed the load balancing in software DSM systems rstly. Their scheme requires compiler support to extract access pattern from the source code, which is dierent from our scheme. Furthermore, the redistribution scheme of our schemes is dierent with theirs too. Markatos and Le Blanc in [7] analyzed the importance of load balancing and locality management in shared memory multiprocessors. They argued that locality management is more important that load balancing in thread assignment. They introduced a policy called memory conscious scheduling that assigns threads to processors whose local memory holds most of the data the thread will access. Their results showed that the looser the interconnection work, the more important the locality management. Our algorithm is greatly inspired by this conclusion. Hsieh et. al. proposed computation migration to enhancing locality for distributed memory parallel systems[2]. They compared the performance of RPC and data migration. Since the goal of their computation migration is locality, it may result in load imbalancing, which is dierent from our goal, which uses computation migration to balance the load, and try to maintain processor locality at the same time. Furthermore, the implementation methods we used is also dierent. They implemented computation migration in kernel level, while we implement it in user level.

3 Dynamic Computation Scheduling Algorithm 3.1

JIAJIA Software DSM System

JIAJIA characterizes itself with a new lock-based cache coherence protocol and home-based memory organization scheme which combines the physical memories of multiple workstations to form a large shared address space. JIAJIA is implemented entirely as a user-level library and currently runs on many mainstream UNIX platforms. Multiple writer technique is employed to alleviate false sharing. JIAJIA implements the scope consistency (ScC) memory model[4]. The unique characteristic of JIAJIA is the use of home concept not only for data information, but also for coherence information, such as write notices. Generally, home-based software DSM systems[4] propagate data information and apply them at homes eagerly, but this information is fetched by others lazily. On the other hand, these systems propagate coherence information and apply them either eagerly or lazily. In our lock-based cache coherence protocol, coherence information is processed in a similar way to data information, and each coherence information has a static home according to the corresponding synchronization object (e.g., a lock or a barrier manager). So, coherence information is propagated to the corresponding home

at release time, and is lazily fetched by the next acquirer of the same synchronization object. Compared with directory-based protocols, all coherence related actions in the protocol are applied at synchronization points. In this way, the lock-based protocol has least coherence related overheads for ordinary read or write misses. Moreover, the lock-based protocol is free from the overhead of maintaining the directory. JIAJIA provides three types of synchronization mechanisms: lock, barrier, and condition variables. JIAJIA also supports home migration scheme based on the information of access patterns collected at run time. Home migration scheme is optional by calling the jia mbarrier() system call. 3.2

Rationale of Dynamic Computation Scheduling Algorithm

The performance of a loop scheduling algorithm is mainly aected by three overhead sources: synchronization overhead, load imbalance, and remote data access latency, where remote data access latency can be represented by processor anity, which refers to certain data dependence of a task to a speci c processor. One important context in which loop iterations have an anity for a particular processor is when: 1. the same parallel loop will be nested to a sequential loop, therefore the same data has to be repeatedly used by successive execution of an iteration (iterative applications satisfy this condition), and 2. the cache (in software DSM this means software cache) is large enough to hold the data until it can be reused. Many iterative scienti c computing applications have this characteristics, and we restrict our attention to this speci c scenario of a parallel loop nested in an outer sequential loop here. For applications without outer sequential loop, we show that although simple chunk self scheduling works well in software DSM system, but the remote data fetch time can be optimized further. Therefore, we propose an anity-based self scheduling in section 4.3, and performance evaluation results listed in section 4.3 validate our idea. Many proposed scheduling algorithms address the reduce of synchronization overhead and favor load balancing. However, previous work of Markatos and LeBlanc[7] reveals that locality is more important than load balancing in shard memory multiprocessors since the remote memory access latency is too long. Based on this observation, they proposed the anity scheduling algorithm[6], which is prior to traditional algorithm obviously. In software DSM systems, the gap between remote access and local access is larger than that of multiprocessors. Therefore, it is critical to keep processor anity when scheduling the loop in software DSM systems. Experimental results will validate this observation later. Compared with central queue based algorithms, the synchronization overhead caused by loop scheduling within each outer sequential iteration can be alleviated greatly by distributing the task queue. However, this synchronization overhead and related contention can

P

for (i=0; i