MYOAN: a Shared Virtual Memory Facility for the Intel Paragon ...

11 downloads 0 Views 160KB Size Report
1404. 4755. 2835. 16. 530. 329. 3724. 1643. 8169. 3393. 32. 727. 300. 7002 ... a programming environment requiring shared variables declarations to be anno-.
MYOAN: a Shared Virtual Memory Facility for the Intel Paragon Supercomputer G.Cabillic / T.Priol / I. Puaut

Abstract myoan is a shared virtual memory system designed for the Intel Paragon supercomputer. The main feature of myoan is its support for multiple consistency protocols suited to the applications' access patterns. This paper describes the design choices, implementation and performance of myoan on the Intel Paragon XP/S supercomputer.

1 Introduction Shared virtual memory (svm) [1] is a software abstraction of shared memory on an architecture without hardware support for shared memory, such as, for example, a network of workstations or a distributed memory multicomputer. This abstraction is attractive because it simpli es programming; processors can access both local and remote data using the standard read and write operations. In most svm systems, consistency of shared data is managed by using distributed versions of multicache consistency protocols (e.g., [1]) that scale the unit of sharing to a virtual memory page in order to increase performance. Another option for improving performance is to de ne less restrictive de nitions of memory consistency than the standard (strong consistency) criterion, stating that \reads return the most recent write" [2]. Propositions for relaxed de nitions of memory consistency include weak consistency [3], release consistency [4], and entry consistency [5]. Studies on sharing and synchronization in parallel programs (e.g., [6]) showed that each consistency protocol is better suited to a di erent class of shared data object. Munin [7], and koan [8] have the interesting property of employing di erent consistency mechanisms, each appropriate for a di erent class of memory access pattern. Moreover, both systems o er the ability to change the consistency protocol of a shared object dynamically. This paper describes myoan, which is an implementation of the koan shared virtual memory facility, initially designed for the Intel iPSC/2 Hypercube, on the Paragon xp/s. The remainder of this paper is organized as follows. Section 2 describes the main features of myoan. Its implementation on the Paragon is detailed in Section 3. Performance measurements of the implementation are given in Section 4. Finally, a comparison with related work is given in Section 5. The work described in this paper is supported by Intel SSD under an External Research and Development Program (INRIA contract no. 193C21431318012) 

1

2 Main Features of MYOAN myoan allows regions of virtual memory to be shared by processes running on distinct nodes. The default memory consistency semantics for memory regions in myoan is strong (atomic) consis-

tency, which is implemented using an invalidation-based protocol, similar to the one implemented in [1] ( xed distributed scheme). In addition, multiple consistency protocols are supported. The remainder of this section focuses on these consistency protocols. A full description of myoan can be found in [9].

2.1 Multiple-writers Consistency Protocol Many relaxed consistency models have been proposed in order to increase the performance of svm systems. myoan proposes a relaxed form of consistency for parallel processes satisfying the wellknown Bernstein conditions [10], which state that processes can execute in parallel if they act on independent data. If we call parallel block the sequences of statements satisfying the Bernstein conditions, two synchronizations barriers are required: one before the beginning of the parallel block, and the other after the end of its execution. As there is no data dependence between the elements of a parallel block, the new values of the variables modi ed by each element need not be seen immediately by the others. This permits myoan to increase the performance of applications as detailed below. The consistency semantics of the memory region associated to a parallel block is set to multiple-writers at the beginning of the block. During the execution of the parallel block, each page of such a region is replicated in read-write access mode in the physical memories of the processors that write into the page. At the end of the parallel block, the consistency semantics of the region is reset to strong consistency. Consequently, each page of the region must have again a single read-write copy; this copy is obtained by merging the replicas created during the parallel block's execution. Multiple-writers consistency semantics eliminates con icts when parallel processes modify independent data belonging to the same page (i.e., false sharing), and thus removes the resulting performance overhead due to page faults and invalidations.

2.2 Read-Only Consistency Protocol myoan o ers a speci c consistency protocol for avoiding the cost of the base invalidation-based

protocol for data that is not modi ed. This protocol is more ecient than the invalidation-based protocol as no data structure on the page membership (copy set) has to be maintained, and no serialization of page faults is required.

2.3 Producer-Consumer Consistency Protocol The producer-consumer consistency protocol was introduced for increasing the performance of parallel programs that exhibit a producer-consumer scheme, where one producer process writes data in a shared region while several consumer processes, after synchronizing with the producer, read the data. Using strong consistency by means of an invalidation-based consistency protocol results 2

in page faults for consumer processes, as well as message exchanges required for transferring the page(s) from the producer to the consumers. The basic idea of the producer-consumer consistency protocol of myoan is to broadcast pages that have been modi ed by the producer to the consumers at the end of the production phase, which can be implemented eciently if the underlying network provides a broadcast or multicast facility. This protocol is provided through two routines: begin broadcast and end broadcast. A call to begin broadcast initiates a production phase, and makes the producer record the list of pages it modi es; a call to end broadcast initiates the beginning of a consumption phase and causes the pages that were modi ed by the producer to be sent to the consumers.

3 Implementation of MYOAN on the Paragon This section describes the implementation of myoan on the Intel Paragon XP/S. The description is divided into three steps. First, the target hardware and software environment is described. The design choices for building myoan on top the Paragon osf/1 operating system are then highlighted. Finally, the implementation of myoan is sketched.

3.1 Hardware and Software Environment The Intel paragon supercomputer [11] is a distributed memory multicomputer composed of a large number of processors arranged in a two-dimensional array and linked by a high speed (200Mb/sec) interconnect network. Each node is a separate computer with two i860 processors, aimed respectively at running applications and handling inter-node communications. The hardware con guration we are equipped with is made of 56 compute nodes, 3 service nodes, 3 I/O nodes and 3 RAID disks of 4.8 Gb each. Nodes execute the Paragon osf/1 operating system1 . Paragon osf/1 is made of the Mach 3.0 micro-kernel and a server implementing Unix features. The micro-kernel provides tasks, that include a protected address space, threads, that are lightweight execution entities, communication ports and messages. The resources de ned by a task are shared by threads that run within the task. Memory management relies on paged virtual memory. However, unlike traditional virtual memory designs, the kernel does not implement all the virtual memory software: user-mode tasks, called external pagers, have the ability to participate to the implementation of virtual memory management. External pagers provide the policy governing the relationship between the image of a set of pages while cached in memory (the physical memory contents of a memory region) and the image of that set of pages when not so cached. The Mach kernel and external pagers communicate through message passing. Precisely, we use osf release 1.0.4, which is based on the Mach 3.0 kernel Norma MK13 R1.1.4 and the Unix server 1.1 R1.1.4. 1

3

3.2 Design Choices 3.2.1 Kernel-level versus user-level implementation There are two approaches for implementing a svm facility in Paragon osf/1. The rst one consists in building the svm software at the user-level by building external pagers. The alternative solution is to modify the Mach kernel for adding a svm facility. myoan is implemented as an external pager for the following reasons. First, we are convinced that an operating system kernel should stay small and should include only facilities required for a wide range of applications. Second, implementing a svm by modifying an existing kernel, although better from the standpoint of eciency, has a severe portability disadvantage. Finally, implementing a svm by extending an existing kernel requires to have the source code of the kernel, which turns out to be dicult in many situations.

3.2.2 Interprocess communication: Norma versus NX There are two ways by which threads can communicate with each others on the Paragon. One way consists in using the standard Mach/osf inter-process communication facility, called norma [12], which runs in kernel mode and can be used for both intra-node and inter-node communications. An alternative way to make threads communicate with each others is to use the Intel nx message passing library. This paragraph compares the performance of Norma and nx for inter-node and intra-node communications2 . Performance is measured thanks to one procedure, called MinArg, that takes two integer arguments and returns the minimum of the two parameters. This procedure is called by sending a message containing the parameters (using either Norma or nx) to the process implementing the procedure, and then waiting for a message embedding the results. Measurements were made in multi-user mode. Table 1 shows the average time required for calling MinArg when: (i) the communicating threads execute on distinct nodes; (ii) the communicating threads run on the same node but belong to di erent tasks; (iii) the communicating threads belong to the same task. The given values were obtained by dividing the total elapsed time of 10000 calls to MinArg by the number of calls.

Communication type

(i) Inter-node (ii) Inter-task (iii) Intra-task

NX

0.328 1996.2 1631.1

Norma 1.909 0.230 0.058

Table 1: Average call time for MinArg (ms) The average time for calling MinArg is 0.328 ms using nx, while 1.909 ms are required when using Norma. As Norma is about 6 times slower than nx on inter-node communications, our shared virtual memory facility uses nx when inter-node communications are required (e.g., when two external pagers communicate when a page fault occurs). Figures obtained for intra-node At the time this paper is written, the second i860 processor of each node is not exploited yet as a dedicated communication processor by the Paragon osf/1 operating system. Consequently, inter-process communication overhead is likely to decrease in future releases of the operating system for both nx and Norma. 2

4

communication (both inter-task and intra-task) show that nx must not be used for inter-process communication when the two communicating processes run on the same node. While at the time this paper is written we cannot explain precisely the reason for such a bad gure, we suspect the nx library to make intensive use of busy waiting for testing the arrival of messages.

3.3 Implementation myoan is implemented as a set of external pagers. For a given application, there is one external

pager per node on which the application runs, and the external pager shares the application's address space. A myoan external pager communicates with the kernel it is running on by using norma, while external pagers communicate with each others through nx. An external pager is made up of two Mach threads. The rst thread receives messages from the kernel via Norma, while the second receives requests from remote pagers via nx. A Mach port is allocated to each shared region when it is created. A Mach port set allows to listen for all the page fault messages coming from the kernel. A xed distributed scheme, similar to the scheme described in [1] is used for maintaining consistency. Given the identi cation of a page, a statically known external pager, called the manager of the page, maintains a page descriptor containing the current consistency protocol of the page, and when required by the consistency protocol, the list of nodes having a copy of the page. The function which is applied for determining the manager of a page is selected when a shared region is mapped into a process address space; the function is either Modulo (page p is managed by the external pager p mod n, where n is the number of nodes used by the application) or Block (page p is managed by the external pager of node p div n).

4 Performance

4.1 Basic Operation Costs

The basic operation costs of myoan are shown in Table 2. Measurements were obtained by making loops of 200 page faults on pages belonging to the same shared region (with strong consistency semantics) and managed by the same external pager. Page size is 8 Kb and experiments were done in multi-user mode.

Operation

1. Write page fault (page creation on manager's node) 2. Write page fault (page creation on node = manager) 3. Write page fault 4. First read page fault 5. Read page fault (give a copy) 6. Write page fault (RO page, 32 replicas) 6

Table 2: Basic operation costs of myoan 5

Cost (ms) 0.957 1.350 4.158 2.715 1.628 12.752

The time that is required for solving a page fault ranges from 0.957 ms for the best case (when a page is written for the rst time on its manager's node) to 12.656 ms (when attempting to write in a read-only page with 32 replicas).

4.2 Applications Performance of the multiple-writers consistency protocol is shown on a simple application: a matrix product (more complete performance measurements will be included in the full paper). Matrix products were done using from 1 to 32 nodes of the Paragon, each node computing a set of lines in the result matrix. The Block function is used for selecting the pages' managers. Table 3 shows the elapsed time and speedup obtained by multiplying two square matrices of double-precision oats. The two input matrices are accessed using the default strong consistency protocol, while the result matrix is accessed either using the default strong consistency protocol (top part of the table) or using the multiple-writer protocol (bottom part of the table). The sizes of the matrices range from 128x128 bytes (16 Kb) to 512x512 (256 Kb). P 1 2 4 8 16 32 P 1 2 4 8 16 32

Strong consistency 128x128 256x256 512x512 Time (ms) Speedup Time (ms) Speedup Time (ms) Speedup 626 | 8818 | 73094 | 313 2.00 4415 2.00 36623 2.00 352 1.78 2213 3.98 19232 3.80 380 1.65 1109 7.95 9159 7.98 425 1.47 1437 6.14 4593 15.91 432 1.45 1418 6.22 2299 31.79 Multiple-writers consistency 128x128 256x256 512x512 Time (ms) Speedup Time (ms) Speedup Time (ms) Speedup 657 | 8820 | 73060 | 329 2.00 5120 1.72 36686 1.99 338 1.94 2658 3.32 18327 3.99 247 2.66 1341 6.58 9183 7.96 164 4.01 1158 7.62 4622 15.81 118 5.57 734 12.02 2338 31.25

Table 3: Performance of the multiple-writers consistency protocol With small matrices, no speedup is obtained when using the strong consistency protocol because of false sharing: computing the product of two 16 Kb matrices is about 1.5 times slower on 32 nodes than on a single node. Good speedups are obtained in the absence of false sharing: a speedup of 31.79 is obtained with the strong consistency protocol on 32 nodes with 256 Kb matrices. In the presence of false sharing, gains are observed when using the multiple-writers protocol: a speedup of 12.02 (instead of 6.22 when using the strong consistency protocol) is obtained when multiplying two 64 Kb matrices on 32 nodes. Note that when there is no false sharing, only a small time 6

overhead results from the use of the multiple-writers consistency protocol (1.7% more than the strong consistency protocol for computing the product of the largest matrices on 32 nodes). Performance of the read-only consistency protocol is shown on an application that reads a sequence of pages of a shared region (see Table 4). Time (ms) P 32 pages 128 pages 256 pages Strong Read-only Strong Read-only Strong Read-only 2 235 127 932 492 1923 985 4 330 232 1299 959 2641 1998 8 432 310 1937 1404 4755 2835 16 530 329 3724 1643 8169 3393 32 727 300 7002 1596 13799 3400

Table 4: Performance of the read-only consistency protocol Results show that solving page faults when using the read-only consistency protocol is from 2 to 4 times faster than when the strong consistency protocol is used. The best results are obtained when the application runs on 32 nodes and 128 pages are read.

5 Related Work Many svm systems have been designed since the rst proposal of [1]. Most proposals (e.g., [4, 5]) were made mainly for avoiding the performance bottleneck of strong consistency. In opposition to these works, our goal is not to propose a single consistency semantics appropriate for all applications, but rather to propose multiple consistency protocols, each suited to a particular class of memory access patterns. myoan has similarities with the Munin distributed shared memory system. Like Munin, myoan o ers multiple consistency protocols. Protocols are provided for read-only objects, producerconsumer objects, and concurrently written objects in both systems. In addition, both systems allow the consistency protocol of a shared region to change dynamically. Unlike Munin, myoan does not include a programming environment requiring shared variables declarations to be annotated with their expected access patterns for de ning the appropriate consistency protocol. Rather, a Fortran programming environment, Fortran-S [13] provides annotations that are translated into calls to myoan.

References [1] K. Li and P. Hudak. Memory coherence in shared virtual memory systems. ACM Transactions on Computer Systems, 7(4):321{357, November 1989. [2] L. M. Censier and P. Feautrier. A new solution to coherence problems in multicache systems. IEEE Transactions on Computers, 27(12):1112{1118, December 1978. 7

[3] M. Dubois, C. Scheurich, and F. Briggs. Memory access bu ering in multiprocessors. In Proc. of 13th Annual International Symposium on Computer Architecture, pages 434{442, Tokyo, June 1986. [4] K. Gharachorloo, D. Lenoski, J. Laudon, P. Gibbons, A. Gupta, and J. Hennessy. Memory consistency and event ordering in scalable shared memory multiprocessors. In Proc. of 17th Annual International Symposium on Computer Architecture, pages 15{26, Seattle, Washington, May 1990. [5] B. N. Bershad and M. J. Zekauskas. Midway : Shared Memory Parallel Programming with Entry Consistency for Distributed Memory Multiprocessors. Research Report CMU-CS-91-170, Department of Computer Science, Carnegie-Mellon University, Pittsburgh, September 1991. [6] J. K. Bennett, J. B. Carter, and W. Zwaenoepoel. Munin: distributed shared memory based on type-speci c memory coherence. Principles and Practice of Parallel Programming, March 1990. [7] W. Zwaenepoel, B. Carter, and K. Bennett. Implementation and performance of munin. In Proc. of 13th ACM Symposium on Operating Systems Principles, pages 152{164, 1991. [8] Z. Lahjomri and T. Priol. KOAN : a shared virtual memory for the iPCS/2 hypercube. In CONPAR/VAPP92, September 1992. [9] G. Cabillic, T. Priol, and I. Puaut. MYOAN: an Implementation of the KOAN Shared Virtuel Memory on the Intel Paragon. Research Report 812, IRISA, March 1994. [10] A. J. Bernstein. Analysis of programs for parallel processing. IEEE Transactions on Computers, 746{757, October 1966. [11] Paragon User's Guide. Intel Corporation, 1993. [12] A. Langerman. Norma IPC version two : requirements. Open Software Foundation and Carnegie Mellon University, 1993. [13] F. Bodin anf L. Kervella and T. Priol. Fortran-S: a Fortran interface for shared virtual memory architectures. In Proceedings of Supercomputing'93, pages 274{283, Portland, Oregon, November 1993.

8