PPFS: AN EXPERIMENTAL FILE SYSTEM FOR HIGH ... - CiteSeerX

5 downloads 0 Views 518KB Size Report
Life in Champaign-Urbana was made enjoyable by Scott Maxwell, Tom Adams, Grant Haab, and especially Jasmin Price. Back home, my long-time friends DanĀ ...
PPFS: AN EXPERIMENTAL FILE SYSTEM FOR HIGH PERFORMANCE PARALLEL INPUT/OUTPUT

BY JAMES VALENTINE HUBER, JR. B.S., Northern Illinois University, 1992

THESIS Submitted in partial ful llment of the requirements for the degree of Master of Science in Computer Science in the Graduate College of the University of Illinois at Urbana-Champaign, 1995

Urbana, Illinois

In Memory of Emily Jane Weren 1970-1993

iii

Acknowledgments I would like to thank my research advisor, Dan Reed, for the guidance and encouragement he has provided throughout this project. My partner, Chris Elford, deserves more thanks than I could ever express. Chris has been a good friend during this entire e ort, and has made it enjoyable on a day-to-day basis. We did not work alone, however. Tara Madyhastha and Chris Kuszmaul were instrumental in shaping the initial design and implementation of the system. More recently, David Blumenthal has made important contributions to the project overall, and to the documentation in particular. Working in the Pablo group has been a genuine pleasure. Thanks to Ruth Aydt, Roger Noe, Phyllis Crandall, Keith Shields, Brian Totty, Tara Madhyastha, Steve Lamm, Louis Tavera, Will Scullin, Anuradha Rawal, and Anda Harney. And most of all, Phil Roth, for many hours of conversation and pinball. My parents and brothers have given the love, encouragment, and support that have sustained me thus far. Thanks to them, I had the opportunity to explore the fascinating realm of computers at a young age, and later to ful ll my dreams of becoming a computer scientist. Life in Champaign-Urbana was made enjoyable by Scott Maxwell, Tom Adams, Grant Haab, and especially Jasmin Price. Back home, my long-time friends Dan Alexander, Oscar Ernohazy, and Matt Laitala helped me "recharge" over weekends and holidays. Special thanks to Michal Friedrich for all the encouragement she has given me during our long friendship. Thanks also to my friends from NIU: Rick Berktold, Beth Edgerton, Rick Hornback, and Shawn Perkins. I would also like to thank the Laitala, Jones, Friedrich, Bowen, and Bowker families for their support over the years. In particular, for the kind advice they have provided, my deepest gratitude goes to Dave Jones and Sam Bowen. Finally, I would like to thank the National Science Foundation for supporting this research under grant NSF ASC 92-12369. iv

Table of Contents Chapter 1 Introduction

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

1.1 Motivation 1.2 Related Work 1.3 Organization

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

2 PPFS Design

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

2.1 Parallel Files 2.1.1 Access Modes 2.1.2 Caching 2.1.3 Data Placement 2.2 PPFS System Elements 2.2.1 Metadata Server 2.2.2 Servers 2.2.3 Clients 2.2.4 Agents 2.3 Summary

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

3 PPFS Implementation

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

3.1 The PPFS Environment 3.1.1 Programming Platforms 3.2 System Processes 3.2.1 Metadata Server 3.2.2 Data Servers 3.2.3 Server Slaves 3.2.4 Caching Agents 3.3 Communication 3.3.1 Process Identi cation 3.3.2 Messages 3.3.3 Message Bodies 3.4 Parallel Files 3.4.1 Clustering 3.4.2 Data Distributions 3.4.3 Variable Length Records 3.4.4 Metadata

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

v

1 2 2 3 4 4 5 6 8 9 9 9 10 11 11 12 12 13 14 14 18 18 19 20 20 22 22 22 24 24 27 27

3.4.5 Access Patterns 3.5 Caching 3.5.1 Caches 3.5.2 Replacement Policies 3.5.3 Prefetching Policies 3.5.4 Write Back Policies 3.6 PPFS Operation 3.6.1 PPFS Con guration 3.6.2 PPFS Control 3.7 Summary

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

4 Experiments

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

4.1 Benchmarks 4.1.1 Read Benchmark Results 4.1.2 Write Benchmark Results 4.2 Genome Pattern Matching 4.2.1 Application Features 4.2.2 Application Performance 4.2.3 Performance Analysis 4.3 Electron Scattering Code 4.3.1 Application Features 4.3.2 Application Performance 4.3.3 Performance Analysis 4.4 Summary

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

5 Conclusions

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

5.1 Summary 5.2 Future Directions

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

Bibliography

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

vi

29 29 29 30 31 31 32 32 33 34 37 37 39 42 44 45 46 47 51 51 52 53 54 56 56 57 58

List of Tables 4.1 4.2 4.3 4.4

Intel Paragon System Con gurations Sixteen Processor Read Benchmark (time in seconds) PFS Gene Sequence Execution Times (seconds) PPFS Gene Sequence Execution Times (seconds)

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

vii

38 42 46 46

List of Figures 2.1 The Portable Parallel File System

: : : : : : : : : : : : : : : : : : : : : : : : : :

3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 3.10 3.11

Opening a Parallel File Parallel File Creation Closing a Parallel File Reading From a Parallel File PPFS Message Class Hierarchy PPFS Class Hierarchy Unit Striped Data Distribution Blocked Data Distribution The ParallelFile::open() Method Example Remote Control Con guration File Example Remote Control Interface

4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 4.10

PPFS Benchmark Writer and Reader Kernel Intel Paragon XP/S PFS Benchmark Performance Intel Paragon XP/S PPFS Benchmark Performance Intel Paragon XP/S PFS/UFS Benchmark Comparison Genome Sequence Execution Time Genome Sequence Speedup PFS Genome Sequence Trace Data (64 nodes) PPFS Genome Sequence Trace Data (64 nodes) PFS Electron Scattering Trace Data (128 Nodes) PPFS Electron Scattering Trace Data (128 Nodes)

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

viii

10 15 16 18 19 21 23 25 26 28 35 36 38 40 41 43 47 48 49 50 53 54

Chapter 1

Introduction This thesis presents the Portable Parallel File System (PPFS), a le system designed for experimenting with input/output performance of parallel scienti c applications. PPFS is an attempt to solve, or at least to learn more about, what is generally known as the I/O problem. The I/O problem stems from the fact that processor technology is increasing at a much faster rate than disk technology. For example, while processor performance has been doubling every few years over the last decade, disk performance has barely managed to double in that time. Disk rotation speeds have recently increased from the formerly standard speed of 3600 RPM to 5400 RPM and higher. Disk seek times, however, have not even halved in the last ten years. What has doubled every few years is disk density, which increases capacity and throughput but does not reduce the latency dominated by seek time. Overall, this leads to an ever-widening performance gap between processors and I/O subsystems. This disparity is especially evident in multiprocessor systems, which multiply improvements in processor performance by the number of processors in the system. Until this gap is closed, input/output will continue to be one of the primary potential bottlenecks in high-performance parallel computing. To improve I/O subsystem performance, many parallel systems employ a set of disks in parallel. The Intel Paragon XP/S, for example, supports multiple redundant arrays of inexpensive disks (RAIDs) [24, 14]. Systems such as this have an impressive peak I/O throughput, equal to the product of the throughput of each device and the number of devices. However, at the present time, their e ective throughput in a parallel, multi-user scienti c computing environment is not well understood. 1

Recently, there has been increased interest in alleviating this problem. Several applications in such areas as biology, chemestry, earth sciences, engineering, graphics, and physics are attracting the attention of researchers who wish to characterize, analyze, and improve the input/output behavior of these programs. It is hoped that the experiences gained by studying these applications will lead to a solution to the I/O problem.

1.1 Motivation The key to good I/O performance lies in nding, for a particular system and set of applications, the best combination of the following policies: cache size, location, and replacement policy; data distribution; prefetching and write behind. To experiment with these policies, one needs a le system that o ers the application a high degree of control. At the same time, a le system designed around the speci c features of one parallel system architecture would necessarily yield conclusions with limited applicability to other architectures. To investigate the wide range of possibilities in parallel I/O, we have designed a portable, user-level parallel le system, called PPFS. PPFS relies on an underlying UNIX le system for physical I/O operations. Parallel les under PPFS are organized as a set of UNIX les, ideally distributed across several devices. This approach has allowed us to quickly explore a variety of policies and performance tradeo s on several architectures. The price of this exibility and portability is that the system relinquishes control of some lower level issues. For example, PPFS must rely on the UNIX le system to allocate disk blocks and to schedule disk arms.

1.2 Related Work Recently, several new parallel le systems based on user-level libraries have emerged. We believe PPFS has a richer interface than these systems, supporting application control and extension of input/output system policies, and declaration of access information. PIOUS (Parallel Input/OUtput System) [20] is a portable input/output system primarily for use with PVM. Unlike PPFS, PIOUS enforces sequential consistency on le accesses to provide fault-tolerance on unreliable, distributed systems. In PPFS, because the primary focus is on parallel systems, data consistency can be controlled by the application, enabling higher performance in many cases. PASSION [4] is another user-level library that provides many features found in PPFS. However, 2

PASSION focuses on out-of-core problems and support for compiler-generated input/output operations, rather than multiple, autonomous input/output streams. The Vesta [5] parallel le system, developed by IBM, allows applications to de ne logical partitions, and some access information. Applications can also exercise some control over the data distribution. Related work also includes a number of commercial parallel le systems | the CM-5 Scalable Parallel File System [19, 18], the Intel Concurrent File System [10] for the iPSC/2 and iPSC/860, and the Intel Paragon's Parallel File System [14]. These provide data striping and a small set of parallel le access modes. In many cases, these access modes do not allow the application enough control to extract good performance from the input/output system. Distributed le systems, such as Zebra [12] and Swift [3], stripe data over distributed input/output servers, but do not provide distribution or policy control to the application layer. Further, because the performance requirements in this environment are quite di erent (users are not as willing to tune for input/output performance), these systems provide little control to the application program. Several groups have proposed schemes for exploiting access pattern information both in sequential and parallel systems [25, 15]. In [15], Kotz uses pattern predictors to anticipate an application's future access patterns. A variety of data management strategies for parallel input/output systems are explored in [16, 8, 17].

1.3 Organization The rest of this thesis presents the design of our PPFS and its use in input/output experiments. Chapter 2 presents the logical organization of PPFS. Chapter 3 describes the current implementation of PPFS. In Chapter 4, we discuss our experiences with using PPFS for two applications. We summarize our results in Chapter 5 and discuss future research directions.

3

Chapter 2

PPFS Design The primary focus of PPFS is experimentation with the input/output behavior of parallel applications. We must therefore make some assumptions about the organization of these applications. PPFS is designed to support scienti c programs that currently use a traditional UNIX le system or a vendor-specifc parallel le system. It is important that the application use some variant of les, as opposed to memory-mapped input/output for instance, to simplify porting the code to run under PPFS and to ensure a simple isomorphism between the original and PPFS versions. We further assume that applications are based on a distributed memory message-passing model, and do not exibit ne-grained parallelism. The design of PPFS can be divided into two broad areas: parallell les and the system elements which support them. In x2.1, we discuss parallel les, including access modes, caching, and data placement issues. Then in x2.2, we describe the system processes which cooperate to implement parallel les.

2.1 Parallel Files When creating a le system, it is important to present the application with a simple yet exible le notion. This decision determines how the system and application interact, and can a ect the usability and performance of the system. There are a wide variety of options, including memory-mapped les, databases, object-oriented repositories, structured les, and traditional UNIX les. All of these ideas are appropriate for some application areas. However, to maintain similarity to existing scienti c applications, we sought a relatively traditional approach 4

for PPFS. Another consideration in the parallel le notion was to allow access patterns and data distributions to be easily expressed, as these characteristics are fundamental to parallel input/output research. Thus, under PPFS, parallel les consist of a sequence of records, which are numbered starting from zero. The record is the unit of access; an application may read or write entire records only, typically specifying records by number. Record numbers are also used to express data distributions and access patterns. Because records represent the basic unit of access in PPFS, xed length records were considered too restrictive; therefore PPFS supports variable length records as well.

2.1.1 Access Modes PPFS o ers several levels of functionality, and allows the application to control which features are used. To achieve this, and to allow the application to use special optimizations, PPFS de nes a set of parallel le access modes. When creating or opening a parallel le, the application must specify a combination of options to indicate the le's access mode. This access mode is used to validate subsequent operations, and to guide optimizations applied to the le. The access mode of a parallel le, along with other information (see x3.4.4), is communicated between processes that deal with the le. Any or all of the options may be used simultaneously for a given le. Most applications will use in and/or out, to perform read and write operations; however a small set of operations are supported without either of these options. The basic read and write primitives require the application to identify, by number, the records to be accessed. Thus, at the application level, no le pointers or seek operations are required, eliminating the need to treat a seek and subsequent read/write as an atomic operation. For some applications, this approach is not acceptable. It can become a burden for the program to manage the sequence of accesses, and global access patterns require explicit coordination. The next and any options enable the use of special access modes designed for such situations. For a speci c access pattern, the next mode is used, enabling use of the read next and write next operations. An application may specify an access pattern, a sequence of records, which speci es how the le will be accessed. This pattern may be used in two ways: a global access pattern determines how the application as a whole will access the le, and a local 5

access pattern determines how one particular process accesses the le. A given le may have a global pattern or a local one, but not both. Local patterns establish a le pointer for each process, permitting the application to use the simpler primitives which don't specify record numbers. Global patterns work the same way, but require coordination amoung all participating processes. Although the le is still accessed in parallel, access to the shared le pointer must be serialized in this case. To avoid the overhead associated with the global access pattern, the any option can be used in some situations. When records are accessed exactly once each, but the order is unimportant, then the application can use the read any and write any operations. No global coordination is required for these primitives. Instead, records are accessed from a random portion of the le. A le created with write any can only be read back with read any, however any le can be read with read any. Each of the access modes just described is implemented by a pair of PPFS primitives: one blocking and the other non-blocking. By providing the asynchronous calls, PPFS allows applications to overlap computation with I/O operations. The remaining options, size and noextend, control how the system maintains and reports the size of a parallel le (the number of records it contains.) An application can always nd the initial size of a le, that is the size of the le when it was opened or created. By default, the size of a le may change by writing new records, or by truncating the le. The noextend mode xes the le size, making these operations illegal. In this case, PPFS avoids the overhead of maintaining the le size. On the other hand, the size option allows the application to nd the current size of the le. Some global coordination is required to implement this operation, thus it is only available upon request.

2.1.2 Caching To avoid the latency of physical input/output operations, data may be cached for later use. Read operations may be avoided by referencing previously accessed or prefetched data, and write operations may be postponed or avoided using a writeback mechanism. Also, prefetching and writeback mechanisms can be used to combine several smaller requests into one large request, thereby reducing the number of physical input/output operations.

6

Many of the same issues which apply to traditional memory hierarchies are also important here, such as cache size and replacement policies. However, a parallel le system raises other questions, the most important of which is cache location. Since there are many I/O devices and many application tasks, it is possible for each device, as well as each application task, to maintain a cache. This immediately raises the problem of consistency, which may be addressed in part by the addition of a new layer of caching which is global to the entire system. Thus, PPFS employs three levels of caching. Each I/O device has an associated server cache, which only contains data residing on the given device. Each application node has a client cache which holds data accessed by the given user process. Finally, an application may use one or more global caches to enforce consistency. Alternatively, consistency may be ensured by only using server caches. Other forms of consistency may be implemented by the application, using the synchronization primitives provided by PPFS. File caches can a ord a greater degree of complexity than traditional CPU/memory caches because the penalty for a cache miss much more expensive. Thus, more elaborate cache policies are feasable. PPFS may be extended by adding new policies to the system, and it allows applications to experiment to nd the best combination of policies. Each cache may potentially have a di erent set of policies for each le. Furthermore, policies may be changed during program execution.

2.1.2.1 Replacement Policies The basic operation of a cache requires choosing data to eject from the cache when it becomes full. PPFS caches are fully associative, so that any arbitrary set of records may reside in a cache (provided the overall capacity of the cache is not exceeded).

2.1.2.2 Prefetching Prefetching is used to tolerate latency by overlapping physical input/output operations with computation. Prefetching is guided by application-supplied access patterns, which include local (per application process) and global (per le) patterns. Prefetching must not be overly aggressive, otherwise it may cause the cache to thrash by ejecting data which will soon be needed in favor of data that will be needed further in the future [1]. 7

2.1.2.3 Write Behind Write behind can often reduce the amount of data written to disk, by eliminating multiple write operations to the same location. It can also reduce the number of write operations by combining smaller continguous operations into one larger write. These factors can improve performance of applications which perform many small writes. The decision procedure for ushing dirty records from a cache interacts with the prefetching and replacement policies. Another factor to consider is where in the memory hierarchy to write data. For example, when a client cache performs write behind, it may send the data directly to disk, or it may be cached on a server, to be written later.

2.1.3 Data Placement The latency of large input/output requests can be reduced by parallelism. By placing parts of a le on several disks, the le system can potentially access the data faster than if it was all on one disk. The ideal placement of data in a le depends on the access characteristics of the application, and the con guration of the system. PPFS supports a rich set of data placement strategies, and allows an application to control which disks are used to store parts of a le.

2.1.3.1 Clustering Under PPFS, a le is divided into segments, each of which reside on a particular I/O server. Each segment is simply a collection of records. A le's clustering associates a server with each segment in the le. A le must have at least one segment, and any number of segments may reside on the same server. Thus, a le with segments consists of segments numbered from 0 to ? 1, and the le's clustering is a mapping from segment number to server. N

N

2.1.3.2 Distribution A parallel le's distribution determines, for each record in the le, in which segment the record resides, and where in that segment. A distribution can be seen as a one-to-one mapping from le record number to a pair containing segment number and segment record number. PPFS provides built-in support for common distributions such as striped and blocked, as well as other HPF [9] distributions. 8

2.1.3.3 Indexing To support parallel les with records of di erent lengths, another level of data placement is required. A parallel le may have an optional indexing scheme which determines the length and location of records. For les with xed length records, this information is computable given the record number. Variable length records require an indexing scheme to map record number to the o set and length (both in bytes) of the record. An indexing scheme may enforce prede ned record sizes, or may allow the application to alter the size of records dynamically. In the latter case, gaps may be created when a record changes size.

2.2 PPFS System Elements The organization of PPFS is based on the client/server model. There are data servers that manage an input/output device, and client processes that are part of the parallel application. In addition to these, there are two other elements in the PPFS. The metadata server coordinates all operations involving open and close operations, and maintains the directory information about all parallel les in the system. The caching agents are optional processes that coordinate any global caching and access modes required for a particular parallel le. The conceptual organization of the system is illustrated in Figure 2.1. Each element is described in more detail below.

2.2.1 Metadata Server The metadata server is so called because it manages parallel le metadata { information about a le, as opposed to information in a le. A persistent store of metadata is maintained to preserve parallel les when the system is not running. The metadata server coordinates the creation, opening, and closing of all parallel les.

2.2.2 Servers The servers are responsible for performing all physical input/output in PPFS. They provide services to manage segments and perform read and write operations on records in those segments. They also manage caches to reduce the latency involved in processing I/O requests. For maximum parallelism, seperate servers can exist for each I/O device in the system. In fact, 9

Pre

Cache

Pre

Server

Cache Server

Metadata Server Cache

Pre

Caching Agent

Pre

Cache

Pre

Cache

Client

Client

User Code

User Code

Pre

Cache Client User Code

Figure 2.1: The Portable Parallel File System multiple servers may be used for a single I/O device, storing mutually exclusive sets of data. In this way, the memory available for cahcing purposes can be increased. Alternatively, a single server may manage several I/O devices, although only in a system-dependent manner. This con guration might be useful where compute nodes are a scarce commodity.

2.2.3 Clients The client nodes are created by linking a PPFS library to the parallel application. The application can then perform I/O using PPFS primitives. The C++, C, and FORTRAN languages are supported by the PPFS programming interface. The PPFS library translates application 10

requests into messages to the other components in the system. The library manages a cache of parallel le data on each client node, attempting to service application requests from this cache. When this is not possible, the library interacts with the data servers (or caching agent) to complete an operation.

2.2.4 Agents The purpose of the caching agents is to act as coordinators for parallel les which require global caching or access modes. When one or more caching agents are used for a parallel le, the clients do not interact directly with the data servers, but instead send requests to the caching agents. This allows the agents to maintain a consistent cache, as well as to make global optimizations of the application request stream.

2.3 Summary This chapter examined the issues which shaped the design of PPFS, and how PPFS provides support for a wide range of applicaion input/output requirements. We discussed how the application can control the way data is organized and accessed, and experiment with caching at various levels of the system. In chapter 3, we present our current implementation, based on this design.

11

Chapter 3

PPFS Implementation This chapter describes the initial implementation of PPFS and discusses implementation decisions made to simplify or optimize the system. Because PPFS is intended to be a exible and extensible tool for experimentation on a wide variety of platforms, it is essential that an implementation of PPFS be well-organized, ecient, and, so far as possible, easy to port. To achieve these goals, and to ease the burden of crafting a complex software system, the C++ programming language [30] was chosen to implement PPFS. The general availability of C++ makes it a reasonable choice for a portable system. C++ o ers type safety, encapsulation, inheritance, and polymorphism, which allow the design concepts of PPFS to be expressed naturally in the implementation. We begin by describing the PPFS envrionment in general, and the currently operational platforms in particular. We then disucss the detailed operation of the system processes in x3.2. The software components which comprise the implementation are presented in x3.3 through x3.5. We describe the con guration and control of PPFS in x3.6 and summarize this chapter in x3.7.

3.1 The PPFS Environment This section describes the programming environment currently provided by PPFS. Brie y, PPFS runs as a collection of user level, UNIX processes. It uses standard UNIX system calls for input/output. Other le systems could be supported by implementing a new storage manager component (see x3.2.3) to perform physical input/output under a di erent le system. The 12

system may be left running for an extended period of time, allowing several applications to be executed without shutting down PPFS. The design identi es several types of processes in PPFS, each of which is implemented by one (or more) programs. These are referred to as system programs, in contrast to the user programs, which are applications ported to PPFS. All of the system programs, as well as the user programs, link with the PPFS library. This library provides support for the PPFS communication layer, and implements all of the C++ classes speci ed by the design. The system programs are small user-level codes that execute on various nodes of the parallel machine. The run program is used to start up the other system programs and spawn an application. If no application is speci ed to the run program, it spawns a special control program which has the ability to accept remote connections to operate and shut down the system. In this way, several applications may be executed without incurring the PPFS start up and shut down overhead. The metadata server, caching agent, and data server nodes are implemented by the remaining system programs: mds, agent, server, and server slave. The data server is split into two separate programs, thus the server can service requests out of its cache while a blocking input/output operation is in progress on the associated server slave.

3.1.1 Programming Platforms Two platforms were selected for the initial implementation: the Intel Paragon XP/S [14], and a cluster of Sun SPARC workstations. The workstations served as a development platform, as they provide a more stable and productive programming environment than the Paragon. The Paragon o ers a native parallel le system (Intel PFS) to which PPFS can be compared for performance and ease of application porting. A PPFS application is organized much as it would be under any distributed memory parallel system. There are two general scenarios. In the simplest case, an application consists of a single program instantiated on several nodes of the machine. Each process has access to PPFS via the library, as well as to the underlying UNIX le system. A variation on this arrangement is where a special program, the host program, spawns and manages several copies of another program, the node program. This is commonly called a \host-node" application. Other organizations are possible, but for these an application must implement its own protocol for barriers and broadcasts. 13

Message passing and process management are provided by NX on the Paragon, but for other platforms, a portable library is needed to supply these features. Several message-passing libraries were considered, but most lacked what were considered vital functionality. For instance, MPI [21] lacks process management, and PVM [11] has no support for interrupt-driven messages. Both of these important features are provided by NXLib [29], which is a port of NX to various workstation platforms. Thus, NXLib was chosen for the SPARC platform.

3.2 System Processes Each of the PPFS elements de ned in x2.2 is implemented by one or more system processes. The run process spawns the other system processes as described in x3.1.1. A detailed description of these system processes is given in this section.

3.2.1 Metadata Server The primary function of the metadata server is to service client requests to create, open, and close parallel les. To accomplish these tasks, a persistent store of le metadata (x3.4.4) is maintained. The metadata for each parallel le is kept in a metadata le of the same name. The metadata server caches these les, writing them to disk when the system shuts down. The same parallel le may be opened multiple times concurrently. In this case, the metadata is shared, and a unique parallel le handle identi es each instance of the open le. This is analagous to opening a standard UNIX le twice: the le itself is identi ed by an inode, but a unique le descriptor is associated with each open operation.

3.2.1.1 System Shutdown When the metadata server receives a shutdown request from the run process, it instructs the data servers and caching agents to terminate. It then writes all of the metadata les to disk, to be used during the next invocation of PPFS.

3.2.1.2 Opening Parallel Files When a client process attempts to open a parallel le, it contacts the metadata server, which searches for the metadata le by name. It rst consults its cache of metadata, then looks on 14

{

Server Slaves

4

4 3

{

Data Servers

3

5

5

2 Metadata Server 6 1 Clients

{

7

Figure 3.1: Opening a Parallel File disk. If the metadata could not be found, a failure is signi ed to the client. On the other hand, if the metadata was found, a unique handle is given to this instance of the le. Then the le's clustering table, which is part of the metadata, is used to determine which data servers manage the segments of the le. For each segment in the le, the corresponding server is sent a request to open the segment. The metadata server then receives replies from the servers, in any order, until it has one reply per segment. At this point, the metdata server sends a reply to the client, including the parallel le handle and metadata. This process is illustrated in Figure 3.1, where arcs represent messages sent from one process to another, and numbers indicate the relative order of messages. In this example, the le is stored in two segments, each on a di erent server. The le is opened by the rst client, then broadcast to the other clients.

3.2.1.3 Creating Parallel Files To create a parallel le, an application must provide the metadata server with the following information:  the name of the le to be created,  the access mode (x2.1.1),  the data distribution (x3.4.2),  the record size (or zero for variable length records), 15

char * int StripeDist RecSize RecID DiskID DiskID

name mode dist rec_sz file_sz disks[] num_disks

= = = = = = =

"data"; ParallelFile::in | ParallelFile::out; 1; 64*1024; 0; { 0, 2, 4, 6 }; sizeof(disks)/sizeof(*disks);

ParallelFile file; file.create(name, mode, dist, rec_sz, file_sz, num_disks, disks);

Figure 3.2: Parallel File Creation    

the index scheme (only for variable length records, x3.4.3), the initial le size (in records), the number of segments, and the speci c servers onto which the le is to be distributed. Figure 3.2 shows an example of how the application speci es this information. The new le is named data and is opened for both input and output. The dist variable represents a striped distribution, with a striping factor of one. The next two parameters specify that each record is 64k bytes long, and that the le is initially empty. The last two arguments control the selection of physical devices. The disks array identi es four particular devices to use (in round-robin fashion) and num disks is the number of segments (one per device). The initial le size is only useful for les created with the noexpand access mode ag, which can be used when an application does not intend to change the size of the le. At present, there is no way to reduce the size of parallel les, but when that becomes possible, this mode will be better described as noresize. The last two items in the above list allow an application to exercise complete control over which servers are used to manage a parallel le. The servers are simply identi ed by number, and the application may supply an array of these numbers to indicate which server owns each segment. This is a very crude and very e ective mechanism. Quite often, however, a program is content to let the system make this decision, in the interest of simplicity and portability. The exact choice made in this case is not speci ed, but should in some sense re ect the system hardware con guration. The current implementation makes rather general decisions based on the assumption that the servers are equivalent. The application may 16

supply only the number of segments, in which case the metadata server assigns them to servers in a round-robin fashion. Even the number of segments may be omitted, in which case the metadata server creates one segment per server in the system. This policy can be customized per installation, for example to use only the faster disks in a system by default, and reserve the slower disks for applications explicitly requesting them. This information supplied by the client is used to create a metadata object for the new le. Each segment in the new le must be created at the appropriate server. The metadata server sends requests to the servers, and the servers reply giving the segment handle of the newly created segments. This information is saved in the clustering table of the new metadata object, which can then be returned to the requesting client. The ow of information through the system when creating a parallel le resembles the protocol used to open parallel les, illustrated in Figure 3.1.

3.2.1.4 Closing Parallel Files To close a parallel le, the metadata server must receive a close request from each client that has the le's metadata. This includes the original client that requested the open (or create) operation, as well as other clients that (directly or indirectly) received the metadata from this node. This is implemented as a non-synchronizing operation to avoid performance bottlenecks when closing parallel les. When all clients have closed a le, the metadata server contacts the servers involved, requesting them to close the segments of the le. This protocol is illustrated in Figure 3.3, for the same le discussed in x3.2.1.2

3.2.1.5 Spawning Caching Agents When an application enables global caching for a parallel le, the client process contacts the metadata server, which allocates a caching agent node and spawns an agent process on that node. The metadata server then sends a noti cation to the client that opened the le. As clients receive this message, which is interrupt-driven, each in turn sends the same message to any nodes which it sent metadata. Thus the same tree that is created when the le is opened and distributed to the clients, is used to to notify all clients of changes in the global status of the le. Once a client detects that a parallel le is globally cached, it no longer communicates directly with the data servers, but sends requests to the caching agent instead. 17

Server Slaves

{ 9

9 8

Data Servers

{

8

10

10

7 Metadata Server

1 Clients

{

3

5

2

4

6

Figure 3.3: Closing a Parallel File 3.2.2 Data Servers The data servers are responsible for managing the data segments which comprise a parallel le. The two primary elements in a server are a cache and a request queue. When a server receives a request from a client or a caching agent, the server's cache is consulted. Many requests are serviced directly by the cache, but others may require accessing the disk. Any request (or part thereof) which cannot be serviced by the cache is placed on the queue to be sent one at a time to the slave. The server also responds to requests from the metadata server to open, create and close segments. These are also placed on the queue to be processed by the slave. All input/output is done by the slave; the server acts only as an intermediary, with the ability to cache data and rearrange requests. The server was split into two processes because thread packages are not available on all architectures.

3.2.3 Server Slaves The server slaves receive input/output requests from the data servers, and access the underlying le system to satisfy the requests. The servers issue read and write requests that specify a segment handle, and a byte o set and byte count. The servers also send requests to open, create, close, and remove segments.

18

Server Slaves

{ 6

6 5

Data Servers

{

5

7

7

4

4

3 Caching Agent 8

2 1

Clients

{

Figure 3.4: Reading From a Parallel File To ease porting to other systems, the server slaves do not directly interact with the host le system. Instead, they operate through a storage managment interface. The Storage Manager is an abstract base class which de nes an interface for managing segments of data. Each server slave uses this to communicate with the device that it controls. In this way, the slave code doesn't depend on the underlying device. Currently, PPFS de nes a single type of storage manager for standard UNIX les. Other storage managers could be de ned to support other le systems, including simulated devices.

3.2.4 Caching Agents The caching agents act as global control centers for parallel les. When an application requests speci c services or access modes which require global coordination, a caching agent is allocated by the metadata server to carry out these operations. Presently, the only global service provided by the implementation of PPFS is global caching. This is the primary function of the caching agents; others include managing global access patterns and maintaining global consistency. To support global caching, each agent manages a cache, attempting to service client requests from this cache, and forwarding requests to the data servers when necessary. The agent is relatively simple, only handling read, write, and shutdown requests from the clients. The cooperation of clients, caching agents, data servers, and server slaves during a typical input/output operation is illustrated in Figure 3.4. Here, one client performs a read. The PPFS 19

library resident on the client node rst checks its local cache. It then sends a read request for any remaining data to the appropriate caching agent. The agent immediately returns to the client any data already present in its cache. The agent then sends read requests for the remaining data to the appropriate servers. The servers immediately return any cached data to the agent, and then issue read requests to their associated slaves. The server slaves perform physical reads and return the data to the servers, which update their caches and return the data to the agent. When the agent receives the full set of data from all servers, and updates its cache, it returns the data to the client. The PPFS library updates the local cache and returns the data to the application.

3.3 Communication The classes discussed in this section encapsulate the functionality of the NX message-passing environment. Thus, porting PPFS to another message passing platform would require modifying these classes. Also, there are several global functions which would need rewriting. One such function is ppfs gop() which performs various global operations, such as a synchronization barrier. Finally, the run program depends heavily on the NX environment. These classes are illustrated in Figure 3.5.

3.3.1 Process Identi cation Every PPFS program (system or user) begins by calling ConfigureNodes(), which receives a system con guration message from the run program. This message speci es how many processes of each type are in the system, and from this each process can initialize its NodeID variables. Under Intel NX, a process is identi ed by a pair of integers: process type and node number, given by the myptype() and mynode() functions. To make the system more portable, the NodeID class is used to identify PPFS processes. The Message class uses NodeIDs to specify source and destination. A set of NodeID variables and arrays keep track of the location of all components in the system. For example, MY NODE is the identi er of the current process, MDS NODE is the location of the metadata server, and SERVER NODE[i] is the location of the ith data server. In this way the processes may be relocated with a minumim of e ort. A set of NodeID methods exist to further shield the system from dependencies on process layout. 20

Message

Message body

MB agent state MB aread any info MB close le0 MB close le1 MB close seg0 MB close seg1 MB con g server0 MB con g server1 MB create le0 MB create seg0 MB create seg1 MB init server0 MB init server1 MB open le0 MB open le1 MB open seg0 MB open seg1 MB quit0 MB read0 MB read1 MB read any info MB slave op0 MB write0 MB write1

Figure 3.5: PPFS Message Class Hierarchy

21

For example, a server process may call MY NODE.server2slave() to nd its associated slave process. A process may also determine its type, for example when it is in a part of the library shared by many process types, via the methods is server(), is mds(), is client(), etc.

3.3.2 Messages The Message class encapsulates the low-level NX functionality, including blocking and nonblocking message send and receive, message polling, posting interrupt-handlers, enabling and disabling interrupt handling (for critical sections), and spawning processes. Messages under NX are simply bu ers of arbitrary length. It is up to the application to prepare and interpret data in a message bu er. This is not suitable for the object-oriented PPFS implementation, hence a Message operates by streaming an application-level object, a Message body.

3.3.3 Message Bodies The Message body is an abstraction that provides an interface for streaming objects to and from the lower level bu ers which go across the network. A Message acts as a handler for a Message body derived object. There are presently 23 classes derived from Message body, thus there are 23 types of high-level messages supporting the internal communication needs of PPFS. All Message body derived types are named MB ..., and most come in pairs. For example, the MB read0 and MB read1 message bodies are used to communicate a read request and reply, respectively.

3.4 Parallel Files The ParallelFile class brings together all information needed for an application to access records in a parallel le. An application can create a parallel le object, invoke the open or create methods, and then read or write records. This section discusses the data structures and abstractions which a ParallelFile manages for the application. Note that many of these exist on other system processes, but ParallelFile itself is only instantiated by the application. Figure 3.6 illustrates the classes discussed here, as well as those pertaining to caching, the topic of x3.5.

22

Cache

ClientCache ServerCache AgentCache StripeDist BlockDist FortranDist FnDist

Distribution

Index

Pattern

AsyncInfo ByteStream CacheBlock CacheControl CacheFlushList CacheKey Cluster Cluster Table Con gData Con gInfo ControlArray Dim DynamicInfo MD Table MemRequest MetaData MR InfoItem NodeID OpenFileTable ParallelFile PauseInfo Prefetcher PrimaryControl Qentry Qrec RecRequest RequestInfo Segment SegRequest ServerCon g UnixSegment

FileIndex FnIndex FixedIndex UniformPattern RelationPattern SeriesPattern FIFO Policy

ReplacePolicy StorageManager

UnixStorage

AsyncOp

AsyncReadAny

Statistic

AgentStatistic ClientStatistic ServerStatistic

Figure 3.6: PPFS Class Hierarchy

23

3.4.1 Clustering The term clustering refers to the way a parallel le is divided into segments, and how those segments are distributed to various servers. A le striped across eight servers would most likely have eight segments, one per server. A segment is a sequence of records managed by a single server. Every parallel le has an associated clustering table, which has an entry describing each segment. This information is kept in a Cluster Table object, which is an array of Seg objects. A Seg holds the NodeID of the server, a SegID which uniquely identi es the segment on that server, and a SegRecID which represents the number of records in the segment. In the context of a speci c parallel le, segments are normally identi ed by their location in the le's clustering table.

3.4.2 Data Distributions A parallel le consists of a sequence of logical records, numbered from zero. Similarly, a segment contains segment records numbered from zero. This raises the question \Given logical record R, in which segment does R reside, and where in that segment?" It is the purpose of the Distribution class to answer this question. The Distribution class provides an interface for performing a two-way mapping between the logical record number and the segment record information. The precise mapping functions are described below. Distribution::map() converts a triplet into a pair: [ logical record number, number of segments, number of records ] --> [ segment number, segment record number ] Distribution::unmap()

converts a quadruplet into a scalar:

[ segment number, segment record number, num_segs, num_recs ] --> [ logical record number ]

To illustrate the map() and unmap() functions, consider a unit striped distribution. Here the records are assigned to segments in round-robin fashion as shown in Figure 3.7. Thus, reading 12 consecutive records from this le causes three records to be read from each segment. In particular, records 0 through 11 of the le correspond to records 0 through 2 of each segment. The map() function for this distribution is 24

Segment 0

Segment 1

Segment 2

Segment 3

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

Figure 3.7: Unit Striped Data Distribution map [R, S, NR] --> [SN, SR] { SN = R % S; SR = R / S; }

which places logical record R in slot R div S of segment R mod S where S is the total number of segments. The unmap() function for this distribution is unmap [SN, SR, S, NR] --> [R] { R = SN + SR*S; }

which says that the record in slot SR of segment SN corresponds to logical record SN+SR*S where S again is the number of segments. Notice that the map() and unmap() are inverses. The unused parameter in the above examples is NR, which represents the total number of records in the le. Thus the le consists of logical records 0 through NR-1. The striped distribution does not depend on the number of records in the le, but other distributions do. One such example is the well-known block distribution, where each segment has a contiguous portion of the le. Figure 3.8 illustrates this layout. The map() function for the block distribution is map [R, S, NR] --> [SN, SR] { const block = NR/S + (NR%S ? 1 : 0);

25

Segment 0

Segment 1

Segment 2

Segment 3

0

5

10

15

1

6

11

16

2

7

12

17

3

8

13

18

4

9

14

19

Figure 3.8: Blocked Data Distribution SN = R/block % S; SR = R/(block*S)*block + R%block; }

Here the value block is the number of records in each segment. It is found by dividing the number of records in the le by the number of segments, and adding one if there is a remainder. The unmap() function for the same distribution is unmap [SN, SR, S, NR] --> [R] { const block = NR/S + (NR%S ? 1 : 0); R = SR/block*block*S + SN*block + SR%block; }

Presently PPFS implements four kinds of data distributions. Block distributions are implemented by the BlockDist class. The StripeDist class supports striped distributions with an arbitrary striping factor. A striping factor of one corresponds to unit striping, and a stripe factor of places sequences of contiguous records in each segment, in round-robin fashion. The UniformDist class provides for HPF [9] style distributions, and anything else can be implemented with the generic FnDist which allows the user to specify function pointers for the map() and unmap() operations. Note that these user-de ned functions must be linked into the client, server, agent and meta-data server, as PPFS does not support any RPC mechanism. N

N

26

3.4.3 Variable Length Records Every parallel le has an associated record size, speci ed when the le is created, which determines the number of bytes in each record. If a le containing records of di erent sizes is needed, a zero is given for this parameter. In this situation, the le is said to have variable-sized records. When a particular record is read or written, some mechanism must be used to determine the record's size and its o set into the segment. This is the purpose of the Index class. The Index abstraction provides an interface for setting and getting the size of a record in a parallel le with variable-sized records. When an application attempts to read a record from such a le, the le's associated index is consulted to nd the size of the record. Similarly, when a record is written, the index is used to set or update the size of the record. There are three types of index schemes supported by PPFS at present. The FixedIndex class implements xed-length records, and the FnIndex class allows the user to specify function pointers. These are included primarily for testing purposes. The more useful variant is the FileIndex class, which uses an index le to maintain the record sizes. In this case, each segment in the parallel le has an associated index le, maintained by the data server. To avoid byte-ordering issues, these index les are stored in ASCII, and consist of an 18-byte line per record. Each line has the o set and length of the record, in eight-byte hexadecimal elds. A space seperates the elds and a newline terminates the line. Because the entries have the same size, the index le can be accessed in random order. For eciency, the servers cache the index le during normal operation, which requires only two words per record. By storing the o set and length of each record, this scheme allows for gaps in the segment le, which arise when a record changes size.

3.4.4 Metadata Every parallel le has an associated MetaData object, which holds all of the information required to access the data in the le. This information is further divided into two parts: persistent and transient. The persistent metadata is saved to disk when the system shuts down. The dynamic metadata, on the other hand, is only required for the duration that the le is open. In fact, if a parallel le is opened more than once concurrently, the persistent metadata will be

27

ParallelFile file; if (mynode() == ppfs_client0()) { file.open("myfile", ParallelFile::in); file.send(); } else { file.recv(); }

Figure 3.9: The ParallelFile::open() Method identical across all such instances, and the transient metadata holds the information speci c to each instance. The persisitent contents of a MetaData object are:  the name of the parallel le,  the number of records in the le,  the size of the records (a zero indicates variable-length),  the clustering table,  the data distribution, and  the index (present only for varaible-length record les). The transient contents of a MetaData object are:  the parallel le handle,  a ag indicating if a caching agent is active for the le,  a NodeID which identi es the caching agent (if any), and  the access mode used to open the le. Metadata objects originate on the metadata server, and from there can be sent to the servers, caching agents, and client processes. When an application opens a parallel le, only one client process should perform the open operation, the others wait to receive the metadata. Thus the application determines which of its processes will have the le open, and how the metadata is transmitted. The canonical case, illustrated in Figure 3.9, involves the lowest-numbered client opening the le and sending it to the others.

28

3.4.5 Access Patterns We de ne an access pattern to be the sequence of records that an application accesses (i.e., reads). There are two things to note here. First, the access pattern does not take into consideration the write operations the application performs, as the access pattern serves to guide prefetching. Second, there are two types of access patterns: global and local. Both are expressed in the same manner, but each client node can have a di erent local access pattern (which guides prefetching into that client's cache), whereas the global access pattern is common to all client nodes (and guides prefetching on the I/O nodes and the agent). Fundamentally, an access pattern gives a sequence of logical record numbers. This can be done in two slightly di erent ways. One is to specify the next record in terms of the previous record accessed. The other is to specify the next record in terms of the access number. A user-supplied function can be used for either of these, but a more convenient approach is to specify and in the read n, skip k strategy. The abstract base class, Pattern, de nes an interface for using access patterns (see Figure 3.6). There are two virtual methods: next(), which gets the next logical record number in the access pattern, and reset(), which resets the pattern to the beginning. Currently, PPFS supports three kinds of patterns. The UniformPattern class implements the read n, skip k strategy. Here, records are read, then are skipped. This encompasses a wide range of patterns, from sequential to reverse and much in between including strides and stripes. The other two, SeriesPattern and RelationPattern employ user-de ned function pointers to specify arbitrary patterns. To use one of these access patterns, the user function must be linked with the PPFS element which contains the cache. n

k

n

k

3.5 Caching This section discusses the support for caching in PPFS. First, the basic cache mechanism is described, then various related policies are discussed.

3.5.1 Caches There are several levels of caching in PPFS. A caching component exists on the client nodes, the server nodes, and the agent nodes. The Cache class (see Figure 3.6) combines the similar 29

features of each into a common base class. Upon this foundation, the three derived caches are built: ClientCache, ServerCache, and AgentCache. Each cache is an associative memory, storing some number of records from perhaps several parallel les. Each record is stored in a CacheBlock object, which holds the record data along with a ag word, used to identify dirty records, etc. In all types of caches, records are indexed via a CacheKey object, which is also stored in the CacheBlock for each record. The interpretation of a CacheKey depends on the type of cache. For example, a CacheKey on the ClientCache would indicate the parallel le handle, and the logical record number. On a ServerCache, it would contain the segment number and segment record number. The le's distribution provides a mapping between these two representations (see x3.4.2).

3.5.2 Replacement Policies The PPFS maintains caches at various locations throughout the system. Each one has a maximum capacity in bytes. When a cache is full, it must eject one or more cache blocks to accommodate incoming data. The ReplacePolicy class is an abstraction for deciding the order in which blocks are to be ejected. This functionality is provided by the next() method which returns a pointer to the next cache block to remove. Each cache consults its associated ReplacePolicy to determine the next block to free. There is also a set of methods which the cache uses to notify the replacement policy of events which may a ect the outcome of future calls to next(). The replacement policy is informed when any of the following occur.  A record is added to the cache.  A record is ejected from the cache.  A record in the cache is read.  A record in the cache is written.  A dirty record in the cache is ushed. These methods all take a pointer to the CacheBlock in question. This exible design allows new cache replacement policies to be added to the system without disturbing the underlying caching components. Also, the new replacement policy will integrate equally well into all three types of caches: client, agent, and server. 30

3.5.3 Prefetching Policies The design of PPFS includes various prefetching policies. These are not part of the present implementation, however a ReplacePolicy can perform this task. The main problem with this arrangement is that, to implement replacement policies and prefetching policies, one needs  classes, whereas splitting them into two separate interfaces would lower this number to + . The only prefetching policy currently employed is on the servers, during read-any mode. Because the access pattern on each server is known, (the servers read the le sequentially) a server can prefetch records which are guarenteed to be accessed soon. An application exercise some control over this prefetching strategy using the ServerConfig class (see x3.6.1.) M

M

N

M

N

N

3.5.4 Write Back Policies One important facet of managing a cache is deciding when to write data to the next level of the memory hierarchy. The current implementation supports the full set of write-through levels: to disk, to server, to agent, and to client. When a record is written to a cache, with a destination level that is not lower in the memory hierarchy, then the record is not forwarded by the cache. Instead, the cache commits the record to memory, marking it as \dirty". When the cache becomes full, and needs to eject a record, the cache's replacement policy (x3.5.2) is consulted to select a record for ejecting. The replacement policy may also decide to ush some or all of the dirty records in the cache. To support this, Cache class de nes several ush methods, which also allow the application to explicitly ush records for consistency or performance reasons. The Cache class has three ush methods, for ushing of a speci c dirty record, all dirty records from the same le, and all dirty records in an entire cache. All three of these methods accept an argument which speci es the destination level to which the records should be ushed. As mentioned in x3.5.3, the coupling of this policy with the basic replacement policy is less than ideal, but quite e ective.

31

3.6 PPFS Operation 3.6.1 PPFS Con guration To experiment with various resource management policies, an application needs the ability to recon gure the servers and caching agents. For example, an application may wish to disable server prefetching (which is enabled by default.) The ServerConfig class allows an application to con gure various parameters on the server. A similar class, AgentConfig, will be created when the implementation of the caching agent matures. A ServerConfig object is a collection of options which a ect the behavior of the servers. An application may send this object to one or more servers as follows: ServerConfig config; config.setCacheSize(1024*1024); config.send();

In this example, each server is instructed to use one megabyte for caching. Several options may be indicated before sending the con guration object. Other forms of send() allow the con guration object to be sent to speci c servers. The full set of options available are:  setFlowControl() Enables or disables a special ow control protocol used by the servers.  setCacheSize() Sets the maximum size of the cache, in bytes.  setCacheBins() Indicates the number of bins in the cache's hash table.  setQueuePause() Speci es the size, in bytes, that the server request queue must reach before client write requests are blocked.  setQueueResume() Sets the point at which the server will resume blocked client write requests.  setUsePrefetching() Enables or disables server prefetching.  setPrefetchUnit() Speci es how many records are prefetched when a server performs a prefetch operation.  setPrefetchAhead() Sets the number of records which the server attempts to prefetch ahead of the client request stream. Any options not speci cally set by the application are not altered at the server.

32

3.6.2 PPFS Control The run program initializes PPFS and spawns application programs. It starts by spawning the metadata server, the data servers, and the server slaves. Caching agents are spawned by the metadata server, whenever an application uses global caching facilities. run recognizes a set of command-line options for con guring the servers. If any of these options are present, the run program will send a ServerConfig object to each of the servers. The run program also has a set of options for enabling various levels of trace data capture on the system and application processes.

3.6.2.1 Direct Operation A single application may be executed directly as follows: run 128 16 myprog myfile

The rst two arguments specify the number of client nodes and the number of server nodes, respectively. The next argument is the name of the application, and the remaining arguments are passed to this program. If the application, or any of the system programs, are not located in the current directory, they must be in the directory indicated by the PPFS BIN environment variable. An optional -h ag may proceed the application name, which indicates it is the host part of a host-node program. In this case run loads the host program onto a single client node, and the host is responsible for loading the node programs onto the remaining nodes.

3.6.2.2 Remote Operation If no application is speci ed, run spawns the control program, which in turn accepts TCP socket connections from a remote operation utility. The remote program can request control to spawn applications and to shut down PPFS. The control program communicates via NX with run, which carries out these commands. Because run deals with NX messaging and not TCP, it is free to receive messages from either the application or the control program. The remote program begins by parsing a con guration le which describes a set of graphical control elements to present to the user. This le also speci es the name of the machine on which the PPFS control program is running. After establishing a connection, the remote program 33

displays a control panel which allows the user to start applications and shut the system down. An typical con guration le is shown in Figure 3.10. As this example illustrates, the le is primarily a speci cation of various user interface elements, such as push-buttons and check boxes. The rst few lines specify the machine name and port number. The control elements are de ned by giving the type of element (group, text, button, etc.) followed by a number, then a set of attributes enclosed in braces. The number serves to uniquely identify the control (actually, the remote program allows controls to be named, but this is not supported by the PPFS control program.) The attributes enclosed in braces provide further details about the control element, such as a label and an initial value. Figure 3.11 shows the interface elements produced by this example.

3.7 Summary This chapter discussed the current implementation of PPFS and provided rationale for our design decisions. We described the software environment used to support the le system. We enumerated the various processes which form the system, and described how they communicate and cooperate. Finally, we presented a detailed description of how the major components of the system are implemented, including parallel les, data distributions, access patterns, metadata, and caching.

34

# PPFS remote control configuration file Host bagpipe.cs.uiuc.edu Port 6002 Group 190 { label "Start Application" controls { Text 191 { label "Application Name" value "genome_host" } Separator { } Numeric 192 { label "Number of Clients" value 16 min 0 max 28 step 1 } Separator { } Text 193 { label "Application Args" value "/home/reed/genome/datafiles/gbpri.seq" } Separator { } Toggle 194 { label "Host/Node" value 0 } } } Text 181 { label "Trace/Config Options" } Separator { } Button 147 { label "Quit Session" } Button 146 { label "Shutdown PPFS" } Button 191 { label "Application Crash" }

Figure 3.10: Example Remote Control Con guration File

35

Figure 3.11: Example Remote Control Interface

36

Chapter 4

Experiments In this chapter, we evaluate the performance of PPFS, rst on some simple benchmarks, then on two scienti c applications. The experiments were performed on three Paragon XP/S systems, each with a di erent con guration (see Table 4.1). We compare the performance of PPFS atop a UNIX le system (UFS) and atop the native Intel parallel le system (PFS).

4.1 Benchmarks Before experimenting with PPFS on real programs, it is useful to analyze it in the context of simple benchmarks. Using programs that consist solely of input/output operations, we can compare PPFS with the native le system [13]. The benchmarks consist of two programs: one creates and writes a 64 MB le, and the other reads the same le. In the rst case, each node writes a disjoint set of records. Writer begins writing record and then writes every th record, where is the number of writers For the read test, each node reads the entire le sequentially The le size is kept xed at 64 MB, while the record size and number of nodes participating are varied. The data written in each record consists of a (binary) four-byte integer representing the record number, and the remainder is lled with the (ASCII) hex digit of the record number mod 16. Thus, it is easy for the reader to verify that the data is correct. The code which implements the readers and writers is shown in Figure 4.1. For all three machines, the record sizes used (represented by the variable REC SIZE in the code) were 1K, 2K, 4K, 8K, 16K, 32K, 64K, and 128K. The number of readers or writers i

i

N

37

N

Number of Nodes

Machine Compute

I/O

bagpipe 28 1 raptor 52 2 trex 512 16 Table 4.1: Intel Paragon System Con gurations

void bench_write(ParallelFile & file) { static char code[] = "0123456789ABCDEF"; for (int i=MY_ID; i