Designing and Implementing Lightweight Kernels for ... - CiteSeerX

4 downloads 77046 Views 504KB Size Report
New Mexico began development of customized system software for .... advantages of our approach to kernel design for high-end operating systems using ...
CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. 2008; 1:1 Prepared using cpeauth.cls [Version: 2002/09/19 v2.02]

Designing and Implementing Lightweight Kernels for Capability Computing Rolf Riesen1 , Ron Brightwell1 , Patrick G. Bridges2 , Trammell Hudson3 , Arthur B. Maccabe2 , Patrick M. Widener2 , and Kurt Ferreira1 1

Sandia National Laboratories, Albuquerque, NM 87185, {rolf,rbbrigh,kbferre}@sandia.gov Department of Computer Science, University of New Mexico, Albuquerque, NM 87131-1386, {bridges,maccabe,pmw}@cs.unm.edu 3 OS Research, 1527 16th St. NW #5, Washington, DC 20036, [email protected] 2

SUMMARY

In the early 1990s, researchers at Sandia National Laboratories and the University of New Mexico began development of customized system software for massively parallel “capability” computing platforms. These lightweight kernels have proven to be essential for delivering the full power of the underlying hardware to applications. This claim is underscored by the success of several supercomputers, including the Intel Paragon, Intel ASCI Red, and the Cray XT series of systems, each having established a new standard for high-performance computing upon introduction. In this paper, we describe our approach to lightweight compute node kernel design and discuss the design principles that have guided several generations of implementation and deployment. A broad strategy of operating system specialization has led to a focus on user-level resource management, deterministic behavior, and scalable system services. The relative importance of each of these areas has changed over the years in response to changes in applications and hardware and system architecture. We detail our approach and the associated principles, describe how our application of these principles has changed over time, and provide design and performance comparisons to contemporaneous supercomputing operating systems.

Contract/grant sponsor: Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000.

c 2008 John Wiley & Sons, Ltd. Copyright

R. RIESEN ET AL.

Se

File I/O

rv

ic

e

2

Users

/home Compute

Net I/O (C) Rolf Riesen, 2007

Figure 1. The compute nodes in a partitioned capability computing system are divided into dedicated compute nodes along with supporting I/O and service nodes.

1. 1.1.

Introduction Capability Computing Systems

In the early 1990s, Sandia National Laboratories began the design and deployment of new “capability” computing platforms under the auspices of the Accelerated Strategic Computing Initiative (ASCI) sponsored by the United States Department of Energy (DOE). These platforms emphasized scalability and performance for mission-critical applications, aiming to satisfy the demanding requirements of the most sophisticated modeling and simulation applications used in science and engineering. This contrasted with so-called “capacity” platforms, which were designed to accomodate increasing numbers of small jobs. ASCI’s mission was to accelerate the technology required to achieve new levels of scalability and performance, which in turn would provide the computing resources necessary to achieve new levels of scientific discovery. To address these needs, capability systems for the ASCI program have been designed around partitioned architectures (Figure 1), in which sets of nodes are dedicated to specific tasks. Specifically, a large set of nodes is dedicated to strictly computational tasks, while other nodes handle I/O, interactive processing, and other service requirements. This architecture allows commodity hardware to be used efficiently to meet the reliability and performance demands of these large scale machines. Unfortunately, general purpose operating systems scale poorly in capability environments for (at least) two reasons. First, application codes performing mostly computational activities on a compute node have no need of most system processes found in UNIX flavors (a line printer

c 2008 John Wiley & Sons, Ltd. Copyright Prepared using cpeauth.cls

Concurrency Computat.: Pract. Exper. 2008; 1:1–1

DESIGNING AND IMPLEMENTING LIGHTWEIGHT KERNELS

3

daemon, for example). While some such processes are relatively easily deactivated at boot time, not all can easily be removed without interdependency issues, and the kernel code supporting these processes still adds to the memory footprint of the kernel. Second, general-purpose multiprocessing activity and scheduling policies vary the amount of processing time made available to an application during execution. While not generally an issue in single-processor situations, this can lead to performance-sapping synchronization issues in the distributedmemory architectures of most capability computing platforms. For example, unpredictable processor availability can cause some processors to block at a synchronization step while waiting for others to finish a compute phase, increasing overall execution time significantly. 1.2.

Scalable System Software

Sandia and the University of New Mexico (UNM) began developing new scalable system software targeted for these platforms, tailoring this software specifically to the hardware architecture, the functional requirements and parallel programming model of key applications, and the intended usage model of the machine. The result is a family of carefully crafted operating systems, termed lightweight kernels, that focus on providing exactly and only those services needed by specific hardware and a small set of mission-critical applications. This work resulted in the following milestones in lightweight kernel development: 1991 The Sandia/UNM Operating System (SUNMOS), the first lightweight kernel is developed as a research-oriented replacement for the nCUBE nCX and Vertex systems. SUNMOS is deployed on the nCUBE-2, one of the first commercially available distributed memory MPP machines designed specifically for scientific computing. 1993 SUNMOS is ported to an 1824-node Intel Paragon. 1994 Puma, an enhanced version of SUNMOS, is deployed on the Paragon. Puma also features the first implementation of the Portals communication architecture. 1996 Puma is ported to ASCI Red (4500 nodes), the world’s first general-purpose terascale computing system. Puma is later productized by Intel under the name Cougar. 1997 Due to the prohibitive level of effort necessary to port Puma to commodity workstation hardware (DEC Alpha), Linux is used as the compute node OS for the Computational Plant project. As a result, Portals becomes an independent component rather than being bound to an OS. 2002 Sandia and Cray design and build the 13,000-node Red Storm, the prototype of Cray’s XT series of machines. Sandia creates the Catamount lightweight kernel by porting Cougar to Red Storm. Portals is also modified to work with SeaStar, Red Storm’s dedicated network/routing hardware. Lightweight kernels developed as part of this work, as well as others indirectly related to this work, continue to be used in large-scale production supercomputers. Catamount, for example, is still used to run ASC Red Storm at Sandia in daily production usage, as well as other XT3

c 2008 John Wiley & Sons, Ltd. Copyright Prepared using cpeauth.cls

Concurrency Computat.: Pract. Exper. 2008; 1:1–1

4

R. RIESEN ET AL.

systems around the world. Similarly, IBM BlueGene-series machines also lightweight kernels inspired in part by the original lightweight kernels constructed for the Intel Paragon and ASCI Red. 1.3.

Paper Overview

In this paper, we discuss the lightweight kernel philosophy, the principles that have guided us in designing these kernels, and our empirical and practical experiences with lightweight kernels. Section 2 describes our philosophy of lightweight kernel design, the principles we have used in mapping this philosophy to actual system implementations, and the high-end system software architecture developed based on these principles. Section 3 illustrates the advantages of our approach to kernel design for high-end operating systems using comparisons with contemporaneous operating systems, and Section 4 follows with a description of our experiences and lessons learned. Section 5 describes related work, placing the lightweight approach to operating system design in the context of broader research themes. Finally, Section 6 discusses future directions for research on lightweight operating systems, and we summarize and conclude in Section 7.

2.

Lightweight Kernel Architecture and Evolution

To address the system software needs of the partitioned hardware architectures, we adopted a philosophy of operating system specialization: we partitioned and specialized system software functionality, both across nodes and between processors within nodes. Based on this philosophy, we also have identified some fundamental principles of lightweight kernel design. These include: • User-level resource management: Provide applications with user-level mechanisms that can be used to manage and protect system resources, allowing applications to determine the most effective resource management strategies. • Predictable performance: Construct the operating system so that provided system services run in a predictable amount of time, enabling user-level performance tradeoff decisions and aiding in system and application performance debugging. • Scalable system services: Provide the application only those system services that will scale to the size of the system, encouraging scalable application development. In the remainder of this section we elaborate on the basic architecture of our lightweight kernels and describe how the principles described above have applied to the design and implementation of lightweight kernel subsystems. 2.1.

Basic Architecture

The basic lightweight kernel (LWK) architecture, as shown in Figure 2, consists of a Quintessential Kernel (QK) and a Process Control Thread (PCT). The QK is the lowest level of the operating system. It sits on top of the hardware and performs hardware services on behalf of the PCT and user-level processes. The QK supports a small set of tasks that

c 2008 John Wiley & Sons, Ltd. Copyright Prepared using cpeauth.cls

Concurrency Computat.: Pract. Exper. 2008; 1:1–1

libmpi.a

libc.a libnx.a

Application 3

libc.a

Application 2

Process Control Thread (PCT)

Application 1

DESIGNING AND IMPLEMENTING LIGHTWEIGHT KERNELS

5

libc.a libvertex

protection boundaries

Q−Kernel: message passing, memory protection

(C) Rolf Riesen, 2007

Figure 2. Lightweight Kernel Architecture

require execution in privileged supervisor mode, including servicing network requests, interrupt handling, and fault handling. It also fulfills privileged requests made by the PCT, including running processes, context switching, virtual address translation and validation. However, the QK does not manage the resources on a compute node. It simply provides the necessary mechanisms to enforce policies established by the PCT and to perform specific tasks that must be executed in supervisor mode. The QK provides a trap mechanism for communicating with the PCT of an application. A list of those traps along with their descriptions is shown in Table I The PCT is a privileged user-level process that performs functions traditionally associated with an operating system. It has read/write access to all memory in user space and manages all compute node resources. It handles process creation, memory management, and scheduling. While QKs do not communicate with each other, the PCTs on each node within a parallel job communicate with each other to start, manage, and shutdown the job. Overall, the PCT and the QK work together to provide a complete operating system, with the QK providing resource management mechanisms, and the PCT implementing resource management policies that use these mechanisms. In short, the PCT is responsible for setting policies and the QK is responsible for enforcing them. 2.2.

User Level Resource Management

For compute nodes, the user is best able to make decisions about resource management and implement the policies that maximize use of available resources. These decisions involve scheduling, processor allocation, and memory management.

c 2008 John Wiley & Sons, Ltd. Copyright Prepared using cpeauth.cls

Concurrency Computat.: Pract. Exper. 2008; 1:1–1

6

R. RIESEN ET AL.

Table I. QK Entry Points

2.2.1.

Entry Point

Description

null trap setuid lputs quit quantum init proc get cache run process install pct init region memlogcntrl trap nop trap cpu migrate trap dual proc rcad ioctl ptl kernel fwd

Null trap. Handler is never called. Set user ID Print a string to the console Quit application quantum. Start the PCT Initialize a process Get cache table entry Run the specified process context Set the start address for PCT Initialize a memory region Control for memory log capability Return CPU ID Migrate process from one CPU to another Check for second CPU Interface to the RAS system Portals dispatch system call

Scheduling

Maximizing the amount of processor time delivered to a parallel application process is critical to achieving high performance and scalability. Our lightweight kernels have been designed to minimize the amount of processing the operating system takes away from the application process and to maximize the flexibility the application has in managing the processors on a node. This is accomplished by allowing users to specify scheduling policy, providing simple support for multiprocessor systems, and by carefully managing timers. For scheduling, the QK up-calls to the PCT at each scheduling point. The PCT makes the scheduling decision (policy) and tells the kernel via the function run process() which process to run next and for how long. The QK then flushes caches as appropriate, sets up the hardware registers, and runs the selected process (mechanism). This approach allows applications with multiple threads or processes on a single node to customize the scheduling policy as necessary. This strategy is similar in spirit to how scheduler activations [1] have been used to interface between kernel- and user-level thread systems in some recent UNIX and microkernel operating systems. Commentary In SUNMOS, the earliest version of our LWK design, there was no PCT; the kernel gained control whenever a message arrived, or when the application trapped into the kernel. This design decision was made because a PCT was not strictly necessary to support MPI message passing applications and our initial goal was to build as simple a kernel as possible; the complexities and resulting scaling problems of OSF/1 strongly motivated this initial decision.

c 2008 John Wiley & Sons, Ltd. Copyright Prepared using cpeauth.cls

Concurrency Computat.: Pract. Exper. 2008; 1:1–1

DESIGNING AND IMPLEMENTING LIGHTWEIGHT KERNELS

7

In Puma, we introduced the PCT to provide a more general process and scheduling model, which made it possible to run two or more application processes concurrently or otherwise adapt scheduling to user needs. The PCT in Puma still only ran when a user process or something external to the node made a request of the PCT; all such requests went through the local kernel, which scheduled the PCT to run as necessary. This change was introduced to provide a more general process and scheduling model (e.g. in future applications) while still minimizing OS interference with MPI applications. Although Puma allowed more than one application process per node, additional processes were rarely used due to the added complexity of starting multiple processes. By default, all memory on a node that remained after allocating space for text (instructions), initialized data, and stack was given to the heap. The user had to specify explicitly at job launch time that the remaining memory should not be completely allocated, leaving space for subsequent processes. To support multiple processes, the PCT in Puma would set a time quantum (one second was common), and the QK would run the PCT briefly at quantum expiration. The PCT would check for requests, handle them, set the next quantum length, and tell the kernel which application process to run next [2, 3]. 2.2.2.

Multiprocessor Support

Traditional multiprocessor operating systems give applications symmetric access to multiple processors, but must deal with the resulting kernel-level synchronization issues. Our lightweight kernels, in contrast, leverage the distributed memory nature of scalable parallel applications to make asymmetry explicit in applications, simplifying kernel design. Accordingly, applications may choose at launch to use system processors in one of several different processor modes [4]. The simplest mode of operation, colloquially referred to as heater mode, runs both the kernel and the user-level process on a single processor and simply ignores additional processors. System calls from a user process are initiated by a trap instruction to the kernel, which handles the request and re-establishes the context of the user process. This mode does not provide any performance advantages, but it is the simplest mode to make operational and has historically been the default mode. In kernel co-processor mode, the kernel runs on one processor (the “system” processor) and the user-level process on another (the “user” processor). In this mode, the QK polls the external devices and looks for system call requests from the user-level process in a sharedmemory mailbox. Since the time to transition between user mode and kernel mode can be significant, this mode offers the possibility of increased performance for handling system calls and servicing devices. This mode is also sometimes referred to as message co-processor mode, because the LWK generally spends most of its time handling messaging traffic on compute nodes in a partitioned system. In user co-processor mode, the kernel and the user process run on both processors. However, the kernel and user codes that run on the user processor do so in a very limited way. The kernel code running on the user processor does not perform any management activities, but simply notifies the system processor of kernel requests. The user code that runs on the user processor must run within the same context as the process on the system processor and is limited in the system calls that it can make. Specifically, an application using this mode runs a coroutine on

c 2008 John Wiley & Sons, Ltd. Copyright Prepared using cpeauth.cls

Concurrency Computat.: Pract. Exper. 2008; 1:1–1

8

R. RIESEN ET AL.

the user processor by passing a function pointer to the kernel, which hands the function off to the user processor. Because of this non-standard interface, application programs have rarely used this mode; those that do use it run a specialized version of a standard math library on the user processor. The most frequently used mode is virtual node mode, in which the kernel and a user process run on the system processor and a separate user process runs on the user processor. As the name implies, this mode treats each processor as a separate node. The available memory on a node is divided in half and two separate user address spaces are created. All kernel services are fully supported on both processors, but the kernel runs only on the system processor. There is no support for shared memory, so data transfers between the two processes on a node must be done via the network. This mode has the advantage of using all of the processors on a node in a manner that is transparent to the application code or user. The ability to choose the way in which multiple processors on a node are used can have a great impact on performance. For codes that are limited by memory or network bandwidth as opposed to processor perfomance, for example, kernel co-processor mode allows application scientists to trade away processors for improved networking perform. Conversely, applications that are processor-bound can choose to use application co-processor mode or virtual node mode. The user is given explicit control over which mode its nodes will use when best when it submits a job for launch on the system. Commentary Intel provided a second, and in later board versions a third, i860 CPU on each node of its Paragon XP. The additional processor was intended to control the network interface and perform message passing tasks. The machine was marketed with the peak flop rating of a single CPU per node; in the race for higher performance numbers and a high ranking on the Top 500 list, this was unusual. Since SUNMOS was designed for the single-CPU-per-node nCUBE-2, it was easier initially to simply ignore the second CPU (heater mode, as described above). The additional multi-processor modes described above were implemented primarily on ASCI Red. By the end of its lifetime, virtual node mode was the predominant processor mode for this system. Virtual node mode has become the default for using Red Storm, as well, and because Red Storm nodes are fitted with an intelligent network interface that is able to handle most message transmission tasks, message co-processor mode is deprecated on this system. 2.2.3.

Memory Management

We did not want to introduce overly general memory management facilities that parallel applications would not use and that would reduce system performance. In particular, we sought to provide as much memory to applications as possible, to maximize memory bandwidth available to applications, and to use layouts that would facilitate fast networking and other critical system services. One such general memory management facility is demand-paged virtual memory, which is intentionally not provided for several reasons. In particular, demand-paged virtual memory has a large potential performance impact. Most nodes in large-scale systems are diskless to maximize reliability, and most well-designed parallel applications avoid paging even in

c 2008 John Wiley & Sons, Ltd. Copyright Prepared using cpeauth.cls

Concurrency Computat.: Pract. Exper. 2008; 1:1–1

DESIGNING AND IMPLEMENTING LIGHTWEIGHT KERNELS

9

environments where it is supported. Scalable applications that need to run out-of-core generally manage data movement explicitly using high-performance I/O and network access, and do so more efficiently than a general memory page replacement strategy implemented in the operating system could. To achieve this, we chose a simple virtual memory system that linearly maps virtual pages to physical pages. This approach provides address space isolation between processes, the PCT, and the QK. In this architecture, the QK gives the PCT a linear address space mapping that it can divide up between user processes using the init region() call into the QK. In this way, physical addresses are mapped one-to-one to virtual addresses. This approach allows the LWK to easily use large-page translation lookaside buffer (TLB) entries to efficiently map large portions of the address space, to do virtual-to-physical translations for setting up network transfers very quickly, and potentially to completely do away with page tables on machines with software-loaded TLBs. This linear mapping can potentially increase an application’s effective memory bandwidth by reducing TLB misses, as well as reducing the OS memory footprint. Commentary In the early 1990’s, memory was a very precious resource; our goal was to give as much of the main memory to the application as possible. This meant we should keep our kernel as small as possible. It also meant that as many services as possible should reside in libraries that were only loaded by applications on demand. The design of the network interface also influenced our memory management strategies. The earliest network interface controllers (NICs) used simple DMA engines that required physical memory addresses and large messages to achieve peak bandwidth. Since our goal was to provide user-level processes with the highest bandwidth possible (only a few percent less than peak), we had to organize memory so that we could hand large areas of physically contiguous memory to the DMA engines of the NIC. The price of memory has dropped significantly in the last decade, but memory can still dominate the cost of a large-scale machine. We expect the emergence of multi-core processors to once again emphasize memory frugality in the operating system. 2.3.

Predictable Performance

Massively parallel systems must exhibit predictable performance, and processors must be available to the application as much as possible∗ . Timer interrupts are carefully managed to minimize application perturbation, while nodes communicate using the Portals highperformance, low-latency message-passing system. 2.3.1.

Timer Management

Unlike traditional full-featured operating systems based on UNIX, our lightweight kernels do not continually take timer interrupts to analyze the state of a node. For example, a Linux 2.4

∗ This characteristic in fact motivated a real-time variant of Cougar, the Parallel Real-Time OS for Secure Environments (PROSE), developed in collaboration with Intel and Hughes Aircraft [5].

c 2008 John Wiley & Sons, Ltd. Copyright Prepared using cpeauth.cls

Concurrency Computat.: Pract. Exper. 2008; 1:1–1

10

R. RIESEN ET AL.

kernel running on an Alpha processor takes an interrupt every millisecond to assess the state of the machine and perform housekeeping activities, and the Linux scheduler is built around daemon processes which have been shown to reduce application performance at scale [6]. In contrast, lightweight kernels and their schedulers only require timer processing to perform critical activities. We detail benefits of this careful timer management in the benchmark results in Sections 3.1 and 3.3. Commentary We explained above how scheduling in the LWK evolved from no scheduling in SUNMOS to rather rudimentary scheduling in today’s LWK. Timer management is strongly tied to scheduling. Puma would run a process uninterrupted until it had used its entire quantum. The only interrupts that were generated were those from the communication subsystem, either when messages arrived or when the outgoing network FIFO queues became empty. In Cougar, regular timer interrupts became necessary. Specifically, ASCI Red had a Reliability, Availability and Serviceability (RAS) system requiring the operating system on each node to periodically declare its liveness, and nodes that failed to report were rebooted. Because the longest settable timer interrupt on the the ASCI Red network interface card, the CNIC, was 100 ms, Cougar had to take a timer interrupt ten times per second to service the RAS system. This did not significantly impact application performance, but the disruption was measurable. 2.3.2.

Portals Communication Subsystem

Message passing performance is a critical aspect of massively parallel, distributed memory machines; even a single memory copy in the network stack can severely impact performance. For this reason, we focused much of our work in the LWK architecture on a scalable, flexible communication mechanism called Portals that allows data transfers directly from user memory on one node to user memory on another.† A Portal is referenced through an index into a Portal table, where each table entry refers to either a match list or a memory descriptor. A memory descriptor describes the layout of the memory associated with the Portal. Matching lists provide Portals with a matching semantic that can be used to bind specific messages to specific memory regions. Each entry of the matching list has a memory descriptor associated with it. A Portal may refer to a memory descriptor directly or to multiple memory descriptors indirectly through a matching list. To maximize performance, all message-passing structures in Portals reside in user space. On receive, the user simply polls a receive queue for information about new incoming messages. On sends, the user appends a message to a queue and notifies the kernel through a system call that it wishes to send a message. The kernel is only responsible for traversing the user space data structures and depositing messages into user space based on the content of the structures. The kernel does not enforce any specific protocol or provide any buffering of messages. This

† See

[7] for a more complete discussion of Portals.

c 2008 John Wiley & Sons, Ltd. Copyright Prepared using cpeauth.cls

Concurrency Computat.: Pract. Exper. 2008; 1:1–1

DESIGNING AND IMPLEMENTING LIGHTWEIGHT KERNELS

11

allows the kernel to remain at a fixed size regardless of the size of the parallel job or the amount of network resources a job requires. An important feature resulting from this design is Portals’ ability to deliver messages without the direct involvement of the user-level process. Once the Portal structures have been put in place to receive a message, a process need not poll the network in order for the data transfer to complete. The process need not even be running for the kernel to deposit the message directly into the process’ memory, a capability known as application by-pass. In conjunction with kernel co-processor mode or an intelligent NIC, this capability provides the ability to fully overlap computation and communication even for somewhat complex protocols like those required by a MPI implementation. Portals are also connectionless, which eliminates the overhead associated with connection establishment and also helps to reduce the amount of state that is needed for message passing. This design, along with the contiguous physical-to-virtual memory mapping described in Section 2.2.3, reduces end-to-end network latency by eliminating page pinning costs, reducing DMA setup times (by avoiding page-table walks to do address translations), and avoiding unnecessary kernel traps. In addition, a latency-bound application can use specialized processor modes, for example the kernel co-processor mode described in Section 2.2.1, to reduce the impact of kernel trap times and polling times on message transmission and reception, respectively. Section 3.2.1 provides comparison results between Portals performance on Cougar and Linux on ASCI Red, illustrating the reduced latencies that Cougar could achieve compared to Linux because of how our lightweight kernels manage memory and processors. Commentary The Portals message passing mechanism evolved from the nCUBE API for message transmission. We wanted to let an application open portions of its address space to the network so it could send and receive messages quickly with little overhead. Using data structures in user space to control the kernel on the Paragon was very efficient. While the kernel had to take some measures to protect itself, e.g., to avoid walking infinite loops in user-level linked lists, no traps into the kernel were necessary to setup or change the way incoming messages were handled. When a new message arrived, the kernel would, from within the interrupt handler, examine the data structures in user space and set up the DMA engines to deposit the message exactly where the user wanted it. This tight integration of kernel functionality and user-level data structures became a burden when it became necessary to move Portals data structures from user-space to another processor or into the network interface (e.g. for the ASCI Red Storm SeaStar NIC), neither of which may share the same address map as the application. Performing address translation and validation on the fly, for example, was prohibitively expensive. To solve this problem, we introduced an API in Portals 3.0 to manipulate the Portals building blocks. [8, 9]. This thin API hid the data structures, making it possible to move them from one address space to another. On Cplant, the data structures moved into the Linux kernel, and we experimented with moving them into a programmable network interface [10]. Two implementations of Portals exist for Red Storm: an interrupt-driven implementation where Portals data structures are in kernel space and a completely offloaded implementation where Portals data structures are contained in the memory of the SeaStar network interface [11]. It is now possible to evaluate the tradeoffs of trapping into the kernel or allowing applications

c 2008 John Wiley & Sons, Ltd. Copyright Prepared using cpeauth.cls

Concurrency Computat.: Pract. Exper. 2008; 1:1–1

12

R. RIESEN ET AL.

to directly manipulate a NIC, allowing us to place the core functionality of Portals wherever it is most efficient for a given architecture [12]. 2.4.

Scalable System Services

The third principle guiding our system software design is to build scalable system services, particularly a scalable parallel application launcher and runtime system. Since we have removed many of the services of a traditional operating system from our compute node kernel, the runtime system plays an integral role in providing services to parallel applications. In our environment, the parallel job launcher not only interacts with the PCTs on the compute nodes to start the job, it provides services to the application while it is running. The parallel job launcher component of our LWK runtime system is called yod. Yod contacts a compute node allocator (which is topology aware to optimize networking performance) to obtain a set of compute nodes, and then communicates with a primary PCT to move the user’s environment and executable out to the compute nodes. The primary PCT works with the secondary PCTs in the job to efficiently distribute the data to all of the compute nodes participating in the job. Once a job has started, yod serves as an I/O proxy for all standard I/O functions. Compute node applications are linked with a library that redefines the standard I/O library routines and some system calls. This library implements a remote procedure call interface to yod, which actually performs the operation locally and then sends the result to the compute node process. Yod also disseminates some UNIX signals that it receives out to the processes running in the parallel job. When yod receives a signal, it sends a message to the primary PCT, which distributes the message to the other PCTs in the job and delivers the desired signal to the application process. This feature can be very useful for operations such as user-level checkpointing and killing jobs. Commentary SUNMOS originated on the nCUBE-2 where a Sun workstation served as a front-end and was used to boot the nCUBE, load the OS, and compile and launch applications. This meant that SUNMOS needed functionality in the kernel to accept the code of an application in the form of messages. The application loader yod ran on the front-end and communicated with SUNMOS on the nCUBE. Yod moved the application code from the frontend to SUNMOS which then used a fan-out tree to distribute the executable to the rest of the sub-cube allocated to this run. A large portion of the SUNMOS kernel was devoted to this capability. Whenever the binary format changed, for example from COFF to ELF, the application loader code in the kernel had to change. The kernel loader had to understand the format so it could place the image segments into the appropriate places in memory: the text segment needed to go into read-only memory, the BSS segment had to be initialized to zero, etc. After several such changes to that portion of the LWK, it became clear that it would be easier to test and integrate new versions of the application loader if the loader was moved to the PCT. While the location of the application loader changed, the basic principle has not. Even in Catamount, the PCT is still responsible for distributing and initializing the executable.

c 2008 John Wiley & Sons, Ltd. Copyright Prepared using cpeauth.cls

Concurrency Computat.: Pract. Exper. 2008; 1:1–1

DESIGNING AND IMPLEMENTING LIGHTWEIGHT KERNELS

13

Front-end workstations are now obsolete. Partitioned architectures with service nodes integrated into the machine are commonplace. Most parallel computing platforms, especially commodity clusters, use specialized nodes that are typically similar in architecture to compute nodes, but with extra devices like Ethernet interfaces to communicate with the outside world. These nodes typically run full-featured operating systems like Linux to provide interactive login environments. Using the compute-node kernel or the PCT to distribute the executable has proven scalable and efficient, even in systems where other methods, such as loading through a network file system, are possible.

3.

Performance Evaluation

To evaluate the performance of our lightweight kernels, we have performed a variety of microbenchmark and application comparisons between each lightweight kernel and contemporaneous traditional HPC operating systems. We also show performance numbers collected by others that compare our LWKs with other operating systems. We include comparisons between Puma and OSF/1 on the Intel Paragon [13], between Cougar and Linux on ASCI Red [14], between Catamount and Cray’s Compute Node Linux on Red Storm [15] and between Catamount and a modified version of Catamount with OS noise scaled up to simulate OS CPU overhead in other systems. 3.1.

SUNMOS and OSF/1 on Intel Paragon

There are only a few published research papers that provide a direct comparison of SUNMOS and OSF/1 on the Intel Paragon. Only a few sites had large Paragon machines, and both the hardware and software environment of the Paragon were constantly evolving. We summarize the results of these papers here. The most detailed performance comparisons were conducted by Saini and Simon [13, 16], who measured memory footprint, network performance, linear algebra library performance, and the performance of the NAS parallel benchmark suite [17]. In addition to numerical data, they also contrasted the user experience in interacting with the two different operating systems. They found that OSF consumed 8 MB of memory on each node, while SUNMOS only consumed 250 KB. Communication micro-benchmarks revealed that SUNMOS significantly outperformed OSF for both latency and bandwidth. OSF zero-byte latency was measured at 100 µs, while SUNMOS zero-byte latency was 70 µs. OSF was only able to achieve a peak bandwidth of 35 MB/s, while the peak bandwidth of SUNMOS was nearly 170 MB/s. The authors credited this higher network bandwidth in allowing SUNMOS to achieve significantly better scalability and performance on several application benchmarks. One particular benchmark — a 3D FFT code — did not scale beyond 32 processors with OSF, but scaled to the full machine size of 256 nodes with SUNMOS. In addition to raw performance, the authors also pointed out the marked difference in determinism between OSF and SUNMOS. The reliance of OSF on demand-paged memory resulted in drastic performance differences between successive application runs. Performance

c 2008 John Wiley & Sons, Ltd. Copyright Prepared using cpeauth.cls

Concurrency Computat.: Pract. Exper. 2008; 1:1–1

14

R. RIESEN ET AL.

increases of up to 40% between the first and second run were observed with OSF, while SUNMOS performance was consistently high regardless of how many times they were run. In a similar study, Cook, Enbody, and Herland [18] also measured raw network performance using micro-benchmarks and compute performance using a standard parallel matrix multiplication application. They measured OSF zero-byte latency at 94 µs compared to 67 µs latency with SUNMOS. In their network bandwidth measurements, OSF peaked at 20 MB/s with a message size of nearly 1 MB. SUNMOS was able to achieve 20 MB/s at a message size of only 3 KB. SUNMOS was also able to outperform OSF on the matrix multiplication benchmark by nearly 56%. This study also examined the issue of network link contention. They observed that SUNMOS was able to easily saturate a 4x4 network submesh while OSF could not. Network link contention caused a saturation benchmark to slow down from the ideal of 82 seconds to 122 seconds. The same benchmark took OSF slightly less than 700 seconds. Finally, Sandia also conducted an in-house comparison of network performance [19] between OSF and SUNMOS. For this particular Paragon, zero-byte latency was measured at 52 µs for OSF and 33 µs for SUNMOS. This version of SUNMOS also supported the use of a message co-processor, which improved the latency performance to 24 µs. Ping-pong bandwidth tests showed that SUNMOS achieved 100 MB/s at a message size of 64 KB and was able to peak at 160 MB/s. Again, OSF network bandwidth peaked at 30 MB/s for 1 MB messages. This study also observed non-deterministic performance using OSF, noting a significant performance difference in network latency. The latency performance of the first message fluctuated between a maximum of 12000 µs and a minimum of 101 µs. 3.2.

Cougar and Linux on ASCI Red

Our experimental evaluations on ASCI Red compared Cougar with two versions of Linux ported to the ASCI Red CNIC: Linux-TCP which used TCP/IP on top of the ASCI Red CNIC for message passing, and Linux-Portals, which used a port of the Portals 3.2 message passing interface to Linux. Both Linux implementations used a minimal Linux configuration with no unnecessary daemon processes or libraries, and all tests were run on a 128-node ASCI Red development system, where each node contained dual 333 Mhz Pentium II Xeon processors. 3.2.1.

Micro-Benchmarks

One of the key advantages of our lightweight kernels compared to commodity systems like Linux is that they were optimized for low latency, high bandwidth communication. To quantify the benefits of these optimizations, we ran simple ping-pong communication micro-benchmarks and application benchmark tests on Cougar, Linux/TCP, and Linux/Portals. These results, shown in Table II, include zero-byte message latency and maximum end-to-end large-message bandwidth. This maximum bandwidth is obtained with 128KB messages for Cougar and 16KB messages for Linux-TCP and Linux-Portals. These results demonstrate the performance advantage of a lightweight kernel over commodity kernels, even when the commodity kernel is equipped with a highly-optimized message-passing system (the Linux-Portals variant). In particular, the simple memory layout of our lightweight

c 2008 John Wiley & Sons, Ltd. Copyright Prepared using cpeauth.cls

Concurrency Computat.: Pract. Exper. 2008; 1:1–1

DESIGNING AND IMPLEMENTING LIGHTWEIGHT KERNELS

15

Table II. Cougar and Linux Latency and Bandwidth Micro-Benchmarks on ASCI Red. Benchmark

Latency (us)

Cougar Linux-Portals Linux-TCP

14 46 100

Bandwidth (MB/sec) 4KB/message 128KB/message 168 82.4 45

308 262 45

kernels described in Section 2.2.3 results in much lower memory management overheads during message transmission and reception in Cougar than in Linux. Specifically, Linux must walk page tables during address translation, both when sending DMA commands to the NIC and when updating network event structures in user memory. Cougar’s linear page mapping strategy avoids these overheads. 3.2.2.

NAS Parallel Benchmarks

The performance of various NAS parallel benchmark codes [17] running under Cougar and Linux on an ASCI Red development system ‡ , shown in Figure 3, illustrate three interesting points in the comparison space. For each benchmark, the results are presented in total MOPS (millions of operations per second). For CG (Figure 3(a)), Linux slightly outperforms Cougar on a small number of nodes where communication is a less significant factor. This result was a bit of a mystery at first, since there seemed to be little reason for a lightweight kernel to perform worse than Linux; however, we quickly realized that the application was compiled with a much newer compiler on Linux than on Cougar. As the number of processors increases, however, scalability issues override the difference in single node performance, and Cougar demonstrates significantly better performance. Figure 3(b) shows that Cougar and Linux performance is almost identical for MG with small numbers of processors. In this case, differences in the compiler are outweighed by advantages of the LWK approach. Although communications issues do not seem to limit MG on Linux at 16, 32, or 64 processors, Cougar begins to have a significant advantage at 128 processors. Finally, Figure 3(c) illustrates that Cougar significantly outperforms Linux at small numbers of nodes when running IS. This application highlights two of the advantages of Cougar: physically contiguous virtual memory and advanced collective operations. With contiguous memory, there is drastically lower pressure on the processor’s TLB. Under Linux, the performance effects of TLB thrashing are clear. This gap only widens as the number of

‡ Unfortunately,

obtaining time for large-scale OS performance tests on production systems like ASCI Red, particular for non-production operating systems like our Linux port to ASCI Red, is particularly difficult. This limited our evaluations in this case to small numbers of nodes on a development system.

c 2008 John Wiley & Sons, Ltd. Copyright Prepared using cpeauth.cls

Concurrency Computat.: Pract. Exper. 2008; 1:1–1

16

R. RIESEN ET AL.

Cougar Linux/TCP Linux/Portals

1400

1200

Total MOps

1000

800

600

400

200

0 0

20

40

60 80 Number of Processors

100

120

140

(a) CG Performance 5000

Cougar Linux/TCP Linux/Portals

4500 4000

Total MOps

3500 3000 2500 2000 1500 1000 500 0 0

20

40

60 80 Number of Processors

100

120

140

(b) MG Performance 180

Cougar Linux/TCP

160 140

Total MOps

120 100 80 60 40 20 0 0

20

40

60 80 Number of Processors

100

120

140

(c) IS Performance

Figure 3. Cougar and Linux NAS Parallel Benchmark Version 2.4 Class B Performance on ASCI Red

c 2008 John Wiley & Sons, Ltd. Copyright Prepared using cpeauth.cls

Concurrency Computat.: Pract. Exper. 2008; 1:1–1

DESIGNING AND IMPLEMENTING LIGHTWEIGHT KERNELS

80

17

Cougar Linux/TCP Linux/Portals

70

Grind Time (microseconds)

60 50 40 30 20 10 0 0

10

20

30

40

50

60

70

Number of Processors

Figure 4. CTH Grind Time Performance (lower is better)

processors increase because the collective operations (specifically MPI Alltoall) implemented for ASCI Red were significantly better than the default collectives available in MPICH 1.2.5. 3.2.3.

CTH Application

Figure 4 shows the compute performance of Sandia’s CTH application under the two operating systems. CTH [20] is a multi-material, large deformation, strong shock wave, solid mechanics code developed at Sandia. CTH has models for multi-phase, elastic viscoplastic, porous and explosive materials. It uses second-order accurate numerical methods to reduce dispersion and dissipation and to produce accurate, efficient results. CTH is used for studying armor/antiarmor interactions, warhead design, high explosive initiation physics, and weapons safety issues. The results in Figure 4 are presented in total grind time, the amount of time to complete an internal CTH time step. This does not include CTH startup overhead, but does include communications time on multiprocessor runs. For this real application, Cougar has a significant performance advantage at one node and maintains that advantage out to several processors. Unlike some of the NAS benchmarks, the scalability difference is not dramatic because CTH is highly compute bound, spending 90% of the time computing.

c 2008 John Wiley & Sons, Ltd. Copyright Prepared using cpeauth.cls

Concurrency Computat.: Pract. Exper. 2008; 1:1–1

18

R. RIESEN ET AL.

3.3.

Catamount and OS Noise

Following the initial problems encountered in scaling applications like SAGE on the Los Alamos ASCI Q system [6], a great deal of attention has been paid to the cost of running a full-featured operating system like Tru64, Linux, or AIX on high-end systems [14, 21]. Recent presentations by developers at Cray [15] have shown that their highly optimized Linux implementation for the Cray XT3 is competitive with our Catamount kernel on some applications, but can perform from 10-30% worse on other applications. To more fully quantify the performance advantage that our lightweight kernel designs present over those of full-featured operating systems on modern HPC hardware, we modified the Catamount OS to allow us to control the frequency and duration for which the operating system takes the CPU away from the application. Because of the low base overhead of Catamount as compared to operating systems like Linux, we can compare the performance of native Catamount versus an operating system with different noise signatures. It is important to note that while this framework does allow us to account for and control various noise parameters that affect application performance, it does not account for all differences between Catamount and other commodity operating systems (i.e. differences in the way the memory subsystem is configured). Since the structure of Catamount is optimized for the usage patterns of our applications, accounting for these differences would further benefit Catamount in our comparison. The following experiments were performed using the Cray Red Storm [22] machine at Sandia. Red Storm is a XT3/4 series machine consisting of over 13,000 nodes and runs the Catamount lightweight kernel along with our noise injection framework on each of these nodes. Each compute node on Red Storm contains a 2.4 GHz dual-core AMD Opteron processor and between 2 and 4 GB of RAM. Each node also contains a Cray SeaStar network interface and high-speed router. The SeaStar is connected to an AMD Opteron processor via a HyperTransport link. The current generation SeaStar device is capable of sustaining a peak unidirectional injection bandwidth of more than 2 GB/s and a peak unidirectional link bandwidth of more than 3 GB/s. Figure 5 shows the performance penalty of Catamount with two different Linux-like noise signatures. The results are presented in terms of performance degradation versus native Catamount, are averaged over 3 runs, and have a variance of less than 2% due to the predictable performance characteristics of Catamount. We illustrate this penalty for three representative applications: CTH, SAGE [23], and the Parallel Ocean Program (POP) [24]. Figure 5(a) illustrates application performance running with an idle Linux noise signature. This signature corresponds to the frequency and duration of a timer interrupt on a modern architecture. For CTH and SAGE, as the scale increases the impact of the noise remains nearly constant. Therefore, at these scales these applications are relatively insensitive to this noise signature. POP, on the other hand, performs dramatically worse as scale increases; as much as 30% under these circumstances. This slowdown is due to the greater number of MPI collectives POP performs between each computation step compared to SAGE and CTH Figure 5(b) shows application run times when applications must compete for the CPU with a simulated intermittent asynchronous process, for example a kernel thread. In this configuration the kernel thread takes 2.5% of the CPU time. This noise signature does not greatly affect

c 2008 John Wiley & Sons, Ltd. Copyright Prepared using cpeauth.cls

Concurrency Computat.: Pract. Exper. 2008; 1:1–1

19

DESIGNING AND IMPLEMENTING LIGHTWEIGHT KERNELS

30

2000 POP SAGE CTH Percent Slowdown - Percent Injected

Percent Slowdown

25

20

15

10

5

0 0

POP SAGE CTH

1600

2000

4000

6000

8000

10000

12000

Number of Processors

(a) Idle Noise Signature

1200

800

400 60 55 50 45 40 35 30 25 20 15 10 500

1000

1500

2000

2500

3000

3500

Nodes

(b) Kernel Thread Noise Signature

Figure 5. Application Performance for two Linux-like Noise Signatures.

the performance of CTH, again due to the fact that CTH spends over 90% of its time in computation. SAGE, on the other hand, slows by 50% at this scale from our 2.5% CPU time noise signature. Most impressive is the performance of POP. At this scale, POP slows by a factor of 20 against this noise signature. As in the idle signature cases, these performance impacts are caused by the number of MPI collectives performed by each of the applications. From this we see that POP and SAGE benefit greatly from our lightweight approach given this noise signature and scale.

4.

Experiences and Lessons Learned

Over the nearly twenty years of LWK development, we have learned many valuable experiences and lessons relating to the development, deployment, and productive use of system software. While many of these are not new (see, for example, Lampson’s well-known paper [25]), we believe that certain ones are particularly relevant to OS development for modern capability computing systems. Experience: LWKs enable heroic users to take heroic steps. When we first began developing lightweight kernels, a handful of application scientists were beginning to port and develop applications for the newly-emerging distributed-memory supercomputers (e.g. the Intel Paragon) and the associated message-passing paradigm (e.g. MPI). Because these users required the full capabilities of the system for their scientific simulations, they were willing to work closely with us to understand the system and to optimize their codes. This experience drove us to push as much functionality into user space as possible while still preserving application isolation and machine reliability demands (e.g. a node crash could

c 2008 John Wiley & Sons, Ltd. Copyright Prepared using cpeauth.cls

Concurrency Computat.: Pract. Exper. 2008; 1:1–1

20

R. RIESEN ET AL.

bring down the high-speed mesh network on some systems). The result of this tension was the QK/PCT split described in section 2; essential isolation, reliability, and resource allocation mechanisms were protected, but everything else had to be built, in the form of libraries, in user space. Applications that had their own resource management strategies could opt out by not linking with the general libraries we provided. A number of key applications in fact did so, their developers working with us to optimize performance by, for example, overlapping communication and computation in the application when appropriate. Lesson: Leave a path for non-hero users. While capability applications and their developers drove machine acquisition and development of system software for these systems, a broader set of application scientists eventually desired and were permitted to use these systems, presenting additional challenges to system software development. In particular, these users were frequently not interested or able to squeeze the last drop of performance out of a machine, and were willing to sacrifice parallel efficiency and scalability in order to shorten application development time. The primary challenge these users presented to us was their desire for general purpose system services like those provided by most desktop operating systems. TCP/IP socket services are frequently requested as an enhancement, for example, by application developers porting code that used TCP/IP for communication or for connecting from a compute node directly to their work station to display graphics or debugging information. Another such request is to support dynamically loadable libraries, generally by users with the need to run script interpreters with large, complex dynamic loading schemes, for example Python. However, application developer time can be precious, and so addressing these demands when they match the machine’s usage model is desirable. Some of these requests require substantial development time (for instance, the construction of a user-level TCP/IP layer). Others require resource-heavy code to be ported to the operating system itself, compromising its capability computing mission. For example, Cray’s recent development of Compute Node Linux (CNL) [15] was motivated partly to support such applications on XT-series systems. CNL performance results have been mixed: a number of important applications slow by at least 50% on CNL compared to their performance on Catamount, and remain at least 10% slower even after substantial CNL optimization effort. We are attempting to address these issues in our lightweight kernels, as described later in Section 6, by adding configurability features to the lightweight kernel which will allow it to be customized based on changing application, programming model, usage model, and hardware demands. Other systems that use lightweight kernels, for example the IBM BlueGene/L system, have addressed this difficulty by forwarding some difficult-to-handle requests to service nodes (e.g. socket communication), while forbidding others (e.g. fork()). Regardless, we have learned that providing a mechanism for handling such requests is important. Experience: A small team can develop and maintain a custom OS kernel. Advocates for the use of commodity operating systems for capability computing have claimed that lightweight kernels are difficult to develop and maintain. Our experience has been that developing and maintaining a custom OS kernel for high-end computing systems is practical. In particular, we have found that lightweight kernels are easy to both port and extend. We note the contrast

c 2008 John Wiley & Sons, Ltd. Copyright Prepared using cpeauth.cls

Concurrency Computat.: Pract. Exper. 2008; 1:1–1

DESIGNING AND IMPLEMENTING LIGHTWEIGHT KERNELS

21

between our experience and that of a number of OS researchers that have recently described their difficulties in researching, developing, and maintaining a custom operating system [26]. The code base that comprises our lightweight kernel is small, its overall architecture is simple, and our hardware and application environment is highly specialized. The combination of these factors makes a lightweight kernel easy to port, extend, tune, and maintain. It can take a large number of developers to design a full-featured, general-purpose OS; Intel employed more than 100 people for OSF 1/AD development at its height, for example. However, this is not true for lightweight kernels that are specialized for a few machines and a small user community. In contrast to Intel’s OSF/1 porting effort, less than 10 people were needed to port SUNMOS to the Intel Paragon, most of them part-time students. Similarly, Sandia used approximately four full-time employees to port Catamount to Red Storm.

5.

Related Work

Over the past several decades, an extensive amount of research into lightweight operating systems has been conducted. Much of this research has not specifically targeted massively parallel processing machines and has not addressed many of the important performance and scalability problems that these platforms must overcome. The commodity supercomputing world has moved away from single, tightly integrated MPPs to large clusters of small- to medium-sized building blocks. As such, most large-scale general-purpose parallel computers on the market today run some variant of UNIX developed for the individual pieces. However, there are some exceptions in both the commercial and research space that address some of these issues for both general- and special-purpose machines. The Cray T3 series of machines ran a microkernel version of the UNICOS operating system called UNICOS/mk [27]. UNICOS/mk is a distributed OS that provides single system image capability to the machine through a three level architecture. At the lowest level, the microkernel running on the compute processors can satisfy OS requests by contacting a local server. The microkernel can also satisfy requests for global system information by contacting a global server running on dedicated processors throughout the system. OS services are not replicated on every processor, but are distributed throughout the machine. This distributed approach frees up much of the local memory on a processor. Early versions of UNICOS/mk used less than 4 MB of local memory on a processor. The approach of UNICOS/mk is similar to that of Puma. However, rather than having OS requests satisfied by servers distributed throughout the machine, Puma pushes this functionality up to user-level libraries. System calls are satisfied via a remote procedure call mechanism with the application launcher. This strategy allows for easily adding OS functionality without modifying fundamental parts of the underlying operating system. In terms of general-purpose research operating systems, IBM Research in collaboration with the University of Toronto has developed a lightweight operating system as part of the K42 project that uses a building block approach to achieve modularity and customizability [28, 29]. The project is aimed at developing a high-performance general-purpose OS for largescale cache-coherent NUMA multiprocessor systems. Since the design initially targeted multiprocessor systems, it is able to service OS requests on the same processor on which

c 2008 John Wiley & Sons, Ltd. Copyright Prepared using cpeauth.cls

Concurrency Computat.: Pract. Exper. 2008; 1:1–1

22

R. RIESEN ET AL.

they are issued and is able to handle requests to various system resources without acquiring common locks. This spatial locality results in substantial performance advantages over typical multiprocessor OSs. The OS is designed to scale to hundreds of processors to support large commercial applications such as databases and web servers. A key feature of the OS is the ability for applications to customize and optimize required OS services. The design allows the OS implementers to exploit specific architectural features on a given system in order to maximize performance. There has been research into lightweight operating systems for special-purpose MPPs as well. The QCDSP MPP [30, 31] developed by Columbia University was designed specifically to solve computations for quantum chromodynamics. The machine is a collection of digital signal processors (DSPs) connected by a custom communications and memory controller chip. The lightweight kernel that runs on the majority of the DSPs in the system can service a fundamental set of requests, and more complex requests are handled by a front-end machine through a standard UNIX I/O system call interface. The kernel does not provide memory protection and the machine has a very limited runtime environment. The entire machine is space shared. Only one application may be running on any of the nodes at a given time, but the entire machine can be scheduled to run different processes. More recently, IBM developed a custom lightweight compute node kernel (CNK) for their Blue Gene/L series [32] of massively parallel systems. IBM’s CNK has much in common with our lightweight kernels, from both design and implementation perspectives. Finally, principles of lightweight design have been adopted by other, non-kernel development projects, such as the Lightweight File System project at Sandia Laboratories [33].

6.

Future Directions

In this section, we discuss several topics that we consider important for the future of lightweight kernel design and implementation. 6.1.

Configurable Lightweight Kernels

Lightweight kernels have the advantage of being specifically constructed for particular hardware. However, as we have described, supporting new application features, system services, and hardware can be difficult. To address these problems, we are exploring the use of component-based software engineering techniques to make lightweight kernel construction less complex. These techniques will allow us to configure lightweight kernels according to application needs, hardware characteristics, and programming model requirements. We briefly describe some of the design issues for configurable lightweight kernels in this section (see [34] for an expanded discussion). We are basing our design activity on the Fractal component model [35] and its associated Think framework [36, 37]. Think provides a component library, component definition and extension interfaces, and a runtime engine. However, several design and implementation challenges remain before this framework will satisfy all our goals for configurable lightweight kernels. Think currently does not support run-time configuration, provides only a function-call

c 2008 John Wiley & Sons, Ltd. Copyright Prepared using cpeauth.cls

Concurrency Computat.: Pract. Exper. 2008; 1:1–1

DESIGNING AND IMPLEMENTING LIGHTWEIGHT KERNELS

23

pattern for intercomponent communication, and has a limited set of components for addressing capability computing needs. We are addressing these problems by defining new component subtrees (such as memory management), by designing run-time interface definition and introspection facilities, and by extending the component communication facilities to include asynchronous (event-based) communication. For example, we have factored the Puma QK/PCT design into components with event-based, asynchronous interfaces to explore the resulting implications on LWK-based applications. 6.2.

Virtualization

We are also exploring virtualization as a means of supporting different kinds of operating system functionality on top of lightweight kernels. Several commodity processor vendors have recently implemented extended hardware support for virtual machines, and there has been a resurgence of virtualization research in the OS community. We are exploring the changes necessary to allow the Catamount lightweight kernel to act as a virtual machine monitor (VMM) or hypervisor. There are several motivations and possible benefits of this work. First, the design of Catamount is already very much like a VMM. As shown in Figure 2 on page 5, the QK provides the hardware abstraction layer that is typical of most VMMs, while the PCT provides the traditional services provided by a host OS. We believe that minimal extensions are needed to the QK and the PCT to support hosting a guest OS. Unlike traditional VMMs, we are not focused on providing the support necessary for running several guest operating systems at once. Rather, we expect that in order to maintain performance, we will only run a single guest OS – Linux is our current target. Our approach is to replicate the interfaces of Linux’s Kernel Virtual Machine in Catamount so that we can host a Linux guest OS without modification to Linux itself. This approach allows us to explore virtualization specifically for HPC environments. Using Catamount as a VMM ensures that a lightweight kernel is always available on the compute nodes to run high-performance applications tailored for a minimal OS environment. For applications that need a more full-service OS environment, a Linux guest OS can be booted when the parallel application is launched. This approach also may have several other benefits. For example, applications may run across a mixture of nodes running both Catamount and Linux. The system administrators are no longer required to set aside a fixed number of Catamount nodes as compute nodes and a fixed number of service nodes running Linux. The user can determine at job launch time the breakdown that is most suitable, without rebooting nodes. There are also some potential benefits beyond application functionality. We are exploring whether the PCT could provide a /proc interface to the Linux guest OS to allow for using standard Linux tools, such as debuggers, with Catamount processes. One of the prohibitive features of our lightweight kernel environment is specialization needed to support a variety of performance tools and debuggers. For example, Cougar’s PCT was designed specifically to work with Intel’s Parallel Debugger, but the requirement for Catamount was to support the TotalView parallel debugger. Providing a /proc interface could possibly eliminate the need for the PCT to interface to specific debuggers. Another possible optimization is to provide

c 2008 John Wiley & Sons, Ltd. Copyright Prepared using cpeauth.cls

Concurrency Computat.: Pract. Exper. 2008; 1:1–1

24

R. RIESEN ET AL.

a way to expose the physically contiguous memory map in Catamount to processes running under the Linux guest OS. By doing this, Linux processes may be able to take advantage of the increased network performance available in the Accelerated Portals implementation for the SeaStar. Finally, virtualization can potentially provide several benefits for supporting fault tolerance capabilities. There are several ongoing research projects in high-performance computing that are exploring virtualization as a means of avoiding or reacting to machine errors that cause parallel jobs to fail. The ability to capture the state of a virtual machine and restore it without any application involvement is a promising approach for dealing with the reliability issues that petascale machines will face. 6.3.

Multiple Cores and Other Architectural Changes

The advent of multi-core processors and heterogeneous processors is changing the landscape of high-performance computing. As the number of cores per processor continues to increase, so do the challenges and complexities of continuing to deliver performance and scalability. From the operating system perspective, we view lightweight kernels as a necessary vehicle for exploring basic operating system functionalities that are traditionally untenable in a monolithic fullservice operating system. Studies have already shown that applications running on multi-core processors are much more sensitive to the allocation and placement of resources, especially with regard to the memory subsystem. Operating systems will have to be much more aware of the memory hierarchy – which cores share a socket, which cores share which levels of cache, whether cores have special high-speed scratch memory, etc. One of the initial approaches that we are exploring in Catamount is to provide a virtual memory map that allows all of the processes in the same parallel job on a node to easily access each others’ memory. This approach maintains the independence of traditional processes, but provides the shared memory capability of threads without the inherent complexity. This is ideal for a large number of applications at Sandia that do not support the mixed-mode programming model using MPI and threads or MPI and OpenMP. Application developers can still run a traditional MPI application, but can select portions of the application to use shared memorystyle data accesses. 6.4.

Applications

Hardware architecture is not the only aspect of high-performance computing that continues to evolve and impact lightweight kernel research. The applications designed for today’s large-scale parallel computing platforms are much different than the applications for which the original lightweight kernel was designed. In the early 1990’s there were relatively few applications that could effectively utilize the full capability of a large-scale machine, and specialization in HPC was commonplace. HPC platform vendors used custom hardware and custom software environments in an effort to differentiate themselves from the competition. Application performance was the ultimate quest, and other features, such as portability, were secondary. This perspective began to change in the mid 1990’s with the wide-spread adoption of MPI as the de facto standard parallel application programming interface. The expectation

c 2008 John Wiley & Sons, Ltd. Copyright Prepared using cpeauth.cls

Concurrency Computat.: Pract. Exper. 2008; 1:1–1

DESIGNING AND IMPLEMENTING LIGHTWEIGHT KERNELS

25

of portability, and, to some extent performance portability, meant that vendors could no longer try to stand apart using custom parallel languages, programming interfaces, or functionality. This trend continued into the late 1990’s as the acceptance of Linux in the HPC community created a de facto standard application development environment. Application developers no longer needed access to an expensive parallel computer to develop codes. They could develop applications on the desktop and run them on a large parallel computer that featured the same interfaces and libraries. The ubiquity of Linux clusters in science and engineering has led to a rapid increase in the number of applications. Parallel application development is no longer just an activity of research centers and national laboratories. Today’s parallel application developers expect to be able to develop applications on the desktop and run them on a wide variety of machines with virtually no modifications. To a large extent, the desire for ultimate performance has taken a back seat to portability and other application development activities. This is most evident in the recent DARPA High Productivity Computing Systems program, the goal of which was to focus HPC vendors on providing platforms that emphasize application developer productivity over application performance. Applications are also becoming inherently more complex. Fifteen years ago, modeling and simulation codes were typically written in FORTRAN 77, and each was focused in a particular area — fluid flow, heat transfer, radiation transport, etc. These applications have evolved into complex C++ frameworks that tie together multiple aspects of simulation. The “multi” prefix dominates their description — multi-dimensional, multi-phase, multi-material, multi-physics. Support libraries have also become an important component of many applications. Complex solver libraries that provide a variety of different methods and approaches are in widespread use. In many cases, the configuration of the application is not known until the code is running and can change dynamically during the run. All of these factors are influencing our approach to lightweight kernel design and development. Moving our focus away from ultimate application performance calls into question several fundamental lightweight kernel design decisions. The need for portability and support for Linux system calls and programming interfaces can lead to inherent scalability issues that are not easily overcome. Supporting functionality such as dynamic linking and loading can add significant complexity to a lightweight kernel. The needs of applications and application development environments will have much greater impact on the design of future lightweight kernels than those of the past.

7.

Conclusion

Lightweight kernels have been a key element in the success of high-performance computing at Sandia National Laboratories over the past two decades years, enabling scalability to unprecedented levels. These kernels have, for example, allowed Sandia’s high-end computing systems to avoid the OS interference problems that have plagued other kernels in recent years. This robustness has also led other leading supercomputer companies to develop lightweight kernels for their systems, for example the CNK kernel for IBM Blue Gene systems.

c 2008 John Wiley & Sons, Ltd. Copyright Prepared using cpeauth.cls

Concurrency Computat.: Pract. Exper. 2008; 1:1–1

26

R. RIESEN ET AL.

Key to the success of these lightweight kernels has been the overall philosophy of OS specialization along with the specific principles of user-level resource management, predictable performance, and scalable system services. For example, developing a specialized OS has allowed a small team of developers to address Sandia’s HPC system software issues, while other groups that have tried to either customize an existing OS or build a general-purpose OS that can meet HPC demands have struggled [26]. Similarly, having clear principles guiding the design and implementation of our kernels has allowed them to evolve with changing hardware and software demands over the years, while still maintaining leading-edge performance.

Acknowledgments The following people have contributed in various ways to the projects described in this paper: Miguel Alvarez, Al Audette, Ed Barsis, Eric Barton, Peter Braam, Bob Benner, Bill Camp, Doug Cline, Dan Davenport, Sudip Dosanjh, Anthony Ferrara, Lee Ann Fisk, David Greenberg, Art Hale, Eric Hoffman, Kenneth Ingham, Gabi Istrail, Jeanette Johnston, Chu Jong, Clint Kaul, Sue Kelly, Lisa Kennicott, Mike Levenhagen, Kevin McCurley, Lance Mumma, Jim Otto, Kevin Pedretti, Paul Pierce, Steve Plimpton, Neil Pundit, Heather Richards, David Robboy, Brian Sanchez, Doug Sanchez, Mark Sears, Lance Shuler, Jim Schutt, John Shadid, Mack Stallcup, Judy Sturtevant, Jim Tomkins, Todd Underwood, David van Dresser, Noah van Dresser, Jeff Vandyke, John Vandyke, Bob van Sant, Dena Vigil, Lee Ward, Stephen Wheat, and David Womble. We have tried to be as comprehensive as memory allows in our listing of the people involved in these projects; any omissions are unintentional and regretted.

REFERENCES 1. Thomas E. Anderson, Brian N. Bershad, Edward D. Lazowska, and Henry M. Levy. Scheduler activations: Effective kernel support for user-level management of parallelism. ACM Transactions on Computer Systems, 10(1):53–70, 1992. 2. Stephen R. Wheat, Arthur B. Maccabe, Rolf Riesen, David W. van Dresser, and T. Mack Stallcup. PUMA: An operating system for massively parallel systems. Scientific Programming, 3:275–288, 1994. 3. Lance Shuler, Chu Jong, Rolf Riesen, David van Dresser, Arthur B. Maccabe, Lee Ann Fisk, and T. Mack Stallcup. The Puma operating system for massively parallel computers. In Proceeding of the 1995 Intel Supercomputer User’s Group Conference. Intel Supercomputer User’s Group, 1995. 4. Arthur B. Maccabe, Rolf Riesen, and David W. van Dresser. Dynamic processor modes in puma. Bulletin of the Technical Committee on Operating Systems and Application Environments (TCOS), 8(2):4–12, 1996. 5. Harish Nag, Roberta Gotfried, David S. Greenberg, Chulsoo Kim, Arthur B. Maccabe, T. Mack Stallcup, Glenn Ladd, Lance Shuler, Stephen R. Wheat, and David van Dresser. PROSE: Parallel real-time operating system for secure environments. In Proceedings of the Intel Supercomputer Users Group Conference, June 1996. 6. Fabrizio Petrini, Darren Kerbyson, and Scott Pakin. The case of the missing supercomputer performance: Achieving optimal performance on the 8,192 processors of ASCI Q. In Proceedings of SC’03, 2003. 7. Ron Brightwell, Arthur B. Maccabe, and Rolf Riesen. Design and implementation of MPI on Portals 3.0. In Dieter Kranzlm¨ uller, Peter Kacsuk, Jack Dongarra, and Jens Volkert, editors, Recent Advances in Parallel Virtual Machine and Message Passing Interface: 9th European PVM/MPI Users’ Group Meeting,

c 2008 John Wiley & Sons, Ltd. Copyright Prepared using cpeauth.cls

Concurrency Computat.: Pract. Exper. 2008; 1:1–1

DESIGNING AND IMPLEMENTING LIGHTWEIGHT KERNELS

8. 9. 10. 11. 12. 13. 14. 15. 16. 17.

18. 19. 20.

21. 22. 23. 24. 25. 26. 27. 28. 29.

27

Linz, Austria, September 29 - October 2, 2002. Proceedings, volume 2474 of Lecture Notes in Computer Science, pages 331–340. Springer-Verlag, September 2002. Ron Brightwell, Tramm Hudson, Rolf Riesen, and Arthur B. Maccabe. The Portals 3.0 message passing interface. Technical report SAND99-2959, Sandia National Laboratories, December 1999. Ron Brightwell, William Lawry, Arthur B. Maccabe, and Rolf Riesen. Portals 3.0: Protocol building blocks for low overhead communication. In Workshop on Communication Architecture for Clusters CAC’02, page 164, Fort Lauderdale, Florida, 2002. IEEE. Ron Brightwell, Mike Levenhagen, Arthur B. Maccabe, and Rolf Riesen. A performance comparison of Myrinet protocol stacks. In Proceedings of Third Linux Clusters Institute Conference on Linux Clusters, 2002. Ron Brightwell, Trammell Hudson, Kevin T. Pedretti, and Keith D. Underwood. SeaStar Interconnect: Balanced bandwidth for scalable performance. IEEE Micro, 26(3), May/June 2006. Rolf Riesen, Ron Brightwell, Arthur B. Maccabe, Trammell Hudson, and Kevin Pedretti. The Portals 3.3 message passing interface: Document revision 2.0. Technical report SAND2006-0420, Sandia National Laboratories, January 2006. Subhash Saini and Horst D. Simon. Applications performance under OSF/1 AD and SUNMOS on Intel Paragon XP/S-15. In Supercomputing ’94: Proceedings of the 1994 conference on Supercomputing, pages 580–589, Los Alamitos, CA, USA, 1994. IEEE Computer Society Press. Ron Brightwell, Rolf Riesen, Keith Underwood, Patrick G. Bridges, Arthur B. Maccabe, and Trammell Hudson. A performance comparison of Linux and a lightweight kernel. In Proceedings of the 2003 IEEE International Conference on Cluster Computing (Cluster 2003), December 2003. David Wallace. Compute Node Linux : Overview, progress to date, and roadmap. In Proceedings of the 2007 Cray User Group Annual Technical Conference, May 2007. Subhash Saini and Horst D. Simon. Applications performance on Intel Paragon XP/S-15. In HighPerformance Computing and Networking, volume 797 of Lecture Notes in Computer Science, pages 493– 498, 1994. D. Bailey, E. Barszcz, J. Barton, D. Browning, R. Carter, L. Dagum, R. Fatoohi, S. Fineberg, P. Frederickson, T. Lasinski, R. Schreiber, H. Simon, V. Venkatakrishnan, and S. Weeratunga. The NAS parallel benchmarks. Technical Report RNR-94-007, NASA Ames Research Center, Moffett Field, CA, 1994. Jeremy Cook, Richard Enbody, and Bjarne G. Herland. SUNMOS on the Intel Paragon: Evaluation and early experience. Technical Report pl-jc-94.0, University of Bergen, Department of Informatics, March 1994. David Greenberg, Barney Maccabe, Kevin S. McCurley, Rolf Riesen, and Stephen Wheat. Communication on the Paragon. In Proceedings of the Intel Supercomputer Users’ Group. 1993 Annual North America Users’ Conference, pages 117–124, 1993. Jr. E.S. Hertel, R.L. Bell, M.G. Elrick, A.V. Farnsworth, G.I. Kerley, J.M. McGlaun, S.V. Petney, S.A. Silling, P.A. Taylor, and L. Yarrington. CTH: A Software Family for Multi-Dimensional Shock Physics Analysis. In Proceedings of the 19th International Symposium on Shock Waves, held at Marseille, France, pages 377–382, July 1993. Dan Tsafrir, Yoav Etsion, Dror G. Feitelson, and Scott Kirkpatrick. System noise, os clock ticks, and fine-grained parallel applications. In ACM International Conference on Supercomputing, June 2005. William J. Camp and James L. Tomkins. Thor’s hammer: The first version of the Red Storm MPP architecture. In In Proceedings of the SC 2002 Conference on High Performance Networking and Computing, Baltimore, MD, November 2002. D. J. Kerbyson, H. J. Alme, Adolfy Hoisie, Fabrizio Petrini, H. J. Wasserman, and M. Gittings. Predictive performance and scalability modeling of a large-scale application. In Proceedings of the 2001 ACM/IEEE conference on Supercomputing, pages 37–48, Denver, CO, 2001. ACM Press. Darren J. Kerbyson and Philip W. Jones. A performance model of the parallel ocean program. Int. J. High Perform. Comput. Appl., 19(3):261–276, 2005. Bulter W. Lampson. Hints on computer system design. In Proceedings of the 9th ACM Symposium on Operating Systems Principles, pages 33–48, 1983. Robert W. Wisniewski, Dilma da Silva, Marc Auslander, Orran Krieger, Michal Ostrowski, and Bryan Rosenburg. K42: Lessons for the OS community. SIGOPS Oper. Syst. Rev., 42(1):5–12, 2008. Cray Research, Inc. Cray Research MPP Software Guide, SG-2508 edition, January 1994. M. Auslander, H. Franke, Orran Krieger, B. Gamsa, and M. Stumm. Customization-lite. In Proceedings of the 6th Workshop on Hot Topics in Operating Systems (HotOS-VI), pages 43–48, 1997. http: // www. research. ibm. com/ K42 .

c 2008 John Wiley & Sons, Ltd. Copyright Prepared using cpeauth.cls

Concurrency Computat.: Pract. Exper. 2008; 1:1–1

28

R. RIESEN ET AL.

30. Dong Chen. QCDSP: A teraflop scale massively parallel supercomputer. In Supercomputing’97, November 1997. 31. http: // www. eecg. toronto. edu/ parallel/ tornado. html . 32. Jose Moreira, Michael Brutman, Jose Castanos, Tom Gooding, Todd Inglett, Derek Lieber, Pat McCarthy, Mike Mundy, Jeff Parker, Brian Wallenfelt, Mark Giampapa, Thomas Engelsiepen, and Roger Haskin. Designing a highly-scalable operating system: The Blue Gene/L story. In Proceedings of the 2006 ACM/IEEE International Conference for High-Performance Computing, Networking, Storage, and Analysis (SC’06), November 2006. 33. Ron A. Oldfield, Arthur B. Maccabe, Sarala Arunagiri, Todd Kordenbrock, Rolf Riesen, Lee Ward, and Patrick Widener. Lightweight I/O for scientific applications. In Proceedings 2006 IEEE Conference on Cluster Computing, Barcelona, Spain, September 2006. 34. Jean-Charles Tournier, Patrick G. Bridges, Arthur B. Maccabe, Patrick M. Widener, Zaid Abudayyeh, Ron Brightwell, Rolf Riesen, and Trammell Hudson. Towards a framework for dedicated operating systems development in high-end computing. Operating Systems Review: Special Issue on System Software for High-End Computing Systems, 40(2):16–21, April 2006. 35. E. Bruneton, T. Coupaye, M. Leclerc, V. Quema, and JB. Stefani. An Open Component Model and its Support in Java. In 7th Int. Symp. CBSE, 2004. 36. J.F. Fassino, J.B. Stefani, J. Lawall, and G. Muller. Think: A software framework for component-based operating system kernels. In USENIX 2002 Annual Conference, 2002. 37. J.C. Tournier and J.P. Fassino. The Think Components-Based Operating System. In Hermes Science and Lavoisier Company, editors, Model Driven Engineering for Distributed Real-time Embedded Systems. International Scientific and Technical Encyclopedia, 2005.

c 2008 John Wiley & Sons, Ltd. Copyright Prepared using cpeauth.cls

Concurrency Computat.: Pract. Exper. 2008; 1:1–1