A New Generation of Cluster Interconnect - CiteSeerX

3 downloads 0 Views 478KB Size Report
The cluster interconnect for the SiCortex systems brings together a number of .... •A buffer descriptor table (BDT), whose indices serve as handles specifying ...
A New Generation of Cluster Interconnect Lawrence C. Stewart and David Gingold December 2006

The SiCortex family of Linux® cluster systems takes High Performance Technical Computing (HPTC) a step beyond conventional clusters. SiCortex concentrates on power-efficient design and simultaneous tuning of silicon, microcode, and system software to deliver outstanding application performance per dollar, per watt, and per square foot. The Company’s initial product offering includes:

•The SC5832, which is a 5.8 Teraflop system with up to 8 Terabytes of memory. The SC5832 fits into a single cabinet and draws 18 KW.

•The SC648, which is a 648 Gigaflop system with up to 864 Gigabytes of memory. Two SC648 systems fit in a single 19” rack with room to spare. A single SC648 system draws 2 KW.

Abstract This paper describes the SiCortex interconnect fabric, its software interface, and the communication software, including the Message Passing Interface (MPI), that uses it.

Introduction The cluster interconnect for the SiCortex systems brings together a number of technologies and innovations, all designed to provide high-bandwidth, low-latency communication in a system with high reliability and low cost. The SiCortex cluster interconnect consists of three components: fabric links that provide direct connections between nodes; a fabric switch for routing traffic within each node; and a DMA Engine that serves as the software interface to the interconnect. All of these components are integrated within the SiCortex node chip. Key features of the interconnect are:

•The fabric links provide multiple, 2 gigabyte/second, direct connections between nodes, using no external cables or switches.

•The fabric’s Kautz topology provides low network diameter and fault-tolerant routing. •The fabric’s DMA Engine is cache-coherent with the processors, so that message data need not pass through main memory.

•Application software initiates fabric messaging directly, without OS intervention. •The fabric hardware delivers messages to applications reliably and in-order. •The fabric provides for remote DMA operations to and from virtual memory, using a low-overhead memory registration scheme that requires no page pinning.

SiCortex System Architecture A SiCortex SC5832 system is composed of 972 six-way SMP compute nodes connected by the interconnect fabric. Each node consists physically of a single SiCortex node chip and two industry-standard DDR2 memory modules. The node chip, shown in Figure 1, contains six 64-bit processors, their L1 and L2 caches, two memory controllers (one for each memory module), the interconnect fabric components, and a PCI Express® (PCIe®) interface.The PCIe controller is used for external I/O devices, not for the fabric itself. The nodes in the SiCortex system are connected into a degree-3 directed Kautz network, which provides fault-tolerant features and a low network diameter for a large number of nodes. Physically, 27 node chips and their associated memory DIMMs are packaged on a single board, called a module. Of the 27 nodes on a module, three have their PCIe busses connected to PCI EXPRESSMODULE™ slots, and a fourth is attached to an on-board PCIe dual gigabit-Ethernet controller. The PCIe interfaces are disabled on the other nodes.

A New Generation of Cluster Interconnect

2

Six 64-bit MIPS CPUs

CPU CPU CPU

CPU CPU CPU

L1 Cache L1 Cache L1 Cache

L1 Cache L1 Cache L1 Cache

DMA Engine

Coherent L2 Cache

DDR-2 Controller

Fabric Switch

PCI Express Controller

DDR-2 Controller Node Chip

DDR-2 DIMM

From other nodes

DDR-2 DIMM

External I/O

To other nodes Fabric Links

FIGURE 1. SiCortex Node

For more detailed information about the SiCortex systems, see the “SiCortex Technical Summary” at www.sicortex.com.

Interconnect Hardware Within the node chip, the interconnect consists of three components: the DMA Engine, the fabric switch, and the fabric links. The DMA Engine connects the memory system to the fabric switch, and implements the processors’ software interface to the fabric. The fabric switch forwards traffic between incoming and outgoing links, and to and from the DMA Engine. The fabric links, three receivers and three transmitters per node, connect directly to other nodes in the system. The fabric switches and links implement a packet-based network using source-routed packets. The system protects all packets with a combination of ECC and CRC, so that the fabric as a whole provides in-order and reliable transmission. All of these features are implemented in hardware. The DMA Engine The fabric switches and links are connected to the node’s processors through a microcodedriven DMA Engine. The DMA Engine implements the interconnect’s software interface, translating user requests into packet streams. It is connected to the memory system at the cache level so that all of its activities are coherent with respect to the processor cores.

A New Generation of Cluster Interconnect

3

Application messages do not need to pass through main memory during their lifetime, which substantially reduces message latency. The DMA Engine microcode runs at 250 MHz, interleaving instructions from two of ten hardware threads. Six threads service the DMA’s transmit and receive ports, two service a copy port used for on-node communication, one responds to processor I/O cycles, and one manages the DMA’s queues and scheduling. On the processor side, the DMA Engine communicates with the L2 caches and memory controllers using a hardware interface that allows for up to four 64-byte reads and four 64byte writes outstanding to the memory system. On the fabric side, the DMA Engine passes packet streams to and from the switch via three transmit and three receive ports. The Fabric Switch Each node includes a fabric switch, shown in Figure 2. The fabric switch is a buffered 4x4 crossbar switch that connects the three link receivers and the DMA Engine transmit ports to the three link transmitters and the DMA Engine receive ports. From DMA Engine Transmit Ports

To DMA Engine Receive Ports

Store-andForward Packet Buffer Replay Buffer From Fabric Receive Ports

To Fabric Transmit Ports

FIGURE 2. The Fabric Switch

Each incoming link has a dedicated crosspoint and path to the DMA Engine, letting the local node receive on all ports simultaneously. Similarly, the DMA Engine has independent paths to dedicated crosspoints for each transmit link.

A New Generation of Cluster Interconnect

4

Each crosspoint contains 16 full-packet buffers with ECC. The switch implements a virtual channel cut-through router. Cut-through allows packets to pass through the switch with minimal delay and the virtual channel implementation prevents deadlock. The Fabric Links The fabric links in the SiCortex systems are built from multiple lanes of 2 Gb/s SerDes PHYs. The forward (data) channel has eight lanes in parallel, while the reverse (control) channel has one lane. Thus, the raw link data rate is 2 GB/s. The links use a standard 8/10 code for DC balance on each lane. This implementation leads to a minimum link data unit of 64 bits. Non-data codes on lane 0 indicate the start and end of packets. Because the whole machine is frequency-locked to a master clock, the links need no elastic buffers. Each lane has a phase-locked loop for data recovery. The link as a whole has a framing function that delivers aligned 64-bit words to the fabric switch.

63

0 Route

Hardware Control

RDMA Control Word

Payload Word 0

Payload Word 15

CRC-32

Dest ID, Type

FIGURE 3. Fabric Packet

Packets in the fabric (see Figure 3) are composed of a header word for routing, an optional control word for RDMA, up to 16 payload words, and a trailer word containing type and CRC information. Packets have a maximum length of 19 64-bit words, carrying a maximum 128 bytes of payload. Each link provides CRC-32 error detection, flow control, and retransmission from an ECCprotected replay buffer. In the event of an error, the link discards packets until a successful replay occurs, preserving the in-order behavior of the link.

The Software Interface The DMA Engine implements the software interface that lets applications use the fabric. This interface appears to user-level code as a set of in-memory data structures shared with

A New Generation of Cluster Interconnect

5

the DMA Engine as well as a small set of memory-mapped registers.1 The bulk of an application’s interaction with the DMA Engine happens simply by the processor and DMA Engine reading and writing memory. The set of registers and data structures that an application shares with the DMA Engine is known as a DMA context. The DMA Engine implements 14 such contexts per node. The data structures include:

•A command queue (CQ), a circular memory buffer which the processor writes, providing commands for the DMA Engine to execute.

•An event queue (EQ), a circular memory buffer which the DMA Engine writes, delivering received short messages and events indicating RDMA completion.

•A heap, where the processor can place command chains, and to which the DMA can write additional messages.

•A route descriptor table (RDT), whose indices serve as handles specifying routes through the fabric to remote DMA contexts.

•A buffer descriptor table (BDT), whose indices serve as handles specifying pages in the user’s virtual memory. In order to enforce proper OS protection (see OS Protection and Security on page 7), only the kernel may write a process’s BDT and RDT structures. DMA Primitives Software issues DMA commands by writing them to the CQ and then writing an I/O register to report that new commands are present. The available commands include:

•send-event: Deliver data in a single packet to the EQ of a remote DMA context. •write-heap: Write data in a single packet to a location in a remote context’s heap. •send-command: Transmit an embedded DMA command in a packet to be executed by the remote context’s DMA Engine.

•do-command: Decrement a counter, and if the result is negative, execute a specified list of commands in the local heap.

•put-buffer: Transmit an entire memory segment to a remote context’s memory. Upon completion, optionally generate a remote event or execute remote commands. The ability to remotely execute commands provides much of the power of these primitives. Software implements RDMA GET operations, for example, by issuing one or more sendcommand commands, each with an embedded put-buffer command. Application software detects incoming events on the EQ either by polling the memory or by configuring the DMA Engine to interrupt the processor when new events arrive. The polled method provides the shortest message latency.

1. The kernel-mode interface to the DMA Engine includes additional in-memory data structures and registers, and also allows the DMA hardware to interrupt the kernel.

A New Generation of Cluster Interconnect

6

Virtual Memory Integration The DMA Engine, by design, allows user-level code to use it directly, without requiring system calls or interrupts in the critical path. To accomplish this task for RDMA operations, which copy data in and out of application virtual memory, the DMA Engine relies on the fact that the virtual-to-physical memory associations are likely to stay intact during the application’s lifetime, and allows the software to recover from the uncommon case where that is not so. The DMA Engine accomplishes virtual memory RDMA using buffer descriptors (BDs), which are entries in the BDT. The BDT accomplishes what a page table does in a virtual memory system, but with somewhat different mechanics. When software issues a putbuffer command to the DMA Engine, it refers to a virtual address by using a BD (specified as an index into the BDT) and an offset from that BD. Software associates virtual pages with BDs using a kernel system call. The operation is lightweight, and more importantly, it needs only to be done once. The BD internally stores the corresponding physical page frame, and the DMA Engine accesses that to read and write memory. When the kernel unmaps a virtual page which has an associated BD, it invalidates that BD. When the DMA Engine attempts to use an invalid BD, it creates a fault event on the applications EQ. The user code recovers from the fault by re-validating the BD (a system call) and re-starting the DMA command. OS Protection and Security In designing the DMA Engine, we’ve taken care to keep intact the basic process protection and security mechanisms that Linux provides. In doing so, we formed a security model that allows two assumptions: the OS kernels running on nodes in the SiCortex system trust one another in cooperatively managing the interconnect, and processes in a parallel application trust one another in cooperatively using their DMA contexts. But kernels do not trust applications, and applications do not trust one another. To implement this model, we rely on DMA Engine mechanisms and operating system support. In particular:

•Only the kernel constructs fabric routes. User-level software specifies routes as indices in its RDT, but only the kernel writes the table’s contents. This prevents one process from receiving commands or events from a non-authorized process, and also protects against various misuses of fabric routes.

•Only the kernel writes the physical address in buffer descriptors. The DMA Engine can access any physical memory address, but an application specifies RDMA operations only by references to buffer descriptors. Thus the kernel prevents DMA access to memory not mapped to the application.

•The hardware matches packets before accepting them. Packets arriving for an application carry a process key that must match at the receiving DMA Engine. If an application crashes and its resources are reassigned, leftover packets from the old application are not received by the new one, even if the routes are the same.

A New Generation of Cluster Interconnect

7

MPI For SiCortex, MPI is the critical software API where our customers’ applications meet our communication hardware.2, 3 Central to SiCortex’s mission of providing purpose-built systems with standard interfaces, we deliver not just an MPI implementation tuned to the hardware, but indeed hardware tuned to MPI. SiCortex’s MPI implementation is derived from the popular MPICH2 software from Argonne National Laboratory. (See http://www-unix.mcs.anl.gov/mpi/mpich2.) At present we support all MPI-1 and selected MPI-2 features. For many users, this may well be all they need to know about SiCortex MPI. Existing MPI applications, re-compiled to run on the SiCortex systems, take immediate advantage of the SiCortex interconnect fabric. MPI applications need not know about the DMA Engine and fabric, because that code is contained entirely within the SiCortex MPI library. SiCortex MPI Internals The MPICH2 code base, from which we built SiCortex MPI, includes a sophisticated “channel” interface intended to simplify targeting the library to new communication architectures. But in order to build a system with microsecond message latencies, we dispensed with this internal abstraction and instead wrote software that interfaced to the higher-level ADI3 layer in MPICH2. The result leaves very little code between applications calling MPI send and receive operations and the DMA Engine hardware itself.4 For point-to-point MPI operations, the internal messaging protocols that SiCortex MPI uses are not unusual. The software sends small messages using an eager protocol that copies the data at both ends, and larger messages using a rendezvous protocol that uses RDMA for zero-copy transfers. The receivers perform MPI matching in software, using optimized code to traverse posted receive and early send queues. The SiCortex interconnect fabric simplifies these operations in several ways. For eager messages, the fabric’s reliable, in-order message primitives allow the sending software to complete send operations once the data is passed to the DMA Engine, and streamline the receiver’s re-assembly of messages. For rendezvous messages, the DMA Engine provides an efficient way for software to register virtual memory pages, and allows the receiver itself to initiate RDMA fetch operations immediately after it establishes an MPI match. For sufficiently large RDMA transfers, software on the receive end schedules several RDMA GET operations across multiple available fabric paths between the source and destination

2. M. Snir, S.W. Otto, S. Huss-Lederman, D.W. Walker, J. Dongarra. MPI -- The Complete Reference: Volume 1, The MPI Core. MIT Press, 1998. 3. W. Gropp, S. Huss-Lederman, A. Lumsdaine, E. Lusk, B. Nitzberg, W. Saphir, M. Snir. MPI -- The Complete Reference: Volume 2, The MPI Extensions. MIT Press, 1998. 4. For short messages, we are able to complete MPI send and receive operations with roughly 250 instructions in the critical path. This stands in contrast to the 1200 instruction path length of typical MPI implementations.

A New Generation of Cluster Interconnect

8

nodes. This method helps avoid fabric congestion and can triple the fabric bandwidth available to an application. MPI messages generally complete without OS intervention. The MPI implementation uses polling, rather than interrupts, in order to shorten message latencies when waiting to complete operations. For RDMA operations, the MPI library must initially make system calls to set up DMA buffer descriptors (in effect registering virtual memory for use by the DMA Engine). But once these buffer descriptors are set up, their associated memory can be used for RDMA operations without further OS intervention during the life of the application. On-Node Communication Each node in the SiCortex architecture is a six-core SMP system running a single Linux kernel. SiCortex Linux supports the standard Linux multi-threading interfaces, but the normal way to run MPI on our system puts individual single-threaded MPI processes on individual processor cores. The MPI library uses only the DMA Engine, not shared memory, for onnode communication. The result indeed benefits MPI applications. For very large messages, using the DMA Engine is clearly the best approach because it can copy data from memory to memory faster than a processor can. For moderately large messages that fit within a processor’s L2 cache, the DMA Engine has the additional advantage of being able to copy data directly from one processor’s cache to another’s. For very short messages, while it might be possible to pass data through shared memory more efficiently than via the DMA Engine, doing so in the MPI library would necessarily compromise the efficiency of off-node short messages by adding additional checks to the software on the sending and receiving ends. DMA-Driven Collectives The MPI collective operations coordinate the activity of the processes in parallel applications. The SiCortex DMA Engine includes facilities that can accelerate many of these collective operations by allowing multiple communication steps to proceed autonomously in the DMA Engines rather than involving the processors at every step. The following sections explain how these mechanisms work for two critical MPI operations: broadcast and barrier. Broadcast In an MPI broadcast operation, one application process sends a message to many processes. For small messages, the issue in collective design is minimizing latency. For large message broadcasts, the issue typically is resource scheduling to minimize contention for node and network resources. With the Kautz interconnect topology, it is possible to overlay a ternary tree on the set of nodes participating in a broadcast, such that no two links in the tree contend for a physical link in the fabric. Using such a tree, the software constructs pre-built DMA command chains to run at the tree’s internal nodes. For short broadcast messages, a processor at the root of the tree issues a single DMA command, triggering a cascade of messages which traverse each link of the tree exactly once and which carry the broadcast to all ranks in a

A New Generation of Cluster Interconnect

9

maximum of six steps (six being the Kautz network diameter for a 978-node system). Requiring no processor intervention, the operation proceeds at hardware speeds. For broadcast messages, the same tree provides optimal resource scheduling for pipelined RDMA transfers of the entire broadcast message to all nodes. Barrier In an MPI barrier operation, no process completes the operation until all participating processes have initiated it. In the SiCortex system, we implement the barrier as a reduction followed by a broadcast of a zero-size message. For the reduction phase, the MPI software uses a special DMA Engine operation that, when a command packet arrives, decrements a counter within the DMA Engine and triggers a command sequence when the counter value becomes negative. The software constructs a combining tree of these commands. On entry to the barrier, processes initiate the reduction by each issuing a single DMA command. Without further processor intervention, the reduction operation culminates in a command executed at the root of the tree which initiates the broadcast phase. The broadcast works as described above. The entire barrier operation completes in, at most, twelve communication steps, again requiring no processor intervention.

Other Communication APIs For direct user-mode fabric communication, MPI is the standardized API that SiCortex delivers in our initial software releases. But we recognize that other communication interfaces are of interest to the high-performance computing community, notably shmem and GASnet communication primitives and global address space languages such as UPC (see http://upc.lbl.gov) and Co-Array Fortran. We believe the SiCortex architecture is well suited to implement such communication interfaces. New communication implementations will likely follow the strategy of the MPI software, building libraries directly on top of the primitives provided by the SiCortex DMA Engine, particularly when implementing low-latency operations. Our software internally uses an interface called scdma which exposes only those DMA primitives to user-level software. With these interfaces provided as open source software in the SiCortex system, plenty of opportunities exist for interested researchers and users to participate in implementing new APIs on the machines. Since the DMA Engine’s software interface is largely defined by microcode, the system allows for some flexibility to extend this interface, introducing new DMA primitives to support other communication APIs.

TCP/IP Fabric Communication While MPI uses the SiCortex fabric for direct user-mode communication, the Linux kernel itself also uses the fabric to provide inter-node IP communication. We accomplish this with

A New Generation of Cluster Interconnect

10

a Linux network device driver that transmits network frames over the fabric. The driver is called SCethernet, although it supports only IP protocols. The SCethernet driver sends small network frames directly as eager messages to the receiving driver’s event queue, and transmits large frames starting with a rendezvous request message. As with an MPI rendezvous, the receiving node uses RDMA to fetch a large frame’s data. Unlike MPI, SCethernet communication is interrupt-driven. The driver provides for network broadcasts using a software-driven spanning tree overlay on the fabric. Although SCethernet communication is not as efficient as MPI, particularly for short messages, it nonetheless provides a useful, high-performance communication substrate for any distributed TCP/IP application. The fact that the kernel transmits frames over the fabric is, of course, completely transparent to applications.

Interconnect Topology Each node in the SiCortex system has three fabric link inputs and three fabric link outputs. The nodes are wired up to form a Kautz network—a directed graph first described by William Kautz of SRI in 19685 and researched heavily in following years.6,7 Many different network topologies have been used for computer interconnect networks, including hypercubes (Thinking Machines CM-2), multi-dimensional meshes and toruses (Illiac-IV, CrayT3E™, BlueGene/L), fat trees (CM-5, InfiniBand®), Flat Neighborhood Networks, and so on. The choice of a network topology is a complex trade-off among complexity, wiring, congestion, latency, fault tolerance, and other issues. We chose the Kautz graph topology because:

•Kautz graphs have the largest known number of nodes for a given diameter using nodes which have fixed degree.

•Kautz graphs have redundant routing. For a degree-3 graph, between any pair of nodes there are always three distinct routes that do not share any intermediate links or switches. Kautz graphs have low latency due to their low diameter. They have good congestion and redundancy properties due to the existence of redundant routes. However, Kautz networks are very difficult to wire; in order to get low diameter, many of the links must connect to distant nodes. The key breakthrough which led to our choosing the Kautz graph was defining how to tile the graph using a fixed subgraph. This lets us build a range of systems using only a single processor module design with backplanes of manageable complexity.

5. W. H. Kautz, “Bounds on directed (d,k) graphs,” Theory of cellular logic networks and machines, AFCRL-68-0668 Final report, pp. 20-28, 1968. 6. G.J.M. Smit, P.J.M. Havinga and M.J.P. Smit, “Rattlesnake: a network for real-time multimedia communications,” ACM SIGCOMM Computer Communication Review, Volume 23, Issue 3, pp. 29-30, July 1992. 7. S. Banerjee, V. Jain, and S. Shah, “Regular Multihop Logical Topologies for Lightwave Networks,” IEEE Communications Surveys, Volume 2, Number 1, 1st Quarter, 1999.

A New Generation of Cluster Interconnect

11

One unusual fact about our design is that fabric links are one-way; a node can transmit to three other nodes, but generally receives from a different three nodes. The Kautz Graph Kautz graphs have the largest number of nodes for a fixed degree and diameter. Degree refers to the number of links per node, and diameter refers to the maximum hop count to get from any node to another. In a Kautz network, the diameter grows as the logarithm of the number of nodes. Kautz graphs can be built from nodes with two or more links. We chose to use nodes with three links because four uses too many pins and two leads to diameters that are larger than we wanted for interesting machine sizes. 01

02

03

10

30

13

31

12

32

21

23

20

FIGURE 4. 12-node Kautz Graph

Table 1 displays the number of nodes for given degree and diameter. The SC5832 is degree 3, diameter 6, and, thus, has 972 nodes.

Diameter

2

3

4

5

6

7

Degree 2

6

12

24

48

96

192

Degree 3

12

36

108

324

972

2916

Degree 4

20

80

320

1280

5129

20480

TABLE 1. Number of Nodes in Kautz Networks

A New Generation of Cluster Interconnect

12

There are several ways to describe the connectivity of a Kautz network. Figure 4 shows a 12-node, degree-3 Kautz graph. Visually, the links from each side connect to the other three sides. In a mathematical representation, nodes in the degree k diameter D Kautz graph are represented by D digit strings in base k+1, with the rule that you can’t have two matching digits in a row. The shortest Linux script (that we know of) to generate the node numbers of the 972 node machine is seq -w 4e5 | egrep -v '[4-9]|(.)\1' The connectivity is defined by left-shifting the node number, discarding the leading digit, and adding a new digit to the right. From this construction, it is obvious that any node can reach another node within D steps simply by shifting in the new node number.

Conclusion The interconnect in the SiCortex systems reflects a careful design aimed to provide very low-latency, high-bandwidth communications for high-performance applications. The interconnect fabric and DMA Engine are deliberately suited to run MPI as well as other communication models. Together with other innovations of the SiCortex architecture, this new generation of interconnect represents an important development for high-performance computing: the SiCortex systems are designed to run existing applications, substantially increasing delivered performance without requiring users to re-write their codes.

A New Generation of Cluster Interconnect

13

Copyrights and Trademarks Copyright© 2006 SiCortex, Inc. All rights reserved. A New Generation of Cluster Interconnect

The following are trademarks of their respective companies or organizations: Cray and CrayT3E are the registered trademarks of Cray Inc. EXPRESSMODULE is the trademark of PCI-SIG. InfiniBand is the registered trademark of the InfiniBand Trade Association. PCI, PCI Express, and PCIe are the registered trademarks of PCI-SIG. Linux is the registered trademark of Linus Torvalds in the U.S. and other countries. The registered trademark Linux is used pursuant to a sublicense from the Linux Mark Institute, the exclusive licensee of Linus Torvalds, owner of the mark in the U.S. and other countries.

A New Generation of Cluster Interconnect

14