A New Parallel and Scalable Server IO Networking Infrastructure for ...

2 downloads 1351 Views 641KB Size Report
the PaScal infrastructure to support high performance, high data bandwidth, and high availability IP based global storage systems on large scale Linux clusters.
PaScal – A New Parallel and Scalable Server IO Networking Infrastructure for Supporting Global Storage/File Systems in Large-size Linux Clusters Gary Grider, Hsing-bung Chen, James Nunez, Steve Poole, Rosie Wacha, Parks Fields, Robert Martinez, Paul Martinez, Satsangat Khalsa Los Alamos national Labs Los Alamos, New Mexico 87545 USA Abbie Matthews, Garth Gibson Panasas File Systems 6530 Kaiser Drive Fremont, CA 94555 most critical elements to build a large scale HPC cluster system. Without these three elements being well balanced, we cannot fully utilize a HPC cluster. High data bandwidth I/O networking provides a data super-highway to meet the needs of constantly increasing computation power and storage capacity.

Abstract This paper presents the design and implementation of a new I/O networking infrastructure, named PaScal 1 (Parallel and Scalable I/O Networking Framework). PaScal is used to support high data bandwidth IP based global storage systems for large scale Linux clusters. PaScal has several unique properties. It employs (1) Multi-level switch-fabric interconnection network by combining high speed interconnects for computing Inter-Process Communication (IPC) requirements and low-cost Gigabit Ethernet interconnect for global IP based storage/file access, (2) A bandwidth on demand scaling I/O networking architecture, (3) Open-standard IP networks (routing and switching), (4) Multipath routing for load balancing and failover, (5) Open Shortest Path First (OSPF) routing software, and (6) Supporting a global file system in multi-cluster and multiplatform environments. We describe both the hardware and software components of our proposed PaScal. We have implemented the PaScal I/O infrastructure on several large-size Linux clusters at LANL. We have conducted a sequence of parallel MPI-IO assessment benchmarks on LANL’s Pink 1024 node Linux cluster and the Panasas global parallel file system. Performance results from our parallel MPI-IO benchmarks on the Pink cluster demonstrate that the PaScal I/O Infrastructure is robust and capable of scaling in bandwidth on large-size Linux clusters.

In the past couple of years, several high-speed interconnect technologies (Quadrics [3], Myrinet [4], and Infiniband [5]) have been introduced into the HPC community. These interconnects are used in 90% of the Top 500 Supercomputer sites [1] and provide a significant improvement in latency sensitive IPC. However the lack of industry-wide support of data storage using these technologies, makes it extremely difficult or impossible to build a fully functional long term HPC cluster, with computing, scalable I/O, and global storage/file system components. In turn, most supercomputers are using (a) IP based [13] [14] [15], (b) Fiber-Channel (FC) based, or (c) iSCSI based global storage systems for the purposes of scalability. IP based Ethernet network (minimized total cost of ownership, simplified management, scaled, and shared) is about all that makes sense at very large scale system deployment. In this paper, we propose a new Parallel and Scalable I/O networking infrastructure, called PaScal. We can apply the PaScal infrastructure to support high performance, high data bandwidth, and high availability IP based global storage systems on large scale Linux clusters. PaScal integrates several scales-up and scale-out [2] [7] [8] [9] [10] [17] advanced networking features. It has several unique properties such as 1) Multi-level switch-fabric interconnected systems by combining high speed interconnects for computing IPC requirement and low-cost Gigabit Ethernet interconnect for IP based global storage access, 2) A bandwidth on demand scaling I/O network architecture without re-wiring and reconfiguring the system, 3) Open-standard IP networks (routing and switching), 4) Equal Cost Multi-Path routing for load balancing and failover [17],

Keywords: Parallel and Scalable I/O, Cluster computing, Global Storage/File system, Multipath routing, Load balancing, High-speed interconnect

1

INTRODUCTION

Commercial off the shelf based cluster computing Systems have delivered reasonable performance to technical and commercial areas for years. High speed computing, global storage, and networking (IPC and I/O) are the three 1

The University of California operates Los Alamos National Laboratory for the National Nuclear Security & Administration of the United States Department of Energy. LAUR-05-7620

1-4244-0198-4/06/$20.00 ©2006 IEEE

331

Generally there are several different I/O networking approaches to connect network storage to a single cluster and a multi-cluster environment:

5) Open Shortest Path First (OSPF) routing software [18][19] with low over-provisioning requirements and 6) Support for a global parallel file system in multicluster and multi-platform environment [13][14][15] The ultimate goal of PaScal is to provide high performance, efficient, reliable, parallel, and scalable I/O capabilities for scientific applications run on Terascale/Petascale clusters. This type of scientific simulation-based analysis normally requires efficient transfer of a huge volume of complex data among simulation, visualization, and data manipulation function. We also want to provide a balanced I/O subsystem by developing parallel and scalable I/O techniques for high-end computing communities as well as to leverage progress and advancements in academic and industry research and development.

1) Approach-1 uses only the Level-1 interconnect and connects the data storage system to the cluster’s server nodes directly 2) Approach-2 connects the data storage system to a common Federated Ethernet Fabric Network (FEFN, Level-2) or other networks such as FC and iSCSI. It then connects all cluster compute server nodes to a high-performance interconnect system(Level-1) such as Quadrics, Myrinet, or Infiniband 3) Approach-3 connects the data storage system to a common Linear Scaling Ethernet Fabric Network (LS-EFN). It uses I/O nodes as area border routers to route data traffic for multiple compute nodes and connects all cluster server nodes (compute and I/O) to a high-performance interconnect systems (Level-1) such as Quadrics, Myrinet, or Infiniband.

The remaining part of the paper is organized as follows: In Section 2, we present the requirements and problems of building a fully functional HPC Cluster Computing System. We provide the system overview of the proposed PaScal I/O networking infrastructure in Section 3. Our experiment of PaScal I/O network infrastructure is presented in Section 4. In Section 5, we present performance evaluation results from parallel MPI-IO benchmark testing and cost comparison. In section 6, we discuss the impact of PaScal. In Section 7, we conclude and describe our current and future works.

2.1 The Approach-1 solution - - Connect the data storage system to the cluster directly Figure-1 presents the “Direct connected I/O architecture-1” using Approach-1. This approach is to build a non-blocking tree topology (Fat tree or Clos tree is most common) (Level-1) between compute server nodes and I/O server nodes. Storage is attached to I/O server nodes directly. This architecture does not reduce bandwidth from any cluster node to the storage system. However, there are several disadvantages with this architecture including (1) full bi-section bandwidth between storage I/O nodes is wasted, (2) system cost grows exponentially when more storage I/O ports are needed, and (3) it is not applicable to a multi-cluster environment.

2

REQUIREMENTS AND PROBLEMS OF BUILDING A FULLY FUNCTIONAL HPC CLUSTER COMPUTING SYSTEM Generally a fully functional HPC cluster requires high speed and big memory compute nodes (or server nodes) and a high bandwidth global storage/file system. A global storage/file system employs a single global namespace, removes physical and logical boundaries, provides parallel data paths to compute nodes, scales client network capacity, grows seamlessly with a single namespace, and supporst dynamic load balancing. High performance I/O networking is the data superhighway used to balance data traffic between computing and storage.

I/O server nodes

Global Storage

Compute server nodes Level-1

In this paper we have defined two levels of interconnect used in HPC clusters. 1) Level-1 interconnect uses high speed interconnect systems with non-blocking (Fat tree or Clos tree) tree topology (scaling-up) such as Quadrics, Myrinet, or Infiniband for fulfilling requirements of low latency, high speed, high bandwidth cluster communication 2) Level–2 interconnect uses IP based network to support latency-tolerant I/O communication and global IP based storage systems.

Direct attached-1

Figure-1: Directly attached I/O network Infrastructure-1 Another direct attached I/O architecture from Approach-1 suggested by high-speed interconnect system vendors is shown in Figure-2. In this layout, the network configuration uses the additional expansion Gigabit Ethernet line cards on a high speed interconnect switch (Level-1 network) to support the Ethernet/FC/iSCSI storage network.

332

Clearly it eliminates some unnecessary outer Gigabit Ethernet switches, Network Interface Cards (NIC), and cables. However, it cannot support scaling data traffic for the Ethernet storage network due to the limited data bandwidth from those plug-in Ethernet bridge modules in existing Quadrics/Myrinet/Infiniband switches. Also the cost is prohibitive when using Ethernet bridge modules on a high-end Quadrics/Myrinet/Infiniband switch compared to using a regular Ethernet network system.

network, and (g) it has error prone NIC cable installation and management. Federated GiGE Switches

Level-2 Level-1

Global storage

Global Storage

Server nodes

Ethernet Expansion Compute server nodes

Level-1

Figure 3: Federated GiGE switch fabrics

GiGE

2.3 Approach-3 - Connect the I/O to a common Ethernet Fabric and use I/O nodes

Direct attached-2

We adopt the Approach-3 solution in PaScal. In PaScal, we separate server nodes into two categories based on their functional purposes; Compute nodes are only used for cluster computation and I/O nodes are only used for I/O routing and accessing to global storage systems. Figure-4 shows the outline of our proposed PaScal multi-level switch-fabric interconnect system. We describe the details of the PaScal I/O infrastructure in Section 3.

Figure-2: Directly attached I/O network Infrastructure-2

2.2 The Approach-2 solution - "Federated" switch fabrics Approach-2 normally uses "Federated" Ethernet switches in the Level-2 network [22] [23] [24] [25] that are configured as non-blocking, Constant Bi-sectional Bandwidth (CBB) tree topologies (Fat-tree). This is typically proposed and used in a multi-cluster environment for supporting I/O networking as shown in Figure-3. A server node is equipped with two NICs, One NIC is connected to the Level-1 network for latency sensitive IPC communication and the other NIC is connected to the Level2 network for data I/O traffics. “Single path routing” [17] policy is adopted in the Level-2 network.

Global Storage

Compute nodes

Level-1

I/O nodes

. .

.

The advantage of this storage network infrastructure is that it can be scaled to support thousands of cluster nodes while continuing to provide high bandwidth connectivity to the rest of the network. In multi-switch cluster configurations, CBB is achieved when the aggregate uplink bandwidth is equal to or greater than the aggregate nodal bandwidth on each access switch and the uplinks are evenly distributed among the non-blocking core switches. Larger clusters can be constructed by adding more levels to the tree structure.

Level-2: Linear scaling GiGE switches

Figure-4: The proposed PaScal I/O networking infrastructure

3 SYSTEM OVERVIEW OF THE PROPOSED PASCAL I/O NETWORKING INFRASTRUCTURE 3.1 multi-level switch-fabric interconnect system PaScal adapts and enhances a multi-level switch-fabric interconnect system, 3.1.1 Level-1 interconnect: The Level-1 interconnect uses (a) high speed interconnect systems (scale up) such as Quadrics, Myrinet, or Infiniband for fulfilling requirements of low latency, high speed, high bandwidth cluster IPC communication and (b) aggregating Multi-Path I/O routes (scaling-out) for loadbalancing and failover,

This approach has many problems such as (a) it has redundant network, (b) it is very expensive and complex to grow this type of federated switch network, (c) it forces compute server nodes to get involved in “single path I/O routing”, (d) it provides no load balancing between the Level-1 and the Level-2 networks, (e) it cannot scale well as the system keeps growing, (f) it needs a complicated Ethernet routing/switching configuration in the Level-2

333

3.1.2

Level–2 interconnect: The Level-2 interconnect uses multiple Layer-3 routing supported Gigabit Ethernet switches (scale out) to support latency-tolerant I/O communication and global IP based storage systems. Without using the “Federated switch” solution, we can linearly expand the Level-2 network by employing a global domain multicasting feature in metadata server. With this support we can maintain a “single name space” global storage system and keep growing the storage capacity.

3.2

the Level-2 interconnects and create a unique scaling architecture. 2) Open Shortest Path First (OSPF) routing: software Zebra and Quagga [18] [19] is Open Source software that manages TCP/IP based routing protocol. It is used on each I/O node for providing regional data routing services between the Level-1’s Compute nodes and the Level-2’s Gigabit Ethernet switches/routers. We run both Zebra and OSPF processes as background daemons. They work together on (a) sending routing requests, (b) receiving similar requests from neighbor routers, and (c) eventually updating the routing table accordingly. The OSPF protocol maintains multiple equal-cost routes, used in Gigabit Ethernet switches for providing load balancing in both directions.

Components in PaScal I/O Infrastructure

PaScal integrates several hardware and software components to provide a unique and scalable computing and networking infrastructure for large-size Linux based clusters. 3.2.1

3.2.2

Hardware components 1) Compute node :A Compute node is equipped with at least one high-speed interface card, connected to a high-speed interconnect fabric in Level-1, and setup with Linux multi-path equalized routing to multiple routing available I/O nodes for load balancing and failover (high availability). A Compute node is used for computing only, 2) I/O node: An I/O node has two network interfaces. One high-speed interface card is connected to the Level-1 network and communicates with Compute nodes. One or more Gigabit Ethernet interface cards (bondable) are connected to the Level-2 linear scaling Gigabit switches. An I/O node works like an Area Border Router (ABR) and handles inbound and outbound traffic for multiple Compute nodes. It runs open source Zebra/Quagga/OSPF routing software, 3) Level-1 High Speed interconnection system: Uses latency sensitive IPC switch-fabric such as Quadrics, Myrinet, or Infiniband, and 4) Level-2 IP based Ethernet interconnection system: Provides high data bandwidth for IP based storage systems.

3.2.3 Bi-direction multipath routing and load balancing for Outbound and Inbound traffic Multipath routing (RFC2991, RFC2992) is mostly used in wireless data networks as it may help to bring relief of unreliable wireless links. Multipath routing is seldom used in High Performance Computing community. Multipath routing is applied to utilize several routing available paths to distribute traffic from a source to a destination. Multipath routing should improve system performance through load balancing and reduce end-to-end delay. Load balancing overcomes the capacity constraint of “single path routing” and routes through less congested paths. Open Shortest Path First (OSPF, RFC2328) is designed for exchanging routing information within a large or very large internetwork. The biggest advantage of OSPF is that it is efficient. OSPF routing requires very little network overhead even in very large nettworks. OSPF uses a Shortest Path First (SPF) algorithm to compute routes in the routing table. The SPF algorithm computes the shortest (least cost) path between the router and all the networks. SPF-calculated routes are always loop-free. We have employed a multipath routing policy on both load balancing initiators from Compute node outbound (writing/sending) traffic and the Level-2 Gigabit Ethernet switch inbound (reading/receiving) traffic. With this bidirection multipath routing we can sustain parallel data paths for both write and read data transfer. This is especially useful when applied to concurrent socket I/O sessions on IP based storage systems. PaScal can evenly allocate socket I/O sessions to routing available I/O nodes. We then utilize the OSPF routing capability from I/O nodes and Ethernet switches to efficiently manage the inbound and outbound traffic for extreme load balancing.

Sofware components 1) Multi-Path routing for load balancing and failover: Multipath routing is used to provide balanced outbound traffic to the multiple I/O gateways. It also supports failover and deadgateway detection capability for choosing good routes from active I/O gateways. Linux Multi-Path routing is a destination addressbased load-balancing algorithm. Each Compute node is setup with N-ways MultiPath routes thru “N” I/O nodes. We use Linux Multi-Path routing to balance and match up the bandwidth difference between the Level-1 and

Multipath routing (RFC2991, RFC2992) is mostly used in wireless data networks as it may help to bring relief of unreliable wireless links. Multipath routing is seldom used in High Performance Computing community. Multipath

334

routing is applied to utilize several routing available paths to distribute traffic from a source to a destination. Multipath routing should improve system performance through load balancing and reduce end-to-end delay. Load balancing overcomes the capacity constraint of “single path routing” and routes through less congested paths.

clusters, possibly managed by separate research organizations, and provide an ability to mount a single name space global file system across all clusters. Figure-5 illustrates the top-level view of PaScal in a multi-cluster and multi-platform environment. Each cluster maps its I/O routing paths through multiple “Lanes” (subnets managed by a GigE switch) and involves a global domain routing activity with a global parallel file system. A global domain multicast support from a parallel file system [13] [14] [15] makes it possible to maintain a single global name space file system for multi-cluster environments using the PaScal I/O infrastructure. We can linearly increase a “Lane” (subnet) to the PaScal I/O infrastructure and meet the growing needs of a global parallel file system.

Open Shortest Path First (OSPF, RFC2328) is designed for exchanging routing information within a large or very large internetwork. The biggest advantage of OSPF is that it is efficient. OSPF routing requires very little network overhead even in very large nettworks. OSPF uses a Shortest Path First (SPF) algorithm to compute routes in the routing table. The SPF algorithm computes the shortest (least cost) path between the router and all the networks. SPF-calculated routes are always loop-free. We have employed a multipath routing policy on both load balancing initiators from Compute node outbound (writing/sending) traffic and the Level-2 Gigabit Ethernet switch inbound (reading/receiving) traffic. With this bidirection multipath routing we can sustain parallel data paths for both write and read data transfer. This is especially useful when applied to concurrent socket I/O sessions on IP based storage systems. PaScal can evenly allocate socket I/O sessions to routing available I/O nodes. We then utilize the OSPF routing capability from I/O nodes and Ethernet switches to efficiently manage the inbound and outbound traffic for extreme load balancing.

4 PASCAL EXPERIMENTAL CLUSTER – LANL PINK 1024-NODE LINUX CLUSTER WITH PANASAS GLOBAL & PARALLEL FILE SYSTEM We have implemented the PaScal I/O Infrastructure on several large-size Linux Clusters at LANL (Lightning, Pink, TLC, Flash, Gordon, Bolt) in the past twelve months. The following describes the PaScal experience on LANL’s Pink Linux cluster (1024 node, dual Xeon 2.4 GHZ CPU) and includes MPI-IO read and write scaling results.

4.1

Multiple Lanes: multiple subnets

Cluster-1 Cluster-2

. . Level-1 .

1) 960 Compute nodes with one Myrinet PCI-X Dcard, 2) 64 I/O nodes with one Myrinet PCI-X D-card and one Gigabit Ethernet interface, 8GB/sec bandwidth from 64 I/O nodes is preserved for the future growth of Panasas Active Scale Storage Cluster (up to 24 Panasas Shelves – 120 TB capacity), 3) Myrinet switch fabric (Clos tree) for the Level-1 interconnect, 4) One Extreme Network Black-Diamond 6808 Gigabit Ethernet switch for the Level-2 interconnect, and 5) Eight shelves of Panasas ActiveScale Storage system [13]. Two mount points, scratch1 and scratch2, are created under a global name space.

Lane 1 Lane 2 Lane 3

. . .

Pink Linux Cluster

The Pink cluster is the second largest single-image Linux cluster in the world. PINK is equipped with

Expandable Global storage Single name space

Cluster-N Lane M Global Domain (single name space) = ™Subnet (Lanei (i=1..M)), Level-2 Linear scaling IP routing network

Figure 5: PaScal I/O on Multi-Cluster environments

4.2

3.3 PaScal I/O on Multi-cluster & Multi-platform environment

Panasas ActiveScale Storage Cluster 2

The Panasas ActiveScale Storage system is composed of five major components in the Panasas Storage system. These are:

With (a) network Layer-2 and Layer-3 fail-over support from Linux kernel routing and Gigabit switches and (b) a global multicast domain support from scalable metadata servers, the PaScal I/O networking infrastructure [25] can support a global storage system in a multi-cluster and multiplatform environment. “PaScal I/O to support a Multicluster and Multi-platform environment” is meant to setup several independent large-size (thousand nodes) HPC

1) The primary component is the object, which contains the file data and enough additional meta 2

Panasas ActiveScale Storage Cluster is a commercial product from Panasas Inc[13]. Currently we use Panasas file system and plan to use Lustre file system [14], PVFS2 file system [15], GPFS, and NFSv4 in the near future.

335

2)

3)

4)

5)

data for the OSD to autonomously manage it. An object can be viewed as an offset and length in a file, with each object located on a different OSD, The network attached Object Storage Device (OSD) (a.k.a. Storage Blade), which is a more intelligent evolution of today’s disk drive that autonomously manages the object, including its layout, The Panasas File System (PanFS) client module, which accepts POSIX file system commands and data from the client operating system, addresses the OSDs directly and stripes objects across multiple OSDs. It is currently available as a Linux system module, The PanFS Metadata Server (MDS) (a.k.a. Director Blade), which is distinct from the OSD, intermediates amongst multiple clients allowing them to share data while maintaining consistency on all nodes. The MDS provides clients with signed capabilities, that the clients then present to an OSD for accessing object contents, and A high bandwidth network fabric that ties the clients to the OSDs and the MDSs.

The “N-to-1” (Figure 6-a, all Read/Write to one file) tests consist of N processes writing to non-overlapping regions of a single file and “N-to-N” (Figure 6-b, Read/Write one file per process) refers to N processes each writing to a single individual file. All data was checked on read against the known write pattern for data integrity. This is included in the read timing calculations. PE 1

2

3

4

5

6

N PEs R/W one single larger file

N

…………………………………………….

block 1 2 3 4 5 6 7 …………..N (block(1)+offset(N-1)) …………………….

Figure 6-a: “N-to-1” - A single file concurrently accessed by N processes PE 1

2

3

4

5

6

N PEs concurrent R/W N separated files

N

……………………………………………..

At the heart of the Panasas ActiveScale Storage system is a method of decoupling the data path (e.g., read, write) from the control path (e.g., metadata). This separation provides a method for allowing clients direct and parallel access to the storage devices thus providing high bandwidth to both individual clients and to workstation clusters. It also distributes the system metadata that allows sharing file access without a central bottleneck.

1 2

3

4

5 6

Files ………………………………….N

Figure 6-b: “N-to-N” - N individual files concurrently accessed by N processes We have conducted a sequence of parallel MPI-IO [6] [8] benchmarks on Pink to assess the impact on performance and scaling of the PaScal framework. We have developed a parallel MPI-IO software benchmark that uses the MPI-IO API to write and then read to file(s) in a variety of patterns that mimic the I/O profiles of scientific simulation codes.

Clients communicate directly with the OSD to access the stored data. Since the OSD manages all low-level layout issues, there is no need for a file server to intermediate every data transaction. To provide data redundancy and increase I/O throughput, PanFS stripes the data of one file system across a number of objects (one per OSD). When PanFS stripes a file across 10 OSDs attached to a Gigabit Ethernet network, PanFS can deliver up to 400 MB/sec split among the clients accessing the file.

We have created a dual-file partition both under the same single global name space: scratch1 and scratch2. Each file partition is equipped with four Panasas Active Storage shelves (20TB capacity). All of the testing runs were against 1) One file partition of a 20 TB Panasas Active Scale Storage cluster with an estimated theoretical maximum bandwidth of 1600MB/sec 3 for both read and write access, or 2) Two file partitions of a 40 TB Panasas Active Scale Storage cluster with an estimated theoretical maximum bandwidth of 3200MB/sec 4 for both read and write access. For all tests, the minimum effective bandwidth, or the speed of the slowest process, is reported. By “effective” we mean that the files create/open and close times are all factored into the calculation of bandwidth. Even with open

The independent parallel accesses by clients, allows the cluster total I/O bandwidth to scale linearly with the total number of OSDs in the storage cluster. This property also allows us to effectively load balance I/O traffic across multiple paths in the PaScal I/O networking infrastructure.

5 PERFORMANCE EVALUATION OF PASCAL I/O INFRASTRUCTURE Two synthesized parallel I/O access patterns tested here are referred to as “N-to-1” and “N-to-N”. “N-to-1” and “Nto-N” parallel I/O access patterns are the two most used in message-passing based scientific application programs such as BLAST, FLASH, BTIO, Parallel Ocean Program Model, Quantum Chemistry, Terrain rendering, Electron scattering [26][27][28][29][30][31].

3

ETMB(4 shelves)=40MB/sec per Disk drive* 10 Disk per shelf * 4 shelves = 1600MB/sec 4 ETMB(8 shelves)=40MB/sec per Disk drive* 10 Disk per shelf * 8 shelves = 3200MB/sec

336

and close times factored into the bandwidth calculation, initial results are very promising in terms of scaling and performance.

5.1

5.2

“N-to-N” and “N-to-1” concurrent Read tests

In the “N-to-N” concurrent read test case, we launch N read processes. Each process reads a separate 4GB file using message block size ranges from 32KB to 4MB. In the “N-to-1” read case, we enable N read processes access to a single 256GB file using message block size ranges from 32KB to 4MB. 4 to 256 processes are used in both “N-to-N” and “N-to-1” read testing. The scaling read result of “N-toN” case in Figure 8-a shows that we can almost reach the estimated theoretical maximum bandwidth – 1600MB/sec across four Panasas storage shelves as the message block size is approaching 4MB. This can be explained for the reason that larger block size messages imply a few number of transactions, resulting in the reduced overhead of packet transmission. The overall result of the “N-to-1” read shown in Figure 8-b is slightly worse (10%) than the “N-to-N” case due to root causes mentioned in the “N-to-1” write case.

“N-to-N” and “N-to-1” concurrent Write tests

Figure 7-a shows the effective write bandwidth for the “N-to-N” case using 2 through 256 processes with the write message block size between 0.5 and 64 MB. Each process wrote a single 4 GB file with sequential (not random) write access and all N files were created under one subdirectory. You can see that the write bandwidth is steadily approaching 1 GB/sec as we include more and more processors. Remember that at 256 processors, 256 files are all being created and opened at approximately the same time. One nice characteristic of the Panasas file system is the wide range of message sizes that achieve the maximum bandwidth and bandwidth does not decrease with larger message sizes. This consistent performance is not commonly seen in many file systems.

MB/sec

For the “N-to-1” write bandwidth, shown in Figure 7-b, we are using 2 to 256 processes with the write message block size between 0.5 and 64 MB. As in the “N-to-N” case, we see fairly consistent performance across a large range of message sizes and an increase in overall bandwidth as more processes are writing. The maximum write bandwidth peaks at approximately 810 MB/sec, which is less than the “N-toN” case. Essentially all processes are opening, writing, and closing a single file at the same time. This creates contention from accessing the control path in the file system and explains the reduced peak bandwidth.

4PE

8PE

1800

16PE

24PE

1600

32PE

64PE

1400

128PE

192PE

1200

256PE

1000 800 600 400 200 0 32K B

64K B

128K B

256K B

512K B

1M B

2M B

4M B

Message block size

Figure 8-a: N-to-N effective read bandwidth M B/ sec

MPI-IO N to N- Effective Write Bandwidth

1600 1400

1200

4PE

8PE

16PE

24PE

32PE

64PE

128PE

192PE

1200

Minimum Bandwidth (MB/sec.)

256PE 1000

1000

800 800

600 400

600

2 Procs 32 Procs 64 Procs 96 Procs 128 Procs 192 Procs 256 Procs

400 200

200 0 32KB

64KB

128KB

256KB

512KB

1MB

2MB

4MB

Message block size

0 0

10

20

30

40

50

60

70

Figure 8-b: N-to-1 effective read bandwidth

Message Size (MB)

Figure 7-a: N-to-N effective write bandwidth

5.3 Scaling tests: single read on one file partition vs. dual read on two file partitions

MPI-IO N to 1 Effective Write Bandwidth 900

We also conducted bandwidth scaling/aggregating tests that covered

Minimum Bandwidth (MB/sec.)

800 700 600

1) Single concurrent read on one file partition (scratch1 partition with an estimated theoretical maximum bandwidth (ETMB) =1600 MB/sec), 2) Dual concurrent read on two file partitions (scratch1 partition and scratch2 partition with an estimated theoretical maximum bandwidth (ETMB)=3200MB/sec), 3) 128 processes and 256 processes, and 4) Message sizes of 2MB and 4MB

500 400

2 Procs 32 Procs 64 Procs 96 Procs 128 Procs 192 Procs 256 Procs

300 200 100 0 0

10

20

30

40

50

60

70

Message Size (MB)

Figure 7-b: N-to-1 effective write bandwidth

337

As the performance result in Figure 9 (“N-to-N” read) and Figure 10 (“N-to-1” read) show, we gained almost 9098% bandwidth growth when we applied dual concurrent read on two separate file partitions. In the “N-to-N” read test, performance results showed almost linear bandwidth improvement from the comparison of using four shelves Panasas ActiveScale storage (scratch1) and using eight shelves of Panasas ActiveScale storage (scratch1 and scratch2).

MB/SEC

3000 2000

870

1000

2980 3200 1650

1508 1600

Table-1 lists the I/O equipment/price used to build both infrastructures. PaScal can save up to 91.5% on networking equipment. PaScal can reduce Gigabit Ethernet port count by 89%. PaScal can reduce Gigabit NIC cards by 93.75% and decrease Gigabit Ethernet cable management complexity ratio from 25 to 1. Table-2 lists the comparison results of PaScal vs. Approach-2 architectures. The major reduction of cost from the PaScal I/O infrastructure is a result of eliminating superfluous high-end/high-port count Gigabit switches while maintaining the same I/O bandwidth.

0 Read 1 file partition

Read 2 file partitions

Message Size: 2MBytes

Figure 9-a: “N-to-N” Read –2MB/size

MB/SEC

N-to-N Read

128PE 256PE ET MB

4000 3000 2000

1223

3110 3200 2410

1574 1600

1000 0 Read 1 file partition

Read 2 file partitions

Message Size: 4MBytes

Figure 9-b: “N-to-N” Read –4MB/size N-to-1 Read

MB/SEC

4000

3200

128PE 256PE ET MB

3000

550

1740

1600

2000 1000

Read 1 file partition

Read 2 file partitions

Figure 10-a: “N-to-1” Read –2MB/size

3000

N-to-1 Read

2000 1000

1387

Extreme Switch

6816

6 for $1,344,000

N/A

Extreme Switch

6808

N/A

1 for $116,000

GiGE NIC cards

1024 - $51,200

64 - $3200

GiGE cables

1600 - $8,000

64 - $320

Port counts

1600

180

Total Cost

$1,403,200

$119,520

3200 2670 1600

PaScal

Comparison Category Cost saving Gigabit NIC card reduction Cable management complexity ratio Gigabit Port count reduction Bandwidth reduction from accessing to 8 shelves Panasas File System

Message Size: 2MBytes

128PE 256PE ET MB

Approach-2

Table-1: Cost comparison of Approach-2 and PaScal infrastructures

1078

890

0

4000

I/O network consolidation and cost reduction

The I/O infrastructures shown in Approach-1 (directly attached storage) and Approach-2 (storage attached to Federated Ethernet Switch) are used in several large-size clusters such as LLNL’s (Lawrence Livermore National Laboratory) Multi-Cluster Global Scalable Storage System, LLNL’s Blue Gene/L system, LANL’s ASC/Q machine, and a previous configuration of LANL’s Lightning cluster. We have done a cost/performance comparison between Approach-2 federated switch fabrics Infrastructure and the PaScal infrastructure on the deployment of LANL’s Pink 1024 node cluster equipped with eight shelves of Panasas File System.

N-to-N Read

128PE 256PE ET MB

4000

5.4

1588

PaScal vs. Approach-2 91.5% 93.75% (1:16) 1: 25 88.75% No bandwidth reduction – can reach the most allowable hardware bandwidth of 8 shelves Panasas (2 file system) using PaScal compared to Approach-2 infrastructure 5

Table 2: Comparison of PaScal vs. Approach-2

820

0 Read 1 file partition

Read 2 file partitions

5

We have experienced Approach-2 (previous configuration) and PaScal IO (current configuration) on LANL’s Lightning 1600+ nodes cluster and observe performance results from both approaches.

Message Size: 4MBytes

Figure 10-b: “N-to-1” Read –4MB/size

338

several unique properties: (1) Multi-level switch-fabric interconnect systems, (2) A bandwidth on demand scaling I/O networking architecture, (3) Open-standard IP networks (routing and switching), (4) Multi-Path routing for load balancing and failover, (5) Open source software - OSPF routing software, and (6) Supporting a global file system in multi-cluster and multi-platform environments.

6 DISCUSSIONS Performance results from our parallel MPI-IO benchmarks on the Pink cluster demonstrates that the proposed PaScal I/O infrastructure is robust and capable of scaling bandwidth on large-size Linux based clusters. Scaling/aggregating test results prove that PaScal I/O infrastructure can preserve bandwidth on I/O nodes and meet the seamless capacity growth required of a global storage /file system.

To date PaScal has been implemented on several Linux based clusters at LANL; Lightning - a 1,600+ node cluster, Pink – a 1024 node cluster, Flash/Gordon – a 900+ nodes multi-cluster, and Turquoise Linux Cluster (TLC), a 128 node cluster. PaScal is being implemented on LANL Bolt 2000+ node AMD-Opteron cluster that will be equipped with 72 Panasas Shelves (360TB capacity) and an expected 30GB/sec I/O bandwidth in a production environment. Also, we have applied PaScal I/O Infrastructure in a multi-cluster and multi-platform environment; the Pink cluster (Intel IA32 based) and the TLC cluster (AMD Opteron based). We have learned that each PaScal deployment can be recursively and effectively revised and applied to the next iteration of PaScal deployment and guides our direction for the next generation PaScal IO infrastructure research.

Our performance and scaling results also demonstrate the advantages of PaScal in utilizing 1) Linux Multi-path equalized routing for load balancing, 2) Gigabit Ethernet Layer-3 switch Equal Cost Multipath (ECMP) routing, 3) Robust OSPF routing capability, and 4) Concurrent read/write capability, More I/O nodes and Gigabit Ethernet switches can be linearly added to the Level-2 interconnect system when more storage access bandwidth is needed from the Level-1 Compute nodes.

In order to leverage the PaScal capabilities we are currently conducting research in the following areas: (1) Fault-tolerant application support, (2) Fault protection management [11], (3) Using SCTP Multi-homing fail-over [16] in file system client software for TCP/Socket connection migration, (4) Applying 10-Gigabit Ethernet in I/O node group, (5) Using OpenIB [20] software IPoIB, SDP, and SRP, (6) Using iSER [20] based global storage, (7) Supporting jumbo frame for end-to-end communication, (8) Supporting Ethernet bounding in I/O node group, (9) Applying to multi-clusters of the order to hundreds of Teraflops scientific computing, (10) Enhancing of Linux Multi-path routing with dead gateway detection capability [17], (11) Tuning Linux obsolete networking and routing parameters for RAS and performance [17], and (12) deploying of the PaScal I/O infrastructure using Lustre [14], PVFS2 [15], and NFSV4 parallel file systems.

Compared to the architecture in Approach-2, the proposed PaScal architecture clearly provides a great deal of network consolidation and cost saving in terms of reducing unnecessary (1) Ethernet interfaces, (2) Ethernet cables, and (3) high-end Ethernet switches. PaScal also eliminates the problems with the architectures described in Approach-1 and Approach-2. It is known that challenging scientific computing requires deep data [21]. PaScal is a scaling parallel I/O infrastructure that provides networking capabilities to satisfy the constantly increasing computing power and the global storage/file needs. PaScal not only provides good I/O bandwidth at scale but it also scales with very little incremental cost. When you scale federated-switch fabrics in Level-2 interconnects (Approach-2) the cost doesn’t just increase by port count. It is much worse than that. PaScal is looking for a way to scale up that doesn’t have the federated-switch effect on cost by taking advantage of the fact that there is non-blocking Fat-Tree or Clos tree (Level1) in the clusters. PaScal is also providing a solution to connect multiple clusters to a global storage in a linear scaling, load balancing, and fail-over way. It does it in such a way that adding more bandwidth for I/O in the Level-2 interconnect doesn’t increase the cost at the same rate as growing a Federated switch fabric.

7

8 [1]. [2]. [3].

CONCLUSIONS AND FUTURE WORKS

[4].

In this paper, we have proposed an innovative Parallel and Scalable I/O networking framework called PaScal. The PaScal I/O infrastructure successfully integrates several hardware and software components to achieve high performance scale-up and scale-out networking capabilities on industry standard Linux based clusters. PaScal employs

[5]. [6].

339

REFERENCES Douglas L. Farley, “Performance Comparison of Mainframe, Workstations, Clusters, and Desktop computers”, 2005, NASA/TM-2005-213505 Rob Latham, Rob Ross, and Rajeev Thakur, “ The Impact of File Systems on MPI-IO Scalability Fabrizio Petrini, “The Quadrics Network:High Performance Clustering Technology”, Feb. 2002 IEEE Micro Borden ,N.J, etc., Myrinet: a gigabit-per-second local area network”, Feb. 1995 IEEE Micro Tony Shanley, Infiniband Network Architecture, 2002, MindShare Published Rajeev Thakur and William Gropp, Ewing Lusk, “On Implementation MPI-IO and with High Performance”,

[7].

[8].

[9]. [10].

[11]. [12]. [13].

[14]. [15]. [16]. [17]. [18]. [19]. [20].

[21].

[22]. [23].

[24].

[25]. Jianwei Li, Wei-keng Liao, Alok Chounhary, etc., “Parallel netCDF: A High –Performance Scientific I/O Interface”, SC’03 [26]. Robert Ross, Daniel Nurmi, “A Case Study in Application I/O on Linux Clusters”, SC 2001 [27]. Heshan Lin, Xiaosong Ma, Praveen Chandramohan, Al Geist, Nagiza Samatova, “Efficient Data Access for Parallel BLAST” [28]. Chris H.Q. Ding, Yum He, “Data Organization and I/O in a Parallel Ocean Circulation Model”, SC’99 [29]. Phyllis E. Crandall, Ruth A Aydt, Andrew A. Chien, Daniel A. Reed, “Input/Output Characteristics of Scalable Parallel Applications, SC’ 95 Appendix-A

Pced. of the 6th workshop on I/O in Parallel and Distributed Systems, May 1999 Andy D. Hospodor, Ethan L. Miller, “Interconnection Architectures for Petabyte-Scale High-Performance Storage Systems”, Proceedings 2004 IEEE Goddard Conference on Mass Storage Systems & Technologies Rajeev Thakur and William Gropp, Chapter 13 “Parallel IO”, CRPC Handbook of Parallel Computing, J. Dongarra, I. Foster, G. Fox, K. Kennedy, L. Torczon, and A. White, Morgan Kaufmann Publishers, Inc Micah Beck, Terry Moore, James S. Plank, “An Endto-End Approach to Globally Scalable Network Storage”, SIGCOMM’02, August 2002 Charles Blake, Rodrigo Rodrigues, “High Availability, Scalable Storage, Dynamic Peer Networks: Pick Two”, MIT Laboratory for Computer Science Noel Burton-Krahn, “HotSwap –Transparent Server Failover for Linux”, Pced. of LISA’2002 Renato John Recio, “Server I/O Networks past, Present, and Future”, Proceeding of the ACM SIGCOMM 2003 Workshop David Nagle, Denis Serenyi, Abbie Matthews, “ The Panasas ActiveScale Storage Cluster – Delivering Scalable High Bandwidth Storage”, Proceedings of SC’04, Nov. 2004 Evan J. Felix, etc., :Active Storage Processing in a Parallel File System”, 2005 Linux HPC Revolution Conference Rob Latham. Etc., “PVFS-2: File Systems for Linux Clusters”, January 2004 Linux World Randall R. Stewart, Qiaobing Xie, Scream Control Transmission Protocol –SCTP, Addison-Wesley Bert Hubert, Linux Advanced Routing & Traffic Control HOWTO, Version 1.0.0 Michael Petry, “The Paths of the Zebra”, June 2003 Linux Magazine Bob Woodruff, etc., :Introduction to the Infiniband Core Software”, Linux Symposium 2005, Vol-2 W.T.C. Kramer, A. Shoshani, D.A. Agarwal, B.R. Draney, G. Jin, G.F. Butler, J.A. Hules, “Deep scientific computing requires deep data”, IBM J Res & Dev, Vol. 48 NO. 2 March 2004 R.L. Brady, J.E. Garlick, and R.J. Goldstone, “Achieving Order through CHAOS: the LLANL HPC Linux Cluster Experience”, UCRL-JC-153559, June, 2003 Kim Cupps “Blue Gene/L System Software Update”, UCRL-PRES-210094, Feb. 23, 2005 Fabrizio Petrini, Juan Fernadez, Eitan Frachtenberg, and Salvador Coll, “Scalable Collective Communication on ASCI Q Machine”, Hot Interconnect 11. August 2003 Aaron E. Darling, Lucas Carey, Wu-chun Feng , “The Design, Implementation, and Evaluation of mpiBLAST”

NumCompNode=M NumIONode=N BprocComp[i], i=1..M BprocIO[j], j=1..N LaneComp[a,b], a=1..A, b=1..B, M=A*B LaneIO[a,c], a=1..A, c=1..C, N= A*C // Part A: Compute node logic ID and IP address mapping i=1 for(a=1:aM) break } // Part-B: IO node logic ID and IP address mapping j=1 for(a=1:aN) break } // Part-C: Compute node C way Multipath routing for(a=1:a