Author Guidelines for 8

An Intelligent Parallel and Scalable Server I/O Networking Environment for High Performance Cluster Computing Systems 1 Hsing-bung (HB) Chen, Parks Fields, Alfred Torrez HPC-5, Los Alamos National Lab Los Alamos, New Mexico 87545 {hbchen, parks, atorrez}@lanl.gov, 505-665-3591

Abstract We develop an intelligent server I/O faultmanagement mechanism used in Equal Cost MultiPath routing to enable LANL’s PaScalBB based High Performance Computing cluster systems to run computational jobs 24x7 without service interruption during an unexpected physical I/O link failures. This mechanism, named Dead Server I/O Gateway Detection and Recovery (DGD), can detect a data path connectivity problem within seconds when it happens, removes the entry of a dead I/O gateway from a MultiPath routing table, migrates connecting I/O path to available entrance in a Multi-Path routing table, and then resumes the existing I/O data stream. The DGD can tolerate multiple single points of failures; keep the streaming I/O data moving, and seamlessly continue and finish computation jobs. We have developed a proof-of-concept implementation of this DGD mechanism on server large size Linux cluster as a blueprint for production-type Reliability-AvailabilityServiceability (RAS) solution. Keywords: Reliability, Dependability, Cluster Computing, Parallel File System, Parallel I/O

1. Introduction A key challenge in Petascale computing systems is to supply a highly reliable/dependable and high throughput I/O access so we can keep feeding sufficient amount data to run computational jobs 24x7 without service interruption when unexpected physical I/O link failures happened. Many research and development activities are focused on providing the detection of failure and avoiding single point of failure in networking subsystem. We are aware of no prior research work attempting to address both failure detection and automatic failover to a working connection path. The challenging problem is to detect 1

and correct network routes involving routing path end to end from compute nodes to the global storages. It is in this area that we are unaware of anyone doing end to end routing recovery in parallel [1] [2] [3] [4] [5] [6] [7] [8] [9]. In this paper we develop an intelligent server I/O fault-management mechanism used in Equal Cost Multi-Path routing to enable LANL’s PaScalBB [10] based High Performance Computing systems to run computational jobs 24x7 without service interruption during an unexpected physical I/O link failures and connection loss. The coming of the Petascale computers era posts multidimensional challenges (system software, application software, computing system, high speed interconnect, high bandwidth server I/O, global storage and file system) and opportunities for supercomputing. LANL’s Parallel Scalable Back Bone (PaScalBB) architecture was designed and has proven to meet those challenges and is providing a feasible, affordable pathway which easily accommodates our growing scale. This proposed intelligent server I/O faultmanagement mechanism, named Dead Server I/O Gateway Detection and Recovery (DGD), can detect a data path connectivity problem within seconds when it happens, removes the entrance of a dead I/O gateway from a Multi-Path routing table, migrates connecting I/O path to available entrance in a Multi-Path routing table, and then preserves/resumes the existing I/O data stream. It can tolerate multiple single points of failure; keep the streaming I/O data moving, and seamlessly continue and finish computation jobs. We have developed a proof-of-concept implementation of this proposed DGD mechanism on several PaScalBB based Linux clusters as blueprints for production type Reliability Availability Serviceability (RAS) solution. We are currently

This work was carried out under the auspices of the National Nuclear Security Administration of the U.S. Department of Energy at Los Alamos National Laboratory under Contract No. DE-AC52-06NA25396., LA-UR-07-8225

planning deploying this proposed DGD mechanism on all LANL’s PaScalBB based HPC production clusters.

2. LANL’s PaScalBB Server I/O backbone architecture Formatting your paper LANL’s PaScalBB adopts several hardware and software components to provide a unique and scalable server I/O networking architecture. Figure-1 has shown the system components used in PaScalBB.

2.1 Hardware Components used in PaScalBB 2.1.1 Level-1 High Speed Interconnection Network The Level-1 interconnect uses (a) high speed interconnect systems such as Quadrics, Myrinet, or Infiniband for fulfilling requirements of low latency, high speed, high bandwidth cluster IPC communication and (b) aggregating I/O-Aware multiPath routes for load-balancing and failover. 2.1.2 Level–2 IP based Interconnection Network The Level-2 interconnect uses multiple Gigabit Ethernet switches/routers with layer-3 network routing support to provide latency-tolerant I/O communication and global IP based storage systems. Without using the “Federated network” solution, we can linearly expand the Level-2 IP based network by employing a global host domain multicasting feature in metadata servers of a global file system. With this support we can maintain a “single name space” global storage system and provide a linear cost growing path for I/O networking. 2.1.3 Compute node A Compute node is equipped with at least one high-speed interface card connected to a high-speed interconnect fabric in Level-1. The node is setup with Linux multi-path equalized routing to multiple available I/O nodes for load balancing and failover (high availability). A Compute node is used for computing only and is not involved with any routing activities. 2.1.4 I/O node I/O node: An I/O routing node has two network interfaces. One high-speed interface card is connected to the Level-1 network for communication with Compute nodes. One or more Gigabit Ethernet interface cards (bondable) are connected to the Level-2 linear scaling Gigabit switches. I/O nodes serve as the routing gateways between Level-1 and Level-2 network. Every I/O has the same networking capability.

2.2 System Software Components used in PaScalBB 2.2.1 Equal Cost MultiPath Routing for load balancing Multi-path routing is used to provide balanced outbound traffic to the multiple I/O gateways. It also supports failover and dead-gateway detection capability for choosing good routes from active I/O gateways. Linux Multi-Path routing is a destination address-based load-balancing algorithm. Multi-path routing should improve system performance through load balancing and reduce end-to-end delay. Multipath routing overcomes the capacity constraint of “single-path routing” and routes through less congested paths. Each Compute node is setup with N-ways multipath routes thru “N” I/O nodes. Multi-path routing also balances the bandwidth gap between the Level-1 and the Level-2 interconnects. We use the Equal Cost Multi-path (ECMP) routing strategy on compute nodes so compute nodes can evenly distribute traffic workloads on all I/O nodes. With this bi-direction multi-path routing (Figure2) we can sustain parallel data paths for both write (outbound) and read (inbound) data transfer. This is especially useful when applied to concurrent socket I/O sessions on IP based storage systems. PaScalBB can evenly allocate socket I/O sessions to routing available I/O routing nodes. 2.2.2 OSPF routing used in I/O nodes OSPF routing capability in I/O nodes and Level-2 Ethernet switches is used to efficiently manage the inbound and outbound traffic for bi-directional load balancing. We evenly assign I/O nodes into multiple subnets and create corresponding VLANs in Level-2’s Ethernet switches to work with each I/O node subnet. Instead of advertising the whole compute node community from each I/O node, I/O nodes sub-netting will drastically reduce compute node route advertising overhead. Each I/O node will just advertise routes within its subnet. OSPF routing overhead is not seen here due to the limited number of routing hops used in PaScalBB server I/O network. We also designate each I/O Subnet/VLAN in a “stub area with no summary”. This prevents the I/O node and switch’s VLAN from advertising route summary (external routes, optional inter-are routes) into I/O node’s subnet stub area. Using “Stub area with no summary” can significantly reduce the size of the routing tables in I/O nodes and provide some isolation in the area/subnet from changes in topology outside the area/subnet. This also eliminates type-3/type-4/type-5 LSA-Summary messages and reduces 70%~75% of

OSPF routing overhead. With OSPF dynamic routing capability, we can gradually grow I/O node subnets to accommodate a multi-cluster environment without any impact on the existing compute node community [10].

2.3 PaScalBB Environment

I/O

on

Multi-clusters

With (a) network Layer-2 and Layer-3 fail-over support from Linux kernel routing implementation and Ethernet switch capabilities and (b) a global multicast domain support from scalable metadata servers, the PaScalBB I/O networking architecture can support a global storage system in a heterogeneous multi-cluster and Grids environment. Figure-3 illustrates the toplevel view of PaScalBB in a heterogeneous multicluster and Grids environment. We can apply PaScalBB to support a heterogeneous multi-clusters environment that is consisted of several independent large-scale cluster systems. These systems are possibly managed by separate research organizations. PaScalBB provides an ability to mount a single name space global file system across all clusters. Each cluster maps its I/O routing paths through multiple “IO-Lanes”. An “IO-Lane” is consisted of a group of I/O nodes managed and routed by an individual Gigabit/10gigabit Ethernet switch. Each IO-Lane provides accessibility to a set of storage subnets. We then use a global domain multicasting to maintain a global file with a single name space. With this we can support a Petascale global file system accessible for multi-cluster environments using the PaScalBB I/O architecture. We can linearly add more “IO- Lanes” into the PaScalBB I/O architecture to meet the increasing bandwidth demand of global parallel file systems. Remote Grids computing facilities, with “PaScalBB IO Lanes using multi Gigabit Ethernet links or multiple 10-Gigabits Ethernet links”, can participate the sharing of a distance-less remote/global storage/file system through long-haul optical links. The purpose of using IO Lane is to mitigate single switch bandwidth limitation and provide a linear growing path for IP storage network. Currently all Land’s production HPC clusters are implemented with the PaScalBB Server I/O backbone architecture.

3. Dead Server I/O Gateway Detection and Recovery (DGD) A key challenge on the PaScalBB based HPC systems is to supply a highly reliable/dependable and high throughput I/O access so we can keep feeding sufficient amount data to run computational job 24x7

without service interruption during an unexpected physical I/O link failures and connection loss. Figure-4 is shown the six major steps involved in the proposed DGD mechanism. The active Master Node is a typical head node or service node that manages the whole computing cluster both compute nodes and I/O nodes. A standby Master node is also equipped to sustain a high-availability support for the active Master node. The active Master node has connection to both Level 1 and Level 2 networks. This DGD mechanism is implemented and runs as a background daemon process in the Master node.

3.1 State diagram of the Dad’s path connectivity checking For each path-connectivity checking (Level-1 connectivity, Level-2 NIC connectivity, Level-2 switch connectivity, cluster management network connectivity, Global File system connectivity, and OSPF routing module status) in the DGD, there is a state machine associated with it (Figure-5). There are two modes for path connectivity; Active and InActive modes. The DGD daemon process starts up and immediate goes to Active mode. Each pathconnectivity is checking periodically. The interval between two consecutive checking is depended on the number of I/O gateways. The state mode changed is based on the number of consecutive failure on path connectivity checking. The state mode is set to “Active” when all types of path connectivity pass checking otherwise it would be set to “InActive”. Not status changed

InitDGD

Active

Active I/O gateway (s) detected Add active route(s) into ECMP routing table

Dead I/O gateway(s) detectedd

I/O gateway(s) detected alive Over Max Fault Counts ?

No

Yes InActive

Dead I/O gateway(s) detected. Remove dead route(s) from ECMP routing table

Not status changed

Figure-5: DGD State diagram for each Path connectivity checking The state is transferred from “Active” to “InActive” when one or more I/O gateways are detected as no connectivity during a path connectivity checking period. Dead route(s) of I/O gateway(s) will be removed from ECMP route table immediately. The

state is finally changed when a contiguous connectivity checking fail count is over the limit. The state is transferred from “InActive” to “Active” when one or more dead I/O gateways are detected as “alive” (i.e., it passed all path-connectivity checking). New route(s) of detected alive I/O gateways are added back to each Compute node’s ECMP route table immediately.

3.2 DGD processing steps The DGD daemon process periodically check the status of path connectivity between the master nodes, compute nodes, I/O nodes, Level-2 network, and Global File systems and the health status of the OSPF routing module on each I/O node. The DGD processing steps is described as follows: Step 0: initDGD( ); # Initialize DGD processing data structures ECMPRouteChanged Å NO PreviousStatus(I/O node)i=1 to n Å Active CurretnStatus(I/O node)i=1 to n Å Active OSPFRoutingStatus(I/O node)i=1 to n Å Active

Step 1: Check Management network connectivity o Master node Æ I/O nodes (Master node health status: Active vs. InActive)

Step 2: Check Level-1 connectivity:

o Master node Æ Level 1 Network Æ I/O nodes (Level 1 interface) o Master node Æ Level 1 Network Æ I/O nod es (Level 2 interface)

Step 3: Check Level-2 connectivity:

o Master node Æ Level 2 Network Æ I/O nodes (Level 2 interface) o Master node Æ Level 2 Network Æ I/O nodes (Level 1 interface)

Step 4: Check Global File System connectivity:

o Master node Æ Level 1 Network interface Æ Global File System

Step 5: Check Status of the Level-2 Interconnect: o Master node Æ Level 2 Storage network

Step 6: Check OSPF status on both I/O node and Level-2 interconnect: o OSPFRoutingStatus : Active vs. InActive If (OSPFRoutingStatus is InActive) Set OSPFROutingFlag Å RestartOSPFMoudle

Step 7: Summarize of step1-step6 checking and set current I/O node status (CurretnStatus(I/O node)) andUpdate I/O nodes state machine and detect ECMP route fault-events If (PreviousStatus(I/O node) equal CurretnStatus(I/O node)) { ActiveÆActive: No ECMP route change InActive Æ InActive: No ECMP route change ECMPRouteChanged Æ NO } If (PreviousStatus(I/O node) not equal

CurretnStatus(I/O node)) { Active Æ InActive: Remove dead/inactive I/O node from ECMP route InActive Æ Active: Add new Active I/O node back to ECMP route ECMPRouteChanged Æ Yes }

Step 8: Adaptive Event-handling If (ECMPRouteChanged) { + Generating syslog information + Email DGD status updated notification + Record DGD history status + Adaptive route change on Equal Cost Multipath (ECMP) Route + Add or Remove I/O node from ECMP route } If (OSPFROutingFlag is RestartOSPFMoudle) { Stop OSPF routing module. Start OSPF routing module }

Step 9: sleep N seconds; // kick-off DGD process every N seconds Step 10: Repeat Step 1 through Step 9 The proposed DGD mechanism has the following intelligent properties: 1) It is dependable a. Checks I/O node connectivity status periodically in Level-1 network, Level-2 network, and Global File system, b. Checks OSPF routing status periodically both in I/O nodes and Level-2 network, and c. Record I/O network status history for event handling 2) It is adaptive a. Dynamically adjusts the ECMP route in each compute node based on the latest Server I/O network status i. when I/O nodes encounter the lost of connectivity in Level-1 or Level-2 networks ii. when I/O nodes back to active service state, b. Avoids a dead I/O path during computing, c. Seamlessly migrates I/O path to available I/O nodes, and d. Continues and finishes computation jobs without any service interruption

4. Implementation, verification

experimental,

and

We have implemented this proposed DGD mechanism on a 256 nodes cluster – BlueSteel. The BlueSteel Cluster is with (a) 248 Compute nodes, (b) 8 I/O nodes, (c) Voltaire Infiniband switch as the Level1 interconnect, (d) BlackDiamond-10000 Gigabit

Ethernet as the Level-2 interconnect, (e) one Master node, and (f) 6 Panasas ActiveStorage shelves (30TB capacity, 24 Gigabit links). Every compute node is setup with 8-Way ECMP routing. The Master node is running a DGD daemon process. We conduct a sequence of concurrent Write then Read testing cases. We first launch an 8 processes concurrent N-to-N write operation to create eight 16GB data files on the Panasas File System and we then kick off an 8 processes N-to-N concurrent read operation that read files back from the Panasas File System.

2.

Shutdown two I/O nodes; File I/O stops; and then DGD resume system to active status and finish the Write operation Start a Read operation; disable a Gigabit Ethernet on an I/O node; File I/O stops; then DGD resume to active status and continue Read operation Disable the OSPF process on an I/O node; File I/O stops; DGD resume to active status and continue Read operation Disable an Infiniband interface on an I/O node; File I.O stops; then DGD resume system to active status and continue Read operation, and Shutdown two I/O nodes; File I/O stops; and then DGD resume system to active status and finish the Read operation.

3. 4. 5. 6.

4.1 N-to-N concurrent write then read testing Figure-6 shows a successful 8 processes concurrent N-to-N File Write then Read operation testing.

Write Read

Figure-6: Write/Read with no dead I/O gateway

4.2 N-to-N concurrent write then read testing with fault insertions and DGD operation We then repeat the same 8 processes concurrent N-to-N Write then Read testing and insert various fault-events, such as disable/enable the Infiniband interface on an I/O node, disable/enable Gigabit Ethernet interface on an I/O node, remove/restart the OSPF daemon process on and I/O node, and shutdown/restart an I/O node. The DGD is involved the fault event detection and ECMP route recover. Here is the detail description about the DGD faultevent management - detection and recovery in a concurrent Write-then-Read testing shown Figure-6 Step: Description of testing sequences 1.

Start a Write operation; Disable an Infiniband interface on an I/O node; File I/O traffic stops; DGD resume system to active status; then continue Write operation

We also demonstrated how we used the DGD process to Minimizes the system suspended-time ¾ Ti is the system down-time/suspended-time between an unexpected failure event happened and the time when the DGD rediscovered the right system I/O routing path. ¾ The system down-time (Ti, i=1 to 6, for six faultinsertion events) illustrates why the DGD process play an important role to sustain a highly reliable and available computing environment for a High Performance Computing cluster. Verified testing cases 1) T1, T4, T5, &T6: Applying the DGD process to handle fault events within a reasonable time can put the system back into ACTIVE service state in time and minimize the service-interruption time. 2) T2 & T3: Slowly or without applying the DGD process to handle the fault-insertion events would cause the longer or infinite system down-time. T1

T2

T3

T4

T5

Duration of System suspended

Figure-7: Write/Read with DGD involved

T6

4.3 Summary of DGD testing There is no any job hang and system corruption during our sequence of DGD fault insertions testing. The DGD daemon process has detected all falseinsertion-events, adjusted ECMP routes on all compute nodes to reflect the actual I/O gateway status, reported system event log information, email noticed system users, helped resuming file I/O operation for interrupted compute nodes (because of some dead I/O node gateways), and eventually help all involved computed nodes finished concurrent write-then-read operations.

5. Conclusions and Future Works Many research and development activities are focusing on providing the detection of failure and avoiding single point of failure in networking subsystem and we are aware of no prior work attempting to address both failure detection and fault recovery in parallel In this paper, we have presented the design and implementation of an adaptive and dependable Server I/O networking fault-management mechanism named DGD. Our proof-of-concept implementation, experimental, and verification demonstrates that this adaptive DGD server I/O fault-management mechanism is a significant enhancement to our PaScalBB based HPC systems and it is the best in class solution to sustain requirements of Reliability, Availability, and Serviceability in Petascale computing systems; • Discover critical Server I/O failure in time and seamlessly resume suspended File I/O operation without service interruption on compute nodes, • Insure a continuous availability of a HPC system, • Tolerate multiple points of failures in Server I/O networking • Sustain a trustworthy service environment for data-intensive applications, and • Incorporate a self-managed capability on Equal Cost Multi-Path Routing There are many useful and portable features that we plan to implement in the DGD process; • A uniformed implementation of this DGD mechanism on all Land’s production HPC cluster using table driven and database so we can have a unified DGD fault management community among all Land’s production HPC clusters, • Incorporate this DGD process into the existing general cluster resource management system (such as Zenoss and Nagios) , and

•

Provide reports of historical DGD events, failure detection and fault recovery, for future HPC reliability and availability study.

6. References [1] Chokchai Leangsuksum, Tong Liu, Yudan Liu, Stephen L. Scott, Richard Libby, and Ibrahim Haddad, Highly Reliable Linux HPC Clusters: Self-awareness Approach, Proceedings of ISPA 2004 Conference, pp.217-222 [2] Victor C. Zandy and barton P. Miller, Reliable Network Connections, Proceedings of MOBICOM 2002 conference [4] Anshumal Sinha, Sandeep Sarat, and Jonathan S. Shapiro, Network Subsystems Reloaded: A HighPerformance, Defensible Network Subsystem, Proceedings of the USENIX Annual Technical Conference 2004 [5] C. Engelmann, S.L. Scott, C. Leangsuksun, and X. He, Towards High Availability for HighPerformance Computing System Services: Accomplishments and Limitations, High Availability and Performance Computing Workshop 2006 [6] Bianca Schroeder and garth A. Gibson, A Largescale study of failures in high-performance computing systems, Proceedings of the International Conference on Dependable Systems and networks 2006 [7] Bianca Schroeder and Garth A. Gibson, “Understanding Failures in Petascale Computers”, Journal of Physics: SciDAC 2007, conference series 78(2007) [8] Christian Engelmann, Stephen L. Scott, Chokchai Leangsuksun, and Xubin He, Symmetric Active/Active High Availability for HighPerformance Computing System Services, Journal of Computers, vol1. No.8, December 2006 [9] Christian Engelmann, etc., “MOLAR: Adaptive Runtime Support for High-End Computing Operating and Runtime Systems”, ACM SIGOPS Operating Systems Review (OSR), 40(2), pages 63-72, 2006 [10] Hsing-bung Chen. Gary Grider, and Parks Fields, A Cost-Effective, High Bandwidth Server I/O network Architecture for Cluster Systems, Proceedings of IEEE 2007 IPDPS Conference, March 2007

Comp nodes - Outbound N-way load balancing Multi-path routing

Comp Node

. . Node Comp . .

Level-1 Interconnect network

I/O Node

. .I/O Node . . I/O Node

Comp Node

Switch - Inbound M-way multiple streams Equal Cost Multi-path routing - switch Level-2 Interconnect network

Global File System

I/O nodes/VLAN use OSPF to route inbound and outbound traffics for Level-1 and Level-2 networks

Figure-1: LANL’s PaScalBB Infrastructure Compute node uses N ways Multipath routing for outbound “write” data traffic I/O Node Vi (i=1 to n)

V

W

V

Compute Node

I/O routing nodes Cloud and Layer-3 Switch OSPF/ECMP routing

V

Read

Layer-3 Gigabit Ethernet Switch provides OSPF M sessions Equal Cost Multipath (ECMP) routing and for inbound “read” data traffic ECMP switch port Wj (j =1 to m)

W

W

V

File Server

Write

W

IO nodes cloud uses OSPF to route read and write traffic from the Level-1 and the Level-2 networks

Figure-2: Extreme Bi-directional Load balancing between compute nodes and Storage Multi-GiGE/10 access Long-haul optical links Remote Access

Top_level Switch

Multiple Lanes: multiple subnets

Switch or router

IO Lane Grid-A

Switch or router

IO Lane

10GiGE trunks IO Lane 1

Cluster-1 Cluster-2 . . .

Expandable Petabyte scale Global storage

IO Lane N . . .

GiGE

Grid-B

IO lane IO Lane

Cluster-N

Grid-C

Grid-D

IO Lane M

Global Domain (single name space) = ∑Subnet (Lanei (i=1..M)), Level-2 Linear scaling IP routing network

Figure-3: LANL’s PaScalBB in multi-cluster and Grid environment Master node 4

6 1

2

1/10 Gigabit Ethernet Network

Infiniband Network Compute node ECMP routing Routing

Level-1 Network

3

Level-2 Network

5

IO node OSPF Routing

Figure-4: DGD operation steps diagram

Global File System