Commodity High Performance Computing at Commodity ... - CiteSeerX

3 downloads 276 Views 49KB Size Report
High Performance Computing Centre, Department of Electronics and Computer. Science ... The commercial uptake of Windows NT in the server market (a tenfold increase in the last two years) has allowed ... 1 × 14” monitor with 4-way switch.
S.J. Cox et al. / Commodity High Performance Computing at Commodity Prices

19

Commodity High Performance Computing at Commodity Prices Simon J. COX, Denis A. NICOLE, and Kenji TAKEDA High Performance Computing Centre, Department of Electronics and Computer Science, University of Southampton, SO17 1BJ, UK

Abstract. The entry price of supercomputing has traditionally been very high. As processing elements, operating systems, and switch technology become cheap commodity parts, building a powerful supercomputer at a fraction of the price of a proprietary system becomes realistic. We have recently purchased, in support of both our local and national collaborations, a dedicated computational cluster of eight DEC Alpha workstations. Each node has a 500MHz AXP21164A processor with 256Mb memory running Windows NT 4.0 and cost under 6000 pounds. They are connected by 100Mb/s switched ethernet. In this paper we discuss some of the issues raised by our choice of processor, operating system and interconnection network. The results we present indicate that the cluster is fully competitive with systems from major vendors for a wide range of engineering and science applications, and at a cost lower by at least a factor of three. Indeed the only current area of under-performance relative to these vendors' highend offerings is the inter-node network bandwidth and latency. We give some initial results indicating how the network performance might be improved under Windows NT.

1. Introduction The convergence of high-end workstations and commodity personal computers has been particularly rapid over the last few years. It is now possible to build cheap, powerful supercomputer-level machines using commodity parts at a fraction of the cost of proprietary systems. In this paper we give some of our early experiences with a commodity cluster which we have recently purchased. In section 2, we discuss the configuration of our cluster and some of the design considerations. We examine a number of issues about the performance of our system in section 3. The results in section 4 indicate that the parallel performance of our cluster may be enhanced significantly by exploiting the networking facilities provided by Windows NT. We draw our conclusions in section 5. 2. The Technology The best value for money clearly lies in commodity desktop PC technology. Here we can not only take advantage of economies of scale in the corporate market, but also in the rapidly increasing take-up of PCs in the home. Even greater leverage can be obtained by using the DEC Alpha microprocessor. These Fortran-optimised chips are priced to compete

20

S.J. Cox et al. / Commodity High Performance Computing at Commodity Prices

in the Windows NT marketplace against Intel and offer twice the price/ performance of comparable Pentium-based systems. The price-performance of a system also depends upon the operating system and compiler costs. Whilst there are free operating systems for Pentiums and Alphas, the performance of the compilers on these platforms is generally poor. On the Alpha platform, Digital UNIX and Windows NT are the obvious commercial products; while Windows NT is almost free, the effective cost of UNIX, including upgraded disk hardware, is around £2000 per node. The commercial uptake of Windows NT in the server market (a tenfold increase in the last two years) has allowed significant additional functionality to be added, whilst driving down the “total cost of ownership” for a system. We have therefore recently purchased an eight node commodity cluster of workstations. The configuration is shown in Table 1. The total system cost was just under £50,000 and currently represents the biggest single computational resource at Southampton University. Table 1 Configuration of Commodity Supercomputer

Eight Nodes of 500 MHz DEC Alpha 21164 processor 256Mbytes RAM 2.5 Gbyte EIDE drive Windows NT 4.0 Configured as (a) 2×2 nodes for development Visual C++ 4.2 (Risc) (shortly to be upgraded to 5.0) Digital Visual Fortran 21” and 14” monitor

Disk Server

(b) Compute nodes (4 nodes total) 1 × 14” monitor with 4-way switch 200 MHz Pentium 32 Mbytes RAM 20 Gbytes IDE drive DLT Backup Debian Linux

The SPEC benchmark ratings of each node are 16.5 and 13.5 for specfp95 and specint95 respectively [1]. We chose Linux for the server node to support our existing Sun, AIX, and Windows 95 clients. Digital Visual FORTRAN for Alphas running Windows NT gives identical performance to DVF without the KAP optimizing pre-processor under Digital Unix [2]. We achieve 110Mflops on the Linpack100 benchmark. The KAP preprocessor, which enhances the speed of serial code by around 10%, is not currently available for Alphas running Windows NT, so at present there is a small performance penalty for running under Windows NT. Digital Visual Fortran uses Microsoft Developer Studio as an integrated development environment and comes complete with the IMSL numerical libraries. Naturally it is priced to compete with comparable Pentium compilers, in contrast to DVF for Digital Unix, and thus offers excellent value for money.

S.J. Cox et al. / Commodity High Performance Computing at Commodity Prices

21

3. Issues 3.1 Operating system issues Windows NT 4.0 provides little facility for remote access to resources, such as inbound Telnet and rlogin. We have experimented with the Microsoft Beta Telnet daemon [3] and have found that it offers little real user support and is prone to exit without warning. More fundamental problems of security also exist. Remote users and the local user share the same drive maps. If one user changes the mapping of a drive, all the other users are affected by this change. Windows NT 4.0 also provides little support for remote graphical services; there is no analogy of X under Unix. This means that it is not easy to provide graphical debugging and profiling tools. A limited range of X client and server third party software exists; however, at present there is no third party software supporting remote graphical windows applications under NT 4.0 on Alphas. Many of these issues will be addressed by Windows NT 5.0, which we currently have under Beta testing. This provides facilities for remote administration, a wide range of thick and thin client/server models and the ability to access graphical services remotely. Most of these are functions which have long been enjoyed under Unix operating systems! Windows NT 4.0 is a 32 bit operating system and can only address up to 2Gb of user process memory and 4Gb in total. Linux will address 3-4Gb. This is a fundamental limitation for running memory intensive applications. Windows NT 5.0, which is fully 64 bit, will not suffer from such problems. 3.2 Parallel performance At present we are using 100Mbit switched ethernet for the interconnection network between processors. We have run PVM [4] and the Mississippi port of MPI for Windows NT [5] and measured the performance. The results are summarised in Table 2. Table 2 Results for Interprocessor communication using PVM and MPI

Message Size

Bandwidth

Latency

PVM (two processes, one machine)

800 kB

2.34 Mbyte / s

20 ms

PVM (two machines)

800 kB

3.52 Mbyte / s

25 ms

MPI (two processes, one machine)

(rinf)

10.8 MByte / s

0.8 ms

MPI (two machines) Beta release

(rinf)

(59.2 kByte / s)

2 ms

The figures in Table 2 should be compared with the native file transfer speed between processors which we have measured to be 4-5 Mbyte / s for files larger than 100Mb. Whilst the figures for PVM and shared memory MPI are consistent with this, the figure for MPI between processors is much worse. Our results should not be interpreted as a benchmark of the performance of the MPI on NT. At the time of writing (Feb 1998), the 0.92 Beta release MPI implementation on NT 4.0 [5] which we are using limits the efficiency for running parallel jobs; it was designed for use on shared memory systems. We have supplied our own MPI Fortran bindings for this implementation and achieved a small speedup (15%) of real scientific code on two processors [6]. The main reason for the inefficiency is due to

22

S.J. Cox et al. / Commodity High Performance Computing at Commodity Prices

excessive copying of data into buffers before transfer using the TCP/ IP protocol. The penalty for inter-process communication is exacerbated by the raw performance of the Alpha nodes. Whilst transputers gave excellent speedup results due to a relatively slow processor with a fast interconnection network, our scenario is quite the reverse! In section 4 we present some initial results about how this problem might be addressed using native Windows protocols. There are a few other problems with the current Beta implementations of MPI and PVM. MPI runs under the Administrator account (the NT analogue of the root under Unix) with full system privileges, since it was originally intended as a shared memory implementation. It can also leave dead processes hanging on remote machines which must be killed off manually. PVM currently requires pvmd3 daemons to be started manually on remote processes and fails to redirect I/O properly. We are working hard with the implementors and hope that at least some of these problems may be fixed in the final releases. 3.3 Network hardware We have chosen to use commodity switched ethernet for communication between the processors. A number of proprietary systems already exist which offer high bandwidth and low latency. Amongst these are the PCI Memory Channel Interconnect from Digital which provides applications with a cluster-wide address space. Applications map portions of this address space into their own virtual address space as 8kbyte pages and can then read from or write to this address space just like normal memory. At present this technology is aimed (and priced) for the server market and can connect up to eight systems through a shared bus. There is support for MPI and PVM with a latency of around 8 µs and a bandwidth of > 60 Mb/s under Digital Unix [7]. The current cost is around £1600 per node. Another sensible choice is Myrinet [8] which offers a latency of 200 Mb/s) at around £1500 per host. This is a proprietary PCI based system with Unix (and clone) support for MPI and PVM [9], however little support is currently available for Windows NT. At the time of writing there is no clear choice amongst these (and other) emerging technologies. We must further remember that to achieve our goal of a truly commodity system, we require mass-market components. For commodity networking we should perhaps look to technology which will be used to connect PCs to the internet and exploit gigabit ethernet when it becomes more widely available. 3.4 Other commodity systems The Beowulf initiative [10] has concentrated on using Intel-based machines running Linux and connected by switched ethernet to provide cost-effective production machines for a number of applications. A sensible choice for Alpha-based systems is to use Digital Unix compilation nodes, and transfer binary compatible executables to compute nodes running Linux. A commodity system consisting of 200 DEC Alphas running Linux and Windows NT (as file servers) connected by 100Mbit ethernet was used to add post-production digital effects to films such as Titanic [11, 12]. Industries and city institutions which run commercially supported operating systems in the rest of their organisation may be unwilling to use Linux-based high performance computing software. Our early adoption of Windows NT as an operating system for high performance computing should be seen in the long term context of providing suitable HPC platforms for these customers.

S.J. Cox et al. / Commodity High Performance Computing at Commodity Prices

23

4. Results: Improving parallel performance Whilst much of the inefficiency of the Beta implementation of MPI on NT is a result of unnecessary copying of data, a number of other issues remain. The TCP/ IP protocol may not be the most appropriate choice for inter-processor communication between Windows NT machines. To exploit the native Microsoft NetBEUI protocol we may use Named Pipes. They are similar in many ways to standard Unix pipes, the most important difference being that they are bidirectional (more like a Unix Stream Pipe) and independent of the transport protocol (Unix pipes use TCP/ IP). Named pipes are an application-level communication construct which may be used for interprocessor communication. The Named Pipe consists of a server end and a client end. Once a connection has been established between client and server, both may used normal read and write I/O functions on the Named Pipe to transmit data. In Figure 1 and Table 3 we present some results indicating the performance of our 100Mbit switched ethernet using Named Pipes and the NetBEUI protocol. The experiments use a simple echo server and client [13]. The timed results are for the message travelling to the echo server and back. The largest single message which may be sent is 64kB. 30

Same machine (Default Name) Same Machine (Explicit Name)

25

Point to Point (Average)

Time (ms)

20

Simultaneous Pair

15 10 5 0 0

8

16

24

32

40

48

56

64

Message Size (kB)

Figure 1 Ping-Pong Communication using Named Pipes, see text for more details.

Table 3 Results for ping-pong communication using Named Pipes, see text for more details.

Bandwidth (Mbytes / s)

Latency (ms)

Single Machine, default name

71

0.37

Single Machine, explicit name

26

1.6

Point to Point

5.5

1.9

Point to Point (Loaded Switch)

5.5

4.9

Simultaneous Pair

5.3

3.0

Simultaneous Pair (Loaded switch)

4.9

2.9

24

S.J. Cox et al. / Commodity High Performance Computing at Commodity Prices

For shared memory communication on the same machine, it is possible to set up a default parameter for the client to connect to the local echo server (“Single Machine, default name”). This gives the highest bandwidth and lowest latency. When an explicit name is used (“Single Machine, explicit name”) a name resolution step is required, which increases the latency. The “point to point” performance is consistent with the native file transfer rate of 4-5 Mbyte / s and indicates that the hardware performance is much better than the PVM and MPI results in Table 2 suggest. The “loaded network” results were obtained by timing the point to point performance between a pair of nodes, whilst the other 6 nodes were sending 64kB messages across the switch. The bandwidth is unchanged, indicating that the switched ethernet gives scalable performance when loaded on eight out of its twelve ports. Whilst a single pipe allows bidirectional communication, it is much more realistic to have multiple independent pipes connecting processors. In the final “simultaneous pair” experiments, we measured the performance of a pair of such pipes both with and without a loaded network. Each machine communicated down two pipes simultaneously, acting as server for one, and client on the other. The performance dropped by only 5% (network not loaded) and 12% (network loaded) relative to the single pipe bandwidth, indicating that a simple global reduction by pairwise exchange may be performed using this technique with little loss of efficiency. A more sophisticated pairwise exchange may exploit the bidirectional nature of the pipe. In summary, the results indicate that it is possible to achieve acceptable bandwidths and latencies under Windows NT 4.0 using commodity network hardware and native Windows protocols. In the future we will investigate other Windows networking communication mechanisms, in particular Windows Sockets. 5. Conclusions Commodity high performance computing aims to deliver cost-effective computational resources by using commodity parts. In this paper we have discussed our initial experiences with a cluster of DEC Alpha workstations running Windows NT connected together using 100Mbit switched ethernet. Whilst the development environment, compiler technology and single node performance are excellent, providing high bandwidth and low latency under Windows NT at commodity prices is less straightforward. Current implementations of PVM and MPI use TCP/ IP protocols. In this paper we have demonstrated that using efficient native Windows protocols may yield improved message passing performance in commodity systems using commodity networking hardware. References [1]

Standard Performance Evaluation Corporation, 1998: Currently at: http://www.specbench.org/

[2]

HENNING, J.L. , 1997. NT: CPU Speed Without Heroics. This article is being submitted to Digital Systems Report, Computer Economics Inc., Digital Equipment Corporation. It is currently available on: http://www.digital.com/fortran/ntspeed.html

[3]

Microsoft WindowsNT 4.0 Workstation Resource Kit, 1996. ISBN 1-57231-343-9 (Washington: Microsoft Press)

[4]

PVM for Windows NT, http://www.epm.ornl.gov/pvm/NTport.html

S.J. Cox et al. / Commodity High Performance Computing at Commodity Prices

[5]

25

SKJELLUM, A., PROTOPOPOV, B., HEBERT, S., BRENNAN, P.J. and SEEFELD, W., 1997. MPI on Windows NT. 0.92 Beta release. Currently available at: http://www.erc.msstate.edu/mpi/mpiNT.html

[6]

COX, S.J., DANIELL, G.J., and NICOLE, D.A., 1998. Maximum Entropy, Parallel Computation and Lotteries. Submitted to 1998 International Conference on Parallel and Distributed Processing Techniques and Applications.

[7]

See Digital’s High Performance Technical Computing Centre for more details: http://www.digital.com/info/hpc/

[8]

Myricom corporation: http://www.myri.com/

[9]

ANDERSON, D., CHASE, J., GADDE, S., GALLATIN A., YOCUM, K., and FEELEY, M, 1998 Cheating the I/O Bottleneck: Network Storage with Trapeze/Myrinet. To appear at USENIX, June 1998.

[10] The Beowolf Project: http://cesdis.gsfc.nasa.gov/beowulf/ [11] Titanic, 1997. Film. Paramount Pictures and Twentieth Century Fox. Digital Domain added postproduction digital visual effects. [12] STRAUSS, D. and WOOK, 1998. Linux Helps Bring Titanic to Life. Linux Journal, Issue 46. http://www.linuxjournal.com/issue46/2494.html [13] SINHA, A.K., 1996. Network Programming in Windows NT. (New York: Addison-Wesley). ISBN 0201-59056-5

S.J. Cox et al. / Commodity High Performance Computing at Commodity Prices

26