Efficient techniques for distributed computing - Semantic Scholar

Efficient Techniques for Distributed Computing Thomas Dramlitsch, Gabrielle Allen, Edward Seidel Abstract- We discuss a set of novel techniques we are developing, which build on standard tools, t o make distributed computing for large-scale simulations across multiple machines (even scattered across different continents) a reality. With these techniques we demonstrate here that we are able t o scale a tightly coupled scientific application in metacomputing environments. Such research and development in metacomputing will lead the way to'routine, straightforward and efflcient use of distributed computing resources anywhere around the world. This work applies not only t o the large-scale simulations in astrophysics which provide the motivation for this work, but also opens the way for new, innovative application scenarios.

In this paper we describe new techniques we are developing t o perform eficiently a S1719k distributed simulation across multiple machines that are defined from the beginning, assuming the same resources are available throughout the calculation, and we present preliminary results achieved by implementing them. In a companion paper we discuss how these techniques may be combined with new techniques under development t o handle scenarios where availability of resources or networks changes during the simulation, or the needs of the simulation it self changes, and technologies to adapt t o the present state [2].

I. INTRODUCTION Scientists from many different fields need t o allocate large resources for their scientific applications. For example, an astrophysics application such as the collision of two neutron stars, requires of the order of TBytes of memory and TFlops of CPU power for an accurate numerical simulation. Typically, the larger the requested resource, the longer the time it takes before it is available. However, scientists often have access to more than one supercomputer, as well as a growing number of smaller local resources such as Beowulf clusters. Distributed computing could then be beneficial, running the application not just on a single machine, but on a virtual supercomputer built by connecting multiple resources together. There are three main scenarios where distributing applications across multiple machines,can provide a real benefit. The first is where a scientist needs t o perform a large scale computation that is beyond the reach of resources than can be provided by any single machine. In this case, the motivation is simply the possibility of performing the needed simulation, whatever the cost or performance. The second scenario is where the concern is the time t o solution. For example, if a scientist requires 128 CPUs for a calculation, she may have t o wait some days in scheduled queues. An alternative could be t o use 32 processors from each of 4 different machines, which are immediately available. The final motivation is provided when the driving force is the cost of a simulation, here a scientist maybe willing t o suffer poor performance taking advantage instead of free or cheap simulations. The basic problem here is: The more processors a scientist wants to allocate the bigger is the amount of communication going on. A real application (solving equations of particle physics, relativity, climate models etc.), which needs t o achieve up to one TFlops/s (distributed across a couple of thousand CPU's)will need t o send data in the order of up t o 50-100MB/s across the network. This is doable for one supercomputer, but since wide area networks today are still not that fast, the scaling of such an application would end up being under 10%.

standard run impoved run

/

.-

1

_-

1

- '

3

5

7

9

1

1

1

3

1

5

number WUs

Fig. 1. The scaling behavior of a numerical code ran between two SGI origins (MIPS RIO000 CPUs) connected with a IOOMBit fastEthernet adapter. The problem-size per processor is kept fixed, we have a regular domain decomposition. The dashed line shows the results for a standard distributed run. Communications are done wery iteration, we have many uncompressed messages. The scaling here is about 40%. The solid line however represents a run with some tricks applied we discussed in sec. 11: 10 messages having a size of 245kb per processor were coalesced into one 2.35MB message. This message-chunk was compressed down to a message size of 158kb before sending over the network (which is about 7% of the original size). The scaling thus is improved up to 80% on 16 processors.

11. ADVANCED TECHNIQUES FOR DISTRIBUTED

COMPUTING

'

Not so long ago, experiments in metacomputing had just two goals: To see how it is possible t o run across multiple supercomputers and t o show that it is possible. Many infrastructure tools have been built t o make this possible, such as Globus and MPICH-G. Different applications, such as the Cactus framework, have been developed t o provide actual implementations. These experiments showed that distributed computing is indeed possible, and identified current difficulties and inefficiencies. Our goal now is t o explore advanced techniques t o achieve adequate performance. In the next sections we describe some methods, currently being investigated and

435 0-7695-1296-8/01 $10.00 0 2001 IEEE

=I

/

/

compared, which could be used t o improve communication performance. Buffering Since different networks have different characteristics, such as latency and bandwidth, the simulation code will ideally possess the ability t o adapt dynamically t o these characteristics. In a particular example studied the number of messages could vary form about 60 per 10 iterations (each having a size of about 15kb) up t o just one (having the size of up t o one MByte). Overlap of Computation and Communication Most scientific codes using the communication protocol MPI use non-blocking send/recv operations, since usually one processor is communicating simultaneously with many others and this way an overlap between different communication channels can be achieved. Instead of simply waiting to receive data, processors should continue t o perform computations. With this overlap, one can reduce the idle time (where a CPU is waiting for data). Compression Numerical simulation codes typically deal with floating point data, and this is the type of data which must be communicated between neighboring processors. Usually, a numerical code performs computations on smooth data, which for us means that the data does not change very much from gridpoint to gridpoint. If we now start t o compress the data, using standard compression techniques (i.e. from UNIX libz), we can reduce the size of the data t o be sent significantly, down t o 5% in tests we have performed on simulations of Einstein’s equations using Cactus. In most cases, and in combination with buffering it improves the overall performance by far. Incremental Update Another idea t o reduce the amount of data t o be send over the network (while leave numerics unchanged of course) is t o send incremental data. This technique is often used in backup procedures, where only the data which changed since the last time is being archived. During a numerical simulation, especially when the simulation should be accurate enough in space and time, the data on a gridpoint does not change very much after one iteration. If a code sends the difference of those two floating point numbers, and uses compression afterward, the reduction of the amount of data will be even more significant.

higher. More details will be discussed in future papers.

IV. CONCLUSION In this paper we motivated the use of distributed computing for real world applications, and discussed the current status and existing difficulties. We illustrated how the major hindrance arises from poor communications performance between individual machines in a virtual supercomputer. In addressing this problem we presented several techniques, currently being researched and developed, which improve the performance significantly. Using these techniques it was possible t o drastically improve performance for a test case problem, increasing the scaling of from below 50% t o about 80% in a testcase (two SGI Origins connected by lOOMBit fastE). In a following paper we will show that we could use our techniqes t o make a distributed run over 4 supercomputers including up t o 1500 CPU’s possible with a speedup of up t o 80%.

ACKNOWLEDGMENTS It is a pleasure for us t o thank Ian Foster, Tom Goodale, Roy Heimbach, Nick Karonis, Gerd Lanfermann, Thomas Ftadke, Matei Rpeanu, John Shalf and Brian Toonen for their advice and help. Computing resources and technical support have been provided by AEI, ANL and NCSA.

REFERENCES The Globus project: http://www.globus.org Nomadic Migmtion: A New Tool for Dynamic Grid Computing G. Lanfermann, G. Allen, T. Radke, and E. Seidel, submitted to HPDC10. Numerical Relativity in a Distributed Environment Werner Benger, Ian Foster, Jason Novotny, Edward Seidel, John Shalf, Warren Smith, and Paul Walker, Pmceedings of the Ninth SIAM Conference on Parallel Processing for Scientific Computing, 1999. Issues of Running Large Scientific Applications in a Distn’buted Environment Matei Ripeanu, Master Thesis, University of Chicago, 2000 Cactus-G Toolkit: Supporting Efficient Execution in Heterogenwus Distributed Computing Environments Gabrielle Allen, Thomas Dramlitsch, Ian Foster, Tom Goodale, Nick Karonis, Matei Ripeanu, Ed Seidel, Brian Toonen Proceedings of 4th Globus Retreat, July 30-August 1 2000, Pittsburgh http://www.supercomp.org/sc98/hpcc/ A Grid-Enabled MPI: Message Passing in Heterogeneous Distributed Computing Systems I . Foster and N. Karonis, Proc. Supercomputing 98 (SC98), Orlando, FL, November 1998. The Cactus Code: A Problem Solving Environment for the Grid Gabrielle Allen, Werner Benger, Tom Goodale, Hans-Christian Hege, Gerd Lanfermann, Andre Merzky, Thomas Radke, Edward Seidel, John Shalf Proceedings of the Ninth IEEE International Symposium on High Performance Distributed Computing (HPDCS), Pittsburgh, 2000 Designing and Building PamlleI Programs Ian Foster, http://www.mcs.anl.gov/dbpp S. Akl. T h e Design and Analysis of Parallel Algorithms. Prent ice-Hall, 1989. Reproducible Mensurements of MPI Performance Characteristics William Gropp, Ewing Lusk, Argonne National Labs, http://www-unix.mcs.anl.gov/gropp/papers.html http://www-unix.mcs.anl.gov/mpi/mpich/

111. FUTURE WORK Many large distributed computations will be done in the near future. A tightly coupled numerical computation is planned t o be run across 1500 CPU’s at NCSA and SDCS. In order t o estimate the benefit of our improvements we scaled a distributed run between two SGI origins at NCSA connected by a lOOMBit Fast Ethernet link. The first preliminary results can be seen in Figure I. Since there is an OC12 network link between those two centers and the estimated number of processors talking over the WAN will be six times as higher, one can regard this scaling behavior as representative for such a big run. Indeed it turned out that exactly those techniques discussed in sec. I1 improved the performance even more significant than in our testcase. The speedup was increased from about 10% up to 70% and

436