Javelin: Parallel Computing on the Internet - CiteSeerX

41 downloads 38376 Views 495KB Size Report
ubiquitous software such as a Java-enabled web browser to participate. Security: ..... the hosts that participate rst get the largest amount of work. This led to anĀ ...
Javelin: Parallel Computing on the Internet Michael O. Neary, Bernd O. Christiansen, Peter Cappello, and Klaus E. Schauser

Department of Computer Science University of California, Santa Barbara Santa Barbara, CA 93106

fneary,

g

bernd, cappello, schauser @cs.ucsb.edu

Abstract

Java o ers the basic infrastructure needed to integrate computers connected to the Internet into a seamless distributed computational resource: an infrastructure for running coarse-grained parallel applications on numerous, anonymous machines. First, we sketch such a resource's essential technical properties. Then, we present a prototype of Javelin, an infrastructure for global computing. The system is based on Internet software that is interoperable, increasingly secure, and ubiquitous: Java-enabled Web technology. Ease of participation is seen as a key property for such a resource to realize the vision of a multiprocessing environment comprising thousands of computers. Javelin's architecture and implementation require participants to have access to only a Java-enabled Web browser. Experimental results are given in the form of a Mersenne Prime application and a ray-tracing application that run on a heterogeneous network of several parallel machines, workstations, and PCs. Two key areas of current research, fault-tolerance and scalability, are subsequently explored brie y.

Keywords: Distributed Computing, High Performance Computing, Java, Internet, World Wide Web. 1

Introduction

We want to solve computational problems that cannot be solved on existing NOWs or supercomputers | that simply cannot be solved now. Imagine that you have a huge computational problem that you want to solve. You have an \embarrassingly" parallel algorithm for it. However, it is too large to solve on a network of workstations or a supercomputer. You don't want to wait until networks of workstations get big enough, or until processors and/or supercomputers get fast enough. What do you do? Our vision is to harness the combined resources of millions of computers connected to the Internet, forming a powerful heterogeneous computing environment for running coarse-grain parallel applications. We believe that the time is ripe for this next step. First, global computing has made impressive strides recently. For example, on June 17th, 1997, 56-bit DES [27] was cracked by distributed.net, using approximately 78,000 computers, as many as 14,000 per day. The press release [14] states: This project demonstrates the kind of supercomputing power that can be harnessed on the Internet using nothing but \spare" CPU time. \Imagine what might be possible using millions of computers connected to the Internet!" Aside from cryptography and other obvious mathematical uses, supercomputers are used in many elds of science. \Perhaps a cure for cancer is lurking on the Internet?", said Verser, \Or perhaps the Internet will become Everyman's supercomputer." Clearly, large numbers of users are willing to participate in the global computation of suitable applications. While dramatic and encouraging, this example shows that, of the tens of millions of computers connected to the Internet, only a small percentage actually participate. There are several reasons for this. No incentive is provided beyond either curiosity, helping a worthy cause, or fame (if one is lucky enough to nd the key). Security, speci cally lack thereof, is a major inhibitor. Prospective participants must trust the program server to serve them a program that does not contain bugs or viruses that destroy or spy on local data.

Secondly, participating in global computing is administratively complex for both the program server and the participants. Global computing applications run on di erent architectures and operating systems, requiring many executable versions of the application, as well as di erent downloading mechanisms. We believe that lack of security, lack of potent tools for developing and deploying global computing applications, and lack of incentive all diminish the number of actual participants. The advent of a safe and portable language system such as Java has provided tools and mechanisms to address some of these issues and interconnect computer systems around the globe in cooperative work. We progress towards our vision by developing a exible, easily-installed infrastructure for running coarse-grained distributed applications on any computer connected to the Internet. In the next section, we enumerate some fundamental issues in global computing. Then, we review the Javelin vision, as well as the existing Javelin prototype. Following that, we discuss some research issues of highest priority: fault tolerance and scalability. Then, we brie y review the existing research e orts towards an infrastructure for global computing, concluding that there is a basic need for research on fault tolerance and scalability in the context of global computing.

2

Goals and Issues

In this section we brie y describe some of the most important research issues in the eld of global computing. The list is by no means complete, since this is a relatively new area and more topics are likely to arise in the near future. We have divided the issues into three distinct groups. The rst group contains subjects that are being addressed in the context of secure, portable languages like Java. Thus, we do not feel so concerned about these issues since signi cant progress has already been made. The second group contains the topics that we consider the most important for our current research, whereas the third group contains issues that can be dealt with only after solving some of the problems in the second group, and hence must be placed in a more distant future. The rst group consists of the following topics:

Ease of Use: We view ease of use as a key property of the proposed infrastructure, since it relies on the

participation of thousands of users. We envision a web-centric approach where a user only needs ubiquitous software such as a Java-enabled web browser to participate. Security: Executing an untrusted piece of code poses integrity and security threats to the host. To protect the host from buggy or malicious code, any untrusted program must be run within an environment that only allows limited access to system resources such as the le system. Host security already has been addressed by a variety of mechanisms including Software Fault Isolation [12], Secure Remote Helper Applications [19], and interpreted languages such as Java. Interoperability: In heterogeneous systems like the Internet, hosts and clients may have di erent instruction sets, word sizes, or operating systems. The infrastructure proposed must provide the means to overcome this heterogeneity. This issue has been addressed by either employing machine-independent languages, such as Java, E [15], and Limbo [24], or by providing multiple binary executables [33]. Machine-independent languages achieve portability at the expense of some performance; binary executables achieve performance at the expense of portability. It is thus desirable to support both approaches in order to meet the demands of as many applications as possible. Performance: Since our infrastructure aims at providing better performance than is available locally, the ecient execution of anonymous code is essential. The interpretation overhead of portable languages is being overcome by modern compilation techniques such as just-in-time compilation [36, 31] that allow for an execution speed close to that of compiled C or C++ code. The second group contains the four issues that are currently highest on our priority list. In particular, fault tolerance and scalability will be discussed in greater detail in Section 4. 2

Scalability: As performance relies heavily on the number of participants, scalability is a key issue. We

intend to provide an infrastructure that is scalable with respect to communication (i.e., limitations imposed by subsidiary technologies, such as Java applet security, must be overcome), computation (i.e., resource allocation must be distributed), and administration (i.e., requiring neither login accounts on hosts nor operating system changes). Fault Tolerance: In a potentially Internet-wide distributed system, fault tolerance is crucial; hosts may be untrusted and communication is unreliable. It can be provided either transparently by the underlying architecture or explicitly by the application itself. Incentive: To grow the infrastructure to thousands of hosts, we need to give potential hosts an incentive to participate. We can capture the interactions between clients and hosts in a microeconomic model which models the trading of computing resources. We feel that the application of market mechanisms is the right way of solving the problem of resource allocation in potentially world-wide distributed systems. Important underlying technologies required for the implementation of a market mechanism, such as cyber-money, authentication schemes, and secure transactions, are maturing rapidly. Correctness: Economic incentives have a dark side: the specter of hosts faking computations and returning wrong or inaccurate results. To deal with this problem, clients may be able to apply algorithm-speci c techniques that cheaply verify a computation's correctness (e.g., it is simple to verify a proposed solution to a system of linear equations), or cheaply verify with high probability the correctness of a computation. Other techniques include statistical methods (e.g., in an image rendering computation, the client can compute some random pixels and compare these to the corresponding pixels returned by hosts), checksum computations that are inserted into the code to assure that the code was actually run, or redundant computations (i.e., performing a computation by several hosts). Reputation services, similar to credit and bond ratings, also give hosts an economic disincentive for faking computations. Resource allocation using market-based mechanism is one of our more long-term goals in its full scale, although a simpler form of incentive | simply exploiting natural curiosity and the desire to get famous | is very much a part of our current research. Besides these, there are other important issues we might consider in the future but do not intend to address in the near term:

Programming Models: The envisioned infrastructure provides a substrate on which various communica-

tion, data, and programming models may be implemented; di erent models are suitable for di erent applications. We intend to provide the programmer with abstractions, such as a global le system, shared memory, and reliable communication channels. Locality: At present, neither the latency nor bandwidth across the Internet are satisfactory for communication intensive parallel applications. Thus, both must be taken into account when mapping applications to the available resources. Ultimately, this implies the need for methods to determine or forecast the communication latency and bandwidth requirements and execution time constraints for given problems [37]. Client Privacy: A company's internal data and \know-how" represent a value that usually is protected from unauthorized access. The proposed infrastructure must provide mechanisms that enable clients to hide the data and possibly the algorithms that are passed to untrusted hosts (see, e.g., Feigenbaum [16]). Although encrypted computing might not be possible for all applications, a number of important practical problems can be encrypted [1]. Another way of ensuring client privacy is to split the computation into fragments such that no part by itself reveals any useful information about the complete computation. Quality of Service: Quality of service must be incorporated and ensured. A host negotiating for a task should be able to calculate accurate runtime estimates based on the task's pro le as well as its own machine characteristics, in order to decide whether it can meet the client's runtime constraints without 3

actually performing the computation. One possibility is through micro-benchmarks that characterize the host as a point in a multidimensional performance space [28]. A suitable benchmark must be representative, quickly computable by the host, and the broker must be able to quickly evaluate the result. In the following sections we will explain to what extent we have already solved some of the given problems and how we intend to attack others.

3

The Javelin Infrastructure

In this section, we describe Javelin [11], our prototype infrastructure for Internet-based parallel computing using Java. Our system is based on Internet software technology that is essentially ubiquitous: Web technology. The current prototype already provides some of the properties listed in Section 2, like e.g., ease of use and interoperability. Others, like scalability and fault tolerance, will be addressed in the near future (see Section 4). Once these fundamental technological issues have been solved, we will be in a situation where computation is truly a tradable commodity, and buying and selling of computational resources will become feasible. At this point we will integrate market-based protocols and algorithms into our design, leading eventually to a system that is naturally load-balanced around the globe and o ers users a true incentive to participate. The basic system architecture is shown in Figure 1. There are three system entities | clients, brokers and hosts. A client is a process seeking computing resources; a host is a process o ering computing resources; a broker is a process that coordinates the allocation of computing resources. Clients register their tasks to be run with their local broker; hosts register their intention to run tasks with the broker. The broker assigns tasks to hosts that, then, run the tasks and send results back to the clients. The role of a host or a client is not xed. A machine may serve as a Javelin host when it is idle (e.g., during night hours), while being a client when its owner wants additional computing resources.

Clients Brokers Hosts Figure 1: The Javelin Architecture.

3.1 Applications

One of the most important goals of our research is to nd and classify new applications for our infrastructure. We intend to further re ne and optimize the applications we have already implemented, as well as port some new applications that promise further insight into what is required to make the infrastructure as useful as possible.

3.1.1 Structure of an Application

Before we list some actual applications, we need to make a few assumptions about the type of application/programming model that will be supported by the system: 4

Ideally, applications should be adaptively parallel, i.e., they should not demand a xed number of hosts at startup, should be able to change the number of hosts at runtime, and should generally be oblivious to the number of hosts at any given moment. The primary programming model supported is a so-called dynamic bag-of-work1 model, where initially the client has a large task which can be split into smaller tasks and assigned to other hosts. These smaller tasks can in turn be split and reassigned, and so on, until the granularity is so small that splitting does not make sense anymore. The task is then executed. It is important that each task be independent and all tasks be equally important to avoid prioritization. For fault tolerance purposes we assume all tasks to be idempotent, i.e., after the rst execution of a task repeated execution of the same task does not change the result. For more details, see Section 4. If an application does need to specify a xed number of processes, it should not expect its local broker to actively start nding the desired number of hosts for a one-to-one process to host mapping by contacting neighboring brokers, since this could easily ood the network. Instead, the local broker may opt to schedule more than one process to a single host and will decide immediately or after a certain period of time if the host demand can be met. This restriction greatly simpli es the task of a broker and should be considered a starting point that can be modi ed in future. As the most general parallel programming model the system will also support message passing based on send and receive primitives. However, in this context fault tolerance will be limited since the system cannot know about the semantics of the application, so it will not know how to react to host and network failures on behalf of the application.

3.1.2 Current Applications

We have a pool of applications which vary in structure, complexity and diculty of implementation. The list of applications we are currently considering contains:

Mersenne Primality Testing | this is our structurally easiest application. The computation to commu-

nication ratio is huge for large primes. It ts both the message passing and the bag-of-work paradigm. Image Rendering | tting the bag-of-work model, this application is an ideal starting point for experiments with various fault tolerance methods (see Section 4). Seismic Data Processing | traditionally a cluster computing application tting the message passing model, this is our most demanding application to date. Implementing it on top of Javelin requires the use of Java applications since local le I/O is required. However, this also means the development of a new host security model for Javelin applications. Beyond those applications listed, our goal is to nd as many additional applications that can bene t from the proposed architecture as possible. In the near future, however, we intend to concentrate on those listed and demonstrate that the infrastructure can greatly improve their performance, opening up the possibility of running applications much larger than anyone would have deemed possible a few years ago.

3.2 Preliminary Results

3.2.1 Javelin Prototype

Our most important goal is simplicity, i.e., to enable everyone connected to the Internet or an intranet to easily participate in Javelin. To this end, our design is based on widely used components: Web browsers and the portable language Java. By simply pointing their browser to a known URL of a broker, users automatically make their resources available to host parts of parallel computations. This is achieved by downloading and executing an applet that spawns a small daemon thread that waits and \listens" for tasks 5

Client

1. Upload applet

2. Register URL 6. Forward result

Server httpd Router

Broker 5. Send result

4. Retrieve applet

3. Retrieve URL

Host

Figure 2: Steps involved in the remote execution of an applet. from the broker. The simplicity of this approach makes it easy for a host to participate | all that is needed is a Java-capable Web browser and the URL of the broker. Tasks are represented as applets that are embedded in HTML pages. This design decision implies certain limitations due to Java applet security: e.g., all communication must be routed through the broker and every le access involves network communication. Therefore, in general, coarse-grained applications with a high computation to communication ratio are well suited to Javelin. Figure 2 shows the steps involved in the remote execution of an applet. These steps are: 1. The client uploads the applet and an embedding HTML page on an HTTP server. Clients running their own HTTP server may skip the rst step. 2. The client registers the corresponding URL with a broker. 3. The host registers with the broker its intention to execute tasks, and retrieves the URL of a task. 4. The host downloads the HTML page from the HTTP server, and executes its embedded applet. 5. The host stores the result at the server site. If communication between cooperating hosts is required, messages are stored at, and retrieved from, the server site. 6. The client retrieves the result. In the following we brie y discuss performance numbers from our prototype. We conducted our performance measurement experiments in a heterogeneous environment: Pentium PCs, Sparc-5s, a 64-node Meiko CS-2 where the individual nodes are Sparc-10 processors, and single and dual processor UltraSparcs, connected by 10 and 100 Mbit Ethernet. For a more detailed presentation of the performance results and the applications used the reader is referred to [11].

3.2.2 Raytracing Measurements

We have ported a sequential raytracer written in Java to Javelin as an example of a real-world application that bene ts from additional processors even if communication is relatively slow. To evaluate the performance and dynamic behavior of our infrastructure we have raytraced images of the size 1024x1024 pixels for the two scenes shown in Figure 3, and a randomly generated scene. 1

We also like to call this Master-Worker with Subcontracting, and nd this term slightly more intuitive.

6

(a) simple

(b) cone89

Figure 3: Raytraced images. Figure 4(a) shows the speedup curve for our parallel raytracer running on simple and cone435 in a cluster of two Sparc-5s, ve UltraSparcs, and one dual-processor UltraSparc. Figure 4(b) gives the speedup curve for raytracing random on the 64 Sparc-10 nodes of our Meiko CS-2. The graphs illustrate that the speedup that can be achieved depends on the computation to communication ratio. The more complex the scene, the longer it takes to compute the color of a pixel, while communication costs stay the same and become less important overall. 8

64

cone435

random

simple 48

Speedup

Speedup

6

4

2

32

16

0

0

0

2

4

6

8

0

Sparc Processors

16

32

48

64

Sparc10 Processors

(a)

(b)

Figure 4: Speedup curves for raytraced images.

3.2.3 Mersenne Prime Measurements

As our second application we implemented a parallel primality test which is used to search for Mersenne prime numbers. This type of application is well suited to Javelin, since it is very coarse-grained with a high computation-to-communication ratio when testing large Mersenne primes. For our measurements, we chose to test the Mersenne primality for all 119 prime exponents between 4000 and 5000. The reason for selecting this range is that on the one hand, we tried to make numbers large enough to simulate the true working 7

8

64

6

48

Speedup

Speedup

conditions of the application, and on the other hand, we wanted to keep them small enough to be able to complete our set of measurements in a reasonable amount of time. The rst set of measurements was performed on clusters of identical machines. Figure 5(b) presents the speedup curve for test runs on a 64-node Meiko CS-2, while Figure 5(a) shows the curve for a cluster of 8 Sun UltraSparcs. In both cases, the speedup was close to linear as long as the ratio of job size to number of processors was big enough. For large numbers of processors communication through a single router becomes a bottleneck. In a more realistic setting where huge primes are checked we do not expect this to be a problem since the computation will be extremely coarse-grained. For our tests, we chose a strategy where the biggest tasks (large amount of computation) were enqueued at the beginning of the task queue, thus ensuring that the hosts that participate rst get the largest amount of work. This led to an even task distribution.

4

32

16

2

0

0 0

2

4

6

0

8

16

32

48

64

Meiko Processors

UltraSparc Processors

(a)

(b)

Figure 5: Speedup curves for Mersenne primality test. In conclusion, the initial results from our prototype are highly encouraging. The measurements show that we can achieve almost linear speedup for small to medium numbers of hosts with this type of coarse-grained application. The next step is to seek ways to avoid the saturation of the routing service and and other bottlenecks in the system, so that the result will be an infrastructure that truly scales to the size of the Internet. In the long term, it is our belief that the most natural way to avoid bottlenecks on a global scale is the market-based resource allocation approach, since it has already proven itself superior to any centralized scheme in the real world.

4

Current Research

In this section, we present some possible approaches to tackle two of the most important issues in designing a global computing infrastructure, scalability and fault tolerance. Without fault tolerance, such a system is doomed to fail because of the very nature of wide area networks with unstable communication and frequent host failures. Without scalability, the system loses its claim to the predicate \global", thus defeating its own purpose of bringing together large numbers of hosts on the Internet. Our goal in this phase of the project is to create a reliable, stable infrastructure that scales to a larger number of hosts than any comparable system has achieved so far. We begin by giving an overview of work that has been done in the eld of fault tolerance and thread migration, followed by our proposal of a combination of various techniques. Although the migration of threads at runtime is not primarily linked to fault tolerance, it is extremely useful in the setting of a socalled \graceful retreat", i.e., when a user suddenly reclaims a host and the ongoing computation has to be 8

saved in a portable state that can be restarted later on some other machine. As a side product, once the problem of checkpointing and migrating a thread has been solved, the problem of checkpointing for the sake of saving parts of a computation without actually migrating becomes a subproblem of the more general case. Later, we will talk about a possible way of enhancing the scalability of the current Javelin prototype based on a network of brokers that operate autonomously.

4.1 Fault Tolerance and Thread Migration

Traditional fault tolerance techniques like checkpointing and transaction systems have been studied extensively in the database community. A good general reference is [6]. However, these mechanisms are generally perceived as costly and require the use of le I/O (logging), which is often prohibitive in the global computing setting. One example of such a relatively costly approach is FT-Linda [2]. The original Linda de nition ([35], see also Section 5) does not consider fault tolerance mechanisms. FT-Linda is a version of Linda that addresses these concerns by providing two kinds of enhancements: stable tuple spaces, in which tuple values are guaranteed to persist across failures, and atomic execution of tuple space operations, which allows for collections of tuple space operations to be executed in an all-or-nothing fashion despite failures and concurrent operations by other processes. To achieve these features, FT-Linda utilizes a replicated state machine approach based on ordered atomic multicast. This will work ne as long as the tuple space is replicated to only a few machines in a LAN. Clearly, if the system consists of several thousand hosts, all with their own local share of the storage space, atomic multicasts are not a good solution. A similar concept is that of JavaSpaces [32], an extension to the Java language close to Linda. Here, fault tolerance is achieved through a transaction mechanism. Although this is generally a very useful approach to enhance the fault tolerance of Java itself, we believe that the following techniques are more suitable in a global computing setting.

4.1.1 Noti cation

A simple, straightforward approach to a successful fault tolerance strategy is to leave the policy up to the application and only provide a mechanism to detect faulty components. In many cases this might even be the only feasible solution for a system since the application semantics remain largely hidden from the infrastructure. For instance, many traditional message passing applications have semantics that can only be handled by the application programmer, since only he or she knows how to handle lost messages or host failures. The underlying system can only help by trying to detect such failures. A common tool for failure detection is a noti cation mechanism that lets applications register certain events that are of interest to them and have the system call a speci ed routine, a so-called callback function, when an event occurs. The callback function can then handle the event according to the application semantics.

4.1.2 TIES/Eager Scheduling

TIES stands for Two-Phase Idempotent Execution Strategy. It is a fault tolerance mechanism that was rst developed as a theoretical concept to transform arbitrary PRAM programs to execute on synchronous failstop PRAMs [23]. Later, the results were improved by combining probabilistic and deterministic algorithms, and then extended to execution on asynchronous PRAMs where processors can be in nitely slow. The basic idea of TIES is that faster processors get rescheduled to subsume the work of slower processors, leading to correct execution since each step is idempotent. In the asynchronous setting an additional problem arises with respect to late writers, i.e., processes that clobber memory locations by rewriting older results. This can be overcome by either a central instance maintaining a some notion of global time and rejecting late results or by not letting two processes that work on the same parallel step write their results to the same location (memory evasion). The rst practical implementation of TIES was sketched in [13]. The term eager scheduling was coined around that time, too, and is used as a synonym for TIES. The strategy is also used in the Charlotte/Knitting Factory system (see Section 5). The main advantage of eager scheduling is its low 9

overhead in the absence of failures, and that it brings together parallel programming, load balancing, and fault tolerance in a natural way that ts the adaptively parallel programming model.

4.1.3 Portable Checkpoints at Bytecode Level

Portable Checkpointing [26] is a compiler based technique that, by doing a source to source translation of ANSI-C code, instruments the source code to save and recover from checkpoints that capture the state of a computation in a machine independent format. This technique provides stack environment portability, enables the conversion of complex data types, and makes pointers portable. The result is an ecient and comfortable way of migrating threads between di erent architectures. As a side product, traditional checkpointing without thread migration comes for free. Compared to C, Java has several advantages with respect to this technique. Firstly, since Java has no pointers, the whole problem of making pointers portable need not be considered. Secondly, with Object Serialization already built in to the language, one does not have to deal with the task of storing data structures in a portable manner. Java's multithreaded concept also seems more appropriate from the beginning. All this leads us to believe that the technique can be extended to Java in a fairly straightforward manner. In order to hide this code conversion from the application programmer we are currently investigating Portable Checkpointing at bytecode level. If this is successful it would open up the possibility of making the conversion at the broker instead of the client's machine, and the bytecode could be shipped as a method parameter in an RMI call. The broker could then make selective decisions on which hosts should receive unmodi ed bytecode and which hosts should use a checkpoint version due to their instability.

4.1.4 Uni ed Approach

After studying the various techniques for fault tolerance and their costs and bene ts in a global computing setting, we have come to the following conclusions: Any broadcast/multicast based technique involving replication is too costly due to network latency and bandwidth constraints and the potentially unlimited number of hosts. Logging based techniques are also ruled out due to high overhead and lack of local le I/O. Checkpointing is a good way to preserve partial work results and restart applications after failures. However, it normally requires local le I/O which collides with the Java security model. Otherwise, if checkpointing involves communication, it may be too costly depending on the nature of the application. In the case of long running applications like Mersenne Primes, checkpointing might be essential because restarting jobs after failures would be unacceptable. All this leads us to believe that a good approach would o er checkpointing as a tool for the application to select and con gure. A noti cation mechanism that leaves the fault tolerance policy up to the application is considered highly bene cial. Having a separate thread monitor the status of a host and take action based on a well-de ned event model is also bene cial to the system itself, e.g., in case of broker failures or network problems. Eager scheduling is a relatively simple and ecient way to ensure fault tolerance. However, it might be costly with long running applications in the presence of frequent failures. Consider once again the Mersenne Prime example: If failures are frequent, jobs might run for several hours, then fail. If, in the mean time, the same job were scheduled over and over again without knowing whether the rst host will complete it successfully, the system would be wasting a lot of valuable resources. In this case, checkpointing seems the superior method. In fact, checkpointing and eager scheduling seem to be on opposite ends of the same scale | the longer the individual job, the more valuable checkpointing becomes, and vice versa. Portable Checkpointing is highly desirable since it opens up the possibility of a true \graceful retreat", i.e., if a user reclaims a host and the system needs to vacate all resources quickly, it can pack up the computation thread and move its state to the broker. From there, it can be rescheduled to the next idle host and restarted exactly where it left o . As a consequence, we suggest a combined approach to the fault tolerance problem that will lead to a Toolkit API for the application programmer. The uni ed strategy will consist of three parts: 10

1. An Event Manager thread that runs on every Javelin host and monitors events that applications and the system are interested in. If an event is triggered, an asynchronous message is sent to the interested party and an appropriate callback method is invoked to handle the event. An example of such an event model, although somewhat elaborate and designed for a very di erent setting, is given in [25]. 2. Eager Scheduling will serve as the default mechanism if nothing else is speci ed by the application. 3. The application will be able to select Portable Checkpointing and con gure the regular checkpointing interval according to its needs. Once the implementation is ready, further experiments are needed to evaluate the di erent techniques and their suitability for speci c applications.

4.2 Scalability

To be clear about what we mean by scalable, if a global computational infrastructure is scalable, its components have bounded power: bounded computational rate, bounded communication rate, and bounded state. In particular, for Javelin to be scalable, its clients, brokers, and hosts have bounded power. These bounds imply that, for example, clients, brokers, and hosts, can communicate with only a xed number of other components during a xed interval of time. Thus, at any point in time, there are bounds on the number of connections between hosts, between brokers, between brokers and hosts, and between the client and brokers. Bounded state similarly implies bounds on the number of brokers that a broker can know about at any point in time. The prototype o ers just a single broker/router that becomes a bottleneck when too many hosts participate in a computation. Clearly, a network of brokers must be realized in order to achieve scalability. Internet-wide participation means that all hosts must be largely autonomous and able to work in the presence of node and network failures. Scalability implies that the architecture cannot be centralized. Bounded state implies that no site can, in general, have a global system view (e.g., a table with the names of all participating brokers). We have identi ed two key problems in building a scalable architecture: 1. Host allocation and code distribution | How does a client nd hosts for its computation, and how does the code get distributed eciently to a potentially very large number of hosts ? 2. Data communication at runtime | How is data exchanged between participating hosts after an application has been successfully started ? In the following we describe how these problems might be solved. Technically, the proposed system will be implemented entirely in Java. As the underlying distributed object technology, Java RMI (Remote Method Invocation) will be used. The following discussion is based on a scenario where all hosts execute code as Java applets. Therefore, the system must deal with applet security restrictions like, e.g., no local le I/O. We will, however, assume that applets can communicate directly with one another once their RMI handles have been passed along through the server2. The rest of this section is structured according to the di erent states a Javelin host can be in during its lifetime. The complete state transition diagram is shown in Figure 6. There are four states: NoHost, Standby, Ready, and Running. If a host has not joined Javelin it is in state NoHost. The transition to Standby is made by contacting a broker, registering, and downloading the daemon thread. In the next section we will describe how hosts can be allocated and code can be shipped so that an application is ready to start up, causing a state transition from Standby to Ready. In Section 4.2.2 we will present the data exchange mechanism based on a distributed, double ended queue and address hashing, which allows the host to run the application and 2 This method has been proposed and successfully tested in the Knitting Factory project [4] and runs with Sun's JDK 1.1 and the HotJava browser. Since at this point the two leading web browsers, Netscape Communicator and Microsoft Internet Explorer, both implement their own security policies and subsets of JDK 1.1, there currently is no agreed standard security model.

11

therefore transition to Running. The diagram has two more sets of transitions, a \natural" way back from each state to the previous state when a phase has terminated, and a set of \interrupt" transitions (shown in dashed lines) that lead back to the NoHost state when a user withdraws the host from the system.

NoHost

Standby

Ready

Running

Figure 6: State Transition Diagram for Javelin Hosts.

4.2.1 Finding Hosts and Shipping Code

As stated above, a client and its local broker do not actively look for hosts to join a computation. Hosts can join at any time, either by contacting the same broker as the client or indirectly through some other broker. Hosts are managed by brokers according to the farming principle | each broker can only accommodate a certain number of hosts, depending on its performance characteristics. Brokers can communicate with other brokers, but they will only be able to hold a limited number of broker addresses in their working set at any time since the number of participants is potentially very large and the address table is bounded in size. If every host that participates in a computation had to go to the client to download the code this would soon lead to a bottleneck for large numbers of hosts. Therefore, rst the local broker and then every other broker that joins in a computation will act as a cache on behalf of the client. Thus, client code has to be uploaded rst to the local broker and then passed on to every other broker requesting work on behalf of its idle hosts. The uploading can be realized by recursing through the class call graph and packing all necessary classes into a single archive le. Upon termination of a computation, the host that detects termination (usually the original client) must send an invalidation message to its broker, which in turn passes it on to all other brokers involved, so that the code can be deleted. Technically, each broker will be implemented as a Java RMI server, so that it can directly call methods on another broker object. Java Object Serialization will be helpful in the actual shipping of the code. Hosts will run as applets, but will also be RMI servers so they can be contacted by other hosts. In the following we describe the sequence of steps from the moment a client application is willing to execute until the moment when a host has received the code to participate in the computation. 1. The client contacts its local broker. 2. If the broker is willing to accept jobs, the client wraps up all necessary code and uploads it to the local broker. Depending on the type of application, the client may now start up and execute on its own. 3. The local broker puts the code to disk and creates an HTML page so that the code can be downloaded as an applet. In doing so, the broker acts as a disk cache for the client. 4. A Host joins the system by pointing its browser to its local broker's URL. In doing so, it downloads a daemon applet from the broker that will manage host computation and communicate with the broker. 5. The host daemon contacts the local broker asking for code to execute. 6. (a) If the local broker has work, it sends a code URL back to the host. If not, it contacts another broker and asks for code. To preserve the principle of locality and avoid ooding the network with concurrent requests from a single broker, we suggest using breadth- rst search as the search algorithm. 12

(b) If the other broker has code, it sends back the code in reply. The local broker saves the code to disk and creates an HTML wrapper as before. The local broker can now send the new URL back to the host daemon. 7. The host daemon executes showURL() to force the browser to load the code applet. 8. The application starts to execute on this host. We believe that this method will meet the system goals. To back our claim, consider the following: Scalability is potentially unlimited, since each broker is responsible only for a bounded number of hosts and can communicate with a bounded set of other brokers. Locality is maximized, because the system will always nd hosts in the vicinity of the client rst by using BFS (ideally, the local hosts will be sucient, but next it will nd hosts that are one broker hop away, then two hops, and so on). Ease of use remains optimal for participating hosts | the mechanism of retrieving code from other brokers is completely transparent. Autonomy is guaranteed, since each broker can make its own decisions on when and how often to call other brokers for jobs. The chosen network topology, an unrestricted graph of bounded degree, can result in a high degree of network connectivity if the broker connections are chosen carefully to prevent articulation points. This compares favorably to any strictly tree-based or hierarchical scheme, like the one used in the ATLAS project (see Section 5). This allows the Javelin network to remain functional even in the presence of failures, greatly enhancing its fault tolerance capabilities.

4.2.2 Distributing and Exchanging Data

As stated before, Javelin supports two di erent programming models: Fixed Parallelism with message passing, and Adaptive Parallelism with master-worker-subcontracting. The following discussion is restricted to the latter case, since the message passing model is more geared towards letting the application choose all its data structures and communication patterns. The system really cannot make any decisions since the application semantics are hidden from it at that level. Application PUSH_WORK

POP_WORK

Deque

I_WANT_WORK

X_WANTS_WORK

X_WANTS_WORK

Javelin Runtime

Figure 7: Distributed Double-Ended Task Queue. In the subcontracting model we base our strategy on two main data structures that are local to every host: a hash table of host addresses (technically, Java RMI handles), and a distributed, double-ended task queue containing \chunks of work". From the point of view of scalability, using a hash table allows for fast retrieval in the average case and scales to very large numbers. In addition, there is no centralized site in this setup, and host autonomy is guaranteed since sucient information is kept locally to remain functional in the presence of failures. It is important to observe that the address table is bounded in size | the hash table is preallocated to some xed size that remains manageable. At startup, a broker's connections 13

are precon gured to some default setup, whereas a host's rst connection is always its local broker. Over the lifetime of the host, the address table will ll up through interaction with other hosts. Should the table eventually over ow, the need to evict \older" connections will arise, which can be taken care of by a standard replacement policy like LRU. All this will result in a working set of connections for each host. The task queue is double-ended because we follow the concept of randomized work stealing which was rst introduced in the Cilk project ([7], see also Section 5). The local host picks work o one end of the queue, whereas remote requests get served at the other end. Jobs get split until a certain minimum granularity determined by the application is reached, then they will be processed. This means that when a host runs out of local jobs, it picks one of its neighbors at random from its hash table and issues a work request to that host. In doing so the host piggybacks its own address information onto the request and selects the number of hops that the request shall be forwarded if unsuccessful. This is to make sure unsuccessful work requests actually die eventually and will not ood the network. Figure 7 shows the principle of the task queue.

5

Related Work

In this section we summarize the research e orts that are most closely related to our work. At present, none of these projects have shown that they will scale to very large numbers of participants. In fact, just like the current Javelin prototype, most of these designs are hampered by some centralized system component that becomes a bottleneck for large numbers of hosts. Charlotte/Knitting Factory [5, 4] supports distributed shared memory, and uses a fork-join model for parallel programming. A distinctive feature of this project is its eager scheduling of tasks, where a task may be submitted to several servers, providing fault-tolerance and ensuring timely execution. Charlotte provides fault-tolerance based on the fact that each task is atomic. Changes to the shared memory become visible only after a task completes successfully. This allows a task to be resubmitted to a di erent server, in case the original server fails. ATLAS [3] provides a global computing model based on Java and on the Cilk programming model [7] that is best suited to tree-like computations. Cilk is a C-based runtime system for multithreaded parallel programming on traditional MPPs, although variants have been implemented in other languages and on more loosely coupled workstation clusters (see ATLAS above). It provides a distinct programming model, where a Cilk program consists of a collection of procedures, each of which is broken into a sequence of nonblocking threads. Since threads are nonblocking, they can always run to completion once invoked. Threads can spawn either children or successors, with the latter needed to pick up data dependencies from any children of a predecessor. The programming model is well suited to asynchronous, tree-like computations. Cilk also introduces a work-stealing scheduler that enables a processor that runs out of work to steal work from another busy processor, leading to ecient load balancing in the system. ATLAS ensures scalability using a hierarchy of managers. The current implementation uses native libraries, which may raise some portability problems. Like Charlotte, ATLAS provides fault-tolerance based on the fact that each task is atomic. Each subtask is computed by a subtree in the hierarchy of servers. Any subtask that does not complete times out and is recomputed from a checkpoint le local to its subtree. Popcorn [9] provides a Java API for writing parallel programs for Internet distribution. Applications are decomposed by the programmer into small, self-contained subcomputations, called computelets. The application does not specify a destination on which the computelet is to execute. Rather, a centralized entity called \market" brings together buyers and sellers of CPU and determines which seller will run the computelet. Market incentives are supported, e.g. two di erent types of auction for CPU cycles. User participation as a seller is made extremely easy | just point a Java-enabled browser to the market web site and ll in a user pro le to open an account for so-called \Popcoins", the micro-currency used by Popcorn. Bayanihan [29] identi es issues and concepts of a Java-based system like Javelin. It classi es the participation in a global computing framework into di erent levels of volunteer computing, touching on economic concepts (e.g. barter for CPU time without proposing a broker model). The current prototype provides a general framework for executing tasks within a so-called \chassis object" that can either be a Java applet 14

or application. Tasks are dealt out by a centralized server (work manager) and executed by work engines in the client. Communication in Bayanihan is based on the HORB [22] distributed object library. Scheduling and fault tolerance schemes are left to the application programmer. Nin et [34] is a Java-based global computing system with an architecture very similar to Javelin. Clients correspond to Javelin clients, Dispatchers correspond to Javelin brokers, and Servers correspond to Javelin hosts. One major di erence is that the system is based on Java applications instead of applets, thus overcoming applet security restrictions and letting the computational units, called Nin ets, establish point-to-point RMI communication. A special Nin et security model is provided to address security concerns that result from the use of applications. The programming model is a master-slave model in which the client acts as the master. A special feature of the Nin et system is the ability to checkpoint ongoing computations in order to provide fault tolerance as well as graceful evacuation of hosts reclaimed by their users. Historically, the forerunners of the current group of global or metacomputing systems have been LANbased parallel computing packages using the message passing paradigm, like e.g. PVM [33] or MPI [30]. Recently, some research projects have built upon this approach by keeping the programming model and enhancing interoperability through languages like Java and object orientation. Successful examples are JPVM [17], IceT [20], and ParaWeb [8]. The following systems require the user to have login access to all participating machines and binaries for all architectures used in the computation. Piranha [10] is one of the rst projects aiming at utilizing idle computers in a LAN for parallel applications. It introduces adaptive parallelism, in which the number of processes participating in a computation may vary at runtime. The programming model is a master-worker approach; task communication is handled by a Linda tuple space [35]. The system is fault tolerant by reassigning tasks that have not been completed due to host withdrawal. Linda is a language for programming parallel applications whose most notable feature is a distributed shared memory called tuple space. A collection of primitives operating on this tuple space allow for interprocess communication and synchronization. Linda implementations are available on a number of di erent architectures and for a number of di erent languages. Globus [18] is viewed as a networked virtual supercomputer also known as a metacomputer: an execution environment in which high-speed networks are used to connect supercomputers, databases, scienti c instruments, and advanced display devices. The project aims to build a substrate of low-level services | such as communication, resource location and scheduling, authentication, and data access | on which higher-level metacomputing software can be built. Like Globus, Legion [21] strives to provide a single, coherent virtual machine that can accommodate large numbers of hosts. Some of the design goals are scalability, programming ease, fault tolerance, site autonomy, and multi-language support. To achieve these goals and a single, persistent name space, Legion introduces an object model that wraps around every component of the system.

6

Conclusion

We have designed and implemented a prototype of Javelin, an infrastructure for Internet-based parallel computing using Java. Javelin allows machines connected to the Internet to make a portion of their idle resources available to remote clients, and at other times use resources from other machines when more computational power is needed. In order to be successful, an infrastructure for transforming the Web into an immense parallel computing resource must be easy to install and use. By requiring clients and hosts to have access to only a Java-enabled Web browser, Javelin achieves ease of participation: No OS or compiler modi cations are needed and system administrative costs are zero. We presented experimental results for two coarse-grain, compute-intensive applications: a raytracer and a Mersenne primality test. For both applications we achieved good speedups on Javelin. We expect that future versions of Javelin will see a performance boost due to optimized JIT compilers. We outlined pressing research issues in fault tolerance and scalability, properties that are essential to any global computing infrastructure. Overall, we believe that Javelin can be used successfully for Internet-based parallel computation, and future versions will address the remaining challenges to establishing a robust global computing infrastructure. 15

References [1] A. Alexandrov, M. Ibel, K. E. Schauser, and C. Scheiman. SuperWeb: Research Issues in Java-Based Global Computing. Concurrency: Practice and Experience, 9(6):535{553, June 1997. [2] D. E. Bakken and R. D. Schlichting. Supporting Fault-Tolerant Parallel Programming in Linda. IEEE Transactions on Parallel and Distributed Systems, 6(3):287{302, Mar. 1995. [3] J. E. Baldeschwieler, R. D. Blumofe, and E. A. Brewer. ATLAS: An Infrastructure for Global Computing. In Proceedings of the Seventh ACM SIGOPS European Workshop on System Support for Worldwide Applications, 1996. [4] A. Baratloo, M. Karaul, H. Karl, and Z. M. Kedem. An Infrastructure for Network Computing with Java Applets. In Proceedings of the ACM 1998 Workshop on Java for High-Performance Network Computing, Palo Alto, CA, Feb. 1998. [5] A. Baratloo, M. Karaul, Z. Kedem, and P. Wycko . Charlotte: Metacomputing on the Web. In Proceedings of the 9th Conference on Parallel and Distributed Computing Systems, 1996. [6] P. A. Bernstein, V. Hadzilacos, and N. Goodman. Concurrency Control and Recovery in Database Systems. AddisonWesley, 1987. [7] R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. Randall, and Y. Zhou. Cilk: An Ecient Multithreaded Runtime System. In 5th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPOPP '95), pages 207{216, Santa Barbara, CA, July 1995. [8] T. Brecht, H. Sandhu, M. Shan, and J. Talbot. ParaWeb: Towards World-Wide Supercomputing. In Proceedings of the Seventh ACM SIGOPS European Workshop on System Support for Worldwide Applications, 1996. [9] N. Camiel, S. London, N. Nisan, and O. Regev. The POPCORN Project: Distributed Computation over the Internet in Java. In 6th International World Wide Web Conference, Apr. 1997. [10] N. Carriero, D. Gelernter, D. Kaminsky, and J. Westbrook. Adaptive Parallelism with Piranha. Technical Report YALEU/DCS/TR-954, Department of Computer Science, Yale University, New Haven, Connecticut, 1993. [11] B. O. Christiansen, P. Cappello, M. F. Ionescu, M. O. Neary, K. E. Schauser, and D. Wu. Javelin: Internet-Based Parallel Computing Using Java. Concurrency: Practice and Experience, 9(11):1139{1160, Nov. 1997. [12] Colusa Software. Omniware Technical Overview, 1995. http://www.colusa.com. [13] P. Dasgupta, Z. Kedem, and M. Rabin. Parallel processing on networks of workstations: A fault-tolerant high performance approach. In Proceedings of the 15th IEEE International Conference on Distributed Computing Systems, 1995. [14] DESCHALL. Internet-Linked Computers Challenge Data Encryption Standard. Press Release, 1997. http://www.frii.com/~rcv/despr4.htm. [15] Electric Communities. The E Programming Language, 1996. http://www.communities.com/e/epl.html. [16] J. Feigenbaum. Encrypting Problem Instances | Or, ..., Can You Take Advantage of Someone Without Having to Trust Him? In Proceedings of the CRYPTO'85 Conference, 1985. [17] A. J. Ferrari. JPVM: Network Parallel Computing in Java. In Proceedings of the ACM 1998 Workshop on Java for High-Performance Network Computing, Palo Alto, CA, Feb. 1998. [18] I. Foster and C. Kesselman. Globus: A Metacomputing Infrastructure Toolkit. International Journal of Supercomputer Applications, 1997. [19] I. Goldberg, D. Wagner, R. Thomas, and E. A. Brewer. A Secure Environment for Untrusted Helper Applications | Con ning the Wily Hacker. In Proceedings of the 1996 USENIX Security Symposium, 1996. [20] P. A. Gray and V. S. Sunderam. Native-Language-Based Distributed Computing Across Network and Filesystem Boundaries. In Proceedings of the ACM 1998 Workshop on Java for High-Performance Network Computing, Palo Alto, CA, Feb. 1998. [21] A. S. Grimshaw, W. A. Wulf, and the Legion team. The Legion Vision of a Worldwide Virtual Computer. Communications of the ACM, 40(1), Jan. 1997. [22] S. Hirano. HORB: Extended Execution of Java Programs. In First International Conference on World-Wide Computing and its Applications (WWCA 97), 1997. http://ring.etl.go.jp/openlab/horb/. [23] Z. Kedem, K. Palem, and P. Spirakis. Ecient robust parallel computations. In Proceedings of the 22nd ACM Symposium on Theory of Computing, 1990. [24] Lucent Technologies Inc. Inferno. http://inferno.bell-labs.com/inferno/. [25] M. O. Neary and D. Schumacher. DOMS | A Prototype of a Distributed, Object Oriented, Active Database Kernel as a Framework for Cooperative Transactions. Master's thesis, Universitat-Gesamthochschule Paderborn, Germany, March 1995. [26] B. Ramkumar and V. Strumpen. Portable Checkpointing for Heterogeneous Architectures. In 27th International Symposium on Fault-Tolerant Computing | Digest of Papers, pages 58{67, June 1997.

16

[27] I. RSA Data Security. The RSA Data Security Secret-Key Challenge, 1997. http://www.rsa.com/rsalabs/97challenge. [28] R. Saavedra-Barrera, A. Smith, and E. Miya. Machine Characterization Based on an Abstract High-Level Language. IEEE Transactions on Computers, 38(12), December 1989. [29] L. F. G. Sarmenta. Bayanihan: Web-Based Volunteer Computing Using Java. In 2nd International Conference on WorldWide Computing and its Applications, Mar. 1998. [30] M. Snir, S. W. Otto, S. Huss-Lederman, D. W. Walker, and J. Dongarra. MPI, The Complete Reference. MIT Press, Nov. 1995. [31] Softway. Guava { Softway's just-in-time compiler for Sun's Java language. http://guava.softway.com.au/. [32] Sun Microsystems, Inc. JavaSpace Speci cation, Mar. 1998. Revision 1.0. [33] V. S. Sunderam. PVM: A Framework for Parallel Distributed Computing. Technical Report ORNL/TM-11375, Dept. of Mathematics and Computer Science, Emory University, Atlanta, GA, USA, Feb. 1990. [34] H. Takagi, S. Matsuoka, H. Nakada, S. Sekiguchi, M. Satoh, and U. Nagashima. Nin et: a Migratable Parallel Objects Framework using Java. In Proceedings of the ACM 1998 Workshop on Java for High-Performance Network Computing, Palo Alto, CA, Feb. 1998. [35] R. A. Whiteside and J. S. Leichter. Using Linda for Supercomputing on a Local Area Network. Technical Report YALEU/DCS/TR-638, Department of Computer Science, Yale University, New Haven, Connecticut, 1988. [36] T. Wilkinson. Ka e { A free virtual machine to run Java code, 1997. http://www.tjwassoc.demon.co.uk/ka e/ka e.htm. [37] R. Wolski, N. Spring, and C. Peterson. Implementing a Performance Forecasting System for Metacomputing: The Network Weather Service. In Proceedings of the ACM/IEEE Conference on Supercomputing (SC97), San Jose, CA, Nov. 1997.

17