A General Purpose Discrete Event Simulator

A General Purpose Discrete Event Simulator Homi Bodhanwala, Luis Miguel Campos, Calvin Chai, Chris Decoro, Kevin Fowler Per Franck, Huy Nguyen, Nilesh Patel, Isaac Scherson, Fabricio Silva fhbodhanw,lcampos,cchai,cdecoro,kfowler,pfranck,hnguyen6,patelnp,isaac,[email protected]

Dept. of Information and Computer Science University of California - Irvine Irvine, CA 92697, U.S.A.

Abstract

Building general-purpose simulators that include diverse sets of qualities is a key challenge in system simulation. This paper examines an implementation of a general purpose, modular, extensible, distributed, fault tolerant simulator. The simulator environment consists of a core module, and a speci cation for the development of application dependent modules. In order to validate our approach we used the simulator to study the problem of how to eciently manage the resources of a parallel/distributed system in a multiprogrammed environment. In particular we evaluated the performance of algorithms in the areas of dynamic scheduling and dynamic load balancing. The simulation environment provides support for most, if not all, characteristics found in today's Massively Parallel Processor systems. We grouped those characteristics into four models, namely the architectural model, the machine execution model, the communication model and the computational model. In terms of workload modeling we provide an integrated set of tools that allow for three dierent descriptive models: the probabilistic model, the algorithmic model and the directed acyclic model. This paper is organized as follows: the next section brie y describes other simulators currently available. Section 3 describes the simulator environment, the principles behind its development, detailed module descriptions and interactions, message structure, message submission, and message ow. Section 4 describes in detail a particular simulation implementation, namely, of a distributed system. Section 5 concludes the paper with a discussion of future enhancements.

Building simulators that are simultaneously general purpose, modular, extensible, distributed and portable is a key challenge in system simulation. This paper describes one such implementation in the form of a general purpose discrete event-driven simulator. Our implementation is truly general purpose allowing it to execute not only diverse algorithms in the eld of computer science (for instance, Load Balancing and Dynamic Scheduling for parallel and distributed systems) but to simulate problems found in other areas of research as well (for instance, cooperation between ants on an ant colony). The modularity and extensibility features of the simulator promote ease of change, multi-person development and testing and allow for component-based simulations to be built. The distributed nature allows for substantial shorter simulation execution time. Portability permits the same code to be executed in a multitude of environments which reduces both development time and cost. The simulator has been used to study resource allocation policies in multiprogrammed distributed systems.

1 Introduction

With the advancements of system simulation, it is bene cial to have a simulator capable of many types of simulation. It is desirable to have a simulator with a diverse set of tools that is also general purpose. Many simulator projects, both underway and completed, are limited to speci c types of simulation models. In general these simulators are not easily modi ed to model other types of problems. Modular designs lay the framework for easy modi cations. It is relatively simple to change various models for speci c 2 Related Work models of simulations. The use of generic modules as well as carefully designed interfaces between modules In this section we brie y discusses several simulators leaves room for further improvements and expansion. each one of which with unique characteristics. These 1

2.3 Parallax

simulation tools are described here in order to illustrate the need for a modular and extensible generalpurpose event-driven simulator capable of simulate complex models often requiring the collaboration of researchers from diverse areas. A typical case would be the simulation of a parallel/distributed system which involves at least the following modules: Architecture description, workload description, system level algorithms. Each module can be further subdivided into dierent components. For instance the system level module can be composed by the a Load Balancer, a Scheduler, etc. It should be possible to test each of this components not only individually but also in conjunction with other since the behavior of all components can not always be inferred from the individual component's behaviors.

Parallax is a software tool that aids in parallel software design by automating a number of scheduling heuristics and performance analysis tools. With this tool, a user is able to model a parallel program as a task graph, produce a schedule by automatic optimization of the scheduling heuristic, design a topology of the desired parallel processor, and nally, it recognizes the estimated performance that is obtained from scheduling the task graph onto the target machine.

3 Simulator Description The simulator works by relaying messages between modules and a Core engine. The Core's responsibilities include receiving events, queuing them, sending events to other modules that requested them, and synchronizing all the modules. The Core does not and should not understand any of the semantics of these messages. Its purpose is simply to control the

ow of execution. It is this property that allows us to ensure that the simulator is general-purpose and can be adapted to any process that can be modeled in an event-driven fashion

2.1 Pyrros Pyrros is a compile-time scheduling and code generation tool. Pyrros consists of a task graph language with an interface to C, a scheduling system, a graphic displayer, and a code generator. The task graph is utilized for outlining partitioned programs and statistics. The scheduling system is used for clustering, load balancing, and computation. The graphic displayer illustrates the results. The code generator goes through the code and nds the most optimal amongst various parallel machines. In an attempt to help the user con rm the accuracy of the

ow dependence between tasks, Pyrros has the ability to display the input dependence graph. The user can check the scheduling result by letting Pyrros display the schedule chart in the graph window and the statistics information in a separate text window.

3.1 Simulator Principles

The simulation environment described in this paper was developed with four fundamental principles in mind:

Modularity/Extensibility: This goal is achieved

by separating the core features, which are required by all types of algorithms being tested, from algorithm-speci c features. In addition, a clear interface for the exchange of information between user-de ned modules is established. New modules, implementing speci c user's need and adhering to the corresponding interface, can be eortlessly added to the simulation environment.

2.2 Hypertool

Hypertool takes a user-partitioned program as its input, automatically allocates the partitions to processors and inserts proper synchronization primitives where needed. Once an algorithm is developed and partitioned by a designer, who then writes the pro- Portability: This goal is achieved by using Java gram as a set of procedures. The program is then as the implementation language. The language optimized, and converted into a parallel program for itself provides many useful features that make distributed systems through parallel code synthesis. portability an attainable software quality. Hypertool can then generate performance estimates, including execution time, communication time, sus- Applicability: This goal is achieved by de ning pension time for each processor, and next work delay a common event-passing interface between the for each communication channel. If the results are not simulation engine and any algorithm-speci c adequate, the designer has an option to re-strategize module. Dierent algorithms will listen for difthe partitioning. ferent events and will respond in a user-de ned 2

is the message hierarchy, which enables domain independent Module-to-Core communication and Coreto-Module interaction in addition to providing an infrastructure for domain dependent events to simulate any process. The second is a general Core that handles domain independent administrative tasks, such as module registration, synchronization, and the control of the event queue.

manner. In virtue of this, it is possible to concurrently simulate several algorithms, each from a dierent research area, working in tandem to solve a particular problem. We are not aware of any simulation tool that provides such capability. Distributed: Rather than running a simulation on a single machine, we have realized the many advantages in allowing a single simulation to be run across several machines and platforms. Each machine may execute a particular function and all machines are synchronized, only if needed, by the core. This will reduce the running time of the simulation by exploiting parallelism and increasing the amount of work done simultaneously. Since the Java programming language is used, the modules may run on any platform.

3.2.1 The Message Hierarchy

The simulator operates by transferring messages to and from modules. As such, a format for these messages must be de ned. Figure 2 represents the hierarchy of messages: This hierarchy is general enough to support any arbitrary simulation, yet detailed enough to handle all administrative tasks by the Core.

Indeed, the full power of the simulator is realized when it combines these features with the discrete event-driven simulation (DEVS) model. The ultimate result is a fast, powerful, and general- purpose simulation tool, which can execute a single emulation or a distributed one on machines located half way around the globe with respect to each other. A block diagram of the simulator is given in gure 1. It is simply composed by a discrete-event simulation engine, also known as Core and by a variable number of user-de ned modules. The number and the behavior of the user-de ned modules are naturally domain dependent. Figure 1 show a block diagram for a distributed system simulation. Architectural Description

Figure 2: The Message Hierarchy It supports two fundamental types of communication:

Workload Description INPUT

Module-to-Core communication: All modules

can communicate directly with only one entity, the Core. Modules are not permitted to send SIMULATOR messages directly to each other. This property allows for a centralized control of the simulation. Such communication is supported by the ModuleMessage object and all its sub-modules. Core-to-Module communication: Once a module communicates with the Core, the Core may need to send a response to the module. Such communication is supported by the Figure 1: The Logical Structure of a distributed system simulation CoreResponse object and its sub-modules. A l g o r i t h m s

Load Balancing

Dynamic Scheduling

Static Scheduling

OUTPUT . . . .

View 1

View N

The Message Super Class

3.2 Design and Implementation

The Message module is the parent of the entire mesWhen designing a general-purpose simulator, two sage hierarchy. It contains only one data eld, namely fundamental entities need to be de ned. The rst the time at which the message was created, or, if it 3

is a UserEvent, the time at which the event needs to be scheduled.

1. process events and 2. request the scheduling of new ones. The SubmitMsg provides a mechanism for implementing the latter function. Each module schedules events by encapsulating them into a SubmitMsg. The Core parses the SubmitMsg and all events are inserted into the sorted (by time) event queue. A SubmitMsg may contain one or more events to schedule. A SubmitMsg is acknowledged by a response from the Core (this issue will be discussed in detail in section 3.2.1). An empty SubmitMsg is one in which there are no events to schedule. This message serves as an acknowledgment so that the Core can proceed. CancelMsg: Modules may occasionally express a desire to cancel an event. In such a situation, a CancelMsg is used. Each CancelMsg contains a list of unique event ids, assigned by the Core. The Core then removes all events in the event queue corresponding to the event ids in the CancelMsg. If no event in the event queue corresponds to a speci ed event id, the event id is ignored. An event in the event queue may not correspond to an event id either because the module provided an event id that never existed, or because the event with the corresponding event id was already processed by the Core. In either case, the event cannot be cancelled.

Module-to-Core Communication and the ModuleMessage Class

The ModuleMessage class provides a structure for Module-to-Core communication. It inherits the time eld from its parent, the Message class. The ModuleMessage class itself does not provide much functionality, its purpose is more cosmetic in that it serves as an encapsulation for all its subclasses, which provide speci c formats for speci c Moduleto-Core messages. Each ModuleMessage subclass is nal; that is, no user can de ne a message type that inherits from any ModuleMessage subclass. The reason for this is obvious: any change to this part of the message hierarchy could result in destroying the Core's ability to facilitate interaction among modules. There are four types of ModuleMessage(s):

RegistrationMsg: Before a module can participate

in a simulation, it must register with the Core. Thus, an obvious part of the RegistrationMsg is the module's name. If a module with an unauthorized module name tries to register with the Core the request is declined and no connection is established. Once a module is authorized, the Core extracts a list of all event types the registering module wishes to accept. These event types must be of type UserEvent (more on this later). Whenever a particular event is processed, the Core will forward the event to all modules that expressed interest in receiving such events. Each request is represented as an ordered pair of the form: (module name, event type). The third and nal component of the RegistrationMsg is a list of initial events to schedule. These events, which must also be of type UserEvent, are extracted and inserted into the Core's event queue for processing. The event queue is sorted by the schedule time of events. AcceptMsg: Although not currently used, the AcceptMsg provides a means for a module to express interest in a particular type of UserEvent. This task is assumed to be part of the registration of a module and therefore, seems obsolete. Nevertheless, the AcceptMsg provides a means for a module to accept dierent types of events throughout the life of a simulation, making the nal product more extensible. SubmitMsg: The fundamental role of each module is to:

Core-to-Module Communication and the CoreResponse Class

Since modules can communicate directly only with the Core, the Core is responsible for providing all forms of acknowledgment and the forwarding of all types of events to modules. Similar to the ModuleMessage class, the CoreResponse class is more of an encapsulation for a particular type of communication than a functional entity. Its subclasses are responsible for providing speci c functionality. Like the subclasses of the ModuleMessage class, the subclasses of the CoreResponse are nal. There are two subtypes of CoreResponse:

SubmitResponse: Every simulation is dependent

on a module's ability to submit events for processing. As aforementioned, this is done with a SubmitMsg. As the Core inserts the events into the event queue, it assigns each event a unique event id. These event ids are then forwarded to the module in a SubmitResponse message. The ids can be used by the module to request a cancellation of an event in the future. It is the mod-

4

ule's responsibility to keep track of these event ids. CancelResponse: A CancelMsg requires an acknowledgment simply because the Core cannot guarantee that an event was cancelled. Therefore, the CancelResponse indicates whether each cancellation request was successful or unsuccessful (either because no such event id was ever de ned, or the event had already been processed).

that every module needs to complete. First, each module must read from a text le that contains: the module's name (used for Registration), the IP address of the machine on which the Core is running, and the port number to which the module should request a connection. Second, the module must read a separate text le containing a list of ordered pairs of the form: (module name, event type). These ordered pairs will be used to indicate the types of events the module wishes to accept (as de ned in the AcceptMsg speci cation). These processes must be completed prior to Registration. Since these two function are standard for all modules they have been de ned in the Module class. Overall, every user-de ned module must:

Notice that the RegistrationMsg also results in a response being sent back whereas the AcceptMsg does not. For the case of a RegistrationMsg, a SubmitResponse is returned, as the RegistrationMsg may contain a list of events that require scheduling. An AcceptMsg does not trigger a response since a module does not need any additional information upon sending such message. There is no need to simply acknowledge the arrival of message, as TCP connections guarantee delivery. The RegistrationMsg, SubmitMsg, and CancelMsg have responses only because they provide the module with important status information.

1. Inherit from the module class to take advantage of the prede ned methods described above. 2. Be able to accept and understand CoreResponse messages, which include its subclasses, CancelResponse and SubmitResponse. 3. Be able to construct ModuleMessage objects (RegistrationMsg, AcceptMsg, SubmitMsg, and CancelMsg), as this is the module's only means of communicating with the Core.

UserEvents

Every simulation makes use of user-de ned events that require processing. Since these events are domain-speci c, we leave the de nition of user events to those using the simulator. However all user-de ned events must satisfy the following conditions:

Once a module is connected to the Core, it must listen for input and act appropriately. The Module class provides the listening functionality as well as a set of declared, but unde ned methods, which are used to handle the dierent types of messages from the Core. When inheriting from the Module class, every Inherit from the UserEvent class user-de ned module promises to create de nitions for Implement the java.io.Serializable interface these handler methods. to allow for the sending of events over TCP socket connections.

3.3 Simulation Life Cycle

3.2.2 The Core

Every simulation follows a domain-independent life The Core is the centralizing agent of the simulator. cycle, which is administrated by the Core: It is responsible for starting and ending simulations, as well as registering modules, keeping track of the 1. The Core reads the text le of module names that will be registering and waits for modules to current time, and handling the distribution of events. register. The Core's responsibilities are domain-independent, and therefore module designers need not worry about 2. The Core waits until all modules have register. its implementation details. Every time a module attempts to establish a connection, the Core veri es the module name. If 3.2.3 User Modules the name matches a name in the list, the modThe design and implementation of modules, unlike ule name is removed from the list (to prevent the Core, is entirely domain-dependent. It is the multiple registrations of the same module), the user's responsibility to write modules that serve a module is allowed to connect, and the rest of the particular function. However, there are a few tasks RegistrationMsg is parsed. 5

3. Once all modules have registered (i.e. the list of module left to register has a size of 0), the simulation begins. 4. The rst event in the event queue is popped out and distributed to all modules that requested events of that type. Once the event is sent to all modules, the Core blocks until a response is returned by all receiving modules. In the chance that a particular module does not have a new event to schedule, it must send back a SubmitMsg with no events, to acknowledge that although it has no schedule request, it has completed processing of the initial event. Note that the Core cannot pop the next event until all modules respond, as some module's response may change the state of the event queue. 5. The above step is executed until the size of the event queue is 0, at which point the simulation completes

This way we can have several dierent \views", simultaneously or not, each of which can be customized according to a user's preferences. The output module exchanges information with the simulation engine through a well de ned interface. This interface allows for a one-way communication only. Information ows from the simulation engine to the output module, or to be more precise, to an instantiation of a particular user's view. All possible instantiations must listen to a particular event. They dier in two ways: The way they implement the actions associated with the triggering of the event How they display the information associated with the event The information associated with each occurrence of the event can, this way, be displayed in real-time. Additionally, all information being tracked by a particular simulation is made available to the output at the conclusion of the simulation, in the 4 A Distributed System imple- module, form of ASCII les. This les can then be processed to display relevant statistical information. mentation By separating the output module from the engine module, we allow for programmers to develop their 4.1 Overview own output modules, providing the functionality they To validate our approach we have implemented a dis- desire, instead of us trying to develop a monolithic tributed system simulation and studied the perfor- module using a one ts all approach. mance of several load balancing and dynamic scheduling algorithms. At the core of the simulation environment devel- 4.2 Simulation Events oped by us is the event-driven simulation engine The individual algorithm implementations communidescribed in section 3. Algorithm-speci c modules cate with the Core via events. Every module impleaimed at solving a user's particular problem, can be menting a particular algorithm must listen to one or added to the environment, at compile time. In addi- more of the ve events de ned for this simulation. tion, several frequently used probability distributions Once an event is generated, it is passed by the Core are provided as an external module and can be ac- to all registered algorithms. Some of the events are cessed through a GUI. These and other tools, such generated regardless of which actions the algorithm as a Direct Acyclic Graph Editor and Generator are may take, for instance the event JobArrival is deintegral components of the environment. pendent only on the workload, in particular the choice A block diagram of the environment is given in of probability distribution for the inter-arrival time. gure 1 Others, however, are a direct result of the interaction The input to the simulator, both the architectural between the algorithm and the Core. For instance and workload descriptions, are given in the form of the event PEIdle is obviously a direct result of the ASCII les, although GUI tools that generate those actions taken by the algorithm when acting upon a les automatically are available. Both the architec- previous generated event. A typical example would tural and workload types of input are described in be a dynamic scheduler algorithm that in response to sections 4.3.1 and 4.3.2 respectively. the event EndOfTimeSlice assigns a particular task The mechanism by which external modules, imple- to the Processing Element (PE) in question, which in menting a particular algorithm, communicate with turn will lead to the triggering of the event PEIdle the simulation engine is explained in section 4.2. when the task running on that PE performs an I/O The output of the simulator is implemented by a operation. module independent of the simulation engine itself. The ve events de ned are: 6

PEIdle EndOfTimeSlice TaskArrival JobArrival TaskStateChange

LogP models an asynchronous, distributed memory, multicomputer in which processors communicate by point to point messages. The main parameters of the model are the following [5] (see gure 3): P P

4.3 Inputs

M o

(processors)

P M

........ P M o (overhead)

g (gap)

L(latency) As previously mentioned, the simulator accepts two types of input. The rst type describes the architecInterconnection network tural features of the machine being simulated whereas the second type aims at describing the workload that is to be run. Each type of input in turn can be exLimited capacity pressed in several dierent ways depending among (L/g to or from a processor) other factors, the degree of detail sought, what is that the programmer wants to measure/model, etc. Figure 3: The LogP performance parameters In section 4.3.1 we explain how to correctly specify the architecture input and in section 4.3.2 we present the dierent ways of characterizing the workload. L: An upper bound on the latency, or delay, incurred in communicating a message containing a word (or small number of words) from its source 4.3.1 Architectural Description processor/memory module to its target procesThere are currently two ways of describing the input sor/memory module. required to characterize the machine architecture to the simulator. We can either use a high-level model, o: The overhead, de ned as the length of time that where we hide for instance all the details related to a processor is engaged in the transmission or rea particular interconnect topology, or we can use a ception of each message. During this time, the low-level model where we describe in detail all the processor cannot perform other operations. components of the PP system being modeled. g: The gap, de ned as the minimum time interval between consecutive messages transmissions or High Level Description consecutive message receptions at a processor. Currently, the simulator provides support for one high-level description model only, namely, the LogP P: The number of processors/memory modules. model [5, 7, 4, 11]. The main goal of the LogP model is to serve as a Low Level Description basis for the design and analysis of fast, portable parallel algorithms. Many parallel algorithms developed Using a low level description, all signi cant machine under current parallel models are either impractical parameters can be speci ed, such as network topolbecause they exploit arti cial factors not present in ogy, routing algorithm, individual PE to PE bandany reasonable machine, such as zero communication width, etc. This type of description is often used delay or overly specialized because they are tailored when a very realistic simulation for a particular mato the idiosyncrasies of a single machine, such as a chine is sought. It allows for easy identi cation of bottlenecks for a particular combination of hardware particular interconnect topology. LogP is a realistic parallel model, yet simple architecture and algorithm being tested. The main enough to be used to design algorithms that work pre- drawback is that results obtained this fashion may dictably well over a wide range of machines. It allows not applicable to other machine architectures. the algorithm designer to address key performance is- The machine architecture con guration le is used sues without specifying unnecessary detail. It allows to represent both, high level and low level description machine designers to give a concise performance sum- models. mary of their machine against which algorithms can The notation used to describe the format of each be evaluated. entry in the le is:

7

[Token1 ] : Token2 j : : : j Tokenn

4.3.2 Workload Description

To describe the workload submitted to the simulator we can use three dierent description models, namely the probabilistic model (see section 4.3.2), the algorithmic/programming language model (see section 4.3.2) and the direct acyclic graph (DAG) model (see section 4.3.2). The reason for allowing such variety in the way the workload can be described stems from the fact that dierent users have dierent requirements for their simulations and instead of forcing them to cast their problem in one particular format we proProperty : Architectural Description vide dierent modules that convert a user's favorite Value : High Level Low Level descriptive model into the one internally used by the simulator. The three dierent models aim at solvProperty : Model ID ing dierent user's needs, usually associated with the Value : LogP mechanism by which the workload information was Property : Latency gathered. Value : Real The probabilistic model is often used when the user abstracts himself from the goal of any individual job Property : Overhead Value : Real composing the workload and focus instead on a set of characteristics descriptive of their behavior. Property : Gap between Messages The algorithmic/programming language model is Value : Real used when the user knows the functionality of each Property : Machine Execution Model job submitted to the system and wants to study how Value : Synchronous Asynchronous combinations of jobs/algorithms in uence the performance of the resource allocation policies being tested. Property : Network Topology Value : Ring Mesh Hypercube Fully Connected Finally, the direct acyclic graph model gives the Property : Link Bandwidth programmer absolute control over the behavior of Value : Real each individual task that constitutes a parallel job. This information can for instance be generated by a parallelizing compiler with the aim of statically Table 1: Machine architecture le format. Speci c schedule tasks in order to minimize the overall exproperties to high level and low level models ecution time. where Token1 corresponds to a Property and Token2 : : :Tokenn are possible Values. The character ( j ) between tokens indicates mutually exclusive values. A token enclosed in ( [ ] ) indicates an optional parameter or value. A list of all currently available properties and associated possible values is given in tables 1 and 2 j

j

j

Property Value Property Value Property Value Property Value Property Value

j

j

Probabilistic Model

Two basic approaches are usually used when modeling workloads using the probabilistic model [14]:

: Number of PEs : Integer : Time Slice Duration : Integer

1. Use of real workloads traces gathered from scienti c production runs on real supercomputers and carefully reconstructed for use in simulation testing. 2. Use of exible synthetic workload models that use probability distributions to generate workload data.

: FrontEnd : Boolean : Blocking : Boolean : Spin Duration : Integer

The focus when using this model, is on job-oriented resource management and job oriented workloads, Table 2: Machine architecture le format. Common where a job consists of a collection of one or more properties to high level and low level models computational tasks to be run in parallel. There is a general agreement on which job characteristics are important when describing a workload. A workload 8

description consists of two major components: job arrival and job structure. Each job arrives at a speci c time and requires a speci c amount of processing time. Thus, there is a model for the distribution of the arrival process and a separate model for the distribution of each particular job's work requirement. The rst component describes how jobs are submitted to the system over a period of time. This can be somewhat involved, as a distinction has to be made between short interactive jobs and long batch jobs. In addition, there are daily and weekly cycles in the arrival process, due to the working patterns of the human users of the system. The second component is that of modeling the work requirements of each job. This can be done in a monolithic manner, or else the internal structure of each job can be speci ed. As additional internal job structure is modeled, more sophisticated resource management features can be evaluated, presumably resulting in a more ecient system. The most common and clearly identi able structures are the computational structure (parallelism and barrier synchronization), interprocess communication, memory requirements, runtime, I/O needs, etc. Unfortunately, there is not much hard data that has been measured about typical internal structural distributions. Modelling Job Arrivals and Internal Job Structure are active areas of research. For detailed information about both the reader should read [1, 8, 15, 9, 10, 12, 6]. The internal job characteristics supported by the simulator are: Degree of parallelism Initial Processor Arrival I/O Distribution Communication and Computation Distributions Synchronization Distribution Task Execution Distribution Initial Task Distribution Spawn Distribution A detailed description of all the above characteristics can be found at [2]. The example in table 3 describes a very simple, and unrealistic, workload composed by jobs of one type only. The jobs can be characterized as irregular, highly parallel and communicationintensive. The granularity of synchronization is coarse and all jobs are submitted to a single PE that acts as front end for the PP system.

Inter Arrival Time :UniformDistribution(2,20) Initial Processor Arrival :FrontEnd Job : Job Execution Time :UniformDistribution(500,800) Communication :UniformDistribution(10,20) Global Synchronization:0.001 Initial Task :UniformDistribution(10,15) Spawn :UniformDistribution(100,200) Spawn Size :UniformDistribution(1,4) Task Execution Time :UniformDistribution(10,100)

Table 3: A workload example using the probabilistic model

Simulation Language Model

When the user has prior knowledge of the purpose of each parallel job, he can describe it algorithmically using a simulation language developed by us. This simulation language is composed by a minimal set of primitives/statements. Those primitives can be classi ed into two classes:

Computational Statements Calculation: This statement represent any opera-

tion that needs CPU to be performed. Obvious examples are assignments and arithmetic operations. This statement has an optional parameter, either a constant value representing the xed duration (in time-slices) of the operation, or a probability distribution that allows for the length of its execution to be conditioned to some variable, for instance remaining execution time. Used this way this statement can realistically simulate loops. If no parameter is given its value defaults to a constant value of one time-slice.

Input/Output: This statement represents the

execution of an I/O operation, for instance a read or write in traditional programming languages. It has an optional argument, either a constant value, representing the xed duration (in time-slices) of the operation, or a probability distribution that allows the length of its execution to be variable. If no parameter is given its value defaults to a constant value of one time-slice.

Fork: This statement is the only way to express

parallelism in the program. It has a variable number of arguments (one for each task to be spawned), each of which uniquely identi es the new task being spawned (task id). No spawn

9

task is allowed to start execution of its own statements before all tasks given as parameters are created.

Communication Statements Send: There are four dierent forms of the send

statement. They all follow the same semantics of PVM message passing primitives [3]. Synchronous Send returns, only when the receiver has posted a receive operation . Asynchronous Send does not depend on the receiver calling a matching receive. Each of the above send statements can be either a point-to-point send or a multicast send, leading to the four possible send primitives mentioned. The simulator knows whether a send is point-to-point or multicast by checking the number of parameters of the send primitive and therefore only two forms of the send are actually de ned. Additionally the programmer can specify how long it will take to transmit the message. This optional parameter is not to be confused with the time it will take the receiver to actually receive the message. The latter is a function of the network delay, where as the former is used to quantify how much data is being transferred. The optional parameter is either a constant value or a probability distribution. Receive: There are four dierent forms of the receive statement. They all follow the same semantics of PVM message passing primitives [3]. Blocking Receive returns as soon as data is ready, i.e. a matching send has been posted. Non Blocking Receive returns as soon as possible. This type of receive is provided for compatibility reasons but seldom used. Each of the above receive statements can be either a pointto-point receive or a multicast receive. The simulator knows whether a receive is point-to-point or multicast by checking the number of parameters of the receive primitive and therefore only two forms of the receive are actually de ned. Barrier Synchronization: This statement is implemented in terms of the two previous communication statements. A barrier synchronization is nothing more than a multicast asynchronous send (to all tasks belonging to a job) followed by a multicast blocking receive. This statement is only provided as a convenience to the programmer since no parameter is required and therefore no need to create and maintain a potentially very long list of task's ids

necessary for the multicast operation. Although the simulation language is quite simple it can successfully be used to describe a surprisingly large number of applications. Its simplicity coupled with its broad descriptive power is in fact one of its strengths, since it allows for very ecient yet realistic simulations. In order to exemplify how the language can be used in practice, the following example shows the code necessary to describe a very common parallel application, namely, matrix multiplication. In this example we show a very simple algorithm for matrix-vector multiplication. Although simple it serves as the basis for many vector-processing routines. It is also particularly well suited for use in popular parallel architectures such as arrays, rings and tori. The problem can be de ned as: Given an N N matrix A = (aij ) and an N -vector ~x = (xj ), suppose that we wish to compute the matrixvectorPproduct ~y = A~x de ned by ~y = (yi ) and yi = Nj=1 aij xj for 1 i N . The simple sequential method for doing this takes 2N 2 ? N steps: N multiplications and N ? 1 adds for each yi . Using an N ? cell linear array, however, the entire product can be calculated in 2N ? 1 multiply/add steps, thereby providing a reasonably ecient speedup over the naive sequential algorithm. In [13] Leighton presents a simple algorithm for matrixvector multiplication on a linear array. We transcribe that algorithm in here so that the reader can have a good understanding of the problem at hand. However our simulation language is platform independent and the solution we will present using the primitives provided by the language re ects that fact. It completely abstracts the user from the machine topology allowing him to focus on the high level algorithmic solution. The algorithm for matrix-vector multiplication on a linear array is quite simple. The xj 's are input one-per-step from the left end of the array (starting with x1; x2; : : :) and the aij 's are input from the top of the array as shown in gure 4. The ith cell of the linear array computes yi by multiplying the ~x-value input from the left by the A-value input from the top, and adding the product to its local memory at each step. Note that xj and aij arrive in cell i at the same time (speci cally, at step i + j ? 1) so that the value P N computed at the ith cell is precisely j =1 aij xj = yi . The computation of yi is completed at step N + i ? 1, after which it may by output. Hence, all values are computed after 2N ? 1 steps.

10

TaskID Code

a44 a34

a43

a24

a33

a42

a14

a23

a32

a41

a13

a22

a31

.

a12

a21

a11

.

. .

. .

Fork(T1; T2 ; T3 ; T4 )

T0 :

Async Send(T1; T2 ; T3 ; T4 ,const)

A Block Receive(T1; T2 ; T3 ; T4 )

x4 x3 x2 x1

Calculation(const)

x

Figure 4: Computing the matrix-vector product ~y = A~x

I/O(const)

on an N-cell linear array for N=4

END

Using the primitives provided by the simulation language we approach the problem a little bit differently. We start by realizing that we can subdivide the problem into N simpler subproblems, where each subproblem, or task, would be responsible to multiply one row of the original matrix by the vector and that all these tasks can be executed in parallel. Each of this tasks can be further subdivided into N simpler tasks, each of which is responsible for multiplying only two values, one taken from the original row and the other taken from the vector. Once every individual multiplication is completed, in one single step, the individual products have to be added on a row basis, i.e. in the body of the original N subtasks (this step could also be parallelized), which in turn will be forward to the original task to produce the resulting vector. A diagram showing the various steps and the resulting parallel code are shown in gure 5 and table 4 respectively. For drawing simplicity we chose N to be equal to four. As it is clear from the above discussion the solution using the simulation language is completely independent from the parallel system topology, number of PEs available, architectural model, execution model and computational model. Moreover, it does not even favor the message passing paradigm over the shared memory one, although the naming of the communication primitives may lead to think otherwise. To understand why, one just needs to keep in mind that any send/receive can easily be implemented using shared memory by writing/reading directly in memory.

Block Receive(T0)

T1 :

Fork(T11; T12 ; T13 ; T14 )

Async Send(T11; T12 ; T13 ; T14 )

Block Receive(T11; T12 ; T13 ; T14 ) Calculation(const) Async Send(T0) END

Comments Spawn one task for each row of the matrix Pass a row and the vector to every task created Wait for all children to communicate their partial results Combine the partial results in a vector Output the final result Exit

Wait for the parameters to be sent by the parent task Spawn one task per element in the row assigned to it Pass the parameters for each of its children Wait for the partial products Add all partial results Pass the result to the parent task

The code for T2 ; T3 and T4 is similar to therefore not shown in here.

T11 :

Block Receive(T1) Calculation(const)

Direct Acyclic Model

Async Send(T1)

This form of job modeling is often used when the user has complete knowledge of the job's execution. Some of the fundamental parameters that must be known

T1

and

Wait for the two values to be multiplied Multiply one element from the matrix by one element from the vector Communicate the product of the above calculation to its parent

END

11 For simplicity we do not show the code for the remaining 15 tasks at this level (tasks T12 to T44 ) since it is similar to the one shown here for T1

A1

A

X

R

MATRIX

v e c t o r

r e s u l t

v e c t o r

. . .

a11 x1

a11 x1

.

a14 x4

.

=

.

. . .

+ . . + a14 x4

.

Communication/Computation Ratio, used to characterize the communication patterns of the job.

Best Possible Execution Time, that is the total A4

a41 x1

a41 x1

clock time the job would require to nish if it had an unlimited amount of resources available. Naturally a job will often take longer than its best possible execution time since it will have to compete with other jobs for the limited available resources. Another way of interpreting this parameter is by seeing it as the time it takes to execute the tasks belonging to the critical path, where critical path is de ned as the path whose execution time determines the minimum execution time of the entire sequence. No amount of machine parallelism can make the execution time shorter.

v e c t o r

. . . a x 44 4

+ .. +

a44 x4

r e s u l t

The output of our DAG generator is naturally a Figure 5: Dierent steps necessary to parallelize the DAG equivalent to the program being modeled, with matrix-vector problem proper functional dependencies, execution time and communication delay estimations. A question that often arises when using this type of modeling is, if (or at least estimated) are: we cannot estimate the execution times and communication delays accurately can we still get good per Job Parallelism (over time) formance via static scheduling? If the task graph is coarse-grain the answer is yes. Since in the simula Communication Patterns tions we ran each task is associated with a non trivial set of operations the above assumption is valid. Individual Task Execution time A snapshot of the DAG Editor is given in gure 6 These parameters are usually described by a direct acyclic graph (DAG), where the dependencies among the various individual tasks that compose the parallel job are fully speci ed. This type of representation can be generated by a parallelizing compiler and this type of modeling is often used by researchers studying the general problem of static scheduling. For our purposes, only true dependencies are considered and scheduling is done at the task level, not instruction level. The simulator includes a tool called Direct Acyclic Graph Editor/Generator. The input to the DAG generator is a set of criteria that de ne the behavior of the parallel problem being modeled. The set of criteria are:

Maximum Degree of Parallelism, that is the maximum number of tasks that can be executed at the same time at any given time.

Average Degree of Parallelism, represents the

Figure 6: A snapshot of the DAG Editor

average number of tasks that can be executed in parallel at any given time.

12

5 Conclusion

volume 1162 of Lecture Notes in Computer Science, pages 89{110, 1996.

In this paper we described in detail a general purpose, extensible, modular, distributed and portable [9] D. Feitelson and L. Rudolph. Metrics and benchmarking for parallel job scheduling. In Prosimulating environment. The backbone of the enviceedings of the 4rd Workshop on Job Scheduling ronment is a module named Core's, which is responStrategies for Parallel Processing, volume 1459 sible for relaying events among the diverse user proof Lecture Notes in Computer Science, pages 1{ vided modules that compose a particular simulation. 24, 1998. We have used the environment to build a distributed system simulation and studied the impact on perfor- [10] J. Jann, P. Pattnaik, H. Franke, F. Wang, mance of several operating system level algorithms. J. Skovira, and J. Riordan. Modeling of workload Future work includes adding fault tolerance capabilin MPPs. In Proceedings of the 3rd Workshop on ities to the environment and to create a database of Job Scheduling Strategies for Parallel Processing, reusable modules that can be easily used to create volume 1291 of Lecture Notes in Computer Scipowerful o the shelf simulations. ence, pages 95{116, 1997. [11] R. M. Karp et al. Optimal broadcast and sumReferences mation in the LogP model. In 5th Annual ACM Symposium on Parallel Algorithms and Architec[1] M. Calzarossa and G. Serazzi. A characteritures, July 1993. zation of the variation in time of workload arrival patterns. IEEE Transactions on Comput- [12] L. Kleinrock. Power and deterministic rules of ers, 34(2):156{162, February 1985. thumb for probabilistic problems in computer communications. International Conference on [2] Luis Miguel Campos. Resource Management Communications, 3:43.1.1{43.1.10, June 1979. Techniques for Multiprogrammed Distributed Systems. PhD thesis, University of California, [13] F. T. Leighton. Introduction to Parallel AlgoIrvine, 1999. rithms and Architectures: Arrays Trees, Hypercubes. Morgan Kaufmann Publishers, 1992. [3] Henri Casanova, Jack Dongarra, and Weicheng

Jiang. The performance of PVM on MPP sys- [14] Virginia Lo, Jens Mache, and Kurt Windisch. tems. Technical report, University of Tennessee, A comparative study of real workload traces July 1995. and synthetic workload models for parallel job scheduling. In Proceedings of the 4rd Workshop [4] David Culler et al. LogP: Towards a realistic on Job Scheduling Strategies for Parallel Promodel of parallel computation. In Proceedings cessing , volume 1459 of Lecture Notes in Comof the 4th ACM SIGPLAN Symposium on Prinputer Science , pages 25{46, 1998. ciples and Practice of Parallel Programmming, May 1993. [15] K. Windsich, V. Lo, D. Feitelson, B. Nitzberg, and R. Moore. A comparison of workload traces [5] David Culler et al. LogP. A pratical model of from two production parallel machines. In 6th parallel computation. Communications of the Symposium on the Frontiers of Massively ParalACM, 39(11):78{85, November 1996. lel Computation, 1996. [6] A. Downey. A parallel workload model and its implications for processor allocation. In High Performance Distributed Computing Conference, 1997.

[7] A. Dusseau et al. Fast parallel sorting under LogP: Experiences with CM-5. IEEE Transactions on Parallel and Distributed Systems, 7(8), August 1996. [8] D. Feitelson. Packing schemes for gang scheduling. In Proceedings of the 2nd Workshop on Job Scheduling Strategies for Parallel Processing, 13