Efficient Parallel Simulation of an Individual-Based ... - Semantic Scholar

4 downloads 42707 Views 300KB Size Report
Apr 8, 2008 - University of California, Santa Barbara, CA 93106 ... architecture of the GPU is quite suitable for both types of parallel processing.
Efficient Parallel Simulation of an Individual-Based Fish Schooling Model on a Graphics Processing Unit Hong Li Department of Computer Science University of California, Santa Barbara, CA 93106 [email protected] Allison Kolpas Department of Mathematics University of California, Santa Barbara, CA 93106 [email protected] Linda Petzold Department of Computer Science Department of Mechanical Engineering University of California, Santa Barbara, CA 93106 [email protected] Jeff Moehlis Department of Mechanical Engineering University of California, Santa Barbara, CA 93106 [email protected] April 8, 2008

Abstract Due to their low cost and high performance processing capabilities, graphics processing units (GPUs) have become an attractive alternative to clusters for some scientific computing applications. In this paper we show how stochastic simulation of an individual-based fish schooling model can be efficiently carried out on a general-purpose GPU. We describe our implementation and present computational results to illustrate the power of this new technology.

1

Introduction

Driven by the video gaming industry, the Graphics Processing Unit (GPU) has evolved into an inexpensive yet powerful high speed computing device for scientific applications. The GPU has a highly parallel structure, high memory bandwidth, and more transistors devoted to data processing than to data caching and flow control than a CPU [8]. Problems that can be implemented with stream processing and limited memory are well-suited for the GPU. Single Instruction Multiple Data (SIMD) computation, which involves a large number of totally independent records being processed by the same sequence of operations simultaneously, is an ideal type of GPU application. In recent years, computation on the general-purpose GPU (GPGPU) has become an active research field, with a wide range of applications including cellular automata, particle systems, fluid dynamics and computational geometry [3, 10, 5, 4]. Previous generation GPUs have required non-graphics applications to be recast into graphics computation through a graphics application programming interface (API). Last year, NVIDIA released a Compute Unified Device Architecture (CUDA) toolkit for their GPUs, providing general purpose functionality with a C-like language for non-graphics application. 2

Fish schooling is an important example of self-organized collective motion in animal groups. The collective behavior of the group emerges from local interactions of individuals with their neighbors, without any regard to a leader, template, or other external cue. Such groups can be composed of hundreds to millions of members, with all individuals responding rapidly to their neighbors to maintain the collective motion. In this paper we show how long-time stochastic simulations of an individual-based fish schooling model can be efficiently performed on a CUDA-enabled GPU. In the schooling model, each organism or “agent” is treated individually, with rules specifying its dynamics and interactions with other agents. Noise is included to account for imperfect sensing and processing. For different values of the parameters, different schooling behaviors emerge, including a swarm state in which individuals incoherently move about a central location, and a mobile state, in which individuals travel in an aligned polarized group. A large number of realizations are necessary to accurately determine the statistics of the collective motion. Simulations of the schooling model carry a high computational cost and can benefit from parallel processing. In the model, each agent updates its direction of travel based on the positions and directions of travel of all other agents. This computation can be performed in parallel across individuals within a single realization. In addition, the realizations can be performed in parallel. The architecture of the GPU is quite suitable for both types of parallel processing. In the following, we first describe the details of the individual-based fishschooling model. We then review the features of the GPU and show how parallel processing both across individuals within a single realization and across realizations can be implemented on a CUDA-enabled GPU to efficiently perform ensemble long-time simulations of the schooling model.

3

2

Fish Schooling Model

Many organisms move and travel together in self-organizing groups, including flocks of birds, schools of fish, and swarms of locusts [1]. Individual-based models (IBM) are frequently used to describe the dynamics of such groups since they can incorporate biologically realistic social interactions and behavioral responses as well as relate individual-level behaviors to emergent population-level dynamics. Here we consider a two-dimensional individual-based model for fish schooling. This model is similar to that considered in [2], but without an informed leader, and with different weights of orientation and attraction response. Groups are composed of N individuals with positions pi (t), unit directions vˆi (t), constant speed s, and maximum turning rate θ. At every time step of size τ , individuals simultaneously determine a new direction of travel by considering neighbors within two behavioral zones: a “zone of repulsion” of radius rr about the individual and a “zone of orientation and attraction” of inner radius rr and outer radius rp . The latter includes a blind area, defined as a circular sector with central angle (2π − η), for which neighbors within the zone are undetectable. These zones are used to define behavioral rules of motion. First, if individual i finds agents within its zone of repulsion, it repels away from them by orienting its direction away from their average relative directions. Its desired direction of travel in the next time step is given by

vi (t + τ ) = −

and normalized as vˆi (t + τ ) =

X pj (t) − pi (t) , |p j (t) − pi (t)| j6=i

vi (t+τ ) , |vi (t+τ )|

(1)

assuming vi (t + τ ) 6= 0. If vi (t + τ ) = 0,

agent i maintains its previous direction of travel giving vˆi (t + τ ) = vˆi (t). If agents are not found within individual i’s zone of repulsion, then it will align with (by averaging the directions of travel of itself and its neighbors) and feel 4

an attraction towards (by orienting itself towards the average relative directions of) agents within the zone of orientation and attraction. Its desired direction of travel is given by the weighted sum of two terms:

vi (t + τ ) = ωa

X pj (t) − pi (t) X vj (t) + ωo , |pj (t) − pi (t)| |vj (t)| j j6=i

(2)

where ωa and ωo are the weights of attraction and orientation, respectively. This vector is normalized assuming vi (t + τ ) 6= 0. If vi (t + τ ) = 0, then agent i maintains its previous direction of travel. Noise effects are incorporated by rotating agent i’s desired direction vˆi (t + τ ) by an angle drawn from a circularly wrapped normal distribution of mean 0 and standard deviation σ. Also, since individuals can only turn θτ radians in one time step, if the angle between vˆi (t) and vˆi (t + τ ) is greater than θτ , individuals do not achieve their desired direction, and instead rotate θτ towards it. Finally, each agent’s position is updated simultaneously as

pi (t + τ ) = pi (t) + sˆ vi (t + τ )τ.

(3)

To begin a simulation, individuals are placed in a bounded region with random positions and directions of travel. Simulations are run for approximately 3000 time steps. The fish schooling model is very well suited for the GPU because of its high arithmetic intensity, relatively simple data structure needs, and its complete data parallelism for ensemble simulations. One may also quite easily assess the performance and accuracy of simulations on the GPU by comparing with results from the host workstation.

5

3

The Graphics Processor Unit - A New Data Parallel Computing Device

The GPU was originally designed as a dedicated graphics rendering device for workstations or PC’s. Due to the forceful economic pressure of the fast-growing interactive entertainment industry, the GPU has evolved rapidly into a very powerful general-purpose computing device (see Figure 1 [8]). The modern GPU is small enough to fit into most desktops and workstations, creating a super computer on the desktop. It allows users to program with a high level language. Although it currently supports only 32-bit floating point precision, that will change soon. The GPU is best suited for parallel processing applications and computations with high stream processing floating point arithmetic intensity and high computation per memory access computation [6]. It is especially wellsuited to SIMD applications, since the calculation is able to hide the memory access latency. But it is still a specialized processor. It is not so efficient for applications which are memory access intensive, require double precision, or involve many logical operations on integer data or branches [6]. GFLOPS

Figure 1: Observed peak GFLOPS on GPU, compared with theoretical peak GFLOPS on CPU. [8] The NVIDIA 8800 GTX that we use is the new generation with stream processing instead of the pipelining that was characteristic of previous generations of GPUs. The 480mm2 surface area of the NVIDIA 8800 GTX chip contains 768MB RAM and 681 million transistors to construct 128 stream processors,

6

grouped into 16 clusters of multiprocessors (8 streaming processors in each multiprocessor) as shown in Figure 2 [8]. The 8 processors in one multiprocessor share a 16 KB low latency shared memory, which brings data closer to the arithmetic unit. The maximum observed bandwidth between system and device memory is about 2GB/second. The global memory can be accessed with the same speed but with much higher latency. Device Multiprocessor N

Multiprocessor 2 Multiprocessor 1

Shared Memory

Registers

Registers

Processor 1

Processor 2

Registers



Instruction Unit Processor M

Constant Cache

Device Memory

Figure 2: Hardware Model : A set of SIMD multiprocessors with on-chip shared memory [8]. Until recently, researchers interested in employing GPUs for scientific computing applications had to use a graphics application programming interface (API) such as OpenGL. They would first recast their model into a graphics API, and then trick the GPU into running it as a graphics code. This is a very uncommon programming pattern and made migrating non-graphics applications onto GPU a significant challenge. Just last year, NVIDIA introduced the CUDA Software Development Kit (SDK), which supplies an essential high-level development environment for general purpose computation applications on the NVIDIA GPU. This minimizes the learning curve for beginners to access the low-level hardware and gives the user

7

much more development flexibility than graphics programming languages [8]. A productive way of computing on the device is to maximize the use of fast but small shared memory and to minimize the accesses to the slow global memory. To do this, we must first develop a strategy to make the data fit into the limited shared memory. In our computation, to obtain memory level parallelization in each block, the threads first load their target agent’s information from global memory to shared memory simultaneously, and then perform the computation involving the target agent with all the data in shared memory. After this computation, each thread copies the data of its target agent back to global memory in parallel. A kernel is able to run on many thread blocks in parallel without any communication among the blocks, which is important because cooperation among different thread blocks is not yet supported. Within one thread block, all of the threads can synchronize through their hazard-free shared memory.

4

Parallelization

There are two ways to parallelize the simulation. One is to parallelize across realizations: independent realizations with different initial conditions are performed in parallel. Such realizations are uncoupled and therefore do not need to communicate with each other. Each realization communicates with the host to get the initial data and transfers the results at the end of the simulation. Another is to parallelize within a single realization: the domain of state variables is partitioned into smaller sub-domains, and all the threads cooperate with each other while manipulating their own sub-domain in parallel to advance the simulation. We use both methods of parallelization for the fish schooling model. During the parallelization within a single realization, different sub-domains of the fish schooling model are tightly coupled. Even for an application which is

8

ideal for the GPU architecture, the user still needs to be careful to deal with the limited shared memory and the large latency of the global memory. The efficient concurrent use of the shared memory and global memory can supply data to the arithmetic unit with a reasonable rate and results in good performance. We will show how our computation for the fish schooling model can be structured to fit within the GPU architecture constraints.

4.1

Parallelization Across Realizations

Parallelization across realizations is a straightforward and effective way to improve performance for this application. Using multiple blocks for ensemble simulations keeps the GPU’s large arithmetic capacities fully used and hides the shared and global memory access latency to achieve good performance. First, the kernel instructions are distributed among thread blocks, and each of them is in charge of a single realization. The system state for the fish schooling model is defined as P xji (t), P yij (t), V xji (t), V yij (t), where P xji (t), P yij (t) represent the x and y components of the positions of the fish agent, V xji (t), V yij (t) represent the x and y components of the velocities of the agent, i is the block id, j is the id of the fish agent, and t represents time. Each thread block with id i = c computes on the subset P xjc (t), P ycj (t), V xjc (t), V ycj (t) and generates its own initial state variables P xjc (0), P ycj (0), V xjc (0), V ycj (0) using the parallel Mersenne Twister (MT) [7, 9] random number generator. The desired results are stored in the final state vectors P xji (tf ), P yij (tf ), V xji (tf ), V yij (tf ) after tf simulation steps. We use an intermediate data structure on the device to minimize the transfer between the host and the device, and group a few small transfers into a big transfer to reduce the overhead for each transfer.

9

4.2

Parallelization Within A Single Realization

We briefly review the sequential algorithm before demonstrating our strategy for parallelization across individuals. To begin a simulation, individual agents are placed in a bounded region with random positions and headings. Then, the simulation is advanced for 3000 steps to reach a steady state. By steady state we mean that the group has self-organized and arrived at a particular type of collective motion. During a time step, each agent has to compute the influence from all other agents on itself. To compute the influence on an agent, we first calculate the distance between this “goal agent” and all other agents, and then use a few variables to record the influence coming from the different zones. Then we calculate the net influences for this “goal agent”, including a random noise term and save it to an influence array. In a system with N fish agents, the time complexity for the distance calculation on a CPU is O(N ∗ N), and for the influence calculation it is O(N). After determining all the influences for all of the agents in the fish school, we update the positions and directions of the agents simultaneously, based on the influence array at the current time step. The on-chip shared memory is very limited but has much lower latency for memory instructions than the global memory. It takes about 2 clock cycles to access data in the shared memory, with an additional 200-300 clock cycles of memory latency to read or write from the global memory [8]. To effectively make use of the GPU, the on-chip shared memory must be used as much possible. In theory it is better to have at least 128 threads to get the best performance improvement [8], but we also need take the limited shared memory size into account. Thus, there are restrictions on the total number of fish agents stored in the shared memory and the number of threads in each block. The main method of parallelizing within a single realization is to decompose the problem domain into smaller sub-domains to compute in parallel. In our 10

100 threads per block Each thread is responsible one individual fish agent

1D Thread Blocks 160 Blocks are used to maximize the use of GPU resources

B1T1

B1T2



B1T100

B2T1

B2T2



B2T100









B160T1

B160T2



B160T100

Figure 3: Decomposition of the fish schooling model for CUDA threads and blocks for simulating schools of size N = 100. simulation, the system state vectors P xji (t), P yij (t), V xji (t), V yij (t) for each block i are partitioned into subsets to be manipulated by each thread. Assume that there are N fish agents in one realization simulated with n threads. Then we need order of k = N/n time to calculate the influences on each agent. To begin a simulation, all agents are initialized with random positions and directions of travel in parallel. At each time step, n agents are loaded by managing one agent in each thread k times. For schools of size N = 100, we set k = 1 for best performance. When simulating larger schools, one must take k > 1. The decomposition of the schooling model for schools of size N = 100 is shown in Figure 3. The system state variables loaded to shared memory at this stage are called the “goal agents”. Each thread has only one “goal agent” held at a time and is responsible for computing the influences on this “goal agent”. Thus, at one time step, each thread operates on one k-element subvector. To calculate 11

the distance between the “goal agent” and all the other agents, the positions and directions of all the other agents are needed. A temporary array of size n in shared memory is used to load n agents at a time, with each thread loading one agent. The program will keep operating until all the desired data has been loaded and utilized for calculation for all the “goal” agents if k > 1. Then, each thread can calculate the influence information for its own “goal agent”. At the last simulation step, each thread records the results into its own influence subvector. Thus the distance calculation complexity is reduced to O(N), and the influence calculation complexity is reduced to O(1). After all the influence calculations are completed, each thread updates its system state subvector with its influence subvector. The process of parallelizing across realizations and within a single realization is shown in Figure 4. The concurrent reads to the same data are supported by the hardware. Thus, during each calculation, each thread can read the full system state vector without conflicts. Grid Block (0)

Block (m)

Shared

Memory

(0)

Registers

Shared Memory (0)

(n)

(1)

Registers

Registers

Registers

(n)

Registers

... Thread (0)

Local Memory

Thread (1)

Local Memory

(0,0)

...

Thread (n)

Local Memory

(0,1)

Thread (0)

...

Local Memory

(0,n)

(m,0)

Thread (n)

Local Memory

(m,n)

Global Memory

Figure 4: m blocks and n threads are used to parallelize across realizations and within a single realization. First, each thread loads information of one agent as its “goal agent” from the device memory to shared memory. Second, each thread loads its “goal agent” to an intermediate array in shared memory. Third, each thread uses the full data in the intermediate array to compute the influences on its own “goal agent”. The loading continues until all of the influences to the “goal agent” have been computed. Then, each thread saves its influence to the influence record array (this reuses the previous array to save shared memory). At last, each thread updates information of its “goal agent” to the system state vector on the device. 12

In summary, during the entire simulation most calculations are based on data in the low-latency shared memory and multiple blocks running in parallel hide the memory access latency and maximize the use of the arithmetic units. This results in a big performance improvement.

4.3

Random Number Generation

A huge number of high quality pseudorandom numbers are required to generate initial conditions and to add noise to our calculations at each time step. Statistical results can only be trusted if the independence of samples can be guaranteed. Pregenerating a large random number sequence is not a good choice because it consumes too much time for memory accesses. To generate random numbers for our application, we use the Mersenne Twister (MT) [7, 9] algorithm. It has passed many statistical randomness tests including the stringent Diehard tests. It is able to generate high quality, long period random sequences with high order of dimensional equidistribution, and makes efficient use of memory. Thus, MT is perfect for our applications. A modified version of Eric Mills’ multi threaded C implementation [9] of the MT algorithm was employed to generate the random numbers for each thread in parallel. To generate a large number of random numbers efficiently, the shared memory based implementation is employed.

5

Simulation Results and Performance

Our simulations were performed on an NVIDIA GeForce 8800GTX installed on a host workstation with Intel Pentium 3.00GHz CPU and 3.50GB of RAM with physical Address Extension. We explored the effects of varying the ratio of orientation to alignment tendencies r = ωo /ωa for N = 100 member schools. For each r considered, 1120 13

steady-state simulations of the schooling model were performed on the GPU. The remainder of the parameters were set to rr = 1, rp = 7, η = 350◦ , s = 1, τ = 0.2, σ = 0.01, and θ = 115◦ . For r close to zero, attraction tendencies dominate over alignment tendencies and groups exhibit swarming behavior. As r is increased past 1 (equal orientation and alignment), schools become increasingly more polarized, forming highly aligned mobile groups for large r. Group polarization, defined as N X 1 vˆi (t) , P (t) = N i=1 serves as a good measure of the mobility of a school. See Figure 5 for a plot of the average group polarization as a function of r. The simulation performance in generating these results was extraordinary. The parallel GPU simulation is about 230-240 times faster than the corresponding sequential simulation of the same model on the CPU. The polarization curve in Figure 5 took a few minutes to generate on the GPU. The corresponding serial version on the CPU would have taken more than 8 hours.

6

Conclusions

We showed how stochastic steady-state simulations of an individual-based fish schooling model can be efficiently performed in parallel on a general-purpose GPU and observed speedups of about 230-240 times for our parallelized code on the GPU over the corresponding sequential code running on the CPU of the host workstation. With such processing capabilities at our fingertips, it is easy to compare the effects of different modeling assumptions and parameters, allowing us to refine our understanding of how behavior at the population-level arises from individual-level interactions. The GPU has the power to revolutionize the 14

(c)

1

Average Group Polarization

(b)

0.8

0.6

r=1000 r=2

0.4

0.2 (a) r=0.125

0 −2 10

0

10

2

10

4

10

6

10

r

Figure 5: Group polarization as a function of r averaged over 1120 steady-state replicate simulations run in parallel on the GPU. (a) A realization of the swarm state (r = 0.125), (b) a realization of the dynamic parallel state (r = 2), (c) a realization of the highly parallel state (r = 1000). way we do scientific computation by bringing the processing power of a cluster to the home desktop.

7

ACKNOWLEDGMENTS

This work was supported in part by the U.S. Department of Energy under DOE award No. DE-FG02-04ER25621, by NIH Grant EB007511, by the Institute for Collaborative Biotechnologies through grants DAAD19-03-D004 from the U.S. Army Research Office, and by National Science Foundation Grant NSF-0434328.

References [1] S. Camazine, J. L. Deneubourg, N. R. Franks, J. Sneyd, G. Theraulaz, and E. Bonabeau. Self-Organization in Biological Systems. Princeton University Press, Princeton, 2003.

15

[2] I. D. Couzin, J. Krause, N. R. Franks, and S. A. Levin. Effective leadership and decision making in animal groups on the move. Nature, 433:513–516, 2005. [3] GPGPU-Home. GPGPU homepage, 2007. http://www.gpgpu.org/. [4] H. Li, A. Kolpas, L. Petzold, and J. Moehlis. Parallel simulation for a fish schooling model on a general-purpose graphics processing unit. Concurrency and Computation: Practice and Experience, 2008. to appear. [5] H. Li and L. Petzold. Stochastic simulation of biochemical systems on the graphics processing unit. Technical report, Department of Computer Science, University of California, Santa Barbara, 2007. Submitted. [6] W. m. Hwu. Gpu computing programming, performance, and scalability. In Proceedings of the Block Island Workshop on Cooperative Control. 2007. [7] M. Matsumoto and T. Nishimura.

Mersenne Twister: a 623-dimensionally

equidistributed uniform pseudo-random number generator . ACM Transactions on Modeling and Computer Simulation (TOMACS), 8:3–30, 1998. [8] NVIDIA Corporation. NVIDIA CUDA Compute Unified Device Architecture Programming Guide, 2007. http://developer.download.nvidia.com. [9] NVIDIA Forums members. NVIDIA forums, 2007. http://forums.nvidia.com. [10] J. D. Owens, D. Luebke, N. Govindaraju, M. Harris, J. Krger, A. E. Lefohn, and T. J. Purcell. A survey of general-purpose computation on graphics hardware. In Eurographics 2005, State of the Art Reports, pages 21–51, Aug. 2005.

16