Parallel and Distributed Computing Laboratory - Semantic Scholar

Parallel Hierarchical Radiosity algorithms : Case study on a DSM-COMA architecture 1

Chegu Vinod and Vipin Chaudhary TR-95-01-27

Wayne State University

PDCL PDCL PDCL PDCL PDCL PDCL PDCL Parallel and Distributed Computing Laboratory Department of Electrical and Computer Engineering Wayne State University, Detroit MI 48202 Ph: (313) 577-5802 FAX: (313) 577-1101

Parallel Hierarchical Radiosity algorithms : Case study on a DSM-COMA architecture Chegu Vinod and Vipin Chaudhary Abstract

A major problem in realistic image synthesis is the solution of the rendering equation. The goal of research in this area is aimed at arriving at faster solutions to the rendering equation, which satis es the desired numerical and visual accuracy. Radiosity methods provide an eective solution to the rendering equation for lambertian diuse environments. Hierarchical radiosity algorithms apply techniques developed for N-body problem, and incorporate a global error bound to arrive at a faster approximate convergence to the radiosity solution. We evaluate the complexity of the hierarchical radiosity algorithm and suggest methods to solve the problem in parallel. In this paper, we study three task allocation/mapping schemes(static, dynamic and SD) required to exploit data-locality and hence eciently parallelize this dynamic and irregular application, on a 64 node distributed shared memory-cache only memory access(DSM-COMA) architecture. The SD scheme was found to yield better speedups than the static or dynamic schemes. Keywords: Global illumination, rendering equation, radiosity, multiprocessors, COMA, message-passing. I. Introduction

Scienti c visualization is today demanding parallel implementation since it combines the need for large databases of 3D structures or scanned images, very high computation rates to compute the motion of the millions of objects that make up a complex scene, and a very high bandwidth communication to send many multi{megabyte pictures to a remote display device.

Study was conducted using the resources of the University of Michigan CPC, funded under NSF Grant CDA-92-14296

1

Photorealism in a computer generated image is a critical requirement for important scienti c visualization applications. The problem is often termed as realistic image synthesis which involves taking into account the interdependence relation in the illumination of the scene while rendering. Two popular approaches suggested in the literature are Ray tracing and Radiosity. However, these approaches are, highly computationally expensive. Two issues that have been of concern to the designers and implementers today are,

Design of ecient algorithms that provide a near realistic image. Parallelizing the computation on multicomputers/multiprocessors to render the realistic image in near real time.

In recent years hierarchical N-body algorithms have been applied to a wide range of scienti c and engineering problems to arrive at faster approximate solutions. But even these hierarchical approaches do not promise real time performance for larger/complex problem data sets necessitating the need for design of ecient parallel algorithms. Hierarchical radiosity algorithms apply techniques developed for N-body problem, incorporating a global error bound and allow surfaces/patches to interact only when they are within the speci ed error, resulting in a faster convergence. The extent of parallelism that can be exploited from a given application also depends on the type of multiprocessor architecture being used. Recent research in large-scale shared-memory machines has led to two interesting variants namely, the cache-coherent non uniform memory access machines(CC-NUMA) and cache only memory access architectures(COMA). These two variants oer some common features like distributed main memory, scalable interconnection network, and directory-based cache coherence. They provide a scalable memory bandwidth, and cache-coherence using only a small fraction of the system bandwidth. Since COMA architectures(Swedish-DDM, KSR) are known to perform better for dynamic applications which have irregular ne-grained data accesses(eg: N-body algorithms) over CC-NUMA (MIT-Alewife, Stanford-DASH)[14], we decided to conduct a case-study of this radiosity algorithm on a KSR2-64. In this paper, we study the eect of three task mapping/allocation schemes on a dynamic, irregular, and loosely synchronous N-body application in an attempt to exploit data-locality and hence eciently parallelize on a 64 node distributed shared memory-cache only memory access(DSM-COMA) architecture. The following section introduces the problem of global illumination followed by a brief review of a few popular radiosity solutions in section 3. For a more detailed review please 2

refer to [10]. Section 4 gives an overview of a general radiosity based rendering system. The hierarchical radiosity algorithm is reviewed in section 5 and its complexity is analyzed in section 6. A parallel hierarchical radiosity is presented in section 8. Section 9 brie y discusses the implementation details followed by a study of the eect of task mapping/allocation schemes on a DSM-COMA architecture in section 10. A model for a radiosity based rendering system and a block schematic of the framework of its parallel implementation are also shown. II. Global Illumination models

Realistic aspect of a computer generated image depends on the type of illumination model used. Non-physical based local lighting models like that of Phong, although computationally simple, fail to recognize the true interdependency in the shading model. Physical lighting models reproduce better eects by incorporating the illumination and shading eects due to other surfaces in the scene. Interdependency in the global illumination is governed by an integral equation known as the rendering equation [1]. Given the geometry, emmisive, re ective, and transmissive properties of all the surfaces, the outgoing radiance at each surface is expressed as : Z

Lout (x; o ; o) = Lemit (x; o; o ) + dx cosri coso Vx;x bds (x; i; i ; o; o )Lout(x ; o ; o) 0

0

!

2

0

0

0

0

(1) where L(x; ; ) is the radiance (energy/time/projected area/solid angle) leaving a point x in the direction ; . Vx;x is the visibility function, which equals 1 if x and x are inter-visible or 0 if they are occluded from each others view; ! is the set of all surfaces in the scene. bds (x; i; i ; o) is the bidirectional re ectance distribution function (BRDF) describing the re ective properties of the surface at a point x. It is a spectral quantity and hence is a function of wavelength . The variables o ; o; i ; i; Vx;x ; r are functions of x and x . The outgoing radiance Lout for the surface x can be computed from the above equation. The rendering equation is an example of Fredholm equation of the second kind. The kernel of the integral is given by : 0

0

0

0

K (x; x ) = bds(x; i ; i; o ; o) cosri coso Vx;x 0

0

2

0

(2)

For specular scenes the non-smoothness and sparseness of the kernel can be exploited by using Monte-Carlo techniques for arriving at approximate solutions. For diuse scenes the 3

smoothness of the kernel can be made use of while using nite element methods (Projection methods : Collocation & Galerkin ) [8] to arrive at an approximate solution to the integral. The former approach is often characterized as ray tracing while the latter as radiosity. The rendering equation simpli es if we assume that all the surfaces are lambertian diuse. The radiance emitted and the diuse re ectance are functions of x alone. Radiosity is the sum of the emitted and re ected radiation over the surface under consideration. Z

B (x) = E (x) + bds (x) dx cosri coso Vx;x B (x ) 0

0

!

2

0

0

(3)

The kernel in the above integral can be approximated by using projection methods. The unknown radiosity function can be expressed as a set of basis functions with limited support, resulting in a set of n linear equations, where n is the number of discrete elements in the environment being rendered. III. Radiosity solutions

The radiosity of a diuse material has no directional dependence and can be denoted as a single number. We assume that scene description has been discretized into n surface elements, where each element is small enough that the brightness does not vary signi cantly across its surface. The energy leaving each element i is given by

BiAi = EiAi + i

n X j

FjiBj Aj

(4)

where, Bi is the brightness/radiosity of element i (energy/unit area/unit time), Ei is the emissivity of element i (energy/unit area/unit time), Ai is the area of element i, i is the diuse re ectivity of element i, Fij is the form factor from element i to j (fraction of the energy arriving at patch j from patch i). The form factor is a purely geometric term, and is proportional to the solid angle subtended by the emitter from the vantage point of the element receiving the light, assuming all points on the emitter are visible from the receiver gure 1. The dierential form factor between two nite areas is given as, Z Z cosicosj dA dA 1 Fij = A i j rij i Ai Aj 2

4

(5)

Aj Nj

r Ni

Ai

Figure 1: Form factor geometry The equation states that the contribution of element i to the brightness of element j is equal to its own brightness times a form factor, which gives the ratio of the light leaving element i and reaching j, times the diuse scattering coecient of i. The steps and the approaches involved in arriving at a solution to the radiosity rendering equation are shown in gure 2 [8]. There are two factors that eect the nal result of the radiosity based global illumination algorithms, computation of form factors and the other the meshing strategy used to decompose the given polygons in the scene to arrive at faster and better approximations. Several sources of error arise while computing the form factors using the hemicube[3]. Most of them are due to aliasing, or beating that occurs due to uniform sampling. Other sources of error arise due to the fact that a hemicube placed on a source can only determine the form factors to receiving surfaces with nite areas (which are the elements) and not to point receivers (dierential surface areas) . Thus the radiosities at the element vertices, which are actually used to render the nal image, cannot be determined directly. Additional imprecision is introduced, especially while shading of curved surfaces while trying to determine the radiosities at the vertices by averaging the radiosity of the surrounding elements. Ray traced form factors try to address the above problems. The other common problem which was earlier ignored in computing the radiosity solutions is the method adopted for subdividing of the surfaces into elements. Meshes with constant grid of rectangular or triangular elements were used to approximate true radiosity functions with functions that are constant over an element exhibit jagged shadow edges. Better meshing schemes can be created with user interaction. Posteriori meshing constructs approximate the solution with a coarse, uniform mesh, and then re ne the mesh where the gradient appears to be large. These methods solve the problem to some extent but the accuracy depend upon the neness of the subdivision. A priori meshing constructs, also known as Discontinuity meshing[8], employ object space techniques to predict, before solution where 5

the shadow edges and other discontinuities occur. A mesh is then constructed accordingly. Another meshing improvement is to vary the mesh resolution according to the range of the interaction : a ne mesh is needed for nearby elements but a coarse mesh suces for distant elements. The brightness weighted hierarchical algorithm focuses the eort on computation of signi cant form factors (nearby elements) and tries to arrive at faster solutions by approximating the insigni cant ones (distant elements). The hierarchical algorithm reduces the total number of interactions from O(n ) to O(n + m ), where n is the total number of patches after subdivision and m is the number of initial top-level polygonal patches. A combined approach using hierarchical algorithm and discontinuity meshing has also been suggested [5]. It makes use of the hierarchical algorithm for subdividing the input polygonal patches at the discontinuities to arrive at a desired numerically accurate solution during the rst (global) pass and then makes use of Discontinuity meshing and quadratic interpolation during the second (local) pass to arrive at a visually accurate solution. The resulting algorithm was found to be both numerically and visually accurate and can be used to produce images without jaggy shadows. Wavelets have been used to solve the rendering equation (Fredholm integral equations of the second order) and have been found to provide a better understanding of the hierarchical radiosity algorithm. Recent improvements to the Hierarchical radiosity(HR) algorithm include the use of clustering algorithms[6] to reduce the complexity due to creation of initial links for complex environments and use of partitioning and ordering[7] for polygonal environments that are much larger than can be stored in the main memory (has to stored in a disk). This paper only deals with the issues related to ecient parallel implementation of the original hierarchical radiosity algorithm. The same concepts could be extended to the recent improvements to HR[6, 7]. 2

2

IV. Overview of the Radiosity based rendering system

The radiosity based rendering problem can be split in to three main parts, namely, the preprocessing part, radiosity solution part and the rendering part gure 3. The three parts can be represented by a simple block schematic diagram gure 2. The above model allows us to treat the radiosity solution part independent of the preprocessing and rendering stages. We now try to compute the computational requirement for implementing this model in real time. Radiosity computations can be done o line and the results can be used to render the 6

image using the graphics pipeline. V. Hierarchical radiosity algorithm : A review

The radiosity problem shares many similarities with the N-body problem [4]. In the Nbody problem each body exerts a force on every other body resulting in n(n ? 1)=2 pairs of interactions. Similarly, in the radiosity problem each patch may scatter light to every other patch resulting in n interactions. In the N-body problem, the forces obey Newton's third law and are therefore equal but opposite; in the radiosity problem, there is no requirement that the amount of light being transported between two surfaces is the same in both the directions. Moreover, just as the gravitational or electro-magnetic forces fall o as 1=r , the magnitude of the form factor between two patches also falls o as 1=r . For this reason the hierarchical clustering ideas like those of the N-body problem apply to the radiosity problem. The hierarchical radiosity algorithm[4, 9] which comprises of three components: 1) a hierarchical description of the environment, 2) a criterion for determining the level in the hierarchy at which two objects can interact, and 3) a means of estimating the energy transfer between the objects, proceeds as follows. The input polygons that comprise of the scene are rst inserted into the Binary Spatial Partitioning(BSP) tree to facilitate the visibility computation [2] between the pairs of the patches. Then the algorithm iterates over the following steps until the total radiosity of the scene converges to within a xed tolerance: 2

2

2

For every polygon compute its radiosity due to all the polygons on its interaction list, subdividing it or other polygons hierarchically as necessary.

Add all the area-weighted polygon radiosities together to obtain the total radiosity of the scene, and compare it with that of the previous iterations to check for convergence.

Every input polygon can be viewed as the root of a quadtree that it will be decomposed into. In every iteration, each of these quadtrees are traversed - depth rst - starting from its root. At every quadtree node visited interaction of that node (patch i say), are computed with all other patches j, in its interaction list. The interaction between two patches involves computing both the visibility and the on-occluded form factor between them, and multiplying the two to obtain the actual form factor. The actual form factor (Fji) is then multiplied by the radiosity Bj of patch j to determine the magnitude of light transfer between j and i. If the magnitude of BF is greater than the prede ned value, the patch with the larger area is 7

subdivided. Children are created in its quadtree if they do not already exist. If the patch being visited in the quadtree ( patch i) is subdivided, patch j is removed from its interaction list and added to its children's lists. If patch j is subdivided, it is replaced by its children on patch i's interaction list. Patch i's interaction list is completely processed in this manner before visiting its children in the tree traversal. Finally, after the traversal of a quadtree is completed, an upward pass is made to accumulate the area-weighted radiosities of a patch's descendants into its own radiosities. If the radiosity of the scene has not converged, the next iteration performs similar traversals of all quadtrees starting at their roots. At the beginning of an interaction, the interaction list of a patch in any quadtree is exactly as it was left at the end of the previous iteration: containing the patches with whom its interaction did not cause any subdivision. VI. Complexity of the hierarchical radiosity algorithm

In the case of hierarchical algorithm the number of patches and hence the number of resulting interactions are not known a priori. They can however be evaluated as functions of the Aeps and Feps as follows. Since we are using quadtrees to subdivide a patch into its sub-patches we can arrive at the following expression. We know that the number of interactions are of the order of the number of nodes in the tree, which in turn are functions of Aeps and Feps. Each polygon is assigned a list of polygons that are directly visible to it and with which they have to interact. Two patches can interact with one another only if the Feps is greater than the form factors between the two patches. The number of interactions between a pair of polygons can be given by the following expression,

nc = where,

Aj A eps e a dlog 4 4 X X

a=1

b=1

(f (Feps; Fij ) + f (Feps; Fji)) "

(Feps > Fij ) f (Feps; Fij ) = 10 if if (Feps < Fij )

(6)

#

Let pi n be the number of polygons that are visible to a given polygon pi . So the total ( )

8

number of interactions between the polygon pi and the ones in the interaction list are

Ni =

(n) pX i

c=1

nc

(7)

Let there be k polygons in the environment; the total number of interactions in the environment N = Pki Ni The number of interactions, N, gives a measure of the number of computations required for arriving at a solution. For a scene with medium complexity (100,000 polygonal patches) the computational requirement for providing real time solutions vary from 150-200 M-FLOPS or even more. =1

VII. KSR: DSM-COMA architecture

KSR[20] is a Cache only memory access machine(COMA) built on a fat-tree interconnection of hierarchy of rings. The level:0 ring has 32, 64-bit custom super-scalar processor cells interconnected by a pipelined unidirectional slotted ring, and connected through two ring routing cells(RRC) to the higher level:1 ring. The COMA system presents an invalidation based sequentially consistent shared memory model. It comprises of a system virtual address(SVA) space, which is global to the entire system. The programmer however sees the shared memory in terms of a context address(CA) space which is unique to each processor. The segment translation tables(STT) maps the CA space to SVA space. The consistency information is stored in the ALLCACHE TM routing directory cell(ARD) the level:0 ring and these ARD's are connected together in the level:1 ring gure 5. Each processing node has 32M-Bytes of local cache and 0.5M-Bytes of sub-cache. Unit of data transfer in the system is 128 bytes (subpage). Unit of data transfer between the sub-cache and the local cache is half a subpage. Allocation in the local-cache is done on a 16KB page basis and in the sub-cache is in terms of 2KB blocks at a time. In order to reduce the latency of the data accesses the KSR provides asynchronous operations pre-fetch and post-store. This study was done on a 64 node (two level:0 rings) KSR2. VIII. Parallel hierarchical radiosity algorithm

The analysis of computational requirements of hierarchical radiosity algorithm keeping in mind the objectives of realistic image synthesis demands an average performance speed few hundred M-FLOPS. Such an immense data manipulation requirement cannot be met even 9

with the fastest workstations. An obvious way of getting over the problem is by using more than one processor working at their peak performance [17]. Although the hierarchical radiosity algorithm is based on the same approach as that for classical N-body problems(Barnes-Hut, FMM)[11, 12] there are several dierences between the two. First, the starting point in hierarchical algorithm is not the n nally divided element, or \bodies", but are large input polygons that the scene is initially decomposed into. These input polygons are subdivided hierarchically as the algorithm proceeds { generating a quadtree per subdivision as the algorithm proceeds { and the number of the resulting leaf level elements (n) are not known a priori. The second dierence is that the interaction law is not simply a function of the interacting entities and the distance between them, as the transport of light between the two patches might be occluded and hence visibility computation is also necessary. The third dierence is that while the N-body computation proceeds over a hundreds of time steps, the radiosity computation proceeds over a smaller number of iterations until the total radiosity of the scene converges. Parallelism can be exploited, across the input polygons, across the patches the input polygons are subdivided into, and across the interactions computed for these patches. Of the three, we choose to parallelize across the input polygons. The algorithm is brie y described below, 1. Every process has a task queue assigned to it. 2. In every iteration of the radiosity algorithm provide every task queue with an initial set of polygon-polygon interactions. 3. If a processor subdivides a patch during the course of its computation it creates a task for each created patch which includes all the patch-patch interactions it must perform. 4. When a processor nds no more tasks in its task queue, it can steal tasks from other processors. 5. To preserve data locality the created tasks are added to the head of the task queue, while they are stolen from the tail end of the task queue by the other processors. This maximizes the ecient utilization of the processor idle time and ensures data locality to some extent. 6. The steps (3-5) are repeated till the radiosity of the scene converges. 10

IX. Implementation details

The problem graph for this algorithm is of the form shown in gure 6: Tasks A and B constitute initialization and visibility computation. Tasks C, D and E which constitute the interactions between the polygons/patches (form-factor, radiosity computation are performed in parallel). The average global radiosity is computed from the patch radiosities and checked for convergence (tasks F, G). The framework of our implementation on the KSR2-64 is shown in gure 8. We have designed and implemented our parallel radiosity algorithm closely on lines with the modular framework of the radiosity based rendering problem illustrated in gure 3. The pre-processing and the radiosity computation were implemented on the KSR2{64 and rendering part was implemented on a RS6000 graphics front-end. PVM was used for communication between the graphics front-end and the KSR2{ 64. Execution timings presented exclude this PVM daemon communication, only the timings of the computation have been used in the paper. All timings are in seconds. The two data sets(room and large room) used comprise of 174 and 248 input polygons, respectively, and the solution in each case converged after 6-7 iterations. The input polygon room data set and the hierarchical meshed room and the rendered radiosity solution are shown in gure 7a, 7b & 7c. X. Allocation/Mapping strategies

While mapping the parallel algorithm on to a multiprocessor/multicomputer its necessary to minimize the schedule length, de ned as the sum of all computation and communication time between the processes. The mapping scheme chosen should minimize this objective function [15, 16]. Having already found an expression for the number of interactions(Ni) between the patches, the time complexity for computing the hierarchical radiosity solution can be approximated by a similar expression,

T = Tinit(Ni ) + Tvis (Ni) + Tff (Ni) + Trad + Tback

(8)

1

The total time(T) complexity of the algorithm is given as T = Pki Ti where, T is the time required for one iteration, which comprises of Tinit (Ni), Tff (Ni), Trad (Ni) and Tback (Ni), time =1

11

1

required for initialization, visibility computation, form factor computation, global average radiosity computation and test for convergence for identifying the need for further re nement respectively. Three allocation schemes are used : static, dynamic, and SD. Static scheme allocates patches to individual processors and subsequent patches (as a result of hierarchical division) will stay in the queue of that processor. Since certain patches will not need further radiosity computation and the computation of patches when they are smaller in size will be very small this scheme results in the following disadvantages and advantages: a) The processors have an unbalanced load; b) If the patches are relatively equal in computation and on the average if there are equal number of patches for each processor then this scheme is good (since the overhead is small). Dynamic scheme allocates from a global queue of patches and patches after subsequent division are also put into this global queue. As soon as a processor is nished with its current job it gets a patch from the queue. This has some advantages and disadvantages: a) It leads to load balancing; b) When the patches become smaller, the overhead increases and so the speedup will be less (there will be contention for the queue and most processors will be busy trying to access the queue rather than work. Also, the overhead of accessing the global data structure is very high). Both the static and dynamic schemes have been found to saturate beyond 20 processors. SD scheme starts o initially as a static allocation scheme and assigns tasks to the individual processors and steals tasks from the tails of queues of the busy processors, as and when the processors become free. The granularity of the task should be carefully chosen without aecting the data locality. The SD scheme shows some improvement over the other two schemes. Execution timings and speedup graphs for the static, dynamic and SD schemes are shown in gure 9, 10, 11 respectively. Speedups for the above three schemes are shown in gure 12 (for room data).

12

XI. Conclusion and future research

The time complexity of the hierarchical radiosity based rendering algorithm was computed. A parallel hierarchical radiosity algorithm has been presented. Three allocation strategies were used(static, dynamic, SD) and the performance of this irregular and dynamic algorithm on a DSM-COMA architecture was studied. The SD scheme was found to give better speedups. As the number of iterations required for convergence for the two data sets were less and maximum time was taken during the rst iteration the performance improvement due to the SD scheme was minimal. Latency hiding techniques using pre-fetch and post-store instructions available on KSR have been found to yield a better performance in cases of other N-body applications(where the number of iterations are more, eg. Barnes-Hut method)[13], but not in the case of radiosity. Emphasis in the future research would be towards arriving at ecient parallel algorithms for hierarchical radiosity incorporating recent improvements [6, 7], and extending the same to deal with 3-Dimensions(i.e. in the presence of participating media). There is a need to develop heuristics for mapping and scheduling tasks on to the network to improve the eciency of the parallel image synthesis algorithms. Acknowledgments

We would like to thank Dr. Pat Hanrahan, Princeton University for making the source code for the hierarchical radiosity algorithm and data sets publicly available. Special thanks to Dr. Edward Davidson and Dr. Paul S. McClay for giving us access to the computational resources at the University of Michigan Center for Parallel Computing.

13

References

[1] James T Kajiya. (1986), \The rendering equation" ACM Computer Graphics , (Proc. SIGGRAPH 86), vol.20.,no.4, pp.143-150.

[2] H.Fuchs, et al., \On visible surface generation by a priori tree structures", ACM Computer Graphics, (Proc. SIGGRAPH 80) pp.124{133.

[3] M.F.Cohen and D.P. Greenberg, \The hemi-cube : A radiosity solution for complex environments", ACM Computer Graphics, (Proc. SIGGRAPH 85).

[4] Pat Hanrahan, , Salzman,D., Larry Aupperle, \A rapid hierarchical radiosity algorithm" ACM Computer Graphics, (Proc. SIGGRAPH 91) pp.165{174.

[5] Dani Lischiniski, Fillipino Tampeiri, Donald P. Greenberg, \Combining hierarchical radiosity and discontinuity meshing", ACM Computer Graphics, (Proc. SIGGRAPH 93) pp.199-208.

[6] Brian Smiths, James Arvo and Donald Greenberg, \A clustering algorithm for radiosity in complex environments", ACM Computer Graphics, (Proc. SIGGRAPH 94) pp.435-442.

[7] Seth Teller, Celeste Fowler, Thomas Funkhouser and Pat Hanrahan, \Partitioning and order-

ing large radiosity computations", ACM Computer Graphics, (Proc. SIGGRAPH 94) pp.443450

[8] Paul S. Heckbert., (1991), \Simulating global illumination using adaptive meshing", PhD. Thesis. University of California, Berkeley.

[9] Brian E.S., James R. Arvo, David H.S.,(1992),\An Importance-driven radiosity algorithm", ACM Computer Graphics, (Proc. SIGGRAPH 92) pp. 273{282.

[10] C.Vinod, V.Chaudhary, J.K.Aggarwal, \Computational requirements for accelerating radiosity algorithms", PDCL-TR-93-20-20, Under review for publication.

[11] J.K.Salmon, \Parallel Hierarchical N-body methods", PhD thesis, California Institute of Technology, December 1990.

[12] J. P. Singh, C. Holt, T. Totsuka, A. Gupta, J. L. Hennessy, \Load Balancing and Data Locality

in Hierarchical N-body Methods", Computer Systems Laboratory, Stanford University, CSLTR-92-505.

[13] C. Tumuluri and A.N. Choudhary, \Exploitation of latency hiding on the KSR1 Case study: The Barnes Hut algorithm", Technical report, Syracuse University, 1994.

[14] Anoop Gupta, Truman Joe and Per Stenstrom, \Comparative performance of cache coherent NUMA and COMA architectures", Computer Systems Laboratory, Stanford University.

[15] Michael G. Norman and Peter Thanisch, \Models of machines and computation for mapping in multicomputers", ACM Computing Surveys, September 1993.

[16] V. Chaudhary, J.K. Aggarwal, \Generalized mapping of Parallel algortihms onto parallel architectures", IEEE Transactions on Parallel and Distributed systems, March 1993.

14

[17] Rodney, Recker,J., George,David,W., Donald P.Greenberg, \Accelerating techniques for Progressive Re nement Radiosity" ACM Computer Graphics, (Proc. SIGGRAPH 90) pp.59-66.

[18] Michael John Muss, \Workstations Networking Distributed Graphics and Parallel Processing", Ballistic Research Laboratory, Maryland 1991.

[19] Kurt Akeley and Tom Jermoluk, \High-performance polygon rendering", ACM Computer Graphics, (Proc. SIGGRAPH 88).

[20] Kendall Square Research Corporation, \KSR Technical summary", 1992.

15

Integral Equation

RADIOSITY SOLUTION

Uniform linear

Adaptive linear

Uniform constant

MESH & BASIS

Integral Equation approximation

Galerkin

CONSTRAIN

Collocation

Integration problem

Hemicube

Analytic

INTEGRATE

Ray tracing

Linear System

Gauss−Seidal

Succesive Overrelaxation

SOLVE

Progressive

Discrete Solution

Z−buffer

DISPLAY

Ray tracing

Picture

Figure 2: Radiositysolutions 8

[ ]

16

Pre−processing

Quick Model

Mesh Preprocessing

Radiosity solution part

Rendering part

Polygonalized Model

Radiosity Input model

Find Shooter

Movement Script

Compute Form factors

Render

Primitive Mesher Adaptive Refinement Meshed Model

Display

Movement Script

Distribute Energy

Meshed Primitive (data)

Render

Radiosity Output model

Video Output

Workstation Display

Display

Figure 3: Radiosity based rendering problem

Root Polygon

Uniform Subdivision

Hierarchical Subdivision

Quad−tree

Figure 4: Hierarchical subdivision and corresponding quadtree

17

Level 1

ALLCACHE Dir.

ALLCACHE Dir. 34 Level 0 Rings ALLCACHE Dir.

ALLCACHE Dir. Level 0

Local cache Directory




Local cache

Local cache

Local cache

Local cache

Processor and Sub−cache



Processor and Sub−cache 32 Processor cells

32 Processor cells

Figure 5: KSR: DSM-COMA architecture

A

B

C

C

C

C

C

C

C

D

D

D

D

D

D

D

E

E

E

E

E

E

E

F

G

Figure 6: Task graph

18

(a) Input polygon data

(b) Hierarchically meshed

Figure 7: Room data: 174 input polygons START MASTER PVMD

RS6K−520H (PDCL−WSU)

GRAPHICS FRONT END

INTERACTIVE RENDERING USING MOTIF / OPEN−GL

SPAWN SLAVE PVMD ON KSR1−64 SEND THE REQ. DATA MODEL, NPROCS etc.

INITIALIZE GLOBAL DATA STRUCT. AND BSP TREE CREATE PTHREADS

PVMD. COMMUNICATION

ASSIGN TASKq’s EXEC. RADIOSITY

KSR2−64 (CPC−UofM)

KEEP PVM EAR OPEN

COMPUTE VISIBILITY (0−1) COMPUTE FORM_FACTOR COMPUTE INTERACTION AREA WEIGTED RADIOSITY SUBDIVIDE PATCHES (4 kids) PERFORM REFINEMENT TILL GLOBAL RADOISITY CONVERGENCE

Figure 8: Framework of parallel hierarchical radiosity algorithm implementation on KSR2-64 19

1600 ROOM LARGE ROOM

IDEAL ROOM LARGE ROOM

60

1400 50

1000

40

time in secs

speedup

1200

800

30

600 20 400

10

200

0 10

20

30 40 # of procs

50

60

10

(a) Execution Time

20

30 40 # of procs

50

60

(b) Speedups

Figure 9: Static allocation scheme



60

1400 50

1000

40

time in secs

speedup

1200

800

30

600 20 400

10

200

0 10

20

30 40 # of procs

50

60

10

(a) Execution Time

20

30 40 # of procs

(b) Speedups

Figure 10: Dynamic allocation scheme 20

50

60



60

1400 50

1000

40

time in secs

speedup

1200

800

30

600 20 400

10

200

0 10

20

30 40 # of procs

50

60

10

20

(a) Execution Time

30 40 # of procs

50

60

(b) Speedups

Figure 11: SD scheme

IDEAL STATIC DYNAMIC SD

60

50

speedup

40

30

20

10

10

20

30 40 # of procs

50

60

Figure 12: Speedups on KSR2-64 (DSM-COMA) for room data

21