Parallel and Distributed Photo-Realistic Rendering - CiteSeerX

Parallel and Distributed Photo-Realistic Rendering Alan Chalmers and Erik Reinhard Dept. of Computer Science University of Bristol Bristol, BS8 1UB United Kingdom

Course #3

Orlando, Florida 19-24 July, 1998

Contents 1

2

3

Introduction 1.1 Concepts . . . . . . . . . . . . . . . . . . 1.1.1 Dependencies . . . . . . . . . . . 1.1.2 Scalability . . . . . . . . . . . . . 1.1.3 Control . . . . . . . . . . . . . . . 1.2 Classification of Parallel Systems . . . . . . 1.2.1 Flynn’s taxonomy . . . . . . . . . 1.2.2 Parallel versus Distributed systems . 1.3 The Relationship of Tasks and Data . . . . 1.3.1 Inherent difficulties . . . . . . . . . 1.3.2 Tasks . . . . . . . . . . . . . . . . 1.3.3 Data . . . . . . . . . . . . . . . . . 1.4 Evaluating Parallel Implementations . . . . 1.4.1 Realisation penalties . . . . . . . . 1.4.2 Performance metrics . . . . . . . . 1.4.3 Efficiency . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

6 6 7 8 10 11 12 14 15 15 17 17 17 18 19 23

Realistic Rendering 2.1 Ray tracing . . . . . . . . . . . . . . 2.2 Radiosity . . . . . . . . . . . . . . . 2.2.1 Form factors . . . . . . . . . 2.2.2 Full radiosity . . . . . . . . . 2.2.3 Progressive refinement . . . . 2.2.4 Hierarchical radiosity . . . . . 2.3 Stochastic techniques . . . . . . . . . 2.3.1 Variance reduction techniques 2.3.2 Stochastic ray tracing . . . . . 2.3.3 Stochastic radiosity . . . . . . 2.3.4 Particle tracing . . . . . . . . 2.3.5 Hybrid methods . . . . . . . . 2.3.6 Participating media . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

29 29 33 34 35 35 36 36 38 39 40 41 41 42

Task Scheduling 3.1 Problem Decomposition . . . . . . 3.1.1 Algorithmic decomposition 3.1.2 Domain decomposition . . 3.1.3 Abstract definition of a task 3.1.4 System architecture . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

44 44 45 45 46 46

. . . . .

1

3.2

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

48 48 52 56 56 56 57 58 61 61 62 66 67 69

Data Management 4.1 World Model of the Data: No Data Management Required . . . . . 4.2 Virtual Shared Memory . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 Implementing virtual shared memory . . . . . . . . . . . . 4.3 The Data Manager . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.1 The local data cache . . . . . . . . . . . . . . . . . . . . . 4.3.2 Requesting data items . . . . . . . . . . . . . . . . . . . . 4.3.3 Locating data items . . . . . . . . . . . . . . . . . . . . . 4.4 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.1 Keeping the data items consistent . . . . . . . . . . . . . . 4.4.2 Weak consistency: repair consistency on request . . . . . . 4.4.3 Repair consistency on synchronisation: Release consistency 4.5 Minimising the Impact of Remote Data Requests . . . . . . . . . . 4.5.1 Prefetching . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.2 Multi-threading . . . . . . . . . . . . . . . . . . . . . . . 4.5.3 Profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6 Data Management for Multi-Stage Problems . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

71 71 71 72 73 73 75 77 80 81 82 82 83 83 84 87 88

Caching techniques 5.1 Coherence . . . . . . . . . . . . . . . 5.2 Placement policies . . . . . . . . . . 5.3 Replacement policies . . . . . . . . . 5.4 What to store? . . . . . . . . . . . . . 5.5 Geometric simplification and grouping 5.6 Levels of Detail . . . . . . . . . . . . 5.7 Environment mapping and impostors . 5.8 Directional caching . . . . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

89 89 91 92 92 93 94 94 95

3.3

3.4

4

5

6

Computational Models . . . . . . . . . . . . . . . 3.2.1 Data driven model . . . . . . . . . . . . . 3.2.2 Demand driven model . . . . . . . . . . . 3.2.3 Hybrid computational model . . . . . . . Task Management . . . . . . . . . . . . . . . . . 3.3.1 Task definition and granularity . . . . . . 3.3.2 Task distribution and control . . . . . . . . 3.3.3 Algorithmic dependencies . . . . . . . . . Task Scheduling Strategies . . . . . . . . . . . . . 3.4.1 Data driven task management strategies . . 3.4.2 Demand driven task management strategies 3.4.3 Task manager process . . . . . . . . . . . 3.4.4 Distributed task management . . . . . . . 3.4.5 Preferred bias task allocation . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . .

. . . . . . . .

Parallel ray tracing 98 6.1 Demand driven ray tracing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 6.2 Data parallel ray tracing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 6.3 Hybrid scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

2

7

8

Parallel Radiosity 7.1 Parallel Gathering . . . . . . . . . . . . . . . . . . . . . . . 7.2 Parallel Shooting . . . . . . . . . . . . . . . . . . . . . . . 7.3 Parallel hierarchical radiosity . . . . . . . . . . . . . . . . . 7.4 Case Study: Parallel radiosity using the hemi-cube approach 7.4.1 Data Management . . . . . . . . . . . . . . . . . . 7.4.2 Configurations . . . . . . . . . . . . . . . . . . . . 7.4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . 7.4.4 Conclusions . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

105 105 108 110 111 112 112 112 116

Parallel Particle Tracing 8.1 Parallel Characteristics . . . . . . . . . . . . . 8.1.1 Scene data . . . . . . . . . . . . . . . 8.2 Task Scheduling . . . . . . . . . . . . . . . . . 8.2.1 Task packet variations . . . . . . . . . 8.3 Data Management . . . . . . . . . . . . . . . . 8.3.1 Data Manager . . . . . . . . . . . . . . 8.3.2 Coherence . . . . . . . . . . . . . . . 8.3.3 Effects of data management . . . . . . 8.3.4 Data Allocation . . . . . . . . . . . . . 8.3.5 Data Fetch Mechanism . . . . . . . . . 8.3.6 Demand Fetch . . . . . . . . . . . . . 8.3.7 Prefetch . . . . . . . . . . . . . . . . . 8.3.8 Prefetch horizon . . . . . . . . . . . . 8.3.9 Prefetch particle tracing . . . . . . . . 8.3.10 Multi-threading . . . . . . . . . . . . . 8.3.11 Combining Latency Hiding Techniques 8.3.12 Profiling . . . . . . . . . . . . . . . . 8.3.13 Task migration . . . . . . . . . . . . . 8.3.14 Results handling . . . . . . . . . . . . 8.4 Conclusions . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

122 122 123 124 125 127 128 128 130 131 133 134 134 136 137 137 139 139 139 141 143

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

Appendix A: Slides Appendix B: Included papers: Jansen & Chalmers, “Realism in Real Time?” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Reinhard & Jansen, “Rendering Large Scenes Using Parallel Ray Tracing” . . . . . . . . . . . . . . . . . Renambot et. al. “Towards Efficient Parallel Radiosity for DSM-based Parallel Computers Using Virtual Interfaces” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Singh et. al. “Parallel Visualization Algorithms: Performance and Architectural Implications . . . . . . .

3

About the authors Dr. Alan Chalmers Dr Alan Chalmers is a lecturer in the Department of Computer Science at the University of Bristol, UK. He has published over 60 papers in journals and international conferences on parallel photo-realistic graphics. He was the chairman of the First Eurographics Workshop on Parallel Graphics & Visualisation held in Bristol in September 1996, and is co-chairman of the Second of these workshops to be held in Rennes in 1998. Recently he and Prof. F. W. Jansen were the guest editors of the journal Parallel Computing for its special edition on Parallel Graphics & Visualisation. His current research interests include the application of parallel photo-realistic graphics to archaeological site visualisation in order to provide a flexible tool for investigating site reconstruction and utilisation. Ir. Erik Reinhard Ir Erik Reinhard is a researcher at the University of Bristol in the field of parallel and distributed ray tracing, radiosity and particle tracing. He received a ’TWAIO’ diploma (comparable to an MSc) in parallel computer graphics from Delft University of Technology in 1996. As part of his PhD, he was involved in researching and establishing an internet based modelling and parallel rendering service at TNO in The Netherlands. His current research interests include task scheduling and data management techniques for (photo-)realistic rendering, different data representation schemes to facilitate inter-processor communication and load balancing, and visual perception issues. He has published more than 10 papers on parallel rendering and is currently writing a book on this subject with Alan Chalmers.

4

Acknowledgements We are very grateful to International Thomson Computer Press for permission to use material from: Chalmers A.G., Tidmus J.P. Practical Parallel Processing: An Introduction to Problem Solving in Parallel. International Thomson Computer Press, 327 pages, ISBN 1-85032-135-3, March 1996. We would also like to thank John Tidmus very much for allowing us to include much of his work in our chapter on Parallel Particle Tracing: Tidmus J.P. Task and Data Management for Parallel Particle Tracing. PhD thesis, A. Chalmers & R. Miles (supervisors), University of the West of England, December 1997. We are grateful to IEEE and Elsevier for permission to include the key papers. Finally, we would like to thank Arjan Kok for supplying the model of the macintosh lab, Jan Eek for the beer crate model and Greg Ward Larson for the conference room model. We also used his radiance rendering package [185] to render most of the images in these notes. Finally, we would like to thank the authors of the papers that are included at the back of these course notes for giving us copyright permission. Erik Reinhard’s work is partially sponsored by the European Commission under TMR grant number ERBFMBICT960655.

5

1 Introduction “Parallel processing is like a dog’s walking on its hind legs. It is not done well, but you are surprised to find it done at all.” [Steve Fiddes (University of Bristol) with apologies to Samuel Johnson] Realistic computer graphics is an area of research which develops algorithms and methods to render images of artificial models or worlds as realistically as possible. Such algorithms are known for their unpredictable data accesses and their high computational complexity. Rendering a single high quality image may take several hours, or even days. Parallel processing offers the potential for solving such complex problems in reasonable times. However, there are a number of fundamental issues: task scheduling, data management and caching techniques, which must be addressed if parallel processing is to achieve the desired performance when computing realistic images. These are applicable for all three rendering techniques presented in this tutorial: ray tracing, radiosity and particle tracing. This chapter introduces the concepts of parallel processing and considers the difficulties associated with solving problems in parallel. Parallel processing is an integral part of everyday life. The concept is so ingrained in our existence that we benefit from it without realising. When faced with a taxing problem, we involve others to solve it more easily. This co-operation of more than one worker to facilitate the solution of a particular problem may be termed parallel processing. The goal of parallel processing is thus to solve a given problem more rapidly, or to enable the solution of a problem that would otherwise be impracticable by a single worker. The principles of parallel processing are, however, not new, as evidence suggests that the computational devices used over 2000 years ago by the Greeks recognised and exploited such concepts. In the Nineteenth Century, Babbage used parallel processing in order to improve the performance of his Analytical Engine [118]. Indeed, the first general purpose electronic digital computer, the ENIAC, was conceived as a highly parallel and decentralised machine with twenty-five independent computing units, co-operating towards the solution of a single problem [73]. However, the early computer developers rapidly identified two obstacles restricting the widespread acceptance of parallel machines: the complexity of construction; and, the seemingly high programming effort required [20]. As a result of these early set-backs, the developmental thrust shifted to computers with a single computing unit, to the detriment of parallel designs. Additionally, the availability of sequential machines resulted in the development of algorithms and techniques optimised for these particular architectures. The evolution of serial computers may be finally reaching its zenith due to the limitations imposed on the design by its physical implementation and inherent bottlenecks [7]. As users continue to demand improved performance, computer designers have been looking increasingly at parallel approaches to overcome these limitations. All modern computer architectures incorporate a degree of parallelism. Improved hardware design and manufacture coupled with a growing understanding of how to tackle the difficulties of parallel programming has re-established parallel processing at the forefront of computer technology.

1.1 Concepts Parallel processing is the solution of a single problem by dividing it into a number of sub-problems, each of which may be solved by a separate worker. Co-operation will always be necessary between workers during problem solution, even if this is a simple agreement on the division of labour. These ideas can be illustrated by a simple analogy 6

of tackling the problem of emptying a swimming pool using buckets. This job may be sub-divided into the repeated task of removing one bucket of water. A single person will complete all the tasks, and complete the job, in a certain time. This process may be speededup by utilising additional workers. Ideally, two people should be able to empty the pool in half the time. Extending this argument, a large number of workers should be able to complete the job in a small fraction of the original time. However, practically there are physical limitations preventing this hypothetical situation.

Figure 1.1: Emptying a pool by means of a bucket The physical realisation of this solution necessitates a basic level of co-operation between workers. This manifests itself due to the contention for access to the pool, and the need to avoid collision. The time required to achieve this co-operation involves inter-worker communication which detracts from the overall solution time, and as such may be termed an overhead.

1.1.1 Dependencies Another factor preventing an ideal parallel solution are termed: dependencies. Consider the problem of constructing a house. In simple terms, building the roof can only commence after the walls have been completed. Similarly, the walls can only be erected once the foundations are laid. The roof is thus dependent upon the walls, which are in turn

7

dependent on the foundations. These dependencies divide the whole problem into a number of distinct stages. The parallel solution of each stage must be completed before the subsequent stage can start. The dependencies within a problem may be so severe that it is not amenable to parallel processing. A strictly sequential problem consists of a number of stages, each comprising a single task, and each dependent upon the previous stage. For example, in figure 1.2, building a tower of toy blocks requires a strictly sequential order of task completion. The situation is the antithesis of dependency-free problems, such as placing blocks in a row on the floor. In this case, the order of task completion is unimportant, but the need for co-operation will still exist.

Task completion

Task completion a.

b.

Figure 1.2: Building with blocks: (a) Strictly sequential (b) dependency-free Pipelining is the classic methodology for minimising the effects of dependencies. This technique can only be exploited when a process, consisting of a number of distinct stages, needs to be repeated several times. An automotive assembly line is an example of an efficient pipeline. In a simplistic form, the construction of a car may consist of four linearly dependent stages as shown in figure 1.3: chassis fabrication; body assembly; wheel fitting; and, windscreen installation. An initial lump of metal is introduced into the pipeline then, as the partially completed car passes each stage, a new section is added until finally the finished car is available outside the factory. Consider an implementation of this process consisting of four workers, each performing their task in one time unit. Having completed the task, the worker passes the partially completed car on to the next stage. This worker is now free to repeat its task on a new component fed from the previous stage. The completion of the first car occurs after four time units, but each subsequent car is completed every time unit. The completion of a car is, of course, sensitive to the time taken by each worker. If one worker were to take longer than one time unit to complete its task then the worker after this difficult task would stand idle awaiting the next component, whilst those before the worker with the difficult task would be unable to move their component on to the next stage of the pipeline. The other workers would thus also be unable to do any further work until the difficult task was completed. Should there be any interruption in the input to the pipeline then the pipeline would once more have to be “refilled” before it could operate at maximum efficiency.

1.1.2 Scalability Every problem contains a upper bound on the number of workers which can be meaningfully employed in its solution. Additional workers beyond this number will not improve solution time, and can indeed be detrimental. This upper bound provides an idea as to how suitable a problem is to parallel implementation: a measure of its scalability. A given problem may only be divided into a finite number of sub-problems, corresponding to the smallest tasks. The availability of more workers than there are tasks, will not improve solution time. The problem of clearing a room of 100 chairs may be divided into 100 tasks consisting of removing a single chair. A maximum of 100 workers can be allocated one of these tasks and hence perform useful work. 8

Initial lumps Chassis fabrication of metal

Body assembly

Wheel fitting

Windscreen installation

Finished cars

1 2 3 4 5 6 t 7 8 9 0

Increasing time

2 3 4 5 6 t1 7 8 9

1

3 4 5 6 t 7 8 9 2

2

1

3

2

1

4

3

2

1

5

4

3

2

4 5 6 t3 7 8 9 5 6 t4 7 8 9 6 t5 7 8 9

7 8 9

6

5

4

7

6

5

3

t6 t

8 9

1

4

7

Motion of cars through pipeline Figure 1.3: Pipeline assembly of a car

9

1 2 1 2 3

The optimum solution time for clearing the room may not in fact occur when employing 100 workers due to certain aspects of the problem limiting effective worker utilisation. This phenomenon can be illustrated by adding a constraint to the problem, in the form of a single doorway providing egress from the room. A bottleneck will occur as large numbers of workers attempt to move their chairs through the door simultaneously, as shown in figure 1.4.

@$&

Figure 1.4: Bottleneck caused by doorway The delays caused by this bottleneck may be so great that the time taken to empty the room of chairs by this large number of workers may in fact be longer than the original time taken by the single worker. In this case, reducing the number of workers can alleviate the bottleneck and thus reduce solution time.

1.1.3 Control All parallel solutions of a problem require some form of control. This may be as simple as the control needed to determine what will constitute a task and to ascertain when the problem has been solved satisfactorily. More complex problems may require control at several stages of their solution. For example, solution time could be improved when clearing the room if a controller was placed at the door to schedule its usage. This control would ensure that no time was wasted by two (or more) workers attempting to exit simultaneously and then having to “reverse” to allow a single worker through. An alternative to this explicit centralised control would be some form of distributed control. Here the workers themselves could have a way of preventing simultaneous access, for example, if two (or more) workers reach the door at the same time then the biggest worker will always go first while the others wait. Figure 1.5(a) shows the sequential approach to solving a problem. Computation is applied to the problem domain to produce the desired results. The controlled parallel approach shown in figure 1.5(b) achieves a parallel implementation of the same problem via three steps. In step 1, the problem domain is divided into a number of sub-problems, in this case four. Parallel processing is introduced in step 2 to enable each of the sub-problems to be computed in parallel to produce sub-results. In step 3, these results must now be collated to achieve the desired final results. Control is necessary in steps 1 and 3 to divide the problem amongst the workers and then to collect and collate the results that the workers have independently produced.

10

Subproblem 1

Computation

Subresult 1

Subproblem 2

Computation

Subresult 2

Subproblem n

b.

Results

Step 1: Subdivision of problem

...

Problem domain

Computation

...

a.

Problem domain

Computation

Results

Subresult n

Step 2: Independent computation

Step 3: Collation of results

Figure 1.5: Control required in (a) a Sequential versus (b) a parallel implementation

1.2 Classification of Parallel Systems A traditional sequential computer conforms to the von Neumann model. Shown in figure 1.6, this model comprises a processor, an associated memory, an input/output interface and various busses connecting these devices. The processor in the von Neumann model is the single computational unit responsible for the functions of fetching, decoding and executing a program’s instructions. Parallel processing may be added to this architecture through pipelining using multiple functional units within a single computational unit or by replicating entire computational units (which may contain pipelining). With pipelining, each functional unit repeatedly performs the same operation on data which is received from the preceding functional unit. So in the simplest case, a pipeline for a computational unit could consist of three functional units, one to fetch the instructions from memory, one to decode these instructions and one to execute the decoded instructions. As we saw with the automobile assemblage example, a pipeline is only as effective as its slowest component. Any delay in the pipeline has repercussions for the whole system. Control Bus

I/O interface

Memory Processor Data bus Address bus

Figure 1.6: Von Neumann model architecture Vector processing was introduced to provide efficient execution of program loops on large array data structures. By providing multiple registers as special vector registers to be used alongside the central processing unit, a vector 11

processor is able to perform the same operation on all elements of a vector simultaneously. This simultaneous execution on every element of large arrays can produce significant performance improvements over conventional scalar processing. However, often problems need to be reformulate to benefit from this form of parallelism. A large number of scientific problems, such as weather forecasting, nuclear research and seismic data analysis, are well suited to vector processing. Replication of the entire computational unit, the processor, allows individual tasks to be executed on different processors. Tasks are thus sometimes referred to as virtual processors which are allocated a physical processor on which to run. The completion of each task contributes to the solution of the problem. Tasks which are executing on distinct processors at any point in time are said to be running in parallel. It may also be possible to execute several tasks on a single processor. Over a period of time the impression is given that they are running in parallel, when in fact at any point in time only one task has control of the processor. In this case we say that the tasks are being performed concurrently, that is their execution is being shared by the same processor. The difference between parallel tasks and concurrent tasks is shown in figure 1.7. Parallel tasks

Concurrent tasks

task task

task task

Processor 1

Processor 2

(a)

Processor

(b)

Figure 1.7: (a) Parallel tasks (b) Concurrent tasks The workers which perform the computational work and co-operate to facilitate the solution of a problem on a parallel computer are known as processing elements and are often abbreviated as PEs. A processing element consists of a processor, one or more tasks, and the software to enable the co-operation with other processing elements. A parallel system comprises of more than one processing element.

1.2.1 Flynn’s taxonomy The wide diversity of computer architectures that have been proposed, and in a large number of cases realised, since the 1940’s has led to the desire to classify the designs to facilitate evaluation and comparison. Classification requires a means of identifying distinctive architectural or behavioural features of a machine. In 1972 Flynn proposed a classification of processors according to a macroscopic view of their principal interaction patterns relating to instruction and data streams [48]. The term stream was used by Flynn to refer to the sequence of instructions to be executed, or data to be operated on, by the processor. What has become known as Flynn’s taxonomy thus categorises architectures into the four areas shown in figure 1.8. Since its inception, Flynn’s taxonomy has been criticised as being too broad and has thus been enlarged by several other authors, for example, Shore in 1973 [158], Treleaven, Brownbridge and Hopkins in 1982 [174], Basu in 1984 [11], and perhaps one of the most detailed classifications was given by Hockney and Jesshope in 1988 [79]. Real architectures are, of course, much more complex than Flynn suggested. For example, an architecture may exhibit properties from more than one of his classes. However, if we are not too worried about the minute details of any individual machine then Flynn’s taxonomy serves to separate fundamentally different architectures into four broad categories. The classification scheme is simple (which is one of the main reasons for its popularity) and thus useful to show an overview of the concepts of multiprocessor computers. 12

Single Data stream Multiple Data stream

Single Instruction stream

Multiple Instruction stream

SISD

MISD

SIMD

MIMD

Figure 1.8: Flynn’s taxonomy for processors

SISD: Single Instruction Single Data embraces the conventional sequential, or von Neumann, processor. The single processing element executes instructions sequentially on a single data stream. The operations are thus ordered in time and may be easily traced from start to finish. Modern adaptations of this uniprocessor use some form of pipelining technique to improve performance and, as demonstrated by the Cray supercomputers, minimise the length of the component interconnections to reduce signal propagation times [152]. SIMD: Single Instruction Multiple Data machines apply a single instruction to a group of data items simultaneously. A master instruction is thus acting over a vector of related operands. A number of processors, therefore, obey the same instruction in the same cycle and may be said to be executing in strict lock-step. Facilities exist to exclude particular processors from participating in a given instruction cycle. Vector processors, for example the Cyber 205, Fujitsu FACOM VP-200 and NEC SX1, and array processors, such as the DAP [141], Goodyear MPP (Massively Parallel Processor) [12], or the Connection Machine CM-1 [78], may be grouped in this category. MISD: Multiple Instruction Single Data Although part of Flynn’s taxonomy, no architecture falls obviously into the MISD category. One the closest architecture to this concept is a pipelined computer. Another is systolic array architectures which derives their from the medical term “systole” used to describe the rhythmic contraction of chambers of the heart. Data arrives from different directions at regular intervals to be combined at the “cells” of the array. The Intel iWarp system was designed to support systolic computation [4]. Systolic arrays are well suited to specially designed algorithms rather than general purpose computing [101, 102]. MIMD: Multiple Instruction Multiple Data The processors within the MIMD classification autonomously obey their own instruction sequence and apply these instructions to their own data. The processors are, therefore, no longer bound to the synchronous method of the SIMD processors and may choose to operate asynchronously. By providing these processors with the ability to communicate with each other, they may interact and therefore, co-operate in the solution of a single problem. This interaction has led to MIMD systems sometimes being classified as tightly coupled if the degree of interaction is high, or loosely coupled if the degree of interaction is low. Two methods are available to facilitate this interprocessor communication. Shared memory systems allow the processors to communicate by reading and writing to a common address space. Controls are necessary to prevent processors updating the same portion of the shared memory simultaneously. Examples of such shared memory systems are the Sequent Balance [170] and the Alliant FX/8 [39]. In distributed memory systems, on the other hand, processors address only their private memory and communicate by passing messages along some form of communication path. Examples of MIMD processors from 13

which such distributed memory systems can be built are the Intel i860 [137],the Inmos transputer [81] and Analog Devices SHARC processor.

Shared Memory

Interconnection method

Interconnection method

MIMD processors

MIMD processors with private memory (b)

(a)

Figure 1.9: Systems of MIMD processors (a) shared memory (b) distributed memory The conceptual difference between shared memory and distributed memory systems of MIMD processors is shown in figure 1.9. The interconnection method for the shared memory system, figure 1.9(a), allows all the processors to be connected to the shared memory. If two, or more, processors wish to access the same portion of this shared memory at the same time then some arbitration mechanism must be used to ensure only one processor accesses that memory portion at a time. This problem of memory contention may restrict the number of processors that can be interconnected using the shared memory model. The interconnection method of the distributed memory system, figure 1.9(b), connects the processors in some fashion and if one, or more, processors wish to access another processor’s private memory, it, or they, can only do so by sending a message to the appropriate processor along this interconnection network. There is thus no memory contention as such. However, the density of the messages that result in distributed memory systems may still limit the number of processors that may be interconnected, although this number is generally larger than that of the shared memory systems. Busses have been used successfully as an interconnection structure to connect low numbers of processors together. However, if more than one processor wishes to send a message on the bus at the same time, an arbiter must decide which message gets access to the bus first. As the number of processors increases, so the contention for use of the bus grows. Thus a bus is inappropriate for large multiprocessor systems. An alternative to the bus is to connect processors via dedicated links to form large networks. This removes the bus-contention problem by spreading the communication load across many independent links.

1.2.2 Parallel versus Distributed systems Distributed memory MIMD systems consist of autonomous processors together with their own memory which cooperate in the solution of a single complex problem. Such systems may consist of a number of interconnected, dedicated processor and memory nodes, or interconnected “stand-alone” workstations. To distinguish between these two, the former configuration is referred to as a (dedicated) parallel system, while the latter is known as a distributed system, as shown in figure 1.10. The main distinguishing features of these two systems are typically the computation-to-communication ratio and the cost. Parallel systems make use of fast, “purpose-built” (and thus expensive) communication infrastructures, while distributed systems rely on existing network facilities such as ethernet, which are significantly slower and susceptible to other non-related traffic. The advantage of distributed systems is that they may consist of a cluster of existing workstations which can be used by many (sequential) users when not employed in a parallel capacity. A number of valuable tools have been developed to enable these workstations to act in parallel, such as Parallel Virtual Machine (PVM), and Message Passing 14

(a)

(b)

Figure 1.10: (a) Parallel system (b) Distributed system

Interface (MPI). These provide an easy framework for coupling heterogeneous computers including, workstations, mainframes and even parallel systems. However, while some of the properties of a distributed computing system may be different from those of a parallel system, many of the underlying concepts are equivalent. For example, both systems achieve co-operation between computational units by passing messages, and each computational unit has its own distinct memory. Thus, the ideas presented in this tutorial should prove equally useful to the reader faced with implementing his or her realistic rendering problem on either system.

1.3 The Relationship of Tasks and Data The implementation of any problem on a computer comprises two components:

the algorithm chosen to solve the problem; and, the domain of the problem which encompasses all the data requirements for that problem.

The algorithm interacts with the domain to produce the result for the problem, as shown diagrammatically in figure 1.11. A sequential implementation of the problem means that the entire algorithm and domain reside on a single processor. To achieve a parallel implementation it is necessary to divide the problem’s components in some manner amongst the parallel processors. Now no longer resident on a single processor, the components will have to interact within the multiprocessor system in order to obtain the same result. This co-operation requirement introduces a number of novel difficulties into any parallel implementation which are not present in the sequential version of the same problem.

1.3.1 Inherent difficulties User confidence in any computer implementation of a problem is bolstered by the successful termination of the computation and the fact that the results meet design specifications. The reliability of modern computer architectures and languages is such that any failure of a sequential implementation to complete successfully will point automatically to deficiencies in either the algorithm used or data supplied to the program. In addition to these possibilities of failure,

15

The Problem

Problem Domain

Application of Algorithm

Results

Figure 1.11: The components of a problem

a parallel implementation may also be affected by a number of other factors which arise from the manner of the implementation: Deadlock: An active parallel processor is said to be deadlocked if it is waiting indefinitely for an event which will never occur. A simple example of deadlock is when two processors, using synchronised communication, attempt to send a message to each other at the same time. Each process will then wait for the other process to perform the corresponding input operation which will never occur. Data consistency: In a parallel implementation, the problem’s data may be distributed across several processors. Care has to be taken to ensure:

if multiple copies of the same data item exists then the value of this item is kept consistent; mutual exclusion is maintained to avoid several processors accessing a shared resource simultaneously; and, the data items are fetched from remote locations efficiently in order to avoid processor idle time.

While there is meaningful computation to be performed, a sequential computer is able to devote 100% of its time for this purpose. In a parallel system it may happen that some of the processors become idle, not because there is no more work to be done, but because current circumstances prevent those processors being able to perform any computation. Parallel processing introduces communication overheads. The effect of these overheads is to introduce latency into the multiprocessor system. Unless some way is found to minimise communication delays, the percentage of time that a processor can spend on useful computation may be significantly affected. So, as well as the factors affecting the successful termination of the parallel implementation, one of the fundamental considerations also facing parallel programmers is the computation to communication ratio.

16

1.3.2 Tasks Subdividing a single problem amongst many processors introduces the notion of a task. In its most general sense, a task is a unit of computation which is assigned to a processor within the parallel system. In any parallel implementation a decision has to be taken as to what exactly constitutes a task. The task granularity of a problem is a measure of the amount of computational effort associated with any task. The choice of granularity has a direct bearing on the computation to communication ratio. Selection of too large a granularity may prevent the solution of the problem on a large parallel system, while too fine a granularity may result in significant processor idle time while the system attempts to keep processors supplied with fresh tasks. On completion of a sequential implementation of a problem, any statistics that may have been gathered during the course of the computation, may now be displayed in a straightforward manner. Furthermore, the computer is in a state ready to commence the next sequential program. In a multiprocessor system, the statistics would have been gathered at each processor, so after the solution of the problem the programmer is still faced with the task of collecting and collating these statistics. To ensure that the multiprocessor system is in the correct state for the next parallel program, the programmer must also ensure that all the processors have terminated gracefully.

1.3.3 Data The problem domains of many rendering applications are very large. The size of these domains are typically far more than can be accommodated within the local memory of any processing element (or indeed in the memory of many sequential computers). Yet it is precisely these complex problems that we wish to solve using parallel processing. Consider a multiprocessor system consisting of sixty-four processing elements each with 4 MBytes of local memory. If we were to insist that the entire problem domain were to reside at each processing element then we would be restricted to solving problems with a maximum domain of 4 MBytes. The total memory within the system is 64 4 = 256 MBytes. So, if we were to consider the memory of the multiprocessor system as a whole, then we could contemplate solving problems with domains of up to 256 MBytes in size; a far more attractive proposition. (If the problem domain was even larger than this, then we could also consider the secondary storage devices as part of the combined memory and that should be sufficient for most problems.) There is a price to pay in treating the combined memory as a single unit. Data management strategies will be necessary to translate between the conceptual single memory unit and the physical distributed implementation. The aims of these strategies will be to keep track of the data items so that an item will always be available at a processing element when required by the task being performed. The distributed nature of the data items will thus be invisible to the application processes performing the computation. However, any delay between the application process requesting an item and this request being satisfied will result in idle time. As we will see, it is the responsibility of data management to avoid this idle time.

1.4 Evaluating Parallel Implementations The chief reason for opting for a parallel implementation should be: to obtain answers faster. The time that the parallel implementation takes to compute results is perhaps the most natural way of determining the benefits of the approach that has been taken. If the parallel solution takes longer than any sequential implementation then the decision to use parallel processing needs to be re-examined. Other measurements, such as speed-up and efficiency, may also provide useful insight on the maximum scalability of the implementation. Of course, there are many issues that need to be considered when comparing parallel and sequential implementations of the same problem, for example:

Was the same processor used in each case? If not, what is the price of the sequential machine compared with that of the multiprocessor system?

17

Was the algorithm chosen already optimised for sequential use, that is, did the data dependencies present preclude an efficient parallel implementation?

1.4.1 Realisation penalties If we assume that the same processor was used in both the sequential and parallel implementation, then we should expect, that the time to solve the problem decreases as more processing elements are added. The best we can reasonably hope for is that two processing elements will solve the problem twice as quickly, three processing elements three times faster, and n processing elements, n times faster. If n is sufficiently large then by this process, we should expect our large scale parallel implementation to produce the answer in a tiny fraction of the sequential computation, as shown by the “optimum time” curve in the graph in figure 1.12.

Optimum time Actual time

100

Computation Time

80

60

40

20

0 10

20

30

40

50 60 Number of PEs

70

80

90

100

Figure 1.12: Optimum and actual parallel implementation times However, in reality we are unlikely to achieve these optimised times as the number of processors is increased. A more realistic scenario is that shown by the curve “actual times” in figure 1.12. This curve shows an initial decrease in time taken to solve the example problem on the parallel system up to a certain number of processing elements. Beyond this point, adding more processors actually leads to an increase in computation time. Failure to achieve the optimum solution time means that the parallel solution has suffered some form of realisation penalty. A realisation penalty can arise from two sources:

an algorithmic penalty; and, an implementation penalty.

The algorithmic penalty stems from the very nature of the algorithm selected for parallel processing. The more inherently sequential the algorithm, the less likely the algorithm will be a good candidate for parallel processing. 18

Aside: It has also been shown, albeit not conclusively, that the more experience the writer of the parallel algorithm has in sequential algorithms, the less parallelism that algorithm is likely to exhibit [25]. This sequential nature of an algorithm and its implicit data dependencies will translate, in the domain decomposition approach, to a requirement to synchronise the processing elements at certain points in the algorithm. This can result in processing elements standing idle awaiting messages from other processing elements. A further algorithmic penalty may also come about from the need to reconstruct sequentially the results generated by the individual processors into an overall result for the computation. Solving the same problem twice as fast on two processing elements implies that those two processing elements must spend 100% of their time on computation. We know that a parallel implementation requires some form of communication. The time a processing element is forced to spend on communication will naturally impinge on the time a processor has for computation. Any time that a processor cannot spend doing useful computation is an implementation penalty. Implementation penalties are thus caused by:

the need to communicate As mentioned above, in a multiprocessor system, processing elements need to communicate. This communication may not only be that which is necessary for a processing element’s own actions, but in some architectures, a processing element may also have to act as a intermediate for other processing elements’ communication.

idle time Idle time is any period of time when an application process is available to perform some useful computation, but is unable to do so because either there is no work locally available, or its current task is suspended awaiting a synchronisation signal, or a data item which has yet to arrive. It is the job of the local task manager to ensure that an application process is kept supplied with work. The computation to communication ratio within the system will determine how much time a task manager has to fetch a task before the current one is completed. A load imbalance is said to exist if some processing elements still have tasks to complete, while the others do not. While synchronisation points are introduced by the algorithm, the management of data items for a processing element is the job for the local data manager. The domain decomposition approach means that the problem domain is divided amongst the processing elements in some fashion. If an application process requires a data item that is not available locally, then this must be fetched from some other processing element within the system. If the processing element is unable to perform other useful computation while this fetch is being performed, for example by means of multi-threading as discussed in section 4.5.2, then the processing element is said to be idle

concurrent communication, data management and task management activity Implementing each of a processing element’s activities as a separate concurrent process on the same processor, means that the physical processor has to be shared. When another process other than the application process is scheduled then the processing element is not performing useful computation even though its current activity is necessary for the parallel implementation.

The fundamental goal of the system software is to minimise the implementation penalty. While this penalty can never be removed, intelligent communication, data management and task scheduling strategies can avoid idle time and significantly reduce the impact of the need to communicate.

1.4.2 Performance metrics Solution time provides a simple way of evaluating a parallel implementation. However, if we wish to investigate the relative merits of our implementation then further insight can be gained by additional metrics. A range of metrics will allow us to compare aspects of different implementations and perhaps provide clues as to how overall system performance may be improved. 19

Speed-up A useful measure of any multiprocessor implementation of a problem is speed-up. This relates the time taken to solve the problem on a single processor machine to the time taken to solve the same problem using the parallel implementation. We will define the speed-up of a multiprocessor system in terms of the elapsed time that is taken to complete a given problem, as follows: Speed-up

=

elapsed time of a uniprocessor elapsed time of the multiprocessors

(1.1)

The term linear speed-up is used when the solution time on an n processor system is n times faster than the solution time on the uniprocessor. This linear speed-up is thus equivalent to the optimum time shown in section 1.4.1. The optimum and actual computation times in figure 1.12 are represented as a graph of linear and actual speed-ups in figure 1.13. Note that the actual speed-up curve increases until a certain point and then subsequently decreases. Beyond this point we say that the parallel implementation has suffered a speed-down.

Linear speed-up Actual speed-up Super-linear speed-up

100

Speed-up

80

60

40

20

0 10

20

30

40

50 60 Number of PEs

70

80

90

100

Figure 1.13: Linear and actual speed-ups The third curve in figure 1.13 represents so-called super-linear speed-up. In this example, the implementation on 20 processors has achieved a computation time which is approximately 32 times faster than the uniprocessor solution. It has been argued, see [43], that it is not possible to achieve a speed-up greater than the number of processors used. While in practice it certainly is possible to achieve super-linear speed-up, such implementation may have exploited “unfair” circumstances to obtain such timings. For example, most modern processors have a limited amount of cache memory with an access time significantly faster compared with a standard memory access. Two processors would have double the amount of this cache memory. Given we are investigating a fixed size problem, this means that a larger proportion of the problem domain is in the cache in the parallel implementation than in the sequential implementation. It is not unreasonable, therefore, to imagine a situation where the two processor solution time is

20

more than twice as fast than the uniprocessor time. Although super-linear speed-up is desirable, in this tutorial we will assume a “fair” comparison between uniprocessor and multiprocessor implementations. The results that are presented in the case studies thus make no attempt to exploit any hardware advantages offered by the increasing number of processors. This will enable the performance improvements offered by the proposed system software extensions to be highlighted without being masked by any variations in underlying hardware. In practice, of course, it would be foolish to ignore these benefits and readers are encouraged to “squeeze every last ounce of performance” out of their parallel implementation. Two possibilities exist for determining the “elapsed time of a uniprocessor”. This could be the time obtained when executing: 1. an optimised sequential algorithm on a single processor, Ts ; or, 2. the parallel implementation on one processing element, T1 .

SC

Processor

PE 1

Ts

PE n

Tn

(a)

SC

PE 1

SC

T1

PE 1

PE n

Tn

(b)

Figure 1.14: Systems used to obtain Tn and (a) Ts (b) T1 The time taken to solve the problem on n processing elements we will term Tn . The difference between how the two sequential times are obtained is shown in figure 1.14. There are advantages in acquiring both these sequential times. Comparing the parallel to the optimised sequential implementation highlights any algorithmic efficiencies that had to be sacrificed to achieve the parallel version. In addition, none of the parallel implementation penalties are hidden by this comparison and thus the speed-up is not exaggerated. One of these penalties is the time taken simply to supply the data to the processing element and collect the results. The comparison of the single processing element with the multiple processing element implementation shows how well the problem is “coping” with an increasing number of processing elements. Speed-up calculated as TTn1 , therefore, provides the indication as to the scalability of the parallel implementation. Unless otherwise stated, we will

21

use this alternative for speed-up in the case studies in this book as it better emphasises the performance improvements brought about by the system software we shall be introducing. As we can see from the curve for “actual speed-up” in figure 1.13, the speed-up obtained for that problem increased to a maximum value and then subsequently decreased as more processing elements were added. In 1967 Amdahl presented what has become known as “Amdahl’s law” [3]. This “law” attempts to give a maximum bound for speed-up from the nature of the algorithm chosen for the parallel implementation. We are given an algorithm in which the proportion of time that needs to be spent on the purely sequential parts is s, and the proportion of time that might be done in parallel is p, by definition. The total time for the algorithm on a single processor is s + p = 1 (where the 1 is for algebraic simplicity), and the maximum speed-up that can be achieved on n processors is: maximum speed-up

= (ss ++ pp ) n 1 = s+ p

(1.2)

n

1000 Linear Speedup Amdahl s=0.001 Amdahl s=0.005 Amdahl s=0.01 Amdahl s=0.1 Gustafson

900 800 700

Speedup

600 500 400 300 200 100 0 0

100

200

300

400 500 600 Number of PEs

700

800

900

1000

Figure 1.15: Example maximum speed-up from Amdahl and Gustafson’s laws Figure 1.15 shows the maximum speed-up predicted by Amdahl’s law for a sequential portion of an algorithm requiring 0.1%, 0.5%, 1% and 10% of the total algorithm time, that is s = 0:001; 0:005; 0:01 and 0.1 respectively. For 1000 processors the maximum speed-up that can be achieved for a sequential portion of only 1% is less than 91. This rather depressing forecast put a serious damper on the possibilities of massive parallel implementations of algorithms and led Gustafson in 1988 to issue a counter claim [69]. Gustafson stated that a problem size is virtually never independent of the number of processors, as it appears in equation (1.2), but rather: . . . in practice, the problem size scales with the number of processors. 22

Gustafson thus derives a maximum speed-up of:

maximum speed-up

p n)) = (s +s(+ p = n + (1 , n) s

(1.3)

This maximum speed-up according to Gustafson is also shown in figure 1.15. As the curve shows, the maximum achievable speed-up is nearly linear when the problem size is increased as more processing elements are added. Despite this optimistic forecast, Gustafson’s premise is not applicable in a large number of cases. Most scientists and engineers have a particular problem they want to solve in as short a time as possible. Typically, the application already has a specified size for the problem domain. For example, in parallel radiosity we will be considering the diffuse lighting within a particular environment subdivided into a necessary number of patches. In this example it would be inappropriate for us to follow Gustafson’s advice and increase the problem size as more processing elements were added to their parallel implementation, because to do so would mean either:

the physical size of the three dimensional objects within the environment would have to be increased, which is of course not possible; or, the size of the patches used to approximate the surface would have to be reduced, thereby increasing the number of patches and thus the size of the problem domain.

This latter case is also not an option, because the computational method is sensitive to the size of the patches relative to their distances apart. Artificially significantly decreasing the size of the patches may introduce numerical instabilities into the method. Furthermore, artificially increasing the size of the problem domain may improve speedup, but it will not improve the time taken to solve the problem. For fixed sized problems it appears that we are left with Amdahl’s gloomy prediction of the maximum speed-up that is possible for our parallel implementation. However, all is not lost, as Amdahl’s assumption that an algorithm can be separated into a component which has to be executed sequentially and part which can be performed in parallel, may not be totally appropriate for the domain decomposition approach. In this model, we retain the complete sequential algorithm and exploiting the parallelism that exists in the problem domain. So, in this case, an equivalent to Amdahl’s law would imply that the data can be divided into two parts, that which must be dealt with in a strictly sequential manner and that which can executed in parallel. Any data dependencies will certainly imply some form of sequential ordering when dealing with the data, however, for a large number of problems such data dependencies may not exist. It may also be possible to reduce the effect of dependencies by clever scheduling. The achievable speed-up for a problem using the domain decomposition approach is, however, bounded by the number of tasks that make up the problem. Solving a problem comprising a maximum of twenty tasks on more than twenty processors makes no sense. In practice, of course, any parallel implementation suffers from realisation penalties which increase as more processing elements are added. The actual speed-up obtained will thus be less than the maximum possible speed-up.

1.4.3 Efficiency A relative efficiency based on the performance of the problem on one processor, can be a useful measure as to what percentage of a processor’s time is being spent in useful computation. This, therefore, determines what the system overheads are. The relative efficiency we will measure as: Efficiency

=

speed-up 100 number of processors

(1.4)

Figure 1.16 shows the optimum and actual computation times given in figure 1.12 represented as processing element efficiency. The graph shows that optimum computation time, and therefore linear speed-up, equates to 23

100 Optimum efficieny Actual efficiency

Efficiency

80

60

40

20

0 10

20

30

40

50 60 Number of PEs

70

80

Figure 1.16: Optimum and actual processing element efficiency

24

90

100

an efficiency of 100% for each processing element. This again shows that to achieve this level of efficiency every processing element must spend 100% of its time performing useful computation. Any implementation penalty would be immediately reflected by a decrease in efficiency. This is clearly shown in the curve for the actual computation times. Here the efficiency of each processing element decreases steadily as more are added until by the time 100 processing elements are incorporated, the realisation penalties are so high that each processing element is only able to devote just over 1% of its time to useful computation. Optimum number of processing elements Faced with implementing a fixed size problem on a parallel system, it may be useful to know the optimum number of processing elements on which this particular problem should be implemented in order to achieve the best possible performance. We term this optimum number nopt . We shall judge the maximum performance for a particular problem with a fixed problem domain size, as the shortest possible time required to produce the desired results for a certain parallel implementation. This optimum number of processing elements may be derived directly from the “computation time” graph. In figure 1.12 the minimum actual computation time occurred when the problem was implemented on 30 processing elements. As figure 1.17 shows, this optimum number of processing elements is also the point on the horizontal axis in figure 1.13 at which the maximum speed-up was obtained.

Linear speed-up Actual speed-up

100

Speed-up

80

60

40

| v

20

Optimum number of PEs

0 10

20

30

40

50 60 Number of PEs

70

80

90

100

Figure 1.17: Optimum number of processing elements related to speed-up The optimum number of processing elements is also the upper bound for the scalability of the problem for that parallel implementation. To improve the scalability of the problem it is necessary to re-examine the decisions concerning the algorithm chosen and the make-up of the system software that has been adopted for supporting the parallel implementation. As we will see in the subsequent chapters, the correct choice of system software can have a significant effect on the performance of a parallel implementation.

25

Linear speed-up Speed-up for implementation Speed-up for implementation Speed-up for implementation Speed-up for implementation

100

1 2 3 4

Speed-up

80

60

40

20

0 10

20

30

40

50 60 Number of PEs

70

80

90

Figure 1.18: Speed-up graphs for different system software for the same problem

26

100

Figure 1.18 shows the speed-up graphs for different system software decisions for the same problem. The goal of a parallel implementation may be restated as: “to ensure that the optimum number of processing elements for your problem is greater than the number of processing elements physically available to solve the problem!” Other metrics Computation time, speed-up and efficiency provide insight into how successful a parallel implementation of a problem has been. As figure 1.18 shows, different implementations of the same algorithm on the same multiprocessor system may produce very different performances. A multitude of other metrics have been proposed over the years as a means of comparing the relative merits of different architectures and to provide a way of assessing their suitability as the chosen multiprocessor machine. The performance of a computer is frequently measured as the rate of some number of events per second. Within a multi-user environment the elapsed time to solve a problem will comprise the user’s CPU time plus the system’s CPU time. Assuming that the computer’s clock is running at a constant rate, the user’s CPU performance may be measured as: CPU time =

CPU clock cycles for a program clock rate (eg. 100MHz)

The average clock cycles per instruction (CPI) may be calculated as: CPI =

CPU clock cycles for a program Instruction count

We can also compute the CPU time from the time a program took to run: CPU time

= =

seconds program seconds clock cycle

cycles instructions clock program instructions

Such a performance metric is dependent on: Clock rate: this is determined by the hardware technology and the organisation of the architecture; CPI: a function of the system organisation and the instruction set architecture; and, Instruction count: this is affected by the instruction set architecture and the compiler technology utilised. One of the most frequently used performance metrics is the MIPS rating of a computer, that is how many Million Instructions Per Second the computer is capable of performing: MIPS =

instruction count execution time 106

clock rate = CPI 106

However, the MIPS value is dependent on the instruction set used and thus any comparison between computers with different instruction sets is not valid. The MIPS value may even vary between programs running on the same computer. Furthermore, a program which makes use of hardware floating point routines may take less time to complete than a similar program which uses a software floating point implementation, but the first program will have a lower MIPS rating than the second [75]. These anomalies have led to MIPS sometimes being referred to as “Meaningless Indication of Processor Speed”. Similar to MIPS is the “Mega-FLOPS” (MFLOPS) rating for computers, where MFLOPS represents Million FLoating point OPerations per Second: 27

MFLOPS =

no. of floating point operations in a program execution time 106

MFLOPS is not universally applicable, for example a word processor utilising no floating point operations would register no MFLOPS rating. However, the same program executing on different machines should be comparable, because, although the computers may execute a different number of instructions, they should perform the same number of operations, provided the set of floating point operations is consistent across both architectures. The MFLOPS value will vary for programs running on the same computer which have different mixtures of integer and floating point instructions as well as a different blend of “fast” and “slow” floating point instructions. For example, the add instruction often executes in less time than a divide instruction. A MFLOPS rating for a single program can not, therefore, be generalised to provide a single performance metric for a computer. A suite of benchmark programs, such as the LINPACK or Livermore Loops routines, have been developed to allow a more meaningful method of comparison between machines. When examining the relative performance of computers using such benchmarks it is important to discover the sustained MFLOPS performance as a more accurate indication of the machines’ potential rather than merely the peak MFLOPS rating, a figure that “can be guaranteed never to be exceeded”. Other metrics for comparing computers include: Dhrystone: A CPU intensive benchmark used to measure the integer performance especially as it pertains to system programming. Whetstone: A synthetic benchmark without any vectorisable code for evaluating floating point performance. TPS: Transactions Per Second measure for applications, such as airline reservation systems, which require on-line database transactions. KLIPS: Kilo Logic Inferences Per Second is used to measure the relative inference performance of artificial intelligence machines Tables showing the comparison of the results of these metrics for a number of architectures can be found in several books, for example [80, 86]. Cost is seldom an issue that can be ignored when purchasing a high performance computer. The desirability of a particular computer or even the number of processors within a system may be offset by the extraordinarily high costs associated with many high performance architectures. This prompted an early “law” by Grosch that the speed of a computer is proportional to its cost [65, 64]. Fortunately, although this is no longer completely true, multiprocessor machines are nevertheless typically more expensive than their general purpose counterparts. The parallel computer eventually purchased should provide acceptable computation times for an affordable price, that is maximise: “the bangs per buck” (performance per unit price).

28

2 Realistic Rendering Physically correct rendering of artificial scenes requires the simulation of light behaviour. Such simulations of light behaviour are computationally very expensive, which means that a good lighting simulation may take between a couple of hours to several days. The best lighting simulation algorithms to date are ray tracing [191], radiosity [55] and particle tracing [130]. The difference between these algorithms is in which light paths are approximated and which are correctly simulated. This means that the lighting effects obtainable with ray tracing are slightly different from radiosity. As pointed out by Kajiya [92], all rendering algorithms aim to model the same lighting behaviour, i.e. light scattering off various types of surfaces, and hence try to solve the same equation, termed the rendering equation. Following the notation adopted by Shirley [156], it is given in equation 2.1.

Lo(x; o ) = Le (x; o ) +

Z

allx0

0odA0 v(x; x0 )fr (x; 0o; o)Lo (x0 ; 0o) cos i cos 0 kx , xk2 x

(2.1)

This equation simply states that the outgoing radiance Lo at surface point in direction o is equal to the emitted radiance Le plus the incoming radiance from all points 0 reflected into direction o . In this equation, v ( ; 0 ) is a visibility term, being 1 if 0 is visible from surface point and 0 otherwise. The material properties of surface point are represented in the bi-directional reflection distribution function (BRDF) fr ( ; 0o ; o ), which returns the amount of radiance reflected into direction o as function of incident radiance from direction 0o . The cosine terms translate surface points in the scene into projected solid angles.

x

x

x

x

xx

x

The rendering equation is an approximation to Maxwell’s equation for electromagnetics [92] and therefore does not model all optical phenomena. For example, it does not include diffraction and it also assumes that the media inbetween surfaces do not scatter light. This means that participating media, such as smoke, clouds, mist and fire are not accounted for without extending the above formulation. There are two reasons for the complexity of physically correct rendering algorithms. One stems from the fact that the quantity to be computed, Lo is part of the integral in equation 2.1, turning the rendering equation into a recursive integral equation. The other is that, although fixed, the integration domain can be arbitrarily complex. Recursive integral equations with fixed integration domains are called Fredholm equations of the second kind and have to be solved numerically [41]. The following sections briefly discuss three numerical approximations to the rendering equation: ray tracing, radiosity and particle tracing.

2.1 Ray tracing All (photo-) realistic rendering algorithms approximate the rendering equation (equation 2.1). The differences are in the type of error introduced by the different methods. One such approximation is called ray tracing [191]. The basic ray tracing algorithm follows, for each pixel of the image, one or more rays into the scene (figure 2.1). If such a primary ray hits an object, the light intensity of that object is assigned to the corresponding pixel. In order to model shadows, from the intersection point of the ray and the object, new rays are spawned towards each of the light sources (figure 2.2). These shadow rays are used to test whether there are other objects between the intersection point and the light sources, indicating the intersection point is in shadow, or whether the light sources are hit directly, which means that the intersection point was directly lit.

29

Image plane

Viewpoint

Environment

Figure 2.1: Shooting primary rays.

Mirroring reflection and transparency may be modelled similarly by shooting new rays into the reflected and/or transmitted directions (figure 2.2). These reflection and transparency rays are treated in exactly the same way as primary rays are. Hence, ray tracing is a recursive algorithm. Shadow rays

Shadow rays

Screen

Primary ray Eye point

Reflection ray

Figure 2.2: Modelling reflection and shadowing. In terms of the rendering equation, ray tracing can be defined more formally as [98]:

30

Lo(x; o ) = Le (xZ; o ) + X v(x; xl )fr;d (x)Le (xl ; 0o ) cos l d!l +

ZL

allxi 2L

fr;s (x; s ; o )L(xs ; s ) cos s d!s +

s 2 s dLa (x)

Here, the second term on the right hand side computes the direct contribution of the light sources L. The visibility term is evaluated by casting shadow rays towards the light sources. The specular contribution is computed by evaluating the third term. If the specular component (the same holds for transparency) intersects a surface, this equation is evaluated recursively. As normally no diffuse interreflection is computed in ray tracing, the ambient component is approximated by a constant, the fourth term. This recursive process has to be carried out for each individual pixel separately. A typical image therefore costs at least a million primary rays and a multiple of that in the form of shadow rays and reflection and transparency rays. The most expensive parts of the algorithm are the visibility calculations. For each ray, the object that intersected the ray first, must be determined. To do this, a potentially large number of objects will have to be intersected with each ray. One of the first and arguably one of the most obvious optimisations is to spatially sort the objects as a pre-process, so that for each ray instead of intersecting all the objects in the scene, only a small subset of the objects need to be tested. Sorting techniques of this kind are commonly known as spatial subdivision techniques [54]. The simplest of these is the grid (figure 2.3a), which subdivides the scene into a number cells (or voxels) of equal size. Tracing a ray is now performed in two steps. First each ray intersects a number of cells, and these must be determined. This is called ray traversal. In the second step objects contained within these cells are intersected. Once an intersection in one cell is found, subsequent cells are not traversed anymore. The objects in the cells that are not traversed, are not tested at all. Although the grid is simple to implement and cheap to traverse, it does not adapt itself very well to the quirks of the particular model being rendered. Complex models usually concentrate a large number of objects in a few relatively small areas, whereas the rest of the scene is virtually empty. Figure 2.2 is one such example of a complex scene in which a large concentration of objects is used to model the musical equipment and the couches. The floor and the walls, however, consist of just a few objects. Adaptive spatial subdivisions, such as the octree [53] and the bintree (figure 2.3b and 2.3c gives 2D examples) are better suited for complex scenes. Being tree structures, space is recursively subdivided into two (bintree) or eight (octree) cells whenever the number of objects in a cell is above a given threshold and the maximum tree depth is not yet reached. The cells are smaller in areas of high object concentration, but the number of objects in each of these cells should be more or less the same. The cost of intersecting a ray with the objects in a cell is therefore nearly the same for all cells in the tree. Experiments have shown that as a rule of thumb, the number of cells in a spatial subdivision structure should be of the same order as the number of objects N in the scene [143]. Given this assumption, an upper bound for the cost (in seconds) of tracing a single ray through the scene is for the three spatial subdivision structures derived as follows [145]: Grid Thepnumber of grid cells is N , so that in each of the orthogonal directions x, y and z , the number of cells will be 3 N . A ray travelling linearly through the structure will therefore cost

31

a. Grid

b. Octree

c. Bintree

Figure 2.3: 2D examples of spatial subdivision techniques.

T =

p3

N (Tcell + Tint ) p = O( 3 N )

In this and the following equations Tcell is the time it takes to traverse a single cell and Tint is the time it takes on average to intersect a single object. Bintree Considering a balanced bintree with N leaf cells, the height of the tree will be h, where h number of cells traversed by a single ray is then O(2 3 ), giving

2h = N .

The

T = 2 h3 (Tcell + Tint ) p = 3 N (Tcell + Tint ) p = O( 3 N ) Octree In a balanced octree with intersects O(2h ) cells:

N

leaf cells, the height is h, where

8h = N .

A ray traversing such an octree

T = 2ph(Tcell + Tint ) = 3 N (Tcell + Tint ) p = O( 3 N ) Although the asymptotic behaviour of these three spatial subdivision techniques are the same, in practice differences may occur between the grid and the tree structures due to the grid’s inability to adapt to the distribution of data 32

in the scene.

p

Spatial subdivision techniques have reduced the number of intersection tests dramatically from O(N ) to O( 3 N ), but a very large number of intersection tests is still required due to the sheer number of rays being traced and due to the complexity of the scenes that has only increased over the years. Other sorting mechanisms that improve the speed of rendering, such as bounding box strategies, exist, but differ only in the fact that objects are now bounded by simple shapes that need not be in a regular structure. This means that bounding spheres or bounding boxes may overlap and may be of arbitrary size. The optimisation is due to the fact that intersecting a ray with such a simple shape is often much cheaper than intersecting with the more complex geometry it encapsulates. Bounding spheres or boxes may be ordered in a hierarchy as well, leading to a tree structure that removes the need to test all the bounding shapes for each ray. Because bounding boxes (and spheres) are quite similar to spatial subdivision techniques, their improved adaptability to the scene and their possibly more expensive ray traversal cost being the differences, these techniques are not considered any further. The reduction in intersection tests is of the same order as for spatial subdivision techniques. As other optimisations that significantly reduce the time complexity of ray tracing are not imminent, the most viable route to improve execution times is to exploit parallel processing (chapter 6).

2.2 Radiosity The rendering equation (equation 2.1) provides a general expression for the interaction of light between surfaces. No assumptions are made about the characteristics of the environment, such as surface- and reflectance properties. However, if it is assumed that surfaces are perfect diffuse reflectors or emitters, then the rendering equation can be simplified. A Lambertian surface [107] has the property that it reflects light in all directions in equal amounts. Radiance is then independent of outgoing direction and only a function of position:

Lout (x; out ) = L(x)

(2.2)

In addition, the relation between a diffuse reflector and its bi-directional reflection distribution function is given by fr = d [98], so that the rendering equation can be simplified to yield the radiosity equation [38]:

L(x) = Le (x)Z+ d (x)

0

all x

x

i cos o v(x; x0 )dA0 L(x0 ) cos k x0 , x k2

0

x

Here, the radiance L( ) for a point x on a surface is the sum of the self-emitted radiance Le ( ) plus the reflected energy that was received from all other points x’ in the environment. Unfortunately, it is not practically possible to solve this equation for all points in the scene. Therefore the surfaces in the scene are normally subdivided into sufficiently small patches (figure 2.4), where the radiance is assumed to be constant over each patch. If x is a point on patch i and x’ a point on patch j, the radiance Li for patch i is given by:

Li = Lei + X Z di LAj j

Z

i Ai Aj

cos i cos j dA dA ij j i r2

In this equation, r is the distance between patches i and j and areas of the patches i and j. This equation can be rewritten as: 33

ij gives the mutual visibility between the delta

Figure 2.4: Subdivision of a polygon into smaller patches.

a. Hemisphere method

b. Hemicube method

Figure 2.5: Form factor computation by hemisphere and hemicube methods

Li = Lei + d i

Z Z

X j

Lj fi!j

(2.3)

cos i cos j dA dA fi!j = A1 (2.4) ij j i r2 i Ai Aj In these equations, the form factorfi!j is the fraction of power leaving patch i that arrives at patch j. Form factors

depend solely on the geometry of the environment, i.e. the size and the shape of the elements and their orientation relative to each other. Therefore, the radiosity method is inherently view-independent.

2.2.1 Form factors Computing form factors is generally the most expensive part of radiosity algorithms. It requires visibility computations to determine which elements are closest to the target element. One way of doing these computations is by means of ray tracing [184]. First, a hemisphere of unit radius is placed over the element (figure 2.5a). The surface of this hemisphere is (regularly or adaptively) subdivided into a number of cells. From the centre point of the element rays are shot through the cells into the environment. This process yields a delta form factor for every cell. Summing the delta form factors then gives the form factor for the element. As rays are shot into all directions, this method is called undirected shooting. More sophisticated hemisphere methods direct more rays into interesting regions, for example by explicitly shooting towards patches (directed shooting) or by adaptive shooting. In the adaptive variant the delta form factors are compared to each other to see where large differences between them occur. In these directions the cells on the hemisphere are subdivided and for each cell a new delta form factor is computed. Both directed shooting and adaptive 34

refinement are more efficient than plain undirected shooting. Instead of placing half a sphere above a patch to determine directions at which to shoot rays, by Nusselt’s analogue also half a cube could be used (figure 2.5b). The five sides of this cube are (possibly adaptively) subdivided and for every grid cell on the cube, a delta form factor is computed. Because the sides of the cube can be viewed as image planes, z-buffer algorithms are applicable to compute the delta form factors. The only extension to a standard z-buffer algorithm is that with every z-value an ID of a patch is stored instead of a colour value. In this context the z-buffer is therefore called an item-buffer. After applying the z-buffer algorithm for a patch, the grid cells are traversed and for each grid cell, its delta form factor is added to the form factor of the patch. The advantage of the hemi-cube method is that standard z-buffering hardware may be used.

2.2.2 Full radiosity Equations 2.3 and 2.4 are the basis of the radiosity method [55] and describe how the radiance of a patch is computed by gathering incoming radiance from all other patches in the scene. Strictly speaking, it is an illumination computation method that computes the distribution of light over a scene. As opposed to ray tracing, it is not a rendering technique, but a pre-processing stage for some other rendering algorithm. For a full radiosity solution (also known as gathering), equations 2.3 and 2.4 must be solved for each pair of patches i and j. Therefore, if the scene consists of N patches, a system of N equations could be solved (see equation 2.5)

0 1 , d f ,d f ,d f BB ,d2 f12!11!1 1 , 1 d12!f22!2 ,d12f12!!NN B@ ... .. . . .. . . .

,dN fN !1 ,dN fN !2 1 , d N fN !N

10 L CC BB L12 CA B@ ...

LN

1 0 Le CC BB L1e2 CA = B@ ...

LeN

1 CC CA

(2.5)

Normally this system of equations is diagonally dominant, which means that Gauss-Seidel iterative solutions are appropriate [37]. However, this requires computing all the form factors beforehand to construct the full matrix. It is therefore also known as Full Matrix radiosity. The storage requirements of this radiosity approach are O(N 2 ), as between any two elements, a form factor is to be computed and stored. This clearly can be a severe problem, restricting the size of the model that can be rendered.

2.2.3 Progressive refinement To avoid the O(N 2 ) storage requirement of the full radiosity method, there is another approach for calculating the radiances Li , which reduces storage requirements to O(N ). It is called progressive radiosity or the shooting

method. The latter name stems from its physical interpretation, which is that for each iteration the element with the highest unshot energy is selected. This element shoots its energy into the environment, instead of gathering it from the environment. This process is repeated until the total amount of unshot energy drops below a specified threshold. As most patches will receive some energy after only a few iterations, this method gives a quick initial approximation of the global illumination with subsequent refinements resulting in incremental improvements of the radiosity solution [38]. All patches are initialised to have a radiance equal to the amount of light Le that they emit. This means that only the light sources initially emit light. In progressive radiosity terminology the light sources are said to have the most unshot radiance. For each iteration the patch with the largest unshot radiance is selected. This patch shoots its radiance to all other elements. The elements which are visible to the shooting patch therefore gain (unshot) radiance. 35

Finally the shooting element’s unshot radiance is set to zero, since there is no unshot radiance left. This completes a single iteration. Thus after each iteration the total amount of radiance is redistributed over all elements in the environment and an image of the results can be generated before a new iteration commences. By selecting the element with the largest amount of unshot radiance at the beginning of each iteration, the largest contributions to the final result are added first. This greatly improves the convergence rate in the early stages of the algorithm. Moreover, fewer iterations are needed to have the residual error in the solution drop below a specified threshold [36]. When intermediary results have to be displayed, an ambient term can be added to the solution vector. This is comparable to the over-relaxation technique for classical radiosity methods in that the estimation of the solution is exaggerated. The ambient term in progressive radiosity is only added for display purposes. This term is not used to improve the solution vector for the next iteration, as is done in the over-relaxation technique. The ambient radiosity term is derived from the amount of unshot energy in the environment. As the solution vector converges, the ambient term decreases. This way the sequence of displayed intermediary images gracefully yields the final image. In progressive radiosity form factors are not stored (as this would lead to a storage requirement of O(N 2 )), but only radiances and unshot radiances are stored for every element (O(N ) storage). The downside of this approach is, that visibility between elements sometimes has to be recomputed. This disadvantage is compensated by the number of iterations necessary to arrive at a good approximation [36]. However, if the full solution is needed, the convergence rate of the gathering method is better [29].

2.2.4 Hierarchical radiosity In order to minimise insignificant energy exchanges between patches, a hierarchical variant of the radiosity algorithm was derived. Instead of computing form factors between individual patches, radiosity exchanges are computed between groups of patches at various levels in the hierarchy [72, 71, 70, 159]. Therefore it is possible to perform a minimal amount of work to obtain the best result within a specified error bound. This is accomplished by selecting the coarsest subdivision in the hierarchy for the desired level of precision. As an example, in figure 2.61 a reference patch on the left wall interacts with some other patches in the scene. Several different situations may occur, based on the distance between a patch and the reference patch:

For patches that are close together, as patch smaller patches may be appropriate.

1 and the reference patch in figure 2.6 are, a subdivision into

For more distant patches, such as patch 2 and the reference patch in the same figure, the form factor can be approximated with no additional subdivision of the reference patch.

Finally, for very distant patches (patch 3 and the reference patch), the reference patch may be merged with its surrounding patches without affecting precision much.

In effect, the form factor matrix is subdivided into a number of blocks, where each of the blocks represents an interaction between groups of patches. The total number of blocks is O(N ), which is a clear improvement over the O(N 2 ) complexity of the regular form factor matrix.

2.3 Stochastic techniques Monte Carlo techniques are generally used to efficiently evaluate complicated integral functions in a probabilistic manner. As the environment to be rendered can be viewed as a complicated function, Monte Carlo methods are applicable. They allow for solving difficult integrals, such as the rendering equation 2.1. There is almost no additional storage space required beyond the scene geometry itself and these techniques can be applied to arbitrary surface 1 This

example is loosely based on the one in [159].

36

Reference patch Patch 1

Patch 2

Patch 3

Figure 2.6: Radiosity exchanges between a reference patch and different surfaces in the scene

geometry and reflectance functions. Also, methods exist to quantify the error in the solution [15]. To evaluate an integral F 179]:

R

= f (x) dx using Monte Carlo techniques, the following steps are necessary [130,

Let p(x)dx be the probability distribution function of the points generated in the domain .

In practice, a number of samples, distributed according to p, are taken and F is estimated to be the average of those samples:

In Monte Carlo integration, a sample X is generated. Then f (X )=p(X ) is used as an estimate of F . This sample value F

= f (X )=p(X ) needs to be finite for the estimate E to be unbiased:

X ) = Z f (x) p(x)dx = Z f (x)dx = F E fp((X )

p(x)

F n1

n f (X ) X i i=1 p(Xi )

The variance in the sample value F is given by

V [F ] = E [F 2 ] , E 2 [F ] =

Z f 2(x) dx , F 2

p(x)

It is an estimate for the difference between the estimate E and the actual solution F . Different schemes to choose samples have been devised. All of them attempt to direct the sampling towards areas that contribute most to the final result [178, 177, 180]. This is known as variance reduction (see section 2.3.1). If n samples are taken to estimate F , the variance V is reduced by a factor of n. Unfortunately, the running time is increased by a factor of n as well. Also, the variance is affected by the choice of point distribution function p. 37

dim. 2

dim. 2

dim. 1

dim. 1

Figure 2.7: Correlated (left) versus uncorrelated (right) two-dimensional sampling. Ideally, p should be chosen such that p = Ff . Then, p is proportional to f , and the variance V [F ] is zero. However, this requires knowledge of the quantity F that we want to compute, making this approach impractical. If p is chosen such that its shape resembles f , then this is known as importance sampling. Stochastic techniques can be applied to both ray tracing and radiosity (sections 2.3.2 and 2.3.3). They have also been used to solve the rendering equation directly (section 2.3.4).

2.3.1 Variance reduction techniques When sampling a function stochastically, a large number of samples are required to reduce the amount of noise in the estimate of the function. A large number of samples may cost a long time to compute. However, when the samples are not distributed randomly, but are biased towards areas where their contribution is high, an accurate estimate can be obtained using fewer samples. In [155] the characteristics of sampling are summarised. To sample accurately and to avoid aliasing as much as possible, the following issues will have to be observed [155]: Adaptivity This technique is not particular to stochastic sampling. The idea is to direct more sampling effort to directions that contribute most to the solution. A number of refinement criteria are possible, which may be based upon statistics, confidence, signal theory or the characteristics of the human eye [155]. Irregularity Irregular sampling patterns replace obvious aliasing in a scene by noise. It has been shown that noise is less perceptible to the human eye than regular aliasing effects. Complete stratification If N samples are to be taken in an interval of length L, then complete stratification techL and in each of these intervals one sample is taken ranniques split the interval into N intervals of length N domly. Importance sampling If the function that is to be sampled has an associated weighting function, then it is more efficient to distribute the samples according to the weighting function. Uncorrelation Uncorrelated sampling allows multi-dimensional sampling at the same cost of one-dimensional sampling. In figure 2.7 an example is given of correlated two-dimensional sampling (left) and uncorrelated twodimensional sampling, which is needed to avoid visual artifacts. Fast reconstruction When samples are taken non-uniformly, the reconstruction of the sampled scene to form an image, becomes more complex than in the case of uniform sampling. Uniformly distributed samples only need a limited-support filter to reconstruct the original signal [121]. These characteristics should be combined into a single sampling strategy to arrive at an anti-aliasing method with optimal efficiency. Very few sampling algorithms proposed in literature include all of these characteristics [155]. The remainder of this paragraph is devoted to a more detailed description of the stratification and importance sampling techniques. The usefulness of these variance reduction techniques lies in the fact that fewer stochastically 38

determined samples are needed to obtain a solution with an error similar to that obtained with brute force methods, i.e. when samples are taken randomly without considering extra available information. In [109], stratified sampling is used to direct most effort to regions where the difference between neighbouring samples is large. In other words, where the variance is high. The region that is to be sampled is broken up into a number of smaller regions. The combination of the distribution functions of these regions is the same as the original region. This way it is possible to arrive at a good estimate of the integrand by sampling the subregions. Fewer samples are necessary [109] than by picking samples randomly. There are a number of strategies available for splitting up the region of interest. These are summed up in [92]. For example, the subdivision may be fixed and samples within the subregions are taken randomly [109]. In a hierarchical integration method, the regions are repeatedly split in half as are the contributions of the regions to the integrand. A tree of subregions is built and after each split the integral values of parent nodes are updated. Instead of always subdividing each region in the hierarchical integration, splitting may be done only when locally the variance is high. In [92] it is suggested that a probability threshold could be used. At each node a random number is chosen and a threshold is calculated. If the random number falls below the threshold, the left sub node is chosen or else the right sub node is being processed. Different criteria can be applied to compute the threshold. This method is called adaptive hierarchical integration. Another variance reduction method is called importance sampling. In this case the samples are not distributed evenly, but according to a distribution function or a filter. This means that more samples are taken in the regions of interest (where the filter value is high) and less computation power is spent on less important areas. The difference between importance sampling and adaptive strategies is that the latter gives criteria which regions to split in half, where the former splits regions in unequal parts. To give an idea of the effect of importance sampling, consider a spherical light source [157]. When sampled uniformly, there is a probability of one half that a point on the far side of the light source is taken, yielding unnecessarily high variance. Therefore, it is better only to sample from the visible part of the sphere as seen from a point x in space. The third and best sampling strategy is to sample uniformly with respect to the solid angle as seen from x.

2.3.2 Stochastic ray tracing In ray tracing, instead of choosing sampling points at regular intervals, the image sampling point of a primary ray may be computed using a distribution function. Usually (pseudo) random numbers that are generated by a computer are uniformly distributed in the unit interval. This distribution has the disadvantage of tending to leave large gaps in some areas, while bunching together in others [54]. To avoid this, alternative distributions may be used such as the Poisson disk distribution. This is a random distribution with the restriction that no two samples are closer to each other than a specified minimum distance, see figure 2.8 left. A disadvantage of such a global distribution pattern is that it is expensive to compute. A cheap approximation to a Poisson disk distribution can be obtained by applying local variations to a regular grid of sample points. Regularly spaced sample points are offset by a small random distance, thus approximating the Poisson disk distribution. This process is called jittering (figure 2.8 right). Both distribution functions reduce the effect of aliasing, turning it into noise. Where regular sampling patterns are easily discovered by the human eye, noise appears to be less offending. This is why noise (variance) is more acceptable in an image than aliasing. When a ray hits a surface, according to the surface features of the object, secondary rays are spawned. Material properties are normally modelled with bi-directional reflection distribution functions (figure 2.9), which indicate in which direction new rays should be traced. The spread of secondary rays allows the visualisation of for example glossy reflection and translucency. Generally brdfs are used to model surface features which would be displayed at a resolution higher than one per pixel. Larger surface features could be displayed by texture maps and bump maps. On the largest scale, the object geometry is explicitly modelled [190]. However, which representation to use depends on context. The frequency of 39

Global Poisson disk distribution

Local jittering

Figure 2.8: Poisson disk distribution and jittering in screen space. Incoming ray

Direction of reflection

Bidirectional reflection distribution function Surface

Figure 2.9: Bi-directional reflection distribution function (brdf) used to simulate glossy reflection.

sampling across a surface determines the upper limit and the area to be integrated determines a lower limit of size that the chosen representation should model. Because the result of evaluating a brdf depends on the angle of incidence and the surface properties, the evaluation of such functions is complicated and expensive. In deterministic algorithms this problem is overcome by simplifying surface properties, or fixing the angle of incidence, thus resulting in a much simpler and easy to evaluate brdf. On the other hand, stochastic methods are well suited to evaluating complicated functions. As is shown in [190], a brdf represented by spherical harmonics can be efficiently evaluated using Monte Carlo simulations. No simplifying assumptions are necessary, which ensures an accurate physically based modelling of surface properties. Diffuse interreflection can be rendered using a Monte Carlo approach called path tracing [92]. Here, at each surface intersection at most one new ray and a light ray is spawned instead of a whole branch of rays. The primary rays and the light rays are the most important rays in that they contribute most to the final pixel colour. By ignoring the less important rays, a large reduction of computation time can be achieved. The direction of the new ray is chosen probabilistically. A disadvantage of path tracing is that as there are less secondary rays traced, many more primary rays have to be traced to avoid aliasing.

2.3.3 Stochastic radiosity As well as for ray tracing, stochastic methods can be applied to radiosity algorithms as well. Stochastic methods have been developed to solve the radiosity matrix [124, 123] and have been applied to hierarchical radiosity [173]. They have also been used to compute form factors. As in deterministic undirected form factor computations, a hemisphere is placed over a patch. From the centre of the hemisphere rays are shot into the environment, each contributing a delta form factor. If the sample distribution is chosen with a density proportional to a cosine, importance sampling is performed over the hemisphere [38]. Malley’s method provides a way to arrive at such a distribution. In this method a set of jittered sample points is chosen on

40

the base of the hemisphere. A ray direction is determined by projecting these sample points straight up onto the hemisphere. The ray is shot from the centre of the hemisphere through the projected point. Because the density of the rays is proportional to the differential form factor, the result of every ray carries equal weight [38]. When ray directions are determined stochastically, small objects may be missed, so that they may not receive any radiance during the entire radiosity pre-processing. These objects then may appear to be black when subsequently rendered. A solution for this problem is grouping small patches and treating them as a single larger patch [97]. This method is discussed in paragraph 5.5, where it is presented as an efficiency improvement scheme. In a stochastic radiosity algorithm it has the additional advantage that groups of patches are large enough not to be missed by randomly shot rays.

2.3.4 Particle tracing In previous sections stochastic methods were applied to ray tracing and for factor calculations in radiosity. Monte Carlo simulations can also be applied to solve the rendering equation directly. It is then called particle tracing [130, 41]. In this model light is viewed as particles being sent out from light emitting surfaces. These particles are traced from the light source and followed through the scene bouncing at the surfaces, until they are absorbed by a surface. The direction in which a particle leaves the light emitter, the wavelength of the particle and its position on the emitter, are determined stochastically according to the point spread function describing the behaviour of the light emitter. A powerful light source is said to have a higher probability than weaker light sources. Thus, more particles are assigned to powerful light sources. The same link exists between the wavelength of the particles, the direction in which the particle travels and the position on the light source respectively and their associated point spread functions. More important wavelengths for example, are chosen more often because of their higher probability of occurring. After a particle is emitted from a light source it travels in a straight path until it hits a surface. If the particle encounters participating media on its way, the direction of the particle may be altered before it hits a surface. This process is called scattering and occurs when light is reflected, refracted or diffracted due to dust particles floating around. If light is scattered uniformly in all directions, it is called isotropic scattering. If light is scattered according to a distribution of directions, it is called anisotropic scattering. In section 2.3.6 particle tracing in participating volumes is discussed more elaborately. When particles hit a surface, they may be reflected or refracted. First, according to a distribution function, it is determined whether the particle is absorbed or reflected or refracted. If the particle is reflected, a new direction of the particle is determined according to a point spread function which describes the surface properties of the surface. If it is decided that the particle is refracted, the angle of refraction is computed using a similar distribution function. The number of particles a patch receives determines its brightness, i.e. its radiance. In particle tracing, a very large number of particles have to be traced before a reasonable approximation to the actual solution of the rendering equation is reached. This result is due to the law of the large numbers, which states that the larger the number of samples (traced particles), the better the agreement of the estimator with the actual value. In [130] a particle tracer is compared with a standard radiosity algorithm. The results obtained from the Monte Carlo simulation and the radiosity algorithm turn out to be comparable with respect to execution times and image quality. It is expected that particle tracing compares more favourably with the radiosity method when larger environments are rendered [130]. Another advantage of particle tracing is that this method is inherently capable of handling more complex effects such as participating media.

2.3.5 Hybrid methods Deterministic ray tracing and radiosity algorithms have been combined into hybrid rendering methods, possibly including importance driven techniques. The same is possible for stochastic rendering. An example of such a stochastic hybrid rendering technique is bi-directional path tracing [105]. Here particles are traced from the light sources into the scene. The points at which the particles interact with the environment are stored for later use. From 41

Shadow rays Eye path

Light path Light source

Screen

Eye point

Figure 2.10: Bi-directional path tracing

the eye point, rays are shot into the environment. The intersection points of the rays with patches are stored as well. The eye point and the light sources are thus treated in similar ways. The points that were found when tracing light paths are then connected to the points stored during eye path tracing. This is equivalent to casting shadow rays to the most important light sources and to important patches that are lit by those light sources. This way the contributions of the light paths are combined with the results found on the eye paths, thus resulting in the final pixel colours, see figure 2.10. Bi-directional path tracing can be extended to include participating media [104].

2.3.6 Participating media In environments containing participating volumes, particles can originate from light sources, i.e. emitting surfaces, and from emitting volumes. Some particles travelling through a participating medium can be absorbed or scattered, while others continue their paths unhindered. The optical properties of the medium determine which percentage of particles are absorbed or scattered. Scattered particles continue their paths until they are absorbed by a surface or a volume. Participating media are characterised by their shape and by the way they interact with light. Usually a participating volume is in the shape of a cube or a sphere. The boundaries of the participating medium can be found by using a standard ray casting algorithm [150]. The following interactions with light are possible [38] (see also figure 2.11): Absorption Light travelling through the medium is partially absorbed. Outscattering The light which is traced through the participating medium is partially scattered in different directions. Emission The medium itself emits light. This occurs for example when flames are modelled. Inscattering The radiance of light may be increased when it travels through a medium due to light coming from other directions being scattered into the path of the light which is traced. Particle tracers can be extended to deal with participating media by allowing volumes as well as surfaces to act as light sources. In addition when a particle is travelling through a participating volume, a position on its path is computed (probabilistically) at which the particle will be scattered. If this position is before the nearest surface, a

42

Direction of ray Absorption

Outscattering

Emission

Inscattering

Figure 2.11: The energy a ray is carrying is affected due to the participating medium the ray is travelling through (the thickness of each ray indicates its brightness).

new direction is determined according to a probability function. In the other case, the particle will not interact with the volume but with the nearest surface instead. The interaction of a volume with a particle is first expressed as a function of the distance S the particle travels through the volume. The radiance of light travelling a distance S through a medium is reduced by the factor

R

S e, 0 K (u) du , which is known as Bouguer’s Law. For a medium with high opacity, the extinction factor K (u) will

be high. In a homogeneous medium, where the interaction of light with the medium is spatially independent, i.e. the probability of interaction is the same everywhere in the medium, the opacity equals KS . For non-homogeneous media the probability of a particle interacting with the medium before it has travelled a distance S is given by:

P (S ) = 1 , e,

R S K(u) du 0

Evaluating this equation may be difficult. For this reason the path traversed by a particle is broken up in a number of areas S , where every area is assumed homogeneous. If the point of interaction is determined, the direction in which the particle will continue is to be computed. If the interaction takes place in an isotropic medium, the new direction can be determined by picking a uniformly distributed random variable. Isotropic scattering is an idealisation of the scattering process [130]. For anisotropic media, other distribution functions may be used. Though non-stochastic methods for computing the global illumination problem exist that account for participating media, see for example the zonal method in [150], particle tracing is an extremely elegant method to solve the rendering equation. However, particle tracing has a number of disadvantages when used in a complex environment. The first is that the number of particles that have to be traced may be prohibitively large. This is especially true for participating media with high opacity [130].

43

3 Task Scheduling The efficient solution of a problem on a parallel system requires the computational performance of the processing elements to be fully utilised. Any processing element that is not busy performing useful computations is degrading overall system performance. Task scheduling strategies may be used to minimise these potential performance limitations.

3.1 Problem Decomposition A problem may be solved on a parallel system by either exploiting the parallelism inherent in the algorithm, known as algorithmic decomposition, or by making use of the fact that the algorithm can be applied to different parts of the problem domain in parallel, which is termed domain decomposition. These two decomposition methods can be further categorised as shown in figure 3.1.

Decomposition method Algorithmic

Fork and join

Domain

Data flow Data driven

Demand driven

Figure 3.1: Methods of decomposing a problem to exploit parallelism Over the years, an abundance of algorithms have been developed to solve a multitude of problems on sequential machines. A great deal of time and effort has been invested in the production of these sequential algorithms. Users are thus loathed to undertake the development of novel parallel algorithms, and yet still demand the performance that multiprocessor machines have to offer. Algorithmic decomposition approaches to this dilemma have led to the development of compilers, such as those for High Performance Fortran, which attempt to parallelise automatically these existing algorithms. Not only do these compilers have to identify the parallelism hidden in the algorithm, but they also need to decide upon an effective strategy to place the identified segments of code within the multiprocessor system so that they can interact efficiently. This has proved to be an extremely hard goal to accomplish. The domain decomposition approach, on the other hand, requires little or no modification to the existing sequential algorithm. There is thus no need for sophisticated compiler technology to analyse the algorithm. However, there will be a need for a parallel framework in the form of system software to support the division of the problem domain amongst the parallel processors.

44

3.1.1 Algorithmic decomposition In algorithmic decomposition the algorithm itself is analysed to identify which of its features are capable of being executed in parallel. The finest granularity of parallelism is achievable at the operation level. Known as dataflow, at this level of parallelism the data “flows” between individual operands which are being executed in parallel [1]. An advantage of this type of decomposition is that little data space is required per processor [76], however, the communication overheads may be very large due to the very poor computation to communication ratio. Fork & join parallelism, on the other hand, allocates portions of the algorithm to separate processors as the computation proceeds. These portions are typically several statements or complete procedures. The difference between the two algorithmic forms of decomposition is shown for a simple case in figure 3.2.

Sample problem begin a := 5 + 6 b := (6+4)*2 c := a + b end 5

6 +

2

4

fork

a := 5 + 6

b := (6+4)*2

+

join *

+

c := a + b

a.

b.

Figure 3.2: Algorithmic decomposition: (a) dataflow (b) fork & join

3.1.2 Domain decomposition Instead of determining the parallelism inherent in the algorithm, domain decomposition examines the problem domain to ascertain the parallelism that may be exploited by solving the algorithm on distinct data items in parallel. Each parallel processor in this approach will, therefore, have a complete copy of the algorithm and it is the problem domain that is divided amongst the processors. Domain decomposition can be accomplished using either a data driven or demand driven approach. As we shall see, given this framework, the domain decomposition approach is applicable to a wide range of problems. Adoption of this approach to solve a particular problem in parallel, consists of two steps: 1. Choosing the appropriate sequential algorithm. Many algorithms have been honed over a number of years to a high level of perfection for implementation on sequential machines. The data dependencies that these highly sequential algorithms exhibit may substantially inhibit their use in a parallel system. In this case alternative sequential algorithms which are more suitable to the domain decomposition approach will need to be considered.

45

2. Analysis of the problem in order to extract the criteria necessary to determine the optimum system software. The system software provides the framework in which the sequential algorithm can execute. This system software takes care of ensuring each processor is kept busy, the data is correctly managed, and any communication within the parallel system is performed rapidly. To provide maximum efficiency, the system software needs to be tailored to the requirements of the problem. There is thus no general purpose parallel solution using the domain decomposition approach, but, as we shall see, a straightforward analysis of any problem’s parallel requirements, will determine the correct construction of the system software and lead to an efficient parallel implementation. Before commencing the detailed description of how we intend to tackle the solution of realistic rendering problems in parallel, it might be useful to clarify some of the terminology we shall be using.

3.1.3 Abstract definition of a task The domain decomposition model solves a single problem in parallel by having multiple processors apply the same sequential algorithm to different data items from the problem domain in parallel. The lowest unit of computation within the parallel system is thus the application of the algorithm to one data item within the problem domain. The data required to solve this unit of computation consists of two parts: 1. the principal data items (or PDIs) on which the algorithm is to be applied; and 2. additional data items (or ADIs) that may be needed to complete this computation on the PDIs. For example, in ray tracing, we are computing the value at each pixel of our image plane. Thus these pixels would form our PDIs, while all the data describing the scene would constitute the ADIs. The problem domain is thus the pixels plus the scene description. The application of the algorithm to a specified principal data item may be regarded as performing a single task. The task forms the elemental unit of computation within the parallel implementation. This is shown diagrammatically in figure 3.3.

3.1.4 System architecture This tutorial is concentrating on implementing realistic rendering techniques on distributed memory systems (either a dedicated parallel machine or a distributed system of workstations). These processors may be connected together in some manner to form a configuration. A process is a segment of code that runs concurrently with other processes on a single processor. Several processes will be needed at each processor to implement the desired application and provide the necessary system software support. A processing element consists of a single processor together with these application and system processes and is thus the building block of the multiprocessor system. (We shall sometimes use the abbreviation PE for processing element in the figures and code segments.) When discussing configurations of processing elements, we shall use the term links to mean the communication paths between processes. Structure of the system controller To provide a useful parallel processing platform, a multiprocessor system must have access to input/output facilities. Most systems achieve this by designating at least one processing element as the system controller (SC) with the responsibilities of providing this input/output interface, as shown in figure 3.4. If the need for input/output facilities becomes a serious bottleneck then more than one system controller may be required. Other processing elements perform the actual computation associated with the problem. In addition to providing the input/output facilities, the system controller may also be used to collect and collate results computed by the processing elements. In this case the system controller is in the useful position of being able 46

Principal data item specified by task

Additional data items required

Application of algorithm

Result

Figure 3.3: A task: the processing of a principal data item

Starting...

SC

PE 1

PE n

Figure 3.4: The system controller as part of a parallel system

47

to determine when the computation is complete and gracefully terminate the concurrent processes at every processing element.

3.2 Computational Models The computational model chosen to solve a particular problem determines the manner in which work is distributed across the processors of the multiprocessor system. In our quest for an efficient parallel implementation we must maximise the proportion of time the processors spend performing necessary computation. Any imbalance may result in processors standing idle while others struggle to complete their allocated work, thus limiting potential performance. Load balancing techniques aim to provide an even division of computational effort to all processors. The solution of a problem using the domain decomposition model involves each processing element applying the specified algorithm to a set of principal data items. The computational model ensures that every principal data item is acted upon and determines how the tasks are allocated amongst the processing elements. A choice of computation model exists for each problem. To achieve maximum system performance, the model chosen must see that the total work load is distributed evenly amongst the processing elements. This balances the overheads associated with communicating principal data items to processing elements with the need to avoid processing element idle time. A simplified ray tracing example illustrate the differences between the computational models. A sequential solution to this problem may be achieved by dividing the image plane into twenty-four distinct regions, with each region constituting a single principal data item, as shown in figure 3.5, and then applying the ray tracing algorithm at each of these regions in turn. There are thus twenty-four tasks to be performed for this problem where each task is to compute the pixel value at one area of the image plane. To understand the computational models, it is not necessary to know the details of the algorithm suffice to say that each principal data item represents an area of the image plane on which the algorithm can be applied to determine the value forthat position. We will assume that no additional data items are required to complete any task. Problem domain Principal Data Items

Figure 3.5: Principal data items for calculating the pixels in the image plane

3.2.1 Data driven model The data driven model allocates all the principal data items to specific processing elements before computation commences. Each processing element thus knows a priori the principal data items to which they are required to apply the algorithm. Providing there is sufficient memory to hold the allocated set at each processing element, then, apart from the initial distribution, there is no further communication of principal data items. If there is insufficient 48

local memory, then the extra items must be fetched as soon as memory space allows. This fetching of remote data items will be discussed further when data management is examined in Chapter 4. Balanced data driven In balanced data driven systems (also known as geometric decompositions), an equal number of principal data items is allocated to each processing element. This portion is determined simply by dividing the total number of principal data items by the number of processing elements: portion at each PE =

number of principal data items number of PEs

If the number of principal data items is not an exact multiple of the number of processing elements, then

(number of principal data items) MOD (number of PEs) will each have one extra principal data item, and thus perform one extra task. The required start task and the number of tasks is communicated by the system controller to each processing element and these can then apply the required algorithm to their allotted principal data items. This is similar to the way in which problems are solved on arrays of SIMD processors. In this example, consider the simple ray tracing calculation for an empty scene. The principal data items (the pixels) may be allocated equally to three processing elements, labelled PE1 , PE2 and PE3 , as shown in figure 3.6. In this case, each processing element is allotted eight principal data items.

PE1

PE2

PE3

PDIs for PE1

PDIs for PE2

PDIs for PE3

Figure 3.6: Equal allocation of data items to processing elements As no further principal data item allocation takes place after the initial distribution, a balanced work load is only achieved for the balanced data driven computational model if the computational effort associated with each portion of principal data items is identical. If not, some processing elements will have finished their portions while others still have work to do. With the balanced data driven model the division of principal data items amongst processing elements is geometric in nature, that is each processing element simply may be allocated an equal number of principal data items irrespective of their position within the problem domain. Thus, to ensure a balanced work load, this model should only be used if the computational effort associated with each principal data item is the same, and preferably

49

where the number of principal data items is an exact multiple of the number of processing elements. This implies a priori knowledge, but given this, the balanced data driven approach is the simplest of the computational models to implement. Using figure 3.6, if the computation of each pixel 1 time unit to complete, then the sequential solution of this problem would take 24 time units. The parallel implementation of this problem using the three processing elements each allocated eight tasks should take approximately 8 time units, a third of the time required by the sequential implementation. Note, however, that the parallel solution will not be exactly one third of the sequential time as this would ignore the time required to communicate the portions from the system controller to the processing elements. This also ignores time required to receive the results back from the processing elements and for the system controller to collate the solution. A balanced data driven version of this problem on the three processing elements would more accurately take: Solution time = initial distribution + d

24 e + result collation 3

Assuming low communication times, this model gives the solution in approximately one third of the time of the sequential solution, close to the maximum possible linear speed-up. Solution of the same problem on five processing elements would give: Solution time = initial distribution + d

24 e + result collation 5

This will be solved in even longer than the expected 4.8 time units as, in this case, one processing element is allocated 4 principal data items while the other four have to be apportioned 5. As computation draws to a close, one processing element will be idle while the four others complete their extra work. The solution time will thus be slightly more than 5 time units. Unbalanced data driven Differences in the computational effort associated with the principal data items will increase the probability of substantial processing element idle time if the simplistic balanced data driven approach is adopted. If the individual computation efforts differ, and are known a priori, then this can be exploited to achieve optimum load balancing. The unbalanced data driven computational model allocates principal data items to processing elements based on their computational requirements. Rather than simply apportioning an equal number of tasks to each processing element, the principal data items are allocated to ensure that each processing element will complete its portion at approximately the same time. For example, the complexity introduced into the ray tracing calculations by placing object into the scene, as shown in figure 3.7, will cause an increased computational effort required to solve the portions allocated to PE1 and PE2 in the balanced data driven model. This will result in these two processing elements still being busy with their computations long after the other processing element, PE3 , has completed its less computationally complex portion. Should a priori knowledge be available regarding the computational effort associated with each principal data item then they may be allocated unequally amongst the processing elements, as shown in figure 3.8. The computational effort now required to process each of these unequal portions will be approximately the same, minimising any processing element idle time. The sequential time required to solve the ray tracing with objects in the scene is now 42 time units. To balance the work load amongst the three processing elements, each processing element should compute for 14 time units. Allocation of the portions to each processing element in the unbalanced data driven model involves a preprocessing step to determine precisely the best way to subdivide the principal data items. The optimum compute time for each processing element can be obtained by simply dividing the total computation time by the number of processing elements. If possible, no processing element should be allocated principal data items whose combined computation time exceeds this optimum amount. Sorting the principal data items in descending computation times can facilitate the subdivision. 50

Individual task computation times 1

1

1

1

1

1

1

1

2

1

1

1

1

3

6

5

1

1

3

2

1

1

1

4

Figure 3.7: Unequal computational effort due to presence of objects in the scene

Processing Elements PE 1

PE 2

PDIs for PE1 1

PE 3

PDIs for PE 2

PDIs for PE

3

1

1

1

1

1

1

1

2

1

1

1

1

3

6

5

1

1

3

2

4

1

1

1

Figure 3.8: Unequal allocation of data items to processing elements to assist with load balancing

51

The total solution time for a problem using the unbalanced data driven model is thus: Solution time

=

preprocessing + distribution +longest portion time + result collation

So comparing the naive balanced distribution from section 3.2.1 Balanced solution time

=

distribution + 21 + result collation

Unbalanced solution time = preprocessing + distribution + 14 + result collation The preprocessing stage is a simple sort requiring far less time than the ray tracing calculations. Thus, in this example, the unbalanced data driven model would be significantly faster than the balanced model due to the large variations in task computational complexity. The necessity for the preprocessing stage means that this model will take more time to use than the balanced data driven approach should the tasks have the same computation requirement. However, if there are variations in computational complexity and they are known, then the unbalanced data driven model is the most efficient way of implementing the problem in parallel.

3.2.2 Demand driven model The data driven computational models are dependent on the computational requirements of the principal data items being known, or at least being predictable, before actual computation starts. Only with this knowledge can these data items be allocated in the correct manner to ensure an even load balance. Should the computational effort of the principal data items be unknown or unpredictable, then serious load balancing problems can occur if the data driven models are used. In this situation the demand driven computational model should be adopted to allocate work to processing elements evenly and thus optimise system performance. In the demand driven computational model, work is allocated to processing elements dynamically as they become idle, with processing elements no longer bound to any particular portion of the principal data items. Having produced the result from one principal data item, the processing elements demand the next principal data item from some work supplier process. This is shown diagrammatically in figure 3.9 for the simple ray tracing calculation. Unlike the data driven models, there is no initial communication of work to the processing elements, however, there is now the need to send requests for individual principal data items to the supplier and for the supplier to communicate with the processing elements in order to satisfy these requests. To avoid unnecessary communication it may be possible to combine the return of the results from one computation with the request for the next principal data item. The optimum time for solving a problem using this simple demand driven model is thus: Solution time = 2 total communication time + total computation time for all PDIs number of PEs total computation time for all PDIs , will only be possible if the work can be alloThis optimum computation time, number of PEs cated so that all processing elements complete the last of their tasks at exactly the same time. If this is not so then 52

Processing Elements PE1

PE2

PE3 task supply

Demand for tasks Task Supplier "Pool" of available tasks

Figure 3.9: A demand driven model for a simple ray tracing calculation

some processing elements will still be busy with their final task while the others have completed. It may also be possible to reduce the communication overheads of the demand driven model by overlapping the communication with the computation in some manner. This possibility will be discussed later in section 3.3. On receipt of a request, if there is still work to be done, the work supplier responds with the next available task for processing. If there are no more tasks which need to be computed then the work supplier may safely ignore the request. The problem will be solved when all principal data items have been requested and all the results of the computations on these items have been returned and collated. The dynamic allocation of work by the demand driven model will ensure that while some processing elements are busy with more computationally demanding principal data items, other processing elements are available to compute the less complex parts of the problem. Using the computational times for the presence of objects in the scene as shown in figure 3.8, figure 3.10 shows how the principal data items may be allocated by the task supplier to the processing elements using a simple serial allocation scheme. Note that the processing elements do not complete the same number of tasks. So, for example, while processing elements 2 and 3 are busy completing the computationally complex work associated with principal data items 15 and 16, processing elements 1 can compute the less computationally taxing tasks of principal data items 17 and 18. The demand driven computational model facilitates dynamic load balancing when there is no prior knowledge as to the complexity of the different parts of the problem domain. Optimum load balancing is still dependent on all the processing elements completing the last of the work at the same time. An unbalanced solution may still result if a processing element is allocated a complex part of the domain towards the end of the solution. This processing element may then still be busy well after the other processing elements have completed computation on the remainder of the principal data items and are now idle as there is no further work to do. To reduce the likelihood of this situation it is important that the computationally complex portions of the domain, the so called hot spots, are allocated to processing elements early on in the solution process. Although there is no a priori knowledge as to the exact computational effort associated with any principal data item (if there were, an unbalanced data driven approach would have been adopted), nevertheless, any insight as to possible hot spot areas should be exploited. The

53

Processing Elements PE 1

PE 2

PE 3

Demand for tasks

task supply Task Supplier "Pool" of available tasks

Order of task allocation

Task allocated to PE1


PE 1 1

PE 2 2

PE 3 3

PE 1 4

PE 2 5

PE 3 6

PE 1 7

PE 2 8

PE 3 9

PE 1 10

PE 2 11

PE 1 12

PE 2 13

PE 1 14

PE 3 15

PE 2 16

PE 1 17

PE 1 18

PE 1 19

PE 3 20

PE 2 21

PE 3 22

PE 1 23

PE 3 24


Figure 3.10: Allocation of principal data items using a demand driven model

54

task supplier would thus assign principal data items from these areas first. In the ray tracing example , while the exact computational requirement associated with the principal data items in proximity of the objects in the scene may be unknown, it is highly likely that the solution of the principal items in that area will more complex than those elsewhere. In this problem, these principal data items should be allocated first. If no insight is possible then a simple serial allocation, as shown in figure 3.10, or spiral allocation, as shown in figure 3.11 or even a random allocation of principal data items will have to suffice. While a random allocation offers perhaps a higher probability of avoiding late allocation of principal data items from hot spots, additional effort is required when choosing the next principal data item to allocate to ensure that no principal data item is allocated more than once. Processing Elements PE 1

PE 2

PE 3

Demand for tasks

task supply Task Supplier "Pool" of available tasks

Order of task allocation 20

7

8

9

10

21

19

6

1

2

11

22

18

5

4

3

12

23

17

16

15

14

13

24

Figure 3.11: Allocation of principal data items in a spiral manner As with all aspects of parallel processing, extra levels of sophistication can be added in order to exploit any information that becomes available as the parallel solution proceeds. Identifying possible hot spots in the problem domain may be possible from the computation time associated with each principal data item as these become known. If this time is returned along with the result for that principal data item, the work supplier can build a dynamic profile of the computational requirements associated with areas of the domain. This information can be used to adapt the allocation scheme to send principal data items from the possible hot spot regions. There is, of course, a trade off here between the possible benefits to load balancing in the early allocation of principal data items from hot spots, and the overhead that is introduced by the need to:

time each computation at the processing elements; return this time to the work supplier; develop the time profile at the work supplier; and, adapt the allocation strategy to take into account this profile.

55

The benefits gained by such an adaptive scheme are difficult to predict as they are dependent on the problem being considered and the efficiency of the scheme implementation. The advice in these matters is always: “implement a simple scheme initially and then add extra sophistication should resultant low system performance justify it.”

3.2.3 Hybrid computational model For most problems, the correct choice of computational model will either be one of the data driven strategies or the demand driven approach. However, for a number of problems, a hybrid computational model, exhibiting properties of both data and demand driven models, can be adopted to achieve improved efficiency. The class of problem that can benefit from the hybrid model is one in which an initial set of principal data items of known computational complexity may spawn an unknown quantity of further work. In this case, the total number of principal data items required to solve the problem is unknown at the start of the computation, however, there are at least a known number of principal data items that must be processed first. If the computational complexity associated with these initial principal data items is unknown then a demand driven model will suffice for the whole problem, but if the computational complexity is known then one of the data driven models, with their lower communication overheads, should at least be used for these initial principal data items. Use of the hybrid model thus requires the computational model to be switched from data driven to demand driven mode as required.

3.3 Task Management Task management encompasses the following functions:

the definition of a task; controlling the allocation of tasks; distribution of the tasks to the processing elements; and, collation of the results, especially in the case of a problem with multiple stages.

3.3.1 Task definition and granularity An atomic element may be thought of as a problem’s lowest computational element within the sequential algorithm adopted to solve the problem. As introduced in section 3.1.2, in the domain decomposition model a single task is the application of this sequential algorithm to a principal data item to produce a result for the sub-parts of the problem domain. The task is thus the smallest element of computation for the problem within the parallel system. The task granularity (or grain size) of a problem is the number of atomic elements, which are included in one task. Generally, the task granularity remains constant for all tasks, but in some cases it may be desirable to alter dynamically this granularity as the computation proceeds. A task which includes only one atomic element is said to have the finest granularity, while a task which contains many is coarser grained, or has a coarser granularity. The actual definition of what constitutes a principal data item is determined by the granularity of the tasks. A parallel system solves a problem by its constituent processing elements executing tasks in parallel. A task packet is used to inform a processing element which task, or tasks, to perform. This task packet may simply indicate which tasks require processing by that processing element, thus forming the lowest level of distributed work. The packet may include additional information, such as additional data items, which the tasks require in order to be completed. To illustrate the differences in this terminology, consider again the simple ray tracing problem. The atomic element of a sequential solution of this problem could be to perform a single ray-object intersection test. The principal data item is the pixel being computed and the additional data item required will be object being considered. A

56

sequential solution of this problem would be for a single processing element to consider each ray-object intersection in turn. The help of several processing elements could substantially improve the time taken to perform the ray tracing. The finest task granularity for the parallel implementation of this problem is for each task to complete one atomic element, that is perform one ray-object intersection. For practical considerations, it is perhaps more appropriate that each task should instead be to trace the complete path of a single ray. The granularity of each task is now the number of ray-object intersections required to trace this single ray and each pixel is a principal data item. A sensible task packet to distribute the work to the processing elements would include details about one or more pixels together with the necessary scene data (if possible, see Chapter 4). To summarise our choices for this problem: atomic element: to perform one ray-object intersection; task: to trace the complete path of one ray (may consists of a number of atomic elements); PDI: the pixel location for which we are computing the colour; ADI: the scene data; and, task packet: one or more rays to be computed. Choosing the task granularity for the parallel implementation of a problem is not straightforward. Although it may be fairly easy to identify the atomic element for the sequential version of the problem, such a fine grain may not be appropriate when using many processing elements. Although the atomic element for ray tracing was specified as computing a single ray-object intersection in the above example, the task granularity for the parallel solution was chosen as computing the complete colour contribution at a particular pixel. If one atomic element had been used as the task granularity then additional problems would have introduced for the parallel solution, namely, the need for processors to to exchange partial results. This difficulty would have been exacerbated if, instead, the atomic element had been chosen as tracing a ray into a voxel and considering whether it does in fact intersect with an object there. Indeed, apart from the higher communication overhead this would have introduced, the issue of dependencies would also have to be checked to ensure, for example, that a ray was not checked against an object more than once. As well as introducing additional communication and dependency overheads, the incorrect choice of granularity may also increase computational complexity variations and hinder efficient load balancing. The choice of granularity is seldom easy, however, a number of parameters of the parallel system can provide an indication as to the desirable granularity. The computation to communication ratio of the architecture will suggest whether additional communication is acceptable to avoid dependency or load balancing problems. As a general rule, where possible, data dependencies should be avoided in the choice of granularity as these imply unnecessary synchronisation points within the parallel solution which can have a significant effect on overall system performance.

3.3.2 Task distribution and control The task management strategy controls the distribution of packets throughout the system. Upon receipt, a processing element performs the tasks specified by a packet. The composition of the task packet is thus an important issue that must be decided before distribution of the tasks can begin. To complete a task a processing element needs a copy of the algorithm, the principal data item(s), and any additional data items that the algorithm may require for that principal data item. The domain decomposition paradigm provides each processing element with a copy of the algorithm, and so the responsibility of the task packet is to provide the other information. The principal data items form part of the problem domain. If there is sufficient memory, it may be possible to store the entire problem domain as well as the algorithm at each processing element. In this case, the inclusion of the principal data item as part of the task packet is unnecessary. A better method would be simply to include the identification of the principal data item within the task packet. Typically, the identification of a principal data item is considerably smaller, in terms of actual storage capacity, than the item itself. The communication overheads associated with sending this smaller packet will be significantly less than sending the principal data item with the 57

packet. On receipt of the packet the processing element could use the identification simply to fetch the principal data item from its local storage. The identification of the principal data item is, of course, also essential to enable the results of the entire parallel computation to be collated. If the additional data items required by the task are known then they, or if possible, their identities, may also be included in the task packet. In this case the task packet would form an integral unit of computation which could be directly handled by a processing element. However, in reality, it may not be possible to store the whole problem domain at every processing element. Similarly, numerous additional data items may be required which would make their inclusion in the task packet impossible. Furthermore, for a large number of problems, the additional data items which are required for a particular principal data item may not be known in advance and will only become apparent as the computation proceeds. A task packet should contain as a minimum either the identity, or the identity and actual principal data items of the task. The inability to include the other required information in the packet means that the parallel system will have to resort to some form of data management. This topic is described fully in Chapter 4.

3.3.3 Algorithmic dependencies The algorithm of the problem may specify an order in which the work must be undertaken. This implies that certain tasks must be completed before others can commence. These dependencies must be preserved in the parallel implementation. In the worst case, algorithmic dependencies can prevent an efficient parallel implementation, as shown with the tower of toy blocks in figure 1.2. Amdahl’s law, described in section 1.4, shows the implications to the algorithmic decomposition model of parallel processing of the presence of even a small percentage of purely sequential code. In the domain decomposition approach, algorithmic dependencies may introduce two phenomena which will have to be tackled:

synchronisation points which have the effect of dividing the parallel implementation into a number of distinct stages; and, data dependencies which will require careful data management to ensure a consistent view of the data to all processing elements.

Multi-stage algorithms Many problems can be solved by a single stage of computation, utilising known principal data items to produce the desired results. However, the dependencies inherent in other algorithms may divide computation into a number of distinct stages. The partial results produced by one stage become the principal data items for the following stage of the algorithm, as shown in figure 3.12. For example, many scientific problems involve the construction of a set of simultaneous equations, a distinct stage, and the subsequent solution of these equations for the unknowns. The partial results, in this case elements of the simultaneous equations, become the principal data for the tasks of the next stage. Even a single stage of a problem may contain a number of distinct substages which must first be completed before the next substage can proceed. An example of this is the use of an iterative solver, such as the Jacobi method [52, 84], to solve a set of simultaneous equations. An iterative method starts with an approximate solution and uses it in a recurrence formula to provide another approximate solution. By repeatedly applying this process a sequence of solutions is obtained which, under suitable conditions, converges towards the exact solution. Consider the problem of solving a set of six equations for six unknowns, x = . The Jacobi method will solve this set of equations by calculating, at each iteration, a new approximation from the values of the previous iteration. So the value for the xi ’s at the nth iteration are calculated as:

A

n,1 n,1 xn1 = bi , a12 x2 a, : : : , a16 x6

11

58

b

Stage 1 of algorithm

Stage 2 of algorithm

Principal data item specified by the task Additional data items required

Result

Partial result produced in stage 1 (Also principal data item for stage 2)

Figure 3.12: The introduction of partial results due to algorithmic dependencies

59

n,1 n,1 xn2 = bi , a21 x1 a, : : : , a26 x6

22

.. .

n,1 n,1 xn6 = bi , a61 x1 a, : : : , a65 x6

66

A a 11 a 21 a 31 a 41 a 51 a 61

a 12 a 22 a 32 a 42 a 52 a 62

a 13 a 23 a 33 a 43 a 53 a 63

a 14 a 24 a 34 a 44 a 54 a 64

a 15 a 25 a 35 a 45 a 55 a 65

a 16 a 26 a 36 a 46 a 56 a 66

x

b

x1 x2 x3 x4 x5 x6

b1 b2 b3 b4 b5 b6

PE 1 PE 2

Figure 3.13: Solving an iterative matrix solution method on two processing elements A parallel solution to this problem on two processing elements could allocate three rows to be solved to each processing element as shown in figure 3.13. Now PE1 can solve the nth iteration values xn1 ; xn2 and xn3 in parallel with PE2 computing the values of xn4 ; xn5 and xn6 . However, neither processing element can proceed onto the (n + 1)st iteration until both have finished the nth iteration and exchanged their new approximations for the xni ’s. Each iteration is, therefore, a substage which must be completed before the next substage can commence. This point is illustrated by the following code segment from PE1 : PROCEDURE Jacobi() (* Executing on PE 1 *) Begin Estimate x[1] ... x[6] n := 0 (* Iteration number *) WHILE solution not converged DO Begin n := n + 1 Calculate new x[1], x[2] & x[3] using old x[1] ... x[6] PARALLEL SEND new x[1], x[2] & x[3] TO PE 2 RECEIVE new x[4], x[5] & x[6] FROM PE 2 End End (* Jacobi *)

Data dependencies The concept of dependencies was introduced in section 1.1.1 when we were unable to construct a tower of blocks in parallel as this required a strictly sequential order of task completion. In the domain decomposition model, data dependencies exist when a task may not be performed on some principal data item until another task has been completed. There is thus an implicit ordering on the way in which the task packets may be allocated to the processing elements. This ordering will prevent certain tasks being allocated, even if there are processing elements idle, until the tasks on which they are dependent have completed. A linear dependency exists between each of the iterations of the Jacobi method discussed above. However, no dependency exists for the calculation of each xni , for all i, as all the values they require, xnj ,1 ; 8j 6= i, will already 60

have been exchanged and thus be available at every processing element. The Gauss-Seidel iterative method has long be preferred in the sequential computing community as an alternative to Jacobi. The Gauss-Seidel method makes use of new approximations for the xi as soon as they are available rather than waiting for the next iteration. Provided the methods converge, Gauss-Seidel will converge more rapidly than the Jacobi method. So, in the example of six unknowns above, in the nth the value of xn1 would still be calculated as:

n,1 n,1 xn1 = bi , a12 x2 a, : : : , a16 x6 ; 11

but the xn2 value would now be calculated by:

n,1 n,1 n xn2 = bi , a21 x1 , a23 xa3 , : : : , a26 x6

22

Although well suited to sequential programming, the strong linear dependency that has been introduced, makes the Gauss-Seidel method poorly suited for parallel implementation. Now within each iteration no value of xni can be calculated until all the values for xnj ; j < i are available; a strict sequential ordering of the tasks. The less severe data dependencies within the Jacobi method thus make it a more suitable candidate for parallel processing than the Gauss-Seidel method which is more efficient on a sequential machine. It is possible to implement a hybrid of these two methods in parallel, the so-called “Block Gauss-Seidel - Global Jacobi” method [28]. A processing element which is computing several rows of the equations, may use the GaussSeidel method for these rows as they will be computed sequentially within the processing element. Any values for xni not computed locally will assume the values of the previous iteration, as in the Jacobi method. All new approximations will be exchanged at each iteration. So, in the example, PE2 would calculate the values of xn4 , xn5 and xn6 as follows:

xn4 = bi , a11 xn1 ,1 , a12 xn2 ,1 , a13 xn3 ,1 , a15 xn5 ,1 , a16 xn6 ,1 a44 xn5 = bi , a11 xn1 ,1 , a12 xn2 ,1 , a13 xn3 ,1 , a14 xn4 , a16 xn6 ,1 a55 xn6 = bi , a11 xn1 ,1 , a12 xn2 ,1 , a13 xn3 ,1 , a14 xn4 , a15 xn5 a66 3.4 Task Scheduling Strategies 3.4.1 Data driven task management strategies In a data driven approach, the system controller determines the allocation of tasks prior to computation proceeding. With the unbalanced strategy, this may entail an initial sorting stage based on the known computational complexity, as described in section 3.2.1. A single task-packet detailing the tasks to be performed is sent to each processing element. The application processes may return the results upon completion of their allocated portion, or return individual results as each task is performed, as shown in this code segment:

61

PROCESS Application Process() Begin RECEIVE task packet FROM SC via R FOR i = start task id TO finish task id DO Begin result[i] := Perform Algorithm(task[i]) SEND result[i] TO SC via R End End (* Application Process *)

In a data driven model of computation a processing element may initially be supplied with as many of its allocated principal data items as its local memory will allow. Should there be insufficient storage capacity a simple data management strategy may be necessary to prefetch the missing principal data items as computation proceeds and local storage allows. This is discussed further when considering the management of data in Chapter 4.

3.4.2 Demand driven task management strategies Task management within the demand driven computational model is explicit. The work supplier process, which forms part of the system controller, is responsible for placing the tasks into packets and sending these packets to requesting processing elements. To facilitate this process, the system controller maintains a pool of already constituted task packets. On receipt of a request, the work supplier simply dispatches the next available task packet from this task pool, as can be seen in figure 3.14.

Processing Elements

PE 1

.....

PE 2

PE n task supply

demands for tasks

Work supplier

id PDIs id PDIs

ADIs

id PDIs

id PDIs id PDIs id PDIs

ADIs

ADIs ADIs

ADIs id PDIs

ADIs id PDIs

ADIs

ADIs

id PDIs

id PDIs

ADIs

ADIs

"Pool" of available task packets

Figure 3.14: Supplying task packets from a task pool at the system controller The advantage of a task pool is that the packets can be inserted into it in advance, or concurrently as the solu62

tion proceeds, according to the allocation strategy adopted. This is especially useful for problems that create work dynamically, such as those using the hybrid approach as described in section 3.2.3. Another advantage of the task pool is that if a hot spot in the problem domain is identified, then the ordering within the task pool can be changed dynamically to reflect this and thus ensure that potentially computationally complex tasks are allocated first. More than one task pool may be used to reflect different levels of task priority. High priority tasks contained in the appropriate task pool will always be sent to a requesting processing element first. Only once this high priority task pool is (temporarily) empty will tasks from lower priority pools be sent. The multiple pool approach ensures that high priority tasks are not ignored as other tasks are allocated. In the demand driven computational model, the processing elements demand the next task as soon as they have completed their current task. This demand is translated into sending a request to the work supplier, and the demand is only satisfied when the work supplier has delivered the next task. There is thus a definite delay period from the time the request is issued until the next task is received. During this period the processing element will be computationally idle. To avoid this idle time, it may be useful to include a buffer at each processing element capable of holding at least one task packet. This buffer may be considered as the processing element’s own private task pool. Now, rather than waiting for a request to be satisfied from the remote system controller, the processing element may proceed with the computation on the task packet already present locally. When the remote request has been satisfied and a new task packet delivered, this can be stored in the buffer waiting for the processing element to complete the current task. Whilst avoiding delays in fetching tasks from a remote task pool, the use of a buffer at each processing element may have serious implications for load balancing, especially towards the end of the problem solution. We will examine these issues in more detail after we have considered the realisation of task management for a simple demand driven system - the processor farm. A first approach: The processor farm Simple demand driven models of computation have been implemented and used for a wide range of applications. One realisation of such a model, often referred to in the literature, is that implemented by May and Shepherd [117]. This simple demand driven model, which they term a processor farm, has been used for solving problems with high computation to communication ratios. The model proposes a single system controller and one or more processing elements connected in a linear configuration, or chain. The structure of a processing element in this model is shown in figure 3.15. The application process performs the desired computation, while the communication within the system is dealt with by two router processes, the Task Router (TR) and the Result Router (RR). As their names suggest, the task router is responsible for distributing the tasks to the application process, while the result router returns the results from the completed tasks back to the system controller. The system controller contains the initial pool of tasks to be performed and collates the results. Such a communication strategy is simple to implement and largely problem independent. To reduce possible processing element idle time, each task router process contains a single buffer in which to store a task so that a new task can be passed to the application process as soon as it becomes idle. When a task has been completed the results are sent to the system controller. On receipt of a result, the system controller releases a new task into the system. This synchronised releasing of tasks ensures that there are never more tasks in the system than there is space available. On receipt of a new task, the task router process either: 1. passes the task directly to the application process if it is waiting for a task; or 2. places the task into its buffer if the buffer is empty; or, otherwise 3. passes the task onto the next processing element in the chain. The processor farm is initialised by loading sufficient tasks into the system so that the buffer at each task router is full and each application process has a task with which to commence processing. Figure 3.16 shows the manner in which task requests are satisfied within a simple two processing element configured in a chain. 63

SC

PE 1

PE 2

PE n Key: SC - System Controller AP - Application Process TR - Task Router RR - Result Router

TR

AP

RR

Figure 3.15: A processing element for the processor farm model

The simplicity of this realisation of a demand driven model has contributed largely to its popularity. Note that because of the balance maintained within the system, the only instance at which the last processing element is different from any other processing element in the chain is to ensure the closedown command does not get passed any further. However, such a model does have disadvantages which may limit its use for more complex problems. The computation to communication ratio of the desired application is critical in order to ensure an adequate performance of a processor farm. If this ratio is too low then significant processing element idle time will occur. This idle time occurs because the computation time for the application process to complete its current task and the task buffered at the task router may be lower than the combined communication time required for the results to reach the system controller plus the time for the new tasks released into the system to reach the processing element. This problem may be partially alleviated by the inclusion of several buffers at each task router instead of just one. However, without a priori knowledge as to the computation to communication ratio of the application, it may be impossible to determine precisely what the optimum number of buffers should be. This analysis is particularly difficult if the computational complexity of the tasks vary; precisely the type of problem demand driven models are more apt at solving. The problem independence of the system will also be compromised by the use of any a priori knowledge. If the number of buffers chosen is too small, then the possibility of application process idle time will not be avoided. Provision of too many buffers will certainly remove any immediate application process idle time, but will re-introduce the predicament as the processing draws to a close. This occurs once the system controller has no further tasks to introduce into the system and now processing must only continue until all tasks still buffered at the processing elements have been completed. Obviously, significant idle time may occur as some processing elements struggle to complete their large number of buffered tasks. The computation to communication ratio of the processor farm is severely exacerbated by the choice of the chain topology. The distance between the furthest processing element in the chain and the system controller grows linearly as more processing elements are added. This means that the combined communication time to return a result and receive a new task also increases. Furthermore, this communication time will also be adversely affect by the message

64

System Controller

PE 1

Tasks

PE 2 Initial state: both APs demand work

Results

First task is loaded from SC

Tasks

Results

Second task is

Tasks

loaded from SC Results

Third task is loaded from SC

Tasks

T I M E

Results

Processor farm fully loaded

Tasks

Results

First result is produced

Tasks

Results

AP demands next task

Tasks

Results

Returning result

Tasks

causes new task to enter the farm Results

Figure 3.16: Task movement within a two PE processor farm

65

traffic of all the intermediate processing elements which are closer to the system controller.

3.4.3 Task manager process The aim of task management within a parallel system is to ensure the efficient supply of tasks to the processing elements. A Task Manager process (TM) is introduced at each processing element to assist in maintaining a continuous supply of tasks to the application process. The application process no longer deals with task requests directly, but rather indirectly using the facilities of the task manager. The task manager process assumes the responsibility for ensuring that every request for additional tasks from the application process will be satisfied immediately. The task manager attempts to achieve this by maintaining a local task pool. In the processor farm, the task router process contains a single buffered task in order to satisfy the next local task request. As long as this buffer is full, task supply is immediate as far as the application process is concerned. The buffer is refilled by a new task from the system controller triggered on receipt of a result. The task router acts in a passive manner, awaiting replenishment by a new task within the farm. However, if the buffer is empty when the application process requests a task then this process must remain idle until a new task arrives. This idle time is wasted computation time and so to improve system performance the passive task router should be replaced by a “intelligent” task manager process more capable of ensuring new tasks are always available locally. The task management strategies implemented by the task manager and outlined in the following sections are active, dynamically requesting and acquiring tasks during computation. The task manager thus assumes the responsibility of ensuring local availability of tasks. This means that an application process should always have its request for a task satisfied immediately by the task manager unless:

at the start of the problem the application processes make a request before the initial tasks have been provided by the system controller; there are no more tasks which need to be solved for a particular stage of the parallel implementation; or, the task manager’s replenishment strategy has failed in some way.

A local task pool To avoid any processing element idle time, it is essential that the task manager has at least one task available locally at the moment the application process issues a task request. This desirable situation was achieved in the processor farm by the provision of a single buffer at each task router. As we saw, the single buffer approach is vulnerable to the computation to communication ratio within the system. Adding more buffers to the task router led to the possibility of serious load imbalances towards the end of the computation. The task manager process maintains a local task pool of tasks awaiting computation by the application process. This pool is similar to the task pool at the system controller, as shown in figure 3.14. However, not only will this local pool be much smaller than the system controller’s task pool, but also it may be desirable to introduce some form of “status” to the number of available tasks at any point in time. Satisfying a task request will free some space in the local task pool. A simple replenishment strategy would be for the task manager immediately to request a new task packet from the system controller. This request has obvious communication implications for the system. If the current message densities within the system are high and as long as there are still tasks available in the local task pool, this request will place and unnecessary additional burden on the already overloaded communication network. As an active process, it is quite possible for the task manager to delay its replenishment request until message densities have diminished. However, this delay must not be so large that subsequent application process demands will deplete the local task pool before any new tasks can be fetched causing processor idle time to occur. There are a number of indicators which the task manager can use to determine a suitable delay. Firstly, this delay is only necessary if current message densities are high. Such information should be available for the router. Given a need

66

for delay, the number of tasks in the task pool, the approximate computation time each of these tasks requires, and the probable communication latency in replenishing the tasks should all contribute to determining the request delay. In a demand driven system, the computational complexity variations of the tasks are not known. However, the task manager will be aware of how long previous tasks have taken to compute (the time between application process requests). Assuming some form of preferred biased allocation of tasks in which tasks from similar regions of the problem domain are allocated to the same processing element, as discussed in section 3.4.5, the task manager will be able to build up a profile of task completion time which can be used to predict approximate completion times for tasks in the task pool. The times required to satisfy previous replenishment requests will provide the task manager with an idea of likely future communication responses. These values are, of course, mere approximations, but they can be used to assist in determining reasonable tolerance levels for the issuing of replenishment requests. task pool Current status

id PDI ADI Green id PDI ADI Orange

AP TM

id PDI ADI id PDI ADI

R

Red id PDI ADI TM

Figure 3.17: Status of task manager’s task pool The task manager’s task pool is divided into three regions: green, orange and red. The number of tasks available in the pool will determine the current status level, as shown in figure 3.17. When faced with the need to replenish the task pool the decision can be taken based on the current status of the pool: green: Only issue the replenishment request if current message traffic density is low; orange: Issue the replenishment request unless the message density is very high; and, red: Always issue the replenishment request. The boundaries of these regions may be altered dynamically as the task manager acquires more information. At the start of the computation the task pool will be all red. The computation to communication ratio is critical in determining the boundaries of the regions of the task pool. The better this ratio, that is when computation times are high relative to the time taken to replenish a task packet, the smaller the red region of the task pool need be. This will provide the task manager with greater flexibility and the opportunity to contribute to minimising communication densities.

3.4.4 Distributed task management One handicap of the centralised task pool system is that all replenishment task requests from the task managers must reach the system controller before the new tasks can be allocated. The associated communication delay in satisfying these requests can be significant. The communication problems can be exacerbated by the bottleneck arising near the system controller. Distributed task management allows task requests to be handled at a number of locations remote from the system controller. Although all the tasks originate from the system controller, requests from processing elements no longer have to reach there in order to be satisfied. 67

Neighbouring Processing Element

Processing Element

AP

System Controller

AP

Task request TM

TS

TM R

R

R Possible task request

Task returned Key: AP - Application Process TM - Task Manager R - Router TS - Task Supplier

Figure 3.18: Task request propagating towards the system controller

The closest location for a task manager to replenish a task packet is from the task pool located at the task manager of one of its nearest neighbours. In this case, a replenishment request no longer proceeds directly to the system controller, but simply via the appropriate routers to the neighbouring task manager. If this neighbouring task manager is able to satisfy the replenishment request then it does so from its task pool. This task manager may now decide to in turn replenish its task pool, depending on its current status and so it will also request another task from one of its neighbouring task managers, but obviously not the same neighbour to which it has just supplied the task. One sensible strategy is to propagate these requests in a “chain like” fashion in the direction towards the main task supplier at the system controller, as shown in figure 3.18. This distributed task management strategy is referred to as a producer-consumer model. The application process is the initial consumer and its local task manager the producer. If a replenishment request is issued then this task manager becomes the consumer and the neighbouring task manager the producer, and so on. The task supplier process of the system controller is the overall producer for the system. If no further tasks exist at the system controller then the last requesting task manager may change the direction of the search. This situation may occur towards the end of a stage of processing and facilitates load balancing of any tasks remaining in task manager buffers. As well as reducing the communication distances for task replenishment, an additional advantage of this “chain reaction” strategy is that the number of request messages in the system is reduced. This will play a major rôle helping maintain a lower overall message density within the system. If a task manager is unable to satisfy a replenishment request as its task pool is empty, then to avoid “starvation” at the requesting processing element, this task manager must ensure that the request is passed on to another processing element. A number of variants of the producer-consumer model are also possible:

Instead of following a path towards the system controller, the “chain reaction” could follow a predetermined Hamiltonian path (the system controller could be one of the processors on this path). 68

Aside: A Hamiltonian path is a circuit starting and finishing at one processing element. This circuit passes through each processor in the network once only. Such a path would ensure that a processing element would be assured of replenishing a task if there was one available and there would be no need to keep track of the progress of the “chain reaction” to ensure no task manager was queried more than once per chain.

In the course of its through-routing activities a router may handle a task packet destined for a distant task manager. If that router’s local task manager has an outstanding “red request” for a task then it is possible for the router to poach the “en route task” by diverting it, so satisfying its local task manager immediately. Care must be taken to ensure that the task manager for whom the task was intended is informed that the task has been poached, so it may issue another request. In general, tasks should only be poached from “red replenishment” if to do so would avoid local application process idle time.

3.4.5 Preferred bias task allocation The preferred bias method of task management is a way of allocating tasks to processing elements which combines the simplicity of the balanced data driven model with the flexibility of the demand driven approach. To reiterate the difference in these two computational models as they pertain to task management:

Tasks are allocated to processing elements in a predetermined manner in the balanced data driven approach. In the demand driven model, tasks are allocated to processing elements on demand. The requesting processing element will be assigned the next available task packet from the task pool, and thus no processing element is bound to any area of the problem domain.

Provided no data dependencies exist, the order of task completion is unimportant. Once all tasks have been computed, the problem is solved. In the preferred bias method the problem domain is divided into equal regions with each region being assigned to a particular processing element, as is done in the balanced data driven approach. However, in this method, these regions are purely conceptual in nature. A demand driven model of computation is still used, but the tasks are not now allocated in an arbitrary fashion to the processing elements. Rather, a task is dispatched to a processing element from its conceptual portion. Once all tasks from a processing element’s conceptual portion have been completed, only then will that processing element be allocated its next task from the portion of another processing element which has yet to complete its conceptual portion of tasks. Generally this task should be allocated from the portion of the processing element that has completed the least number of tasks. So, for example, from figure 3.19, on completion of the tasks in its own conceptual region, PE3 may get allocated task number 22 from PE2 ’s conceptual region. Preferred bias allocation is sometimes also referred to as conceptual task allocation. The implications of preferred bias allocation are substantial. The demand driven model’s ability to deal with variations in computational complexity is retained, but now the system controller and the processing elements themselves know to whom a task that they have been allocated conceptually belongs. As we will see in section 4.6, this can greatly facilitate the even distribution of partial results at the end of any stage of a multi-stage problem. The exploitation of data coherence is a vital ploy in reducing idle time due to remote data fetches. Preferred bias allocation of tasks can ensure that tasks from the same region of the problem are allocated to the same processing element. This can greatly improve the cache hit ratio at that processing element. This topic will be expanded when discussing caching in section 5.

69

Processing Elements PE1

PE2

PE3

Demand for tasks

task supply Task Supplier task ids

"Pool" of available tasks 1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

Conceptual region for PE 1



Figure 3.19: Partial result storage balancing by means of conceptual regions

70

4 Data Management The data requirements of many problems may be far larger than can be accommodated at any individual processing element. Rather than restricting ourselves to only solving those problems that fit completely within every processing element’s local memory, we can make use of the combined memory of all processing elements. The large problem domain can now be distributed across the system and even secondary storage devices if necessary. For this class of application some form of data management will be necessary to ensure that data items are available at the processing elements when required by the computations. Virtual shared memory regards the whole problem domain as a single unit in which the data items may be individually referenced. This is precisely how the domain could be treated if the problem was implemented on a shared memory multiprocessor system. However, on a distributed memory system, the problem domain is distributed across the system and hence the term virtual. Virtual shared memory systems may be implemented at different levels, such as in hardware or at the operating system level. In this chapter we will see how the introduction of a data manager process at each processing element can provide an elegant virtual shared memory at the system software level of our parallel implementation.

4.1 World Model of the Data: No Data Management Required Not all problems possess very large data domains. If the size of the domain is such that it may be accommodated at every processing element then we say that the processing elements have a “world model” of the data. A world model may also exist if all the tasks allocated to a processing element only ever require a subset of the problem domain and this subset can be accommodated completely. In the world model, all principal and additional data items required by an application process will always be available locally at each processing element and thus there is no need for any data item to be fetched from another remote location within the system. If there is no requirement to fetch data items from remote locations as the solution of the problem proceeds then there is no need for any form of data management. The processor farm described in section 3.4.2 is an example of a parallel implementation which assumes a world model. In this approach, tasks are allocated to processing elements in an arbitrary fashion and thus there is no restriction on which tasks may be computed by which processing element. No provision is made for data management and thus to perform any task, the entire domain must reside at each processing element. Data items do not always have to be present at the processing element from the start of computation to avoid any form of data management. As discussed in section 3.3.1, both principal and additional data items may be included within a task packet. Provided no further data items are required to complete the tasks specified in the task packet then no data management is required and this situation may also be said to be demonstrating a world data model.

4.2 Virtual Shared Memory Virtual shared memory provides all processors with the concept of a single memory space. Unlike a traditional shared memory model, this physical memory is distributed amongst the processing elements. Thus, a virtual shared memory environment can be thought of providing each processing element with a virtual world model of the problem domain. So, as far as the application process is concerned, there is no difference between requesting a data item that happens to be local, or remote; only the speed of access can be (very) different.

71

Virtual shared memory can be implemented at any level in the computer hierarchy. Implementations at the hardware level provide a transparent interface to software developers, but requires a specialised machine, such as the DASH system [111]. There have also been implementations at the operating system and compiler level. However, as we shall see, in the absence of dedicated hardware, virtual shared memory can also be easily provided at the system software level. At this level, a great deal of flexibility is available to provide specialised support to minimise any implementation penalties when undertaking the solution of problems with very large data requirements on multiprocessor systems. Figure 4.1 gives four levels at which virtual shared memory (VSM) can be supported, and examples of systems that implement VSM at that particular level. Higher level

Lower level

System Software Compiler Operating System Hardware

Provided by the Data Manager process High Performance Fortran[85], ORCA[10] Coherent Paging[113] DDM [187], DASH [111], KSR-1 [99]

Figure 4.1: The levels where virtual shared memory can be implemented.

4.2.1 Implementing virtual shared memory At the hardware level virtual shared memory intercepts all memory traffic from the processor, and decides which memory accesses are serviced locally, and which memory accesses need to go off-processor. This means that everything above the hardware level (machine code, operating system, etc.) sees a virtual shared memory with which it may interact in exactly the same manner as a physically shared memory. Providing this, so called, transparency to the higher levels, means that the size of data is not determined by the hardware level. However, in hardware, a data item becomes a fixed consecutive number of bytes, typically around 16-256. By choosing the size to be a power of 2, and by aligning data items in the memory, the physical memory address can become the concatenation of the “item-identifier” and the “byte selection”. This strategy is easier to implement in hardware. 31

... Item identifier

6

5 ... 0 byte-selection

In this example, the most significant bits of a memory address locates the data item, and the lower bits address a byte within the item. The choice of using 6 bits as the byte selection in this example is arbitrary. If a data structure of some higher level language containing two integers of four bytes each happened to be allocated from, say, address ...1100 111100 to ...1101 000100, then item ...1100 will contain the first integer, and item ...1101 will contain the other one. This means that two logically related integers of data are located in two physically separate items (although they could fit in a single data item). Considered another way, if two unrelated variables, say x and y are allocated at addresses ...1100 110000 and ...1100 110100, then they reside in the same data item. If they are heavily used on separate processors, this can cause inefficiencies when the machine tries to maintain sequentially consistent copies of x and y on both processors. The machine cannot put x on one processor and y on the other, because it does not recognise x and y as different entities; the machine observes it as a single item that is shared between two processors. If sequential consistency has to be maintained the machine must update every write to x and y on both processors, even though the variables are not shared at all. This phenomenon is known as false sharing. Virtual shared memory implemented at the operating system level also use a fixed size for data items, but these are typically much larger than at the hardware level. By making an item as large as a page of the operating system (around 1-4 KByte), data can be managed at the page level. This is cheaper, but slower than a hardware implementation. When the compiler supports virtual shared memory, a data item can be made exactly as large as any user data structure. In contrast with virtual shared memory implementations at the hardware or operating system level, compiler based implementations can keep logically connected variables together and distribute others. The detection of 72

logically related variables is in the general case very hard, which means that applications written in existing languages such as C, Modula-2 or Fortran cannot be compiled in this way. However, compilers for specially designed languages can provide some assistance. For example, in High Performance Fortran the programmer indicates how arrays should be divided and then the compiler provides the appropriate commands to support data transport and data consistency. Implementing virtual shared memory at the system software level provides the greatest flexibility to the programmer. However, this requires explicit development of system features to support the manipulation of the distributed data item. A data manager process is introduced at each processing element especially to undertake this job.

4.3 The Data Manager Virtual shared memory is provided at the system software level by a data manager process at each processing element. The aim of data management within the parallel system is to ensure the efficient supply of data items to the processing elements. The data manager process manages data items just as the task manager was responsible for maintaining a continuous supply of tasks. Note that the data items being referred to here are the principal and additional data items as specified by the problem domain and not every variable or constant the application process may invoke for the completion of a task. The application process now no longer deals with the principal and additional data items directly, but rather indirectly using the facilities of the data manager. The application process achieves this by issuing a data request to the data manager process every time a data item is required. The data manager process assumes the responsibility for ensuring that every request for a data item from the application process will be satisfied. The data manager attempts to satisfy these requests by maintaining a local data cache. The data management strategies implemented by the data manager and outlined in the following sections are active, dynamically requesting and acquiring data items during computation. This means that an application process should always have its request for a data item satisfied immediately by the data manager unless:

at the start of the problem the application processes make requests before any initial data items have been provided by the system controller; the data manager’s data fetch strategy has failed in some way.

4.3.1 The local data cache The concept of data sharing may be used to cope with very large data requirements [27, 58]. Data sharing implements virtual shared memory by allocating every data item in the problem domain an unique identifier. This allows a required item to be “located” from somewhere within the system, or from secondary storage if necessary. The size of problem that can now be tackled is, therefore, no longer dictated by the size of the local memory at each processing element, but rather only by the limitations of the combined memory plus the secondary storage. The principal data item required by an application process is specified by the task it is currently performing. Any additional data item requirements are determined by the task and by the algorithm chosen to solve the problem. These additional data items may be known a priori by the nature of the problem, or they may only become apparent as the computation of the task proceeds. To avoid any processing element idle time, it is essential that the data manager has the required data item available locally at the moment the application process issues a request for it. In an attempt to achieve this, the data manager maintains a local cache of data items as shown in figure 4.2. The size of this cache, and thus the number of data items it can contain, is determined by the size of a processing element’s local memory. Each data item in the system is a packet containing the unique identifier, shown in figure 4.2 as id, together with the actual data which makes up the item. The data items may be permanently located at a specific processing element, or they may be free to migrate within the system to where they are required. When a data manager requires a particular data item which is not already available locally, this data item must be fetched from some remote location 73

local cache id id id id id id id id id

AP TM

DM R

data item data item data item data item data item data item data item data item data item DM

Figure 4.2: The local cache at the data manager

and placed into the local cache. This must occur before the application process can access the data item. The virtual shared memory of the system is thus the combination of the local caches at all the processing elements plus the secondary storage which is under the control of the file manager at the system controller. In certain circumstances, as will be seen in the following sections, rather than removing the data item from the local cache in which it was found, it may be sufficient simply to take a copy of the data item and return this to the local cache. This is certainly the case when the data items within the problem domain are read-only, that is the values of the data items are not altered during the course of the parallel solution of the problem (and indeed the same would be true of the sequential implementation). This means that it is possible for copies of the same data item to be present in a number of local caches. Note that it is no advantage to have more than one copy of any data item in one local cache. There is a limited amount of space in any local cache. When the cache is full and another data item is acquired from a remote location, then one of the existing data items in the local cache must be replaced by this new data item. Various strategies for implementing this replacement will be described in section 5. Care must be taken to ensure that no data item is inadvertently completely removed from the system by being replaced in all local caches. If this does happen then, assuming the data item is read-only, a copy of the entire problem domain will reside on secondary storage, from where the data items were initially loaded into the local caches of the parallel system. This means that should a data item being destroyed within the system, another copy can be retrieved from the file manager (FM) of the system controller. If the data items are read-write then their values may be altered as the computation progresses. In this case, the data managers have to beware of consistency issues when procuring a data item. The implications of consistency will be discussed in section 4.4. As we will now see, the strategies adopted in the parallel implementation for acquiring data items and storing them in the local caches can have a significant effect on minimising the implementation penalties and thus improving overall system performance. The onus is on the data manager process to ensure these strategies are carried out efficiently.

74

4.3.2 Requesting data items The algorithm being executed at the application process will determine the next data item required. If the data items were all held by the application process, requesting the data item would be implemented within the application process as an “assignment statement”. For example a request for data item i would simply be written as x := data_item[i]. When all the data items are held instead by the data manager process, this “assignment statement” must be replaced by a request from the application process to the data manager for the data item followed by the sending of a copy of the data item from the local cache of the data manager to the waiting application process, as shown in figure 4.3.

AP x := data_item[i] y := x * 5

a. Without data manager AP SEND i TO DM RECEIVE x FROM DM y := x * 5 i data_item DM RECEIVE i FROM AP SEND data_item[i] TO AP b. With data manager

Figure 4.3: Accessing a data item (a) with, and (b) without a data manager The data item’s unique identifier enables the data manager to extract the appropriate item from its local cache. If a data item requested by the application process is available, it is immediately transferred, as shown in figure 4.4(a). The only slight delay in the computation of the application process will occur by the need to schedule the concurrent data manager and for this process to send the data item from its local cache. However, if the data item is not available locally then the data manager must “locate” this item from elsewhere in the system. This will entail sending a message via the router to find the data item in another processing element’s local cache, or from the file manager of the system controller. Having been found, the appropriate item is returned to the requesting data manager’s own local cache and then finally a copy of the item is transferred to the application process. If the communicated request from the application process is asynchronous and this process is able to continue with its task while awaiting the data item then no idle time occurs. However, if the communication with the data manager is synchronous, or if the data item is essential for the continuation of the task then idle time will persist until the data item can be fetched from the remote location and a copy given to the application process, as shown in figure 4.4(b). Unless otherwise stated, we will assume for the rest of this chapter that an application process is unable to continue with its current task until its data item request has been satisfied by the data manager.

75

PE

PE

n

AP

PE

n

m

AP id data item

DM

DM

DM

R

R

id data item

. . . . . idle time Computation

Computation

idle time Computation

Time

fetch time

Computation

Time

(a)

(b)

Figure 4.4: AP idle time due to: (a) Data item found locally (b) Remote data item fetch

76

4.3.3 Locating data items When confronted with having to acquire a remote data item, two possibilities exist for the data manager. Either it knows exactly the location of the data item within the system, or this location is unknown and some form of search will have to be instigated. Resident sets Knowing the precise location of the requested data item within the system enables the data manager to instruct the router to send the request for the data item, directly to the appropriate processing element. One of the simplest strategies for allocating data items to each processing element’s local cache is to divide all the data items of the problem domain evenly amongst the processing elements before the computation commences. Providing there is sufficient local memory and assuming there are n processing elements, this means that each th processing element would be allocated n1 of the total number of data items. If there isn’t enough memory at each processing element for even this fraction of the total problem domain then as many as possible could be allocated to the local caches and the remainder of the data items would be held at the file manager of the system controller. Such a simplistic scheme has its advantages. Provided these data items remain at their predetermined local cache for the duration of the computation, then the processing element from which any data item may found can be computed directly from the identity of the data item. For example, assume there are twelve data items, given the unique identification numbers 1; : : : ; 12, and three processing elements, PE1 ; PE2 , and PE3 . A predetermined allocation strategy may allocate permanently data items 1; : : : ; 4 to PE1 , data items 5; : : : ; 8 to PE2 and 9; : : : ; 12 to PE3 . Should PE2 wish to acquire a copy of data item 10, it may do so directly from the processing element known to have that data item, in this case PE3 . It is essential for this simple predetermined allocation strategy that the data items are not overwritten or moved from the local cache to which they are assigned initially. However, it may be necessary for a processing element to also acquire copies of other data items as the computation proceeds, as we saw with PE2 above. This implies that the local cache should be partitioned into two distinct regions:

a region containing data items which may never be replaced, known as the resident set; and, a region for data items which may be replaced during the parallel computation.

The size of the resident set should be sufficient to accommodate all the pre-allocated data items, as shown for

PE2 from the above example in figure 4.5. The remaining portion of the local cache will be as large as allowed by

the local memory of the processing element. Note that this portion needs to have sufficient space to hold a minimum of one data item as this is the maximum that the application process can require at any specific point during a task’s computation. To complete a task an application process may require many data items. Each of these data items may in turn replace the previously acquired one in the single available space in the local cache. The balanced data driven model of computation is well suited to a simple pre-determined even data item allocation scheme. In this model the system controller knows prior to the computation commencing precisely which tasks are to be assigned to which processing elements. The same number of tasks is assigned to each processing element and thus the principal data items for each of these tasks may be pre-allocated evenly amongst the local caches of the appropriate processing elements. Similar knowledge is available to the system controller for the unbalanced data driven model, but in this case the number of tasks allocated to each processing element is not the same and so different numbers of principal data items will be loaded into each resident set. Note that the algorithm used to solve the problem may be such that, even if a data driven model is used and thus the principal data items are known in advance, the additional data items may not be known a priori. In this case, these additional data items will have to be fetched into the local caches by the data managers as the computation proceeds and the data requirements become known. More sophisticated pre-allocation strategies, for example some form of hashing function, are possible to provide resident sets at each processing element. It is also not necessary for each data item to be resident at only one

77

local cache PE 2 AP DM

TM

R

Resident set

10 8 7 6 5

data item data item data item data item data item DM

Figure 4.5: Resident set of the local cache

processing element. Should space permit, the same data item may be resident at several local caches. This is particularly useful when a few data items are used frequently by most tasks, as discussed in chapter 5 and section 4.5.3 on caching and profiling. The pre-allocation of resident sets allows the location of a data item to be determined from its unique identifier. A pre-allocated resident set may occupy a significant portion of a local cache and leave little space for other data items which have not been pre-allocated. The shortage of space would require these other data items to be replaced constantly as the computation proceeds. It is quite possible that one data item may be needed often by the same application process either for the same task or for several tasks. If this data item is not in the resident set for that processing element, then there is the danger that the data item will be replaced during the periods that it is not required and thus will have to be re-fetched when it is required once more. Furthermore, despite being pre-allocated, the data items of a resident set may in fact never be required by the processing element to which they were allocated. In the example given earlier, PE2 has a resident set containing data items 5; : : : ; 8. Unless there is a priori knowledge about the data requirements of the tasks, there is no guarantee that PE2 will ever require any of these data items from its resident set. In this case, a portion of PE2 ’s valuable local cache is being used to store data items which are never required, thus reducing the available storage for data items which are needed. Those processing elements that do require data items 5; : : : ; 8 are going to have to fetch them from PE2 . Not only will the fetches of these data items imply communication delays for the requesting data managers, but also, the need for PE2 ’s data manager to service these requests will imply concurrent activity by its data manager which will detract from the computation of the application process. The solution to this dilemma is not to pre-allocate resident sets, but to build up such a set as computation proceeds and information is gained by each data manager as to the data items most frequently used by its processing element. Profiling can also assist in establishing these resident sets, as explained in section 4.5.3. The price to pay for this flexibility is that it may no longer be possible for a data manager to determine precisely where a particular data item may be found within the system.

78

Searching for data at unknown locations Acquiring a specific data item from an unknown location will necessitate the data manager requesting the router process to “search” the system for this item. The naive approach would be for the router to send the request to the data manager process of each processing element in turn. If the requested data manager has the necessary data item it will return a copy and then there is no need for the router to request any further processing elements. If the requested data manager does not have the data item then it must send back a not found message to the router, whereupon the next processing element may be tried. The advantages of this one-to-one scheme is that as soon as the required data item is found, no further requests need be issued and only one copy of the data item will ever be returned. However, the communication implications of such a scheme for a large parallel system are substantial. If by some quirk of fate (or Murphy’s law), the last processing element to be asked is the one which has the necessary data item, then one request will have resulted in 2 (number of PEs , 1) messages, a quite unacceptable number for large systems. Furthermore, the delay before the data item is finally found will be large, resulting in long application process idle time. An alternative to this communication intensive one-to-one approach, is for the router process to issue a global broadcast of the request; a one-to-many method. A bus used to connect the processing elements is particularly suited to such a communication strategy, although, as discussed in section 1.2.1, a bus is not an appropriate interconnection method for large multiprocessor systems. The broadcast strategy may also be used efficiently on a more suitable interconnection method for large systems, such as interconnections between individual processors. In this case, the router issues the request to its directly-connected neighbouring processing elements. If the data managers at these processing elements have the required data item then it is returned, if not then these neighbouring processing elements in turn propagate the request to their neighbours (excluding the one from which they received the message). In this way, the requests propagates through the system like ripples on a pond. The message density inherent in this approach is significantly less than the one-to-one approach, however one disadvantage is that if the requested data item is replicated at several local caches, then several copies of the same data item will be returned to the requesting data manager, when only one is required. For very large multiprocessor systems, even this one-to-many approach to discovering the unknown location of a data item may be too costly in terms of communication latency and its contribution to message density within the system. A compromise of the direct access capabilities of the pre-allocated resident set approach and the flexibility of the dynamic composition of the local caches is the notion of a directory of data item locations. In this approach, it is not necessary to maintain a particular data item at a fixed processing element. We can introduce the notion of a home-processing element that knows where that data item is, while the data item is currently located at the owner-processing element. The home-processing element is fixed and its address may be determined from the identifier of the data item. The home processing element knows which processing element is currently owning the data item. Should this data item be subsequently moved and the one at the owner-process removed, then either the home-processing element must be informed as to the new location of the data item or the previous owner-processing element must now maintain a pointer to this new location. The first scheme has the advantage that a request message may be forwarded directly from the home-processing element to the current owner, while the second strategy may be necessary, at least for a while after the data item has been moved from an owner, to cope with any requests forwarded by the home-processing element before it has received the latest location update. Finally, it is also possible to do away with the notion of a home-processing element, by adding a hierarchy of directories. Each directory on a processing element “knows” which data items are present on the processing element. If the required data item is not present, a directory higher up in the hierarchy might know if it is somewhere nearby. If that directory does not know, yet another directory might if it is further away. This is much like the organisation of libraries: you first check the local library for a book, if they do not have it you ask the central library, and so on until you finally query the national library. With this organisation there is always a directory that knows the whereabout of the data item, but it is very likely that the location of the data item will be found long before asking the highest directory. (The Data Diffusion Machine [187] and KSR-1 [99] used a similar strategy implemented in hardware).

79

4.4 Consistency Copies of read-only data items may exist in numerous local caches within the system without any need to “keep track” of where all the copies are. However, if copies of read-write data items exist then, in a virtual shared memory system, there is the danger that the data items may become inconsistent. The example in figure 4.6 illustrates this problem of inconsistency. Suppose that we have two processing elements PE1 and PE2 , and a data item y with a value 0, that is located at processing element PE1 . Processing Element PE2 needs y, so it requests and gets a copy of y. The data manager on processing element PE2 decides to keep this copy for possible future reference. When the application at processing element PE1 updates the value of y, for example by overwriting it with the value 1, processing element PE2 will have a stale copy of y. This situation is called inconsistent: if the application running at processing element PE1 requests y it will get the new value (1), while the application at processing element PE2 will still read the old value of y (0). This situation will exist until the data manager at processing element PE2 decides to evict y from its local memory.

PE 1 AP

PE 2 Initial situation:

AP DM y 0

DM

R

R

AP

AP

Next:

DM y 0

DM y 0

R

R

AP

AP

The AP at PE 2 needs the data item y The DM of PE 2 obtains a copy of y from PE1

Finally:

DM

DM y 0

y 1 R

PE 1 has data item y that has the initial value 0.

R

PE 1 updates the value of y. The copy being held by the DM on PE 2 is now stale. PE 1 and PE 2 have an inconsistent view on y.

Figure 4.6: An example how an inconsistency arises. There are two processing elements, PE1 and PE2 and a data item y. PE2 keeps a copy of y, while y is updated at PE1 . The programming model of a physical shared memory system maintains only one copy of any data item; the copy in the shared memory. Because there is only one copy, the data items cannot become inconsistent. Hence, naive virtual shared memory differs from physical shared memory in that virtual shared memory can become inconsistent. To maintain consistency all copies of the data items will have to be “tracked down” at certain times during 80

the parallel computation. Once again tone-to-one or many-to-one methods could be used to determine the unknown locations of copies of the data items. If the directory approach is used then it will be necessary to maintain a complete “linked list” through all copies of any data item, where each copy knows where the next copy is, or it knows that there are no more copies. A consistency operation is performed on this list by sending a message to the first copy on the list, which then ripples through the list. These operations thus take a time linear in the number of copies. This is expensive if there are many copies, so it can be more efficient to use a tree structure (where the operation needs logarithmic time). (A combination of a software and hardware tree directory of this form is used in the LimitLESS directory [23].) There are several ways to deal with this inconsistency problem. We will discuss three options: data items are kept consistent at all times (known as sequential consistency); the actual problem somehow copes with the inconsistencies (known as weak consistency); and finally, inconsistent data items are allowed to live for a well defined period (the particular scheme discussed here is known as release consistency).

4.4.1 Keeping the data items consistent The first option is that the data manager will keep the data items consistent at all times. To see how the data items can be kept consistent, observe first that there are two conditions that must be met before a data item can become inconsistent. Firstly, the data item must be duplicated; as long as there is only a single copy of the data item, it cannot be inconsistent. Secondly, some processing element must update one of the copies, without updating the other copies. This observation leads to two protocols that the data manager can observe to enforce consistency, while still allowing copies to be made: 1. Ensure that there is not more than a single copy of the data item when it is updated. This means that before a write all but one of the copies must be deleted. This solution is known as an invalidating protocol. 2. Ensure that all copies of the data item are replaced when it is updated. This solution is known as an updating protocol. It is relatively straightforward to check that the invalidating option will always work: all copies are always identical, because a write only occurs when there is only a single copy. In the example, the copy of y at processing element PE2 will be destroyed before y is updated on processing element PE1 . For the updating protocol to be correct, the protocol must ensure that all copies are replaced “at the same time”. Suppose that this is not the case: in the example the value on processing element PE1 might be updated, while processing element PE2 still has an old value for y. If the data managers running on processing elements PE1 and PE2 communicate, they can find out about this inconsistency. In order for the update protocol to work, the updating data manager must either ensure that no other data manager is accessing the data item while it is being updated, or that it is impossible for any communication (or other update) to overtake this update. It is not easy to decide in general whether an invalidating or an updating protocol is better. Below are two examples that show that invalidating and updating protocols both have advantages and disadvantages. In both cases we assume that the problem is running on a large number of processing elements, and that there is a single shared data item that is initially replicated over all processing elements. 1. A task, or tasks, being performed by an application process at one processing element might require that the data item be updated at this data manager many times, without any of the other processing elements using it. An updating protocol will update all copies on all processing elements during every update, even though the copies are not being used on any of the other processing elements. An invalidating protocol is more efficient, because it will invalidate all outstanding copies once, whereupon the application process can continue updating the data item without extra communication. 2. Suppose that instead of ignoring the data item, all other processing elements do need the updated value. An invalidating protocol will invalidate all copies and update the data item, whereupon all other processing ele81

ments have to fetch the value again. This fetch is on demand, which means that they will have to wait on the data item. An updating protocol does a better job since it distributes the new value, avoiding the need for the other processing elements to wait for it. There is a case for (and against) both protocols. It is for this reason that these two protocols are sometimes combined. This gives a protocol that, for example, invalidates all copies that have not been used since the last update, and updates the copies that were used since the last update. Although these hybrid protocols are potentially more efficient, they are unfortunately often more complex than a pure invalidating or updating protocol.

4.4.2 Weak consistency: repair consistency on request The option to maintain sequential consistency is an expensive one. In general, an application process is allowed to proceed with its computation only after the invalidate or update has been completed. In the example of the invalidating protocol, all outstanding copies must have been erased and the local copy must have been updated before the application process can proceed. This idle time may be an unacceptable overhead. One of the ways to reduce this overhead is to forget about maintaining consistency automatically. Instead, the local cache will stay inconsistent until the application process orders the data manager to repair the inconsistency. There are two important advantages of weak consistency. Firstly, the local cache is made consistent at certain points in the task execution only, reducing the overhead. Secondly, local caches can be made consistent in parallel. Recall for example, the updating protocol of the previous section. In a weakly consistent system we can envisage that every write to a data item is asynchronously broadcasted to all remote copies. Asynchronously means that the processing element performing the write continues whether the update has been completed or not. Only when a consistency-command is executed must the application process wait until all outstanding updates are completed. In the same way, a weakly consistent invalidating protocol can invalidate remote copies in parallel. These optimisations lead to further performance improvement. The disadvantage of weak consistency is the need for the explicit commands within the algorithm at each application process so that when a task is being executed, at the appropriate point, the data manager can be instructed to make the local cache consistent.

4.4.3 Repair consistency on synchronisation: Release consistency A weak consistency model as sketched above requires the programmer of the algorithm to ensure consistency at any moment in time. Release consistency is based on the observation that algorithms do not go from one phase to the other without first synchronising. So it suffices to make the local caches consistent during the synchronisation operation. This means that immediately after each synchronisation the local caches are guaranteed to be consistent. This is in general slightly more often than strictly necessary, but it is far less often than would be the case when using sequential consistency. More importantly, the application process itself does not have to make the local caches consistent anymore, it is done “invisibly”. Note that although invisible, consistency is only restored during an explicit synchronisation operation; release consistency behaves still very differently from sequential consistency. As an example, an application process at PE1 can poll a data item in a loop, waiting for the data item to be changed by the application process at PE2 . Under sequential consistency any update to the data item will be propagated, and cause the application process at PE1 to exit the loop. Under release consistency updates do not need to be propagated until a synchronisation point, and because it does not recognise that the polling loop is actually a synchronisation point the application process at PE1 might be looping forever.

82

4.5 Minimising the Impact of Remote Data Requests Failure to find a required data item locally means that the data manager has to acquire this data item from elsewhere within the system. The time to fetch this data item and, therefore, the application process idle time, can be significant. This latency is difficult to predict and may not be repeatable due to other factors, such as current message densities within the system. The overall aim of data management is to maximise effective processing element computation by minimising the occurrence and effects of remote data fetches. A number of techniques may be used to reduce this latency by: Hiding the Latency: - overlapping the communication with the computation, by: Prefetching - anticipating data items that will be required Multi-threading - keeping the processing element busy with other useful computation during the remote fetch Minimising the Latency: - reducing the time associated with a remote fetch by: Caching & profiling - exploiting any coherence that may exist in the problem domain

4.5.1 Prefetching If it is known at the start of the computation which data items will be required by each task then these data items can be prefetched by the data manager so that they are available locally when required. The data manager thus issues the requests for the data items before they are actually required and in this way overlaps the communication required for the remote fetches with the ongoing computation of the application process. This is in contrast with the simple fetch-upon-demand strategy where the data manager only issues the external request for a data item at the moment it is requested by the application process and it is not found in the local cache. By treating its local cache as a “circular buffer” the data manager can be loading prefetched data items into one end of the buffer while the application process is requesting the data items from the other end, as shown in figure 4.7. The “speed” at which the data manager can prefetch the data items will be determined by the size of the local cache and the rate at which the application process is “using” the data items.

Next data item for AP

Local cache treated as a Local cache circular buffer id data item id data item id data item id data item

id data item id data item DM Latest prefetched data item

Figure 4.7: Storing the prefetched data items in the local cache 83

This knowledge about the data items may be known a priori by the nature of problem. For example, in the parallel solution of the hemi-cube radiosity method discussed in section 7.4, the data manager knows that each task, that is the computation of a single row of the matrix of form factors, requires all the environment’s patch data. The order in which these data items are considered is unimportant, as long as all data items are considered. The data manager can thus continually prefetch those data items which have yet to be considered by the current task. Note that in this problem, because all the data items are required by every task and the order is unimportant (we are assuming that the local cache is not sufficiently big to hold all these data items), those data items which remain in the local cache at the end of one task are also required by the subsequent task. Thus, at the start of the next task, the first data item in the local cache can be forwarded to the application process and prefetching can commence once more as soon as this has happened. The choice of computation model adopted can also provide the information required by the data manager in order to prefetch. The principal data items for both the balanced and unbalanced data driven models will be known by the system controller before the computation commences. Giving this information to the data manager will enable it to prefetch these data items. A prefetch strategy can also be used for principal data items within the preferred bias task allocation strategy for the demand driven computation model, as described in section 3.4.5. Knowledge of its processing element’s conceptual region can be exploited by the data manager to prefetch the principal data items within this region of the problem domain.

4.5.2 Multi-threading Any failure by the data manager to have the requested data item available locally for the application process will result in idle time unless the processing element can be kept busy doing some other useful computation. One possibility is for the application process to save the current state of a task and commence a new task whenever a requested data item is not available locally. When the requested data item is finally forthcoming either this new task could be suspended and the original task resumed, or processing of the new task could be continued until it is completed. This new task may be suspended awaiting a data fetch and so the original task may be resumed. Saving the state of a task may require a large amount of memory and indeed, several states may need to be saved before one requested data item finally arrives. Should the nature of the problem allow these stored tasks to in turn be considered as task packets, then this method has the additional advantage that these task packets could potentially be completed by another processing element in the course of load balancing, as explained in the section 3.4.4 on distributed task management. Another possible option is multi-threading. In this method there is not only one, but several application processes on each processing element controlled by an application process controller (APC), as shown in figure 4.8. Each application process is known as a separate thread of computation. Now, although one thread may be suspended awaiting a remote data item, the other threads may still be able to continue. It may not be feasible to determine just how many of these application processes will be necessary to avoid the case where all of them are suspended awaiting data. However, if there are sufficient threads (and of course sufficient tasks) then the processing element should always be performing useful computation. Note that multi-threading is similar to the Bulk Synchronous Parallel paradigm [176]. One disadvantage of this approach is the overhead incurred by the additional context switching between all the application processes and the application process controller, as well as the other system software processes: the router, task manager and the data manager, that are all resident on the same processor. A variation of multiple active threads is to have several application processes existing on each processing element, but to only have one of them active at any time and have the application process controller manage the scheduling of these processes explicitly from information provided by the data manager. When an application processes’ data item request cannot be satisfied locally, that process will remain descheduled until the data item is forthcoming. The data manager is thus in a position to inform the application process controller to activate another application process, and only reactivate the original application process once the required data has been obtained. Note the application process controller schedules a new application process by sending it a task to perform. Having made its initial demand for a task to

84

AP . . . AP

Key: AP - Application Process APC - Application Process Controller TM - Task Manager DM - Data Manager R - Router

AP

APC

DM

TM

R

Figure 4.8: Several application processes per processing element

the application process controller (and not the task manager as discussed in section 3.4.2) an application process will remain descheduled until explicitly rescheduled by the application process controller. Both forms of multi-threading have other limitations. The first of these is the extra memory requirements each thread places on the processing elements local memory. The more memory that each thread will require, for local constants and variables etc, the less memory there will be available for the local cache and thus fewer data items will be able to be kept locally by the data manager. A “catch 22” (or is that “cache 22”) situation now arises as fewer local data items implies more remote data fetches and thus the possible need for yet more threads to hide this increase in latency. The second difficulty of a large number of threads running on the same processing element is the unacceptably heavy overhead that may be placed on the data manager when maintaining the local cache. For example, a dilemma may exist as to whether a recently fetched data item for one thread should be overwritten before it has been used if its “slot” in the local cache is required by the currently active thread. Figure 4.9 shows results for a multi-threaded application. The graph shows the time in seconds to solve a complex parallel ray tracing problem with large data requirements using more than one application process per processing element. As can be seen, increasing the number of application processes per processing element produces a performance improvement until a certain number of threads have been added. Beyond this point, the overheads of having the additional threads are greater than the benefit gained, and thus the times to solve the problem once more increase. The number of threads at which the overheads outweigh the benefits gained is lower for larger numbers of processing elements. This is because the more application processes there are per processing element, the larger the message output from each processing element will be (assuming an average number of remote fetches per thread). As the average distances the remote data fetches have to travel in larger systems is greater, the impact of increasing numbers of messages on the overall system density is more significant and thus the request latency will be higher. Adding more threads now no longer helps overcome communication delays, but in fact, the increasing number of messages actually exacerbates the communication difficulties. Ways must be found of dynamically scheduling the optimum number of application processes at each processing element depending on the current system message densities. Despite these shortcomings, multi-threading does work well, especially for low numbers of threads and is a useful technique for avoiding idle time in the face of unpredictable data item requirements. Remember that multiple threads are only needed at a processing element if a prefetch strategy is not possible and the data item required by one thread was not available locally. If ways can be found to try and guess which data items are likely to required next then,

85

20 16 Processing elements 32 Processing elements 63 Processing elements

Time in Seconds

15

10

5

0 5

10 15 20 Number of Application Processes

Figure 4.9: Problem solution time in seconds

86

25

30

if the data manager is right at least some of the time, the number of remote fetches-on-demand will be reduced. Caching and profiling assist the data manager with these predictions. Caching is described in detail in section 5.

4.5.3 Profiling Although primarily a task management technique, profiling is used explicitly to assist with data management, and so is discussed here. At the start of the solution of many problems, no knowledge exists as to the data requirements of any of the tasks. (If this knowledge did exist then a prefetching strategy would be applicable). Monitoring the solution of a single task provides a list of all the data items required by that task. If the same monitoring action is carried for all tasks then at the completion of the problem, a complete “picture” of the data requirements of all tasks would be known. Profiling attempts to predict the data requirements of future tasks from the list of data requirements of completed tasks. Any spatial coherence in the problem domain will provide the profiling technique with a good estimate of the future data requirements of those tasks from a similar region of the problem domain. The data manager can now use this profiling information to prefetch those data items which are likely to be used by subsequent tasks being performed at that processing element. If the data manager is always correct with its prediction then profiling provides an equivalent situation to prefetching in which the application process is never delayed awaiting a remote fetch. Note in this case there is no need for multi-threading. A simple example of spatial coherence of the problem domain is in shown in figure 4.10. This figure is derived from figure 3.3 which showed how the principal data item (PDI) and additional data items (ADIs) made up a task. In figure 4.10 we can see that task i and task j come from the same region of the problem domain and spatial coherence of the problem domain has meant that these two tasks have three additional data items in common. Task k , on the other hand, is from a different region of the problem domain, requires only one additional data item which is not common to either task i or task j . Problem domain = All data items associated with problem

PDI for task i

PDI for task j


PDI for task k


ADIs for task i

Application of algorithm ADI for task k

ADIs for task j

Result

Result

Result

task i

task j

task k

Figure 4.10: Common additional data items due to spatial coherence of the problem domain Thus, the more successful the predictions are from the profiling information, the higher will be the cache-hit 87

ratios. From figure 4.10 on page 87 we can see that if the completion of task i was used to profile the data item requirements for task j then, thanks to the spatial coherence of task i to task j in the problem domain, the data manager would have a 66% success rate for the additional data items for task j . However, a similar prediction for the additional data items for task k would have a 0% success rate and result in a 100% cache-miss, that is all the additional data items for task k would have to be fetched-on-demand.

4.6 Data Management for Multi-Stage Problems In section 3.3.3 we discussed the algorithmic and data dependencies that can arise in problems which exhibit more than one distinct stage. In such problems, the results from one stage become the principal data items for the subsequent stage as was shown in figure 3.12. So, in addition to ensuring the application processes are kept supplied with data items during one stage, the data manager also needs to be aware as to how the partial results from one stage of the computation are stored at each processing element in anticipation of the following stage. This balancing of partial result storage could be achieved statically by all the results of a stage being returned to the system controller. At the end of that current stage the system controller is in a position to distribute this data evenly as the principal and additional data items for the next stage of the problem. The communication of these potentially large data packets twice, once during the previous stage to the system controller and again from the system controller to specific processing elements, obviously may impose an enormous communication overhead. A better static distribution strategy might be to leave the results in place at the processing elements for the duration of the stage and then have them distributed from the processing elements in a manner prescribed by the system controller. Note that in such as scheme the local cache of each processing element must be able to hold not only the principal and additional data items for the current stage, but also have space in which to store these partial results in anticipation of the forthcoming stage. It is important that these partial results are kept separate so that they are not inadvertently overwritten by data items during the current stage. In a demand driven model of computation the uneven computational complexity may result in a few processing elements completing many more tasks than others. This produces a flaw in the second static storage strategy. The individual processing elements may simply not have sufficient space in their local cache to store more than their fair share of the partial results until the end of the stage. Two dynamic methods of balancing this partial result data may also be considered. Adoption of the preferred bias task management strategy, as discussed in section 3.4.5, can greatly facilitate the correct distribution of any partial results. Any results produced by one processing element from another’s conceptual portion, due to task load balancing, may be sent to directly to this other processing element. The initial conceptual allocation of tasks ensures that the destination processing element will have sufficient storage for the partial result. If this conceptual allocation is not possible, or not desirable, then balancing the partial results dynamically requires each processing element to be kept informed of the progress of all other processing elements. This may be achieved by each processing element broadcasting a short message on completion of every task to all other processing elements. To ensure that this information is as up to date as possible, it is advisable that these messages have a special high priority so that they may be handled immediately by the router processes, by-passing the normal queue of messages. Once a data manager’s local cache reaches its capacity the results from the next task are sent in the direction of the processing element that is known to have completed the least number of tasks and, therefore, the one which will have the most available space. To further reduce the possible time that this data packet may exist in the system, any processing element on its path which has storage capacity available may absorb the packet and thus not route it further.

88

5 Caching techniques Cache memories are used in conventional computers as high-speed buffers between the processor and main memory. The access time of the caches are typically ten times faster that a conventional memory access. Arranging that the data items frequently in use by the processor are kept in the cache, can thus significantly improve the average memory access time. A cache normally consists of two parts: a directory of the items in the cache, and the cache memory itself. The cache memory is divided into a number of uniform sized block frames. Each of these correspond to one of the blocks which make up the main memory. The design of a cache memory is determined by its placement policy, which specifies the mapping from the blocks of main memory to those of the cache, and the replacement policy which determines which entries in the cache must be overwritten to make way for new arrivals. In data management strategies, the local caches at each data manager assume the rôle of the cache memories of conventional processors (hence the choice of the name “local cache”). The local caches of other processing elements and the secondary storage of the file manager are the equivalent of the main memory of conventional processors. The access time for a data manager to fetch data item from these external memories will be substantially higher than a fetch from its local cache. We may also use the terminology of traditional memory management and say that a cache-hit occurs if a data item is found in the local cache when required by an application process, and a cache-miss occurs when the data item is not found and a remote fetch must be initiated.

5.1 Coherence In many problems the sequence in which references are made to data items is not random, but occurs in a somewhat predictable manner. This is due to a property which has been termed coherence (also known as locality); it is observed that the data item requests which are generated by a typical program tend to be restricted to a small subset of the whole problem. Conventional virtual memory management systems depend on exploiting this feature for their success. Coherence refers to three concepts: 1. Temporal coherence. If a problem has good temporal coherence it means that if task uses a data item at time T , it is likely that the task will use it again in the near future, at time T + t. 2. Spatial coherence, in the problem domain. If a problem shows good problem domain coherence it means that if a data item at position A of the problem domain is accessed, it is likely that items in the neighbourhood of A (A a) will be accessed soon. 3. Spatial coherence, in the network. If a application has good network coherence it means that data items that are available on processing element PE1 are more likely to be used at processing element PE1 and the processing elements in the neighbourhood of PE1 than on processing elements further away. All virtual shared memory systems benefit from this last form of coherence. If data is used that is already nearby, there will be less network traffic. The definition of neighbourhood is highly configuration dependent: there are few neighbours in a ring network, but many neighbours in the hypercube or AMP [26, 19] configurations. As an example of the significance of spatial coherence in the problem domain, in [57] a normalised cumulative usage frequency was defined and used to measure the frequencies of reference of the problem domain for a ray tracer application. The graphs produced illustrated immediately the coherence present in image space; that far from references being made uniformly to entries of the domain, a small subset of the entries account for a very large

89

Ray in region B

Rays in region A

Figure 5.1: Ray tracing example of spatial coherence in the problem domain

fraction of the domain. In some cases more than 70 % of all references were made to as little as 10 % of the domain. Although the temporal effects were excluded, these results are indicative that some uniformity amongst data references is evident. Techniques for exploiting coherence of reference in virtual memory systems can be applied to solutions when only limited memory resources are available. A simple ray tracing example of spatial coherence of the problem domain is shown in figure 5.1. In this figure we can see that both rays in region B intersect the same lamp. The ray from region A, on the other hand, is from a different part of the problem domain and requires the data access to the other lamp, which is not needed by the first two rays. Temporal coherence is important for any system that relies on caching. A cache improves performance only if the data is reused, which is the case when the application has good temporal coherence. Almost all implementations of virtual shared memory rely on caching, so the more temporal coherence a problem exhibits, the better suited the problem is for implementing in parallel on a system utilising virtual shared memory. In parallel rendering, coherence has a slightly different meaning, but amounts to the same. Coherence could be described as the extent to which an environment or an image is locally constant [59]. Object coherence. Objects consist of separate connected pieces bounded in space and distinct objects are disjoint in space. This is the main form of coherence; the others are derived from object coherence [168]. Spatial subdivision techniques, such as grids, octrees and bintrees directly exploit this form of coherence, which explains their success. Image coherence. When a coherent model is projected onto a screen, the resulting image should exhibit local constancy as well. This was effectively exploited in [188] Ray coherence. Rays who start at the same point and travel into similar directions, are likely to intersect the same objects. An example of ray coherence is given in figure 5.2, where most of the plants do not intersect the viewing frustum. Only a small percentage of the plants in this scene are needed to intersect all of the primary rays drawn into it.

90

Viewing frustrum

Eye Point

Figure 5.2: Ray coherence: many of these primary rays will intersect the same objects

Frame coherence. In animations, coherence which exists between successive frames could be exploited. In most animations, the appearance between frames tends to be similar, so that only small parts of the scene need to be re-rendered. Data coherence. Rendering algorithms tend to access data in a somewhat predictable way [59]. Not only the data items themselves, but also the order in which they are accessed are predictable (or can be made predictable). A good example of a sequential memory-coherent ray tracing algorithm is presented in [131].

5.2 Placement policies Of the four basic placement policies which have been used in memory management systems, namely direct , fullyassociative, set-associative and sector mappings, only the first three are suitable for data management in message passing systems, and the latter is not discussed further here [57]. In the following description of direct, fullyassociative and set-associative mappings for the cache structure, we will assume a local cache with M frames and a problem domain of N data items, with N > M . Direct mapping. This placement policy can be described by the simple mapping: object i ! slot i mod M There is a disadvantage when using direct mapping. Because two or more memory locations can map to the same cache location, a cache replacement may occur even though the cache is not full. This is inefficient, especially when the data items concerned are used frequently.

91

Fully-associative mapping. With this mapping strategy, an entry in memory can map to any position in the cache: object i ! any slot Such a flexible placement strategy can be complex and expensive to implement. The whole cache may need to be examined in order to determine the appropriate cache location of any particular data item. The inefficiency of this extensive search can result in longer access times than those of the fully-associative mapping, despite the higher hit ratio of this approach. Set-associative mapping. This mapping represents a compromise between the simplicity of the direct mapping and the flexibility of the fully-associative mapping. In this scheme, the cache is divided into K partitions, each partition consisting of E = M K slots. The mapping is then described by: object i ! any slot in partition i mod K

A straightforward mapping, comparable with direct mapping, determines the partition in which a data item may reside. This partition must then be searched to test if the data item is present.

5.3 Replacement policies A replacement policy is necessary to determine which local cache positions, and thus which data items in the local cache, will be overwritten when the cache becomes full. If direct mapping is used, then the local cache position to be overwritten is known. If either of the other mapping strategies are used, then an appropriate cache position must be determined before the replacement (a, so-called, cache-write) can occur. A popular replacement policy is the Least Recently Used (LRU) strategy. Each time a data item is referenced in the cache, it is marked as the most recently used, and all other data items are adjusted accordingly. When a replacement is necessary, the data item which was least recently used is chosen for overwriting. The difficulty of deciding which data item to replace is greatly exacerbated when multi-threading is used. An important data item for one thread may not be used by another. When using the policy of only one active thread the least recently used replacement strategy is not appropriate for those data items not used by the active thread. It is particularly important that the data item which has recently been fetched for a delayed application process is not overwritten before there has been a chance to forward it to the stalled process. One solution to this dilemma is to subdivide the local cache into separate portions for use by each thread, but of course this limits the size of the local cache “available” to each thread. This is likely to increase the number of remote fetches per thread unless there is good temporal coherence within each task or good spatial coherence within the problem domain between tasks. The preferred bias task allocation strategy is one way of trying to ensure coherence between tasks on the same processing element, profiling is another.

5.4 What to store? When scene data is distributed across a number of processors, it may become necessary to fetch data from other processors. That data would then be used and is ideally stored in a cache for re-use later. Memory coherence should be observed for best cache performance (see also [131]). However, it is not always necessary to store geometry at the full resolution. If the environment is highly clustered, then the data that is fetched from remote processors, may have too much detail to handle efficiently. Fetching a highly volumetric cluster of geometry may lead to extra communication overhead and the cache may fill up rather quickly too. This in turn may have a severe impact on cache performance. In addition, under certain circumstances using simplified geometry may increase the quality of sampling as well [97]. An example where the full resolution of geometry data is not always required is given in figure 5.3. Assuming that the plants are distributed across a number of processors, the processors responsible for the area where the viewpoint 92

Figure 5.3: The plants close-by need to be sampled at full resolution, whereas the plants far away can be sampled at a far lower resolution.

is, may occasionally need to access data that is far away. The plants at the back of the greenhouse could be fetched at a far lower resolution without impacting the quality of the sampling. There are many different ways in which geometry can be simplified or or stored at different resolutions. In the next sections these will be briefly described as well as their usability in parallel rendering. The methods considered are ones where geometry is replaced with simpler geometry (geometric simplification and level of detail techniques), as well as techniques where geometry is augmented with an extra data structure (impostors) and a caching method where the geometry is replaced with directional irradiance values (directional caching). Despite their usefulness in optimising parallel rendering algorithms, to our knowledge none of these techniques have actually been implemented in a parallel rendering system.

5.5 Geometric simplification and grouping Geometric simplification attempts to reduce the polygon count of objects by replacing large groups of small surfaces by a small group of larger surfaces [82, 51]. This could be accomplished by surrounding the geometry to be replaced by a cube. Each face of the cube is discretised into a number of patches. The average reflectance and transparency of this geometry can be sampled by firing test rays towards the cube. Transparency could be modelled by shrinking the

93

cube by the percentage of rays that do not intersect any geometry. The patches on the cube are set to the reflectance measured by the rays1 [151]. Usually geometric simplification is used in interactive rendering to eliminate excessive processing of relatively unimportant detail. However, this technique is also useful in reducing the computational cost of indirect illumination [151]. It is therefore useful in radiosity applications, as well as complete ray tracing algorithms such as radiance [186, 185]. Grouping of patches [97, 98] is a similar technique, but instead of replacing geometry prior to rendering, this method, implemented in a progressive refinement radiosity algorithm, shoots energy from a whole group of patches into the environment. Similarly, a group of patches can receive energy per group, rather than per patch. This energy can then be distributed over the patches within the group. Grouping of patches both reduces aliasing and the number of iterations needed to converge to a solution. In parallel radiosity implementations, the amount of data sent to each of the processors could be greatly reduced if energy is exchanged only between groups [49]. In parallel rendering, both geometric simplification and grouping of patches would be pre-processing algorithms prior to rendering. Whenever a remote processor requests data, the geometry sent to this processor could be derived from its distance in the network. When a processor receives a ray task to intersect with locally stored scene data, based on the origin of the ray, simplified geometry could be used too.

5.6 Levels of Detail When objects are close to the origin of rays (whether in radiosity, ray tracing or particle tracing), the full resolution of the model is required. However, when objects are more distant, coarser resolutions may be sufficient. Extending the idea of having the original resolution plus one level of refinement, such as in the geometric simplification example above, many levels of detail may be created as a pre-processing step and stored in a hierarchy [74, 35, 30, 32, 192, 133]. During the rendering stage, for each task the appropriate level of detail may be chosen. In ray tracing based renderers, the level in the hierarchy would be chosen according to the distance between the origin of the ray and the geometry to be intersected. Note that most of the level of detail based methods available today address the problem of geometry simplification, but ignore the problems arising from having multiple reflection distribution functions within the geometry to be simplified. For (photo-)realistic rendering, reflection should be averaged as well. Parallel renderers may benefit from level of detail approaches for the same reason that geometric simplification works.

5.7 Environment mapping and impostors Environment mapping [14, 63, 146] can be used as a pre-processing to either radiosity, ray tracing or particle tracing algorithms. It replaces complex geometry that is far away with a texture mapped sphere or cube (figure 5.4). Tracing of rays is then performed from within the cube or sphere until the boundary is reached. Instead of tracing the ray further into the scene, a table lookup is performed. In a parallel renderer, this not only saves complex intersections with remote geometry, but may also reduce the amount of communication required. If the environment maps coincide with processor boundaries, the method is called virtual walls [193] (see section 7.2 for a discussion of a parallel progressive refinement algorithm using virtual walls). 1 Preserving average material properties is important for geometric simplification algorithms to be useful in realistic rendering algorithms. This is also true for level of detail techniques.

94

Reflected ray is terminated at boundary and table lookup is done in environment map

Figure 5.4: An environment map (left) is created for sampling reflections or form factors for one table in the environment (right).

Figure 5.5: Replacing intricate detail (left) with a texture (right).

Impostors [50, 116] work in much the same way as environment maps, the difference being that rather than replacing the surroundings of an object with a box plus texture map, the object itself is replaced with a simple cube plus texture map (figure 5.5). Therefore, impostors and their usage do not differ that much from the geometric simplification and the grouping of patches algorithms described above.

5.8 Directional caching While sampling a hemisphere, either to sample diffuse interreflection in a ray tracing package a la radiance [186, 185] or in a ray tracing based progressive refinement scheme, not all directions contribute equally to the final result. Usually, the rays that contribute the least, are also the most expensive rays, because they traverse the entire scene and potentially enter an area of high geometric complexity. Sampling a hemisphere is therefore in most cases rather wasteful in the sense that too much work is spent on rays that contribute relatively little to the final shading result. The lesson to be learned is that when computing diffuse interreflection, it is wise to at least sample a local neighbourhood around the intersection point. The potential contribution to the final shading result is highest in that area. For greater distances from the intersection point, we would like to be able to predict in advance which directions are important and which are not. That would allow the prevention of sampling intricately detailed areas which are expensive and don’t contribute that much. The important directions that are not local need to be sampled completely 95

z

x

y

0

2

Figure 5.6: The surroundings of a voxel are discretised into a full sphere using a quadtree. The right image shows a fully developed cache for the voxel shown in figures 5.4 (right) and 5.7.

in order to prevent losing too much accuracy. To distinguish between important (high intensity and/or high variance) and unimportant directions, a standard octree data structure could be augmented with a 2D tree in each leaf cell (figure 5.6). Either as a pre-processing step, or during rendering, this cache may be filled. A low pass pre-processing has the advantage of starting the rendering with an already up to date cache. The directional values are stored directly in the 5D structure and are not attached to the objects available in the leaf nodes. This introduces the potential danger of making errors when reading the cache. Tracing rays is now performed in stages. Each ray is traced locally, until either an intersection is found, or the ray leaves the local neighbourhood (see figure 5.7). If the ray is not considered to be local anymore, the directional cache is read to determine whether the ray is travelling into an important direction. If so, the ray is continued until an intersection is found. If the ray is not going into a worthwhile direction, the directional cache also provides an estimate for the shading result of the ray. If no information for a particular direction is present in the cache, the ray is traced. All fully traced ambient rays are used to update the cache as well, so that future cache evaluations provide better estimates. When reading the cache, a small error will usually be made, because the cached values are stored as if all previous rays emanated from the centre of the voxel, which is hardly ever the case. The error can be reduced by increasing the radius and so can be made arbitrarily small. The distance from the intersection point to the perimeter then becomes large relative to the distance between the intersection point and the centre of the voxel. Controlling the radius therefore provides a convenient way of trading accuracy for speed and vice versa. The advantage of such a scheme would be that no pre-processing is needed. The cache is built during the rendering stage, as opposed to environment maps, which need to be created prior to rendering. In parallel rendering, as the cache builds up, fewer rays need to be migrated to other processors, or less data will have to be fetched from other processors.

96

Low variance or intensity: table lookup

High variance or intensity: trace fully

Figure 5.7: The cache is used to determine which rays are traced fully (and update the cache) and which rays are terminated early

97

6 Parallel ray tracing The object of parallel processing is to find a number of preferably independent tasks and execute these tasks on different processors. Because in ray tracing the computation of one pixel is completely independent of any other pixel, this algorithm lends itself very well to parallel processing. As the data used during the computation is read, but not modified, the data could easily be duplicated across the available processors. This would then lead to the simplest parallel implementation of a ray tracing algorithm possible. The only issue left to be addressed is that of load balancing. Superficially, ray tracing does not seem to present any great difficulties for parallel processing. However, in massively parallel applications, duplicating data across processors is very wasteful and limits the problem size to that of the memory available with each processor. When the scene does not fit into a single processors memory, suddenly the problem of parallelising ray tracing becomes a lot more interesting and the following sections address the issues involved. Three different modes of operation have been tried on ray tracing, which are the demand driven, the data parallel and the hybrid scheduling approach. They are discussed in sections 6.1 through 6.3.

6.1 Demand driven ray tracing The most obvious parallel implementation of ray tracing would simply replicate all the data with each processor and subdivide the screen into a number of disjunct regions [128, 59, 134, 61, 62, 22, 119, 181, 87] or adaptively subdivide the screen and workload [125, 126]. Each processor then renders a number of regions using the unaltered sequential version of the ray tracing algorithm, until the whole image is completed. Whenever a processor finishes a region, it asks a master processor for a new task (figure 6.1. In terms of parallel processing, this is called the demand driven approach. In computer graphics terms this would be called an screen space subdivision. The speed-ups to be expected with this type of parallelism are near linear, as the overhead introduced is minimal. Because the algorithm itself is sequential as well, this algorithm falls in the class of embarrassingly parallel algorithms. Communication is generally not a major problem with this type of parallel ray tracing. After finishing a task, a processor may request a new task from a master processor. This involves sending a message to the master, which in turn will send a message back. The other communication that will occur is that of writing the partial images to either the frame buffer or to a mass storage device (figure 6.1). Load balancing is achieved dynamically by only sending new tasks to processors that have just become idle. The biggest problems occur right at the beginning of the computation, where each processor is waiting for a task, and at the end of the computation, when some processors are finishing their tasks while others have already finished (see figure 6.2 for a typical plot of efficiency over time). One way of minimising load imbalances would be task stealing, where tasks are migrated from overloaded processors to ones that have just become idle [8]. If dynamic load balancing is too expensive in terms of communication costs, the subimages (or regions) may be distributed before the computation starts [87]. In order to facilitate load balancing, it would be advantageous if each task would take approximately the same amount of computer cycles. In a screen space subdivision based ray tracer, the complexity of a task depends strongly on the number of objects that are visible in that region (figure 6.3). Various methods exist to circumvent this problem. The left image in figure 6.4 shows a single task per processor approach. This is likely to suffer from load imbalances as clearly the complexity for each of the tasks is different. The middle image shows a good practical solution by having multiple smaller regions per processor. Load balancing can be static by assigning the regions to the pro-

98

Task request Task Pixel data Slave 1 Slave 2 Slave 3

Slave 4

Master processor

Figure 6.1: Demand driven ray tracing. Each processor requests a task from the master processor. When the master receives a requests, it sends a task to the requesting processor. After this processor finishes its task, it send the resulting pixel data to the master for collation and requests a new task.

Efficiency

100%

50%

0% 0

Time

38.94 s

Figure 6.2: Typical efficiency over time plot for demand driven ray tracing. Near the end of the computation, some processors are still finishing their old tasks, while others have finished already and are unable to obtain more work.

99

Area of low complexity Area of high complexity

Figure 6.3: Different areas of the screen have different complexities.

cessors prior to the rendering stage or dynamic by allowing a processor to request a region upon completion of its previous region. It’s likely to give smaller, but still significant, load imbalances at the end of the computation. Finally, the right image in figure 6.4 shows how each region may be adapted in size to create a roughly similar workload for each of the regions. Profiling by subsampling the image to determine the relative workloads of different areas of the image would be necessary (and may also be used to create a spatial subdivision should the scene be distributed over the processors [135]). Unfortunately, parallel implementations based on image space subdivisions assume that the local memory of each processor is large enough to hold the entire scene. If this is the case, then this is also the best possible way to parallelise a ray tracing algorithm. Shared memory (or virtual shared memory) architectures would best adopt this strategy too, because good speed-ups can be obtained using highly optimised ray tracers [106, 160, 93]. It has the additional advantage that the code hardly needs any rewriting to go from a sequential to a parallel implementation. However, if very large models need to be rendered on distributed memory machines or on clusters of workstations, or if the complexity of the lighting model increases, the storage requirements will increase in accordance. It may then become impossible to run this embarrassingly parallel algorithm efficiently and other strategies will have to be found. First of all, the data will have to be distributed. Data access will incur different costs depending on whether the data is stored locally or with a remote processor. It is suddenly very important to store locally the data that is frequently accessed, while less frequently used data may be kept at remote processors. If the above screen space subdivision is to be maintained, caching techniques may be helpful to reduce the number of remote data accesses. The unpredictable nature of data access patterns that ray tracing exhibits, make cache design a rather non-trivial task [59, 62]. However, for certain classes of rays, ray coherence is easily exploited. These are mainly bundles of primary rays (see figure 5.2) and bundles of shadow rays (assuming that the light sources used are area light sources). It is entirely possible to select the data necessary for all of these rays by intersecting a bounding pyramid with a spatial subdivision structure [195]. The resulting list of voxels can then be communicated to the processor requesting the data.

100

1

2

2

2

1

4

3

2

3

3 4

3

3

2

1

1

1

4

1

b. Dynamic load balancing

1

4 4

2

a. Static load balancing

2

4 1

2 3

c. Dynamic load balancing with adaptive regions

Figure 6.4: Image space subdivision for four processors. (a) One subregion per processor. (b) Multiple regions per processor. (c) Multiple regions per processor, but each region should bring about approximately the same workload. 1

2

3

4

Viewing frustrum

5

6

7

Primary ray Reflection ray Shadow ray

8

Spatial subdivision

Figure 6.5: Tracing and shading in a data parallel configuration.

6.2 Data parallel ray tracing A different approach to rendering scenes that do not fit into a single processor’s memory, is called data parallel rendering. In this case, the data is distributed over the processors. Each processor will own a subset of the scene database and trace rays only when they pass through its own subspace [33, 95, 96, 17, 59, 135, 88, 162, 132, 182, 183, 94, 144]. If a processor detects an intersection in its own subspace, it will spawn secondary rays as usual. Shading is normally performed by the processor that spawned the ray. In the example in figure 6.5, all primary rays are spawned by processor 7. The primary ray drawn in this image intersects a chair, which is detected by processor 2 and a secondary reflection ray is spawned, as well as a number of shadow rays. These rays are terminated respectively by processors 1, 3 and 5. The shading results of these processors are returned to processor 2, which will assemble the results and shade the primary ray. This shading result is subsequently sent back to processor 7, which will eventually write the pixel to screen or file. In order to exploit coherence between data accesses as much as possible, usually some spatial subdivision is used to decide which parts of the scene are stored with which processor. In its simplest form, the data is distributed according to a uniform distribution (see figure 6.6a). Each processor will hold one or more equal sized voxels [33, 132, 182, 183, 144]1 . Having just one voxel per processor allows the data decomposition to be nicely mapped onto a 1 Note

that Cleary et. al. [33] employed simulated parallelism on a uni-processor to test their algorithm.

101

2D or 3D grid topology. However, since the number of objects may vary dramatically from voxel to voxel, the cost of tracing a ray through each of these voxels will vary and therefore this approach may lead to severe load imbalances. A second, and more difficult problem to tackle, is the fact that the number of rays passing through each voxel is likely to vary. Certain parts of the scene attract more rays than other parts. This has mainly to do with the view point and the location of the light sources. Both the variations in cost per ray and the number of rays attracted by each voxel indicate that having multiple voxels per processor is a good option, as it is likely to reduce any load imbalances. Another approach is to use a hierarchical spatial subdivision, such as an octree [95, 96, 59, 142], bintree (see figures 6.6b and 6.6c) or hierarchical grids [162] and subdivide the scene according to some cost criterion. Three criteria are discussed by Salmon and Goldsmith [153]: the data should be distributed over the processors such that the computational load generated at each processor is roughly the same. The memory requirements should be similar for each processors as well. Finally, the communication cost incurred by the chosen data distribution should be minimised. Unfortunately, in practice it is very difficult to meet all three criteria. Therefore, usually a simple criterion is normally used, such as splitting off subtrees such that the number of objects in each subtree is roughly the same. This way at least the cost for tracing a single ray will be the same over all processors. A method for estimating the cost per ray on a per voxel basis is presented in [145]. Memory permitting, a certain degree of data duplication may be very helpful as a means of reducing load imbalances. For example, data residing near light sources may be duplicated with some or all processors or data from neighbouring processors maybe stored locally [162, 144]. In order to address the second problem, such that each processor will handle roughly the same number of ray tasks, profiling may be used to achieve static load balancing [135, 88]. This method attempts to equalise both the cost per ray and the number of rays over all processors. It is expected to outperform other static load balancing techniques at the cost of an extra pre-processing step. If such a pre-processing step is to be avoided, the load in a data parallel system could also be dynamically balanced. This involves dynamic redistribution of data [40] 2 . The idea is to shift data from heavily loaded processors to their neighbours, provided that these have a lighter workload. This could be accomplished by shifting the voxel boundaries. Alternatively, the objects may be randomly distributed over the processors (and thus not according to some spatial subdivision) [94]. A ray will then have to be passed from processor to processor until it has visited all the processors. If the network topology is ring based, communication could be pipelined and remains local. Load balancing can be achieved by simply moving some objects along the pipeline from a heavily loaded processor to a less busy processor. In general, the problem with data redistribution is that data accesses are highly irregular; both in space and in time. Tuning such a system is therefore very difficult. If data is redistributed too often, the data communication between processors becomes the bounding factor. If data is not redistributed often enough, a suboptimal load balance is achieved. In summary, data parallel ray tracing systems allow large scenes to be distributed over the processors’ local memories, but tend to suffer from load imbalances; a problem which is difficult to solve either with static or dynamic load balancing schemes.

6.3 Hybrid scheduling The challenge in parallel ray tracing is to find an algorithm which allows large scenes to be distributed without losing too much efficiency due to load imbalances (data parallel rendering) or communication (demand driven ray tracing). Combining data parallel and demand driven aspects into a single algorithm may lead to implementations with a reasonably small amount of communication and an acceptable load balance. 2 Dipp´ e

and Swensen have simulated their algorithm on a uni-processor

102

1

2

3

4

5

6

7

8

a. Grid

3

1

2

4

5

6

7

8

1

3

2 4

5

6

7

8

b. Octree

1

2 4

3 5 6

7

8 2

1 3

4 c. Bintree

Figure 6.6: Data distributions for data parallel ray tracing

103

Hybrid scheduling algorithms have both demand driven and data parallel components running on the same set of processors: each processor being capable of handling both types of task [154, 91, 110, 89, 144]. The data parallel part of the algorithm then creates a basic, albeit uneven load. Tasks that aren’t computationally very intensive but require access to a large amount of data are ideally suited for data parallel execution. On the other hand, tasks that require a relatively small amount of data could be handled as demand driven tasks. By assigning demand driven tasks to processors that attract only a few data parallel tasks, the uneven basic load can be balanced. Because it is assumed that these demand driven tasks do not access much data, the communication involved in the assignment of such tasks is kept under control. A scene subdivision similar to Green and Paddon’s [59] is presented by Scherson and Caspary [154]: the algorithm has a preprocessing stage in which a hierarchical data structure is built. The objects and the bounding boxes are subdivided over the processors whereas the hierarchical data structure is replicated over all processors. During the rendering phase, two tasks are discerned: the ray cell-intersection and the ray-object intersection. Demand driven processes, which compute the intersection of rays with the hierarchical data structure, can therefore be executed on any processor. Data driven processes, which intersect rays with objects, can only be executed with the processor holding the specified object. Another data and demand driven approach is presented by Jevans [91]. Again each processor runs two processes, the intersection process operates in demand driven mode and the ray generator process works in data driven mode. Each ray generator is assigned a number of screen pixels. The environment is subdivided into sub-spaces (voxels) and all objects within a voxel are stored with the same processor. However, the voxels are distributed over the processors in random fashion. Also, each processor holds the entire sub-division structure. The ray generator that runs on each processor is assigned a number of screen pixels. For each pixel rays are generated and intersected with the spatial sub-division structure. For all the voxels that the ray intersects, a message is dispatched to the processor holding the object data of that voxel. The intersection process receives these messages which contain the ray data and intersects them with the objects it locally holds. It also performs shading calculations. After a successful intersection, a message is sent back to the ray generator. The algorithm is optimistic in the sense that the generator process assumes that the intersection process concludes that no object is intersected. Therefore, the generator process does not wait for the intersection process to finish, but keeps on intersecting the ray with the sub-division structure. Many messages may therefore be sent in vain. To be able to identify and destroy the unwanted intersection requests, all messages carry a time stamp. The ability of demand driven tasks to effectively balance the load depends strongly on the amount of work involved with each tasks. If the task is too light, then the load may remain unbalanced. As the cost of ray traversal is generally deemed cheap compared with ray-object intersections, the effectiveness of the above split of the algorithm into data parallel en demand driven tasks needs to be questioned. Another hybrid algorithm was proposed by Jansen and Chalmers [89]. Rays are classified according to the amount of coherence that they exhibit. If much coherence is present, for example in bundles of primary or shadow rays, these bundles are traced in a demand driven mode of operation, one bundle per task. As the objects that such a bundle may intersect are distributed over many processors, data communication will occur. Caching techniques can be employed to minimise this amount of communication. The main data distribution can be according to a grid or octree, where the spatial subdivision structure is replicated over the processors. The spatial subdivision either holds the objects themselves in its voxels, or identification tags indicating which processor does store the data for those voxels. If a processor needs access to a part of the spatial subdivision that is not locally available, it reads the identification tag and in the case of data parallel tasks migrates the task at hand to that processor or in the case of demand driven tasks sends a request for data to that processor.

104

7 Parallel Radiosity Radiosity is one of the most compute intensive tasks in computer graphics, making it an obvious candidate for parallel processing. In contrast to ray tracing, where load balancing is likely to be the bottleneck, parallel radiosity algorithms tend to suffer from both communication and load balancing problems. This is due to the fact that in order to compute a radiosity value for a single patch, visibility calculations involving all other patches are necessary. In some more detail, the problems encountered in parallel radiosity are:

The form factor computations are normally the most expensive part in radiosity. They involve visibility calculations between each pair of elements. If these elements are stored on different processors, then communication between processors will need to occur. During the radiosity computations, energy information is stored with each patch. If the environment is replicated with each processor (memory permitting), the energy computed for a certain patch must be broadcast to all other processors. Again, this may well lead to a communication bottleneck, even if each processor stores the whole scene data base. If caching of objects within a radiosity application is attempted, problems with cache consistency may occur. The reason is that a processor may compute an updated radiosity value for an element it stores. If this element resides in a cache at some other processor, it should not be used any further without updating the cache. If the scene is highly segmented, such as a house consisting of a set of rooms, there will not be much energy exchange between the rooms. Without paying proper attention to load balancing issues, distributing such a scene over a number of processors may lead to severe load imbalances. Another reason for load imbalances is that the time needed to compute a form factor can vary considerably depending on the geometric relationships between patches.

Two of the key papers provided at the end of these notes are concerned with parallel radiosity methods, using virtual walls [147] and the hierarchical approach [160]. Thus, in the following sections, the numerous parallel radiosity techniques will only be briefly described. Finally a case study of a parallel implementation of the gather method using the hemi-cube approach is given.

7.1 Parallel Gathering Conceptually the gather method may be thought of as consisting of two distinct stages: 1. calculation of the form factors between all the patches of the environment and thus set up a matrix of these form factors; and 2. solving this matrix of form factors for the patch radiosities in each of the desired colour bands. Setting up the Matrix of Form Factors The computational effort required for the calculation of the form factors ranges in complexity from the need to use a full analytical procedure so as to reduce inaccuracies, to a sufficient approximation obtained by means of the hemicube technique, to an evaluation of the form factor from a previously calculated value via the reciprocity relationship 105

and finally to the simplest case in which the form factor is known to be zero when the two patches concerned face away from each other. The calculation of a single form factor only requires information about two patches, the receiving patch and the projecting patch, and may thus proceed in parallel with all other form factor calculations [129]. This parallel computation may proceed either as a data driven model or a demand driven model. A Data Driven Approach In the data driven approach, each processing element may be initially allocated a certain number of patches for which it will be responsible for calculating the necessary form factors. The data supplied for each patch may consists of: the vertices of the patch; the reflectivity of the patch in each of the desired colour bands; and, if the patch is a light source, the emission quantities of the patch for each of the desired colour bands, if not then these values will be zero. Additional information relating to translucent properties, directional reflectivity etc. may also be provided. Now, acting upon the information of a single projecting patch, the processing element is able to calculate the form factors for all its allocated patches, thereby producing a partial column of the full form factor matrix. So, for example, if a processing element is allocated p patches, k; k + 1; : : : ; k + p, then from the data for the projecting patch j the form factors F(k;j ) ; F(k+1;j ) ; : : : ; F(k+p;j ) may be calculated, as shown in figure 7.1.

projecting patch j F (1,1)

PE’s allocated patches

F (1,n)

F (k,j) F (k+1,j) F (k+2,j)

F (k+p,j)

F (n,1)

F (n,n) matrix of form factors

Figure 7.1: Calculation of a partial column of the form factor matrix The processing element may now process the next projecting patch, and so on until all patches have been acted upon, by which stage the complete rows of the matrix of form factors, corresponding to the allocated patches, will have been computed. Once all such rows have been calculated the matrix is ready to be solved. The advantage of this data driven approach is that the data of each projecting patch only has to be processed once per processing element [136]. The obvious disadvantage is the load balancing difficulties that may occur due to the variations in computational complexity of the form factor calculations. These computational variations may result in those processing elements which have been allocated patches with computationally easy form factor calculations standing idle while the others struggle with their complex form factor calculations. This imbalance may be addressed by “sensible” allocation of patches to processing elements, but this typically requires a priori knowledge as to the complexities of the environment which may not be available. An alternative strategy to reduce the imbalance may be to dynamically distribute the work from the processing elements that are still busy to those that have completed their tasks. This may require the communication of a po106

tentially large, partially completed, row of form factors to the recipient processing element. A preferable technique would be to simply inform the recipient processing element which projecting patches have already been examined. The recipient processing element need then only perform the form factors for those projecting patches not yet processed. Once calculated, the form factors may be returned to the donor processing element for storage. A Demand Driven Approach No initial allocation of patches to processing elements is performed with the demand driven approach. Instead, a processing element “demands” the next task to be performed from the system controller. Each task requires the processing element to calculate all the form factors associated with the receiving patch specified by the task. The granularity of the task is usually a single receiving patch, but may include several receiving patches. The data included in the task packet may typically consist of the vertices, reflectivities, emission quantities etc [28]. The processing element thus calculates a single row (or several rows) of the matrix of form factors as the result of each task. So, for example if a processing element receives patch k in a task packet then, as shown in figure 7.2, the row of form factors for patch k will be produced before the next task is requested.

resultant row of form factors

receiving patch k specified in the task packet

F (1,1)

F (1,n)

F (k,1)F (k,2)

F (k,n)

F (n,1)

F (n,n) matrix of form factors

Figure 7.2: Calculation of a row of the form factor matrix The demand driven approach reduces the imbalance that may result from the data driven model by no longer binding any patches to particular processing elements. Processor idle time may still result when the final tasks are completed and this idle time may be exacerbated by a granularity of several receiving patches per tasks. The disadvantage of the demand driven approach is that the data for the projecting patches have to be fetched for every task that is performed by a processing element. This data fetching may, of course, be interleaved with the computation. Solving the Matrix of Form Factors From the form factors calculated in the first stage of the gather method the matrix produced must be solved in the second stage of the method to produce the radiosities of every patch for each of the desired colour bands. For patch i the row of this matrix for the colour band is:

F(i;1) ; : : : ; 1 , F(i;i) ; : : : ; F(i;n) .

107

For an environment consisting only of convex patches, Fii = 0 for all patches. If this is not the case, then the patches may be subdivided until they are all convex. The matrix to be solved is diagonally dominant because n F = 1 and the reflectivities for each colour band only take values in the range greater than or equal to i j=1 ij 0 and less than 1. Parallel iterative solvers may, therefore, be used. As the computational effort associated with each row of the matrix does not vary significantly, a data driven approach may be used. This is particularly appropriate if the rows of form factors remain stored at the processing elements at the end of the first stage of the method. Each processing element may, therefore, be responsible for a number of rows of the matrix. The matrix must be solved for each colour band separately. Initially the solution vector at each processing element is set to the emission values, for the particular colour band, of the patches for which each processing element is responsible. At each iteration the processing elements update their solution vector and then these updated solution vectors are exchanged. If each iteration is synchronised then at the end of an iteration the system controller can determine if the partial results have converged to the desired level of tolerance. Once convergence has occurred the iterative process may be repeated to solve for radiosity values of the next colour band. When the radiosity values for all the desired colour bands have been obtained the processing is complete and the image may be displayed. The Jacobi method is most often used in parallel gathering, because it is directly applicable and inherently parallel. Unfortunately has a slow convergence rate (O(N 2 ) [129]. However, it is possible to transform the original coefficient matrix into a suitable form that allows different iterative solution methods such as a pre-conditioned Jacobi method [120] or the scaled conjugate gradient method [103].

P

7.2 Parallel Shooting In parallel progressive refinement methods, the main problem is that each shooting patch may update most, if not all, of the remaining patches in the scene. This means that all the geometry data needs to be accessed for each iteration, as well as all patch and element data. If data is replicated with each processor, then updates for each element must be broadcast to all other processors. If the data is distributed, then data fetches will be necessary. A number of parallel implementations therefore duplicate the geometry data so that visibility tests can be carried out in parallel. The patch and element information could then be distributed to avoid data consistency problems and to allow larger scenes to be rendered [44, 194]. Most parallel progressive refinement approaches use ray tracing to compute form factors [90, 44, 164, 165, 166, 194], with only a few exceptions that either use a hemicube algorithm [140, 29] or analytic form factors [29]. As with parallel gathering algorithms, parallel progressive refinement methods can be solved both in data parallel and demand driven mode. A data parallel approach requires the patches to be initially distributed amongst the processors. Each processor may now assume responsibility for selecting, from its allocated patches, the next shooting patch [29, 66, 16, 18, 67, 166, 194]. The energy of that patch is then shot to all patches visible from the source patch. For remote patches, this involves communication of energy to neighbouring processors. One technique which allows the scene data to be distributed across the available processors, is the virtual walls method [193]. Here the scene is distributed according to a grid. When a processor shoots its energy and part of it would leave that processor’s subspace, the energy is temporarily stored at one or more of the grid walls, which are subdivided into small patches. As opposed to the original virtual walls technique (figure 7.3a), these patches retain directional information as well (figure 7.3b) [114, 6, 119]. After a number of iterations, the energy stored at a processor’s walls is transferred to neighbouring processors and from there shot into the next subscene. Moreover, the computation and communication stages may be overlapping, i.e. each processor communicates energy to neighbouring processor while at the same time running its local progressive refinement algorithm. Recently, the virtual walls technique has been augmented with visibility masks and is renamed to virtual interfaces [5, 148, 147]. When a source shoots its energy, it records which parts of its hemisphere project onto the 108

Virtual wall Subspace 2

Subspace 1 a. Loss of directional information when crossing a virtual wall.

Virtual wall Subspace 1

Subspace 2

b. Storing directional information at a virtual wall prevents this problem.

Figure 7.3: Virtual walls with and without storing direction vectors

109

boundaries of its sub-environment. This information is stored in a visibility mask which allows directional energy to be transferred to neighbouring processors. This is accomplished without accumulating energy for multiple iterations of the shooting algorithm. One of the problems with a data parallel approach is that careful attention must be paid to the way in which light sources are distributed over the processors. If one or more processors do not have a light source patch in their subset of the scene, then these processors may remain idle until late in the calculations. Also, during the computations, a processor may run out of patches with unshot energy. Without proper redistribution of tasks, this may lead to load imbalances. In a demand driven approach, a master processor would select shooting patches and send them to the processors that require more work [140, 45, 120, 164, 165, 163, 56]. The issue of load balancing can then be addressed either by poaching tasks from busy neighbouring processors [166] or by dynamically redistributing data [165]. In this computing method, there is a master processor which selects a number of patches to shoot. These patches are then communicated to the slave processors. Communication between slave processors will occur if the geometry is distributed across the processors, since shooting energy from a patch requires access to the entire scene database. After shooting, either the results are communicated back to the master processor, or the results are broadcast directly to all other processors. In the former case, the master processor usually keeps track of patches and elements, while the surface data is distributed. In the latter case, both geometry and patches are distributed. Master-slave configurations like this tend to have the disadvantage that there is a single processor controlling the entire computation. This limits scalability as this processor is bound to become the bottleneck if the number of processors participating in the computation, is increased. If the master controls all the patch and element data as well, then the master may suffer from a memory bottleneck as well. For this reason, master-slave configurations do not seem to be appropriate to parallel shooting methods.

7.3 Parallel hierarchical radiosity In hierarchical radiosity, the surface geometry is subdivided as the need arises. This leads to clusters of patches in interesting areas, whereas the subdivision remains course in other areas. If the environment is subdivided over a number of processors, there is a realistic chance that some processors find their local geometry far further subdivided than the geometry stored at other processors. For hierarchical radiosity, the main issue seems to be load balancing. As opposed to progressive refinement algorithms, only very few parallel hierarchical radiosity algorithms have been implemented to date. One is implemented on virtual shared memory architectures [160, 161, 149], and one is implemented on a cluster of workstations [46]. Both are discussed briefly in this section. The virtual shared memory implementation is attractive in the sense that the algorithm itself needs hardly any modification. Each processor runs a hierarchical radiosity algorithm. Whenever a patch is selected for subdivision, it is locked before subdivision. Such a locking mechanism is necessary to avoid two processors updating the same patch concurrently. Other than that, there are no changes to the algorithm. A task is defined to be either a patch plus its interactions or a single patch-patch interaction. Each processor has a task queue, which is initialised with a number of initial polygon-polygon interactions. If a patch is subdivided, the tasks associated with its subpatches are enqueued on the task queue of the processor that generated these subpatches. A processor takes new tasks from its task queue until no more tasks remain. When a processor is left with an empty queue, it will try to steal tasks from other processors’ queues. This simple mechanism achieves load balancing. If no (virtual) shared memory machine is available, the scene data will have to be distributed over the available processors. Load balancing by means of task stealing then involves communication between two processors [46]. Communication will also occur if energy is to be exchanged between patches stored at different processors. Hence,

110

the behaviour of a parallel hierarchical radiosity algorithm is likely to be similar to parallel implementations of progressive refinement radiosity.

7.4 Case Study: Parallel radiosity using the hemi-cube approach Many scientific problems involve more than one distinct stage of computation. Any parallel implementation must ensure that the preliminary results from one stage are correctly distributed in anticipation of the following stages. The “gather” radiosity method is such a problem consisting of two distinct stages: 1. The formulation of a set of simultaneous equations to describe the interaction of surfaces within the environment under consideration; and, 2. The solution of these equations for the intensity of each surface’s radiation. The “gather” radiosity method using the hemi-cube approach is just such a problem: 1. Set up the matrix of form factors 2. Solve for the unknown radiosities Stage 1: Examination of the problem description reveals that a task for the first stage of the problem could be the projection of one patch onto the hemi-cube positioned at the centre of another patch, that is, to calculate one element of the form factor matrix. However, the possibility of occluded surfaces requires the algorithm to check every time a projection is performed onto a hemi-cube positioned at a patch to ensure that no other patch has projected onto the same portion of the hemi-cube. Thus there are data dependencies present in the calculation of the individual elements of one row of the form factor matrix and therefore it is preferable to treat this as a sequential operation. A task should therefore be the calculation of one row of the form factor matrix. The choice of task now identifies the principal and additional data items. Each receiving patch is a principal data item. The algorithm positions a hemi-cube about the centre of this patch and then projects all other patches onto this hemi-cube. All the other patches are therefore the additional data items for this task. With the task and principal and additional data items established, we are now in a position to choose the correct computational model. The number of patches in the environment are fixed and thus so is the number of tasks. Patches which face away from the receiving patch do not have to be projected, and having determined this from the corresponding normals, these patches do not need to be considered further. The computational complexities associated with a simple comparison of normals and a projection of a patch onto a hemi-cube are very different. Furthermore, these computational complexity variations are unknown for each task as this depends entirely on the nature of the environment. Thus, a demand driven approach must be chosen as the computational model. There is another stage of the algorithm after the form factor matrix has been established, and that is the solution of this matrix. The resultant rows of the form factor matrix will be the principal data items of the second stage. A preferred bias allocation strategy will ensure these partial results are positioned correctly in anticipation of the second stage. Stage 2: Stage 2 of the algorithm is the solution of the matrix of form factors to determine the unknown radiosities. The sequential solution of this problem typically uses a Gauss-Seidel iterative solver. We have already seen in section 3.3.3 that this method is inappropriate for parallel processing so we will use the “Block Gauss-Seidel - Global Jacobi” method. The tasks for this second stage of the problem will be the solution of one row of the form factor matrix. There are a fixed number of tasks and the computational complexity to solve one row is the same for all rows and so we should adopt a balanced data driven model. There is no subsequent stage once the matrix has been solved and so from we should use a pre-allocation of tasks as the task management strategy.

111

7.4.1 Data Management Despite the very large data requirements, the data management strategy for the parallel implementation is straightforward. With the hemi-cube approach, we know a priori by the nature of problem that is the computation of a single row of the matrix of form factors, that each task, requires all the environment’s patch data. The order in which these data items are considered is unimportant, as long as all data items are considered. The data manager can thus continually prefetch those data items which have yet to be considered by the current task. Note that in this problem, because all the data items are required by every task and the order is unimportant (we are assuming that the local cache is not sufficiently big to hold all these data items), those data items which remain in the local cache at the end of one task are also required by the subsequent task. Thus, at the start of the next task, the first data item in the local cache can be forwarded to the application process and prefetching can commence once more as soon as this has happened.

7.4.2 Configurations The performance of a distributed memory multiprocessor depends in large part on the efficiency of the message transfer system that provides the interface between the co-operating processors. Two of the important components of this message transfer system are the underlying processor interconnection network which provides the routes along which the messages may travel and the routing strategy which establishes the communication protocols. Whilst a number of physical factors may be beyond our control, such as the architecture of the multiprocessor system, nevertheless, we must make the best possible choice of implementation within these constraints in this most crucial aspect of parallel processing. The suitability of the underlying processor interconnection network, also known as the topology, plays a significant part in determining the overall system performance. In a fully connected configuration every processor is adjacent to every other processor, connected by a single link. The time required to send a message from a source processor to a destination processor is simply the time it takes to transfer the message physically across the link. However, for a n processor system, a fully connected configuration requires each processor to have (n , 1) links available for interconnection. For large values of n this may not be feasible, in which case processors will have to communicate with each other not directly, but via intermediate processors. For a constant number of links per processor, the number of intermediate processors through which a message must be routed increases as the number of processors in the configuration increases. This increase in the number of intermediate processors between any two processors is a function of the interconnection network. Figure 7.4 shows examples of configurations which are frequently used in multiprocessor systems. A Minimum Path (AMP) configuration, on the other had, is an irregular configuration which is designed to minimise the number of links a message has to traverse between any source and any other destination processor. Figure 7.5 shows the 32-processor AMP configuration (which also includes the system controller). Table 7.1 gives the diameters of a variety of numbers of processors arranged in some of these configurations. As can be seen from the table, for a given number of processors, the diameter of AMP configurations (which in this case have all be constructed with a valency of 4)are less than any of the other configurations. Table 7.2 gives the average interprocessor distance values, for the different configurations. This average interprocessor distance is equivalent to the average number of links a message has to cross from any source processor to its desired destination processor. So, for example, a message from a source processor in a 64-processor AMP would have to cross 2.92 links on average while, for a processor on a 64-processor ring, the same message would have to traverse on average 16 links.

7.4.3 Results The radiosity problem was solved on different sizes of AMP configuration ranging from 1 to 63 processing elements, for a small environment of 448 patches. Table 7.3 shows the time in seconds for the setup and solution stages of the problem.

112

(a) (b)

(c)

(d)

(e)

(f)

Figure 7.4: (a) 8-processor chain, (b) 8-processor ring, (c) 16-processor hypercube, (d) 16-processor mesh, (e) 32processor torus, (f) 13-processor ternary tree

113

SC

19

18

6 28

17

20

5

7 15

13

31

1

30 25

11

3

0

4

14

29

26

2 27

12

16 10

8

24

21

9 22

23

Figure 7.5: The 32-processor AMP configuration

AMP Hypercube Torus Ternary Tree Mesh Ring Chain a valency

8 2 3 3 4 4 4 7

13 2 4 6 12

16 3 4 4 5 6 8 15

23 3 6 11 22

Processors 32 40 3 4 5a 6 7 6 6 10 11 16 20 31 39

53 4 8 26 52

64 4 6a 7 8 14 32 63

greater than 4

Table 7.1: Comparison of configuration diameters

114

128 5 7a 12 10 22 64 127

AMP Hypercube Torus Ternary Tree Mesh Ring Chain a valency

8 1.28 1.50 1.50 1.97 1.75 2.00 2.63

13 1.55 2.56 3.23 4.31

16 1.73 2.00 2.00 2.91 2.5 4.00 5.31

Processors 32 40 2.31 2.53 2.50a 3.00 3.50 3.93 4.25 3.88 4.23 8.00 10.00 10.67 13.99

23 2.05 3.39 5.74 7.65

53 2.76 4.77 13.25 17.66

64 2.92 3.00a 4.00 5.01 5.26 16.00 21.33

128 3.58 3.50a 5.64 6.25 7.94 32.00 42.66

greater than 4

Table 7.2: Comparison of average interprocessor distances

PEs 1 2 4 8 13 16 23 32 42 53 63

Setup 1874.460 938.570 473.570 238.452 147.626 121.294 85.561 63.487 49.911 40.830 35.053

Solve 78.859 42.819 23.749 22.924 20.918 18.946 24.151 28.096 35.525 43.732 49.956

Total 1953.527 981.597 497.533 261.595 168.771 140.453 109.935 91.807 85.664 84.782 85.229

Table 7.3: Times in seconds for 448 patches on different AMP configurations

115

The high computational effort required by the matrix setup stage allows good parallel performance to be achieved for this stage. Looking at the graphs of speed-up for this stage in figure 7.6 we see that a 63-processing element AMP shows a speed-up of 53.5. The speed-up graphs have yet to reach their peaks and we can, therefore, say that the optimum number of processing elements, for this matrix setup stage is greater than 63 processing elements.

Linear Speed-up setup AMP

60

50

Speed-up

40

30

20

10

10

20

30 40 Number of PEs

50

60

Figure 7.6: Speed-up for the matrix setup stage of the 448 patch problem If we now examine the efficiency values for this stage of the computation in figure 7.7, we see that the setting up of the form factor matrix on a 63-processing element AMP is 85% efficient. This means that during this first stage, each processing element is able to devote 85% of its time to performing useful computation. The communication overheads for the matrix solution stage of the gather method are considerably higher than the setup stage because every processing element has to exchange its partial results with all other processing elements at each iteration. As can be seen from table 7.3, the time taken to solve the matrix initially decreases as the number of processing elements is increased until a certain point and then this time increases as more processing elements are added. The maximum speed-up achieved for this stage of the gather method is just over 4. Figure 7.8 shows the speed-up achieved for this matrix solution stage, while figure 7.9 shows the corresponding efficiencies. Figure 7.10 shows the problem solution time in seconds for an increasing problem size from 448 to 2268 patches for each of three configurations of 63 processing elements. The effect of the configurations can clearly be seen. The difference in time between the solution of 448 patches and 2268 patches on the AMP configuration, rises from 87 seconds to 585 seconds when compared to the torus configuration of 97 seconds and 615 seconds and the ring configuration of 194 seconds and 806 seconds.

7.4.4 Conclusions The correct choice of the computational model and task management strategy resulted in a speed-up of 53 and 85% efficiency in the parallel implementation on 63 processing elements of the first stage of the problem, but only a 116

100

95

90

Efficiency

85

80

setup AMP

75

70

65

60 10

20

30 40 Number of PEs

50

60

Figure 7.7: Efficiencies of the matrix setup stage of the 448 patch problem

117

20 Linear Speed-up solve AMP solve Torus solve Ring

Speed-up

15

10

5

0 10

20

30 40 Number of PEs

50

60

Figure 7.8: Speed-up for the matrix solution stage of the 448 patch problem

118

100

Efficiency

80

solve AMP solve Torus solve Ring

60

40

20

0 10

20

30 40 Number of PEs

50

60

Figure 7.9: Efficiencies of the matrix solution stage of the 448 patch problem

119

900

800

63-PE AMP 63-PE Torus 63-PE Ring

700

Time in seconds

600

500

400

300

200

100

0 400

600

800

1000

1200 1400 Problem size

1600

1800

2000

2200

Figure 7.10: Time in seconds for increasing problem size on 63-processing element configurations

120

maximum speed-up of 4 for the second stage. Setup Solve

100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 1

2

8

13

23 16 32 42 Number of Processing Elements

53

63

Figure 7.11: Percentages of time taken for a problem size of 448 patches on AMP configurations These discrepancies in the speed-ups and efficiencies between the matrix setup and matrix solution stages are important in so far as they affect the scalability of this application. From figure 7.6 we can see that the optimum number of processing elements for the setup stage is more than 63, whereas, from figure 7.8, this optimum number is somewhere in the region of 20 processing elements for the solution stage. Closer examination reveals that, for this 448 patch problem, on a single processing element the setup stage occupies 96% of the total solution time, and the solution stage just 4% of the total solution time, but for the 63-processing element AMP solution, this ratio has changed to 42% for the setup stage and 58% for the solution stage. We, therefore, need to re-examine the algorithm we have chosen to solve the matrix because the communication requirements for the current algorithm grow rapidly as more processing elements are added, while, for the fixed size problem, the computational requirements decrease. Only by selecting a matrix solution algorithm with lower communication requirements, such as a conjugate gradient method, can we hope to maintain some form of balance in the computation to communication ratio.

121

8 Parallel Particle Tracing The particle tracing model traces the path of photons as they are emitted from the surface of the light sources and uses the reflected/refracted/emitted particle flux given by a large number of these particles per unit time as a measure of the illumination of points in the environment [130]. This model accurately simulates the physical propagation of light within the environment, and can be used for complex scenes including the interaction of the medium within the environment. A solution based on every photon is obviously computationally intractable, however Monte Carlo methods can be used to simulate this particle model to obtain an estimate of the global illumination. The accuracy of the solution will improve as the number of particles traced is increased, that is, the more particles used, the “more realistic” the result. Naturally, the computation time associated with the particle tracing method depends on this number of particles. Experience, based on a sequential implementation of the particle tracing method, has shown that even for relatively simple environments, the number of particles that have to be considered in the simulation can be of the order of a few hundred thousand. On the single processor machine this can amount to many minutes and even hours of computing time. The paths of the particles through the environment may be traced independently and then the resultant fluxes combined to produce the global illumination. It is this independent path tracing that suggests that the particle tracing method is suitable for implementing in parallel. However, the particle tracing method also exhibits certain characteristics which complicate its solution on multiprocessor systems. These characteristics include: very large data requirements and variations in computational complexity associated with the data. The simplified particle tracing algorithm is as follows: For each particle repeat the following steps:

Choose the particle’s wavelength by Discrete Distribution Sampling of the cumulative emission spectrum. In the presence of multiple light sources choose the emitter from which the particle will originate by Discrete Distribution Sampling the emitter strength distribution at the chosen wavelength. Choose the position on the emitter at which the particle originates by position sampling the emitter surface geometry. Update the outgoing particle flux at the emitter’s surface. REPEAT – find nearest surface or medium that results in an interaction – ascertain interaction type (eg. absorption, reflection, refraction, scattering etc.) – determine particle and surface/medium status after interaction

UNTIL particle absorbed

8.1 Parallel Characteristics First impressions of the particle tracing seem to suggest that it is within the class of “embarrassingly parallel” problems. The path of each particle may be computed independently of all other particles. However, the global illumination solution is determined by the total history of the particles. Each object interaction results in a cumulative update of a data structure representing scene global illumination. These updates must occur during the simulation for two reasons: firstly, storage of all interactions is prohibited by the number and accuracy required. Accumulation must 122

occur at the data structure to minimise storage requirements, but the entire data structure is too large to replicate. Secondly, the desire for interactive rendering utilising a progressive refinement approach creates the need for correct illumination values during the simulation. The exploitation of ray coherence is an important factor in parallel ray tracing [57]. However, the random nature of the Monte Carlo method restricts any similar approach based on the coherence of particle paths. Precomputational analysis using the position and intensity of light sources may provide some indication as to voxels likely to be requested by particle paths leaving the light sources. The subsequent path of the particle is determined by the nature of the environment. In summary:

spatial subdivision essential to reduce particle-object intersection computations. low particle coherence due to random walk. large variations in voxel and object usage. difficult to predict object utilisation due to random walk and complexity of environments. large variations in computation associated with individual particles or even groups of particles due to stochastic nature of simulation.

8.1.1 Scene data Seven scenes have been considered in this parallel implementation of particle tracing: MirroredRoom: A simple room containing several pieces of furniture and doorway. Illumination comes from two spherical light sources attached to the ceiling. There are additional a three-dimensional array of “floating” highly reflective spheres in the centre of the room. Figure 8.1 shows an image of this scene.

Figure 8.1: Image of MirroredRoom scene. Room1 and Room2: These scenes represent a single room office containing several pieces of furniture and doors. Room1 contains two ceiling mounted rectangular light sources. The Room2 has additional table lights. LargeRoom: A more complex office environment containing furniture and doors. Scene contains eight ceiling mounted rectangular light sources.

123

DiningRoom1 and DiningRoom2: Two scenes describing a room containing four tables and a chair. Illumination is provided by a small square source in the ceiling. DiningRoom2 has two additional spherical light sources near the floor in one corner of the room. Figure 8.2 shows an image of DiningRoom2.

Figure 8.2: Image of DiningRoom2 scene. LectureHall: A large hall containing rows of bench seats facing a platform. Sixteen ceiling mounted rectangular lights illuminate the room. Tables 8.1 summarises the structure of the scenes. Name MirroredRoom Room1 Room2 LargeRoom DiningRoom1 DiningRoom2 LectureHall

Objects 124 234 240 380 402 404 3180

Sources 2 2 8 8 1 3 16

Polygons 22 234 240 380 402 402 3180

Spheres 102 0 0 0 0 2 0

Table 8.1: Overview of scene structure.

8.2 Task Scheduling As discussed in section 3.3.1 the first challenge to task scheduling is the identification of an individual task. In order to avoid any data dependencies in parallel particle tracing, the task granularity is chosen as the computation of the complete contribution to the flux in the environment of a single particle. Our choices for this problem were as follows: atomic element: to perform one particle-object intersection; task: to trace the complete path of one particle (may consists of a number of atomic elements); PDIs: the surfaces in the environment to which the particle makes a contribution; ADIs: the scene data; and, 124

task packet: one or more particles to be computed. Although adopting the task granularity of tracing a complete path for each particle does remove any data dependencies, it does introduce significant variations in computational complexity between different tasks, as some particles may be absorbed after only one or a few intersections, while others could intersect with many objects. Variations in particle path length are shown in figure 8.3a for two scenes. These differences in path lengths result in variations in computation performed during the simulation of a particle, shown by the distribution of object intersections associated with each particle in figure 8.3b.

55

25

50 45

20

DiningRoom2

DiningRoom2

MirroredRoom

MirroredRoom Particle frequency (%)

Particle frequency (%)

40 35 30 25 20

15

10

15 5

10 5 0

1 2

4

6

8

10

12

14 16 18 Path length

20

22

24

26

28

0

30

(a) Particle path lengths

0

50

100

150 Objects intersected

200

250

300

(b) Particle object intersections

Figure 8.3: Distribution of particle path lengths and intersection calculations. These large (unknown) variations in task computational complexity imply that a demand-driven model of computation must be adopted for parallel particle tracing. To illustrate this point, figure 8.4 compares a data- and demanddriven implementation for the MirroredRoom. The error bars display the differences between completion times of the first processor and eventual problem solution. This highlights the load imbalances occurring near problem completion. The speed-up of the data-driven model is poor for large configurations as distribution and allocation times increase. The demand-driven approach performs considerable better with performance approaching linear speed-up even for large systems.

8.2.1 Task packet variations Varying the task packet size can supplement the task allocation strategy to minimise load-balancing problems. These packet variations can occur within the framework of the actual allocation strategy adopted. Smaller packets at start of computation will enable processors to start performing tasks rapidly, while towards the end of computation, reducing the packet size will minimise the chance of processors being left with large numbers of tasks whilst others have completed their portion. Figure 8.5 shows how the concept of variations in task packet size during computation benefits load-balancing during the initial and final stages of problem computation. Often the system has a maximum available storage size allocated to tasks within each processor. This forms a natural ceiling, limiting task packet size within the system, because individual processors will not be able to handle the concurrently arriving tasks:

125

62

Linear speed−up

58

Data driven

54

Demand driven

50 46 42 Speed−up

38 34 30 26 22 18 14 10 6 2 2

6

10

14

18

22

26 30 34 38 42 Number of processors

46

50

54

58

62

Figure 8.4: Data- versus demand-driven computational model for a world data model particle tracing application. System utilisation

FIXED TASK PACKET SIZE

Task packet size

Time Problem completion

VARIABLE TASK PACKET SIZE

System utilisation

Task packet size


Figure 8.5: Variations in task packet size during computation.

Fixed Constant task packet size during computation. Linear The packet size varies linearly at beginning and end of computation. Logarithmic Packet size varies logarithmically with computation. 126

62

Linear speed−up

58

Fixed packet size

54

Linear packet sizes

50

Logarithmic packet sizes

46 42 Speed−up

38 34 30 26 22 18 14 10 6 2 2

6

10

14

18

22


46

50

54

58

62

Figure 8.6: The effect of task packet variations in a demand-driven (world-model) particle tracing application. Figure 8.6 illustrates the effect of varying the task packet size dynamically during solution of the MirroredRoom problem. The packet schedule using logarithmically increasing and decreasing sizes seem to cope better with the variations in task computational complexity. The implementations perform extremely well demonstrating the implicit parallel nature of particle tracing.

8.3 Data Management Problems with small data requirements may be solved by having this data replicated and stored at each processing element, the so-called world model. However, if the data requirements are larger than can be accommodated at each processing element, then this data needs to be distributed and data management becomes necessary to control the resultant movement of data within the system. In particle tracing, the environmental model, consisting of object information and spatial subdivision, forms the majority of the data space . Each object within the environment consists of a number of fixed parameters describing its physical properties, such as shape and surface properties, and variable coefficients altered by ongoing computation, such as surface flux values. Within the algorithm, individual objects are accessed independently, so it seems natural to make each object an individual data item. Spatial subdivisions for complex scenes may contain a large numbers of nodes. Within an octree, nodes may be categorised as internal, empty or terminal. Internal nodes contain pointers to lower level nodes, and as such form the branching structure of the tree. Terminal nodes represent voxels containing objects and require storage for the identification of the objects within that voxel. An empty node corresponds to regions with no object occupancy. Table 8.2 gives approximate storage requirements for a number of scenes created with a maximum of ten objects in each voxel. The small size of the octree relative to object storage requirements means that it is certainly possible to replicate the octree at each processing element. This approach has been frequently been taken as octree access during path tracing is high [60]. However, data management does not prohibit the decomposition of the spatial subdivision into data items which can be distributed and accessed in the same manner as environmental objects.

127

Scene Name Objects Room1 234 DiningRoom1 402 LectureHall 3180

Nodes 841 1289 9769

Depth 9 9 9

Octree Internal Empty 105 270 161 406 1221 2332

Terminal 466 722 6216

Size 15K 25K 200K

Table 8.2: Comparison of scene spatial subdivision structure.

The size of the meshes attached to each object determines that their location within a multiprocessor system should be fixed unless absolutely necessary. The communication overhead necessary to move this amount of information would adversely affect computation. One solution is for an object to be stored in an unpurgeable cache location with its relevant grid information. Precluding the movement of objects with attached mesh structure creates a resident set within the cache, but does not prevent the replication of information required to perform intersection calculations. The adoption of resident sets does have the advantage that destruction of all copies of particular objects by cache over-write is no longer possible. At least one copy of each object will remain within a resident set somewhere within the system. A data management strategy can now be structured to take advantage of the additional information that the resident set provides: the location of one copy of each data item is always known – the processor containing the mesh for that data item. Obviously, other copies may be at closer locations in network space due to replication.

8.3.1 Data Manager The data manager process at each processing element performs local caching of data as well as holding surface illumination meshes associated with certain data items. The cache now contains two areas: a resident set containing unpurgeable objects with their meshes, and a non-resident area holding transient copies of data held in other data manager’s resident sets. Application access to data held at the data manager occurs by means of a communication between the two processes. For example, data requests are generated as particles enter voxels and intersection calculations require object data contained within the voxel. With replicated, write-able data there is potential for cache coherence problems to occur. All accesses to replicated spatial subdivision and fixed object parameters are read-only, removing one of the criteria for consistency problems. Usage of object coefficients may occur at separate locations independently and because this information is augmenting the intersection calculations consistency problems may reduce efficiency but will not cause erroneous computations. For example, a mail-box test could fail because a replicated object does not have the new intersection parameter. This will result in the need to perform additional computation, which is obviously reduce performance, but will not affect the validity of the results. Access to the surface illumination mesh is now only possible through the data manager holding that particular mesh. Potential consistency problems could occur if simultaneous access was made using two sequences of instructions such as “read-modify-write”. Methods for eliminate this form of inconsistency include using locks, or converting the read-modify-write sequence to an atomic update command. This eliminates the need for a cache coherence protocol to control data access which would inevitably reduce performance due to the additional computation and communication to maintain consistency. The nature of the access to illumination meshes mean that update operations provide sufficient write access during the particle simulation and rendering phases. Then it is possible to update surface meshes after interaction between particle and object. Where this update occurs will depend on the physical location of the illumination mesh associated with that object.

8.3.2 Coherence Throughout their development computer graphics algorithms have exploited the concept of local similarities to enhance and optimise performance [167]. Path tracing systems, such as ray tracing and particle tracing, rely heavily 128

on the coherence between objects arising due to their distribution in environmental space. This enables acceleration techniques to benefit from knowledge of local neighbourhood structure, reducing overall computation involved with following the path of a particle/ray. Further levels of coherence are identifiable, such as the consistency within areas of an image or within sequences of images. However, the strong object coherence associated with path tracing gives rise to locality of data reference and provides great potential for improving data management strategies. Green acknowledges the difficulty associated with quantifying the degree of coherence within an application [57]. However, he defined a normalised cumulative usage frequency based on the frequency of reference within a sequence of data usage. This measure ignores local temporal information by considering the total number of references to data items during problem computation. Consider the frequency of reference f of the N individual data items d within the environmental database D (di 2 D), ordered such that:

f (d1 ) f (d2 ) f (dN ) The normalised cumulative usage frequency (cuf ) is defined as: Pn cuf (dn ) = PiN=1 f (di ) i=1 f (di ) Results for a ray tracer operating on complex scenes consisting of up to 8400 objects demonstrate the coherence associated with the data references. In these images, up to 70% of all references were made to as little as 10% of the entire data space. Figure 8.7 shows similar results for a particle tracing application using three different scenes. 1.0 0.9

Normalised cumulative usage frequency (cuf)

0.8 0.7 0.6

LargeRoom LectureHall

0.5 MirroredRoom 0.4 0.3 0.2 0.1 0.0 0.0

0.1

0.2

0.3

0.4 0.5 0.6 Fraction of database

0.7

0.8

0.9

1.0

Figure 8.7: Locality of data request within particle tracing. In common with Green’s ray tracing results, figure 8.7 demonstrates non-uniform distribution of overall data references. In all examples there is a high usage of a small subset of the scene data space, but the actual usage distribution varies according to the scene being processed. The LectureHall and LargeRoom scenes show at least 70% of all data references being made to less than 5% of the data space. However, the scene MirroredRoom demonstrates a more uniform distribution of data requests. This can be explained by examining the nature of the scenes being simulated. Both the LargeRoom and LectureHall are diffuse environments (containing up to 3138 objects) this contrasts with the simple, highly reflective nature of the MirroredRoom scene. Due to this reflectance, the paths of the particles within the latter scene are longer. This coupled with the particular geometry of the scene (containing an array of mirrored spheres illuminated from above) results in a more even distribution of particles within the environment. This, in turn, results in more uniform object usage. 129

Considering the data references made by individual particles gives another indication of the coherence within the problem. Each particle has an associated sequence of data references ri ; : : : rj within the entire sequence of data references made by the problem solution. Figure 8.8 shows the frequency with which particles reference subsets of the data set for the MirroredRoom scene. 20 18 16


14 12 10 8 6 4 2 0 0.0

0.1

0.2

0.3


0.7

0.8

0.9

1.0

Figure 8.8: Locality of requests for individual particles. These results show that within small temporal regions (processing associated with a particle) the usage is mainly limited to a small proportion of the entire data set. However, with this environment, a significant proportion of the particles (9% ) require access to at least 50% of the data set in order to complete their computation. This has serious implications for the data management strategy. In summary, particle tracing exhibits locality of reference due to the coherence in object space being exploited by the path tracing computations. Scenes demonstrate spatial locality, clustering object references within the data set, and temporal locality, favouring the reuse of limited subsets of the data. However, the degree to which these phenomena appear is dependent on the nature of the scene. For example, highly reflective environments may determine that computation involved with a single particle requires access to a large proportion of the data set.

8.3.3 Effects of data management To show the effects of different data management strategies, a simulated parallel particle tracer is used. The advantage of this approach is that parameters, such as period of computation and distribution of data requests, are under control of the user, rather than responding to the combination of stochastic simulation and properties of the scene adopted. Accurately simulating the requests for data arising from the particle tracing algorithm was achieved by basing the request distribution on information gathered from actual scene processing. The overall request distribution for a given scene can be evaluated by following the paths of a large number of particles. The implicit ordering of data items, created during scene construction, was removed by shuffling the distribution entries. This left a distribution showing overall usage of data even if it contains no temporal information about requests. The simulator applied an additional parameter to this distribution, constraining it to a given coherence in object identifiers. The range of object identifiers varies across the processors in the system, although overlap will occur. This models a successful task management strategy allocating work based on the data necessary for its completion. Although this may not be exactly the situation arising in an efficient parallel implementation, it provides a rough model. Figures 8.9 and 8.10 show results at two different levels of coherence for a 16-processor system. The simulator ran with random task and data allocation strategy using a demand-driven computational model with routing 130

optimisation techniques [24]. Achieving an efficient software implementation provides a different scenario than most VSM systems. On a multiprocessor system, the balance of miss penalty to hit time increases due to the latency of remote fetch. Minimising both characteristics is vital to increasing overall system performance. The performance of fully- and set-associative caches will be shown to be much higher than direct-mapped caches due to improved location usage. However, the software-based search mechanisms required for a pure associative cache would nullify this performance advantage. It was thus decided to trade-off some potential cache storage space for an index of cached items in the associative cache structure. This enabled instant location of entries within the cache, in a similar way to the direct-mapped approach. A set-based structure, attempting to reduce search delays, is no longer required as search time is fixed for any size of cache. So set-associative caches have not been considered as under these constraints the fully-associative technique will perform better for a given replacement strategy. For a given level of coherence, large data manager caches are capable of storing enough items to eliminate remote fetches. The adoption of cache warm starting enhances performance by enabling the cache to fill rapidly, anticipating the data requests from the application. The overheads associated with warm start and placement mean the performance drops slightly as soon as a world data model is rejected. Reducing the cache size beyond a certain point has an adverse affect on solution time as cache misses occur due to there being insufficient space to store all the data items demanded by the coherence level. This gives rise to a situation where data must be fetched from other processors, resulting in a miss penalty due to the latency associated with this remote fetch. Figure 8.9 shows the situation for a 10% coherence level at each processor. For large cache sizes the direct mapping strategy is the most efficient due to the simplicity in its implementation. However, as the cache size reduces the performance drops rapidly as poor cache utilisation causes high levels of cache miss. Random replacement on a fully-associative cache causes cache overwrite as incoming data is arbitrarily allocated a slot. This results in relatively poor performance in larger caches. The FIFO and FINUFO policies attempt to replace older cache items anticipating that temporal coherence will favour more recently utilised data. This significantly out-performs the direct-mapped approach, although the advantage over random replacement decreases as the cache size is reduced. At very small cache sizes, these three fully-associated replacement strategies perform similarly as their usage predicting techniques fail due to the small numbers of available slots. The LRU policy provides a better prediction of temporal usage and the overheads associated with maintaining this knowledge is rapidly outweighed by the increased cache utilisation. At this level of coherence, LRU provides similar or better cache performance than the other cache strategies. Increasing the level of fetch coherence should favour the more advanced replacement strategies due to the reductions in data variation. This is demonstrated in figure 8.10 utilising a coherence of 5% , with an increase in the benefits of LRU over the other replacement strategies. For this coherence level, as expected, FINUFO shows an advantage over FIFO for all but the smallest cache sizes. Increasing the system size should exacerbate these performance differences as the fetch latency would be expected to increase. Badouel et. al. utilise cache-based VSM in a distributed-memory ray tracing application [9] using distributed data. They utilised a similar system with a software cache partitioned into resident and non-resident areas. Reducing the cache size relative to the data set dramatically increased the communication burden, rapidly increasing the overall solution time. They suggest increasing the node storage capacity or number of nodes to overcome this problem. However, available storage is usually fixed by the physical memory associated with a processor, and will decrease relative to the data set as the scenes being modelled become inevitably larger and more detailed. Similarly, increasing the number of nodes in the system may exacerbate communication problems rather than improve them. In summary, adopting a full-associated cache in the data manager using a least-recently-used replacement policy maximises cache performance. The introduction of a cache index means the cache policies evaluated are not pure techniques as described within the terms of (usually hardware-based) VSM. The desire for increased performance of a software implementation leads to the optimisation of standard approaches for the given application.

8.3.4 Data Allocation Movement of data within a system consisting of distributed caches can be categorised as:

131

3.0

Direct mapping Random fully associative

2.5 Relative execution time

FIFO fully associative FINUFO fully associative LRU fully associative 2.0

1.5

1.0

20

19

18

17

16 15 Cache size (% of dataset)

14

13

12

Figure 8.9: Caching strategy performance at a coherence level of 10% .

3.5 Direct mapping Random fully associative FIFO fully associative

3.0

FINUFO fully associative

Relative execution time

LRU fully associative

2.5

2.0

1.5

1.0

13

12


9

Figure 8.10: Caching strategy performance at a coherence level of 5% .

data allocation – initial partitioning of data space across system. data migration – movement of data in response to ongoing computation.

132

8

The initial distribution of data across the processors becomes important in systems that require resident set to be fixed. This distribution aims to reduce the impact of remote data accesses that occur throughout the system during problem computation. This is based on the knowledge that the latency associated with these accesses will adversely affect performance. Remotely updating an illumination data structure requires a communication to the processor holding the structure. Ideally, these must also be minimised by performing work close to the location of the data structure. The aim of data allocation is thus to minimise the number of remote data fetches and remote mesh updates occurring during problem solution. This distribution must be achieved within the constraints that are imposed by the fixed storage capacity at each node. This problem requires that available storage area is partitioned into space for cache entries and space for the object illumination structures. The total illumination storage is determined by the scene and is known prior to the computation starting. Data allocation can be achieved with or without information about the problem under consideration:

randomly – the benchmark strategy. uninformed – attempting to predict data usage prior to computation starting. informed – utilising additional information gained from computation such as from a profiling run.

Having information concerning the form of the problem solution enables it to be utilised by the data allocation strategy. This information may include a rough idea of data item usage patterns and illumination update distributions. The availability of information enables a cost function to be allocated to each data item, approximating the implications associated with its allocation. Although complex cost functions can be created incorporating location and neighbour parameters, a simple weighting of data items has proven sufficient. Allocation then becomes a problem of distributing cost over the processors.

8.3.5 Data Fetch Mechanism The data fetch mechanism determines how the data manager obtains requested data items. This includes the fetch policy and path of the request. The fetch policy determines when the fetch takes place with respect to the application request. This can be fetch-on-demand, occurring after the application request or prefetch, prior to the demand from the application. The adoption of caches containing resident sets determines that one location of each data item (the home location) is always known. This fact can be utilised, although other copies may exist closer to the requesting processor. The data fetch path can be:

naive– following a preset path through processors. broadcast – broadcast to all processors. home – direct to home data location.

The broadcast request guarantees minimum request latency, but has the drawback that multiple copies of the requested data item could be located and returned simultaneously to the source processor. However, redundant message elimination techniques can help reduce the effect of these unwanted duplicates [24]. Due to replication of data within the system, processors on the search path may hold copies of the required items within their caches. In this case, the snooping performed at this processor results in the data being returned and the termination of the request far short of its anticipated destination.

133

8.3.6 Demand Fetch Demand fetch occurs in response to the request for data item(s) by the application. Pure naiverequests would follow a fixed Hamiltonian path through the processors, visiting each in turn. This can be achieved in a cyclic way, with all requests circulating in the same direction, or utilising independent Hamiltonian paths designed to spread communication density over the links. Figure 8.11 shows how the path taken by the fetch request affects overall solution time in a simulated application on a 32-processor configuration. The linear request path follows a Hamiltonian circuit through the processors, searching until it finds a copy of the required data. As the cache size is reduced the miss ratio increases resulting in more remote fetches. The relative scarcity of replicated data limits the effectiveness of techniques such as poaching [24]. This causes the message density to increases and fetch latency to increase. Broadcasting the request causes messages to propagate in all directions from the requesting processor. This potentially reduces the fetch latency at the expense of increasing overall message density in comparison to a single request. At low cache sizes the performance of this strategy improves over the Hamiltonian path as low data replication results in lower latency for approximately the same total number of messages.

6.0 Hamiltonian path Broadcast request Home location path

Relative solution time

5.0

4.0

3.0

2.0

1.0

30

25


10

5

Figure 8.11: Solution time for varying demand fetch request paths. The final mechanism shown in figure 8.11 utilises the known location of the item within a resident set, and significantly outperforms the other mechanisms. This results in a reduction in the average number of messages per fetch and the decrease in overall communication is shown as reduced execution time. The results demonstrate that utilising the additional information provided by a resident set approach, can reduce fetch latencies within the virtual memory system.

8.3.7 Prefetch Prefetching utilises knowledge about the expected data requirements of an application to move data close to where it is required before the data request is issued [122]. Essential elements of a VSM-based prefetch implementation are the prefetch horizon and fetch path. The horizon determines how far ahead of computation the prefetch is going to occur. To reduce cache space requirements, this 134

should be the minimum value that eliminates stall. However, increasing the horizon gives the prefetch strategy more chance to hide fetch latencies. The prefetch can be application-initiated or data manager-initiated. The former requiring the application to be responsible for determining when data must be prefetched prior to its actual usage during computation. The other strategy utilises a prediction mechanism within the data manager to bring data to the local cache prior to application request. Tracing a particle through voxelised space results in the grouping of data requests as the particle enters occupied voxels. This occurs due to the need to test for intersection with all objects within the voxel. The stochastic nature of the particle tracing approach mean that prediction of future data usage from previous requests is difficult. This reduces the likely performance of a data manager-initiated prefetch strategy. The computations occurring at the application process give more information as to potential data usage and suggest that prefetch should be initiated here. The algorithmic requirement of testing for intersection with all objects means that all objects will be required and can be prefetched together. Removing the ordering of intersection tests within a voxel increases the tolerance of latency associated with particular items, but may have implications for the computation required in the intersection tests. Figure 8.12 shows a simulated particle tracing application utilising various prefetch request paths on a 32processor configuration. Demand fetch using multiple requests heading towards the home location, as suggested by figure 8.11, provides the benchmark strategy. This and broadcast fetch generate a large number of messages rapidly reducing the performance. As with the pure demand fetch scenario, the relative performance of broadcast increases as the cache size reduces.

3.0 Demand fetch Prefetch using broadcast Prefetch using home location

2.5

Prefetch using nearest location


Prefetch using spanning tree

2.0

1.5

1.0

30

25


10

5

Figure 8.12: Solution time for various prefetch request path. The home and nearest prefetch paths attempt to reduce overall communication density by using a single message during the fetch. This proves successful, with the more intelligent strategy of visiting the nearest home location first, significantly improving performance over the demand and broadcast paths. However, a spanning tree based request path reduces solution time by decreasing the overall number of messages involved in the fetch [24]. As the cache size decreases, the performance differential increases as poaching becomes less effective. The adoption of a spanning tree based request path seems to be suggested for the fetches consisting of multiple data items. 135

8.3.8 Prefetch horizon Increasing the prefetch horizon delays the onset of data stall, allows the prefetch strategy more time to obtain remote items. However, this requires increased storage capability and may result in cache over-write in a fixed sized cache. Methods of altering the prefetch horizon during a voxelised particle tracing algorithm include:

increasing number of objects in voxel. fetching data required for subsequent voxels.

Increasing the maximum voxel occupancy alters the spatial subdivision structures as this forms one of the voxel division criteria. This potentially results in fewer voxels containing more objects. Prefetching data for subsequent voxels also increases the prefetch horizon. However, both techniques have drawbacks: changing maximum voxel occupancy may adversely alter the computation balance between voxel traversal and intersection tests, and fetching data that may not be required is potentially performance reducing. An increase in voxel occupancy results in an smaller octree containing less voxels and requiring fewer traversal computations. However, higher occupancy determines that during computation more intersection tests must be performed, resulting in an increase in overall sequential solution time. Figure 8.13 shows that the trend is duplicated in a parallel implementation, increasing the voxel occupancy above ten causes the solution times to increase.

130 120

Maximum voxel occupancy 9 Maximum voxel occupancy 10

110

Maximum voxel occupancy 11

Solution time (seconds)

100

Maximum voxel occupancy 12 Maximum voxel occupancy 14

90 80 70 60 50 40 30 20 100

90

80

70

60 50 40 Cache size (% of dataset)

30

20

10

0

Figure 8.13: Parallel solution time for variations in octree voxel occupancy. It should be noted that the lower bound on occupancy is formed by the maximum number of objects at any point in space. A perfect cubic arrangement of boxes results in eight objects being adjacent to certain points. The voxel occupancy must be higher than the maximum occupancy of any point within the scene unless minimum voxel size limits are enforced. Within architectural environments this occupancy can be exceeded, for example between floors, giving support to the adoption of a maximum voxel occupancy of ten objects. Altering the prefetch horizon by precomputing the voxel traversal order means speculative prefetch occurs – fetching data that may not be utilised.

136

8.3.9 Prefetch particle tracing The distribution of voxel occupancy will affect the efficacy of voxel-based prefetch particle tracing. Figure 8.14 shows prefetch versus demand fetch for a particle tracing application on a 32-processor configuration. In this example the parallel system is allocating data and tasks randomly over the processors. The adoption of a prefetch voxel intersection algorithm improves cache hit ratios at the data manager. These result from the hiding of latency associated with fetching the required data, Reducing the cache size causes higher miss ratios, enabling the prefetch version to improve its performance relative to the demand fetch. 45

Cache miss ratio (%)

40 35 30 Demand fetch

25

Prefetch

20 15 10 5 0


2.5

2.0

1.5

1.0

30

25


10

5

Figure 8.14: Cache miss ratios and solution time for demand fetch versus prefetch particle tracing. Prefetch data requests can reduce stall associated with remote fetch latency. However, as cache sizes reduce two problems become apparent:

cache overwrite. prefetch horizon needs to dynamically alter.

For small cache sizes, overwrite of prefetched data means that prefetch cannot cope as a solitary latency hiding technique. Highly reflective scenes [172] demonstrate this phenomena.

8.3.10 Multi-threading Multi-threading can be used to hide fetch latencies by context switching between active threads. Data stall is avoided during the fetch by allowing another application thread to execute, providing alternate computation that is not affected by the unavailability of particular data items.

137

In practice, the problem with multi-threading is that increasing the number of threads to counteract large latencies actually increases the latency and hence is self-defeating. The single thread approach is efficient until the cache can no longer contain the items demanded. The solution time then rises sharply as the hit ratio drops and remote fetch latency impinges on application idle time. Adopting a multiple thread model introduces the concomitant overheads associated with controlling and scheduling multiple threads. This overhead may become significant to overall execution time due to the large numbers of communications occurring during problem solution. As cache space is reduced, two application threads are capable of hiding fetch latency by performing computation on present data items. The advantages of having alternate computation to perform outweigh the control overheads and below this “break-even” point, solution time is lower. Reducing the cache size further results increased cache misses and the solution time rises. Operating a switch-on-cycle policy increases the request density despite the data manager attempting to aggregate requests. Even if combined with communication optimisation techniques such as redundant message elimination, the performance of switch-on-cycle is lower than block-switched multi-threading. Introducing further threads onto the processing element increases the control overhead as the thread controller monitors the additional threads. Even with a world data model, the monitoring overhead attached to every Application Process to Data Manager communication provides a considerable impact. The relative effect of an additional thread reduces as the number of threads increases due to the fixed overhead associated with monitoring the extra thread. Providing extra threads enables the processing element to absorb thread latencies more effectively. However, this is always offset by the additional control overhead meaning that for this particular simulation, no benefit is derived from having more than three threads available. Figure 8.15 shows the request coherence affects the performance of multi-threaded processing elements.

3.0

3.0 Single AP

Single AP

2 Threads

2 Threads

3 Threads

2.5

3 Threads

2.5

4 Threads

2.0

1.5

1.0

4 Threads 5 Threads



5 Threads

2.0

1.5

20

19

18

17

16

15 14 13 12 Cache size (% of dataset)

11

10

9

8

1.0

7

(a) 10% request coherence

35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 Cache size (% of dataset)

(b) 20% request coherence

Figure 8.15: Solution time for multi-threaded application at different coherence levels. Multi-threading provides an mechanism for absorbing the latency of fetch within other processing. The efficacy is limited by:

control overheads. cache overwrite.

138

increased communication burden.

So, in summary, multi-threading can improve the performance of applications at low cache sizes. The adoption of large environmental models means that this will inevitably occur. Anticipating the optimum number of threads for a particular problem is difficult.

8.3.11 Combining Latency Hiding Techniques The latency hiding techniques described in previous sections may be combined in a multiply-threaded, prefetching application. Utilising block switched threads to attempt to overcome the data stall arising from prefetch failures may reduce overall processing time. However, issues such as cache overwrite may reduce the effectiveness of this approach. Gupta et. al. evaluated the performance benefits of multi-threading in combination with prefetch in a coherent cached, shared memory scenario. They determined that the success was dependent on low context switch overhead and minimal cache interference (e.g. overwrite). The combination of both techniques could actually reduce overall performance [68]. Minimisation of cache interference is achieved if the threads are performing tasks with similar data requirement [77]. The adoption of a multi-threaded application emphasises this fundamental guideline for task management strategies. The combination of prefetch and multiple contexts demands that a consistent prefetch strategy is adopted – aware of the duplicated requests made by individual threads. Allowing the Data Manager control over prefetch actions ensures that this occurs. Prefetch requests arising from the threads are combined and scheduled at the Data Manager. The time to process these requests will add to the overall context switch time provided by the processor, meaning they must be handled efficiently. Figure 8.16 illustrates a multiply threaded simulator utilising prefetch on a 16-processor configuration. The multiple thread overhead causes lower performance for large caches, but for small caches the advantage of reduced cache miss ratios proves beneficial. Figure 8.17 shows a multiply threaded particle tracing application utilising prefetch. The DiningRoom2 scene is implemented on a 16-processor system using random allocation of tasks and data. This results in very low request coherence between tasks and hence between contexts. This will hamper a multi-threaded approach due to massive cache interference as prefetched data items are overwritten by data from other contexts. In this case the overhead associated with adopting the multiple threads is not recouped by the reduction in cache miss ratio. In summary, for small caches, combining prefetch and multiple threading can enhance performance. However, this is dependent on task and data management strategies maintaining cache coherence and reducing interference between contexts. Random data and task allocation strategies cannot provide high enough coherence to enable the combination of latency reduction techniques to improve solution time over a prefetch only application.

8.3.12 Profiling Profiling attempts to gauge problem computational structure by performing selected examples of problem calculations. These calculations are deemed typical of total problem computation and a problem profile extrapolated from measured characteristics. For a particle tracing application, the stochastic nature of the simulation means that no one particle can be considered typical of the entire computation. However, simulating a number of particles may provide information about the overall computation [171]. Figure 8.18 shows the distribution of terminal voxel access for the most referenced fifty voxels within the LargeRoom scene. The distribution for various numbers of particles demonstrates that accurate voxel usage may be determined using low numbers of particles.

8.3.13 Task migration The migration strategy must be tolerant of varying conditions, providing beneficial task movement at all stages of computation. The migration criteria, determining when migration takes place, is based on the availability of 139

50 45 Cache miss ratio (%)

40 35 30

Single Prefetch AP

25

2 Prefetch APs

20 15 10 5 0 3.5


3.0 2.5 2.0 1.5 1.0

20

15

10

5

Cache size (% of dataset)

Figure 8.16: Cache miss ratios and solution time for prefetch and prefetching, multi-threaded simulator.

data required to continue processing the task. The strategy is altered during the later stage of computation by the arrival of a broadcasted message from the system controller indicating that no more tasks will be generated. The load-balancing strategy then enters a redistribution phase, while the remaining tasks are processed. The migration threshold is defined as the ratio of available data to required data to enable the next stage of processing. A threshold near 0% results in data moving towards the tasks. Higher migration thresholds cause tasks to move to processors containing the data. Figure 8.20 illustrates how the adoption of task migration is beneficial to system performance. The results show a simulation of 50000 particles within the DiningRoom1 scene on system of 16, 32 and 63 processors. The benchmark strategy was a random task and data allocation strategy using a demand-driven computational model. This was compared with conceptual task allocation based on overlapping regions created utilising techniques outlined in the previous sections. The data allocation strategy used a precomputed usage prediction to assign data items within conceptual regions whilst attempting to reduce the number of remote flux updates occurring within the system. This technique utilised a greedy bin-packing algorithm to minimise anticipated remote updates within the constraints of the chosen regions. Figure 8.20 shows that adoption of task migration was beneficial at all times. The performance advantage improved as cache size was reduced and the likelihood of data not being present in the cache increased. This would be expected as the random task and data allocation approach suffers from low coherence coupled with the additional overhead of a high proportion of flux updates begin remote. Both these phenomena result in additional communication, adding to the communication burden.

140

45


40 35 30 Single Prefetch AP

25

2 Prefetch APs

20 15 10 5 0


2.5

2.0

1.5

1.0

30

25


10

5

Figure 8.17: Cache miss ratios and solution time for prefetch and prefetching, multi-threaded particle tracing. 11.0 100 particles 500 particles 1K particles 5K particles 10K particles 50K particles 100K particles

Voxel access (% of all accesses)

10.0 9.0 8.0 7.0 6.0 5.0 4.0 3.0 2.0 1.0 0.0 Terminal voxels

Figure 8.18: Distribution of voxel access for variable length profile runs.

8.3.14 Results handling The processing of individual tasks leads to a number of surface flux updates as the particle traverses the environment. These flux updates may be considered as partial results because they each contribute to the final solution – the global 141

60 4 AMP 8 AMP − equal LOCAL cache space 8 AMP − equal TOTAL cache space


55

50

45

40

20

25

30 35 40 45 50 Migration threshold (% of voxel data present)

55

60

10

5

Figure 8.19: Task migration threshold. 90 16 AMP random TM, no migration 16 AMP 35% task migration 32 AMP random TM, no migration 32 AMP 35% task migration 63 AMP random TM, no migration 63 AMP 35% task migration

80


70 60 50 40 30 20 10

50

45

40

35 30 25 20 15 Cache size (% of data set)

0

Figure 8.20: Solution time for conceptual allocation with task migration against random task allocation on 16, 32 and 63-processor AMP configurations.

illumination of the environment. In order to compute the final illumination, all flux updates pertinent to each object must be combined to give a overall object illumination. This stage may be performed just prior to illumination calculation, such as within the rendering phase. However, the adoption of a progressive refinement approach and the need to store large numbers of updates means that on-the-fly object flux update is preferable.

142

This requires partial results moving to the data manager holding the flux information of that object. If the local data manager is storing the relevant surface flux information, then a local update can be performed. Remote flux updates will require the update result being communicated to the processor holding the object. However, in line with the virtual shared memory adopted within the system, as far as the application is concerned the update took place as if the information was local. The adoption of the resident set approach within the data manager means the data management strategy knows the location of the processor containing the surface flux of the required object. So the construction of a message containing the necessary update information is possible. The actual update information is minimal, and consist of:

object identifier – object causing interaction. object location – location within object. flux change – parameter of change of flux.

The size of this update message determines that an individual communication for each update is wasteful. Unnecessary communication can be eliminated by grouping these updates together into bunches. Burst-mode flux updates require storing all updates for individual processors at their source until they can be released within a single message to the given processor. The introduction of an remote update cache within the data manager can hold updates, sorted by final destination, until triggered by lack of space. This requires allocation of a storage area and adoption of a trigger criteria. The desire to spread communications over the processing time meant that a simple fixed cache space system was implemented. Each processor was assigned the same space, and would send the updates when this storage area was full. The intention being that the essentially random nature of update location would result an even release of messages. The adoption of a cached update system requires a mechanism for releasing all held updates into the system at any particular time. For example, it may be required to ensure flux continuity prior to rendering in a progressive refinement approach. This can be achieved using a flush-update-cache message being broadcast to all processors. The arrival of the message at the data manager triggering a complete release of remaining cached updates.

8.4 Conclusions Realistic image synthesis by particle tracing exhibits characteristics of typical complex scientific and engineering problems in that it is computationally intensive and has large data requirements. Increasing current levels of image realism will require more accurate simulations of the intricate interactions between light and objects with complex optical characteristics. This fact coupled with the desire for larger and more detailed environmental models means that computational and data requirements are inevitably rising. The independent nature of particle path simulation seems amenable to parallel processing. However, three factors determine that this cannot be considered to be an “embarrassingly parallel” problem: the variations in computational complexity associated with each particle; the adoption of distributed data access; and the implications of illumination storage. Distributed data access removes the limitation that the entire problem data domain must be stored on each processor. This becomes necessary as problem data requirements outgrow local storage capabilities. Additionally, precision in the lighting represented in the image demands high resolution of illumination information. The ensuing distributed illumination storage requires updating during ongoing computation to enable interactive simulation. Data management The stochastically-driven particle path determines that individual particles may require access to a large proportion of the environmental data set during their computation. The adoption of software-based Virtual Shared Memory enables the application to unequivocally access items forming the global problem data domain. Distribution of this data space across the parallel resources leads to discrepancies in latency associated with data access from remote 143

locations versus local access. Stalling application computation during remote fetch operations rapidly leads to serious performance degradation. The latency associated with remote data fetches must be reduced to minimise the detrimental effect of this data stall. The locality of data reference demonstrated by the particle tracing application suggests the use of caches to store data close to where it will be used in the future. For some scenes, up to 90% of all data references were made to less than 5% of the data space. This clustering of references in time and space result in the cache reducing average data access times. Restructuring of data access methods within the particle tracing application enable the avoidance of the overheads associated with maintaining system-wide cache coherence. Within multiprocessor systems, the latency occurring on cache miss can be unpredictable and exhibit large variations due to the need to fetch required data from remote locations. The high miss penalty suggests use of advanced placement and replacement strategies to maximise cache performance. The implementational overheads associated with these policies were rapidly outweighed by the increased cache utilisation, resulting in improved performance. Reduction of available cache size emphasises these benefits, with improved cache utilisation resulting in up to 50% reduction in total solution time. Prefetching data items can minimise the occurrence and effect of data stall by issuing requests before the application needs particular data items. This strategy was enhanced by reformulation of the algorithm determining the particle’s path. Reduction of the strict ordering of calculations enabled the application to be more tolerant of fetch latency. The efficacy of data prefetch was improved by utilising a request path through the system configuration based on an intelligent spanning tree. The ensuing reduction in messages over fetch based on broadcast or Hamiltonian paths reduces overall communication density and solution time. With a small cache enforcing the acquisition of remote data, a spanning tree prefetch policy resulted in a 30% reduction in solution time for a simulated particle tracing application on a 32-processor configuration. Increasing the prefetch horizon improves the likelihood of fetch latencies been hidden by other processing. However, this strategy can be self-defeating if cache overwrite deletes useful data items. Speculative prefetching of potentially unused data items proves detrimental with reduced cache size as overwrite becomes more prevalent. Prefetching improves system performance although failures still occur as unpredictable requests or computation lead to stall as data requirements are not anticipated. Multi-threading provides a mechanism for minimising the effect of data stall by providing alternate computation that is not affected by the unavailability of particular data items. Theoretically, the introduction of additional threads allows arbitrary amounts of stall to be hidden. However, practical considerations such as the increasing overheads associated with controlling multiple contexts eventually nullify the benefits. The incorporation of extra threads also increases the contention for available cache slots if the coherence between tasks is not high. Network traffic resulting from additional requests for data also proved detrimental. These factors mean that the benefits of increasing numbers of threads are rapidly outweighed unless data request coherence is high. The combination of these data management techniques has been shown to be beneficial if high request coherence can be maintained. The minimisation of cache interference due to low request coherence depends on threads performing tasks with similar data requirements. This data coherence between threads is determined by the adopted task management strategy. Task scheduling Task scheduling attempts to distribute work in order to maintain processor utilisation throughout the problem solution. A load-balanced system utilises an equal division of work across homogeneous system resources to minimise idle time. A successful task management strategy also attempts to maximise local data coherence to aid data management techniques. Defining what constitutes a task, the elemental form of distributed work, is always difficult and has far-reaching implications. The variations in computational complexity associated with the chosen particle tracing tasks were shown to be several orders of magnitude. The task generation time determines that distributing tasks in packets containing variable numbers of tasks increases performance. Using logarithmic tapering of initial and final packet sizes

144

enabled processor utilisation to climb rapidly as computation commenced, and aided load-balancing near problem completion. Allocation of tasks to processors may be achieved statically or dynamically: represented by the data- or demanddriven computational models. The demand-driven approach copes better with variations in task computational complexity due to the explicit load-balancing associated with the allocation mechanism. The data-driven strategy failed to cope with the unpredictable variations in complexity, a fact that was accentuated by increasing system size. A world data model application demonstrated these failings with a speed-up of only 15 against 56 achieved for a demand-driven approach on 63 processors. A particle’s path can result in the processor performing the calculation not holding the requisite data items. This will naturally happen as particle motion leads away from the given conceptual region, although blurring of the region boundaries occur due to the size of objects and the transient nature of the cache contents. At a given point, it may become beneficial to perform the task elsewhere due to the lack of appropriate data. The migration of tasks between processors occurs to balance the computational load, due to the distributed task acquisition strategy, and in a separate process, to maintain coherence between cache and tasks. A migration threshold based on the availability of local data with additional overriding criteria to assist load-balancing reduces solution times.

145

Bibliography [1] W. B. Ackerman. Data flow languages. In N. Gehani and A. D. McGettrick, editors, Concurrent Programming, chapter 3, pages 163–181. Addison-Wesley, 1988. [2] G. S. Almasi and A. Gottleib. Highly Parallel Computing. Benjamin/Cummings, Redwood City, California, 2nd edition, 1994. [3] G. M. Amdahl. Validity of the single-processor approach to achieving large scale computing capabilities. In AFIPS, volume 30, Atlantic City, Apr. 1967. AFIPS Press, Reston, Va. [4] M. Annaratone et al. Warp architecture and implementation. In 13th Annual International Symposium on Computer Architecture, pages 346–356, Tokyo, June 1986. [5] B. Arnaldi, T. Priol, L. Renambot, and X. Pueyo. Visibility masks for solving complex radiosity computations on multiprocessors. Parallel Computing, 23(7):887–897, jul 1997. Special Issue on Parallel Graphics and Visualisation. [6] B. Arnaldi, X. Pueyo, and J. Vilaplana. On the division of environments by virtual walls for radiosity computation. In Photorealism in Computer Graphics, pages 198–205, 1991. Proceedings 2nd EG Rendering Workshop. [7] J. Backus. Can programming be liberated from the von Neumann style functional style and its algebra of programs. Communications of the ACM, 21(8):613–641, 1978. [8] D. Badouel, K. Bouatouch, and T. Priol. Distributed data and control for ray tracing in parallel. IEEE Computer Graphics and Applications, 14(4):69–77, 1994. [9] D. Badouel, K. Bouatouch, and T. Priol. Distributing data and control for ray tracing in parallel. 146

IEEE Computer Graphics & Applications, pages 69–77, 1994. [10] H. Bal. Programming Distributed Systems. Silicon Press, Summit, New Jersey, 1990. [11] A. Basu. A classification of parallel processing systems. In ICCD, 1984. [12] K. Batcher. Design of a massively parallel processor. IEEE Transactions on Computers, 29(9):836– 840, Sept. 1980. [13] M. Ben-Ari. Principles of Concurrent and Distributed Programming. Addison-Wesley, Wokingham, England, 1990. [14] J. F. Blinn and M. E. Newell. Texture and reflection in computer generated images. Communications of the ACM, 19(10):542–547, okt 1976. [15] M. R. Bolin and G. W. Meyer. An error metric for monte carlo ray tracing. In J. Dorsey and P. Slusallek, editors, Rendering Techniques ’97 (EG Workshop on Rendering 1997), pages 57– 68, St. Etienne France, jun 1997. Eurographics, Springer Wien. [16] K. Bouatouch, D. Menard, and T. Priol. Parallel radiosity using a shared virtual memory. In First Bilkent Computer Graphics Conference, ATARV93, pages 71–83, Ankara, Turkey, jul 1993. [17] K. Bouatouch and T. Priol. Parallel space tracing: An experience on an iPSC hypercube. In N. Magnenat-Thalmann and D. Thalmann, editors, New Trends in Computer Graphics (Proceedings of CG International ’88), pages 170–187, New York, 1988. Springer-Verlag. [18] K. Bouatouch and T. Priol. Data management scheme for parallel radiosity. Computer-Aided Design, 26(12):876–882, dec 1994.

[19] C. J. Burgess and A. G. Chalmers. Optimum transputer configurations for real applications requiring global communications. In 18th World Occam and Transputer Users Group Conference, Manchester, apr 1995. IOS Press. [20] A. W. Burks. Programming and structural changes in parallel computers. In W. Händler, editor, Conpar, pages 1–24, Berlin, 1981. Springer. [21] N. Carriero and D. Gelernter. How to Write Parallel Programs. MIT Press, Cambridge, Massachusetts, 1990. [22] M. B. Carter and K. A. Teague. The hypercube ray tracer. In D. Walker and Q. Stout, editors, The 5th Distributed Memory Computing Conference Vol. I, pages 212–216. IEEE Computer Society Press, apr 1990. [23] D. Chaiken, J. Kubiatowicz, and A. Agarwal. LimitLESS directories: A scalable cache coherence scheme. In Proceedings of the Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS-IV, pages 224–234, Apr. 1991. [24] A. Chalmers and J. Tidmus. Practical Parallel Processing: An Introduction to Problem Solving in Parallel. International Thomson Computer Press, London, 1996. ISBN 1-85032-135-3. [25] A. G. Chalmers. Occam - the language for educating future parallel programmers? Microprocessing and Microprogramming, 24:757–760, 1988. [26] A. G. Chalmers and S. Gregory. Constructing minimum path configurations for multiprocessor systems. Parallel Computing, pages 343–355, apr 1993. [27] A. G. Chalmers and D. J. Paddon. Communication efficient MIMD configurations. In 4th SIAM Conference on Parallel Processing for Scientific Computing, Chicago, 1989. [28] A. G. Chalmers and D. J. Paddon. Parallel radiosity methods. In D. L. Fielding, editor, Transputer Research and Applications – NATUG 4, pages 183–193, Amsterdam, oct 1990. Proceedings of the Fourth Conference of the North American Transputer Users Group, IOS Press. Held october 11–12, Ithaca, NY, ISSN 0925-4986. 147

[29] A. G. Chalmers and D. J. Paddon. Parallel processing of progressive refinement radiosity methods. In 2nd EG Workshop on Rendering, pages 1–11, Barcelona, Spain, may 1991. [30] B. Chamberlain, T. DeRose, D. Lischinski, D. Salesin, and J. Snyder. Fast rendering of complex environments using a spatial hierarchy. In W. A. Davis and R. Bartels, editors, Graphics Interface ’96, pages 132–141. Canadian Information Processing Society, Canadian Human-Computer Communications Society, may 1996. ISBN 09695338-5-3. [31] P. Chaudhuri. Parallel Algorithms: Design and analysis. Prentice-Hall, Australia, 1992. [32] A. Ciampalini, P. Cignoni, C. Montani, and R. Scopigno. Multiresolution decimation based on global error. The Visual Computer, 13:228–246, 1997. [33] J. G. Cleary, B. M. Wyvill, G. M. Birtwistle, and R. Vatti. Multiprocessor ray tracing. Computer Graphics Forum, 5(1):3–12, mar 1986. [34] B. Codenotti and M. Leonici. Introduction to parallel processing. Addison-Wesley, Wokingham, England, 1993. [35] J. Cohen, A. Varshney, D. Manocha, G. Turk, H. Weber, P. Agarwal, F. P. Brooks, Jr., and W. Wright. Simplification envelopes. In H. Rushmeier, editor, SIGGRAPH 96 Conference Proceedings, Annual Conference Series, pages 119– 128. ACM SIGGRAPH, Addison Wesley, aug 1996. held in New Orleans, Louisiana, 04-09 August 1996. [36] M. F. Cohen, S. E. Chen, J. R. Wallace, and D. P. Greenberg. A progressive refinement approach to fast radiosity image generation. In J. Dill, editor, Computer Graphics (SIGGRAPH ’88 Proceedings), volume 22, pages 75–84, aug 1988. [37] M. F. Cohen and D. P. Greenberg. The hemi-cube: A radiosity solution for complex environments. In B. A. Barsky, editor, Computer Graphics (SIGGRAPH ’85 Proceedings), volume 19, pages 31– 40, jul 1985. [38] M. F. Cohen and J. R. Wallace. Radiosity and Realistic Image Synthesis. Academic Press, Inc., Cambridge, MA, 1993.

[39] A. L. DeCegama. The Technology of Parallel Processing: Parallel Processing Architectures and VLSI Design. Prentice-Hall International Inc., 1989. [40] M. A. Z. Dippé and J. Swensen. An adaptive subdivision algorithm and parallel architecture for realistic image synthesis. In H. Christiansen, editor, Computer Graphics (SIGGRAPH ’84 Proceedings), volume 18, pages 149–158, jul 1984. [41] P. Dutré. Mathematical Frameworks and Monte Carlo Algorithms for Global Illumination in Computer Graphics. PhD thesis, Katholieke Universiteit Leuven, Belgium, sept. 1996.

[49] T. A. Funkhouser. Coarse-grained parallelism for hierarchical radiosity using group iterative methods. In H. Rushmeier, editor, SIGGRAPH 96 Conference Proceedings, Annual Conference Series, pages 343–352. ACM SIGGRAPH, Addison Wesley, aug 1996. held in New Orleans, Louisiana, 04-09 August 1996. [50] T. A. Funkhouser and C. H. Séquin. Adaptive display algorithm for interactive frame rates during visualization of complex virtual environments. In J. T. Kajiya, editor, Computer Graphics (SIGGRAPH ’93 Proceedings), volume 27, pages 247– 254, aug 1993.

[42] D. L. Eager, J. Zahorjan, and E. D. Lazowska. Speedup versus efficiency in parallel systems. IEEE Transactions on Computers, 38(3):408–423, Mar. 1989.

[51] M. Garland and P. S. Heckbert. Surface simplification using quadric error metrics. In T. Whitted, editor, SIGGRAPH 97 Conference Proceedings, pages 209–216. ACM SIGGRAPH, Addison Wesley, aug 1997.

[43] V. Faber, O. M. Lubeck, and A. B. White Jr. Super-linear speedup of an efficient sequential algorithm is not possible. Parallel Computing, 3:259–260, 1986.

[52] C. F. Gerald and P. O. Wheatley. Applied numerical analysis. World Student Series. AddisonWesley, Reading, MA, 5th edition, 1994.

[44] M. Feda. Parallel radiosity on transputers with low communication overhead. In S. Ferenczi and P. Kacsuk, editors, Proceedings of the 2nd Austrian-Hungarian Workshop on Transputer Applications, pages 62–70, Budapest, Hungaria, sep– oct 1994. Hungarian Transputer Users Group and Austrian Centre for Parallel Computing. Report KFKI-1995-2/M, N. [45] M. Feda and W. Purgathofer. Progressive refinement radiosity on a transputer network. In 2nd EG Workshop on Rendering, Barcelona, Spain, may 1991. held in Barcelona, Spain; 13-15 May 1991. [46] C.-C. Feng and S.-N. Yang. A parallel hierarchical radiosity algorithm for complex scenes. In 1997 Symposium on Parallel Rendering, pages 71–77. ACM SIGGRAPH, oct 1997. ISBN 1-58113-0104. [47] H. P. Flatt and K. Kennedy. Performance of parallel processors. Parallel Computing, 12:1–20, 1989. [48] M. J. Flynn. Some computer organisations and their effectiveness. IEEE Transactions on Computers, 21(9):948–960, 1972. 148

[53] A. S. Glassner. Space subdivision for fast ray tracing. IEEE Computer Graphics and Applications, 4(10):15–22, oct 1984. [54] A. S. Glassner, editor. An Introduction to Ray Tracing. Academic Press, San Diego, 1989. [55] C. M. Goral, K. E. Torrance, D. P. Greenberg, and B. Battaile. Modeling the interaction of light between diffuse surfaces. In Computer Graphics (SIGGRAPH ’84 Proceedings), volume 18, pages 213–222, jul 1984. [56] P. Green and E. Morgan. Parallelisation schemes for the progressive refinement radiosity method for the synthesis of realistic images. In P. Nixon, editor, Transputer and Occam Developments (Proceedings of the 18th World Occam and Transputer User Group Technical Meeting, pages 97– 112, Amsterdam, apr 1995. IOS Press. ISBN 905199-222-X. [57] S. A. Green. Parallel Processing for Computer Graphics. Research Monographs in Parallel and Distributed Computing. Pitman Pulishing, London, 1991. ISBN 0-273-08834-3.

[58] S. A. Green and D. J. Paddon. A non-shared memory multiprocessor architecture for large database problems. In M. Cosnard, M. H. Barton, and M. Vanneschi, editors, Proceedings of the IFIP WG 10.3 Working Conference on Parallel Processing, Pisa, 1988. [59] S. A. Green and D. J. Paddon. Exploiting coherence for multiprocessor ray tracing. IEEE Computer Graphics and Applications, 9(6):12–26, nov 1989. [60] S. A. Green and D. J. Paddon. Exploiting coherence for multiprocessor ray tracing. IEEE Computer Graphics & Applications, 9(6):12–26, Nov. 1989. [61] S. A. Green and D. J. Paddon. A highly flexible multiprocessor solution for ray tracing. Technical Report TR-89-02, Computer Science Department, University of Bristol, Merchant Venturers Building, Woodland Road, Bristol BS8 1UB, mar 1989. [62] S. A. Green and D. J. Paddon. A highly flexible multiprocessor solution for ray tracing. The Visual Computer, 6(2):62–73, mar 1990. [63] N. Greene. Environment mapping and other applications of world projections. IEEE Computer Graphics and Applications, pages 21–29, nov 1986. [64] H. A. Grosch. High speed arithmetic: The digital computer as a research tool. Journal of the Optical Society of America, 43(4):306–310, Apr. 1953. [65] H. A. Grosch. Grosch’s law revisited. Computerworld, 8(16):24, Apr. 1975. [66] P. Guitton, J. Roman, and C. Schlick. Two parallel approaches for a progressive radiosity. In 2nd EG Workshop on Rendering, pages 1–11, Barcelona, Spain, may 1991. [67] P. Guitton, J. Roman, and G. Subrenat. Implementation results and analysis of a parallel progressive radiosity. In 1995 Parallel Rendering Symposium, pages 31–38. ACM SIGGRAPH, oct 1995. ISBN 0-89791-774-1. [68] A. Gupta, J. Hennessy, K. Gharachorloo, T. Mowry, and W. Weber. Comparative evaluation of latency reducing and tolerating techniques. 149

In 18th Annual International Symposium on Computer Architecture, pages 254–263, Toronto, May 1991. [69] J. L. Gustafson. Re-evaluating Amdahl’s law. Communications of the ACM, 31(5):532–533, May 1988. [70] P. Hanrahan and D. Saltzman. A rapid hierarchical radiosity algorithm for unoccluded environments. In C. Bouville and K. Bouatouch, editors, Photorealism in Computer Graphics, Eurographics Seminar Series, New York, 1992. Springer Verlag. [71] P. Hanrahan, D. Saltzman, and L. Aupperle. A rapid hierarchical radiosity algorithm. Computer Graphics, 25(4):197–206, aug 1991. [72] P. Hanrahan and D. Salzman. A rapid hierarchical radiosity algorithm for unoccluded environments. Technical Report CS-TR-281-90, Department of Computer Science, Princeton University, aug 1990. [73] D. R. Hartree. The ENIAC, an electronic computing machine. Nature, 158:500–506, 1946. [74] T. He, L. Hong, and A. Varshney. Controlled topology simplification. IEEE Transactions on Visualization and Computer Graphics, 2(2):171– 184, jun 1996. [75] J. L. Hennessy and D. A. Patterson. Computer Architecture: A quantitative approach. Morgan Kaufmann, San Mateo, CA, 1990. [76] T. Hey. Scientific applications. In G. Harp, editor, Transputer Applications, chapter 8, pages 170– 203. Pitman Publishing, 1989. [77] M. D. Hill and J. R. Larus. Cache considerations for multiprocessor programmers. Communications of the ACM, 33(8):97–102, Aug. 1990. [78] D. W. Hillis. The Connection Machine. The MIT Press, 1985. [79] R. W. Hockney and C. R. Jesshope. Parallel Computers 2: Architecture, Programming and Algorithms, chapter 1, pages 60–81. Adam Hilger, 1988.

[80] R. W. Hockney and C. R. Jesshope. Parallel Computers 2: Architecture, Programming and Algorithms. Adam Hilger, Bristol, 1988. [81] M. Homewood, M. D. May, D. Shepherd, and R. Shepherd. The IMS T800 transputer. IEEE Micro, pages 10–26, 1987. [82] H. Hoppe, T. DeRose, T. Duchamp, J. McDonald, and W. Stuetzle. Mesh optimization. In J. T. Kajiya, editor, Computer Graphics (SIGGRAPH ’93 Proceedings), volume 27, pages 19–26, aug 1993. [83] R. M. Hord. Parallel Supercomputing in MIMD Architectures. CRC Press, Boca Raton, 1993. [84] R. J. Hosking, D. C. Joyce, and J. C. Turner. First steps in numerical analysis. Hodder and Stoughton, Lonfon, 1978. [85] HPF Forum. High Performance Fortran language specification. Scientific Programming, 2(1), June 1993. [86] K. Hwang. Advanced Computer Architecture: Parallelism, Scalability, Programmability. McGraw-Hill Series in Computer Engineering. McGraw-Hill, Inc., New York, 1993. [87] S. E. Hyeon-Ju Yoon and J. W. Cho. Image parallel ray tracing using static load balancing and data prefetching. Parallel Computing, 23(7):861–872, jul 1997. Special Issue on Parallel Graphics and Visualisation. ¨ uç. Subdivision [88] V. ˙I sler, C. Aykanat, and B. Ozg¨ of 3D space based on the graph partitioning for parallel ray tracing. In Proceedings of the Second Eurographics Workshop on Rendering, Barcelona, Spain, may 1991. [89] F. W. Jansen and A. Chalmers. Realism in real time? In M. F. Cohen, C. Puech, and F. Sillion, editors, 4th EG Workshop on Rendering, pages 27– 46. Eurographics, jun 1993. held in Paris, France, 14–16 June 1993. [90] J. P. Jessel, M. Paulin, and R. Caubet. An extended radiosity using parallel ray-traced specular transfers. In 2nd Eurographics Workshop on Rendering, pages 1–12, Barcelona, Spain, may 1991. held in Barcelona, Spain; 13-15 May 1991.

150

[91] D. A. J. Jevans. Optimistic multi-processor ray tracing. In R. A. Earnshaw and B. Wyvill, editors, New Advances in Computer Graphics (Proceedings of CG International ’89), pages 507– 522, New York, 1989. Springer-Verlag. [92] J. T. Kajiya. The rendering equation. In D. C. Evans and R. J. Athay, editors, Computer Graphics (SIGGRAPH ’86 Proceedings), volume 20, pages 143–150, aug 1986. held in Dallas, Texas, August 18–22, 1986. [93] M. J. Keates and R. J. Hubbold. Accelerated ray tracing on the KSR1 virtual shared-memory parallel computer. Technical Report UMCS-94-2-2, Department of Computer Science, University of Manchester, Oxford Road, Manchester, UK, feb 1994. [94] H.-J. Kim and C.-M. Kyung. A new parallel ray-tracing system based on object decomposition. The Visual Computer, 12(5):244–253, 1996. ISSN 0178-2789. [95] H. Kobayashi, T. Nakamura, and Y. Shigei. Parallel processing of an object space for image synthesis using ray tracing. The Visual Computer, 3(1):13–22, feb 1987. [96] H. Kobayashi, T. Nakamura, and Y. Shigei. A strategy for mapping parallel ray-tracing into a hypercube multiprocessor system. In N. MagnenatThalmann and D. Thalmann, editors, New Trends in Computer Graphics (Proceedings of CG International ’88), pages 160–169, New York, 1988. Springer-Verlag. [97] A. J. F. Kok. Grouping of patches in progressive radiosity. In M. Cohen, C. Puech, and F. Sillion, editors, Fourth Eurographics Workshop on Rendering, pages 221–231, Paris, France, jun 1993. [98] A. J. F. Kok. Ray Tracing and Radiosity Algorithms for Photorealistic Image Synthesis. PhD thesis, Delft University of Technology, The Netherlands, may 1994. Delft University Press, ISBN 90-6275-981-5. [99] KSR. KSR Technical Summary. Kendall Square Research, Waltham, MA, 1992.

[100] V. Kumar, A. Grama, A. Gupta, and G. Karyps. [111] D. Lenoski, J. Laudon, T. Joe, D. Nakahira, Introduction to Parallel Computing. BenL. Stevens, A. Gupta, and J. Hennessy. The jamin/Cummings, Redwood City, California, DASH prototype: Logic overhead and perfor1994. mance. IEEE Transactions on Parallel and Distributed Systems, 4(1):41–61, Jan. 1993. [101] H. T. Kung. VLSI array processors. Prentice-Hall, Englewood Cliffs, NJ, 1988. [112] T. Lewis and H. El-Rewini. Introduction to parallel computing. Prentice-Hall, 1992. [102] H. T. Kung and C. E. Leiserson. Systolic arrays (for VLSI). In Duff and Stewart, editors, Sparse [113] K. Li. Ivy: A shared virtual memory system for Matrix proceedings, Philadelphia, 1978. SIAM. parallel computing. Proceedings of the 1988 International Conference on Parallel Processing, 2:94– ¨ uç. A parallel [103] T. M. Kurç, C. Aykanat, and B. Ozg¨ 101, Aug. 1988. scaled conjugate-gradient algorithm for the solution phase of gathering radiosity on hypercubes. [114] R. Liere. Divide and conquer radiosity. In PhoThe Visual Computer, 13(1):1–19, 1997. torealism in Computer Graphics, pages 191–197, 1991. Proceedings 2nd EG Rendering Workshop. [104] E. Lafortune and Y. D. Willems. Rendering participating media with bidirectional path tracing. [115] G. J. Lipovski and M. Malek. Parallel Computing: In X. Pueyo and P. Schröder, editors, RenderTheory and comparisons. John Wiley, New York, ing Techniques ’96 (EG Workshop on Rendering 1987. 1996), pages 91–100, Porto Portugal, jun 1996. [116] P. W. C. Maciel and P. Shirley. Visual navigation Eurographics, Springer Wien. of large environments using textured clusters. In [105] E. P. Lafortune and Y. D. Willems. Bi-directional P. Hanrahan and J. Winget, editors, 1995 Sympopath tracing. In Proceedings of CompuGraphics sium on Interactive 3D Graphics, pages 95–102. ’93, Alvor, Portugal, dec 6-10 1993. ACM SIGGRAPH, apr 1995. ISBN 0-89791-7367. [106] Z. Lahjomri and T. Priol. KOAN: A shared virtual memory for the iPSC/2 hypercube. Technical [117] M. D. May and R. Shepherd. Communicating proReport Report 597, IRISA, Campus de Beaulieu, cess computers. Inmos technical note 22, Inmos 35042 Rennes Cedex, France, jul 1991. Ltd., Bristol, 1987. [107] Lambert. Photometria sive de mensura et gradibus [118] L. F. Menabrea and A. Augusta(translator). Sketch luminis, colorum et umbrae, 1760. of the Analytical Engine invented by Charles Babbage. In P. Morrison and E. Morrison, editors, [108] C. Lazou. Supercomputers and Their Use. ClareCharles Babbage and his Calculating Engines. don Press, Oxford, revised edition, 1988. Dover Publications, 1961. [109] M. E. Lee, R. A. Redner, and S. P. Uselton. Statis- [119] K. Menzel. Parallel rendering techniques for multically optimized sampling for distributed ray tractiprocessor systems. In Proceedings of Spring ing. In B. A. Barsky, editor, Computer Graphics School on Computer Graphics, pages 91–103. (SIGGRAPH ’85 Proceedings), volume 19, pages Comenius University Bratislava, jun 1994. Held 61–67, jul 1985. held in San Francisco, CA; 22–26 june 6–9 in Bratislava, Slovakia. July 1985. [120] S. Michelin, G. Maffeis, D. Arquès, and J. C. [110] W. Lefer. An efficient parallel ray tracing scheme Grossetie. Form factor calculation: a new for distributed memory parallel computers. In expression with implementations on a parallel 1993 Parallel Rendering Symposium, pages 77– t.node computer. Computer Graphics Forum, 80. ACM-SIGGRAPH, 1993. ISBN 0-89791-61812(3):C421–C432, 1993. Eurographics ’93. 2.

151

[121] D. P. Mitchell. Generating antialiased images [131] M. Pharr, C. Kolb, R. Gershbein, and P. Hanrahan. at low sampling densities. In M. C. Stone, edRendering complex scenes with memory-coherent itor, Computer Graphics (SIGGRAPH ’87 Proray tracing. In T. Whitted, editor, SIGGRAPH 97 ceedings), volume 21, pages 65–72, jul 1987. held Conference Proceedings, Annual Conference Sein Anaheim, California; 27 – 31 July 1987. ries, pages 101–108. ACM SIGGRAPH, Addison Wesley, aug 1997. ISBN 0-89791-896-7. [122] T. Mowry and A. Gupta. Tolerating latency through software-controlled prefetching in shared- [132] P. Pitot. The voxar project. IEEE Commemory multiprocessors. Journal of Parallel and puter Graphics and Applications, 13(1):27–33, jan Distributed Computing, pages 53–72, June 1991. 1993. [123] A. Neumann, L. Neumann, P. Bekaert, Y. D. [133] J. Popović and H. Hoppe. Progressive simplicial Willems, and W. Purgathofer. Importance-driven complexes. In T. Whitted, editor, SIGGRAPH 97 stochastic ray radiosity. In Rendering Techniques Conference Proceedings, Annual Conference Se’96 (EG Workshop on Rendering 1996), pages ries, pages 217–224. ACM SIGGRAPH, Addison 111–121, Porto Portugal, jun 1996. Eurographics, Wesley, aug 1997. ISBN 0-89791-896-7. Springer Wien. [134] M. Potmesil and E. M. Hoffert. The pixel machine: [124] L. Neumann, W. Purgathofer, R. F. Tobler, A. NeuA parallel image computer. In Computer Graphics mann, P. Eliás, M. Feda, and X. Pueyo. The (SIGGRAPH ’89 Proceedings), volume 23, pages stochastic ray method for radiosity. In P. M. 69–78, jul 1989. Hanrahan and W. Purgathofer, editors, Rendering Techniques ’95 (EG Workshop on Rendering [135] T. Priol and K. Bouatouch. Static load balancing 1995), pages 206–218, Dublin Ireland, jun 1995. for a parallel ray tracing on a MIMD hypercube. Eurographics, Springer Wien. The Visual Computer, 5(1/2):109–119, mar 1989. [125] I. Notkin and C. Gotsman. Parallel adaptive ray- [136] W. Purgathofer and M. Zeiller. Fast radiosity tracing. In V. Skala, editor, Proceedings of the by parallelization. In Proceedings EurographThird International Conference in Central Europe ics Workshop on Photosimulation, Realism and on Computer Graphics and Visualisation 95, volPhysics in Computer Graphics, pages 173–183, ume 1, pages 218–226, Plze˘n, Czech Republic, feb Rennes, France, jun 1990. 1995. University of West Bohemia. WSCG 95. [137] B. Purvis. Programming the Intel i860. Parallelo[126] I. Notkin and C. Gotsman. Parallel progressive gram International, pages 6–9, Oct. 1990. ray-tracing. Computer Graphics Forum, 16(1):43– [138] M. J. Quinn. Parallel Computing: Theory and 56, march 1997. practice. McGraw-Hill, New York, 1994. [127] D. Nussbaum and A. Argarwal. Scalability of parallel machines. Communications of the ACM, [139] V. Rajaraman. Elements of parallel computing. 34(3):56–61, Mar. 1991. Prentice-Hall of India, New Dehli, 1990. [128] D. E. Orcutt. Implementation of ray tracing on the [140] R. J. Recker, D. W. George, and D. P. Greenhypercube. In G. Fox, editor, Third Conference berg. Acceleration techniques for progressive reon Hypercube Concurrent Computers and Applifinement radiosity. In R. Riesenfeld and C. Sequin, cations, pages 1207–1210, 1988. vol. 2. editors, Computer Graphics (1990 Symposium on Interactive 3D Graphics), pages 59–66, mar 1990. [129] D. Paddon and A. Chalmers. Parallel processing held in Snowbird, Utah; 25-28 March 1990. of the radiosity method. Computer-Aided Design, 26(12):917–927, dec 1994. ISSN 0010-4485.

[141] S. F. Reddaway. DAP - a Distributed Array Processor. In 1st Annual Symposium on Computer [130] S. N. Pattanaik. Computational Methods for Architecture, 1973. Global Illumination and Visualisation of Complex 3D Environments. PhD thesis, National Centre for Software Technology, Bombay, India, feb 1993. 152

[142] E. Reinhard and A. G. Chalmers. Message han- [150] H. E. Rushmeier. Solution methods for radiadling in parallel radiance. In M. Bubak, J. Dontively participating media. In Global Illumination garra, and J. Wa´sniewski, editors, Proceedings Course Notes, SIGGRAPH 92, Chicago, 1992. EuroPVM-MPI’97 - Recent Advances in Paral[151] H. E. Rushmeier, C. Patterson, and A. Veerasamy. lel Virtual Machine and Message Passing InterGeometric simplification for indirect illumination face (Fourth European PVM-MPI Users’ Group calculations. In Proceedings of Graphics Interface Meeting), Lecture Notes in Computer Science ’93, pages 227–236, Toronto, Ontario, may 1993. (1332), pages 486–493. Springer - Verlag, nov Canadian Information Processing Society. 1997. ISBN 3-540-63697-8. [152] R. M. Russel. The CRAY-1 computer system. [143] E. Reinhard and F. W. Jansen. Pyramid clipping. Communications of the ACM, 21:63–72, 1978. Ray Tracing News, volume 8, number 2, may [153] J. Salmon and J. Goldsmith. A hypercube ray1995. tracer. In Proceedings of the 3rd Conference on [144] E. Reinhard and F. W. Jansen. Rendering large Hypercube Concurrent Computers and Applicascenes using parallel ray tracing. Parallel Comtions Vol. II, pages 1194–1206. ACM Press, 1988. puting, 23(7):873–886, July 1997. Special issue [154] I. D. Scherson and C. Caspary. Multiprocesson Parallel Graphics and Visualisation. ing for ray tracing: A hierarchical self-balancing [145] E. Reinhard, A. J. F. Kok, and F. W. Jansen. approach. The Visual Computer, 4(4):188–196, Cost prediction in ray tracing. In X. Pueyo and 1988. P. Schroeder, editors, Rendering Techniques ’96, pages 41–50, Porto, june 17-19 1996. Eurograph- [155] C. Schlick. An adaptive sampling technique for multidimensional integration by ray-tracing. In ics, Springer Wien. ISBN 3-211-82883-4. Second Eurographics Workshop on Rendering, [146] E. Reinhard, L. U. Tijssen, and F. W. Jansen. EnBarcelona, Spain, may 1991. vironment mapping for efficient sampling of the diffuse interreflection. In G. Sakas, P. Shirley, and [156] P. Shirley. Physically Based Lighting Calculations for Computer Graphics. PhD thesis, University of S. Müller, editors, Photorealistic Rendering TechIllinois, Urbana-Champaign, nov 1991. niques, pages 410–422, Darmstadt, jun 1994. Eurographics, Springer Verlag. proceedings of the [157] P. Shirley, K. Sung, and W. Brown. A ray trac5th Eurographics Workshop on Rendering. ing framework for global illumination systems. Graphics Interface ’91, pages 117–128, 1991. [147] L. Renambot, B. Arnaldi, T. Priol, and X. Pueyo. Towards efficient parallel radiosity for DSM- [158] J. E. Shore. Second thoughts on parallel processbased parallel computers using virtual interfaces. ing. Comput. Elect. Eng., 1:95–109, 1973. In 1997 Symposium on Parallel Rendering, pages [159] F. X. Sillion and C. Puech. Radiosity and Global 79–86. ACM SIGGRAPH, oct 1997. ISBN 1Illumination. Morgan Kaufmann Publishers, Inc., 58113-010-4. San Francisco, California, 1994. ISBN 1-55860277-1. [148] L. Renambot, B. Arnaldi, T. Priol, and X. Pueyo. Towards efficient parallel radiosity for DSM[160] J. S. Sing, A. Gupta, and M. Levoy. Parallel vibased parallel computers using virtual intersualization algorithms: Performance and architecfaces. Technical Report 3245, Institut National tural implications. IEEE Computer, 27(7):45–55, de Recherche en Informatique et en Automajul 1994. tique (INRIA), Campus Universitaire de Beaulieu, [161] J. P. Singh, C. Holt, T. Totsuka, A. Gupta, and 35042 Rennes Cedex, France, sep 1997. J. Hennessy. Load balancing and data locality [149] J. Richard and J. P. Singh. Parallel hierarchical in adaptive hierarchical N-body methods: Barnescomputation of specular radiosity. In 1997 Symhut, fast multipole and radiosity. Journal of Parposium on Parallel Rendering, pages 59–69. ACM allel and Distributed Computing, 27(1):118–141, SIGGRAPH, oct 1997. ISBN 1-58113-010-4. jun 1995. ISSN 0743-7315. 153

[162] S. Spach and R. Pulleyblank. Parallel raytraced image generation. Hewlett-Packard Journal, 43(3):76–83, jun 1992.

photo-realistic image synthesis. In R. Miles and A. Chalmers, editors, Proceedings of WoTUG-17: Progress in Transputer and occam Research, volume 17, pages 139–147. WoTUG, IOS Press, Apr. 1994. Bristol, UK. ISBN 90-5199-163-0.

[163] W. Stürzlinger, G. Schaufler, and J. Volkert. Load balancing for a parallel radiosity algorithm. In 1995 Parallel Rendering Symposium, pages 39– [172] J. P. Tidmus, R. Miles, and A. Chalmers. Prefetch 45. ACM SIGGRAPH, oct 1995. ISBN0-89791data management for parallel particle tracing. In 774-1. A. Bakkers, editor, Parallel Programming and Java, Proceedings of WoTUG-20, volume 50 of [164] W. Stürzlinger and C. Wild. Parallel progresConcurrent Systems Engineering, pages 130–137, sive radiosity with parallel visibility calculations. University of Twente, Netherlands, 1997. World In V. Skala, editor, Winter School of Computer occam and Transputer User Group (WoTUG), IOS Graphics and CAD Systems, pages 66–74. UniverPress, Netherlands. sity of West Bohemia, jan 1994. [173] R. F. Tobler, A. Wilkie, M. Feda, and W. Pur[165] W. Stürzlinger and C. Wild. Parallel visibility gathofer. A hierarchical subdivision algorithm for computations for parallel radiosity. In B. Buchstochastic radiosity methods. In J. Dorsey and berger and J. Volkert, editors, Parallel Processing: P. Slusallek, editors, Rendering Techniques ’97 CONPAR 94 - VAPP VI (Third Joint International (EG Workshop on Rendering 1997), pages 193– Conference on Vector and Parallel Processing), 203, St. Etienne France, jun 1997. Eurographics, volume 854 of Lecture Notes in Computer Science, Springer Wien. pages 405–413, Berlin, sep 1994. Springer-Verlag. ISBN 3-540-58430-7. [174] P. C. Treleaven, D. R. Brownbridge, and R. P. Hopkins. Data driven and demand-driven com[166] D. Stuttard, A. Worral, D. Paddon, and C. Willis. puter architecture. Communications of the ACM, A parallel radiosity system for large data sets. In 14(1):95–143, Mar. 1982. V. Skala, editor, The Third International Conference in Central Europe on Computer Graphics [175] A. Trew and G. Wilson, editors. Past, Present and and Visualisation 95, volume 2, pages 421–429, Parallel: A survey of available parallel computer Plze˘n, Czech Republic, feb 1995. University of systems. Springer-Verlag, London, 1991. West Bohemia. [176] L. G. Valiant. A bridging model for paral[167] I. E. Sutherland. The ultimate display. In Proceedlel computation. Communications of the ACM, ings of IFIP Congress, volume 65, pages 505–508, 33(8):103–111, Aug. 1990. 1965. [177] E. Veach. Non-symmetric scattering in light trans[168] I. E. Sutherland, R. F. Sproull, and R. A. Schuport algorithms. In X. Pueyo and P. Schröder, macker. A characterization of ten hidden-surface editors, Eurographics Rendering Workshop 1996, algorithms. Computing Surveys, 6(1):1–55, mar pages 81–90. Springer Wien, jun 1996. ISBN 31974. 211-82883-4. [169] R. J. Swam, S. H. Fuller, and D. P. Siewiorek. [178] E. Veach and L. J. Guibas. Bidirectional estima‘Cm —A Modular, Multi-Microprocessor’. In tors for light transport. In G. Sakas, P. Shirley, and Proc. AFIPS 1977 Fall Joint Computer ConferMüller, editors, Photorealistic Rendering Techence 46, pages 637–644, 1977. niques (EG Workshop on Rendering 94), pages 145–167. Eurographics, Springer Heidelberg, jun [170] S. Thakkar, P. Gifford, and G. Fiellamd. The 1994. Balance multiprocessor system. IEEE Micro, 8(1):57–69, Feb. 1988. [179] E. Veach and L. J. Guibas. Optimally combining sampling techniques for monte carlo rendering. In [171] J. P. Tidmus, A. Chalmers, and R. Miles. Distributed Monte Carlo techniques for interactive 154

Computer Graphics Proceedings, Annual Confer- [188] H. Weghorst, G. Hooper, and D. P. Greenberg. ence Series, 1995 (ACM SIGGRAPH ’95 ProceedImproved computations methods for ray tracing. ings), pages 419–428, 1995. Transactions on Graphics, 3(1):52–69, jan 1984. [180] E. Veach and L. J. Guibas. Metropolis light trans- [189] R. P. Weicker. Dhrystone: A synthetic systems port. In T. Whitted, editor, SIGGRAPH 97 Conprogramming benchmark. Communications of the ference Proceedings, Annual Conference Series, ACM, 27(10):1013–1030, 1984. pages 65–76. ACM SIGGRAPH, Addison Wesley, [190] S. H. Westin, J. R. Arvo, and K. E. Torrance. Preaug 1997. dicting reflectance functions from complex sur[181] I. Verdú, D. Giménez, and J. C. Torres. Ray tracing faces. ACM Computer Graphics, 26(2):255–264, for natural scenes in parallel processors. In H. Lidjul 1992. dell, A. Colbrook, B. Hertzberger, and P. Sloot, editors, Highe-Performance Computing and Net- [191] T. Whitted. An improved illumination model for shaded display. Communications of the ACM, working, volume 1067 of Lecture Notes in Com23(6):343–349, jun 1980. puter Science, pages 297–305. Springer-Verlag, apr 1996. ISBN 3-540-61142-8. [192] J. C. Xia, J. El-Sana, and A. Varshney. Adaptive real-time level-of-detail-based rendering for ˇ [182] J. Zára, A. Holeˇcek, and J. Pˇrikryl. Parallelisapolygonal models. IEEE Transactions of Visution of the ray-tracing algorithm. In V. Skala, edialization and Computer Graphics, 3(2):171–183, tor, Winter School of Computer Graphics and CAD apr-jun 1997. Systems 94, volume 1, pages 113–117. University of West Bohemia, jan 1994. WSCG 95. [193] H. Xu, Q. Peng, and Y. Liang. Accelerated radiosˇ ara, A. Holeˇcek, and J. Pˇrikryl. When the ity method for complex environments. In Euro[183] J. Z´ graphics ’89, pages 51–61, Amsterdam, sep 1989. parallel ray-tracer starts to be efficient? In ProElsevier Science Publishers. Eurographics ’89. ceedings of Spring School on Computer Graphics, pages 108–116. Comenius University Bratislava, [194] Y. Yu, O. H. Ibarra, and T. Yang. Parallel progresjun 1994. Held jun 6–9 Bratislava, Slovakia. sive radiosity with adaptive meshing. In A. Ferreira, J. Rolim, Y. Saad, and T. Yang, editors, Par[184] J. Wallace, K. Elmquist, and E. Haines. A ray tracallel Algorithms for Irregularly Structured Probing algorithm for progressive radiosity. In Comlems (Third International Workshop, IRREGUputer Graphics (ACM SIGGRAPH ’89 ProceedLAR ’96), volume 1117 of Lecture Notes in Comings), volume 23, pages 315–324, jul 1989. puter Science, pages 159–170. Springer-Verlag, [185] G. J. Ward. The RADIANCE lighting simulation aug 1996. ISBN 3-540-61549-0. and rendering system. In A. Glassner, editor, Proceedings of SIGGRAPH ’94 (Orlando, Florida, [195] M. van der Zwaan, E. Reinhard, and F. W. Jansen. Pyramid clipping for efficient ray traversal. In July 24–29, 1994), Computer Graphics ProceedP. Hanrahan and W. Purgathofer, editors, Renderings, Annual Conference Series, pages 459–472. ing Techniques ’95, pages 1–10. Trinity College, ACM SIGGRAPH, ACM Press, July 1994. ISBN Dublin, Springer - Vienna, June 1995. proceedings 0-89791-667-0. of the 6th Eurographics Workshop on Rendering. [186] G. J. Ward, F. M. Rubinstein, and R. D. Clear. A ray tracing solution for diffuse interreflection. ACM Computer Graphics, 22(4):85–92, aug 1988. [187] D. H. D. Warren and S. Haridi. The Data Diffusion Machine—a scalable shared virtual memory multiprocessor. In Proceedings of the 1988 International Conference on Fifth Generation Computer Systems, pages 943–952, Tokyo, Japan, Dec. 1988. 155

3DUDOOHO 'LVWULEXWHG 3KRWR5HDOLVWLF 5HQGHULQJ Alan Chalmers

Erik Reinhard

Department of Computer Science University of Bristol Bristol, BS8 1UB United Kingdom [email protected]

[email protected]

$FNQRZOHGJHPHQWV :H :H DUH DUH YHU\ YHU\ JUDWHIXO JUDWHIXO WR WR WKH WKH IROORZLQJ IROORZLQJ ,QWHUQDWLRQDO ,QWHUQDWLRQDO 7KRPVRQ 7KRPVRQ &RPSXWHU &RPSXWHU 3UHVV 3UHVV

,((( ,((( (OVHYLHU (OVHYLHU

-RKQ -RKQ 7LGPXV 7LGPXV

(ULN (ULN -DQVHQ -DQVHQ /XF /XF 5HQDPERW 5HQDPERW 0DUN 0DUN /HYR\ /HYR\

7KH 7KH DXWKRUV DXWKRUV RI RI WKH WKH NH\ NH\ SDSHUV SDSHUV

Page 1

,QWURGXFWLRQ §3DUDOOHO §3DUDOOHO SURFHVVLQJ SURFHVVLQJ LV LV OLNH OLNH D D GRJªV GRJªV ZDONLQJ ZDONLQJ RQ RQ LWV LWV KLQG KLQG OHJV OHJV ,W ,W LV LV QRW QRW GRQH GRQH ZHOO ZHOO EXW EXW \RX \RX DUH DUH VXUSULVHG VXUSULVHG WR WR ILQG ILQG LW LW GRQH GRQH DW DW DOO¨ DOO¨ >6WHYH >6WHYH )LGGHV )LGGHV DSRORJLHV DSRORJLHV WR WR 6DPXHO 6DPXHO -RKQVRQ @ -RKQVRQ @

&RQFHSWV &RQFHSWV

¥ 'HSHQGHQFLHV 6FDODELOLW\ &RQWURO

&ODVVLILFDWLRQ &ODVVLILFDWLRQ RI RI 3DUDOOHO 3DUDOOHO 6\VWHPV 6\VWHPV

¥ )O\QQªV WD[RQRP\ 3DUDOOHO YV 'LVWULEXWHG

5HODWLRQVKLS 5HODWLRQVKLS EHWZHHQ EHWZHHQ 'DWD 'DWD DQG DQG 7DVNV 7DVNV

¥ 'LIILFXOWLHV

Introduction

1

Introduction

2

3DUDOOHO 5HDOLVWLF *UDSKLFV

6WULYH IRU 5HDOLVP 2YHU 2YHU PLOOLRQ PLOOLRQ REMHFWV REMHFWV 0%\WHV 0%\WHV RI RI PHPRU\ PHPRU\ KRXUV KRXUV RI RI FRPSXWLQJ FRPSXWLQJ

&RRSHUDWLRQ 6ROXWLRQ RI D VLQJOH SUREOHP 2QH 2QH SHUVRQ SHUVRQ WDNHV WDNHV D D FHUWDLQ FHUWDLQ WLPH WLPH WR WR VROYH VROYH WKH WKH SUREOHP SUREOHP 'LYLGH 'LYLGH SUREOHP SUREOHP LQWR LQWR D D QXPEHU QXPEHU RI RI VXESUREOHPV VXESUREOHPV (DFK (DFK VXESUREOHP VXESUREOHP VROYHG VROYHG E\ E\ D D

VLQJOH ZRUNHU ZRUNHU

5HGXFHG 5HGXFHG SUREOHP SUREOHP VROXWLRQ VROXWLRQ WLPH WLPH

%87 &RRSHUDWLRQ &RRSHUDWLRQ

⇒ 29(5+($'6 29(5+($'6 Introduction

3

Page 1

:RUNLQJ 7RJHWKHU

2YHUKHDGV $FFHVV $FFHVV WR WR SRRO SRRO FROOLVLRQ FROOLVLRQ DYRLGDQFH DYRLGDQFH

Introduction

4

'HSHQGHQFLHV 'LYLGH D SUREOHP LQWR D QXPEHU RI GLVWLQFW VWDJHV 3DUDOOHO 3DUDOOHO VROXWLRQ VROXWLRQ RI RI RQH RQH VWDJH VWDJH EHIRUH EHIRUH QH[W QH[W FDQ FDQ VWDUW VWDUW 0D\ 0D\ EH EH WRR WRR VHYHUH VHYHUH

⇒ QR QR SDUDOOHO SDUDOOHO VROXWLRQ VROXWLRQ

¥ HDFK VXESUREOHP GHSHQGHQW RQ SUHYLRXV VWDJH 'HSHQGHQF\IUHH 'HSHQGHQF\IUHH SUREOHPV SUREOHPV

¥ RUGHU RI WDVN FRPSOHWLRQ XQLPSRUWDQW ¥ %87 FRRSHUDWLRQ VWLOO UHTXLUHG Introduction

5

Introduction

6

%XLOGLQJ ZLWK %ORFNV

Strictly sequential

Dependency-free

Page 2

3LSHOLQLQJ &ODVVLF PHWKRG IRU PLQLPLVLQJ GHSHQGHQF\ HIIHFWV &DQ &DQ RQO\ RQO\ EH EH H[SORLWHG H[SORLWHG ZKHQ ZKHQ

¥ 1XPEHU RI GLVWLQFW VWDJHV ¥ 5HSHDWHG VHYHUDO WLPHV &RPSOHWLRQ &RPSOHWLRQ RI RI WDVN WDVN VHQVLWLYH VHQVLWLYH WR WR WLPH WLPH WDNHQ WDNHQ E\ E\ HDFK HDFK ZRUNHU ZRUNHU

⇒ LGOH LGOH WLPH WLPH ⇒ SLSHOLQH SLSHOLQH PXVW PXVW EH EH UHILOOHG UHILOOHG

,I ,I ZRUNHU ZRUNHU WLPHV WLPHV GLIIHUHQW GLIIHUHQW ,QWHUUXSWLRQ ,QWHUUXSWLRQ

Introduction

7

Introduction

8

3LSHOLQH $VVHPEO\ RI &DUV Initial lumps Chassis fabrication of metal

Body assembly

Wheel fitting

Windscreen installation

Finished cars

1 2 3 4 5 6 t 7 8 9 0

Increasing time

2 3 4 5 6 t1 7 8 9

1

3 4 5 6 t 7 8 9 2

2

1

3

2

1

4

3

2

5

4

3

2

6

5

4

3

7

6

5

4 5 6 t3 7 8 9 1

5 6 t4 7 8 9 6 t5 7 8 9

7 8 9

1

t6 t

8 9

4

7

Motion of cars through pipeline

1 2 1 2 3

6FDODELOLW\ 8SSHU ERXQG RQ WKH QXPEHU RI ZRUNHUV $GGLWLRQDO $GGLWLRQDO ZRUNHUV ZRUNHUV ZLOO ZLOO

127 LPSURYH LPSURYH VROXWLRQ VROXWLRQ WLPH WLPH

6KRZV 6KRZV KRZ KRZ VXLWDEOH VXLWDEOH D D SUREOHP SUREOHP LV LV IRU IRU SDUDOOHO SDUDOOHO SURFHVVLQJ SURFHVVLQJ *LYHQ *LYHQ SUREOHP SUREOHP

⇒ ILQLWH ILQLWH QXPEHU QXPEHU RI RI VXESUREOHPV VXESUREOHPV

¥ PRUH ZRUNHUV WKDQ WDVNV 8SSHU 8SSHU ERXQG ERXQG PD\ PD\ EH EH D D ORW ORW OHVV OHVV WKDQ WKDQ QXPEHU QXPEHU RI RI WDVNV WDVNV

¥ ERWWOHQHFNV Introduction

9

Page 3

%RWWOHQHFN DW 'RRUZD\ #

More workers may result in LONGER solution time Introduction

10

Introduction

11

&RQWURO 5HTXLUHG E\ DOO SDUDOOHO LPSOHPHQWDWLRQV :KDW :KDW FRQVWLWXWHV FRQVWLWXWHV D D WDVN WDVN :KHQ :KHQ KDV KDV WKH WKH SUREOHP SUREOHP EHHQ EHHQ VROYHG VROYHG +RZ +RZ WR WR GHDO GHDO ZLWK ZLWK PXOWLSOH PXOWLSOH VWDJHV VWDJHV )RUPV )RUPV RI RI FRQWURO FRQWURO

¥ FHQWUDOLVHG ¥ GLVWULEXWHG

&RQWURO 5HTXLUHG Problem domain

Subproblem 1

Computation

Subresult 1

Subproblem 2

Computation

Subresult 2

Computation

Subresult n

Subproblem n

b.

Results

Step 1: Subdivision of problem

Results

...

Problem domain

Computation

...

a.

Step 2: Independent computation

Step 3: Collation of results Introduction

12

Page 4

&ODVVLILFDWLRQ RI 3DUDOOHO 6\VWHPV 7UDGLWLRQDO VHTXHQWLDO YRQ 1HXPDQQ SURFHVVRU SURFHVVRU DVVRFLDWHG DVVRFLDWHG PHPRU\ PHPRU\ LQSXWRXWSXW LQSXWRXWSXW LQWHUIDFH LQWHUIDFH Control Bus

I/O interface

Memory Processor Data bus Address bus

Introduction

13

3DUDOOHO YV &RQFXUUHQW 3DUDOOHOLVP 5HSOLFDWLRQ RI FRPSXWDWLRQDO XQLW 7DVNV 7DVNV VRPHWLPHV VRPHWLPHV UHIHUUHG UHIHUUHG WR WR DV DV

¥ DOORFDWHG D SK\VLFDO SURFHVVRU ⇒ SDUDOOHO SDUDOOHO ⇒ FRQFXUUHQW FRQFXUUHQW

7DVNV 7DVNV H[HFXWLQJ H[HFXWLQJ RQ RQ GLVWLQFW GLVWLQFW SURFHVVRUV SURFHVVRUV 7DVNV 7DVNV VKDULQJ VKDULQJ WKH WKH VDPH VDPH SURFHVVRU SURFHVVRU Parallel tasks

Concurrent tasks

task task

task task

Processor 1

Processor 2

Processor

(a)

(b)

Introduction

14

)O\QQªV 7D[RQRP\ &ODVVLILFDWLRQ E\ LQVWUXFWLRQ GDWD VWUHDPV 6,6' 6,6' 6LQJOH 6LQJOH ,QVWUXFWLRQ ,QVWUXFWLRQ 6LQJOH 6LQJOH 'DWD 'DWD

¥ FRQYHQWLRQDO VHTXHQWLDO SURFHVVRUV 6,0' 6,0' 6LQJOH 6LQJOH ,QVWUXFWLRQ ,QVWUXFWLRQ 0XOWLSOH 0XOWLSOH 'DWD 'DWD

¥ QXPEHU RI SURFHVVRUV REH\LQJ WKH VDPH LQVWUXFWLRQ 0,6' 0,6' 0XOWLSOH 0XOWLSOH ,QVWUXFWLRQ ,QVWUXFWLRQ 6LQJOH 6LQJOH 'DWD 'DWD

¥ FORVHVW HTXLYDOHQW V\VWROLF DUUD\ DUFKLWHFWXUHV Introduction

15

Page 5

)RXU %URDG $UFKLWHFWXUH &DWHJRULHV 6LQJOH ,QVWUXFWLRQ VWUHDP

0XOWLSOH ,QVWUXFWLRQ VWUHDP

6LQJOH 'DWD VWUHDP

6,6'

0,6'

0XOWLSOH 'DWD VWUHDP

6,0'

0,0'

Real architectures are much more complex Introduction

16

0,0' $XWRQRPRXVO\ REH\ WKHLU RZQ LQVWUXFWLRQV :LWK :LWK FRPPXQLFDWLRQ FRPPXQLFDWLRQ

¥ FRRSHUDWH LQ WKH VROXWLRQ RI VLQJOH SUREOHP 7LJKWO\ 7LJKWO\ FRXSOHG FRXSOHG

¥ KLJK OHYHO RI LQWHUDFWLRQ EHWZHHQ SURFHVVRUV /RRVHO\ /RRVHO\ FRXSOHG FRXSOHG

¥ OLWWOH RU QR LQWHUDFWLRQ EHWZHHQ SURFHVVRUV Introduction

17

6KDUHG YV 'LVWULEXWHG 0HPRU\ 7ZR PHWKRGV IRU SURYLGLQJ FRPPXQLFDWLRQ 6KDUHG 6KDUHG PHPRU\ PHPRU\

¥ FRPPXQLFDWLRQ YLD FRPPRQ DGGUHVV VSDFH ¥ HJ $OOLDQW ); 0XOWLSURFHVVRU 3&V 'LVWULEXWHG 'LVWULEXWHG PHPRU\ PHPRU\

¥ FRPPXQLFDWLRQ E\ SDVVLQJ PHVVDJHV ¥ HJ ,QPRV 7UDQVSXWHU $QDORJ 'HYLFHV 6+$5&

Introduction

18

Page 6

6\VWHPV RI 0,0' 3URFHVVRUV 0HPRU\ DFFHVV 6KDUHG 6KDUHG PHPRU\ PHPRU\

¥ SURFHVVRUV VKDUH WKH VDPH PHPRU\ ¥ FRQWHQWLRQ DV QXPEHU RI SURFHVVRUV LQFUHDVHV 'LVWULEXWHG 'LVWULEXWHG PHPRU\ PHPRU\

¥ SURFHVVRUV DGGUHVV RQO\ WKHLU SULYDWH PHPRU\ ¥ KLJKHU GHQVLW\ RI PHVVDJHV DV PRUH SURFHVVRUV DGGHG

Introduction

19

&RQFHSWXDO 'LIIHUHQFH 6KDUHG 0HPRU\

,QWHUFRQQHFWLRQ PHWKRG

,QWHUFRQQHFWLRQ PHWKRG

0,0' SURFHVVRUV D

,QWHUFRQQHFWLRQ QHWZRUNV %XVVHV %XVVHV

0,0' SURFHVVRUV ZLWK SULYDWH PHPRU\ E

⇒ FRQWHQWLRQ FRQWHQWLRQ ⇒ VSUHDGV VSUHDGV FRPPXQLFDWLRQ FRPPXQLFDWLRQ

'HGLFDWHG 'HGLFDWHG OLQNV OLQNV

Introduction

20

3DUDOOHO YV 'LVWULEXWHG 6\VWHPV %RWK GLVWULEXWHG PHPRU\ V\VWHPV 3DUDOOHO 3DUDOOHO

¥ LQWHUFRQQHFWHG GHGLFDWHG SURFHVVRU DQG PHPRU\ QRGHV ¥ IDVW SXUSRVH EXLOW FRPPXQLFDWLRQ LQIUDVWUXFWXUH 'LVWULEXWHG 'LVWULEXWHG

¥ LQWHUFRQQHFWHG §VWDQGDORQH¨ ZRUNVWDWLRQV ¥ UHOLHV RQ H[LVWLQJ QHWZRUN IDFLOLWLHV HJ HWKHUQHW

Introduction

21

Page 7

0DQ\ 6LPLODU 8QGHUO\LQJ &RQFHSWV

%RWK PHVVDJH SDVVLQJ V\VWHPV

⇒ SDUDOOHO SDUDOOHO ODQJXDJHV ODQJXDJHV ⇒ 390 390 03, 03, Different computation-to-communication ratios

3DUDOOHO 3DUDOOHO

'LVWULEXWHG 'LVWULEXWHG

Introduction

22

5HODWLRQVKLS RI 7DVNV 'DWD ,PSOHPHQWDWLRQ RI SUREOHP FRPSULVHV WZR SDUWV $OJRULWKP $OJRULWKP 'RPDLQ 'RPDLQ

¥ HQFRPSDVVHV DOO GDWD UHTXLUHPHQWV IRU WKH SUREOHP 6HTXHQWLDO 6HTXHQWLDO

3DUDOOHO 3DUDOOHO

⇒ DOO DOO FRPSRQHQWV FRPSRQHQWV RQ RQ VLQJOH VLQJOH SURFHVVRU SURFHVVRU

⇒ FRPSRQHQWV FRPSRQHQWV GLYLGHG GLYLGHG DPRQJVW DPRQJVW SURFHVVRUV SURFHVVRUV Introduction

23

&RPSRQHQWV RI 3UREOHP The Problem

Problem Domain

Application of Algorithm

Results

algorithm interacts with domain to produce result Introduction

24

Page 8

,QKHUHQW 'LIILFXOWLHV )DLOXUH WR VXFFHVVIXOO\ FRPSOHWH 6HTXHQWLDO 6HTXHQWLDO VROXWLRQ VROXWLRQ

¥ GHILFLHQFLHV LQ DOJRULWKP RU GDWD 3DUDOOHO 3DUDOOHO VROXWLRQ VROXWLRQ

¥ GHILFLHQFLHV LQ DOJRULWKP RU GDWD ¥ GHDGORFN ¥ GDWD FRQVLVWHQF\

Introduction

25

Introduction

26

1RYHO 'LIILFXOWLHV )DFWRUV DULVLQJ IURP LPSOHPHQWDWLRQ 'HDGORFN 'HDGORFN

¥ SURFHVVRU ZDLWLQJ LQGHILQLWHO\ IRU DQ HYHQW 'DWD 'DWD FRQVLVWHQF\ FRQVLVWHQF\

¥ GDWD LV GLVWULEXWHG DPRQJVW SURFHVVRUV &RPPXQLFDWLRQ &RPPXQLFDWLRQ RYHUKHDGV RYHUKHDGV

¥ ODWHQF\ LQ PHVVDJH WUDQVIHU

(YDOXDWLQJ 3DUDOOHO ,PSOHPHQWDWLRQV 5HDOLVDWLRQ SHQDOWLHV $OJRULWKPLF $OJRULWKPLF SHQDOW\ SHQDOW\

¥ QDWXUH RI WKH DOJRULWKP FKRVHQ ,PSOHPHQWDWLRQ ,PSOHPHQWDWLRQ SHQDOW\ SHQDOW\

¥ QHHG WR FRPPXQLFDWH ¥ FRQFXUUHQW FRPSXWDWLRQ FRPPXQLFDWLRQ DFWLYLWLHV ¥ LGOH WLPH

Introduction

27

Page 9

6ROXWLRQ 7LPHV 100

Optimum time Actual time

Computation Time

80

60

40

20

0 10

20

30

40

50 60 Number of PEs

70

80

90

100

Introduction

28

Introduction

29

Introduction

30

6SHHGXS Linear speed-up Actual speed-up Super-linear speed-up

100

Speed-up

80

60

40

20

0 10

20

speed-up =

30

40

50 60 Number of PEs

70

80

90

100

elapsed time of a uniprocessor elapsed time of the multiprocessors

(IILFLHQF\ 100 Optimum efficieny Actual efficiency

Efficiency

80

60

40

20

0 10

20

30

efficiency =

40

50 60 Number of PEs

70

80

speed-up x 100 number of processors

90

100

Page 10

$PGDKO *XVWDIVRQªV §/DZV¨ 8SSHU ERXQG RQ VSHHGXS $PGDKO $PGDKO

PD[ VSHHGXS

VSQ VSQ

*XVWDIVRQ *XVWDIVRQ

PD[ VSHHGXS

Q Q Q V Q V

1000 Linear Speedup Amdahl s=0.001 Amdahl s=0.005 Amdahl s=0.01 Amdahl s=0.1 Gustafson

900 800 700

Speedup

600 500 400 300 200 100

Introduction

0 0

100

200

300

400 500 600 Number of PEs

700

800

900

31

1000

2SWLPXP 1XPEHU RI 3(V 7R LPSURYH VFDODELOLW\ UHH[DPLQH UHH[DPLQH DOJRULWKP DOJRULWKP DGRSWHG DGRSWHG DOWHU DOWHU PDNHXS PDNHXS RI RI V\VWHP V\VWHP VRIWZDUH VRIWZDUH

Goal: “to ensure that the optimum number of processing elements for your problem is greater than the number of PEs physically available to solve the problem!” Introduction

32

Page 11

Realistic Rendering All rendering algorit hms model light behaviour and therefore t ry t o solv e t he same equat ion: Lo = Le ( x, θθθθo ) +

∫ v( x, x ' ) f ( x, θθθθ' , θθθθ) L ( x' , θθθ' ) cos θθθ r

o

o

o

o

i

all x'

cos θθθθ'o dA' x'− x

2

•• Rendering Rendering equ equation ation –– Fredholm Fredholm equation equation of of the the second second kind – can only be solved numerically Rendering 1

Realistic Rendering Approximat ions t o t he rendering equat ion: •• Ray Ray Tracing Tracing •• Rad Radiosity iosity •• Particle Tracing

Rendering 2

Ray Tracing Approximat es t he rendering equat ion:

Lo ( x, θθθθo ) = Le ( x, θθθθo ) +

∑ ∫ v ( x, x ) f l

r,d

( x) Le ( xl , θθθθ'o ) cos θθθθi d ω ω ωl +

L all x ∈L i

∫f

r,s

( x, θθθθs , θθθθo ) L( xs , θθθθs ) cos θθθθs d ω ω ωs +

θθθs ∈Ω s

ρρρρd La ( x ) Rendering 3

Page 1

Ray Tracing In this equat ion: •• The The first first term term accounts accounts for for hitting hitting light light sou sources rces ddirectly irectly •• The The second second term term is is the the d irect contribu tion of light sources •• The The third third term term com compu putes tes sp specular ecular reflection • The fourth fourth term term is is an an ambient ambient com component ponent – crudely approximates everything not accounted for by first three terms Rendering 4

Ray Tracing Primary Rays

Rendering 5

Primary Rays From t he v iew point : •• Per Per pixel one or or more more rays rays are are shot shot into into the the scene scene •• These These rays rays may may hit hit aa num number ber of of objects objects •• The colour colour of of the the object closest to the view p oint is selected and assigned to the pixel • This step step is is repeated repeated for each pixel, m aking up the the image image

Rendering 6

Page 2

Ray Tracing Secondary Rays

Rendering 7

Once an Intersection With an Object is Found : From t he int ersect ion point secondary ray s are shot •• Shad Shadow ow rays rays tow toward ardss light light sou rces •• Reflection Reflection rays in reflected d irections •• Refraction rays in transm itted d irections • Possibly d iffu se inter-reflection rays into a hem i-sp i-sp here of d irections

Shading values of these rays affect t he shading of t heir parent rays Rendering 8

D ealing With Complexity

Rendering 9

Page 3

Example of a Complex Scene •• H High igh com compplexity lexity •• Geom Geometry etry unevenly unevenly sp spread read over over sp space ace •• Large Large mem memory ory requ requirem irements ents

Rendering 10

Greenhouse Example •• H High igh com compplexity lexity •• Instancing Instancing

Rendering 11

Greenhouse Example

Rendering 12

Page 4

Sorting Object Space Spat ial subdivision t echniques spatially sort object s •• Regular Regular subd subdivisions ivisions (grid (grid)) pperform erform relatively relatively ppoorly oorly •• Ad ap tive sp atial su bd ivision d eal w ith com p lexity better •• Bintrees and octrees: octrees: – recursively subdivide when too many objects in a voxel – limit depth of tree as well

Rendering 13

Sorting Object Space O t her spat ial subdiv ision t echniques exist •• e.g. e.g. H Hierarchical ierarchical grid gridss •• May May red reduuce ce num number ber of of voxels voxels • But overhead is larger as w ell • Overall Overall pperform erformance ance is is sim similar ilar to to octrees octrees

Rendering 14

Grid Spatial Subdivision Assume the number of voxels is N: •• In In each each orthogonal orthogonal direction direction there there w w ill ill be be

3

N voxels voxels

•• The The uupp pper er bound bound of of the the tim timee taken taken to to trace trace aa ray ray is: is:

T = 3 N (Tcell + Tint ) = O(3 N )

Rendering 15

Page 5

Spatial Subdivisions Grid

Rendering 16

Bintree Spatial Subdivision If t here are N leaf voxels: voxels: h •• The The height height of of the the tree tree w will ill be be h, h, w w here here 2 = N h

•• The The nu number mber of of voxels voxels traversed traversed is is then then O (2 3 ) • The u pp er bound for the traversal tim e is then h

T = 2 3 (Tcell + Tint ) = 3 N (Tcell + Tint ) = O (3 N ) Rendering 17

Spatial Subdivisions Bint ree

Rendering 18

Page 6

Octree Spatial Subdivision If t here are N leaf voxels: voxels: h •• The The height height of of the the tree tree is is h, h, w where here 8 = N h •• A A ray ray w wou ould ld therefore therefore traverse traverse O(2 ) voxels

•• The nu number mber of of voxels traversed by a single ray is therefore:

T = 2h (Tcell + Tint ) = 3 N (Tcell + Tint ) = O (3 N ) Rendering 19

Spatial Subdivisions O ct ree

Rendering 20

Radiosity Derived from radiat ive heat transfer •• App Approxim roximates ates rend rendering ering equation equation (for (for patch patch i): i):

Li = Lei + ρρρρid ∑ L j fi → j j

fi→ j =

1 Ai

∫∫

Ai A j

cos θθθθi cos θθθθj πππr 2

δδδδij dAj dAi

Rendering 21

Page 7

Radiosity In this equat ion: •• The The first first term term is is the the self self em emitted itted rad radiance iance of of ppatch atch ii •• The The second second term term is is the the su m of rad iances from all other p atches incid ent on p atch i. • f i→ j is the form factor betw een p atches i and j.

This equat ion is evaluat ed for all patches

Rendering 22

Patches

Rendering 23

Energy Exchange Betw een Patches

Rendering 24

Page 8

Hemisphere Method Comput ing a form factor t o det ermine v isibilit y: •• Places Places a hem isp here over over an an elem element ent •• Discretises Discretises the hem isphere (regularly or ad aptively) • Shoot rays throu gh each p artition to d eterm ine closest element, yield ing a d elta form factor • Su m all d elta form factors into a form factor

Rendering 25

Hemisphere Method

Rendering 26

Hemicube Method Alternat iv e visibility method •• Places Places a hem hemicube icube over over an an element element •• Comp u te d elta form factors for each grid grid cell cell • Su m d elta form factors factors – suitable for hardware z-buffer algorithms

Rendering 27

Page 9

Hemicube Method

Rendering 28

Radiosity Methods Three main radiosit y algorit hms: •• Fu Fullll m matrix atrix rad radiosity iosity (or gathering) • Progressive refinem ent (or shooting) • H ierarchical rad iosity

Rendering 29

Full Matrix Radiosity •• First First aa m matrix atrix of of form form factors factors is is created created •• Then Then this this m matrix atrix is is solved solved uusing sing aa num numerical erical m method ethod,, such such as Seid el ssas Gau Gauss-Seid ss-Seid el • Can be seen as gathering energy for each patch from all other p atches • After each iteration the solu tion for one m ore p atch is know n • Requires O ( N 2 ) storage storage space space Rendering 30

Page 10

Progressive Refinement •• The The m most ost rad radiant iant ppatches atches ddistribute istribute their their energy energy into into the the environment environment •• After After each each iteration iteration all all ppatches atches have have received received som somee energy energy •• Only aa few iterations requ ired before an ap proxim proximation ation can can be d isp layed • Comp lete convergence is slow er er than than in in gathering gathering m method ethods, s, because som e form factors have to be calcu lated m ore than once Rendering 31

Hierarchical Radiosity •• Comp Compuutes tes energy energy exchanges exchanges betw betw een een group groupss of of ppatches atches at at variou variouss levels levels in in aa hierarchy hierarchy •• The The coarsest coarsest level level in in the the hierarchy hierarchy w with ith the the desired desired level level of of pprecision recision is is chosen chosen to to control control the the error error in in the the solu solution tion

Rendering 32

Stochastic Rendering Monte Carlo techniques: •• Su Suitable itable for for solving solving difficult difficult integral integral equ equations ations such such as as the the rendering rendering equ equation ation •• N Noo significant significant ad addditional itional storage storage sp space ace requ required ired beyond beyond scene scene geom geometry etry itself itself •• Can be app lied to arbitrary arbitrary surface surface geometry geometry and and reflection reflection functions

Rendering 33

Page 11

Monte Carlo Techniques To evaluate an int egral F = ∫ f ( x ) dx : Ω •• p ( x) dx is is the the probability probability density density fu function nction of of points points generated generated in in ddom omain ain

Ω

•• Generate Generate sam p les X , w here

f (X ) p( X )

is an estim ate for

F

•• Different schem es exist to choose sam ples such that the d ifference betw een estim ate and actu al valu e is m inim ised • This is know n as variance red uction

Rendering 34

Particle Tracing Solv es rendering equat ion direct ly •• Light Light sources sources emit emit particles particles at at ddifferent ifferent w wavelengths avelengths and and in in ddifferent ifferent ddirections irections •• Particles Particles get absorbed or reflected at su rfaces •• Reflection is chosen accord ing to a surface brd f • Each patch in the environm ent so receives a nu m ber of p articles d eterm ining their brightness

Rendering 35

Particle Tracing •• Particip Participating ating m ed ia are easily incorp orated •• Sam Samee for for cau caustics stics

•• A large nu m ber of particles are necessary to get reliable results

Rendering 36

Page 12

7DVN 6FKHGXOLQJ

3URYLGLQJ WDVNV WR WKH SURFHVVRUV 3UREOHP 3UREOHP GHFRPSRVLWLRQ GHFRPSRVLWLRQ

¥ DOJRULWKPLF GHFRPSRVLWLRQ ¥ GRPDLQ GHFRPSRVLWLRQ ¥ GHILQLWLRQ RI D WDVN ¥ V\VWHP DUFKLWHFWXUH Task Scheduling 1

'LVWULEXWLQJ 7DVNV &RPSXWDWLRQDO PRGHOV 'DWD 'DWD GULYHQ GULYHQ

¥ EDODQFHG ¥ XQEDODQFHG

'HPDQG 'HPDQG GULYHQ GULYHQ

Task Scheduling 2

3UREOHP 'HFRPSRVLWLRQ ([SORLW SDUDOOHOLVP ,QKHUHQW ,QKHUHQW LQ LQ DOJRULWKP DOJRULWKP

¥ DOJRULWKPLF GHFRPSRVLWLRQ ¥ SDUDOOHOLVLQJ FRPSLOHUV

$SSO\LQJ $SSO\LQJ VDPH VDPH DOJRULWKP DOJRULWKP WR WR GLIIHUHQW GLIIHUHQW GDWD GDWD LWHPV LWHPV

¥ GRPDLQ GHFRPSRVLWLRQ ¥ QHHG IRU H[SOLFLW V\VWHP VRIWZDUH VXSSRUW

Task Scheduling 3

Page 1

'LIIHUHQW 'HFRPSRVLWLRQ 0HWKRGV 'HFRPSRVLWLRQ PHWKRG $OJRULWKPLF )RUN -RLQ

'RPDLQ

'DWD )ORZ

'DWD 'ULYHQ

'HPDQG 'ULYHQ

Task Scheduling 4

$OJRULWKPLF 'HFRPSRVLWLRQ ([SORLWLQJ LQKHUHQW SDUDOOHOLVP LQ DOJRULWKP $QDO\VH $QDO\VH DOJRULWKP DOJRULWKP )LQHVW )LQHVW JUDLQ JUDLQ RI RI SDUDOOHOLVP SDUDOOHOLVP DW DW RSHUDWRU RSHUDWRU OHYHO OHYHO

¥ GDWDIORZ

3RUWLRQV 3RUWLRQV RI RI WKH WKH DOJRULWKP DOJRULWKP H[HFXWLQJ H[HFXWLQJ LQ LQ SDUDOOHO SDUDOOHO

¥ IRUN MRLQ Task Scheduling 5

'LIIHUHQW $OJRULWKPLF $SSURDFKHV 6DPSOH SUREOHP %HJLQ D

E

F

D E

(QG

IRUN

D

E

MRLQ

D

Dataflow

F

DE

E

Fork & join

Task Scheduling 6

Page 2

'RPDLQ 'HFRPSRVLWLRQ ([SORLWLQJ SDUDOOHOLVP LQ GDWD GRPDLQ (DFK (DFK SURFHVVRU SURFHVVRU KDV KDV WKH WKH VDPH VDPH DOJRULWKP DOJRULWKP 7ZR 7ZR VWHS VWHS DSSURDFK DSSURDFK

&KRRVH DQ DSSURSULDWH VHTXHQWLDO DOJRULWKP ² ² DYRLG DYRLG DQ\ DQ\ GDWD GDWD GHSHQGHQFLHV GHSHQGHQFLHV

([WUDFW WKH LPSRUWDQW FULWHULD IRU SDUDOOHO H[SORLWDWLRQ ² ² QR QR JHQHUDO JHQHUDO SXUSRVH SXUSRVH SDUDOOHO SDUDOOHO SURFHVVLQJ SURFHVVLQJ Task Scheduling 7

$EVWUDFW 'HILQLWLRQ RI D 7DVN Additional data items required

Principal data item specified by task


Result

3ULQFLSDO 3ULQFLSDO 'DWD 'DWD ,WHP ,WHP 3', 3', RQ RQ ZKLFK ZKLFK DOJRULWKP DOJRULWKP LV LV DSSOLHG DSSOLHG $GGLWLRQDO $GGLWLRQDO 'DWD 'DWD ,WHPV ,WHPV $',V $',V QHHGHG QHHGHG WR WR FRPSOHWH FRPSOHWH FRPSXWDWLRQ FRPSXWDWLRQ

Task Scheduling 8

6\VWHP $UFKLWHFWXUH 'LVWULEXWHG PHPRU\ V\VWHPV

⇒ ⇒ KDUGZDUH KDUGZDUH ⇒ ⇒ DD QXPEHU QXPEHU RI RI LQWHUFRQQHFWHG LQWHUFRQQHFWHG SURFHVVRUV SURFHVVRUV ⇒ FRGH FRGH UXQQLQJ UXQQLQJ FRQFXUUHQWO\ FRQFXUUHQWO\ RQ RQ D D SURFHVVRU SURFHVVRU 3URFHVV 3URFHVV ⇒ ⇒ EXLOGLQJ EXLOGLQJ EORFN EORFN 3URFHVVLQJ 3URFHVVLQJ HOHPHQW HOHPHQW 3( 3( ⇒ 3URFHVVRU 3URFHVVRU

&RQILJXUDWLRQ &RQILJXUDWLRQ

¥ 6LQJOHSURFHVVRU ¥ DSSOLFDWLRQ SURFHVVHV V\VWHP VRIWZDUH SURFHVVHV

Task Scheduling 9

Page 3

6\VWHP &RQWUROOHU

Starting...

SC

PE 1

PE n

3URYLGHV 3URYLGHV LQSXWRXWSXW LQSXWRXWSXW IDFLOLWLHV IDFLOLWLHV &RQWUROV &RQWUROV WKH WKH SDUDOOHO SDUDOOHO VROXWLRQ VROXWLRQ

Task Scheduling 10

&RPSXWDWLRQDO 0RGHOV 'HWHUPLQHV WKH PDQQHU WDVNV DUH DOORFDWHG WR 3(V 0D[LPLVH 0D[LPLVH 3( 3( FRPSXWDWLRQ FRPSXWDWLRQ WLPH WLPH 0LQLPLVH 0LQLPLVH LGOH LGOH WLPH WLPH

¥ ORDG EDODQFLQJ

(YHQO\ (YHQO\ DOORFDWH DOORFDWH WDVNV WDVNV DPRQJVW DPRQJVW WKH WKH SURFHVVRUV SURFHVVRUV

Task Scheduling 11

5D\ 7UDFLQJ D 6LPSOH 6FHQH Problem domain

Area of interest

Principal Data Items

Example problem considered

PDIs: Pixels in the image plane

⇒ 24 tasks

Task Scheduling 12

Page 4

'DWD 'ULYHQ 0RGHOV $OO $OO 3',V 3',V DOORFDWHG DOORFDWHG WR WR VSHFLILF VSHFLILF 3(V 3(V EHIRUH EHIRUH FRPSXWDWLRQ FRPSXWDWLRQ VWDUWV VWDUWV (DFK (DFK 3( 3( NQRZV NQRZV

D SULRUL ZKLFK ZKLFK 3',V 3',V LW LW LV LV UHVSRQVLEOH UHVSRQVLEOH IRU IRU

%DODQFHG JHRPHWULF GHFRPSRVLWLRQ (YHQO\ (YHQO\ DOORFDWH DOORFDWH WDVNV WDVNV DPRQJVW DPRQJVW WKH WKH SURFHVVRUV SURFHVVRUV ,I ,I 3',V 3',V QRW QRW DQ DQ H[DFW H[DFW PXOWLSOH PXOWLSOH RI RI QXPEHU QXPEHU RI RI 3(V 3(V WKHQ WKHQ VRPH VRPH 3(V 3(V GR GR RQH RQH H[WUD H[WUD WDVN WDVN

portion at each PE =

number of PDIs number of PEs Task Scheduling 13

%DODQFHG 'DWD 'ULYHQ Processing Elements

PE 1

PE 2

PE 3 task supply

Demand for tasks

solution time = initial distribution

Task Supplier

+

"Pool" of available tasks

24 3

+ result collation

Task Scheduling 14

8QEDODQFHG 'DWD 'ULYHQ 7DVN DOORFDWLRQ XVLQJ NQRZQ FRPSXWDWLRQ WLPHV 3(V 3(V PD\ PD\ GR GR GLIIHUHQW GLIIHUHQW QXPEHUV QXPEHUV RI RI WDVNV WDVNV 3(V 3(V VKRXOG VKRXOG ILQLVK ILQLVK DW DW WKH WKH VDPH VDPH WLPH WLPH Individual task computation times

1

1

1

1

1

1

1

1

2

1

1

1

1

3

6

5

1

1

3

2

4

1

1

1

Task Scheduling 15

Page 5

8QHTXDO $OORFDWLRQ RI 7DVNV computation at each PE =

total computation time number of PEs

Processing Elements

PE 1

PE 2

PDIs for PE 1

1

PE 3

PDIs for PE 2

solution time =

preprocessing

PDIs for PE 3

1

1

1

1

1

1

1

2

1

1

1

1

3

6

5

1

1

+ initial distribution

+

longest portion time (14)

+ result collation

3

2

4

1

1

1

Task Scheduling 16

'HPDQG 'ULYHQ 0RGHO 7DVN FRPSXWDWLRQ WLPH XQNQRZQ :RUN :RUN LV LV DOORFDWHG DOORFDWHG G\QDPLFDOO\ G\QDPLFDOO\ DV DV 3(V 3(V EHFRPH EHFRPH LGOH LGOH

¥ 3(V QR ORQJHU ERXQG WR SDUWLFXODU 3',V

3(V 3(V H[SOLFLWO\ H[SOLFLWO\ GHPDQG GHPDQG QHZ QHZ WDVNV WDVNV 7DVN 7DVN VXSSOLHU VXSSOLHU SURFHVV SURFHVV PXVW PXVW VDWLVI\ VDWLVI\ WKHVH WKHVH GHPDQGV GHPDQGV

Task Scheduling 17

'\QDPLF $OORFDWLRQ RI 7DVNV Processing Elements

PE 1

PE 2

PE 3 task supply

Demand for tasks

Task Supplier "Pool" of available tasks

solution time =

2 x total comms time

+ total comp time for all PDIs number of PEs

Task Scheduling 18

Page 6

7DVN 6XSSOLHU 3URFHVV PROCESS Task_Supplier() Begin remaining_tasks := total_number_of_tasks (* initialise all processors with one task *) FOR p = 1 TO number_of_PEs Begin SEND task TO PE[p] remaining_tasks := remaining_tasks -1 End WHILE results_outstanding DO Begin RECEIVE result FROM PE[i] IF remaining_tasks > 0 THEN Begin SEND task TO PE[i] remaining_tasks := remaining_tasks -1 End ENDIF End End (* Task_Supplier *)

Simple demand driven task supplier

Task Scheduling 19

$OORFDWLRQ RI 3',V Processing Elements

PE 1

PE 2

PE 3 task supply

Demand for tasks

Task Supplier

PE 1

PE 2 1

PE 1

PE 3 2

PE 2 7

Task allocated to PE 1


"Pool" of available tasks

Order of task allocation

8

PE 1

PE 2 13

19

PE 3

PE 2

21

PE 3 5

11

PE 1 16

PE 3

12

PE 1 17

PE 1 22

6

PE 1

PE 2

10

15

PE 2 20

PE 2 4

PE 1 9

14

PE 3

PE 1

PE 1 3

PE 3

18

PE 3 23

24


Task Scheduling 20

/RDG %DODQFLQJ $OO 3(V VKRXOG FRPSOHWH DW WKH VDPH WLPH 6RPH 6RPH 3(V 3(V EXV\ EXV\ ZLWK ZLWK FRPSOH[ FRPSOH[ WDVNV WDVNV 2WKHU 2WKHU 3(V 3(V DYDLODEOH DYDLODEOH IRU IRU HDVLHU HDVLHU WDVNV WDVNV &RPSXWDWLRQ &RPSXWDWLRQ HIIRUW HIIRUW RI RI HDFK HDFK WDVN WDVN XQNQRZQ XQNQRZQ

¥ KRW VSRW DW HQG RI SURFHVVLQJ ⇒ ⇒ XQEDODQFHG VROXWLRQ $Q\ $Q\ NQRZOHGJH NQRZOHGJH DERXW DERXW KRW KRW VSRWV VSRWV VKRXOG VKRXOG EH EH XVHG XVHG Task Scheduling 21

Page 7

6SLUDO $OORFDWLRQ Processing Elements

PE 1

PE 2

PE 3 task supply

Demand for tasks

Task Supplier "Pool" of available tasks

Order of task allocation 20

7

8

9

10

21

19

6

1

2

11

22

18

5

4

3

12

23

17

16

15

14

13

24

5DQGRP 5DQGRP DOORFDWLRQ DOORFDWLRQ 3URILOLQJ 3URILOLQJ

Task Scheduling 22

7DVN 0DQDJHPHQW (QFRPSDVVHV WKH IROORZLQJ IXQFWLRQV 'HILQLWLRQ 'HILQLWLRQ RI RI D D WDVN WDVN &RQWUROOLQJ &RQWUROOLQJ WKH WKH DOORFDWLRQ DOORFDWLRQ RI RI WDVNV WDVNV 'LVWULEXWLRQ 'LVWULEXWLRQ RI RI WKH WKH WDVNV WDVNV WR WR 3(V 3(V &ROODWLRQ &ROODWLRQ RI RI WKH WKH UHVXOWV UHVXOWV

¥ HVSHFLDOO\ ZLWK PXOWLSOH VWDJH SUREOHPV Task Scheduling 23

7DVN 'HILQLWLRQ *UDQXODULW\ &RPSXWDWLRQDO HOHPHQWV $WRPLF $WRPLF HOHPHQW HOHPHQW UD\REMHFW UD\REMHFW LQWHUVHFWLRQ LQWHUVHFWLRQ

¥ VHTXHQWLDO SUREOHPªV ORZHVW FRPSXWDWLRQDO HOHPHQW

7DVN 7DVN WUDFH WUDFH FRPSOHWH FRPSOHWH SDWK SDWK RI RI RQH RQH UD\ UD\

¥ SDUDOOHO SUREOHPªV VPDOOHVW FRPSXWDWLRQDO HOHPHQW 7DVN 7DVN JUDQXODULW\ JUDQXODULW\

¥ QXPEHU RI DWRPLF XQLWV LV RQH WDVN

Task Scheduling 24

Page 8

7DVN 3DFNHW 8QLW RI WDVN GLVWULEXWLRQ ,QIRUPV ,QIRUPV D D 3( 3( RI RI ZKLFK ZKLFK WDVNV WDVNV WR WR SHUIRUP SHUIRUP 7DVN 7DVN SDFNHW SDFNHW PD\ PD\ LQFOXGH LQFOXGH

¥ LQGLFDWLRQ RI ZKLFK WDVNV WR FRPSXWH ¥ GDWD LWHPV WKH 3', DQG SRVVLEO\ $',V 7DVN 7DVN SDFNHW SDFNHW IRU IRU UD\ UD\ WUDFHU WUDFHU

⇒ ⇒ RQH RQH RU RU PRUH PRUH UD\V UD\V WR WR EH EH WUDFHG WUDFHG Task Scheduling 25

7DVN 'LVWULEXWLRQ &RQWURO &RPSOHWLRQ RI D WDVN UHTXLUHV copy of the algorithm

} at the PE

+ the PDI(s)

in } identified the packet

+ any ADIs

by } specified the algorithm Task Scheduling 26

$OJRULWKPLF 'HSHQGHQFLHV $OJRULWKP DGRSWHG IRU SDUDOOHO LPSOHPHQWDWLRQ 0D\ 0D\ VSHFLI\ VSHFLI\ RUGHU RUGHU RI RI WDVN WDVN FRPSOHWLRQ FRPSOHWLRQ 'HSHQGHQFLHV 'HSHQGHQFLHV 0867 0867 EH EH SUHVHUYHG SUHVHUYHG $OJRULWKPLF $OJRULWKPLF GHSHQGHQFLHV GHSHQGHQFLHV LQWURGXFH LQWURGXFH

¥ V\QFKURQLVDWLRQ SRLQWV ⇒ ⇒ GLVWLQFW SUREOHP VWDJHV ¥ GDWD GHSHQGHQFLHV ⇒ ⇒ FDUHIXO GDWD PDQDJHPHQW Task Scheduling 27

Page 9

0XOWLVWDJH DOJRULWKPV 6WDJH 'DWD LWHPV

$GGLWLRQDO GDWD LWHPV UHTXLUHG

3ULQFLSDO GDWD LWHP VSHFLILHG E\ WKH WDVN 6WDJH RI DOJRULWKP

6WDJH 3DUWLDO UHVXOW 6WDJH 3ULQFLSDO GDWD LWHP

partial results stage n

PDIs stage n+1

$GGLWLRQDO GDWD LWHPV UHTXLUHG 6WDJH RI DOJRULWKP

Task Scheduling 28

-DFREL 0HWKRG ,WHUDWLYH WHFKQLTXH x=b for •• Solving Solving the the equation equation A Ax=b for 66 unknowns, unknowns, the the value value for at the the nnthth iteration iteration are: are: for the the xxii at b1-a12x2(n-1)- ... -a16x6(n-1) a11 b2-a21x2(n-1)- ... -a26x6(n-1) n x2 = a22

x1n =

x6n =

b6-a61x1(n-1)- ... -a65x5(n-1) a66 Task Scheduling 29

7ZR 3( 6ROXWLRQ $ D D D D D D

D D D D D D

D D D D D D

D D D D D D

D D D D D D

D D D D D D

[

E

[ [ [ [ [ [

E E E E E E

3( 3(

nth iteration: PE1 x1n

in parallel

PE2 x4n

x2n

x5n

x3n

x6n

Task Scheduling 30

Page 10

-DFREL 3URFHVV PROCESS Jacobi() (*Executing on PE_1 *) Begin Estimate x[1] ... x[6] n := 0 (* Iteration number *) WHILE solution_not_converged DO Begin n := n + 1 Calculate new x[1], x[2] & x[3] using old x[1]...x[6] PARALLEL SEND new x[1], x[2] & x[3] TO PE_2 RECEIVE new x[4], x[5] & x[6] FROM PE_2 END END (* Jacobi *)

VW 1HLWKHU 1HLWKHU 3( 3( FDQ FDQ SURFHHG SURFHHG RQWR RQWR WKH WKH Q Q VW LWHUDWLRQ LWHUDWLRQ XQWLO XQWLO ERWK ERWK WK KDYH LWHUDWLRQ KDYH ILQLVKHG ILQLVKHG Q QWK LWHUDWLRQ

Task Scheduling 31

'DWD 'HSHQGHQFLHV 6SHFLILF 6SHFLILF RUGHU RUGHU IRU IRU WDVN WDVN FRPSOHWLRQ FRPSOHWLRQ Gauss-Seidel: x1n value calculated as with Jacobi: x1n =

b1-a12x2(n-1)- ... -a16x6(n-1) a11

But, x2n calculated as: x2n =

b2-a21x1n- ... -a26x6(n-1) a22

“Block Gauss-Seidel - Global Jacobi”

Task Scheduling 32

7DVN 6FKHGXOLQJ 6WUDWHJLHV 'DWD GULYHQ 6& 6& GHWHUPLQHV GHWHUPLQHV DOORFDWLRQ DOORFDWLRQ SULRU SULRU WR WR FRPSXWDWLRQ FRPSXWDWLRQ VWDUWLQJ VWDUWLQJ 6LQJOH 6LQJOH WDVN WDVN SDFNHW SDFNHW ZLWK ZLWK DOO DOO GHWDLOV GHWDLOV VHQG VHQG WR WR HDFK HDFK 3( 3( PROCESS Application_Process() Begin RECEIVE task_packet FROM SC via R FOR i = start_task_id TO finish_task_id DO Begin result[i] := Perform_Algorithm(task[i]) SEND result[i] TO SC via R END END (* Application_Process *) Task Scheduling 33

Page 11

'HPDQG 'ULYHQ 7DVN 0DQDJHPHQW Processing Elements

PE 1

PE 2

.....

PE 3 task supply

Demand for tasks

Task Supplier

id PDIs id PDIs

ADIs

id PDIs

id PDIs

id PDIs

ADIs

ADIs id PDIs

ADIs

ADIs id PDIs

ADIs id PDIs

ADIs

ADIs

id PDIs

id PDIs

ADIs

ADIs

"Pool" of available task packets

7DVN 7DVN PDQDJHPHQW PDQDJHPHQW LV LV

H[SOLFLW

7DVN 7DVN SRRO SRRO DOORZV DOORZV IOH[LEOH IOH[LEOH DOORFDWLRQ DOORFDWLRQ

Task Scheduling 34

6LPSOH 'HPDQG 'ULYHQ 6\VWHP 3URFHVVRU IDUP 6LPSOH 6LPSOH WR WR LPSOHPHQW LPSOHPHQW DQG DQG WKXV WKXV ZLGHO\ ZLGHO\ XVHG XVHG 6LQJOH 6LQJOH 6& 6& DQG DQG 3(V 3(V DUUDQJHG DUUDQJHG LQ LQ D D FKDLQ FKDLQ $3 $3 SHUIRUPV SHUIRUPV WKH WKH FRPSXWDWLRQ FRPSXWDWLRQ 75 75 DQG DQG 55 55 URXWH URXWH WDVNV WDVNV DQG DQG UHVXOWV UHVXOWV UHVSHFWLYHO\ UHVSHFWLYHO\ 6LQJOH 6LQJOH EXIIHU EXIIHU DW DW HDFK HDFK 3( 3( 1HZ 1HZ WDVN WDVN VXSSOLHG VXSSOLHG E\ E\ 6& 6& ZKHQ ZKHQ UHVXOW UHVXOW UHFHLYHG UHFHLYHG 1HYHU 1HYHU PRUH PRUH WDVNV WDVNV LQ LQ WKH WKH V\VWHP V\VWHP WKDQ WKDQ VSDFH VSDFH DYDLODEOH DYDLODEOH Task Scheduling 35

3URFHVVRU )DUP System Controller Tasks

PE 1

PE 2 Initial state: both APs demand work

Results

Tasks

First task is loaded from SC

SC

Results

Tasks

PE 1

PE n

PE 2

Key:

Second task is loaded from SC

SC - System Controller

Results

Tasks

T I M E

AP - Application Process TR - Task Router Third task is loaded from SC

RR - Result Router

Results

TR Tasks

Processor farm fully loaded

Results

AP Tasks

First result is produced

Results

Tasks

AP demands next task

RR

Results

Tasks

Returning result causes new task to enter the farm

Results

PE for processor farm Task Scheduling 36

Page 12

3UREOHPV ZLWK 3URFHVVRU )DUP 6LPSOH WR LPSOHPHQW %87 &RPSXWDWLRQWRFRPPXQLFDWLRQ &RPSXWDWLRQWRFRPPXQLFDWLRQ UDWLR UDWLR FULWLFDO FULWLFDO

¥ WRR ORZ ⇒ ⇒ VLJQLILFDQW LGOH WLPH %XIIHUV %XIIHUV KHOS KHOS %87 %87

¥ WRR IHZ ⇒ ⇒ LGOH WLPH SUREOHP QRW VROYHG ¥ WRR PDQ\ ⇒ ⇒ UHLQWURGXFHV SUREOHP DW HQG RI SURFHVVLQJ &KDLQ &KDLQ FRQILJXUDWLRQ FRQILJXUDWLRQ VHYHUHO\ VHYHUHO\ H[DFHUEDWHV H[DFHUEDWHV GLIILFXOWLHV GLIILFXOWLHV Task Scheduling 37

7DVN 0DQDJHU 3URFHVV $LP WR HQVXUH VXIILFLHQW VXSSO\ RI WDVNV WR 3(V 6HSDUDWH 6HSDUDWH 70 70 SURFHVV SURFHVV LQWURGXFHG LQWURGXFHG DW DW HDFK HDFK 3( 3( (DFK (DFK 70 70 PDLQWDLQV PDLQWDLQV LWV LWV RZQ RZQ ORFDO ORFDO ´WDVN ´WDVN SRROµ SRROµ $3 $3 QRZ QRZ GHDOV GHDOV ZLWK ZLWK 70 70 DQG DQG QRW QRW GLUHFWO\ GLUHFWO\ ZLWK ZLWK 6& 6& 70 70 UHVSRQVLEOH UHVSRQVLEOH IRU IRU HQVXULQJ HQVXULQJ WDVNV WDVNV DYDLODEOH DYDLODEOH IRU IRU $3 $3

¥ DFWLYH SURFHVV ⇒ ⇒ DGDSW WR FKDQJLQJ FLUFXPVWDQFHV Task Scheduling 38

5HTXHVWLQJ 7DVNV $3 VKRXOG DOZD\V KDYH WDVN UHTXHVW VDWLVILHG LPPHGLDWHO\ E\ 70 XQOHVV ,QLWLDO ,QLWLDO VWDUWXS VWDUWXS GHOD\ GHOD\ 1R 1R PRUH PRUH WDVNV WDVNV DYDLODEOH DYDLODEOH 70·V 70·V VWUDWHJ\ VWUDWHJ\ KDV KDV IDLOHG IDLOHG

Task Scheduling 39

Page 13

/RFDO §7DVN 3RRO¨ AP TM

task pool Current status

R Green

id

PDI ADI

id

PDI ADI

id

PDI ADI

id

PDI ADI

Orange Red id

PDI ADI

TM

*UHHQ *UHHQ 2QO\ 2QO\ LVVXH LVVXH UHTXHVW UHTXHVW LI LI FXUUHQW FXUUHQW PHVVDJH PHVVDJH GHQVLW\ GHQVLW\ ORZ ORZ 2UDQJH 2UDQJH ,VVXH ,VVXH UHTXHVW UHTXHVW XQOHVV XQOHVV PHVVDJH PHVVDJH GHQVLW\ GHQVLW\ LV LV YHU\ YHU\ KLJK KLJK 5HG 5HG $OZD\V $OZD\V LVVXH LVVXH UHTXHVW UHTXHVW

Task Scheduling 40

'LVWULEXWHG 7DVN 0DQDJHPHQW &HQWUDOLVHG WDVN VXSSO\ $OO $OO UHTXHVWV UHTXHVWV IRU IRU QHZ QHZ WDVNV WDVNV WR WR 6& 6&

⇒ ⇒ ERWWOHQHFN ERWWOHQHFN

6LJQLILFDQW 6LJQLILFDQW GHOD\ GHOD\ LQ LQ IHWFKLQJ IHWFKLQJ QHZ QHZ WDVNV WDVNV

'LVWULEXWHG WDVN VXSSO\ 7DVN 7DVN UHTXHVWV UHTXHVWV KDQGOHG KDQGOHG UHPRWHO\ UHPRWHO\ IURP IURP 6& 6& 6SUHDG 6SUHDG RI RI FRPPXQLFDWLRQ FRPPXQLFDWLRQ ORDG ORDG DFURVV DFURVV V\VWHP V\VWHP 5HGXFHG 5HGXFHG WLPH WLPH WR WR VDWLVI\ VDWLVI\ WDVN WDVN UHTXHVW UHTXHVW Task Scheduling 41

3URSDJDWLQJ 7DVN 5HTXHVWV Neighbouring Processing Element

Processing Element

System Controller

AP

AP Task request

TS

TM

Key: AP - Application Process TM - Task Manager R - Router TS - Task Supplier

TM

R

R

R Possible task request

Task returned

7DVN 7DVN UHTXHVWV UHTXHVWV SURSDJDWH SURSDJDWH WRZDUGV WRZDUGV 7DVN 7DVN 6XSSOLHU 6XSSOLHU DW DW 6& 6& 3URGXFHUFRQVXPHU 3URGXFHUFRQVXPHU PRGHO PRGHO

Task Scheduling 42

Page 14

3UHIHUUHG %LDV $OORFDWLRQ &RPELQLQJ 'DWD GULYHQ 'HPDQG GULYHQ %DODQFHG %DODQFHG GDWD GDWD GULYHQ GULYHQ

¥ WDVNV DOORFDWHG LQ D SUHGHWHUPLQHG PDQQHU

'HPDQG 'HPDQG GULYHQ GULYHQ

¥ WDVNV DOORFDWHG G\QDPLFDOO\ RQ GHPDQG FRQFHSWXDO ¥ HQDEOHV WKH H[SORLWDWLRQ RI DQ\ FRKHUHQFH

3UHIHUUHG 3UHIHUUHG %LDV %LDV 5HJLRQV 5HJLRQV DUH DUH SXUHO\ SXUHO\

Task Scheduling 43

&RQFHSWXDO 5HJLRQV Processing Elements PE1

PE2

PE3

Demand for tasks

task supply Task Supplier task ids

"Pool" of available tasks 1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24




7DVN 7DVN DOORFDWLRQ DOORFDWLRQ QR QR ORQJHU ORQJHU DUELWUDU\ DUELWUDU\

Task Scheduling 44

Page 15

'DWD 0DQDJHPHQW 3URYLGLQJ 3URYLGLQJ GDWD GDWD WR WR WKH WKH SURFHVVRUV SURFHVVRUV :RUOG :RUOG PRGHO PRGHO 9LUWXDO 9LUWXDO VKDUHG VKDUHG PHPRU\ PHPRU\ 'DWD 'DWD PDQDJHU PDQDJHU SURFHVV SURFHVV

¥ ORFDO GDWD FDFKH ¥ UHTXHVWLQJ ORFDWLQJ GDWD &RQVLVWHQF\ &RQVLVWHQF\ Data Management 1

5HPRWH 'DWD )HWFKHV $GYDQFHG $GYDQFHG GDWD GDWD PDQDJHPHQW PDQDJHPHQW 0LQLPLVLQJ 0LQLPLVLQJ FRPPXQLFDWLRQ FRPPXQLFDWLRQ ODWHQFLHV ODWHQFLHV

¥ 3UHIHWFKLQJ ¥ 0XOWLWKUHDGLQJ ¥ 3URILOLQJ 0XOWLVWDJH 0XOWLVWDJH SUREOHPV SUREOHPV

Data Management 2

'DWD 5HTXLUHPHQWV 5HTXLUHPHQWV 5HTXLUHPHQWV PD\ PD\ EH EH ODUJH ODUJH )LW )LW LQ LQ WKH WKH ORFDO ORFDO PHPRU\ PHPRU\ RI RI HDFK HDFK SURFHVVRU SURFHVVRU

¥ ZRUOG PRGHO 7RR 7RR ODUJH ODUJH IRU IRU HDFK HDFK ORFDO ORFDO PHPRU\ PHPRU\

¥ GLVWULEXWHG GDWD ¥ SURYLGH YLUWXDO ZRUOG PRGHO Data Management 3

Page 1

:RUOG 0RGHO 'DWD 'DWD UHTXLUHPHQWV UHTXLUHPHQWV DUH DUH VXFK VXFK WKDW WKDW )LWV )LWV LQ LQ WKH WKH ORFDO ORFDO PHPRU\ PHPRU\ RI RI HDFK HDFK SURFHVVRU SURFHVVRU

¥ UHSOLFDWLRQ RI HQWLUH GDWD GRPDLQ DW HYHU\ SURFHVVRU ¥ QR QHHG IRU GDWD PDQDJHPHQW (DFK (DFK SURFHVVRU SURFHVVRU RQO\ RQO\ QHHGV QHHGV D D VXEVHW VXEVHW RI RI GDWD GDWD

¥ UHTXLVLWH GDWD VWRUHG DW HDFK SURFHVVRU ¥ GDWD PD\ EH LQFOXGHG ZLWK WDVN SDFNHW Data Management 4

9LUWXDO 6KDUHG 0HPRU\ 960 3URYLGLQJ 3URYLGLQJ D D FRQFHSWXDO FRQFHSWXDO VLQJOH VLQJOH PHPRU\ PHPRU\ VSDFH VSDFH 0HPRU\ 0HPRU\ LV LV LQ LQ IDFW IDFW GLVWULEXWHG GLVWULEXWHG 5HTXHVW 5HTXHVW LV LV WKH WKH VDPH VDPH IRU IRU ERWK ERWK ORFDO ORFDO DQG DQG UHPRWH UHPRWH GDWD GDWD 6SHHG 6SHHG RI RI DFFHVV DFFHVV PD\ PD\ EH EH YHU\ YHU\ GLIIHUHQW GLIIHUHQW

Higher level

System Software Provided by DM process Compiler HPF, ORCA Operating System Coherent Paging

Lower level

Hardware

DDM, DASH, KSR-1 Data Management 5

,PSOHPHQWLQJ 960 960 960 PD\ PD\ EH EH LPSOHPHQWHG LPSOHPHQWHG DW DW D D QXPEHU QXPEHU RI RI OHYHOV OHYHOV +DUGZDUH +DUGZDUH

¥ VL]H RI GDWD QRW GHWHUPLQHG E\ KDUGZDUH 2SHUDWLQJ 2SHUDWLQJ V\VWHP V\VWHP

¥ GDWD PDQDJHG DW SDJH OHYHO &RPSLOHU &RPSLOHU

¥ GDWD LWHP DV ODUJH DV DQ\ XVHU GDWD VWUXFWXUH Data Management 6

Page 2

'DWD 0DQDJHU 3URFHVV 960 960 LPSOHPHQWHG LPSOHPHQWHG DV DV V\VWHP V\VWHP VRIWZDUH VRIWZDUH *UHDWHVW *UHDWHVW IOH[LELOLW\ IOH[LELOLW\ ([SOLFLW ([SOLFLW IHDWXUHV IHDWXUHV QHFHVVDU\ QHFHVVDU\ WR WR PDQLSXODWH PDQLSXODWH GDWD GDWD 'DWD 'DWD PDQDJHU PDQDJHU SURFHVV SURFHVV DW DW HDFK HDFK 3( 3(

¥ HQVXUHV VXSSO\ RI GDWD LWHPV WR 3( ¥ SURYLGHV YLUWXDO ZRUOG PRGHO Data Management 7

'0 3URFHVV DW HDFK 3( Key: AP - Application Process TM - Task Manager DM - Data Manager R - Router

AP

DM

TM

R

$3 $3 QRZ QRZ UHTXHVWV UHTXHVWV GDWD GDWD IURP IURP '0 '0

Data Management 8

/RFDO 'DWD &DFKH 'DWD 'DWD VKDULQJ VKDULQJ (DFK (DFK GDWD GDWD LWHP LWHP KDV KDV D D XQLTXH XQLTXH QXPEHU QXPEHU

¥ UHTXLUHG LWHP FDQ EH §ORFDWHG¨ ZLWKLQ WKH V\VWHP 1R 1R OLPLW OLPLW RQ RQ VL]H VL]H RI RI SUREOHP SUREOHP WKDW WKDW EH EH WDFNOHG WDFNOHG 3', 3', VSHFLILHG VSHFLILHG E\ E\ WKH WKH WDVN WDVN $',V $',V GHWHUPLQHG GHWHUPLQHG E\ E\ WKH WKH DOJRULWKP DOJRULWKP

DQG WDVN WDVN Data Management 9

Page 3

/RFDO &DFKH DW HDFK 3( local cache id

data item

id

data item

id

data item

id

data item

id

data item

id

data item

id

data item

id

data item

id

data item

DM AP

TM

DM

R

'0 '0 KROGV KROGV GDWD GDWD LWHPV LWHPV XS XS WR WR OLPLW OLPLW RI RI ORFDO ORFDO FDFKH FDFKH

Data Management 10

9LUWXDO :RUOG 0RGHO 9:0 7RWDO 7RWDO DYDLODEOH DYDLODEOH VWRUDJH VWRUDJH IRU IRU GDWD GDWD LWHPV LWHPV 6XP 6XP RI RI DOO DOO ORFDO ORFDO FDFKHV FDFKHV VHFRQGDU\ VHFRQGDU\ VWRUDJH VWRUDJH 1R 1R OLPLW OLPLW RQ RQ VL]H VL]H RI RI SUREOHP SUREOHP WKDW WKDW FDQ FDQ EH EH WDFNOHG WDFNOHG 5HDGRQO\ 5HDGRQO\ GDWD GDWD

¥ YDOXHV QRW DOWHUHG GXULQJ FRPSXWDWLRQ 5HDGZULWH 5HDGZULWH GDWD GDWD

¥ FDUH PXVW EH WDNHQ WR HQVXUH FRQVLVWHQF\

Data Management 11

&RPSRVLWLRQ RI 9:0 PE 1

PE n

local cache

local cache

id

data item

id

data item

id

data item

id

data item

id

data item

id

data item

id

data item

id

data item

id

data item

id

data item

id

data item

id

data item

id

data item

id

data item

id

data item

id

data item

id

data item

id

data item

DM

SC

DM

secondary storage

FM

Data Management 12

Page 4

5HTXHVWLQJ 'DWD ,WHPV $OJRULWKP $OJRULWKP GHWHUPLQHV GHWHUPLQHV WKH WKH QH[W QH[W UHTXLUHG UHTXLUHG GDWD GDWD LWHP LWHP 8QLTXH 8QLTXH LGHQWLILHU LGHQWLILHU HQDEOHV HQDEOHV '0 '0 WR WR ORFDWH ORFDWH GDWD GDWD LWHP LWHP ,I ,I GDWD GDWD LWHP LWHP DYDLODEOH DYDLODEOH ORFDOO\ ORFDOO\

¥ FDQ EH WUDQVIHUUHG GLUHFWO\ WR WKH $3 ,I ,I GDWD GDWD LWHP LWHP 127 127 DYDLODEOH DYDLODEOH ORFDOO\ ORFDOO\

¥ '0 PXVW §ORFDWHDQGIHWFK¨ LW IURP HOVHZKHUH LQ WKH V\VWHP Data Management 13

$FFHVVLQJ D 'DWD ,WHP

x := data_item[i] y := compute(x)

SEND i TO DM RECEIVE x FROM DM y := compute(x)

AP

AP

i data item

RECEIVE i FROM AP SEND data_item[i] TO AP

(a)

without DM

(b)

DM

with DM Data Management 14

'DWD 0DQDJHU 3URFHVV PROCESS Data_Manager() Begin busy := TRUE WHILE busy DO PRIORITISED INPUT ALTERNATIVES 1. RECEIVE id FROM AP (* data request from AP *) IF (data_item[id] in local_cache) THEN SEND data_item[id] TO AP ELSE SEND external_request_for_id TO R ENDIF 2. RECEIVE data_item[id] FROM R (* remote data item *) Add_To_Local_Cache(data_item[id]) (* data must have been requested previously *) SEND data_item[id] TO AP 3. RECEIVE external_request_for_id FROM R (* external *) IF (data_item[id] in local_cache) THEN SEND data_item[id] TO requesting_PE via R ELSE SEND not_found_id TO R ENDIF End (* Data_Manager *)

Data Management 15

Page 5

$3 ,GOH 7LPH PE

PE

n

AP

PE

n

m

AP id data item

DM

DM

DM

R

R

id data item

. . . . . idle time Computation

Computation

idle time Computation

fetch time

Time

(a)

(a) local fetch

Computation

Time

(b)

(b) remote fetch

Data Management 16

/RFDWLQJ 'DWD ,WHPV )HWFKLQJ )HWFKLQJ D D UHPRWH UHPRWH GDWD GDWD LWHP LWHP '0 '0 NQRZV NQRZV H[DFWO\ H[DFWO\ ZKHUH ZKHUH GDWD GDWD LWHP LWHP LV LV

¥ IHWFK GLUHFWO\ /RFDWLRQ /RFDWLRQ XQNQRZQ XQNQRZQ

¥ LQVWLJDWH VHDUFK ¥ QDLYH DSSURDFK ⇒ VHDUFK HYHU\ 3( LQ WXUQ ¥ LVVXH D JOREDO EURDGFDVW Data Management 17

5HVLGHQW 6HWV (DFK (DFK 3( 3( KDV KDV D D SUHGHWHUPLQHG SUHGHWHUPLQHG SRUWLRQ SRUWLRQ RI RI GRPDLQ GRPDLQ $GYDQWDJHV $GYDQWDJHV

¥ ORFDWLRQ GLUHFWO\ GHWHUPLQHG IURP LG 'LVDGYDQWDJHV 'LVDGYDQWDJHV

¥ QR JXDUDQWHH 3( ZLOO UHTXLUH GDWD LQ LWV UHVLGHQW VHW ¥ OHVV ORFDO FDFKH IRU UHTXLUHG GDWD LWHPV dynamically create resident set Data Management 18

Page 6

&RQVLVWHQF\ 5HDGZULWH 5HDGZULWH GDWD GDWD FDQ FDQ UHVXOW UHVXOW LQ LQ LQFRQVLVWHQFLHV LQFRQVLVWHQFLHV 'LVWULEXWHG 'LVWULEXWHG PHPRU\ PHPRU\

¥ PXOWLSOH FRSLHV RI WKH VDPH GDWD LWHP 8SGDWLQJ 8SGDWLQJ VXFK VXFK D D GDWD GDWD LWHP LWHP

¥ XSGDWH DOO FRSLHV RI WKLV GDWD LWHP ¥ LQYDOLGDWH DOO RWKHU FRSLHV RI WKLV GDWD LWHP Data Management 19

([DPSOH RI ,QFRQVLVWHQF\ PE 1 AP

PE 2 Initial situation:

AP DM

DM

y 0 R

PE 1 has data item y that has the initial value 0.

R

AP

Next:

AP DM y 0

R

DM y 0 R

AP

Finally:

AP DM

DM y 0

y 1 R

R

The AP at PE 2 needs the data item y The DM of PE 2 obtains a copy of y from PE1

PE 1 updates the value of y. The copy being held by the DM on PE 2 is now stale. PE 1 and PE 2 have an inconsistent view on y.

Data Management 20

0LQLPLVLQJ ,PSDFW RI 5HPRWH 'DWD )DLOXUH )DLOXUH WR WR ILQG ILQG D D GDWD GDWD LWHP LWHP ORFDOO\ ORFDOO\

⇒ UHPRWH UHPRWH IHWFK IHWFK

7LPH 7LPH WR WR ILQG ILQG GDWD GDWD LWHP LWHP FDQ FDQ EH EH VLJQLILFDQW VLJQLILFDQW $3 $3 LGOH LGOH GXULQJ GXULQJ WKLV WKLV WLPH WLPH /DWHQF\ /DWHQF\ GLIILFXOW GLIILFXOW WR WR SUHGLFW SUHGLFW

¥ HJ GHSHQGV RQ FXUUHQW PHVVDJH GHQVLWLHV 'DWD 'DWD PDQDJHPHQW PDQDJHPHQW PXVW PXVW PLQLPLVH PLQLPLVH WKLV WKLV LGOH LGOH WLPH WLPH

Data Management 21

Page 7

'DWD 0DQDJHPHQW 7HFKQLTXHV +LGLQJ +LGLQJ WKH WKH /DWHQF\ /DWHQF\ 2YHUODSSLQJ 2YHUODSSLQJ WKH WKH FRPPXLFDWLRQ FRPPXLFDWLRQ ZLWK ZLWK FRPSXWDWLRQ FRPSXWDWLRQ

¥ SUHIHWFKLQJ ¥ PXOWLWKUHDGLQJ

0LQLPLVLQJ 0LQLPLVLQJ WKH WKH /DWHQF\ /DWHQF\ 5HGXFLQJ 5HGXFLQJ WKH WKH WLPH WLPH RI RI D D UHPRWH UHPRWH IHWFK IHWFK

¥ SURILOLQJ ¥ FDFKLQJ

Data Management 22

3UHIHWFKLQJ ([SORLWLQJ ([SORLWLQJ NQRZOHGJH NQRZOHGJH RI RI GDWD GDWD UHTXHVWV UHTXHVWV

$ SULRUL NQRZOHGJH NQRZOHGJH RI RI GDWD GDWD UHTXLUHPHQWV UHTXLUHPHQWV ¥ QDWXUH RI WKH SUREOHP ¥ FKRLFH RI FRPSXWDWLRQDO PRGHO

'0 '0 FDQ FDQ SUHIHWFK SUHIHWFK WKHP WKHP XS XS WR WR VRPH VRPH VSHFLILHG VSHFLILHG

¥ DYDLODEOH ORFDOO\ ZKHQ UHTXLUHG ¥ RYHUODSSLQJ FRPPXQLFDWLRQ ZLWK FRPSXWDWLRQ

KRUL]RQ

Data Management 23

6WRULQJ 3UHIHWFKHG 'DWD ,WHPV

local cache Next data item

id

data item

id

data item

id

data item

id

data item

local cache treated as a circular buffer Latest prefeched data item

id

data item

id

data item

id

data item

DM '0 '0 ORDGLQJ ORDGLQJ SUHIHWFKHG SUHIHWFKHG GDWD GDWD LWHPV LWHPV DW DW RQH RQH HQG HQG $3 $3 UHTXHVWLQJ UHTXHVWLQJ GDWD GDWD LWHPV LWHPV IURP IURP RWKHU RWKHU HQG HQG

Data Management 24

Page 8

0XOWL7KUHDGLQJ .HHSLQJ .HHSLQJ 3( 3( EXV\ EXV\ ZLWK ZLWK XVHIXO XVHIXO FRPSXWDWLRQ FRPSXWDWLRQ 5HPRWH 5HPRWH GDWD GDWD IHWFK IHWFK

⇒ FXUUHQW FXUUHQW WDVN WDVN VWDOOHG VWDOOHG

6WDUW 6WDUW DQRWKHU DQRWKHU WDVN WDVN

¥ VHSDUDWH WKUHDGV RI FRPSXWDWLRQ %63 $GYDQWDJHV $GYDQWDJHV 3( 3( NHSW NHSW EXV\ EXV\ 'LVDGYDQWDJHV 'LVDGYDQWDJHV 2YHUKHDGV 2YHUKHDGV

Data Management 25

6HYHUDO $3V SHU 3( Key: AP

...

AP - Application Process AP

AP

APC - Application Process Controller TM - Task Manager DM - Data Manager

APC

R - Router

DM

TM

R

Data Management 26

'LIILFXOWLHV ZLWK 0XOWLWKUHDGLQJ 2YHUKHDGV 2YHUKHDGV RI RI PXOWLSOH PXOWLSOH $3V $3V &RQWH[W &RQWH[W VZLWFKHV VZLWFKHV EHWZHHQ EHWZHHQ WKUHDGV WKUHDGV ,QFUHDVHG ,QFUHDVHG PHVVDJH PHVVDJH GHQVLWLHV GHQVLWLHV 5HGXFHG 5HGXFHG ORFDO ORFDO FDFKH FDFKH IRU IRU HDFK HDFK WKUHDG WKUHDG

§&DFKH §&DFKH ¨ ¨ VLWXDWLRQ VLWXDWLRQ /HVV /HVV ORFDO ORFDO FDFKH FDFKH

⇒ PRUH PRUH GDWD GDWD PLVVHV PLVVHV ⇒ PRUH PRUH WKUHDGV WKUHDGV Data Management 27

Page 9

5HVXOWV IRU 0XOWL7KUHDGLQJ 20

16 Proces sing e leme nts 32 Proces sing e leme nts 63 Proces sing e leme nts 15

10

5

0 5

10 15 20 Num ber of Appli cati on Pro cesse s

25

30

3HUIRUPDQFH 3HUIRUPDQFH LQFUHDVHV LQFUHDVHV XQWLO XQWLO D D FHUWDLQ FHUWDLQ QXPEHU QXPEHU RI RI WKUHDGV WKUHDGV 0RUH 0RUH WKUHDGV WKUHDGV UHGXFHV UHGXFHV SUHIRUPDQFH SUHIRUPDQFH

Data Management 28

3URILOLQJ 5HGXFLQJ 5HGXFLQJ WKH WKH UHPRWH UHPRWH IHWFK IHWFK WLPH WLPH $W $W WKH WKH HQG HQG RI RI FRPSXWDWLRQ FRPSXWDWLRQ DOO DOO GDWD GDWD UHTXHVWV UHTXHVWV DUH DUH NQRZQ NQRZQ

¥ LI NQRZQ WKHQ FDQ EH SUHIHWFKHG 0RQLWRU 0RQLWRU GDWD GDWD UHTXHVWV UHTXHVWV IRU IRU HDFK HDFK WDVN WDVN

¥ EXLOG XS D §SLFWXUH¨ RI SRVVLEOH UHTXLUHPHQWV ([SORLW ([SORLW VSDWLDO VSDWLDO FRKHUHQFH FRKHUHQFH ZLWK ZLWK SUHIHUUHG SUHIHUUHG ELDV ELDV DOORFDWLRQ DOORFDWLRQ

¥ SUHIHWFK WKRVH GDWD LWHPV OLNHO\ OLNHO\ WR EH UHTXLUHG Data Management 29

6SDWLDO &RKHUHQFH Problem domain = All data items associated with problem

PDI for task i

PDI for task j


PDI for task k


ADIs for task i

Application of algorithm ADI for task k

ADIs for task j

Result

Result

Result

task i

task j

task k

Data Management 30

Page 10

Caching Used t o minimise inter processor communicat ion •• A A cache cache consists consists of of tw twoo pparts arts –– aa directory directory of of items items in in the the cache cache –– the cache memory itself • Cache d esign is d eterm ined by – a placement policy – a replacement policy Caching 1

Placement and Replacement Placement : •• Sp Specifies ecifies the the m mapp apping ing from from blocks blocks of of main main m memory emory to to those those in in the the cache cache

Replacement: •• Determ Determines ines w which hich cache cache entries entries are are to to be be rep replaced laced w when hen the the cache cache is is full full and and aa new new item item need needss to to be be placed placed

Caching 2

Placement Policies •• Direct Direct Mapp Mapping ing –– object I

slot I mod M

• Fu lly-associative m app ing – object I

any slot

• Set-associative m ap ping – object I

any slot in partition I mod K

• Sector map ping Caching 3

Page 1

Replacement Policies St rat egies: •• LIFO LIFO -- Last Last In In First First Ou Outt •• LRU LRU -- Least Least Recently Recently Used Used

Problems occur w hen mult i-threading i-threading is used •• A A ddata ata item item m may ay be be im impportant ortant for for one one thread thread but but not not for for another another Caching 4

Coherence (or Locality) A small set of dat a items is used very oft en, w hile a large set of data it ems is accessed only occasionally Parallel processing jargon: •• Tem Temporal poral coherence coherence •• Sp Spatial atial coherence coherence •• Sp Spatial atial coherence coherence in in the the netw network ork Caching 5

Spatial Coherence

Caching 6

Page 2

Coherence in Rendering Slight ly different jargon: •• Object Object coherence coherence •• Im Image age coherence coherence •• Ray coherence • Fram e coherence • Data coherence

Caching 7

Object Coherence The main form of coherence •• Objects Objects consist of separate separate connected connected ppieces ieces bou bounded nded in in sp space ace •• Distinct Distinct objects objects are are ddisjoint isjoint is is sp space ace

•• Is d irectly exp loited by sp atial su bd ivision techniqu es • Other forms of coherence are d erived from object coherence Caching 8

Ray Coherence Ray s st art ing at t he same origin and t ravelling in similar directions, are likely t o int ersect t he same objects •• For For exam example ple pprimary rimary rays rays in in ray ray tracing tracing •• Sam Sampling pling area light sources

Caching 9

Page 3

Ray Coherence

Caching 10

What to Store? Normally geometry is st ored in caches •• Takes Takes uupp valu valuable able cache cache m mem emory ory •• Comm Commuunication nication cost cost relatively relatively high high

Dat a compression sometimes possible •• Use Use sim simpplified lified geometry geometry •• Or Or use use aa ddifferent ifferent form format at altogether altogether

Caching 11

Examples Draw n from Graphics •• Geom Geometric etric sim plification •• Group Grouping ing •• Levels of Detail Detail • Environm ent m ap p ing • Im postors • Directional caching

Caching 12

Page 4

Environment Mapping

Caching 13

Environment Mapping

Caching 14

Impostors

Caching 15

Page 5

Impostors

Caching 16

D irectional Caching Can be built separat ely for each leaf voxel in a scene •• Is Is created created on on the the fly fly •• Stores Stores ddirectional irectional inform information ation ind indicating icating w which hich areas areas are are imp important ortant (high (high variance, variance, high high intensity) intensity) •• Im portant directions directions are are alw always ays fu fully lly sam samppled led • Und ersam pled d irections are fully sam pled • Low variance of intensity directions are not sam p led but app roxim ated w ith a cache looku p Caching 17

D irectional Caching

Caching 18

Page 6

Parallel Ray Tracing Three different flav ours of parallel ray t racing: •• Im Image age space space su subd bdivision ivision •• Object Object space space subd subdivision ivision •• H Hybrid ybridss of of the the above above tw twoo

Parallel Ray Tracing 1

Parallel Ray Tracing (Equiv alent ) parallel processing t erms: •• Dem Demand and// ddata ata ddriven riven sched scheduuling ling •• Data Data pparallel arallel processing processing •• H Hybrid ybrid sched scheduling uling


Page 1

D ata D riven Ray Tracing Screen space subdiv ision is most common •• Screen Screen areas areas are are ddistributed istributed over over the the processors processors prior prior to to rend rendering ering •• For For op optim timal al pperform erformance, ance, geom geometry etry w will ill have have to to be be rep replicated licated w with ith each each processor processor •• Different Different w workload orkload pper er im image age region, region, hence hence load load im imbalances balances m may ay occu occurr


D ata D riven Ray Tracing Image space subdiv ision St at ic load balancing


Page 2

D ata parallel ray tracing Similar t o dat a driv en ray t racing •• Bu Butt tasks tasks are are hand handed ed out out by by aa m master aster pprocessor rocessor on on ddem emand and •• When When aa pprocessor rocessor finishes finishes aa su subregion bregion,, itit requ requests ests aa task task from from the the m master aster processor processor •• Largely Largely solves solves load load balancing balancing pproblem roblems, s, excep exceptt at at start-u start-upp tim timee and and near near the the end end of of the the com compputation utation


D emand D riven Ray Tracing Image space subdiv ision Dy namic load balancing


Page 3

D emand D riven Ray Tracing If t he screen space is irregularly subdivided as a result of a profiling pass: •• Workload Workload for for each each screen screen region region is is m more ore or or less less the the sam samee •• Load Load balancing balancing at at start-u start-upp and and at at the the end end of of the the com compu putation tation m may ay be be solved solved •• Unfortu Unfortunately, nately, geom geometry etry still still has has to to be be replicated replicated for for optim optimal al pperform erformance ance


D emand D riven Ray Tracing Image space subdiv ision Dy namic load balancing W it h unequal t ask siz es


Page 4

Issues Related to D emand D riven and D ata D riven Scheduling •• If If scene scene geom geometry etry can can be be rep replicated licated:: –– communication communication is is minimal minimal (no (no bottleneck) bottleneck) –– load load balancing balancing problems problems can can be be solved solved (no (no bottleneck) bottleneck) –– but but memory memory usage usage can can be be excessive excessive –– best best option option for for small small scenes scenes that that fit fit into into aa single single processor’s processor’s memory memory


Issues Related to D emand D riven and D ata D riven Scheduling •• If If scene scene geom geometry etry can can not not be be replicated replicated:: –– data data fetches fetches will will be be necessary, necessary, hence hence communication communication may may become become aa bottleneck bottleneck –– waiting waiting for for requested requested data data causes causes idle idle times times and and load load imbalances imbalances –– caching caching is is therefore therefore vitally vitally important important –– but but memory memory usage usage problems problems are are resolved resolved –– can can be be the the only only option option for for large large scenes scenes Parallel Ray Tracing 10

Page 5

D ata Parallel Ray Tracing At t empt s t o reduce memory requirement s •• Scene Scene ddata ata is is ddistribu istributed ted over over the the available available pprocessors rocessors •• Rays Rays are are traced traced by by the the pprocessor rocessor that that hold holdss the the relevant relevant ddata ata •• If If no no intersection intersection occurs, occurs, aa ray ray task task is is m migrated igrated to to aa neighbouring neighbouring processor processor


D ata Parallel Ray Tracing


Page 6

D ata Parallel Ray Tracing Pros: •• Mem Memory ory is is not not w wasted asted:: each each object object is is located located in in only only one one pprocessor’s rocessor’s m mem emory ory •• Therefore, Therefore, alm almost ost arbitrarily arbitrarily large large scenes scenes can can be be rend rendered ered •• Task Task com comm muunication nication is is betw between een neighbouring neighbouring processors, processors, and and is is therefore therefore local local by by nature nature


D ata Parallel Ray Tracing Cons: •• Workload Workload is is ddistributed istributed very very unevenly unevenly •• There There is is very very little little control control over over w workload orkload ddistribu istribution tion •• Therefore, Therefore, load load balancing balancing is is aa severe severe problem problem •• Lead Leading ing to to aa pperform erformance ance that that is is little little better better than than aa sequ sequential ential ray ray tracer tracer


Page 7

Hybrid Scheduling In a hy brid demand driv en/dat a parallel ray t racer: •• Retain Retain the the ad advantages vantages of of both both ddem emand and ddriven riven and and ddata ata pparallel arallel ray ray tracing tracing •• Elim Eliminate inate the the ddisad isadvantages vantages of of both both


Hybrid Scheduling O v erlay demand driv en and dat a parallel component s on each processor! •• If If aa pprocessor rocessor has has choice: choice: –– execute execute data data parallel parallel tasks tasks first first –– else else execute execute demand demand driven driven tasks tasks •• The The ddata ata parallel parallel part part w will ill generate generate aa basic basic but but uneven uneven load load •• The The ddem emand and ddriven riven com compponent onent w will ill balance balance the the load load Parallel Ray Tracing 16

Page 8

Hybrid Scheduling How t o different iat e bet w een different t asks •• Dem Demand and ddriven riven tasks tasks should should:: –– bring bring about about aa relatively relatively high high workload workload –– use use aa small small subset subset of of scene scene geometry geometry •• Data Data pparallel arallel tasks tasks shou should ld –– handle handle the the remainder remainder of of tasks tasks


Hybrid Scheduling Scherson and Caspary : •• Dem Demand and ddriven riven ray ray traversal traversal •• Data Data pparallel arallel ray ray object object intersection intersection tests tests

•• Unfortu Unfortunately, nately, ray ray traversal traversal is is relatively relatively cheap cheap •• Therefore, Therefore, ray ray traversal traversal has has insu insufficient fficient load load balancing balancing abilities abilities Parallel Ray Tracing 18

Page 9

Hybrid Scheduling Reinhard and Jansen: Jansen: •• Distingu Distinguish ish betw between een coherent coherent and and non-coherent non-coherent tasks tasks •• Coherent Coherent tasks: tasks: –– bundles bundles of of primary primary rays rays –– bundles bundles of of shadow shadow rays rays •• N Non-coherent on-coherent tasks: tasks: –– everything everything else else Parallel Ray Tracing 19

Hybrid Scheduling •• Coherent Coherent tasks: tasks: –– use use aa small small set set of of scene scene data data –– have have aa substantial substantial workload workload –– ideal ideal demand demand driven driven tasks tasks •• N Non-coherent on-coherent tasks: tasks: –– difficult difficult to to predict predict which which objects objects will will be be intersected intersected –– use use task task migration migration (data (data parallel) parallel) Parallel Ray Tracing 20

Page 10

Parallel Radiosity Three algorit hms considered: •• Fu Fullll m matrix atrix rad radiosity iosity •• Progressive Progressive refinem refinement ent •• H Hierarchical ierarchical rad radiosity iosity

Parallel Radiosity 1

Parallel Radiosity Issues relat ed t o all t hree algorit hms: •• Com Comm munication unication tend tendss to to be be the the m main ain pproblem roblem •• Scene Scene geom geometry etry is is up upddated ated dduuring ring the the course course of of the the com compu putation, tation, therefore therefore cache cache inconsistencies inconsistencies m may ay occur occur •• At At each each iteration, iteration, each each processor processor need needss access access to to all all scene scene geom geometry etry


Page 1

Full Matrix Radiosity If t he full solut ion is required, t hen gat hering has adv ant ages over progressiv e refinement •• Assu Assume me the the geom geometry etry is is ddistributed istributed across across the the processors processors •• Algorithm Algorithm consists consists of of tw twoo steps: steps: –– form form factor factor computation computation –– solving solving the the resultant resultant set set of of equations equations


Full Matrix Radiosity Form fact or comput at ion: •• Data Data requirem requirements ents are are know knownn aa priori: priori: –– all all data data is is needed needed at at some some stage stage during during the the computation computation •• A A systolic systolic app approach roach m may ay be be uused sed:: –– scene scene data data is is passed passed from from processor processor to to processor processor


Page 2

Full Matrix Radiosity Solv ing a set of equat ions: •• Jacobi Jacobi m method ethod is is often often uused sed becau because: se: –– inherently inherently parallel parallel –– directly directly applicable applicable •• Bu Butt Jacobi Jacobi has has aa slow slow convergence convergence rate rate •• Other Other m method ethodss inclu includdee group group iterative iterative


Full Matrix Radiosity Demand driven approach: •• Up Upon on com comppletion letion of of aa form form factor, factor, aa new new task task is is requ requested ested from from aa m master aster processor processor •• Master Master processor processor m may ay form form aa com comm muunication nication bottleneck bottleneck

Dat a parallel approach: •• Uses Uses task task m migration igration betw between een processors processors •• Su Suffers ffers m mostly ostly from from load load balancing balancing problem problemss Parallel Radiosity 6

Page 3

Case Study: Gathering using the Hemi-cube Approach Form fact or comput at ion: •• One One task task is is the the com compu putation tation of of aa w whole hole row row of of form form factors factors •• Com Compu putational tational variations variations in in com compu puting ting ddelta elta form form factors factors suggest suggest aa ddem emand and ddriven riven ap approach proach

Mat rix solut ion: •• Gau Seid el ssGauss-Seid ss-Seid el is is inap inapppropriate ropriate for for pparallel arallel com compputing uting •• Use Seid el GaussUse ‘Block ‘Block Gauss-Seid Gauss-Seid el -- Global Global Jacobi’ Jacobi’ instead instead Parallel Radiosity 7

Case Study: D ata Management in Parallel Gathering •• For For com compputing uting aa single single row row of of form form factors, factors, all all the the environm environment’s ent’s ppatch atch ddata ata is is requ required ired •• Therefore, Therefore, pprefetching refetching of of ddata ata is is ap apppropriate ropriate


Page 4

Case Study: Configurations Underly ing net w ork t opology affect s performance •• Processor Processor chain chain •• Ring Ring •• H Hyp ypercu ercube be •• Mesh Mesh •• Toru Toruss •• Ternary Ternary tree tree Parallel Radiosity 9

Case Study: Configurations

SC

19

18

6 28

17

20

5

7 15

13

31

1

30 25

11

3

0

4

14

29

26

2 27

12

16 10

8

24

21

9 22

23


Page 5

Case Study: D iameters

AMP Hypercube Torus Ternary Tree Mesh Ring Chain

Processors 8 2 3 3 4 4 4 7

13 2

4 6 12

16 3 4 4 5 6 8 15

23 3

6 11 22

32 3 5 6 6 10 16 31

40 4 7 6 11 20 39

53 4

8 26 52

64 4 6 7 8 14 32 63

128 5 7 12 10 22 64 127


Case Study: D istances

AMP Hypercube Torus Ternary Tree Mesh Ring Chain

Processors 8 13 1.28 1.55 1.5 1.5 1.97 2.56 1.75 2 3.23 2.63 4.31

16 1.73 2 2 2.91 2.5 4 5.31

23 2.05

3.39 5.74 7.65

32 2.31 2.5 3 3.93 3.88 8 10.67

40 2.53 3.5 4.25 4.23 10 13.99

53 2.76

4.77 13.25 17.66

64 2.92 3 4 5.01 5.26 16 21.33

128 3.58 3.5 5.64 6.25 7.94 32 42.66


Page 6

Progressive Refinement Main problem: •• Each Each shooting shooting patch patch uupd pdates ates m most ost rem remaining aining ppatches atches in in the the scene scene •• Therefore, Therefore, all all geom geometry etry need needss to to be be accessed accessed at at each each iteration iteration •• If If ddata ata is is rep replicated licated w with ith each each pprocessor, rocessor, then then uupd pdates ates m must ust be be broad broadcast cast to to all all pprocessors rocessors


Progressive Refinement Possible solut ion: •• Dup Duplicate licate geom geometry etry ddata ata w with ith each each pprocessor: rocessor: –– allows allows parallel parallel visibility visibility computations computations •• Distribute Distribute ppatch atch and and elem element ent ddata: ata: –– avoids avoids data data consistency consistency problems problems


Page 7

Progressive Refinement Dat a parallel approach: •• Patches Patches are are ddistribu istributed ted am amongst ongst processors processors •• Each Each pprocessor rocessor selects selects aa shooting shooting ppatch atch from from its its ow ownn selection selection of of ppatches atches •• Local Local ppatches atches are are up upddated ated ddirectly irectly •• Rem Remote ote ppatches atches are are uuppddated ated by by com comm muunicating nicating energy energy to to the the relevant relevant processors processors


Progressive Refinement Demand driven approach: •• A Am master aster pprocessor rocessor keeps keeps track track of of patches patches w with ith unshot unshot energy energy •• ItIt globally globally selects selects the the patch patch w with ith the the m most ost uunshot nshot energy energy w whenever henever aa pprocessor rocessor requ requests ests aa new new task task •• Load Load balancing balancing can can be be solved solved by by ppoaching oaching tasks tasks from from busy busy neighbouring neighbouring processors processors •• Master Master processor processor m may ay form form aa com comm muunication nication bottleneck bottleneck Parallel Radiosity 16

Page 8

Communication in D ata Parallel Progressive Refinement If a processor shoot s energy t hat leav es it s local subspace: •• Energy Energy m may ay be be tem temporarily porarily stored stored at at grid grid bound boundaries aries •• When When enou enough gh is is accu accum mulated ulated,, this this is is comm commuunicated nicated to to aa neighbouring neighbouring processor processor •• This This is is called called the the ‘Virtual ‘Virtual Walls’ Walls’ techniqu techniquee •• Loses Loses ddirectionality irectionality inform information ation at at bou bound ndaries aries


Virtual Walls


Page 9

Virtual Interfaces •• Voxel Voxel bound boundaries aries are are subd subdivid ivided ed into into sm small all patches patches (as (as in in the the virtu virtual al w walls alls m method ethod)) •• Dirctional Dirctional inform information ation is is retained retained at at these these patches patches

Problems w it h v irt ual w alls/int erfaces: •• Position Position of of light light sou sources rces is is cru crucial cial to to the the efficiency efficiency of of the the algorithm algorithm in in the the early early stages stages •• During During the the com compputations, utations, aa processor processor m may ay run run ou outt of of ppatches atches w with ith uunshot nshot energy, energy, hence hence task task red redistribu istribution tion m may ay be be requ required ired Parallel Radiosity 19

Virtual Interfaces


Page 10

3DUDOOHO 3DUWLFOH 7UDFLQJ (PEDUUDVVLQJO\ (PEDUUDVVLQJO\ 3DUDOOHO" 3DUDOOHO"

3DWKV RI HYHU\ SDUWLFOH DUH LQGHSHQGHQW %87

¥ ORZ FRKHUHQFH GXH WR VWRFKDVWLF QDWXUH ¥ ODUJH YDULDWLRQV LQ FRPSXWDWLRQDO FRPSOH[LW\ EHWZHHQ SDUWLFOHV ¥ FRPSOH[ VFHQHV ⇒ ODUJH GDWD UHTXLUHPHQWV ¥ QHHG WR DFFXPXODWH DOO SDUWLFOH LQWHUDFWLRQV Parallel Particle Tracing 1

3DUWLFOH 7UDFLQJ $OJRULWKP )RU )RU HDFK HDFK SDUWLFOH SDUWLFOH

&KRRVH ZDYHOHQJWK HPLWWHU SRVLWLRQ RQ HPLWWHU E\ ''6 8SGDWH RXWJRLQJ SDUWLFOH IOX[ DW HPLWWHU·V VXUIDFH 5(3($7

¥ ILQG QHDUHVW LQWHUDFWLQJ VXUIDFH RU PHGLXP ¥ DVFHUWDLQ LQWHUDFWLRQ W\SH DEVRUSWLRQ UHIOHFWLRQ HWF ¥ GHWHUPLQH SDUWLFOH DQG VXUIDFHPHGLXP VWDWXV DIWHU LQWHUDFWLRQ

817,/ SDUWLFOH DEVRUEHG

Parallel Particle Tracing 2

6FHQH 'DWD 6HYHUDO 6HYHUDO VFHQHV VFHQHV FRQVLGHUHG FRQVLGHUHG

Objects Sources Polygons 2 MirroredRoom 124 22 2 234 Room1 234 8 240 Room2 240 8 380 LargeRoom 380 1 402 DiningRoom1 402 3 404 DiningRoom2 402 16 3180 LectureHall 3180

Spheres 102 0 0 0 0 2 0


Page 1

([DPSOH 6FHQHV

MirroredRoom

DiningRoom2 Parallel Particle Tracing 4

7DVN 6FKHGXOLQJ ,GHQWLILFDWLRQ ,GHQWLILFDWLRQ RI RI D D WDVN WDVN

$WRPLF HOHPHQW RQH SDUWLFOHREMHFW LQWHUVHFWLRQ 7DVN WUDFH WKH FRPSOHWH SDWK RI D SDUWLFOH 3',V VXUIDFHV WR ZKLFK SDUWLFOH PDNHV D FRQWULEXWLRQ $',V WKH VFHQH GDWD 7DVN SDFNHW RQH RU PRUH SDUWLFOHV WR EH FRPSXWHG Parallel Particle Tracing 5

&RPSXWDWLRQDO &RPSOH[LW\

55

25

50 45

20

DiningRoom2

DiningRoom2

MirroredRoom

MirroredRoom Particle frequency (%)


40 35 30 25 20

15

10

15 5

10 5 0

1 2

4

6

8

10

12

14 16 18 Path length

20

22

24

Particle path lengths

26

28

30

0

0

50

100

150 Objects intersected

200

250

300

Particle object intersections Parallel Particle Tracing 6

Page 2

&RPSXWDWLRQDO 0RGHO /DUJH /DUJH XQNQRZQ XQNQRZQ YDULDWLRQV YDULDWLRQV

7R HQVXUH ORDG EDODQFLQJ ⇒ GHPDQG GULYHQ *UDSK FRPSDUHV PRGHOV IRU ZRUOG PRGHO

¥ HUURU EDUV VKRZ GLIIHUHQFH EHWZHHQ ILUVW SURFHVVRU FRPSOHWLRQ WLPH DQG HYHQWXDO SUREOHP VROXWLRQ

&RUUHFW FKRLFH RI FRPSXWDWLRQDO PRGHO FDQ PDNH D VLJQLILFDQW GLIIHUHQFH Parallel Particle Tracing 7

'DWD YV 'HPDQG 'ULYHQ 0RGHO

62

Linear speed−up

58

Data driven

54

Demand driven

50 46 42

Speed−up

38 34 30 26 22 18 14 10 6 2 2

6

10

14

18

22


46

50

54

58

62

MirroredRoom Parallel Particle Tracing 8

'\QDPLF 7DVN 3DFNHW 9DULDWLRQV System utilisation

FIXED TASK PACKET SIZE

Task packet size


VARIABLE TASK PACKET SIZE

System utilisation

Task packet size



Page 3

(IIHFW RI 7DVN 3DFNHW 9DULDWLRQV

62

Linear speed−up

58

Fixed packet size

54

Linear packet sizes

50

Logarithmic packet sizes

46 42 Speed−up

38 34 30 26 22 18 14 10 6 2 2

6

10

14

18

22


46

50

54

58

62

Demand driven (world model)


'DWD 0DQDJHPHQW &RQWUROV &RQWUROV WKH WKH PRYHPHQW PRYHPHQW RI RI GDWD GDWD

(QYLURQPHQWDO PRGHO PDMRULW\ RI GDWD VSDFH

¥ REMHFW LQIRUPDWLRQ VSDWLDO VXEGLYLVLRQ

0HVKHV DVVRFLDWHG ZLWK HDFK REMHFW

¥ VL]H ⇒ VKRXOG QRW EH PRYHG XQOHVV DEVROXWHO\ QHFHVVDU\

7ZR VHFWLRQ FDFKH RUJDQLVDWLRQ Parallel Particle Tracing 11

&DFKH 2UJDQLVDWLRQ DW '0 Surface Flux Storage Data Manager Cache

Object n

n

1111111111111 0000000000000 0000000000000 1111111111111 1111111111111 0000000000000 unused 000000000 111111111 000000000 111111111

Object n surface grid

0000000 1111111 1111111 0000000 0000000 1111111

111111 Object m m 000000 000000 111111 000000 111111

Object m surface grid

id

Item identifier

111 Fixed parameters 000 111 000 111 000

111 000 000 111 000 111

Variable coefficients Surface flux grid


Page 4

&RKHUHQFH 1.0 0.9

Normalised cumulative usage frequency (cuf)

0.8 0.7 0.6

LargeRoom LectureHall

0.5 MirroredRoom 0.4 0.3 0.2 0.1 0.0 0.0

0.1

0.2

0.3


0.7

0.8

0.9

Locality of data requests

1.0


/RFDOLW\ RI 5HTXHVWV 20 18 16


14 12 10 8 6 4 2 0 0.0

0.1

0.2

0.3


0.7

0.8

0.9

Frequency of references

1.0


&DFKLQJ 6WUDWHJLHV

3.0

Direct mapping Random fully associative

2.5 Relative execution time

FIFO fully associative FINUFO fully associative LRU fully associative 2.0

1.5

1.0

20

19

18

17


14

10% level of coherence

13

12


Page 5

&DFKLQJ DQG &RKHUHQFH

3.5 Direct mapping Random fully associative FIFO fully associative

3.0

FINUFO fully associative

Relative execution time

LRU fully associative

2.5

2.0

1.5

1.0

13

12


9

5% level of coherence

8


'HPDQG )HWFK

6.0 Hamiltonian path Broadcast request Home location path


5.0

4.0

3.0

2.0

1.0

30

25


10

Varying demand fetch request paths

5


3UHIHWFK

3.0 Demand fetch Prefetch using broadcast Prefetch using home location

2.5

Prefetch using nearest location


Prefetch using spanning tree

2.0

1.5

1.0

30

25


10

Varying prefetch request paths

5


Page 6

3UHIHWFK OLPLWDWLRQV 3UHIHWFKLQJ 3UHIHWFKLQJ FDQ FDQ UHGXFH UHGXFH LGOH LGOH WLPH WLPH %87 %87

$V FDFKH VL]H LV UHGXFHG

¥ FDFKH RYHUZULWH ¥ SUHIHWFK KRUL]RQ PXVW EH G\QDPLF

)RU VPDOO FDFKH VL]HV SUHIHWFK E\ LWVHOI LV QRW VXIILFLHQW


'HPDQG )HWFK YV 3UHIHWFK 45


40 35 30 Demand fetch

25

Prefetch

20 15 10 5 0


2.5

2.0

1.5

1.0

30

25


10

5

Cache miss ratios & solution times


0XOWL7KUHDGLQJ

3.0 Single AP 2 Threads 3 Threads

2.5

4 Threads


5 Threads

2.0

1.5

1.0

20

19

18

17

16

15 14 13 12 Cache size (% of dataset)

11

10% request coherence

10

9

8

7


Page 7

&RPELQLQJ 6WUDWHJLHV 45


40 35 30 Single Prefetch AP

25

2 Prefetch APs

20 15 10 5 0


2.5

2.0

1.5

1.0

30

25


10

5

Prefetch vs Prefetch+Multi-threading


3URILOLQJ 10.0 100 particles 500 particles 1K particles 5K particles 10K particles 50K particles 100K particles

Object access (% of all accesses)

9.0 8.0 7.0 6.0 5.0 4.0 3.0 2.0 1.0 0.0 Objects

Distribution of data usage


7DVN 0LJUDWLRQ 90 16 AMP random TM, no migration 16 AMP 35% task migration 32 AMP random TM, no migration 32 AMP 35% task migration 63 AMP random TM, no migration 63 AMP 35% task migration

80


70 60 50 40 30 20 10

50

45

40

35 30 25 20 15 Cache size (% of data set)

10

Conceptual vs Random allocation

5

0


Page 8

&RQFOXVLRQV 'DWD 'DWD PDQDJHPHQW PDQDJHPHQW

6WRFKDVWLF QDWXUH RI SDUWLFOHV

¥ ODUJH SHUFHQWDJH RI SUREOHP GRPDLQ UHTXLUHG

&DFKLQJ VWUDWHJLHV HIIHFWLYH

¥ RI DOO UHIHUHQFHV WR OHVV WKDQ RI GRPDLQ

&RPELQHG VWUDWHJLHV PRUH HIIHFWLYH

¥ SUHIHWFK PXOWLWKUHDGLQJ


&RQFOXVLRQV FRQWG 7DVN 7DVN VFKHGXOLQJ VFKHGXOLQJ

&RPSXWDWLRQDO PRGHO LPSRUWDQW

¥ YDULDWLRQV LQ FRPSOH[LW\ ⇒ GHPDQG GULYHQ

3URILOLQJ SURYLGHV FOXHV WR GDWD XVDJH

¥ XVHG IRU SUHGLFWLYH SUHIHWFKLQJ

7DVN PLJUDWLRQ

¥ WKUHVKROG EDVHG RQ ORFDO GDWD DYDLODELOLW\ DVVLVWV ORDG EDODQFLQJ Parallel Particle Tracing 26

Page 9

6XPPDU\ 3KRWR5HDOLVWLF 3KRWR5HDOLVWLF *UDSKLFV *UDSKLFV

'HVLUDEOH %87 KLJK FRPSXWDWLRQDO UHTXLUHPHQWV Parallel Processing %87 %87

/DUJH GDWD UHTXLUHPHQWV YDULDWLRQV LQ FRPSXWDWLRQDO FRPSOH[LW\ HWF

6XPPDU\ &DUHIXO &DUHIXO DWWHQWLRQ DWWHQWLRQ PXVW PXVW EH EH SDLG SDLG WR WR

'DWD GHSHQGHQFLHV ⇒ WDVN GHILQLWLRQ JUDQXODULW\ &RPSXWDWLRQDO YDULDWLRQV ⇒ FRPSXWDWLRQDO PRGHO 'LVWLQFW SUREOHP VWDJHV ⇒ WDVN VFKHGXOLQJ 'DWD UHTXLUHPHQWV ⇒ GDWD PDQDJHPHQW &RPPXQLFDWLRQ UHTXLUHPHQWV ⇒ V\VWHP FRPPXQLFDWLRQ

Page 1

Included papers: F.W. Jansen and A.G. Chalmers , “Realism in Real Time?”, Proceedings Fourth Eurographics Workshop on Rendering, Paris, France, June 1993, pp 27–46 E. Reinhard and F.W. Jansen , “Rendering Large Scenes Using Parallel Ray Tracing”, Reprinted from Parallel Computing, 23(7), July 1997, pp 873–885, with permission from Elsevier Science.

L. Renambot, B. Arnaldi, T. Priol and X. Pueyo , “Towards Efficient Parallel Radiosity for DSM-based Parallel Computers Using Virtual Interfaces”, c 1997 IEEE. Reprinted, with permission, from Proceedings of the 1997 Symposium on Parallel Rendering, pp 79–86, October 1997.

J.P. Singh, A. Gupta and M. Levoy , “Parallel Visualisation Algorithms: Performance and Architectural Implications”, c IEEE. Reprinted, with permission, from IEEE Computer, 27(7), pp 45–55, July 1994.

1

Invited presentation 4th EG Workshop on Rendering, Paris, June 1993, p. 27-46.

Realism in Real Time? Frederik W. Jansen, Delft University of Technology* Alan Chalmers, University of Bristol** Abstract With the continuous improvement in ray tracing and radiosity algorithms, image synthesis quality has reached the level of photo realism. However, efforts to achieve real-time performances by implementing ray tracing and radiosity algorithms on parallel processors and dedicated hardware have not been very successful so far. Increasing the number of processors introduces a corresponding growth in inter-processor communication. Caching could be effective to reduce communication, if sufficient coherence would be available in subsequent data requests. Unfortunately, standard ray tracing and radiosity do not provide enough coherence. In this paper we review the different parallel approaches and we try to ascertain those issues that are crucial for further improvement. In particular, we will focus on load and data management strategies that effect the amount of data coherence in ray tracing, and on methods to improve ray and object coherence. Keywords and phrases: rendering, ray tracing, radiosity, graphics hardware, parallel processing, data coherence 1. Introduction Over the last decade, computer graphics research has been very successful in achieving two goals: real-time display and increased realism in display. Real-time display has been accomplished by implementing the viewing pipeline of the projective depth-buffer hidden-surface algorithm in dedicated hardware, and increased realism has been accomplished by extending the traditional ray tracing algorithm to include also diffuse interreflection and indirect specular reflection. Unfortunately, the combination of both goals, realism in real time, is still waiting to be realised. Since the introduction of the geometry engine (Clark 1982), the depth-buffer based display systems have shown a steady increase in performance, both in speed and quality. Starting with a display rate of 30,000 polygons per second in the early eighties, current systems are now able to display more than a million polygons per second, allowing display of reasonable complex scenes in real-time (Torborg 1987; Akeley and Jermoluk 1988; Kirk and Voorhies 1990). Although the depth-buffer algorithm, inherently a pro*

Faculty of Technical Mathematics and Informatics, Julianalaan 132, 2628 BL Delft, The Netherlands. Email: [email protected] ** Department of Computer Science, Queen's Building, University Walk, Bristol BS8 1TR, United Kingdom, Email: [email protected]

1

jective algorithm, is not able to handle optical effects such as shadows, highlights and mirroring reflections in a natural way, several techniques have been developed to enhance the realism by adding textures, anti-aliasing, motion-blur, depth-of-field, etc. (Heaberli and Akeley 1991). Also diffuse interreflection, and area light sources have been incorporated by adding a radiosity pre-processing that subdivides the scene into a mesh of small surface patches and elements, and calculates the exchange of energy between these patches to account for the diffuse interreflection between surfaces (Cohen and Greenberg 1985; Nishita and Nakamae 1985). Display of these elements makes it possible to walk through interiors and still maintain a high-degree of shading accuracy (Baum and Winget 1990; Baum et al. 1991). The accurate representation of shading and shadow gradients is very much dependent on the accuracy of the element mesh. A resolution too high will be too expensive while a resolution too low will not adequately represent the shading gradients. Adaptive meshing techniques have been developed to provide locally a higher resolution to accommodate shading discontinuities (Cohen et al. 1986). Further improvements have been sought in exact meshing techniques that align the boundaries of elements with shading discontinuities (Campbell and Fussell 1990; Heckbert 1992; Lischinski et al. 1992). In this case, however, it is inevitable that a priori knowledge about shading discontinuities is available, obtained for instance from projecting surface contours onto these patches. This is a very expensive and complex kind of (object-space-oriented) pre-processing, particularly if curved surfaces are involved. Nevertheless, the pre-processing will be worthwhile if one wants to display a scene in real time with depth-buffer based display hardware for use in walk-through applications (Baum and Winget 1990). Thus, within the paradigm of depth-buffer-based projective display of polygons a whole set of techniques has been developed to increase realism without sacrificing real-time performance. True realism, however, will still prove to be difficult to achieve with the projective approach because of the lack of specular and mirroring reflections and of the limited accuracy of even the most advanced exact meshing and interpolation technique. The other display paradigm, ray tracing, has always been appreciated for its high-quality rendering capabilities. The initial 'recursive' ray tracing algorithm (Whitted 1980) did model effectively cast shadows and optical effects such as mirroring reflection and transparancy. With stochastic ray tracing, the repertoire of optical effects was further expanded to soft shadows, motion blur and depth-of-focus, and the image quality was improved by anti-aliasing (Cook 1986; Dippé and Wold 1985; Lee et al. 1985; Mitchell 1987). The addition of Monte Carlo sampling techniques to capture also the indirect light has even further increased the realism and accuracy of the illumination calculation (Kajiya 1986), as did improved reflection models (Cook and Torrance 1984) and texture filtering (Heckbert 1986). A radiosity preprocessing has been introduced to ray tracing too (Wallace et al. 1989; Sillion and Puech 1989) to account for the indirect diffuse reflection. The image quality is here less dependent on the accuracy of the mesh because most important shading and shadow continuities can be re-sampled during rendering, either for specular reflection only (Sillion and Puech 1989), most important direct light (Shirley 1990; Chen et al. 1991; Kok and Jansen 1991) or for all direct and indirect light (Rushmeier 1988; Chen et al. 1991). However, the more sampling is done, the more computation times will tend to explode.

2

Efficiency improving techniques such as adaptive ray tracing (Painter and Sloan 1989) and spatial subdivision techniques (Glassner 1989) are effective, but processing times for complex scenes are still in the order of minutes. For that reason, ray tracing has always been a popular subject for parallel processing and good results have been achieved (Scherson and Caspary 1989; Green and Paddon 1989), but not in the sense that it has brought ray tracing of complex scenes within reach of interactive display. The alternative of designing special VLSI hardware, the popular route for the depth-buffer approach mentioned above, has not been tried so much for ray tracing. The efforts of Kedem and Ellis (1984, 1989) and Pulleyblank and Kapenga (1986, 1987) and the more recent developments by Shen and Deprettere (1992) are the notable exceptions so far. The major bottleneck in parallel processing appears to be the data communication between processors. Communication can be reduced by use of caching. However, caching is only effective when enough coherence is available in subsequent data requests (Green and Paddon 1989). Unfortunately, standard ray tracing as such does not provide much coherence between subsequent ray intersection tasks. In this paper we will therefore review different ray tracing algorithms and we will analyse how the amount of coherence can be increased by adapting the order in which ray intersections are scheduled, and by eliminating as much as possible ray intersections that will tend to destroy potential coherence. The paper is structured as follows: In section 2, the requirements for realistic rendering are summarised, the state-of-the-art in global illumination reviewed, and an outline of a family of ray tracing algorithms with radiosity preprocessing is given. In section 3, the different hardware and parallel processing approaches are discussed and the role of caching in relieving the communication bottlenecks is emphasised. In section 4, several forms of coherence are discussed and in section 5, techniques to improve data coherence in ray tracing and a hybrid task scheduling strategy are proposed. In section 6, we discuss the open issues. 2. Realism in computer graphics Realism can only be achieved by combining sophisticated modelling and rendering techniques, such as those for modelling curved surfaces, specifying procedural models, applying texture mapping and filtering, light source models, local reflection models (isotropic/anisotropic, diffuse/specular reflection, refraction, absorption, etc.) and global reflection models (interreflection patterns between surfaces, simulation of soft shadows, mirroring reflections and participating media). Although all these subjects are of equal importance, global reflection (global illumination or inter-reflection) is currently considered to be most crucial, in particular in applications for architecture and interior design. To give an indication of the complexity of the interreflection problem, some of the paths travelled by the light leaving a light source before it reaches the eye are shown in figure 1. The situation is simplified in the sense that surfaces are assumed to be either purely diffuse or purely specular. Path 1 represents the direct diffuse reflection, path 2 the diffuse-specular reflection, path 3 the diffuse-diffuse reflection and path 4 the speculardiffuse reflection. Other possible paths, e.g. only specular (highlight) or specularspecular reflection are not included in the figure.

3

specular

diffuse

1 3

4

eyepoint

light source

2 image plane

diffuse specular

Figure 1. Different paths of light reflection. Standard projective algorithms (depth-buffer, scan-line, etc.) will only account for light following path 1 and for the direct specular reflection of light, however, without shadow testing. Standard ray tracing (Whitted 1980; Glassner 1989) does sample light following path 1 (including shadow detection) and 2, but it does not account for the indirect reflection of light as a result of the interreflection between surfaces in the scene (path 3), nor for the light that is first reflected by a specular surface before it is diffusely reflected by a visible surface (path 4); see also (Arvo 1986). To capture this light it will be necessary to cast secondary rays into all directions, but now these secondary rays will hit other surfaces for which no intensity is known, and thus the sampling has to be done recursively (Kajiya 1986). Sampling efficiency can be improved by applying importance sampling strategies (Kajiya 1986; Shirley and Wang 1991; Arvo and Kirk 1990) and by exploiting coherence, for instance in the form of illuminance caching (Ward et al. 1988). To avoid the expensive recursive viewpoint dependent sampling, a viewpoint independent pre-processing - also known as the radiosity pass - can be done to access the global light distribution in a scene and to precompute the amount of light that each surface receives from its environment (Cohen and Greenberg 1985; Nishita and Nakamae 1985). The radiosity pass can be done either by calculating the energy exchange between surfaces in the scene by simultaneous solving a set of linear equations, or by a progressive radiosity method that 'shoots' light from light sources to other surfaces; light which in turn is re-shot to other surfaces, and so on, until a good approximation of the final light distribution is achieved (Cohen et al. 1988). Just as with ray tracing, shooting is preferably done in a stochastic and recursive (but also expensive) way, which is also known as 'particle tracing' (Pattanaik and Mudur 1992).

4

algorithm

meshing

display

light source sampling

shadow accuracy

time

a

exact/ extensive

ray tracing without spec.

no

dependent on mesh

short

b

extensive

ray tracing

no

dependent on mesh

long

c

moderate

ray tracing

shadow testing primary sources

good

longer

d

low

ray tracing

also secondary sources

better

even longer

e

no/low

ray tracing

recursive sampling

best

longest

Figure 2. Different versions of the two-pass radiosity algorithm. There are several versions of two-pass algorithms that combine ray tracing-based rendering with a ray tracing-based radiosity pre-processing (Wallace et al. 1987, 1989); see algorithm a, b, c, d, and e in fig. 2. The first (a) displays the pre-computed radiosity values, just as the depth-buffer algorithm. The second version (b) takes the precomputed radiosity value as the diffuse intensity of the patch and adds the specular reflection component to it by tracing secondary rays (Sillion and Puech 1989). This version still requires an extensive radiosity pre-processing because the shadows from the major (point) light sources are implicitly included in the radiosity shading. The third version (c) only uses the pre-computed radiosity intensity as an improved ambient term and it resamples the light from the most important light sources and patches to calculate more accurate shadows (Shirley 1990; Chen et al. 1991; Kok and Jansen 1991; Kok et al. 1991). This version performs a source selection or source classification during the radiosity pass to determine which patches can be considered as important light sources. The contributions of these selected sources are then not included in the pre-computed radiosity values; during the rendering pass not only specular rays are traced, but also shadow rays are cast to the selected light sources to accurately calculate their contribution to the shading of the patches. The fourth version (d) re-samples all the light by shooting secondary rays to all directions (Rushmeier 1988); now the radiosity shading is not used at all for display, but only to quantify the light that is diffusely reflected by each patch and that is sampled during the rendering by the secondary (shadow) rays. The last version can be generalised in the aspect that sampling can be continued recursively when the pre-computed radiosity value is not accurate enough (version e). Whether recursion is only done until the first level of recursion, 'one-level path tracing' (Rushmeier 1988), or deeper can be made dependent on the detection of shading discontinuities (highlights, shadow boundaries, etc.) in the neighborhood, or can be a function of the required image quality. The ultimate algorithm for realism will be a combined radiosity-rendering algorithm that uses a recursive stochastic sampling technique both for shooting (particle tracing) and sampling (ray tracing). With an importance-driven technique (Smits et al. 1992; Pattanaik and Mudur 1993), the radiosity refinement can be focused on the surfaces that

5

are visible. Of course, the amount of ray intersections needed for these algorithms will be too large to be performed in real time by one processor for some time to come, and we will have to defer to parallel processing. 3. Parallel processing The dedicated hardware and parallel processing approaches that have been so successfully applied to the depth-buffer algorithm, cannot be applied directly to ray tracing, because the ray-object intersection is much more complex than scan conversion and both the objects and rays cannot be processed in a strict linear order suitable for pipe-lining. In fact, ray tracing can be seen as three quite separate tasks: shading, ray traversal, and ray patch intersection. The shading task initiates primary and secondary rays, performs local light reflection, texture filtering, and anti-aliasing, and is responsible for the final pixel color computation. In the radiosity pre-processing the shading task initiates the hemisphere shooting and updates the patch radiosities. The ray traversal task takes the rays and intersects them with the cells of the spatial subdivision structure, and finds the patches that are candidate for ray intersection. The ray intersection task performs the actual ray-patch intersection and returns the results to the shading task. The shading and ray traversal tasks are very data intensive and not so much computing intensive. The intersection task is both data and processing intensive. This task breakdown was reason for Gaudet et al. (1988) implementing these tasks separately on micro-coded processors which are connected to each other and to the frame buffer to exchange (intermediate) results. In addition, shading, spatial subdivision, and object data are continuously broadcast over three separate buses. The processors take the information from the buses when they need it and when it comes by. The system can be smoothly scaled up by adding more processors, but communication time will increase proportionally with the size of the object data. In general, these three different tasks are seldom considered in total. In most cases, the main focus is on the ray traversal and ray intersection task. Dedicated VLSI implementation of the ray intersection task has first been considered by Kedem and Ellis (1984). A proto-type of their ray casting engine for quadratic surfaces has actually been built (Kedem and Ellis, 1989). A design for a VLSI chip for bicubic patch intersection was published by Pulleyblank and Kapenga (1986, 1987). Work on this project has since then been continued and extended to (two-pass) radiosity algorithms (Yilmaz et al. 1989). A design has been made for a 'radiosity engine' in the form of a plug-in board to enhance the performances of standard workstations for high-quality rendering (Shen et al. 1990, 1991). The board contains several intersection processors, each equiped with several ray traversal units. As the basic computational primitive for hardware implementation was chosen the intersection of a frustum (part of a hemisphere) of rays with a set of bicubic patches. Being a more compact surface representation, the communication for bicubic patches is considerably reduced compared to polygons. However, given several very fast pipe-lined computational units to calculate many ray-patch intersections in parallel, the ray-intersection will be extremely fast and the communication between the patch database at the host and the intersection computation units will be the bottleneck. Besides, by exploiting coherence for neighbouring rays (see the following section),

6

reduction in communication is also expected from a hierarchy of caches. See for details and simulation results (Shen and Deprettere 1992). Many publications have appeared on ray tracing using general purpose parallel processors, in particular using distributed memory systems (e.g. transputer systems), with many showing excellent speedups. However, these good results are usually obtained provided the object data base is replicated on each processor. For complex scenes the data requirements may be very large, far larger than can be accommodated locally at each processor, and now the complete data base must either only be kept by the host and the data sent on request, or it can be distributed over the processors' local memory. This second strategy is preferable due to the communication bottlenecks at the host that are bound to occur with the first strategy. Assuming that the combined memory of the multiprocessor system is sufficient to contain the whole data base then an initial approach may be to allocate each processor an equal portion of the data base up to the limit of its local memory. Processing load distribution may now be performed in one of two ways. Firstly the tasks can be assigned to the processors that contain the relevant data. In ray tracing, the object data can, for instance, be distributed on the basis of an object space subdivision (Cleary et al. 1983) or an hierarchical box method (Scherson and Caspary 1988). Each processor stores the object data of one (or more) partitions of a spatial subdivision. When a ray intersects one of these portions, the task packet for that ray is sent to this processor and if no intersection occurs or when additional secondary rays are generated then the ray(s) is sent to the processor that stores the appropriate next cell. The object data distribution is thus static and the task distribution follows the 'ray flow' through the system. Although not strictly 'data driven' this approach is often characterised as such. This allocation method will suffer from potential load imbalances should certain areas of the scene attract the majority of the rays. A low-resolution pre-processing can be applied to make a first estimation of the expected load distribution (Salmon and Goldsmith 1988; Priol and Bouatouch 1989) and to correct for this. However, as a result of the different ray directions, the processing load for each partition will vary over time. To keep all the processors busy in such a situation it is necessary to dynamically adjust the size of the partitions and redistribute the object data accordingly. One such system was proposed by Dippé and Swensen (1984) and others (Nemoto and Omachi 1986; I` s, ler et al. 1991) have proposed further improvements. A drawback of these methods is that if the ray flow varies quite some bit (and it mostly does) then re-adjusting the object data does not pay and the efficiency is reduced instead of improved. Kobayashi et al. (1988) try to avoid this by distributing a larger number of space partitions over the processors. In this way, load balancing is better, but the amount of data communication will increase. The second strategy of work allocation uses a demand-driven approach: tasks are assigned to processors when they are ready to accept new tasks (Green and Paddon 1989; Badouel et al. 1990). The object data required to execute the task is not available locally then it has to be requested from the appropriate remote processor that stores the data. To avoid repeated requests for the same data item, frequently used data items are cached at the local memory of the processor. Part of the memory is thus allocated to store a segment of the data base, the other part is used as a local cache. It is of course

7

preferable to schedule the tasks not completely at random to the processors but instead to take into account the cache contents. Tasks that use the same data should be preferably scheduled to the same processor to allow the cache contents to be re-used. Each processor receives segments of a coherent task and only when other processors run out of their work, segments are assigned to these processors. This can also be implemented by a 'task stealing' strategy. A drawback of the demand-driven approach is that much requested data items will be resident in many caches and much communication is lost on the less used data items. Another drawback is that as the number of processors increases, the amount of communication will grow accordingly. A hybrid approach is proposed by Scherson and Caspary (1988). Object data is distributed over the processors according to a spatial subdivision or (in their implementation) a bounding box method. Intersection tasks are assigned in a data-driven manner to the processors that contain the relevant data. Ray traversal tasks, however, are assigned in a demand-driven mode: tasks are assigned to processors that are less busy with intersection calculations. This hybrid strategy has the advantage that the intersection computations provide a base load although different for each processor, and that the unbalance in intersection tasks can be compensated with ray traversal tasks, thus avoiding expensive load balancing strategies involving large amounts of data communication. Unfortunately, the ray traversal task is relatively simple and increasing the ray traversal task (by further descending the bounding box hierarchy or by increasing the level of spatial subdivision) increases the amount of communication. Thus, the work load is perfectly balanced and the speed-up is almost linear, but the total performance is low, making this solution less cost effective for a large number of processors. Summarising, a pure data-driven approach does not seem to be very desirable because of its dependency on the object distribution in space and associated (probably unevenly distributed) processing demand. Far worse, the processing load distribution will probably vary rapidly over time, introducing a severe overhead for dynamic load balancing. Further, because of the large number of rays, the communication overhead for the ray messages will be large. A demand-driven approach, on the other hand, has few problems with load balancing and seems more appropriate for very large data bases. But, here the problem is to keep the data communication as a result of the data requests by the processing tasks as low as possible. As most requests will concern the same (small set) of data items, exploitation of coherence by use of caching is recognised by the proponents of the demand-driven approach as an all important factor for efficiency (Green and Paddon 1989; Badouel et al. 1990; Shen and Deprettere 1992). In the rest of the paper we will discuss this issue in more detail. For the following discussion, we will partly abstract from the different hardware architectures and consider a system consisting of a high-performance workstation enhanced with a set of intersection processors, either in the form of general-purpose parallel processors or of dedicated VLSI hardware, as the context for our simulation. The workstation itself could be a shared-memory multi-processor system with enough memory to contain the model and shading database. The shading and ray traversal tasks which are assumed to be data intensive and less computing intensive are done on the workstation. The computing intensive ray patch intersection is delegated to the intersection processors. The intersection processors are connected to the host through a

8

controller. The controller is connected to the host with a bus or a fast link and has a relatively large cache. Each intersection processor has his own local memory (fig. 3). The basic idea is very much the same as described in (Green and Paddon 1989) and (Shen and Deprettere 1992).

h ost co ntro lle r

bu s

ca che o bj ect da tab ase in te rfa ce wo rksta ti on

in te rfa ce

in ter fa ce

cach e

inte rface

ca che

i nter sectio n p ro cesso r

in te rse ction pr oce ssor

cach e in ter sectio n p ro cesso r

in ter fa ce ca che inte rse cti on pro ce ssor

in ter sectio n u ni t

Figure 3. Conceptual architecture (Green and Paddon 1989) The processor inter-connection could be a bus or a network. With a larger number of processors, the message latency becomes an important factor and a number of techniques have been proposed to avoid a processor standing idle while an object data is fetched from a remote location. Minimum path configurations (AMPs) and message reduction schemes such as 'poaching' and 'snooping' have been shown to be successful in reducing communication overheads and thus the latency for fetching a remote data item (Chalmers 1991; Chalmers and Paddon 1992). Nevertheless, for a large multiprocessor system this delay may still be significant. Multi-threading allows each processor to trace the paths of a number of rays concurrently so that should the computation associated with one ray be delayed awaiting a remote data item, the other computations may still continue (Chalmers et al. 1993). However, it has been shown that the maximum number of threads that can be supported efficiently at each processor is limited and what is worse, this limit is lower for larger number of processors due to increasing message densities. While the previous methods have been shown to improve the performance of multiprocessor systems, it is nevertheless only in combination with the ideas of caching are we likely to achieve performances on parallel systems approaching the real-time that we desire. Caching is able to exploit any coherence within a scene to significantly reduce the number of remote data item fetches and so once more allow good speed-ups even when the data is distributed across all the processors. Green and Paddon (1989) discuss

9

several caching strategies. Their starting point is a demand-driven approach on basis of image space subdivision. Tasks, assigned to processors, make requests for object data, which are satisfied either by a local resident set, a local cache, the cache of the controller, the resident sets and caches of neighboring processors, or by the host. A low resolution ray tracing is proposed as a profiling method to select the resident set and the initial cache filling. A similar strategy is proposed in (Shen and Deprettere 1992). re side nt set in te rse ction pr oce ssor

in ter fa ce

cach e

cach es con trol ler an d oth er pro ce ssors

mai n mem or y

Figure 4. Caching strategy (Green 1991) The effectiveness of caching is significantly dependent on the coherence that the algorithm can offer, and thus the overall effectiveness of parallel processing is is also significantly dependent on the way the ray tracing and radiosity algorithms proceed and the order in which intersection calculations are scheduled. 4. Coherence Green and Paddon (1989) discuss several forms of coherence. Beside image and frame coherence they discern the following types: - ray coherence: neighbouring rays will likely intersect the same surfaces; this allows reuse of patch data used for previous rays (Speer et al. 1985; Hanrahan 1986, Arvo and Kirk 1987). - object coherence: local neighborhoods of space tend to be occupied by only a small set of the total number of objects; this allows for a fast localization of the candidate set for intersection with a spatial subdivision technique (Glassner 1989) - data coherence: requests to the object data base will tend to be restricted to a small subset of the whole data base and tend to show a large amount of spatial locality (Green and Paddon 1989). Within a certain time span, data coherence can be interpreted as a more general form of ray and object coherence: given a sequence of data requests, many requests will be for the same data items. This form of coherence is very much dependent on the order the algorithm proceeds and traverses through object space. We will now discuss the different forms of coherence in more detail. 4.1 Ray coherence Clustering neighboring rays into frustums has been proposed earlier to reduce ray tracing costs and to perform efficient anti-aliasing (Speer et al. 1985; Hanrahan 1986). Ray frustums are a very versatile computational primitive within ray tracing and radiosity.

10

The primary rays in ray tracing form a frustum with the eye point as origin. Ray frustums are further used to sample area light sources, and to simulate depth-of-focus and motion blur. In radiosity algorithms, hemispheres of rays are used for shooting. Intersection of a frustum of rays with a set of patches as a computational primitive is particularly attractive for a pipe-lined hardware intersection unit that can test a bundle of rays simultaneously against a set of patches (Shen et al. 1990). To further increase coherence and to distribute the processing over multiple intersection units, Shen et al. subdivide each frustum into sectors (see fig. 5). All rays in one sector will be loaded on one intersection computation unit. The size (angle) of the sectors is made dependent on the expected patch density and patch distribution to obtain a good load balancing over the different intersection units and to keep the number of calls to the database minimal.

grid of regular space subdivision

patches

rays sector plane sector angle shooting patch (preprocessing) sampling patch (display)

Figure 5. Ray frustum and sectors. The ray frustum intersection is indeed a versatile computational primitive for hardware implementation. However, it has also some drawbacks. The ray frustum method is mainly intended for undirected shooting and sampling (see fig. 6), which is to avoid the overhead at the host involving searching the whole data base and clustering the rays into sectors. This means that rays are cast without aiming at a specific patch or a specific point (e.g. a vertex). This fits well within a Monte Carlo type of sampling, but not very well within a progressive radiosity method, as it is implemented usually (with directed shooting). See for a discussion on the advantages and disadvantages of directed and undirected sampling (Wallace et al. 1989), and (Shirley and Wang 1992) where these are called implicit and explicit sampling. The undirected sampling poses some additional constraints on the resolution of the mesh with respect to the resolution of the rays.

11

Figure 6. Directed (left) versus undirected shooting (right). If the number of rays is too low and the mesh resolution too high then some elements of the mesh will not receive a contribution or the contribution will not be spread evenly. This in fact is bound to happen because as the distance over which the rays travel increases, the rays will get further separated and thus the mesh resolution can never be optimally adapted to the ray density. See for instance the disastrous effect of undirected shooting from a light source to a patch that is regularly subdivided into 256 elements (fig. 7). Directed shooting takes 293 rays (fig. 7h right-below). A comparable quality can only be obtained with more than 100,000 rays shot in an undirected way (see fig. 7a-g).

Figure 7. Results of radiosity pass for a patch in front of a light source. Undirected shooting a: 1, b: 10, c: 100, d: 1000, e: 10000, f: 100,000 rays, g: 1,000,000 rays, and directed shooting: h: 293 rays. To accommodate the resolution of the receiving mesh to the resolution of the rays, an hierarchical mesh data structure should be used that assigns intensity values to levels

12

corresponding to the density of the receiving rays (Asensio 1992; Languénou et al. 1992). At the end of the radiosity pass, the different levels could then be merged to obtain the radiosity values of the patch vertices. Further, the number of rays could be made dependent on the (expected) density of the patches in a sector, the sizes of the patches, the distance of the patches from the ray origin, the intensity of the patches (during display), the chance of shading discontinuities (both during radiosity preprocessing and display), the reflectance properties of the patches, etc. So source selection and source classification criteria could be applied here (Kok and Jansen 1991; Chen et al. 1991). The ray density would then reflect the 'importance' of the shooting/sampling direction of that sector. Another problem with frustum tracing is that shooting preferably is not done in a hemisphere type of way, but by shooting individual rays from stochastically distributed positions on the patch preferably reflecting the energy distribution over the patch. Ray directions should be stochastically distributed as well to avoid aliasing. Finally, for specular and mirroring reflection, a frustum of rays will spawn a large number of secondary rays, however, with different starting points and with possible different directions. This will all have a dramatic effect on the ray coherence. Therefore, a compromise will have to be sought here between accuracy and coherence. 4.2 Object coherence The fact that a certain neighborhood of space will tend to contain only a small subset of the total objects in a scene, allows us to localize the candidate set for intersection with a given ray by applying a spatial subdivision technique. Regular grids as well as adaptive space subdivision schemes have been proposed (Glassner 1989). A uniform grid allows for a fast ray traversal, but does not adapt well to an uneven distribution of objects. A two-level grid where complex objects have their own internal grid, seems to combine the advantages of fast traversal and adaptive space subdivision. Unfortunately, for a frustum of rays the spatial subdivision is not nearly so effective as for single rays. Only near to the frustum origin, object localization is achieved. As the rays propagate through space, the distance between the rays increases and ray coherence diminishes. This is amplified by the fact that some of the rays will be intercepted by patches while others continue. Object coherence is also small for objects consisting of a large number of small patches (for instance a keyboard with individually modeled keys). Each ray will hit another patch. Again, as with the shooting resolution, the level of object intersection has to be adapted to the density of the rays. A large number of small patches should be grouped together if this set of patches is only intersected by a small number of rays (Kok 1993). Further, ray tracing should be stopped when the coherence drops below a threshold value. Rays could be assigned values based on the average intensity of a group or subspace. Like in the virtual walls techniques (Xu et al. 1989), planes of the spatial subdivision structure could be given intensity values representing the scene behind them. Rays that would loose coherence could be terminated by assigning them values from this object or space hierarchy. 4.3 Data coherence For the more general form of coherence in subsequent data requests, Green and Paddon (1989) keyed the term data coherence. This type of coherence is caused by repetition in

13

the algorithm, for instance by shooting multiple hemispheres from the same patch, or by repeated shadow testing for the same light sources. In general this type of coherence is very much dependent on the order the algorithm proceeds, in particular, on the order rays trees are traversed through space. Each primary ray will spawn, beside shadow rays, a number of secondary rays to account for specular reflection and transparency. Each of these secondary rays might generate in turn a number of tertiary rays, and so forth. The amount of coherence in the primary rays will be large, but will drop quickly with each new generation of rays (see fig. 8). l igh t so u rce te rtia iry ra ys

sh a do w ra ys se con d ary r ays ob ject

e ye po in t

ob je ct

p ri mar y rays

Figure 8. Ray tree However, if the density of the primary rays increases - say from 4 per pixel to 64 per pixel- then the coherence for the primary rays will improve only slightly (because it is already high), but the coherence for the secondary and following rays will grow considerably. This notion is the basic idea behind the adaptive rendering algorithm proposed in the following section. Note also that a 'breadth-first' shooting of the ray tree may be preferably over 'depthfirst' shooting of the ray tree (Hanrahan 1986). Breadth-first shooting will require that intermediate results will have to be temporarily stored. The same would be true for methods that would try to benefit from coherence between neighbouring ray trees, for instance to shoot simultaneously to the same light source. In general, a breadth-first approach, i.e. first shooting all the primary rays, then all the secondary rays, etc. would be beneficial to the coherence. Of course, in practice this strategy cannot be fully exploited because of the limitations to the amount of processing and storage that will be needed for the intermediate results. Note however, that in the progressive radiosity this actually is done: ray recursion is only applied for specular reflection. Diffuse reflection is postponed until a patch becomes a shooting patch. This can be accomplished because the diffuse reflection is more or less direction independent. Recursive shooting (particle tracing) can be characterised as a depth-first approach. Finally, a more general form of spatial coherence could also be exploited. For instance, in progressive radiosity, speed of convergence could be sacrificed to optimise coherence resulting from shooting from neighbouring patches, i.e. patches near to each other are

14

processed directly after each other, ignoring for a while patches with a higher priority at a larger distance. 5. Coherence improving methods At this point we will bring together the different components of our analysis. Our main concern is a ray tracing algorithm with a radiosity pre-processing and with additional shadow sampling for the main light sources to improve shadow accuracy (see section 2). The algorithm will be implemented on a high-performance workstation enhanced with several intersection processors. Data communication between host and intersection processors is reduced by use of caching (section 3). Nevertheless, the communication between the host and the intersection processors will be considered to be the bottleneck in the system (and not the shading calculations or ray traversal calculations by the host), and thus the performance of the system will be determined mainly by the appearance of coherence in subsequent patch requests (section 4). What we want to achieve is that the amount of communication will be kept in proportion to the amount of intersection computations, that is, the coherence should be kept at a constant, preferably high, level. There are several ways to achieve this: by choosing a suitable breadth- or depth-first ray tracing method, by improving object coherence with an hierarchical object data base, and by a task and management strategy that optimally matches processing and data demands. 5.1 Recursive depth and sampling control We could imagine different modes for the rendering (given an already calculated radiosity pre-processing) in the same style as the adaptive refinement method of Chen et al. (1991): - direct display of the radiosity mesh; only primary rays are cast and shading is directly interpolated from the pre-calculated radiosity values; only the direct visible patches will be communicated to the intersection unit - ray tracing with shadow testing; shadow testing is added to improve shadow accuracy for the most important light sources; the intersection processors will now request also patches in between the visible patches and the light sources - ray tracing with secondary rays for simulating mirroring reflections and transparency; each of the secondary will in its turn also spawn shadow rays - ray tracing with diffuse and specular inter-reflection; secondary rays are cast into all directions to sample the indirect light; at this stage we use the pre-computed radiosity shading only to answer these ray queries; in some cases, if required, sampling may be continued for another recursive level - in addition, a continuous shooting process could refine the radiosity solution, in particular for the visible part of the scene. Also here a trade-off can be made between a simple progressive radiosity pass or a more recursive, stochastic method. It is clear that the first mode (only primary rays) provides the highest degree of coherence, and that coherence will be minimal for the most elaborate form of sampling for the diffuse and specular reflection. However, as the initial sampling rate of the primary rays increases, the coherence will increase as well, in particular for the secondary and tertiary rays, and the more ray recursion can be allowed, given a pre-specified amount of coher-

15

ence (to balance the computation and communication). The basic idea, therefore, is to link the level of ray recursion to the amount of available coherence. The coherence in its turn is dependent on the initial chosen sampling rate of the primary rays and on local patch parameters such as the curvature of the surface, the number of relevant light sources, etc. Decisions to continue ray tracing or to start another level of ray recursion should, therefore, preferably be made based on local criteria. The algorithm could be implemented as follows. As the primary rays hit a patch, the intersections are collected on a patch by patch basis. Secondary and shadow rays are calculated for these points and the amount of coherence is estimated on basis of the specular coefficient, the number of rays, the position and direction of the rays, etc. Only if the coherence is high enough then (frustums of) rays are sent to the intersection processors. Otherwise, ray tracing will be deferred until enough rays have been collected, or alternatively, a pre-processed radiosity/ambient value will be assigned as an estimate for the correct value. Further, each ray will only be continued as long as there is enough coherence; if the coherence drops below a certain level then the ray will be terminated and assigned a value from the virtual database (e.g. the cell planes of the spatial subdivision). Also, a hierarchical element mesh structure or a hierarchical object grouping will be applied to adapt the sampling resolution of the rays to the size of the elements, patches and objects. 5.2 Task and data management So far, the proposed coherence techniques do not specify how the intersection tasks are assigned to the processors. It will be clear that the primary rays within one frustum (or section of a frustum) will preferably be assigned to the same processor. This may also hold for some of the secondary and shadow rays. At a certain point, however, ray and object coherence drops below a certain level and only a more general form of spatial coherence which is more difficult and expensive to grasp (e.g. the overlap in visible patches seen from different, possible remote, patches), is left. When coherence becomes low, it means that data communication will grow. This may be to the amount that 'data flow' communication will exceed the amount of 'ray flow' communication that would occur with a pure data-driven approach. Therefore, we could imagine a hybrid approach in line with (Scherson and Caspary 1988) where each processor would store a partition of the spatial subdivision and process the (random-oriented) rays that traverse that partition. The more coherent ray frustums could then, in a demand-driven mode, be assigned to processors that are less busy with their own partition. The result will be that many-referenced items may be replicated over many processors, however, without increasing the data communication for the less-used data items. In case more coherence would be needed for the demand-driven tasks then a specialisation can also be applied: some processors could work purely or mainly in a demand-driven mode and other processors in a data-driven mode. 6. Conclusions and discussion Data communication will be the main bottleneck in general purpose parallel processing systems with distributed memories for some time to come. In custom-designed systems

16

the communication bandwidth can be optimised, but data transfer rates will still be a limiting factor. Data communication requirements can be relieved with a suitable caching strategy. The effectiveness of caching will depend strongly on the amount of data coherence the algorithm can offer. We have discussed several sources of coherence and proposed some methods to increase object and spatial coherence. We have analysed possible data and task management strategies and proposed a hybrid algorithm that uses both a demand- and data-driven task assignment strategy. Demand-driven will be applied for high-coherent ray intersection tasks and data-driven for low-coherent tasks. There are several issues to be explored yet. One is the trade-off between regular and stochastic ray tracing. Regular sampling/shooting (standard ray tracing and progressive radiosity) will proceed in incremental order and while ray coherence is optimal, the load balancing may be strongly effected by the direction of the rays. A stochastic sampling approach, on the other hand, casts rays from random positions into random directions and will exhibit less data coherence, but the overall pattern of computation will be more stable. It may then be worthwhile to optimise not only the load balancing, but also the communication flow between processors, for instance, by minimising the average distance of the ray flow. It is not to say in advance whether an optimal load balancing and consistent ray flow will compensate for the loss of coherence. This will have to be verified by experiments. Ray tracing in combination with particle tracing will provide the biggest challenge. The algorithm can be described as a stochastic depth-first shooting and sampling. It can effectively model all kinds of anisotropic reflection and scattering in participating media. However, coherence will be minimal. This will probably mean that for parallel processing the load balance is perfect, but that ray flow will be large and object data flow large as well, unless local memory will be sufficient to encompass all object data that is relevant for a (semi-permanent) task. A possible architecture is depicted in fig. 9. There are three clusters of processors, one works in a demand-driven mode, one in a data-driven mode, and one processes the ray tracing results and does the final image synthesis. The demand-driven cluster processes the first generations of rays for both the particle tracing and the rendering. Rendering rays have priority over particle rays, however, when the view point does not change then particle rays may be given a higher priority. As soon as coherence has dropped below a pre-specified level, then the rays are passed to the data-driven cluster. The proposed architecture could be converted into a minimum-path configuration. For instance with a 6x6 data-driven cluster, the maximum distance from any of the demand-driven/image processors to any of the data-driven processors could be minimized to four. Which configuration is optimal will depend very much on what the critical communication paths will be. In the discussion so far, we have neglected the shading task. Texture mapping and physically-based reflection models may involve quite some data. As long as the shading task can be done by the host, the data can reside at the host. If the shading task has to be distributed, then given the amount of data, it can only be in data-driven mode. The same may apply for the radiosity mesh information.

17

ho st

SC syste m con tro lle r

d em an d- dr iven ra y traci ng

d e man d -dr ive n r ay tra cin g d em an d- dr iven p ar ti cle tra cin g

da ta dr iven pa rticle & ra y tr aci ng

im ag e p ro cessin g i mag e co n tr oll er

Figure 9. Architecture for combined particle-ray tracing. Our intention was to review the current state-of-the-art and outline possible directions for further research. A lot of work still has to be done before we will see photo-realistic images of complex scenes being generated in real time. Acknowledgements The ideas reported in this paper have grown out of discussions with Arjan Kok and the members of the Radiosity Engine Project, Ed Deprettere, Gerben Hekstra, Li-Sheng Shen and Theo Verelst from the Network Theory Section of the Faculty of Electrical Engineering of Delft University of Technology. References Akeley, K., Jermoluk, T. (1988), High-performance Polygon Rendering, Computer Graphics 22(4): 239-246, Siggraph'88. Arvo, J. (1986), Backward Ray Tracing, Developments in Ray Tracing, Siggraph'86 course notes.

18

Arvo, J., Kirk, D. (1987), Fast Ray Tracing by Ray Classification, Computer Graphics 21(4):55-64. Arvo, J., Kirk, D. (1990), Particle Transport and Image Synthesis, Computer Graphics 24(4):63-66, Siggraph'90. Asensio, F., (1992), A Hierarchical Ray Casting Algorithm for Radiosity Shadows, Proceedings of the 3rd Eurographics Workshop on Rendering, Bristol. Badouel, D., Bouatouch, K. Priol, T. (1990), Strategies for Distributing Data and Control for Ray-Tracing on Distributed Memory Parallel Computers, Siggraph Course Notes 28, 1990. To appear in Computer Graphics and Applications. Baum, D.R., Winget, J.M. (1990), Real Time Radiosity Through Parallel Processing and Hardware Acceleration, Computer Graphics 24(2): 67-75. Baum, D.R., Mann, S., Smith, K.P., Winget, J.M. (1991), Making Radiosity Usable: Automatic Preprocessing and Meshing Techniques for the Generation of Accurate Radiosity Solutions, Computer Graphics 25(4): 51-60. Campbell, A.T., Fussell, D.S. (1990), Adaptive Mesh Generation for Global Diffuse Illumination, Computer Graphics 24(4): 155-164, Siggraph'90. Chalmers, A. G. (1991), A Minimum Path System for Parallel Processing. PhD thesis, University of Bristol. Chalmers, A.G., Paddon, D.J. (1989), Communication Efficient MIMD Configurations. Proc. of the 4th SIAM Conference on Parallel processing for Scientific Computing, Chicago. Chalmers, A.G., Paddon, D.J. (1991). Parallel Processing in the Progressive Refinement Radiosity Method. Proceedings of the 2nd Eurographics Workshop on Rendering, Barcelona. Chalmers, A.G., Stuttard, D., Paddon, D.J. (1993), Data Management for Parallel Ray Tracing of Complex Images. International Conference on Computer Graphics, Bombay. Chen, S.E., Rushmeier, H., Miller, G., Turner, D. (1991), A Progressive Multi-Pass Method for Global Illumination, Computer Graphics 25(4): 165-174, Siggraph'91. Clark, J. (1982), The Geometric Engine: a VLSI Geometry System for Graphics, Computer Graphics 16(3): 127-133, Siggraph'82. Cleary, J.G., Wyvill, B., Birtwistle, G., Vatti, R. (1983), Multiprocessor Ray Tracing, Tech. Report 83/128/17, Dept. of Computer Science, Univ. of Calgary; also in Computer Graphics Forum 5(1): 3-12. Cohen, M.F., Greenberg, D.P. (1985), The Hemi-cube: A Radiosity Solution for Complex Environments, Computer Graphics 19(3): pp 31-40, Siggraph'85. Cohen, M.F., Greenberg, D.P., Immel, D.S., Brock, P.J. (1986), An Efficient Radiosity Approach for Realistic Image Synthesis, IEEE Computer Graphics and Applications 6(3): 26-35. Cohen, M.F., Chen, S.E., Wallace, J.R., Greenberg, D.P. (1988), A Progressive Refinement Approach to Fast Radiosity Image Generation, Computer Graphics 22(4): 75-84, Siggraph'88.

19

Cook, R.L. (1986), Stochastic Sampling in Computer Graphics, ACM Transactions on Graphics 5(1): 51-72. Cook, R.L, Torrance, K.E. (1982), A Reflectance Model for Computer Graphics, ACM Transactions on Graphics 1(1): 7-24. Dippé, M., Swensen, J. (1984), An Adaptive Subdivision Algorithm and Parallel Architecture for Realistic Image Synthesis. ACM Computer Graphics 18(3):149158. Dippé, M., Wold, E.H. (1985), Antialiasing through Stochastic Sampling, Computer Graphics 19(3): 69-78, Siggraph'85. Gaudet, S., Hobson, R., Chilka, P., Calvert, T. (1988), Multiprocessor Experiments for High-Speed Ray Tracing, Transactions on Graphics 7(3):151-179. Glassner, A.S. (1989), Introduction to Ray Tracing, Academic Press. Green, S., Paddon, D. (1989), Exploiting Coherence for Multiprocessor Ray Tracing. Computer graphics and Applications, 4(10):15-22. Green, S. (1991), Parallel Processing for Computer Graphics. Research Monographs in Parallel and Distributed Computing, Pitman Publishing, London. Hanrahan, P. (1986), Using Caching and Breadth First Search to Speed Up Ray Tracing, Proc. Graphics Interface'86, Canadian Information Processing Society, Toronto, pp. 56-61. Heaberli, P., Akeley, K. (1990), The Accumulation Buffer: Hardware Support for HighQuality Rendering, Siggraph 24(4): 309-318. Heckbert, P.S., (1986), Survey of Texture Mapping, Computer Graphics and Applications 6(11): 56-67. Heckbert, P.S. (1990), Adaptive Radiosity Textures for Bidirectional Ray Tracing, Computer Graphics 24(4): 145-154, Siggraph'90. Heckbert, P.S. (1992), Discontinuity Meshing for Radiosity, Proceedings of the 3rd Eurographics Workshop on Rendering. Ís,ler, V., Aykanat, C., Özgüç, B. (1991), Subdivision of 3D Space Based on the Graph Partitioning for Parallel Ray Tracing, Proceedings 2nd EG Workshop on Rendering, Barcelona. Jansen, F.W., Kok, A.J.F., Verelst, T. (1992), Hardware Challenges for Ray Tracing and Radiosity Algorithms. Proceedings 7th Workshop on Graphics Hardware, Cambridge England, September 1992, Eurographics Technical Report pp 123-134. Kajiya, J.T. (1986), The Rendering Equation, Computer Graphics 20(4): 143-150, Siggraph'86. Kedem, G., Ellis, J.L. (1984), The Ray Casting Machine, Proc. IEEE Int. Conf. on Computer Design: VLSI in Computers (ICCD'84), IEEE Computer Society Press, 533-538. Kedem, G., Ellis, J.L. (1989), The Ray-Casting Machine. In: Dew, P.M., Earnshaw, R.A, Heywood, T.R., Parallel Processing for Computer Vision and Display, Addison-Wesley Publishing Company, p. 378-401. Kirk, D., Voorhies, D. (1990), The Rendering Architecture of the DN10000VS, Computer Graphics 24(4): 299-307, Siggraph'90.

20

Kirk, D., Arvo, J. (1991), Unbiased Sampling Techniques for Image Synthesis, Computer Graphics 25(4): 153-156, Siggraph'91. Kobayashi, H., Nishimura, S., Kubota, H., Nakamura, T. Shigei, Y. (1988), Load Balancing Strategies for a Parallel Ray-Tracing System Based on Constant Subdivision, The Visual Computer 4(): 197-209. Kok, A.J.F., Jansen, F.W., Woodward, C. (1991), Efficient Complete Radiosity Ray Tracing Using a Shadow Coherence Method, Report of the Faculty of Technical Mathematics and Informatics, nr. 91-63, 1991. To appear in The Visual Computer. Kok, A.J.F., Jansen, F.W. (1991), Source Selection for the Direct Lighting Computation in Global Illumination, Proceedings of the 2nd Eurographics Workshop on Rendering. To be published by Springer Verlag. Kok, A.J.F., Jansen, F.W. (1992), Adaptive Sampling of Area Light Sources in Ray Tracing Including Diffuse Interreflection. Computer Graphics Forum 11(3): C289C298, Eurographics'92. Kok, A.J.F. (1993), Grouping of Patches in Progressive Radiosity, Proceedings 4th EG Workshop on Rendering (this volume). Languénou, E., Bouatouch, K., Tellier, P. (1992), An Adaptive Discretization Method for Radiosity, Computer Graphics Forum 11(3): C205-C216, Eurographics'92. Lee, M.E., Redner, A., Uselton, S.P. (1985), Statistically Optimized Sampling for Distributed Ray Tracing, Computer Graphics 19(3): 61-67, Siggraph'85. Lischinski, D., Tampieri, F., Greenberg, D.P. (1992), A Discontinuity Meshing Algorithm for Accurate Radiosity, Computer Graphics and Applications 12(6):2539. Mitchell, D.P. (1987), Generating Antialiased Images at Low Sampling Densities, Computer Graphics 21(4): 65-72, Siggraph'87. Nemoto, K., Omachi, T. (1986), An Adaptive Subdivision by Sliding Boundary Surfaces for Fast Ray Tracing, Proc. Graphics Interface'86, Computer Graphics Society, Montreal, pp. 43-48. Nishita, T., Nakamae E. (1985), Continuous Tone Representation of Three-Dimensional Objects Taking Account of Shadows and Interreflection, Computer Graphics 19(3): 23-30, Siggraph'85. Painter, J., Sloan, K. (1989), Antialiased Ray Tracing by Adaptive Progressive Refinement, Computer Graphics 23(3): 281-288, Siggraph'89. Pattanaik, S.N., Mudur, S.P. (1992), Computation of Global Illumination by Monte Carlo Simulation of the Particle Model of Light, Proceedings 3rd Eurographics Workshop on Rendering, Bristol. Pattanaik, S.N., Mudur, S.P. (1993), The Potential Equation and Importance in Illumination Computations, to appear in Computer Graphics Forum. Priol, T., Bouatouch, K. (1989), Static Load Balancing for a Parallel Ray Tracing on a MIMD Hypercube, The Visual Computer 5(12): 109-119. Pulleyblank, R.W. and Kapenga, J. (1986), VLSI Chip for Ray Tracing Bicubic Patches, In: Advances in Computer Graphics Hardware I, Springer Verlag, Proceedings First Eurographics Workshop on Hardware, 125-140.

21

Pulleyblank, R.W. and Kapenga, J. (1987), The Feasibility of a VLSI Chip for Ray Tracing Bicubic Patches, Computer Graphics and Applications 7(3): 33-44. Rushmeier, H. (1988), Realistic Image Synthesis for Scenes with Radiatively Participating Media, PhD thesis, Cornell University, 1988. Salmon, J., Goldsmith (1988), A Hypercube Ray-tracer. Proc. of the 3rd Conference on Hypercube Concurrent Computers and Applications Vol. II, ACM Press, pp. 1194-1206. Scherson, I.D., Caspary, E. (1988), Multiprocessing for Ray Tracing: A Hierarchical Self-balancing Approach, The Visual Computer 4(4):188-196. Shen, L.-S., Deprettere, E., Dewilde, P. (1990), A New Space Partition Technique to Support a Highly Pipelined Parallel Architecture for the Radiosity Method, In: Advances in Graphics Hardware V, Springer Verlag, Eurographics Hardware Workshop 1990. Shen, L.-S., Laarakker, F.A.J., Deprettere, E. (1991), A New Space Partition Technique to Support a Highly Pipelined Parallel Architecture for the Radiosity Method II, In: Advances in Graphics Hardware VI, Springer Verlag, Eurographics Hardware Workshop 1991. Shen, L.-S., Deprettere, E. (1992), A Parallel-Pipelined Multiprocessor System for the Radiosity Method, Proceedings of the Eurographics Hardware Workshop 1992. Eurographics Technical Report Series, p. 106-122. Shirley, P. (1990) A Ray Tracing Method for Illumination Calculation in Diffuse Specular Scenes, Proceedings Computer Graphics Interface '90, p. 205-212. Shirley, P., Wang, C. (1991), Direct Lighting Calculation by Monte Carlo Integration, Proceedings of the 2nd Eurographics Workshop on Rendering, Barcelona. Shirley, P., Wang, C. (1992), Distributed ray tracing: theory and practice. Proceedings of the 3rd Eurographics Workshop on Rendering, Bristol. Sillion, F., Puech, C. (1989), A General Two Pass Method Integrating Specular and Diffuse Reflection, Computer Graphics 23(3): 335-344, Siggraph '89. Smits, B.E., Arvo, J.R., Salesin, D.H. (1992), An Importance-Driven Radiosity Algorithm, Computer Graphics 26(2): 273-282, Siggraph'92. Speer, L.R., DeRose, T.D., Barsky, B.A. (1985), A Theoretical and Empirical Analysis of Coherent Ray Tracing, Proceedings Graphics Interface'85, Springer Verlag, pp 11-25. Torborg, J.G. (1987), A Parallel Processor Architecture for Graphics Arithmetic Operations, Computer Graphics 21(4): 197-204, Siggraph'87. Wallace, J.R., Cohen, M.F., Greenberg, D.P. (1987), A Two-Pass Solution to the Rendering Equation: A Synthesis of Ray Tracing and Radiosity Methods, Computer Graphics 21(4): 311-320, Siggraph'87. Wallace, J.R., Elmquist, K.A., Haines E.A. (1989), A Ray Tracing Algorithm for Progressive Radiosity, Computer Graphics 23(2):315-324, Siggraph'89. Ward, G.J., Rubinstein, F.M., Clear, R.D. (1988), A Ray Tracing Solution for Diffuse Interreflection, Computer Graphics 22(4): 85-92, Siggraph'88.

22

Ward, G.J. (1991), Adaptive Shadow Testing for Ray Tracing, Proceedings of the 2nd Eurographics Workshop on Rendering, Barcelona. Whitted, T. (1980), An Improved Illumination Model for Shaded Display, Communications of the ACM 23(6): 343-349. Xu, H., Peng, Q-S., Liang, Y-D. (1989), Accelerated Radiosity Method for Complex Environments, Proceedings Eurographics, Elsevier Science Publishers, p.51-59. Yilmaz, A.C., Hagestein, C., Deprettere, E., Dewilde, P. (1989), A Hardware Solution to the Generalized Two-Pass Approach for Rendering of Artificial Scenes, In: Advances in Graphics Hardware IV, Springer Verlag, Proceedings Eurographics Hardware Workshop 1989, 65-79.

23

Rendering Large Scenes Using Parallel Ray Tracing Erik Reinhard Frederik W. Jansen Faculty of Technical Mathematics and Informatics, Delft University of Technology, Zuidplantsoen 4, 2628BZ Delft, The Netherlands

Ray tracing is a powerful technique to generate realistic images of 3D scenes. However, rendering complex scenes may easily exceed the processing and memory capabilities of a single workstation. Distributed processing offers a solution if the algorithm can be parallelised in an efficient way. In this paper a hybrid scheduling approach is presented that combines demand driven and data parallel techniques. Which tasks to process demand driven and which data parallel, is decided by the data intensity of the task and the amount of data locality (coherence) that will be present in the task. By combining demand driven and data driven tasks, a better load balance may be achieved, while at the same time the communication is spread evenly across the network. This leads to a scalable and efficient parallel implementation of the ray tracing algorithm with little restriction on the size of the model data base to be rendered.

1 Introduction

From many fields in science and industry, there is an increasing demand for realistic rendering. Architects for example need to have a clear idea of how their designs are going to look in reality. In theatres the lighting of the interior is important too, so this should be modelled as accurately as possible. It is evident that making such designs is an iterative and preferably interactive process. Therefore next to realism, short rendering times are called for in these applications. Another characteristic of such applications is that the models to be rendered are typically very large and may contain a large number of light sources. Computer graphics techniques such as ray tracing and radiosity can provide the realism required, but the processing and memory capabilities needed for rendering (realistically) large scenes often exceed the capacity of single workstations. Using 0

Reprinted from Parallel Computing, 23(7), Erik Reinhard and Frederik W. Jansen, ‘Rendering Large Scenes Using Parallel Ray Tracing’, pp 873–885, Copyright 1997, with permission from Elsevier Science

parallel or distributed systems, either with distributed or shared memory architectures, seems a logical solution to these problems but, unfortunately, it is not trivial to match the processing with the distributed data in an optimal way. The most common way to parallelise ray tracing is the demand driven approach where coherent subtasks are assigned to processors with a low load [6] [18]. This may be achieved by assigning sub parts of the image to each processor [13] [3] [12]. This has the advantage of spreading the load evenly over the processors, but it requires either the data to be duplicated with each processor, thereby limiting the model size, or objects have to be communicated on request, introducing a significant extra overhead. Caching mechanisms may be implemented to reduce the amount of communication, but its efficiency is highly dependent on the amount of coherence between subsequent data requests. This may be low in ray tracing, in particular for reflection rays. Also, with realistic rendering often too much data is involved (textures, radiosity meshes, etc.) to be communicated frequently. For very large scenes, these disadvantages will aggravate. The same holds for complex lighting models. Demand driven scheduling won’t work in these cases. Alternatively, scheduling could be performed by distributing the data over the processors according to a (regular) spatial subdivision. Rays are then traced through a voxel (cell) of the spatial subdivision and when a ray enters the next voxel, it is transferred as a task to the processor holding that voxel’s objects [4] [2] [9]. This is called the data parallel or data driven approach and it is the basis of our parallel implementation. The advantage of such an approach is that there is virtually no restriction to the size of the model to be rendered. However, there are also some rather severe disadvantages, which include a load balancing problem and an increasing communication overhead with larger numbers of processors. The problems with either pure demand driven or data driven implementations may be overcome by combining the two, yielding a hybrid algorithm [17] [8]. Part of the computation is done data parallel, which provides each processor with a basic, though possibly unequal, load. The other, demand driven, part is overlaid on these processors in such a way that the load unbalance is compensated for. Usually, computationally intense tasks that require only little data are good candidates for demand driven scheduling, while data intense tasks are preferably handled in data parallel manner. Scherson and Caspary [17] propose to take the ray traversal task, i.e. the intersection of rays with the spatial subdivision structure, such as an octree, to be the demand driven component. As an octree does not occupy much space, it may be replicated with each processor. A problem with the hybrid algorithm of Scherson and Caspary is that the demand driven component and the data parallel component are not well matched, i.e. the ray traversal tasks are computationally not expensive enough to fully compensate for the load balancing problem of the data parallel part. Therefore, our data parallel algorithm (presented in section 2) will be modified to 2

incorporate demand driven components that are computationally more intense (section 3). A key notion for our implementation is coherence, which means that rays that have the same origin and almost the same direction, are likely to hit the same objects. Both primary rays and shadow rays directed to area light sources exhibit this coherence. To benefit from coherence, primary rays can be traced in bundles, which are called pyramids in this paper. A pyramid of primary rays has the viewpoint as top of the pyramid and its cross section will be square. First the pyramids are intersected with the spatial subdivision structure, which yields a list of cells containing objects that possibly intersect with the pyramids. Then the individual rays making up the pyramid are intersected with the objects in the cells. By localising the data necessary, these tasks may very well be executed in demand driven mode. Adding demand driven tasks (processing of primary rays) to the basic data driven algorithm, will improve the load balance and increase the scalability of the architecture. In this paper, first a description of the data parallel component of our parallel implementation is given in section 2. Next, the demand driven component and the handling of light rays are introduced in section 3. The complete algorithm is described in section 4. Experiments with two reasonably complex models are described in section 5. The data parallel and hybrid methods are compared in section 6. Finally, conclusions are drawn in section 7.

2 Data parallel ray tracing

To derive a data parallel ray tracer, the algorithm itself must be split up into a number of processes and the object data must be distributed over these processes. Both issues are addressed in this section, beginning with the data distribution. A suitable data distribution is to split the object space into (equal sized) voxels exploiting local object coherence. Each processor is assigned a voxel and its objects. Ray tracing is now performed in a master-slave setup. After initialisation, for each primary ray the host determines which voxel it originates in (or enters) and sends this ray as a task to the associated trace process. This tracing process reads the ray task from its buffer and traces the ray. Two situations may occur. First, the ray may leave the voxel without intersecting any object. If this happens, the ray is transferred to a neighbouring voxel. Else, an intersection with some locally stored object is found and secondary rays are spawned. These rays are traced and if necessary communicated to neighbouring voxels. The results of these rays are returned to the process that spawned the rays. When all results for an intersection are completed, shading is performed and a colour value is returned. The host collects all 3

the shading information for each pixel and writes the result to an image file (see figure 1). TraceTask

File I/O TraceTask

Trace / Host

Init PixelValues

Shade ShadingResult

Figure 1. Host and slave processes.

The advantages of this way of parallel ray tracing are that very large models may be rendered as the object database does not need to be duplicated with each process, but can be divided over the distributed memories instead. Ray tasks will have to be transferred to other processes, but as each voxel borders on at most a few other voxels, communication for ray tasks is only local. This means that data parallel algorithms are in principle scalable, although the communication overhead grows with the number of processors. Disadvantages are that load imbalances may occur, due to ’hot spots’ in the scene (near the viewpoint and near light sources). These may be solved by either static load balancing [14], which most probably yields a suboptimal load balance, or by dynamic load balancing [4], which may induce extra overhead in the form of data communication.

3 Demand driven pyramid tracing

In order to have more control over the average load of each process, and thereby overcome most of the problems associated with data parallel computing, some demand driven components may be added to this algorithm. The demand-driven tasks are in our approach selected on the amount of coherence between rays. For example, primary rays all originate from the eye point and travel in similar directions. These rays exhibit much coherence, as they are likely to intersect the same objects. This is also true for shadow rays that are sent towards area light sources. Reflection and transmission, on the other hand, do not usually produce sets of coherent rays. To exploit coherence, a bundle of rays can be enclosed in a pyramid or cone. This pyramid is then intersected with the spatial subdivision to retrieve all objects within the pyramid, prior to tracing the rays [18] [7]. A similar algorithm, called Pyra-Clip [19], was implemented by us to retrieve a list of cells in depth order from a bintree structure. Figure 2 shows the retrieved cells (1-6) for this example. 4

6 Pyramid 2

5 3 1

4

Figure 2. Pyramid traversal generates clip list (cells 1-6).

After the Pyra-Clip preprocessing is completed, the tracing of the individual rays within the pyramid only requires a relatively small number of objects, namely the objects that lie within the cells traversed. For this reason, pyramid tracing may be executed in a demand driven manner, as the data communication involved is limited.

4 Hybrid system

Adding the demand-driven components to the data parallel system can be done by giving the host the task to pre-process the Pyra-Clip tasks and schedule these to other processors. However, this would require a complete copy of the data base to be stored at the host, which puts a significant constraint on the size of the data base. Also, the processing involved may lead to a computational bottle-neck at the host. We therefore choose to distribute the Pyra-Clip preprocessing as well. In the new set-up there is a host process that issues bundles of rays on request to the (distributed) trace processes. The trace processes are, next to data parallel ray tracing, capable of performing demand-driven tasks of preprocessing the PyraClip tasks and they can also execute these Pyra-Clip tasks (tracing the rays). To avoid the need that all trace processes should have access to the whole data base for the Pyra-Clip preprocessing, only the spatial subdivision structure is duplicated with each process. With the Pyra-Clip preprocessing then first an ordered list of intersected cells (the ’clip-list’) is created. Second, the actual contents of each cell is requested. A cache can be maintained for objects most often called for. In fact, in our current implementation, the cache is pre-filled with a resident set consisting of objects closest to the viewpoint (see figure 3). Rays that survive these objects are further processed in a data parallel manner. When during demand-driven tracing of primary rays an intersection is found then further processing is transferred to the process that keeps the intersected object (in the data parallel scheme). This process then spawns the secondary rays that are traced data parallel. When the results of the secondary rays are gathered, the shading is performed and the result is returned to the host. When no intersection is 5

- Available for PyraClipping - Selected for indicated deman driven task

Voxels

- Light source - Global cache - Local cache

Screen Eye point

Figure 3. Data distribution for demand driven tasks. Each process holds the objects closest to the eye point and pre selects from these objects to form demand driven tasks. An example for local and global light caches is given for one processor as well.

found during demand-driven tracing of primary rays, this result is returned directly to the host process. In order to improve control over the workload and reduce the impact of bottlenecks that may exist near light sources, extra work may be shifted from data parallel to demand driven execution. This may be achieved by storing objects close to the light sources in a global light cache which is duplicated with each processor. If a processor finds an intersection for a primary ray (demand driven task), it can trace parts of the secondary shadow rays through the global light cache. If an intersection with an object in the cache is found, that particular shadow ray need not be traced through data parallel voxels anymore. Second, each data parallel process stores objects between its voxel and the light sources up to a specified maximum. Such a local light cache should not overlap with the global light cache. A data parallel light ray (which has not found an intersection with the global light cache) is now first traced through the voxel it originated in and then, if necessary, through a local light cache. If still no intersection is found, the light ray is transferred to a neighbouring processor which will continue tracing the light ray. The data distribution with local and global caches is depicted in figure 3.

In figure 4, the resulting setup of host, data parallel trace and demand driven PyraClip tasks is depicted. Pyra-Clip tasks involve tracing bundles of primary rays and tracing shadow rays through the global light cache. The trace task includes tracing (single) rays through its voxel and tracing shadow rays through the local light cache. Further, there are a large number of difficult control issues involved which deal with memory usage (buffer and cache sizes, etc.), communication (buffer overflow, message packing, saturation and deadlock), task scheduling (where to do Pyra-Clip 6

Host process

Pixels Schedule

3

Init

4

5

5 1

2

Pyra Clip

Shade task

2 3

Trace task PyraClip task

4 5

Initialisation Pixel values External message

Shade Trace

Internal control flow Both external message and interal control flow

1

1

1

2 Trace process

Figure 4. Control and message flow between processes.

tracing, shadow tracing and shading) and with trade-offs between demand-driven and data-parallel processing. See for further details Reinhard [15].

5 Implementation and Experiments

The scheduling and data management techniques described were implemented with the (sequential) public domain ray tracer Rayshade [11] as basis. For communication support we choose the PVM library [5]. Although standard libraries may be less efficient than hard coding for a specific architecture, it has great advantages in terms of portability. Our current system runs on a cluster of workstations as well as on dedicated multi-processor systems. The former is used for implementation work, while the latter is used to obtain the test results presented below. Three sufficiently realistic models were used in our evaluation. These are a model of a computer room (10; 929 polygons, 16 light sources), the conference room model (23; 177 polygons), which we tested in two variants with 8 and 30 light sources respectively, and a model of a crate (12; 668 triangles and 3 light sources). Images of these models are given in figure 5. The experiments were carried out on a Parsytec PowerXplorer with 32 Motorola MPC 601 processors running asynchronously at 80 MHz. Each processor is connected to a T800 transputer, which is used for communication purposes. Each processor has a 32 MB local memory, adding up to a total of 1 Gbyte RAM. Peak performance is rated at 2 Gflop/s. In all tests performed, the image size is 256 x 256. 7

Figure 5. Computer room, crate and conference room models.

6 Comparison between scheduling techniques

Tests were performed with different cache sizes for the global and local caches (hybrid and data parallel scheduling) and different cache sizes for primary rays (hybrid scheduling). By loading the caches with the entire model, a naive demand driven algorithm (without cache overhead) may be emulated. In this case, primary rays are 8

scheduled on demand and all secondary rays derived from these primary rays, are traced by the same processor without task or data communication. The figures obtained with extremely large caches, are in a sense optimal. They are presented here (figure 6) for reference only. Running data parallel with the same caching strategy performs rather poorly, because there is no control over the distribution of ray tasks. 400

350

300

250

Time (in s)

conference8 - hybrid conference8 - data parallel conference30 - hybrid

200

conference30 data parallel computer room - hybrid computer room - data parallel

150

100

50

0 0

5

10

15

20

25

Processors

Figure 6. Pseudo demand driven scheduling.

If the cache sizes are drastically reduced, the amount of task communication is expected to increase, and therefore the rendering times should increase. If the global light cache is turned off, while keeping the local light caches and the caches for primary rays (hybrid scheduling) large, the overall memory consumption reduces with a few thousand objects, dependent on the model size and the number of light sources. This allows larger models to be rendered, although with reduced performance, see figure 7. With reduced cache sizes, more rays would be traced data parallel, because a smaller resident set is likely to have a lower hit ratio. By reducing the primary cache size, the performance of hybrid and data parallel scheduling will eventually converge. (In a future implementation we’ll consider dynamic caching strategies to improve performance.) Another problem that arises, is that hotspots in the viewing pyramid and in the neighbourhood of light sources become more pronounced. This means that without proper memory management, message buffers may overflow. For this reason, the models with a larger number of light sources (conference room with 30 lights and the computer room with 16 lights) can not be rendered yet. This problem is currently under investigation. The figures show that both hybrid and data parallel rendering are not very scalable and perform rather poorly, although the hybrid scheduling algorithm performs (slightly) better. This can be accounted for by the load balancing capabilities of the 9

700

600

Time (in s)

500

400 conference8 - hybrid conference8 - data parallel conference30 - hybrid

300

conference30 - data parallel computer room - hybrid computer room - data parallel

200

crate - hybrid crate - data parallel 100

0 0

5

10

15

20

25

Processors

Figure 7. No global light caches.

Figure 8. Demand driven rays complement the workload of the data parallel cluster (conference8 model; 16 processors). The left bars in the left figure represent the data parallel rays per processor. The right bars in the left figure represent the fraction of data parallel rays that was transferred to another processor. The right figure depicts the number of demand driven rays per processor.

demand driven tasks, as is clearly demonstrated in figure 8. In this figure the processors that have only a few data parallel tasks, receive a large number of demand driven rays and vice versa, yielding a better load distribution than in data parallel rendering. The communication that occurs during the computation, does not appear to form a bottleneck, as figure 9 demonstrates. Communication is well spread over time, and also there are not many differences in communication behaviour between different processors, hence there is no communication bottleneck. 10

Figure 9. Number of messages sent as function of time (conference8 model, 16 processors). Each bar represents a 0.1 s time interval and the height of each bar indicates the total number of messages sent by all processors in that interval.

Finally, the overhead induced by our parallelisation scheme, is rather low. For the conference room with 8 light sources, rendering with 16 processors gives an overhead penalty of around 5%. About 1% of this is accounted for by internal book keeping and 4% is due to everything related to sending and receiving messages (i.e. input and output buffer management and (un)packing messages). Moreover, the differences in overhead between different processors, are relatively small; book keeping shows a standard deviation of 1.3% and a maximum of 4.6% and buffering gives a standard deviation of 1% and a maximum of 6.4%. If there is no excessive (communication) overhead, the question arises why hybrid scheduling does not yield the expected speed ups for a large number of processors. This question is answered by figure 10, which gives the efficiency over time for the conference room model with 16 processors. It is clear that the first part of the computation is very efficient, after which there is a sudden drop in efficiency. At this point the demand driven tasks are all finished and the data parallel network computes the remaining secondary rays. The load balancing capabilities are exhausted by then. There is a possibility to hand out demand driven tasks more sparingly, but then the performance in the early stages of the computation will be lower. A more realistic solution would be to have more work executed as demand driven tasks and less work in the data parallel cluster and to improve the performance of the remaining rays in the data parallel cluster. Shooting shadow rays towards area lights, for example, should be handled as demand driven tasks. The pyraclip algorithm may be used to pre-select the appropriate objects. Light rays and objects may be sent to another processor. The data parallel cluster, on the other hand, may be improved by having the voxels overlap. In such a scheme, a busy processor may reduce its load by sending some tasks to a selected neighbour processor. This way, the load should spread over the processors more evenly. 11

Figure 10. Efficiency over time (conference8 model; 16 processors). Each bar represents a 0.1 s time interval.

7 Conclusions

The two basic approaches to parallel rendering, demand driven and data parallel, both have advantages and shortcomings. Demand driven approaches suffer from a communication bottleneck due to insufficient cache performance for large models and non-coherent data requests. Data driven scheduling allows the algorithm to be scaled, but almost invariably leads to load imbalances. The experiments also show that the unbalance does not follow the distribution of objects over processes and -even worse- is not constant over time. The unbalance is thus hard to correct with traditional load adapting strategies such as dynamic redistribution of data [4] or static load balancing according to a low resolution rendering pass [1]. The hybrid approach of adding demand driven components to balance the load works fairly well given the abysmal results of data parallel rendering alone. As long as there are demand driven tasks available, each processor can be kept busy, yielding for that part of the computation a very high efficiency. However, half way the computation there are no demand driven tasks left, leaving the data parallel cluster with unacceptable load imbalances. This is the major cause for the efficiency loss that the hybrid algorithm exhibits, as there is no significant communication and the overhead due to the parallelisation is low as well. The global and local light caches reduce data communication and improve the control over the load distribution at the cost of extra memory usage. By adjusting the cache sizes, a balance between performance and memory usage may be found. Very large light caches turn the hybrid scheduling algorithm into a naive demand driven algorithm which needs very little communication. The results obtained this way provide a measure to compare with the results obtained with smaller caches. Currently, the caches are prefilled with object data. Dynamic cache management, however, is a must that allows a far more flexible memory usage. This issue is 12

currently under investigation. Our aim is to render much larger models then the examples given here. With an increasing amount of data, data parallel processing will gain in importance. Texture mapping and radiosity processing were not involved in our experiments yet, and these will add to the data parallel component as well. Therefore, data parallel processing should be improved, both by better load balancing and by reducing the latency of the ray flow through the system. A way to improve the data parallel processing would be to switch from a regular grid of voxels to a data structure that is better adapted to the model. We will also continue to experiment with hierarchical sampling techniques to reduce the number and length of secondary and shadow rays [16] [10].

Acknowledgement Computer power on the Parsytec PowerXplorer was kindly donated by IC 3 A. The tremendous modelling effort to create the conference room was undertaken by Greg Ward and Anat Grynberg and the computer room was created by Arjan Kok. The crate model is courtesy BPO Delft.

References [1] K. Bouatouch and T. Priol. Parallel space tracing: An experience on an iPSC hypercube. In N. Magnenat-Thalmann and D. Thalmann, editors, New Trends in Computer Graphics (Proceedings of CG International ’88), pages 170–187, New York, 1988. Springer-Verlag. [2] J. G. Cleary, B. M. Wyvill, G. M. Birtwistle, and R. Vatti. Multiprocessor ray tracing. Computer Graphics Forum, pages 3–12, 1986. [3] F. C. Crow, G. Demos, J. Hardy, J. McLaugglin, and K. Sims. 3d image synthesis on the connection machine. In Proceedings Parallel Processing for Computer Vision and Display, Leeds, 1988. [4] M. A. Z. Dippé and J. Swensen. An adaptive subdivision algorithm and parallel architecture for realistic image synthesis. ACM Computer Graphics, 18(3):149–158, jul 1984. [5] A. Geist, A. Beguelin, J. Dongarra, W. Jiang, R. Manchek, and V. Sunderam. PVM 3 User’s Guide and Reference Manual. Oak Ridge National Laboratory, Oak Ridge, Tennessee, may 1993. Included with the PVM 3 distribution. [6] S. A. Green and D. J. Paddon. Exploiting coherence for multiprocessor ray tracing. IEEE Computer Graphics and Applications, pages 12–27, nov 1989.

13

[7] N. Greene. Detecting intersection of a rectangular solid and a convex polyhedron. In P. Heckbert, editor, Graphics Gems IV, pages 74–82. Academic Press, Boston, 1994. [8] F. W. Jansen and A. Chalmers. Realism in real time? In 4th EG Workshop on Rendering, pages 1–20, may 1993. [9] H. Kobayashi, S. Nishimura, H. Kubota, T. Nakamura, and Y. Shigei. Load balancing strategies for a parallel ray-tracing system based on constant subdivision. The Visual Computer, 4(4):197–209, 1988. [10] A. J. F. Kok. Ray Tracing and Radiosity Methods for Photorealistic Image Synthesis. PhD thesis, Delft University of Technology, jan 1994. [11] C. E. Kolb. Rayshade User’s Guide and Reference Manual, jan 1992. Included in Rayshade distribution, which is available by ftp from princeton.edu:pub/Graphics/rayshade.4.0. [12] T. T. Y. Lin and M. Slater. Stochastic ray tracing using SIMD processor arrays. The Visual Computer, 7:187–199, 1991. [13] D. J. Plunkett and M. J. Bailey. The vectorization of a ray-tracing algorithm for improved execution speed. IEEE Computer Graphics and Applications, 5(8):52–60, aug 1985. [14] T. Priol and K. Bouatouch. Static load balancing for a parallel ray tracing on a MIMD hypercube. The Visual Computer, 5:109–119, 1989. [15] E. Reinhard. Hybrid scheduling for parallel ray tracing. TWAIO final report, Delft University of Technology, jan 1996. [16] E. Reinhard, L. U. Tijssen, and F. W. Jansen. Environment mapping for efficient sampling of the diffuse interreflection. In G. Sakas, P. Shirley, and S. Müller, editors, Photorealistic Rendering Techniques, pages 410–422, Darmstadt, jun 1994. Eurographics, Springer Verlag. proceedings of the 5th Eurographics Workshop on Rendering. [17] I. D. Scherson and C. Caspary. A self-balanced parallel ray-tracing algorithm. In P. M. Dew, R. A. Earnshaw, and T. R. Heywood, editors, Parallel Processing for Computer Vision and Display, volume 4, pages 188–196, Wokingham, 1988. Addison-Wesley Publishing Company. [18] L. S. Shen, E. Deprettere, and P. Dewilde. A new space partition technique to support a highly pipelined parallel architecture for the radiosity method. In Advances in Graphics Hardware V, proceedings Fifth Eurographics Workshop on Hardware. Springer-Verlag, 1990. [19] M. van der Zwaan, E. Reinhard, and F. W. Jansen. Pyramid clipping for efficient ray traversal. In P. Hanrahan and W. Purgathofer, editors, Rendering Techniques ’95, pages 1–10. Trinity College, Dublin, Springer - Vienna, June 1995. proceedings of the 6th Eurographics Workshop on Rendering.

14

Towards Efficient Parallel Radiosity for DSM-based Parallel Computers Using Virtual Interfaces Luc R ENAMBOT IRISA/INRIA

Bruno A RNALDI IRISA/INRIA

Abstract This paper presents the performance evaluation of a new technique for radiosity computation which aims at exploiting efficiently the different levels of a memory hierarchy of both sequential and parallel computers. Such ability is essential when dealing with complex environments having several millions of polygons. The principle of our technique is to split the initial environment into several sub-environments and compute the radiosity within each subenvironment. Exchange of energy between sub-environments is performed by means of virtual interfaces and visibility masks. The size of sub-environments can be adapted in order to fit into a cache or a local memory. We performed several experiments using an SGI Origin 2000 to show the effectiveness of our solution. It improves both the sequential and parallel execution of a progressive radiosity algorithm. Our technique decreases the execution time on one processor of an SGI Origin 2000 by a factor of more than 5 and leads to a very good efficiency for complex environments (1 million of polygons) on a multiprocessor configuration. CR Categories: D.1.3 [Programming Techniques]: Parallel Programming—; I.3.0 [Computer Graphics]: General—; I.3.6 [Computer Graphics]: Methodology and Techniques—; I.3.7 [Computer Graphics]: Three-Dimensional Graphics and Realism— . Keywords: Parallel computing, radiosity, realistic image synthesis, global illumination, virtual interface, visibility mask.

1

Introduction

Among the different techniques to render high quality images, the radiosity method has become very popular. It computes the most important forms of illumination, the indirect ambient illumination provided by inter-reflections of lights between diffuse objects. Like ray-tracing, radiosity algorithms require large computing resources, both in terms of computing power and memory storage. Since the introduction of the radiosity method, a lot of research has dealt IRISA, Campus de Beaulieu - 35042 Rennes Cedex France frenambot, arnaldi, [email protected], WWW : http://www.irisa.fr y Dept. Informàtica i Matemàtica Aplicada, Universitat de Girona Llus Santal, s/n E-17071 Girona, Spain, [email protected]

Thierry P RIOL IRISA/INRIA

Xavier P UEYO Universitat de Girona y

with the design of new techniques in order to reduce computations. However, despite the several improvements, modern workstations cannot provide the required computing resources when dealing with large complex environments (i.e. an order of several millions of polygons). With the decay of vector supercomputers, the use of parallel computers seems to be the only way left to achieve the required level of performance for radiosity computations. After ten years of constant evolution, parallel computers have reached the maturity which is requisite to their acceptance by the computer graphics industry. Two new architectural trends stand out nowadays: cluster of workstations (IBM SP/2) and Distributed Shared Memory (DSM) parallel computers (SGI Origin). Both approaches are distributed memory parallel computers (i.e. each node has its own local memory). However, these two kinds of parallel architectures have their own programming models. While clusters of workstations have to be programmed using a message passing paradigm, DSM parallel computers provide shared variables as a communication paradigm. Even though, programming these two types of parallel computers requires roughly the same methodology when designing an efficient parallel algorithm. It is mainly a two-dimensional optimization process. The first dimension is the identification of parallelism in the original algorithm in order to split it in several independent tasks. The second dimension is data distribution in order to map data onto local memories. Nevertheless, a high degree of parallelism, in a given algorithm, does not ensure every time a good performance. Indeed, exploitation of parallelism may require a large amount of communication between processors. In that case, such parallel algorithm is said to have a poor data locality property. Such statement can be applied to the radiosity. The computation of inter-reflections between objects involves several tasks that can be run in parallel. For progressive radiosity, several levels of parallelism have been identified [3]. However, this parallelism is obtained at the expense of a loss of data locality, because each processor may have to access any part of the environment. When dealing with large environments (typically several millions of polygons), parallel radiosity algorithms offer poor performance. Lack of data locality has also an impact on the performance of sequential radiosity algorithms due to the existing multi-level memory hierarchy available in modern workstations. Efficient exploitation of the memory hierarchy is crucial to get high performances. However, most studies [17] focus on the exploitation of the these last levels of the memory hierarchy, main memory and disks for the management of virtual memory. Our research addresses the problem of using efficiently the different levels of a memory hierarchy for both uniprocessor and multiprocessors computers. Our approach consists in designing a new radiosity algorithm which has a good data locality property. Locality is exploited to speedup up the computation on a single processor as well as on parallel computers. Such technique has been already successfully applied to message-passing parallel computers [1]. In this paper, we show the impact of our original technique to DSM parallel computers, in particular the SGI Origin 2000 [9]. The paper is organized as follows. Related work is discussed in section 2. Section 3 provides a brief description of the DSM con-

cept. Section 4 presents an overview of our technique. In section 5, a detailed description of the parallel algorithm is given. In section 6, we focus on the experimental platform we used. Section 7 contains the experimental results which were obtained on an SGI Origin 2000. Finally, Section 8 draws some conclusions.

2

Related works

We mainly analyze previous works from two points of view: parallel solutions and locality enhancement. A number of parallel solutions have been proposed using different architectures and different types of parallelizing strategies. [3, 7] present approaches for parallelizing radiosity at several levels. From the scope of the work presented here, we may distinguish between two kinds of solution. A first family of algorithms allows the access of all the processors to the entire database. This may be implemented on shared memory systems or on distributed memory systems where the database is duplicated at each local memory [10, 11] which limits the scene’s size. The second family of algorithms is constituted by the techniques based on distributed memory systems where the database is distributed among the different local memories allowing the application of the algorithm to render bigger scenes [4, 6, 16, 19]. The idea of considering sub-environments (or local environments) has been used from different points of view, in order to reduce computational cost in radiosity algorithms for complex environments. The first contribution in this direction was proposed in [20]. This method divides environment into local environments and computes form-factors in each of them. Afterwards, form-factors between patches in neighboring local environments are computed from previously computed local form-factors. Similar approaches were presented in [2, 18]. Energy is accumulated at the boundaries of neighboring local environments (called virtual walls) and afterwards transfered. Groups of surfaces are used in order to simplify energy exchange in complex environments by means of simplifying the geometry of the environment in a 3-D grouping approach in [8, 13, 14, 15]. Clustering is also used in [12] where the environment is mapped onto the walls of the cluster. This map is used only for secondary rays while primary rays do not use the clusters. The interest of parallelizing the algorithm is pointed out in [12], using the locality resulting from the mapping but not exploited during the rendering step. All these methods introduce approximations for exploiting locality but bound their effects in order to keep the required accuracy. In [17], a technique is proposed to exploit locality in order to minimize the transfers between disk and main memory. This system is designed to compute radiosity efficiently in very large environments where the visibility relationships are locally dense but globally sparse. It is based on a hierarchical approach. Two basic ideas are exploited: partitioning and ordering. Partitioning means identifying subsets of the database each composed by the clusters which interact with a given (receiver) cluster. Ordering implies finding an order which minimizes the difference between two successive subsets. From this analysis, it appears that few prior works have dealt with the use of all the levels of the memory hierarchy of a modern computer. This fact led us to propose a new technique with a better data locality to fully exploit the memory hierarchy.

3

Distributed Shared Memory

Distributed Shared Memory is a way to provide a logical global address space which is distributed among the local memories of a distributed memory parallel computer. Distribution of the global

address space is hidden to the user by either a software or a hardware layer. In both approaches, efficiency relies mainly on caching techniques when accessing data in a remote memory location. Research and development projects faced a serious challenge to implement a hardware cache-coherent DSM. Two main approaches have been investigated. The first approach is the COMA (Cache-Only Memory Architecture) model which was illustrated by the Kendall Square Research’s KSR-1 machine. In this approach, the local memory acts as a large cache (attraction memory) to store sub-pages that contain the data that has been requested. One benefit of such approach is that the data (i.e. sub-pages) will migrate to where it will be used. There is no fixed physical location for a particular data. Therefore, initial data placement has little impact on performance. Unfortunately, since the KSR commercial failure, there are no more COMA machines available. The second approach is the CC-NUMA (Cache-Coherent Non Uniform Memory Access) model. This approach has been adopted by several companies such as Convex/HP (Exemplar), Data General and Sequent (SCI based architectures) and more recently by SGI (Origin 2000). In such an approach, the shared memory is distributed using information in the physical addresses of memory blocks. A virtual address space can thus be distributed using the virtual to physical address translation mechanism of the processor (i.e. a particular page of the virtual address space can be mapped onto a page frame located in a memory of a given processor using the MMU page tables). However, performance of CC-NUMA machines is very sensitive to the memory placement policy. It is usually the responsibility of the operating system to migrate or to replicate pages in order to decrease the number of remote memory accesses. However, since migration and replication policies are based on heuristics, they are sometimes not adapted to users’ applications. In that case, to get efficiency, users have to design their parallel algorithms with an explicit data domain decomposition. This work is similar to the one that is performed when dealing with message-passing distributed memory parallel computers. However, since data remains in the DSM, load balancing problems can be solved easily by migrating data and computations to processors which are not overloaded. Such data migration is achieved transparently through the DSM.

4

Virtual Interface

This section summarizes briefly the virtual interface concept. A more detailed description can be found in [1].

4.1 Principles The first idea addresses the energy transfer between local environments. Instead of accumulating the energy from different sources onto a virtual wall before sending it to another environment, the virtual interface technique allows the management of a source separately from other sources in the same local environment. As soon as a source has emitted its energy, it is sent to the next environment. The second principle of the virtual interface describes how source transfer between local environments is taken into account. Sending the geometry and its emissivity to the next environment after a local processing implies the addition of a new structure, called the visibility mask, to the source (figure 1). The visibility mask stores in the source structure all the occlusions encountered during the processing in each local environment. With our virtual interface concept, the energy of each selected patch, called a source, is first distributed in its local environment. Then, its energy is propagated to other local environments. However, to propagate efficiently the energy of a given patch to another local environment, it is necessary to determine the visibility of the patch according to the current

Sub-environment1

Sub-environment2

Source

No Light Energy

5

Algorithms

Using virtual interface and visibility mask concepts described in the previous section, we design both a sequential and a parallel radiosity algorithm which exhibit a data locality property.

Object

5.1 Sequential version Light Energy Virtual Interface

Figure 1: Virtual Interface (former Virtual Wall).

local environment: an object along the given direction may hide the source (figure 2). We introduced the visibility mask which is a subsampled hemisphere identical to the one involved in the computation of the form factors. To each pixel of the hemisphere, used for form factor computation, corresponds a boolean value in the visibility mask. The visibility mask allows the distribution of energy to local environments in a step by step basis. If the source belongs to the local environment, a visibility mask is created, otherwise the visibility mask already exists and will be updated during the processing of the source. Form factors are computed with the patches belonging to the local environment by casting rays from the center of the source through the hemisphere. If a ray hits an object in the local environment, the corresponding value in the visibility mask is set to false otherwise there is no modification. Afterwards, radiosities of local patches are updated during an iteration of the progressive radiosity algorithm. Finally, the source and its visibility mask are sent to the neighboring local environments to be processed later.

The main idea of this algorithm is to apply a radiosity computation on each sub-environment generated by the projection of virtual interfaces on a given scene. In our case, we implement a classical radiosity method using a progressive approach. The computation of form factors is done through a hemisphere using an accelerated ray-tracing algorithm by means of a regular 3D grid. The main difference is that we manage, for each sub-environment, a list of visibility masks which correspond to non-local sources. Therefore, the selection of the most emissive patch takes into account this list. The patch to be shot at a given iteration could be either a local source or a remote source represented as a visibility mask already processed by other sub-environments. The visibility mask of the current source is filled during the form factors computation. Once the source shoots its energy, the visibility mask is copied to neighboring sub-environment sources lists, if required. All sub-environments are processed one after another to obtain a given convergence level. The criteria is to shoot a patch if its energy is above a threshold (a fraction of the overall most emissive patch in the scene). The termination test ensures that all visibility masks are processed to keep coherent radiosities, so that lists of visibility masks of all environments are empty.

5.2 Parallel version Proc 0

Proc 10

Sub-env. 6

Proc 20 Sub-env. 7

Sub-env. 8

Virtual Interface

True Visibility mask

False

Source

Occlusion

Proc 0

Proc 20 Masks Queue

Hemisphere Local Source

True

Proc 10 Local environment

Figure 2: Initialization of the visibility mask for a local source.

Sub-env. 5

Sub-env. 4

Sub-env. 3

Sub-env. 0

Sub-env. 1

Sub-env. 2

Proc 0

Proc 10

Proc 20

Figure 3: Environment partitioning and mapping

4.2 Discussion Using virtual interfaces and visibility masks, to solve the radiosity computations, may raise the question about their impact on quality degradation. We think that our method does not degrade the results since we ensure that all sources have been processed, in a correct manner, equivalent to the classical algorithm. We discuss this point in section 5. Moreover, visibility masks represents exactly the hemisphere with its resolution. Sampling and aliasing problems inherent to hemisphere form factors computation methods are not in the scope of this paper. However, our technique entails new problems such as the placement of virtual interfaces which may influence the performance. It exhibits data locality depending of the size of working sets associated to a sub-environment. Finding an appropriate placement may limit the overhead of the management of the visibility masks (for example, a virtual interface along a wall can stop light propagation).

Our parallelization scheme uses a domain decomposition technique based on the virtual interface to generate independent computations and to distribute data. Next, we have to distribute these sub-environments onto a set of processors. The communication between distinct sub-environments is carried out by the sending of visibility masks. In a previous study [1], we show that this could be expressed trough message-passing between processors. Processor management Each processor has to manage a set of sub-environments (one or several), and access visibility masks generated by other processors. Figure 3 shows an example of such a construction where we have nine sub-environments to distribute among three processors. Memory management There are two ways of managing memory allocation for radiosity computations. The first one is to let the operating system allocates memory in the global shared space

using default policies. The other is to organize memory, using different regions for each sub-environment, knowing that each region is accessed by only one processor. Furthermore, it can be better to allocate each region in the local memory of the processor that is responsible for it. Once this context is created, different strategies can be applied to map regions onto the set of CPUs. One may use page migration (either by explicit calls or by operating system policies) or process allocation and migration. Termination detection The termination detection is performed by managing a shared global array holding the state of each process. It could be either not locally converged (currently processing a local source or a visibility mask), locally converged (list of visibility mask empty and no patch to shoot) or globally converged. This last state is reached if all sub-environments are in the state locally converged and no visibility mask is being transfered. Synchronization There is very little synchronization in our algorithm. Each processor can perform its computations asynchronously. However, in a shared memory context, some lock variables are necessary to ensure safe accesses to shared structures like the visibility mask queues and global state variables. Load-balancing In a coarse grain parallel program, it is critical to focus our attention on load-balancing to achieve good performances. In our context, we have to find a trade-off between data locality expressed in each sub-environment and the corresponding workload. In extreme cases, we could have a small environment with high energy flow and a large environment without any light source. For load balancing, our first solution is to make a cyclic distribution of all sub-environments. This could be a good heuristic with a regular scene structure and a uniform light distribution. But in most realistic cases, it would not work. Therefore a second approach is to use information given by an execution to generate a new distribution of the sub-environments. The criteria may be the number of rays handled during form factors computations.

6

Experimental platforms

The implementation of our parallel algorithm was carried out on a Silicon Graphics Origin 2000 [9]. This is a CC-NUMA architecture as introduced in section 3. In this section, we describe briefly this machine and the implementation of our parallel algorithm.

6.1 SGI Origin 2000 4 MB

L2 cache

L2 cache

195 Mhz R10000

R10000

R R

performance analysis. It computes various information such as the estimated time per event and other statistics (Mflops, instructions per cycle, caches hit rates, ...).

6.2 Implementation issues Env. 0

Env. 1

Env. 2

Env. 3

Env. N

Local accesses Proc. 0

Proc. 1

Proc. P

Remote accesses

Masks 0

Masks 1

Masks N

Figure 5: Data distribution and memory accesses Two versions of our parallel algorithm were implemented. The first one is the parallelization of our sequential algorithm without data management. Data is allocated processor by processor calling malloc function. In a second version, we manage data placement. Each processor uses a distinct segment of shared memory for each sub-environment it own. The working set (geometries, radiosities) corresponding to one region is allocated in such a segment and thus accessed only by one processor. This can be done with the arena mechanism provided by the Irix operating system. Each arena is linked to a thread (i.e. to a processor) but a thread can manage several arenas. A file holding placement specification is used when the program is launched by the dplace tool, or loaded during the execution for dynamic specifications. A valid placement specification contains descriptions of memory allocation and thread placements. Memory topologies as well as affinities to devices may also be specified. Shared-virtual-address ranges can be mapped to a specified memory. Page size and page migration thresholds can also be controlled. These features were used in the ”optimized” version to perform a better placement. The visibility mask buffers are allocated in shared memory because they represent communications between sub-environments and are shared among several processors, as shown in figure 5. Lock variables control accesses to these buffers. Buffers and lock variables are allocated on secondary cache line boundaries (128 bytes) using memalign function calls and padding to reduce false sharing.

R R

7

800 MB/s

Experiments

256 MB

Local memory

800 MB/s

800 MB/s

R

HUB 800 MB/s

R

R R

Figure 4: Origin2000 32 processors System. Our algorithm was implemented on a 32 processors configuration organized in 16 nodes, each one of them having twoR10000 CPUs with 4Mbytes of second level cache (L2), and 256 Mbytes of local memory. The total physical memory available is 4 Gbytes. The interconnection network is a hypercube as shown in figure 4. An interesting feature of the R10000 is its set of two performance counters. They allow performance analysis of the user’s application execution with very low interferences and at program full speed [21]. Each counter can track one event among sixteen available. Some of these events are associated with the memory hierarchy such as the number of misses for the L1 and L2 caches and for the TLB. A software tool, named perfex, uses these counters to provide a

The main idea that we want to show is that radiosity computations on realistic scenes is memory bound, contrarily to studies on small scenes evoked in section 2. In this section, we conduct experiments showing that our method makes a better use of the memory hierarchy in both sequential and parallel execution.

7.1 Experimental protocol We conduct several experiments using two large databases on the Origin 2000 configuration described in section 6.1. The first database is made of around 400.000 polygons and the second has more than 1.000.000 polygons, before the meshing process. For each database, several sub-environment decompositions were generated and studied with the sequential and parallel versions of our algorithm to show the influence of the scene structure, the decomposition into sub-environments and the number of processors on performance.

Compute

Memory

Compute

9000

Memory

7000

8000

6000

7000 5000 C E S

6000

C E S

5000

N I E

4000

N I E

4000

M I 4

3000

M I 4

3000 2000 2000 1000

1000 0

0 1

2

4

16

25

36

50

64

100

1

2

4

.UMBER OF SUB ENVIRONMENTS

(a) Execution times, Parking scene L1 Time

L2 time

8

12

18

27

48

125

48

125

48

125


(b) Execution times, Csb scene

TLB time

L1 Time

5000

L2 time

TLB time

2100

4500 1800 4000 1500

3500 C E S

3000

C E S

N I

2500

N I

2000

M I 4

E M I 4

1200

E

900

1500

600

1000 300 500 0

0 1

2

4

16

25

36

50

64

100

1

2

4


(c) Memory hierarchy, Parking scene L1 Miss

S S I

L2 Miss

L1 Miss

45%

45%

40%

40%

35%

35%

S S I

20%

20% 15%

10%

10%

5%

5% 0% 2

4

16

25

36

50

64

100

1

2

4


IPC

8

12

18

27


(e) Cache miss rates, Parking scene

S N )

L2 Miss

25%

15%

1

C U R T

27

-

0%

N O I T

18

30% E T A R

25%

-

E L C Y C R E P

12

(d) Memory hierarchy, Csb scene

30% E T A R

8


(f) Cache miss rates, Csb scene

Mflops

IPC

Mflops

1.4

30

1.2

30

1.2

25

1.0

25

0.8

20

0.6

15

0.4

10

5

0.2

5

0

0.0

1.0 20 0.8 15 0.6 10 0.4 0.2 0.0 1

2

4

16

25

36

50

64

100

C E S R E P S P O L F -

E L C Y C R E P N O I T C U R T S N )

0 1

2


4

8

12

18

27

48

125


(g) Instruction per cycle and Mflops, Parking scene

(h) Instruction per cycle and Mflops,Csb scene

Figure 6: Sequential results

C E S R E P S P O L F -

7.1.1 Metrics To analyze our method, we use a set of metrics that describe the behavior of each level of the memory hierarchy and the processor speed. The use of the hardware performance counters available on the Mips R10000 processor greatly simplifies this task. L1 cache misses The first level data cache is on chip and has a latency of 2 to 3 cycles. It is a 32 Kbytes two ways set associative cache with 32 bytes per line. L2 cache misses The second level cache is external and holds 4 Mbytes of data and instructions. It is relatively large compared to current workstations. The miss latency is about 10 cycles compared with 61 cycles for the local memory and around 160 cycles for a non-local memory. TLB misses The Translation Look-aside Buffer is a cache holding the translation of virtual addresses into physical ones of recently accessed pages. A miss in the TLB is a costly process, around 70 cycles latency and could lead to L2 misses. A program with an important working set and without data locality can generate a high TLB miss rate and cause a lot of operating system activity. Number of instructions per cycle Modern superscalar out-oforder execution processors like the Mips R10000 can overlap data requests which are non blocking during a data miss. The R10000 can process up to four instructions per cycle. Typical values are between 0.5 and 2 because of limited instruction level parallelism and data access latency. Memory overhead With the previous metrics and the number of load and store operations, one can compute the overall time spent in the memory hierarchy. MFLOPS It shows the achieved performance of an application. The R10000 is rated at 390 Mflops (Million of floating point operations per second) peak. Speed-up The definition of the speed-up is S = T1 =Tn , with T1 the best sequential time and Tn the time obtained in parallel using n processors. 7.1.2 Scenes We chose two scenes. The first, named Csb, represents the Soda Hall Building. The five floors are made of many furnished rooms, resulting in a scene of over 400.000 polygons. It’s an occluded scene. The second scene, named Parking, represents an underground car park with accurate cars models. The scene is over 1.000.000 polygons. It is a regular and open scene. Table 1 lists some characteristics of these databases. We use a straightScene Csb Parking

Polygons x 1000

Patches x 1000

Sources

Shots

Memory in Mb

430.1 1080.2

1202.3 1288.2

342 140

960 1804

495 525

Table 1: Scenes forward decomposition algorithm which places virtual interfaces evenly along each axis, producing cubic sub-environments. In section 7.2, we studied several decompositions for each scene from 1 to 100 sub-environments for the Parking scene (from 1 1 1 to 10 10 1 in a x y z 2D-decomposition), and from 1 to 125 sub-environments for the Csb scene (from 1 1 1 to 5 5 5 in a x y z 3D-decomposition). In section 7.3, we experimented two decompositions for each scene, giving one or three regions for each processor using a cyclic distribution (4 8 1 and 12 8 1 for Parking scene, and 4 8 1 and 6 4 4 for Csb scene).

7.2 Sequential Results In figures 6(a) and 6(b), we show sequential execution times for each decomposition, divided into computation time and memory

overhead (data misses and load/store operations). Higher decompositions do not give important gains. For example, a gain factor of 4:2 can be achieved for the Csb scene with 100 sub-environments, and 5:5 for the Parking scene. The main gain is given by a reduction of memory overhead. This is illustrated in figures 6(c) and 6(d) showing misses in the memory hierarchy (data caches, TLB). We can see a dramatic reduction of secondary data cache access time up to a factor of 30 for the Parking scene and a factor of 11 for the Csb scene. With the reduction of the working set, we enhance data locality and make a better use of the L2 cache. One can notice that, even if TLB time decreases for a small number of sub-environments, it increases with higher decomposition showing that we introduce some overhead due to sub-environments management. The outcome of better data caches management is a clear improvement of miss rates as shown in figures 6(e) and 6(f). Data locality reduces memory latency and allows the processor to issue more instructions per cycle, which is a great challenge on a superscalar processor. This is clearly stated in figures 6(g) and 6(h) where the processor can only process 0:4 instruction per cycle for the Parking scene with one sub-environment and achieve 1:2 instructions per cycle in a 100 sub-environments decomposition. The overall performance goes from 10 Mflops to 28 Mflops. We obtain similar results for the Csb scene.

7.3 Parallel Results To compare our method to existing ones, we implement a classical parallelization of the radiosity algorithm where a set of processors shoot different patches in parallel [3]. The address space is unique and all the processors may access the entire database. Each processor selects a patch and shoots it onto the scene (i.e. computes form factors between the current patch and the entire database and then updates the radiosities which are stored into shared variables). Critical sections were used to ensure the coherence of radiosities values between selection and update phases. The experiments are named following these rules in figure 7 : p-shot for classical parallelization of radiosity, load or unload for our parallel versions with or without static load balancing, 32 or 96 for the number of sub-environments and finally mem or nomem according to data management scheme. Figures 7(a), 7(c), 7(b), 7(d) show speed-up results for the Parking scene and the Csb scene, with 32 or 96 sub-environments. The first comment is that the p-shoot method is always the least effective approach, having a very high memory overhead (figure 7(e)) and high lock times when the number of processors raises. It achieves a 7:3 speed-up on 32 processors on the Parking scene (figure 7(a)). We obtain the same behavior on the Csb scene up to 12 processors, then lock time becomes very large (figure 7(f)). The probability that processors interlock during computation on theCsb scene is higher than on the Parking scene because the Csb scene contains twice less polygons. With our method lock times are very small in all configurations. But, due to domain decomposition, our method is sensible to load-unbalance and produces less regular results. However, using a simple static load-balance strategy we obtain a speed-up of 21 on 32 processors for theParking scene (figure 7(c)). We can see comparing results for 32 and 96 sub-environments (figures 7(a),7(b),7(c),7(d)) that load-balancing is a key issue to achieve a good performance. This is clearly shown by figure 7(g) where we compare results with and without load-balancing. With load-balancing, we always obtain better results, and the more subenvironments we have the more we can distribute evenly the load. Figure 7(h) exposes another behavior, the impact of data allocation in the shared space as the number of processors grows. The impact on speedup depends on the number of processors. The reason is due to the increasing number of non local remote memory accesses when the data management is left to the operating system.

p-shot load mem

load nomem unload mem

unload nomem

p-shot load mem

14

14

12

12

10 P U

D

E E P 3


unload nomem

10 P U

D

8

8

E E P 3

6

6

4

4

2

2 0

0 0

4

8

12

16

20

24

28

0

32

4

8

12

16

(a) Speed-up, Parking scene, 32 env. p-shot load mem

20

24

28

32

0ROCESSORS

0ROCESSORS


(b) Speed-up, Csb scene, 32 env.

unload nomem

p-shot load mem

21

16

18

14


unload nomem

12

15

P 10

U D E 8 E P 3 6

P U 12

D E E P 9 3

6

4

3

2

0

0 0

4

8

12

16

20

24

28

32

0

4

8

12

16

20

24

28

32

0ROCESSORS

0ROCESSORS

(c) Speed-up, Parking scene, 96 env.

(d) Speed-up, Csb scene, 96 env.

P SHOT

NOMEM

MEM

SUB ENVIRONMENT

12000

2100

10000

1800 1500

8000

E S

Idle

1200

IN E

Lock

6000

M I 4

Compute

900 4000

Mem

600

2000

300

0

0 1

2

4

8

12

16

24

32

1

2

4

8

12

16

24

0ROCESSORS

32

1

2

4

8

12

16

24

32

0ROCESSORS

(e) Times, Parking scene, 96 env. P SHOT

NOMEM

MEM

SUB ENVIRONMENT

E S

5000

1000

4000

800

3000

600

Idle Lock

IN E

Compute

IM 4

2000

400

1000

200

0

Mem

0 1

2

4

8

12

16

24

32

1

2

4

8

12

16

24

0ROCESSORS

32

1

2

4

8

12

16

24

32

0ROCESSORS

(f) Times, Csb scene, 96 env. Parking 96 load

Parking 96 unload

Csb 32 load

Csb 32 nomem

Csb 32 unload

Csb 32 mem

Parking 96 nomem

Parking 96 mem

21

21

18

18

15

15

P 12 U

D

P 12 U

D E E P 3 9

E E P 3

9

6

6

3

3

0

0 0

5

10

15

20

25

30

35

0

(g) Load-balancing effects, Parking & Csb scenes

5

10

15

20

25

30

0ROCESSORS

0ROCESSORS

(h) Memory managment effects, Parking & Csb scenes

Figure 7: Parallel results

35

8

Conclusion and further works

As we have shown in this paper, Virtual Interface and Visibility Mask are two efficient techniques to enhance data locality for radiosity computations. Such an enhancement is of considerable importance when rendering complex environments since a large amount of time is spent accessing the memory. Memory access times is and will remain the main bottleneck since performance of micro-processors increases at a higher rate than the memory access times decrease. By using our technique, we were able to decrease the execution times by a factor of more than five. This factor may vary with scene complexity (in term of the number of polygons). As regards to the results of the parallel version, our approach offers better performance comparing to a more traditional approach. With the two scenes, we were able to get a 50% of efficiency using 32 processors whereas a traditional approach offers no more than 25% of efficiency. Even so, further works needs to be performed before having a robust parallel solution. More experiments have to be performed to confirm these results. The lack of benchmarking scenes, which are widely accepted by the computer graphics industry, prevents us from presenting a larger amount of experimental results and from comparing our results with other already investigated approaches. Next, we would like to address one of the main problem we have encountered : the positioning of virtual interfaces and their influences on load-balancing methods. These techniques could be static and dynamic, using new mapping of the sub-environments on processors, or even new placements for virtual interfaces. One solution to be investigated is the use of graph partitioning techniques as proposed in [5]. For both techniques, a trade-off between loadbalancing and data locality would have to be found. Our study focus on the traditional and well know progressive radiosity method, which has been replaced nowadays by hierarchical methods. Theses methods seem to be even more memory consuming than previous ones. Thus, we think that the main concepts of our method (spatial decomposition and compact representation of light transfer) could be applied to hierarchical and clustering radiosity, using several levels of interaction with the visibility mask. This issue forms the subject of a future research work.

Acknowledgements The authors thank the Centre Charles Hermite (www.loria.fr/CCH) which has provided the computing resource for the experiments provided in this paper, Dan TRUONG, Pierre M ICHAUD and Mounir HAHAD for their comments.

References [1] Bruno Arnaldi, Thierry Priol, Luc Renambot, and Xavier Pueyo. Visibility Masks for Solving Complex Radiosity Computations on Multiprocessors. In Proc. First Eurographics Workshop on Parallel Graphics and Visualisation, pages 219– 232, Bristol, UK, September 1996. [2] Bruno Arnaldi, Xavier Pueyo, and Josep Vilaplana. On the Division of Environments by Virtual Walls for Radiosity Computation. In Proc. of the Second Eurographics Workshop on Rendering, pages 198–205, Barcelona, 1991. SpringerVerlag. [3] K. Bouatouch and T. Priol. Data Management Scheme for Parallel Radiosity. Computer-Aided Design, 26(12):876–882, December 1994. [4] Martin Feda and Werner Purgathofer. Progressive Refinement Radiosity on a Transputer Network. In Proc. of the Second Eurographics Workshop on Rendering, pages 139–148, Barcelona, 1991. Springer-Verlag.

[5] Thomas A. Funkhouser. Coarse-Grained Parallelism for Hierarchical Radiosity Using Group Iterative Methods. In ACM SIGGRAPH ’96 Proceedings, pages 343–352, New Orleans, August 1996. [6] P. Guitton, J. Roman, and Christophe Schlick. Two Parallel Approaches for a Progressive Radiosity. In Proc. of the Second Eurographics Workshop on Rendering, pages 160–170, Barcelona, 1991. Springer-Verlag. [7] Frederik W. Jansen and Alan Chalmers. Realism in Real Time? In Proc. of the Fourth Eurographics Workshop on Rendering, pages 27–46, Paris, June 1993. [8] Arjan J. F. Kok. Grouping of Patches in Progressive Radiosity. In Proc. of the Fourth Eurographics Workshop on Rendering, pages 221–232, Paris, France, June 1993. [9] James Laudon and Daniel Lenoski. The SGI Origin 2000: A CC-NUMA Highly Scalable Server. In Proceedings of the 24th Annual International Symposium on Computer Architecure, pages 241–251, Denver, June 1997. ACM Press. [10] Claude Puech, Francois Sillion, and Christophe Vedel. Improving Interaction with Radiosity-based Lighting Simulation Programs. In Computer Graphics (1990 Symposium on Interactive 3D Graphics), volume 24, pages 51–57, March 1990. [11] Rodney J. Recker, David W. George, and Donald P. Greenberg. Acceleration Techniques for Progressive Refinement Radiosity. In Computer Graphics (1990 Symposium on Interactive 3D Graphics), volume 24, pages 59–66, March 1990. [12] E. Reinhard, L.U. Tijssen, and F.W. Jansen. Environment Mapping for Efficient Sampling of the Diffuse Interreflection. In Proc. of Fifth Eurographics Workshop on Rendering, Darmstadt, June 1994. [13] Holly E. Rushmeier, Charles Patterson, and Aravindan Veerasamy. Geometric Simplification for Indirect Illumination Calculations. In Proc. of Graphics Interface ’93, pages 227–236, San Francisco, May 1993. [14] Francois Sillion and George Drettakis. Feature-Based Control of Visibility Error: A Multiresolution Clustering Algorithm for Global Illumination. In ACM SIGGRAPH ’95 Proceedings, pages 145–152, Los Angeles, 1995. [15] Brian Smits, James Arvo, and Donald Greenberg. A Clustering Algorithm for Radiosity in Complex Environments. In ACM SIGGRAPH ’94 Proceedings, pages 435–442, Orlando, 1994. [16] W. Sturzlinger, G. Schaufler, and J. Volkert Johannes. Load Balancing for a Parallel Radiosity Algorithm. In IEEE/ACM 1995 Parallel Rendering Symposium, pages 39–45, Atlanta, October 1995. [17] Seth Teller, Celeste Fowler, Thomas Funkhouser, and Pat Hanrahan. Partitioning and Ordering Large Radiosity Computations. In ACM SIGGRAPH ’94 Proceedings, pages 443– 450, Orlando, 1994. [18] R. van Liere. Divide and Conquer Radiosity. In Proc. of the Second Eurographics Workshop on Rendering, pages 191– 197, Barcelona, 1991. Springer-Verlag. [19] Amitabh Varshney and Jan F. Prins. An EnvironmentProjection Approach to Radiosity for Mesh-Connected Computers. In Proc. of the Third Eurographics Workshop on Rendering, pages 271–281, Bristol, UK, May 1992. [20] Hau Xu, Qun-Sheng Peng, and You-Dong Liang. Accelerated Radiosity Method for Complex Environments. In Eurographics ’89, pages 51–61. Elsevier Science Publishers, Amsterdam, September 1989. [21] Marco Zagha, Brond Larson, Steve Turner, and Marty Itzkowitz. Performance Analysis Using the MIPS R10000 Performance Counters. In Supercomputing ’96, Pittsburgh, 1996.

Parallel Visualization Algorithms: Performance and Architectural Implications Jaswinder Pal Singh, Anoop Gupta and Marc Levoy Computer Systems Laboratory Stanford University

1

Introduction

Several recent algorithms have substantially sped up complex and time-consuming visualization tasks. In particular, novel algorithms for radiosity computation [1] and volume rendering [2][3] have demonstrated performance far superior to earlier methods. Despite these advances, visualization of complex scenes or data sets remains computationally expensive. Rendering a 256-by-256-by-256 voxel volume data set takes about 5 seconds per frame on a 100 MHz Silicon Graphics Indigo workstation using the ray-casting algorithm in [2], and about a second per frame using a new shear-warp algorithm [3]. These times are much larger than the 0.03 seconds per frame required for real-time rendering or the 0.1 seconds per frame required for interactive rendering. Realistic radiosity and ray tracing computations are much more time-consuming. Multiprocessing provides an attractive solution to this computational bottleneck. It is well known that ray casting algorithms afford substantial parallelism, and we show that the same is true for the radiosity and shear-warp methods as well. However, all these visualization algorithms have highly irregular and unpredictable data access patterns. This makes data distribution and communication management very difficult in the explicit message-passing programming paradigm supported by most scalable multiprocessors (e.g. the Intel iPSC/860 and Paragon or the Thinking Machines CM-5), since these tasks have to be performed explicitly by the programmer. The need for explicit communication management leads to complicated parallel algorithms that look very little like their sequential counterparts, and to substantial performance inefficiencies. Recently, a new class of scalable, shared-address-space multiprocessors has emerged. Like message-passing machines, these multiprocessors also have a distributed interconnection network and physically distributed main memory. However, they provide hardware support for efficient implicit communication through a shared address space, and they automatically exploit temporal locality by caching both local and remote data in a processor’s hardware cache. In this paper, we show that these architectural characteristics make it much easier to obtain very good speedups on the best known visualization algorithms. Simple and natural parallelizations work very well, the sequential implementations do not have to be fundamentally restructured, and the applications have a high enough degree of temporal locality that there is no need for explicit data distribution and communication management. We demonstrate our claims through parallel versions of three state-of-the-art algorithms: a recent hierarchical radiosity algorithm by Hanrahan et al [1], a ray-casting volume renderer by Levoy (parallelized in [2]), and an optimized ray-tracer [12]. We also briefly discuss the parallelization of a new, shear-warp volume rendering algorithm [3] which results in what is to our knowledge the first demonstration of interactive frame rates for a 256-by-256by-256 voxel data set on a general-purpose multiprocessor. The rest of the paper is organized as follows. The next section describes cache-coherent shared address space multiprocessors in general and the specific machines that we use. We then describe the sequential algorithms and their parallelizations for these machines. We demonstrate excellent speedups on the Stanford DASH multiprocessor, an experimental cachecoherent shared address space machines with 48 processors, as well as on a commercially available, 16-processor Silicon Graphics Challenge machine. Finally, we contrast the simplicity of the shared address space parallelizations with the complexity and resulting performance losses of explicit message-passing implementations.

c 1994 IEEE. Reprinted, with permission, from IEEE Computer, 27(7), pp 79–86, July 1994

1

2

Cache-coherent Shared Address Space Multiprocessors Processor

Processor

Cache

Cache

Memory

Memory

Interconnection Network

Figure 1: A shared address space multiprocessor with physically distributed memory.

Figure 1 shows the generic shared address space multiprocessor that we assume in our parallelization. The multiprocessor consists of a number of processing nodes, connected by a general interconnection network. Every node contains a processor, a cache and a portion of the total physical (main) memory on the machine. The address space is shared, so that any processor can reference any variable regardless of where it resides. When a processor reads or writes a word, that word is brought into the processor’s cache. Modifying locally cached shared data introduces the cache coherence problem, which is solved by using a distributed directory-based protocol supported in hardware [4]. The two important goals in parallelizing an algorithm to run on such a machine are balancing the workload across the cooperating processors and preserving locality of data referencing. Locality is important because although memory is uniformly addressable, it is not uniformly accessible: The cost of accessing a data item increases with the distance the access has to travel from the issuing processor to be satisfied. Load balancing and data locality are often at odds with each other, and must be traded off for good performance. The generalized multiprocessor shown in Figure 1 affords locality at three levels of the memory and interconnection hierarchy: • Cache Locality: This includes both the temporal locality exploited by reusing data that a processor brings into its cache (whether from its own local memory unit or from across the network), as well as the spatial locality provided by multiword cache lines. • Memory Locality: If references miss in the cache, one would like to satisfy them in the local memory unit rather than have to communicate across the network. Memory locality can be provided by distributing data appropriately across physical memory units— either statically or dynamically—or by replicating data in main memory as well as in the caches. Both data distribution and replication in main memory require user intervention on most systems, and make the programming task more difficult. • Network Locality: If references have to go across the network to be satisfied, one would like them to be satisfied as close as possible to the issuing processor in the network topology.

As we shall see, neither memory nor network locality are important in the visualization algorithms we examine. The temporal cache locality that obviates these falls naturally out of the spatial coherence in the applications exploited by simple partitioning schemes designed to reduce communication. The specific machine that we use in most of our experiments is the Stanford DASH multiprocessor, an experimental cache-coherent machine built at Stanford University [4]. The machine we use has 48 processors organized in 12 clusters.† A cluster consists of four 33MHz MIPS R3000 processors connected by a shared bus, and clusters are connected together in a two-dimensional mesh network. Every processor has a 64KB first-level cache and a 256KB second-level cache, and every cluster has an equal fraction of the 256MB of physical memory on the machine. To demonstrate performance on a commercially available machine with faster processors, we also use a Silicon Graphics Challenge multiprocessor. This has sixteen 150 MHz MIPS R4400 processors connected by a 1.2 GB/sec bus to one another and to a centralized shared memory. The centralized shared memory on the Challenge implies that locality in main memory is not an issue there. †

The prototype actually has 64 processors in 16 clusters, but is broken up into two separate machines in usual operation.

2

Let us now examine the parallelization of the visualization algorithms on these machines. For each application, we describe the sequential algorithm, discuss the parallelization, and present performance results and analyses. We begin with the hierarchical radiosity algorithm, and then proceed to the volume renderer and ray tracer.

3

Hierarchical Radiosity

The radiosity method computes the global illumination in a scene containing diffusely reflecting surfaces. It is a viewindependent visualization method, which means that the radiosity does not have to be recomputed when the viewpoint is changed. In traditional radiosity approaches, the large polygonal surfaces that describe the scene (such as walls or desktops) are first subdivided into small enough elements or patches that the radiosity of a patch can be approximated as being uniform. The radiosity of a patch i can be expressed as a linear combination of the radiosities of all other patches j, leading to a linear system of equations. The coefficients in the linear combination are the ‘‘form factors’’ between the patches, where the form factor Fji from patch j to patch i is the fraction of light energy leaving j which arrives at i. The inherent form factor depends on the shape of each patch i and j, the angle the patches make with each other, and the distance between them. However, this must be modified by the presence of any intervening patches that occlude the visibility between i and j. Particularly given the need to test all patch pairs i and j for inter-visibility, the computation of form factors is the most time-consuming part of a radiosity algorithm. The number of form factors among all pairs of n patches is O(n2), which makes traditional radiosity methods (including progressive radiosity [7]) very expensive. A new hierarchical method [1] dramatically reduces the complexity of computing radiosities. The method is inspired by recent advances in using hierarchical methods to solve the N-body problem. A scene is initially modeled as comprising a number, say k, of large input polygons (each representing a desktop or a wall in a room scene, for example). Light transport interactions are computed among these polygons, and polygons are hierarchically subdivided as necessary to improve accuracy. Each subdivision results in four subpatches, leading to a quadtree per input polygon. If the resulting final number of undivided subpatches is n, the number of interactions or form factors computed by this algorithm is O(n+k2). A brief description of the algorithm follows. Details can be found in [1][11]. Polygon B

(1) Before refinement A

B

Polygon C Polygon A

(2) After the 1st refinement

input polygon

A

B

interactions A1

A2

patches

(3) After three more refinements B

A

leaf patches

A1

A2

B1

B2

Polygon D A21

(a)

A22

B11

B12

B21

B22

(b)

Figure 2: Refinements and interactions in the hierarchical radiosity application. Binary trees are instead of quadtrees for simplicity in (b), and only one polygon’s interaction lists are shown.

3.1

Sequential Algorithm

The input polygons that comprise the scene are first inserted into a binary space partitioning (BSP) tree to facilitate efficient visibility computation between pairs of patches. Every input polygon is given an interaction list of other input polygons which are potentially visible from it, and with which it must therefore compute interactions. Then, polygon radiosities are computed by the following iterative algorithm: 1. For every polygon, compute its radiosity due to all polygons on its interaction list, subdividing it or other polygons 3

hierarchically as necessary. Subdivided patches acquire their own interactions lists, and are processed recursively (see Figure 2). 2. Add all the area-weighted polygon radiosities together to obtain the total radiosity of the scene, and compare it with that of the previous iteration to check for convergence. If the radiosity has not converged to within a user-specified tolerance, return to step 1. Otherwise, go to step 3. 3. Smooth the solution for display by computing the radiosities at the vertices of the leaf-level elements. Since this phase is performed only once at the end of the algorithm, and since it is a very simple phase from the viewpoint of parallelism, we do not discuss it further. Most of the time in an iteration is spent in step 1. In every iteration, each of the quadtrees is traversed depth-first, starting from the root. At every quadtree node visited in this traversal, interactions of the patch‡ at that node (patch i, say) are computed with all other patches, j, in its interaction list. An interaction may cause one of the interacting patches to be subdivided, and children to be created for the subdivided patch if they don’t already exist. If patch i (the patch being visited) is subdivided, patch j is removed from i’s interaction list and added to each of i’s children’s interaction lists. If patch j is subdivided, it is replace by its children on patch i’s interaction list. Figure 2(b) shows an example of this hierarchical refinement of interactions. Patch i’s interaction list is completely processed in this manner before visiting its children in the tree traversal. At the beginning of an iteration, the interaction list of a patch in any quadtree is exactly as it was left at the end of the previous iteration: containing the patches with which its interaction did not cause a subdivision.

3.2

Exploiting Parallelism

Parallelism is available at three levels in this application: across input polygons, across the patches that a polygon is subdivided into, and across the interactions computed for a patch. Since the patch quadtrees are constructed as the application proceeds, all three levels of parallelism involve communication and synchronization among processors. For example, a processor must lock a patch to ensure that it has exclusive access before subdividing the patch. Statically assigning polygons or polygon pairs to processors leads to severe load imbalances, since the workload distribution across polygon pairs is highly nonuniform. As in all the applications we consider, dynamic task stealing is needed for load balancing. We obtain the best performance by defining a task to be either a patch and all its interactions or a single patch-patch interaction, depending on the size of the problem and the number of processors (the difference is usually small). The parallel implementation provides every processor with its own task queue. A processor’s task queue is initialized with a subset of the initial polygon-polygon interactions. When a patch is subdivided, new tasks involving the subpatches are enqueued on the task queue of the processor that did the subdivision. A processor consumes tasks from its task queue until there are no tasks left. It then steals tasks from other processors’ queues, which it can directly access in the shared address space. While this task stealing provides load balancing, it can compromise data locality. However, locality is preserved as follows [11]. A processor inserts tasks at the head of its queue. It dequeues tasks from the head of its own queue (to yield a depth-first search of quadtrees and hence reuse portions of the BSP tree efficiently across visibility testing interactions) but steals from the tail of another processor’s task queue (increasing the likelihood of stealing a large patch, within which locality can be exploited).

3.3

Results and Discussion

This simple parallelization is both conceptually natural and also very easy to implement in a shared address space. As seen in Figure 4(a), it also yields very good speedups on the DASH multiprocessor. This is despite the fact that no attempt was made to distribute (or replicate) data in main memory, which is fortunate since appropriate data distribution at the page granularity would have been very difficult given the irregular, dynamic data structures and fine-grained data sharing patterns of the algorithm. Good speedups are also obtained on the Challenge, although data distribution is not an issue there given its centralized shared memory. The speedups shown are for the relatively small room scene (174 input polygons, see Figure 3(a)) used by Hanrahan et al in [1], which is why they scale more slowly after 32 processors on DASH. We expect to obtain even better speedups with larger input scenes, and that these results about the effectiveness of shared address space multiprocessors ‡

A “patch” here can mean an input polygon, an intermediate patch, or a leaf element.

4

will extend to other radiosity algorithms, such as hierarchical radiosity with glossy surfaces, zonal radiosity, and even importance-driven radiosity (since there appears to be no need for data redistribution even if the viewpoint changes). We now show that the reason we obtain good performance without attention to locality in main memory is the high degree of temporal locality in the application and the effectiveness of automatic caching in exploiting this locality transparently. To analyze temporal locality, we measure the size and impact of the important per-processors working sets of the applications. We measure working sets by using a simulated multiprocessor with fully associative caches to plot the read miss rate versus the cache size used, following the methodology described in [6]. Figure 4(b) indicates a very high degree of temporal locality: The important working set is about 4KB of data for this input, and reduces the miss rate to a negligible quantity. The algorithm spends the vast majority of its time computing the visibility between interacting patches (say i and j). Visibility for an interaction is computed by firing a number of “random” rays from i to j, and measuring the fraction of these rays that reach j without being occluded. The algorithm therefore repeatedly traverses the relevant portion of the BSP tree between the input polygons that are the ancestors of patches i and j. The next visibility interaction that the processor computes will likely be between patch i and a child of patch j, say, and will thus reuse the same portion of the BSP tree. As a result, the important working set for a processor is a fraction of the BSP tree. The BSP tree is very small compared to the entire data set of quadtrees. The size of the working set (BSP tree) grows as the logarithm of the number of input polygons, and is independent of the number of processors used. Given the multi-megabyte cache sizes that people are building on shared-address-space machines today, there is little chance of encountering problems whose working sets will overflow these caches.

|

32.0

|

40.0

Ideal DASH: room scene Challenge: room scene

|

16.0

Procs

Sec.

|

24.0

Read Miss Rate (%)

48.0

|

Speedup

The use of hierarchy allows this algorithm to exploit temporal locality better than traditional radiosity algorithms, which sweep through all patches as they shoot radiosity to them. The hierarchical algorithm’ s use of gathering rather than shooting also results in better communication behavior—since only a processor that owns a patch writes the radiosity of that patch—and avoids the tradeoff between concurrency and preserving the sorted order of patches in a shooting approach [11]. In fact, gathering has been observed to work better than shooting in parallel even for traditional radiosity algorithms on message-passing machines [8]. Let us now turn our attention now to volume rendering.

|

0.0

|

|

8.0

0

Dash

1 44

290.0 7.8

Challenge

1 16

88.0 6.1

100 room scene, p = 1

80 60 40 20 0

|

|

|

|

|

|

8

16

24

32

40

48

8

32

128

512

2K

8K

32K 128K 512K 1M

Cache Size (bytes)

Number of Processors

(a) Speedups

(b) Working Sets

Figure 4: Speedups and working sets for the radiosity application. The BF refinement threshold was set to 0.02, and the area refinement threshold to 2000 units [1]. The algorithm took 7 iterations to converge on a uniprocessor.

4

Volume Rendering

Volume rendering techniques are of key importance in the analysis and understanding of multidimensional sampled data, such as those generated in various scientific disciplines. The first parallel algorithm we use, developed in [2], renders volumes using optimized ray casting techniques. Until very recently, the sequential algorithm was one of the fastest known algorithms for volume rendering. In Section 4.4, we also examine a new shear-warp algorithm that is much faster, and by parallelizing which we are able to obtain interactive frame rates for a rotation sequence of a 256-by-256-by-256 voxel data set on a Silicon Graphics Challenge multiprocessor. 5

Figure 3: Images rendered by the three applications: (a) hierarchical radiosity image, (b) radiosity image showing the patches created, (c) volume rendering, (d) ray tracing. Figs. (a) and (b) courtesy of Pat Hanrahan, Fig. (c) courtesy of North Carolina Memorial Hospital, and Figure (d) courtesy of Hewlett-Packard.

4.1 Sequential Ray-casting Algorithm The volume to be rendered is represented by a cube of voxels (or volume elements). For each voxel, a color and a partial opacity have been computed during a prior shading operation. The outermost loop of the computation is over a sequence of viewing frames. In a typical sequence, successive frames correspond to changing the angle between the viewer and the volume being rendered. For each frame, rays are cast from the viewing position into the volume data through the pixels in the image plane that corresponds to that frame. Colors and opacities are computed for a set of evenly spaced sample locations along each ray, by trilinearly interpolating from the colors and opacities of surrounding voxels. These samples are blended together using digital compositing techniques to yield a single color for the ray and hence for the corresponding pixel. Rays in a volume renderer typically are not reflected, but pass straight through the volume unless they encounter too much opacity and are terminated early. The algorithm uses three optimizations: (i) the early ray termination mentioned above, controlled by a user-defined

6

opacity threshold, (ii) the use of an octree representation of space to avoid unnecessary sampling in transparent regions of the volume, and (iii) adaptive image sampling. Adaptive sampling introduces some synchronization at partition boundaries [2], and we present parallel performance both with and without it.

4.2


In a shared address space, every processor can directly reference any voxel in the data set. Only one copy of the voxel data set is maintained, and it is distributed round-robin at the granularity of pages among the local memories of processors. No attempt is made at smart data distribution, since this is very difficult at page granularity and since it is in any case impossible to determine a good static distribution given that the viewpoint and hence the data affinity of processors changes across frames. The voxel data set is read-only. It is therefore very easy to exploit the most natural parallelism, which is across rays (or pixels in the image plane). Since an equal partitioning of the image plane among processors is not necessarily load balanced, owing to the nonuniformity of the volume data, task stealing is once again required for load balancing. Given p processors, the image plane is partitioned into p rectangular blocks of comparable size [2]. Every image block or partition is further subdivided into fixed sized square image tiles, which are the units of task granularity and stealing. These tile tasks are initially inserted into the task queue of the processor that is assigned that block (a distributed task-queue system is used, as in the radiosity application). A processor ray traces the tiles in its block in scan-line order. When it is done with its block, it steals tile tasks from other processors that are still busy. Giving a processor a contiguous set of pixels preserves locality by exploiting the spatial coherence in the algorithm: Successive rays that a processor casts go through contiguous pixels and will tend to access much of the same voxel data. Figure 5 shows a four-processor example of the image plane partitioning. A tile, the unit of task stealing

A block, the unit of partitioning

Figure 5: Image plane partitioning in the volume renderer for four processors.

4.3

Results

Figure 6(a) shows speedups on DASH for both adaptive and non-adaptive rendering, and on the Challenge for nonadaptive rendering. The results measure rendering time only, and do not include the time to load in the data set, compute opacities and build the octree, or transfer the rendered image to the frame buffer. A 256-by-256-by-226 voxel data set showing a computed tomography rendering of a human head is used, the resulting image of which is shown in Figure 3(b). The image measures approximately 415-by-415 pixels, and the total data set size is about 29 megabytes. A tile size of 8-by-8 pixels is used as the unit of task stealing. Clearly, the parallel volume renderer yields very good speedups on both machines. Owing to the need for pixel sharing and additional synchronization with adaptive sampling, the speedups in this case are somewhat worse than with non-adaptive sampling. We are able to reach within a factor of 3 of interactive-time rendering on a 48-processor DASH or a 16-processor Challenge. As in the radiosity algorithm, the observed speedups on DASH are very good despite the fact that we simply distribute data round-robin among physical memories. Figure 6(b) shows that the reason in this case also is the high degree of temporal locality on both private and shared data accesses. The important working set in this case is the amount of read-only voxel and octree data used in sampling a ray that is typically reused by the next ray. The reuse owes itself to spatial coherence resulting from the contiguity of partitions in the image plane: Successive rays cast by a processor pass through adjacent pixels and tend to reference many of the same voxels in the volume. The important working set for the 30-megabyte data set we use (too large to be rendered at interactive rates) is only 16 kilobytes in size. The working set size is independent of the number of processors in this application as well, and is proportional to the number of voxels along a single dimension of the data set (along a ray); 7

Procs Sec./frame

|

0.0

|

|

8.0

0

Dash

1 44

11.2 0.36

Challenge

1 16

3.34 0.25

|

|

|

|

|

|

8

16

24

32

40

48

Speedup

Read Miss Rate (%)

|

16.0

|

24.0

|

32.0

Ideal DASH: adaptive Challenge: adaptive DASH: nonadaptive

|

40.0

|

48.0

50 40

Another (less important) working set

30

Important working set

20 10 0

64

128

256

512

1K

2K

4K

8K

16K

32K

64K 128K

256K

512K 1M

Cache size (bytes)


(a) Speedups

(b) Working Sets

Figure 6: Speedups and working sets for the ray-casting volume renderer. The threshold opacity value for early ray termination is set to 0.95, on a scale from 0 to 1.0. About 173,000 rays (618,000 samples) are traced in the case without adaptive sampling, and about 22,000 rays (245,000 samples) with adaptive sampling. i.e. to the cube root of the data set size. Since the push in volume rendering is also toward real-time rendering rather than toward rapidly increasing data set sizes, the important working set for this algorithm is therefore likely to remain small for some time to come.

4.4

Interactive Frame Rates with the Parallel Shear-Warp Method

A new shear-warp algorithm has recently been developed that can render a 256-cube voxel data set in one second on a Silicon Graphics Indigo workstation [3]. We have parallelized this algorithm both on DASH and on the Challenge. The shear-warp algorithm proceeds in two phases. It first factors the viewing transformation into a three-dimensional shear parallel to the data slices, and projects the data to form a distorted intermediate (composited) image. Then, it performs a two-dimensional warp on the composited image to produce a final undistorted image. Unlike the image-order ray-casting algorithm, this is an object-order algorithm which streams through slices of the volume data set in front-to-back order and splats voxels on to the composited image. The shearing of the volume has the attractive property of exploiting spatial cache locality (with multiword cache lines) in both the object and image data, unlike the ray-casting approach which does not exploit spatial cache locality in the object data. The algorithm uses run-length encoding, min-max pyramids and multi-dimensional summed area tables to achieve its efficiency without sacrificing image quality. Its phases are depicted pictorially in Figure 7. Sheared Data Set Voxels

Data Set Voxels

p0 p1 p0 p1 p0 p1

Intermediate (Composited) Image

Rays Image plane

Final (warped) image

Figure 7: The shear-warp volume rendering method. We parallelize the first (compositing) phase by partitioning the intermediate or composited image among processors. This ensures that only one processor writes a given pixel in the composited image. If the original voxel data set were parti8

tioned among processors, different processors would write the same pixels (due to the shearing of the voxel data set) and synchronization would be required both to ensure mutual exclusion when updating pixels as well as to preserve dependences between processing slices of the data set. The composited image is divided into groups of scanlines (the optimal group size depends on the size of the problem and the cache line size on the machine), and the groups are assigned to processors in an interleaved manner (see Figure 7, which shows the partitioning for two processors). Instead of streaming through a full slice of the voxel data set before going to the slice behind it, as in the serial implementation, a processor now streams through the voxels in a sheared slice that correspond to one group of image scanlines that it is assigned, then proceeds to the similar group in the next slice, and so on. When it has gone through all the slices for one group of image scanlines, it processes the other groups that it is assigned, and finally steals groups from other processors if it is idle. The two-dimensional warp is also partitioned in groups of scanlines, by partitioning the final warped image among processors this time. This parallelization achieves very good speedups, and allows us to obtain rendering rates of 12 frames a second for a rotation sequence on a 256-cube voxel human head data set. These speeds were obtained on a 16-processor Challenge machine (a single processor takes about 1 second per frame), and to our knowledge represent the first time that interactive frame rates have been achieved without sacrificing image quality on a 256-cube data set on general-purpose hardware. Thus, both imageorder and object-order algorithms can be parallelized effectively on cache-coherent multiprocessors.

5

Ray Tracing

Our final application is an optimized ray tracer. The ray tracer was originally developed in [12] for a message-passing machine, with duplication of the entire scene data set on every processing node, and was later adapted to the current implementation on a shared address space machine without data set duplication.

5.1

Sequential Algorithm

As in the ray-casting volume renderer, primary rays are fired from a viewpoint, through the pixels in an image plane, and into a space that contains the objects to be rendered. At the first object that a ray encounters, it is reflected toward every light source to determine whether it is in shadow from that light source and to compute the contribution of the light source otherwise. The ray is also reflected from and refracted through the object as appropriate, spawning new rays. The same operations are performed recursively on the new rays at every object that they encounter. Thus, each primary ray generates a tree of rays, the rays being terminated when they leave the volume enclosing the scene or by some user-defined criterion (such as the maximum number of levels allowed in a ray tree). A hierarchical uniform grid (similar to an octree but with not necessarily binary subdivisions) is used to traverse scene data efficiently [12], and early ray tracing and adaptive sampling are implemented.

5.2


Like the ray-casting volume renderer, the ray tracing algorithm affords substantial parallelism across rays, and the scene data are read-only. Here again, only a single copy of the scene database is maintained in shared space, and it is physically distributed round-robin at page granularity among the memories. The partitioning scheme used is almost identical to the one used for the ray-casting volume renderer, with a similar distributed task queue system for load balancing.

5.3

Results

Figure 8 shows the speedups for the parallel ray tracer. The scene being rendered is a car on a checker-boarded floor, as shown in Figure 3(c), and the image has 512-by-512 pixels. The data set size is about 10 megabytes. No anti-aliasing is used in these measurements. Excellent speedups are obtained, without attention to data distribution. Like volume rendering, the important working set in raytracing consists of the data encountered in processing one primary ray (and the tree of rays it generates) that are reused in processing primary rays cast through neighboring pixels. The difference is that the working set is larger in ray tracing, and not so well-defined owing to the unpredictability of reflections (Figure 4). The working set size is once again independent of the number of processors. It’s size depends on the hierarchical grid parameters discussed above, the reflectivity of the scene, and the number of levels allowed in the ray tree. Modern second-level caches should continue to keep the miss rate low enough to provide good performance.

9

0

8

16

24

32

40

48

| 04

|

|

10

|

|

20

|

|

Read Miss Rate (%)

Speedup

30

|

|

40

|

|

50

|

|

60

|

70

|

450 29

80

|

Challenge

1 16

90

|

Dash

1158 25

100

|

|

1 48

|

0.0

|

8.0

Procs Sec./frame

|

16.0

|

24.0

|

32.0

Ideal DASH: car Challenge: car

|

40.0

|

48.0

|

|

|

|

8

16

32

64

|

|

|

|

128 256 512 1K

car scene, p = 1

|

|

2K

4K

|

|

| | |

| |

|

8K 16K 32K 64K 128K 256K 512K 1M

Cache Size (bytes)


(a) Speedups

(b) Working Sets

Figure 8: Speedups and working sets for the ray tracing application. The hierarchical uniform grid is subdivided to a maxi mum of 3 levels, with 5-way subdivision in each dimension and a maximum of 60 primitive objects per leaf cell of the hier archy. The size of a tile (the unit of task stealing) is 8-by-8 pixels. On machines that require main memory to be managed at the granularity of pages and under software control1, several characteristics of these applications would make it very difficult to manage data distribution and replication for locality in main memory. These include: (i) dynamic data structures (the quadtrees) in radiosity and changing viewpoints in the other applications, which make it extremely difficult to determine which processors access which data most often, (ii) fine-grained data sharing, which makes pages an inappropriate granularity for locality management, and (iii) dynamic task stealing. Thus, it is fortunate that caches work well. These same characteristics make it very difficult to program these visualization algorithms for effective parallel performance on message-passing machines that do not support a shared address space, as we shall now see.

6

Cache-Coherent Shared Address Space versus Message-Passing

There are three primary aspects of communication management that distinguish the communication abstractions of a shared address space and message passing between private address spaces: (i) the naming of logically shared data, (ii) exploiting temporal locality on logically shared data, which includes both managing data replication and renaming as well as maintaining the coherence of replicated data, and (iii) the granularity and overhead of communication. In a shared address space abstraction, any datum—local or nonlocal—can be referenced by any processor using the virtual address (name) of that datum in the shared address space. In the message-passing abstraction, on the other hand, a processor can directly reference only those data that are allocated in its private address space (local memory). A processor must therefore know or determine which processor’s address space a datum resides in, and send a message to that processor requesting the datum if it is nonlocal. As we have seen, temporal locality on both local and nonlocal data is handled automatically in shared address space machines that cache shared data—if the caches are large enough—and machines like DASH automatically keep the cached shared data coherent as well. On message-passing machines, nonlocal data must be replicated explicitly by the user and kept coherent by explicit communication of messages in the application program. The replicated data are thus explicitly renamed in message-passing programs, while hardware transparently takes care of renaming in the cache-coherent approach. Finally, while hardware-coherent shared address space machines support communication efficiently at the fine granularity of cache lines, the overhead of initiating and receiving communication is much larger on message-passing machines 1 Note that this is not a necessary feature of cache-coherent machines. Some recent architectures such as the Kendall Square Research KSR-1 manage main memory as a hardware-controlled cache at cache line granularity as well, and are thus able to handle much larger working sets.

10

(owing to software involvement) and it is therefore important to make messages large to amortize this overhead. Note that a coherent shared address space abstraction can be provided in software on a machine that does not provide any hardware support for it (such as an Intel iPSC/860 or Paragon message-passing machine); however, this is typically too inefficient for complex programs with fine-grained communication needs. The disadvantage of cache-coherent machines is their cost and design complexity. However, recent efforts to build these machines has shown that the costs are quite small. In fact, the cost of the extra main memory which we find is often needed on message-passing machines for explicit replication of operating system and application code and data often dominates the hardware cost of cache coherence. And we argue that the cost of providing hardware support for a cache-coherent shared address space is more than justified by the ease of programming and performance it affords. Managing communication explicitly is not very difficult for applications with regular, predictable behavior (such as those that solve systems of equations on regular grids). However, this is not true of visualization applications. We now use the ray tracing and radiosity applications to discuss the difficulties of message-passing implementations for these irregular applications. The issues in volume rendering are similar to those in ray tracing.

6.1

Ray Tracing

The main problems for message-passing in the raytracer are (i) managing the naming, replication and fine-grained communicaton overhead issues in sharing the read-only scene data, and (ii) managing load balancing. A third problem arises in managing synchronization when adaptive sampling is used to reduce computation time. Naming: Any processor may need to access any part of the scene data set with fairly unstructured access patterns. Replicating the entire data set on all nodes is not an acceptable solution, since it severely limits the size of problem that can be solved and is not scalable. A reasonable distribution for message-passing is to assign every processor (memory) a contiguous subvolume of the scene space, so that a processor P can determine which processor Q’s partition a ray goes to when it leaves P’s partition. Processor P then has two choices: it can send the ray to Q, which will then continue to trace the ray, or it can communicate with Q to obtain the volume data the ray needs, and continue to process the ray itself. Both approaches have been tried in the literature [12][13]. Managing the naming and naturally fine-grained communication in both approches is complex and inefficient compared to a hardware-supported shared address space. Replication: We have seen that replication of communicated scene data is very important to good performance. This is in fact accentuated on message-passing machines, where the overheads of communication are much larger. One approach to managing replication is to replicate every remote data structure that is touched and hold it locally for the duration of a frame, replacing data between frames. However, this can lead to large storage overheads without any benefits in complexity. The best approach for managing replication on a message-passing machine, used by Green and Paddon [13], is to emulate a fixed-size hardware cache for nonlocal data in the application program itself. Since this approach essentially amounts to implementing a restricted form of a shared address space with caches in the application program, it itself makes the argument for a shared address space machine (particularly since we have seen that realistic hardware caches are large enough to yield very good performance in such a machine). In fact, implementing this method of managing replication in software on a message-passing machine has significant overheads, since it introduces explicit renaming and in unstructured applications necessitates a check in software on every reference to volume data (to determine whether the referenced item is locally allocated, remotely allocated but in the local cache structure, or remote). None of this is required in a cache-coherent machine. Communication Overhead and Granularity: All of the above approaches naturally generate fine-grained communication, which is very inefficient given the high message overhead on message-passing machines. Coalescing messages to make them larger requires substantial implementation overhead in such an unstructured application. While communication is fine-grained even in the cache-coherent shared address space approach, those machines support fine-grained communication efficiently in hardware. Task Stealing and Load Balancing: In the shared address space implementation, the load balancing problem was resolved very simply by task stealing. All that was required to implement stealing were a lock per task queue and simple termination detection. On message-passing machines, task stealing must be done through explicit messages, which must

11

be handled by the application program while it is performing the main computation. Task stealing is therefore much more complex and incurs greater overheads on message-passing machines. In a survey of message-passing implementations, Green and Paddon [13] mention several attempts to address the load balancing problem but not one of them uses task stealing. Instead, they try to pre-partition the image and object space intelligently to improve load balancing over a uniform decomposition (see [14], for example). The approaches are quite complex and input- as well as view-dependent, and the best ones often require profiling low-resolution runs to determine a desirable partitioning. Finally, optimizations such as adaptive sampling (as used in the volume renderer) further complicate message-passing implementations, by requiring that the necessary synchronization for corner pixel values (see Section 4.2) be performed through explicit messages while the processes are in the midst of the main computation.

6.2

Radiosity

The hierarchical radiosity algorithm is much more complex to implement with explicit message passing. In addition to the irregular, unpredictable data accesses and the need for task stealing that it shares with the raytracer and volume renderer, it has two other complicating properties: (i) the main data structures (quadtrees of patches) are dynamically changing, since they are built as the computation proceeds, and (ii) these data structures are not read-only but are actively read and written by different processors in the same computational phase, which complicates coherence management. We have had graduate students implement message-passing versions on an Intel iPSC/860 machine. However, it has been an exercise in frustration, and only yielded 11-fold speedups on 32 processors before the project was abandoned as not being worthwhile. We briefly describe some of the main problems here. Detailed descriptions and explanations can be found in [11]. Naming: Given the dynamic data structures, we solve the naming problem by giving every patch a unique identifier of the form quadtree.patch, where quadtree is the number of the quadtree or polygon which that patch is a part of, and patch is the (globally consistent) number of the patch within that quadtree. Thus, we essentially implement an application-specific shared address space in software. Replication and Coherence: We have experimented with two approaches to manage replication and coherence. In the first aproach, processors start a time-step with local copies of all the data corresponding to their patches and interaction lists. They modify these data (subdivide patches, etc.) locally as needed in an iteration, and communicate the modifications to other interested processors only at iteration boundaries. Coherence is thus maintained at a very coarse temporal granularity (an entire iteration), stale local information is often used or extrapolated from, and the amount of replication is typically very large. Special data structures have to be maintained dynamically to keep track of which patches are interested in updates made to a given patch. This is similar to maintaining an application-specific directory for cache coherence. The second approach is to emulate a shared address space and caches in the application program. A single “master” copy of the forest of quadtrees is maintained in distributed form and manipulated dynamically through the passing of messages. This approach leads to much finer-grained communication, and local/cached/remote checks at every reference to quadtree data. Task Stealing and Load Balancing: The complexity of maintaining coherence is greatly increased by the need for task stealing, particularly in the local quadtrees approach. When a patch is stolen, it must be decided whether the ownership of the patch remains with the old processor or is passed on to the stealer, both of which complicate coherence and communication management. Although stealing does help load balancing, its communication and bookkeeping overheads are so large in our current implementation that it improves speedups from 10 to only 11 with 32 processors on an Intel iPSC/860 machine. The control and timing issues in handling messages for data, control, coherence, synchronization and load balancing while performing the computation are very difficult to program and debug in message-passing hierarchical radiosity, particularly given our results that cache-coherent shared address space machines solve this problem so well.

7

Summary and Conclusions

We have shown that general-purpose multiprocessors that efficiently support a shared address space and cache shared data are very effective vehicles for speeding up state-of-the-art visualization and image synthesis algorithms. Excellent parallel speedups were demonstrated on some of the most efficient known sequential algorithms, including hierarchical radiosity, 12

ray-casting and shear-warp volume rendering, and ray tracing. We have shown that a shared address space allows us to easily implement very natural parallelizations, and that transparent coherent caching suffices to exploit enough temporal locality to yield excellent parallel performance. On the other hand, the dynamic nature and unstructured access patterns of all the algorithms make it much more difficult to program them effectively in an explicit message-passing paradigm. We therefore believe that scalable multiprocessors should provide efficient support for a cache-coherent shared address space if they target computer graphics and visualization as one of their application domains. We believe that such general-purpose machines will be very effective at realizing real-time or interactive-time visualization of interesting data sets in the future. We have shown that they can already do this for volume rendering using the new shear-warp algorithm, which is a very encouraging result since it does not rely on special-purpose hardware.

Acknowledgments We would like to thank several people for implementing and helping with implementations of the parallel versions: Takashi Totsuka, Jim Christy, Jason Nieh, Philippe Lacroute, Maneesh Agrawala, and David Ofelt. This research was funded by ARPA under Contract No. N00039-91-C-0138. Anoop Gupta is also supported by an NSF Presidential Young Investigator Award.

References [1] Pat. Hanrahan, D. Salzman and L. Aupperle. A Rapid Hierarchical Radiosity Algorithm. Proc. SIGGRAPH, 1991. [2] Jason Nieh and Marc Levoy. Volume Rendering on Scalable Shared Memory MIMD Architectures. Proc. Boston Workshop on Volume Visualization, October 1992. [3] Phillipe Lacroute and Marc Levoy. Fast Volume Rendering Using a Shear-Warp Factorization of the Viewing Transformation. Proc. SIGGRAPH, 1994. [4] Daniel E. Lenoski et al. The directory-based cache coherence protocol for the DASH multiprocessor. In Proc. 17th Annual International Symposium on Computer Architecture, pages 148-159, 1990. [5] Helen Davis, Stephen Goldschmidt and John L. Hennessy. Multiprocessor Simulation and Tracing using Tango. Proc. Intl. Conf. on Parallel Processing, August 1991. [6] Edward Rothberg, Jaswinder Pal Singh and Anoop Gupta. Working Sets, Cache Sizes, and Node Granularity for LargeScale Multiprocessors. In Proc. 20th Annual International Symposium on Computer Architecture, 1993. [7] Michael Cohen et al. A progressive refinement approach to fast radiosity image generation. Proc. SIGGRAPH, 1983. [8] Alan G. Chalmers and Derek J. Paddon. Parallel processing of progressive refinement radiosity methods. Proc. Second Eurographics Workshop on Rendering, Barcelona, 1991. [9] Jaswinder Pal Singh. Parallel Hierarchical N-body Methods and their Implications for Multiprocessors. Ph.D. thesis, Stanford University, Technical Report No. CSL-TR-93-563, February 1993. See also “Load Balancing and Data Locality in Hierarchical N-body Methods”, Technical Report CSL-TR-92-505, Stanford University, to appear in Journal of Parallel and Distributed Computing. [10] Susan Spach and Ronald Pulleyblank. Parallel Raytraced Image Generation. Hewlett-Packard Journal, vol. 43, no. 3, pages 76-83, June 1992. [11] Stuart A. Green and Derek J. Paddon. A highly flexible multiprocessor solution for ray tracing. The Visual Computer, vol. 6, 1990, pp. 62-73. [12] H. Kobayashi et. al. Load balancing strategies for a parallel ray tracing system based on constant subdivision. The Visual Computer, vol. 4, no. 4, pp. 197-209.

13