Distributed Parallel Volume Rendering on Shared ... - CiteSeerX

5 downloads 487 Views 152KB Size Report
Feb 14, 1997 - As an alternative to dedicated rendering hardware, a system using ... visualisation server permits sharing of the resource between several ...
Distributed Parallel Volume Rendering on Shared Memory Systems D.J. Hancock and R.J. Hubbold Centre for Novel Computing (CNC) Department of Computer Science University of Manchester Oxford Road Manchester M13 9PL England email: fd.hancock, [email protected] February 14, 1997 Abstract

This paper reports on the remote use of MIMD parallel machines for compute-intensive visualisation tasks. As an alternative to dedicated rendering hardware, a system using conventional desktop machines connected to a visualisation server permits sharing of the resource between several users, and/or sites. The design of a parallel rendering system should consider not only how to achieve efficient parallel performance, but how to minimise response time, that is, the time taken from the specification of the viewing parameters to the display of the completed image. Progressive refinement and latency hiding techniques are proposed as a method of reducing the latency of a distributed rendering system. In distributed rendering, integrating the transmission of the image with the rendering process allows the latency introduced by the network to be hidden. Timing data are presented demonstrating the scalability of the parallel algorithm and its interactive use over LAN networks.

1 Introduction and Background This paper reports on the remote use of MIMD parallel machines for compute-intensive visualisation tasks. We are investigating the suitability of distributed client/server systems for medical environments where we believe, that despite the latency introduced by the network connection, they have a number of potential benefits:

   

low image generation times large memory capacity (suitable for high-resolution medical data sets) sharing of both data and the cost of the computing resource low cost access to the system via PC or low-end workstation.

1

We begin by looking at the application which is the focus of our research; conformal radio-therapy treatment planning []. The purpose of the treatment is to deliver radiation to a target tumour without damaging the healthy tissue which surrounds it. To produce high levels of radiation within the target, multiple intersecting treatment beams are used. Adjustable beam cross-sections allow a more accurate delivery of radiation to the critical region, but make visualisation of the plan correspondingly more complicated. X-ray CT or MRI scans are used to build a model of the treatment area, and dose simulation is performed to model the distribution of energy through the tissue, allowing the radiation levels to which critical regions are exposed to be computed. In current clinical practice, data presentation and assessment is made by viewing multiple consecutive 2D slices of the computed dose values – a time-consuming process which is inherently difficult because of the 3D nature of the task. Our goal is to combine all of the pertinent data into a single 3D view showing dose distribution superimposed on scanned patient tissue data, and to permit the formulation and modification of treatment plans by interacting with this single 3D view.

Volume Rendering The term volume rendering describes a family of algorithms which generate 2D views of 3D volumetric models. Medical scanner data is probably the most common example of volumetric data in use. These models record one or more sampled properties (e.g. X ray attenuation, or computed dose level) for each of a set of regularly distributed points in a 3D space. Volume rendering falls into two broad categories: methods which extract surfaces from the raw data and then use traditional computer graphics surface shading techniques to display them, and direct volume rendering methods, in which no intermediate geometric model is constructed. This paper is concerned with direct volume rendering (DVR) techniques. DVR has attracted considerable interest for scientific visualisation [Fre89, Luc92, ASK92]. Off-the-shelf scientific visualisation systems such as AVS, Iris Explorer and IBM’s Data Explorer all include DVR modules. It can be employed for a wide variety of problems, but is perhaps best known for its use in medical imaging [NFMD90, KPR+ 92, THB+ 90, NE93]. DVR is a notoriously time-consuming process, particularly if high-quality anti-aliased images are desired. On workstations, image generation times of several minutes are common, and much ingenuity has been demonstrated for accelerating the basic algorithms. Dedicated hardware [] and, more recently, the texture mapping hardware found on high-end graphics workstations [] have also been used to perform high-speed rendering. Parallel computers can also be used to speed up the generation of volume-rendered images and this is the approach we have taken for our visualisation system.

2 Design of parallel DVR algorithms The principal concern in the design of our DVR system is the minimisation of response time, that is, the time taken from the specification of the viewing parameters to the display of the completed image. The design of the rendering component should consider not only how to achieve efficient parallel performance, but how to effectively hide the fact there is a network connection between the renderer and the user interface.

2

2.1 Image transmission With high-performance renderers capable of generating images in a few seconds, the additional cost of transmitting the image data to the host computer for display becomes significant. High resolution colour images have large bandwidth requirements, e.g. a 5122 24-bit/pixel image is 0.76Mbytes, and can take several seconds to transmit across a typical (e.g. Ethernet) network connection. Compression of image data is one method of reducing transmission latency. Progressive refinement of images over time (e.g. interlacing in GIF images) reduces the perceived latency by allowing the user to have a low resolution preview of the image quickly, allowing rapid evaluation of the new viewing parameters.

2.2 Parallelising DVR Synthesising the image seen from a particular viewpoint requires that the energy transmitted to each pixel by each data point, or voxel be integrated. Two classes of algorithm have been developed; ray-casting, which performs the integration by processing each pixel in the image, and splatting which solves the same problem by processing each of the voxels in turn. data values

ray

splat

sample points voxels

Image

Image

Volume Data

Volume Data

Figure 1: Ray-casting (left) and Splatting (right) Parallelisation techniques have been developed for both ray-casting and splatting renderers based on partitioning the image plane, the data volume or some combination of both. The main factors influencing their design are common to the design of all parallel algorithms; that is achieving good scalability and load balancing properties, minimising inter-processor communication and exploiting any computational locality that exists. A good source of information of the design decisions that made be faced can be found in [Neu93], below we outline the principle behind firstly ray-casting, then splatting volume renderers. Ray-casting means generating a virtual ray for each pixel in the image along which the volume data is sampled at regular intervals. Unless the sample points coincide exactly with the known data values, new values must be reconstructed by filtering, e.g. taking a weighted average of the values spatially close to the desired sample point. Different energy transport models can be used to accumulate the sample values to simulate physical phenomena such as X-ray attenuation and optical transmission. Each ray can be processed independently leading to a straightforward method for parallelisation. The Splatting algorithm is so called because each voxel is splatted onto the image plane, starting with the furthest away and moving towards the viewer in slices. The task may be partitioned by sub-dividing the volume data and 3

splatting the voxels in each portion independently. The set of partial images generated in this way must then be merged together in the correct order to preserve the depth ordering.

2.3 Integrated image rendering and transmission Although splatting is a more efficient method than ray-casting for DVR [Neu93], ray-casting methods possess advantages in the context of interactive, distributed synthesis [HH96]. The principle reason is that in SVR, final pixel values are not available until the entire rendering is completed, whereas they are generated continuously in ray-casting methods. By transmitting the image data as soon as it becomes available, instead of waiting until the end of the rendering, the network latency inherent in a distributed system can be partially masked. In SVR, the pixel values are accumulated as the energy from each voxel is imparted. Only partial pixel values are known until every voxel has been processed. These partial values can be transmitted for display as the rendering progresses, with updated values sent as more voxels are considered. This scheme requires that multiple values are sent for each pixel, providing a progressive refinement capability, but increasing the bandwidth required to send the image data. Progressive refinement can be easily achieved in ray-casting by visiting the pixels in a non-raster order and by arranging for the display client to interpolate or replicate pixel values until they become available (see below). This method requires no more bandwidth than sending the entire image in one go, but does increase the number of individual network packets used in image transmission. Transmitting the image as a sequence of pieces is more time consuming than a sending it as a single packet due to greater network overheads. This has implications in the choice of both the partitioning granularity and the progressive refinement policy. If large numbers of small groups of pixel values sent individually, the transmission costs increase and overall response time might become significantly degraded.

3 Implementation of the distributed, parallel renderer The user interface process, running on a colour workstation, communicates with the rendering process, which manages a set of parallel threads, via UNIX sockets. The user interface allows interactive mouse-based control of viewpoint and camera geometry. Other user interface components allow the rendering parameters to be manipulated. The remainder of this section describes in detail how we manage the interaction between the rendering and transmission components in our system.

Partitioning A dynamic scheduling strategy, with a local work queue associated with each thread in the system, is used. Having a local queue for each processor has two benefits over a single queue method, e.g. the processor farm. Firstly, the static partition can be chosen so as to have spatial locality by assigning collections of adjacent pixels, called tiles, to the same processor. Secondly, having multiple queues reduces the chance of contention if tiles need to be rescheduled to adapt to load imbalance.

4

Refinement Our progressive refinement scheme is based on square image tiles; it evaluates pixels in a pattern chosen to allow a fast preview capability, and to maintain a regular distribution of pixels within the tile during refinement. Each tile is processed a fixed number of times (the number of levels of refinement), with additional pixels being addressed on each pass. Figure 2 illustrates the order in while pixels of an 8  8 tile are visited, this tile is rendered in 4 passes. One goal in the design of the refinement scheme was to avoid the generation of large numbers of network packets (e.g. by sending each pixel separately). Each level of refinement contains more pixels than its predecessor, so the rate of refinement slows down, and thus becomes more efficient (in terms of numbers of pixels per packet sent), as the rendering progresses.

0

1

2

3

Figure 2: Refinement levels in an 8  8 tile

Scheduling Upon receipt of new viewing or rendering parameters, tiles are distributed statically to each queue so that each has approximately the same number of tiles. Each rendering thread takes the first tile from its local queue, renders the pixels composing the current refinement level of that tile (see Figure 2) and then marks that level of the tile as completed. The tile is placed at the tail of the local queue and the next tile on the queue processed in the same way. In this manner each refinement level of each tile is processed in order. Once the local queue of a thread is exhausted, the thread finds the most full of the other threads local queues and moves some fraction (1 / N processors ) of the work from that queue to its own local queue. The tile size (and hence the number of tiles that the image is decomposed into) is determined by the number of levels of refinement selected by the user. There is a trade-off in chosing a tile size giving optimum load-balancing with minimal scheduling costs and which provides suitable refinement performance (i.e. how fast an adequate initial image is formed and at what rate the resolution of the image is improved) with tolerable level of network overhead.

Data distribution In ray-casting, interpolation is used to generate sample values at arbitrary points in the data, i.e. at points inbetween those at the voxel vertices where values are actually known. Trilinear interpolation is a typical example of the reconstruction filter used; it requires the values of the 8 voxels neighbouring the sample point. If those 8 values are stored physically close to one another in memory then performance on line-based cache memory system should be improved.

5

Volume data is normally arranged in memory as a 3D array, for example with the x index most rapidly changing. In this case fetching the line containing voxel [x; y; z] would also retrive the data for voxels [x ? 1; y; z],[x + 1; y; z] to cache memory. This is because logically adjacent voxels in the x dimension have been placed adjacent in memory. The locality of reference exhibited in casting adjacent rays would certainly lead to some benefit from pre-cached values. However, greater locality exists in the trilinear interpolation and surface normal approximation, which also requires a computation based on a 3D neighbourhood of voxels. By way of illustration, Figure 3 shows these two ways of grouping voxels for a 6x6x6 volume. In Figure 3(a) voxels are ordered in normal array sequence. In Figure 3(b) the voxels have been re-ordered to exploit locality by grouping them in blocks of dimension 3x3x3. Different shading intensities show how the voxels would be arranged in memory, each shade representing a cache line capable of storing 27 data values.

Y Z X

(a)

(b)

Figure 3: Voxel re-ordering for improved cache re-use If the data is arranged so that logically adjacent voxels in all dimensions are physically adjacent in memory, greater cache hit rates will be obtained. For 8-bit data, a 5  5  5 neighbourhood of the data can be stored in under 128 bytes, a typical size for a cache line. This does imply that 3 bytes are left empty out of every 128, and that volume dimensions which are not multiples of 5 must be padded, but this only increases data size by 2–4%.

4 Performance Measurements We have tested our implementation on three MIMD machines; a 128 processor Kendal Square Research KSR-1 [FBR93], an 8 processor Silicon Graphics Inc. (SGI) Challenge [SGI94] and a 14 processor SGI Origin 2000. The KSR-1 is a virtual shared memory machine, with a physically distributed memory system presented as a single virtual address space. The KSR memory system is based on a coherency unit of 128 bytes called a subpage. Subpages are demand-paged between different processors as required at run-time. The Challenge is a true shared memory SMP machine, with two levels of cache, also with a 128 byte (32  4 byte words) cache line size. The Origin 2000 is also true shared memory SMP computer, based on the MIPS R10000 CPU. The test configuration used 196MHz CPUs each with 32Kb instruction and 32Kb primary data cache, and a 4Mb second level cache. In the following experiments, the data set used is the 256x256x109 voxel CT head from the Chappel Hill Test Data Set Volume II. This requires 6.8 Mbytes of storage which can be accommodated in the local memory of each KSR-1 processing element. In order to make the cache performance of our implementation more apparent, we scaled the data up by a factor of two using a trilinear filter. The resulting 56.5 Mb data-set, of size 512x512x218, is considerably larger than the memory available to any processor, so any performance increase due to exploitation of locality (i.e. improved cache hit-rate) will be apparent. 6

4.1 Scalabilty The first experiment measures scalability, to determine the efficiency of our partitioning and scheduling policy. A test series of four 512x512 24-bit colour images were rendered on varying numbers of processors on both machines. A very high degree of transparency was specified, meaning that almost every voxel had to be sampled at some point, leading to high image generation times. Figures 4 , 5 and 6 show (1=t ) plotted with the ideal time (single processor time / number of processors) illustrating the overhead incurred by our parallelisation strategy. It can be seen that on the KSR, the algorithm scales almost linearly on up to 63 processors. The percentage of linear speedup achieved on 63 processors is 96.5%. 0.045

0.25 KSR Scalability Linear Improvement

0.04

0.2

0.035

0.1 0.09

SGI Performance Linear Improvement

0.02

0.15

1/time

1/time

1/time

0.03 0.025

0.1

0.015 0.01

0.05

0.005 0

0 14 8

16

32 processors

Figure 4: KSR-1

64

0.08 0.07 0.06

Origin2000 Performance Linear Improvement

0.05 0.04 0.03 0.02 0.01 0

1

2

4 processors

Figure 5: SGI Challenge

8

1 2 3 4

6 8 processors

10

13

Figure 6: SGI Origin 2000

On small numbers of processors, the algorithm exhibits super-linear scaling due to the increased amount of cache memory available in the system. Super-linear speedup is achieved on up to 8 SGI Challenge processors. On the KSR, this behaviour does not continue indefinitely as the overheads of synchronisation for scheduling and the loss of locality due to the finer partitioning granularity start to become more significant. It is these costs which cause the increasing divergence between the ideal performance and that actually achieved.

4.2 Cache performance The second experiment was conducted to measure how our arrangement of the voxel data contributed to cache hit rates. A pre-processing utility is used to reorder the data so that each 5x5x5 sub-volume present in the volume will be arranged on a sub-page of the system memory. Address calculation, i.e. converting voxel coordinate [x; y; z] into a memory address, becomes more complicated with data stored in this fashion, and lookup tables are used to accelerate this process. The KSR1 memory hardware provides statistics about both main and sub-cache utilisation during the program execution. For the spatially re-ordered data, the lowest main cache hit-rate that was recorded during all of the runs was 98.5%. The sub-cache hit-rate was lower (as would be expected as the sub-cache is only 0.5Mbytes) with an average of around 96%. For the normally ordered data, the sub-cache hit rate was more variable, and was typically in the range 60% to 80%. These rates demonstrate clearly the effects that the locality of reference are having, and suggest that the data alignment policy operates well. Considering the difference in access times between the local sub-cache and fetching from a remote PE (2 orders of magnitude), the high cache utilisation achieved can be seen to make a valuable contribution to the scalability of the algorithm. The effect is clearly shown by the difference between the measured rendering times, for different numbers of processors shown in Table 1. 7

normal array format re-ordered format normal array format re-ordered format

Processors 32 32 48 48

Time (seconds) 40.20 21.73 25.97 15.63

Table 1: Cache performance data

4.3 Response time In addition to scalability, the overall response time of a distributed rendering system is important. We have designed our system to minimise network latency by transmitting the image whilst the rendering is still underway [HH96], which we term overlapped display. We also use progressive refinement to reduce the perceived delay. Figure 7 shows how the resolution of the image increases during the rendering for a 24 processor case, on an Ethernet connection with a measured bandwidth of just under 100Kbytes/s. 100 90 CT head 256x256x109, 16x16 tiles pixels (%age)

80 70 60 50 40 30 20 10 1

2

3

4

5 6 time

7

8

9

10

Figure 7: Progressive refinement The additional cost of the tiling and progressive refinement techniques can be determined by measuring the time required to send the whole image as a single packet. On the LAN used for these experiments this took just under 4s. With 5 levels of refinement, and 1024 tiles, giving a total of 5120 packets, the image transmission took 10s. However, almost all of that transmission can occur concurrently when rendering in overlapped mode. In practice, we find that useful image interpretation can begin once 20–30% of the pixels have been displayed [HH97]. With overlapped display mode, 30% of the pixels are displayed in 6s, just over the time required with non-overlapped mode. Figure 8 shows how image refinement progresses over time.

8

t=1 (1%)

t=3 (5%)

t=5 (17%)

t=7 (50%)

t=9 (89%)

Figure 8: Progressive refinement sequence

5 Conclusion To overcome the large image generation times associated with DVR visualisation techniques, users can either acquire special purpose hardware (e.g. a high-end graphics workstation with 3D texture mapping), or can use a network connection to access a compute server. We believe that the latter approach can be effective, and has the additional benefit that the visualisation service can be shared – i.e. it can be accessed from more than one physical location by replicating the display clients, and the overall system cost can be amortised between multiple users. We have described the implementation of a distributed, parallel, direct volume rendering system. Our results demonstrate that task partitioning, based on image tiling, coupled with a dynamic scheduling policy leads to good scalability. To make the system responsive, we employ progressive refinement and present a low resolution version of the image to the user early in the rendering process. This technique can also partially mask the latency of the network and greatly enhances the interactive nature of the visualisation procedure.

6 Acknowledgements This work was partly funded under EPSRC grant GR/L02685. The authors would like to express their thanks to Ridgway Scott and Susan Owens of the Texas Center for Advanced Molecular Computation at the University of Houston (for use of their KSR-1), and also to Nigel John and his colleagues at Silicon Graphics, UK (for time on the Challenge and Origin 2000 machines).

7 References References [ASK92]

Ricardo S. Avila, Lisa M. Sobierajski, and Arie E. Kaufman. Towards a comprehensive volume visualization system. In Proceedings Visualization ’92, pages 13–20. IEEE, October 1992.

[FBR93]

Steven Frank, Henry Burkhardt III, and James Rothnie. The KSR1: Bridging the gap between shared memory and MPPs. In Proceedings Compcon ’93, pages 285–294. IEEE, February 1993.

9

[Fre89]

Karen A. Frenkel. Volume rendering. Communications of the ACM, 32(4):426–435, April 1989.

[HH96]

David J. Hancock and Roger J. Hubbold. Efficient image synthesis on distributed architectures. In 3D and Multimedia on the Internet, WWW and Networks, 17-18 April 1996. (in press).

[HH97]

Roger J. Hubbold and David J. Hancock. Autostereoscopic display for radiotherapy planning. In Stereoscopic Displays and Virtual Reality Systems IV, 1997). (accepted for publication).

[KPR+ 92] K.H. H¨ohne, A. Pommert, M. Riemer, Th. Schiemann, R. Schubert, and U. Tiede. Anatomical atlases based on volume visualization. In Proceedings Visualization ’92, pages 115–122. IEEE, October 1992. [Luc92]

Bruce Lucas. A scientific visualization renderer. In Proceedings Visualization ’92, pages 227–234. IEEE, October 1992.

[NE93]

Thomas R. Nelson and T. Todd Elvins. Visualization of 3D ultrasound data. IEEE Computer Graphics and Applications, pages 50–57, November 1993.

[Neu93]

U. Neumann. Volume Reconstruction and Parallel Rendering Algorithms: A Comparative Analysis. PhD thesis, Department of Computer Science, UNC at Chapel Hill, 1993.

[NFMD90] Derek R. Ney, Elliot K. Fishman, Donna Magid, and Robert A. Drebin. Volumetric rendering of computed tomography data: Principles and techniques. IEEE Computer Graphics and Applications, pages 24–32, March 1990. [SGI94]

SGI. Power challenge technical report. Technical report, Silicon Graphics Inc., 1994.

[THB+ 90] Ulf Tiede, Karl Heinze Hoehne, Michael Bomans, Andreas Pommert, Martin Riemer, and Gunnar Wiebecke. Investigation of medical 3D rendering algorithms. IEEE Computer Graphics and Applications, pages 41–53, March 1990.

10