A Generalized Framework for Interactive

A Generalized Framework for Interactive Volumetric Point-Based Rendering

A Dissertation Presented by Neophytos Neophytou to The Graduate School in Partial fulfillment of the Requirements for the Degree of Doctor of Philosophy in Computer Science Stony Brook University December 2006

Copyright by Neophytos Neophytou 2006

Abstract of the Dissertation

A Generalized Framework for Interactive Volumetric Point-Based Rendering by Neophytos Neophytou Doctor of Philsophy in Computer Science Stony Brook University 2006

Volume Visualization is the process of displaying volumetric data represented as sample points on a regular or irregular 3D grid. The data is currently produced by medical scanners such as MRI, CT, etc, and by numerical methods such as scientific simulations. Techniques that have been proposed for this purpose over the years include direct volumerendering which seeks to capture a visual impression of the complete 3D dataset by accounting for the emission and light absorption effects of all the data elements. This technique is effective when rendering volumes of space-filling gasses or volumes composed of many micro-surfaces, such as tissue in medical datasets. We have focused our efforts on Point-Based Volume rendering and specifically on the image-aligned post-shaded splatting algorithm which was proposed as a remedy to the drawbacks of existing algorithms with special focus on image quality. In the course of this dissertation, we will follow the evolution of this algorithm through several stages of maturity. Our contributions to this algorithm include the ability to render efficient grid topologies with significant storage and rendering time gains for both 3D as well as time-varying datasets. We have also proposed a post-convolved volume rendering technique to accelerate magnified viewing. The framework has been ported to a hardware-accelerated implementation that has the ability to interactively slice point based volumes and was optimized to take full advantage of the splatting algorithm’s inherent advantages for empty-space skipping and early splat elimination. Finally, we have generalized our framework for the interactive rendering of irregular datasets consisting of ellipsoidal kernels of arbitrary size and orientation. Our suggested algorithm outperforms existing hardware approaches by about an order of magnitude in terms of throughput. Different modeling approaches have been explored for encoding the data using either RBF (Radial Basis Functions) or EBF (Elliptical Basis Functions), and new methods are proposed for ongoing/future work. The final goal is a truly general point-based hardware-accelerated framework for the interactive visualization of both regular and irregular volumetric data in high-fidelity.

iii

For my wife Odaly, my parents Rita and Savvas and my brother Panickos

Contents

List of Tables

viii

List of Figures

ix

Acknowledgements

xi

Publications

xii

1 Introduction and Background Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Volume as a signal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Sampling grids . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.4 Illumination and shading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.5 Direct colume rendering algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.5.1 Ray-casting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.5.2 Shear-warp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.5.3 Splatting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.5.4 Hardware-accelerated volume rendering . . . . . . . . . . . . . . . . . . . . . . 14 1.6 Time-varying volume visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 1.7 Irregular volume rendering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2 4D Splatting of Time-Varying Volumetric Data on Efficient Grids . . . . . . . . 20 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.2 The 3D and 4D body centered cubic grids (BCC). . . . . . . . . . . . . . . . . . . . . 21 2.3 Splatting in 4D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.3.1 Axis-aligned 4D splatting for CC grids . . . . . . . . . . . . . . . . . . . . . . . 26 2.3.2 Axis-aligned 4D splatting for BCC grids . . . . . . . . . . . . . . . . . . . . . 26 2.3.3 Motion blur effects. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

v

2.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3 Using Post-Convolved Splatting to Accelerate Magnified Viewing . . . . . . . . . 35 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.2 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.2.2 Stage 1: Splatting into a low-res buffer . . . . . . . . . . . . . . . . . . . . . . . 38 3.2.3 Stage 2: Post-convolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.2.4 Potential of alternative grid topologies . . . . . . . . . . . . . . . . . . . . . . . 42 3.3 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4 GPU Accelerated Image-Aligned Splatting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.2 Hardware-accelerated splatting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.2.1 Challenge 1: increased vertex traffic . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.2.2 Challenge 2: increased voxel/pixel overdraw . . . . . . . . . . . . . . . . . . 53 4.2.3 Challenge 3: shading of empty regions . . . . . . . . . . . . . . . . . . . . . . . 55 4.2.4 Challenge 4: shading opaque regions . . . . . . . . . . . . . . . . . . . . . . . . 57 4.2.5 Putting it all together: the overall system . . . . . . . . . . . . . . . . . . . . . 58 4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 5 Post-Shaded Splatting of Irregular Grids with GPU-Acceleration . . . . . . . . . 64 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 5.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 5.2.1 The modified image-aligned splatting algorithm. . . . . . . . . . . . . . . . 66 5.2.2 Ellipsoid slicing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 5.3 Efficient ellipsoid slicing on the GPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 5.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 5.4.1 Ellipsoid slicing and rendering on the GPU . . . . . . . . . . . . . . . . . . . 73 5.4.2 GPU-support for spatial decomposition . . . . . . . . . . . . . . . . . . . . . . 75 5.5 Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 5.5.1 Comparison with fragment-centric approaches . . . . . . . . . . . . . . . . . 77 5.6 Conclusions and future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 6 Practical Considerations on Radial Basis Functions. . . . . . . . . . . . . . . . . . . . . 79 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 6.2 Kernel fitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 6.2.1 Least squares detection of constant and ramp-like areas . . . . . . . . . . 82 6.2.2 The agglutination process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 6.2.3 The AG-Splat reconstruction primitives . . . . . . . . . . . . . . . . . . . . . . 84

vi

6.3 Rendering engine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 6.3.1 The Shear-warp splatting technique . . . . . . . . . . . . . . . . . . . . . . . . . 87 6.3.2 AG-Splatting specific codifications. . . . . . . . . . . . . . . . . . . . . . . . . . 88 6.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 6.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 7 Volume Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 7.2 Collaborative work on volume modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 7.2.1 RBF-based volume subdivision for splatting. . . . . . . . . . . . . . . . . . . 92 7.2.2 Constructing ellipsoids based on Delaunay triangulation . . . . . . . . . 95 7.3 RBF modeling for irregular gridded Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 7.3.1 Analysis of regular volumetric grids for RBF representations . . . . . 98 7.3.2 Proposed RBF optimization approach for ongoing work . . . . . . . . 101 7.3.3 Initial guess . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 7.3.4 Iterative minimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 Conclusion

105

Bibliography

109

vii

Tables

2.1. Comparison of the efficiency of the BCC grid versus the CC grid. .....................29 2.2. Numerical results for the time-varying data-sets used in our study. ....................33 4.1. Rasterization results for several data sets. ............................................................62 5.1. Rendering results for Irregular Splatting. .............................................................77 8.1. The contribution matrix ......................................................................................106

viii

Figures

1.1. Sampling a two dimensional signal. .......................................................................3 1.3. Causes of aliasing. ..................................................................................................4 1.2. Typical sampling/reconstruction pipeline for one dimensional signal. ..................5 1.4. Shear-Warp algorithm...........................................................................................10 1.5. Image-aligned sheet-buffered splatting (from [MC98]). ......................................11 1.6. Pre-shaded versus Post shaded pipeline................................................................12 1.7. Anti-aliasing for Splatting (from [SMM+97])......................................................13 2.1. Two possible frequency domain lattices...............................................................21 2.2. Various grid cells, drawn in relative proportions. ................................................23 2.3. Rendering of the Marschner-Lobb test function on different grids. .....................24 2.4. Contributions of neighboring grid points to sample point A. ...............................27 2.5. Compositing a motion blurred 4D slice. ...............................................................28 2.6. Renderings on the CC (top row) and the BCC grids (bottom row). .....................30 2.7. Vortex data-set. Left: 4D CC grid; right: 4D BCC grid. ......................................31 2.8. 4D turbulent jet data-set........................................................................................31 2.9. Jet Shockwave data-set. ........................................................................................32 2.10. Turbulent vortex data-set with motion trail. .......................................................34 3.1. Two-phase splatting for magnification=2.............................................................37 3.2. Interpolation of a slice from volume basis functions............................................38 3.3. PCVR in frequency space. ....................................................................................39 3.4. 1-D Filter profiles. ................................................................................................40 3.5. 2-D filter profiles. .................................................................................................41 3.6. Radial profile of the 3D frequency responses.......................................................42 3.7. Frequency plot of various combinations of filters. ...............................................43 3.8. Cubic cartesian and BCC cells in frequency space...............................................44 3.9. Overview of the PCVR in our splatting pipeline. .................................................45 3.10. Impact of PCVR on our splatting pipeline..........................................................46

ix

3.11. Full rendering results of CT Head. .....................................................................47 3.12. X-Ray rendering of the CT Head........................................................................48 4.1. Multiple density buffer pipeline. ..........................................................................54 4.2. Applying the empty space skipping optimization.................................................56 4.3. Applying opacity culling optimization. ................................................................58 4.4. Pseudocode of the overall rendering process........................................................59 4.5. Example renderings using the image aligned GPU splatting system. ..................61 5.1. Modified image-aligned sheet-buffered splatting, as seen from the top...............66 5.2. Illustration of the kernel warping process in 2D...................................................68 5.3. Ellipsoidal kernel. .................................................................................................69 5.4. Example renderings of irregular data-sets. ...........................................................76 6.1. Kernel fitting experiments in 1D and 3D..............................................................81 6.2. Homogenous areas inside the ctHead data-set detected in 1D. ............................83 6.3. Comparison of gains for homogenous regions substituted by AG-splat. .............84 6.4. Overview of the agglutination process. ................................................................85 6.5. AG-Splat kernel interactions. ...............................................................................86 6.6. The AG-Splat Kernel in 1D. .................................................................................86 6.7. Overview of Shear-Warp splatting for agglutinated volumes. .............................87 6.8. Rendering results obtained with our renderer.......................................................89 7.1. The subdivision kernel approximation process. ...................................................92 7.2. Comparison of approximations to the MCQ subdivision kernel. .........................94 7.3. Example of material density reconstruction in 2D (from [MCQ04]). ..................94 7.4. Sample visualizations of Sub-division Splatting. .................................................95 7.5. Fitting an ellipse to a grid point neighborhood (from [HNM06]). .......................96 7.6. Results of the ellipsoid construction algorithm on the Blunt Fin data-set............97 7.7. Reconstructing constant function f(x,y)=1. ..........................................................98 7.8. Comparison of reconstruction results of function f(x,y)=1. .................................99 7.9. Comparison of 3D reconstruction results. ..........................................................100 7.10. Sibson Interpolation example in 2D. (From [Ami02]). ....................................102

x

Acknowledgements

First of all, I would like to take this opportunity to express my sincere gratitude to my thesis advisor and mentor, Professor Klaus Mueller. This work would never have been possible without his inspiring advice and continuous encouragement and guidance. During the past few years, Klaus’s office has been the place of the most interesting discussions and the main “hangout” for me and all his students, even at the “unthinkable” (for some people) late night/early morning hours. In addition to being a mentor that was able to nurture and stimulate my curiosity and the desire to work at the best of my ability and beyond, Klaus has provided an excellent example of research ethic and achievement to strive for. I am truly privileged that he played such a major role in my graduate life. Of course the one true friend that has been there for me throughout my whole Ph.D. process has been my wife Odaly. In addition to her support and all the help on everything I needed, her friendship and companionship, the delicious cakes and the cookies for my friends and colleagues at every social function I would ever attend, Ody has also given me something far more important: she completely understood the sometimes chaotic lifestyle of a CS grad student and has provided me with the space I needed to grow. I only hope to have the chance to do the same for her anytime soon. Thank you Ody. For their unconditional love and support and their confidence in everything I have ever done, I thank my family, Rita, Savvas, and Panickos. My parents Rita and Savvas have been there since the beginning, supporting all my educational aspirations. They have also struggled to make sure that I could concentrate on my studies while they provided me and my brother with financial support as well as spoiling us with home cooked meals and weekly “laundry service” while I was studying close to them. I am grateful for all their help as well as giving me a value system that encourages one to always do their best at what they do, but also channel their energy to the service of the community. I also thank Ramona and Menehildo Cruz and Elli and Andreas Vassiliou for being my family away from home during this journey. I would like to acknowledge my collaborators and co-authors in works derived

from this dissertation, as well as parallel research that I was working on during my graduate studies. Many thanks to Wei Hong, Dr. Ingmar Bitter and their adviser, professor Arie Kaufman, Dr. Kevin T. McDonnel and his adviser professor Hong Qin, Xin Guan, Warren Leung, Alexandros Panagopoulos, Fang Xu and Shengying Li, past collaborators professor Skevos Evripidu, Dr. Adrian Lahanas, Panickos Neophytou and Soulla Louka. An important part of my graduate studies has been the time of my close collaboration with the Proteomics Research Group at the department of Applied Mathematics and Statistics. My gratitude goes to professor Wei Zhu for her encouragement and complete confidence in my inputs to the group. In addition to her support, she has provided me and the rest of the proteomics group with a very stimulating environment where she allowed me to apply all my previous graphics systems and visualization experience towards a very challenging multi-disciplinary effort. In addition, I would like to also thank the rest of my collaborators in this effort, Kith Pradhan, Chen Ji, Yue Zhang, and professors Joe Mitchell and Esther Arkin for this wonderful experience. The largest part of my life at Stony Brook was spent in the Visualization lab, and it could not have been more pleasant thanks to the friends and colleagues at the lab. In addition to my collaborating colleagues mentioned above, I would like to thank Kyle Hegeman, Lujin Wang, Xin Guan, Fang Xu, Tom Welsh, Yiping Han, Aili Li, Shengying Li, Satprem Pamudurthy, Supriya Garg, Rong Su, Oleg Mishchenko, Christopher Carner, Zhe Fan, Susan Frank, Xiaohu Guo, Feng Qiu, Yang Wang, Lei Zhang, Sen Wang, Haitao Zhang, Ye Zhao, Peter Imrich, also Dr. Lapchung Lam and professor Dimitris Samaras. In a working environment as crowded as the Computer Science department at Stony Brook, there is one group of people that deserve much of the credit for making my graduate life easier. Stella Mannino, Kathy Germana, Edwina Osmaski, Shakeera Thomas, Betty Knittweis, Brian Tria, Bin Zhang, Dr. Ajay Gupta and Ashwin Nagrani have all given their best in providing logistical and systems support, most of the time at a very short notice and even at very late hours. I also very much appreciated all the help from Mrs. Elsy Arieta-Padro from the International Services Office. Thank you! Many thanks to professors Quynh Din, Arie Kaufman and Joe Mitchell for taking the time to serve in my committee and for their visiting and advising. I do appreciate their invaluable feedback during our interactions and visits. At this point, I would like to extend my gratitude to professor Larry Wittie for his invaluable influence and advice, and most importantly for standing up for me during some very difficult times. Last but not least, I would like to thank my first academic advisor during my undergraduate studies, professor Skevos Evripidu. Skevos, along with professors Elpida Keravnou and George Samaras have encouraged me to pursue this doctorate degree and an academic or research career to (hopefully) follow. Most of this dissertation work was supported by NIH grant 5R21EB004099-02, NSF-CAREER grant ACI-0093157 and DOE grant MO-068.

Publications

Publications directly derived from this dissertation [1] W. Hong, N. Neophytou, K. Mueller, and A. Kaufman, Constructing 3D Elliptical Gaussians for Irregular Data, to appear in: Moeller, T., Hamann, B. and Russell, R.D., eds., Mathematical Foundations of Scientific Visualization, Computer Graphics, and Massive Data Exploration, Springer-Verlag, Heidelberg, Germany, 2006. [2] N. Neophytou, K. Mueller, K. T. McDonnell, W. Hong, X. Guan, H. Qin and A. Kaufman, GPU-Accelerated Volume Splatting With Elliptical RBFs, Joint Eurographics - IEEE TCVG Symposium on Visualization 2006, Lisbon, Portugal, May, 2006. [3] N. Neophytou and K. Mueller, GPU Accelerated Image Aligned Splatting, proceedings of the International Workshop on Volume Graphics, Stony Brook, NY, USA June 2005 [4] Bitter, N. Neophytou, K. Mueller, and A. Kaufman, SQUEEZE: Numerical-precision-optimized volume rendering, SIGGRAPH/Eurographics Workshop on Graphics Hardware 2004, Grenoble, August, 2004 [5] N. Neophytou and K. Mueller, Post-convolved Splatting, Joint Eurographics - IEEE TCVG Symposium on Visualization 2003, Grenoble, France, May, 2003 [6] N. Neophytou and K. Mueller, Space-time points: Splatting in 4D, Symposium on Volume Visualization and Graphics 2002, Boston, October 2002 [7] N . Neophytou, K. McDonnell, and K. Mueller, On the simplification of Radial Basis Function fields for volume rendering: some practical insights, Technical Report, Stony Brook University, June 2006

xiii

Currently in preparation [8] Neophytos Neophytou and Klaus Mueller, A Generalized frame-work for the rendering of structured and unstructured point-based volumetric datasets, in preparation as a journal publication. [9] Kevin McDonnel, Klaus Mueller, Neophytos Neophytou, Hong Qin, Subdivision Volume Splatting, in preparation.

Other Publications [10]N. Neophytou and K. Mueller, Color-space CAD, Sketches & Applications Catalog SIGGRAPH 2006, Boston. ACM, August 2006. [11]W. Leung, N. Neophytou and K. Mueller, SIMD-Aware Ray-Casting, proceedings of the International Workshop on Volume Graphics, Boston, MA, USA July 2006 [12]P. Neophytou, N. Neophytou, P. Evripidou. Net-dbx-g: A Web-based Debugger of MPI Programs Over Grid Environments, IEEE Cluster Computing and the Grid, 2004 International Symposium, April 2004, Pages: 35 - 42. [13]P. Neophytou, N. Neophytou, P. Evripidou. Debugging MPI Grid applications using Net-dbx, European Across Grids Conference 2004: Pages: 139-148. [14]N. Neophytou, P. Evripidou: Net-dbx: A Web-Based Debugger of MPI Programs Over Low-Bandwidth Lines. IEEE Transactions on Parallel and Distributed Systems. 12(9): pages 986-995 (2001) [15]S. Louca, N. Neophytou, A. Lachanas, P. Evripidou: MPI-FT: Portable Fault Tolerance Scheme for MPI. Parallel Processing Letters 10(4): 371-382 (2000) [16]N. Neophytou, P. Evripidou: Net-dbx: A Java Powered Tool for Interactive Debugging of MPI Programs Across the Internet. David J. Pritchard, Jeff Reeve (Eds.): Euro-Par '98 Parallel Processing, 4th International Euro-Par Conference, Southampton, UK, September 1998, Proceedings. Lecture Notes in Computer Science 1470 Springer 1998, ISBN 3-540-64952-2: pages 181-189

Other submitted work [17]N. Neophytou, Klaus Mueller, Color-Space CAD: An Environment for direct colorspace manipulations in 3D, IEEE Computer Graphics & Applications, submitted September of 2006 [18]Neophytos Neophytou, Fang Xu and Klaus Mueller, Hardware acceleration vs. algorithmic acceleration: Can GPU-based processing beat complexity optimization for CT?, SPIE Medical Imaging 2007, submitted August of 2006. [19]Warren Leung, Neophytos Neophytou, Alexandros Panagopoulos and Klaus Mueller, SIMD-Aware Raycasting: Optimizing the ray-casting algorithm for the next generation GPU, Journal of Graphics Tools, submitted September of 2006.

xiv

1

Chapter 1

Introduction and Background Review

1.1 Introduction We base our work on established research in Volume Visualization, including regular point-based rendering techniques, time-varying volume visualization and GPU accelerated visualization techniques. Our proposed generalized point based volume rendering framework extents these existing techniques and additionally introduces a novel set of algorithms for the interactive visualization of both regular and irregular volumetric datasets with high quality resulting imagery. In this chapter we will briefly review some foundations, and re-iterate our treatment of volumetric data as a signal. We will revisit some of the most popular direct volume rendering algorithms, including ray-casting [Lev88], shear-warp [LL94] and splatting [Wes90], with particular emphasis on the image-aligned splatting algorithm [MC98]. We then discuss hardware-accelerated approaches, and observe that usually they attempt to approximate these basic methods. In the past, only applications utilizing high-end graphics workstations could accurately approximate ray-casting with the use of 3D texturemapping capabilities [CCF94]. This made it necessary for commodity PC based solutions to use either hybrid implementations which render portions of the volume in software, and composite in hardware [BJNN98], or substitute the lack of 3D interpolation capabilities with processed computations [EKE01], or by utilizing the programming capabilities of first generation GPU hardware [RSEB+00]. Fortunately, the latest GPU generation allows both 3D texture mapping as well as full programmability on a floating point graphics pipeline. These developments have dramatically changed the landscape of interactive scientific visualization.

2 Next, we will discuss algorithms proposed for rendering time-varying data-sets. The current methods are classified in two categories, depending on their focus. The first group attempts to exploit the temporal coherence of the huge data-sets involved in this kind of visualization, and concentrate on compression techniques over the time axes [SJ94, AAW00, Wes95, SCM99, ECS00, SH99, WG92, LMC02, GS01]. The second group consists of methods designed for general n-D viewing, and handle the data as a 4D function [HC93, HH92b, HH92a, BPRS98, KP89, WB98, BWC00, SLM02]. We finally introduce some of the research on irregular volume rendering, in order to better understand the specifics of this visualization system and set the final requirements for our upcoming generalized framework. 1.2 Volume as a signal Volume data, as mentioned above, usually comes in the form of an array of numbers (sample points). In order to render an image from the volume’s contents, a continuous function that assigns a value to every point in the volume needs to be constructed from the data. Rendering operations such as simulating light propagation or extraction of surfaces can then be applied to the function. Given a discrete set of samples, the process of obtaining a density function that is defined throughout the volume is called reconstruction and it is also referred to as interpolation in most literature. Different reconstruction schemes include trilinear interpolation as used in many ray-casting algorithms (such as that of Levoy [Lev88]), and interpolation using truncated Gaussian filters, as used in splatting methods of Westover [Wes90] and Laur & Hanrahan [LH91]. All these methods use the standard signal processing framework of reconstruction by convolution with a filter. The rest of this section will provide a review of the basics of reconstruction and sampling theory and introduce some practical reconstruction issues. For a more detailed explanation of what follows, please refer to [FvDFH96, GW92, Hec89, ML94].The theory and illustrations will be presented for 2D signals to make the diagrams more intuitive, but can easily be extended to 3D in order to be applied to volumes. 2 Fourier Analysis allows us to express a complex function g: o C as a sum of plane waves of the form exp i Z x x + Z y y . Periodic functions can be expressed with a discrete sum (Fourier series), but for any arbitrary function g , the following integral is needed: 1 g x y = -----2S

³ gˆ Zx Zy e

i Zx x + Zy y

dZ x dZ y

(1)

2

We can get gˆ from g using the following: gˆ Z x Z y =

³

g x y e

–i Zx x + Zy y

dx dy

(2)

2

Intuitively, gˆ Z x Z y measures the correlation over all (x,y) between g and a complex sinusoid of frequency Z x Z y . On the other hand, g x y can be thought of as the sum of sinusoids over all possible frequencies Z x Z y at point (x,y). We call gˆ the Fourier Transform of g , and gˆ the spectrum of g . Additionally we say that g lies in the spatial domain and gˆ lies in the frequency domain. The Fourier transform is reversible, so g and

3

)

)

)

)

gˆ are two representations of the same function. Another important property of the Fourier Transform is that the Fourier transform of the product of two functions is the convolution of their individual Fourier transforms and vice versa. That is gh = gˆ hˆ and g h = gˆ hˆ . Sampling is basically the operation of multiplying the original source function g with an impulse train k (Dirac impulse function). The result of multiplying a function with a grid of impulses is actually another grid of impulses gk with each sample point scaled by the value of the original function at that point (see Figure 1.1 for an illustration). In the frequency domain, the Fourier transform of the impulse sampling grid k is another grid kˆ with the period equal to the frequency of the sampling grid in the spatial domain. So, the Fourier transform of the sampled signal is gk = gˆ kˆ . The convolution of gˆ with the impulse grid kˆ will actually duplicate gˆ at every point of the grid kˆ .In sampling theory, we call the copy of gˆ centered at zero the primary spectrum, and all the other copies around it alias spectra. In order to recover the original signal we need to fill in the gaps of the sampled signal in the spatial domain to get back a continuous function. This is equivalent to canceling out all the alias spectra in the frequency domain, leaving only the primary spectrum intact. To do that we multiply gk with a function hˆ which is one inside the area that covers the primary spectrum and zero outside this region. This multiplication in the frequency domain is equivalent to a convolution of gk with h (the inverse Fourier transform of hˆ ) in the spatial domain. Since this convolution will allow us to reconstruct the original signal g , we call h the reconstruction filter. Since h only passes the lower frequencies close to the center of the primary spectra, it is also called a low-pass filter. It is now obvious, that to recover the original signal it is necessary for the alias spectra to be placed at enough distance apart so that they don’t overlap. This means, that the period of the grid kˆ has to be at least twice the maximum frequency of g , to fit the whole spectrum with no overlaps. In the spatial domain, the period of kˆ translates to the minimum necessary frequency of the sampling grid k , which is also known as the Nyquist frequency for the given signal, f N . The ideal reconstruction filter as described above is a sinc function (or a product of sincs in 2D/3D) in the frequency domain. Unfortunately it can not be fully realized in the

Figure 1.1: Sampling a two dimensional signal. Top: in the spatial domain and Bottom: in the frequency domain. (Figure reproduced from [ML94]).

4

Imperfect Reconstruction Filter

Ideal Reconstruction Filter

Aliasing

Aliasing Aliasing

-fN

(a)

Primary Spectrum

fN Alias Spectrum

-fN

(b)

Primary Spectrum

fN Alias Spectrum

Figure 1.3: Causes of aliasing. (a) Insufficient sampling frequency or non-bandlimitness of signal cause Prealiasing.(b) Imperfect reconstruction filter is too wide and allows parts of the alias spectra in the reconstruction, known as Post-aliasing.

spatial domain, because it has an infinite extent. We have to use imperfect approximations of the sinc filter for reconstruction, which usually trade quality for efficiency. These are bound to produce certain artifacts into the reconstructed signal, depending on their properties. Sometimes the signal is either not band-limited1, or the available sampling frequency is less than f N . In these cases it is necessary to band-limit the signal to the frequency of the sampling grid. This can be done by convolving with a low-pass filter before sampling and it gives rise to the typical signal processing pipeline as shown in Figure 1.2. If the input signal is not sufficiently band-limited, then the alias spectra will overlap with the primary spectrum in the frequency domain, making it impossible to recover the original signal. This gives rise to what we call pre-aliasing (see Figure 1.3(a)). Post-aliasing (Figure 1.3(b)) occurs when the reconstruction filter is significantly wide to allow energy from neighboring alias spectra to “leak-through” in the reconstruction. The result in both cases of aliasing is that frequency components of the original signal appear in the reconstructed signal at different frequencies (called aliases). Smoothing occurs when the signal is severely low-passed either at sampling or reconstruction. This results in band-limiting the spectrum much lower than the Nyquist limit, which means that the higher frequency components are attenuated and rapid variations are removed. Excessive smoothing in image processing results in blurry images while in volumes it results in loss of fine density structure. Some degree of smoothing is always expected during reconstruction since practical filters are imperfect, and start to cut off before f N . The Gibbs phenomenon occurs when a signal with step discontinuities is low-passed. The resulting signal will have oscillations, or ringing, just before and after the discontinuity. This can be avoided by making the reconstruction filter have a gradual cutoff, as demonstrated by Marschner and Lobb in [ML94]. Anisotropy occurs when the reconstruction filter is not spherically symmetric. It causes the amount of smoothing, postaliasing and ringing to depend on the viewing angle. According to what we have discussed above, the perfect reconstruction filter is 1. A signal/function g x is band-limited if its frequency spectrum gˆ Z x is zero after a certain value : x ! 0 .

5

Spatial domain

Frequency domain

(a)

(b)

(c)

(d)

(e)

(f)

Figure 1.2: Typical sampling/reconstruction pipeline for one dimensional signal. It can be applied to volume rendering by extension to three dimensions. (a) Original unbounded continuous input and its frequency response. (b) Low-pass pre-filter to be applied before sampling, in spatial and frequency domain. (c) Resulting lowpassed (bounded) signal. (d) Sampled (bounded) signal. (e) Low-pass filter needed for reconstruction, (f) Reconstructed bounded signal. (Figure reproduced from [FvDFH96])

6 impossible to realize in practice. Different filters used in volume visualization so far are associated with specific limitations and perform better in specific applications. Marschner and Lobb, in [ML94], have proposed numerical evaluation metrics that associate certain filter properties to perceived image quality. They also propose an analytical test function that determines the limits of proposed reconstruction filters, as far as perception and numerical error metrics are concerned. Möller et al [MMK+98] propose a methodology for designing filters based on spatial smoothness and accuracy criteria. Their design criteria is based not only on error metrics for the reconstructed function, but also its characterM ization as a function in the space C of all M-times continuous derivable functions. 1.3 Sampling grids After the sampling stage, which in the case of medical data sets is usually performed by equipment such as MRI or CT scanners, the resulting data is stored according to the sampling grid that was used. Grid structures based on the Cartesian lattice are so far the de-facto standard for regular representations of volumetric data. Some applications like computer simulations, partial differential equation solvers etc., produce irregular grids. The usual approaches for handling these grids are to resample them onto a regular grid and then render, or use object order algorithms like splatting or Projected Tetrahedra (see later sections for further explanation). Recently, the application of multidimensional signal processing research [DM84], in reconstruction by Matej and Lewitt in [ML95], and in volume rendering by Theussl et al in [TMG01], resulted in the introduction of the Body Centered Cubic (BCC) grid to the volume rendering community. For 3D volumes, the BCC grid can store the same signal content as its cartesian counterpart, using about 30% less points. This translates to similar savings in performance, depending on the rendering algorithm. Further savings of up to 50% can be achieved when using the BCC grid for handling time-varying volume data sets, as demonstrated by Neophytou and Mueller in [NM02]. The BCC grid has also been successfully used to speed up the shear-warp algorithm [SM02]. According to multidimensional signal processing theory, a sampling grid that causes the closest possible packing of the alias spectra of the sampled signal in the frequency domain allows for the sparsest distribution of the sample points in the spatial domain. If the signal is spherically band-limited, then an efficient sampling grid can be constructed using lattice theory, to pack the spherical spectra in the frequency domain. The inverse fourier transform of that grid can be used in the spatial domain to construct the sampling data structure. According to Conway and Sloan [CS93], the D n grids provide the closest lattice packings of hyper-spheres of two to five dimensions. The equivalent lattices for the spatial domain are the Hexagonal grid for 2D with 14% less sample points than the 2D cartesian, the BCC for 3D with about 30% less samples, and the 4D BCC for 4D with 50% less samples. A more complete discussion on BCC grids will follow in Chapter 2, where we show how these grids are used for efficient storage and rasterization of time-varying volumes. 1.4 Illumination and shading The next step after reconstruction is the projection of an image that captures the

7 way the volume data generates, reflects, scatters and occludes light. Depending on the level of realism and desired complexity there are several models which can bring out different features of the data. In volume visualization it is not important to shade as realistically as possible. Instead, the goal is to enhance the visual understanding of the data set by providing better spatial cues and structure information. The classification process assigns optical properties such as color and opacity as a function of the scalar values of the data elements. It can be applied to the interpolated scalar function f(X), which is defined for every point X in the volume. Alternatively, these properties can be computed only at the sampled volume grid points and then be interpolated for arbitrary points in the volume. The first approach, used in post-classified volume rendering pipelines, permits for more rapid variations of the optical properties, even within a single volume element. The second one is used in pre-classified pipelines. It may suppress fine variations in the transfer functions but tends to produce smooth images even when the data is noisy or gradients are poor around iso-surfaces. To compute an image, the effects of the optical properties must be integrated along each viewing ray originating from the screen. Piece-wise linear approximations include: polyhedron compositing by Shirley and Tuchman [ST90] and Wilhelms and Van Gelder [WG91]. If a Riemann sum is used to approximate the integral then Leoy’s ray-casting [Lev88] is used or the plane-by-plane compositing method of Westover [Wes90] for splatting. The most common illumination model used is the Phong model [Pho75], which treats all light sources as point light sources, and models three types of reflected light (ambient, diffuse and specular). The diffuse and specular terms are modelled as local components, and at each point to be illuminated, they require a gradient to be computed, which will serve as the surface normal in the lighting calculations. The global illumination is modelled as a constant ambient term. A survey of optical models used for volume rendering is provided by Nelson Max in [Max95]. According to this study, the optical properties which affect the light passing through a medium are due to absorption, scattering or emission of light from small particles or individual molecules of the medium. The absorption only model assumes that all the particles absorb all the light they intercept and do not scatter nor emit any. The opacity (or extinction coefficient) defines the rate at which light is occluded. This model has been used for the rendering of iso-surfaces such as bone or other tissues for medical purposes. Surface shading models such as the Phong model can be used, which require the estimation of the gradients. The emission only model emulates hot soot particles in a flame. In the limit where the particle size is reaching zero, while their emission reaches infinity, the particles only emit light. This is the case for a very hot tenuous gas which glows but is almost transparent. The absorption plus emission model emulates an actual cloud or particles which occlude incoming light as well as add their own glow. This model can be approximated using either a Riemann sum which gives the familiar back-to-front compositing algorithm, or a particle integration model. More advance models include multiple scattering of light and shadow casting models which produce a more realistic illumination of the volume. The relevant differential equations and the resulting integrals are described in detail in [Max95]. The common shading pipelines have been using, up to date, the traditional RGB model to represent color. An interesting work by Bergner et. al.[BMDF02] generalizes this

8 model to use a spectral representation of color. A method for rendering spectral scenes independent from a light source is proposed, which enables reillumination. The user can get different impressions of the data under different lights. In addition to providing improved realism, this method also allows the artificial manipulation of a scene. Metameric materials, which behave differently under various lighting conditions, can be easily forced to disappear from a scene when a user changes the lighting conditions and just re-illuminate, without the need to re-render the volume. 1.5 Direct colume rendering algorithms Algorithms for direct volume rendering apply all the steps described above to produce an image from the sampled data. Different approaches enhance specific characteristics of the volume data and have their own performance vs. quality trade-offs. An attempt to evaluate the most popular volume rendering algorithms has been made by Meissner et al., in [MHB+00], where the authors use some frequently used public data-sets to detect qualitative differences and compare rendering speed. 1.5.1 Ray-casting Ray-casting for volume rendering was formalized by Levoy in [Lev88]. According to this first description, a data preparation phase is required, where a set of new arrays are created from the sampled data. After some corrections are applied to compensate for nonorthogonal sampling, the optical properties of all the volume elements are computed at each grid point. The result is an array of voxel colors C O x i O = r g b and an array of opacities D x i for every voxel (sample point) x i in the volume. Each element is shaded using the Phong model [Pho75]. The gradient at the sample point is used instead of a surface normal in the lighting equations. Rays are then cast into the color and opacity arrays from the observer’s eye point, and trilinear interpolation is used to acquire the array values at the ray sample points. The opacity of the image plane is integrated along each ray. According to what we have discussed previously in Section 1.4, this is a pre-classified pipeline. All the optical properties are first defined for the grid points, and then the actual ray points are interpolated from the shaded data. Alternative ray-casting implementations [AHH+94] use an improved post-shaded pipeline, which can give crispier images and allow for rapid color changes. This can be done by first interpolating the scalar values at the ray points, and then applying the color properties and shading computations at each ray point. Of course this is more computationally expensive for magnified images (where we have more ray points than voxels), but the result is higher image quality. Another complication is the need to either keep the values of adjacent rays, or having to resample the neighboring ray points in order to calculate the gradients that are necessary for the shading computations at each ray point. The process may be sped up by using techniques like early ray termination [Lev88], octree decomposition [Lev90a] or adaptive sampling [Lev90b]. Early ray termination is used when the rays are traversed front-to-back and basically ends the ray traversal after the accumulated opacity for that ray is above a certain threshold. Octree decomposition is a hierarchical spatial enumeration technique that permits fast traversal of empty space which saves substantial traversal and interpolation time. Adaptive sampling tries to minimize work by taking advantage of the homogeneous parts of the volume, for

9 each square in the image, one traverses the rays going out of the vertices of the bounding box and recursively goes down re-partitioning this square into smaller ones if the difference in the image pixel value is larger than a certain threshold. Later techniques to speed up ray-casting include object-order ray-casting [MJC02] which pre-calculates the rays for each non-empty volume cell and removes invisible volume segments. The skipping of empty space presents an opportunity to eliminate an enormous amount of unnecessary computations in ray casting. Yagel and Shi [YS93], proposed a technique for space leaping that uses a coordinate buffer, which takes advantage of interframe coherence during animated viewing. This buffer stores the coordinates of the closest non-empty voxel for each voxel, and it is transformed accordingly whenever the view point changes. Later improvements to this method include Wan et. al. [WSK02] who employ a fast voxel-cell based re-projection scheme and replace the coordinate buffer with a pre-computed distance field that stores the Euclidean distance from each empty (background) voxel toward its nearest object boundary. Another approach is explored by Parker et. al. [PSL+98], who use bricks to improve memory access coherence and a multi-level spatial hierarchy to accelerate the traversal of empty cells. Object-order methods such as spatting, offer implicit space leaping. 1.5.2 Shear-warp Although conceived as early as 1994, Shear-Warp as proposed by Lacroute and Levoy[LL94] still remains the world’s fastest purely software-based volume rendering algorithm. The basic algorithm uses the fact that the viewing transform can be factored into a series of 2D shears and a final 2D warp. The volume is first sheared slice-by-slice, so that the ray direction is parallel to one of the major axes in sheared space, where the ray-casting is performed. Then warping the intermediate image gives the final result (see Figure 1.4). The RLE encoding scheme allows for efficient skipping of empty and fully opaque voxels during the traversal, and the alignment of the rays to an axis in sheared space allows for pre-calculation of the interpolation weights. The combination of these 3 advantages allow shear-warp to render 128 volumes at rates exceeding 10 frames/sec. on modern PC workstations without using any hardware acceleration. The speed, however, comes at the cost of reduced image quality and increased memory consumption. In a recent work, Sweeney and Mueller [SM02] identified some of the shortcomings of the shear-warp algorithm and propose some practical ways to address them. They also proposed an adaptation of the algorithm to take advantage of the BCC sampling grids which achieved the expected rendering time reduction of about 30%. One of the side-effects of shearing is that the ray sampling distance varies from 1 to 3 which leads to aliasing and stair-casing effects for viewing angles near 45q . This is compensated with the insertion of an interpolated intermediate slice on-the-fly during rendering. Another issue is that zooming can only occur at the warping stage, which leads to excessive blurring for magnification factors greater than 2. This is addressed by what the authors called matching sampling, which basically means that the size of the intermediate (base plane) image is increased to match the final image size. This of course will run much slower, but still takes advantage of the optimized ray traversal. The final enhancement proposed in this work was the modification of the rendering pipeline from pre-shaded to post-shaded, in order to further reduce some of the blurring in the images. This requires that another volume is created to store the normals (gradients), and the replacement of

10

Figure 1.4: Shear-Warp algorithm. The volume is first transformed to sheared space so that the rays are aligned to one of the major axis. After the base plane is rendered it is then warped to the final image. three color interpolations with a shading calculation, which takes about the same in terms of rendering time. 1.5.3 Splatting Splatting was proposed by Westover in [Wes90, Wes91]. It represents the volume as an array of overlapping basis functions, usually Gaussian kernels with the amplitudes scaled by the voxel values. An image is generated by projecting these basis functions to the screen. The projection of the radially symmetric basis functions can be achieved by rasterizing a pre-computed footprint lookup table. Each footprint table entry will store the analytically integrated kernel function along a traversing ray. In other words, the footprint table will store a 2D projection of the 3D integrated spherically symmetric kernel, and can be used for all the basis functions needing only to be scaled by the value of the corresponding voxel. The major advantage of splatting is that only the voxels relevant to the desired image need to be projected and rasterized [MSHC99]. This significantly reduces both volume data storage and processing. The traditional splatting approach [Wes90] summed the voxel kernels within volume slices most parallel to the image plane. Although this solves the problem of bleeding (occluded voxels still contribute their color to the final image), it introduces popping artifacts or severe brightness variations in animated viewing. Mueller et. al. in [MC98], address this problem by processing the voxel kernels within slabs that are parallel to the image plane. Image aligned sheet buffered splatting requires the slicing of the voxel kernels along the viewing direction. This is better illustrated in Figure 1.5. Instead of splatting the interpolation kernels as a whole, this approach slices the kernels by a series of cutting planes that are aligned parallel to the image plane. The kernel sections that fall within two cutting planes, or a thin slab, are summed into a sheet buffer. Consecutive buffers are composited front-to-back. Pre-integrated kernel section are used for fast rasterization. This approach is equivalent to ray-casting in the sense that it mimics a set of parallel simultaneous rays that resample the volume into a set of parallel sheet images, spaced apart by 's , which are then composited in front-to-back order. Westover’s proposal of splatting uses a pre-shaded pipeline. The voxels are first classified and shaded, and then each shaded voxel is projected to the screen as a fuzzy ball. The gradients are calculated at each grid point using a central difference filter and using the color assignment from the transfer function, each voxel is shaded once in a pre-

11 processing step. After that the same shaded volume can be used to interpolate new views, using the interpolation filter h . Although this pipeline is very efficient, it usually produces an image that has smooth surfaces but has blurred boundaries (see Figure 1.6). Blurring is more pronounced in splatting than ray-casting because of the reconstruction filter that is used. The radially symmetric Gaussian filter smooths more than the trilinear filter usually used in ray-casting, sometimes resulting in loss of crucial object detail. To address the fuzzy edge problem, Huang et. al.[HCS98] proposed edge splats, which are specialized kernels with rapid interpolation function decay, to be used in object boundaries. It places edge splats at iso-surfaces and regular splats in homogenous volume regions. It requires, however, the detection of directional edges in volume space. This is done as a preprocessing step, where all voxels that are close to an iso-surface are augmented with values for edge direction, edge distance and gradient-strength. These parameters are then used by the splatting algorithm at runtime to select the edge splat with the right decay. Although this algorithm is quite effective for the iso-surface edges, it has certain drawbacks: (i) micro-edges within the iso-range are not handled easily, (ii) the method has problems with discontinuities in the edge profile, e.g., sharp corners such as cube edges, where the edge direction is ambiguous and it’s perception is view-dependent, (iii) the edge splats are equivalent to applying a local high frequency filter to the volume, caus-

current sheet-buffer / slicing slabs

contributing interpolation kernel

fer ufiting b t s ee o sh omp c

s la

id t bw

ne pla e ag im

s h'

(a)

's

z-resolution

(b) kernel section 1

kernel section 2

kernel section 3

kernel section S

Figure 1.5: Image-aligned sheet-buffered splatting (from [MC98]). (a) All kernel sections that fall within the current slicing slab, formed by a pair of image-aligned planes spaced apart by the sampling interval 's, are added to the current sheet buffer. The sheet-buffers are composited in front-to-back order. (b) Array of pre-integrated overlapping kernel sections (shaded areas). The integration width of the pre-integrated sections is determined by the slab width 's, while the zresolution is determined by the number of kernel sections.

12

(a)

(b) (i) original step function: the edge

(c)

(ii) blurred edge, after convolving edge with Gaussian filter h

(iii) crisp edge, after thresholding the blurred edge with iso-value

Figure 1.6: Pre-shaded versus Post shaded pipeline. (a) Rendering of a Cube using a Pre-shaded pipeline, (b)using a Post-shaded pipeline. (c)Illustration of thresholding iso-edges in a post-shaded pipeline. ing the noise in the iso-contour to be possibly amplified, (iv) the requirement of the processing step every time the iso-contour changes, hurts the interactivity of transfer function adjustments. In a different approach to eliminate the blurring, an alternative post-shaded pipeline is suggested by Mueller et. al., in [MMC99]. In this pipeline, the grid voxel densities are first interpolated and the interpolation result is then classified and shaded. The gradient calculation is now done using the interpolated densities at the intermediate sheet buffers. This is equivalent to calculating the gradients at the grid points and then interpolating them, as shown in [MMMY97]. In Figure 1.6c(ii) we can see the interpolation filter h will again have a low passing effect and blur the interpolated volume. However, this time the classification step has not been applied yet. Some of the blurring will be undone by clipping off the blurred image regions by the iso-thresholds of the applied transfer function. This results in the silhouette edge of Figure 1.6c(iii), which is considerably crisper and is located close to the true edge. A rendering result for a sharp iso-surface is also illustrated in Figure 1.6b, for both pre-shaded and post-shaded splatting pipelines. So far we have seen the splatting method applied on orthographic projections. This assumes a constant sampling rate throughout the volume, and only requires a constant preintegrated footprint splat regardless of viewing direction. However, if we attempt to use the same approach for perspective projection then aliasing artifacts will occur towards the volume elements further from the eye, since the available sampling frequency at those parts is less than the Nyquist frequency. An anti-aliased technique for splatting is proposed by Mueller et. al. in [SMM+97, MMS+98]. Their approach is basically to low-pass the further components of the volume, which appear in a higher frequency due to perspective distortion. To achieve this, a threshold is first calculated for the z distance, at which the

14 1.5.4 Hardware-accelerated volume rendering An object order approach that approximates ray-casting is the slicing of the volume into polygons, parallel to the image plane, at a regular sampling distance along the viewing direction, and then the compositing of these polygons in a back-to-front to get the final image. Using 3D texture-mapping this means that the slicing polygons are generated texture coordinates into a 3D solid texture, and are texture mapped to polygons orthogonal to the viewing direction, which are then blended into the frame buffer. This exact technique was implemented by Cabral, Cam, and Foran [CCF94]. They achieved a 10Hz frame rate on a graphics workstation equipped with SGI’s Reality engine. Wilson, Van Gelder and Wilhelms [WVW94] independently proposed the same method, and later Van Gelder and Kim [GK96] improved on this method by incorporating directional light, but they needed to recompute the shaded color volume when the directional light or the viewing angle changed. Westerman and Ertl [WE98] later enhanced the same technique using recently supported in hardware OpenGL extensions. They introduce clipping geometries by means of stencil buffer operations, and rendering of lighted and shaded iso-surfaces using a multi-pass approach on both cartesian and tetrahedral grids. They also demonstrate the utilization of hardware for rendering unstructured grids. Advanced texture mapping capabilities (2D and 3D) have been taken for granted on expensive graphics workstations and has been available on very few specialized PC graphics boards. Only lately, thanks to the ever-evolving and ever-expanding PC games industry, the gap has been bridged between exotic SGI workstations and cheap accessible commodity PC workstations. In some cases, PC-gaming graphics boards (NVidia geForce series, and ATI Rage series) are even more programmable and powerful than high end workstations and up to date, they offer 3D texture-mapping, or equivalent trilinear interpolation. An early hybrid approach to utilize PC hardware for direct volume rendering was proposed by Brady et. al. in [BJNN98]. Their two phase perspective ray-casting algorithm first casts a set of ray segments computing a composite color and transparency for each segment, and then uses the segments to construct approximations of the full rays in the second phase. The segments are accumulated into textured polygons which are then composited using 2D texture mapping hardware to produce the final image. This achieves a good balance of CPU versus graphics board utilization, and is easily implemented on commodity PC hardware. The segments can then be reused for viewpoints near the original position, thus reducing the amount of required ray-casting during animated viewing. The segments are constructed in a way to allow some freedom of movement including translation and rotation of the viewpoint, and it is most suitable for navigation within big data sets, where only a small fraction of the volume is visualized at each frame. The navigation achieved frame rates of more than 10Hz at moderate PC workstations (PII-300Mhz, and quad PPro 200 Mhz). The cheap approximation of the 3D texture assisted volume rendering method is implemented on PC workstations with 2D texturing capabilities. It decomposes the volume into a set of object aligned slices, thus reducing the need for trilinear interpolation to bilinear interpolation which is well optimized on these workstations. However, when this approach is used for higher magnifications it has very strong visual artifacts. The closest software equivalent to this object order method is the shear-warp factorization[LL94], which as we have seen in the previous section, also presents some aliasing artifacts at

15 viewing angles which maximize the distance between ray samples. Rezk-Salama et. al. [RSEB+00] proposed an interesting way to gain back the trilinear interpolation by using multi-textures. Multi-textures were first introduced along with register combiners1 in the NVidia geForce graphics boards, and it allows the result of a blending operation to be directed into another texture, which can then be used to rasterize a textured polygon. Using this feature, a linear interpolation between two consecutive slices S i and S i + 1 , that should give S i + a , can be defined as a blending operation of the form S i + a = 1 – a x S i + a x S i + 1 . The output is then redirected to be an input texture for the rasterization of the desired slice, hence the resulting slice is equivalent to a slice trilinearly interpolated from a 3D texture. Because of the multi-texturing feature, this can be rendered at almost the same rate as the original textures, giving this implementation frame rates even higher than an Onyx2 Base reality workstation. Of course, the rendering speed is compromised once the data set is big enough not to fit in the graphics board’s memory. The requirement for image aligned slices makes memory consumption even higher, since three sets of axes aligned slices (one for each axes) have to be stored for a volume. A different approach to correcting the artifacts from the lack of trilinear interpolation was suggested by Engel, Kraus and Ertl [EKE01]. Texture-based pre-integrated volume rendering is based on the pre-computation of all possible ray segments2 at a preprocessing step and then looking them up during rendering from a dependent texture. The first step calculates a table with all the possible integrations of ray segments with front value S f , and back value S y for a given transfer function TF, which assigns color and opacity values to the all the possible densities of S f and S y . The results are stored in a texture, which can later be indexed by S f S b to recover the precalculated RGBA value. This lookup is enabled by the programmable pixel shaders (which allow the replacement of blending operations by coordinate lookup operations), and dependent textures, which allow the result of a texturing operation to be used as coordinate index into another texture and use that for final rasterization. This algorithm facilitates the visualization of low resolution volume data with high frequency transfer functions. In an effort to implement commercially feasible volume rendering systems, which will not require high end expensive workstations, but also not compromise quality because of the existing game hardware limitations, dedicated hardware systems were proposed, and some of them were actually built and became commercially available. These include VERVE [Kni93], VIZARD [MKS98, KS97, KWHM02] and VolumePro[PHK+99]. The Voxel Engine for Real-time Visualization (VERVE), implements direct volumetric ray casting with the use of special 8-way interleaved memory. It achieved 2.5 fps, however, several modules can be combined to yield higher performance. Vizard is a redesign of Verve as a PCI coprocessor board achieving a performance of about 10fps. It uses lossy data compression and it requires a preprocessing step to calculate distances and gradients. VolumePro is the commercial implementation of EM-Cube [OPL+97], the final version of the Cube family of architectures at the Center for Visual Computing in Stony Brook. It is a ray-casting system that uses memory skewing over its local SDRAM mod1. Pixel shaders with limited programmability which allows the programmer to define a set of blending operations and redirect texture operation output. 2. Ray segments are the partial rays between consecutive slices in a volume that is decomposed into a set of 2D slices.

16 ules for maximum reading throughput. It achieves a real time frame rate of 30fps for data3 sets of 256 . Hardware (GPU) accelerated splatting is specifically addressed in [XC04] where several hardware-accelerated splatting algorithms are compared, including an efficient point-convolution method for X-ray projections. Their system achieves very high throughput rates using previous generation hardware. Most relevant to our proposed work of Chapter 4 is the hardware accelerated EWA splatting approach proposed in [CRZP04]. The authors achieve high speedups using retained mode splatting, keeping all the volume data in the GPU, but in contrast to our work, they also use an axis aligned buffer approach, which gives popping artifacts does also not allow for high quality post-shaded rendering [Wes90]. Further discussion on GPU accelerated splatting will follow in Chapter 4, where we revisit the image aligned spatting algorithm and adjust it to work around some of the limitations of the GPU architecture while also taking advantage of the latest features available on latest generation graphics boards to achieve high quality images at interactive rates. 1.6 Time-varying volume visualization A growing number of volume rendering applications exist that involve four dimensions and higher, for example, MRI and 3D Ultrasound motion studies in cardiology, timevarying data-sets in computational fluid dynamics, 4D shapes in solid models [Ros89], and n-manifolds in mathematics [HH92a, HC93, HH92b]. Currently, volume rendering follows along six broad paradigms: ray-casting [Lev88], splatting [Wes90], shear-warp [LL94], cell-projection [MHC90, ST90] texture-mapping hardware-assisted [CCF94, EKE01, RSEB+00], and custom hardware-accelerated [MKS98, PHK+99]. Some of these have been extended into 4D and higher. In order to classify these extended algorithms, one may distinguish them by their intended purpose. While some algorithms were specifically developed for time-varying data-sets and typically exploit time-coherency for compression and acceleration [AAW00, GS01, LMC01, SCM99, SJ94, SH99, Wes95], other methods have been designed for general n-D viewing [BPRS98, BWC00, HH92a, HH92b, KP90, WB98, HC93] and require a more universal data decomposition. In n-D viewing, the direct projection from n-D to 2D (for n>3) is challenging. One major issue is that there are an infinite number of orderings to determine occlusion (for n=3 there are just two, the view from the front and the view from the back). In order to simplify the user interface and to eliminate the amount of occlusion explorations a user has to do, Bajaj et. al. [BPRS98] perform the n-D volume renderings as an X-ray projection, where ordering is irrelevant. The authors demonstrate that, despite the lack of depth cues, much useful topological information of the n-D space can be revealed in this way. On the other end of the spectrum are algorithms [BWC00] (and the earlier [WB98]) that first calculate an n-D hyper-surface (a tetrahedral grid in 4D) for a specific iso-value, which can then be interactively sliced along any arbitrary hyperplane to generate an opaque 3D polygonal surface for hardware-accelerated view-dependent display. This approach is quite attractive as long as the iso-value is kept constant. However, if the iso-value is modified, a new iso-tetrahedralization must be generated which can take on the order of tens of minutes [BWC00].

17 Since 4D data-sets can become quite large, a variety of methods to compress 4D volumes have been proposed in recent years. Researchers have used wavelets [GS01], DCT-encoding [LMC01], RLE-encoding [AAW00], and images [SCM99, SJ94]. All are lossy to a certain degree, depending on a set tolerance. An alternative compression strategy is the use of more efficient sampling grids. A good candidate is the so-called BodyCentered Cartesian (BCC) grid which was recently employed for 3D volume rendering in [TMG01]. The BCC lattice can save almost 30% of the samples, without loss of signal fidelity, under certain conditions. BCC grids are particularly attractive for point-based volume rendering methods, such as splatting, since there the rendering time is directly proportional to the number of points. But, on the other hand, the BCC grid has also been useful in speeding up the shear-warp algorithm [SM02] and computed tomography [ML95]. Early work on 4D rendering includes a paper by Ke and Panduranga [KP90] who used the hyper-slice approach to provide views onto the on-the-fly computed 4D Mandelbrot set. Another early work is a paper by Rossignac [Ros89], who gave a more theoretical treatment of the options available for the rendering of 4D hyper-solids generated, for example, by time-animated or colliding 3D solids. Hanson and co-authors [HH92a, HC93, HH92b] wrote a series of papers that use 4D lighting in conjunction with a technique that augments 4D objects with renderable primitives to enable direct 4D renderings. The images they provide in [HH92b] are somewhat reminiscent to objects rendered with motion blur. Bajaj et al. [BPRS98] use octree-based splatting, accelerated by texture-mapping hardware, to produce 2D X-ray views of n-D objects. They also present a scalable interactive user interface that allows the user to change the viewpoint into n-D space by stretching and rotating a system of n axis vectors. The 4D iso-surface algorithms proposed by Weigle and Banks [WB98] and Bhaniramka, Wenger, and Crawfis [BWC00] both use a Marching Cubes-type approach and generalize it into n-D. Methods that focus more on the rendering of the time-variant aspects of 3D datasets have stressed the issue of compression and time-coherence to facilitate interactive rendering speeds. Shen and Johnson [SJ94] use difference encoding of time-variant volumes to reduce storage and rendering time. Westermann [Wes95] uses a wavelet decomposition to generate a multi-scale representation of the volume. Shen, Chiang, and Ma [SCM99] propose the Time-Space Partitioning (TSP) tree, which allows the renderer to cache and re-use partial (node) images of volume portions static over a time interval. It also enables the renderer to use data from sub-volumes at different spatial and temporal resolutions. Anagnostou [AAW00] extend the RLE data encoding of the shear-warp algorithm [LL94] into 4D, inserting a new run block into the data structure whenever a change is detected over time. They then composite the rendered run block with partial rays of temporally-unchanged volume portions. Sutton and Hansen [SH99] expand the Branch-OnNeed Octree (BONO) approach of Wilhelms and Van Gelder [WVW94] time-variant data to enable fast out-of-core rendering of time-variant isosurfaces. Lum, Ma, and Clyne [LMC01] recently advocated an algorithm that DCT-compresses time-runs of voxels into single scalars that are stored in a texture map. These texture maps, one per volume slice, are loaded into a texture-map accelerated graphics board. Then, during time-animated rendering, the texture maps are indexed by a time-varying color palette that relates the scalars in the texture map to the current color of the voxel they represent. Although the DCT affords only a lossy compression, their rendering results are quite good and can be pro-

18 duced interactively. Another compression-based algorithm was proposed by Guthe and Straßer [GS01], who use a lossy mpeg-like approach to encode the time-variant data. The data are then decompressed on-the-fly for display with texture mapping hardware. In Chapter 2 we present our proposal to 4D volume rendering. The research most related to our 4D work is the 3D approach by Theußl, Möller, and Gröller [TMG01], who introduced BCC grids to the visualization community. Rendering was achieved by ways of a pre-classifying splat renderer with per-splat compositing. Pre-shaded splatting requires an estimate of the gradient at each voxel position, and they use central differences for this purpose. This, however, is complicated by the fact that a voxel’s nearest BCC grid neighbor along an axis is 2 away, which is a greater distance than in the equivalent Cubic Cartesian (CC) grid. To use a closer distance, they compare this gradient estimation method with one that interpolates a point on each cell face and uses interpolated point pairs on opposing cell faces to estimate the gradient. These points are now closer together, i.e., 2 e 2 , than on the equivalent CC grid. Nevertheless, there are only slightly noticeable differences in image quality for the three schemes. Our splatting algorithm (described in [MSHC99]), in contrast, avoids these issues altogether since it defers classification, gradient estimation, and shading until the volume has been sliced and interpolated into imagealigned sheets. Hence the gradient estimation scheme is identical for both BCC and CC grid rendering. Our image-aligned sheet-buffered approach also provides images with less blur and popping and sparkling artifacts, however, at the expense of higher rendering time [MC98, MMC99]. 1.7 Irregular volume rendering By far the most popular method used for the rendering of irregular grids is the Projected Tetrahedra (PT) algorithm developed by Shirley and Tuchman in 1991 [ST90]. A principal assumption in all PT-related algorithms is that the visual parameters vary linearly along the rays traversing a tetrahedral cell. More sophistication in this regard was brought on by the concept of pre-integrated volume rendering [EKE01] which uses the densities interpolated on the faces at each end of the cell to look up the pre-computed volume rendering integral from a 2D table, still assuming that the density function varies linearly across the cell. Although algorithms have been devised to locate and shade one or more isosurfaces traversing the cells, there is otherwise the assumption of a density-emitter model, that is, no shading occurs along the rays. This, in part, is due to the variable sample interpolation interval across each individual cell, and among the cells overall, and leads to X-ray like images, other than the isosurfaces themselves. The Projected Tetrahedra (PT) has inspired a number of GPU implementations [RKE00, WKE02, WMKE04, WE98, WMFC02]. A issue in that regard is the sorting of the cells in visibility order [MHC90, Wil92], which is needed for proper compositing and iso-surface occlusions. On the GPU, researchers have either used image-based fragment ordering for this [WMKE04] or a CPU-GPU balanced approach [CC05]. A common element of these algorithms is the accelerated rasterization of the triangles of the decomposed cells, interpolating densities, colors, and opacities across these faces, and optionally looking up the pre-integrated volume rendering integrals stored in a texture [RKE00]. In the PT algorithms a cell, made up of several vertices, is considered the rendering primitive, and it is assumed that the fields encoded at the vertices vary linearly within the

19 cell. An alternative view is to consider the cell vertices themselves the rendering primitives, and model them as Gaussian splats or point sprites. The assembly of such kernels then creates a field in 3D space that is no longer cell-wise linear, but now varies in a continuous fashion, afforded by the composite of the basis kernels. The various algorithms using the splatting paradigm for the rendering of irregular grids and scattered data vary in the isotropy of the kernels. Meredith and Ma [JM01] use spherical kernels that fit into a cube and are mapped to texture mapped squares for projection. Jang et al.[JRSF02] fill a cell with one or more ellipsoidal kernels, which they render with elliptical splats. A similar approach was also taken by Mao [MHK95]. Hopf et al. apply splatting to very large time-varying data-sets and render the data as anti-aliased GL points. Common to all these approaches is that shading precedes splat projection because the overlap of the kernels in volume space makes it difficult to interpolate the local information, such as gradient and density, needed for the shading information. Doing so would be equivalent to a local reconstruction of the field function. Such an approach was recently taken by Jang et al. [JWH+04] who use a GPU fragment program to evaluate the Gaussian function of all kernels that overlap at a given point. They perform these operations along a set of parallel slices, which then allows them to calculate the gradients needed for shading within and across adjacent slices. This post-shaded approach leads to imagery with much better lighting effects, but it suffers from the substantial per-fragment calculations needed to compute the distance of the sampling point to the kernel center and then using this and the kernel radius to evaluate the exponential kernel function (although a dependent texture could be used for this). Since they run an optimization algorithm to encode a volume into a minimal set of spherical kernels (radial basis functions or RBFs), the number of kernels they must render is not very large (not more than 1000), and therefore they can obtain interactive frame-rates of up to 4 fps. However, by performing the local function reconstruction via analytic means in the fragment shaders, their approach does not exploit the much faster hard-wired interpolation facilities that exist in GPUs, and which have been exploited by even early splatting approaches run on SGI hardware [CM93]. A result equivalent (neglecting discretization errors) to slice- and pixel-wise kernel evaluation is to rasterize kernel sections, encoded as high-resolution floating point textures. By using the new floating blending capabilities of the latest generation of graphics cards, all operations necessary can be executed within the hard-wired parallel rasterization and pixel processing units on these cards. In Chapter 5 we will see that by using these facilities a significantly higher point rendering rate can be obtained, while still maintaining the high visual quality enabled by slice-based postshaded rendering.

20

Chapter 2

4D Splatting of Time-Varying Volumetric Data on Efficient Grids

2.1 Introduction After the implementation of an integrated volume visualization environment that utilizes the image-aligned sheet-buffered splatting algorithm, we have extended our 3D algorithm to directly render 4D data-sets. 4D Volumetric data-sets, such as time-varying data-sets, usually come on 4D Cartesian Cubic (CC) grids. In this work, we explore the use of 4D Body Centered Cubic (BCC) grids to provide a more efficient sampling lattice. We use this lattice in conjunction with a point-based renderer that further reduces the data into an RLE-encoded list of relevant points. We achieve compression ranging from 50% to 80% in our experiments. The proposed 4D visualization approach follows the hyper-slice paradigm: the user first specifies a 4D slice to extract a 3D volume, which is then viewed using a regular point-based full volume renderer. The slicing of a 4D BCC volume yields a 3D BCC volume, which theoretically has 70% of the data points of an equivalent CC volume. We reach compressions close to this in practice. The visual quality of the rendered BCC volume is virtually identical with that obtained from the equivalent CC volume, at 70%-80% of the CC grid rendering time. Finally, we also describe a 3½-D visualization approach that uses motion blur to indicate the transition of objects along the dimension orthogonal to the extracted hyper-slice in one still image. Our approach uses interleaved rendering of a motion volume and the current iso-surface volume to add the motion blurring effect with proper occlusion and depth relationships.

22 1 U = --------- M T 2

(2)

To find the sampling matrix V in the spatial domain that gives rise to the lattice generated by U in the frequency domain, one can perform a Fourier transform and finds the followT U V = I , where I is the identity matrix [DM90]. V is then given by ing relationship: * T –1 V = U . This results in the dual of the Dn lattice, termed D n lattice [CS98]. The resulting V is not very intuitive, but we can obtain a better one by a suitable rotation of V. We use the one given by Conway and Sloane: 1 0 * M = 0 } 1--2

0 1 0 } 1--2

0 0 1 } 1--2

} } } }

0 0 0 } 1 } --2

0 0 0 } 1--2

(3)

Including the scaling factor of (2) we get: V = T

2M

*

(4)

For the 3D case, this yields the sampling pattern for the BCC lattice shown in Figure 2.2c. It consists of two interleaved CC lattices, each with spacing T 2 and offset by a factor of T e 2 in all three axis directions. Thus the BCC lattice has 1/2 of the CC lattice samples (shown in Figure 2.2a) within each slice, but 2 more slices are required. This amounts to an overall savings of samples of (1- 2 /2)·100% = 29.3%, which one may translate into an equivalent reduction in storage and rendering time. In 4D we also get an interleaved embedding of two CC grids, but this time the interleaving is controlled by the fourth dimension. Both offset and grid spacing is again T e 2 and T 2 , respectively. This is illustrated in Figure 2.2d, where we have untangled the grid along the fourth dimension and arranged the 3D sub-grids side by side, with the fourth dimension axis aligned with the axis of the first dimension. In general, the lattice is always constructed by taking the n-1 base lattice and offsetting and interleaving it for the n-th dimension. * The compression of the D 4 lattice can be informally calculated as follows. We 3 now have 1 e 2 less samples per CC, but we have a 1 e 2 smaller sampling distance along the fourth dimension, which is equivalent to having 2 more3 samples along the fourth axis. This means that the total amount of savings is 2 e 2 = 1 e 2 . Thus, by sampling a time-varying data-set (or any other 4D data-set) into the D4 lattice we can get away with 50% of the samples without impairing the frequency content. For 5D, the number of samples reduces even further to 35%. It should be noted, however, that lossless compression is only warranted if the sampled signal has a (hyper-) spherically band-limited frequency spectrum. If a signal already sampled into a CC grid is resampled into a Dn grid, then it must be band-limited into a hyper-spherical spectrum at that time, else aliasing will occur. If the signal has already been band-limited to such a spectrum before, then the compression is lossless, else it is lossy. For example, a CT reconstruction performed with a Gaussian kernel will be

25 result is list of 3D voxels, for which shading, visibility-ordering and depth-compositing will have to be performed each time a new 3D viewpoint and transfer function is specified by the user. We use the image-aligned sheet-buffered splatting method described in [MSHC99] for the 3D rendering. The extraction of a hyper-slice can be done in two ways. The first method uses the fact that a sliced 4D Gaussian is a 3D Gaussian, weighted by a factor exp(-a·d2), where d is the distance of the 4D point from the hyper-slice and a determines the kernel’s spread. All voxels for which |d| d kernel half width will be extracted from the 4D volume and included in the list used for 3D rendering. The other method does not extract voxels from the 4D grid, but interpolates a 3D volume by ways of a 4D filter with bandwidth (Zx,Zy,Zz,Zt). This allows the extraction of volumes at any resolution, at any hyperplane orientation, and into any grid, such as a 4D CC into a 3D BCC lattice or a 4D BCC into a 3D BCC lattice. The 4D-to-2D projection filter is then a 4D filter h4D aligned with the axes of the 4D data, that is convolved with the 3D Gaussian h3D used for splatting, aligned with the axes of the extracted volume: h x y z t = h 4D x y z t R

–1

h 3D x v y v z v

(6)

where R is the orientation matrix of the hyper-slice and (xv,yv,zv) is the coordinate system of the extracted volume. We can do this via a 3D splatting table if the kernel is rotationally symmetric. In that case, the march through the table is defined by the orientation of the hyperplane. We only splat those points that are within half the kernel width of the hyperslice and we omit points with densities outside the interesting range. Note that if we interpolate the 3D volume at the same (or lesser) resolution than the 4D volume, which is probably the most reasonable, then the list of voxels of the second method is bound to be much smaller than the list of voxels for the first method. While for the first method, the number of extracted voxels is proportional to the product of kernel width w and the volume resolution, it is only proportional to the volume resolution in the second method. This will increase the 3D rendering time for the first method by a factor of w. On the other hand, the interpolation cost is proportional to w4, while the extraction cost is only proportional to w. For the general 4D slicing, we chose the second method for our work since we wanted to minimize the number of 3D voxels for rendering. But there are a few special cases that allow the more efficient extraction method to be applied, without increasing the number of voxels in the list. These special cases arise when the hyperplane is aligned with 3 of the 4 hyper-volume axes, i.e., 3 of the 4 coefficients (a,b,c,d) in (6) are zero. Fortunately, these are likely to be the most popular rendering modes – they slice the volume along x, y, z, or t. A further condition is that the projected 4D grid must coincide with the desired 3D grid. This results in the class of volumes that have the same resolution and the same orientation than the projected 4D volume. However, this is not really a restriction since we can do the resampling and rotations later, by ways of the viewing matrix during the 3D volume rendering. This restricted rendering mode has certain implications for different grid types, which we shall explain in the following.

26 2.3.1 Axis-aligned 4D splatting for CC grids In this special case we can efficiently collapse pairs of extracted voxels into a single voxel: Two 3D volumes that are immediate neighbors of the hyper-slice can be simply combined by linear weighting along the projection direction. We can use a linear filter, which has a box shape in the frequency domain, since the cartesian grid does not impose any radial band-limitness conditions. The complexity for this projection is N, the number of voxels in the 3D volume. 2.3.2 Axis-aligned 4D splatting for BCC grids *

The situation for the D 4 grid is more complex. Here we need a radially symmetric filter, such as a Gaussian, for 4D interpolation, since we need to preserve the spherical band-limitness of the 4D signal. The Gaussian is a non grid-interpolating filter, which means that, even if the projected 4D grid coincides with the desired 3D grid, not only samples along the fourth dimension will affect an interpolated grid point, but also the entire 4D neighborhood that falls into the Gaussian extent of the kernel placed at the grid point. So it seems that we are still stuck with the problems of the general configuration. However, the fact that the grid distances within the nested CC cells are relatively wide helps us here. Figure 2.4 serves as an illustration, where we use t as the fourth dimension, without loss in generality. There we see a 4D BCC grid (2 nested CC grids) pulled apart along t. The vertical dashed line indicates the time slice at t=t1 that we would like to interpolate. We realize that the projection of a 4D BCC grid is a 3D BCC grid, and since we project the 4D grid such that it coincides with the desired 3D grid, we will retain the two nested (x,y,z) spatial CC grids and just interpolate the new grid values. It would save interpolation time if we could just use the original, unfiltered, grid values and only interpolate (weigh) them over time, i.e., just use a 1D Gaussian along the time axis (like we did for the axis-aligned CC case). To see if this is feasible and to estimate how much error we would commit, consider voxel A in Figure 2.4. The 4D Gaussian will overlap voxels C and D in the same time instance, and voxel B in the next time instance (as well as others). But realize that both B and C (and D) are a distance of 1.42 away. A relatively narrow Gaussian for interpolation, which still has good frequency behavior, is given by: hr = e

–2 r

2

(7)

For r=1.4 (points B and C) this Gaussian has decreased to less than 2% of the maximum value. If we are willing to commit this error then the interpolation time would be almost twice as fast as for the Cartesian grid, given that we have to interpolate half the number of voxels. If we are not willing to commit this error, then we will need to use 8 B-voxels and 6 C-voxels for interpolation (the weights are constant and don’t have to computed) per grid point. However, our experiments have shown that the results obtained with the approximate method are just as good as those with the accurate method, and therefore all results reported here use the approximate method. The interpolation yields a list of voxels for the resulting BCC grid. While the 4D BCC grid has only 50% of the voxels in the 4D CC, the extracted 3D BCC has 30% of the 3D CC voxels. This is the case because we have 2 lists of e 2 1 e 2 1 e voxels each, which amounts to a saving of 29.3%. This was to be expected for the 3D BCC grid.

29 the 4D viewing parameters triggers a new hyper-slice interpolation. We also have implemented direct slicing where the construction of the (3D) RLE list is bypassed and the extracted 4D voxels are directly tossed into the bucket array. The modification of the 3D splatter to render BCC grids is straightforward. The only adjustment that is necessary takes place at the onset of the bucket-tossing stage of the algorithm. The transformation matrix that transform the points into image space must be pre-multiplied by the following matrix: 0 0 0 2 0 0 0 2 0 0 0 2

a a a 0

(9)

where a is set to 1 e 2 when z is odd and to 1 otherwise. 2.5 Results We now present experimental results that we have produced with our software. All images were rendered in a pure-software implementation on a Pentium 4, 2Ghz machine with 512 MB of RAM. We first tested the 3D BCC volume rendering with the following data-sets: UNC Head, Visible Human Foot, and Engine. Since these data-sets are only available on CC grids, we resampled the data with a cubic spline filter into the corresponding BCC grids. Table 2.1 compares the speedups that are due to the BCC grid. Similar to [TMG01], we were also able to achieve speedups (over the CC grid) of around 25% at almost no difference in visual quality. On the front page we show (in the right-most two images) the fuel injection and the neghip data-sets rendered on BCC grids, and Figure 2.6 shows side-by-side comparisons of the Engine, the Visible Human Foot, and the UNC

data-set

Size

Eff. points

Time

Head CC

1283 (2.1 M)

492 K

1.26

Head BCC

912 x 181 (1.5 M)

345 K

0.98

Foot CC

1283 (2.1 M)

208 K

1.26

Foot BCC

912 x 181 (1.5 M)

148 K

0.91

Engine CC

2563 (16.7 M)

1565 K

2.46

Table 2.1: Comparison of the efficiency of the BCC grid versus the CC grid. The size column lists the number of points needed on both grids to store the equivalent data-set. The third column lists the number of points that effectively made it into the rendering pipeline (the relevant voxels). The Time column lists the time needed to render one frame at the resolution of the data-set. All timings are listed in seconds.

33 2.6 Conclusions In this chapter, we have applied well established mathematical theory on hyperspherical lattices to losslessly compress 4D data-sets, under the condition that the frequency spectrum of the data is spherically band-limited. We found that the raw data-sets could be compressed to about 50% of their original size, a ratio that is in par with theory. Since we are using a point-based renderer we can reduce the magnitude of the data even more by only storing an RLE list of relevant data points (those with values greater than “air” and noise). The RLE-encoded BCC point lists compress to a size of 20-50% of the raw CC data for the data-sets tested. Since the resampling of the data from the original 4D CC grid into the 4D BCC grid introduced some amount of low-passing which required the inclusion of some of the air voxels into the point list, we believe that the compression ratios would be even better if the tested data-sets were sampled directly into the 4D BCC grid by the data generation process. In these regards, our research (as well as that of [TMG01]) has hopefully provided some pointers to the communities working in these fields of science. We have used a hyper-slice approach to explore the 4D data, i.e., the user first

dataset

Turbulent CC

# time steps

Total data size

Total # # Size RLE # relevant Render relevant relevant encoding interpolated time voxels voxels in (% of # relevant voxels (% (% BCC (% total % BCC / interpolated BCC / CC) / CC) size) CC voxels)

99 168.3 M

9.2 M (5%)

-

127 k

146 k (114%)

1.23

Turbulent 139 87.0 M BCC

7.4 M (8%)

80%

107 k (84 %)

146 k (136%)

1.01 (82%)

80 160.0 M 109.7 M (68%)

-

1.3 M 1.6 M (123 %)

5.63

60.3 M (75%)

54%

986 k (75 1.35 M (137 %) %)

4.58 (81%)

Vortex CC

Vortex 112 84.3 M BCC Jet Shockwave CC

56 89.6 M

38.0 M (42%)

-

727 k

719 k (98%)

5.42

Jet Shockwave BCC

80 48.0 M

20.0 M (41%)

52%

520 k (71 %)

544 k (104 %)

3.90 (71%)

Table 2.2: Numerical results for the time-varying data-sets used in our study. The relevant voxels are those voxels that have values above “air” and noise. The relevant interpolated voxels are the voxels interpolated for the arbitrarily chosen time step shown in Figure 2.4. These voxels are passed into the splat renderer. The RLE is needed for efficient storage and transformation of these spatially non-connected points. The render time is the time (in seconds) to render the image.

35

Chapter 3

Using Post-Convolved Splatting to Accelerate Magnified Viewing

3.1 Introduction One of the most expensive operations in volume rendering is the interpolation of samples in volume space. The number of samples, in turn, depends on the resolution of the final image. Hence, viewing the volume at high magnification will incur heavy computation. In this chapter, we explore an approach that limits the number of samples to the resolution of the volume, independent of the magnification factor, using a cheap postconvolution process on the interpolated samples to generate the missing samples. For Xray, this post-convolution is needed only once, after the volume is fully projected, while in full volume rendering, the post-convolution must be applied before each shading and compositing step. Using this technique, we are able to achieve speedups of two and more, without compromising rendering quality. We demonstrate our approach using an imagealigned sheet-buffered splatting algorithm, but our conclusions readily generalize to any volume rendering algorithm that advances across the volume in a slice-based fashion. Magnified viewing means that the resolution of the image exceeds that of the volume. The implications for this can be grasped by realizing that (i) the frequency content of a volume data-set is bounded from above by its grid sampling rate, and (ii) this upper frequency (the Nyquist frequency) cannot be exceeded, even if we resampled the volume at a higher rate, for the purpose of slicing or magnified volume rendering. In other words, we cannot gain any high-frequency detail in the densities just by over-sampling, and if densities is all we’re interested in for image generation, which is the case for X-ray, then we

36 may just as well 3D-sample the volume at the volumes’s Nyquist frequency and magnify the resulting image just before display. If, however, the interpolated densities are used as an input to another function, then this translation can potentially extend the frequency content of the interpolated signal beyond the volume’s upper frequency bound. In fact, the color and opacity transfer functions as well as the shading operations used in full volume rendering represent such a frequency translation. This implies that we will not be able to perform the magnification after a full rendering, but only before the shading and coloring occurs. An example for an algorithm that does not observe this constraint is the shear-warp algorithm [LL94]. It gains a fair share of its famous speed by performing all interpolation, shading, and compositing operations at volume resolution and then scaling up the resulting image via an inexpensive 2D warp (see [SM02] for a closer analysis). This yields a relative insensitivity to magnification in terms of runtime, but it generates blur in places of high geometric and color detail. On the other hand, Westenberg and Roerdink [WR00] used this idea for their wavelet compression-space X-ray volume rendering approach. Due to their restriction to the X-ray transform, they did not observe significant blurring artifacts. The magnification of an image is generally performed via a convolution with a suitable filter. Hence, we shall call the technique of obtaining a magnified image from an image that was rendered at volume resolution post-convolved volume rendering (PCVR). We found that PCVR, when used in the context of splatting [Wes90], raises a few interesting peculiarities, which are mainly due to the required use of spherically-symmetric kernels (for reasons of efficiency). These issues are explained in the following sections of this paper, and our solutions for them are discussed as well. Westenberg and Roerdink [WR00] used a PCVR scheme to accelerate their hierarchical wavelet X-ray splatting technique. In their work, they first calculate a decomposition of the original density volume into a wavelet basis, which gives rise to a set of wavelet coefficient volumes at decreasing resolution. Then for display, they project each volume separately (after culling irrelevant voxels determined by the wavelet coefficients) into images that match the resolution of the corresponding volume. The projection itself is performed by spreading the projected point into its 2x2 image pixel neighborhood, using bilinear weighting. After all points have been splatted in this way, the images are convolved with a 2D filter corresponding to the projection of the wavelet function (also known as the footprint) at that level. This process is illustrated in Figure 3.1. Finally, the images are added to yield the final display. Significant speedups can be obtained because the bilinear spreading is much more efficient than a full footprint rasterization for each voxel. This is because bilinear spreading involves only 4 pixels, while the rasterization of the footprint for a kernel of radial extent=2 involves 16 pixels, at magnification=1. For larger magnifications this number scales by the square of the magnification factor. For wavelet-splatting the magnification factor is determined by the level of the wavelet volume and grows in a power of 2. So it is obvious why the PCVR brought great speedups, since the rasterization of large wavelet footprints would take much longer than the bilinear spreading. Despite the obvious benefits and the innovative nature of Westenberg’s and Roerdink’s work, there are still a few open issues with this method. First, the authors only use a wavelet decomposition of 2-3 levels, and in their paper they only show the rendering

48 first stage, which has a more favorable frequency response. On the other hand, it is permissible to use an efficient bilinear filter in the post-convolution stage, and we also found that for integer magnifications an efficient discrete post-convolution scheme can be employed, which provides additional significant speedups for PCVR. A downside of the PCVR approach is that it may lead to a slight increase in blurring since the combined filter is a convolution of two lowpass filters. Our experiments indicated, however, that the blurring is relatively minor. The PCVR method has a number of application scenarios in which it can prove to be useful. The first is in software splatting, and, in fact, we use the PCVR scheme as a default option in our in-house software splatter. We think that in hardware implementations our approach will facilitate higher magnifications and therefore more levels, artifact-

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

(i)

(j) (k)

Figure 3.12: X-Ray rendering of the CT Head. (a) Regular splatting, (b) PCVR with bilinear post-convolution, (c) PCVR with Gaussian post-convolution, (d) PCVR with bilinear spreading (pre-convolution). (e),(f),(g): amplified difference Images of (a-b), (b-c), (a-c). (h) PCVR without the post-convolution, (i) compares the image quality of PCVR with bilinear spreading at scales of 1X, 2X, 3X, 4X, (j) the image obtained with PCVR at 4X via Gaussian spreading and bilinear post-convolution.

49 free. PCVR may be useful for custom hardware implementations. For example, the VolumePro 500 [PHK+99] could only cast one ray per voxel. To increase resolution without having to cast more rays, a post-convolution module could be inserted before the shading pipeline but after interpolation. Similarly, the shear-warp algorithm could be modified by inserting a post-convolution module after each slice interpolation. Tiles could be used to limit the convolution effort.Finally, our findings are also relevant for medical imaging, in particular for CT reconstruction algorithms, which often use bilinear spreading in their projectors/back-projectors. The added accuracy obtained from using a better kernel may help to increase reconstruction fidelity. In the next chapters we discuss our efforts to port our image-aligned sheet-buffered splatting approach to programmable hardware platforms, such as NVidia and ATI. In order to use PCVR, one could rasterize the texture-splats into a low-resolution buffer and then project this buffer into a high resolution buffer, all in hardware.

50

Chapter 4

GPU Accelerated Image-Aligned Splatting

4.1 Introduction The past few years have seen a revolution in the development of graphics hardware. The advent of the GPU has completely changed the scenery of computer graphics and has given new hope and inspiration to researchers in scientific visualization and the non-gaming community. Although all new developments on consumer graphics architectures are motivated by the demands of the gaming industry, a very pleasant side-effect of the increasing level of hardware programmability are the numerous applications this offers in scientific visualization and specifically volume rendering. The new processing model for the GPU uses aggregate pipelines that process very small independent SIMD computations in parallel. Any algorithm that can be build around this model can be made highly scalable and can be tremendously sped up by the use of graphics hardware. 3D texture-mapping accelerated volume rendering, as well as ray casting based algorithms, have gained many benefits from the new GPU based boards. That is because these algorithms fully utilize the ability of the hardware to handle large vector based parallel computations. In each step of ray casting, all rays are processed in parallel and are completely independent, giving rise to high speedups if the application is constructed properly. Point based volume rendering, however, doesn’t fall into the category of algorithms that can be easily divided into one independent computation element for every pixel. Splatting, to be specific, is based on the use of overlapping kernels. These overlapping kernels are rasterized onto an accumulation buffer, and the buffer may only be pro-

51 cessed for shading after all the points have been rasterized. Splatting is a “scatter” operation, which means that for every voxel the multiple destination pixels are calculated at runtime. This is opposite to “gather” operations, where every pixel’s destination is defined before rasterization, and every pixel depends on a predefined set of sources which results in a predefined set of texture lookups. In Splatting, every pixel depends on a variable number of points, as opposed to ray casting where every pixel only depends on the previous ray step and the neighbors around it for gradient estimations. As an attempt to convert splatting into a “gather” operation one has seen to implement it at the fragment level in the GPU. Attempting such a scheme would acquire, for each pixel all the neighboring kernel contributions that need to be added to the pixel, depending on the distance from their centers. But even if the number of relevant points per area was restricted this would cause so many lookups, that would effectively drain the texture-memory bandwidth of even the most powerful boards. Our system addresses the major challenges of a hardware based splatting system without converting it to a “gather” method. It achieves interactive rendering rates by exploiting features of the latest generation GPU hardware. It is based on the sheet-buffered image aligned splatting algorithm [MC98] which was initially introduced to better address the performance/quality concerns of the existing splatting algorithms. Splatting is especially attractive for the rendering of sparse data sets, but it can also handle very efficiently irregular data sets as well as alternative grid topologies such as the BCC (Body Centered Cubic) grid. The system described in this chapter demonstrates the ability to interactively visualize both Cartesian and BCC regular data sets. In addition, the proposed system utilizes the newly introduced 16-bit floating point processing, which is now present in the latest 6800 series of NVidia boards as well as the ATI Radeon 9800 pro, throughout the rasterization pipeline. The floating point pipeline is desirable since, although not very pronounced, quantization artifacts do appear in 8-bit pipelines. Such artifacts are more visible especially in specular highlights where the errors pass through exponentiation. Quantization artifacts in alpha compositing are more pronounced in the case of semi-transparent rendering. In their recent work [BNMK04], Bitter et. al. concluded that the minimum precision required throughout the volume rendering pipeline is 16-bit per channel for displaying the rendering result on typical displays of 8bit color precision per channel. Furthermore, this is also the upper limit of the requirement, if the results are to be visualized using standard 24-bit displays. The presence of hard-wired floating point blending capabilities is crucial to splatting implementations, since the blending of million of points using custom floating-point fragment shader programs is impossible at interactive rates. 4.2 Hardware-accelerated splatting We are using the image-aligned sheet-buffered splatting algorithm that was introduced in Chapter 1. This algorithm interpolates, via splatting of pre-integrated basis-function slices, a series of image-aligned density sheets from the volume. It then shades and composites the sheets in front-to-back order. The first hardware based flavor of this algorithm took advantage of early SGI workstations with 2D texture-mapping capability. The non-programmable OpenGL pipeline was used for the compositing of already colored and shaded voxels (pre-shaded splat-

52 ting). The final color of the voxel was used to modulate the 2D kernel that represented the point and was rasterized as a textured rectangle. Thus, for every slice that was composited, the textured splat was modulated by the color and a “section coefficient”, which was precalculated by pre-integrating the represented slab for the sampling interval along the zdirection. These coefficients were calculated for an equidistant set of sampling positions for a given slab width. The post-shaded version of the algorithm addresses the blurring artifacts of the previous approaches. It consists of first splatting the densities of all the pixels that fall into an image aligned slab and shading after all the contents of the slab have been accumulated [MMC99]. For every pixel in the current sheet-buffer the color is assigned using a transfer function and the shading calculation applies Phong lighting using the computed gradient and position for that pixel. This approach allows blur-free zooms and was initially realized only in implementations that require the CPU for a number of operations. The programmable shader technology now makes per-pixel shading available and enables post-shaded splatting to be performed entirely on the GPU. In the following paragraphs we discuss the major challenges in porting this algorithm to the latest generation GPU based hardware, and we propose specific strategies and hardware feature exploits to overcome each of the challenges. 4.2.1 Challenge 1: increased vertex traffic Compared to traditional ray casting and 3D slicing based approaches, Splatting is at a great disadvantage as far as vertex traffic is concerned. At least one textured polygon has to be rasterized for every point (basis function) in the data set. This makes the number of vertices to 4 times greater than the number of data points that have to be sent through the graphics pipeline for every frame, as opposed to one polygon per sampling slice for the ray casting based approaches. In the case of image aligned splatting this number is multiplied further by the number of slices of each point, since the volume is traversed via sliced kernel sections along the viewing direction. The first available strategy for this purpose is the use of the Point Sprites extension, which was first introduced on NVidia boards, but was soon adopted on the ATI Radeon boards as well. This extension allows the definition of a texture that is applied to regular point primitives, effectively causing them to be rasterized as screen aligned textured polygons. The points are defined using the glPoint primitive, which only requires one vertex to be sent. This translates into tangible benefits in a parallel projection pipeline, since the vertex count and the associated transformations are reduced by a factor of four times. The final rasterized polygon, however, is only created after the viewing and projective transformations are applied to the point vertex, so the shape of the polygon cannot be affected by any change to the matrices. This makes the extension slightly more difficult to use for perspective projection, since transformations can only be applied indirectly to the texture coordinates, by using fragment programs and devising a smarter way for passing parameters to them. The use of Point Sprites in our system has gained us a speedup of about 1.2-1.3 since the Point Sprites extension is present on all NVidia boards after GeForce4 and on. Although only a quarter of the total geometry is passed to the processor, the application still remains rasterization bound. Hence the speedup from using pointsprites is still quite low. The second obvious strategy to tackle the vertex traffic issue is the use of Vertex

53 Arrays. This extension allows us to pack the points for each slice into a large array of vertices which can be uploaded to the graphics board using the optimized memory transfer techniques provided by the AGP and PCI express memory interfaces. It also provides a big chunk of computation that can be asynchronously processed on the GPU, allowing for better utilization for both the CPU and the GPU. Despite all these memory transfer improvements and the reduction of the vertex count, the optimal solution would be to keep the entire data set inside the graphics board. This approach, dubbed “retained mode splatting” was used in [CRZP04] and claimed speedups of 7-10 times compared to immediate mode splatting. The data set in the referenced work, however, uses the axis aligned predefined volume traversals, which produce popping artifacts during animated viewing, as was mentioned in the previous section. For our application, a substantial amount of work would have to be done on the GPU to maintain a correct image aligned volume traversal every time the viewing transform is changed. The highly parallel architecture of the GPU only allows for SIMD approaches to be efficiently used for this task. Some “all-GPU particle system” approaches such as [Lat04, KSW04] use bitonic sort and define the constraints of the application in such a way that allows partial sorting to maintain correct visibility order, but the throughput of these systems is now close to about 1 million particles per second, which is roughly enough for an overall performance of about 1fps for a typical splatting application. 4.2.2 Challenge 2: increased voxel/pixel overdraw As described above, the image aligned post-shaded splatting algorithm first collects the density contributions of all points intersected by the current slab. It adds them up by first multiplying their density with the appropriate slab coefficient and then rasterizes the kernel texture, modulated by that factor, into the accumulation buffer. After all densities have been in that way collected the slice is ready for classification and shading. The shading process is implemented in a fragment program and applied to the slice, leaving the final result to be composited into the frame buffer. During the shading calculation for every pixel, the gradient can be interpolated from a gradient volume, along with the densities. This way all 4 channels of the RGBA temporary buffer are utilized by encoding the modulated color as a tuple of (Nx,Ny,Nz,Density). However, this method needs to draw a textured polygon for every slice of each kernel, which for a kernel radius of 2.0 translates to 4 slices per point. Unfortunately this would drain the rasterization limits of current boards, which is at most 3 to 4 million textured points per second. At rendering scales higher than one, which translates to larger textured splats, this rate drops even more since splatting is already a rasterization bound application. The alternative approach that we follow, is to splat all 4 slices for any point at the same time. This means that now the temporary RGBA buffer will be utilized as 4 separate (but consecutive) density slices. At any given slice i, the temporary buffer will actually hold the temporary buffers (i,i+1,i+2, i+3). This requires some extra accounting for deciding the 4 slice coefficients for the point being splatted, as well as a smarter set of fragment programs to perform the shading computations for the current slice. In addition, the gradient of each pixel has to be computed on-the-fly. The cost for the 5 additional texture lookups per pixel required for the central difference gradient calculation is quickly

55 indices long. We adopt the convention that the temporary buffer channels are in order as R, G, B, A, R, G, B, A,... and will hold the slabs 0, 1, 2, 3, 4, 5, 6, 7, 8,.. The relative ordering tuple (i, i+1, i+2, i+3) for slices 0..4 will then be RGBA, GBAR, BARG, ARGB, RGBA, etc... Thus, a voxel that first intersects slice 6 will have its four slice contributions arranged in the order BARG. If we further assume that this voxel intersects slice 6 with kernel section 5 and has density d, then the modulating color is defined by glColor4f(d*coeff[133], d*coeff[197], d*coeff[5], d*coeff[69=(5+64)]). The next stage takes place right after a whole slice is splatted to the active density buffers. The contents of the completed buffer are copied into a temporary copy buffer, which holds the latest density slices that have been completed. These slices are then used for the shading calculations, since the front and back neighbors are necessary for each slice to calculate the pixel gradients. This part of the calculation is implemented using four different fragment programs (for each of the RGBA, GBAR, BARG, ARGB orderings) and are activated according to the slice number. The gradient is calculated using central differences, thus the six neighbors of each pixel need to be read in the program. However, since the front and back neighboring buffers are in the same pixels but in different channels, we can access them when the current pixel was sampled. Therefore, only four additional texture accesses are necessary for the gradient calculations. The shading program reads the transfer function as a 1D texture, making a total of 6 texture accesses per fragment. Although the additional texture lookups and the shading computations are quite expensive, it seems that the coherence of volume data within the slice is well exploited by caching, making this implementation about 3 times faster compared to the initial solution that rasterized every slice separately. 4.2.3 Challenge 3: shading of empty regions An important inefficiency that affects both 3D texture/ray casting-based and the slice-based splatting approach is the expensive processing of empty-space pixels. The slice-based splatting approach has an advantage here because after the splatting stage for the slice, we know which area needs to be composited and which parts of the slice did not receive any contributions. In order to exploit this feature of the slice-based splatting algorithm we are using the OpenGL Depth test feature which is further optimized on FX, 6800 and Radeon boards by what is widely known as the “early z-rejection” test. The early z-test optimization does the depth test before the fragment is processed in the fragment processor and effectively cancels the expensive computations for declared empty fragments. This feature has been widely used in volume rendering applications, which need to perform their potentially expensive computations often only on a very small fraction of the rasterized surfaces. Unfortunately, the early-z test extension was not originally intended for applications in scientific computing and volume rendering. It is more sensitive to conditions prevalent in polygonal rendering and specifically gaming environments. A quite large list of rules, most of the time undocumented and sometimes only discovered by experimentation, determine whether the early-z test is actually performed or not. For example, frequent clearing of the depth buffer or frequent changes to the direction of the z-test (from GL_LESS to GL_GREATER) tend to completely cancel the optimization. In our implementation depth test is used in conjunction with the newly introduced

57 compositing computations. For the splatting phase of the next slice, the new value is then going to be set to the next value which is sliceDepth n + 1 = 1023 – n + 1 e 1024 , and the depthbounds-test range is set back to (0...1) to allow splatting anywhere in the slice. The depthtest has to always be active in order for the early-z-test optimization to be applied. The depth for the new slice is still consistent with the depth-test, since it is smaller than all the current z-values in the depth buffer, which now include 1 and sliceDepth(n+1). Although the rendering order is front-to-back, we set initial value as 1, and render the slices in distance from the screen (which is counter-intuitive to front-to-back rendering). This is because hard-wired conventions in the operations of the z-buffer, which associates value 1.0 (furthest) to allowing all rendering, and value 0.0 (closest) to disallow rendering, and apply early-z culling. Thus, the depth value of 0.0 is reserved to be assigned to opaque fragments, as we will discuss in the next subsection. The scheme we have just described was chosen in order to avoid changes in the direction of the depth-test (i.e. from GreaterEqual to LessEqual), as well as frequent clears of the depth-buffer, both of these will cancel the very sensitive early-z culling extension on NVidia boards, as mentioned above. The only limitation of our scheme is the restriction of the allowed number of slices to about 16K (for 24-bit depth buffer granularity), which, however, is actually more than sufficient for the volumes that the current hardware can handle. 4.2.4 Challenge 4: shading opaque regions A final optimization that can be applied to our framework is also related to the depth-test feature and the early z-rejection test. A strategy that is quite similar to early ray termination used in ray-casting eliminates all pixels that have become opaque in order to avoid unnecessary calculations in the splat rasterization stage. This technique is called “early splat elimination”. The amount of such pixels is actually very high for some isosurface rendering and the rendering of semi-transparent volumes with considerable accumulation. The speedups realized by the use of this extension vary among applications, but in our case they allowed factors of about 2. This comes from the way we use the extension to eliminate both the splatting and shading operations on pixels that are already opaque. We slightly expand our use of the extensions described in Section 4.2.3 and assign a depth value of 0 to all opaque pixels. The compositor fragment programs are slightly modified to input the current image buffer as a texture. In the end, if the sampled pixel has already accumulated an alpha value over the predefined threshold (usually 0.98), then the depth output is set to zero. Otherwise it is set to the current sliceDepth, which will allow normal operations at all stages. All the pixels with depth 0 will be excluded even from the splatting stage, since the depth-test feature is enabled throughout all the stages of the rendering pipeline. Figure 4.3(a..c) better illustrates how this optimization works. An additional level of opacity culling propagates information from the graphics board to the CPU in order to exclude whole voxels that have their textured splat in an opaque region. This requires the creation of an “Opacity Buffer”, as described in [HCSM00]. The alpha component of the buffer is convolved with an averaging filter equal in size to the splat. The result is a buffer that stores for every pixel the average alpha of the splat-sized region around it. This optimizes the query for checking if a whole basis function is completely opaque or not, and eliminates the vertex itself. The opacity buffer is

59 OnViewingTransformChange( ) Traverse RLE For each slice create a VertexArray Initialize PBuffer (3 aux surfaces, associated z-buffer) //All surfaces share same z-buffer// //datatype of PBuffer is defined as 16-bit float Set SplattingPBuffer =PBuffer.Surfaces[0] Set TmpCopyPBuffer =PBuffer.Surfaces[1] Set FinalImagePBuffer =PBuffer.Surfaces[2] For each slice //Splatting Phase// SetActiveSurface(SplattingPBuffer) Enable(DepthTest) DepthMask(true) SetDepthBoundsEXT (0,1) Use Assiciated VertexArray of points Use Associated ColorArray of points //depth for each slice is set to sliceDepth(slice) //Rasterize primitive set to pointSprites //Color=density(slice, slice+1, slice+2, slice+3) //density( ) gives coefficients from 1-D gaussian //Set pointSprite texture to 2D-Gaussian //Set texture properties to Modulate color DrawArrays //After one whole slice is drawn, //the Current Channel (one of RGBA) will be //complete. //Copying Phase// SetDepthBoundsEXT (0,sliceDepth) //Allow only latest fragments to be copied DepthMask(false) SetActiveSurface(TmpCopyPBuffer) Copy CurrentChannel from SplattingBuffer //Compositing Phase// SetActiveSurface(FinalImagePBuffer) //Enable writing to depth, so that alpha saturated //pixels are updated to z=0 by the pixel-shader DepthMask(true) SetInputTexture(TmpCopyPBuffer) //Link currentContents to read alpha channel// //to decide which pixels are saturated SetInputTexture(FinalImagePBuffer) Activate ShadingFragmentProgram Rasterize Polygon for full Size of Buffer setDepth of Polygon to sliceDepth //Shading Program updates both depth and //final color which is composited DeactivateShadingFragmentProgram End Slice //Final Image is available in FinalImagePBuffer //TextureMap final image onto FrameBuffer End Function

Figure 4.4: Pseudocode of the overall rendering process.

60 tion is shared among all three buffers. The rendering process is organized in three stages for each slice. For each slice, all the intersecting basis functions are first splatted as textured point sprites with their RGBA color modulating the active splatting buffers as described in Section 4.2.2. The depth for all splatted point sprites is then set to a unique value sliceDepth(i) chosen for the current slice i, such that sliceDepth(i)