Large Data Management for Interactive Visualization ... - CiteSeerX

1 downloads 0 Views 105KB Size Report
Sep 9, 1999 - Michael Cox. MRJ/NASA Ames Research Center. 1 Introduction. We first distinguish the problem of big data collections from that of big data ...
Version 9-Sep-99

Large Data Management for Interactive Visualization Design 1

Michael Cox MRJ/NASA Ames Research Center

1 Introduction We first distinguish the problem of big data collections from that of big data objects. Big data collections are aggregates of many data sets. These are not the focus of the current notes, but the issues in big data collections are summarized for completeness. Big data objects are just that – single data objects that are too large to be processed by standard algorithms and software on the hardware one has available. Big data objects may comprise multiple individual files – the collection may be referred to as a data set. However, if all of the files or pieces of a data set are intimately related, and analyzed together (e.g. the individual files of a time series) then we refer to the set of files as a single data object. There is a growing literature on techniques to handle big data objects (though not all authors have thought of their techniques as being relevant for “big data”). To understand what techniques can be used and when, we discuss some differences among big data applications. We then discuss the varying architectures that have been and might be applied to manage big data objects for analysis and visualization. We then proceed to a taxonomy of the techniques possible to manage big data, and discuss each of these in turn.

1.1 Big data collections Big data collections are aggregates of many data sets. Typically the data sets are multi-source, often multi-disciplinary, and heterogeneous. Generally the data are distributed among multiple physical sites, and are often multi-database (that is, they are stored in disparate types of data repositories). At any one site, the size of the data may exceed the capacity of fast storage (disk), and so data may be partitioned between tape and disk. Any single data object or data set within the collection may be manageable by itself, but in aggregate the problem is difficult. To accomplish anything useful, the scientist must request information from multiple sources and each such request may require tape access at the repository. Over many scientists the access patterns may not be predictable, and so it is not always obvious what can be archived on faster storage (disk) and what must be off-loaded to tape. In addition, there are the standard data management problems (but aggravated) of consistency, heterogeneous database access, and of locating relevant data. 1

The latest version of these course notes can be found at http://science.nas.nasa.gov/ mbc/home.html. The current notes build upon notes from previous SIGGRAPH courses, in particular [10] and [8]. Both can be found at the same URL.

29

Big Data

Michael Cox

The Earth Observing System (EOS) whose development is overseen by NASA Goddard is an instructive example of the problem of big collections. The goal of EOS is to provide a long-term repository of environmental measurements (e.g. satellite images at various wavelengths; about 1K parameters are currently included in the requirements) for long-term study of climate and earth’s ecosystems. The data are intended to be widely available not only to scientists, but to the general public. Thus, EOS must acquire, archive and disseminate large collections of distributed data, as well as provide an interface to the data. Estimates for the data volume that must be acquired, processed, and made available are from 1 to 3 TBytes/day. These data arrive in the form of individual data objects that vary from about 10 to 100 MBytes (average about 50 MBytes), and are acquired and processed by about 10 Distributed Active Archive Centers (DAACs).

1.2 Big data objects Big data objects are typically the result of large-scale simulations in such areas as Computational Fluid Dynamics (CFD), Structural Analysis, Weather Modeling, and Astrophysics. These simulations typically produce multi-dimensional data sets. There is generally some 3D grid or mesh on which data values are calculated. Often, data values are calculated at 10x or 30x the number of time steps in the simulation than are written to mass storage. For example, in CFD curvilinear grids are commonly used. The grids themselves are regular lattices bent to conform to the structures around which flow is calculated (e.g. a wing). Multiple grids may be required to conform to all of the parts of the surface under study (e.g. the wing and fuselage). In other disciplines other grids and meshes are employed. Simulations may be either steady, in which case time is not modeled. Or simulations may be unsteady in which case there are solutions at multiple time steps. The grids or meshes themselves may move and/or change during the simulation. In CFD, the results are referred to as unsteady grids and are required for example to conform to flap movement on the wing of a plane. Even steady calculations today result in data sets around 1 GBytes (e.g. 32 million nodes on a CFD curvilinear grid). It is common to generate hundreds or thousands of time steps as the result of unsteady simulation leading quickly to TB-scale data sets. Typically these huge data sets are not analyzed while the supercomputer generates them (it is generally not cost-effective to use supercomputer cycles for human interaction). Rather, data sets are post-processed. Post-processing may involve user-driven visualization algorithms such as (in CFD) streamlines, streaklines, examination of cutting planes, etc., and also may involve off-line calculations such as vortex-core extractions. It is clear that data sets of hundreds of GBytes are too large to fit in the main memory of anything but a supercomputer. It is the rare installation that can afford supercomputer time for post-processing, and so these data must be disk-resident during analysis. But hundreds of GBytes is too large for local disk except on the most resourced server-class workstations. For extremely large single data objects (in the 500 GByte range) the data may not even fit entirely on remote mass storage! Coping with such extreme data sets is the venue of algorithms and architectures for managing big data objects. SIGGRAPH 99

30

Version 9-Sep-99

2 Important differences among big data applications To date a number of techniques have been published whose goal is to cope with the problems of big data objects. We abstract the common themes of these techniques and examine them in turn. However we must first look at some differences between visualization (and data analysis) applications because some properties of the application determine which techniques may or may not be productive. In general the questions that must be asked of the application are:

 What is the data analysis model?  Can the data be queried, or must they be browsed?  Can the data be directly rendered, or must the data be algorithmically traversed?  Are the data themselves static or dynamic? In particular, do the data comprise static fields or do they comprise dynamic fields calculated on demand by the application?  Is there an appropriate algorithm for the dimensionality and organization of the data?  How large is the data set?  Is there a data analysis time budget?

2.1 Data analysis process – postprocessing vs. steering There are generally two models or processes for data analysis, which we might call the postprocessing model and the computational steering model. The postprocessing is the more common model in large-scale computational simulation. A supercomputer or otherwise large and expensive machine simulates some phenomenon, taking advantage of fast floating point and extremely large main memory sizes. In a separate step, the simulation writes data to mass storage. For time-varying simulations, 1/10 or 1/30 of the time steps are actually stored (the other time steps are generally required as intermediate results in the simulation for high fidelity of final results). Then, in a separate phase (today generally on a less expensive machine) the data are analyzed (post-processed). Alternatively, parts of the research community have pursued techniques that allow the scientist to interact directly with the simulation, and even steer the computation. The reasoning is that the problem of “big data” can be avoided by not generating the data! Historically, scientists have peeked non-intrusively at the time step data as they have been written to disk, and have shut down simulations gone awry. Those who advocate computational steering generally envision more proactive (and intrusive) monitoring and modification of running codes [36]. Historically, scientists have consistently chosen to use supercomputer cycles for scientific computation rather than data exploration at (human) interactive rates. Although it seems unlikely scientists will suddenly change their views, there may yet be techniques discovered that do allow interaction with running simulations without consuming substantially more supercomputer cycles. It does seem very likely that as scientists find personal computers and workstations sufficient for their specific problems, techniques of computational steering 31

Big Data

Michael Cox

will increasingly be used. On a single-user desktop machine it makes much more sense for the scientist to interact directly with the running simulation, and it makes much less sense for the program to save data to disk for later postprocessing.

2.2 Applications that query vs. those that browse A major difference between applications is in the kind of question asked during post-processing. When questions and form of the answers are well known in advance it may be possible simply to extract the data-dependent answers with minimal user involvement. In fact, user involvement may be simply a query to the data for an answer that matches a specific question and specific parameters. When the data are not that well understood, when the field is not that well-developed, when algorithms to process queries do not yet exist, browsing or navigation must be employed. Feature extraction and scientific data mining are two areas of research based on the query paradigm. A simpler example is the extraction of locally maximum pressures, a more intricate example is vortexcore extraction from large CFD data sets. Feature extraction and data mining techniques tend to be more developed for non-research applications (for example in the design of aircraft) than they are in research applications. In general, feature extraction and data mining work better for engineering than they do for science. It may be impossible to support the query paradigm when the questions or answers are not understood, but it may also be impossible if the algorithms to extract the requisite information do not yet exist. As simple example, it may not even be obvious which summary statistics in a post-processed data set are of interest. A more interesting example is isosurface extraction. Segmentation algorithms are those that can extract interesting surfaces off-line. In medical work, algorithms to extract the surfaces of the kidney from CT data are known. These surfaces can be represented as triangle meshes stored for later viewing, and the original “big data set” can be set aside. Of course, if the static data extracted (the triangle meshes) are still too large for storage, rendering and display, then those data may be amenable to off-line surface decimation and multiresolution techniques. 2 On the other hand, some isosurfaces are not amenable to off-line segmentation. For example, an infinite family of isosurfaces may be present in the data. If it is not possible to determine in advance which of these are interesting, it is obviously not possible to extract them all off-line. As another example, segmentation algorithms for the heart in CT data are not known and so it is not possible to extract the heart and store it as a set of surface meshes. In both examples, it is not possible to store the extracted surfaces and so the original data must be retained and manipulated interactively. The desired isosurfaces must be extracted on-line, interactively at user request. In addition for both examples, surface decimation and level-of-detail algorithms are uninteresting since the data cannot be stored as surfaces. In such cases where off-line feature extraction or data mining or segmentation is not possible, a common technology for browsing is exploratory visualization. Most visualization software and systems allow the user to apply very generic visualization techniques in order to understand the underlying 2

There are many decimation and level-of-detail techniques, e.g. multiresolution, for the representation of surface meshes. The current notes touch upon such techniques for 3-dimensional meshes in scientific visualization, but leave the summary of the rich graphics literature on surfaces to other sources.

SIGGRAPH 99

32

Version 9-Sep-99

data. While browsing is of course useful in engineering as well as in science, browsing is essential in the latter. The paradigm required by the application affects the management of data. Query-based paradigms in general allow significant off-line processing to be done so that answers to user questions can be delivered quickly. This off-line processing is in general possible because the questions and desired answers are known in advance. The requirement for on-line browsing makes off-line processing difficult. It may be unclear what questions to ask, and even when those are known the algorithms to derive answers from the data may not yet exist.

2.3 Direct rendering of the data vs. algorithmic traversal For some applications it is possible to render data directly, that is to produce pixels of visualizations directly from the data. Volume rendering is one example of this. The user (or software) chooses a transfer function that maps data values directly to pixel colors. During volume rendering, the data are reconstructed, resampled, and mapped to pixel colors. Directly rendering 3D scenes (e.g. for architectural walkthroughs or virtual worlds) is another example. A common technique employed for data that can be rendered directly is to reduce the total amount of data actually touched or rendered by several means. A common approach is to down-sample the underlying data and to reconstruct and resample based on output resolution and viewer position. For example, polygons of an architectural walkthrough may be aggregated to simpler polygons and these rendered instead if they are sufficiently far from the viewer – i.e. if only a few pixels of the initial polygons can be seen anyway. This is the standard level-of-detail approach followed in terrain rendering and flight simulators. These view-dependent and output-resolution-dependent techniques are possible because the rate that the data must be sampled is driven directly by screen pixel size. However, many visualization techniques employ algorithmic traversal of the underlying data. Frequencies in the data are not directly seen by the viewer – rather they are interpreted by algorithms. An obvious example is CFD particle tracing. A particle is injected into a field of velocity vectors, and is integrated through the field as if it were smoke injected into the real flow. A particle may traverse the data set arbitrarily, and it is absolutely incorrect to restrict the particle’s traversal by output screen resolution. If this were done the particle may very well touch pixels significantly different than those it would otherwise have. A second example is the display of cutting planes through a data set. Consider a field derived by some nonlinear function of the raw data. Even if a large section of the cutting plane resolves to a single pixel color over few pixels, the derived field must be reconstructed and then resampled before that color can be correctly chosen. In these examples the results of traversal are user- and graphics-independent. The data are translated into graphics primitives by traversal, and these primitives are unrelated to screen resolution and viewer position. For such visualization techniques where direct rendering of the data is not possible, it is not obvious that data reduction based on output resolution and camera position is possible. 33

Big Data

Michael Cox

2.4 Static vs. dynamic fields We have already noted that most big data objects are the result of computational simulation. Simulations in general tend to write as few parameters as possible to mass storage. Given the choice of writing out parameters that can later be derived, and writing out (say) more time steps of an unsteady computation, scientists in general choose to write out more time steps. We refer to the parameters actually stored as static fields and those that must be computed during post-processing (browse- or feature-extraction-time) as dynamic fields or derived fields. Examples of static fields from CFD are density, momentum, and energy (5 fields). During post-processing it is common to derive vorticity, pressure, and more than 50 additional fields. Derived fields may be linear functions of the raw data, but most derived fields are nonlinear functions of the underlying static fields. Therein lies the rub, as nonlinear derived fields present great difficulty for many of the data reduction techniques that have been proposed by the research community. For example, multiresolution methods generally store integrated (average) values at lower resolutions, and some authors propose that these lower resolution data sets can be traversed directly for visualization. However, a derived value over the average of a field is not the same as the average of the derived values over the same field! In particular, consider a velocity field and the derived vorticity (which is the curl of velocity). If vorticity is calculated over a lower-resolution field of average values, the result is very different than if vorticity were calculated over the raw underlying field and then averaged to produce a lower-resolution data set. Evidently the original data must be reconstructed before the nonlinear derived field is calculated! Schemes that do not reconstruct the original data (or that do not otherwise solve this difficult problem) before traversal by the visualization algorithm, provide the wrong answers to their users. From another viewpoint, schemes that work only on static fields in CFD (5 fields) work on less than 10% of the fields of interest to the CFD scientist (5/55). From yet another viewpoint, this problem exists regardless of the resolution of the output device, and regardless of the user’s camera position during the visualization.

2.5 Dimensionality and organization of data Aside from the more fundamental differences between application requirements of the previous sections, there are practical differences as well. There are of course natural differences between applications in the dimensionality of the data (1D, 2D, 3D, etc) but there are also more (perhaps) artificial differences in the stored organization of the data. Algorithms designed for feature extraction or interactive data browsing are in general targeted at specific user problems. Algorithms tend to work with restricted dimensionality of data (e.g. 3D only) and in general tend to work with specific schemes of data storage (e.g. on regular grids only). Some of the different organizations are defined briefly below. Always check the type of data organization for which a particular big data algorithm has been designed to work. There is a fundamental difference between data sets with implicit storage and addressing and those with explicit storage and addressing. In implicitly addressed data, the relationships between vertices, edges, faces, are implicit in the data structure. Finding a vertex, edge, or face can be done with a deterministic address calculation, usually by calculating an offset in a multi-dimensional array. In explicitly addressed data, the relationships between vertices, edges, etc, must be stored explicitly. SIGGRAPH 99

34

Version 9-Sep-99

Finding a vertex, edge, face, usually requires traversal of the data (in particular, by following pointers either in memory or on disk). In CFD in particular, there is a distinction between the address space in which the data are manipulated and the address space in physical (Euclidian) space. The former is generally referred to as computational space and is in terms of the storage data structures, for example three indices into a 3-dimensional array of data. The latter is generally referred to as physical space and is in terms of the coordinates in Euclidian space, for example three floating point values representing a point in 3-dimensional space. A regular grid usually denotes a multi-dimensional array of the underlying data, where storage and addressing are implicit. Rectilinear grids in medical imaging are regular grids. Addressing in a regular grid in computational space is typically done with a tuple of three indices into a multi-dimensional array. However, a regular grid in computational space may not be “regular” in Euclidian 3-space. Curvilinear grids in CFD are generally represented as a pair of regular grids. Parameter values are stored in a 3-dimensional array, while a node-by-node mapping from computational space to physical space is stored in a separate 3-dimensional array. Irregular grids store and address the cells of a data set explicitly. The most common irregular grid comprises lists of the vertices and edges of the grid (edges usually specified by reference to an array of the vertices), and the faces of the tetrahedra of the volume if the data are 3-dimensional.

2.6 How large is the data set – quantity becomes quality In many fields the domain scientist traditionally insists that errors “cannot be” introduced by data analysis (in particular by data traversal and visualization). The computer scientist has historically insisted that data analysis and visualization “must be” interactive. However, as the data sets have increased beyond anything remotely manageable with previous techniques, it is interesting to see examples in both communities where such “hard” requirements have softened. On the ASCI program in particular, domain scientists profess that lossy data analysis techniques are of interest. And there are numerous examples (in the literature and in available software) of computer scientists who offer systems for data visualization at less than 5 Hz. Acceptable error and acceptable non-interactivity are functions of data size. When all data fit in main memory, both scientists agree that error must be low and interactivity high. As the data spill over onto local disk, and then onto raid disk farm, the computer scientist tends to accept non-interactivity. As the data spill over onto raid disk farm and then onto the mass storage system that is typically only affordable at a government lab, the domain scientist begins to accept error. We have not yet discussed the techniques that have been applied to reduce the effective size of extremely large data sets, but we can discuss their potential tradeoffs in terms of data reduction efficacy and potential error introduced. An estimate of both based on reports in the literature and on some guesswork appears in the following table. The techniques grouped at the top of the table have zero error, but we can only expect them to reduce data size by some constant. For example, combining best-case expectations for the first three techniques, we may get a 250x reduction in data size. ASCI today produces data sets of 10 TBytes. It is clear that these fixed-reduction techniques are not sufficient. This is exactly why domain scientists 35

Big Data

Michael Cox

Techniques Data reduction potential Memory hierarchy 2x - 100x Indices 2x - 100x Write-a-check 10x - 50x Compression 10x - 100x ?? Computational steering Arbitrary ?? Feature extraction Arbitrary Multi-resolution browsing Arbitrary View-dependent techniques Arbitrary

Error introduced 0 0 0 Arbitrary 0 ?? 0 ?? Arbitrary Arbitrary

are willing to consider lossy data reduction strategies. There are two techniques in the table that are claimed to offer arbitrary data reduction (computational steering and feature extraction). It is unknown in general the error they introduce into the data (such an evaluation must be domain-specific). The two techniques at the bottom of the table provide potentially arbitrary data reduction, with arbitrary error. These two techniques are active areas of research – presumably because they do offer arbitrary data reduction for projects such as ASCI that require something better than fixed-reduction approaches. One shortcoming of much of the published research on these two techniques, however, is that the error introduced is generally not characterized. In fact, very few tools have been developed to help the researcher characterize error in a new data reduction algorithm. Borrowing from Pang ([35]) we can define three spaces in which error might be characterized and quantified:

 Image-level  Data-level  Feature-level To characterize and quantify error, we must compare the data-reduced set with the original data set. If the comparison is at the image level, we compare the images that result from visualization. There are several methods we may use to compare images:

 Simple visual inspection (unfortunately too common in the literature).  Summary statistics of the images (e.g. RMS).  Taking and evaluating the difference between the two images.  Transforming both images and comparing in the transform space (e.g. Fourier or wavelet analysis). If the comparison is at the data level, we compare the reduced data with the original data. Again there are several methods that have been reported: SIGGRAPH 99

36

Version 9-Sep-99

 Summary statistics (again, e.g. RMS)  Taking and evaluating the differences between the two data sets. Finally, if the comparison is at the feature level, we compare the features that can be deduced from the data-reduced and the original data. The comparisons may be:

 Domain-specific feature comparisons (e.g. vortex cores from Computational Fluid Dynamics), or  Domain-independent features (e.g. isosurface comparisons) While it is difficult to characterize exactly the “information” that has been lost via lossy data reduction, more should be done in this area than has been. Feature-level comparisons are probably better than data-level comparisons. Data-level comparisons are probably better than image-level comparisons. Any quantitative metrics are almost certainly better than simple visual comparisons (especially when those comparisons are not performed by the domain scientist).

2.7 Data analysis time budget The Department of Energy’s Accelerated Strategic Computing Initiative (ASCI) produces and plans to produce what are quite probably the largest data sets in the world, and produces and plans to produce more of these than anyone else. Today’s “average” computer simulation generates 350 GBytes of data, the large simulations currently generate up to about 10 TBytes, and it is expected that in 2004 the average simulation will generate over 12 TBytes of data [19]. The magnitude of these data results in a qualitative difference in data analysis. The architecture of ASCI’s production system (following Heermann [18]) is shown in Figure 1. Note first about this architecture that there are no details of visualization data flow itself! The ASCI architecture represents a pipelined production environment that encompasses not only visualization but the analysis required before new simulations are begun, and the supercomputer simulations themselves. An important point that Heermann makes about the ASCI architecture and requirements is that “entire system throughput is as important as the efficiency of a single system component.” In particular, techniques may not improve the interactivity of visualization (i.e. for browsing) at the expense of total throughput through the visualization stage! This creates another difference between applications: “interactivity vs. total processing time/cost.” For many applications, computer scientists focus on techniques that may be expensive in off-line processing time but that result in better interactive on-line processing. For the ASCI application, this focus on interactivity of visualization is shifted to focus on high bandwidth and low latency through the visualization stage of the ASCI cycle.

3 System architectures The abstract data flow from simulation output to visualization image is shown in Figure 2. 37

Big Data

Michael Cox

Analysis

Visualization Simulation Figure 1: The ASCI system architecture.

Simulation (or data acquisition)

Large Data Set Data traversal Traversal + transfer function

Geometry (e.g. triangles) Rendering

Image

Figure 2: Abstract data flow in visualization of big data.

SIGGRAPH 99

38

Version 9-Sep-99

The process begins with data acquisition, in general the output of computational simulation. This generates a very large data set which is written to mass storage. The data may be accessed for visualization by two scenarios. On the left in Figure 2, an algorithm traverses the data. The algorithm generates geometry which is then rendered to an image (or animation) and displayed. This is the standard scenario in which polygonal surfaces are generated for rendering by graphics hardware. On the right in Figure 2, data are accessed and rendered directly. This is the standard scenario for volume rendering, where the data are “displayed directly”. The traversal rate may or may not be coupled to the rendering rate. For example, the data set may be traversed off-line, and geometry generated and stored for later perusal. As another example, traversal may generate an isosurface that the user examines and manipulates for many frames before requesting a new isosurface. Rendering rate may not be coupled to image display rate. Several researchers are exploring the idea of generating many images from the data, from different points of view and at different resolutions, and then allowing exploration by image interpolation (or reprojection). This is an application of the now popular idea in the graphics community of “image-based rendering”. Given these scenarios, the architectural question is “where are the partitions drawn between different machines?” For example, are all steps from data to pixels computed on a single machine? Or is data traversal done on one machine and rendering on another? Or perhaps reduction allows data to be accessed remotely from mass storage, and traversed and rendered on a desktop workstation?

3.1 Supercomputer with the graphics workstation Perhaps the oldest visualization system architecture is shown in Figure 3. This architecture was employed at NASA Ames Research Center (and most likely at other sites as well) around 1985 or 1986. The process begins with simulation, which generates very large data sets and writes them directly to mass storage. Simulation is on the supercomputer, as are the big fast disks to which are written the very large data sets. In this architecture, the supercomputer also traverses the data when the scientist wishes to perform post-processing. This traversal generates geometry which is shipped via fast network to a graphics workstation whose job is primarily to provide fast rendering. This architectural model provided initial demand for SGI graphics workstations. Architecturally, this is an expensive solution, using supercomputer time for what has become possible more recently on high-end graphics workstations.

3.2 Supercomputer with the heroic workstation Until recently the most common high-end visualization architecture combined a supercomputer with a “heroic workstation”, as shown in Figure 4. In this architecture, the user “writes a check” for the largest high-end workstation that can be purchased (or perhaps only afforded). The supercomputer completes its simulation, and writes data to its own mass storage. Typically the data are copied to the high-end workstation’s own disks (which can be formidable – sometimes a high-end graphics workstation is a data server in its own right – 39

Big Data

Michael Cox

Supercomputer Simulation (or data acquisition) Big fast disks Traversal using big fast memory Geometry (e.g. triangles) Fast network Workstation Rendering Image

Figure 3: System architecture of supercomputer / workstation.

SIGGRAPH 99

40

Version 9-Sep-99

Supercomputer Simulation (or data acquisition) Big fast disks Fast network

Workstation Big fast disks

Traversal using big fast memory Geometry (e.g. triangles) Rendering Image

Figure 4: System architecture of supercomputer / heroic workstation. often the disks are large RAID). The workstation then performs traversal, generation of geometry, and rendering. This solution requires a high-end graphics workstation that is a “complete” package: it must support large disk and memory configurations, both in capacity and bandwidth, while also supporting graphics rendering as fast as (or faster than) the fastest desktop workstation. The disk and memory capacity and bandwidth this package must support generally match commercial server capabilities. The graphics capabilities have historically exceeded those on the desktop. Heroic workstations in this class have historically been extremely expensive, and the budgets to procure such machines are decreasing. This combines with extremely competitive (and cheap) PC desktop graphics to put pressure on the “complete” package. We shall see over the next several years if this historically “complete” machine can maintain market viability. However, note that even in this architecture which relies on brute capacity and bandwidth to solve the big data problem, there is opportunity for data reduction. Between the supercomputer mass storage and the high-end workstation mass storage, data can be compressed. Even within the workstation, if the compressed data could be traversed directly, or if pages of compressed data were read from disk and 41

Big Data

Michael Cox

Supercomputer Simulation (or data acquisition) Big fast disks Fast network

Commercial server Big fast disks

Traversal using big fast memory Geometry (e.g. triangles) Fast network

Workstation PC

Rendering Image

Figure 5: System architecture #1 supercomputer / commercial server / workstation PC. decompressed into memory there would be savings at least of disk footprint and bandwidth. Most of the opportunities discussed later in these notes are applicable even to the “write-a-check” architecture.

3.3 Supercomputer, commercial data server, workstation PC The recent advent of very fast but inexpensive graphics workstation PCs combined with decreasing budgets has driven many visualization workers to alternative architectures. The foundation of these new architectures is the workstation PC on the desktop. The difficulty of these new architectures for very large data sets is a decrease in memory and disk capacity and bandwidth. The “complete package” of the “heroic workstation” – fast graphics, big fast disks, big fast memory – is not available in the PC workstation marketplace, and it seems likely that there will never be sufficient market to sustain a “complete package” based on commodity PC components. New architectural solutions to the problems of big data must be found. Two speculative architectures that take advantage of commodity components are shown in Figures 5 and 6. In Figure 5 the data set is moved from the supercomputer to a commercial server with large SIGGRAPH 99

42

Version 9-Sep-99

Supercomputer Simulation (or data acquisition) Big fast disks Fast network

Commercial server Big fast disks

Data reduction techniques Fast network

Workstation PC

Data reconstruction + data traversal Geometry (e.g. triangles) Rendering Image

Figure 6: System architecture #2 supercomputer / commercial server / workstation PC. capacity and RAID-class bandwidth. Current CPUs used on commercial servers compete with those of high-end graphics workstations and so both data serving and calculation (traversal) are possible on these servers. In this architecture, the commercial server generates geometry that is sent over a fast network to a desktop workstation with fast graphics. Fast, big-capacity commercial servers are on the market today as are workstation PCs that have graphics capability that are up to the requirements of scientific visualization. The biggest component risk in this architecture is currently the network between the server and the desktop. Our own shopping experience within the Data Analysis group at NASA Ames has been that the networking bandwidths required between server and workstation PCs are available but are not commodity. An alternative to the architecture of Figure 5 is shown in Figure 6. This architecture again takes advantage of fast commercial servers and fast desktop graphics workstations, but provides data from the server to the desktop rather than geometry. It also employs any and all data reduction techniques on the data before shipping these across the network to the desktop. This architecture is clearly only as good as the data reduction techniques between server and desktop: data reduction is the topic of 43

Big Data

Michael Cox

the remainder of these notes. Of course, data reduction is not restricted to this architecture or to the path between server disk and workstation memory: it may be employed to reduce the footprint of data storage on any disk, and may be employed as well between the supercomputer and server.

4 Techniques Eight techniques for coping with very large data sets can be identified:

 Memory hierarchy. These techniques share the property that they treat very large data sets as a virtual space that need not be memory-resident.  Indexing. These techniques organize the data so that requisite data can be found and retrieved quickly.  Write-a-check. While most researchers and practitioners are increasingly constrained by budget, there are applications for which the data are so large and budgets large enough to mitigate the problem of big data by buying the biggest systems available for data analysis.  Computational steering. These techniques are currently more research than practical. There appear different definitions of computational steering in the literature and in workshop discussion, but the general idea is to avoid generation of large data sets by data analysis and “discovery” made during the computation.  Compression. These techniques attempt to reduce the data to a smaller representation, either with loss (lossy) or without (lossless).  Multiresolution browsing with drill-down. These techniques apply now-popular methods to represent and manipulate data hierarchically. The idea is that higher levels of the hierarchy retain important information but are smaller and easier to manipulate.  Feature extraction and data mining. These techniques enable on-line queries by providing offline processing that extracts relevant features or information from very large data sets. The idea is generally that the results of off-line processing are smaller and easier to manipulate.  View-dependent techniques. These techniques share the property that they attempt to reduce arbitrarily large data sets to within some a constant factor of the number of pixels on the visualization screen. In the following subsections, each of these techniques is discussed in turn. Examples from the literature are used to demonstrate the technique and to help illuminate and enumerate the alternative approaches that may be employed. However, please note that the specific papers cited are intended to be illustrative of specific features of each technique, and the total collection of papers is not intended as a comprehensive survey of the literature. SIGGRAPH 99

44

Version 9-Sep-99

4.1 Memory hierarchy The memory hierarchy is a useful abstraction for developing systems solutions to the problem of big data. At the top of the memory hierarchy is the most expensive but fastest memory (e.g. registers on the CPU). Below this is less expensive but slightly slower memory (e.g. first-level cache), and so on (e.g. second-level cache, main memory, local disk, remote disk, remote tape). Data may be stored and retrieved in blocks. When the blocks are variable-sized, they are referred to as segments. When they are of fixed-size they are referred to as pages. Segments may be further organized into pages (paged segments). The idea is to retrieve only the segments or pages (or pages of segments) that will be needed by analysis or visualization, thus saving memory that would otherwise be required for the whole data set (demand driven). This approach can also save memory and disk bandwidth. Ideally, a good demanddriven paging or segmentation strategy does not increase the footprint of the data on disk. Not all strategies for segmenting the data are low-overhead in terms of disk footprint. Doubling a 100-GB data set so that it can be analyzed on a low-end workstation is perhaps acceptable for some environments and applications, but it is clearly far superior to allow analysis by a low-end workstation with only a small increase in mass storage requirements. There are some applications for which doubling the data set is simply not feasible for long-term storage. On the other hand, there are also applications for which doubling the data set may be far more acceptable than doubling the pre-processing time (e.g. ASCI). In addition, storage organization and proper selection of the parameters of paging (such as page size) are important details requiring attention in memory hierarchy implementations. Demand-driven strategies may be combined with judicious scheduling so that other work may be done while the data are read from disk (pipelining), and it may be possible to predict accesses so that requests to load data can be issued while previous data are still being processed (prefetching). Demand-driven paging and segmentation are most efficacious when not all of the data are required (sparse traversal). Sometimes sparse traversal is inherent in the visualization (or analysis) algorithms (e.g. particle tracing in CFD). Sometimes algorithms can be designed explicitly for sparse traversal of the data (e.g. some isosurface algorithms discussed later in these notes). Derived fields are a particular difficulty for paging strategies, where most applications/tools simply pre-compute all derived data that will be needed (in particular assuming that the entire data set fits in local memory). Lazy evaluation is a technique that may help manage big derived fields. Examples of memory hierarchy techniques for big data can be further classified. Sequential segments for time-varying data Lane (who now goes by the name Kao) employed one segment per time step for unsteady flow visualization [22]. Paging using operating-system facilities Naive reliance on operating-system virtual memory to manage very large data sets is demonstrably bad. The UNIX system call mmap() potentially offers an alternative and has been explored for CFD post-processing by [15] and [9]. Both report that it results in better performance than simple reliance on virtual memory; the latter reports that it is inferior to a user implementation that manages disk I/O and memory explicitly (as was demonstrated for database implementations 20 years ago). 45

Big Data

Michael Cox

Application-controlled paged segments The addition of paging to the segment-management algorithm of [22] was found by [9] to result in good performance that does not significantly degrade on smaller-memory machines. In more recent work at NASA Ames, we have achieved interactive visualization rates by paging segments of unsteady data. Lazy evaluation of derived fields Globus first explored lazy evaluation of derived fields, and found it an effective mechanism to reduce memory requirements while visualizing derived values [15]. Lazy evaluation of derived fields + caching Moran further quantified the gain and extended this work to include caching of derived fields [29]. From that paper, we note that “caching can improve the performance of a visualization based on a lazy field that is expensive to evaluate, but ... can hinder the performance when evaluation is cheap.” Sparse traversal The paging, caching, and lazy evaluation techniques all take advantage of the fact that many algorithms in visualization sparsely traverse the data set. A viable research direction in managing big data is clearly to search for algorithms that result specifically in sparse traversal. Indexed pages One means of achieving sparse traversal is to provide an index into the pages of the data, so that exactly the pages required can be found and retrieved. This approach has been used explicitly by some (cf. [43], [24]) and implicitly by others (cf. [4]). Paged index

If the index itself is too large, it may also be paged (as was done implicitly by [4]).

Paging in data flow architectures A common problem with data flow visualization architectures is that they have exorbitant memory requirements. Song showed that if the granularity of modules is made sufficiently small, data flow architectures can be significantly more memory-efficient [40]. Schroeder has extended this basic approach and developed an architecture and working system for paging in the Visualization Toolkit [37].

4.2 Indexing Indexing is a technique whereby the data are pre-processed for later accelerated retrieval. Any search structure can serve as an index, but the standard search structures have been octrees, k-d trees, interval trees, and home-brewed data structures in both 3- and 4-dimensions. When applied to the problem of big data, an index may allow sparse traversal of the data, thereby saving memory space and bandwidth and saving disk bandwidth. Some indices significantly increase the storage requirements of the data, and thus are less desirable for the management of big data. Those indices that require little storage are obviously preferable for most environments. Adapting a classification scheme from Cignoni [6], we can distinguish three general types of indices. Space-based methods partition physical space. Value-based methods partition the parameter SIGGRAPH 99

46

Version 9-Sep-99

values at the nodes (or cells) of the data set. Seed-based methods provide “starting points” for runtime traversal. To date, the only seed-based methods evident in the literature appear to be for isosurface generation. Most of the work on indices for scientific data has (perhaps unfortunately) been for volume rendering and isosurface generation. Cignoni claims that space-based methods for isosurfaces are better over regular grids because spatial coherence can be exploited, while value-based methods for isosurfaces are better for irregular grids where spatial coherence is difficult to exploit [7]. However, it is clear from the literature that the value-based methods increase the size of the data set by a factor of 2x or more, while the space-based methods increase data set size by perhaps 15%. The seed-based methods appear comparable to the space-based methods as measured by increase in data set size. Increases reported range from 1% to 8%. 4.2.1 Coping with large indices When the size of the index is too large, it may be possible to build the index over clusters of cells (pages or segments) rather than over individual cells themselves. This approach not only may reduce index size, but also has the advantage that the pages or segments themselves may be compressed for storage on disk. Pages may be reconstructed during traversal, or in some cases visualization algorithms may be applied to compressed data directly (this is discussed in the sections on compression and multiresolution). Other applications of an index over pages or segments of data include user browsing and later drill-down, and progressive refinement of network transmission or retrieval from disk. Browsing over the index may allow the user to find areas of interest within the data set, with subsequent requests to retrieve and examine the underlying data (drill-down). Browsing may also be enhanced by progressive refinement, whereby index (summary) data are displayed while the user moves through the data set, but the data themselves are retrieved and displayed when the user stops browsing. (Both techniques are discussed further in the section on multiresolution). An alternative approach to cope with indices that are too large to be memory-resident is to page the index itself. This may be done explicitly, or implicitly by storing the index directly with the data. 4.2.2 Value and seed indices for isosurface extraction Isosurfaces have been the goal of most development in value and seed indices. When the isosurfaces are a priori known, it is more efficacious to extract those of interest off-line (e.g. kidney or bone segmentation) and store the triangulated surfaces. However, when the underlying data have continuous isosurfaces with no obvious segmentation, the user must browse the isosurfaces of the data set to find features of interest. The idea behind the value and seed indices is that such browsing can be made more interactive by building an index off-line that can accelerate isosurface generation when new values are requested. Acceleration comes primarily from sparse traversal of the underlying data. In general, most techniques have been developed so that when the user requests an isosurface value incrementally different than the last isosurface value requested, the new isosurface can be generated quickly. That is, most techniques have been developed to exploit coherence between isosurface requests. 47

Big Data

Michael Cox

The value index algorithms work essentially by building a data structure to maintain two lists and the relationships between them (the maximum and minimum ranges for the underlying cells of the data). The seed index algorithms work by providing the seed cells from which all cells (that contain the isosurface) can be found for any isosurface query. Problems with value and seed indices As previously noted, the value index data structures explode data size, while the seed index data structures increase data set size more parsimoniously. A critical problem with indices in general, and with isosurfaces in particular, is that they provide only sparse traversal on a single parameter (i.e. range of a single value within each cell). Some data sets store 50 parameters (e.g. typical data sets generated at Lawrence Livermore for the ASCI project) and would require an index for each one. A second problem with indices in general, and with isosurfaces in particular, is that they do not provide sparse traversal over derived parameters (in any obvious way). In the data sets commonly in use at NASA Ames, only 5 parameters are stored and on a regular basis another 50 are derived at run-time. 4.2.3 Examples of indexing from the literature Spatial indices – sparse traversal in 3D and 4D Spatial indices work by organizing the points in physical space. Wilhelms applied an octree to data volumes, storing with each internal node the min/max interval spanned by the sub-volume. This allowed efficient pruning during isosurface generation and volume rendering while increasing storage by about 15% [45]. Wilhelms later extended this work to handle time-varying data [46]. Others have since applied octrees and other spatial data structures in similar ways (cf. [27]). Value indices – sparse traversal in 3D and 4D Value indices for isosurface generation have a rich literature. Most of the methods work essentially by storing two lists over the cells of the data set (minimum and maximum values) along with references between them. Finding an isosurface is then reduced to the problem of looking up the cells whose range covers the desired value. Most of the work has been in 3D ( [14], [13], [26], [39], [6], [4]). Shen appears to be the first to extend value indices to 4D [38]. As previously mentioned, storage for a value index over the entire data set increases storage requirements by at least a factor of 2x. The 4D approach of [38] might be profitably combined with the approach in [4] to page the value index itself. However, work that does so has evidently not appeared in the literature. Seed indices – sparse traversal in 3D and 4D Seed indices appear to be more promising as datareducing data structures than value indices. The idea is to store only the subset of the cells from which all other cells may be found (the “seed” cells) [20], [1]. The storage requirements of Itoh are not explicitly noted in the paper, but appear to be in the 10% to 20% range [20]. Bajaj in particular shows storage requirements that increase by only 1% to 8% [1]. SIGGRAPH 99

48

Version 9-Sep-99

Indexing over pages Ueng built an octree over segments of the data, using values at intermediate nodes for reduced-resolution volume rendering [43]. The system allowed the user to drill down into areas of interest based on the results of the reduced-resolution images. The paper unfortunately does not report the increase in storage requirements for this spatial indexing. Leutenegger explored the same idea with an R-tree of the underlying segments and claimed that it would give better performance than an octree (but did not provide experimental results on real data sets) [24]. However, the scheme requires about 2.5x the disk space for storage of the tetrahedra in unstructured grids (though only requires about 12% more storage space for vertices). Paged index Chiang appears to be the first to build a value index over pages (rather than over the individual cells) [4]. The results appear promising, reducing the index size as a function of the underlying page size. This work also implicitly “paged the page table” (i.e. paged a value index for isosurface generation) by intermixing the value index with the data). Those results appear promising in terms of the reduction of I/O (in terms of the sparsity of traversal).

4.3 Write-a-check For some applications, the size of the problem is so large that even when data analysis is done on a cluster of the largest workstation-class machines commercially available, there is still a large impedence mismatch between data size and memory capacity. An instructive example of can be found in recent work by Painter et al.’s at Los Alamos [34]. They used roughly $10M of equipment to demonstrate volume rendering of 1 billion cell volumes at 3 to 5 Hz. The equipment comprised a 16-pipe RealityMonster, 30 fiber-channel disk interfaces, and collections of 72 GB raid disks (that supported bandwidth of 2.4 GB/s to a single file). Some comparison of scales is useful to understand the size of the data problem that the ASCI project has undertaken. $10M is to 10 TB as a $10K workstation is to 10 GB. And even if bandwidth scaled lInearly (which it does not) this would imply $10K desktop bandwidth of 2.4 MB/s. So, even with $10M of equipment, the ASCI project has at least as large a problem as workers analyzing supercomputer Output on a standard workstation.

4.4 Computational steering There are a number of techniques that have been labeled “computational steering”. At the most basic level, nonintrusive taps into running code periodically write out time steps for perusal by the scientist. If the simulation has gone awry, the scientist can cancel the job, fix the problem, and restart. A somewhat more sophisticated version of computational steering envisions direct taps into running code for analysis tools to analyze the progress and direction of the simulation. In the most advanced vision, the data analysis tools may be used to modify parameters of the running simulation, and actually change its course. As has already been noted, scientists have historically chosen to apply supercomputer cycles to simulations (rather than to interactive sessions). Infrastructure and funding have also supported this paradigm. It is not clear that usage patterns of supercomputers will change any time soon. However, 49

Big Data

Michael Cox

what was once possible only on a supercomputer has become progressively more feasible on a desktop workstation (and even PC workstation). It is likely that advances in computational steering will be utilized first (and possibly only) by the scientist doing simulations directly on his or her own desktop computer.

4.5 Compression There are two types of data that may be compressed: the grid or mesh data comprising the physical points or cells themselves, and the solution or parameter data comprising the values at those physical points or cells. Some data sets (in particular rectilinear volume data) may have only solution/parameter data. Others (in particular curvilinear grids, unstructured grids) have both data describing the physical points themselves (and sometimes their relationships – explicit vs. implicit addressing) as well as the solution/parameter data. The straightforward compression schemes (DCT, Fourier) have been applied only to solution/parameter data. The multiresolution schemes (discussed in the next section) have been applied to both grid/mesh and solution/parameter data. Compression can be lossy or lossless. While there may be applications for lossy compression, many scientists will not even consider tools that visualize lossy data.3 In particular, CFD researchers and design engineers report that lossless compression is a requirement, and it is well-known that lossless compression in medical imaging is absolutely essential. On the other hand, many scientists in such fields as CFD do report that lossy schemes that losslessly preserve features of interest are of interest to them. However, note that faithful derived fields over lossy raw data are very difficult to ensure. While scientists may accept feature-preserving lossy compression in principle, the guarantee must be extended to the derived fields the same scientists commonly use. Some compression schemes are applied to the entire data set, some are applied to sub-blocks of the data set. The latter are more amenable to progressive transmission and refinement (and therefore browsing and drill-down) and also to paging and other memory-hierarchy schemes. Some work has been done on applying visualization operators (e.g. volume rendering) directly to the compressed data. When this is not possible the data must be reconstructed before traversal/rendering. While reconstruction requires memory bandwidth proportional to the data traversed, it still saves disk capacity and bandwidth. Compression has been reported as a technique in several contexts:

 Compressed storage with reconstruction before traversal. Two user-paradigms are apparent in previous work: progressive refinement, and browsing with drill-down. With the former, lowerfidelity data are traversed and displayed while the user changes viewpoint within the data, but then higher-fidelity data are traversed and displayed while the user viewpoint remains constant. With the latter, lower-fidelity data are traversed and displayed by the user, who explicitly chooses to select, traverse, and display higher-fidelity data when an interesting “feature” or “region” is detected. 3

We note that some scientists working on ASCI report that their big data problems are so severe that the scientists are willing to work with reduced-fidelity data.

SIGGRAPH 99

50

Version 9-Sep-99

 Feature extraction and/or data mining. In such cases, compression is achieved by representing the data by underlying core features. In this case the compression is believed to retain important features from the raw data. We distinguish between feature extraction techniques which are inherently domain-specific (e.g. vortex-core extraction in CFD) and data mining techniques which are inherently general (e.g. maximum divergence over a vector field). Feature extraction and data mining are also inherently multiresolution in nature, with the idea that some features are retained across scales of the data.  Lossy compression. The proposal in some work is that “high frequencies” in the data are somehow less important than “lower frequencies” that must be believed to have more important information content. Under these circumstances the belief is that the data may be compressed to eliminate the “higher frequencies”, thus reducing data size. Before discussing the types of scientific data compression that have been explored, we first address error metrics for lossy compression. Error metrics for lossy compression The biggest gap in the data-reduction literature for scientific data has been in quality error metrics for lossy compression. A common metric is that of “image quality”. This is probably the worst of all possible metrics to use. In general, it is not at all obvious that an image that “looks OK” has the same information content as the image generated from highfidelity data (i.e. data which retains its “high frequencies”). “Image quality” if used must be tied to the underlying information for which the scientist or engineer is searching. Other metrics such as Signal-to-Noise (SNR) or Root-Mean-Square (RMS) error are indeed better, but still do not demonstrate that fundamental “information” has not been lost in the compression. User studies are one way to address this problem, but these admittedly are extremely time-consuming and also have the problem that they offer only statistical evidence of success. Error-based progressive refinement The work by Laur is an early example of metric-based progressive refinement of volume-rendered scientific data [23]. In this work an octree was built over the underlying volume data, and mean values were employed at intermediate nodes to volume-render reduced-resolution images. Associated with each node was an RMS error of the averages, and so the reduced-resolution renderings were data based, not image based. While the user manipulated the viewpoint, only reduced-resolution renderings within some data error tolerance were calculated. When the user stopped moving the viewpoint the implementation progressively refined the rendering with higher-quality data closer to the leaves of the octree. Wilhelms extended this work to provide more arbitrary integration functions at intermediate nodes, and more arbitrary error functions [46]. Lossless compression on full data sets Fowler provides lossless compression of 3D data by applying a technique common in 2D image compression: Differential Pulse-Code Modulation (DPCM). This technique works by predicting differences between voxels (pixels) and encoding the differences between predicted and actual with an entropy-coding scheme such as Huffman coding [12]. Fowler reports 2:1 compression without loss. 51

Big Data

Michael Cox

Zhu uses multiresolution analysis to achieve lossless compression [49]. The solution/parameter data are first wavelet-compressed and the lower-resolution representation is used to detect “structures which persist across scales of the wavelet decomposition.” These are then used in building an octree partition of the volume. Each subtree is then wavelet-compressed for transmission. The ordering of transmission is driven by a model of the human visual system – the intent is to transmit coefficients from most visible to least visible. Lossy compression on data pages The work by Ning is an early example of compressing the underlying pages of data (rather than the whole data set) using vector quantization [33]. Compression was reported as 5:1. However, the resulting images shown are quite striking in their distortion, and no results were reported showing that the underlying information was not also distorted (only “visual” evidence is presented). Yeo also reports compression on underlying pages of the data, with results between 10:1 and 100:1 [47]. Quality loss is reported by signal-to-noise ratio (SNR). While this metric is preferable to visual metrics it is still not entirely clear now SNR corresponds to the fidelity of the underlying information for which a scientist is searching. Operators on compressed data Several papers have appeared on volume rendering directly in frequency space on Fourier-transformed scalar data ( [25], [28], [41]). However, these authors were not concerned with data compression per se, and generally have not reported Fourier-transformed data sizes (one does report data size of 2x the original [25]). Chiueh appears to be the first to operate in a transformed domain (Fourier) while also achieving compression [5]. The authors do so by compressing blocks of data 9thereby minimizing the global effect of signals across the data), and they render directly from these blocks. The scheme is attractive in that it may be combined with memory hierarchy techniques. However, the 30:1 compression ratios reported come at the apparent expense of serious degradation in image fidelity for non-linear transformation functions. Compression of unsteady data Ma combines difference-encoding between octree sub-trees of an unsteady data set, with image-based rendering: only those sub-trees that change are re-rendered [27]. Rendering in this case is volume rendering.

4.6 Multiresolution Most work in multiresolution analysis (in particular using wavelets) has been over triangulated surfaces. The techniques that have been developed have had success in Computer Aided Design (CAD), architectural walkthroughs, and exploration and display of virtual worlds. However, most of this work over 2D surfaces is not particularly useful for scientific data. With the latter, surfaces are rarely explicitly defined – at best they are discovered at run-time. That is, most work over 2D surfaces applies to static geometry that is defined and later used. This is rarely the case in scientific visualization. More recently work has been undertaken which specifically applies multiresolution techniques to scientific data. The applications to date have been primarily for browsing and drill-down, and progressive transmission and refinement. Most applications have been lossy, a few have provided lossless SIGGRAPH 99

52

Version 9-Sep-99

representations. Multiresolution analysis has also been used for feature extraction from scientific data, and there has been more recent work on information metrics for multiresolution compression – metrics that attempt to preserve the underlying information content of the data, or that attempt to preserve significant features of the data.

Visual error metrics for multiresolution representation of scientific data The dearth of robust error metrics for lossy compression of scientific data is suffered as well by most work in multiresolution analysis of scientific data. The use of image quality as sole metric is even more ubiquitous in this literature than it is in the literature for lossy compression of scientific data. From one paper in the field, “... comparing the structures found in the original image with those computed from the optimized data set containing only 19% of the vertices, we find only very small details missing...” But were those “small details” the important ones?! While image quality may be sufficient for many applications in computer graphics, it is an insufficient metric when the application must ensure aerodynamic stability of a plane under design or the correct diagnosis of a serious illness.

Better error metrics for multiresolution representation of scientific data Standard error metrics reported in the literature are Root-Mean-Square (RMS) error and Signal-to-Noise-Ratio (SNR) (cf. [23], [7]). A nagging difficulty with any generic error metric has already been discussed: how does 1% or 2% error relate to the information content of the data themselves? Wilhelms, perhaps recognizing the difficulty of defining the appropriate integration function (using a spatial index) and the appropriate integration error, developed a software architecture that allowed many integration functions and error metrics to be set by the user [46]. Trotts treats multiresolution error similarly – by ensuring that the aggregate error in multiresolution tetrahedral decimation over a grid does not exceed a user-specified tolerance [42]. Zhu presents an error metric driven by the data frequencies possibly visible to the human eye [49]. This is an interesting direction that may tie data errors to information content. Bajaj introduces multiresolution coarsening (and refinement) based on an error metric that preserves critical points and their relationships (in scalar data) [3], [2]. The feature preservation is intended to maintain correct calculation of isosurfaces at all scales of the multiresolution representation.

Example applications of multiresolution Several authors have employed multiresolution for feature extraction (or data mining) and for progressive refinement. The common data mining operation appears to be “edge detection.” Examples of these techniques can be found in [44], [17], and [31]. The first approaches to multiresolution were over rectilinear grids (cf. [30], [44], [17], [31]). An alternative that works on irregular grids is to compress the solution/parameter data only (leaving the grid/mesh uncompressed) [49]. A more recent approach to multiresolution has been to tetrahedralize a grid and then either refine the lowest-resolution representation, or coarsen the highest-resolution representation (cf. [16], [32], [48], [7]). 53

Big Data

Michael Cox

4.7 Feature extraction and data mining All of the algorithms so far discussed have been primarily data reduction techniques. An alternative approach is to devise algorithms that compress the data specifically by extracting (off-line) the important answers (which presumably require significantly less storage than the raw data). Such feature extraction techniques are in general very domain-specific (cf. [21]. While these notes elide this promising direction, the scientist or engineer faced with very large data should consider feature extraction and data mining techniques a viable possibility.

4.8 View-dependent techniques There are many techniques from computer graphics to reduce either the data that need be touched, or the computation, by using the viewer’s point of view. Level-of-Detail (LOD) modeling, occlusion culling, view frustum culling, etc, are examples of this approach. The risk of using such techniques for data culling in scientific visualization has always been that important information may be lost by image-space data reduction. Crawfis offers an interesting argument for (and provides an interesting example of) using viewdependent techniques to manage large data [11]. He argues that especially for large data visualization, the scientist sees 1 - 50 MB of pixels, but has 50 MB - 10 TB of data: why not process only the data that are “visible”. The standard concern about this approach is that fidelity to the underlying data must be ensured (which is difficult). Crawfis argues that visualization does not maintain fidelity to the underlying data anyway (!), that interactivity is the more important component of discovery. Based on this philosophy, their system applies image-based rendering to the 2D cross-sections of a splat volume renderer to accelerate visualization within some small cone of the viewing frustum. During the time that the user peruses this small cone (at interactive rates), their system uses dead reckoning to take additional cross-sections that will be used from subsequent viewpoints.

5 Summary In these notes we have discussed three major issues in the management of very large data sets for interactive analysis and visualization: application requirements and differences, end-to-end systems architectures, and major techniques and algorithms. The major techniques have been classified as: memory hierarchy, data indices, write-a-check, computational steering, compression, multiresolution browsing with drill-down, and feature extraction and data mining. We have reviewed some of the more representative literature. Finally, we have discussed error metrics for lossy compression of scientific data, and for multiresolution representation and analysis. We have emphasized, even over-emphasized, that a major gap in the literature has been in good metrics for data loss due to compression or multiresolution representation. SIGGRAPH 99

54

Version 9-Sep-99

References [1] BAJAJ , C. L., PASCUCCI , V., AND S CHIKORE , D. Fast isocontouring for improved interactivity. In 1996 Symposium on Volume Visualization (October 1996), pp. 39–46. [2] BAJAJ , C. L., PASCUCCI , V., AND S CHIKORE , D. Visualization of scalar topology for structural enhancement. In Proceedings of Visualization ’98 (October 1998), pp. 51–58. [3] BAJAJ , C. L., AND S CHIKORE , D. Topology preserving data simplification with error bounds. Computers and Graphics (Spring 1998). special issue on Simplification. [4] C HIANG , Y. J., S ILVA , C. T., AND S CHROEDER , W. J. Interactive out-of-core isosurface extraction. In Proceedings of Visualization ’98 (October 1998), pp. 167–174. [5] C HIUEH , T., YANG , C., H E , T., P FISTER , H., AND K AUFMAN , A. Integrated volume compression and visualization. In Proceedings of Visualization ’97 (October 1997), pp. 329–336. [6] C IGNONI , P., M ARINO , P., M ONTANI , C., P UPPO , E., AND S COPIGNO , R. Speeding up isosurface extraction using interval trees. IEEE Transactions on Visualization and Computer Graphics 3, 2 (April - June 1997), 158–. [7] C IGNONI , P., M ONTANI , C., P UPPO , E., AND S COPIGNO , R. Multiresolution representation and visualization of volume data. IEEE Transactions on Visualization and Computer Graphics 3, 4 (October - December 1997). [8] C OX , M. Managing big data for scientific visualization. In ACM SIGGRAPH ’98 Course 2, Exploring Gigabyte Datasets in Real-Time: Algorithms, Data Management, and Time-Critical Design (August 1998). [9] C OX , M., AND E LLSWORTH , D. Application-controlled demand paging for out-of-core visualization. In Proceedings of Visualization ’97 (October 1997), pp. 235–244. [10] C OX , M., AND E LLSWORTH , D. Managing big data for scientific visualization. In ACM SIGGRAPH ’97 Course 4, Exploring Gigabyte Datasets in Real-Time: Algorithms, Data Management, and Time-Critical Design (August 1997). Los Angeles CA. [11] C RAWFIS , R. Parallel splatting and image-based rendering. In NSF/DOE Workshop on Large Scale Visualization and Data Management (May 1999). Presentations available at ftp://sci2.cs.utah.edu/pub/ldv99/. [12] F OWLER , J. E., AND YAGEL , R. Lossless compression of volume data. In 1994 Symposium on Volume Visualization (October 1994), pp. 43–50. [13] G ALLAGHER , R. S. Span filtering: An optimization scheme for volume visualization of large finite element models. In Proceedings of Visualization ’91 (October 1991), pp. 68–75. 55

Big Data

Michael Cox

[14] G ILES , M., AND H AIMES , R. Advanced interactive visualization for cfd. Computing Systems in Engineering 1 (1990), 51–62. [15] G LOBUS , A. Optimizing particle tracing in unsteady vector fields. NAS RNR-94-001, NASA Ames Research Center, January 1994. [16] G ROSSO , R., L UERIG , C., AND E RTL , T. The multilevel finite element method for adaptive mesh optimization and visualization of volume data. In Proceedings of Visualization ’97 (October 1997), pp. 387–394. [17] G UO , B. A multiscale model for structure-based volume rendering. IEEE Transactions on Visualization and Computer Graphics 1, 4 (December 1995), 291–301. [18] H EERMANN , P. D. Production visualization for the asci one teraflops machine. In Proceedings of Visualization ’98 (October 1998), pp. 459–462. [19] H EERMANN , P. D. Asci visualization: One teraflops and beyond. In NSF/DOE Workshop on Large Scale Visualization and Data Management (May 1999). Presentations available at ftp://sci2.cs.utah.edu/pub/ldv99/. [20] I TOH , T., AND KOYAMADA , K. Automatic isosurface propagation using an extrema graph and sorted boundary cell lists. IEEE Transactions on Visualization and Computer Graphics 1, 4 (December 1995), 319–. [21] K ENWRIGHT, D. N., AND H AIMES , R. Automatic vortex core detection. IEEE Computer Graphics and Applications 18, 4 (July/August 1998), 70–74. [22] L ANE , D. Ufat: A particle tracer for time-dependent flow fields. In Proceedings of Visualization ’94 (October 1994), pp. 257–264. [23] L AUR , D., AND H ANRAHAN , P. Hierarchical splatting: A progressive refinement algorithm for volume rendering. In Computer Graphics (Proceedings SIGGRAPH) (July 1991), pp. 285–288. Vol. 25, No. 4. [24] L EUTENEGGER , S. L., AND M A , K. L. Fast retrieval of disk-resident unstructured volume data for visualization. In DIMACS Workshop on External Memory Algorithms and Visualization (May 1998). [25] L EVOY, M. Volume rendering using the fourier projection-slice theorem. In Proceedings of Graphics Interface ’92 (May 1992), pp. 61–69. [26] L IVNAT, Y., S HEN , H.-W., AND J OHNSON , C. R. A near optimal isosurface extraction algorithm using the span space. IEEE Transactions on Visualization and Computer Graphics 2, 1 (March 1996), 73–84. [27] M A , K.-L., S MITH , D., S HIH , M.-Y., AND S HEN , H.-W. Efficient encoding and rendering of time-varying volume data. NASA/CR-1998-208424 98-22, ICASE, 1998. SIGGRAPH 99

56

Version 9-Sep-99

[28] M ALZBENDER , T. Fourier volume rendering. ACM Transactions on Graphics 12, 3 (July 1993), 233–250. [29] M ORAN , P., AND H ENZE , C. Large field visualization with demand-driven calculation. In Proceedings of Visualization ’99 (October 1999). [30] M URAKI , S. Approximation and rendering of volume data using wavelet transforms. IEEE Computer Graphics and Applications 13, 4 (July 1993), 50–56. [31] M URAKI , S. Multiscale volume representation by a dog wavelet. IEEE Transactions on Visualization and Computer Graphics 1, 2 (June 1995). [32] N EUBAUER , R., O HLBERGER , M., RUMPF, M., AND S CHWIRER , R. Efficient visualization of large-scale data on hierarchical meshes. In Proceedings of Visualization in Scientific Computing ’97 (1997), Springer Wien. [33] N ING , P., AND H ESSELINK , L. Fast volume rendering of compressed data. In Proceedings of Visualization ’93 (October 1993), pp. 11–18. [34] PAINTER , J., M C C ORMICK , P., AND M C P HERSON , A. Reality monster volume rendering. In NSF/DOE Workshop on Large Scale Visualization and Data Management (May 1999). Presentations available at ftp://sci2.cs.utah.edu/pub/ldv99/. [35] PANG , A., AND L ODHA , S. Towards understanding uncertainty in terascale visualization. In NSF/DOE Workshop on Large Scale Visualization and Data Management (May 1999). Presentations available at ftp://sci2.cs.utah.edu/pub/ldv99/. [36] S AM U SELTON , C . Panel: Computational steering is irrelevant to large data simulations. In NSF/DOE Workshop on Large Scale Visualization and Data Management (May 1999). Presentations available at ftp://sci2.cs.utah.edu/pub/ldv99/. [37] S CHROEDER , W. A multi-threaded streaming pipeline architecture for large structured data sets. In NSF/DOE Workshop on Large Scale Visualization and Data Management (May 1999). Presentations available at ftp://sci2.cs.utah.edu/pub/ldv99/. [38] S HEN , H. W. Isosurface extraction from time-varying fields using a temporal hierarchical index tree. In Proceedings of Visualization ’98 (October 1998), pp. 159–166. [39] S HEN , H.-W., H ANSEN , C. D., L IVNAT, Y., AND J OHNSON , C. R. Isosurfacing in span space with utomost efficiency (issue). In Proceedings of Visualization ’96 (October 1996). [40] S ONG , D., AND G OLIN , E. Fine-grain visualization in data flow environments. In Proceedings of Visualization ’93 (October 1993), pp. 126–133. [41] TOTSUKA , T., AND L EVOY, M. Frequency domain volume rendering. In Computer Graphics (Proceedings SIGGRAPH) (August 1993), pp. 271–278. Vol. 27, No. 4. 57

Big Data

Michael Cox

[42] T ROTTS , I. J., H AMANN , B., J OY, K. I., AND W ILEY, D. F. Simplification of tetrahedral meshes. In Proceedings of Visualization ’98 (October 1998), pp. 287–295. [43] U ENG , S. K., S IKORSKI , C., AND M A , K. L. Out-of-core streamline visualization on large unstructured meshes. IEEE Transactions on Visualization and Computer Graphics 3, 4 (October - December 1997). [44] W ESTERMANN , R. A multiresolution framework for volume rendering. In Proceedings of the 1994 Symposium on Volume Visualization (October 1994), pp. 51–57. [45] W ILHELMS , J., AND G ELDER , A. V. Octrees for faster isosurface generation. ACM Transactions on Graphics 11, 3 (July 1992), 201–227. [46] W ILHELMS , J., AND G ELDER , A. V. Multi-dimensional trees for controlled volume rendering and compression. In Proceedings of the 1994 Symposium on Volume Visualization (October 1994), pp. 27–34. [47] Y EO , B. L., AND L IU , B. Volume rendering of dct-based compressed 3d scalar data. IEEE Transactions on Visualization and Computer Graphics 1, 1 (March 1995). [48] Z HOU , Y., C HEN , B., AND K AUFMAN , A. Multiresolution tetrahedral framework for visualizing regular volume data. In Proceedings of Visualization ’97 (October 1997), pp. 135–142. [49] Z HU , Z., M ACHIRAJU , R., F RY, B., AND M OORHEAD , R. Wavelet-based multiresolutional representation of computational field simulation datasets. In Proceedings of Visualization ’97 (October 1997), pp. 151–158.

SIGGRAPH 99

58