Compiling Object-Oriented Data Intensive Applications - CiteSeerX

0 downloads 0 Views 273KB Size Report
pressed , and a single large hospital can process thousands of slides per day. Applications that make ... important information from general object-oriented data- parallel loops. .... This image can be thought of as a two di- mensional array orĀ ...
Compiling Object-Oriented Data Intensive Applications  Renato Ferreira Gagan Agrawaly Joel Saltz  Department of Computer Science University of Maryland, College Park MD 20742 frenato,[email protected] yDepartment of Computer and Information Sciences University of Delaware, Newark DE 19716 [email protected] Abstract Processing and analyzing large volumes of data plays an increasingly important role in many domains of scienti c research. High-level language and compiler support for developing applications that analyze and process such datasets has, however, been lacking so far. In this paper, we present a set of language extensions and a prototype compiler for supporting high-level objectoriented programming of data intensive reduction operations over multidimensional data. We have chosen a dialect of Java with data-parallel extensions for specifying collection of objects, a parallel for loop, and reduction variables as our source high-level language. Our compiler analyzes parallel loops and optimizes the processing of datasets through the use of an existing run-time system, called Active Data Repository (ADR). We show how loop ssion followed by interprocedural static program slicing can be used by the compiler to extract required information for the run-time system. We present the design of a compiler/run-time interface which allows the compiler to e ectively utilize the existing run-time system. A prototype compiler incorporating these techniques has been developed using the Titanium front-end from Berkeley. We have evaluated this compiler by comparing the performance of compiler generated code with hand customized ADR code for three templates, from the areas of digital microscopy and scienti c simulations. Our experimental results show that the performance of compiler generated versions is, on the average 21% lower, and in all cases within a factor of two, of the performance of hand coded versions. 1 Introduction Analysis and processing of very large multi-dimensional scienti c datasets (i.e. where data items are associated with points in a multidimensional attribute space) is an important component of science and engineering. Examples of  This research was supported by NSF Grant ACR-9982087 and NSF CAREER award ACI-9733520.

these datasets include raw and processed sensor data from satellites, output from hydrodynamics and chemical transport simulations, and archives of medical images. These datasets are also very large, for example, in medical imaging, the size of a single digitized composite slide image at high power from a light microscope is over 7GB (uncompressed), and a single large hospital can process thousands of slides per day. Applications that make use of multidimensional datasets are becoming increasingly important and share several important characteristics. Both the input and the output are often disk-resident. Applications may use only a subset of all the data available in the datasets. Access to data items is described by a range query, namely a multidimensional bounding box in the underlying multidimensional attribute space of the dataset. Only the data items whose associated coordinates fall within the multidimensional box are retrieved. The processing structures of these applications also share common characteristics. However, no high-level language support currently exists for developing applications that process such datasets. In this paper, we present our solution towards allowing high-level, yet ecient, programming of data intensive reduction operations on multidimensional datasets. Our approach is to use a data parallel language to specify computations that are to be applied to a portion of disk-resident datasets. Our solution is based upon designing a prototype compiler using the titanium infrastructure which incorporates loop ssion and slicing based techniques, and utilizing an existing run-time system called Active Data Repository [5]. We have chosen a dialect of Java for expressing this class of computations. Our chosen dialect of Java includes dataparallel extensions for specifying collection of objects, a parallel for loop, and reduction variables. However, the approach and the techniques developed are not intended to be language speci c. Our overall thesis is that a data-parallel framework will provide a convenient interface to large multidimensional datasets resident on a persistent storage. Conceptually, our compiler design has two major new ideas. First, we have shown how loop ssion followed by interprocedural program slicing can be used for extracting important information from general object-oriented dataparallel loops. This technique can be used by other compilers that use a run-time system to optimize for locality or communication. Second, we have shown how the compiler and the run-time system can use such information to e-

ciently execute data intensive reduction computations. Our compiler extensively uses the existing run-time system ADR for optimizing the resource usage during execution of data intensive applications. ADR integrates storage, retrieval and processing of multidimensional datasets on a parallel machine. While a number of applications have been developed using ADR's low-level API and high performance has been demonstrated [5], developing applications in this style requires detailed knowledge of the design of ADR and is not suitable for application programmers. In comparison, our proposed data-parallel extensions to Java enable programming of data intensive applications at a much higher level. It is now the responsibility of the compiler to utilize the services of ADR for memory management, data retrieval and scheduling of processes. Our prototype compiler has been implemented using the titanium infrastructure from Berkeley [20]. We have performed experiments using three di erent data intensive applications templates, two of which are based upon the Virtual Microscope application [9] and the third is based on water contamination studies [14]. For each of these templates we have compared the performance of compiler generated versions with hand customized versions. Our experiments show that the performance of compiler generated versions is, on average 21% lower, and in all cases within a factor of two of the performance of hand coded versions. We present an analysis of the factors behind the lower performance of the current compiler and suggest optimizations that can be performed by our compiler in the future. The rest of the paper is organized as follows. In Section 2, we further describe the charactestics of the class of data intensive applications we target. Background information on the run-time system is provided in Section 3. Our chosen language extensions are described in Section 4. We present our compiler processing of the loops and slicing based analysis in Section 5. The combined compiler and run-time processing for execution of loops is presented in Section 6. Experimental results from our current prototype are presented in Section 7. We compare our work with existing related research e orts in Section 8 and conclude in Section 9. 2 Data Intensive Applications In this section, we rst describe some of the scienti c domains which involve applications that process large datasets. Then, we describe some of the common characteristics of the applications we target. Data intensive applications from three scienti c areas are being studied currently as part of our project.

chemical transport simulation models reactions and transport of contaminants, using the uid velocity data generated by the hydrodynamics simulation. This simulation is performed on a di erent spatial grid, and often uses signi cantly coarser time steps. This is achieved by mapping the uid velocity information from the circulation grid, averaged over multiple ne-grain time steps, to the chemical transport grid and computing smoothed uid velocities for the points in the chemical transport grid. Satellite data processing: Earth scientists study the earth by processing remotely-sensed data continuously acquired from satellite-based sensors, since a signi cant amount of earth science research is devoted to developing correlations between sensor radiometry and various properties of the surface of the earth [5]. A typical analysis processes satellite data for ten days to a year and generates one or more composite images of the area under study. Generating a composite image requires projection of the globe onto a two dimensional grid; each pixel in the composite image is computed by selecting the \best" sensor value that maps to the associated grid point.

Data intensive applications in these and related scienti c areas share many common characteristics. Access to data items is described by a range query, namely a multidimensional bounding box in the underlying multidimensional space of the dataset. Only the data items whose associated coordinates fall within the multidimensional box are retrieved. The basic computation consists of (1) mapping the coordinates of the retrieved input items to the corresponding output items, and (2) aggregating, in some way, all the retrieved input items mapped to the same output data items. The computation of a particular output element is a reduction operation, i.e. the correctness of the output usually does not depend on the order in which the input data items are aggregated. 3 Overview of the Runtime System Our compiler e ort targets an existing run-time infrastructure, called the Active Data Repository (ADR) [5] that integrates storage, retrieval and processing of multidimensional datasets on a parallel machine. We give a brief overview of this run-time system in this section. Processing of a data intensive data-parallel loop is carried out by ADR in two phases: loop planning and loop execution. The objective of loop planning is to determine a schedule to eciently process a range query based on the amount of available resources in the parallel machine. A Analysis of Microscopy Data: The Virtual Microscope [9] loop plan speci es how parts of the nal output are comis an application to support the need to interactively view puted. The loop execution service manages all the resources and process digitized data arising from tissue specimens. in the system and carries out the loop plan generated by The Virtual Microscope provides a realistic digital emulathe loop planning service. The primary feature of the loop tion of a high power light microscope. The raw data for such execution service is its ability to integrate data retrieval a system can be captured by digitally scanning collections of and processing for a wide variety of applications. This is full microscope slides under high power. At the basic level, achieved by pushing processing operations into the storage it can emulate the usual behavior of a physical microscope manager and allowing processing operations to access the including continuously moving the stage and changing magbu er used to hold data arriving from disk. As a result, the ni cation and focus. system avoids one or more levels of copying that would be needed in a layered architecture where the storage manager Water contamination studies: Environmental scientists study and the processing belong in di erent layers. the water quality of bays and estuaries using long running A dataset in ADR is partitioned into a set of (logical) hydrodynamics and chemical transport simulations [14]. The disk blocks to achieve high bandwidth data retrieval. The size of a logical disk block is a multiple of the size of a phys-

ical disk block on the system and is chosen as a trade-o between reducing disk seek time and minimizing unnecessary data transfers. A disk block consists of one or more objects, and is the unit of I/O and communication. The processing of a loop on a processor progresses through the following three phases: (1) Initialization { output disk blocks (possibly replicated on all processors) are allocated space in memory and initialized, (2) Local Reduction { input disk blocks on the local disks of each processor are retrieved and aggregated into the output disk blocks, (3) Global Combine { if necessary, results computed in each processor in phase 2 are combined across all processors to compute nal results for the output disk blocks. ADR run-time support has been developed as a set of modular services implemented in C++. ADR allows customization for application speci c processing (i.e., mapping and aggregation functions), while leveraging the commonalities between the applications to provide support for common operations such as memory management, data retrieval, and scheduling of processing across a parallel machine. Customization in ADR is currently achieved through class inheritance. That is, for each of the customizable services, ADR provides a base class with virtual functions that are expected to be implemented by derived classes. Adding an application-speci c entry into a modular service requires the de nition of a class derived from an ADR base class for that service and providing the appropriate implementations of the virtual functions. Current examples of data intensive applications implemented with ADR include Titan [5], for satellite data processing, the Virtual Microscope [9], for visualization and analysis of microscopy data, and coupling of multiple simulations for water contamination studies [14]. 4 Java Extensions for Data Intensive Computing In this section we describe a dialect of Java that we have chosen for expressing data intensive computations. Though we propose to use a dialect of Java as the source language for the compiler, the techniques we will be developing will be largely independent of Java and will also be applicable to suitable extensions of other languages, such as C, C++, or Fortran 90. 4.1 Data-Parallel Constructs We borrow two concepts from object-oriented parallel systems like Titanium [20], HPC++ [2], and Concurrent Aggregates [6].  Domains and Rectdomains are collections of objects of the same type. Rectdomains have a stricter de nition, in the sense that each object belonging to such a collection has a coordinate associated with it that belongs to a pre-speci ed rectilinear section of the domain.  The foreach loop, which iterates over objects in a domain or rectdomain, and has the property that the order of iterations does not in uence the result of the associated computations. We further extend the semantics of foreach to include the possibility of updates to reduction variables, as we explain later. We introduce a Java interface called Reducinterface. Any object of any class implementing this interface acts as a reduction variable [10]. The semantics of a reduction variable is analogous to that used in version 2.0 of High Performance Fortran (HPF-2) [10] and in HPC++ [2]. A reduction variable has the property that it can only be updated inside a

foreach loop by a series of operations that are associative and commutative. Furthermore, the intermediate value of the reduction variable may not be used within the loop, except for self-updates. 4.2 Example Code Figure 1 outlines an example code with our chosen extensions. This code shows the essential computation in the virtual microscope application [9]. A large digital image is stored in disks. This image can be thought of as a two dimensional array or collection of objects. Each element in this collection denotes a pixel in the image. Each pixel comprises of three characters, which denote the color at that point in the image. The interactive user supplies two important pieces of information. The rst is a bounding box within this two dimensional box, this implies the area within the original image that the user is interested in scanning. We assume that the bounding box is rectangular, and can be speci ed by providing the x and y coordinates of two points. The rst 4 arguments provided by the user are integers and together, they specify the points lowend and hiend. The second information provided by the user is the subsampling factor, an integer denoted by subsamp. The subsampling factor tells the granularity at which the user is interested in viewing the image. A subsampling factor of 1 means that all pixels of the original image must be displayed. A subsampling factor of n means that n2 pixels are averaged to compute each output pixel. The computation in this kernel is very simple. First, a querybox is created using speci ed points lowend and hiend. Each pixel in the original image which falls within the querybox is read and then used to increment the value of the corresponding output pixel. 4.3 Restrictions on the Loops The primary goal of our compiler will be to analyze and optimize (by performing both compile-time transformations and generating code for ADR run-time system) foreach loops that satisfy certain properties. We assume standard semantics of parallel for loops and reductions in languages like High Performance Fortran (HPF) [10] and HPC++ [2]. Further, we require that no Java threads be spawned within such loop nests, and no memory locations read or written to inside the loop nests may be touched by another concurrent thread. Our compiler will also assume that no Java exceptions are raised in the loop nest and the iterations of the loop can be reordered without changing any of the language semantics. One potential way of enabling this can be to use bound checking optimizations [15]. 5 Compiler Analysis In this section, we rst describe how the compiler processes the given data-parallel data intensive loop to a canonical form. We then describe how interprocedural program slicing can be used for extracting a number of functions which are passed to the run-time system. 5.1 Initial Processing of the Loop Consider any data-parallel loop in our dialect of Java, as presented in Section 4. The memory locations modi ed in this loop are only the elements of collection of objects, or temporary variables whose values are not used in other iterations of the loop or any subsequent computations. The memory locations accessed in this loop are either elements

Interface Reducinterface f *Any object of any class implementing this interface is a reduction variable* g public class VMPixel f char colors[3]; void Initialize() f colors[0] = 0 ; colors[1] = 0 ; colors[2] = 0 ; g *Aggregation Function* void Accum(VMPixel Apixel, int avgf) f colors[0] += Apixel.colors[0]/avgf ; colors[1] += Apixel.colors[1]/avgf ; colors[2] += Apixel.colors[2]/avgf ;

g

*Data Declarations* static Point[2] hipoint = [Xdimen-1,Ydimen-1]; static RectDomain[2] VMSlide = [lowpoint : hipoint]; static VMPixel[2d] VScope = new VMPixel[VMSlide]; public static void main(String[] args) f Point[2] lowend = [args[0],args[1]]; Point[2] hiend = [args[2],args[3 ]]; int subsamp = args[4]; RectDomain[2] Outputdomain = [[0,0]:(hiend lowend)/subsamp]; VMPixelOut[2d] Output = new VMPixelOut[Outputdomain] ; RectDomain[2] querybox ; Point[2] p; foreach(p in Outputdomain) f Output[p].Initialize(); g querybox = [lowend : hiend] ; *Main Computational Loop* foreach(p in querybox) f Point[2] q = (p - lowend)/subsamp ; Output[q].Accum(VScope[p],subsamp*subsamp) ;

g

public class VMPixelOut extends VMPixel implements Reducinterface; public class VMScope f static int Xdimen = ... ; static int Ydimen = ... ; static Point[2] lowpoint = [0,0];

g

g

g

Figure 1: Example Code of collections or values which may be replicated on all processors before the start of the execution of the loop. For the purpose of our discussion, collections of objects whose elements are modi ed in the loop are referred to as left hand side or lhs collections, and the collections whose elements are only read in the loop are considered as right hand side or rhs collections. The functions used to access elements of collections of objects in the loop are referred to as the subscript functions. De nition 1 Consider any two lhs collections or any two rhs collections. These two collections are called congruent i  The subscript functions used to access these two collections in the loop are identical.  The layout and partitioning of these two collections are identical. By identical layout we mean that elements with the same indices are put together in the same disk block for both the collections. By identical partitioning we mean that the disk blocks containing elements with identical indices from these collections reside on the same processor. Consider any loop. If multiple distinct subscript functions are used to access rhs collections and lhs collections and these subscript functions are not known at compile-time, tiling output and managing disk accesses while maintaining high reuse and locality is going to be a very dicult task for the run-time system. In particular, the current implementation of ADR does not support such cases. Therefore, we perform loop ssion to divide the original loop into a set of loops, such that all lhs collections in any new loop are congruent and all rhs collections are congruent. We now describe how such loop ssion is performed. Initially, we focus on lhs collections which are updated in di erent statements of the same loop. We perform loop

ssion, so that all lhs collections accessed in any new loop are congruent. Since we are focusing on loops with no loopcarried dependencies, performing loop ssion is straightforward. An example of such transformation is shown in Figure 2, part (a). We now focus on such a new loop in which all lhs collections are congruent, but not all rhs collections may be congruent. For any two rhs accesses in a loop that are not congruent, there are three possibilities: 1. These two collections are used for calculating values of elements of di erent lhs collections. In this case, loop ssion can be performed trivially. 2. These two collections Y and Z are used for calculating values of elements of the same lhs collection. Such lhs collection X is, however, computed as follows: X (f (i)) opi = Y (g(i)) opj Z (h(i)) such that, opi  opj . In such a case, loop ssion can be performed, so that the element X (f (i)) is updated using the operation opi with the values of Y (g(i)) and Z (h(i)) in di erent loops. An example of such transformation is shown in Figure 2, part (b). 3. These two collections Y and Z are used for calculating values of the elements of the same lhs collection and unlike the case above, the operations used are not identical. An example of such a case is X (f (i)) + = Y (g(i))  Z (h(i)) In this case, we need to introduce temporary collection of objects to copy the collection Z . Then, the collection Y and the temporary collection can be accessed using the same subscript function. An example of such transformation is shown in Figure 2, part (c). After such a series of loop ssion transformations, the original loop is replaced by a series of loops. The property

foreach(r 2 R) f 1 [SL ( )] 1 = A1 ( 1 [SR ( )] m [SL ( )] m = Am ( 1 [SR ( )] O

r

op

I

r

:::

g

O

r

op

I

r

n [SR (r)])

;:::;I

n [SR (r)])

;:::;I

Figure 3: A Loop In Canonical Form of each loop is that all lhs collections are accessed with the same subscript function and all rhs collections are also accessed with the same subscript function. However, the subscript function for accessing the lhs collections may be di erent from the one used to access rhs collections. 5.1.1 Terminology After loop ssion, we focus on an individual loop at a time. We introduce some notation about this loop which is used for presenting our solution. The terminology presented here is illustrated by the example loop in Figure 3. The range (domain) over which the iterator iterates is denoted by the function R. Let there be n rhs collection of objects read in this loop, which are denoted by I1 ; : : : ; In . Similarly, let the lhs collections written in the loop be denoted by O1 ; : : : ; Om . Further, we denote the subscript function used for accessing right hand side collections by SR and the subscript function used for accessing left hand side collections by SL . Given a point r in the range for the loop, elements SL(r) of the output collections are updated using one or more of the values I1 [SR (r)]; : : : ; In [SR (r)], and other scalar values in the program. We denote by Ai the function used for creating the value which is used later for updating the element of the output collection Oi . The operator used for performing this update is opi . 5.2 Slicing Based Interprocedural Analysis We are primarily concerned with extracting three sets of functions, the range function R, the subscript functions SR and SL , and the aggregation functions A1 ; : : : ; Am . Similar information is often extracted by various data-parallel Fortran compilers. One important di erence is that we are working with an object-oriented language (Java), which is signi cantly more dicult to analyze. This is mainly because the object-oriented programming methodology frequently leads to small procedures and frequent procedure calls. As a result, analysis across multiple procedures may be required in order to extract range, subscript and aggregation functions. We use the technique of interprocedural program slicing for extracting these three sets of functions. Initially, we give background information on program slicing and give references to show that program slicing can be performed across procedure boundaries, and in the presence of language features like polymorphism, aliases, and exceptions. 5.2.1 Background: Program Slicing The basic de nition of a program slice is as follows. Given a slicing criterion (s; x), where s is a program point in the program and x is a variable, the program slice is a subset of statements in the programs such that these statements, when executed on any input, will produce the same value of the variable x at the program point s as the original program.

The basic idea behind any algorithm for computing program slices is as follows. Starting from the statement p in the program, we trace any statements on which p is data or control dependent and add them to the slice. The same is repeated for any statement which has already been included in the slice, till no more statements can be added in the slice. Slicing has been very frequently used in software development environments, for debugging, program analysis, merging two di erent versions of the code, and software maintenance and testing. A number of techniques have been presented for accurate program slicing across procedure boundaries [18]. 5.2.2 Extracting Range Function We need to determine the rhs and lhs collection of objects for this loop. We also need to provide the range function R. The rhs and lhs collection of objects can be computed easily by inspecting the assignment statements inside the loop and in any functions called inside the loop. Any collection which is modi ed in the loop is considered a lhs collection, and any other collection touched in the loop is considered a rhs collection. For computing the domain, we inspect the foreach loop and look at the domain over which the loop iterates. Then, we compute a slice of the program using the entry of the loop as the program point and the domain as the variable. 5.2.3 Extracting Subscript Functions The subscript functions SR and SL are particularly important for the run-time system, as it determines the size of lhs collections written in the loop and the rhs disk blocks from each collection that contribute to the lhs collections. The function SL can be extracted using slicing as follows. Consider any statement in the loop which modi es any lhs collection. We focus on the variable or expression used to access an element in the lhs collection. The slicing criterion we choose is the value of this variable or expression at the beginning of the statement where the lhs collection is modi ed. The function SR can be extracted similarly. Consider any statement in the loop which reads from any rhs collection. The slicing criterion we use is the value of the expression used to access the collection, at the beginning of such a statement. Typically, this value of the iterator will be included in such slices. Suppose the iterator is p. After rst encountering p in the slice, we do not follow data-dependencies for p any further. Instead, the functions returned by the slice use such iterator as the input parameter. For the virtual microscope template presented in Figure 1, the slice computed for the subscript function SL is shown at the left hand side of Figure 4 and the code generated by the compiler is shown on the left hand side of Figure 5. In the original source code, the rhs collection is accessed with just the iterator p, therefore, the subscript function SR is the identity function. The function SL receives the coordinates of an element in the rhs collection as parameter (iterpt) from the run-time system and returns the coordinates of the corresponding lhs element. Titanium multidimensional points are supported by ADR as a class named ADR Pt. Also, in practice, the command line parameters passed to the program are extracted and stored in a data-structure, so that the run-time system does not need to explicitly read args array.

(a) foreach (p in box) f A[f(p)] += C[p] B[g(p)] += C[p]

(b) foreach (p in box) f A[f(p)] += B[g(p)] + C[h(p)]

(c) foreach (p in box) f A[f(p)] += B[p]  C[g(p)]

g

g

#

g

#

foreach (p in box) f A[f(p)] += C[p] g foreach (p in box) f B[g(p)] += C[p]

#

foreach (p in box) f A[f(p)] += B[g(p)] g foreach (p in box) f A[f(p)] += C[h(p)]

g

foreach (p in box) f T[p] = C[g(p)] g foreach (p in box) f A[f(p)] += B[p]  T[p]

g

g

Figure 2: Examples of Loop Fission Transformations Point[2] lowend = [args[0],args[1]]; int subsamp = args[4]; Point[2] q = (p - lowend)/subsamp ;

foreach(p in querybox) f Point[2] q = (p - lowend)/subsamp ; Output[q].Accum(VScope[p],subsamp*subsamp) ;

g

Figure 4: Slice for Subscript Function (left) and for Aggregation Function (right) 5.2.4 Extracting Aggregation Functions For extracting the aggregation function Ai , we look at the statement in the loop where the lhs collection Oi is modi ed. The slicing criterion we choose is the value of the element from the collection which is modi ed in this statement, at the beginning of this statement. For the virtual microscope template presented in Figure 1 has only one aggregation function. This slice for this aggregation function is shown in Figure 4 and the actual code generated by the compiler is shown in Figure 5. The function Accum accessed in this code is obviously part of the slice, but is not shown here. The generated function iterates over the elements of a disk block and applies aggregation functions on each element, if that element intersects with the range of the loop and the current tile. The function is presented as a parameter of current block (the disk block being processed), the current tile (the portion of lhs collection which is currently allocated in memory), and querybox which is the iteration range for the loop. Titanium rectangular domains are supported by the run-time as ADR Box. Further details of this aggregation function are explained after presenting the combined compiler/run-time loop processing. 6 Combined Compiler and Run-time Processing In this section we explain how the compiler and run-time system can work jointly towards performing data intensive computations. 6.1 Initial Processing of the Input The system stores information about how each of the rhs collection of objects Ii is stored across disks. Note that after we apply loop ssion, all rhs collections accessed in the same loop have identical layout and partitioning. The

compiler generates appropriate ADR functions to analyze the meta-data about collections Ii , the range function R, and the subscripting function SR, and compute the list of disk blocks of Ii that are accessed in the loop. The domain of each rhs collection accessed in the loop is SR  R. Note that if a disk block is included in this list, it is not necessary that all elements in this disk block are accessed during the loop. However, for the initial planning phase, we focus on the list of disk blocks. We assume a model of parallel data intensive computation in which a set of disks is associated with each node of the parallel machine. This is consistent with systems like IBM SP and cluster of workstations. Let the set P = fp1 ; p2 ; : : : ; pq g denote the list of processors in the system. Then, the information computed by the run-time system after analyzing the range function, the input subscripting function and the meta-data about each of the collections of objects Ii is the sets Bij . For a given input collection Ii and a processor j , Bij is the set of disk blocks b that contain data for collection Ii , is resident on a disk connected to processor pj , and intersections with SR  R. Further, for each disk block bijk belonging to the set Bij , we compute the information D(bijk ), which denotes the subset of the domain SR  R which is resident on the disk block b. Clearly the union of the domains covered by all selected disk blocks will cover the entire area of interest, or in formal terms,

8i

[

8 j; 8 k

D(bijk )  SR  R

ADR Pt Subscript out( ADR Pt iterpt ) f ADR Pt outpoint(2); ADR Pt lowend(2); lowend[0] = args[0]; lowend[1] = args[1]; int subsamp = args[4]; outpoint[0] = (iterpt[0] - lowend[0])/subsamp; outpoint[1] = (iterpt[1] - lowend[1])/subsamp; return outpoint ;

g

void Accumulate(ADR Box current block, ADR Box current tile, ADR Box querybox) f ADR Box box = current block.intersect(querybox); ADR Pt lowpt = box.getLow(); ADR Pt highpt = box.getHigh(); ADR Pt inputpt(2); ADR Pt outputpt(2); int subsamp = args[4]; for (i0 = lowpt[0]; i0 highpt[0]; i0++) f for (i1 = lowpt[1]; i1 highpt[1]; i1++) f inputpt[0] = i0; inputpt[1] = i1; if (project(inputpt, outputpt, current tile)) f Output[outputpt].Accum(VScope[inputpt], subsamp*subsamp);