On Querying Spreadsheets - Semantic Scholar

On Querying Spreadsheets Laks V.S. Lakshmanan

Dept. of Computer Science, Concordia University Montreal, Canada [email protected]

Nita Goyal

Subbu N. Subramaniany

IBM Almaden Research Center San Jose, CA, USA [email protected]

Ravi Krishnamurthy

Picture Programming Lab, H.P. Labs, Palo Alto, CA, USA email:fgoyal, [email protected]

Abstract

Applications such as spreadsheets and word processors, besides traditional databases, are a source of vast amount of electronic information. We consider the problem of querying the data in these applications. This problem has several motivations from the perspective of data integration, interoperability, and OLAP. In this paper, we provide an architecture for realizing interoperability among such diverse applications and address the challenges that arise speci cally in the context of querying data stored in spreadsheet applications. A fundamental challenge here is the lack of a well-de ned schema. We propose a framework in which the user can specify the layout of data in a spreadsheet, based on her perception of the important concepts underlying that data. Layout speci cations can be viewed as the \physical schema" of a spreadsheet. We motivate the concept of an abstract database machine (ADM) that uses the layout speci cations to provide a relational view of the data in spreadsheet applications, and similarly to DBMS, supports ecient querying of the spreadsheet data. We develop a methodology for building ADMs for spreadsheets, and describe our implementation of an ADM for Microsoft Excel applications, based on the above methodology. Our implementation platform is IBM PCs running Windows NT, MS Oce, and OLE 2.0. We also demonstrate the generality and practicality of our approach by developing a formal characterization of the class of spreadsheets that can be handled in our framework. Our results show that the approach presented in this paper is capable of handling a broad class of naturally occurring spreadsheet applications. This work is part of an oce tool integration project at Concordia University, in conjunction with the Picture Programming Project at H.P. Labs, Palo Alto, CA.

1 Introduction

Numerous applications require interaction with information sources that are dissimilar in nature. The sources This research was supported in part by grants from the Natural Sciences and Engineering Research Council of Canada and a grant from Hewlett Packard Labs, Palo Alto, CA. y Currently on assignment at IBM Santa Teresa Labs, San Jose, CA. Much of this work performed while at Concordia University, Montreal, Canada.

could include databases, spreadsheets, documents, mail, fax, internet information servers, news groups, etc. Here is one example: Consider an application where a vice president of a company sends congratulatory/warning letters to the salespeople depending on their performance in the past month. The sales quota of each salesperson may be stored in a database, which may also store his coordinates { email address, fax number, voice mail number, etc. Sales persons compile their sales gures on an ongoing basis in spreadsheets, a very commonly used tool. To facilitate the task (and more importantly, save the time,) of the vice president's secretary, template letters of either kind are stored as template documents. Yet, the only way the secretary can prepare the letters presently is to manually get the relevant data from the database and spreadsheets, analyze them, decide who should get what message, and use a facility like Microsoft mail merge to prepare the actual messages, one by one. In this, as well as in many other similar real-life applications, considerable time can be saved if the bulk of the activity could be somehow automated. Our work was inspired by the observation that for a whole class of practical applications requiring interoperation among heterogeneous tools (such as databases, mail tools, spreadsheets, and word processors, to name a few), certain key functionalities are required: (1) Querying the data in the tools; (2) Updating the data, possibly in a query dependent manner; (3) Automating the process of query dependent updates. In this paper, we argue that development of this whole breed of applications can signi cantly bene t from the application of database technology. We draw an analogy between the state of the art for such application development, and the state of the art of database technology in the pre-relational days. Presently, the approach to developing such applications can be characterized by the phrase one-application-at-a-time. This approach has the following drawbacks: (1) Developing such interoperable applications from scratch requires that the access paths to the various tools be programmed from scratch on a per application basis, and this code will be mixed with the application dependent code itself. (2) There are situations where the way the data in a tool is perceived could depend on the application. This means that access path programming it-

Application

View Manager

OR View

User’s Layout Specs

Excel ADM

OR View


Access ADM

O L E

MS Excel

OR


View

Word ADM

2.0

MS Access

MS Word

Figure 1: Architecture for Middleware for interoperation among heterogeneous data sources. self would have to depend on the particular application. The particular query and update methods to be implemented on the tools might also depend on the application, as this often depends on the application semantics. We argue that perhaps the most eective way of realizing these functionalities is to make use of database technology and extend it to accommodate non-traditional information sources . The reasons for this belief are the following. (1) Productivity would be enhanced the most if we can build a language for application development which lets the programmer focus mainly on the semantic (rather than access path) details of the speci c application. This is clearly one of the fortes of relational database technology and in particular, SQL. (2) A uniform database-like view enables the user to have a high level view of the individual tools against which s/he can develop the interoperating application, analogously to developing database applications against the relational conceptual model. (3) This approach to application development oers the potential of query optimization, similar in spirit to what has been done for the relational data model. (4) By borrowing from database technology, we can bring techniques such as materialized view maintenance to bear on the problem of building a technology for developing a class of interoperable applications. (3)

2 Our Approach and Contributions

In this section, we describe our approach for enabling a technology for developing a class of applications requiring interoperation among heterogeneous tools. The crux of our approach is to maintain a semantic view of the data in the sources and developing applications against this view, using a declarative language similar to SQL. This raises the following issues. (1) Semantics of source data obviously depends on the instance of the data, and also on the way it is laid out or structured within the tool. However, the underlying schematic information is not known apriori. Worse, even the notion of a schema does not exist explicitly. Thus, we need a language for specifying the layout of data in a manner that corresponds to the data semantics,

as perceived by the user. (2) We should support a (semantic) modeling of source data whose \granularity" closely matches the kind of use (e.g. kind of queries/updates) anticipated. (3) We need an algorithm that, given a speci cation of the layout, translates it into the actual semantic view. (4) If the view is materialized then it has to be kept in sync with the source: updates on the view must be re ected to the source and externally eected updates on the source must be pulled in and used to refresh the view. (5) There should be a mechanism for managing the activities of re ection and refreshing in a systematic manner, without requiring the application programmer to write that code as part of every application. Our approach for tackling the above issues is based on the concept of an abstract database machine (ADM) that we propose. The ADM is essentially a module that is responsible for: (1) mapping physical source data to a logical form corresponding to its semantic view, given a speci cation of data layout; (2) refreshing; (3) re ection. Figure 1 shows the architecture we envision for achieving interoperability among semi-structured information sources, based on building semantic abstractions on top of them. Though the architecture makes reference to speci c Microsoft tools, the platform on which we realized our implementation, we stress that it is quite general and is applicable to other platforms as well. In the rest of the paper, by an application, we mean an interoperating application requiring access to heterogeneous data sources/tools. About this paper: In this paper, we present the abstractions we have developed speci cally for spreadsheets, and discuss algorithms for implementing queries posed against a semantic relational view, on the underlying spreadsheets. Our contributions are: (a) We propose an approach for realizing the aforementioned class of interoperating applications, based on maintaining (object-relational) proxies of the data in the different sources, and discuss an architecture for it (Section 2). (b) We develop the concepts and techniques necessary for building relational views which capture the semantics of the information in spreadsheet tools such as MS Excel.

City Montreal LA ... 1990 nuts 1000 2000 ... bolts 750 1750 ... ... ... ... ... ... 1991 nuts 1200 2200 ... bolts 900 2000 ... Mary 1992 nuts 2000 3000 ... screws 3000 4000 ... ... ... ... ... ... ... 1994 bolts 1200 1300 ... screws 3100 4000 ... ... ... ... ... ... ... ... ... ... ... ... ... (a) Qtr1 Qtr2 east west John 1990 nuts 1000 2000 nuts bolts 1991 bolts 900 2100 screws ... ... ... ... ... ... Mary 1992 nuts 2000 3000 bolts screws 2000 3000 1993 nuts 1800 2100 nuts ... ... ... ... ... ... (b) John ...

east north 1200 1800 1100 2100 1000 1350 ... ... 2500 2900 2200 2200 ... ...

Figure 2: Some Example Data Layouts in Spreadsheets. Both

gures show sample sales reports. (a) shows sales by dierent salespersons for dierent years, \plotted" against parts and cities. (b) shows similar sales data, for each quarter, plotted against parts and regions.

To this end, we propose the concept of a layout speci cation, which speci es a mapping between source data and a (user) perceived relational view of that data. From a database point of view, a layout speci cation can be likened to physical schema speci cation (Section 4). (c) We provide algorithms for materializing semantic relational views of spreadsheets, given their layout speci cation (Section 5). (d) We establish the practical usefulness of our approach by precisely characterizing the class of spreadsheet applications that it handles (Section 5.1). (e) We describe the system architecture as well as implementation of a system for interoperability that is based on the ideas in this paper (Section 6). We also make an extensive comparison of our work with other related research (Section 3). Figure 2 shows sample spreadsheets. Notice the variety of ways in which data is laid out with no notion of an associated well-de ned schema. The user has an idea about the `concepts' represented in a spreadsheet. Some questions are: how can these concepts be speci ed? how can the speci cation be used for building a semantic view of a spreadsheet?

3 Comparison with Related Work

In this section, we compare our work with four major lines of previous work, speci cally, and then brie y survey other related work. In the context of many interoperability projects, the important concept of mediator, rst proposed by Wiederhold [19], is used for developing a view of the data sources against which queries can be posed. Examples of such projects include the TSIMMIS project at Stanford Univer-

sity [7], HERMES project at the University of Maryland [18], the Information Manifold project at AT&T Labs [16], and the SchemaLog project at Concordia University [15]. (1) In the conventional \approaches" for mediator based interoperability (that is adopted in all known projects), a layer of abstraction called wrapper is typically used to build a syntactically higher level view of a source, which supports relatively higher level queries than the native queries of the source on which it is built. The mediator is the application speci c program that makes use of the `wrapper ports' of the disparate information sources, to provide for interoperability among them. However, what the wrapper exposes is a purely structural view of the source, and is usually instance independent. It is left to the mediator to unearth the semantic view of the source, thus mixing this code with the application speci c code in the mediator. Recognizing the labor intensive nature of mediator development, researchers (e.g. see Papakonstantinou et. al. [17]) have proposed \tool kits" for rapid prototyping of wrappers for a general class of sources. These tools, while expediting wrapper development, oer little support for capturing source data semantics. (2) The HERMES approach [18], although dierent from the TSIMMIS approach in several respects, does not enable the wrapper to extract the semantic view of source data: it is still the responsibility of the mediator. (3) The text/document querying and updating projects at INRIA [8, 1] and at the University of Waterloo [3] are also relevant to our work, although their focus is on sources quite dierent from spreadsheets. Again, their abstraction and modeling of text essentially captures the structure of the text determined from a grammar for the language, rather than the application-relevant semantics of the text. (4) Finally OLE and OLE-DB have been claimed [4] to provide a \database-like view" of Microsoft tools and are thus expected to enable programming interoperating applications among such sources in SQL. However, these technologies only achieve a syntactic database-like view of the sources. The semantics are essentially left to the application developer to extract. On the other hand, these technologies are very useful in serving as a \conduit" for physical (or syntactic) interoperability between MS sources, and can thus be used as a platform on which to realize an approach such as ours. (5) Finally, in a substantially dierent vein, the Oce-By-Example (OBE) project of Zloof [20] in principle considered applications similar in spirit to the ones discussed in Section 1. However OBE assumed the sources were non-autonomous internal ones, and under the complete control of the application, and did not develop any semantic abstractions of the sources. A major dierence of our work compared with previous work is that unlike them we provide concepts and techniques for specifying the source data semantics and for mapping it to a semantic view, meaningful for an application developer. To appreciate the dierence, assume for the sake of illustration that a particular source, storing relational data, only exposes a xed low level view of the data, in the form, view(Relation, TupleId, Attribute, Value). This view is xed regardless of the instance of

source data it contains. On the other hand, a semantic view of the data, showing the relations and their schemas in the familiar manner, is clearly instance dependent. Recently, Gyssens at al. [11] proposed a two-dimensional data model called the tabular data model and de ned a tabular algebra corresponding to it. They observed the potential of tabular model and algebra to serve as a unifying model for relations (as in databases) and spreadsheets. Given this, a natural question would be why not simply use tabular algebra for our requirements in this project. The key dierence is that tabular model deals with tables which are designed with some well-de ned schema in mind. The problem we face in the work here is that we need to unearth the semantic view of a spreadsheet whose schema is not clearly visible. While tabular algebra can map between two-dimensional tables and relations, these relations are not guaranteed to correspond to the semantic view, unless the intended semantics is made clear. Other important related eorts include [5, 14, 9, 6], as well as the extensive work on multidatabases (see [2, 12] for surveys). But none of these addresses the issue of abstracting the source data semantics (in an application dependent manner).

4 Semantic Model for Spreadsheets

In this section, we discuss the issues that arise in building a semantic abstraction on top of spreadsheets, and present a methodology for mapping spreadsheet information to a relational view. The challenges are: (1) Spreadsheets are two-dimensional objects, whereas relations are one-dimensional. (2) Unlike relations, the \attributes" associated with the data in a spreadsheet do not have xed positions; worse, they may not appear at all. We must provide some means for mapping the data in such a spreadsheet to a relation with a well-de ned schema that is meaningful with respect to the application on hand. (3) Given the unbounded number of possibilities in which data may be laid out on a spreadsheet, we must come up with a means using which the user can clearly specify the lay-out of the spreadsheet using her perception of the underlying concepts in the spreadsheet. We need a general technique for translating a given speci cation of these concepts into a relation scheme corresponding to the semantic view of a spreadsheet. (4) We need to develop algorithms that make use of the user's layout speci cation to map the spreadsheet data to its semantic (relational) view. In attempting to map the information content of a spreadsheet to a relational schema, we are implicitly associating a schema with the sheet, as well as instances for each component of the schema. We formalize this notion below. De nition 4.1 (concept) The attributes in the schema of the semantic view of a source (spreadsheet) are called concepts from the source's point of view. A concept is a row (column) concept if its instances range over rows (columns) in a spreadsheet. In the spreadsheet of Figure 2(a) in Section 1, name, instances of which range over the rows in the spreadsheet is

a row concept, whereas city is a column concept. As we will see later in this section, the notion of row (column) concept plays an important role in the speci cation of the layout of a spreadsheet. Crucial to the human understanding of spreadsheet information is the \togetherness" of related information in the spreadsheet. E.g. in the spreadsheet in Figure 2(a), information related to John is present together in a \block" corresponding to John, within which information related to John's sales in 1990 are grouped together etc. Also, a common operation performed on spreadsheets is aggregation or some formula application over such a block of information. In order to perform such aggregation operations, spreadsheet tools usually require rows (columns) of a block to be adjacent. Thus for many reasons, information blocks are an integral component of a spreadsheet. Below we formalize this notion. De nition 4.2 (Relation C ) Let c1 and c2 be any two cell contents in a spreadsheet. We say c1 is related to c2 , denoted c1 c2 , i c1 and c2 are identical (i.e. c1 = c2 ) or c2 is a blank. Let C = (c1 ; :::;ck ) and D = (d1 ; :::;dk ) be (row/column) vectors of cell contents. Then C D i ci di ; 1 i k. Let i be a row (column) number and C be a list of row (column) concepts in a spreadsheet. Then i:C is the tuple of data entries that are instances of C in the i-th row (column). We de ne a relation C on rows (columns) of the spreadsheet as follows. Let i; j be two rows (columns) in the spreadsheet. Then i C j i (i:C j:C ) ^ j = i + 1. The relation C is the symmetric, re exive, transitive closure of C . As an example, in the spreadsheet of Figure 2(a), rows 1 to the row number above the rst Mary row form the `block' corresponding to the name concept, John. It is important to note that our de nition does not care if the name instance has been `factored out' (appears once) and the rest of the information corresponding to it is `nested' within the occurrence (as in our example) or if the name instance is explicitly multiplied out for each row on which there is information pertaining to the instance. Both scenarios are natural, and our de nition takes care of both of them elegantly. One of the key pieces of information needed for extracting a semantic view is when each block begins/ends. This information plays a crucial role in the layout speci cation as well as the algorithm that maps it to the semantic view. De nition 4.3 Let C be a list of row (column) concepts. The end of block of C in the context of a row (column) i, i) is de ned as max fj j j C ig. denoted eob(C; In our example spreadsheet, eob(name; 1) returns the row number immediately above the rst Mary row. In the speci cation of a spreadsheet layout, we shall make use of the following built-in functions. De nition 4.4 (Built-in Functions) (1) eor() and eoc(), which indicate the index of the last row (column) of a given worksheet. (2) eob(concept; i), which indicates the index of the last row (column) in the block containing

the i-th row (column), that is associated with concept (as in De nition 4.3). Of these, eor() and eoc() are usually directly provided by most spreadsheet tools, including MS Excel. On the other hand, eob() can be implemented on top of the tool. We are now ready to provide the formal de nition of a layout speci cation. Intuitively, a layout speci cation associates a variable with concepts and lays down its associated range in the worksheet. De nition 4.5 (Layout speci cation) A layout speci cation is a set of concepts, where each concept is associated with the following properties. 1. coordinate, which has two components { row index and column index. These indices are variables or constants. 2. range, which in turn has the properties lower bound and upper bound on the indices used in the coordinate speci cation of concepts as de ned in (1). These bounds can be constants, variables, arithmetic expressions involving them, or built-in/user de ned functions. 3. increment associated with the concept. This may be a constant, a variable, an arithmetic expression involving them, or built-in/user de ned functions. This property is optional, and when not present, is assumed to be 1 by default. The concepts in the layout speci cation form the schema of the semantic view. A concept's range provides information on which part of the rows/columns in the spreadsheet, instances corresponding to the concept occur. The coordinate provides the exact location of the instance, and increment helps step through the rows/columns mentioned in the range. Sometimes, the range associated with a concept's coordinate indices may already be speci ed along with the speci cation of a previous concept (see Example 4.1). From our experience, such a speci cation seems to be suf cient for capturing the layout of a whole class of naturally occurring spreadsheets. In Section 5, we precisely characterize the class of spreadsheets handled by our notion of layout speci cation. Before we present an example of a layout speci cation, we de ne the following terms associated with the concepts of a worksheet. De nition 4.6 (Measure/Parameter concepts) Let W be a worksheet and L be its layout speci cation. Concepts in W whose coordinates are of the form (X; Y ) X;Y being variables, where the ranges associated with X and Y are already speci ed in L, are called measure concepts. The measure concept is said to be determined by the concepts associated with the variable X and Y . Remaining concepts that are not measure concepts are called parameter concepts.

Example 4.1 For the spreadsheet in Figure 2(a), our layout speci cation would be the following.

Concept

name: year: city: part: sales:

Coordinate

N, 0 Y, 1 0, C P, 2 P, C

Range

1 N eor() N Y eob(name) 3 C eoc() Y P eob(year)

Informally, this speci cation says the following. (1) Entries corresponding to sales person names can be found in column 0, from row 1, down to the last row. (2) Years corresponding to the names above, appear in rows N through eob(name), where N denotes the current \cursor" corresponding to name, and eob(name) is the last row corresponding to the current name. The fact that several rows may correspond to a given name is captured through the use of the variable N (ranging over name entries) to denote the rst row where the name appears, and the method eob(name), which returns the last row corresponding to the name. (3) Cities appear on row 0, column 3 onward, until the last column, indicated by the method eoc(). (4) Parts appear in column 2 within each block of rows corresponding to the same name and year. This is indicated by saying that the row index of part-values, P , ranges from the rst row corresponding to the current year, namely the value of the variable Y , up till the end of block of the current year. (5) Finally, sales-values appear in cells indexed by row P and column C , thereby capturing the connections between part, city, and sales. (6) The optional increment property is not present in the speci cation and hence is assumed to be 1. Notice that the ranges associated with the variables P;C , used to index the concept sales, are already de ned before the concept sales' speci cation, hence in this example, sales is a measure concept while others are parameters. Intuitively, the layout speci cation speci es how the spreadsheet should be `read' in order to materialize the semantic view. We present one more example of a layout speci cation. Example 4.2 As a second example, consider the more sophisticated layout of the spreadsheet shown in Figure 2(b). The layout speci cation for this spreadsheet is: Concept

name: year: qtr: region: part: sales:

Coordinate

N, 0 Y, 1 0, Q 1, R P, Q P, R

Range

2 N eor() N Y eob(name) 2 Q eoc() Q + 1 R eob(qtr) Y P eob(year)

The speci cation above follows notations and conventions similar to those of the previous example. Having informally described the intuitive meaning of a layout speci cation, below we provide its formal semantics. De nition 4.7 Let L be a layout speci cation and W be a worksheet conforming to L. Let (X1 ; Y1 ); : : : ; (Xn; Yn ) be the coordinates and 1 ; : : : ; m be the range speci cations occurring in

N

Q

Y

R

P

Figure 3: Variable Dependency Graph for the layout speci cation of Example 4.2 L. The semantic view implied by L, SEML (W ) is the relation f< (X1 ; Y1 ); : : : ; (Xn ; Yn ) > j 1 ^ : : : ^ m g. Recall that when the increment is left unspeci ed, it is assumed to be 1 by default. The above de nition precisely captures the semantic view of a spreadsheet as expressed by a layout speci cation. Thus, the layout speci cation provides a clean mapping between the semi-structured information in the spreadsheet and a underlying `relational image'. In this sense, the layout speci cation can be thought of as playing the role of a physical schema speci cation. Example 4.3 Consider Example 4.1. Its semantic view according to De nition 4.7 is the relation f< (N; 0); (Y; 1); (0; C ); (P; 2); (P; C ) > j 1 N eor() ^ N Y eob(name; Y ) ^ 3 C eoc() ^ Y P eob(year; P )g.

5 Semantic View Materialization

In this section, we shall consider the problem of materializing the semantic view of a spreadsheet, given its layout speci cation. The task on hand is that of automating the process whereby, given an arbitrary layout speci cation, code is generated, which when executed on any worksheet conforming to the original speci cation, will materialize its semantic view. We need the following de nition before presenting the algorithm. De nition 5.1 Let L be a layout speci cation. We let V ars(L) denote the set of variables occurring in L. For variables X;Y 2 V ars(L), we say Y depends on X , provided X occurs in the range speci cation for Y either in the lower bound or in the upper bound. The variable dependency graph induced by a layout speci cation L is a directed graph G(L) = (V ars(L); E (L)), such that E contains a directed edge hY; X i i Y depends on X . For instance, the variable dependency graph for the layout speci cation of Example 4.2 is shown is Figure 3. Not all layout speci cations are meaningful. For instance, when two concepts are used in each other's range speci cation, the cyclic dependency between them implies no practical worksheet layout can actually correspond to them. The following de nition isolates layout speci cations with a well-de ned meaning.

De nition 5.2 (Well-de ned layout speci cation) Let L be a layout speci cation. L is said to be well-de ned provided it satis es the following conditions. (1) For each variable in L, whenever the lower and upper bounds are

both constants, the lower bound is no more than the upper bound. (2) No row (column) concept depends on a column (row) concept. (3) Whenever one of the bounds for a variable V is eob(concept; V ), there is a variable C associated with the concept concept, and V depends on C . (4) The variable dependency induced by L is acyclic. It follows from the de nition of well-de nedness that each variable can depend on at most one variable, which is necessarily dierent from itself. Among other things, the variable dependency graph associated with a well-de ned layout speci cation is always acyclic and hence it is possible to order its nodes linearly by a topological sort that respects the edges of the graph. We now present our generic iterative algorithm for worksheet to semantic view mapping.

Algorithm 5.1 (Semantic View Materialization) Input: A layout speci cation L and a worksheet W conforming to it. Output: (1) A layout speci c code which when executed on W generates the semantic view of W implied by L. (2)

Code for eob(concept, V), for various concepts and associated index variables.

Begin

1. Analyze L and generate the variable dependency graph G(L) (De nition 5.1). 2. Test whether L is well de ned (De nition 5.2.) if L is not well de ned, then return(\layout speci cation not well de ned"). 3. Use topological sorting to obtain a linear ordering of the nodes of G(L) such that for each edge Y ! X in G(L), Y < X in the ordering. Using the ordering, generate while loops such that whenever Y < X , the loop for Y is nested within the loop for X . 4. Initialize each variable to the lower bound in its range speci cation; the exit condition for the associated while loop is given by the upper bound. 5. The increment for each while loop is based on the following rule: If the variable V associated with the while loop is a leaf node in G(L), then increment it by the increment value provided in the speci cation (the default is 1); else V is assigned a variable that depends on it. 6. The coordinates used for obtaining the components of a (semantic view) tuple from the spreadsheet are identical to the ones mentioned in the layout speci cation L. The innermost loop has the insert-tuple statement which lists the coordinates corresponding to all the concepts, as mentioned in the layout speci cation. 7. The eob() procedure: for each occurrence of an expression of the form LB V eob(concept) in some range speci cation in L, do: Suppose concept is a row concept and let C be the associated variable. //The procedure for a column concept is similar.

The range speci cation above indicates V must depend on the variable associated with concept, and hence the latter must exist. Let the coordinates for the variable associated with concept be (C; I ), where I is either a variable or a constant. Then, the code for eob(concept, V) is as follows.

eob(concept; V )

bool function

f

If

(cell(V; I ) 6= b=) ^ cell(V; I ) 6= cell(V ? 1; I ))

true) false)

return( else return(

g End

Example 5.1 We illustrate our algorithm using the spreadsheet of Example 2(b) and its corresponding layout speci cation in Example 4.2. Figure 3 shows the variable dependency graph obtained by analyzing the layout speci cation (Step 1). One possible topological ordering for this example is hN; Y; P; Q; Ri (Step 2), which re ects in the order of nesting of the while loops in the code below. The initialization and exit condition for each loop are obtained directly from the layout speci cation (Step 3). The increment for each loop is obtained by inspecting the variable dependency graph (Step 4). The (innermost) loop, corresponding to variable R contains the insert tuple statement { its arguments obtained from the layout speci cation (Step 5). Below, we also show the main code plus the code for the function eob(quarter; R) function (note that quarter is a column concept). The code for eob(name; Y ) is similar. N := 2

while not eor() f Y := N while not eob(name; Y ) f P := Y while not eob(year; P ) f Q := 2 while not eoc() f R := Q + 1 while not eob(quarter; R) f

insert tuple(< cell(N; 0); cell(Y; 1); cell(0; Q); cell(1; R); cell(P; Q); cell(P;R) >) R++

g

g g g g

Q := R

P ++

Y := P

N := Y

f g

bool function If

eob(quarter; R)

(cell(0;R) 6= b=) ^ cell(0;R) 6= cell(0; R ? 1))

true) false)

return ( else return (

The following theorem establishes the correctness of our algorithm. For lack of space its proof is suppressed, and can be found in [13].

Theorem 5.1 (Correctness of Algorithm 5.1) Let L be a layout speci cation and W a worksheet conforming to it. Then the code generated by Algorithm 5.1 with L as input, when executed on W , correctly generates SEML (W ), the semantic view of W implied by L, in the sense of Def2 We note that since Algorithm 5.1 uses the \minimal instruction set" imaginable on spreadsheets { that of reading cells { it is fairly general and implementable on top of any spreadsheet. Before discussing our implementation we present a formal result on the power of our approach to handle the variety of spreadsheets that occur in real life. inition 4.7.

5.1 Expressive Power of Layout Speci cation

Our discussions so far on the layout speci cations and the associated algorithm for materializing the semantic view, raise the following fundamental question: What are the kinds of spreadsheets that can be handled using our notion of layout speci cation? The following theorem, the proof of which is suppressed here for want of space (and given in [13]), gives an exact characterization of the class of spreadsheets captured using the notion of (well-de ned) layout speci cations.

Theorem 5.2 Let W be a worksheet, measure(W ) be the set of measure concepts of W , and for m 2 measure(W ),

parameter(W; m) be the set of parameter concepts that determine the measure concept m of W . The class of

worksheets whose layout can be captured by the layout of speci cation of De nition 4.5, is the class of worksheets that satisfy the property: parameter(W; m) = parameter(W; n); 8m; n 2 measure(W ).

Intuitively, Theorem 5.2 states that the class of spreadsheets handled by our notion of layout speci cation is the class of all spreadsheets conforming to well-de ned layout speci cations that satisfy the property that all measures in the spreadsheet layout are determined by the same set of parameters. From our experience, a whole variety of naturally occurring spreadsheets belong to this class. Examples include spreadsheets containing accounting/ nancial information, various kinds of reports, experimental data, etc. Thus, Theorem 5.2 establishes the practical value of our approach for handling a variety of real-life spreadsheet applications.

6 Implementation

In our project, besides spreadsheets, we have implemented ADMs for other tools such as databases, and word processors. Our implementation is done on Windows NT, and provides for interoperability among MS Excel spreadsheets, MS Access databases, and MS Word documents. The implementation essentially makes use of OLE 2.0 as a conduit for physical interoperability among these tools. The architecture of our implementation is the one sketched in Figure 1 of Section 1. Access, Excel, and Word are OLE servers and our implementation is a OLE client application that programmatically controls these servers. The ADM's are implemented in MFC/VC++. The implementation makes use of the OLE interfaces available for manipulating Excel objects (such as worksheets, ranges), Access objects (such as recordsets), and the Word object (essentially the WordBasic language). In our implementation, the ADMs take as input the user's layout speci cation and the tool data (spreadsheets, databases, or documents) and map it to the objectrelational proxy. The Excel ADM is based on Algorithm 5.1 of Section 5. ADMs for Access and Word are discussed in [13]. The interoperability application is written in the language Logic++ [10], a higher order Horn clause logic language on complex objects with object-oriented features. Our choice of Logic++ as the language for application development is based on (1) its programming model that provides natural support for OO features in a declarative framework, (2) its intrinsic support for data structures such as sets, lists, and nested tuples, and behavioral aspects such as methods , and (3) its ability for providing generic managers for re ect and refresh that automatically maintain the external source data and the proxy data in sync by invoking the update re ection and refreshing routines at the appropriate juncture. For an extended discussion of Logic++ as well as the re ect mechanism see [10].

7 Discussion and Conclusions

In this paper, we motivated the problem of querying nontraditional data sources like spreadsheets and word processors, from the point of view of real applications that require interoperating among such tools. While this observation has been made by others earlier, we have brought out the unique nature of our work. Our approach is based on proxying the data in such non-traditional sources, but unlike previous work, we proxy the data from a point of view that is dictated by application semantics. While our work involved the integration of relational databases, spreadsheets, and word processors, in this paper we focused attention on the work we did on spreadsheets. Our ideas and algorithms have been implemented on MS Oce running on Windows NT. To our knowledge, for the rst time, we have proposed the necessary concepts and algorithms for materializing the semantic view of a spreadsheet by associating a physical schema called layout speci cation. We also established

the relevance of our approach for handling real-life applications, by characterizing the class of spreadsheets that can be captured by our approach.

References [1] Abiteboul, S., Cluet, S., and Milo, T. [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20]

A database interface for le update. In ACM SIGMOD, 1995. ACM Computing Surveys, 22(3), Sept 1990. Special issue on HDBS. Blake, G.E., Et al. Text/relational database management systems: Overview and proposed SQL extensions. Tech. report, CS Dept, Univ of Waterloo, Canada, 1995. Blakeley, J.A. Data access for the masses through ole db. In ACM SIGMOD, pp161{172, 1996. Buneman, P., Davidson, S., Hillebrand, G., and Suciu, D. A query language and optimization techniques for unstructured data. ACM SIGMOD, 1996. Carey M.J., Et al. Towards Heterogeneous Multimedia Information Systems: the Garlic Approach. In IEEE RIDE-DOM), pp124{131, 1995. Chawathe, S. Et al. The TSIMMIS Project: Integration of Heterogeneous Information Sources. In IPSJ, Tokyo, 1994. Christophides, V., Cluet, S. Abiteboul, S., and Scholl, M. From structured documents to novel query facilities. In ACM SIGMOD, 1994. Colby, Latha Et al. Algorithms for deferred view maintenance. In ACM SIGMOD, pages 469{480, 1996. Goyal, N., Hoch, C., Krishnamurthy, R., Meckler, B., and Suckow, M. Is GUI programming a database research problem? In ACM SIGMOD, 1996. Gyssens, Marc, Lakshmanan, L.V.S., and Subramanian, I. N. Tables as a paradigm for querying and restructuring. In Proc. ACM PODS, 1996. Hsiao, D.K. Federated databases and systems: Partone { a tutorial on their data sharing. VLDB Journal, 1:127{179, 1992. Lakshmanan, Laks V.S., Subramanian, Iyer N., Goyal, Nita, and Krishnamurthy, Ravi. On querying and updating the spreadsheets. Technical report, Concordia University, Montreal, Quebec, October 1996. Lakshmanan L.V.S., Sadri F., and Subramanian, I. N. A declarative language for querying and restructuring the web. IEEE RIDE-IMS, 1996. Lakshmanan L.V.S., Sadri F., and Subramanian I. N. Logic and algebraic languages for interoperability in multidatabase systems. Journal of Logic Programming 33:2, pp.101-149, 1997. Levy, A.Y., Rajaraman, A., and Ordille, J.J. Querying heterogeneous information sources using source descriptions. In VLDB, pp251{262, 1996. Papakonstantinou, Y. Et al. A query translation scheme for rapid implementation of wrappers. In DOOD, 1995. Subrahmanian, V.S. Et al. HERMES: Heterogeneous Reasoning and Mediator System. Tech. Report, IACS & Dept of CS, Univ of Maryland, College Park, 1995. Wiederhold, G. Mediators in the Architecture of Future Information Systems. IEEE Computer, March 1992. Zloof, M.M. Oce-by-example: A business language that uni es data and word processing and electronic mail. IBM Systems Journal, 21(3):272{304, 1982.