Benchmarking Temporal Databases A Research Agenda - CiteSeerX

TECHNICAL REPORT 95-CSE-20

Benchmarking Temporal Databases A Research Agenda Margaret H. Dunham Ramez Elmasri Mario A. Nascimento 1 Marion Sobol

Department of Computer Science and Engineering Southern Methodist University Dallas, TX 75275{0122 fmario, [email protected] Computer Science and Engineering

Department University of Texas at Arlington Arlington, TX, 76019-0015 [email protected]

Management and Information System Department

Southern Methodist University Dallas, TX 75275{0122 [email protected]

December, 1995. 1 On leave of absence from CNPTIA{EMBRAPA, Campinas, Brazil, ([email protected]) and supported by CNPq (Process 260088/92.7), Brasilia, Brazil.

Benchmarking Temporal Databases { A Research Agenda Margaret H. Dunham, Ramez Elmasri, Mario A. Nascimento 2 and Marion Sobol

Department of Computer Science and Engineering Southern Methodist University Dallas, TX 75275{0122 fmario, [email protected] Computer Science and Engineering

Department University of Texas at Arlington Arlington, TX, 76019-0015 [email protected]

Management and Information System Department

Southern Methodist University Dallas, TX 75275{0122 [email protected]

Abstract

There has been a lot of published research in the eld of temporal databases (TDBs), and much of it has been devoted to indexing structures. There has also been some research in the topic of benchmarks for Temporal Databases, but those have been focused in evaluating the semantic expressiviness of a query language and/or the TDB model capabilities. Thus far, we have not seen much been done to design benchmarks to evaluate indexing structures for TDBs though. The goal in this paper is to provide a framework to benchmark indexing structures/algorithms for TBDs. We also present ways to generate application independent test data, which are to be used in the benchmarking process. The framework presented can be applied to benchmark bitemporal databases as well. 2 On leave of absence from CNPTIA{EMBRAPA, Campinas, Brazil, ([email protected]) and supported by CNPq (Process 260088/92.7), Brasilia, Brazil.

1

Contents 1 Introduction

3

2 Temporal Data Model Overview

6

3 Components of a 2TDB Benchmark

7

3.1 Conducting Surveys of Temporal Database Users : : : : : : 3.2 Generating Data for the Temporal Database Benchmark : : 3.3 The Benchmark Queries : : : : : : : : : : : : : : : : : : : 3.3.1 Querying Single Temporal Attributes : : : : : : : : 3.3.2 Querying a Bitemporal Database : : : : : : : : : : 3.3.3 Determining Queries Needed by TDB Benchmarks :

: : : : : :

: : : : : :

: : : : : :

: : : : : :

: : : : : :

: : : : : :

: : : : : :

: : : : : :

: : : : : :

: : : : : :

: : : : : :

: : : : : :

: : : : : :

: : : : : :

4 Research Objectives

7 9 10 11 12 14

15

List of Figures 1

The Normal distribution and the lifespan spanning : : : : : : : : : : : : : : : : : : : 15

List of Tables 1 2 3 4 5 6

Range and Point Operators : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Benchmark VQ queries based on valid time values : : : : : : : : : : : : : : : : : Example queries based on valid time values : : : : : : : : : : : : : : : : : : : : Benchmark TQ queries based on transaction time values : : : : : : : : : : : : : Example queries based on transaction time values : : : : : : : : : : : : : : : : : Benchmark queries on bitemporal (i.e. based on valid and transaction time) data

2

: : : : : :

: : : : : :

11 11 12 12 12 13

1 Introduction Temporal database systems aim to provide database users with a framework for facilitating the storage and retrieval of all states of a database application over time. This includes past (historical) states, the current state of the database, and possibly planned future states as well. Temporal databases (TDBs) also store the times when changes and events occur, and allow queries that refer to this temporal information. Many database applications already exist that do require temporal data to be stored on-line. These applications include medical histories, planning and scheduling, engineering design and manufacturing, nancial databases, geographic information systems, and many other applications. Most commercial database management systems (DBMSs) provide minimal support for temporal applications, usually in the form of special data types for date, time, and timestamp attributes. In this case, the design and implementation of the temporal database and the semantics of time is left entirely to the designers, users, and implementors of temporal database applications. The long-term goal of temporal database research is to provide concepts and techniques for creating temporal DBMSs that include features for supporting temporal database applications. This would make it easier for designers and implementors to create these applications. Towards this goal, much research has been done concerning temporal data models, query languages, and indexing structures [McK86, Soo91, Kli93]. However, little work has been done to provide benchmarks for TDBs. The work that has been done in temporal benchmarking is geared towards the semantic expressiveness of the TDB model and its query language [KR94, J+93, Sno87]. In this proposed research, we will attack the problem of generating benchmark data and queries for temporal databases. Creating performance benchmarks is important in many elds. In the database eld, there are many benchmarks for non-temporal databases, which are used to compare the performance of commercial DBMSs under various operating conditions [Gra93]. The goal of our proposed research is to create realistic benchmarks for temporal database applications. TDBs are able to support temporal attributes whose values change over time, as well as the non-temporal ones. It has been recognized that two main time dimensions are relevant to recording changes to the temporal attributes [SA86]:

Valid time { which is the time when changes and events occur in the real-world and Transaction time { which is the time when changes or events are recorded in the database. A third type of time, called user-de ned time, allows users to de ne time dimensions whose semantics are dierent from both valid time and transaction time. Other time dimensions, such as decision time [KC93] [ND96] (or belief time) are also possible, but have not been thoroughly investigated. 3

The name TDB is a general name for a class of temporal databases with dierent characteristics. Depending on the time dimensions supported, the following categorization of TDBs [JCG+ 94] has been proposed:

Rollback (transaction-time) Databases: These support only transaction time. Thus, one can

\roll" the database back to a previous state, where a state is de ned whenever a transaction commits. Rollback databases allow queries with respect to the knowledge the database had at a given point in time in the past.

Historical (valid-time) Databases: Provide support for valid time only, and thus preserve the history of the application attribute values and objects/tuples. Therefore one is allowed to ask queries regarding the evolution of objects, and naturally, of the attributes within them.

Bitemporal Databases: (2TDBs) Both valid time and transaction time are supported. In such a case one is allowed to ask queries based on dierent states of the database and/or object history. Hence, 2TDBs allow corrections to data while maintaining the history of the data.

The objective of this proposed research is to provide tools (queries, data, and software) that can be used to provide meaningful performance analysis for the three categories of TDBs discussed above. To our knowledge this is the rst research towards benchmarks that is focused towards actual performance of TDBs. Performing temporal database benchmarks is unlike performing conventional database benchmarks as the semantics of the temporal data is involved. There are several aspects that may be evaluated in a TDB benchmark. These include: 1. Eciency in accessing records valid at dierent epochs. With this criteria one is able to verify and compare the performance of dierent storage structures and access methods when a query refers to information that is valid in the past, present, or future. This allows searching past valid times \as of" the current transaction time. 2. Eciency in accessing dierent states of the database. By using this criteria one is able to measure how an access structure is able to handle the \browsing" through previous states of the database as of a previous transaction time. 3. Eciency in accessing dierent states of the database and then accessing records valid at dierent epochs. This is a combination of the two items above and thus is in the realm of bitemporal databases. In this case one is able to select one (or a set of) state(s) as of certain transaction times, and then examine the valid-time \contents". 4. Eciency in inserting and (logically) deleting records, and archiving (vacuming). Since temporal databases are \append-only", no records should be physically deleted if one wants to 4

maintain a complete history of all changes and events. Hence, updates typically involve the \closing" of old versions and the \opening" of new versions. Archiving (vacuming) refers to removing versions that are older than a speci c cuto time to oine storage media. Item (1) above targets queries based on valid time only. Item (2) addresses transaction time queries. Item (3) covers both time dimensions for a bitemporal database. Item (4) covers the dierent categories of updates to dierent types of temporal databases. Any temporal database benchmark must be able to evaluate implementation alternatives based on these temporal concepts. For instance, a given access structure may behave very well when accessing records valid NOW (current database state) but poorly when accessing records that are valid in the far past (or in the far future). Ideally the performance should be good for all cases. Practically though, one may have to choose one or a few time epochs with respect to which the structure will be more ecient. In such a case the overall performance is the weighted average performance, with heavier weights put on more important cases for the particular application. Eciency is a customizable metric. In one case it may translate into number of I/Os while in others it may translate into access time or even concurrency availability. One of the components of the benchmark will be eciency measures which can be used to compare implementation alternatives. In addition to specifying dierent classes of queries and updates for a TDB benchmark, a critical portion of the benchmark is to generate the temporal data itself that will be used in evaluating performance. There is very little work done in this area. Generating data for performance benchmarks of non-temporal databases, such as relational databases and object-oriented databases, has been done in a number of (non-temporal) database benchmarks [Gra93, CDN93]. In such cases, each benchmark identi es a number of parameters, such as key distributions, size of records, number of records, structure of complex objects, and so on, which are then used to generate the various data benchmarks. An important part of our research will be to identify the various parameters that can be used to characterize temporal data. In order to have practical signi cance, we will utilize a survey of three typical temporal database application domains: insurance, airlines, and medical. The surveys will be used to determine typical distributions of temporal data versions, so that the benchmark will generate data that ts the distributions in the practical application domains. The tool can be utilized to generate benchmark data for other application domains as their data distributions become known. Our survey of the application domains will also allow us to identify the typical temporal queries that are pertinent to each domain. We intend to generate a number of dierent data benchmarks for dierent classes of temporal database applications. A software tool that generates these data distributions will be created as part of the proposed research, and the tool will be highly parameterized so that dierent parameter values will produce temporal database distributions with dierent characteristics. 5

We will create benchmark databases for all three main types of TDBs: rollback databases, history databases, and bitemporal databases. Since bitemporal is the general case, we give an overview of the bitemporal (2TDB) model that we will utilize in our research in the next section. By suppressing either the valid time dimension or the transaction time dimensions, we can create benchmarks for rollback and historical databases.

2 Temporal Data Model Overview We make the following assumptions regarding our bitemporal model.

We assume time to be discrete and countable although not nite. Hence our representation

of time is isomorphic to the integer numbers. We assume the existence of three particular points of interest: T1? ; T1+ and NOW . These represent respectively a point in time suciently back in the past, a point in time suciently far in the future and the current time. NOW is constantly changing at each instant to the new current point in time.

The valid time range of a record or object is given by a time interval V = [Vs; Ve ], where Ve ? Vs 0. We assume that Vs; Ve 2 (?1; +1), thus Vs ; Ve are not upper nor lower bounded.

There are two proposed representations for transaction time. In one representation, transac-

tion time is a point in time lower bounded by 0 (zero) and upper bounded by NOW , i.e., Tt 2 [0; NOW ), and corresponds to the commit time of the transaction that created the data versions. In the other representation, transaction time is a time range [Tts; Tte], which represents the period of validity of the version, so that Tts represents the commit time of the transaction that created the version, and Tte represents the commit time of the transaction that (logically) deleted the version, if any. The rst (point) representation assumes that Tte of one version of an object is the same as Tts of the next version of the same object, and hence assumes that Tt is stepwise constant. We will create data benchmarks for both representations of transaction time.

We assume tuple-versioning instead of attribute-versioning. This means that once a part of a record or object is updated we consider the whole record to be updated, and hence a new version of the object is created.

A record Rj is composed of N non-temporal attributes (Aj1; Aj2; :::; Ajn) and three time attrib-

utes Vsj ; Vej , and Ttj (which will be a point or an interval, depending on the transaction time representation). These time attributes are respectively the start and end of Rj 's valid time and 6

Rj 's transaction time. For the sake of presentation and without loss of generality we divide Rj 's non-temporal attributes into Ajk and Aj where Ajk is the (non-temporal) key that uniquely identi es all the Rj versions that represent the same object, and Aj represents the remaining attributes Aj1; Aj2; :::; Ajn. Therefore Rj =< Ajk ; Aj ; V j ; Ttj >, where V j = [Vsj ; Vej ] is called Rj 's (valid-time) lifespan.

In the most general benchmark for bitemporal databases, we will allow corrections, i.e., we

will allow two versions of a same record to have overlapping valid time intervals. However, we do require that such versions must have dierent transaction times. This implies that we do not observe the 1TNF. This type of correction will not be allowed in the simpler bitemporal data benchmarks, nor in the single-dimensional temporal benchmarks.

3 Components of a 2TDB Benchmark There are two major components of a database benchmark: queries and data. To construct meaningful benchmarks, queries and data should re ect typical uses in existing application domains. In this section we rst discuss how a survey will be used to determine actual user needs. ln the next two sections we then discuss how the tool that we plan to develop will generate the data distributions for the temporal database benchmarks, and then we discuss the types of queries that will be generated for the benchmarks.

3.1 Conducting Surveys of Temporal Database Users When one is constructing temporal databases, there are possibilities of storing very large amounts of data and establishing an almost in nite number of possible queries. The design of realistic and useful temporal databases for benchmarking requires the speci cation of the most crucial and most frequent queries that will be addressed to the database. To set up these databases and queries and to benchmark them will require a survey of the need and current use of temporal information in speci c industries. We plan to survey a number of dierent industries. We propose to start with a needs survey for users in the insurance industry, since their obligations to pay claims and assess rates and make rate changes will require a database that is bi-temporal using both valid time and transaction time elements. In the future, we plan to study the temporal database needs of other industries such as the airline and healthcare industries. Insurance was chosen rst as it seems to oer a simpler set of needs, while the airline industry allows for constantly changing rates and time schedules. In the healthcare industry there are many complex problems which arise from the need for a uni ed patient 7

recording system. These industries, particularly healthcare would be logical areas for a follow-on study once we have developed a benchmarking system. In our survey we shall rst conduct a pilot study, where we interview in depth a small sample of chief information ocers (CIO's), database designers, and employees who use the database to make decisions. These people will be queried about the following issues using personal interviews that feature open-ended questions and some closed ended questions featuring rating scales. The issues to be covered will be: 1. What types of information do they collect? Does it refer to the past only or is there future data, such as in ation adjustments and scheduling? 2. How do they model or represent their temporal information? What are the time attributes they use (valid time, transaction time, other user de ned times)? What temporal data types are used (Date, time, timestamp, expires, starts)? What type of conceptual model (schema structures) do they have (tuple versioning, attribute versioning, other)? What temporal aspects are needed? 3. What type of temporal queries do they apply? What are the most common queries? What are the most important queries? What are the relative frequencies of the queries? 4. What are the distributions of their temporal data over time? How often do they change data? 5. How long do they need to retain the data? (For example, tax records should be kept for 5 years, accident records for 3 years.) 6. What do they do about archiving (vacuming) old data to o-line storage (tapes)? 7. What is the current database system they use, and what approach do they use to solve temporal problems? We estimate that the pilot study will take from 2-6 months to formulate questions, locate appropriate respondents and conduct in-depth interviews. Using the results of interviews from the pilot study, we will then formulate a mail questionnaire, which will be sent to a much larger representative sample of CIO's, database designers and users in the insurance industry. The results of this survey will provide us with information which will statistically characterize and analyze the database needs for the industry. This survey should take 2 months for the formulation of the questionnaire and the sample selection, 3-4 months to mail questionnaires and receive results and 2 months for statistical data analysis. As the results of the study are received they will be used in the formulation of one or more temporal database benchmarks for the industry using the tool being developed. The desired characteristics speci ed by the users can then be utilized for purposes of benchmarking. 8

3.2 Generating Data for the Temporal Database Benchmark We already have experience in creating data for performance evaluation of valid-time only temporal databases [EWK90, EJK92, K+94]. This data was generated in the context of simulation experiments for performance evaluation of temporal indexing techniques. Based on this experience, we have identi ed the following initial set of parameters that will be utilized to create the data for the benchmarks: 1. Total number of objects: This is the number of actual objects in the application. Each object will have multiple versions as its attribute values change over time. 2. Total number of versions: This is number of all object versions in the application. Hence, the average number of versions per object would be this value divided by the total number of objects. 3. Mean version lifespan: This is the mean lifespan Ve ? Vs for the valid time component of an object version. 4. Distribution of version lifespan: Dierent statistical distributions can be used, including uniform distribution and normal distribution. In the latter case, another parameter would be the standard deviation of the valid time for version lifespans. Our tool will generate benchmarks that follow various valid time lifespan distributions. 5. Long-lived versions: It is possible that some applications will have two or more types of objects: those that have rapid changes and those that change infrequently. The latter case leads to a class of objects whose versions are called long-lived versions (LLVs). Our benchmark will allow two (or more) dierent distributions, one for regular versions and the other for longlived versions. In addition, the percentage of long-lived versions will be a parameter to the benchmark generating tool. 6. Version size: The size in bytes of each version. The above is a preliminary list of the parameters we will use to generate the various data sets for the benchmarks. As part of the research, additional parameters will be identi ed to cover the general bitemoral data model. The parameters will be chosen to generate data that conforms to the data distributions that emerge from the surveys of temporal database users that will be conducted during the research. These surveys will allow us to conduct an analysis of a number of practical temporal database applications. Our methodology for conducting the surveys was discussed in Section 3.1. We will create a tool that generates temporal data for the performance benchmarks that match the distributions of the practical temporal databases analyzed during our surveys. By varying the 9

parameters, temporal data conforming to dierent distributions can be generated by the tool. Hence, a suite of temporal benchmark data sets covering various data distributions will be the result of this part of the research project. We expect that domain speci c data sets will need to be generated for distinct temporal database application domains, as will domain speci c queries. Hence, a number of temporal benchmarks will be generated. Each domain speci c benchmark will contain its own set of data and queries. For both data and queries, the existence of temporal attributes complicates the generation of the benchmarks. The actual data in the temporal databases may follow dierent distributions for dierent application domains. A major objective for this research will be to examine the actual distribution of valid time ranges. For example, we may nd that some domains have valid time ranges which are uniformly short. Others may follow an exponential distribution. As an example, consider the length of time an employee is employed at a company. This is the valid time of employment. We could expect most employees to have a short valid time (less than 5 years) with a very few having a long valid time (15 years or more) and also very few with an extremely short valid times (1 or 2 years). This valid time range distribution may follow an exponential curve or a normal distribution or possibly some other distribution. We expect the distributions to be application domain speci c. We will obtain the distribution information from the surveys conducted.

3.3 The Benchmark Queries In this subsection we give our preliminary classi cation of the types of queries that are needed in a benchmark to specify the types of access requests against temporal databases. In our discussion, we use the following convention. A capital letter in italic font (e.g. X ) represents a range, with its end points being the same capital letter with subscripts s and e for its start and end (e.g. Xs and Xe .) A point is represented by a capital letter in a calligraphy font, (e.g. X .) We present in Table 1 several binary operators, their English reading and their mathematical meaning. Let T and P be ranges, i.e., T = [Ts; Te] and P = [Ps ; Pe ]; and let S and Q be points. Note that T , P , S and Q may be related to temporal or non-temporal values. This table de nes operations which may be present in temporal database queries. One should note that additional operators can be derived from those in Table 1. For instance an equal-range operator could be de ned as a special case of the contains operator when both ranges have equal end points. Furthermore, other operators can be obtained by using the ones presented in conjunction with the AND and/or OR set operators, for instance fS < Qg fS Q AND NOT S = Qg.

10

Table 1: Range and Point Operators Type Notation Reading Range-Range P = T P contains T operators P T P follows T P T P precedes T Range-Point T S T spans S operators T S T is-before S Point-Point operators

T S

Meaning

P s Ts T e P e Ts Ps Te and Pe > Te Ps < Ts and Ts Pe Te T s S Te Te < S Ts > S

T is-after S Q is-equal-to S obvious meaning Q is-smaller-equal-to S obvious meaning Q is-greater-than S obvious meaning

Q=S QS Q>S

3.3.1 Querying Single Temporal Attributes Now we show, via the operators in Table 1, how we can derive queries based on valid time (Table 2) and transaction time (Table 4). Let P = [Ps; Pe ] be a time range, P a time point, and Rj =< Ajk ; Aj ; V j ; Ttj > a data record. Assume that V j is a range and that Ttj is a point. Table 2: Benchmark VQ queries based on valid time values Notation VQ1 VQ2 VQ3 VQ4 VQ5 VQ6 VQ7

Meaning

fR jV = P g fRj jP = V j g fRj jV j Pg fRj jV j Pg fRj jV j P g fRj jV j P g fRj jV j Pg j

j

Reading (Set of all records with lifespan ...) containing time range P contained in time range P beginning after P ending before P beginning after Ps and containing Pe containing Ps and ending before Pe containing P

Table 3 provides some examples for the VQ queries above. As long as valid time is supported this set of queries (Table 2) can be used. Hence, they can be used for historical or bitemporal databases. In the latter case though, all considered records are constrained to as of the current transaction time. Versions as of other transaction times can be retrieved for bitemporal databases by using conditions on the transaction time as well. Table 4 shows benchmark queries that deal with transaction time. They can be used in either a rollback or bitemporal databases. The most common query involving transaction time should be the one involving the state of the TDB from its creation until a given point in time. It is usually referred to as the TDB state as of a point in time. Note that TQ3 does re ect this type of query. 11

Table 3: Example queries based on valid time values Query VQ1 VQ2 VQ3 VQ4 VQ5 VQ6 VQ7

Example (Find all employees records who ...) were working some time during last month only were working all of last month began working after last January 1st stopped working before last January 1st began working last year and are still doing so this year began working two years ago and were not employed this year were working last December 31st

Table 5 oers some examples where transaction time queries may be used. Table 4: Benchmark TQ queries based on transaction time values Notation TQ1 TQ2 TQ3 TQ4

Meaning

fR jP T g fRj jT j > Pg fRj jT j Pg fRj jT j = Pg j

j

Reading (Set of all records committed ...) during P after P before or at P exactly at P

Table 5: Example queries based on transaction time values Query TQ1 TQ2 TQ3 TQ4

Example (Find all employee versions that were committed ...) in June, 1994 after last December 31st before last December 31st last Monday

A last remark on queries based on transaction time is that they are constrained to lie within the [0; NOW ] range, i.e. one cannot pose queries with respect to time before the TDB creation nor times greater than the current time, and thus it is always concerned with the past.

3.3.2 Querying a Bitemporal Database Once we have presented the benchmark queries above for single time dimensions, establishing bitemporal queries is a matter of combining the above ones. Namely, we need the cartesian product of the 12

sets fVQ1, ... , VQ7g fTQ1, ... , TQ4g, which implies the set fBQ1, ... , BQ28g of bitemporal queries. Table 6 shows this set. There are two main issues to discuss over the set of bitemporal queries presented in Table 6. First, not all types of queries are expected to happen with the same frequency or probability. Second, based on this rst issue, the types of queries more often posed in a particular application are the ones that should be chosen for the benchmark for that application domain. Note that although we have not explicitly shown thus far, each and every benchmark query posed above can be extended by making use of non-temporal attributes as well. The reasons that we left out non-temporal conditions from the tables are two-fold. First, the resulting tables would be too cluttered, and secondly this allows us to concentrate on the temporal aspects of the benchmark queries. Table 6: Benchmark queries on bitemporal (i.e. based on valid and transaction time) data Not. BQ1 BQ2 BQ3 BQ4 BQ5 BQ6 BQ7 BQ8 BQ9 BQ10 BQ11 BQ12 BQ13 BQ14 BQ15 BQ16 BQ17 BQ18 BQ19 BQ20

Meaning

fRj jV j = P ^ Q T j g fRj jV j = P ^ T j > Pg fRj jV j = P ^ T j Pg fRj jV j = P ^ T j = Pg fRj jP = V j ^ P T j g fRj jP = V j ^ T j > Pg fRj jP = V j ^ T j Pg fRj jP = V j ^ T j = Pg fRj jV j P ^ P T j g fRj jV j P ^ T j > Qg fRj jV j P ^ T j Qg fRj jV j P ^ T j = Qg fRj jV j P ^ P T j g fRj jV j P ^ T j > Qg fRj jV j P ^ T j Qg fRj jV j P ^ T j = Qg fRj jV j P ^ Q T j g fRj jV j P ^ T j > Pg fRj jV j P ^ T j Pg fRj jV j P ^ T j = Pg

Reading (Records ...) with lifespan containing P and committed during Q with lifespan containing P and committed after P with lifespan containing P and committed before or at P with lifespan containing P and committed at P with lifespan contained in P and committed during Q with lifespan contained in P and committed after P with lifespan contained in P and committed before or at P with lifespan contained in P and committed at P with lifespan beginning after P and committed during P with lifespan beginning after P and committed after Q with lifespan beginning after P and committed before or at Q with lifespan beginning after P and committed at Q with lifespan ending before P and committed during P with lifespan ending before P and committed after Q with lifespan ending before P and committed before or at Q with lifespan ending before P and committed at Q with lifespan following P and committed during Q with lifespan following P and committed after P with lifespan following P and committed before or at P with lifespan following P and committed at P Continued on next page 13

Notation BQ21 BQ22 BQ23 BQ24 BQ25 BQ26 BQ27 BQ28

Meaning

fRj jV j P ^ Q T j g fRj jV j P ^ T j > Pg fRj jV j P ^ T j Pg fRj jV j P ^ T j = Pg fRj jV j P ^ P T j g fRj jV j P ^ T j > Qg fRj jV j P ^ T j Qg fRj jV j P ^ T j = Qg

Continued from previous page

Reading (Records ...) with lifespan precedeing P and committed during Q with lifespan precedeing P and committed after P with lifespan precedeing P and committed before or at P with lifespan precedeing P and committed at P valid at P and committed during P valid at P and committed after Q valid at P and committed before or at Q valid at P and committed at Q

3.3.3 Determining Queries Needed by TDB Benchmarks Developing (non-temporal) database benchmarks is a dicult task with many potential queries to be included [Gra93]. The number of potential queries is even larger in a temporal database due to the temporal attributes and the multiple time dimensions. Thus, choosing the set of queries is not an easy task. Queries in a benchmark must be general enough to examine the most important types of operations performed, but must also be representative of typical TDB applications. We have shown seven queries based on valid time, four using transaction time, and 28 using both. It would be virtually impossible to show all the types of queries which could be performed against a temporal database. (Certainly we could add to this list temporal joins, queries based solely on nontemporal attributes, and update types of transactions.) This list is, however, representative of the types of queries which should be included in any temporal database benchmark. A major aspect to determining the suite of queries to be used for each domain speci c benchmark is determining which temporal values are used most often. In eect, we need to determine for each temporal query what temporal values are used. In temporal databases, although queries may be stated based on any of the temporal values, we concentrate our discussion using ve reference points in time, namely: FarPast, Past, NOW, Future and FarFuture. Where: NOW is the current time, Past = NOW ? p, FarPast = NOW ? p, Future = NOW + f and FarFuture = NOW +f . We believe that using p = f = , p = f = , where = k (k > 1) may be enough. Nonetheless, we are not concerned with what the values of and are. This would depend on the time granularity used for the TDB. We feel that most queries (and updates) in a TDB center around the current time, NOW. Fewer queries will be asked based on points in the Future or Past, and even fewer queries will be based on the FarPast and FarFuture values. Some applications may focus on past data only (e.g. auditing) while other applications (e.g. scheduling or reservations) may focus on future data only. What we have to determine is that the distribution of usage of the temporal values (as included in the 39 earlier queries). It may be based on a normal distribution with a mean of 14

NOW, as depicted in Figure 1 (a), with dierent application domains having diering distributions. That is, they will have dierent values for and . One objective for our user survey will be to determine what the values of these points should be for each domain.

FarPast

NOW Past

FarFuture

FarPast

Past

NOW FarFuture Future

(b) Checking Account Domain

(c) Airline Reservation Domain

Future

(a) Insurance domain

NOW

Figure 1: The Normal distribution and the lifespan spanning We intend to develop a suite of TDB benchmarks with each being aimed at a speci c application (e.g. insurance, airlines, medical). Each suite will consist of the queries (and updates) that are most used in that target domain. Thus, we feel that a major component of this research is to actually solicit feedback from the user community. As researchers, we cannot determine the best set of queries (or data) without talking to the users. While the temporal values used in a speci c query may tend to follow a normal distribution, we may nd that the quantity of each type of query used follows an exponential distribution. Even though there are many possible queries, most may not be used or may be used seldomly (following an 80/20 rule). The exponential distribution to choose the quantities of each query type will approximate this usage. We expect to nd that the queries involved, however, dier based on the domain. Again, we anticipate that our user surveys can help us to determine the distribution of query type for each benchmark.

4 Research Objectives This research has four major objectives: 1. Survey users in temporal database areas to determine their needs and uses for temporal data. 2. Develop a parameterized software tool for generating TDB benchmarks. 3. Using the results of the survey, develop a suite of benchmarks targeted at the particular user domains surveyed. 15

4. Use the benchmark tool to perform benchmarks against previously proposed temporal indexing structures. The actual user survey will be conducted by an MIS researcher who is an expert at conducting user surveys. The application domain we will survey rst is insurance, and other application domains we expect to survey in the future are medical and airlines. The survey will be written and conducted to answer the questions discussed in Section 3.1. The parameterized tool will be created based on the discussion in Section 3.1. We will identify the parameters needed to create temporal data, and include various distributions within the tool to customize the data benchmarks. We will then utilize the tool to create benchmark data that approximates the distributions of data determined by the surveys of the application domains. We will then generate the appropriate queries identi ed by the studies of the application domains, as discussed in Section 3.2. To demonstrate the use of the benchmark tool, we will conduct experiments to compare and determine the eciency of various TDB indexing techniques. We already have a survey of the indexing techniques proposed for temporal databases [Nas95], and as part of the research we will extend this survey to include any new indexing structures as well as others surveyed in [ST94]. We will then select a number of those structures for implementation and comparison by utilizing the generated benchmarks. Some of the indexing structures we will consider are the following (this is not an exaustive list):

Those considering valid time ranges only: Time Index [EWK90, EJK92, K+94], MAP21 [NDK95], Time Polygon Index [SOL94] and Interval B-tree [AT95].

Those considering valid time but also assuming that input data is monotonically growing (append-only): Monotonic B+-tree [EJK92] and Append-Only Tree [GS93].

Those indexing transaction time only: Time Split B-tree [LS93] and Snapshot Index [TK93]. Those based on Spatial-oriented data such as the R-tree [Gut84] (and its derivatives). And bitemporal structures: Incremental Valid Time Trees [NDE95] (and derivatives), Sharing Trees [NE95] and the Bitemporal Interval Tree and Bitemporal R-Tree [KTF95].

References [AT95]

C-H. Ang and K-P. Tan. The interval B-tree. Information Processing Letters, 53(2):85{ 89, January 1995. 16

[CDN93] M. J. Carey, D. J. DeWitt, and J. F. Naughton. The oo7 benchmark. In Proceedings of the 1993 ACM SIGMOD International Conferencee on Management of Data, pages 12{21, Washington, D.C., May 1993. [EJK92] R. Elmasri, M. Jaseemuddin, and V. Kouramajian. Partitioning of time index for optical disks. In F. Golshani, editor, Proceedings of the Eight International Conference on Data Engineering, pages 574{583, Phoenix, AZ, February 1992. IEEE. [EWK90] R. Elmasri, G. T. J. Wuu, and Y.-J. Kim. The time index: An access structure for temporal data. In Proceedings of the Sixteenth Very Large Databases Conference, pages 1{12, 1990. [Gra93]

J. Gray, editor. The Benchmark Handbook for Database and Transaction Processing Systems. Morgan Kaufmann, Second edition, 1993.

[GS93]

H. Gunadhi and A. Segev. Ecient indexing methods for temporal relations. Transactions on Knowledge and Data Engineering, 5(3):496{509, June 1993.

[Gut84] A. Guttman. R-trees: A dynamic index structure for spatial searching. In Proceedings of the 1984 ACM SIGMOD International Conferencee on Management of Data, pages 47{57, Jun 1984. [J+93]

C. S. Jensen et al. The TSQL benchmark. In Proceedings of the International Workshop on an Infrastructure for Temporal Databases, pages QQ{1{QQ{28, Arlington, TX, June 1993.

[JCG+ 94] C. S. Jensen, J. Cliord, S. K. Gadia, A. Segev, and R. T. Snodgrass. A consensus glossary of temporal database concepts. ACM SIGMOD Record, 23(1):52{64, Jan 1994. [K+94]

V. Kouramajian et al. The Time Index+: An incremental access structure for temporal databases. In Proceedings of 3rd International Conference on Knowledge and Management, November 1994.

[KC93]

S. K. Kim and S. Chakravarthy. Modeling time: Adequacy of three distinct time concepts for temporal databases. In E. Elmasri, V. Kouramajian, and B. Thalheim, editors, Proceedings of the Twelfth International Conference on the Entity-Relationship Approach (ER'93), Arlington, TX, December 1993. Also published as Volume 823 of Springer Verlag's Lecture Notes of Computer Science.

[Kli93]

N. Kline. An update of the temporal database bibliography. ACM SIGMOD Record, 22(4), Dec 1993. 17

[KR94]

P.P. Kalua and E.L. Roberson. Benchmark queries for temporal databases. Technical Report TR-379, Computer Science Department, Indiana University, 1994.

[KTF95] A. Kumar, V.J. Tsotras, and C. Faloutsos. Access methods for bi-temporal databases. In Proceedings of the International Workshop on Temporal Databases, Workshop in Computing, pages 235{254, Zurich, Switzerland, September 1995. Springer and British Computer Society. [LS93]

D. Lomet and B. Salzberg. Transaction time databases. In A. Tansel et al., editors, Temporal Databases: Theory, Design and Implementation, chapter 16, pages 388{417. Benjamin/Cummings, Redwood City, CA, 1993.

[McK86] E. McKenzie. Bibliography: Temporal databases. ACM SIGMOD Record, 15(4):40{52, Dec 1986. [Nas95]

M.A. Nascimento. Indexing structures for bitemporal databases. Ph.D. research proposal, CSE - SEAS - Southern Methodist University, Dallas, TX, May 1995.

[ND96]

M. A. Nascimento and M. H. Dunham. Indexing a transaction-decision time database. In Proceedings of the 1996 ACM Symposium on Applied Computing (SAC'96), Philadelphia, PA, February 1996.

[NDE95] M.A. Nascimento, M.H. Dunham, and R. Elmasri. Using incremental trees for space ecient indexing of bitemporal databases. In Proceedings of the Second International Conference on Application of Databases (ADB'95), Santa Clara, CA, December 1995. [NDK95] M.A. Nascimento, M.H. Dunham, and V. Kouramajian. A mapping-based approach for range indexing. Technical Report 95-CSE-14, Southern Methodist University, Dallas, TX, August 1995. Submitted for publication. [NE95]

M.A. Nascimento and M.H. Eich. Indexing bitemporal databases via trees with shared leaves { the SLT approach. Technical Report 95-CSE-06, Southern Methodist University, Dallas, TX, May 1995. Submitted for publication.

[SA86]

R. T. Snodgrass and I. Ahn. Temporal databases. IEEE Computer, 19(9):35{42, sep 1986.

[Sno87]

R. T. Snodgrass. The temporal query language TQuel. Transactions on Database Systems, 12(2):247{298, June 1987.

18

[SOL94] H. Shen, B.C. Ooi, and H. Lu. The TP-Index: A dynamic and ecient indexing mechanism for temporal databases. In Proceedings of the Tenth International Conference on Data Engineering, pages 274{281, Houston, TX, February 1994. IEEE. [Soo91]

M. D. Soo. Bibliography on temporal databases. ACM SIGMOD Record, 20(1):14{23, Mar 1991.

[ST94]

B. Salzberg and V.J. Tsotras. A comparison of access methods for time evolving data. Technical Report NU-CCS-94-21, College of Computer Science, Northeastern University, 1994. (Also published as Technical Report CATT-TR-94-81 at Polytechnic University).

[TK93]

V.J. Tsotras and N. Kangelaris. The snapshot index, an I/O optimal access method for timeslice queries. Technical Report CATT-TR-93-68, Polytechnic University, December 1993.

19