What can Partitioning do for your Data ... - Semantic Scholar

What can Partitioning do for your Data Warehouses and Data Marts?

Ladjel Bellatreche

Kamalakar Karlapalem

Department of Computer Science University of Science and Technology Clear Water Bay, Kowloon, Hong Kong, CHINA fladjel, [email protected]

Michel Schneider

Mukesh Mohania

Department of Computer Science Western Michigan University Kalamazoo, MI 49008-5021, U.S.A. [email protected]

LIMOS Universite Blaise Pascal 63177 Aubiere Cedex, FRANCE [email protected]

Abstract

Ecient query processing is a critical requirement for data warehousing systems as decision support applications often require minimum response times to answer complex, ad-hoc queries having aggregations, multi-ways joins over vast repositories of data. This can be achieved by fragmenting warehouse data. The data fragmentation concept in the context of distributed databases aims to reduce query execution time and facilitates the parallel execution of queries. In this paper, we propose a methodology for applying the fragmentation technique in a Data Warehouse (DW) star schema to reduce the total query execution cost. We present an algorithm for fragmenting the tables of a star schema. During the fragmentation process, we observe that the choice of the dimension tables used in fragmenting the fact table plays an important role on overall performance. Therefore, we develop a greedy algorithm in selecting \best" dimension tables. We propose an analytical cost model for executing a set of OLAP queries on a fragmented star schema. Finally, we conduct some experiments to evaluate the utility of fragmentation for eciently executing OLAP queries.

Key Words : Data Warehouses, Star schema, Fragmentation, Query Optimization, Performance Evaluation

1 Introduction Usually, Data Warehouses (DWs) are built and owned by centrally coordinated organizations. A DW stores large volumes of data which are used frequently by decision support applications. DWs are usually dedicated to the processing of data analysis and decision support queries (OLAP queries). These queries are much more complex, consequently the response time is much higher. A lot of work has been done to speed up the OLAP query processing in DWs. Some of the techniques employed are: materialized view [14], advanced indexes [19], sampling and parallel computing technologies [11]. Designing a warehouse database for an integrated enterprise is a very complicated and iterative process since it needs data from many departments/units, data cleaning, and requires extensive business modeling. Therefore, some organizations have preferred to develop a data mart to meet requirements speci c to a departmental or restricted community of users. Of course, the development of data marts entail the lower cost and shorter implementation time. The role of the data mart is to present convenient subsets of a DW to consumers having speci c functional needs. There can be two approaches for developing the data mart (1) it can either be designed by integrating data from source data (called bottom-up approach) (2) or it can be designed by deriving the data from warehouse (called, top-down approach) [13]. In this paper, we advocate the top down approach as shown in Figure 1 because warehouse data can be fragmented based on the need of each data mart. In this approach, fragmentation [5, 20, 1] can play an important role, by fragmenting the DW into number of fragments that can be used as a data allocation unit to data marts. These fragments can be allocated to data marts so that most of the queries posed on a given data mart can be executed locally, thus communication cost can be minimized. This problem can be seen as data allocation in distributed databases. We use fragmentation and partitioning words interchangeably. Data partitioning 1

Data Mart A

Data Warehouse

Data Mart B

... Data Mart N

Figure 1: The Top Down Flow from DWs to Data Marts is a technique aimed of reducing the number of disk accesses for query execution by minimizing accesses to irrelevant data [4, 20]. There are two types of possible partitioning: (i) Vertical partitioning (VP) of a relation R produces vertical fragments, each of the fragments contains a subset of R's attributes as well the primary key of R. (ii) Horizontal partitioning (HP) partitions a relation R along its tuples. Each horizontal fragment (HF) has a subset of the tuples of the relation R. Two versions of HP are cited by the researchers [5]: primary HP and derived HP. Primary HP of a relation is performed using predicates that are de ned on that relation. On the other hand, derived HP is the partitioning of a relation that results from predicates de ned on another relation. A lot of work has been done on the partitioning in the relational models [5, 4, 20] and object models [1] compare to the DWs. Chaudhuri et al. [6] developed a technique called index merging to reduce storage and maintenance of index using the concept of vertical partitioning. Recently, Datta et al. [12] developed a new indexing technique called \Curio" that speeds up the query processing and it does not require a lot of storage space. On HP, little work has been done. Noaman et al. [18] proposed a construction technique of a distributed DW by adapting the work done by [20]. But they did not show how HP can be used to speed up the query processing and how it helps in allocating the fragments to data marts. In this paper, we will concentrate on partitioning the warehouse data horizontally. We consider that the warehouse data is modeled using a star schema [16]. This schema has two kinds of tables: dimension tables D = fD1 ; D2; :::; Ddg, where each table Di has a primary key KDi , and a fact table F where its primary key is composed by the concatenation of the keys of dimension tables.

1.1 Motivation

Building indices like join indices on the whole DW schema can cause problem of maintaining them, because whenever we need to execute a query, we should load the whole indices from the disk to the main memory. The sizes of indices can be very huge [2]. If the warehouse data is partitioned, we can build join indices for each partition that will be easier to maintain and to load it. Although indexing can help in providing good access support at the physical level, the number of irrelevant data retrieved during the query processing can still be very high. The Horizontal Partitioning (HP) aims to reduce irrelevant data accesses [1, 20]. Moreover, these partitions (fragments) can be allocated to data marts if the data in data marts are derived from the warehouse data (i.e. top-down approach). Another advantage of allowing partition of warehouse data is that the OLAP query can be executed in a parallel fashion. This has been accepted in real practice also. For example, MCI Telecommunications' IT data center in Colorado Springs is running a massive 2TB decision support DW called warehouseMCI on a 104 node IBM RS/6000 SP massively parallel processing system. The database is growing at 100GB to 200GB per month [10].

2

CUSTOMER Cid : 4 bytes Gender : 1 bytes City : 25 bytes State : 25 bytes Hobby : 4 bytes

PRODUCT Pid SKU Brand Size Weight

: 4 bytes : 25 bytes : 10 bytes : 4 bytes : 4 bytes

Package_type : 4 bytes

SALES Cid : 4 bytes Pid : 4 bytes Tid : 2 bytes Dollar_Sales : 8 bytes Dollar_Cost : 8 bytes Unit_Sales : 8 bytes 34 bytes

59 bytes 3, 000, 000 rows

LEGEND:

: Fact Table : Dimension Table

TIME : Foreign-Key Tid

51 bytes 100, 000, 000 rows

: 2 bytes

Date : 16 bytes Month : 4 bytes

300, 000 rows

Attribute : key Attribute Attribute : Non Key Attribute

Year : 4 bytes Season : 4 bytes 30 bytes 1, 094 rows

Figure 2: An Example of a Star Schema

1.1.1 Motivating Example In this section we use the concept of HP in an example given in [9]. It consists of three dimension tables

CUSTOMER, PRODUCT, and TIME, and one fact table SALES. The tables and attributes of the schema

are shown in Figure 2. Let us assume that the dimension table CUSTOMER is horizontally partitioned into two HFs Cust 1 and Cust 2 such that: Cust 1 = Gender=`M (CUSTOMER) 0

(1)

(2) Cust 2 = Gender=`F (CUSTOMER): Based on these HFs, we can horizontally partition the fact table SALES into two derived HFs Sales 1 and Sales 2 such that: Sales 1 : SALES n Cust 1 (3) Sales 2 : SALES n Cust 2 (4) where n represents the semi-join. After partitioning SALES and CUSTOMER tables, our star schema can be represented by two sub-star schemas S1 and S2 such that: S1 : (Sales 1, Cust 1, PRODUCT, TIME) (all sales activities for only the males customers) and S2 : (Sales 2, Cust 2, PRODUCT, TIME) (all sales activities for only the female customers). Suppose we have six frequently asked OLAP queries taken from Informix paper [9] on the DW listed in Figure 3. The queries Q1, Q2 and Q6 access the whole star schema (i.e., S1 [ S2 ). The queries Q4 accesses only the sub-schema S2 . The queries Q3 and Q5 access only the sub-schema S1 . The half of the queries (Q3, Q4 and Q5) accesses only to a part of the whole schema. The HP guarantees a good performance for these queries. The other part of queries (Q1 , Q2 and Q6) accesses the whole star schema. In this case, we can execute these queries in parallel. The OLAP queries we consider are based on star join strategy [12]: each query is a join between the fact table and dimension tables ltered by some selection operation followed by aggregations (see Figure 3 for example queries) . The syntax of these queries is : SELECT FROM WHERE GROUP BY 0

3

ID 1

2

3

4

5

6

SQL Formulation

Characteristics

SELECT sum(S.Dollar Sales) FROM CUSTOMER C, SALES S WHERE C.State = \Illinois" AND S.Cid = C.Cid GROUP BY Cid SELECT sum(S.Dollar Sales), sum(S.Unit Sales) FROM PRODUCT, SALES S, TIME T WHERE S.Pid = P.Pid AND S.Tid = T.Tid AND P.Package type = \Box" AND T.Season = \Summer" GROUP BY Pid SELECT sum(S.Dollar Sales), sum(S.Unit Sales) FROM SALES S, PRODUCT P, CUSTOMER C, TIME T WHERE S.Pid = P.Pid AND S. Cid = C.CiD AND S.Tid = T.Tid AND P.Package type = `Paper" AND T.Season = \Winter" AND C.Gender = `M' GROUP BY Tid SELECT sum(S.Dollar Sales) FROM SALES S, CUSTOMER C, TIME T WHERE S.Tid = T.Tid AND S.Cid = C.Cid AND C.Gender = `F' AND T.Season = \Summer" GROUP BY Cid SELECT sum(S.Dollar Sales) FROM SALES S, CUSTOMER C, PRODUCT P WHERE S.Pid = P.Pid AND S.Cid = C.Cid AND C.Gender = `M' AND P.Package type = \Box" GROUP BY Pid SELECT max(S.Dollar-Sales) FROM SALES S, CUSTOMER C WHERE S.Cid = C.Cid GROUP BY Cid

1 join 1 Selection Predicates

15

2 joins 20 2 selection predicates

3 joins

20 3 selection predicates

2 joins

15 2 selection predicates

2 joins 20 2 selection predicates

1 join 0 Selection Predicates

Figure 3: Example of OLAP Queries

4

Frequency

10

Customer1

CUSTOMER PRODUCT

Partitioning

Product1

Sales1 Customer2

... SALES Reconstruction

Product2

Sales2

Customerj

...

TIME

... Producti

Time1 Salesl Time2

... Timek

Figure 4: From Unpartitioned Star Schema to Partitioned Star Schema

1.2 Contributions

The main contributions of this paper are that: We rstly propose an algorithm for partitioning dimension tables and the fact table of a star schema. This has been discussed in Section 2. Fragmenting the fact table based on all predicates given in OLAP queries might be prohibitive. Therefore, we show that the dimension tables play a very important role in fragmenting the fact table. We develop a greedy algorithm for selecting the best dimension tables for partitioning the fact table. This has been discussed in Section 3. We then develop a cost model for executing the most frequent OLAP queries on partitioned and unpartitioned star schemas in Section 4. Finally, we evaluate our partitioning algorithm in Section 5.

2 Partitioning of a Warehouse Data In this section we discuss an algorithm for HP in a warehouse data that has been modeled using star schema. We rstly introduce some de nitions.

De nition 1 Distributed Join [4] is a join between horizontally partitioned relations.

When an application requires the join between two relations R and S, all tuples of R and S need to be compared; thus, in principle, it is necessary to compare all fragments fR1; R2; :::; Rng of R with all fragments fS1 ; S2; :::; Sng of S. However, sometimes it is possible to deduce that some of partial joins (Ri 1 Sj ) are intrinsically empty. This can happen when the relation R (or S) is partitioned using the derived HP based on the partitioning schema of S (or R). A distributed join between two relations R and S can be represented by a graph called join graph. The nodes of this graph represent the fragments of R and S. An edge between nodes exists if these nodes are join-able. De nition 2 A join graph is called total [5] when it contains all possible edges between fragments of R and S. A join graph is partitioned if it is composed of two or more sub-graphs without edges between them. A join graph is simple if it is partitioned and each sub-graph has just one edge.

Determining that a join has a simple join graph is very important in database design [5]. The simple join graph concept has a great advantage in optimizing selection and join operations by providing partition elimination [8]. Partition elimination occurs when the database optimizer determines that some of the table fragments are unnecessary to satisfy the query execution. In next section, we show how a simple join can improve the execution of OLAP queries. Partitioning of warehouse data is more complex and challenging compared to that in relational and object databases due to the several choice of partitioning of a star schema. In warehouse, either the dimension tables or the fact table or both can be fragmented. Since most of OLAP queries rst access dimension 5

tables and then fact table, it is necessary to fragment both of them. We rstly partition some/all dimension tables using the primary HP, and use them to partition the fact table using the derived HP. This approach is suitable because it takes into consideration the queries requirements and the relationship between the fact table and dimension tables.

2.1 Partitioning a Star Schema

In this section, we discuss an algorithm for fragmenting a star schema with one fact table F and d dimension tables fD1 ; D2 ; :::; Ddg. This algorithm partitions dimension tables rst and then use their fragmentation schemas to derive the fragments of the fact table. To partition a dimension table, we use an anitydriven algorithm that uses quantitative and qualitative [20] information about applications. Quantitative information describes the selectivity factors and the frequency of each query accessing this table. Qualitative information describes selection predicates de ned on this dimension table. A simple predicate p is de ned by: p : Ai Value where Ai is an attribute, 2 f=; ; ; 6=g, Value 2 Dom(Ai ). The input to the proposed algorithm are a set of dimension tables d and one fact table, and a set of most frequently asked OLAP queries Q = fQ1; Q2; :::; Qng with their frequencies. The main steps of our algorithm are: 1. Enumerate all simple predicates used by OLAP queries (Q1; : : :; Qn). 2. Assign to each dimension table Di (1 i d) its set of simple predicates SSP Di . 3. Each dimension table Di having SSP Di = can not be fragmented. Let Dcandidate be the set of all dimension tables that do not have empty SSP Di . Let g be the cardinality of Dcandidate. 4. Application of COM MIN algorithm [20] to the simple predicate of each dimension table Di of Dcandidate . This algorithm takes a set of simple predicates and then generates a set of complete and minimal. The rule of completeness and minimality states that \a relation is partitioned into at least two fragments which are accessed dierently by at least one application". 5. For fragmenting a dimension table Di , it is possible to use one of the algorithms proposed by [4, 20] in the relational model. These algorithms generate a set of disjoint fragments, but their complexities are exponential to the number of used simple predicates. As result, we use our algorithm proposed in the object model and has a polynomial complexity [1]. After the fragmentation process, each dimension table Di of Dcandidate will has mi fragments fDi1 ; Di2; : : :; Dimi g, where each fragment Dij is de ned as follows: Dij = clij (Di ) with clji (1 i g; 1 j mi ) represents a clause of simple predicates. 6. Derive the fragments of the fact table using the fragmentation schemas of the dimension tables. Q The number of fragments of the fact table is equal to: N = gi=1 mi (for details see [1]). Therefore, the star schema S is decomposed into N sub-star schemas fS1 ; S2; :::; SN g, where each one satis es a clause of predicates.

Example 1 Let us consider the star schema in Figure 2 and the six OLAP queries (in Figure 3). From

these queries, we enumerate all selection predicates: p1 : C.State =\Illinois", p2 : C.Gender = `M', p3 : C.Gender = `F', p4 : P.Package type = \Box", p5 : P.Package type = \Paper", p6 : T.Season = \Summer" et p7 : T.Season = \Winter". The set of simple predicates for each table are (step 2): SSP CUSTOMER = fp1; p2; p3g, SSP PRODUCT = fp4; p5g and SSP TIME = fp6; p7g. For each set, we generate the set of complete and minimal simple predicates (step 4): We obtain the following: CUSTOMER = fp2; p3g, SSP PRODUCT = fp4; p5g et SSP TIME SSPMin ?Com Min?Com Min?Com = fp6; p7g.

6

Partitioned Star Schema with Simple Joins D1

D3 F

Initial Star Schema

D31

D11

F1

D12

F2

Fragmentation D1 F

D32

D3 Reconstruction

D2

D1M

F3

D3M’

FN

LEGEND

D2

: Valid Query : Valid Query : Valid Query : No Valid Query

Figure 5: Simple Join in a Star Schema By applying the fragmentation algorithm [1] for each dimension table, we obtain the following fragments:

CUSTOMER : Cust 1 = Gender=`M (CUSTOMER) et Cust 2 = Gender=`F (CUSTOMER), PRODUCT : Prod 1 = Package type=\Box (PRODUCT) 0

0

00

Prod 2 = Package type=\Paper (PRODUCT),

and

00

TIME : Time 1 = Saison=\Winter (TIME) and Time 2 = Saison=\Summer (TIME). 00

00

Finally, the fact table can be horizontally partitioned into 8 (N = 2 2 2) fragments and 8 sub-star schemas.

Our algorithm may generate a large number of fragments of the fact table. For example, suppose we have: CUSTOMER is partitioned into 50 fragments using the State attribute 1, TIME into 12 fragments using the Month attribute, and PRODUCT into 2 fragments using Package type attribute, then the fact table can be fragmented into 1200 (50 12 2) fragments (as result 1200 sub-star schemas). As we can see here, it would be very hard to maintain these sub-star schemas, therefore, it is important to reduce the number of fragments of the fact table. In next section we discuss a greedy algorithm that reduces the fact table fragments. Generally, when the derived HP is used for partitioning a table in a database schema, two potential cases of join exist (simple and partitioned). In the DW context, when the fact table is horizontally partitioned based on the dimension tables, we will never have a partitioned join (i.e., the case wherein a HF of the fact table has to be joined with more than one HF of the dimension table will not occur) [3]. A simple join operation has three major advantages: It avoids costly total distributed join of the fact table F with each and every HF Dij (1 j mi ) of each and every dimension table Di . It guarantees the elimination of some partitions for join and selection. For example, if the fact table SALES has been partitioned into 12 fragments (using the attribute Month of the dimension table TIME), the system can satisfy a query that asks for only last two months of data by processing only 2 of 12 fragments. It facilitates parallel processing of multiple simple joins (each HF Fi of the fact table joins with exactly one HF Dik of Di ). 1

case of 50 states in the U.S.A.

7

2.2 The Correctness Rules

Any fragmentation algorithm must guarantee the correctness rules of fragmentation: completeness, reconstruction, and disjointness. The completeness ensures that all tuples of a relation are mapped into at least one fragment without any loss. The completeness of the dimension tables is guaranteed by the use of COM MIN algorithm [20, 4]. The completeness of the derived horizontal fragmentation of the fact table is guaranteed as long as the referential integrity rule is satis ed among the dimension tables and the fact table. The reconstruction ensure that the fragmented relation can be reconstructible from its fragments [20]. In our case, the reconstruction of the fact and the dimension tables are obtained by union operation, i.e., F = [Ni=1 Fi and Di = [mj =1i Dij . The disjointness ensure that the fragments of a relation are non-overlapping. This rules is satis ed for the dimension table since we used an no-overlap algorithm [1]. For the fragments of the fact table, the disjointness rule is guaranteed by the fact that any HF of the fact table has to be joined with only one HF of a dimension table [3].

3 Selection of Dimension Tables As we have seen in section 2.1, the number of fragments of fact table can be very large, especially when dimension tables are partitioned into large number of fragments. Therefore, it a necessary to develop an algorithm for selecting a set of dimension tables that can be used as a basis for generating fragments of fact table.

3.1 Selection Algorithm

Suppose we have a star schema with a set of dimension tables and a fact table. Let fQ1; Q2; :::; Qng be the set of frequently asked queries. Our aim is to partition the dimension tables and the fact table in order to reduce the query processing cost and the maintenance cost. To solve this problem, we develop a greedy algorithm that considers only dimension tables with SSP Di 6= . In this algorithm, we x the number of fragments, say W. Our greedy algorithm selects one dimension table randomly. Once the selection is done, we partition the fact table based on the fragmentation schema of the selected dimension table. We compute the number of fragments (N) of the fact table, and then cost of executing a set of OLAP queries 2 . If N is less than W and there is an improvement of query processing cost, our algorithm selects another dimension and repeats the same process till the two conditions are satis ed. Otherwise, we remove the selected dimension table and we select another dimension table and we keep repeating the same process. The main steps of this algorithm are shown in Figure 6. At the end of this algorithm, we obtain a partitioned DW ensuring a good query processing cost and maintenance cost. As we see, our algorithm selects dimension table randomly. This selection can aect our greedy algorithm. To solve this problem, we propose three methods for selecting the dimension tables:

Frequently Used Dimension Table Note that each dimension table is accessed by a set of OLAP queries with certain frequencies. This method consists in selecting the dimension tables having the high access frequency. Low Cardinality of Attribute Domains This solution consists in selecting dimension tables having simple predicates de ned on an attribute with low cardinality of data. For example, suppose that we have a simple predicate de ned on the dimension table CUSTOMER in Figure 2 using the attribute Gender which 2

We suppose we have a cost model for executing a set of queries. This model is described in section 4

8

Star Schema

Select a Dimension Table (if it is possible)

NO END LEGEND The Final Partitioning Schema

YES Partition the Fact Table

N : Number of Fact Table Fragments W : Number of Fragments Chosen by DWA C_Pr : Previous Cost C_Cu : Current Cost

Set of Queries 1. Compute N 2. Compute the Cost YES NO if (N < W) and (C_Cu < C_Pr)

Remove the Selected Table

Figure 6: The Steps of Greedy Algorithm has exactly two values: Female and Male. This solution ensures that the resulting fragments of fact table can get executed faster using indices like bitmap [19].

Small Set of Simple Predicates (SSP) This solution consists in computing the cardinality of set of simple predicates SSP Di for each dimension table Di . Finally, we select the dimension tables having minimum cardinality of their set of simple predicates. This solution ensures that the resulting fragments of fact table will be small and therefore, helps the DWA to administrate his/her partitioned DW and facilitates the allocation process of fragments to data marts. Note that the rst two solutions try to satisfy the performance constraint, and the last one tries to satisfy the maintenance constraint.

3.2 Query Execution Strategy in Partitioned Star Schema

Since the dimension tables and fact table are horizontally partitioned, we need to ensure the data access transparency concept in which a user of the DW is unaware of the distribution of the data. Our goal is to provide to the DW users the unpartitioned star schema and the query optimizer task is to translate the OLAP queries on the unpartitioned star schema to partitioned star schemas. Before executing a query Q on a partitioned DW, we need rst to identify the sub-schemas satisfying this query as shown in Figure 7.

De nition 3 (Relevant Predicate Attribute (RPA)) is an attribute which participates in a predicate

which de nes a HP. Any attribute which does not participate in de ning a a predicate which de nes a HP is called irrelevant predicate attribute. De nition 4 (Partitioning Speci cation Table) Suppose we have an initial star schema S horizontally partitioned into N sub-schemas fS1 ; S2 ; :::; SN g. The partitioning conditions for each partitioned table can be represented by a table called partitioning speci cation table of S . This table has three columns: the rst one contains the table names, the second provides the fragments of its corresponding table, and the last one reports the partitioning condition for each fragment. Example 2 Let us consider the example in section 2, where the dimension table CUSTOMER is partitioned into two fragments Cust 1 and Cust 2. The fact table has two fragments Sales 1 and Sales 2. The corresponding partitioning speci cation table for this example is illustrated in Table 1.

>From the partitioning speci cation table, we can conclude that : Each attribute belonging to the partitioning speci cation table is a RPA. In our example, \Gender" is the single RPA. Let SRPA be the set of RPAs.

Fragment Identi cation Let Q be a query with p selection predicates de ned on a partitioned star schema S = fS1 ; S2; :::; SN g. Our aim is to identify the sub-schema(s) that participate in executing Q. 9

Dimension Fragments Condition de Fragmentation

CUSTOMER SALES

Cust 1 Cust 2 Sales 1 Sales 2

Gender = `M' Gender = `F' SALES n Cust 1 SALES n Cust 2

Table 1: Partitioning Speci cation Table (case 1) Based on the selection predicates of the query Q and partitioning speci cation table, we proceed as follow: For each selection predicate SPi (1 i p), we de ne the function attr(SPi) which gives us the name of the attribute used by SPi . The union of attr(SPi) gives us the names of all attributes used by Q. We call this set by query predicate attributes. Let SPA(Q) be the set of all predicate attributes used by the query Q. Using SPA(Q) and SRPA(S), four scenarios are possible: 1. SPA(Q) = ;, (the query Q does not contain any selection predicate (case of the query Q6 in Table 3). In this situation, two approaches are possible to execute Q: (a) Perform the union operations of all sub-schemas and then perform the join operations as in unpartitioned star schema. (b) Perform the join operations for each sub-schemas and then assemble the result using the union operation. 2. (SPA(Q) 6= ;) and (SPA(Q) \ SRPA(S) = ;) (the query has some selection predicates on non partitioned dimension tables, or the predicate attribute of Q do not match with relevant predicate attribute). For example, the query Q1 has one selection predicate de ned on dimension table CUSTOMER using the attribute Gender. But this table is not partitioned using this attribute. To execute this kind of queries, we use two approaches presented in 1.a and 1.b. 3. (SPA(Q) \ SRPA(S) 6= ) means that some predicate attributes of the query Q match with certain RPA. In this case, we can easily determine the names of dimension tables and the fragments that participate in executing the query Q. For example, suppose that the fact table is partitioned into 8 fragments (see Table 5). SRPA(S) will be equal fGender; Package type; Seasong. If we want to execute a query having one selection predicate de ned on Gender, we need to access 4 fragments instead of 8. 4. (SPA(Q) SRPA(S)) means that all predicate attributes of Q are RPA. In this case, the query Q may get executed very fast. To conclude this section, some observations can be noted: (a) The HP can deteriorate the performance of executing queries satisfying the cases 1 and 2, (b) it is good for queries respecting the case 3, and (c) it is recommended for queries satisfying the case 4.

4 Query Processing Cost Models In this section, we present two cost models for processing a set of frequently asked OLAP queries. The rst one is for unpartitioned star schema, and the second one is for partitioned star schema. The objective of our cost models is to calculate the cost of executing these queries in terms of disk page accesses (IO cost) during the selection and join operations (which are the most used and most expensive operations in DWs [17]). To perform the selection and join operations, we make two major assumptions regarding the size of the main memory, as it aects the number of disk accesses required [21]: 10

LEGEND

Query

: Valid Sub-Star Schema Sub-Schema Identification : Non Valid Sub-Star Schema

... D11

D21

D11

F1 D41

Sub-Star Schema 1

...

F2 D31

D41

D1m1

D21

D31

D2m2

FN D4m4

D3m3

Sub-Star Schema N

Sub-Star Schema 2

Figure 7: Sub-schemas Identi cation 1. Large Memory Hypothesis (LMH): All dimension tables are in the main memory because their sizes are very small [7], and the fact table and indices are in disk, but the loading of the fact table and indices from disk to the main memory is done only once. This assumption becomes more and more realistic as the size of main memory keeps increasing because of fall in main memory prices. 2. Medium Memory Hypothesis (MMH): It is similar to the LMH, but the single dierence is that sometimes the size of the intermediate result cannot t in the memory, and then we need to store it in the disk and reload it if required.

4.1 Query Execution Strategy on a Unpartitioned DW

Note that the join operation is typically expensive, particularly when the sizes of relations involved are larger than the available main memory [17]. The star join index [22, 19] has been proved to be an ecient access structure for speeding up joins de ned on fact and dimension tables in DWs. Our cost model considers a star join index on top of the star schema. The notations for these cost models are summarized in Table 2. Let Qi be a query represented by a query

Symbol PS w(cij ) w(KTj ) w(Tj ) jjTj jj jTj j SFTj B

Meaning

Page size of the le system (in bytes) Width, in bytes, of the column ci of a table Tj (dimension, or fact) Width, in bytes, of a key attribute K of a table Tj Width, in bytes, of a tuple of a table Tj Number of tuples present in a table Tj Total number of pages occupied by a table Tj Selection factor on table Tj Buer size used for storing intermediate results Table 2: Symbols and their Meanings

graph QGi having a set of nodes denoted by node(QGi) (t is the cardinality of node(QGi )). Suppose that we have a star join index G with a set of nodes denoted by node(G). Suppose also that we have a B + -tree to access the tuples of G from a dimension table and another one to access the tuples of the fact table from the tuples of G as shown in Figure 8. Then four possible scenarios arise to execute Qi : 1. Qi accesses only one table (dimension or fact). In this case, we do not need to use a join index. 2. The star join index G does not cover the tables of query Qi (i:e:; node(QGi) \ node(G) = ). We call this strategy no cover. We execute Qi using usual techniques (e.g., hash join, sort merge join, etc.). The cost of joining two relations using hash join or sort merge join is given in [21]. 11

3. The star join index G covers all tables of the query Qi (i.e., node(QGi) node(G)). This scenario is called total cover. For example, an index between SALES, CUSTOMER and PRODUCT covers the query Q1 . To perform a query under this scenario, we do the following steps: (a) Load all dimension tables using by the query Qi. (b) Select one dimension table Dmin among the dimension tables (node(QGi)) that has the minimum selectivity factor. (c) Load the B + -tree between the dimension table Dmin and the G. Once this B + -tree is loaded, we use the tuples of Dmin to load the useful tuples of G from the disk to the main memory (see Figure 8). (d) Perform the selection operations on the tables of node(QGi) (if they have selection predicates). For each table of node(QGi ), consider the primary key (IDs) of the selected tuples that can be used to access the G. Each identi es a set of tuples of G. By calculating the intersection of these sets, we obtain the tuples of G satisfying the join operation. (e) Load the B + -tree that can be used to access the tuples of the fact table (see Figure 8). (f) Load the useful tuples from the fact table. 4. The join index G covers only some tables of QGi (including the fact table) (i.e., node(G) \ node(QGi) 6= ). This scenario is called partial cover. For example, the join index G between SALES and TIME partially covers the query Q3 which accesses all tables of our star schema (Figure 3). To execute a query Qi in this scenario, we use G to perform the join operation between the tables covered by G (in the same way as the cover scenario), and then use the result of this join to complete the execution of Qi (in the same way as the no cover scenario).

4.2 The Cost Models

Now we present a cost model for each scenario.

No cover: If the star join index G does not cover the query graph, we use the hash join technique as in no cover scenario.

Total cover: The cost of executing a query under this scenario is given by:

1. The loading cost of t dimension tables used by the query Qi is given by: Load Cost(D) =

t

X

i=1

jDij

(5)

2. The selecting cost of the dimension table Dmin among dimension tables of node(QGi ) is null (all dimension tables are in main memory). 3. The loading cost of the B + -tree is de ned by the size of this B + -tree. As in [15], we consider that the size of a B + -tree is the size of all leaf nodes (the number of leaf nodes is approximatively the number of rows in the underlying table). 4. The loading cost of the G is given by :

Load Cost(G) = SFDmin jjGjj

(6)

5. The performing cost of the intersection operation is null, since the dimension tables and the index are in main memory. 6. The loading cost of the B + -tree is similar to that of step 3. 7. The loading cost of the tuples of the fact table is

Load Cost(F) = SFF jjF jj 12

(7)

D

1

IDs K

1

K

2

...

K

k-1

K

k

Fact Table

Dmin

B+Tree

Dk Star Join Index

IDs

Figure 8: Join Processing using Star Join Index

Partial cover: The cost for this case is derived from the no cover and total cover strategies.

We evaluate the best cost for each query Qi as C(Qi ). The total query processing cost (TQPC) for a set of s queries is then: s X (8) TQPC = fi C(Qi ) i=1

4.3 Cost Model for a Partitioned DW

Assume we have N sub-star schemas, where each one adheres to a clause of simple predicates. We suppose the presence of star join index for each sub-star schema. The cost model under these assumptions is quite similar to the previous one. The single dierences are (1) instead of loading all tables, we load only fragments of tables needed for a given query, (2) we load only the indices of the valid sub-star schema. To identify sub-star schema(s) needed by a query Q, we de ne a boolean variable valid(Q; Si ) as follows: 1 if the sub-star schema Si is used by the query Q valid(Q; Si ) = 0 otherwise Similarly, we can de ne valid(Q; Dij ) 3 . Now we have all ingredients to describe the cost model for the partitioned case: 1. The loading cost of dimension table fragments of a given sub-star schema Sk is given by: Load Cost(D) =

t m

i X X

(

i=1 j =1

valid(Q; Dij ) Dij )

(9)

2. The selecting cost of the fragment having the minimum selectivity is null (all fragments are in the main memory). 3. The loading cost of the B + -tree used to access the tuples of star join index de ned on the valid sub-star schema. 4. The loading cost of star join index Gk of sub-star schema Sk is given by :

Load Cost(Gk ) = SFDmin;j jjGkjj

(10)

5. The performing cost of the intersection operation is null, since the fragments of dimension tables and the index are in main memory. 6. The loading cost of the B + -tree is similar to that of step 3. 3

valid(Qi; Dij ) = 1 means that the fragment Dij is used by the query Qi , 0 otherwise

13

7. The loading cost of the tuples of the fact table fragment Fk 4 is:

Load Cost(Fk ) = SFFk jjFkjj

(11)

5 Evaluation of a Partitioned DW In order to show the utility of HP in the DW context given by a DW schema and data set from Informix [9], we conduct some experiments study to show the tradeo of partitioned and unpartitioned DWs. As in section 1, the schema consists of three dimension tables CUSTOMER, TIME, and PRODUCT and one fact table SALES. The key characteristics of our experimental data warehouse are as shown in Table 3. To characterize the improvement of performance using the HP technique, we de ned a normalized IO metric as follow: Of IOs for the Horizontally Partitioned Star Schema Normalized IO = Number Number Of IOs for the Unpartitioned Star Schema We note that if the value of normalized IO is less than 1.0, then it implies that HP is bene cial. Suppose Parameter

SALES w(SALES) CUSTOMER w(CUSTOMER) PRODUCT w(PRODUCT) TIME w(TIME) PS jj

jj

jj

jj

jj

jj

jj

jj

Description Number of rows

SALES width (in bytes) Number of rows CUSTOMER width (in bytes) Number of rows PRODUCT width (in bytes) Number of rows TIME width (in bytes) Page Size (in bytes)

Value

100000000 34 3000000 59 300000 51 1094 30 8192

Table 3: Parameters used in the Experiments that the dimension tables are partitioned as in example 2. To see the impact of the number of dimension tables which are horizontally partitioned, we will concentrate on the following cases: Case 0: All tables are unpartitioned Case 1: Only one dimension table is partitioned, for example, CUSTOMER. The partitioning speci cation table is given in Table 1. Case 2: Two dimension tables are partitioned, for example, CUSTOMER and PRODUCT. The partitioning speci cation table is given in Table 4. Case 3 : All dimension tables are partitioned. The partitioning speci cation table is given in Table 5. We assume that we have a complete star join index on top of our star schema, i.e., a star join index covering all tables of the DW schema (SALES, CUSTOMER, TIME and PRODUCT). For all experiments, we assume that the sizes of fragments are uniform. In Figure 9 and 10, we present the results of executing the six queries for each case. The comments that we can summarized are: The HP gives almost a good performance compare to the unpartitioned case. In all cases, the query Q3 gets bene t of HP. This is because it has selection predicates used for partitioning algorithm (see point 4 of section 3.2). 4

A sub-star schema Sk contains one fact table fragment Fk

14

Dimension Table Fragments CUSTOMER PRODUCT SALES

Cust 1 Cust 2 Prod 1 Prod 2 Sales 1 Sales 2 Sales 3 Sales 4

Fragmentation Clause

Gender = `M' Gender = `F' Package type = \Box" Package type = \Paper" SALES n Cust 1 n Prod 1 SALES n Cust 1 n Prod 2 SALES n Cust 2 n Prod 1 SALES n Cust 2 n Prod 2

Table 4: The Partitioning Speci cation Table for case 2

Dimension Fragments

CUSTOMER TIME PRODUCT SALES

Cust 1 Cust 2 Time 1 Time 2 Prod 1 Prod 2 Sales 1 Sales 2 ... Sales 8

Condition de Fragmentation

Gender = `M' Gender = `F' Season = \Winter" Season = \Summer" Package type = \Box" Package type = \Paper" SALES n Cust 1 n Time 1 n Prod 1 SALES n Cust 1 n Time 1 n Prod 2 ... SALES n Cust 2 n Time 2 n Prod 2

Table 5: The Partitioning Speci cation Table for case 3

The HP may deteriorate the performance of certain queries. For example, all partitioning cases are

not suitable for the queries Q1 and Q6. This is due to (1) Q1 has one selection predicate (C.State = `Illinois') that does not appear in partitioning speci cation tables (see Tables 1, 4 and 5), (2) Q6 does not have any selection predicate. To evaluate these two queries, we need to access all sub-star schemas and then do union operation. The number of partitioned dimension tables has a great impact on reducing the query processing cost. As we see, in Table 9 and 10, the HP gives better results (for Q2, Q3, Q4, Q5) in case 3, where all three dimension tables are partitioned. The normalized IO is less than 1.0 for each case. This is mean that the HP gives better results if we are concerning on the whole performance of our DW. For example, the HP is not good for the queries Q1 and Q6, but it is bene t for the queries Q2, Q3, Q4 and Q5. The bad performance of Q1 and Q6 is compensated by the good performance of pro table queries (Q2, Q3, Q4 and Q5). The HP gives always good result under MMH (see Figure 10). This is due to the size of intermediate results that can be stored in the main memory when the DW is partitioned. Since the fact table re ect the dynamic aspect of the DW (inserting new tuples), we have studied the eect of variating the size of the fact table on HP. We assume that our star schema is partitioned into 2, 4 and 8 sub-star schemas, and we vary the size of the fact table from 108 to 109. We consider three queries Q1 , Q2, and Q3 among the six in Figure 3: (1) Q1 is executed under case 1 (only one dimension table is partitioned), (2) Q2 is executed under case 2 (two dimension tables are fragmented), and (3) Q3 is executed under case 3 (where all dimension tables are fragmented). We observe that the normalized IO is constant (see Figure 11) for each query. Consequently, the normalized IO is independent with the cardinality of the fact table. 15

Figure 9: Query Processing Cost under LMH

Figure 10: Query Processing Cost under MMH

1.4

(Q1, Case 1) (Q2, Case 2) (Q3, Case 3)

1.2

Normalized IO

1 0.8 0.6 0.4 0.2 0 1e+08

2e+08

3e+08

4e+08 5e+08 6e+08 7e+08 Size of the Fact Table

8e+08

9e+08

Figure 11: Variation Eect of the size of Fact Table 16

1e+09

6 Conclusion and Future Work Executing an OLAP query in a data warehouse can be very expensive, particularly on large warehouse data, if the data is not modeled properly. Moreover, if OLAP queries need only a portion of data, it is advisable to fragment data so that a set of queries can be executed on each fragment as far as possible, thus query response time can be minimized. In this paper we have studied the problem of partitioning the warehouse data when data is modeled using star schema. We have shown that horizontal fragmentation of star schema can facilitate parallelism and OLAP queries can be executed more eciently. Moreover, these fragments can be allocated to data marts in such a way that each fragment can be a functional unit of allocation. We have proposed an algorithm that rst select dimensional tables for fragmentation and then fragment fact table based on these dimensional tables. We have also developed a cost model for executing the most frequent OLAP queries on this partitioned warehouse and evaluated their performance analysis. Our experiments show that the fragmentation gives almost better results compare to the unpartitioned case.

References [1] L. Bellatreche, K. Karlapalem, and Simonet A. Algorithms and support for horizontal class partitioning in object-oriented databases. To Appear in the Distributed and Parallel Databases Journal, 8(2), April 2000. [2] L. Bellatreche, K. Karlapalem, and Q. Li. Algorithms for graph join index problem in data warehousing environments. Technical Report HKUST-CS99-07, Hong Kong University of Science & Technology, March 1999. [3] L. Bellatreche, K. Karlapalem, and M. Mohania. Olap query processing for partitioned data warehouses. In the International Symposium on Database Applications in Non-Traditional Environments (IEEE Computer Society Press), November 1999. [4] S. Ceri, M. Negri, and G. Pelagatti. Horizontal data partitioning in database design. Proceedings of the ACM SIGMOD International Conference on Management of Data. SIGPLAN Notices, pages 128{136, 1982. [5] S. Ceri and G. Pelagatti. Distributed Databases: Principles & Systems. McGraw-Hill International Editions, 1984. [6] S. Chaudhuri and V. Narasayya. Index merging. Proceedings of the International Conference on Data Engineering (ICDE), pages 296{303, March 1999. [7] Oracle Corp. Star queries in oracle8. White Paper, June 1997. [8] Oracle Corp. Oracle8iTM enterprise edition partitioning option. Technical report, Oracle Corporation, February 1999. [9] Informix Corporation. Informix-online extended parallel server and informix-universal server: A new generation of decision-support indexing for enterprise data warehouses. White Paper, 1997. [10] Simpson D. Build your warehouse on mpp. Available at http://www.datamation.com/servr/12mpp.html, December 1996. [11] A. Datta, B. Moon, and H. Thomas. A case for parallelism in data warehousing and olap. in the 9th International Workshop on Database and Expert Systems Applications (DEXA98), pages 226{231, August 1998. [12] A. Datta, K. Ramamritham, and H. Thomas. Curio: A novel solution for ecient storage and indexing in data warehouses. Proceedings of the International Conference on Very Large Databases, pages 730{733, September 1999. [13] J. M. Firestone. Data warehouses and data marts: A dynamic view. White Paper 3, Executive Information Systems, Inc., March 1997. [14] A. Gupta and I. S. Mumick. Maintenance of materialized views: Problems, techniques, and applications. Data Engineering Bulletin, 18(2):3{18, June 1995. [15] H. Gupta, V. Harinarayan, A. Rajaraman, and J. Ullman. Index selection for olap. Proceedings of the International Conference on Data Engineering (ICDE), pages 208{219, April 1997.

17

[16] R. Kimball. The Data Warehouse Toolkit. John Wiley & Sons, 1996. [17] H. Lei and K. A. Ross. Faster joins, self-joins and multi-way joins using join indices. Data and Knowledge Engineering, 28(3):277{298, November 1998. [18] A. Y. Noaman and K. Barker. A horizontal fragmentation algorithm for the fact relation in a distributed data warehouse. in the 8th International Conference on Information and Knowledge Management (CIKM'99), pages 154{161, November 1999. [19] P. O'Neil and D. Quass. Improved query performance with variant indexes. Proceedings of the ACM SIGMOD International Conference on Management of Data, pages 38{49, May 1997. and P. Valduriez. Principles of Distributed Database Systems. Prentice Hall, 1991. [20] M. T. Ozsu [21] R. Ramakrishnan. Database Management Systems. WCB/McGraw Hill, 1998. [22] Red Brick Systems. Star schema processing for complex queries. White Paper, July 1997.

18