Integrating heterogeneous data warehouses using XML technologies

5 downloads 0 Views 663KB Size Report
represented as a table name in another database. 210. Journal of Information ..... product time a d. Fig. 4. A sample illustration of the elements in a data cube.
Journal of Information Science http://jis.sagepub.com

Integrating heterogeneous data warehouses using XML technologies Frank S.C. Tseng and Chia-Wei Chen Journal of Information Science 2005; 31; 209 DOI: 10.1177/0165551505052467 The online version of this article can be found at: http://jis.sagepub.com/cgi/content/abstract/31/3/209

Published by: http://www.sagepublications.com

On behalf of:

Chartered Institute of Library and Information Professionals

Additional services and information for Journal of Information Science can be found at: Email Alerts: http://jis.sagepub.com/cgi/alerts Subscriptions: http://jis.sagepub.com/subscriptions Reprints: http://www.sagepub.com/journalsReprints.nav Permissions: http://www.sagepub.com/journalsPermissions.nav

Downloaded from http://jis.sagepub.com at PENNSYLVANIA STATE UNIV on February 7, 2008 © 2005 Chartered Institute of Library and Information Professionals. All rights reserved. Not for commercial use or unauthorized distribution.

Integrating heterogeneous data warehouses using XML technologies

Frank S.C. Tseng and Chia-Wei Chen Department of Information Management, National Kaohsiung First University of Science and Technology, Kaohsiung, Taiwan, ROC Received 28 August 2004 Revised 12 January 2005

Abstract. Data warehousing has been widely adopted by contemporary enterprises. For inter-organizational information sharing, the need cannot be over-emphasized to conduct researches on the integration of heterogeneous data warehouses to overcome the challenging situations today. That makes it urgent to establish a systematic integration methodology for integrating heterogeneous data warehouses via the Internet or proprietary extranets. Traditionally, researchers usually employed a canonical format as the integration medium for logical data integrations among heterogeneous systems. In this paper, to fully utilize the power of the Internet, we propose a framework and develop a prototype to integrate heterogeneous data warehouses by XML technologies. We first formally define the elements in data warehousing and discuss various semantic conflicts occurring among heterogeneous data cubes. Then, we propose the system architecture and related resolution procedures for all kinds of semantic conflicts. For local data cubes with different schemas, we define a global XML Schema to inte-

Correspondence to: Department of Information Management, National Kaohsiung First University of Science and Technology, 1 University Rd, YenChao, Kaohsiung 824, Taiwan, ROC. E-mail: [email protected]

grate the local cube structures, and transform each local cube respectively into an XML document conforming to the global XML Schema. These transformed XML documents obtained from local cubes will be manipulated by pre-defined XQuery commands to form a unified XML document, which can be regarded as the global cube. The integrated global cube can be easily stored and manipulated in native XML databases. The proposed methodology enables global users to browse or pose multi-dimensional expressions (MDX) on the global cube to obtain a result in the same way as they perform locally.

Keywords: heterogeneous data warehouse; XML; XML schema; XQuery; multi-dimensional expression

1. Introduction Recently, data warehousing has played a significant role to help effective decision making for enterprises [1–4]. Contemporary enterprises are collectively striving to provide on-line analytical processing of historical data by employing data warehouses. Thanks to the rapid evolution in communication and database technologies, numerous data warehouse systems have been established and distributed over the Internet. The Web revolution would be employed to expand data warehouse applications for inter-organizational applications. For instance, they can be used and applied in analyzing customer Web behaviors for Customer Relationship Management (CRM), a promising trend in business affairs. However, data in various data warehouse systems are usually built and expanded autonomously. For an environment connecting multiple data warehouses, each data warehouse is usually operated autonomously and cannot be accessed in a homogeneous way from the global point of view

Journal of Information Science, 31 (3) 2005, pp. 209–229 © CILIP, DOI: 10.1177/0165551505052467 Downloaded from http://jis.sagepub.com at PENNSYLVANIA STATE UNIV on February 7, 2008 © 2005 Chartered Institute of Library and Information Professionals. All rights reserved. Not for commercial use or unauthorized distribution.

209

Integrating heterogeneous data warehouses

[5, 6]. That is, not many of the data warehouses established are designed to consider any requirements for future integration with other data warehouse systems. Such a phenomenon ends up with multiple ‘islands’ of data warehouse systems; it results in isolated data warehouses, which cannot interchange, interoperate, or access data with others. To fully utilize the power of the Internet, many researchers [5, 7–9] advocate conducting studies on integrating heterogeneous, autonomous, and distributed data warehouse systems by XML technologies to expand data warehouse applications for intra- or even inter-organizational applications. The reason for using XML technologies is that XML documents are widely used in Web information systems for contemporary applications [10]. For example, Bapst and Vanoirbeek [11] develop an electronic platform for production of Requests for Proposals. Royappa [12] implements a common web-based storefront for a group of merchants for electronic commerce. Moreover, Kimball and Merz [3] even advocate the term data webhouse, and regard it as a rebirth of data warehouse. To unify the whole process of database integration and data webhouse presentation, we choose XML as our candidate among data formatting media. In this paper, we will formally define the elements of data warehousing and discuss the possible conflicts arising when cubes from heterogeneous data warehouses are integrated. Also, we employ a canonical data exchange standard to transform and store the desired data in data warehouses. Based on the analysis of various conflicts, we propose solutions to resolve the semantic discrepancies. We elaborate on the process of transforming autonomous data cubes into XML documents and provide a general framework by employing XQuery [13] in native XML databases for data warehouse integration. By adopting our approach, users can retrieve multi-dimensional data in a global view for effective on-line analytical processing.

2. Related work Before discussing the problems of heterogeneous data warehouse integration, the problems occurring in heterogeneous database integration should be explored in advance. 2.1. Heterogeneous database integration The general approaches in previous work on heterogeneous database integration can be classified as follows: 210

(1)

Global schema approach. Such an approach provides a global schema for the independent databases by integrating their schemas. Dayal and Hwang [14] and Motro [15] adopted this approach based on functional models, and Breitbart et al. [16] and Deen et al. [17] were based on relational models. We have also proposed a global schema approach to integrate heterogeneous databases with different schema structures in [18]. (2) Federated approach. Such an approach does not require the creation of a global schema. On the other hand, for each application, the database administrator creates a schema describing only data that the application may access in the local databases. Heimbigner and McLeod [19] adopted such an approach and Hsiao [20, 21] provides a good survey of such approaches. (3) Multi-database query language approach. Such an approach provides users with a multi-database query language [22], and users should refer to the schemas and pose their queries against these schemas using the multi-database query language. Czejdo et al. [23] and Litwin et al. [24, 25] fall into this category. The schema integration process may present a large number of problems caused by various aspects of semantic discrepancy due to the design autonomy of each participant. Fortunately, Lee et al. [26] have proposed a very concise classification scheme for the conflicts occurring in heterogeneous databases, as follows. (1) Value-to-Value conflicts. These conflicts occur when databases use different representations for the same data. This type of conflict can be further divided into data representation conflicts, data scaling conflicts, and inconsistent data. (2) Value-to-Attribute conflicts. These conflicts occur when the same information is expressed as attribute values in one database and as an attribute name in another database. (3) Value-to-Table conflicts. These conflicts occur when the attribute values in one database are expressed as table names in another database. (4) Attribute-to-Attribute conflicts. These occur when semantically related data items are named differently or semantically unrelated data items are named equivalently. The former case is also called synonyms and the latter case homonyms [27]. Some classification schemes call both cases naming conflicts. (5) Attribute-to-Table conflicts. These conflicts occur if an attribute name of a table in a database is represented as a table name in another database.

Journal of Information Science, 31 (3) 2005, pp. 209–229 © CILIP, DOI: 10.1177/0165551505052467 Downloaded from http://jis.sagepub.com at PENNSYLVANIA STATE UNIV on February 7, 2008 © 2005 Chartered Institute of Library and Information Professionals. All rights reserved. Not for commercial use or unauthorized distribution.

F.S.C. TSENG AND C.-W. CHEN

(6)

Table-to-Table conflicts. These conflicts occur when information in a set of semantically equivalent tables is represented in a different number of tables in another database. When integrating relations with such conflicts into a global relation, null values are usually generated. This phenomenon is also called missing data. Based on a similar classification scheme, we will adopt the federated approach to integrate heterogeneous data warehouses and classify the conflicts among heterogeneous data warehouses. As XML Schema has been successfully adopted in [28] for resolving the conflicts in heterogeneous database systems, we will use XML as the canonical standard to store data derived from local data cubes and use XML Schema to define the global schema. 2.2. Related work on heterogeneous data warehouse integration In [7], Golfarelli et al. have pointed out that, owing to the popularity of using XML, the information required during enterprise decision-making is not only from data warehouses but also XML documents. They advocate using XML documents as the sources of data warehouses and propose a semi-automatic approach for building the conceptual schema for a data mart starting from XML sources. Bruckner et al. [29] utilize XTM (XML Topic Maps) [30] to build a framework for integrating the information stored in distributed data warehouses. The framework addresses the semantic integration problem by describing the mapping between topic maps of local data warehouse resources and integrating them into global topic maps. The authors indicate four conflicts that have to be dealt with while integrating different data warehouses: (1) Dimension hierarchies or levels with the same name, but different resources (identical dimension schemas). (2) Different levels of detail for equivalent dimensions. (3) Dimension levels with the same name but different meanings. (4) Different dimension hierarchy names but the same meaning. In [9], Pedersen et al. presented a theoretically wellfounded approach to the logical federation of Online Analytical Processing (OLAP) and XML data sources. The approach allows external XML data to be used as virtual dimensions, enabling three specific uses of XML data:

(1)

OLAP query results may be decorated with XML data. (2) External XML data may be used for selection. (3) OLAP data may be grouped by external XML data when aggregation is performed. In this approach, the authors make no assumptions about the existence of Document Type Definitions (DTDs) or XML Schema and almost all data sources can be efficiently wrapped in XML format [31]. Another approach using XML to store fact data and metadata in a data warehouse is Xcube [32]. It consists of XCubeSchema, XCubeDimension and XcubeFact, where XCubeSchema holds the multi-dimensional schema, XCubeDimension holds the hierarchical structure of the dimensions, and XCubeFact contains the fact data (i.e. the cells of a cube). Although it is proposed as a family of XML-based document templates to exchange data warehouse data, the details for resolving the semantic conflicts are not really elaborated. To resolve the problem of heterogeneous data warehouse integration, Mangisengi et al. [5, 6] have proposed a framework for providing interoperability of distributed, heterogeneous, and autonomous data warehouses based on a federated approach. The benefit of this work is that it can preserve the autonomy of the particular data warehouses and their applications, and users outside this framework should not be aware of how many data warehouses exist under the framework. When a user poses a global query on the system, the system will decompose the global query and send the obtained sub-queries to the mediators. Then the mediators send those sub-queries to corresponding local data warehouses. After local data warehouses have processed the queries, they send the query results to the mediators. All mediators send the local query results to the federated layer to integrate the result for users. However, this approach suffers from the complexity of the mediators and the communication mechanism among the mediators. It may lead to heavy loading on each local data warehouse and the federated component in this framework. If users pose the same query at different times, the results must be recomputed or re-processed. Besides, the schema conflicts or integration problems occurring in multiple data warehouses are not well addressed. In our work, to alleviate the shortcomings of the framework developed by Mangisengi et al. [5, 6], we will combine the above approaches and propose our architecture to integrate heterogeneous data warehouses. We use XML as the canonical standard to store data derived from data cubes and use XML Schema [33]

Journal of Information Science, 31 (3) 2005, pp. 209–229 © CILIP, DOI: 10.1177/0165551505052467 Downloaded from http://jis.sagepub.com at PENNSYLVANIA STATE UNIV on February 7, 2008 © 2005 Chartered Institute of Library and Information Professionals. All rights reserved. Not for commercial use or unauthorized distribution.

211

Integrating heterogeneous data warehouses

to define the global cube schema. All the local cube data will be extracted and integrated according to the pre-defined global cube schema. Then, users can pose their queries on the global cube to obtain a query result. Our approach does not require any mediator and thus eliminates the complexity of the system. Based on our approach, users can pose queries on the XML documents over the Internet and obtain the global query results without disturbing local data warehouse operations. That means our work overcomes the drawbacks of previous work proposed by Mangisengi et al. [5, 6] and avoids performance degradation in local cube query processing. Our approach extends the idea of XCube to define an XML Schema for describing the schema of XML documents, which are used to store the data cube metadata and fact data. Then, we explain how to integrate multiple data cubes from these XML documents by XQuery [13, 34, 35], which has been found to be feasible in [36] and [37]. In the next section, we will discuss the possible problems of integrating data cubes from heterogeneous data warehouses and how to use XML technologies to solve these problems. In the following, we first formally define the elements of data warehousing and classify the types of conflict when integrating cubes from different data warehouses, in Section 3. Then, we propose our system architecture and describe our cube representation by XML in Section 4. The resolution for all kinds of semantic conflicts will be presented in Section 5. Based on the types of semantic conflict, the resolutions will be materialized by their corresponding XQuery to resolve the semantic discrepancies among the XML documents generated from local cubes. In Section 6, we present some examples to illustrate our approach for resolving the types of semantic conflicts. Finally, we conclude and propose some future directions in Section 7.

3. The problems of heterogeneous data cube integration 3.1. Formal definitions of the elements for data warehousing In the following, we give some definitions connected with dimension, hierarchy of dimension, and cube for data warehousing. Definition 1: a dimension D is a tree structure of h levels, h ⱖ 1, which is used to represent the 212

hierarchical relationships among a set of terms. A node in a dimension D is called a member, and each internal node contains a special child called summary member, denoted ‘*’, which is used for denoting the total concept of the other children of the internal node. Definition 2: for a dimension D, the i-th level member set, denoted D(i), is defined as D(i) = {a | a is a member in the i-th level of D, but a is not a summary member}. Also, we use D(0) to denote the union of all nonsummary members in D, which is the union of all i-th level member sets in D. That is, D(0) = ∪1ⱕi ⱕh D(i), where h is the height of D. In practice, each D(i) has a specific name, which will be called the i-th level name. Definition 3: for a dimension D, the hierarchy of D, denoted HD = L1 傻 L2 傻 . . . Li 傻 . . . Lh, is the metadata of D, such that L i = D(i), where L i is the level name of the i-th level. Practically, a dimension can be constructed from a relational table, where each level corresponds to an attribute in the relation and the attribute names are usually used as the corresponding level names. To illustrate the above definitions, we give an example as follows. Example 1. Suppose there is a relation Region representing the regions of Taiwan as shown in Table 1. This relation can be used to construct a dimension, denoted R as depicted in Figure 1, where the first level corresponds to the dimension itself (which is commonly denoted ‘(All Region)’), and the second and third levels are derived from the attributes Location, and City, respectively. All nodes with label ‘*’ are summary members. That is, the summary member in the second level has the same meaning as all regions in Taiwan, which represents {South, North}. Besides, the summary

Table 1 A relation Region for constructing dimension R. Region ...

Location

City

...

... ... ... ... ... ...

North North North South South South

Taipei Taoyun Hsinchu Tainan Kaohsiung Pingtong

... ... ... ... ... ...

Journal of Information Science, 31 (3) 2005, pp. 209–229 © CILIP, DOI: 10.1177/0165551505052467 Downloaded from http://jis.sagepub.com at PENNSYLVANIA STATE UNIV on February 7, 2008 © 2005 Chartered Institute of Library and Information Professionals. All rights reserved. Not for commercial use or unauthorized distribution.

F.S.C. TSENG AND C.-W. CHEN

Level All Region

1

North

*

Taipei

*

2

South

Taoyun

Hsinchu

*

Tainan

Kaohsiung

Pingtong

3

Fig. 1. An illustration of dimension R.

Level Name

Level All Region

All Region

Location

City

North

Taipei

Taoyun

1

2

South

Hsinchu

Tainan

Kaohsiung

Pingtong

3

Fig. 2. A concise illustration of dimension R.

Level All Product

Applicance

TV

Refrigerator

Communication

Cellular Phone

Radio

1

2

Computer

Monitor

Printer

3

Fig. 3. A concise illustration of dimension P. Journal of Information Science, 31 (3) 2005, pp. 209–229 © CILIP, DOI: 10.1177/0165551505052467 Downloaded from http://jis.sagepub.com at PENNSYLVANIA STATE UNIV on February 7, 2008 © 2005 Chartered Institute of Library and Information Professionals. All rights reserved. Not for commercial use or unauthorized distribution.

213

Integrating heterogeneous data warehouses

members under South and North respectively have the same meaning as South and North, which denote {Tainan, Kaohsiung, Pingtong} and {Taipei, Taoyun, Hsinchu}, respectively. Figure 1 is redrawn in Figure 2, omitting all the summary members. According to the illustration of dimension R, we know that R(1) = {(All Region)}, R(2) = {South, North}, and R(3) = {Tainan, Kaohsiung, Pingtong, Taipei, Taoyun, Hsinchu}, and R(0) = {(All Region), South, North, Tainan, Kaohsiung, Pingtong, Taipei, Taoyun, Hsinchu}. In Figure 3, another dimension, denoted P, representing the products of a company manufacturing consumer electronics, is concisely depicted. Both dimensions will be used in the following examples. For a dimension D, there are two basic operations called drill-down and roll-up, which are formally defined as follows. Definition 4: for a dimension D, expanding an internal node to obtain all of its children is called drill-down, and shrinking a set of children to obtain their common parent is called roll-up. The basic component of a data cube is called a cell, which is defined as follows. Definition 5: a cell defined on n dimensions (D1, D2, . . . , Dn) is denoted c = (tc, Mc), where tc = (c1, c2, . . . , ci, . . . , cn), ci ∈ Di(0) ∪ {‘*’}, 1 ⱕ i ⱕ n, and Mc = (m1, m2, . . . , mj, . . . , mk) is a tuple consisting of measures derived from the fact table based on tc. Definition 6: a cell c = (tc, Xc), where tc = (c1, c2, . . . , ci, . . . , cn), defined on n dimensions (D1, D2, . . . , Dn) is called an m-d cell, 0 ⱕ m ⱕ n, if and only if there are exactly m non-summary member ci (i.e. ci 苷 ‘*’). If m = n and ci ∈ Di(hi), where hi is the height of Di, for all 1 ⱕ i ⱕ n, then c is also called a base cell; otherwise c is called a non-base cell. Definition 7: a data cube C = (M, D1, D2, . . . , Dn) is a cube composed of all cells ci = (tc , Mc ) with tc ∈ ⫻ Dj(0), where M = ∪{Mc } is called the measure dimension of C. For a data cube C, we say that C has k measures if each Mc is a k-tuple (m1, m2, . . . , mj, . . . , mk). Besides, as M is a set consisting of k-tuples, it can be regarded as a flat relation with k attributes, denoted M(A1, A2, . . . , Ak), where each attribute corresponds to exactly one measure in C. In the following, we will use M or M(A1, A2, . . . , Ak) interchangeably to indicate the measure dimension of a cube, which has k measures A1, A2, . . . , Ak. i

i

i

214

i

i

i

1ⱕj ⱕn

time

region

a

non-base cell

d

product base cell

Fig. 4. A sample illustration of the elements in a data cube.

A sample illustration of a cube C = (M, R, P, T) is shown in Figure 4, where R and P represent the aforementioned dimensions region and product, respectively. In addition, we assume T is a dimension representing time. 3.2. Types of semantic conflicts in heterogeneous data cube integration First of all, we define the concept of semantically related data cubes as follows. Definition 8: for two data cubes CA = (MA, D1A, D2A, . . . , DnA) and CB = ((MB, D1B, D2B, . . . , DnB), where MA = MA(A1, A2, . . . , Ak) and MB = MB(B1, B2, . . . , Bk) are sets of k-measures, we say that CA and CB are semantically-related, if D Ai and D Bi are semantically-related, for all 1 ⱕ i ⱕ n, and the corresponding measures Aj and Bj, 1 ⱕ j ⱕ k, in MA and MB are semantically related, respectively. Note that in Definition 8, it is feasible to assume both CA and CB have n dimensions without loss of generality, since if CA has m dimensions, but CB has n dimensions, where m < n, then we can add (n – m) dimensions into CA, such that each added dimension only contains a summary member ‘*’, which represents ‘(All D Ai )’, (m + 1) ⱕ i ⱕ n.

Journal of Information Science, 31 (3) 2005, pp. 209–229 © CILIP, DOI: 10.1177/0165551505052467 Downloaded from http://jis.sagepub.com at PENNSYLVANIA STATE UNIV on February 7, 2008 © 2005 Chartered Institute of Library and Information Professionals. All rights reserved. Not for commercial use or unauthorized distribution.

F.S.C. TSENG AND C.-W. CHEN

For semantically related cubes created from local sites with possibly different dimensional models, we classify the possible semantic conflicts as follows. (1) Cube-to-cube conflicts: these conflicts occur when semantically related local cubes are created by different dimensional models. For example, suppose a cube C1 was modeled in star schema, but the other cube C2 was constructed via snowflake [4]. (2) Dimension-to-dimension conflicts: these conflicts occur when the dimension schema structures, dimension members, or the naming of semantically related dimensions have semantic discrepancies. Such conflicts can be further classified into the following sub-categories: (a) Dimension schema conflicts: such conflicts occur when two data cubes have different dimension hierarchies, with possibly different dimension levels. (b) Dimension member conflicts: such conflicts occur when two cubes have mismatched members, which correspond to the same level in their semantically related dimensions. (c) Naming conflicts: such conflicts occur when two dimensions in local cubes C1 and C2 have mismatched names. We further distinguish them into the following sub-subcategories: (i) Dimension naming conflicts: such conflicts occur when two local dimensions have mismatched dimension names.

(ii)

Level naming conflicts: such conflicts occur when two local dimensions have mismatched level names. (3) Measure-to-measure conflicts: these conflicts occur when the measures in different cubes are in different names, different values (inconsistent measures), different formats, or even different units. Such conflicts can be further classified into three categories: (a) Measure naming conflicts: such conflicts occur when two local cubes have semantically related measures with mismatched names. (b) Inconsistent measures: such conflicts occur when two local cubes have semantically related measures with mismatched values. (c) Measure scaling conflicts: such conflicts occur when two local cubes have semantically related measures with mismatched scales. We illustrate our classification scheme in Figure 5. Notice that since dimensions as well as the measure dimension are all organized as independent objects, we observe that it is impossible for cube-to-dimension, cube-to-measure, or even dimension-to-measure conflicts to occur between two semantically-related local cubes. In the following, we use the example cubes C1 and C2 respectively shown in Figure 6(a) and (b) to describe all types of conflicts. The tables listed after the corresponding cube dimensional models are sample data used to construct the cubes, respectively.

Semantic Conflicts in Heterogeneous Data Cubes

Cube-to-Cube Conflicts

Dimension Schema Conflicts

Dimension-to-Dimension Conflicts

Dimension Member Conflicts

Naming Conflicts

Dimension Member Conflicts

Naming Conflicts

Measure-to-Measure Conflicts

Measure Naming Conflicts

Inconsistent Measures

Measure Scaling Conflicts

Fig. 5. A classification scheme of the semantic conflicts between local cubes. Journal of Information Science, 31 (3) 2005, pp. 209–229 © CILIP, DOI: 10.1177/0165551505052467 Downloaded from http://jis.sagepub.com at PENNSYLVANIA STATE UNIV on February 7, 2008 © 2005 Chartered Institute of Library and Information Professionals. All rights reserved. Not for commercial use or unauthorized distribution.

215

Integrating heterogeneous data warehouses

(a)

Time_D

(b)

Time

Time_D

Year | Quarter | Month

TID Year Quarter Month

Bookstores_D

SID City City | StoreName StoreName

Year | Month Books

Bookstores_D

Bookstores

Bookstores

Time

TID Year Month

Sales_Fact

Books_D

TID SID BID Sales (NT$) Cost (NT$) Quantity

BID Publisher Category Bookname

Books Publisher | Category | Name

Profits_Fact TID SID BID Sales (US$) Cost (US$) Amount

SID Nation StoreName

Nation | StoreName

Books_D BID CID Publisher Bookname

Time_D Time_D TID

Year

1

1999

1

2

1999

3

1999

Bookstores_D

Quarter Month

Category_D CID Category

Bookstores_D

TID Year Month

SID

Nation

StoreName

SID

City

StoreName

1 1999

3

1

ROC

Elite Store

2

1

Taipei

Elite Store

2 1999

7

2

Japan

Oriental Store

2

5

2

Tokyo

Oriental Store

3 1999

12

3

Korea

Liberty Store

4

11

3

Seoul

Liberty Store Books_D

Books_D BID

Publisher

1

Ten Shia Publisher

2

Central Publisher

3

Iging Publisher

BID

CID Publisher

Publisher | Category | Name

Category_D Bookname

CID

Category

Dead of Arthur

1

Computer science

Category

Bookname

1

2

Ten Shia

Literature

Chinese literature

2

1

History

Database System

2

Literature

3

3

Central

Japanese textbook

3

Language

Computer science UNIX Operating System Language

American textbook Profit_Fact

Sales_Fact

TID

SID

BID

Sales

Cost

Amount

TID

SID

BID

Sales

Cost

Quantity

1

1

3

150

125

50

1

3

2

3600

2800

30

2

3

1

120

108

36

2

1

3

4000

3200

40

3

2

2

160

120

40

3

2

1

4200

3400

21

Fig. 6. The dimensional model of (a) cube C1 and (b) cube C2 and its sample data.

Example 2. For the cubes depicted in Figure 6, we observe that cubes C1 and C2 have the following semantic conflicts: (1) Cube-to-cube conflicts: in Figure 6, cubes C1 and C2 are two semantically related local cubes. However, cube C1 was modeled in star schema, but cube C2 was constructed in snowflake [4]. (2) Dimension-to-dimension conflicts: there are many Dimension-to-dimension conflicts between cubes C1 and C2: (a) Dimension schema conflict: there is a dimension schema conflict between the dimensions Time in cubes C1 and C2, since the level hierarchy of dimension Time in cube C1 is Year 傻 Quarter 傻 Month, but in cube C2, it is Year 傻 Month. (b) Dimension member conflict: there are two dimension member conflicts between the dimensions Bookstores in cubes C1 and C2. The first conflict occurs because the members of the first level of dimension Bookstores in 216

(3)

cube C1 stand for cities (e.g. ‘Taipei’, ‘Tokyo’, and ‘Seoul’), but the members of the first level of dimension Bookstores in cube C2 represent nations (e.g. ‘ROC’, ‘Japan’, and ‘Korea’). The second conflict occurs because the members of Books.Publisher in cube C1 use the full name of a publisher (e.g. ‘Ten Shia Publisher’ and ‘Central Publisher’), but the members of Books.Publisher in cube C2 store the publisher name only (e.g. ‘Ten Shia’ and ‘Central’). (c) Naming conflicts: there is no level naming conflict between the dimensions Bookstores in cubes C1 and C2, respectively. Measure-to-measure conflicts: there is a measure naming conflict between cubes C1 and C2, since the measure name Quantity in cube C1 corresponds to the measure name Amount in cube C2. Besides, there is also a measure scaling conflict between cubes C1 and C2, since the measures Sales and Cost in cube C1 use ‘NT$’ as the currency, but they are represented as ‘US$’ in cube C2.

Journal of Information Science, 31 (3) 2005, pp. 209–229 © CILIP, DOI: 10.1177/0165551505052467 Downloaded from http://jis.sagepub.com at PENNSYLVANIA STATE UNIV on February 7, 2008 © 2005 Chartered Institute of Library and Information Professionals. All rights reserved. Not for commercial use or unauthorized distribution.

F.S.C. TSENG AND C.-W. CHEN

4. The solutions for heterogeneous data cube integration

(2)

Then, according to the site metadata and the cube metadata stored in XML documents, the corresponding XQuery to resolve each specific conflict will be applied to update XMLM(C1) and XMLF(C1) (and/or XMLM(C2) and XMLF(C2)) and accordingly to transform each local cube into a structure conforming to the global cube structure. (3) Finally, we integrate them into a global cube using XQuery, for users to pose their queries on the global cube. To represent cube metadata and fact data in XML format, we have to define the XML representations conforming to cube metadata and the general structure of a cube. In the following, we will define two XML Schemas, one to represent the metadata of a data cube and the other to express the cube fact data.

4.1. The general architecture to resolve semantic conflicts between cubes The general architecture to resolve semantic conflicts between local cubes is depicted in Figure 7. Each local cube C is supposed to be composed of two parts, namely the cube metadata and the fact data, which will be transformed into XML formats and denoted XMLM(C) and XMLF(C), respectively. In addition, the site manager should analyze the semantic conflicts between the global and local cubes and provide the site metadata (to be discussed in Section 4.3) for heterogeneous cube integration. In our approach, the general process of integrating heterogeneous data cubes can be divided into the following steps: (1) For two local cubes C1 and C2, we retrieve the cube metadata (i.e. the cube schemas) and fact data, and transform them into the corresponding XML documents, say XMLM(C1) and XMLF(C1), and XMLM(C2) and XMLF(C2), respectively, which will be stored in a native XML database (in our testbed Tamino [38]).

4.2. Cube representation by XML Although the CWM (Common Warehouse Meta-model) developed by OMG [39] has defined the metadata of data warehouses, the fact data and level information are completely left out in CWM (this has been pointed out in [32]). Furthermore, [40] and [41] used XML DTD to describe the schema of a data cube. However, they

Global Users Global Site Global Cube Integration

XM L Site A

L XM

Xquery

XML

Xquery

Xquery

Site C

Site B

Site Metadata

Site Metadata native XML database A XMLM

Cube Metadata

XMLF

Site Metadata Native XML database B XMLM

Native XML database C

XMLF

Cube Metadata

XMLM

XMLF

Cube Metadata

Local Data Cubes with Cube Metadata

Fig. 7. The system architecture for integrating heterogeneous data warehouses. Journal of Information Science, 31 (3) 2005, pp. 209–229 © CILIP, DOI: 10.1177/0165551505052467 Downloaded from http://jis.sagepub.com at PENNSYLVANIA STATE UNIV on February 7, 2008 © 2005 Chartered Institute of Library and Information Professionals. All rights reserved. Not for commercial use or unauthorized distribution.

217

Integrating heterogeneous data warehouses

employ an element name to describe the cube name, measure name, dimension name, and the dimension members. Their approaches may lack flexibility because they lead to each data cube having its own XML DTD and various data types cannot be described through DTD. As XML Schema supports various data types and has more advantages than DTD, we will employ XML Schema to define our target. As our objective is to derive data from a data warehouse data cube and store them in an XML document, the defined XML document should contain the cube’s metadata and fact data. The XML Schema we defined contains the following two parts:

level number, and a sub-element , which contains all of its members. The attribute Hierarchy is a numeric value representing the level number. has an attribute UpperLevel used for indicating the parent level number. Part 2. Fact data of cubes: this part describes all cells of a cube, i.e. the fact data of a cube. We depict its composition in Figure 10. In Figure 10, we observe that there is only one element defined in the fact data of cubes, namely CubeFact. This is because a cube contains at lease one measure and is formed by base cells defined on all of the involved dimensions. Base cells under some specific levels in different dimensions can be joined together and regarded as a non-base cell, as we have defined in Definition 6. Therefore, a cell is defined through the involved dimensions, the level names and level numbers on which the cell resides in the corresponding dimensions, respectively. That is, there are three attributes, namely Dimension, LevelName, Hierarchy, defined for , where the first one represents the involved dimensions, and the second and final ones respectively describe the level names and level numbers on which the cell resides in the dimension. Based on the above discussion, we list the XML Schema definition of CubeFact in Figure 11 and present an example to illustrate the above XML coding schemes.

Part 1. Metadata of cubes: this part includes the metadata of cubes. We depict its composition in Figure 8. In Figure 8, we observe that there are two elements, i.e. and , to be defined in the metadata of cubes, namely CubeMeta. We list the XML Schema definition of CubeMeta in Figure 9 and explain the elements as follows: (1) is used to represent all measures in a data cube. This element appears at least once and contains only one attribute Name, which stands for the measure name. (2) is used to represent all of the dimensions in a data cube. The element contains one attribute Name, which represents the corresponding dimension name, and one sub-element , which sequentially contains the level names of the level hierarchy defined for the dimension. also contains two attributes, namely Name and Hierarchy, which respectively record the level name and the hierarchy

Example 3. Suppose there is a cube composed of two dimensions: Region and Product as depicted in Figure 12, and a measure Total Price. Then, the cube metadata is as Figure 13 shows. In Figure 14, we also

Legend Measure

Attribute

1..

Element

Name CubeMeta Dimension CubeName

Level 1..

Name

Members 1..

Name

1.. UpperLevel

Hierarchy Fig. 8. XML Schema for representing the metadata of data cubes.

218

Journal of Information Science, 31 (3) 2005, pp. 209–229 © CILIP, DOI: 10.1177/0165551505052467 Downloaded from http://jis.sagepub.com at PENNSYLVANIA STATE UNIV on February 7, 2008 © 2005 Chartered Institute of Library and Information Professionals. All rights reserved. Not for commercial use or unauthorized distribution.

F.S.C. TSENG AND C.-W. CHEN



Fig. 9. XML Schema definition of CubeMeta.

list an example XML document that represents the data generated from the cube. Our integration process utilizes the semantic knowledge of all local cubes in each participating site. We call these semantic knowledge site metadata, which should be prepared or discovered before integration.

(1)

(2)

(3) 4.3. Site metadata For a local cube C = (M, D1, D2, . . . , Dn) in each participating site, the site metadata should consist of the following components:

The domain of each level in each dimension: that is, the i-th level member sets defined in Definition 2. The domain of each measure: for a measure dimension M(A1, A2, . . . , Ak) in cube C, we use Dom(Ai) to denote the domain of each measure Ai. The semantic description of each measure: this is used to ensure local autonomous systems agree with each other on the meaning of their exchanged data, which is a labor-intensive task. These semantic descriptions often depend on context information, the database origin, the

Journal of Information Science, 31 (3) 2005, pp. 209–229 © CILIP, DOI: 10.1177/0165551505052467 Downloaded from http://jis.sagepub.com at PENNSYLVANIA STATE UNIV on February 7, 2008 © 2005 Chartered Institute of Library and Information Professionals. All rights reserved. Not for commercial use or unauthorized distribution.

219

Integrating heterogeneous data warehouses

Member 1.. Dimension Cell

CubeFact

UpperLevel 1..

Hierarchy

CubeName

Measure 1..

Legend Dimension

Attribute Element

applications, and so on. Very good work on formulating semantic information to support heterogeneous database integration has been established in [42]. By regarding context information as the metadata, the basic approach is based on the concept of a semantic value, which is defined as a piece of data together with its associated context. To convert a semantic value from one context to another, conversion functions will be employed. To describe a semantic description of a measure Ai in cube C, we employ the following notation adopted from [42]: Des(Ai) = {S1, S2, . . . , Sn}

Fig. 10. XML Schema for representing the fact data of data cubes.



Fig. 11. XML Schema definition of CubeFact.

220

Journal of Information Science, 31 (3) 2005, pp. 209–229 © CILIP, DOI: 10.1177/0165551505052467 Downloaded from http://jis.sagepub.com at PENNSYLVANIA STATE UNIV on February 7, 2008 © 2005 Chartered Institute of Library and Information Professionals. All rights reserved. Not for commercial use or unauthorized distribution.

F.S.C. TSENG AND C.-W. CHEN

Region

North

Taipei

HsinChu

Product

All

South

Tainan

Kaohsiung

Computer

Region

City

Monitor

(a) Dimension Region

Printer

All

Communication

Cellular Phone

Category

Radio

Name

(b) Dimension Product

Fig. 12. The dimensions Region and Product used in the example.

All Region North South Taipei Hsinchu Tainan Kaohsiung All Product Computer Communication Monitor Printer Cellular Phone Radio

Fig. 13. The XML Schema of the example cube Metadata.

Journal of Information Science, 31 (3) 2005, pp. 209–229 © CILIP, DOI: 10.1177/0165551505052467 Downloaded from http://jis.sagepub.com at PENNSYLVANIA STATE UNIV on February 7, 2008 © 2005 Chartered Institute of Library and Information Professionals. All rights reserved. Not for commercial use or unauthorized distribution.

221

Integrating heterogeneous data warehouses

All Region All Product 101,100 North All Product 66,300 Hsinchu All Product 32,100 Taipei All Product 34,200 South All Product 34,800 Kaohsiung All Product 21,600 Tainan All Product 13,200 ... Taipei Computer 12,600 ... Tainan Monitor 3,600

Fig. 14. An example XML document generated from the example cube.

222

Journal of Information Science, 31 (3) 2005, pp. 209–229 © CILIP, DOI: 10.1177/0165551505052467 Downloaded from http://jis.sagepub.com at PENNSYLVANIA STATE UNIV on February 7, 2008 © 2005 Chartered Institute of Library and Information Professionals. All rights reserved. Not for commercial use or unauthorized distribution.

F.S.C. TSENG AND C.-W. CHEN

to denote the necessary descriptions for schema integration and to supply the semantics of the measure, where S1 is called the primary description, which will be denoted Des*(Ai) = S1 in the following. It represents the ‘actual’ meaning of the i-th measure. When the measure is clear enough to be self-explanatory (i.e. there is no need for a primary description); then it can be replaced by an asterisk (*). The other Si’s are called the auxiliary descriptions, which will be denoted Des’(Ai) = Des(Ai) – Des*(Ai) = {S2, . . . , Sn} in the following. They will be used to supply auxiliary information in the i-th measure. If there is no need for any primary description or auxiliary description for a measure Ai in cube C, then Des(Ai) = ∅, which can be regarded as an empty set or null. To illustrate, for example, if the actual meaning of a measure Ai is ‘amount of money’, then there may be an auxiliary description representing ‘unit of currency’, and Des(Ai) = (‘Money’, ‘Currency’).

5. Resolving the semantic conflicts between local data cubes In this section, we will discuss our XML-based solutions for resolving the types of semantic conflicts between heterogeneous data cubes. The transformed XML documents will be integrated together as a global data cube, which can be manipulated by XQuery commands for query processing. 5.1. Resolving the cube-to-cube conflicts For two local cubes C1 and C2, if cube C1 is created by star schema, and C2 is generated via snowflake, then we say there is a cube-to-cube conflict between cubes C1 and C2. As our approach will derive the metadata and fact data of both local cubes and store them in XML formats, we can directly transform the derived data from both sides into the same structure. That is, if a dimension was generated from a set of de-normalized relations, then these relations will be automatically joined together to create the dimension. Therefore, such conflicts will be automatically resolved in the data generation process. 5.2. Resolving the dimension-to-dimension conflicts Since the dimension-to-dimension conflicts can be further classified into dimension schema conflicts, dimension member conflicts, and naming conflicts, we explain their resolution processes as follows. Note that

all insert or update operations in the resolution processes will be implemented by pre-defined XQuery statements. (1) Dimension schema conflicts: as such conflicts occur when two local cubes C1 and C2 have different dimension hierarchies, with possibly different dimension levels, the resolution procedure is as follows. (a) Suppose the level number of a dimension D in C2 is less than that of another dimension in C1, then the new level definitions and their corresponding members should be inserted into the dimension representation in XMLM(C2) and XMLF(C2), respectively. (b) The hierarchy of the original levels under the newly added levels will also be updated accordingly. (c) Update the attribute UpperLevel of the original members under the newly added level in XMLM(C2) accordingly. (d) Finally, we should insert new cells into cube C2 to contain the derived fact data of the newly added level. The measure values of these cells should be re-aggregated according to the related measure values of existing cells. (2) Dimension member conflicts: such conflicts occur when two local cubes C1 and C2 have mismatched members, which correspond to the same level in their semantically related dimensions. The conflicts occur because the levels with mismatched members use different granularities. The resolution procedure is as follows. (a) As the mismatched members use different granularities, we should expand the levels in both dimensions to encompass both granularities. In this step, we may insert new level(s) and their corresponding members into a specific dimension according to the original order in XMLM(C1) or XMLM(C2). (b) Update the hierarchy of the original levels under the newly added level(s). (c) Moreover, the attribute UpperLevel of the original member under the newly added level(s) in XMLM(C1) (or, XMLM(C2)) should be updated accordingly. (d) After finishing the above work, we rename the original level name of both dimensions according to the target name. (3) Naming conflicts: as such conflicts occur when two dimensions in local cubes C1 and C2 have mismatched dimension and/or level names, the resolution procedure is as follows. (a) Choose the canonical dimension name (and/or

Journal of Information Science, 31 (3) 2005, pp. 209–229 © CILIP, DOI: 10.1177/0165551505052467 Downloaded from http://jis.sagepub.com at PENNSYLVANIA STATE UNIV on February 7, 2008 © 2005 Chartered Institute of Library and Information Professionals. All rights reserved. Not for commercial use or unauthorized distribution.

223

Integrating heterogeneous data warehouses

(b)

level name), which will be used as the global dimension name (and/or level name) for the global cube derived from cubes C1 and C2. Then translate the local dimension member names to agree with the global member (and/or level) names.

5.3. Resolving the measure-to-measure conflicts Since the measure-to-measure conflicts can be further classified into measure naming conflicts, inconsistent measures, and measure scaling conflicts, we explain their resolution processes as follows. (1) Measure naming conflicts: as such conflicts occur when two local cubes C1 and C2 have semantically related measures with mismatched names, the resolution procedure is as follows. (a) Choose the canonical measure name, which will be used in the global cube derived from cubes C1 and C2. (b) Then, translate the local measure names to agree with the global measure names. (2) Inconsistent measures: such conflicts occur when two local cubes C1 and C2 have semantically related measures with mismatched values. To resolve such conflicts, the concept of partial value [43] can be employed to capture all the values derived from both local cubes. This makes the process of cube integration more complicated; we think such a case is beyond the scope of this paper and we will elaborate on such an extension in the future. For the manipulation of partial values, gentle readers are referred to [44] and our previous papers [45–49]. (3) Measure scaling conflicts: as such conflicts occur when two local cubes C1 and C2 have semantically related measures with mismatched scales, the resolution procedure is as follows. (a) Choose the target scale to be used to represent the specific measure of the global cube derived from cubes C1 and C2 and define the corresponding mapping function for translation. (b) Then, translate the local measure values to the global measure values.

6. An example to illustrate the resolution processes In this section, based on the sample cubes depicted in Figure 6, we will illustrate the process of integrating heterogeneous data cubes. As cube-to-cube conflicts will be automatically resolved in the data generation 224

process (as discussed in the previous section), we will ignore the case of cube-to-cube conflict. That is, the dimension Books in cube C2 will be regarded as a dimension constructed from de-normalizing the relations Books_D and Category_D. 6.1. Resolving the dimension-to-dimension conflicts There are many dimension-to-dimension conflicts between cubes C1 and C2. We illustrate the resolution process as follows. 6.1.1. Resolution for dimension schema conflicts. As the level number of dimension Time in C2, i.e. Year 傻 Month, is less than that of dimension Time in C1, i.e. Year 傻 Quarter 傻 Month, there is a dimension schema conflict between the dimensions Time in cubes C1 and C2. Therefore, we have to insert a new level Quarter into the Time dimension in C2 by the following steps. (1) Insert a new level Quarter and the corresponding members, Quarter 1, Quarter 2, Quarter 3 and Quarter 4, into the dimension Time in C2 by the following XQuery statement: update insert Quarter 1 Quarter 2 Quarter 3 Quarter 4 following input()/CubeMeta[@CubeName="Profit"]/ Dimension[@Name="Time"]/Level[@Name="Year"]

(2)

Update the attribute Hierarchy of all s under the newly added level. That is, the attribute Hierarchy of Time.Month in C2 should be updated from 2 to 3. The following XQuery statement accomplishes this. update replace input()/CubeMeta[@CubeName="C2"]/ Dimension[@Name="Time"]/Level[@Name="Month"]/ @Hierarchy with attribute Hierarchy {"3"}

(3)

Then, the attribute UpperLevel of all s under the newly added level should be updated accordingly. That is, the attributes UpperLevel of the members January, February and March should be updated into Quarter 1; those of April, May and June should be updated into Quarter 2; those of July, August and September should be updated into Quarter 3; and those of October, November and December should be updated into Quarter 4. This can be handled by the following XQuery

Journal of Information Science, 31 (3) 2005, pp. 209–229 © CILIP, DOI: 10.1177/0165551505052467 Downloaded from http://jis.sagepub.com at PENNSYLVANIA STATE UNIV on February 7, 2008 © 2005 Chartered Institute of Library and Information Professionals. All rights reserved. Not for commercial use or unauthorized distribution.

F.S.C. TSENG AND C.-W. CHEN

command, where only the statement for Quarter 1 updating is presented. update replace input()/CubeMeta[@CubeName="C2"]/ Dimension[@Name="Time"]/Level[@Name="Month"]/ Member[. = "January" or . = "February" or . = "March"]/ @UpperLevel with attribute UpperLevel {"Quarter 1"} ...

(4)

Finally, we insert new cells into C2 to contain the derived fact data of the newly added level. The measure values of these cells should be re-aggregated according to the related measure values of existing cells. Figure 15 illustrates this process.

6.1.2. Resolution for dimension member conflicts. As we have discussed in Example 2, to resolve the first dimension member conflict between the dimensions Bookstores in cubes C1 and C2, we should expand the levels in both dimensions to encompass both granularities. That is, the dimension hierarchy in the global cube should be Nation 傻 City 傻 StoreName. The following steps can handle this. (1) Insert a new level Nation on top of the level City for Bookstores in cube C1 and the corresponding members, ROC, Japan, and Korea, into the

dimension in C1 by Xquery statement 4; and for cube C2, we insert a new level City under the level Nation for Bookstores in cube C2 and the corresponding members, Taipei, Tokyo, and Seoul, into the dimension in C2 by Xquery statement 5. update insert ROC Japan Korea following input()/CubeMeta[@CubeName="C1"]/Dimension[@Name= "Bookstores"]/Level[@Name="(All)"]

update insert Taipei Tokyo Seoul following input()/CubeMeta[@CubeName="C2"]/Dimension[@Name= "Bookstores"]/Level[@Name="Nation"]

... January 120 ... ... February 100 ... ... March 80 ...

re-aggregated into a new cell

... Quarter 1 300 ...

Fig. 15. An illustration for inserting new cells to contain the derived fact data. Journal of Information Science, 31 (3) 2005, pp. 209–229 © CILIP, DOI: 10.1177/0165551505052467 Downloaded from http://jis.sagepub.com at PENNSYLVANIA STATE UNIV on February 7, 2008 © 2005 Chartered Institute of Library and Information Professionals. All rights reserved. Not for commercial use or unauthorized distribution.

225

Integrating heterogeneous data warehouses

(2)

Update the attribute Hierarchy of all s under the newly added level. (3) Then, update the attribute UpperLevel of all s under the newly added level accordingly. (4) Finally, we insert new cells into both cubes to contain the derived fact data corresponding to that of their newly added levels. Note that the XQuery statements for steps 2, 3, and 4 are all similar to those of the steps 2, 3, and 4 discussed in Section 6.1.1. Therefore, we omit the details. For the second dimension member conflict between the dimensions Bookstores in cubes C1 and C2, the resolution is quite simple. We only have to append the string ‘Publisher’ at the end of all members in Books.Publisher of C1 to convert the data into a uniform format by using the following statement: update for $a in input()/CubeFact[@CubeName="C1"]/Cell/Member[@LevelName= "[Books].[Publisher]"] do replace $a with {string-join(($a, "Publisher"), " ")}

6.1.3. Resolution for dimension naming conflicts. In Figure 6, suppose the last level name of dimension Bookstores in cube C1 is Name, which is originally different from the level name StoreName of dimension Bookstores in cube C2, then we have to resolve this level naming conflict by using the following XQuery statement: update replace input()/CubeMeta[@CubeName="C1"]/ Dimension[@Name="Bookstores"]/Level[@Name="Name"]/ @Name with attribute Name {"StoreName"} update replace input()/CubeFact[@CubeName="C1"]/Cell/ Member[@LevelName="[Bookstores].[Name]"]/@LevelName with attribute LevelName {"[Bookstores].[StoreName]"}

6.2. Resolving the measure-to-measure conflicts In Figure 6, there is a measure naming conflict between cubes C1 and C2, since the measure name Quantity in cube C1 corresponds to the measure name Amount in cube C2. Besides, there is also a measure scaling conflict between cubes C1 and C2, since the measures Sales and Cost in cube C1 use ‘NT$’ as the currency, but they are represented as ‘US$’ in cube C2. We illustrate the resolution process as follows. 6.2.1. Resolution for measure naming conflicts. To resolve the measure naming conflict, the process is 226

similar to resolving that of level naming, as described in Section 6.1.3. Therefore, if we choose Quantity as the target measure name in the global cube, then the following XQuery statement can be employed to handle this: update replace input()/CubeMeta[@CubeName="C2"]/ Measure[@Name="Amount"]/@Name with attribute Name {"Quantity"} update replace input()/CubeFact[@CubeName="C2"]/Cell/ Measure[@Name="Amount"]/@Name with attribute Name {"Quantity"}

6.2.2. Resolution for measure scaling conflicts. To resolve the measure scaling conflict, suppose the currency exchange rate between ‘NT$’ and ‘US$’ is 35:1, then the following XQuery statement can accomplish this: update for $Sales in input()/CubeFact[@CubeName="C1"]/Cell/Measure[@Name="Sales"] let $NewSales := $Sales * 35 do (replace $Sales with {$NewSales}) update for $Cost in input()/CubeFact[@CubeName="C1"]/Cell/Measure[@Name="Cost"] let $NewCost := $Cost * 35 do (replace $Cost with {$NewCost})

7. Conclusion and future work Data warehousing is gaining in popularity as organizations realize the benefits of being able to perform multi-dimensional analyses of cumulated historical business data to help contemporary administrative decision-making. However, the task of creating a global data warehouse for inter-organization information sharing from scratch is generally laborintensive, error-prone, and time-consuming. To alleviate such difficulties, in this paper, we have proposed a general framework and developed a prototype to integrate heterogeneous data warehouses using XML technologies. We have defined the elements of data warehousing and discussed the possible conflicts to be resolved for heterogeneous data warehouse integration. Our approach transforms all cube data into XML formats, resolves the semantic discrepancies, and then integrates the data into a global cube through

Journal of Information Science, 31 (3) 2005, pp. 209–229 © CILIP, DOI: 10.1177/0165551505052467 Downloaded from http://jis.sagepub.com at PENNSYLVANIA STATE UNIV on February 7, 2008 © 2005 Chartered Institute of Library and Information Professionals. All rights reserved. Not for commercial use or unauthorized distribution.

F.S.C. TSENG AND C.-W. CHEN

XQuery statements to utilize the integration power of the Internet. In this paper, we propose an approach to integrating semantically related data cubes in heterogeneous data warehouses into a unified virtual cube for sharing data over the Internet. Since data cubes in heterogeneous data warehouses are created unilaterally and spread over the Internet, XML is employed in our work as it has been developed and advocated to be the standard for interchanging data over the Internet. Therefore, based on the general architecture proposed in Figure 7, all cube data must be transformed into XML formats for data integration. Although the size of a data cube is usually very large, and the integration process seems time-consuming, the end users only face the integrated global cube, which can be further transformed and stored in relational databases or some kind of multidimensional file structures for efficient processing of global queries. Heterogeneous data warehouse integration is a complex, tedious, and labor-intensive task. It is characterized by a need for extensive human interaction as well as intensive computation. While this has been recognized in the literature, past work has tended to suffer from the complexity of the system development. For our approach, the contribution of our work can be summarized as follows. (1) Simplicity, transparency and flexibility. We have proposed a simple and symmetric mapping scheme between cube schemas and XML documents to build loosely coupled data warehousing systems over the Internet. The mapping is shown to be effective and contemporary database management systems (DBMSs), data warehouse systems, and native XML DBMSs supporting XML capabilities can be adopted to implement the whole process. Besides, since all produced modules are executed on the global side, they are transparent to all local users, who need not care about all the details of the execution process. (2) Semantics and efficiency. Based on the semantics of local data warehouses, each local site only has to prepare its cube metadata and site metadata to conform to the global cube schema. Such work can be performed at each local site in parallel after the global schema is established. With our approach, historical data sets, regardless of content or origin, can be efficiently consolidated in a vast number of different ways, which provide enterprises with capabilities to create, manipulate, animate, and synthesize multi-dimensional data for dynamic inter- and intra-enterprise analysis.

The integration of heterogeneous data warehouses can be regarded as a vertical integration of pre-existing data warehouses. From another point of view, we have to work toward the horizontal integration of heterogeneous data warehouses, which corresponds to interorganizational workflow streamlining processes in the on-line analytical processing of business data. In the next step, we intend to enhance the proposed approach to manipulate and integrate XML documents through Web workflow applications in a more complete and subtle way. Moreover, we will extend our framework to conduct research tasks for the following aspects: (1) To make the conflict resolution process more automated: in this paper, the conflict resolution process is semi-automatic; it still needs manual help from the local and global administrators. We will try to directly employ pre-defined enterprise domain ontology to help build the concept hierarchy of each global dimension to promote the automation of the conflict resolution process to a certain extent. (2) To devise the conversion rules between multidimensional expression (MDX) [50] and XQuery: traditionally, users use multi-dimensional query language MDX to query or derive data from data cubes. Since the integrated global cube is in XML format, it is helpful to develop the conversion rules between MDX and XQuery. Such an achievement would help the systems to receive MDX statements as input and translate them into XQuery statements to apply directly to the global cube for query execution. (3) Data compression of XML documents: as the XML documents derived from a local data cube usually contain large amounts of data, it may become critical for system performance when the global network bandwidth is occupied by such data. Therefore, how to properly compress the generated XML data to reduce network traffic is yet another challenge. (4) Extension for heterogeneous document warehouse integration: in [51] and [52], we have defined the concept and elements of the document warehouse, which organizes a set of documents into a multidimensional cube structure. Some of the related issues regarding document warehouses are also under investigation [53]. We are now extending the framework for integrating document cubes from heterogeneous document warehouses. The conflicts discussed in Section 3 will be reexamined to focus on document cubes.

Journal of Information Science, 31 (3) 2005, pp. 209–229 © CILIP, DOI: 10.1177/0165551505052467 Downloaded from http://jis.sagepub.com at PENNSYLVANIA STATE UNIV on February 7, 2008 © 2005 Chartered Institute of Library and Information Professionals. All rights reserved. Not for commercial use or unauthorized distribution.

227

Integrating heterogeneous data warehouses

Acknowledgement This research was partially supported by the National Science Council, Republic of China, under contract no. NSC 93–2416-H-327–007.

References [1] S. Anahory and D. Murray, Data Warehousing in the Real World: A Practical Guide for Building Decision Support Systems (Addison-Wesley, Longman, Harlow, 1997). [2] W.H. Inmon and C. Kelley, The 12 rules of data warehouse for a client/server world, Data Management Review 4(5) (1994) 6–16. [3] R. Kimball and R. Merz, The Data Webhouse ToolkitBuilding the Web-Enabled Data Warehouse (Wiley, New York, 2000). [4] R. Kimball, The Data Warehouse Toolkit: Practical Techniques for Building Dimensional Data Warehouses (Wiley, New York, 1996). [5] O. Mangisengi, J. Huber, C. Hawel and W. Essmayr, Integration Issues for Heterogeneous, Distributed, and Autonomous Data Warehouses (SCCH, Hagenberg, 2000). [Technical Report, SCCH-TR-0075.] [6] O. Mangisengi, J. Huber, C. Hawel and W. Essmayr, A framework for supporting interoperability of data warehouse islands using XML. In: Y. Kambayashi et al. (eds), Data Warehousing and Knowledge Discovery (DaWaK 2001) (Springer, London, 2001) 328–38. [7] M. Golfarelli, S. Rizzi and B. Vrdoljak, Data warehouse design from XML sources. In: J. Hammer (ed.), Proceedings ACM Third International Workshop on Data Warehousing and OLAP (DOLAP’01) (ACM, Atlanta, 2001) 40–47. [8] T. Niemi, M. Niinimäki, J. Nummenmaa and P. Thanisch, Constructing an OLAP Cube from Distributed XML data. In: D. Theodoratos (ed.), ACM Fifth International Workshop on Data Warehousing and OLAP (ACM, McLean, VA, 2002) 22–7. [9] D. Pedersen, K. Riis and T.B. Pedersen, XML-Extended OLAP Querying. In J. Kennedy (ed.),: 14th International Conference on Scientific and Statistical Database Management (SSDBM’02), July 2002 (IEEE Computer Society, Edinburgh, 2002) 195–206. [10] F.S.C. Tseng and W.C. Huang, An automatic load/extraction scheme for XML documents through objectrelational repositories, Journal of Systems and Software 64(3) (2002) 207–18. [11] F. Bapst and C. Vanoirbeek, XML documents production for an electronic platform of requests for proposals. In: X. Défago (ed.), Proceedings of the 18th IEEE Symposium on Reliable Distributed Systems, 1998 (IEEE Computer Society, Lausanne, 1999) 330–35.

228

[12] A.V. Royappa, Implementing catalog clearinghouses with XML and XSL. In: Proceedings of the 1999 ACM Symposium on Applied Computing, Sept. 1999, 616–21. [13] D. Chamberlin, XQuery: an XML query language, IBM Systems Journal 41(4) (2002) 597–615. [14] U. Dayal and H.Y. Hwang, View definition and generalization for database integration in a multi-database system, IEEE Transactions. on Software Engineering 10(6) (1984) 628–44. [15] A. Motro, Superviews: virtual integration of multiple databases, IEEE Transactions on Software Engineering 13(7) (1987) 785–98. [16] Y. Breitbart, P.L. Olson and G.R. Thompson, Database integration in a distributed heterogeneous database system. In: Proceedings of the IEEE International Conference on Data Engineering, 1986 (IEEE Computer Society, Los Angeles, 1986) 301–10. [17] S.M. Deen, R.R. Amin and M.C. Taylor, Data integration in distributed databases, IEEE Transactions on Software Engineering 13(7) (1987) 860–64. [18] F.S.C. Tseng, J.J. Chiang and W.P. Yang, Integration of relations with conflicting schema structures in heterogeneous database systems, Data & Knowledge Engineering 27(2) (1998) 231–48. [19] D. Heimbigner and D. McLeod, A federated architecture for information management, ACM Transactions On Office Information Systems 3(3) (1985) 253–78. [20] D. Hsiao, Tutorial on federated databases and systems (part 1), International Journal on Very Large Data Bases 1(1) (1992) 127–79. [21] D. Hsiao, Tutorial on federated databases and systems (part 2), International Journal on Very Large Data Bases 1(2) (1992) 285–322. [22] R. Krishnamurthy, W. Litwin and W. Kent, Language features for interoperability of databases with schematic discrepancies. In: J. Clifford and R. King (eds), Proceedings of the ACM SIGMOD-International Conference on Management of Data (ACM, Denver, 1991) 40–49. [23] B. Czejdo, M. Rusinkiewicz and D.W. Embley, An approach to schema integration and query formulation in federated database systems. In: Proceedings of the 3rd IEEE International. Conference on Data Engineering, 1987 (IEEE Computer Society, Los Angeles, 1987) 477–84. [24] W. Litwin, A. Abdellatif, B. Nicolas, P. Vigier and A. Zeronnal, MSQL: a multidatabase manipulation language, Information Sciences: an International Journal 49(1) (1989) 59–101. [25] W. Litwin and A. Abdellatif, An overview of the multidatabase manipulation language MDSL, Proceedings of the IEEE 75(5) (1987) 621–32. [26] C. Lee, C.J. Chen and H. Lu, An aspect of query optimization in multidatabase systems, ACM SIGMOD Record 24(3) (1995) 28–33. [27] M.P. Reddy, B.E. Prasad, P.G. Reddy and A. Gupta, A methodology for integration of heterogeneous databases,

Journal of Information Science, 31 (3) 2005, pp. 209–229 © CILIP, DOI: 10.1177/0165551505052467 Downloaded from http://jis.sagepub.com at PENNSYLVANIA STATE UNIV on February 7, 2008 © 2005 Chartered Institute of Library and Information Professionals. All rights reserved. Not for commercial use or unauthorized distribution.

F.S.C. TSENG AND C.-W. CHEN

[28]

[29]

[30]

[31]

[32]

[33]

[34]

[35]

[36]

[37]

[38]

[39]

IEEE Transactions on Knowledge and Data Engineering 6(6) (1994) 920–33. K.H. Lee et al., Conflict classification and resolution in heterogeneous information integration based on XML schema. In: B.Z. Yuan and X.F. Tang (eds), Proceedings of IEEE TENCON’02, 2002 (IEEE Computer Society, Beijing, 2002) 93–6. R.M. Bruckner, T. Wang Ling, O. Mangisengi and A.M. Tjoa, A framework for a multidimensional OLAP model using topic maps. In: C. Claramunt et al. (eds), Second International Conference on Web Information Systems Engineering (WISE’01), Vol. 2, Dec. 2001 (IEEE Computer Society, Kyoto, 2001) 109–18. S. Pepper and G. Moore, XML Topic Maps (XTM) 1.0 TopicMaps.Org Specification (2003). Available at: www. topicmaps.org/xtm/1.0/ (accessed 12 January 2005). V. Christophides, S. Cluet and J. Simeon, On wrapping query language and efficient XML integration. In: W.D. Chen et al. (eds), Proceedings of the ACM SIGMOD Conference, 2000 (ACM, Dallas, 2000) 141–52. W. Hümmer, A. Bauer and G. Harde, XCube-XML for data warehouses. In: Proceedings of the ACM 6th International Workshop on Data Warehousing and OLAP (DOLAP’03) (ACM, Louisiana, 2003) 33–40. World Wide Web Consortium, XML Schema, W3C Recommendation (2001). Available at: www.w3c.org/ TR/xmlschema-0/ (accessed 12 January 2005). Howard Katz et al., XQuery from the Experts: A Guide to the W3C XML Query Language (Addison-Wesley, Boston MA, 2003). World Wide Web Consortium, XML Query(XQuery), W3C Working Draft (2003). Available at: www.w3.org/ XML/Query (accessed 8 January 2005). G. Gardarin, A. Mensch, T. Tuyet Dang-Ngoc and L. Smit, Integrating heterogeneous data sources with XML and XQuery. In: V. Marik (ed.), 13th International Workshop on Database and Expert Systems Applications (DEXA’02) (IEEE Computer Society, Aix en Provence, 2002) 839–44. Y. Papakonstantinou and V. Vassalos, Architecture and implementation of an XQuery-based information integration platform, Bulletin of the IEEE Computer Society Technical Committee on Data Engineering 25(1) (2002) 18–26. Software AG Corporation, Tamino Native XML Database Management System. Available at: www.softwareag. com/tamino/technical/description.htm (accessed 12th January 2005). OMG, Common Warehouse Metamodel (CWM) Version 1.0 (2001). Available at: www.omg.org/docs/ad/01-0201.pdf (accessed 12 January 2005).

[40] M.R. Jensen, T.H. Moller, T.B. Pedersen, Specifying OLAP cubes on XML data, Journal of Intelligent Information Systems 17(2–3) (2001) 255–80. [41] J. Pokorny, Modelling stars using XML. In: J. Hammer (ed.), Proceedings of the ACM Third International Workshop on Data Warehousing and OLAP (DOLAP’01) (ACM, San Antonio, 2001) 24–31. [42] E. Sciore, M. Siegel and A. Rosenthal, Using semantic values to facilitate interoperability among heterogeneous information systems, ACM Transactions on Database Systems 19(2) (1994) 254–90. [43] J. Grant, Partial values in a tabular database model, Information Processing Letters 9(2) (1979) 97–9. [44] L.G. DeMichael, Resolving database incompatibility: an approach to performing relational operations over mismatched domains, IEEE Transactions on Knowledge and Data Engineering 1(4) (1989) 485–93. [45] F.S.C. Tseng, A.L.P. Chen and W.P. Yang, Searching a minimal semantically-equivalent subset of a set of partial values, International Journal on Very Large Data Bases 2(4) (1993) 489–512. [46] F.S.C. Tseng, A.L.P. Chen and W.P. Yang, Answering heterogeneous database queries with degrees of uncertainty, Distributed and Parallel Databases 1(1) (1993) 281–302. [47] F.S.C. Tseng, A.L.P. Chen and W.P. Yang, Refining imprecise data by integrity constraints, Data & Knowledge Engineering, 11(3) (1993) 229–316. [48] A.L.P. Chen, J.S. Chiu and F.S.C. Tseng, Evaluating aggregation functions over imprecise data, IEEE Transactions on Knowledge and Data Engineering 8(2) (1996) 273–84. [49] F.S.C. Tseng, A.L.P. Chen and W.P. Yang, Implementing the division operation on a database containing uncertain data, Journal of Information Science and Engineering 12(1) (1996) 51–78. [50] G. Spofford, MDX Solutions with Microsoft SQL Server Analysis Services (Wiley, New York, 2001). [51] F.S.C. Tseng and A.Y.H. Chou, The concept of document warehousing and its applications on managing enterprise business intelligence. In: C.P. Wei (ed.), Proceedings of the 8th Pacific Asia Conference on Information Systems: PACIS 2004, July 8–11, 2004, Shanghai, China (AIS, Shanghai, 2004) 563–74. [CD-ROM, track 52.] [52] F.S.C. Tseng, Design of a multi-dimensional query expression for document warehouses, Information Sciences: an International Journal (in press). [53] F.S.C. Tseng and W.P. Lin, D-Tree: a multi-dimensional indexing structure for constructing document warehouses, Journal of Information Science and Engineering (in press).

Journal of Information Science, 31 (3) 2005, pp. 209–229 © CILIP, DOI: 10.1177/0165551505052467 Downloaded from http://jis.sagepub.com at PENNSYLVANIA STATE UNIV on February 7, 2008 © 2005 Chartered Institute of Library and Information Professionals. All rights reserved. Not for commercial use or unauthorized distribution.

229