xancestor: a mapping approach for storing and

0 downloads 0 Views 6MB Size Report
Mar 22, 2016 - Oracle, and PostgreSQL (Cloudera). The major challenge in ...... different solutions to manage the storing and querying of XML documents, including ...... (i.e. values of leaf nodes) of the subtree i of the XML tree plus its root ri-1, as follows. (Dweib et ...... running Microsoft Windows 7 (32 bits) Service Pack 2.
XANCESTOR: A MAPPING APPROACH FOR STORING AND QUERYING XML DOCUMENTS IN RELATIONAL DATABASE USING PATH-BASED TECHNIQUE

AMJAD FAISAL ALI QTAISH

THESIS SUBMITTED IN FULFILLMENT OF THE DEGREE OF DOCTOR OF PHILOSOPHY

FACULTY OF INFORMATION SCIENCE AND TECHNOLOGY UNIVERSITI KEBANGSAAN MALAYSIA BANGI 2016

XANCESTOR: SATU PENDEKATAN PEMETAAN UNTUK PENYIMPANAN DAN PERTANYAAN DOKUMEN XML DALAM PANGKALAN DATA HUBUNGAN MENGGUNAKAN TEKNIK BERASASKAN LALUAN

AMJAD FAISAL ALI QTAISH

TESIS YANG DIKEMUKAKAN UNTUK MEMPEROLEH IJAZAH DOKTOR FALSAFAH

FAKULTI TEKNOLOGI DAN SAINS MAKLUMAT UNIVERSITI KEBANGSAAN MALAYSIA BANGI 2016

iii

DECLARATION

I hereby declare that the work in this thesis is my own except for quotations and summaries which have been duly acknowledged.

22 Mar 2016

AMJAD FAISAL ALI QTAISH P68516

iv

ACKNOWLEDGEMENT First and foremost, praise be to Almighty Allah for all his blessings for giving me patience and good health throughout the duration of this PhD research. I would like to express my special appreciation and thanks to my supervisor Prof. Madya Dr. Kamsuriah Ahmad. She has been a tremendous mentor for me. I would like to thank her for encouraging my research and for allowing me to grow as a research scientist. Her advice on my research as well as on my career has been invaluable. I would also like to thank all my friends who have supported me in writing and inspired me to strive towards my goal, and thanks too to all the postgraduate students of the UKM power research group for their help and friendship, and for creating a pleasant working environment throughout my years at UKM. Also, a special thanks to my family. Words cannot express how grateful I am to my mother and sisters for all of the sacrifices that they made on my behalf. In particular, I would like express appreciation to my beloved wife Hida’a and my sons Faisal and Aysam who spent sleepless nights with me and were always there to give me support in the challenging periods of my study. Finally, I would like to dedicate this thesis to the pure spirit of my father who encouraged and supported me in my life and my study but sadly passed away during the last months of my research.

v

ABSTRACT

XML has become a common language for data exchange in the web, therefore it needs a proper management. There are four central problems in XML management: capture, storage, retrieval, and exchange of data. Even though there exist numerous database systems, relational database (RDB) are often used to store and query the contents of XML documents. Therefore the processes of mapping from XML to RDB and vice versa are one that occurs frequently. Numerous researchers have proposed various mapping approaches that mapped the hierarchical nature of XML documents into the tabular format of RDB. However, the existing approaches have faced problems in terms of storage and query response time. If the design of RDB is not done efficiently, the number of join operations between tables will increased when queried, which consequently affects the query response time. Therefore, there is a need to propose a new mapping approach for XML documents to overcome this limitation. This study aims to propose a new mapping approach, known as XAncestor, which consists of two algorithms: an XML mapping algorithm (XtoDB) and a query-mapping algorithm (XtoSQL). XtoDB maps XML documents to a fixed RDB with less storage space. While XtoSQL algorithm translates XPath queries into corresponding SQL queries based on the constructed RDB, in order to reduce query response time. The basic idea of XAncestor is to map distinct ancestor paths, for all leaf nodes of the XML tree, with their structural information into its RDB: instead of mapping the whole document (inner and leaf nodes) with the information paths from the root (node-paths or leafpaths). The proposed algorithms able to reduce the RDB storage space, the search space, the number of join operations, and the nesting queries of the translated SQL query. The two algorithms used in XAncestor are then developed as a prototype in order to test for their effectiveness. The results of the two algorithms used in XAncestor are compared with the results produced by the other five similar approaches. The comparisons prove that the proposed algorithms overcome the existing approaches, in terms of the effectiveness and scalability. The two algorithms used in XAncestor are verified in terms of the correctness and completeness. It is hoped that the proposed XAncestor approach contributes to commercial organizations in exchanging and managing their data efficiently on the web in terms of the storage size and queries. The proposed approach also serves as an introductory material for researchers who wish to enhance the mapping components and extension into new research areas.

vi

XANCESTOR: SATU PENDEKATAN PEMETAAN UNTUK PENYIMPANAN DAN PERTANYAAN DOKUMEN XML DALAM PANGKALAN DATA HUBUNGAN MENGGUNAKAN TEKNIK BERASASKAN LALUAN

ABSTRAK

XML telah menjadi satu bahasa yang sering digunakan untuk pertukaran data dalam web, oleh itu ia memerlukan satu pengurusan yang betul. Terdapat empat masalah utama dalam pengurusan XML: mengambil, menyimpan, mendapatkan semula dan pertukaran data. Walaupun kini terdapat pelbagai sistem pangkalan data, pangkalan data hubungan (PDH) biasanya digunakan untuk penyimpanan dan pertanyaan kandungan dokumen XML. Oleh itu, proses pemetaan dari XML ke PDH dan sebaliknya kerap dilakukan. Ramai penyelidik telah mencadangkan pelbagai pendekatan pemetaan yang memetakan dokumen XML yang bersifat hierarki kepada format PDH yang berbentuk jadual. Walau bagaimanapun, pendekatan yang sedia ada menghadapi masalah dari segi penyimpanan dan tempoh tindak balas ketika pertanyaan dilakukan. Sekiranya reka bentuk PDH tidak dilakukan dengan efisyen, bilangan operasi gabungan antara jadual akan meningkat apabila pertanyaan dilakukan, yang seterusnya memberi kesan kepada masa tindak balas pertanyaan. Oleh itu, satu pendekatan pemetaan yang baru bagi dokumen XML perlu dicadangkan untuk mengatasi kelemahan ini. Kajian ini bertujuan mencadangkan satu pendekatan pemetaan yang baru yang dinamakan XAncestor, yang terdiri daripada dua algorithma: algoritma pemetaan XML (XtoDB) dan algoritma pemetaan pertanyaan (XtoSQL). XtoDB memetakan dokumen XML ke dalam PDH tetap dengan ruang simpanan yang lebih kecil. Algoritma XtoSQL menterjemahkan pertanyaan XPath ke dalam pertanyaan SQL berdasarkan kepada PDH yang telah dibina untuk menghasilkan masa tindak balas pertanyaan yang lebih singkat. Idea asas XAncestor adalah untuk memetakan laluan leluhur yang jelas untuk semua nod daun pada pokok XML termasuk semua maklumat berstruktur ke dalam PDH: bukannya memetakan seluruh dokumen (nod dalaman dan daun) dengan maklumat laluan dari akar (laluannod atau laluan-daun). Algoritma yang dicadangkan ini mampu untuk mengurangkan ruang simpanan PDH, ruang carian, mengurangkan bilangan operasi gabungan, dan pertanyaan bersarang bagi terjemahan pertanyaan SQL. Dua algoritma yang terdapat dalam XAncestor kemudiannya dibangunkan sebagai prototaip untuk diuji keberkesanannya. Keputusan yang dihasilkan oleh dua algoritma dalam XAncestor ini dibandingkan dengan keputusan yang dihasilkan oleh lima pendekatan lain yang serupa. Perbandingan ini membuktikan bahawa dua algoritma ini dapat mengatasi pendekatan yang sedia ada dari segi keberkesanan dan kebolehan untuk diskala bagi ruang simpanan dan masa tindak balas pertanyaan PDH. Kedua-dua algoritma dalam XAncestor diuji dari segi ketepatan dan kelengkapan. Adalah diharapkan bahawa pendekatan XAncestor yang dicadangkan ini dapat membantu organisasi komersial dalam pertukaran dan pengurusan data di web secara efisyen dari segi penyimpanan dan pertanyaan. Pendekatan yang dicadangkan ini juga memainkan peranan sebagai bahan pengenalan kepada penyelidik yang ingin menambahbaik komponen pemetaan dan melanjutkannya ke dalam bidang penyelidikan yang baru.

vii

TABLE OF CONTENTS

Page DECLARATION

iii

ACKNOWLEDGMENT

iv

ABSTRACT

v

ABSTRAK

vi

TABLE OF CONTENTS

vii

LIST OF TABLES

x

LIST OF ILLUSTRATIONS

xiii

LIST OF ABBREVIATIONS

xviii

CHAPTER I

INTRODUCTION

1

1.1

Introduction

1

1.2

Research Background

2

1.3

Problem Statement

9

1.4

Motivating Example

12

1.5

Research Questions

22

1.6

Research Objectives

25

1.7

Research Scope

26

1.8

Research Methodology

28

1.9

Thesis Outline

31

CHAPTER II

LITERATURE REVIEW

33

2.1

Introduction

33

2.2

Preliminaries

34

2.2.1 2.2.2 2.2.3

34 35 36

2.3

Extensible Markup Language (XML) XML Data Model XML Query languages

XML Data Management

39

viii

2.3.1 2.3.2 2.3.3 2.3.4 2.4

2.5

Files Object-Oriented Database (OODB) Native XML Database (NXD) Relational Database (RDB)

39 39 40 40

Managing XML Data in a Relational Database (RDB)

45

2.4.1 2.4.2

45 49

Structure Mapping Approach Model Mapping Approach

Discussion on Mapping From XML Document to Relations 88 2.5.1 2.5.2

RDB Storage Space Query Response Time

88 90

2.6

Summary

94

CHAPTER III

RESEARCH METHODOLOGY

96

3.1

Introduction

96

3.2

Design-Based Research (DBR)

96

3.3

Adopted Research Methodology

98

3.3.1 3.3.2 3.3.3 3.3.4

Phase 1 - Problem Identification Phase 2 - Proposed Approach Design Phase 3 - Prototype Development Phase 4 - Prototype Evaluation

99 107 109 113

3.4

Summary

114

CHAPTER IV

XANCESTOR: A MAPPING APPROACH

116

4.1

Introduction

116

4.2

Mapping Theory

116

4.2.1

118

4.3

XML Tree Theory

XAncestor Mapping Approach

123

4.3.1 4.3.2 4.3.3

128 131 136

XAncestor RDB Scheme XML Mapping Algorithm (XtoDB) Query Mapping Algorithm (XtoSQL)

4.4

Correctness of the Mapping Algorithm

146

4.5

Summary

148

CHAPTER V

RESULTS AND DISCUSSION

150

5.1

Introduction

150

5.2

Experimental Design

152

5.2.1

153

Experimental Setup

ix

5.2.2 5.2.3 5.3

Experimental Datasets Experimental Queries

153 156

Experimental Evaluation

157

5.3.1 5.3.2 5.3.3

158 165 171

Effectiveness Evaluation: RDB Storage Space Effectiveness Evaluation: Query Response Time Scalability Evaluation

5.4

Experimental Discussion

191

5.5

Completeness of XAncestor

194

5.5.1 5.5.2 5.5.3

195 196

Checking by Number of Tuples Mapped Checking by Query Results Checking by Reconstructing XML Document From RDB

200

5.6

Summary

202

CHAPTER VI

CONCLUSION AND FUTURE WORKS

204

6.1

Introduction

204

6.2

Achievement of the Objectives

205

6.3

Research Contribution

207

6.4

Recommendations for Future Work

209

6.5

Concluding Remarks

211

REFERENCES

212

APPENDICES

223

APPENDIX A

XAncestor Prototype Requirements and Interfaces

223

APPENDIX B

Results of Pre-Experiment on Existing Model Mapping Approaches Based on XML Dataset (DBGroup)

231

Results of Experiments on Model Mapping Approaches Based on XML Datasets (SIGMOD, DBLP1, DBLP2, DBLP3)

238

Sample Results of Mapping XML Documents into Relational Database Using XAncestor Mapping Approach

271

Sample Results of Mapping XPath Queries into SQL Queries using XAncestor Approach

284

List of Publication

292

APPENDIX C

APPENDIX D APPENDIX E APPENDIX F

x

LIST OF TABLES

Table No.

Page

Table 1.1

RDB scheme of SMX/R after mapping DBGroup.xml using path-based technique

14

Table 1.2

RDB scheme of Ying-Cao after mapping DBGroup.xml using path-based technique

15

Table 1.3

RDB scheme of s-XML after mapping DBGroup.xml using node-based technique

17

Table 2.1

Examples of some XPath expressions

37

Table 2.2

Edge scheme: Edge table

51

Table 2.3

XRel scheme: (a) Path, (b) Element, (c) Text, and (d) Attribute tables

55

SUCXENT scheme: (a) Document, (b) Path, (c) PathValue, (d) AncestorInfo, and (e) TextContent tables

57

Table 2.5

SMX/R scheme: (a) Path_Index_Table, and (b) Path_Table

58

Table 2.6

Xlight scheme: (a) Document, (b) Ancestor, (c) Path, (d) Attribute, and (e) Data tables

60

XRecursive scheme: (a) Tag_Structure, and (b) Tag_Value tables

63

Table 2.8

Suri-Sharma scheme: (a) Node and (b) Data tables

64

Table 2.9

s-XML scheme: (a) ParentTable and (b) ChildTable

65

Table 2.10

XParent scheme: (a) LabelPath, (b) DataPath, (c) Data, (d) Element, and (e) Ancestor tables

68

Table 2.11

XPEV scheme: (a) Path, (b) Edge, and (c) Value tables

70

Table 2.12

XPred scheme: (a) Path, (b) Node, and (c) Data tables

71

Table 2.13

Ying-Cao scheme: (a) File, (b) Path, (c) InnerNodes, and (d) LeafNodes tables

74

Table 2.14

Wang et al. scheme: (a) No_Value_Table and (b) Value_Table

76

Table 2.15

LNV scheme: (a) Signatures and (b) Path tables

79

Table 2.4

Table 2.7

xi

Table 2.16

Sainan-Caifeng scheme: (a) Signature and (b) Path tables

80

Table 2.17

XMap scheme: (a) Vertex, (b) Path, and (c) Data tables

82

Table 2.18

Summary of model mapping approaches

84

Table 2.19

Comparison of complexity on translated SQL query for Q1 by query mapping techniques in various approaches

90

Table 3.1

Results of evaluation of RDB storage space in (a) XRel, (b) XParent, (c) XPEV, (d) XRecursive, (e) Suri-Sharma, (f) YingCao, and (g) s-XML

101

Table 3.2

Types of XPath queries expressions

104

Table 3.3

Results of evaluation of query response time (in ms) of (a) XRel, (b) XParent, (c) XPEV, (d) XRecursive, (e) SuriSharma, (f) Ying-Cao, and (g) s-XML

104

Table 3.4

Input, activities, and deliverables of Phase 1

106

Table 3.5

Input, activities, and deliverables of Phase 2

109

Table 3.6

Input, activities, and deliverables of Phase 3

112

Table 3.7

Input, activities, and deliverables of Phase 4

113

Table 4.1

Path Expressions Used in XPath Queries

123

Table 4.2

XAncestor RDB scheme: (a) Ancestor_Path Leaf_Node tables based on DBGroup.xml

and

(b) 135

Table 5.1

XML datasets used

153

Table 5.2

Query description for SIGMOD dataset

157

Table 5.3

Query description for DBLP dataset

157

Table 5.4

Results of RDB storage space evaluation for (a) XRel, (b) SMX/R, (c) Ying-Cao, (d) XRecursive, (e) s-XML, and (f) XAncestor approaches based on the SIGMOD dataset

158

Results of RDB storage space evaluation for (a) XRel, (b) SMX/R, (c) Ying-Cao, (d) XRecursive, (e) s-XML, and (f) XAncestor approaches based on the DBLP1 dataset

160

Results of RDB storage space evaluation for (a) XRel, (b) SMX/R, (c) Ying-Cao, (d) XRecursive, (e) s-XML, and (f) XAncestor approaches based on the DBLP2 dataset

162

Table 5.5

Table 5.6

Table 5.7

Results of RDB storage space evaluation for (a) XRel, (b) SMX/R, (c) Ying-Cao, (d) XRecursive, (e) s-XML, and (f)

164

xii

XAncestor approaches based on the DBLP3 dataset Table 5.8 Table 5.9

Table 5.10

Table 5.11

Table 5.12

Table 5.13

Table 5.14

Table 5.15 Table 5.16

RDB storage space of the six approaches for various XML dataset sizes (MB)

172

Query response time (ms) of the six approaches for executing the corresponding SQL query of the XPath query (Q1) for various XML dataset sizes

175

Query response time (ms) of the six approaches for executing the corresponding SQL query of XPath query (Q2) for various XML dataset sizes

178

Query response time (ms) of the six approaches to execute the corresponding SQL query of XPath query (Q3) for various XML dataset sizes

181

Query response time (ms) of the six approaches for executing the corresponding SQL query for XPath query (Q4) for various XML dataset sizes

183

Query response time (ms) of the six approaches for executing the corresponding SQL query for XPath query (Q5) for various XML dataset sizes

186

Query response time (ms) of the six approaches for executing the corresponding SQL query of XPath query (Q6) for various XML dataset sizes

189

Comparison of approaches in terms of number of tuples mapped for SIGMOD dataset

195

XAncestor scheme for DBGroup.XML: (a) Ancestor_Path and (b) Leaf_Node tables

201

xiii

LIST OF ILLUSTRATIONS

Figure No. Figure 1.1

Page Trends and issues in data exchange in big data environment leading to problem formulation

8

Figure 1.2

XML document DBGroup.xml

12

Figure 1.3

Tree Representation for twig query (Q)

20

Figure 1.4

Translated SQL query for twig query (Q) using SMX/R

20

Figure 1.5

Translated SQL query for twig query (Q) using s-XML

21

Figure 1.6

Organization of thesis

32

Figure 2.1

Part of XML document

35

Figure 2.2

XML data models: (a) edge-labelled and (b) node-labelled

36

Figure 2.3

Classification of queries for XML

38

Figure 2.4

Techniques used in model mapping approaches

50

Figure 2.5

Translated SQL query for Q1 using Edge-based technique

51

Figure 2.6

Handling paths: (a) Node-Path and (b) Leaf-Path based on DBGroup.xml

53

Figure 2.7

Translated SQL query for Q1 using path-based technique

54

Figure 2.8

Translated SQL query for Q1 using node-based technique

62

Figure 2.9

Translated SQL query for Q1 using hybrid (Edge- and pathbased) technique

67

Translated SQL query for Q1 using hybrid (Path- and Nodebased) technique

73

Translated SQL query for Q1 using indexing (signature) technique

78

Figure 3.1

Four phases of design-based research methodology

97

Figure 3.2

Phases of research methodology

99

Figure 3.3

Iterative stages of RAD method

110

Figure 4.1

DBGroup.xml: (a) document and (b) tree representation

119

Figure 2.10 Figure 2.11

xiv

Figure 4.2

Composite relationship for identifying P-C

121

Figure 4.3

Associative relationship for identifying A-D

121

Figure 4.4

Proposed XAncestor mapping approach

126

Figure 4.5

XAncestor RDB scheme

130

Figure 4.6

Entity relationship diagram of XAncestor RDB scheme

130

Figure 4.7

Flowchart of proposed XtoDB algorithm

132

Figure 4.8

Proposed XtoDB algorithm

133

Figure 4.9

Flowchart of proposed XtoSQL algorithm

137

Figure 4.10

Proposed XtoSQL algorithm

138

Figure 4.11

Translated SQL query for XPath queries Q1 and Q2 using XtoSQL

140

Translated SQL query for Q1 based on DBGroup.xml using XtoSQL

141

Translated SQL query for Q2 based on DBGroup.xml using XtoSQL

141

Translated SQL query for XPath queries Q3 and Q4 using XtoSQL

142

Translated SQL query for Q3 based on DBGroup.xml using XtoSQL

142

Translated SQL query for Q4 based on DBGroup.xml using XtoSQL

142

Figure 4.17

Translated SQL query for XPath query Q5 using XtoSQL

143

Figure 4.18

Translated SQL query for Q5 based on DBGroup.xml using XtoSQL

144

Figure 4.19

Translated SQL query for XPath query Q6 using XtoSQL

145

Figure 4.20

Translated SQL query for Q6 based on DBGroup.xml using XtoSQL

146

Figure 5.1

Snapshot of SIGMOD: (a) XML document, and (b) DTD

154

Figure 5.2

Snapshot of DBLP: (a) XML document, and (b) DTD

155

Figure 5.3

Comparison of RDB storage space of approaches using

159

Figure 4.12 Figure 4.13 Figure 4.14 Figure 4.15 Figure 4.16

xv

SIGMOD Figure 5.4 Figure 5.5 Figure 5.6 Figure 5.7 Figure 5.8 Figure 5.9 Figure 5.10 Figure 5.11

Figure 5.12

Figure 5.13

Figure 5.14

Figure 5.15

Figure 5.16

Figure 5.17

Comparison of RDB storage space of approaches using DBLP1

161

Comparison of RDB storage space of approaches using DBLP2

163

Comparison of RDB storage space of approaches using DBLP3

165

Comparison of query response time of approaches using SIGMOD

166

Comparison of query response time of approaches using DBLP1

168

Comparison of query response time of approaches using DBLP2

169

Comparison of query response time of approaches using DBLP3

170

Comparison of six approaches in terms of scalability of RDB storage space for various XML dataset sizes using column graph

173

Comparison of six approaches in terms of scalability of RDB storage space for various XML dataset sizes using linear graph

174

Comparison of six approaches in terms of scalability of query response times for query (Q1) for various XML dataset sizes using column graph

175

Comparison of six approaches in terms of scalability of query response time for query (Q1) for various XML dataset sizes using linear graph

177

Comparison of six approaches in terms of scalability of query response time for query (Q2) for various XML dataset sizes using column graph

178

Comparison of six approaches in terms of scalability of query response time for query (Q2) for various XML dataset sizes using linear graph

180

Comparison of six approaches in terms of scalability of query response time for query (Q3) for various XML dataset sizes using column graph

181

xvi

Figure 5.18

Figure 5.19

Figure 5.20

Figure 5.21

Figure 5.22

Figure 5.23

Figure 5.24

Figure 5.25 Figure 5.26 Figure 5.27 Figure 5.28 Figure 5.29 Figure 5.30 Figure 5.31 Figure 5.32

Comparison of six approaches in terms of scalability of query response time for query (Q3) for various XML dataset sizes using linear graph

183

Comparison of six approaches in terms of scalability of query response time for query (Q4) for various XML dataset sizes using column graph

184

Comparison of six approaches in terms of scalability of query response time for query (Q4) for various XML dataset sizes using linear graph

185

Comparison of six approaches in terms of scalability of query response time for query (Q5) of various XML dataset sizes using column graph

186

Comparison of six approaches in terms of scalability of query response time for query (Q5) for various XML dataset sizes using linear graph

188

Comparison of six approaches in terms of scalability of query response time for query (Q6) for various XML dataset sizes using column graph

189

Comparison of six approaches in terms of scalability of query response time for query (Q6) for various XML dataset sizes using linear graph

191

Result of corresponding SQL query for Q using XRel approach

197

Result of corresponding SQL query for Q using SMX/R approach

197

Result of corresponding SQL query for Q using Ying-Cao approach

198

Result of corresponding SQL query for Q using XRecursive approach

198

Result of corresponding SQL query for Q using s-XML approach

199

Result of corresponding SQL query for Q using XAncestor approach

199

XML data model: (a) original (DBGroup.XML) and (b) its tree

200

Reconstructed XML tree for DBGroup.xml

XML

document 201

xvii

Figure 5.33

Reconstructed XML document for DBGroup.xml

202

xviii

LIST OF ABBREVIATIONS

XML

eXtensible Markup Language

RDB

Relational Database

SQL

Structured Query Language

W3C

World Wide Web Consortium

RDBMS

Relational Database Management System

OODB

Object-Oriented Database

NXD

Native XML Database

DTD

Document Type Definition

DBR

Design-Based Research

CLOB

Character Large Object

BLOB

Binary Large Object

API

Application Program Interface

XSLT

eXtensible Stylesheet Language Transformation

Transact-SQL

Transaction- Structured Query Language

DBMS

Database Management System

DBA

Database Administrator

CPI

Constraint Preserving Inlining

GUI

Graphical User Interface

P-C

Parent-Child

A-D

Ancestor-Descendant

ERD

Entity Relationship Diagram

DOM

Document Object Model

OLAP

Online Analytical Processing

RAD

Rapid Application Development

DAD

Document Access Definition

1

CHAPTER I

INTRODUCTION

1.1

INTRODUCTION

This chapter introduces the thesis, which investigates mapping approaches for storing and querying XML document using relational database (RDB). The large amount of data being represented by extensible markup language (XML) on the Web today and the evolution of the big data environment raise the question of how to manage this data in terms of storing and querying it effectively. Previous studies (Bousalem & Cherti 2015; Mourya & Saxena 2015; Ying et al. 2012; Fakharaldien et al. 2011a) have indicated that RDB technology is still the best available approach for managing XML content because of its capabilities in storage, retrieval, indexing, data integrity, and security. Therefore, the need to map XML to an RDB scheme has increased. The main problem in this context is to define the optimal mapping approach to map (shred and decompose) XML documents in a RDB scheme. On the one hand, the optimal approach should result in less RDB storage space with high scalability when XML documents become bigger. On the other hand, the approach should map (translate) different types of XPath query into their corresponding structured query language (SQL) query in the shortest possible query response time. Therefore, finding the optimal mapping approach that can achieve these aims has been the central focus for database researchers in recent years (Mourya & Saxena 2015; Yuan et al. 2013; Subramaniam et al. 2012; Fakharaldien et al. 2011a; Zafari et al. 2010) and the search for such an approach motivates this study.

2

The remainder of this chapter is organized as follows. First, the background to this study “mapping approaches for storing and querying XML document using relational database” is provided in Section 1.2, which highlights the importance of extensible XML technology in data exchange in the big data environment. Then, in Section 1.3 the problems in the existing mapping approaches are discussed extensively and the criteria for mapping XML into RDB are also precisely defined. Next, in Section 1.4 further explanation is provided on the limitations of recent mapping approaches in terms of storing and querying XML documents. Then the research questions are identified and defined in Section 1.5. Following this, the objectives of the study are outlined in Section 1.6, the research scope is clearly stated in Section 1.7, and the methodology of research is broadly presented in Section 1.8. Finally, the overall organization of the thesis is presented in Section 1.9. 1.2

RESEARCH BACKGROUND

The twenty-first century is known as the era of information. Most enterprises and people are managing most of their transactions and activities on the Internet. Internetbased activities include e-management, e-learning, e-mail, e-library, e-commerce, and e-business. This is leading to the creation of a huge amount of data, which is known as big data (Fu et al. 2015; Katal et al. 2013; Chen et al. 2012). The exchange of huge amounts of data between enterprises using XML technology has become a challenging issue for researchers and database vendors alike. This is because of the variations in the data format and the semantics of metadata used to describe data (Bousalem & Cherti 2015; Arenas et al. 2014; Dweib & Lu 2012; Fakharaldien et al. 2011a; Arenas et al. 2010). Since February 1998, XML has been the prominent standard for data interchange and data representation on the Web (Mourya & Saxena 2015; Raut & Atique 2014; Subramaniam et al. 2012; Ahmad 2011). In addition, XML has gained widespread support and adoption among all the leading vendors of software tools, servers, and databases (Adam et al. 2012). Extensible markup language is a semistructured data language with a nested, self-describing nature that uses simple-to-write

3

and simple-to-parse language to facilitate data interchange in a variety of systems on the Web (Bousalem & Cherti 2015; Kale & Deshmukh 2014; Fakharaldien et al. 2011a). It enables communication between different computing systems, a feat that had previously been extremely difficult or even impossible to achieve. Therefore XML provides a global structure for the interchange of data regardless of the platforms and data models of the applications. The computing world now has a new way of implementing distributed application systems, and the openness of XML allows data to be exchanged between virtually any hardware, software, or operating system (Adam et al. 2012; Fakharaldien et al. 2011a). Simply put, XML opens the door for information exchange without restriction. However, considerable amounts of semi-structured data on the Web are represented in XML. Therefore, the managing of this data in terms of storing and querying has become a major research challenge (Bousalem & Cherti 2015; Mourya & Saxena 2015; Yuan et al. 2013; Wang et al. 2012; Fakharaldien et al. 2011a; Haw & Lee 2011; Dweib et al. 2009). Managing XML documents in terms of storing and querying them can be undertaken by using four approaches: a file system (Dweib & Lu 2012; Haw & Lee 2011; Kyu & Nyunt 2009), an object-oriented database (OODB) (Bousalem & Cherti 2015; Dweib & Lu 2012; Subramaniam et al. 2012; Wang et al. 2012; Fakharaldien et al. 2011a; Haw & Lee 2011), a native XML database (NXD) (Bousalem & Cherti 2015; Vaidya 2015; Raut & Atique 2014; Yuan et al. 2013; Ahmad 2011; Haw & Lee 2011), and a relational database (RDB) (Bousalem & Cherti 2015; Vaidya 2015; Yuan et al. 2013; Dweib & Lu 2012; Subramaniam et al. 2012; Fakharaldien et al. 2011a). Ahmad (2011) shows that a file system can be used with minimal effort to store XML data, but the file system has shortcomings in terms of XML data querying. On the other hand, an OODB allows the clustering of XML elements and subelements, which might be useful for certain complex applications such as multimedia data and geographic information (Dweib & Lu 2012). However, the current generation of OODBs is insufficiently mature to evaluate queries on significantly huge databases (Wang et al. 2012; Kyu & Nyunt 2009). In addition, the absence of a standard OODB query language has hindered the development of tests to compare various OODB

4

implementations (Haw & Lee 2011). An NXD supports the XML data model and query languages directly, i.e. XML data can be stored and retrieved in its original format and no additional mapping is needed. However, the NXD approach still has performance problems when handling a large volume of XML data (Dweib & Lu 2012). Also, due to the document-centric nature of NXDs, complex searches or aggregations can be cumbersome (Ahmad 2011; Atay et al. 2010). Moreover, the NXD approach is not suitable when integration between various heterogeneous XML documents is needed (Haw & Lee 2011). At present, the RDB overcomes the drawbacks in the other approaches (File, OODB, and NXD) because of its capabilities in storage, indexing, data integrity and security, and query optimization. In addition, the RDB is expected to remain the prevalent storage mechanism for managing business data into the foreseeable future due to its powerful data management service capabilities (Mourya & Saxena 2015; Fakheraldien et al. 2011c; Ning & De 2010). Moreover, the RDB has garnered commercial strength from giant vendors; in fact the RDB is the most appropriate XML data management mechanism to date (Bousalem & Cherti 2015; Mourya & Saxena 2015; Wang et al. 2012; Fakharaldien et al. 2011b; Ahmad 2011). Both Hashem et al. (2015) and Zhang et al. (2010) indicate that even though cloud computing is being used to store, process, and analyse data, RDBs are still being used and are considered as part of current cloud computing architecture. This is because they play an important role in ensuring the easy transition of applications from old enterprise infrastructures to new cloud infrastructure architectures such as Cloud SQL (Google), SQL Azure (Microsoft), MySQL and Oracle (Amazon), and MySQL, Oracle, and PostgreSQL (Cloudera). The major challenge in using a RDB approach for the managing of the huge amount of data represented in XML in a big data environment is to resolve the conflict between the hierarchical nature of XML data, which is semi-structured data, and the tabular format of relational data, which is structured data. This can be done by using a suitable mapping approach (Haw & Lee 2011; Atay et al. 2010). Therefore, developing mapping approaches to resolve this challenge is an active area of research

5

(Vaidya 2015; Dweib & Lu 2012; Haw & Lee 2011; Atay et al. 2010). An XML document needs to be mapped accurately so it can be represented as a RDB scheme. Users can then access the document by using the same mechanism used in RDBs. Users can also then issue XML queries over the document, which are translated into SQL queries over a RDB, and the result of the query is translated back into XML. All these processes are done internally (Fakharaldien et al. 2011a; Ahmad 2008). However, mapping XML data to a RDB is a difficult task for all IT organizations globally, so there is a need to develop a mapping prototype that contains interfaces and tools to ease the storing and querying XML data in a RDB (Fakharaldien et al. 2011a). Numerous researchers (Bousalem & Cherti 2015; Subramaniam et al. 2012; Suri & Sharma 2012; Ying et al. 2012; Fakharaldien et al. 2011a; Ning & De 2010) have focused on improving the mapping process between XML documents and RDBs in terms of RDB storage space and its scalability when XML documents become bigger and query response time by drawing on the advantages of the properties of both XML and RDBs. The approaches in previous studies have improved the storage space of RDBs and have reduced the number of join operations in the translated query, which lessens the query response time. In addition, based on the schema definition and the RDB design needed to map an XML document into a RDB scheme, researchers (Bousalem & Cherti 2015; Mourya & Saxena 2015; Vaidya 2015; Dweib & Lu 2012; Wang et al. 2012) have classified the mapping approaches into two categories: structure mapping (also called schema-based mapping) and model mapping (also called schemaless-based or generic mapping) approaches. In structure mapping approaches (Bousalem & Cherti 2015; Mourya & Saxena 2015; Vaidya 2015; Dweib & Lu 2012; Fakharaldien et al. 2011a), an XML document is mapped into a RDB scheme based on its structure, i.e. a document type definition (DTD) or XML-schema. This approach results in additional complexity when managing different structured XML documents with different logical and physical designs of RDB. In contrast, in model mapping approaches (Bousalem & Cherti 2015; Mourya & Saxena 2015; Vaidya 2015; Dweib & Lu 2012; Fakharaldien et al. 2011a),

6

an XML document’s nodes and edges are mapped to the RDB scheme. Therefore, the obtained RDB scheme is not related to the XML document structure. Thus, in this approach, a fixed RDB scheme is used to store the XML document structure. This approach can support a range of sophisticated applications as well as well-formed XML documents even if they do not have a DTD. During data exchange in the big data environment, numerous XML documents of semi-structured data might exist in different forms and schemes. This makes it more difficult to manage (store and query) these documents. Model mapping approaches overcome the disadvantages of structure mapping approaches to solve this problem (Bousalem & Cherti 2015; Ying et al. 2012; Fakharaldien et al. 2011a). This thesis explores the issues related to the management of XML documents, particularly storing and querying issues (Bousalem & Cherti 2015; Fakharaldien et al. 2012; Subramaniam et al. 2012; Haw & Lee 2011; Su-Cheng et al. 2010). As regards the storage issue, because the design of the RDB scheme has an effect on the mapping process and query performance (Subramaniam et al. 2012; Ying et al. 2012; Haw & Lee 2011), the first problem that arises is how to map (shred and decompose) the content and hierarchical structure of the XML document into an optimal fixed and predefined RDB scheme efficiently. This can be achieved by reducing the complexity of the RDB scheme design i.e. removing dangling (unnecessary) tables and columns. It can also be achieved by removing redundancy or unnecessary data and paths mapped (reducing the number of records mapped). Both of these actions reduce RDB storage space, particularly in the case of large XML documents (Vaidya 2015; Subramaniam et al. 2012; Ying et al. 2012; Ahmad 2011; Haw & Lee 2011). As for the querying issue, Subramaniam et al. (2012) state that it is essential that a mapping approach preserves the relationships among the XML nodes such as parent-child (P-C), ancestor-descendant (A-D), levels and siblings during the mapping process. Hence, accurate query retrieval is needed, and the responses provided to the user should be precise. Therefore, the second problem that arises is how to map different types of XML query (XPath) into their corresponding SQL query based on the predefined RDB scheme in order to shorten the query response time, particularly

7

for large XML documents. This can be done by reducing the search space, number of join operations, number of nesting queries (parent ID recursively), and number of SQL functions used in the translated SQL query. However, the P-C and A-D relationships among the nodes still need to be preserved efficiently in the translated SQL query (Vaidya 2015; Fakharaldien et al. 2012; Subramaniam et al. 2012; Fakharaldien et al. 2011a; Haw & Lee 2011). The above-mentioned problems faced during the mapping process are the motivation for this study. Therefore, the main problem investigated in this research is how to extend and develop the existing approaches for mapping an XML document into a RDB scheme to overcome their limitations and better handle the issues of storing and querying. Figure 1.1 presents a summary of the current trends in using XML for data exchange in the big data environment. It also describes the issues faced in trying to meet the needs of mapping from XML to a RDB scheme in terms of storing and querying, which lead to the formulation of the research problem.

8

Trends in Data Exchange in Big Data Environment  XML has become the dominant standard in data exchange and data representation  XML is used as an international data standard for understanding business data and for translating and moving data from one platform to another  Growing amounts of XML data need to be managed on the Web Pressures on XML Data Management  XML data needs to be managed efficiently to achieve better storing and querying  XML comes in different forms and structures on the Web  Different technologies have been proposed to manage XML data File Technology Limitations in XML data accessing and querying

NXD Technology

OODB Technology Less mature for evaluating queries on huge databases

Performance problems when handling large amounts of XML data

RDB Technology  Overcomes drawbacks of other approaches because of its capabilities in storage, indexing, data integrity and security, and query optimization  Has garnered commercial strength from giant vendors  Is the most appropriate XML data management mechanism to date Mapping XML to RDB

Structure mapping approaches

Model mapping approaches

 Map XML document based on its structure (DTD or XML-schema)  Different structured XML documents have different RDB schemes

 Use fixed RDB scheme to map all types of XML document structure  Support any sophisticated applications and well-formed XML documents

Challenges in Model Mapping Process  How to deal with XML documents that have irregular structures (with scheme or without scheme) and that come in different forms (well-formed or valid) as well as having nested elements  How to map the content and hierarchical structure of an XML document into a fixed and predefined RDB scheme that contains free dangling tables and columns, redundancy or unnecessary data and paths mapped in order to reduce RDB storage space, particularly for large XML documents  How to map different types of XPath query into corresponding SQL queries that contain fewer join operations, nesting queries, and SQL functions in order to achieve a shorter query response time, particularly for large XML documents

The Problem  As shown by the pre-experiment conducted for this study, recent model mapping approaches fail to map XML documents into a RDB with less storage space and high scalability when XML documents become bigger and fail to map different types of XPath query into SQL queries in less time in order to respond to (execute) these queries promptly Figure 1.1

Trends and issues in data exchange in big data environment leading to problem formulation

9

1.3

PROBLEM STATEMENT

As outlined above, the managing of XML data in terms of storing and querying using a RDB is a current research challenge (Bousalem & Cherti 2015; Mourya & Saxena 2015; Yuan et al. 2013; Wang et al. 2012; Fakharaldien et al. 2011a; Haw & Lee 2011; Dweib et al. 2009). Resolving the conflict between the hierarchical nature of XML data and the tabular format of a RDB requires an effective mapping approach (Vaidya 2015; Dweib & Lu 2012; Haw & Lee 2011; Atay et al. 2010). Many different model mapping approaches have been proposed to map an XML document to a fixed RDB scheme. These approaches have attempted to improve storing and querying (Bousalem & Cherti 2015; Dweib & Lu 2012; Subramaniam et al. 2012; Suri & Sharma 2012; Ying et al. 2012; Fakharaldien et al. 2011a; Ning & De 2010; Zafari et al. 2010). The existing model mapping approaches concentrate on developing XML mapping algorithms to store the content and hierarchical structure of the XML document as efficiently as possible in order to reduce the storage space even when the size of XML documents is large. These approaches also focus on creating query mapping algorithms to translate different types of XPath query into the corresponding SQL query efficiently in order to enhance performance, i.e. reduce query response time and return the results in an XML format (tree view). These two focal areas of the existing approaches give rise to two important questions. The first question is: Are some XML mapping algorithms ‘better’ than others in terms of handling the storage issue? By way of explanation, let us say that two XML mapping algorithms (XM1 and XM2) can shred and decompose an XML document into two fixed RDB schemes (R1 and R2), respectively. How do we determine whether XM1 is better than XM2 or vice versa in handling the storage issue? In this study, the XML mapping algorithm is considered to be the best if it can produce the smallest RDB storage space for different sizes of XML document whether R1 or R2 is used. So, the answer to this question depends on which one of R1 and R2 has less storage space with high scalability when XML documents become bigger. A smaller RDB storage space can be achieved by finding the optimal way to handle the content and hierarchical structure of the XML document in an optimal fixed and

10

predefined RDB scheme that has to contain free dangling tables, columns, and redundancy or unnecessary information and paths mapped (reducing the number of records mapped). The second question is: Are some query mapping algorithms ‘better’ than others in terms of handling the querying issue? To illustrate, a certain XPath query (QT) and two query mapping algorithms (QM1 and QM2) can translate QT into two SQL queries (Q1 and Q2), respectively. How do we determine whether QM1 is better than QM2 or vice versa for handling the querying issue? In this study, the query mapping algorithm is considered to be the best if it can translate QT into a corresponding SQL query and execute it in the shortest query response time. So, the answer to this question depends on which one of queries Q1 and Q2 has less query response time. A shorter query response time can be achieved by reducing the search space of the RDB scheme, reducing the number of join operations, removing the nesting queries, and removing the SQL functions used in the translated SQL query i.e. maintaining the P-C and A-D relationships among the nodes efficiently. To address the above questions, a pre-experiment (discussed in detail in Chapter 3) was conducted on five recent model mapping approaches: SMX/R (Ning & De 2010), XRecursive (Fakharaldien et al. 2011b), Suri-Sharma (Suri & Sharma 2012), Ying-Cao (Ying et al. 2012), and s-XML (Subramaniam et al. 2012). These approaches were evaluated in terms of RDB storage space and scalability when the XML documents become bigger and in terms of query response time based on their RDB. The initial results of the pre-experiment showed that these approaches suffer from the following issues in terms of storing and querying XML documents: 

The different ways the approaches use to handle (store) the content and hierarchical structure of XML documents in their RDB scheme increase the complexity of their RDB scheme design (dangling tables and columns) and the presence of redundancy or unnecessary information and paths mapped

11

(records mapped) in their RDB scheme. This leads to an increase in the RDB storage space with less scalability when the XML document becomes bigger. That is to say, the cost of RDB storage space and scalability is very high, particularly in the case of large XML documents. Therefore, the ways for XML mapping that are used in these approaches are not effective. 

The different ways the approaches use to translate different types of XPath query into their corresponding SQL query result in a high number of join operations, nested queries (recursively), and SQL functions in the translated SQL query. In other words, the ways for preserving the P-C and A-D relationships among the nodes of the XML document in the translated SQL queries over their RDB scheme leads to an increase in the search space inside the RDB. This has an adverse effect on the time taken to execute and respond to these queries, and the query response time is increased, particularly for large XML documents. Therefore, the ways for querying that are used in these approaches are not effective. Due to the several limitations mentioned, recent approaches suffer from all the

above issues in terms of storing and querying XML documents in their RDB scheme. A more detailed discussion of these issues will be provided in Chapter 2. Therefore, it is the aim of this study to propose a correct and complete mapping approach that can handle the content and hierarchical structure of XML documents in its RDB scheme efficiently to address the storage issue. This can be achieved by reducing the complexity of the RDB scheme design by removing dangling tables and columns and by removing redundancy or unnecessary data and paths mapped (reducing the number of records mapped) in the RDB scheme. This will lead to less RDB storage space with high scalability, even when XML documents become bigger. In addition, the proposed approach will also translate different types of XPath query into their corresponding SQL query efficiently by maintaining the P-C and A-D relationships among the nodes of the XML document in the translated SQL query. This can be achieved by reducing the search space in the RDB scheme through

12

reducing the number of join operations, removing the nesting queries, and removing the SQL functions used inside the translated SQL queries. This will lead to a reduction in the time taken to execute queries (less query response time). A fuller description of the limitations of existing approaches in terms of storing and querying XML documents is given by way of example in the next section, to explain the motivation for this study. 1.4

MOTIVATING EXAMPLE

An XML document is represented as a rooted, nested, hierarchical, and ordered nodelabelled data tree (Haw & Lee 2011). Figure 1.2 shows the XML document (DBGroup.xml) which is used as a motivating example to explain the storing and querying limitations of recent model mapping approaches.

Figure 1.2

XML document DBGroup.xml (Jiang et al. 2002)

The DBGroup document lists all the information about members and projects. Each member has information about the name, age, office, and project ID of the member. Each project has information about its identifier and title name. Edge-based is considered the first, simplest, and straightforward technique that has been used to store and query XML data using RDB (Edge approach) (Florescu & Kossmann 1999). By this technique, an XML document can be modelled as a rooted,

13

nested, hierarchical, ordered edge-labelled data tree (Haw & Lee 2011) i.e. the labels of the edge-labelled data tree are on the edges. Edge-based technique requires a large amount of storage space to store an XML document because all edges between the nodes in the document are stored in a single table (Edge), particularly in large XML documents (Fakharaldien et al. 2012; Subramaniam et al. 2012). Due to this technique handles edges individually, i.e. it only maintains the edge-label rather than the label paths (Gou & Chirkova 2007; Florescu & Kossmann 1999), the concatenation of the edges to form a path for processing user queries is required. This leads to increase the time taken to execute user queries i.e. query response time. In contrast, path-based and node-based techniques are the most recent techniques that have been used in model mapping approaches (Subramaniam et al. 2012; Ying et al. 2012; Fakharaldien et al. 2011a; Ning & De 2010). The XMap approach (Bousalem & Cherti 2015) uses an indexing (signature) technique. However, the researchers did not evaluate its performance against other approaches. Hence it is not known whether the indexing (signature) technique would be the best technique for mapping an XML document into a RDB scheme. Path-based and node-based techniques use different ways to handle the content and hierarchical structure of the XML document when mapping XML to a RDB. The path-based technique handles the content and hierarchical structure of the document by mapping all the information of inner and leaf nodes (whole document) with their distinct path information starting from the root into the RDB scheme. This is the main concept of this technique (Vaidya 2015; Vainio & Junkkari 2014). The path-based technique handles the hierarchical structure of the document by handling (storing) the path information in a path table. In this technique, two methods are used for handling path information, namely: Node-Path and Leaf-Path. The Node-Path handles the distinct path information for every node in the document starting from the root (rootto-node). The Leaf-Path handles the distinct path information for only the leaf nodes in the document starting from the root (root-to-leaf).

14

In the case of Figure 1.2, the path-based technique maps all the information of inner nodes (&1 - &6, &11) and leaf nodes (&7 - &18 except &11) of the DBGroup document into the RDB scheme of the model mapping approaches that apply this technique. The SMX/R approach (Ning & De 2010) is an example of a model mapping approach that applies a path-based technique. SMX/R maps all the information of the inner and leaf nodes (whole document) and their distinct path information starting from the root (Node-Path) into two tables (Path_Index_Table and Path_Table), as shown in Table 1.1.

Table 1.1

RDB scheme of SMX/R after mapping DBGroup.xml using path-based technique

(a) Path_Index_Table PID 1 2 3 4 5 6 7 8 9 10 11

Path DBGroup DBGroup.Member DBGroup.Member.Name DBGroup.Member.Age DBGroup.Member.Office DBGroup.Member.Office.Building DBGroup.Member.Office.Room DBGroup.Member.@Project=105 DBGroup.Project DBGroup.Project.@Id=105 DBGroup.Project.Title

NodeName DBGroup Member Name Age Office Building Room @Project=105 Project @Id=105 Title

(b) Path_Table DocID 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10

PID 1 2 3 3 4 4 5 5 2 3 3 5 6 6 7 7 2 8

StartPos 1 2 3 4 7 8 11 12 16 17 18 21 22 23 26 27 32 33

EndPos 61 15 6 5 10 9 14 13 31 20 19 30 25 24 29 28 46 33

NodeLVL 1 2 3 4 3 4 3 4 2 3 4 3 4 5 4 5 2 3

NodeType 1 1 1 3 1 3 1 3 1 1 3 1 1 3 1 3 1 2

NodeValue ‘Fervvac’ ‘23’ ‘CSD 4212’ ‘Daniel’ ‘CS’ ‘4215’ ‘105’ continue…

15

...continued 10 10 10 10 10 10 10 10 10 10 10 10 10

3 3 4 4 5 5 9 10 11 11 9 11 11

34 35 38 39 42 43 47 48 49 50 54 55 56

37 36 41 40 45 44 53 48 52 51 60 58 57

3 4 3 4 3 4 2 3 3 4 2 3 4

1 3 1 3 1 3 1 2 1 3 1 1 3

‘Ryan’ ‘24’ ‘CSD 4212’ ‘105’ ‘XML’ ‘Cube’

From Table 1.1 it can be seen that SMX/R needs high RDB storage space to map an XML document, particularly a large XML document. There are three reasons for this. First, SMX/R maps the content of the whole XML document, i.e. all the information of the inner and leaf nodes which consists of elements, attributes, and texts, into a single table (Path_Table). Second, SMX/R contains a dangling column (NodeName) in the Path_Index_Table. Third, SMX/R maintains the hierarchical structure of the XML document by storing all the distinct path information for every node (11 paths). Consequently, redundant paths occur in the Path_Index_Table. The Ying-Cao approach (Ying et al. 2012) is another example of a model mapping approach that applies a path-based technique. Ying-Cao maps all the information of the inner and leaf nodes of the XML document (whole document) and the distinct path information for only leaf nodes (Leaf-Path) into four tables (File, Path, InnerNodes, and LeafNodes), as shown in Table 1.2.

Table 1.2

RDB scheme of Ying-Cao after mapping DBGroup.xml using path-based technique

(a) File Table DocId 10

Name DBGroup.XML

(b) Path Table PathId 1 2

PathExp DBGroup/Member/Name DBGroup/Member/Age continue…

16

...continued 3 4 5 6 7 8

DBGroup/Member/Office DBGroup/Member/Office/Building DBGroup/Member/Office/Room DBGroup/Member/@Project DBGroup/Project/@Id DBGroup/Project/Title

(c) InnerNodes Table DocId 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10

INodeId 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

NodeName DBGroup Member Member Member Project Project Name Age Office Name Office Name Age Office Title Title Building Room @Project @Id

ParentId NULL 1 1 1 1 1 2 2 2 3 3 4 4 4 5 6 11 11 4 5

Level 1 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 4 4 3 3

SiblingNum 1 1 2 3 4 5 1 2 3 1 2 1 2 3 2 1 1 2 4 1

(d) LeafNodes Table DocId 10 10 10 10 10 10 10 10 10 10 10 10 10

LNodeId 1 2 3 4 5 6 7 8 9 10 11 12 13

PathId 1 2 3 1 4 5 1 2 3 6 7 8 8

ParentId 7 8 9 10 17 18 12 13 14 19 20 15 16

LeafValue ‘Fervvac’ ‘23’ ‘CSD 4212’ ‘Daniel’ ‘CS’ ‘4215’ ‘Ryan’ ‘24’ ‘CSD 4212’ ‘105’ ‘105’ ‘XML’ ‘Cube’

As can be seen from Table 1.2, Ying-Cao also consumes high RDB storage space to map XML documents, and in particular, large XML documents. This happens for the following reasons. First, Ying-Cao has a complex RDB scheme, which consists

17

of more than two tables to map the content of the whole XML document. Second, Ying-Cao maps all the information of the inner nodes as data into the InnerNodes table, yet this information already exists as part of the path information for the leaf nodes. Third, Ying-Cao maintains the hierarchical structure of the XML document by only storing all the distinct path information for every leaf node. Consequently, redundant paths occur in the Path table. Therefore, the path-based technique in YingCao stores all the inner nodes’ information in separate tables as data, but these inner nodes are already stored as part of the path information in the path table. This leads to a greater number of tables and the usage of more storage space (Vaidya 2015). In contrast, the node-based technique handles the content and hierarchical structure of the document by assigning node_Ids or absolute positions for all nodes, storing all the node information of the document, and adding a parent ID field in the RDB scheme. This is considered the main concept of this technique (Vaidya 2015). In the case of Figure 1.2, the node-based technique maps all the information of the inner nodes (&1 to &6) and leaf nodes (&7 to &18) of the DBGroup document into the RDB scheme of the model mapping approaches that apply this technique. The s-XML approach (Subramaniam et al. 2012) is an example of a model mapping approach that applies a node-based technique. s-XML maps all the information of the inner and leaf nodes (whole document) into two tables (ParentTable and ChildTable), as shown in Table 1.3. s-XML uses a parent ID (LParent attribute) to handle the hierarchical structure of the document.

Table 1.3

RDB scheme of s-XML after mapping DBGroup.xml using node-based technique

(a) ParentTable IdNode 1 2 3 4 5 6 7

pName DBGroup DBGroup DBGroup DBGroup DBGroup Member

cName DBGroup Member Member Member Project Project Name

Level 0 1 1 1 1 1 2

LParent 1 1 1 1 1 2

SelfLabel [1,1] [1,1] [2,1] [3,1] [4,1] [5,1] [1,1] continue…

18

...continued 8 9 10 11 12 13 14 15 16 17 18 19 20

Member Member Member Member Member Member Member Project Project Office Office Member Project

Age Office Name Office Name Age Office Title Title Building Room @Project @Id

2 2 2 2 2 2 2 2 2 3 3 2 2

2 2 3 3 4 4 4 5 6 11 11 4 5

[2,1] [3,1] [1,1] [2,1] [1,1] [2,1] [3,1] [2,1] [1,1] [1,1] [2,1] [4,1] [1,1]

pName Name Age Office Name Building Room Name Age Office @Project @Id Title Title

SelfLabel [1,1] [1,1] [1,1] [1,1] [1,1] [1,1] [1,1] [1,1] [1,1] [1,1] [1,1] [1,1] [1,1]

LParent 7 8 9 10 17 18 12 13 14 19 20 15 16

Value ‘Fervvac’ ‘23’ ‘CSD 4212’ ‘Daniel’ ‘CS’ ‘4215’ ‘Ryan’ ‘24’ ‘CSD 4212’ ‘105’ ‘105’ ‘XML’ ‘Cube’

(b) ChildTable IdNode 1 2 3 4 5 6 7 8 9 10 11 12 13

Level 3 3 3 3 4 4 3 3 3 3 3 3 3

It is clear from Table 1.3 that s-XML needs high RDB storage space to map an XML document, and particularly a large XML document. It needs a lot of space for the following reasons. First, s-XML maps the content of the whole XML document, i.e. all the information of the inner and leaf nodes is mapped into two tables that have a high row size. Second, s-XML contains dangling columns, e.g. the pName column in ParentTable and ChildTable, and the IdNode and SelfLabel columns in the ChildTable. From the above, it can be concluded that recent model mapping approaches use ways to handle the content and hierarchical structure of XML documents in the RDB scheme that are not effective. The use of these ways leads to an increase in the complexity of the RDB scheme design (dangling tables and columns) and the presence of redundancy or unnecessary information and paths mapped (records mapped) in the

19

RDB scheme. This leads to an increase in RDB storage space with less scalability when an XML documents become bigger. In other words, the cost of RDB storage space and scalability is very high, particularly for large XML documents. Therefore, this study aims to propose a correct and complete mapping approach (XML mapping algorithm) that can handle the content and hierarchical structure of XML documents in its RDB scheme efficiently to address the storage issue. This can be achieved by reducing the complexity of the RDB scheme design (reducing dangling tables and columns) and removing redundancy or unnecessary data and paths mapped (reducing records mapped) in the RDB scheme. This will result in less RDB storage space with high scalability when XML documents become bigger. It is also necessary to reduce the space needed to store XML documents in order to improve data querying performance (Dweib & Lu 2012; Subramaniam et al. 2012). The most important factor in efficient query mapping is the time required to produce a result when the translated SQL query is executed (Raut & Atique 2014). The effectiveness of path-based and node-based techniques is measured based on how efficiently they maintain the P-C and A-D relationships among the nodes of the XML document (Subramaniam et al. 2012). Path-based and node-based techniques handle P-C and A-D relationships in the translated SQL query for an XPath query over the RDB scheme differently according to the approach in which they are used. To explain the querying limitations clearly, let us take a type of XPath query, namely a twig query with ‘/’, to evaluate the branching query associated with P-C relationships. Referring to Figure 1.2, consider a twig query (Q) is represented below and shown as its tree representation in Figure 1.3. The query is as follows: Q: Retrieve members’ names where members’ age is ‘24’ /DBGroup/Member/Name [Age = ‘24’]

(twig query)

20

24

Figure 1.3

Tree Representation for twig query (Q)

SMX/R, which uses a path-based technique, translates the twig query (Q) into the corresponding SQL query as shown in Figure 1.4.

SELECT v2.NodeValue FROM Path_Table e, Path_Index_Table p1, Path_Index_Table p2, Path_Index_Table p3, Path_Table v1 , Path_Table v2 WHERE p1.Path = '.DBGroup.Member' AND p2.path = '.DBGroup.Member.Age' AND p3.path ='.DBGroup.Member.Name' AND e.PID = p1.PID AND v1.PID = p2.PID AND v2.PID = p3.PID AND e.StartPos < v1.StartPos AND e.EndPos > v1.EndPos AND e.StartPos < v2.StartPos AND e.EndPos > v2.EndPos AND v1.NodeValue= 24 AND v2.NodeType = 3; Figure 1.4

Translated SQL query for twig query (Q) using SMX/R

According to Figure 1.4, SMX/R answers the twig query (Q) by extracting three sub-paths from Q: a path for the branching node ‘/DBGroup/Member’, a path for the node that has a value ‘DBGroup/Member/Age’, and a path for the node that is required to retrieve its value ‘DBGroup/Member/Name’. Then, SMX/R matches these sub-paths with the path information in the Path_Index table using the ‘=’ operator in order to get their path IDs, which are used to look up a value of a node (Name) by the other node (Age) in the other table. SMX/R takes a long time to execute Q because it requires a large search space to get the PathID in the Path_Index table, which stores distinct path information for every node of the document. SMX/R also requires a large search space to match the PathID and find the data in the Path table, which stores all the information about the elements, attributes, and texts. Hence, a number of join

21 operations and containment relationships using ϴ-join () operators are needed in order to maintain the P-C and A-D relationships, which is costly in a relational database management system (RDBMS). On the other hand, s-XML, which uses a node-based technique, translates the twig query (Q) into the corresponding SQL query as shown in Figure 1.5.

SELECT Value FROM ChildTable WHERE LParent in ( SELECT distinct IdNode from ParentTable WHERE Cname = 'Name' and Lparent in ( SELECT distinct IdNode FROM ParentTable WHERE Cname= 'Member' AND Pname = 'DBGroup' ) AND LParent in ( SELECT distinct LParent FROM ParentTable WHERE Cname = 'Age' AND LParent in ( SELECT distinct IdNode FROM ParentTable WHERE Cname= 'Member' AND Pname = 'DBGroup' ) AND IdNode in ( SELECT distinct LParent FROM ChildTable WHERE Value = 24 ))) Figure 1.5

Translated SQL query for twig query (Q) using s-XML

According to Figure 1.5, s-XML answers the twig query (Q) by using the LParent attribute recursively in the translated query to form nested queries in order to maintain the P-C and A-D relationships among the nodes. This leads to an additional computation processing cost, which is also costly in a RDBMS.

Therefore, these recent approaches use ways that are not effective in maintaining the P-C and A-D relationships among the nodes of the document in the translated SQL query for a certain XPath query over the RDB scheme. The ways result in the use of a high number of join operations, nested queries (recursively), and SQL functions in the translated SQL queries. This leads to an increase in the search space inside the RDB scheme. This in turn has an adverse effect on the time taken to execute and respond to queries, particularly in the case of large XML documents.

22

Therefore, this study aims to propose a correct and complete mapping approach (query mapping algorithm) to translate different types of XPath query into their corresponding SQL query efficiently in order to address the querying issue. In other words, the approach will be designed to maintain the P-C and A-D relationships among the nodes of the XML document in the translated SQL queries efficiently. This will be achieved by reducing the search space in the RDB scheme, reducing the number of join operations, removing the nesting queries, and removing the SQL functions used inside the translated SQL queries. This will lead to a reduction in the time taken to execute the queries, i.e. less query response time. Vaidya (2015) indicates that the path materialization approach (path-based technique) is the best solution for query execution compared to other techniques such as node-based and Edge-based. The path-based technique requires fewer join operations in the translated SQL query. In light of the above discussion, this study has chosen to develop the path-based technique to handle these limitations in relation to storing and querying XML documents using a RDB. The study proposes a mapping approach that handles the content and hierarchical structure of the document efficiently. The proposed approach stores only the information for all the leaf nodes with their distinct ancestor paths, i.e. it stores one ancestor path for all leaf nodes that have the same parent. In addition, instead of storing all the information of the inner nodes in separate tables, the proposed approach stores them inside the ancestor paths of the leaf nodes. Therefore, the proposed approach reduces the storage space required for paths and data, the search space inside the RDB, and the number of join operations in the translated SQL query for an XPath query. 1.5

RESEARCH QUESTIONS

In light of the foregoing, this study is motivated to propose a mapping approach (XAncestor) to store XML documents in a RDB scheme with less RDB storage space and high scalability when the XML documents become bigger. In order to achieve this, the first aim of this study is to map (shred and decompose) the content and hierarchical structure of the XML document into an optimal fixed and predefined

23

RDB scheme by reducing the complexity of the RDB scheme design (free dangling tables and columns) and removing redundancy or unnecessary data and paths mapped (reducing the number of records mapped). The design of the optimal RDB scheme and how it affects the quality of the mapping process and query performance will be discussed in Chapter 4. However, in brief, recent approaches such as SMX/R (Ning & De 2010), XRecursive (Fakharaldien et al. 2011b), Ying-Cao (Ying et al. 2012), s-XML (Subramaniam et al. 2012), and XMap (Bousalem & Cherti 2015) propose fixed and predefined RDB schemes to maintain and map the content and hierarchical structure of XML documents. However, all these approaches have some limitations. For instance, XRel, Ying-Cao, and XMap have limitations in terms of mapping XML documents into the RDB scheme because they store the whole document (all inner and leaf nodes) in more than two tables. Therefore the complexity of their RDB design is high. In addition, SMX/R and Ying-Cao store distinct path information from a root-to-any node or from a root-to-any leaf, respectively. Also, s-XML and XRecursive contain redundant and dangling columns. All these issues lead to an increase in the RDB storage space and degrade its scalability when the XML documents become bigger. Hence the first research question (RQ1) is: RQ1: How can the content and hierarchical structure of an XML document be mapped (shredded and decomposed) into an optimal fixed and predefined RDB scheme in order to get less storage space with high scalability when the XML documents become bigger? This study is also motivated to propose an approach (XAncestor) that can query XML documents that are mapped into a RDB scheme in less query response time, which is an indicator of query performance. Different types of XPath query, including short simple path expression queries, long simple path expression queries, queries with one ‘//’, queries with two ‘//’, twig queries with ‘/’, and twig queries with ‘//’, are selected and commonly used in related research studies (Ying et al. 2012; Wu et al. 2011; Marks & Roantree 2009; Soltan & Rahgozar 2006; Yang et al. 2006;

24

Schmidt et al. 2002; Yoshikawa et al. 2001) in order to evaluate the efficiency of their query mapping algorithms. These types of XPath query are categorized based on their simplicity and complexity i.e. simple path query to twig query, in order to evaluate the P-C and A-D relationships among the XML nodes (Haw & Lee 2011). In the proposed approach, These types of XPath query are mapped into their corresponding SQL query, which are executed more quickly based on a predefined RDB scheme by reducing the search space over the RDB scheme, reducing nesting queries (parent ID recursively), and reducing the number of join operations in the translated SQL query. This results in efficiently maintaining the P-C and A-D relationship among the XML nodes. Recent approaches such as SMX/R (Ning & De 2010), XRecursive (Fakharaldien et al. 2011b), Ying-Cao (Ying et al. 2012), s-XML (Subramaniam et al. 2012), and XMap (Bousalem & Cherti 2015) do not maintain the P-C and A-D relationship among XML nodes efficiently. This is because they translate different types of XPath query into an SQL query by using a high number of join operations and nesting queries. This leads to an increase in the search space inside the RDB scheme, which has an adverse effect on the time taken to execute and respond to queries, i.e. it degrades query performance. Hence, the second research question is: RQ2: How can different types of XPath query, including short simple path expression queries, long simple path expression queries, queries with one ‘//’, queries with two ‘//’, twig queries with ‘/’, and twig queries with ‘//’ be mapped into their corresponding SQL query based on the predefined RDB scheme in order to reduce the query response time, i.e. improve the query performance? Thus, this study aims to answer two questions (RQ1, RQ2) by developing a proposed mapping approach (XAncestor) as a prototype that requires less RDB storage for different sizes of XML documents and that requires less time to execute and respond to different types of XPath query in XML documents.

25

This study evaluates the effectiveness and scalability of the proposed XAncestor approach by comparing it with five recent related approaches: XRel (Yoshikawa et al. 2001), SMX/R (Ning & De 2010), XRecursive (Fakharaldien et al. 2011a), Ying-Cao (Ying et al. 2012), and s-XML (Subramaniam et al. 2012). Three of these approaches use a path-based technique, and two use a node-based technique to address the issues of RDB storage space and query response time. In addition, this study also proves the correctness and the completeness of the XAncestor approach (prototype) by comparing it with these five approaches in terms of storing and querying XML documents in a RDB scheme. Hence, the third research question is a threefold question: RQ3: How can a mapping approach (XAncestor) be developed as a prototype based on the results from RQ1 and RQ2; how can its effectiveness and scalability be compared; and how can it be proved that it is a correct and complete approach in terms of RDB storage space and query response time? 1.6

RESEARCH OBJECTIVES

Based upon the problem previously discussed and the above three research questions, the objectives of the study are as follows: RO1:

To propose an XML mapping algorithm (XtoDB) for mapping different sizes of XML document into a fixed and predefined RDB scheme using a pathbased technique in order to improve RDB storage space and scalability when XML documents become bigger.

RO2:

To propose a query mapping algorithm (XtoSQL) for translating different types of XPath query, including short simple path expression queries, long simple path expression queries, queries with one ‘//’, queries with two ‘//’, twig queries with ‘/’, and twig queries with ‘//’, into SQL queries based on the RDB scheme using a path-based technique in order to reduce the query response time, i.e. improve query performance.

26

RO3: To develop a mapping approach (XAncestor) as a prototype based on the results of RO1 and RO2; to evaluate and compare its effectiveness and scalability; and to prove that the prototype is a correct and complete approach in terms of RDB storage space and query response time by comparing it with five recent related approaches. 1.7

RESEARCH SCOPE

Nowadays, the volume of data is increasing rapidly and it is available in different types such as structured, unstructured, and semi-structured. This data is exchanged between enterprises and individuals over the Internet, which is defined as a big data environment (Fu et al. 2015; Hashem et al. 2015; Chen et al. 2014; Katal et al. 2013). Extended markup language is considered an important standard medium for exchanging and representing semi-structured data on the Internet (big data environment) (Bousalem & Cherti 2015; Vaidya 2015; Dweib & Lu 2012; Hong & Chen 2012; Subramaniam et al. 2012). Therefore, XML data management has become an important research issue (Bousalem & Cherti 2015; Mourya & Saxena 2015; Yuan et al. 2013; Wang et al. 2012; Fakharaldien et al. 2011a; Haw & Lee 2011; Dweib et al. 2009). Numerous researchers (Bousalem & Cherti 2015; Vaidya 2015; Raut & Atique 2014; Yuan et al. 2013; Dweib & Lu 2012; Subramaniam et al. 2012; Wang et al. 2012; Fakharaldien et al. 2011a; Ahmad 2011; Haw & Lee 2011) have introduced different solutions to manage the storing and querying of XML documents, including the file system, NXD, OODB, and RDB. Many of these approaches (Bousalem & Cherti 2015; Mourya & Saxena 2015; Vaidya 2015; Dweib & Lu 2012; Subramaniam et al. 2012; Fakharaldien et al. 2011c; Ahmad 2011) use a RDB to store and query XML documents because it is widely used as a storage and retrieval solution in the business domain. The existing approaches work by mapping between hierarchically ordered XML and tabular unordered RDB and draw on the advantages of both technologies. Some researchers (Bousalem & Cherti 2015; Mourya & Saxena 2015; Vaidya 2015; Dweib & Lu 2012; Wang et al. 2012) have classified these approaches

27

based on the schema definition and the RDB scheme design into two categories: structure mapping and model mapping. This research study focuses only on managing the storing and querying of one type of data in the big data environment, namely semi-structured data (XML documents) with a data size of one megabyte and above. In addition, this research study focuses only on model mapping approaches (also called schemaless or generic mapping approaches) for storing and querying XML documents that use a RDB, and not structure mapping approaches, for the following reasons: 

Model mapping approaches support any modern XML applications whether static (the DTDs are unchanged) or dynamic (the DTDs are changed frequently). These applications cannot be used with structure mapping approaches.



Model mapping approaches support well-formed XML applications (without a schema for data description) on the Web. In contrast, structure mapping approaches only support XML documents with a DTD or XML-schema structure.



The expressive power of database models is not needed to support XML documents in model mapping approaches. The database model in structure mapping approaches does not include constructs to describe the elements in the DTD content of the XML document. Thus, structure mapping approaches need to extend the expressive power of database models in order to map the data structures of XML documents into a RDB schema in a natural way. Also, for querying an XML document, XPath is considered the most popular

query language for expressing queries over XML data (Vainio & Junkkari 2014; Berglund et al. 2003). This research study focuses on XPath as the standard XML query language for the following reasons: 

XPath is an XML standard query language that is recommended by W3C.

28 

The functionality of XPath is covered by the expression power of many query languages.



XPath is mandatory for the future standard query language. Finally, because path expressions frequently appear in XML queries (XPath), it

is essential to store and handle the path information of the XML document efficiently. However, storing all the path information for every tag (node) in an XML document incurs a lot of storage overhead because it creates redundant path information, in particular for a database with large XML documents. This increased data volume results in increased query response time, i.e. poor query performance (Vaidya 2015; Dweib & Lu 2012; Jeong & Lee 2006). Vaidya (2015) indicates that the path-based technique is the best solution for query execution compared to other techniques like node-based and Edge-based. Therefore, this study has chosen to focus on developing a path-based technique that can be applied to a model mapping approach in order to handle the issues related to the storing and querying of XML documents using a RDB. 1.8

RESEARCH METHODOLOGY

Finding an appropriate research methodology is one of the most important issues for the success of any research project. The appropriate research methodology helps the researcher to establish objectives and achieve them successfully to make a contribution to their field of study. The aim of this study’s methodology is to provide a pathway to achieve the objectives of this study by identifying the research problem, i.e. discovering the research gap by analysing all previous relevant works. The identified problem then requires a solution which needs to be investigated carefully. The proposed solution then needs to be developed and implemented before it can be evaluated and compared with existing related works. These are the main steps in the research methodology. For the purposes of this study, a generic system development research methodology, namely design-based research (DBR) (Herrington et al. 2007; Wang &

29

Hannafin 2005; Barab & Squire 2004), is extended to meet the nature and needs of the research. This study adopts the DBR methodology to solve the research problem by a proposed mapping approach (XAncestor), which consists of two algorithms (XtoDB, XtoSQL) that are developed into a prototype. The research methodology for this study consists of four main phases (Reeves 2006), which will be explained in detail in Chapter 3. An overview of these phases is presented below: 

Problem Identification: In this phase, the literature on mapping XML documents into a RDB is reviewed. In addition, a pre-experiment is conducted on recent model mapping approaches. The pre-experiment evaluates and analyses the approaches in terms of how they handle the storage issue. It does this by evaluating the RDB storage space and its scalability when XML documents become bigger. The pre-experiment also evaluates and analyses the approaches in terms of how they handle the querying issue (query response time) based on their RDB. The results of these evaluations lead to the identification of the research problem, research questions, and research objectives.



Proposed Approach Design: In this phase, a solution (new approach) for the identified research problems is proposed. In addition, the mathematical concepts and theories that are used in the proposed approach are identified. Then the proposed mapping approach (XAncestor), including its components and algorithms, is designed. Each part of the proposed approach, including the algorithms, is verified and validated with reference to the literature covered in the previous phase.



Prototype Development: In this phase, the proposed approach is developed as a prototype, namely: the XAncestor prototype. This XAncestor prototype is developed using the Rapid Application Development (RAD) methodology. This methodology is a system development methodology created to radically

30

decrease the time needed to design and implement information systems in an iterative manner (Dennis et al. 2012; Joseph et al. 2011). In RAD, every dedicated algorithm is programmed separately and tested alone. Then, these algorithms are joined to build the prototype. 

Prototype Evaluation: In this phase, the evaluation environment, measurements, datasets, and types of XPath query are identified based on the standards in the literature. Then, a testing process is conducted. The results generated by the proposed XAncestor approach (prototype) are then compared to the results generated by recent related approaches.

31

1.9

THESIS OUTLINE

This thesis is organized into six chapters: 

Chapter 1 serves as an essential introduction to the study.



Chapter 2 reviews the XML document, XML data graph, XML query, and related literature on the various approaches for mapping XML to relations in terms of handling storing and querying issues. The discussion in this chapter leads to the formulation of this study.



Chapter 3 explains in detail the research methodology, starting from identifying the problem to achieving the objectives of this study.



Chapter 4 gives a full description of the proposed new mapping approach for mapping XML documents into a RDB in terms of handling the storing and querying issues using a path-based technique.



Chapter 5 evaluates the performance (effectiveness and scalability) of the proposed approach by comparing the RDB storage space and query response time generated by the proposed approach with the RDB storage space and query response time generated by recent related approaches.



Chapter 6 concludes the thesis and comprises a list of general and specific contributions in the three studied area. Some conclusions and suggestions for further research are also made.

The organization of thesis is illustrated in Figure 1.6.

32

Chapter I Introduction

Chapter II Literature Review

Chapter III Research Methodology

Chapter IV XAncestor: A Mapping Approach

Chapter V Results and Discussion

Chapter VI Conclusion and Future Works Figure 1.6

Organization of thesis

33

CHAPTER II

LITERATURE REVIEW

2.1

INTRODUCTION

The main challenge in developing an approach to map extensible markup language (XML) to relations is to resolve the conflict between the hierarchical nature of the XML model and the flat presentation of the relational model that consists of rows and columns (Haw & Lee 2011; Atay et al. 2010). Therefore, to understand the advantages and drawbacks of existing mapping approaches, both XML and relational data representation models are discussed in this chapter. Since the existing mapping approaches are not efficient enough in storing and querying XML documents in terms of less RDB storage space with high scalability when XML documents become bigger and less query response time (high query performance), a formal mapping approach is very much needed. In this chapter, which presents a review of the literature relevant to the research topic, the preliminaries regarding XML, the XML data model, and XML query languages are presented in Section 2.2. Then, various ways of managing XML data are discussed in Section 2.3. Since RDB technology is the most efficient way to manage data (Bousalem & Cherti 2015; Mourya & Saxena 2015; Fakharaldien et al. 2011a; Ahmad 2011; Haw & Lee 2011; Fan & Ma 2006; Tian et al. 2002; Florescu & Kossmann 1999), there have been tremendous efforts to find ways to use this technology to manage XML. Various approaches for mapping XML can be found in the literature, which can be classified into two different categories: structure mapping (also called schema-based mapping) and model mapping (also called schemaless-

34

based or generic mapping) approaches. These existing approaches, their advantages, and their limitations are reviewed in detail in Section 2.4. Then a review of the existing mapping approaches that have attempted to address the issues of RDB storage space and query response time are discussed in Section 2.5. Finally, a summary of this chapter is presented in Section 2.6. 2.2

PRELIMINARIES

Some preliminaries on XML, the XML data model, XML query languages, are presented in the following subsections. 2.2.1

Extensible Markup Language (XML)

Extensible markup language or XML, which is recommended by the World Wide Web Consortium (W3C) (Bray et al. 1998; Bray et al. 2008), has rapidly become the dominant standard for data exchange and representation on the Web (Mourya & Saxena 2015; Raut & Atique 2014; Subramaniam et al. 2012; Fakharaldien et al. 2011b; Qin et al. 2005; Jiang et al. 2002). Extensible markup language helps in storing structured data and offers an independent form to describe the information. Consequently, the exchange of data between one platform and other is a simple process. Through its characteristics XML has empowered correspondence among diverse computing frameworks, which would have been unimaginable or difficult to do in the recent past. By using XML, a general structure for exchanging data is provided, irrespective of the data model and the form of the application. Therefore, XML has received widespread support and achieved extensive adoption among all the leading vendors of software tools, servers, and databases (Adam et al. 2012; Fakharaldien et al. 2012). In addition, data can be transmitted to remote services anywhere on the Internet using XML-based Web services to take advantage of the new ubiquity of connected software applications (Fakharaldien et al. 2012). The amount of data on the Web is increasing rapidly, which is leading to the generation of big data (Fu et al. 2015; Katal et al. 2013; Chen et al. 2012). The openness of XML allows such data to be exchanged between virtually any hardware, software, or

35

operating system (Fakharaldien et al. 2011a; Augeri et al. 2007). Simply, XML facilitates information interchange between enterprises operating on diverse platforms without restriction in the big data environment that exists across the Internet. An XML document mainly consists of a strictly nested hierarchy of elements with a single root. Elements can contain character data, child elements or a mixture of both. Each element is written based on its type and may contain one or more attributes. Attributes contain only data, i.e. they cannot contain elements. Each element is bounded by matching starting and ending tags such as and . Figure 2.1 shows a part of an XML document (Jiang et al. 2002), which is later utilized as a motivational example when reviewing the existing model mapping approaches in Section 2.4.

Figure 2.1

2.2.2

Part of XML document (Jiang et al. 2002)

XML Data Model

An XML document can be modelled as a rooted, nested, hierarchical, ordered edgelabelled data tree or a node-labelled data tree (Haw & Lee 2011). These two models (trees) are similar except for the label location; the labels of the edge-labelled model are on the edges, whereas the labels of the node-labelled model are on the nodes. The node-labelled model is used in most research. The edge-labelled model is used in some situations, such as when representing most structural indices and shredding XML data into a RDB. Figure 2.2(a) and (b) shows an example of the edge-labelled model and the node-labelled model, respectively, based on the XML document in Figure 2.1.

36

Figure 2.2

XML data models: (a) edge-labelled and (b) node-labelled

Previous studies have identified four main node types in the XML data model (tree) (Yoshikawa et al. 2001) as shown in Figure 2.2. The first node type is the root node, which represents the XML tree root. The second node type is the element node of the XML tree. An element node has an expanded name, which is the element type name specified in the tag. The element has zero or more children. The type of each child node is an element or a text. In Figure 2.2 there are five element nodes: DBGroup, Member, Name, Age, and Office. These elements are regarded as tag names in the XML document shown in Figure 2.1. Figure 2.2(a) shows these tag names as labels of the XML tree edges, which are applied in the Edge approach (Florescu & Kossmann 1999). Figure 2.2(b) shows these five elements as labels of the XML tree nodes, which are applied in the XRel approach (Yoshikawa et al. 2001). The third node type is the attribute node, which is related to the element node. An attribute node cannot be a child of an element node. An attribute name and an attribute value are assigned to each attribute node. Moreover, attribute nodes have no child nodes. For simplicity, the part of the XML document in Figure 2.1 does not include this node type. The fourth and last node type is the text node, which represents a string-value. A text node does not have an expanded name and has no child nodes. Figure 2.2 shows trees with three text nodes: ‘Fervvac’, ‘23’, and ‘CSD 4212’. 2.2.3

XML Query Languages

There has been extensive research into XML query languages. Some examples of these languages include Quilt (Chamberlin et al. 2001), XQuery (Boag et al. 2002),

37

and XPath (Berglund et al. 2003) as well as XML query algebras such as XAL (Frasincar et al. 2002), YATL (Christophides et al. 2000), and Lore (Mchugh et al. 1997). An XML algebra provides a solid grounding in determining the semantics of the query language due to its power of expression, whereas an XML query language provides a way to extract and manipulate data from XML documents. The majority of query languages differ in terms of aspects of their grammar and representation; however, they all share a common characteristic in that the queries utilize path expression for query evaluation (Yao & Zhang 2004). XPath is recommended by W3C and considered to be a major language for querying XML documents using path expressions (Vainio & Junkkari 2014; Dweib & Lu 2012; Berglund et al. 2003). XPath selects the nodes from an XML document from the root path (Vaidya 2015). It navigates through the XML document to get the results matching the predicates and conditions mentioned in the expression (Jandhyala 2006). XPath uses a library of standard functions such as numeric functions, string functions, and functions on nodes, sequences, and contexts, which aid users to specify simpler queries. Table 2.1 provides examples of some XPath expressions to retrieve the XML document shown in Figure 2.1.

Table 2.1

Examples of some XPath expressions

XPath Expression

Description

./Member

Selects all elements within the current context

//Member

Selects all elements in the document

DBGroup/Member/Name

Selects all elements that are children of a element

DBGroup//Age

Selects all elements one or more levels deep in the element (arbitrary descendants)

DBGroup/*/Office

Selects all elements that are grandchildren of elements

DBGroup/Member[1]

Selects the first element that is a child of the element

DBGroup/Member/*

Selects all elements that are children of the element that is a child of the element

DBGroup/Member/Name [Age = 23]

Selects all elements of the elements of the element that have an element with a value equal to 23

/ /Name [Age > 20 ]

Selects all elements that have an element with a value greater than 20

38

As XML is semi-structured data, there are usually two types of user queries: full-text queries (keyword based) and structural queries (complex queries in a tree-like structure). This classification of XML queries depicted in Figure 2.3 (Haw & Lee 2011). A keyword search resembles the process of content retrieval used in information retrieval technology. In contrast, structural searches retrieve matches from a tree based on the tags and structure (relationship) specified within the query criteria. Structural queries can be further categorized into path queries (chain queries) and twig queries (branching queries). Each type of query may contain a single type parent-child (P-C) relationship, single type ancestor-descendant (A-D) relationship or a mix of both types.

Figure 2.3

Classification of queries for XML (Haw & Lee 2011)

A path query is a simple path expression query with a query on one single element, i.e. it consists of one leaf node at a time. A twig query is a branch path expression query, which defines a query on two or more elements, i.e. it consists of two or more leaf nodes. In both cases, query nodes can be elements, attributes or text, although the query edges for the path query are either P-C or A-D relationships, while the edges for the twig query can be P-C, A-D or sibling (preceding and following). This determines the order of the relationships. In XPath notation (Berglund et al. 2003), the P-C relationship is depicted as ‘/’, whereas the A-D relationship is depicted as ‘//’. A complex query refers to a twig query that contains at least a branching edge and may contain multiple edges formed by basic P-C or A-D relationships.

39

2.3

XML DATA MANAGEMENT

There are several approaches for managing XML documents in terms of storing and querying, which range from using files to full-fledged database management systems such as object-oriented database (OODB), native XML database (NXD) or relational database (RDB) systems. The approaches can be classified into four categories: (i) Files, (ii) OODB, (iii) NXD, and (iv) RDB. Each of these categories is discussed in the following subsections. 2.3.1

Files

The files approach uses an XML document as a single object in a file and directly maps and stores it as a single text column, e.g. as a character large object (CLOB), similar to the storage of an image file (Dweib & Lu 2012; Ahmad 2011; Haw & Lee 2011; Kyu & Nyunt 2009; Yoshikawa et al. 2001; Florescu & Kossmann 1999). There are two main advantages of storing XML as a CLOB; there is no need for preprocessing and the original XML document is preserved. However, it does not allow an effective search of the contents of the document, nor does it provide support for querying the XML data because the data would need to be parsed, loaded into the memory, processed and (for update operations) dumped back to disk. Therefore, this approach may not be appropriate and efficient, and is impractical for storing and querying XML documents (Dweib & Lu 2012; Ahmad 2011). 2.3.2

Object-Oriented Database (OODB)

The OODB approach is capable of dealing with complicated applications such as multimedia data and geographic information systems (Bousalem & Cherti 2015; Dweib & Lu 2012; Subramaniam et al. 2012; Wang et al. 2012; Fakharaldien et al. 2011a; Haw & Lee 2011). The OODB allows the clustering of XML elements and sub-elements, which can be useful. However, the current generation of OODB systems is not mature enough to process complicated queries from large databases (Wang et al. 2012; Kyu & Nyunt 2009). In addition, the absence of a standard OODB

40

query language has hindered the development of tests to compare the various OODB system implementations (Haw & Lee 2011). Moreover, there are other limitations to the OODB. For instance, it is a dependent language, i.e. a specific application program interface (API) is used to access the data. Also, it is schema dependent, i.e. any modifications to the schema or any class has to be done to the classes interacting with instances of that class (Dweib & Lu 2012). However, there are some works on storing XML documents in an OODB because both XML and the OODB are hierarchical in terms of their structure (Chung & Jesurajaiah 2005). 2.3.3

Native XML Database (NXD)

The NXD approach was developed to directly support the XML data model and is queried by special purpose engines (Bousalem & Cherti 2015; Vaidya 2015; Raut & Atique 2014; Yuan et al. 2013; Ahmad 2011; Haw & Lee 2011). Examples of this approach include Timber (Jagadish et al. 2002), Sedna (Fomichev et al. 2006), eXist (Meier 2003), and Software AG’s Tamino (Harold 2005). The major advantage of the NXD approach is that the XML data can be stored and queried in its original format without additional mapping or translation. Additionally, most NXDs have the ability to optimize the query techniques. However, the main disadvantage of NXD approach is that it still has performance problems when handling a large volume of XML data (Dweib & Lu 2012; Ahmad 2011). In addition, due to the document-centric nature of XML databases, complex searches or aggregations might be cumbersome (Ahmad 2011; Atay et al. 2010). Moreover, the NXD approach is not suitable when integration between various heterogeneous XML documents is needed (Haw & Lee 2011). Currently there is no firm standard or structure for NXDs (Harold 2005), making it unsettled and unsuitable for managing large XML data. 2.3.4

Relational Database (RDB)

When managing XML in a RDB, the XML data is mapped into rows and columns within a table inside the RDB scheme. This is known as shredding. The queries posed in semi-structured query languages are translated into SQL queries. The results are

41

then translated back to XML, where all the processes are done internally. This process of mapping XML to the RDB scheme is widespread. Currently, most major database vendors such as IBM, Oracle, and Microsoft provide mechanisms in their database solutions to store and query XML data by extending the existing data model with an additional XML data type so that a column of this data type can be defined and used to store XML data. In addition, a set of methods is associated with this new XML data type to process, manipulate, and query stored XML data (Atay et al. 2010). Some of the major database solutions are outlined below: a.

IBM DB2 Extender

IBM DB2 Extender (Ibm 2012) maps XML documents into a RDB using the XML Extender document access definition (DAD) as the XML schema. It provides two functions: the dxxShredXML() function, which is used to decompose an XML document and store it in the RDB scheme, and the dxxGenXML() function, which is used to build a shredded XML from the RDB scheme (Dweib & Lu 2012). IBM DB2 stores large XML documents as CLOBs or binary large objects (BLOBs) and small XML documents as XMLVARCHAR columns. When documents are stored outside DB2, they are stored as XMLFile columns. In addition, IBM DB2 uses some additional tables for indexing structured data contained in the text of mixed content XML documents. These additional tables are updated automatically when new documents are added (Dweib & Lu 2012). Moreover, DB2 integrates a native mode for XML data management. Relational data and XML data are processed using different engines, so DB2 not really compromised these two kinds of data (Yuan et al. 2013). b.

Oracle

Oracle 11g (Zeis et al. 2009) and Oracle 12c (Ruel 2013) are two solutions that are provided through the Oracle XML DB. In the first solution, an XML document is stored as a CLOB in a single special type field (XMLType), or by shredding the contents of the XML document into a set of rows (Vaidya 2015; Dweib & Lu 2012).

42

Storing the entire document in a CLOB introduces a performance overhead because an XML document needs to be read and parsed again and again for each new request to access a particular document node (Bulajic & Filipovic 2012). The second solution provides an option for XML document shredding, which is either automated or controlled by the user, depending on the XML schema. However, the XML schema is required before mapping to the RDB scheme for shredding options. When storing complex XML schema it may be necessary to define a table with more than 1000 columns. Oracle limits the number of columns per table to 1000 columns, but in some cases more tables would be required or partitioning would need to be implemented to store more than 1000 columns (Bulajic & Filipovic 2012). Also, Oracle solutions are designed only for Oracle systems, which are expensive and not available for other DBMSs (Dweib & Lu 2012). c.

Microsoft SQL Server

Microsoft’s SQL Server 2008 (Microsoft Corporation 2008) offers an XML data type where an XML document can be stored. Optionally, an XML schema collection can be associated with an XML data column. The limitations of this XML data type are fivefold: (i) the size of XML data type cannot exceed 2 GB; (ii) it cannot be compared and sorted; (iii) it cannot be used as a subtype of the sql_variant instance; (iv) it does not support casting and converting to text or ntext; and v) it cannot be used as a key column in an index. The storing of XML documents was improved in SQL Server 2012 (Microsoft Corporation 2012) and SQL Server 2016 (Microsoft Corporation 2016) through the introducing of the OPENXML row set. The XML documents are compiled into an internal DOM representation using the sp_xml_preparedocument procedure. This presentation is then loaded into the OPENXML row set using the SELECT statement. Microsoft SQL Server allows the storing of XML documents as CLOBs. However, unlike IBM DB2 Extender, no extra tables are provided for indexing mixed content data. In the SQL Server, the RDB scheme is constructed from XML schema definition (XSD), which makes it difficult to query the XML data from other resources. Moreover, SQL Server XML cannot be applied to other DBMSs such as DB2 or Oracle.

43

Dweib et al. (2008) and Fakharaldien et al. (2011c) state that database vendors such as IBM, Oracle, and Microsoft provide mature and proven techniques for relational data processing; however, their XML extensions are not yet mature enough. The vendors provide a schema-free method of storing XML data where the whole XML document is stored in a column using CLOB or BLOB data, which are not useful for querying data (Vaidya 2015; Bulajic & Filipovic 2012). Additionally, there is no requirement for XML to SQL query translation because XML queries are similar to XML query processing in a native XML database. Therefore, when using these vendors’ solutions, the database administrator (DBA) must articulate how to map the XML data into their system, and the XML storage must be tailored to one system and hard-coded to some default mapping on behalf of users so that they cannot be utilized for another relational backend. As the DBA must carry out specific research for the development of XML support, the solutions are unique to that vendor’s product and cannot be utilized in other products. Therefore, further studies are required to utilize the advantages of RDB and XML technologies in order to develop a middleware for storing and querying XML data in any RDBMS. Ahmad (2011) argues that RDB systems are mature (having been the subject of over 30 years of development), scale very well, broad application, ease of expansion, superior interaction, abundant semantics, good formatting and strong ability to control data and security, and have extra features so that XML data can co-exist, making it possible to build applications with little extra effort. Therfore, the RDB appears to be the most feasible and promising approach for XML data management (Bousalem & Cherti 2015; Mourya & Saxena 2015; Wang et al. 2012; Fakharaldien et al. 2011a; Ahmad 2011). To optimize the use of RDB systems for managing XML data in terms of addressing storing and querying issues, recent works have focused on techniques and algorithms to map from XML to RDB and to map XPath queries to SQL queries. Furthermore, it is believed that efforts to find the best mapping approach will continue in the future. Currently, there are many existing approaches in the literature on the mapping process. These existing approaches have concentrated on XML mapping for storing the content and hierarchical structure of the XML document in the RDB scheme in order to create less RDB storage space with high scalability when XML

44

documents become bigger. In addition, these approaches have focused on query mapping for translating different types of XPath queries into corresponding SQL queries in order to enhance query performance, i.e. to shorten the query response time and return the results as a tree view. These concentrations of the existing approaches on XML mapping for storing and querying the XML document using RDB give rise to the following two important questions. The first question is: Are some XML mapping algorithms ‘better’ than others in terms of handling the storing issue? The second question is: Are some query mapping algorithms ‘better’ than others in terms of handling the querying issue? This study attempts to answer these questions in the next section (Section 2.4) by analysing the existing mapping approaches and discussing the limitations of each of them in terms of RDB storage space and scalability when XML documents become bigger and query response time, i.e. query performance. In this context, a mapping approach is considered to be efficient in handling the storing issue if it can handle the content and hierarchical structure of the XML document in the RDB scheme efficiently in order to get less RDB storage space with high scalability when XML documents become bigger. This can be achieved if the complexity of the RDB scheme design (free dangling tables and columns) is simplified and redundancy or unnecessary data and path mappings are removed to reduce the number of records mapped. In addition, a mapping approach is considered to be efficient in handling the querying issue if it can map (translate) different types of XPath query into the corresponding SQL query efficiently in order to shorten the query response time, i.e. achieve high query performance. This can be achieved by reducing the search space inside the RDB scheme, reducing the number of join operations, reducing the nesting queries, and reducing the functions used in the translated SQL queries, i.e. maintaining the P-C and A-D relationships among the nodes efficiently.

45

2.4

MANAGING XML DATA IN A RELATIONAL DATABASE (RDB)

The mapping of an XML document to relations is not an easy task to accomplish because the data model of an XML document is fundamentally different from that of a RDB. Especially, the structure of an XML document is hierarchical, and the XML elements may be nested and repeated, while the relational model is a flat representation of data in tables and columns (Hong & Chen 2012; Haw & Lee 2011; Atay et al. 2010). Although some of the major IT vendors currently provide solutions for mapping XML documents to relations, this topic has attracted the attention of many researchers who have explored other possibilities. In this section, the existing solutions proposed by other researchers are discussed in detail and examples are provided. On the Web, XML documents may come with a schema, i.e. the assistance of a document type definition (DTD) or XML-schema to describe the data, or without a schema, i.e. without the assistance of a DTD or XML-schema, that is called a wellformed XML documents. The existence or the absence of a schema greatly influences the mapping procedure. When a schema is available for an XML document, information from the schema on the structural constraints of an XML document is used to guide the mapping design (structure mapping approach). However, when a schema is not available, a generic storage mapping to the RDB scheme is used (model mapping approach). Therefore, the existing mapping approaches can be categorized into structure mapping (also called schema-based mapping) and model mapping (also called schemaless-based or generic mapping) approaches. 2.4.1

Structure Mapping Approach

This approach is based on the existence of a type definition such as an XML DTD or XML schema that conforms to the XML document. By analysing the structural properties of the schema, the approach converts DTD or XML schema into relational schemas before mapping the XML data. Some of the structure mapping approaches are Inlining (Shanmugasundaram et al. 1999), LegoDB (Bohannon et al. 2002), ShreX

46

(Amer-Yahia et al. 2004), and XML-to-Relational (Atay et al. 2007). These approaches are presented in the following subsections. a.

Inlining Approach

The Inlining approach refers to two specialized strategies (shared Inlining and hybrid Inlining), which aim to minimize data fragmentation (Shanmugasundaram et al. 1999). When using this method, the complex structure of the DTD can be simplified by discarding information pertaining to the order of occurrence among the elements. In doing this, the simplified DTD only preserves the semantics of the child elements that are concerned with whether the element (i) can occur only once, or more times or (ii) is mandatory or not. A DTD graph is based on the simplified information. To transform a DTD graph into a relational scheme, two techniques can be applied (shared and hybrid). When the shared technique is used, relations are created for elements in the DTD graph that have an in-degree greater than one. For nodes with an in-degree equal to one, the parent node’s relations are inlined. For elements’ nodes with an in-degree equal to zero (root), a separate relation is created because they cannot be reached from any other node. In the DTD graph, edges marked with ‘*’ indicate that the element of a destination node can occur more than once. For each of these elements, a separate relation is created because a RDB cannot store set values. Element nodes that appear with directed paths from the element in the DTD graph, which create a relational schema, are inclined as an attribute. However, the directed paths cannot contain ‘*’. For every relation, a system-generated ID is created, and in order to preserve the P-C relationship between the elements, the parent ID is used. For elements that have an indegree greater than one, a parent CODE is added to the relations. The disadvantage of this method is that it considers elements as distinct and stores all the (potentially duplicated) information in the tables. Thus possibly redundant information increases the size of the RDB storage space accordingly (Dweib et al. 2009). In addition, this approach offers limited structures to represent the features of XML data, such as nested relationships, the ordering of XML documents, and the DBMS schema

47

representations. Querying these structures is usually complex because the end users are not familiar with them (Dweib & Lu 2012). b.

LegoDB Approach

The mapping technique used in LegoDB is based on the operation cost (Bohannon et al. 2002). LegoDB uses the information in the XML schema to derive several possible mapping options, and selects the one that leads to the lowest cost of executing a given query workload over sample documents (Amer-Yahia et al. 2004). The goal is to find a relational configuration with the lowest cost for processing a given query workload based on a given XML document. In this approach, a simple XML schema/DTD often produces a relational schema with many tables, which leads to an increase in the size of the RDB storage space and in the time taken for query execution (Fakharaldien et al. 2012). c.

ShreX Approach

Shrex approach is an XML-to-relational mapping framework and system (AmerYahia et al. 2004). This system provides the first comprehensive solution to the relational storage of XML data available through Web transactions. The mappings in ShreX are defined through annotations to an XML schema (Subramaniam et al. 2012). The use of an XML schema simplifies the mapping process because the user does not need to understand new specialized mapping language. The use of annotations allows mapping choices to be combined in several or multiple ways. ShreX provides generic (and automatic) document shredding and query translation capabilities, and it is also portable i.e. its mapping specifications are independent of the database backend. However, shredding XML into a large number of tables can lead to a storage overhead and poor query performance (Vyas & Kolte 2014).

48

d.

XML-to-Relational Approach

The XML-to-Relational approach improves the Inlining approach by mapping ordered XML data to unordered relational data (Atay et al. 2007). This approach contains three types of mapping: schema mapping, data mapping, and query mapping. To support the ordered nature of the XML data model, an order encoding scheme is used and an additional column is introduced (endID) to store the order of the elements in the document. Thus, any query, including the one that requires the document order information, can be evaluated over the generated database scheme. The rest of the mapping is similar to that in the Inlining approach proposed by Shanmugasundaram et al. (1999). Therefore, the XML-to-Relational approach converts an XML document into a typically large number of tuples of different relations. This leads to a performance issue when querying some parts of an XML document that results in an SQL query with a large number of join operations (Haw & Lee 2011). Structure mapping approaches like Inlining, LegoDB, ShreX, and XML-toRelational require a DTD or XML schema to develop the relational schema. Moreover, the approaches require a relational schema in order to store the XML schema. The schema is utilized during and after the shredding of XML documents. The captured data from the XML document is stored in the created relational table. There are two main advantages to these approaches: (i) they restrict the XML structure to the defined schema (i.e. they assign and utilize markup elements and attributes according to the defined schema) and (ii) they simplify the mapping process because users do not have to become conversant in addressing a new mapping language (Dweib & Lu 2012). However, some researchers (Dweib & Lu 2012; Maghaydah 2010) have identified several disadvantages of structure mapping approaches: (i) they are all heuristic; (ii) they do not consider multiple possible relational mappings to find the optimal selection; (iii) XML schemas may not be available, so a schema may have to be constructed to do the mapping; (iv) a reconstruction of the database schema is required if any change in the XML schema occurs, which can be expensive; (v) sometimes a large number of relations need to be

49

created, depending on the XML schema; and (vi) many joins are required to retrieve XML document information. 2.4.2

Model Mapping Approach

This approach is based on a fixed (predefined) RDB scheme and it can map different structured XML documents. This kind of approach can support any modern applications and well-formed XML documents even if they do not have any XMLschema or DTDs (Bousalem & Cherti 2015; Wang et al. 2012; Ying et al. 2012; Fakharaldien et al. 2011b). Basically, this approach maps XML documents into RDB schemes by traversing, shredding, and decomposing XML documents into relational tables and translating different types of XML queries (XPath) into SQL queries according to these tables. This section reviews and compares 16 approaches that are the most cited and that are the latest model mapping approaches. These approaches are Edge (Florescu & Kossmann 1999), XRel (Yoshikawa et al. 2001), XParent (Jiang et al. 2002), SUCXENT (Prakas et al. 2004), XPEV (Qin et al. 2005), LNV (El-Sharkawi & ElHadi El Tazi 2005), Sainan-Caifeng (Sainan et al. 2008), XPred (Wu & Huang 2008), SMX/R (Ning & De 2010), Xlight (Zafari et al. 2010), XRecursive (Fakharaldien et al. 2011b; Fakharaldien et al. 2011a), Suri-Sharma (Suri & Sharma 2012), Ying-Cao (Ying et al. 2012), Wang et al. (Wang et al. 2012) s-XML (Subramaniam et al. 2012), and XMap (Bousalem & Cherti 2015). These approaches can be categorized according to the techniques used into Edge-based, path-based, node-based, indexing (signature), hybrid edge- and path-based, and hybrid path- and node-based, as shown in Figure 2.4. In the following subsections, the approaches under each category are reviewed in turn with respect to their advantages and with respect to their drawbacks in terms of RDB storage space and query response time that might affect the query performance. The review also considers the RDB scheme that would be produced by applying the mapping technique of each approach based on the XML document shown in Figure 2.1 and using the XML data models (trees) shown in Figure 2.2.

50

Figure 2.4

a.

Techniques used in model mapping approaches

Edge-Based Technique

The Edge-based technique handles edges individually, i.e. it only maintains the edgelabel rather than the label paths (Gou & Chirkova 2007; Florescu & Kossmann 1999). Thus, concatenation of the edges to form a path for processing user queries is required. According to Figure 2.2(a), this technique stores the edge information about the DBGroup node as follows: the source node is ‘0’, which is a root, the target node is ‘1’, the sibling ordinal is ‘1’, the flag is ‘reference’, and the value is ‘null’. Therefore, the Edge-based technique requires a large amount of storage space to store an XML document because all edges between the nodes in the document are stored, particularly in large XML documents (Fakharaldien et al. 2012; Subramaniam et al. 2012). For

the

translated

SQL

query

of

the

XPath

twig

query

(Q1)

‘/DBGroup/Member/Name [Age = 23]’, shown in Figure 2.5, the Edge-based technique needs a large number of join operations in its translated SQL to check the edge-connections in order to maintain the P-C relationships among the nodes in the XML tree (Vaidya 2015; Maghaydah 2010; Jiang et al. 2002). In addition, this

51 technique does not support twig queries with ‘//’ axes (A-D query) because it does not know the number of tags between the elements of both sides of the ‘//’ (Vaidya 2015; Maghaydah 2010). Therefore, excessive self-joins are required to maintain the A-D relationships among the nodes in the XML tree, which is actually the most costly application in a RDBMS (Suri & Sharma 2012). Therefore, this technique increases the query response time, particularly in the case of complex (twig) queries on large XML documents.

SELECT name.Value FROM Edge dbgroup , Edge member, Edge age, Edge name WHERE dbgroup.Label = 'DBGroup' AND member.Label = 'Member' AND age.Label = 'Age' AND name.Label = 'Name' AND dbgroup.Source = '0' AND dbgroup.Target = member.Source AND member.Target = age.Source AND member.Target = name.Source AND age.Value = 23; Figure 2.5

Translated SQL query for Q1 using Edge-based technique

The Edge approach is one example of a model mapping approach that employs this technique, and is discussed in detail below. i.

Edge Approach

The Edge approach (Florescu & Kossmann 1999) is the simplest and most basic in terms of shredding and loading XML data into a single Edge table. Each edge in the XML tree is mapped to a tuple in a single relation. Based on the XML document in Figure 2.2(a), the RDB scheme shown in Table 2.2 is yielded through the model mapping process in the Edge approach.

Table 2.2 Source 0 1

Ordinal 1 1

Target 1 2

Edge scheme: Edge table Label ‘DBGroup’ ‘Member’

Flag ref ref

Value to be continued…

52

...continuation 2 1 2 1 2 1

3 4 5

‘Name’ ‘Age’ ‘Office’

val val val

‘Fervvac’ ‘23’ ‘CSD 4212’

The Edge table stores the identifier of the source and target nodes of each edge of the XML tree and the edge label, a flag indicates whether the edge represents an internal node or a value (i.e. a leaf), and an ordinal number reflects the ordered edges. The limitation of the Edge approach is that it is imperfect because the edge label is retained instead of the label paths. For a path to be created, this approach needs an edge concatenation. For example, a node, X, can be concatenated to another node, Y, if and only if X.Target = Y.Source. Therefore, multiple join operations are required compared to the path expression length to check the edge connections in order to handle user queries (Fakharaldien et al. 2012; Subramaniam et al. 2012; Suri & Sharma 2012; Haw & Lee 2011). This multiple join operation increases the query response time. In regular path queries, the whole XML data tree needs to be traversed, i.e. the search space is very high (Jiang et al. 2002). Therefore the Edge approach is only appropriate for simple XML documents that do not contain complicated combinations of tables. Applying it to large XML documents may deteriorate performance and lead to what is known as an ‘excessive table size error’. Storing an entire XML document in a single table identifies this operation as an error (Fakharaldien et al. 2012; Subramaniam et al. 2012). b.

Path-Based Technique

The main aspect of the path-based technique is the utilization of a path table in the RDB scheme of the mapping approach to handle the path information about the nodes of a document (Vaidya 2015; Vainio & Junkkari 2014; Mohammad 2011). Path expressions are a conventional and compact way to represent hierarchical relationships by embedding them within SQL queries and compiling them in standard SQL (Vainio & Junkkari 2014). The path identifier and distinctive path expression for a node are stored in the path table. The concept of storing path information in a path

53

table reduces the search space for a node in other tables. In the path-based technique, there are two ways to handle the path information for the nodes: the Node-Path and the Leaf-Path. Based on the XML document in Figure 2.2(b), the handling of path information using the Node-Path and the Leaf-Path is shown in Figure 2.6(a) and 2.6(b), respectively.

(a) Node-Path (5 Paths)

(b) Leaf-Path (3 Paths)

/DBGroup /DBGroup/Member /DBGroup/Member/Name /DBGroup/Member/Age /DBGroup/Member/Office Figure 2.6

/DBGroup/Member/Name /DBGroup/Member/Age /DBGroup/Member/Office

Handling paths: (a) Node-Path and (b) Leaf-Path based on DBGroup.xml

Using the Node-Path method shown in Figure 2.6(a), some researchers developed approaches that handle the distinct path information for every node in the document starting from the root (root-to-node) (Ning & De 2010; Qin et al. 2005; Jiang et al. 2002; Yoshikawa et al. 2001). However, part of the path information for these nodes becomes redundant using this method, which increases the storage space and adversely affects query processing (Vaidya 2015; Dweib & Lu 2012; Jeong & Lee 2006). Other researchers enhanced the Node-Path method by employing the Leaf-Path method. This method handles the distinct path information for only the leaf nodes in the document starting from the root (root-to-leaf) (Wang et al. 2012; Ying et al. 2012; Zafari et al. 2010; Prakas et al. 2004). This leads to a decrease in the storage space and search space in the path table, which has a positive effect on query response time (Ying et al. 2012; Zafari et al. 2010). However, the Path-based technique still handles (stores) all the information about the inner nodes of the document in separate tables as data, and this data is already stored as part of the path information in the path table. This leads to an increase the number of tables (storage space), particularly in the case of large XML documents (Vaidya 2015; Senthilkumar et al. 2011). For

the

translated

SQL

query

of

the

XPath

twig

query

(Q1)

‘/DBGroup/Member/Name [Age = 23]’ shown in Figure 2.7, the path-based technique

54

also requires a number of join operations between the path table and other data tables to handle the P-C and A-D relationships among the nodes of the translated SQL query. This has an effect on the query response time (Vaidya 2015; Suri & Sharma 2012), particularly in the case of complex queries (twig queries) on large XML documents. The ϴ-join () is one of the join operations used in this technique in order to maintain the containment relationships among the nodes (Ning & De 2010; Yoshikawa et al. 2001). This type of join operations is costly in a RDBMS in terms of query execution (Vaidya 2015; Dweib & Lu 2012; Maghaydah 2010; Zafari et al. 2010; Zhang et al. 2001).

SELECT v2.NodeValue FROM Path_Table e, Path_Index_Table p1, Path_Index_Table p2, Path_Index_Table p3, Path_Table v1 , Path_Table v2 WHERE p1.Path = '.DBGroup.Member' AND p2.Path = '.DBGroup.Member.Age' AND p3.Path ='.DBGroup.Member.Name' AND e.PID = p1.PID AND v1.PID = p2.PID AND v2.PID = p3.PID AND e.StartPos < v1.StartPos AND e.EndPos > v1.EndPos AND e.StartPos < v2.StartPos AND e.EndPos > v2.EndPos AND v1.NodeValue= 23 AND v2.NodeType = 3; Figure 2.7

Translated SQL query for Q1 using path-based technique

Examples of model mapping approaches that employ the path-based technique are: (i) XRel, (ii) SUCXENT, (iii) SMX/R, and (iv) Xlight, which are discussed below. i.

XRel Approach

The XRel approach (Yoshikawa et al. 2001) employs the XPath data model (Clark & Derose 1999), where an XML document is represented as a tree node structure. These nodes are stored in relations according to the node type, with distinctive path information from the root to any node. A simple path expression is preserved by the XRel approach, which is the path information about a node from the root, and the

55

region information about a node (a pair of start and end positions of a node in an XML tree). Moreover, a containment relationship is managed by this region, where the relationships among the XML nodes are easily maintained. Therefore, node identifiers are unnecessary for storing XML data graphs. Based on the XML document in Figure 2.2(b), the RDB scheme shown in Table 2.3 is produced by the model mapping process of the XRel approach.

Table 2.3

XRel scheme: (a) Path, (b) Element, (c) Text, and (d) Attribute tables

(a) Path Table PathID 1 2 3 4 5

Pathexp #/DBGroup #/DBGroup#/Member #/DBGroup#/Member#/Name #/DBGroup#/Member#/Age #/DBGroup#/Member#/Office

(b) Element Table DocID 10 10 10 10 10

PathID 1 2 3 4 5

Start 1 2 3 7 11

End 232 15 6 10 14

Ordinal 1 1 1 1 1

(c) Text Table DocID 10 10 10

PathID 3 4 5

Start 4 8 12

End 5 9 13

Value ‘Fervvac’ ‘23’ ‘CSD 4212’

Start

End

Value

(d) Attribute Table DocID

PathID

The DocID, PathID, Start, End, Ordinal, and Value are stored by the XRel schema. These are the document identifier, simple path expression identifier, start and end positions of a region, ordinal number of the node based on the siblings, and stringvalue of the node. The limitation of the XRel approach is that it is imperfect because it has a high number of join tables and tuples, which increases the size of the RDB (Dweib & Lu

56

2012; El-Sharkawi & El-Hadi El Tazi 2005). This approach stores the paths for any existing nodes in the XML tree. This increases the storage space requirements and leads to a slower search in the path tables. This approach uses the containment relationship to preserve the A-D relationships. In another meaning, it uses ϴ-join () operators, which are costly to use in joining the tables because of the way in which a RDBMS processes join tables (Dweib & Lu 2012; Zafari et al. 2010; Prakash 2005; Zhang et al. 2001). In the XRel approach, the querying cost for verifying the P-C relationship is high. Thus, the verification of the P-C relationship of any two nodes from the tag positions confirms that the non-existence of a node between the two nodes is important (Fujimoto et al. 2005). ii.

SUCXENT Approach

The SUCXENT approach does not require knowledge of the DTDs of the XML documents being stored (Prakas et al. 2004). This approach is able to efficiently store and query XML data in a RDB schema by storing only the paths and values of the leaf nodes together with a list of their ancestors. SUCXENT solves the XRel (Yoshikawa et al. 2001) and XParent (Jiang et al. 2002) problems by maintaining ancestor information for only the leaf nodes. This allows the effective evaluation of queries without the need for θ-joins. Therefore, SUCXENT is able to perform three times better than XRel in terms of document insertion time and up to 10–15 times better in terms of query performance. SUCXENT can significantly outperform XParent in terms of insertion/extraction time, can store up to 20% more and can increase query performance by up to 25%. Also, SUCXENT is able to reconstruct documents up to twice as fast as Shared-Inlining (Shanmugasundaram et al. 1999). There are several reasons why SUCXENT is able to perform better than the above-mentioned approaches: (i) it requires a significantly smaller storage size; (ii) there is a lower input/output (I/O) cost associated with query processing; and (iii) it requires fewer joins in the corresponding SQL queries (Hyder & Hassan 2006). Based on the XML document in Figure 2.2(b), the RDB schema shown in Table 2.4 is produced using SUCXENT.

57

Table 2.4

SUCXENT scheme: (a) Document, (b) Path, (c) PathValue, (d) AncestorInfo, and (e) TextContent tables

(a) Document Table DocId 10

Name DBGroup.XML

(b) Path Table PathId 1 2 3

PathExp DBGroup.Member.Name DBGroup.Member.Age DBGroup.Member.Office

(c) PathValue Table DocId 10 10 10

PathId 1 2 3

LeafOrder 1 2 3

SiblingOrder 1 1 1

LeftSibIxnLevel -1 2 2

LeafValue ‘Fervvac’ ‘23’ ‘CSD 4212’

(d) AncestorInfo Table DocId 10 10

SiblingOrder 1 1

AncestorOrder 1 2

AncestorLevel 1 2

(e) TextContent Table DocId

Linkid

Text

In the Document table, the DocId and Name attributes represent the document identifier and its name, respectively. In the Path table, the PathID and PathExp refer to the path identifier and the distinct simple path expression of the existing leaf node starting from the root node, respectively. In the PathValue table, the attributes LeafOrder, SiblingOrder, LeftSibIxnLevel, and LeafValue refer to the position of a leaf node in the XML document, the sibling information, the Left sibling intersection level, which refers to the level of the highest common ancestor of a node, and the value of a leaf node, respectively. The ancestor information about a leaf node is stored in the AncestorInfo table. The TextContent table stores the value of a leaf node. The limitation of the SUCXENT approach is that it is limited because XML documents are stored in five tables, namely: Document, Path, PathValue, AncestorInfo, and TextContent. In addition, this approach stores path information from root-to-leaf for each leaf node. This leads to an expansion of the storage space

58

(Fakharaldien et al. 2011a; Prakash 2005). In addition, the merging of these tables with paths from leaf-to-leaf nodes for data retrieval complicates the query process and also increases the search space and query response time (Hyder & Hassan 2006). iii.

SMX/R Approach

The SMX/R approach (Ning & De 2010) is an improved version of the XRel approach (Yoshikawa et al. 2001). The P-C and path information about the node is utilized by this approach to map XML documents into the RDB scheme. Hence, both the P-C and A-D relationships are effortlessly preserved and established. This approach is beneficial in terms of more generic solutions for storing any XML collection efficiently and/or precise filtering of large subsets of XPath expressions, and also for better extraction of fragments based on the XPath expressions. Start and end positions are provided for each node (tag) through the pre-order and post-order traversal numbers. Similar to the XRel approach, a NodeID is non-existent in the schema of SMX/R. Based on the XML document in Figure 2.2(b), the RDB scheme shown in Table 2.5 is produced by the SMX/R approach.

Table 2.5

SMX/R scheme: (a) Path_Index_Table, and (b) Path_Table

(a) Path_Index_Table PID 1 2 3 4 5

Path DBGroup DBGroup.Member DBGroup.Member.Name DBGroup.Member.Age DBGroup.Member.Office

NodeName DBGroup Member Name Age Office

(b) Path_Table DocID 10 10 10 10 10

PID 1 2 3 4 5

StartPos 1 2 3 7 11

EndPos 232 15 6 10 14

NodeLVL 1 2 3 3 3

NodeType 1 1 3 3 3

NodeValue ‘Fervvac’ ‘23’ ‘CSD 4212’

In the Path_Index_Table, the PID, Path, and NodeName attributes are special IDs for the path, path string from the root to each node, and the node name,

59

respectively. The Path_Table consists of seven attributes. The DocID attribute is a distinct ID for the XML document, the PID attribute is a unique ID for a path (key to Path_Index_table). The StartPos attribute is the start location of the node (pre-order number), while EndPos is the end location of the node (post-order number). NodeLVL is the nesting depth within the document, NodeType is the type of node (1 = element, 2 = attribute, and 3 = text), and NodeValue depicts the value of the node. The limitation of the SMX/R approach is that it is limited because, if the storage space is increased, the query response time is increased. This situation happens when the distinct paths for all existing nodes are stored as shown in Table 2.5. This increases the search space in the Path_Index_Table. Also, ϴ-joins () are employed in this approach to check the edge connections, i.e. the containment relationship for preserving the P-C and A-D relationships. This checking process is an expensive process in RDBMS (Vaidya 2015; Dweib & Lu 2012; Maghaydah 2010; Zafari et al. 2010; Zhang et al. 2001). Moreover, even the search space for the elements, attributes or texts expands because a single table (Path_Table) is used for storage them (Vaidya 2015). iv.

Xlight Approach

The Xlight approach (Zafari et al. 2010) stores the existing root-to-leaf path information about the XML document in its RDB scheme. Unlike XParent (Jiang et al. 2002), nodes with similar parents have equal LeafGroup values and ancestors. This reduces the use of ancestor tables. Only information on the ancestor is stored. Similar to XRel (Yoshikawa et al. 2001), the Path table is also employed in this approach, and this reduces the large number of joins during queries. Therefore the storage space is significantly reduced, while the search conducted in the path table is accelerated. The ϴ-join problem (as in XRel) is resolved through the ancestor table, where all of the ancestors of the leaf nodes are stored in a single table. Also, equijoins are utilized to replace the ϴ-joins over this set of ancestors. Based on the XML document in Figure 2.2(b), the RDB scheme shown in Table 2.6 is produced using Xlight.

60

Table 2.6

Xlight scheme: (a) Document, (b) Ancestor, (c) Path, (d) Attribute, and (e) Data tables

(a) Document Table DocId 10

Name DBGroup.XML

(b) Ancestor Table DocId 10

LeafGroup 1

AncestorPre 1

AncestorLevel 1

(c) Path Table PathId 1 2 3

Path /DBGroup/Member/Name /DBGroup/Member/Age /DBGroup/Member/Office

(d) Attribute Table Name

Val

Id

Pre

(e) Data Table DocId 10 10 10

PathId 1 2 3

LeafNo 1 2 3

LeafGroup 1 1 1

LinkLevel 1 1 1

LeafValue ‘Fervvac’ ‘23’ ‘CSD 4212’

hasAttrib 0 0 0

The path identifier in the PathId attribute and all of the existing root-to-leaf paths in the XML document are stored in the Path table within a Path attribute. The related information about the leaf nodes in the XML document is stored in a Data table, which consists of seven attributes. The DocId attribute represents the document identifier; PathId depicts the root-to-leaf paths; LeafNo pertains to the order of the leaf nodes in the document; LeafGroup represents the same number for any participant leaf node with the same parent; LinkLevel denotes the level where two adjacent paths should link to each other; LeafValue indicates the textual contents of the leaf node; and hasAttrib shows the number of attributes in each path. The ancestor information about each leaf node is preserved by the Ancestor table through the storage of the distinct ancestor information for the nodes with similar parents. The information about the existing attributes in the XML document is stored in the Attribute table. The limitation of the Xlight approach is that it is limited because XML documents are stored in five tables, namely Document, Path, Data, Ancestor, and

61

Attribute (Senthilkumar et al. 2011). This expands the storage space. In addition, the merging of multiple tables for data retrieval complicates the query process and increases the query response times (Vaidya 2015). c.

Node-Based Technique

The node-based technique maintains the nodes rather than the edges and paths of nodes. No edge information is explicitly maintained in this technique (Jiang et al. 2002). The efficiency of this type of mapping depends on the method used in the node-labelling of the XML tree. Node-labelling allocates a number (label) to each node to denote the identifier or the absolute position of the node in the document (Vaidya 2015; Mohammad 2011; Fujimoto et al. 2005). Numerous node-labelling methods have been proposed, such as Interval encoding, which relates to the number of words (Li & Moon 2001), Global order label (Tatarinov et al. 2002), Local order label (Tatarinov et al. 2002), Dewey order label (Tatarinov et al. 2002), Pre-order post-order (Grust 2002), ORDPATH (O'neil et al. 2004), Prime number labelling (Wu et al. 2004), Variable Length Endless Insertable (Kobayashi et al. 2005), Clusterbased order (Soltan & Rahgozar 2006), Dynamic interval-based labelling (Chung & Jesurajaiah 2005), and the Persistent labelling scheme (Khaing & Thein 2006). The main goal of these node labelling methods is to preserve the nested hierarchical (structural) relationship among the nodes and minimize the re-labelling cost during XML updates (Vaidya 2015; Subramaniam et al. 2012; Haw & Lee 2011; Dweib et al. 2009). In the example of the node-labelled XML data model as shown in Figure 2.2(b), the Global order labelling method for the nodes in the XML tree is depicted. This method assigns numbers (labels) to the nodes based on the sequence 1, 2, 3, 4, …, n. Thus in the figure, the DBGroup node is given the number ‘1’, the Member node ‘2’, the Name node ‘3’, the Age node ‘4’, and the Office node ‘5’. The node-based technique handles the contents and hierarchical structure of the XML document by utilizing the parent node label (according to the node-labelling method used) for each node in the document. This leads to storing of the whole document, i.e. all the information about the parent nodes (tag structures) and child

62

nodes (tag values) of the document, which increases the number of records and label size (Vaidya 2015; Haw & Lee 2011; Maghaydah 2010). Therefore, the storage space requirement also increases, particularly for large XML documents. In the case of Figure 2.2(b), this technique stores information about Member and Name nodes as follows: the Member node has the node_id = ‘2’, the parent_id = ‘1’, which is the node_id of DBGroup, and the level = ‘1’, which are all stored in the parent table, whereas the Name node has the node_id = ‘3’, the parent_id = ‘2’, which is the node_id of Member, the level = ‘2’, and the value = ‘Fervvac’. For

the

translated

SQL

query

of

the

XPath

twig

query

(Q1)

‘/DBGroup/Member/Name [Age = 23]’ shown in Figure 2.8, the node-based technique uses the parent of a label (LParent) to answer P-C queries. It also uses the same label in a recursive manner in forming nested queries to answer A-D queries. Therefore, this technique has an additional computation processing cost, which increases the query response time, particularly for complex queries (A-D queries) on large XML documents (Vaidya 2015; Haw & Lee 2011; Maghaydah 2010).

SELECT Value FROM ChildTable WHERE LParent in ( SELECT distinct IdNode from ParentTable WHERE Cname = 'Name' and Lparent in ( SELECT distinct IdNode FROM ParentTable WHERE Cname= 'Member' AND Pname = 'DBGroup' ) AND LParent in ( SELECT distinct LParent FROM ParentTable WHERE Cname = 'Age' AND LParent in ( SELECT distinct IdNode FROM ParentTable WHERE Cname= 'Member' AND Pname = 'DBGroup' ) AND IdNode in ( SELECT distinct LParent FROM ChildTable WHERE Value = 23 ))) Figure 2.8

Translated SQL query for Q1 using node-based technique

Examples of the model mapping approaches that employ the node-based technique are: (i) XRecursive, (ii) Suri-Sharma, and (iii) s-XML, which are discussed in turn below.

63

i.

XRecursive Approach

The XRecursive approach (Fakharaldien et al. 2011b; Fakharaldien et al. 2011a) uses ORDPATH (O'neil et al. 2004) as its node-labelling scheme. In this approach, storing the path value or path structure is unnecessary because the P-C relationships among the nodes are preserved through the parent ID (pId) attribute. A similar attribute for the A-D relationship among the nodes is also preserved. Based on the XML document in Figure 2.2(b), the RDB scheme shown in Table 2.7 is produced using this approach.

Table 2.7

XRecursive scheme: (a) Tag_Structure, and (b) Tag_Value tables

(a) Tag_Structure Table tagName DBGroup Member Name Age Office

Id 1 2 3 4 5

pId 1 1 2 2 2

Value ‘Fervvac’ ‘23’ ‘CSD 4212’

Type E E E

(b) Tag_Value Table tagId 3 4 5

In the Tag_Structure table, the tagName attribute pertains to the name of the node, the Id attribute represents the node ID, which is the primary key, and the pId attribute represents the parent ID of the node. In the Tag_Value table, the tagId attribute only represents the tag ID (elements or attributes) that contains a value. Value shows the tag value, Type ‘A’ denotes an attribute and ‘E’ pertains to the element. The limitations of the XRecursive approach are that it requires a large amount of storage space because it has to store the whole document (inner and leaf nodes) in its RDB, which increases the search space within the RDB. In this approach, the path information for the nodes is not stored in the RDB to maintain the P-C and A-D relationships among the nodes. Instead, the parent ID (pId) is used recursively in forming the path, which increases query processing time in a RDBMS.

64

ii.

Suri-Sharma Approach

In the Suri-Sharma approach (Suri & Sharma 2012), the nodes, including the values, type, name, and position of the nodes, are stored in a single table. The Path concept is not utilized in this approach. Instead of storing distinct paths, the parent node ID for every node is stored in order to preserve the P-C relationships of the nodes. In addition, this approach uses the parent node ID recursively to preserve the A-D relationships among the nodes. This approach requires a smaller number of join operations for query processing compared to other approaches such as XRel (Yoshikawa et al. 2001) and XPEV (Qin et al. 2005). Based on the XML document in Figure 2.2(b), the RDB scheme shown in Table 2.8 is yielded through the model mapping process employed in the Suri-Sharma approach.

Table 2.8

Suri-Sharma scheme: (a) Node and (b) Data tables

(a) Node Table NodeID 1 2 3 4 5

NodeName DBGroup Member Name Age Office

(b) Data Table DocID 10 10 10 10 10

NodeID 1 2 3 4 5

ParentID NULL 1 2 2 2

NodeValue NULL NULL ‘Fervvac’ ‘23’ ‘CSD 4212’

NodeType Element Element Text Text Text

NodePos 1 1,1 1,1,1 1,1,2 1,1,3

The NodeID attribute, which pertains to the node identifier, and the NodeName, which depicts the node name, are stored in the Node table. In the Data table, the DocID attribute refers to the ID of the particular XML document; NodeID pertains to the node identifier; ParentID denotes the parent node ID of a node; NodeValue signifies the node value, i.e. the text values are stored in this node; NodeType indicates whether the node is an element, an attribute or a text; and NodePos is the node position among the siblings of the node in the XML data graph.

65

The limitation of the Suri-Sharma approach is that it is limited because the entire XML document (all inner and leaf nodes) are stored together in Data table, which increases the RDB storage space. Furthermore, the paths are not stored in the RDB scheme, where the P-C and A-D relationships among the nodes are preserved through multiple joins in the tables or the recursive utilization of the parent ID, which makes the RDBMS more costly. Therefore, in this approach the query processing deteriorates because the query response time increases for any kind of query. iii.

s-XML Approach

The s-XML approach (Subramaniam et al. 2012) is based on the persistent-labelling schema (Khaing & Thein 2006). A dynamic update is supported by this labelling schema, and the competence of XML processing is enhanced. The support is efficient for the recovery of structural queries as well as for complex chain and twig queries. The P-C relationship among the nodes is preserved through the utilization of the LParent attribute. Moreover, the A-D relationship among the nodes is effortlessly preserved through the reduction of the level by one and the corresponding parent-label and self-label. Based on the XML document in Figure 2.2(b), the RDB scheme shown in Table 2.9 is yielded by the model mapping process of s-XML.

Table 2.9

s-XML scheme: (a) ParentTable and (b) ChildTable

(a) ParentTable IdNode 1 2 3 4 5

pName DBGroup Member Member Member

cName DBGroup Member Name Age Office

Level 0 1 2 2 2

LParent 1 2 2 2

SelfLabel [1,1] [1,1] [1,1] [2,1] [3,1]

pName Name Age Office

SelfLabel [1,1] [1,1] [1,1]

LParent 3 4 5

Value ‘Fervvac’ ‘23’ ‘CSD 4212’

(b) ChildTable IdNode 1 2 3

Level 3 3 3

66

The s-XML approach uses ParentTable and ChildTable to describe the relational schema. In the ParentTable, the IdNode attribute denotes a unique node identifier; pName stores the parent node name; cName preserves the child name; Level retains the level information; LParent conserves the parent label of the node, where the reference of the parent label (IdNode) is stored; and SelfLabel preserves the self-label or local label of the node, which is [n, d] in the persistent-labelling schema. In the ChildTable, IdNode is a unique node identifier; Level stores the node level information in the XML document; pName stores the element name of the parent node; SelfLabel preserves the self-label or local label of the node [n, d] in the persistent-labelling schema; LParent conserves the node parent-label; and Value stores the node value. The limitation of the s-XML approach is that it is limited because it stores all the information about the inner and leaf nodes of the document together in two tables (ParentTable, ChildTable). These tables contain dangling columns and redundant data (pName and SelfLabel attributes in ChildTable). This leads to an increase in the storage space and query response time. With respect to the query processing aspects, the paths for each node are not stored in a separate table. Instead, multiple joins in the tables or the recursive utilization of the LParent attribute (nested queries) are used. This leads to an expensive RDBMS (Vaidya 2015). Hence, the s-XML approach increases the query response time for all kinds of queries. d.

Hybrid Technique (Edge- and Path-Based)

Some researchers have combined Edge-based and path-based techniques to form a hybrid technique (Gou & Chirkova 2007; Qin et al. 2005; Jiang et al. 2002). Like the Edge-based technique, this hybrid technique stores the hierarchical structure of the XML document based on the P-C relationships between the nodes (edges). In the study by (Jiang et al. 2002) the P-C relationships are handled by using a DataPath table that contains the Pid and Cid attributes for the parent ID and child ID, respectively. Qin et al. (2005) handle the P-C relationships by using an Edge table that contains Source and Target attributes for every edge. Like the Path-Based technique,

67

the path information from the root to any XML document node is stored as shown previously in Figure 2.6(a) (Wu & Huang 2008; Qin et al. 2005; Jiang et al. 2002). Using this hybrid technique, all the edge, ancestor, and path information for the existing nodes (root-to-node) of the document are stored. This results in an increase in the storage space requirement, particularly in the case of large XML documents (Dweib & Lu 2012; Fakharaldien et al. 2012; Ying et al. 2012; Zafari et al. 2010; Dweib et al. 2009; Prakas et al. 2004). For

the

translated

SQL

query

of

the

XPath

twig

query

(Q1)

‘/DBGroup/Member/Name [Age = 23]’ shown in Figure 2.9, this technique requires a large number of join operations in order to maintain the P-C relationships among the nodes (Wu & Huang 2008). In addition, finding the ancestor for a special node to maintain the A-D relationships among the nodes is very costly in this technique because it needs to join tables with themselves frequently (Dweib & Lu 2012; Fakharaldien et al. 2012; Zafari et al. 2010). These factors lead to an increase in the time taken to execute SQL queries and particularly complex queries (twig queries) on large XML documents.

SELECT FROM WHERE AND AND AND AND AND AND AND AND AND AND

d2.Value Data d1, Data d2, Element e1, LabelPath Lb1, LabelPath Lb2, LabelPath Lb3, DataPath p1, DataPath p2 Lb1.Path = './DBGroup./Member./Age' Lb2.Path = './DBGroup./Member./Name' Lb3.Path = './DBGroup./Member' d1.PathID = Lb1.ID d2.PathID = Lb2.ID d1.Did = p1.Cid d2.did = p2.Cid p1.Pid = p2.Pid e1.PathID = Lb3.ID p1.Pid = e1.Did d1.Value = 23;

Figure 2.9

Translated SQL query for Q1 using hybrid (Edge- and path-based) technique

Examples of model mapping approaches that employ an edge- and path-based hybrid technique are: (i) XParent, (ii) XPEV, and (iii) XPred, which are discussed below.

68

i.

XParent Approach

The XParent approach is an improved version of the Edge and XRel approaches (Jiang et al. 2002) that combines an Edge- and path-based technique. Similar to XRel, XParent uses a Path Index Table (LabelPath). The Edge and Edge-Value elements are used to store the P-C relationship, where a separate table (DataPath) is used to preserve this relationship. The XParent scheme is similar to the XRel scheme, except that XParent uses a data-path Identifier (Did), while XRel uses start and end position elements to preserve the P-C relationship. In XParent, an Ancestor table (Did, Ancestor, Level) is used to store all of the ancestors of a specific node in a single table. The Ancestor table enables the query response time to be reduced by replacing the ϴjoins with equijoins over the set of ancestors. The XPath data model is employed by the XParent schema to represent an XML document (Clark & Derose 1999). The XML document is modelled as an ordered tree, which is then mapped into four relational tables. Based on the XML document in Figure 2.2(b), the model mapping process in the XParent approach produces the RDB scheme shown in Table 2.10.

Table 2.10

XParent scheme: (a) LabelPath, (b) DataPath, (c) Data, (d) Element, (e) Ancestor tables

(a) LabelPath Table ID 1 2 3 4 5

Len 1 2 3 3 3

Path ./DBGroup ./DBGroup./Member ./DBGroup./Member./Name ./DBGroup./Member./Age ./DBGroup./Member./Office

(b) DataPath Table Pid 1 2 2 2

Cid 2 3 4 5

(c) Data Table PathID 3 4 5

Did 3 4 5

Ordinal 1 1 1

Value ‘Fervvac’ ‘23’ ‘CSD 4212’ continue…

69

...continued (d) Element Table PathID 1 2 3 4 5

Did 1 2 3 4 5

Ordinal 1 1 1 1 1

Ancestor 1 2 1 2 1 2 1

Level 1 1 2 1 2 1 2

(e) Ancestor Table Did 2 3 3 4 4 5 5

In the LabelPath table, the ID, Len, and Path attributes represent a unique label-path identifier, the number of edges of the label-path, and a distinct simple path expression, respectively. In the DataPath table, the Pid and Cid attributes represent the parent-node and child-node IDs of an edge, respectively. The attributes of the Element and Data tables, such as the PathId, are foreign keys of the ID in the LabelPath table, the Did is a data-path identifier, the Ordinal is an ordinal number of the node among all siblings that share the same parent, and the Value is the string value. The limitation of the XParent approach is that it is limited because it stores every node in the document. In addition, it uses the Ancestor table to store every ancestor of a node. This leads a high disk space requirement (Prakas et al. 2004). Also, using this approach may lead to an overhead because of redundant information in the Ancestor table (Subramaniam et al. 2012). XParent uses a similar method to that in XRel, where the paths for all the existing nodes are preserved. However, this increases the storage space requirements (Dweib & Lu 2012; Ying et al. 2012). The parent ID and ancestor information is preserved for all existing nodes in a document. Therefore, determining an ancestor of a specific node is costly because the nodes need to be joined with tables frequently (Dweib & Lu 2012; Fakharaldien et al. 2012; Zafari et al. 2010).

70

ii.

XPEV Approach

The XPEV approach is an improved version of the Edge and XRel approaches (Qin et al. 2005) that uses Edge- and path-based techniques and has a three-table schema. The Path table stores distinctive (root-to-any) paths, the Edge table preserves P-C relationships, and the Value table stores all of the element and attribute values. The XQuery 2.0 and XPath 1.0 data models (Fernández et al. 2002) are applied in this approach. Based on the XML document in Figure 2.2(b), the RDB scheme shown in Table 2.11 is produced using XPEV.

Table 2.11

XPEV scheme: (a) Path, (b) Edge, and (c) Value tables

(a) Path Table Pid 1 2 3 4 5

Pathexpress #/DBGroup #/DBGroup#/Member #/DBGroup#/Member#/Name #/DBGroup#/Member#/Age #/DBGroup#/Member#/Office

(b) Edge Table Pid 1 2 3 4 5

Source 0 1 2 2 2

Target 1 2 3 4 5

Label ‘DBGroup’ ‘Member’ ‘Name’ ‘Age’ ‘Office’

Ordinal 1 1 1 1 1

Flag ref ref val val val

Target 3 4 5

Label ‘Name’ ‘Age’ ‘Office’

Ordinal 1 1 1

Value ‘Fervvac’ ‘23’ ‘CSD 4212’

(c) Value Table Pid 3 4 5

Source 2 2 2

In the Path table, the Pid and Pathexpress attributes represent the path identifier and the distinct simple path expression, respectively. In the Edge and Value tables, the Pid, Source, Target, Label, Ordinal, Flag, and Value attributes denote a path identifier, the identifier of the source node, the identifier of the target node, the label of the edge, a flag indicating an internal node or a value (a leaf), and an ordinal number of the node among all siblings that share the same parent because the edges are ordered and string-valued, respectively.

71

The limitation of the XPEV approach is that it is limited because the RDB scheme is large because the entire XML documents are stored in two tables, namely: the Edge and the Value tables (Vyas & Kolte 2014). Similar to XRel, the XPEV approach stores paths for any existing nodes in the XML tree. This increases the storage space requirement and slows down the searches of the path table. Also, the two tables (Edge, Value) need to be combined in this approach to determine A-D relationships. These two tables are joined using the Source and Target elements. This leads to an increase in the number of join operations between the two tables, which increases the query response time (deteriorates query performance) (Vaidya 2015; Vyas & Kolte 2014). iii.

XPred Approach

Like the XParent approach, a hybrid Edge and path-based technique is used in the XPred approach (Wu & Huang 2008). XPred stores structural information within the nodes to reduce the number of join operations when processing user queries. In particular, for every node in a given XML document, the path from its root and its predecessor’s information is stored in the RDB scheme. XPred can eliminate the join operation for P-C traversing. Therefore, this approach can improve query processing performance. In other words, the XPred approach reduces the potentially large number of join operations that would otherwise be required for processing user queries. Based on the XML document in Figure 2.2(b), the RDB scheme shown in Table 2.12 is produced using XPred.

Table 2.12

XPred scheme: (a) Path, (b) Node, and (c) Data tables

(a) Path Table PathID 1 2 3 4 5

Length 1 2 3 3 3

LabelPath ./DBGroup ./DBGroup./Member ./DBGroup./Member./Name ./DBGroup./Member./Age ./DBGroup./Member./Office

(b) Node Table NodeID 1

PathID 1

Ordinal 1

PredID NULL continue…

72

...continued 2

2

1

1

(c) Data Table NodeID 3 4 5

PathID 3 4 5

Ordinal 1 1 1

PredID 2 2 2

Value ‘Fervvac’ ‘23’ ‘CSD4212’

In the Path table, PathID, Length, and LabelPath denote the unique label-path identifier of each label-path, the length of the corresponding label-path, and the distinct label-path expressions for each node starting from the root, respectively. The Node table stores the information about the elements and attributes of an XML document. In the Node table, the attributes NodeID, PathID, Ordinal, and PredID denote a unique ID of each node; the label-path identifier, i.e. the foreign key of the PathID in the Path table of each node; the ordinal number of the corresponding node among nodes with the same name and connected to the same parent; and the predecessor node identifier, i.e. parent node identifier, respectively. The Data table has the same attributes, but also stores an additional attribute, namely Value, which refers to the value of a node whether the value is that of the corresponding element or that of the corresponding attribute in the XML document. The limitation of the XPred approach is that it is limited because the high number of join tables and tuples required increases the size of the RDB size. This approach stores the paths for any existing nodes in the XML tree. This increases the storage space requirement and leads to a slower search of the path tables. Moreover, the Node table needs to be joined several times or recursively by the PredID to preserve the A-D relationships. Therefore, the query response time is increased, particularly in the case of a complex query (twig query) on large XML documents. e.

Hybrid Technique (Path- and Node-Based)

Path- and node-based techniques are combined to form this hybrid technique. The distinct path information for all the leaf nodes of the XML document, as shown in Figure 2.6(b) is handled and stored in the Path table used in this technique. In

73

addition, all the inner node information and labels (according to one of the nodelabelling schemes) of the XML document are stored as data in this technique, and this data is already stored as part of the path information in the path table. These aspects lead to an increase in the number of tables and label size, which causes an increase in the storage space, particularly for large XML documents (Vaidya 2015; Haw & Lee 2011; Senthilkumar et al. 2011; Maghaydah 2010). In this technique, the P-C and A-D relationships among the nodes are preserved through applying a recursive process on the parent node label (parent ID) in the inner nodes table and through the path information about the leaf nodes. For

the

translated

SQL

query

of

the

XPath

twig

query

(Q1)

‘/DBGroup/Member/Name [Age = 23]’ shown in Figure 2.10, this technique has an additional computation processing cost, i.e. the number of join operations and the nesting queries are increased. This leads to an increase in the query response time, particularly for complex queries (A-D queries) on large XML documents (Vaidya 2015; Haw & Lee 2011; Maghaydah 2010).

SELECT FROM WHERE AND AND AND

v1.LeafValue InnerNodes t1, LeafNode v1, Path p1 p1.PathExp = 'DBGroup/Member/Name' v1.PathID = p1.PathID v1.ParentID = t1.INodeID t1.ParentID = ( SELECT t2.ParentID FROM InnerNodes t2, LeafNode v2, Path p2 WHERE p2.PathExp = 'DBGroup/Member/Age' AND v2.PathID = p2.PathID AND v2.ParentID = t2.INodeID AND v2.LeafValue = 23 )

Figure 2.10

Translated SQL query for Q1 using hybrid (Path- and Node-based) technique

Examples of the model mapping approaches that employ a hybrid path- and node-based technique are: (i) Ying-Cao and (ii) Wang et al., which are discussed below.

74

i.

Ying-Cao Approach

In the Ying-Cao approach (Ying et al. 2012) a path-based technique and a node-based technique (for the inner nodes) are combined to form a hybrid technique for transferring the major content of XML documents to a RDB scheme. In this approach, the Path table is utilized to store each distinct path expression of the leaf node (i.e. the distinct path from the root node to the leaf node). This method of storage decreases the size of the RDB, especially for complex XML documents. Complex structural queries, such as twig and recursive queries, can be obtained according to the relationship preserved by the labels (parent IDs) of the inner nodes in the InnerNodes table. A more intense structural search of the XML document can be undertaken by the YingCao approach, which improves query processing performance (i.e. the query response time is reduced). Based on the XML document in Figure 2.2(b), the RDB scheme shown in Table 2.13 is yielded by the Ying-Cao approach.

Table 2.13

Ying-Cao scheme: (a) File, (b) Path, (c) InnerNodes, and (d) LeafNodes tables

(a) File Table DocId 10

Name DBGroup.XML

(b) Path Table PathId 1 2 3

PathExp DBGroup/Member/Name DBGroup/Member/Age DBGroup/Member/Office

(c) InnerNodes Table DocId 10 10 10 10 10

INodeId 1 2 3 4 5

NodeName DBGroup Member Name Age Office

ParentId NULL 1 2 2 2

Level 1 2 3 3 3

SiblingNum 1 1 1 2 3

(d) LeafNodes Table DocId 10 10 10

LNodeId 1 2 3

PathId 1 2 3

ParentId 3 4 5

LeafValue ‘Fervvac’ ‘23’ ‘CSD 4212’

75

In the File table, DocId refers to the distinct ID for each XML document and Name pertains to the XML document name. In the Path table, PathId denotes the path identifier, while the distinct path expressions of the leaf node are stored in PathExp. In the InnerNodes table, INodeId refers to the label provided for each inner node, NodeName conveys the name of the inner node, ParentId denotes the INodeId of the parent node, Level conveys the level of the inner node, and SiblingNum depicts the position of each inner node in accordance with the siblings of the node in an XML tree. Aside from the attributes that are in the InnerNodes table, the LeafNodes table employs the LeafValue attribute to store the text content or attribute value. The limitation of the Ying-Cao approach is that it is limited because a large amount of storage space may be needed for a complex RDB scheme. The InnerNodes table needs to be joined several times or recursively by the parent ID to preserve the A-D relationships. Therefore, the query response time is increased, particularly in the case of a complex query (twig query) on large XML documents. ii.

Wang et al. Approach

In the Wang et al. approach (Wang et al. 2012) a path-based technique and a nodebased technique (for the inner nodes) are combined to map the contents and hierarchical structure of the XML document to the RDB schema. This approach introduces a unique utilization of the RDB for storing and querying XML data. The two relational tables are used in its RDB scheme to store the nodes of the XML document tree whether have text or non-text values. The significant characteristics of the Wang et al. approach ease XML querying because the index structure does not need to be determined, which is equivalent to the B+-tree (Yan et al. 2005) and R-tree (Yu et al. 2007) approaches. The information on the leaf nodes with their paths is stored in the same table. The P-C relationship among the XML nodes is preserved through the ParentId attribute, and this approach employs a similar attribute recursively to preserve the A-D relationship among the nodes. Based on the XML document in Figure 2.2(b), the RDB scheme shown in Table 2.14 is produced through the model mapping process in the Wang et al. approach.

76

Table 2.14

Wang et al. scheme: (a) No_Value_Table and (b) Value_Table

(a) No_Value_Table ID 1 2

Name DBGroup Member

ParentID NULL 1

Level1 1 2

(b) Value_Table ID 1 2 3

Name Name Age Office

Value1 ‘Fervvac’ ‘23’ ‘CSD 4212’

Path DBGroup/Member/Name DBGroup/Member/Age DBGroup/Member/Office

ParentID 2 2 2

Level1 3 3 3

The Value_Table is used to store the relevant information on the elements (or attributes) of the XML document that have a value. In this table the ID attribute refers to the node sequence number that reflects the order of appearance during pre-order traversing of the document tree; Name depicts the element (or attribute) name; Value1 denotes the element (or attribute) value; Path confirms the path from a root node to the present node (leaf node); ParentID pertains to the parent node ID of the element (or attribute); and Level1 indicates the layer number of the node location. The No_Value_Table stores the relevant information about the elements (or attributes) of the XML document i.e. do not have a value. There are four attributes in this table, namely ID, Name, ParentID, and Level1. The meanings of the contents stored in these four attributes are similar to those in the Value_Table. The limitation of the Wang et al. approach is that it is limited because it increases the RDB storage size and query response time. The path in this approach is stored as a string in the same table where the values of the leaf nodes are stored. This introduces data and paths redundancy, particularly, when the leaf nodes have the same path. Furthermore, the approach stores all information of inner nodes, which increases the size of the RDB. With regard to query processing, the use of ParentID expands the search space for the path in the Value_Table. Moreover, a high number of joins is needed in the Wang et al. approach for complex queries (twig queries), which increases the query response time.

77

f.

Indexing (Signature) Technique

The indexing or signature technique traverses and parses the XML document just once. When the XML document is being traversed, the signatures of all of the distinct nodes (numerical IDs starting from 1) and their distinct names are stored in a separate table. This table is called the signatures table in the LNV and Sainan-Caifeng approaches and the vertex table in the XMap approach (Bousalem & Cherti 2015; Sainan et al. 2008; El-Sharkawi & El-Hadi El Tazi 2005), which are discussed in more detail below. In the LNV and Sainan-Caifeng approaches, the signature technique stores all the path information for only the leaf nodes as a sequence of numerical IDs (signatures) with their information (values) in the RDB scheme. The LNV approach has a storage overhead because it uses dangling columns such as Path of Nodes and Type of Node (Sainan et al. 2008). In the case of XMap, this technique stores all the path information for the existing nodes (inner and leaf) as a sequence of numerical IDs (signatures) in the path table and their information in a single table (Data). This leads to an increase in the storage space requirement. For the XML tree illustrated in Figure 2.2(b), this technique would store the signature information about the nodes in the signatures table by assigning the numbers 1, 2, 3, 4, 5 to ‘DBGroup’, ‘Member’, ‘Name’, ‘Office’, and ‘Age’, respectively. In addition, this technique would store the information (values) about the leaf nodes, namely ‘Name’, ‘Office’, and ‘Age’, with their path information as ‘1,2,3’, ‘1,2,4’, and ‘1,2,5’, respectively, in a single table (Path) in the case of LNV and SainanCaifeng. In the case of XMap, this technique would store the information about all the nodes ‘DBGroup’, ‘Member’, ‘Name’, ‘Office’, and ‘Age’ in the Data table and their path information ‘1’, ‘1,2’, ‘1,2,3’, ‘1,2,4’, and ‘1,2,5’ in the Path table, respectively. For the XPath twig query (Q1) ‘/DBGroup/Member/Name [Age = 23]’, this technique needs to follow two steps for query processing before this query can be executed. First, the XPath query (Q1) is replaced by a path of signatures (numerical IDs) according to the label signatures of the nodes using functions. Thus, the results of this step are two paths of signatures: ‘1,2,3’ for ‘/DBGroup/Member/Name’ and

78 ‘1,2,5’ for ‘/DBGroup/Member/Age’. In the second step the XPath query (Q1) is translated into a SQL query. Since the whole document (inner and leaf nodes) is stored in single table, this technique needs to perform self-join operations and functions such as substr, instr, and getParent in order to maintain the P-C and A-D relationships among the nodes in the SQL query shown in Figure 2.11 (Bousalem & Cherti 2015; Vaidya 2015). This leads to an increase in the time taken to execute SQL queries, in the case of a complex query (twig query) on large XML documents.

SET @pathName = getOrdPath (‘/DBGroup/Member/Name’) SET @pathAge = getOrdPath (‘/DBGroup/Member/Age’) SELECT d2.Value FROM Data d1, Data d2, Path p1, Path p2 WHERE p1.path = @pathAge AND p2.path = @pathName AND d1.pathId = p1.id AND d2.pathId = p2.id AND d1.Value = 23 AND getParent (d1.ordpath) = getParent (d2.ordpath) Figure 2.11

Translated SQL query for Q1 using indexing (signature) technique

The examples of the model mapping approaches mentioned above (LNV, Sainan-Caifeng, and XMap) that employ this technique are discussed in more detail below. i.

LNV Approach

In the LNV approach (El-Sharkawi & El-Hadi El Tazi 2005), each distinctive node in the XML document is provided with a signature and these signatures are stored in the label’s Signatures table. For each leaf node in the XML document tree, a path expression signifies the path from the root to the leaf node, and a list of the label’s signatures can interpret this path expression. Thus, a compact storage structure for large XML documents is achieved. The LNV storage structure is constructed while traversing the XML tree. During the query process this structure is analysed without traversing the XML document several times. Based on to the XML document in Figure 2.2(b), the RDB scheme shown in Table 2.15 is created by the model mapping process of the LNV approach.

79

Table 2.15

LNV scheme: (a) Signatures and (b) Path tables

(a) Signatures Table Label DBGroup Member Office Name Age

SignatureID 1 2 3 4 5

(b) Path Table DocID 10 10 10

Path of Signatures 1,2,4 1,2,5 1,2,3

Path of Nodes 1,2,3 1,2,4 1,2,5

Values Type of Node ‘Fervvac’ Element ‘23’ Element ‘CSD 4212’ Element

Position 1,1,1 1,1,2 1,1,3

In the Signatures table, the Label and SignatureID attributes represent the name of a node and the signature identifier (sequence of integer numbers starting from 1), respectively. The Path table consists of six attributes. The DocID attribute is a document identifier. The Path of Signatures attribute refers to the list of signatures of path labels that are ordered from the root. The Path of Nodes attribute is the list of nodes in the path that are ordered from the root. The Values attribute denotes the value associated with the end of the path. The Type of Node attribute pertains to the leaf node type (element, attribute, comment, or text), and Position depicts the positions that represent the occurrence of each element node among the siblings of the node. The limitation of the LNV approach is that it is limited because an overhead in terms of the size of the Path of Signature and Path of Nodes columns may occur when storing large XML documents (Sainan et al. 2008). The substr() and instr() functions are employed to calculate the relationships among the nodes (El-Sharkawi & El-Hadi El Tazi 2005). In addition, the XPath query is processed in two steps. First, the XPath query is translated into the path of the label’s signatures according to the Path table. Second, the query produces a SQL query according to the RDB scheme. These steps increase the query response time, which adversely affects query performance, especially in the case of a complex query (twig query) on large XML documents.

80

ii.

Sainan-Caifeng Approach

The Sainan-Caifeng approach (Sainan et al. 2008) is an improved version of the LNV approach. Each distinctive node in the XML document has a signature, and these signatures are stored in the label’s Signatures table. For each leaf node in the XML document tree, a path expression characterizes the path from the root to a leaf node, and this path is interpreted as a list of the label’s signatures. In the Sainan-Caifeng approach, the LNV storage structure is enhanced by eliminating the Path of nodes column in the Path table, thus decreasing the storage space. In addition, the query response time is enhanced by reducing the number of join operations and eliminating the instr() function, where the clause of the translated SQL is found. Based on the XML document in Figure 2.2(b), the RDB scheme shown in Table 2.16 is produced using the Sainan-Caifeng approach.

Table 2.16

Sainan-Caifeng scheme: (a) Signature and (b) Path tables

(a) Signatures Table Label DBGroup Member Office Name Age

SignatureID 1 2 3 4 5

(b) Path Table DocID 10 10 10

Path of Signature 1,2,4 1,2,5 1,2,3

Position of Signature 1,1,1 1,1,2 1,1,3

Values ‘Fervvac’ ‘23’ ‘CSD4212’

Type of Value Element Element Element

The Label and SignatureID attributes in the Signatures table pertain to a node and a signature identifier (the sequence of integer numbers starts at 1), respectively. In the Path table, the DocID attribute is the number of the XML document. The Path of Signature attribute pertains to the list of signatures of the labels in the path that are ordered from the root. The Values attribute is the value related to the end of the path. The Type of Value attribute refers to the leaf node type (element, attribute, comment

81

or text). The Position of Signature attribute denotes the positions pertaining to the occurrence of each node among the siblings of the node. The limitation of the Sainan-Caifeng approach is that it is limited because it suffers from an overhead in terms of the size of the Path of Signature and Position of Signature when large XML documents are stored in the RDB. Also, substr() functions are needed for querying string matching and for joining the tables. Moreover, the XPath query is applied in two steps. First, the XPath query is converted into a label’s signature path according to the Path table. Second, the SQL statement is yielded according to the RDB scheme. These steps may increase the query response time and have a detrimental effect on query performance, especially in the case of a complicated XPath query (twig query) on large XML documents. iii.

XMap Approach

The XMap approach (Bousalem & Cherti 2015) is an improved version of the LNV and Sainan-Caifeng approaches. This approach maps XML documents into a RDB scheme that contains three tables (Vertex, Path, and Data). The Vertex table stores all the distinctive node names in the XML document using unique IDs. The Path table stores all the path information from the root to each node as a sequence of numerical IDs for all the nodes used in the paths based on the Vertex table. The Data table stores all the information about the inner and leaf nodes whether or not they have a value. This approach uses ORDPATH labelling for every node in the Data table (O'neil et al. 2004). Also, instead of using a containment relationship to maintain the P-C and A-D relationships among the nodes as in XRel, XMap stores the ID of the parent in the first part of ORDPATH for every node. Then, equijoins are used to evaluate the P-C relationships. Based on the XML document in Figure 2.2(b), the RDB scheme shown in Table 2.17 is produced using XMap.

82

Table 2.17

XMap scheme: (a) Vertex, (b) Path, and (c) Data tables

(a) Vertex Table ID 1 2 3 4 5

Name DBGroup Member Name Age Office

(b) Path Table ID 1 2 3 4 5

Path 1 1.2 1.2.3 1.2.4 1.2.5

(c) Data Table OrdPath 1 1.1 1.1.1 1.1.3 1.1.5

Value

‘Fervvac’ ‘23’ ‘CSD 4212’

Order 1 2 3 4 5

NumberElements 1 3 0 0 0

NumberAttributes 0 0 0 0 0

PathID 1 2 3 4 5

In the Vertex table, the ID and Name attributes represent a unique identifier and the node name, respectively. In the Path table, the ID and Path attributes represent a unique identifier and the path that is constructed from the IDs of all nodes used in the path from the root to each node in the XML document, respectively. In the Data table, the OrdPath, Value, Order, NumberElements, NumberAttributes, and PathID attributes represent an ID of the node based on the ORDPATH labelling scheme, a value of the node, the order of the node among all the nodes in document, the number of the child elements of a node, the number of the attributes of a node, and the path identifier, which is a foreign key of the ID in the Path table, respectively. The limitation of the XMap approach is that it is limited because it increases the RDB storage space and query response time. XMap stores the whole document, i.e. all the inner and leaf nodes in a single table (Data table). In addition, XMap stores the OrdPath Identifier for each node, which increases the column size. These factors expand the RDB storage space, particularly for large XML documents. Moreover,

83

XMap uses functions to convert the path used in the XPath queries into the same format as the paths used in the Path table (a sequence of numerical IDs), and these numerical IDs are used in the translated SQL query. In addition, XMap uses the getParent() function for the OrdPath ID in its translated SQL query to get the ID of the parent in the first part of ORDPATH for a node in order to maintain the P-C and A-D relationships among the nodes. These factors lead to an increase in the time taken to execute queries, particularly in the case of complex queries (twig queries) on large XML documents. In Table 2.18 below, the 16 approaches reviewed above are summarized and compared in terms of the techniques used, RDB scheme produced, relationships among the nodes, and effect of each approach on RDB storage space and query response time.

84

Table 2.18 Reference

(Florescu & Kossmann 1999)

(Yoshikawa et al. 2001)

(Jiang et al. 2002)

(Prakas et al. 2004)

Approach

Edge

XRel

XParent

SUCXENT

Summary of model mapping approaches

Technique

RDB Scheme

Relationships Among Nodes

RDB Storage Space

Query Response Time

Edge-Based

Edge (Source, Ordinal, Target, Label, Flag, Value)

It maintains the P-C relationship using SourceTarget. It keeps the A-D relationship using self-join on Edge table.

It stores the whole document in a single table. It needs high storage space to store XML documents in its RDB scheme.

Excessive self-joins (concatenation of edges to form a path) are needed. It needs more time to respond to all types of query.

Path-Based

Path (PathID, Pathexp) Element (DocID, PathID, Start, End, Index, Reindex) Text (DocID, PathID, Strat, End, Value) Attribute (DocID, PathID, Start, End, Value)

It maintains both P-C and A-D relationships using the containment relationship.

It stores all the nodes of the document with their distinct path from root-to-any node. Its schema contains four tables. It needs high storage space to store XML documents in its RDB scheme.

It uses four tables in its RDB scheme. It stores distinct paths for all inner nodes, so the path search space increases. It uses Ф-joins () to check the edge connections. It does not use a special full-text search. It needs time to respond to all types of query.

Hybrid (Edge- and Path-Based)

LabelPath (ID, Len, Path) DataPath (Pid, Cid) Element (PathID, Did, Ordinal) Data (PathID, Did, Ordinal, Value) Ancestor (Did, Ancestor, Level)

It maintains the P-C relationship using the DataPath table. It maintains the A-D relationship using self-joins on DataPath table and Ancestor table.

It stores all the nodes of the document with their distinct path from root-to-any node. Its schema contains four tables. It stores every node and its ancestor using the Ancestor table to preserve A-D relationship. It needs high storage space to store XML documents in its RDB scheme.

It uses four tables to represent the schema. It stores distinct paths for all existing nodes, which increases the path search space. It needs to join tables frequently or use a recursive operation on the DataPath table. It needs more time to respond to all types of query.

Path-Based

Document (DocId, Name) Path (PathId, PathExp) PathValue (DocId, PathId, LeafOrder, SiblingOrder, LeftSibIxnLevel, LeafValue) AncestorInfo (DocId, SiblingOrder, AncestorOrder, AncestorLevel) TextContent (DocId, LinkId, Text)

It maintains the P-C relationship using paths from root-to-leaf node. It maintains the A-D relationship using AncestorInfo.

It stores the whole document in a complex schema (five tables). It stores paths from root-to-leaf nodes. It needs high storage space to store XML documents in its RDB scheme.

The merging of five tables with paths from the root-to-leaf node for data retrieval complicates the query process and increases both the search space and query response time.

continue…

84

85 ...continued

(Qin et al. 2005)

(El-Sharkawi & El-Hadi El Tazi 2005)

(Sainan et al. 2008)

(Wu & Huang 2008)

(Ning & De 2010)

XPEV

LNV

SainanCaifeng

XPred

SMX/R

It maintains the P-C relationship using SourceTarget. It maintains the A-D relationship using self-joins on Edge table.

It stores the distinct path from root-to-any node (all existing nodes). It stores the whole document in two tables (Edge, Value) with redundant data. It needs high storage space to store XML documents in its RDB scheme.

It stores distinct paths for all existing nodes, so the path search space increases. It joins two tables (Edge, Value) using Source-Target to determine the P-C and A-D relationships. It needs more time to respond to all types of query.

It maintains both P-C and A-D relationships using the Path_of_Nodes + Substr() and instr() functions.

It stores dangling columns (Path_of_Nodes, Type_of_Node). It stores paths of label signatures and paths of nodes. Size overhead for columns table may occur when storing a large document. It needs high storage space to store XML documents in its RDB scheme.

It stores distinct paths for leaf nodes. It calculates the relationships among the nodes using substr() and instr() functions. XPath query is converted first to path of numerical IDs before is translated to SQL query for executing. It needs more time to respond to all types of query. It stores distinct paths for leaf nodes. It calculates the relationships among nodes using substr() and instr() functions. XPath query is converted first to path of numerical IDs before is translated to SQL query for executing. It needs more time to respond to all types of query.

Hybrid (Edge- and Path-Based)

Path (P_id, Pathexpress) Edge (P_id, Source, Target, Label, Ordinal, Flag) Value (P_id, Source, Target, Label, Ordinal, Value)

Indexing (Signature)

Signatures (Label, SignatureID) Path (DocID, Path_of_Signatures, Path_of_Nodes, Values, Type_of_Node, Position)

Indexing (Signature)

Signature (Label, SignatureID) Path (DocID, Path_of_Signature, Position_of_Signature, Values, Type_of_Value)

It maintains both P-C and A-D relationships using the Position_of_Signature + Substr() function

It stores a dangling column (Type_of_Node). It stores paths of label signatures and their positions. Size overhead for columns table may occur when storing a large document. It needs high storage space to store XML documents in its RDB scheme.

Hybrid (Edge- and Path-Based)

Path (PathID, Length, LabelPath) Node (NodeID, PathID, Ordinal, PredID) Data (NodeID, PathID, Ordinal, PredID, Value)

It maintains both P-C and A-D relationships using Parent ID and the node’s path information.

It stores all nodes and their paths in the XML tree in its RDB scheme. It needs high storage space to store XML documents in its RDB scheme

It stores distinct paths for all existing nodes, which increases the path search space. The Node table needs to be joined several times or recursively by the PredID to preserve the A-D relationships. It needs more time to respond to all types of query.

Path-Based

Path_Index_Table (PID, Path, NodeName) Path_Table (DocID, PID, StartPos, EndPos, NodeLVL, NodeType, NodeValue)

It maintains both P-C and A-D relationships using the containment relationship.

It stores the distinct path from root-to-any node (all existing nodes). It stores all the information about elements,

It stores distinct paths for all existing nodes, so the path search space increases. It uses Ф-joins () to check the edge connections

continue…

85

86 ...continued attributes, and texts in single table. It needs high storage space to store XML documents in its RDB scheme.

(Zafari et al. 2010)

(Fakharaldie n et al. 2011a)

(Suri & Sharma 2012)

(Ying et al. 2012)

Xlight

XRecursive

SuriSharma

Ying-Cao

Path-Based

Node-Based

Node-Based

Hybrid (Path- and Node-Based)

Document (DocId, Name) Path (PathId, Path) Data (DocId, PathId, LeafNo, LeafGroup, LinkLevel, LeafValue, hasAttrib) Ancestor (DocId, LeafGroup, AncestorPre, AncestorLevel) Attribute (Name, Val, Id, Pre)

It maintains the P-C Relationship using paths from the root-to-leaf node. It maintains the A-D relationship using the Ancestor table.

It stores the whole document in a complex schema (five tables). It stores paths from root-to-leaf nodes. It needs high storage space to store XML documents in its RDB scheme.

(containment relationship). The search space for elements, attributes or texts is increased because they are stored in a single table. It needs more time to respond to all types of query. It uses five tables to represent the schema. It stores paths from the rootto-leaf nodes, which increases the searching time. It needs to search the path table twice for complex queries (with predicate). It needs more time to respond to all types of query.

It maintains both P-C and A-D relationships using parent ID (PID).

It stores a dangling column (Type) and all the information about the inner nodes. It needs high storage space to store XML documents in its RDB scheme.

It does not store the paths in a table. It uses nested queries using pId recursively to form the path, which is costly. It does not support other types of query such as index queries, grouping and index queries, text matching queries, or ‘//’ and text matching queries. It does not store an ordinal column. It needs time to respond to all types of query.

Node (NodeID, NodeName) Data (DocID, NodeID, ParentID, NodeValue, NodeType, NodePos)

It maintains both P-C and A-D relationships using ParentID.

It stores all the nodes of the document in its RDB scheme. It stores redundant data in two tables (Node, Data) and a dangling column (NodeType). It needs high storage space to store XML documents in its RDB scheme.

It does not store paths in a table. It stores redundant data in two tables (Node, Data), which increases the searching space and time in the case of a complex XML document. It needs multiple join tables or uses the ParentID recursively to keep the A-D relationship and path. It needs more time to respond to all types of query.

File (DocId, Name) Path (PathId, PathExp) InnerNodes (DocId, INodeId, NodeName, ParentId, Level, SiblingNum)

It maintains both P-C and A-D relationships using ParentID and the node’s path information.

It stores paths from root-to-leaf nodes and a dangling column (SiblingNum). It stores all inner nodes in a table. It needs high

It stores paths from nodes. The length of identified through the nodes in pathexp (it

Tag_Structure (tagName, Id, pId) Tag_Value (tagId, Value, Type)

root-to-leaf the SQL is number of uses nested

continue…

86

87 ...continued

(Wang et al. 2012)

(Subramania m et al. 2012)

(Bousalem & Cherti 2015)

Wang et al.

s-XML

XMap

Hybrid (Path- and Node-Based)

Node-Based

Indexing (Signature)

LeafNodes (DocId, LNodeId, PathId, ParentId, LeafValue)

storage space to store XML documents in its RDB scheme.

queries based on the number of nodes in pathexp). A complex query may be costly. It needs to join the InnerNodes table many times or recursively with the ParentID in order to find the A-D relationship. It needs more time to respond to all types of query.

Value_Table (ID, Name, Value1, Path, ParentID, Level1) No_Value_Table (ID, Name, ParentID, Level1)

It maintains both P-C and A-D relationships using ParentID and the node’s path information.

It stores the path as a string in the same table that stores the leaf nodes’ values (No Path table). It stores the path from the root to the present node (leaf node) and all inner nodes. It needs high storage space to store XML documents in its RDB scheme.

It needs a large search space for the path in the Value_Table. It uses the ParentId recursively to form the path from the inner nodes. It needs a high number of joins for a complex query on a large XML document. It needs more time to respond to all types of query.

It maintains both P-C and A-D relationships using parent ID (LParent).

It stores the whole document (inner and leaf nodes) in its RDB scheme that has a high row size. It contains a dangling column (pName in ParentTable and ChildTable). Its columns table may be redundant. It needs high storage space to store XML documents in its RDB scheme.

ParentTable (IdNode, pName, cName, Level, LParent, SelfLabel) ChildTable (IdNode, Level, pName, SelfLabel, LParent, Value)

Vertex (ID, Name) Path (ID, Path) Data (OrdPath, Value, Order, NumberElements, NumberAttributes, PathID)

It maintains the P-C relationship using the getParent() function on the OrdPath column or the nested getParent() function on the same column to maintain the AD relationships.

It stores the whole document (all inner and leaf nodes) with their path information starting from the root as a sequence of numerical IDs. It stores the OrdPath Identifier for each node, which increases its column size. It needs high storage space to store XML documents in its RDB scheme.

It needs a large search space for querying, i.e. searching parent and child nodes (in two separate tables). It does not store paths in a table. It needs multiple joins on LParent recursively (nested queries) to determine the A-D relationship, which is costly. It needs more time to respond to all types of query. It uses functions to convert the path used in the XPath queries into a path that is a sequence of numerical IDs, which are used in the translated SQL query. It uses the getParent() function on the OrdPath ID in its translated SQL query in order to maintain the PC and A-D relationships among the nodes. It needs more time to respond to all types of query, particularly complex queries on large XML documents.

87

88

2.5

DISCUSSION ON MAPPING FROM XML DOCUMENT TO RELATIONS

Database vendors such as IBM, Oracle, and Microsoft are currently building tools to assist in the mapping of XML documents to RDB schemes. However, the tools are not sufficiently adept at storing XML documents in a small amount of storage space in the RDB scheme with high scalability when XML documents become bigger and at querying XML documents in a shorter query response time, i.e. they do not have high query performance (Bulajic & Filipovic 2012; Fakharaldien et al. 2011c; Dweib et al. 2008). Therefore, many efforts have been made by researchers to propose better mapping approaches for XML documents. As discussed in Section 2.4, the existing mapping approaches can be categorized into structure mapping or model mapping approaches (Bousalem & Cherti 2015; Mourya & Saxena 2015; Vaidya 2015; Dweib & Lu 2012; Wang et al. 2012). This study focuses on model mapping approaches. In the preceding sections of this chapter the limitations of these approaches have been identified in terms of RDB storage space and query response time. Both these issues are discussed further in the following subsections. 2.5.1

RDB Storage Space

The design of the RDB scheme and the way the mapping process is used to handle the contents and hierarchically structure of the XML document are considered the main challenges in mapping an XML document to a RDB (Vaidya 2015; Dweib & Lu 2012; Subramaniam et al. 2012; Ying et al. 2012; Ahmad 2011). It is necessary to reduce the amount of storage space required in the RDB for storing the XML document in order to improve the querying of XML data in terms of response time (Dweib & Lu 2012). According to the results of this literature review presented in this chapter, the existing model mapping approaches need a large amount of RDB storage space because the methods they use to handle the content and hierarchical structure of XML documents in their RDB scheme are inefficient. These inefficiencies, which are highlighted in Table 2.18, are explained in the following paragraphs.

89

Firstly, the Edge approach stores all the edges between the nodes of a document in a single table (Fakharaldien et al. 2012; Subramaniam et al. 2012). In the SUCXENT and Xlight approaches, the XML documents are mapped into a complex RDB scheme in which multiple tables and columns are created (Senthilkumar et al. 2011). Also, in the XRel, SMX/R, Xlight, Ying-Cao, and Wang et al. approaches, all the information about the internal nodes of the document is stored in a separate table as data that is already stored as part of the path information in the path tables (Vaidya 2015; Haw & Lee 2011; Senthilkumar et al. 2011; Maghaydah et al. 2010). These approaches also store the path information for all the existing nodes or only for the leaf nodes starting from the root in the path table, which causes redundant paths in the path table (Vaidya 2015; Dweib & Lu 2012; Jeong & Lee 2006). The XRecursive and s-XML approaches use a node-labelling method to assign a label for each node before starting their mapping process. These approaches store all the existing nodes mostly in two tables: one for the parent nodes and the other for the child nodes. They then handle the P-C relationships between the nodes by using a parent label (Vaidya 2015; Haw & Lee 2011; Maghaydah et al. 2010). The XParent, XPEV, and XPred approaches store all the edge, ancestor, and path information for every node of the document in the RDB (Fakharaldien et al. 2012; Ying et al. 2012; Zafari et al. 2010; Dweib et al. 2009; Prakas et al. 2004). The LNV, Sainan-Caifeng, and XMap approaches store all path information for the existing nodes or only for the leaf nodes as a sequence of numerical IDs (signatures) with their information in the RDB. The RDB scheme of the LNV approach is not designed efficiently because of the presence of a dangling column (Sainan et al. 2008). Thus, the above-discussed approaches map all the information about the XML documents into a complex RDB scheme resulting in the creation of multiple tables and columns as well as the storing of unnecessary or redundant data and the mapping of unneeded paths. Thus, a high number of records are mapped. This leads to an increase in the storage space requirement of their respective RDB scheme, particularly for large XML documents.

90

2.5.2

Query Response Time

The most important factor for any efficient model mapping approach is the time required to produce a result when the translated SQL query is executed (Raut & Atique 2014). Hence, it is essential for an efficient model mapping approach to include a simple yet effective storage scheme and to have a way of handling the P-C and A-D relationships among the nodes of the XML document efficiently (Vaidya 2015; Dweib & Lu 2012; Subramaniam et al. 2012; Maghaydah 2010; Gou & Chirkova 2007). This will lead to a reduction in the number of join operations in the translated SQL query and a smaller search space inside the RDB, which in turn lead to a reduction in the time taken to execute a query, i.e. less query response time. The

translated

SQL

queries

for

the

XPath

twig

query

(Q1)

‘/DBGroup/Member/Name [Age = 23]’ produced from the query mapping techniques that used in model mapping approaches as shown in Figures 2.5, and 2.7 to 2.11, are compared in terms of their complexity including number of path selection, number of column selection, number of joined table, number of join operation, number of nested query, and number of SQL function which affect the query response time i.e. query performance as shown in Table 2.19 below.

Table 2.19

Comparison of complexity on translated SQL query for Q1 by query mapping techniques in various approaches

Edge

Technique Used Edge-based

No. of Path No. of Column No. of joined No. of Join No. of Nested No. of SQL Selection Selection table Operation Query Function 0 6 4 3 0 0

SMX/R

Path-based

3

2

6

7

0

0

s-XML

Node-based

0

7

2

5

5

0

XParent

Hybrid

3

1

8

7

0

0

Ying-Cao

Hybrid

2

1

3

5

1

0

XMap

Indexing

2

1

4

2

0

4

Approach

According to Table 2.19 and the results of this literature review, the existing model mapping approaches need a long query response time to execute the translated SQL queries for XPath queries because the methods they use to maintain the P-C and A-D relationships among the nodes of the XML documents in the translated SQL

91

queries are inefficient. The methods employed require that a high number of join operations, nested queries, and SQL functions are used in the translated SQL queries. This leads to an increase in the search space inside the RDB, which has an adverse effect on the time taken to execute these queries. In other words, the query response time is increased. This is particularly the case for large XML documents. The drawbacks of the 16 approaches in terms of query response time, which are highlighted in Table 2.18, are discussed in the following paragraphs. The Edge approach employs the Edge-based technique to maintain P-C relationships by concatenating the edges of the XML tree to form a path (Zafari et al. 2010; Jiang et al. 2002). This leads to a need for multiple self-join operations in the translated SQL queries (Fakharaldien et al. 2012; Subramaniam et al. 2012; Zafari et al. 2010; Gou & Chirkova 2007). The Edge approach needs to partially compute the transitive closures of data trees by issuing expensive recursive SQL queries in order to maintain the A-D relationships between the data nodes (Gou & Chirkova 2007). The approaches XRel, SUCXENT, SMX/R, and Xlight employ a path-based technique to maintain the P-C and A-D relationships among the nodes (Vaidya 2015; Vainio & Junkkari 2014; Mohammad 2011). XRel and SMX/R store the path information for each node of the XML tree in the Path table, while SUCXENT and Xlight store only the path information about each leaf node of the XML tree in the Path table. This leads to an increase in the search space in the Path tables and requires a number of join operations between the Path table and other data tables in the translated SQL queries, particularly for complex queries (twig queries) on large XML documents (Vaidya 2015; Suri & Sharma 2012). The ϴ-join () is one of the join operations used in XRel and SMX/R in order to maintain the containment relationships among the nodes (Ning & De 2010; Yoshikawa et al. 2001). This type of join operation is costly in a RDBMS in terms of query execution time (Vaidya 2015; Dweib & Lu 2012; Maghaydah 2010; Zafari et al. 2010; Zhang et al. 2001). The approaches that employ a node-based technique, namely XRecursive, Suri-Sharma, and s-XML use the parent of a label (like LParent) in forming nested

92

queries to preserve the P-C and A-D relationships among the nodes, which involves an additional computation processing cost (Vaidya 2015; Haw & Lee 2011; Maghaydah 2010). In the case of the XParent, XPEV, and XPred approaches that employ hybrid Edge- and path-based techniques, all the edge, ancestor, and path information for every node of the document is stored in the RDB. Therefore a high number of join operations are needed to maintain the P-C and A-D relationships among the nodes in the translated SQL queries in these three approaches (Fakharaldien et al. 2012; Zafari et al. 2010; Wu & Huang 2008). In the Ying-Cao and Wang et al. approaches that employ hybrid path- and node-based techniques, the table that stores the internal node information has to be joined several times or recursively by the parent ID when forming nested queries to maintain the P-C and A-D relationships among the nodes in the translated SQL queries. The LNV and Sainan-Caifeng approaches employ an indexing (signature) technique, where all the path information for the existing nodes is stored as paths of the label signatures (as a sequence of numerical IDs). These two approaches process the XPath query by replacing it with a path of numerical IDs using SQL functions, and then translate it into an SQL query. In the case of the XMap approach, the entire document (all the inner and leaf nodes) is stored in a single table. Therefore self-join operations and SQL functions such as substr, instr, and getParent are needed to maintain the P-C and A-D relationships among the nodes in the translated SQL query (Bousalem & Cherti 2015; Vaidya 2015). Thus, for all the above approaches there is an increase in the search space inside their respective RDB scheme, which causes an increase in the time taken to execute queries. In other words, the query response time increases. This is particularly the case for complex (twig) queries on large XML documents. Based on the literature review presented in this chapter, this research study concludes that the existing model mapping approaches that employ a variety

93

techniques are all unable to provide efficient storage and query processing when XML documents are stored in the RDB. The design of the RDB scheme and the way in which the mapping process is used to handle the contents and hierarchical structure of the XML document needs to be more efficient (Vaidya 2015; Dweib & Lu 2012; Subramaniam et al. 2012; Ying et al. 2012; Ahmad 2011). Greater efficiency would lead to less storage space being needed in the RDB to store XML documents. This is considered essential to improve the querying of XML data in terms of speed (Dweib & Lu 2012). Since path expressions frequently appear in XML queries, particularly in XPath queries, it is essential to efficiently store and handle the path information of the XML document (Jeong & Lee 2006). Vaidya (2015) indicates that the path-based technique is the best solution for query execution. The query execution time of this technique is shorter than that of other existing techniques (see Section 2.4) because it requires fewer join operations to maintain the P-C and A-D relationships among the nodes of XML documents. However, this technique maps all the information about the inner and leaf nodes of the document with their distinct path information starting from the root, whether NodePath or Leaf-Path, into a complex RDB scheme (Ying et al. 2012; Zafari et al. 2010; Prakas et al. 2004; Yoshikawa et al. 2001). Moreover, redundant paths and inner node information are stored in the RDB using this technique, which increases the storage space and search space in the RDB. In addition, the path-based technique also requires a number of join operations between the path table and other data tables to handle the P-C and A-D relationships among the nodes of the translated SQL query, which has a detrimental effect on the query response time (Vaidya 2015; Suri & Sharma 2012). Therefore, this study aims to enhance the path-based technique by addressing the storing and querying issues associated with this technique by proposing an efficient mapping approach, called XAncestor. The proposed approach handles the content and hierarchical structure of the document to reduce the storage space of the RDB as well as provide high scalability when XML documents become bigger. It also reduces the number of join operations and removes the nesting queries in the translated SQL query to enhance query performance, i.e. shorten the query response

94

time. The basic idea of XAncestor is that it stores only the information for all the leaf nodes of the XML document, instead of storing all the inner and leaf nodes of the document. This results in a need for less RDB storage space without affecting the query answers among these nodes. In addition, XAncestor stores the ancestor path information for every leaf node, i.e. it stores one ancestor path for all the leaf nodes that have the same parent. Therefore, XAncestor reduces the storage space for the paths and reduces the search space. It also enhances the performance of the XPath query by translating it into an SQL query that has fewer join operations and removes the nesting queries and functions used in order to reduce the query response time. The proposed approach uses a two-step process. In the first step, it employs a path-based XML mapping algorithm to map the XML document into a fixed and optimal RDB. To achieve this objective, a fixed RDB scheme is identified and used to maintain and manage the relations of the document contents. In the second step, it employs a querying mapping algorithm to query and retrieve the document contents through the use of the XPath language. To achieve this objective, an editor is created to write the XPath expression, translate it into SQL, and then execute and display the results as a tree view and a grid view. 2.6

SUMMARY

The main objective of this chapter is to review and analyse the existing literature on mapping tools or algorithms for mapping XML documents to a RDB scheme. This review is important because it reports on the current state of the art in the development of mapping tools with a particular emphasis on the development of a RDB scheme that requires less storage space but has high scalability when XML documents become bigger and on the reduction of the time taken to execute different types of query to achieve high query performance. To this end, after covering the preliminaries regarding XML, namely XML itself, the XML data model, and XML query languages this study reviewed 16 approaches which are the most cited and the latest model mapping approaches for

95

storing and querying XML documents in RDBs found in the literature. These approaches are Edge, XRel, XParent, SUCXENT, XPEV, LNV, Sainan-Caifeng, XPred, SMX/R, Xlight, XRecursive, Suri-Sharma, Ying-Cao, Wang et al., s-XML, and XMap. These approaches are categorized under six main techniques. This review discussed the techniques used by each approach, the RDB scheme produced, and the issues that lead to a high storage space requirement and query response time. The review emphasized the methods used by each approach to preserve the P-C and A-D relationships. This preservation determines how efficient the model mapping approach is in terms of storing and querying XML documents through RDBs. This chapter also demonstrated the various techniques used and the drawbacks of each of the approaches in terms of RDB storage space and query response time. From the results of the literature review, it became apparent that all of the 16 model mapping approaches have some drawbacks in terms of storing and querying XML documents through RDBs. First, all these approaches suffer from high storage space, particularly in the case of large XML documents. This is because of the complexity of RDB design used in these approaches and because the methods that they use to handle the content and hierarchical structure of the XML document are inefficient. Second, due to the design of their RDB and the way they used in their mapping process, these approaches map (translate) XPath queries into corresponding SQL queries over their RDB with a high query response time, particularly for complex (twig) queries on large XML documents. This because of the methods that they use to maintain the P-C and A-D relationships among the nodes of the XML document in the translated SQL queries are inefficient. Even though the previous studies indicate that the path-based technique is the best solution for query execution compared to other existing techniques (see Section 2.4), this technique has limitations in respect of storage and querying XML documents. Therefore, this study focuses on developing a path-based technique in order to handle the storing and querying issues associated with XML documents using a RDB. A new mapping approach, called XAncestor, will be proposed and explained in detail in the next chapter of this study.

96

CHAPTER III

RESEARCH METHODOLOGY

3.1

INTRODUCTION

This chapter presents a full description of the research methodology for this study. The chapter is based on the findings of a review of the relevant previous works on the mapping of XML documents into a RDB in terms of storing and querying. The review identified the most relevant and suitable methods that are available to achieve the research objectives. The aim of this chapter is to provide the appropriate pathway to achieve the objectives of this study. Therefore, first a description of the design-based research (DBR) methodology adopted in this study is provided in Section 3.2. Then a detailed discussion of the phases of the research methodology is presented in Section 3.3. Finally, a summary of this chapter is presented in Section 3.4. 3.2

DESIGN-BASED RESEARCH (DBR)

Design-based research is also commonly known as design research, development research, and design experiments (Parker 2011; Barab & Squire 2004; Barr & Wells 1990). The DBR methodology was presented by a small group of scholars in the DBR collective as a way to develop algorithms and systems (Barab & Squire 2004; Von Alan et al. 2004; Kelly 2003). The methodology is an iterative research process that includes analysis, design, development, and implementation in real-time settings (Parker 2011; Wang & Hannafin 2005). Design-based research is a systematic but flexible methodology that was proposed to help bridge the gap between research and practical implementations (Oates 2006; Aken 2004). The key aspects of the method

97

include the handling of complicated issues, integrating design principles with new technologies to develop practical solutions to the problems, and conducting effectiveness evaluations to improve the proposed solutions and to identify new design principles (Parker 2011; Reeves 2006). Figure 3.1 illustrates the four main phases of the DBR methodology.

Figure 3.1

Four phases of design-based research methodology (Reeves 2006)

In line with Figure 3.1, Parker (2011) states that the DBR methodology consists of four iterative phases, which are described below.

Phase 1 - Analysis of Practical Problems by Researchers and Practitioners in Collaboration In this phase, the problems and the literature review are addressed. During this phase, the problems are clearly identified and investigations are conducted into the work that has already been carried out in the same or related fields. By the end of this phase, it should be possible to determine the preliminary research questions and objectives to guide the research (Herrington et al. 2006).

Phase 2 - Development of Solutions Informed by Existing Design Principles and Technological Innovations This phase focuses on designing and developing solutions to the problem. During this phase, a more targeted literature review is conducted. Relevant theories as well as

98

design principles for mapping approaches are explored and developed in depth in order to propose a mapping approach to solve the research problem (Herrington et al. 2006).

Phase 3 - Iterative Cycles of Testing and Refinement of Solutions in Practice In this phase, a potential solution to the problem is implemented and tested internally in an iterative manner to assess its effectiveness and correctness (Reeves 2006). Phase 4 - Reflection to Produce ‘Design Principles’ and Enhance the Implementation of the Solution In this phase, the proposed solution that is implemented needs to be comprehensive, and this is confirmed by evaluating it against existing related works (Herrington et al. 2007). 3.3

ADOPTED RESEARCH METHODOLOGY

The stages, features, and attributes of the DBR methodology are customized and adopted as the research methodology for this study. Figure 3.2 presents the four interrelated phases of this study’s research methodology. These phases describe the research objectives/activities undertaken to create an efficient mapping approach (prototype) for storing and querying XML documents using RDB. The effectiveness of the approach in terms of storing is measured by its capability of mapping the content and hierarchical structure of XML documents into a RDB with less storage space and high scalability when XML documents become bigger. The effectiveness of the approach in terms of querying is measured by its capability of mapping XPath queries into corresponding SQL queries in a shorter query response time. Although the activities in the methodology seem to be a linear process, the iterative process is still applied as the literature review, theoretical knowledge, and mapping approaches.

99

Phase 1 Problem Identification

Phase 2 Proposed Approach Design

 Review and analyse the literature on mapping approaches for storing and querying data.  Evaluate the recent model mapping approaches using a pre-experiment.  Identify the research problem, questions, and objectives.

 Identify mathematical concepts and theories related to the XML tree structure to propose a new solution (approach).  Propose and design a new mapping approach (XAncestor), which contains a RDB scheme, XML mapping algorithm, and query mapping algorithm.

Phase 4 Prototype Evaluation

Phase 3 Prototype Development

 Identify the evaluation environment, data sets, measurements, and XPath queries.  Conduct a testing process to get results.  Compare the results generated by the XAncestor prototype with the results generated by current approaches.  Verify and validate the approach.

 Develop the XAncestor approach as a prototype, which contains two algorithms, using RAD.  Programme and test each algorithm separately.  Join the algorithms to build the prototype.  Test the prototype’s functions.

Figure 3.2

3.3.1

Phases of research methodology

PHASE 1 – PROBLEM IDENTIFICATION

The main aims of the first phase of the research methodology are to identify the problems in the mapping of XML documents into a RDB and to determine the questions and objectives of this study. In order to achieve this, the relevant literature on the mapping of XML documents into a RDB in terms of storing and querying is reviewed and analysed. Also, a pre-experiment is used to evaluate and compare the recent model mapping approaches. This leads to the precise identification of the research problems and gaps. A review and in-depth analysis of the comprehensive literature on this topic and the performance of a pre-experiment are essential to facilitate the accurate determination of the research objectives. This phase looks into the importance of XML in a data exchange environment and the need to manage XML efficiently. Since the RDB approach is the best option for managing XML documents, it is necessary to map the XML to this model efficiently so that less RDB storage space with high scalability when XML documents become bigger and a shorter query response time, i.e. high query performance are

100

obtained. During this phase, the existing model mapping approaches for handling the storing and querying issues are reviewed and analysed. For the storing issue, an analysis of the strengths and weaknesses of the existing XML mapping algorithms is undertaken in a pre-experiment. The complexity of their RDB scheme design (multiple tables and columns) and the ways used to handle the content and hierarchical structure of XML documents in their RDB schemes are investigated because these issues affect RDB storage space and scalability when XML documents become bigger. For the querying issue, the existing query mapping algorithms are analysed in a pre-experiment in order to evaluate their strengths and weaknesses in terms of their maintenance of the parent-child and ancestor-descendant relationships among the XML nodes based on the search space and the number of join operations in the translated SQL statements. These aspects are investigated because they affect the query response time (query performance). In the pre-experiment, seven mapping approaches, XRel (Yoshikawa et al. 2001), XParent (Jiang et al. 2002), XPEV (Qin et al. 2005), XRecursive (Fakharaldien et al. 2012; Fakharaldien et al. 2011a), Suri-Sharma (Suri & Sharma 2012), Ying-Cao (Ying et al. 2012), and s-XML (Subramaniam et al. 2012), are compared in two evaluations. The first evaluation compared the storage space of the produced RDB of each approach. The second evaluation compared the time required by each approach to execute the translated SQL queries. The aim of this pre-experiment is to provide an initial view of the limitations of the existing model mapping approaches with respect to the RDB storage space required for the XML document and the time required to execute the translated SQL queries from XPath queries, i.e. the query response time. For the pre-experiment, the machine specifications are a MSI Intel(R) Atom(TM) CPU N450 @1.66 GHz 1.67 GHz, 256 GB HDD, 2 GB RAM, and a Microsoft Windows 7 (32 bits) operating system. In addition, the data loaders for these approaches are implemented using a DOM parser. The benchmark data is loaded into a Microsoft SQL Server 2012 database. The XML dataset DBGroup.xml (Jiang et al. 2002) mentioned in Section 1.4 of Chapter 1 is used in this pre-experiment. Table 3.1 shows the results of the evaluation of the storage space of the RDBs produced by the existing model mapping approaches. The table includes the name of each approach,

101

the names of the tables created in the RDB scheme of each approach, the number of tuples in each table of each RDB scheme, the total number of tuples in each RDB scheme, and the size of the storage space in each RDB scheme.

Table 3.1

Results of evaluation of RDB storage space in (a) XRel, (b) XParent, (c) XPEV, (d) XRecursive, (e) Suri-Sharma, (f) Ying-Cao, and (g) s-XML No. of Tuples

Total No. of Tuples

RDB Storage Space (KB)

15 67 10 44

136

416

(b)

LabelPath DataPath Element Data

15 76 77 54

222

568

(c)

Edge Value Path

77 54 15

146

260

(d)

Tag_Structure Tag_Value

77 54

131

128

(e)

Node Data

15 77

92

132

(f)

File InnerNodes LeafNodes Path

1 77 54 11

143

420

(g)

ParentTable ChildTable

77 54

131

132

Approach

(a)

Tables in RDB Scheme Path Element Attribute Text

According to Table 3.1, approaches such as XRel, XParent, and Ying-Cao consume the most storage space compared to the other approaches. This is because they store the whole document (all inner and leaf nodes) in their RDB scheme. The RDB scheme of these approaches is complex as it contains multiple tables and columns. XRel stores the whole document in four tables (Path, Element, Attribute, and Text) based on the node types. XParent stores the whole document in four tables (LabelPath, DataPath, Element, and Data). Ying-Cao stores the whole document in four tables (File, Path, InnerNodes, and LeafNodes).

102

While the XPEV approach requires less storage space than the above three approaches it still needs a comparatively large amount of RDB storage space compared to XRecursive, Suri-Sharma, and s-XML because it stores all the edge information between the nodes of the document and their values in two tables (Edge, Value). Therefore redundant data appears in the RDB scheme of XPEV, which increases the storage space, particularly in the case of large XML documents. In XRel, XParent, and XPEV, all the distinctive path information for every node in the document starting from the root (15 root-to-node paths) is stored in the RDB scheme, which increases the RDB storage space. Even though the Ying-Cao approach reduces the space required to store the path information compared to XRel, XParent, and XPEV by storing the distinctive path information for only the leaf nodes of the document starting from the root (11 root-to-leaf paths), it still needs a large volume of RDB storage space to store the whole document and the path information for the leaf nodes in multiple tables and columns, so its RDB scheme is also complex. Approaches such as Suri-Sharma and s-XML consume slightly more RDB storage space than the XRecursive approach because they store the entire XML document in their RDB scheme. Similar to the limitation of the Edge approach, SuriSharma stores all the information of the inner and leaf nodes together in the Data table, which increases the storage space, particularly for large XML documents. sXML stores all the information of the inner and leaf nodes of the document together in two tables (ParentTable, ChildTable) that have a high row size. These two tables contain dangling columns and redundant data (pName attribute in ParentTable and ChildTable, IdNode and SelfLabel attributes in ChildTable). This leads to an increase the RDB storage space, particularly for large XML documents. Finally, the pre-experiment showed that the XRecursive approach utilizes the smallest amount of storage space compared to the other methods because it stores the XML document in a simple RDB scheme that contains two tables (Tag_Structure, Tag_Value) and it has fewer columns in each table. However, XRecursive still has a

103

limitation that prevents the further reduction of its RDB storage space. It stores all the information of the inner and leaf nodes in the RDB scheme, which increases the storage space, particularly for large XML documents. Therefore, the pre-experiment shows that the existing model mapping approaches suffer from a large RDB storage space, particularly for large XML documents. This is because in these approaches the design of the RDB scheme and the techniques of mapping the contents and hierarchically structure of the XML documents into the RDB scheme are not effective. The techniques employed by these approaches increase the complexity of the RDB scheme and also lead to redundancy or unnecessary information in terms of the paths (records) mapped. Most of the approaches store all the information contained in the XML documents, i.e. all the inner and leaf nodes, in the RDB scheme during the mapping process. The RDB schemes of these approaches are complex because they contain dangling and multiple tables and columns. This complexity leads to the consumption of more storage space to store XML documents and requires a high number of joins between tables, which increases the search space when answering queries. Also, redundancy occurs or unnecessary information is stored, such as the storing of inner node information as data in a separate table inside the RDB scheme. These inner nodes already exist inside the path information. Redundant path information is also stored in the RDB scheme, such as the storing of the path information for every node (Node-Path) and the path information for only leaf nodes (Leaf-Path). For the pre-experiment on the existing query mapping approaches, six types of XPath query, ranging from the simple to complex, are identified to evaluate the query response time. The same set of queries is selected for XRel (Yoshikawa et al. 2001) and XParent (Jiang et al. 2002). A simple path expression query is a chain query containing only one leaf node. This kind of query is divided based on the length of the query path into a short, simple path expression query containing a few inner nodes in the path and into a long simple path expression query containing more inner nodes in the path. In addition, queries with one of ‘//’ and two of ‘//’ identify the nodes’ parentchild and ancestor-descendant relationships in the path. Finally, a twig query with ‘/’

104 and twig query with ‘//’ (branching query), which contains two or more leaf nodes (i.e. with predicates) reserves the nodes’ parent-child and ancestor-descendant relationships, respectively. Based on the document mentioned in Section 1.4 of Chapter 1, six types of XPath query are evaluated for each approach, as shown in Table 3.2. For each approach, the time recorded for each query is based on an average of ten consecutive executions, as shown in Table 3.3.

Table 3.2 Types of XPath queries expressions Qn

Query Description

Query Expression

Feature

Q1

Select all departments names for all projects Select all duration of days for all projects

/DBGroup/Project/Dept

Q3

Select all ages for all members

//Member/Age

Simple Path Expression (Short) Simple Path Expression (Long) Query with One ‘//’

Q4

Select all offices for all members

//Member//Office

Query with Two ‘//’s

Q5

Select all names for members whose their age equal to 27 Select all names for members whose their age greater than 20

/DBGroup/Member/Name [Age = ‘27]

Twig Query with ‘/’

/DBGroup//Member [Age > 20]/Name

Twig Query with ‘//’

Q2

Q6

Table 3.3 Approach (a) (b) (c) (d) (e) (f) (g)

/DBGroup/Project/Duration/Day

Results of evaluation of query response time (in ms) of (a) XRel, (b) XParent, (c) XPEV, (d) XRecursive, (e) Suri-Sharma, (f) Ying-Cao, and (g) s-XML Q1 105.7 129.0 80.8 90.1 128.0 87.6 89.7

Q2 111.0 113.5 93.4 112.9 129.1 60.4 119.7

Q3 115.0 146.2 110.6 77.4 139.9 58.5 155.0

Q4 101.4 108.4 62.1 159.5 305.4 75.0 149.3

Q5 133.7 138.3 129.0 121.0 144.6 118.5 127.1

Q6 131.7 160.3 112.4 127.1 168.5 254.6 402.5

According to Table 3.3, all the approaches tested in the pre-experiment took a long time to execute the queries (Q1–Q6). This is because of the storage limitations in their RDB schemes as well as the excessive join operations or nested queries needed in their translated SQL queries, which increases the search space inside the RDB schemes. With regards to the XParent approach, it stores distinct path information for every node, which increases the search space in the LabelPath table to find the PathID. In addition, XParent uses self-joins in the LabelPath table and uses the DataPath table

105 (Pid, Cid) to maintain the parent-child ‘/’ and ancestor-descendant ‘//’ relationships among the nodes inside the translated SQL query in its large RDB storage space. The Suri-Sharma approach uses many self-joins in its Node table based on the number of nodes in the query in order to maintain the parent-child ‘/’and ancestor-descendant ‘//’ relationships among the nodes inside the translated SQL query. The XRel approach required more time to find the PathID in the Path table that stores all distinct path information for every node of the XML document and also to match this Path ID with three other tables (Element, Attribute, Text). XRel maintains the parent-child relationships ‘/’ inside the translated SQL query by using containment relationships, ϴ-joins (), in its large RDB scheme. This type of join operation is costly in RDBMS in terms of query execution. The XRecursive and s-XML approaches maintain the parent-child ‘/’ and ancestor-descendant ‘//’ relationships among the nodes inside the translated SQL query by using Pid and LParent columns as parent IDs in a recursive manner to form nested queries in their RDB scheme. The XPEV approach has a shorter query response time compared to the other approaches when executing the corresponding SQL query for Q1, Q4, and Q6. However, XPEV still takes a long time to execute these queries because it stores distinct path information for every node, which increases the search space in the Path table to find the P_id. Furthermore, XPEV joins the Edge and Value tables to maintain the parent-child ‘/’ and ancestor-descendant ‘//’ relationships among the nodes inside the translated SQL query by using source and target columns. Even though the Ying-Cao approach has a shorter query response time compared to the other approaches when executing the corresponding SQL query for Q2, Q3, and Q5, it still takes a long time to execute them because it requires more time to find the PathID in the Path table that stores distinct path information for all the leaf nodes of the XML dataset and to match it with the InnerNodes and LeafNodes

106

tables that store all the inner and leaf nodes of the document, respectively, to get the data. Therefore, the pre-experiment showed that the existing model mapping approaches all take a long time to execute their translated SQL queries over their RDB scheme, particularly for large XML documents. This is because the techniques they use to maintain the parent-child ‘/’ and ancestor-descendant ‘//’ relationships among the nodes inside their translated SQL queries are not effective. These techniques lead to a high number of join operations or nested queries in the translated SQL queries. This in turn leads to an increase in the search space inside the various RDB schemes. From the results of the pre-experiment, it can be concluded that the path-based technique, which is applied in existing approaches such as XPEV and Ying-Cao, results in a shorter query response time compared to other existing techniques used in the other approaches. However, the path-based technique still has limitations in terms of storing and querying XML documents. So, this study focuses on developing a pathbased technique that can apply to the existing model mapping approaches in order to handle the issues related to storing and querying XML documents using a RDB. A detailed discussion of Phase 1 can be found in Chapter 1 and Chapter 2. Chapter 1 identifies the existing problems in the current model mapping approaches that result in a large RDB storage space with poor scalability when XML documents become bigger and high query response time (degraded query performance), which motivated this study. In addition, Chapter 1 also presents the research objectives for this study. Chapter 2 discusses the existing model mapping approaches together with their strengths and weaknesses. Table 3.4 shows the overall process of this phase of the research.

Table 3.4

Input, activities, and deliverables of Phase 1

Input

 Books, journals, proceedings, published and unpublished papers, white papers, theses, online related articles, reports, case studies

Activities

 Review of the existing documentation relevant to XML and the Web (wellformed and valid XML documents, XML data models, Document Object Model (DOM) as XML parser, XPath query, etc.) continue…

107 ...continued  Review of the existing documentation relevant to RDB (RDB model, RDB design, SQL query, etc.)  Review of all the related issues regarding the mapping from XML to RDBs (especially storing and querying issues)  Evaluation of the existing XML mapping and query mapping algorithms in terms of RDB storage space and query response time, respectively using a pre-experiment  Analysis of the existing XML mapping algorithms and their RDB scheme based on the complexity of the design of the RDB scheme (number of tables and columns) and the ways in which the content and hierarchical structure of XML documents is handled in the RDB scheme, which have an effect on RDB storage space and scalability when XML documents become bigger.  Analysis of the existing query mapping algorithms and their maintenance of the parent-child and ancestor-descendant relationships among XML nodes based on the search space and number of join operations or nested queries in the translated SQL statements, which has an effect on query response time (query performance)  Identification of the research problems  Identification of the research questions and objectives Deliverables

3.3.2

 Literature and issues are summarized  Research problems are identified  Research questions and objectives are identified

Phase 2 – Proposed Approach Design

The aim of the second phase of the research methodology is to propose and design a new solution (approach) to solve the problems identified in the previous phase. In addition, this solution is needed to achieve the main objectives of the research study. Therefore, in order to achieve the goals of this phase, the mathematical concepts and theories related to the XML tree structure that are used to propose and design the solution are identified. Moreover, the components, algorithms, and tools of the proposed solution, which is a new model mapping approach (XAncestor), are proposed and designed. The two main activities in this phase are as follows: 

Identification of the mathematical concepts and theories for the new approach: The theories and mathematical concepts related to the XML tree structure are used as the basis for the proposed new mapping approach (XAncestor). For example, the composite and associative functions are used to identify the relationships between the XML nodes, such as parent-child

108

and ancestor-descendant relationships, which help to query the contents of the XML document as regular XPath expressions and to optimize the query process. In addition, some of the definitions are used to identify the concepts of the XML tree, dynamic fragment, root of the fragment, path expression, XAncestor RDB scheme, ancestor path expression, leaf nodes, Dewey order labelling, ancestor positions, XML mapping, and query mapping. 

Propose and design of the approach: The proposed mapping approach (XAncestor) is constructed based on an understanding of the relevant previous works that had been reviewed and analysed and on the mathematical concepts and theories related to XML tree structure. The proposed mapping approach contains three components, namely: a fixed RDB, XML mapping algorithm (XtoDB), and query mapping algorithm (XtoSQL). The fixed RDB stores the XML document efficiently. The XML mapping (XtoDB) maps the XML documents into a fixed RDB using the path-based technique. The query mapping (XtoSQL) translates the XML queries (XPath) into SQL queries using the path-based technique, and the results of the queries are then constructed into XML data format. This phase of the research referred to published and cited research studies,

which included recent comprehensive surveys on the mapping of XML documents into RDBs. These studies provided the theoretical base and the scientific methods for verifying the proposed approach. For example, the proposed approach adopted the path-based technique, like other recent approaches such as SMX/R (Ning & De 2010), Xlight (Zafari et al. 2010), and Ying-Cao (Ying et al. 2012). However, the proposed approach particularly referred to the XRel approach (Yoshikawa et al. 2001) because it involves the first theoretical and practical implementation of this type of model mapping approach for XML documents that handles storing and querying issues using a RDB. A detailed discussion of Phase 2 is presented in Chapter 4. Section 4.2 covers the mathematical concepts and theories related to the XML tree structure that are used

109

in the proposed approach (XAncestor). Section 4.3 presents the proposed mapping approach (XAncestor), describing in detail the proposed RDB scheme, the proposed XML mapping algorithm (XtoDB), and the proposed query mapping algorithm (XtoSQL). Table 3.5 shows the overall process of this phase of the research.

Table 3.5

Input, activities, and deliverables of Phase 2

Input

Deliverables from Phase 1

Activities

 Identification of the mathematical concepts and theories related to the XML tree structure to be used in the proposed solution (new approach)  Propose and design of the new approach (XAncestor), which includes: - Identifying the XAncestor mapping approach - Designing the RDB scheme of XAncestor - Identifying the XML mapping algorithm (XtoDB) - Identifying the query mapping algorithm (XtoSQL)

Deliverables

 Mathematical concepts and theories related to XML tree structure are identified  The XAncestor mapping approach is identified  The RDB scheme of XAncestor is designed  The XML mapping algorithm is identified  The query mapping algorithm is identified

3.3.3

Phase 3 – Prototype Development

The aim of the third phase of the research methodology is to develop a prototype containing the proposed solutions identified in the previous phase in order to solve the issues covered by the problem statement. Based on a review of the literature, the rapid application development (RAD) method is considered to be the best method by which to develop and implement the XAncestor. The development of the proposed main algorithms (XtoDB, XtoSQL) as a prototype is undertaken in the manner of a circular development phase (Dennis et al. 2012; Joseph et al. 2011). The RAD method is a well-defined and structured method that allows the fast delivery of a system. The RAD method is designed to accelerate the development of the life cycle of high-quality systems at a faster rate than the traditional life cycle (Gerber et al. 2007). In RAD, the step-by-step development of traditional methods is compressed into an iterative process, i.e. the RAD method involves developing and

110

improving the data models, process models, and prototype in parallel. It uses an iterative process that consists of improving user requirements, designing a solution, prototyping a solution, reviewing the prototype, and providing user input, before starting the process again. Powerful application development tools are used in the RAD method in order to develop high-quality applications rapidly. Prototyping is used to help users visualize and request changes to the system as it is being built, allowing applications to evolve iteratively. The RAD method has been successful in many cases such as for unstable business requirements or for developing non-traditional systems. The structure of the RAD life cycle is designed to ensure that developers build systems according to the users’ needs. The RAD life cycle consists of four stages that cover all the activities and tasks required to scope and define the business requirements, and to design, develop, and implement the application system that supports those requirements. These activities are combined into a single analysisdesign-code-test process, which is considered the heart of the system’s development in the RAD method, as shown in Figure 3.3 (Joseph et al. 2011).

Analysing

Testing

Prototype Cycle

Designing

Coding

Figure 3.3

Iterative stages of RAD method (Joseph et al. 2011)

The stages of the RAD method illustrated in Figure 3.3 are adopted to develop the proposed XAncestor approach as a prototype. The RAD stages applied in the study are described in the following subsections.

111

a.

Analysis stage

In this stage, as mentioned in the previous phases of the research methodology, the literature on model mapping approaches for storing and querying XML documents using a RDB are analysed. As a result, the requirements and functions for each proposed algorithm (process) are identified. In the RAD method, every development aspect is analysed. The proposed solutions (new approach) are identified to solve the identified problems of this study. These solutions are illustrated as a mapping design. Therefore, each part of the approach is analysed separately. Then, these parts are integrated together in order to find the best way to implement them. b.

Design stage

This stage dealt with the designing of two algorithms, namely the XML mapping algorithm (XtoDB) and the query mapping algorithm (XtoSQL). These two algorithms are used to map the XML document into a RDB scheme and to translate the XPath query into an SQL query, respectively. Two types of design for the development of the prototype are identified in this stage. The first is an internal design for the processes and algorithms. As the approach is an algorithmic approach, the main issue to address is the design of the algorithms themselves. There are many issues to consider in the algorithmic design, such as the correctness, completeness, and division of the small processes into functions and procedures. All these issues are taken into account in the development of the approach. The second is the design of the interfaces for the approach. The main issues in the design of the interfaces are the representation of the parameters of the algorithms as objects on the interfaces, and the display of the results of the XPath query in the XML hierarchical format. c.

Coding stage

In this stage, the programming languages that are used to code and implement previous model mapping approaches are reviewed and analysed. Most of the related approaches used previous versions of the Java programming language, which

112

provided a good graphical user interface (GUI) that helped in developing their prototypes. In this study, the Java programming language in the NetBeans IDE 8.0.1 environment is used as a tool to create the GUI and to code and implement the prototype components. In this stage, two proposed algorithms (XtoDB, XtoSQL) are coded, developed, and implemented piece by piece as a pipeline from the previous stage to build the XAncestor prototype. d.

Testing stage

In line with the incremental development cycle of the RAD method, this stage is conducted after each addition or modification in the above-mentioned stages. The best way to test the parameters of the algorithms is to use a debugging tool because it enabled the values of the algorithms to be evaluated for correctness during run-time. Otherwise, it would have been very hard to judge the correctness of the implemented algorithms. A detailed discussion of Phase 3 can be found in Chapter 4. Section 4.4 describes and proves the correctness of the mapping algorithm. Appendix A presents the implementation of the XAncestor prototype and describes in detail the requirement tools and the interfaces. Table 3.6 shows the overall process of this phase of the research.

Table 3.6 Input

Activities

Deliverables

Input, activities, and deliverables of Phase 3

Deliverables from Phase 2     

Analysis of previous XML mapping algorithms Coding and testing of the proposed XML mapping algorithm (XtoDB) Analysis of previous XML query mapping algorithms Coding and testing of the proposed query mapping algorithm (XtoSQL) Joining of the two algorithms (XtoDB, XtoSQL) to construct a prototype (XAncestor)  Verification and validation of the correctness of the XAncestor approach    

Previous XML mapping algorithms are analysed XtoDB algorithm is coded and tested Previous XML query algorithms are analysed XtoSQL algorithm is coded and tested continue…

113 ...continued  XAncestor prototype is constructed  Correctness of XAncestor approach is verified and validated

3.3.4

Phase 4 – Prototype Evaluation

This final phase of the research methodology comprehensively evaluated and compared the results generated by the proposed XAncestor approach (prototype) with the results generated by XRel (2001), SMX/R (2010), XRecursive (2011), Ying-Cao (2012), and s-XML (2012). This comparison is made by conducting an experiment to assess the effectiveness and scalability of the RDB storage space and query response time, which are considered the performance metrics utilized in this study. The abovementioned five approaches are chosen for comparison because they are considered to be recent approaches for mapping XML documents into the RDB scheme and translating XPath queries into SQL queries. They use recently developed techniques, including path-based and node-based techniques, for storing and querying XML documents in the RDB scheme. The comparison results are produced by applying the approaches to two real datasets (SIGMOD, DBLP) and six different types of XPath query including short simple path expression queries, long simple path expression queries, queries with one ‘//’, queries with two ‘//’, twig queries with ‘/’, and twig queries with ‘//’, for each dataset. The approaches are compared and analysed in terms of RDB storage space and scalability when XML documents become bigger as well as query response time (query performance). The proposed approach (XAncestor) is also evaluated for completeness in this phase. A detailed discussion of Phase 4 can be found in Chapter 5. The overall process of this final phase of the research is illustrated in Table 3.7.

Table 3.7 Input

Activities

Input, activities, and deliverables of Phase 4

Deliverables from Phase 3  Generation of result from (XAncestor) using two datasets and six XPath queries in terms of RDB storage space and query response time  Generation of result from (XRel) using two datasets and six XPath queries in terms of RDB storage space and query response time  Generation of result from (SMX/R) using two datasets and six XPath continue…

114 ...continued      Deliverables

3.4

queries in terms of RDB storage space and query response time Generation of result from (XRecursive) using two datasets and six XPath queries in terms of RDB storage space and query response time Generation of result from (Ying-Cao) using two datasets and six XPath queries in terms of RDB storage space and query response time Generation of result from (s-XML) using two datasets and six XPath queries in terms of RDB storage space and query response time Comparison of all the generated results in terms of the effectiveness and scalability of RDB storage and query response time Verification and validation of the completeness of the XAncestor approach

 Results of comparison analysis are provided  Completeness of the XAncestor approach is verified and validated

SUMMARY

This chapter began with a brief description of the DBR methodology, which is adopted for this study. Then, the four phases of the research methodology for this study are described in detail. In the first phase, the existing documentation relevant to XML, Web, and RDB are reviewed. All the existing approaches (XML and query mapping algorithms) for mapping XML to a RDB to identify the storing and querying issues are also reviewed. Then, in a pre-experiment the identified model mapping approaches are evaluated and analysed in terms of their handling of storing and querying issues. The activities of this phase led to identification of the research problems, questions, and objectives. In the second phase, the proposed model mapping approach (XAncestor) is designed based on the mathematical concepts and theories related to the XML tree structure. The XAncestor approach contained a proposed XML mapping algorithm (XtoDB) and a proposed query mapping algorithm (XtoSQL). The XtoDB algorithm is designed to map the content and hierarchical structure of the XML document into a predefined and fixed RDB scheme using the path-based technique. The XtoSQL algorithm is designed to translate XML queries (XPath) into SQL queries and the results of the queries are then translated back to XML data format. In the third phase, the proposed approach is developed and implemented as a prototype following the RAD method. All the stages of the RAD method that are applied in a prototype development are described in detail. During this stage, the XML mapping algorithm (XtoDB) and query mapping algorithm (XtoSQL) of the XAncestor approach are coded and tested. The correctness of the XAncestor

115

approach is also verified in this phase. In the last phase of the research methodology, the proposed approach (XAncestor) is evaluated and compared with five recent model mapping approaches to assess the effectiveness and scalability of their RDB storage space and the speed of the query response time. The completeness of the XAncestor approach is also verified in this phase.

116

CHAPTER IV

XANCESTOR: A MAPPING APPROACH

4.1

INTRODUCTION

This chapter gives a full description of the proposed mapping approach, known as XAncestor, for mapping extended markup language (XML) documents into a relational database (RDB). This approach enhances the path-based technique, which is considered the best solution for query execution (Vaidya 2015). Section 4.2 presents the mathematical concepts and theories related to the XML tree structure that are used to propose the mapping approach (XAncestor). Section 4.3 describes in detail the proposed mapping approach (XAncestor) that contains the proposed XML mapping algorithm (XtoDB) that maps the XML document into the RDB scheme. It also gives a detailed description of the proposed query mapping algorithm (XtoSQL) that translates XPath queries into corresponding structured query language (SQL) queries. In addition, it provides flowcharts and running examples of the XAncestor mapping approach to illustrate how the XtoDB and XtoSQL algorithms work. Section 4.4 presents the proof of the correctness of the XAncestor approach. Finally, Section 4.5 provides a summary of the chapter. 4.2

MAPPING THEORY

The management of XML still faces some problems with regards to storing and querying data (Bousalem & Cherti 2015; Mourya & Saxena 2015; Yuan et al. 2013; Wang et al. 2012; Fakharaldien et al. 2011a; Haw & Lee 2011; Dweib et al. 2009). These problems can be overcome by using a RDB to efficiently store and query the content of XML documents (Bousalem & Cherti 2015; Mourya & Saxena 2015; Wang

117

et al. 2012; Fakharaldien et al. 2011b; Ahmad 2011). The mapping of an XML document into a RDB involves the mapping of ordered, hierarchical, and semistructured information into unordered, tabular, and structured information. This is still a major challenge and research on this topic is ongoing (Vaidya 2015; Dweib & Lu 2012; Haw & Lee 2011; Atay et al. 2010). For the efficient storing of XML documents, the content and hierarchical structure of XML documents needs to be handled efficiently in the RDB scheme. This can be achieved by designing an efficient RDB scheme in such a way that its complexity is reduced (reducing dangling tables and columns) and by developing an XML mapping algorithm that removes redundancy or unnecessary data and paths mapped (reduces the number of records mapped) during the mapping process. This will then lead to less RDB storage space with high scalability when the XML documents become bigger. For the efficient querying of XML documents, the parent-child (P-C) and ancestor-descendant (A-D) relationships among the nodes should be maintained efficiently in the translated SQL query of an XPath query. These P-C and A-D relationships are most commonly used in practical applications and have attracted the attention of many researchers (Gou & Chirkova 2007; El-Sharkawi & El-Hadi El Tazi 2005). The efficient maintenance of these relationships can be achieved by designing a query mapping algorithm that can translate XPath queries into corresponding SQL queries with fewer join operations, fewer nesting queries, and fewer SQL functions. This will lead to a reduction in the size of the search space inside the RDB scheme and thereby reduce the time taken to execute the queries (shorten query response time). The path expression is considered the main component of any XPath query (Vainio & Junkkari 2014; Dweib & Lu 2012; Berglund et al. 2003). Some examples of path expressions are mentioned in Section 2.2.3 of Chapter 2. It is essential that the proposed mapping approach can efficiently store and handle the path information of XML documents. Based on the literature review, the path-based technique is

118

considered to be the best solution for query execution compared to other techniques such as node-based and Edge-based (Vaidya 2015). However, a lot of storage overhead is incurred in storing all the path information for every node or for only leaf nodes. This is a particular issue for a database with large XML documents. In addition, the path-based technique handles (stores) all the inner nodes’ information of the document in separate tables as data, yet this is already stored as part of the path information in the path table. This leads to an increase in the number of tables (storage space), which results in increasing the number of join operations needed in the translated SQL query. This increases the search space and has an adverse effect on query performance (Vaidya 2015; Dweib & Lu 2012; Jeong & Lee 2006). Hence, in the proposed mapping approach, all the information of the leaf nodes and their distinct ancestor path information are stored in a fixed and predefined RDB scheme. However, the path expression of the XPath query is insufficient to determine and illustrate the topology of the XML tree (positions of the nodes in the XML tree) because more than one node may share the same path expression and the precedence relationship among the nodes is lost in the path expressions (Jeong & Lee 2006; Yoshikawa et al. 2001). This has a detrimental effect on the handling of the hierarchical structure of the document. Hence, the position of the ancestor for each leaf node of the document starting from 1 (position of the root) is required in the proposed mapping approach. This will be achieved by using the Dewey order labelling method (Tatarinov et al. 2002), and this aspect is discussed later in the chapter. 4.2.1

XML Tree Theory

An XML document can be modelled as a rooted, nested, hierarchical, ordered graph using seven types of nodes: namely root, element, text, attribute, namespace, processing-instruction, and comment, based on the XQuery 1.0 and XPath 2.0 data model (Berglund et al. 2010). This study adopts the XQuery 1.0 and XPath 2.0 data model to represent XML documents and focuses on four types of node: root, element, text, and attribute, which is in line with previous studies (El-Sharkawi & El-Hadi El

119

Tazi 2005; Qin et al. 2005; Jiang et al. 2002; Yoshikawa et al. 2001). The XQuery 1.0 and XPath 2.0 data model can clearly represent the relationships between the nodes of the XML data tree such as the P-C and A-D relationships. Figure 4.1 presents the XML document DBGroup.xml and its tree representation. This document will be used as an example throughout this chapter (Jiang et al. 2002).

(a)

(b)

Figure 4.1

DBGroup.xml: (a) document and (b) tree representation (Jiang et al. 2002)

From Figure 4.1 it can be seen that the XML document DBGroup.xml consists mainly of a strictly nested hierarchy of element nodes with a single root (DBGroup). The element nodes can contain character data, child element nodes or a mixture of both. Each element node is written based on its type and may contain zero or more attribute nodes. The attribute nodes contain only data, i.e. they cannot contain element

120

nodes and cannot have attribute nodes. Each element node is bounded by matching starting and ending tags such as … content … . The content can contain other elements, or can consist entirely of other elements, or might be empty. Attributes are named values that are given in the start tag, and the values are surrounded by single or double quotation marks such as . One of the important characteristics of an XML document is that it is ‘well-formed’. A wellformed XML document conforms to some rules such as having only one root element, all start tags have matching end tags, elements must be nested properly, attribute values must always be quoted, tags are case sensitive, etc. These restrictions on the XML document structure make the shredding and storing of XML documents in a RDB easier (Dweib & Lu 2012). The following Rules 1 and 2 identify the composite and associative functions that are used to identify the relationships between the XML nodes, such as P-C and AD relationships, respectively. These identifications help query the content of the XML document as regular XPath expressions and optimize the query process. Hence, the main mathematical concepts used in the proposed approach are as follows: Rule 1: If f : S :S

T and g : T

U, then the composite function is g o f

U (Oosten 2002; Barr & Wells 1990).

Definition 1: Composite Relationship for Identifying P-C According to Rule 1, if f is a P-C relationship between S and T, where f : S and g is a P-C relationship between T and U, where g : T

T

U, then it can be said

that h : g o f is the A-D relationship between S and U, where h : S

U (Dweib

et al. 2009). The definition above will be used to define the hierarchical structure of the XML document, particularly the P-C relationships between the nodes. Figure 4.2 illustrates this composite relationship for identifying P-C.

121

S f h: g o f T g

Figure 4.2

Rule 2: If f : S

U

Composite relationship for identifying P-C

T , g:T

V and h : V

W, then the associative

function is ( h o g ) o f = h o ( g o f ) (Oosten 2002; Barr & Wells 1990).

Definition 2: Associative Relationship for Identifying A-D According to Rule 2, if f is a P-C relationship between S and T, where f : S g is a P-C relationship between T and V, where g : T between V and W, where h : V

T,

V, h is a P-C relationship

W, the composition i = ( h o g ) is an A-D

relationship between T and W where i : T

W, and the composition j = ( g o f )

is the A-D relationship between S and V, where j : S

V, then k = (h o g ) o f =

h o (g o f ) (Dweib et al. 2009). The reason for defining the associative relationships is that it enables the definition of the hierarchical structure of the XML document, particularly the A-D relationships between the nodes. Figure 4.3 illustrates this associative relationship for identifying A-D.

hog f S

g T

V

h

go f (h o g ) o f = h o (g o f ) Figure 4.3

Associative relationship for identifying A-D

W

122

Definition 3: The XML tree is composed of many subtrees of different levels, which can be defined as follows (Dweib et al. 2009; Atay et al. 2007): 𝑛

𝑇 = ∑(𝐸𝑖 , 𝐴𝑖 , 𝑋𝑖 , 𝑟𝑖−1 )

(4.1)

𝑖=1

where: i = 1, 2, … n represents the levels of the XML tree, and 0 represents the root Ei

is a finite set of elements in level i

Ai is a finite set of attributes in level i Xi

is a finite set of texts in level i

ri-1 is the root of the subtree of level i. Definition 4: A dynamic fragment (shred), df(i), is defined as the attributes and texts (i.e. values of leaf nodes) of the subtree i of the XML tree plus its root ri-1, as follows (Dweib et al. 2009): df(i) = (Ai, Xi, ri-1)

(4.2)

where: Ai is a finite set of attributes in level i Xi is a finite set of texts in level i ri-1 is the root of the subtree of level i. Definition 4 will be used to define the shredding process (dynamic fragment) of the XML document into the RDB scheme of the proposed mapping approach (XAncestor). Definition 5: The root of the fragment (shred) is the node, which has an out-degree of more than 1 (Dweib et al. 2009).

123 According to Figure 4.1, the element node ‘DBGroup’ is considered a root of the XML document, which has an out-degree of 5 connecting with ‘Member’, ‘Member’, ‘Member’, ‘Project’, and ‘Project’. The first element node ‘Member’ is considered a root of the subtree, which has an out-degree of 3 connecting with ‘Name’, ‘Age’, and ‘Office’ and so on.

Definition 6: Path Expression A path expression, P, can be denoted as a1n1a2n2 ... aknk, where each ni is a node type and each ai is either a child axis ‘/’ denoting a P-C relationship or descendant axis ‘//’ denoting an A-D relationship. Each aini constitutes a navigation step of P, and k is the number of steps in P (Atay et al. 2007). Based on the XML document sample in Figure 4.1, the following Table 4.1 shows the examples of path expressions used in different types of XPath query.

Table 4.1 Path Expressions Used in XPath Queries Qn

Query Description

Query Expression

Feature

Q1

/DBGroup/Member/Age

Q3

Select all ages for all members in DBGroup Select all offices rooms that related to all members in DBGroup Select all buildings for all offices

//Office/Building

Simple path expression (short) Simple path expression (long) Query with one ‘//’

Q4

Select all buildings for all members

//Member//Building

Query with two ‘//’

Q5

Select all names for members whose their age equal to 24 Select all offices rooms which its buildings name is CS

/DBGroup/Member[Age =‘24’]/Name

Twig query with ‘/’

//Office[Building = ‘CS’]/Room

Twig query with ‘//’

Q2

Q6

4.3

/DBGroup/Member/Office/Room

XANCESTOR MAPPING APPROACH

The XAncestor mapping approach includes an XML mapping algorithm (XtoDB) to map XML documents into a fixed and predefined RDB using a path-based technique. The basic idea of the XtoDB algorithm is to map only distinct ancestor path information for all the leaf nodes of the XML tree with their information into a fixed RDB scheme. The path information of all the inner nodes with their information is not

124

mapped into the XAncestor RDB scheme because these are considered to be redundant data and paths. This is because these inner nodes are already stored in the ancestor path information of the leaf nodes in the XAncestor RDB scheme. This aspect of the proposed XtoDB algorithm is unique and is not found in previous studies. Use of the proposed XtoDB algorithm leads to a reduction in the storage space of the RDB where high scalability is needed to store different sizes of XML document. In addition, the XAncestor mapping approach includes a query mapping algorithm (XtoSQL) to translate different types of XML query (XPath) into their corresponding SQL query using a path-based technique. Then, the result of the query is constructed into XML data format. The uniqueness of this algorithm compared to those in other studies lies in the fact that it maintains the P-C and A-D relationships among the nodes in the translated SQL queries efficiently. This reduces the number of join operations, removes nesting SQL queries, and removes SQL functions. This leads to a reduction in the search space inside the predefined RDB scheme, which has a positive effect on the time taken to execute queries, i.e. it reduces query response time. The proposed approach (XAncestor) considers well-formed XML documents that are shredded and decomposed into elements (leaf nodes that have a value) and attributes. Then these elements and attributes are inserted into RDB tables. The proposed approach (XAncestor) does not use a document type definition (DTD) or XML-schema to describe the structure of the XML document for the following reasons: 

Many applications deal with dynamic and structurally variant XML documents on the Web. Therefore, these variant structures are difficult to define by a DTD or XML-schema. Therefore, a schema-less (model) mapping approach is better for dealing with such XML documents.



The schema-less (model) mapping approach supports well-formed XML applications (without a schema that describes data). This means it supports

125

any XML documents on the Web, whereas structure mapping approaches support only XML documents with DTD or XML-schema structures. 

It is not practical to design many candidate relational schemas for all potential XML data which may have different XML-schema. Two well-known XML parsers, namely the simple application interface for

XML (SAX) (Megginson 2004) and the document object model (DOM) (W3consortium 2000), can be used to extract XML data from an XML document before mapping it to a RDB. The SAX parser is a standard interface for event-based XML parsing. It parses an XML document as a sequence of events (i.e. startDocument, endDocument, startElement, endElement … etc.). The DOM parser constructs the whole XML document as a tree structure (DOM tree) in the memory first and then parses it. The DOM has an advantage over the SAX in that it provides an object model that can model any XML document regardless of how it is structured, which makes it easier to access the content of XML documents. Thus, the DOM parser offers full access to XML data (Tang et al. 2013). Moreover, manipulating a document using the DOM is quicker and simpler, whereas the SAX needs to use temporary buffers to keep information in the memory. However, choosing which of the two parsers to use to process XML documents depends mostly on the project’s requirements (Oliveira et al. 2013). Figure 4.4 shows the proposed XAncestor mapping approach that maps the XML document into the RDB scheme (XML mapping) and maps the XPath query into the corresponding SQL query (query mapping).Then, the result of the query returns back as XML data format.

126

Figure 4.4

Proposed XAncestor mapping approach

The XAncestor mapping approach consists of an XML mapping that maps the XML document into the RDB with less storage space and high scalability when the XML documents become bigger. In XML mapping, the XML document is loaded and entered into the proposed XML mapping algorithm (XtoDB), which maps the document into relational data that is stored as tuples in a fixed and predefined RDB scheme. The XAncestor mapping approach also consists of a query mapping, which maps (translates) the XPath query into its corresponding SQL query in less query response time. In query mapping, the XPath query is entered by the user and it is translated into its equivalent SQL query by means of the proposed query mapping algorithm (XtoSQL). Then, the SQL query is executed over the RDB scheme and the result of this query (tuples) is returned back to XtoSQL algorithm for translation into XML format. The three main components of the XAncestor mapping approach (shaded grey in Figure 4.4) that are considered the main contributions of this study are as follows: 

Fixed RDB scheme: A fixed RDB scheme is proposed and designed in order to handle and store the content and hierarchical structure of XML documents efficiently. The proposed RDB scheme of the XAncestor mapping approach

127

consists of two tables, namely: Ancestor_Path and Leaf_Node, which are shown later. Section 4.3.1 discusses in detail the proposed XAncestor RDB scheme’s entities, attributes, and the relation types between them, which are represented using an entity relationship diagram (ERD). 

XtoDB mapping: The XtoDB mapping algorithm uses a path-based technique that loads the XML document, parses it using a DOM parser, and shreds and decomposes the content and hierarchical structure of the XML document into a predefined RDB scheme that contains an Ancestor_Path and a Leaf_Node tables. The uniqueness of this algorithm compared to those in other studies lies in the fact that it handles and stores the distinct ancestor path information for all the leaf nodes of the document with their information in a predefined RDB scheme. This leads to a reduction in the RDB storage space and high scalability when the XML documents become bigger. Section 4.3.2 presents a full explanation of this algorithm by way of an example.



XtoSQL mapping: The XtoSQL algorithm uses a path-based technique that translates the user’s XPath query into the corresponding SQL query and fires it against the predefined RDB scheme to get the results as tuples. Then, these results are translated from a relational table format (tuples) to an XML hierarchical format and returned back to the user. The uniqueness of this algorithm compared to those in other studies lies in the fact that it maintains the P-C and A-D relationships among the nodes efficiently in the translated SQL queries. This reduces the number of join operations, removes nesting SQL queries, and removes SQL functions. In turn, this reduces the search space inside the predefined RDB scheme, which has a positive effect on the time taken to execute these queries, i.e. it results in less query response time. Section 4.3.3 presents a full explanation of this algorithm by way of an example.

128

4.3.1

XAncestor RDB Scheme

This section gives a full description of the proposed RDB scheme used in the proposed XAncestor mapping approach. This description starts by defining the concepts that are used to propose the XAncestor RDB and then presenting the proposed RDB scheme in the form of an ERD for ease of understanding. The definitions of the various concepts are explained below.

Definition 7: Leaf node The leaf node in the XML tree is the node that has value whether attribute or text. The leaf node does not contain child nodes. The leaf node information of the XML tree is handled in the proposed RDB scheme of the XAncestor mapping approach. In Figure 4.1(b), Name, Age, Office, Building, Room, @Project, @id, and Title are examples of the leaf nodes that are handled in the proposed XAncestor RDB scheme.

Definition 8: Ancestor Path Expression The ancestor path expression P for the leaf node n in the XML document D can be denoted as /e1/e2/.../ek, where each ei is an element node of a step and the axis of each step is a child axis ‘/’, which denotes a P-C relationship. The element node e1 of the first step represents the root element of D and the element node ek of the last step represents the parent of the leaf node n in D. Definition 8 is used to define the ancestor path information for a leaf node of the XML tree that is handled in the proposed XAncestor RDB scheme. In Figure 4.1(b), the ancestor path information ‘/DBGroup/Member’, ‘/DBGroup/Member/ Office’, and ‘/DBGroup/Project’ for all leaf nodes are handled in the XAncestor RDB scheme. Definition 9: Dewey Order Labelling The Dewey order is based on Dewey Decimal Classification. Each node is assigned a vector that represents the path from the root to the node of the document. So each part of the path represents the local order of an ancestor node (Tatarinov et al. 2002).

129

Definition 10: The ancestor position for the leaf node is the position of its parent node starting from the root using Dewey order labelling (Tatarinov et al. 2002). Definitions 9 and 10 are used in the proposed XAncestor RDB scheme to determine the ancestor position for each leaf node of the document starting from 1 (position of the root), which maintains the hierarchical structure of the XML tree efficiently. This is because handling the ancestor path information for only the leaf nodes is not sufficient to determine and illustrate the topology of the XML tree in cases where more than one leaf node may have the same ancestor path information but different positions in the XML tree. For instance, in Figure 4.1(b), two leaf nodes have the

same

name

(Age)

and

share

the

same

ancestor

path

information

‘/DBGroup/Member’, but they have different positions in the XML tree. Using Dewey order labelling, the positions of their ancestor are different, namely: ‘1.1’ and ‘1.3’. Definition 11: XAncestor RDB Scheme The RDB scheme for XAncestor is a storage structure for storing XML documents in two relations. First, the Ancestor_Path (Ances_PathID, Ances_PathExp), where each ancestor path of the leaf node in the XML tree is represented by a tuple, contains a unique ID that is generated for each distinct ancestor path in ‘Ances_PathID’ and a distinctive ancestor path (root-to-parent) of a leaf node in ‘Ances_PathExp’. Second, the Leaf_Node (Node_Name, Ances_PathID, Ances_Pos, Node_Value), where information is stored for each leaf node in the XML tree by a tuple, contains the name of the leaf node in ‘Node_Name’, an ancestor path ID from root-to-parent of the leaf node in ‘Ances_PathID’, a position of the ancestor for the leaf node based on Dewey order labelling, which is stored in ‘Ances_Pos’, and the value of the leaf node in ‘Node_Value’. For clarity, Figure 4.5 represents the RDB scheme used in the XAncestor mapping approach; it shows the two tables (Ancestor_Path and Leaf_Node) and their attributes. Figure 4.6 provides an ERD of the RDB scheme of XAncestor; it shows the entities, attributes, and the relation types between the entities.

130

Ancestor_Path (Ances_PathID, Ances_PathExp) Leaf_Node (Node_Name, Ances_PathID, Ances_Pos, Node_Value) Figure 4.5

Ances_PathID

XAncestor RDB scheme

Ancestor_Path

Ances_PathExp

1

Has

Node_Name

n

Ances_Pos

Leaf_Node

Ances_PathID

Figure 4.6

Node_Value

Entity relationship diagram of XAncestor RDB scheme

The design of the proposed XAncestor RDB scheme is motivated by the need to reduce the usage of RDB storage space and improve scalability when the XML documents become bigger in order to deal with the query performance problems of existing model mapping approaches. In the proposed approach, the storage of only the information of all the leaf nodes of the XML document is handled by creating the Leaf_Node table. Hence, it is unnecessary to store the information for all the inner nodes. Thus, the RDB storage space needed to store the XML document can be reduced without affecting the query answers among these nodes. In addition, the storage of the distinctive ancestor path information for every leaf node, i.e. the storage of one ancestor path for all the leaf nodes that have the same parent, is handled in the proposed approach by creating the Ancestor_Path table. This reduces the storage space that would otherwise be used to store unnecessary paths, such as the paths of the

131

inner nodes. It also reduces the search space. This results in enhancing the performance of the XPath query, i.e. less query response time. This is achieved by looking for matches between the two tables based on the Ances_PathID attribute in the translated SQL query instead of using a high number of join operations, nesting queries, and SQL functions, which are costly in a relational database management system (RDBMS). 4.3.2

XML Mapping Algorithm (XtoDB)

The XtoDB algorithm focuses on mapping an XML document into a fixed RDB by shredding and decomposing the hierarchical structure that is represented in the XML document into a flat (tuple) structure that is represented in the RDB by using a pathbased technique. The basic idea of the XtoDB algorithm is to map only distinct ancestor path information for all the leaf nodes of the XML tree with their information into a fixed RDB scheme. Therefore, the path information of all the inner nodes with their information is not mapped into the proposed RDB scheme. This leads to a reduction in the RDB storage space together with high scalability, which is beneficial for storing XML documents. The XtoDB algorithm is defined as follows:

Definition 12: XML Mapping (XtoDB) Extended markup language mapping, XM, is a function that assigns to each triple (X, R, σ) a set of relational tuples, T, where X is a well-formed XML document, R is a RDB scheme, σ is a σ-mapping over R, and T is the result of shredding X into relational tuples according to the layout described by R and σ. Figure 4.7 shows the flowchart of the proposed XML mapping algorithm (XtoDB). Figure 4.8 shows how the XtoDB algorithm maps an XML document into the RDB scheme (Ancestor_Path and Leaf_Node tables). Then, following an explanation of this algorithm (Figures 4.7 and 4.8) by way of an example based on the document in Figure 4.1(a) is discussed.

132

Figure 4.7

Flowchart of proposed XtoDB algorithm

133

Algorithm: XtoDB Input: an XML document Output: Ancestor_Path and Leaf_Node relations 1. Parse and read XML document using DOM parser 2. During reading the document as DOM tree, 3. class NodeInfo { 4. public void ReadXML (Node m, String Position) { 5.

int j = 0

6.

String Ances_Pos

7.

String ParentPos = Position

8.

NodeList nodeList = m.getChildNodes()

9.

FOR currentNode in nodeList [ ] LOOP

10.

IF (currentNode == Element) Then

11.

j=j+1

12.

Position = ParentPos + ‘.‘ + String.valueOf (j)

13.

attrs = currentNode.getAttributes()

14.

FOR attr in attrs[ ]

LOOP

15.

Ances_Pos = Position

16.

Store_DB (attr)

17.

END LOOP

18.

ReadXML (currentNode , Position)

19.

Else IF (currentNode == Text)

20.

Ances_Pos = ParentPos.substring (0, ParentPos.lastIndexOf (‘.‘))

21.

Store_DB (currentNode.getParentNode())

22.

END IF

23. END LOOP 24. }

// End of ReadXML Function

25. public void Store_DB (Node n) { 26. Boolean Exist 27. String Path_Exp = Get_Ancestor_PathExpression (n)

// Path from root-to-parent

28. Exist = Check Path_Exp in Ancestor_Path table 29. IF (Exist = False) Then 30.

Store Ances_PathID, Ances_PathExp in Ancestor_Path table

31.

Store Node_Name, Ances_PathID, Ances_Pos, Node_Value in Leaf_Node table

32. Else

33.

Find Ances_PathID in Ancestor_Path table for Path_Exp

34.

Store Node_Name, Ances_PathID, Ances_Pos, Node_Value in Leaf_Node table

35. END IF

134

36. }

// End of Store_DB Function

37. }

// End of NodeInfo Class

38. public class XtoDB { 39. public static void main(String args[]) { 40.

Document doc = documentBuilder.parse(xmlFile)

41.

NodeInfo Node_Child = new NodeInfo()

42.

Node_Child.ReadXML(doc.getDocumentElement(),‘1’)

43. }

// End of void main

44. }

// End of XtoDB Class Figure 4.8

Proposed XtoDB algorithm

The algorithm works as follows: First, the algorithm starts by reading the element (root) of the document, with its position (‘1’) based on Dewey order labelling, and calls it the ReadXML function (line 42). In lines 5–6, the algorithm defines the variables (j, Ances_Pos). In line 7, the variable (ParentPos) is defined as string and set to the argument (Position) which is equal to ‘1’ in the first calling of the ReadXML function, i.e. position of the root in the document. In line 8, the algorithm reads all the child nodes of the element (root) in the document and stores them in the nodeList array. For each child node in the nodeList array (lines 9–23), the algorithm checks the node type for this node. If the child node is an Element that has attribute values (i.e. is a leaf node), as shown in line 10, then the algorithm increments the integer variable (j) by 1, concatenates the value of (j) with parent position variable (ParentPos) separated by ‘.’, and stores it in variable (Position), as shown in lines 11–12. Then, the algorithm reads all the attributes of this Element node and stores them in the attrs array (line 13). For each attribute (lines 14–17), the algorithm stores the value of (Position), which is a position of element node, in the (Ances_Pos) variable (line 15). Then, the algorithm calls the Store_DB function (line 16), which stores the ancestor’s pathID and ancestor’s PathExp, i.e. the path from the root-to-parent (if not in existence), in the Ancestor_Path table, and also stores its name, ancestor’s pathID, ancestor’s position (Ances_Pos), and value for this attribute in the Leaf_Node table (lines 25–36).

135

However, if the child node is an Element that has a text value (i.e. it is a leaf node), as shown in line 19, then the algorithm gets the parent position of the leaf node that has a text value by using the substring function on the variable (ParentPos) and stores it in the variable (Ances_Pos), as shown in line 20. In line 21, the algorithm calls the Store_DB function, which stores the ancestor’s pathID and ancestor’s PathExp, i.e. the path from the root-to-parent (if not stored before), in the Ancestor_Path table, and also stores its name, ancestor’s pathID, ancestor’s position, and value for this leaf node in the Leaf_Node table (lines 25–36). On the other hand, if the child is non-leaf node (inner node), i.e. the node does not have an attribute and text, then the algorithm recalls the ReadXML function (line 18) recursively for this element to get its child and checks them again and so on until the algorithm has read all the content (nodes) of the DOM tree of the XML document. After executing the XtoDB algorithm in the XAncestor prototype, the XML document shown in Figure 4.1(a) is shredded, decomposed, and stored in the RDB scheme (Ancestor_Path and Leaf_Node tables) with less storage space. The XtoDB algorithm mapped all the leaf nodes of the document and only their distinctive ancestor path information (root-to-parent) into the XAncestor RDB scheme shown in Table 4.2.

Table 4.2

XAncestor RDB scheme: (a) Ancestor_Path and (b) Leaf_Node tables based on DBGroup.xml

(a) Ancestor_Path Table Ances_PathID 1 2 3

Ances_PathExp /DBGroup/Member /DBGroup/Member/Office /DBGroup/Project

(b) Leaf_Node Table Node_Name

Ances_PathID

Ances_Pos

Node_Value

Name Age Office Name Building Room

1 1 1 1 2 2

1.1. 1.1. 1.1. 1.2. 1.2.2. 1.2.2.

‘Fervvac’ ‘23’ ‘CSD 4212’ ‘Daniel’ ‘CS’ ‘4215’ continue…

136

...continued Name Age Office @Project @id Title Title

4.3.3

1 1 1 1 3 3 3

1.3. 1.3. 1.3. 1.3. 1.4. 1.4. 1.5.

‘Ryan’ ‘24’ ‘CSD 4212’ ‘105’ ‘105’ ‘XML’ ‘Cube’

Query Mapping Algorithm (XtoSQL)

The purpose of improving the mapping of XML documents into a RDB is not just to use less storage space and achieve high scalability when the XML documents become bigger. This data needs to be stored in such a way that it can be efficiently queried. In the proposed algorithm (XtoSQL) of the XAncestor mapping approach, the XML path language (XPath) is used as a source tool for querying the XML data stored in the RDB. Different types of XPath expression (see Table 4.1) are translated into their corresponding SQL query in less query response time over the RDB scheme by using a path-based technique to obtain the results from the RDB. Then, these results are translated to an XML hierarchical format and returned back to the user. The XtoSQL algorithm is defined as follows:

Definition 13: Query Mapping (XtoSQL) Query mapping, QM, is a function that assigns to each tuple (Q, X, R, σ) a relational query Q’, where Q is an XML query, X is an XML document, R is a RDB, σ is the mapping from X to R, and Q’ is an equivalent set of relational queries to Q such that Q’(R) ≡ Q(X). Figure 4.9 contains the flowchart of the proposed query mapping algorithm (XtoSQL). Figure 4.10 shows how the XtoSQL algorithm maps six types of XPath query into corresponding SQL query. Then, following an explanation of this algorithm (Figures 4.9 and 4.10) by way of an example of the six types of XML query (XPath), which are shown in Table 4.1 and based on the document in Figure 4.1(a), is discussed.

137

Figure 4.9

Flowchart of proposed XtoSQL algorithm

138

As shown in Figure 4.9, the query mapping algorithm (XtoSQL) translates different types of XML query (XPath) (see Table 4.1) into their corresponding SQL query over the RDB scheme. These XML queries include short and long simple path expression queries (Q1 & Q2) that evaluate the short and long P-C relationships, queries with one of ‘//’ (Q3) that evaluate one A-D relationship, queries with two of ‘//’ (Q4) that evaluate two A-D relationships, twig queries with ‘/’ (Q5) that evaluate the branching queries with P-C relationships, and twig queries with ‘//’ (Q6) that evaluate the branching queries with A-D relationships. Figure 4.10 shows how the XtoSQL algorithm translates these six types of XPath queries into SQL queries.

Algorithm: XtoSQL Input: XML query (XPath), RDB scheme Output: translated SQL query, Relational result, XML result based on RDB scheme 1. public void XtoSQL ( ) { 2.

Enter XPath Query (strXPathQuery, Query_Type)

3.

Parse & Read (strXPathQuery, Query_Type)

4.

IF (Query_Type == Q1 || Query_Type == Q2 ) Then

5.

Extract from strXPathQuery  strAncestor_Path, Last_Node

6.

strSQLQuery = ‘SELECT L.Node_Value

7.

FROM

8.

WHERE P.Ances_PathExp = strAncestor_Path

9.

AND

L.Ances_PathID = P.Ances_PathID

10.

AND

L.Node_Name = Last_Node‘

Leaf_Node L, Ancestor_Path P

11.

Execute SQL query (strSQLQuery)

12.

Display the result in (relational and XML formats)

13. Else IF (Query_Type == Q3 || Query_Type == Q4) Then 14.

Extract from strXPathQuery  strAncestor_Path, Last_Node

15.

Change ‘//’ inside strAncestor_Path into ‘%’

16.

strSQLQuery = ‘SELECT L.Node_Value

17.

FROM

Leaf_Node L, Ancestor_Path P

18.

WHERE

P.Ances_PathExp LIKE strAncestor_Path

19.

AND

L.Ances_PathID = P.Ances_PathID

20.

AND

L.Node_Name = Last_Node‘

21.

Execute SQL query (strSQLQuery)

22.

Display the result in (relational and XML formats))

23. Else IF (Query_Type == Q5) Then

139

24.

Extract from strXPathQuery  strAncestor_Path, Last_Node,

25. 26.

Node_Condition, Value_Condition strSQLQuery = ‘SELECT L2.Node_Value

27.

FROM Leaf_Node L1, Leaf_Node L2, Ancestor_Path P

28.

WHERE P.Ances_PathExp = strAncestor_Path

29.

AND

L1.Ances_PathID = P.Ances_PathID

30.

AND

L1.Node_Name = Node_condition

31.

AND

L1.Node_Value = Value_condition

32.

AND

L2.Node_Name = Last_Node

33.

AND

L1.Ances_PathID = L2.Ances_PathID

34.

AND

L1.Ances_Pos = L2.Ances_Pos‘

35.

Execute SQL query (strSQLQuery)

36.

Display the result in (relational and XML formats)

37. Else IF (Query_Type == Q6) Then 38.

Extract from strXPathQuery  strAncestor_Path, Last_Node,

39.

Node_Condition, Value_Condition

40.

Change ‘//’ inside strAncestor_Path into ‘%’

41.

strSQLQuery = ‘SELECT L2.Node_Value

42.

FROM Leaf_Node L1, Leaf_Node L2, Ancestor_Path P

43.

WHERE P.Ances_PathExp LIKE strAncestor_Path

44.

AND

L1.Ances_PathID = P.Ances_PathID

45.

AND

L1.Node_Name = Node_condition

46.

AND

L1.Node_Value = Value_condition

47.

AND

L2.Node_Name = Last_Node

48.

AND

L1.Ances_PathID = L2.Ances_PathID

49.

AND

L1.Ances_Pos = L2.Ances_Pos‘

50.

Execute SQL query (strSQLQuery)

51.

Display the result in (relational and XML formats)

52. END IF 53. }

// End of XtoSQL Function Figure 4.10

Proposed XtoSQL algorithm

The algorithm works as follows: First, the algorithm starts by requesting the user’s XPath query and the type, which are stored in the strXPathQuery and Query_Type string variables, respectively (line 2). In line 3, the algorithm parses and reads these two string variables. The algorithm then translates the XPath query (see

140

Table 4.1) based on the entered query type (line 4). The process for each of the six query types is explained below. a.

Short or Long Simple Path Expression Query (Q1, Q2)

These types of XPath query evaluate queries that have short or long P-C relationships. For these types, the algorithm extracts the ancestor’s path and last node from the strXPathQuery variable, which are stored in the strAncestor_Path and Last_Node, respectively (line 5). Based on the strAncestor_Path and Last_Node that are extracted from the strXPathQuery, the algorithm translates the entered XPath query into the corresponding SQL query by matching the entered XPath query (whether Q1 or Q2) accurately with the path information in the Ancestor_Path table using the ‘=’ operator to obtain its Ances_PathID, which is used to look up the data stored in the Leaf_Node table (lines 6–10). It then stores this translated SQL in the strSQLQuery, as shown in Figure 4.11. In line 11, the algorithm executes the translated SQL query that is stored in strSQLQuery based on the RDB. The result of the translated SQL query is displayed in relational and XML formats (line 12). strSQLQuery = ‘SELECT L.Node_Value FROM

Leaf_Node L, Ancestor_Path P

WHERE P.Ances_PathExp = strAncestor_Path AND

L.Ances_PathID = P.Ances_PathID

AND

L.Node_Name = Last_Node’

Figure 4.11

Translated SQL query for XPath queries Q1 and Q2 using XtoSQL

The XPath queries ‘/DBGroup/Member/Age’ and ‘/DBGroup/Member/Office /Room’ (see Figure 4.1 and Q1 and Q2 in Table 4.1) are examples of short and long simple path expression queries (containing P-C relationships ‘/’), respectively. Using the XtoSQL algorithm, these queries are translated into the corresponding SQL queries shown in Figure 4.12 and Figure 4.13, respectively.

141

SELECT L.Node_Value FROM

Leaf_Node L, Ancestor_Path P

WHERE P.Ances_PathExp = ‘/DBGroup/Member’ AND

L.Ances_PathID = P.Ances_PathID

AND

L.Node_Name = ‘Age’ Figure 4.12

Translated SQL query for Q1 based on DBGroup.xml using XtoSQL

SELECT L.Node_Value FROM

Leaf_Node L, Ancestor_Path P

WHERE P.Ances_PathExp= ‘/DBGroup/Member/Office’ AND

L.Ances_PathID = P.Ances_PathID

AND

L.Node_Name = ‘Room’ Figure 4.13

b.

Translated SQL query for Q2 based on DBGroup.xml using XtoSQL

Query with One or Two of ‘//’ (Q3, Q4)

These types of XPath query evaluate queries that have one A-D relationship or two AD relationships. For these types, the algorithm extracts the ancestor’s path and last node from the strXPathQuery variable, which are stored in the strAncestor_Path and Last_Node, respectively (line 14). Moreover, the algorithm changes all the A-D ‘//’ operators inside the strAncestor_Path into ‘%’ operators that are used in SQL queries in many RDBMSs (line 15). Based on the strAncestor_Path (containing ‘%’ operators) and the Last_Node, the algorithm translates the entered XPath query into the corresponding SQL query by matching the entered XPath query (whether Q3 or Q4) accurately with the path information in the Ancestor_Path table by using the ‘Like’ operator to get its Ances_PathID, which is used to look up the data stored in the Leaf_Node table (lines 16–20). It then stores this translated SQL query in the strSQLQuery, as shown in Figure 4.14. In line 21, the algorithm executes the translated SQL query that is stored in the strSQLQuery based on the RDB. The result of the translated SQL query is displayed in relational and XML formats (line 22).

142

strSQLQuery = ‘SELECT

L.Node_Value

FROM

Leaf_Node L, Ancestor_Path P

WHERE

P.Ances_PathExp LIKE strAncestor_Path

AND

L.Ances_PathID = P.Ances_PathID

AND

L.Node_Name = Last_Node’

Figure 4.14

Translated SQL query for XPath queries Q3 and Q4 using XtoSQL

The XPath queries ‘//Office/Building’ and ‘//Member//Building’ (see Figure 4.1 and Q3 and Q4 in Table 4.1) are examples of one and two A-D relationship queries (containing one or two of ‘//’), respectively. Using the XtoSQL algorithm, these queries are translated into the corresponding SQL queries shown in Figure 4.15 and Figure 4.16, respectively.

SELECT L.Node_Value FROM

Leaf_Node L, Ancestor_Path P

WHERE P.Ances_PathExp LIKE ‘%Office’ AND

L.Ances_PathID = P.Ances_PathID

AND

L.Node_Name = ‘Building’ Figure 4.15

Translated SQL query for Q3 based on DBGroup.xml using XtoSQL

SELECT L.Node_Value FROM

Leaf_Node L, Ancestor_Path P

WHERE P.Ances_PathExp LIKE ‘%Member%’ AND

L.Ances_PathID = P.Ances_PathID

AND

L.Node_Name = ‘Building’ Figure 4.16

c.

Translated SQL query for Q4 based on DBGroup.xml using XtoSQL

Twig Query with ‘/’ (Q5)

This type of XPath query evaluates the branching query with P-C relationships. For this type of query, in order to retrieve information for a node from another node with a known value where both nodes have the same parent (same ancestor path), the algorithm extracts the ancestor’s path, last node (which needs to be retrieved), node inside the predicate, and value of the node inside the predicate, which are stored in the

143

strAncestor_Path, Last_Node, Node_Condition, and Value_Condition, respectively (lines 24–25), from the strXPathQuery variable. Then, based on the previous variables in lines 24–25, the algorithm decomposes the entered XPath query in the strXPathQuery into one sub-path, which is the ancestor’s path for two nodes, and stores it in the strAncestor_Path. It should be noted that this method differs from that adopted in previous approaches that decompose the entered XPath query into three sub-paths. In those approaches, the first sub-path is the path of the branching node, the second is for a node that needs to be retrieved, and the third is of a node that has a value. Using three sub-paths in SQL query makes query processing in a RDBMS costly. Then, the algorithm translates the entered XPath query in the strXPathQuery into the corresponding SQL query by matching the ancestor’s path for two nodes accurately with the path information in the Ancestor_Path table by using the ‘=’ operator to obtain the Ances_PathID for them, which is used in addition to the node that has a known value in the entered query’s predicate, to look up the data that is stored in the Leaf_Node table (lines 26–34) for a node that needs to be retrieved. This translated SQL is then stored in the strSQLQuery, as shown in Figure 4.17. In line 35, the algorithm executes the translated SQL query that is stored in the strSQLQuery based on the RDB. The result of the translated SQL query is displayed in relational and XML formats (line 36). strSQLQuery = ‘SELECT L2.Node_Value FROM Leaf_Node L1, Leaf_Node L2, Ancestor_Path P WHERE P.Ances_PathExp = strAncestor_Path AND

L1.Ances_PathID = P.Ances_PathID

AND

L1.Node_Name = Node_condition

AND

L1.Node_Value = Value_condition

AND

L2.Node_Name = Last_Node

AND

L1.Ances_PathID = L2.Ances_PathID

AND

L1.Ances_Pos = L2.Ances_Pos‘

Figure 4.17

Translated SQL query for XPath query Q5 using XtoSQL

144 The XPath query ‘/DBGroup/Member[Age = ‘24’]/Name’ (see Figure 4.1 and Q5 in Table 4.1) is example of a twig query with a P-C relationship (containing ‘/’). Using the XtoSQL algorithm, this query is translated into the corresponding SQL query shown in Figure 4.18.

SELECT L2.Node_Value FROM

Leaf_Node L1, Leaf_Node L2, Ancestor_Path P

WHERE P.Ances_PathExp = ‘/DBGroup/Member' AND

L1.Ances_PathID = P.Ances_PathID

AND

L1.Node_Name = ‘Age’

AND

L1.Node_Value = ‘24’

AND

L2.Node_Name = ‘Name’

AND

L1.Ances_PathID = L2.Ances_PathID

AND

L1.Ances_Pos = L2.Ances_Pos Figure 4.18

d.

Translated SQL query for Q5 based on DBGroup.xml using XtoSQL

Twig Query with ‘//’ (Q6)

This type of XPath query evaluates the branching query that has A-D relationships. For this type of query, in order to retrieve information for a node from another node with a known value where both nodes have the same parent (same ancestor’s path), the algorithm extracts the ancestor’s path, the last node (which needs to be retrieved), the node inside the predicate, and the value of the node inside the predicate from the strXPathQuery variable, which are stored in the strAncestor_Path, Last_Node, Node_Condition, and Value_Condition, respectively (lines 38–39). In addition, the algorithm changes all the A-D ‘//’ operators inside the strAncestor_Path into ‘%’ operators that are used in SQL queries in many RDBMSs (line 40). Next, the algorithm decomposes the entered XPath query in the strXPathQuery into one sub-path, which is the ancestor’s path for two nodes and is stored in the strAncestor_Path. As mentioned above, this method differs from that of previous approaches in which the entered XPath query is decomposed into three sub-paths. In those approaches, the first sub-path is the path of the branching node, the second is for

145

a node that needs to be retrieved, and the third is of a node that has a value. The use of sub-paths in SQL query makes query processing in a RDBMS costly. Then, the algorithm translates the entered XPath query in the strXPathQuery into the corresponding SQL query by matching the ancestor’s path for two nodes accurately with the path information in the Ancestor_Path table using the ‘Like’ operator to obtain the Ances_PathID for them, which is used in addition to the node that has a known value in the entered query’s predicate, to look up the data stored in the Leaf_Node table (lines 41–49) for a node that needs to be retrieved. This translated SQL is stored in the strSQLQuery, as shown in Figure 4.19. In line 50, the algorithm executes the translated SQL query that is stored in the strSQLQuery based on the RDB. The result of the translated SQL query for Q6 is displayed in relational and XML formats (line 51). Line 52 and line 53 show the end of the END IF clause and the XtoSQL algorithm, respectively. strSQLQuery = ‘SELECT L2.Node_Value FROM Leaf_Node L1, Leaf_Node L2, Ancestor_Path P WHERE P.Ances_PathExp LIKE strAncestor_Path AND

L1.Ances_PathID = P.Ances_PathID

AND

L1.Node_Name = Node_condition

AND

L1.Node_Value = Value_condition

AND

L2.Node_Name = Last_Node

AND

L1.Ances_PathID = L2.Ances_PathID

AND

L1.Ances_Pos = L2.Ances_Pos’

Figure 4.19

Translated SQL query for XPath query Q6 using XtoSQL

The XPath query ‘//Office[Building = ‘CS’]/Room’ (see Figure 4.1 and Q6 in Table 4.1) is example of a twig query with an A-D relationship (containing ‘//’). Using the XtoSQL algorithm, this query is translated into the corresponding SQL query shown in Figure 4.20.

146

SELECT L2.Node_Value FROM

Leaf_Node L1, Leaf_Node L2, Ancestor_Path P

WHERE P.Ances_PathExp LIKE ‘%Office’ AND

L1.Ances_PathID = P.Ances_PathID

AND

L1.Node_Name = ‘Building’

AND

L1.Node_Value = ‘CS’

AND

L2.Node_Name = ‘Room’

AND

L1.Ances_PathID = L2.Ances_PathID

AND

L1.Ances_Pos = L2.Ances_Pos Figure 4.20

4.4

Translated SQL query for Q6 based on DBGroup.xml using XtoSQL

CORRECTNESS OF THE MAPPING ALGORITHM

One of the challenging issues in proposing an approach in this field is to prove that it produces the correct results in terms of mapping XML documents into a RDB scheme. The proposed approach (XAncestor) is proven to produce a correct result if its XML mapping algorithm (XtoDB) is correct. To show that the proposed XtoDB algorithm computes a correct result, the correctness of the XtoDB algorithm is proven below: Definition 1. Let ƞr be the root of the XML document which has child nodes, ƞchild(i), under it. The mapping process starts from the root of the document and traverses the document using the breadth-first traversal method. Definition 2. For all subsequent layers of the XML document, which consists of the child nodes, ƞnextchild(i) of node ƞchild(i) and the descendants of ƞr, are read and traversed using breadth-first traversal. Different types of nodes are identified during the traversing of the document. If the traversed node is an element node (inner node), then this node name and its position in the document are identified in order to use them in the ancestor path information and ancestor position (based on Dewey order labelling) for a certain leaf node. If the traversed node is a leaf node, then its ancestor path information (root-to-parent) and ancestor path ID are stored in the Ancestor_Path table and its information including name, ancestor path ID, ancestor position, and value are stored in the Leaf_Node table.

147

Lemma 1. If the traversed node is an element node (not a text node or the attribute node) of the XML document, then the name and position of this node is stored in the variables which are used to get the ancestor path information and ancestor position for a certain leaf node. Proof. The intended XML document is parsed by using a DOM parser, so the elements in the document can be read and accessed as shown in line 40 of the XtoDB algorithm. The function ReadXML(doc.getDocumentElement(), ‘1’) is called to read the root node of the document, ƞr as shown in line 42 of the XtoDB algorithm. The root node, ƞr is retrieved using the doc.getDocumentElement(), which is an object of the document class in the DOM. The position of the root node is ‘1’ based on Dewey order labelling. Subsequently, the child nodes, ƞchild(i), ƞnextchild(i) of ƞr are read and stored in the NodeList array (1st array), as shown in line 8 of the XtoDB algorithm. For each child node in the NodeList array (1st array), the XtoDB algorithm checks if this child node ƞchild(i) is an element node in line 10. Then this child node is read and its position, followed by its parent’s position, are stored in the variable (Position), as shown in line 12. With the recursive function (breadth-first traversal) using ReadXML (currentNode, Position), as shown in line 18, the child nodes of child node ƞchild(i), ƞnextchild(i) are read and stored in the NodeList array (2nd array), as shown in line 8, until the leaf node or last node (whether attribute or text node) of the document is reached. Therefore, all the non-leaf nodes are read and traversed as in lines 10–18, where the function ReadXML() is looped until the leaf node is reached. Lemma 2. If the traversed node is a leaf node that has value, whether attribute or text in the XML document, then the information of this node, its ancestor path information, and ancestor position are mapped to the Leaf_Node table after the path and position of the ancestor nodes (inner nodes) for the leaf node have been identified from Lemma 1. Proof. The currentNode.getAttributes() for the current element node is invoked in line 13 to get all the attributes of this current element and store them in the attrs array. For each attribute (i.e. leaf node), as shown in lines 14–17, the attribute is traversed, read,

148 and its ancestor’s path is stored in the Ancestor_Path table, while its name, its ancestor’s position (value of Position variable), and its value are stored in the Leaf_Node table, as shown in lines 25–36, by using the Store_DB() function. In addition, as shown in lines 19–22, when the current node is the leaf node (last node) that has a text, then this node is traversed, read, and its ancestor’s path is stored in the Ancestor_Path table, while its name, its ancestor’s position (the value of the substring function on the ParentPos variable), and its value are stored in the Leaf_Node table, as shown in lines 25–36, by using the Store_DB() function. Theorem 1. Given an XML document with multiple levels, the nodes are traversed node-by-node using breadth-first traversal and mapped to their respective tables. Proof. The root node and its descendants (inner nodes) are traversed and their positions are read starting from the root using Dewey order labelling in order to store them as the ancestor path information and ancestor positions of the leaf nodes, as in Lemma 1. In Lemma 2, using the XtoDB algorithm all the leaf nodes are traversed, read, and mapped with their ancestor paths and their data to the Ancestor_Path and Leaf_Node tables, respectively. Thus, the XML document is traversed and mapped into the Ancestor_Path and Leaf_Node tables. 4.5

SUMMARY

In this chapter, a new mapping approach (XAncestor) is proposed in order to achieve the objectives of this research. First, a detailed description of the mathematical concepts and theories related to the XML tree structure used in this approach is provided. Then, the XAncestor mapping approach itself is presented in detail. The three main components of this mapping approach are explained as follows: First, the proposed RDB scheme, which contains two tables (Ancestor_Path, Leaf_Node), is explained and, for clarity, an ERD of the RDB scheme is also provided. Second, the XML mapping algorithm (XtoDB) is discussed. This algorithm maps the XML documents into a fixed RDB using a path-based technique. The basic idea of the XtoDB algorithm is to map only distinct ancestor path information for all the leaf

149

nodes with their information of the XML tree into the RDB scheme using less storage space. Third, the query mapping algorithm (XtoSQL) is discussed. This algorithm maps different types of XPath query into their corresponding SQL query in less query response time over the RDB scheme using a path-based technique. In order to obtain the results from the RDB, these results are translated into an XML hierarchical format and returned to the user. Flowcharts and running examples of the XAncestor mapping approach are also presented to illustrate how the XtoDB and XtoSQL algorithms work. Lastly, the correctness of the proposed approach (XAncestor) is proven in terms of mapping XML documents into its RDB scheme (XtoDB algorithm).

150

CHAPTER V

RESULTS AND DISCUSSION

5.1

INTRODUCTION

In the previous chapter, the proposed mapping approach (XAncestor) that contains the XML mapping algorithm (XtoDB) and query mapping algorithm (XtoSQL) for handling storing and querying issues, respectively, is successfully developed. The XML mapping algorithm (XtoDB) can be claimed to be good due to its capability in storing the XML document in less storage space in the RDB scheme with high scalability when XML documents become bigger. This is achieved because the proposed approach reduces the complexity of the RDB scheme (dangling tables and columns). It removes redundancy or unnecessary data and paths mapped. That is, it reduces the number of records mapped. The mapping algorithm (XtoDB) handles only the ancestor’s path information about the leaf nodes and their information in the XML tree. The query mapping algorithm (XtoSQL) can also be claimed to be good due to its capability in translating different types of XPath query into the corresponding SQL query in a short query response time. That is, it has high query performance. This is achieved because the proposed approach reduces the search space in the RDB scheme. It reduces the number of join operations, removes nesting queries, and removes the SQL functions used inside the translated SQL queries. To evaluate the effectiveness and scalability of the proposed mapping approach (XAncestor) in terms of RDB storage space and query response time, this chapter compares the results generated by the proposed approach with those generated by five existing approaches by conducting some experiments. The compared approaches are XRel (Yoshikawa et al. 2001), SMX/R (Ning & De 2010), XRecursive

151

(Fakharaldien et al. 2011b; Fakharaldien et al. 2011a), Ying-Cao (Ying et al. 2012), and s-XML (Subramaniam et al. 2012). These approaches are chosen for comparison because they are recent model mapping approaches that have been successfully used and evaluated their performance against other approaches for mapping XML documents into a predefined RDB scheme in terms of storing and querying. They apply recent techniques such as path-based and node-based techniques for storing and querying XML documents in a RDB scheme. The approaches are used on two real datasets and on six different types of query for each dataset. Each approach is compared and analysed with regards to its effectiveness and scalability in terms of RDB storage space and query response time. For a fair comparison in the experiments, structure mapping approaches are omitted because their RDB schemes are built based on the structure of the XML document (DTD or XML-schema) (Bousalem & Cherti 2015; Mourya & Saxena 2015; Vaidya 2015; Fakharaldien et al. 2011a). In structure mapping approaches, different XML documents have different RDB schemes (Qin et al. 2005; Yoshikawa et al. 2001). The size and number of the mapped relational tables will vary with the XML scheme, then, querying the XML data will be difficult (Wang et al. 2012; Jeong & Lee 2006). The aim of the experiments discussed in this chapter is to prove that the proposed mapping approach (XAncestor) can outperform other similar existing approaches in terms of the effectiveness and scalability of RDB storage space and query response time. This chapter is organized as follows: First, the experimental design, including the experimental setup, the experimental datasets, and the experimental queries, is presented in Section 5.2. Next, the experimental evaluations of the proposed mapping approach (XAncestor) and the other five approaches are described in Section 5.3. This section looks at the effectiveness and scalability of RDB storage space and query response time of the approaches when applied to the selected datasets. Then, a discussion of the results produced by XAncestor and the other five approaches is presented in Section 5.4. After that, the completeness of XAncestor is proved in Section 5.5. Finally, a summary of this chapter is provided in Section 5.6.

152

5.2

EXPERIMENTAL DESIGN

A sequence of experiments is performed to evaluate the effectiveness and scalability of the proposed approach (XAncestor) compared to the existing approaches in terms of RDB storage space and query response time after the mapping process has been completed. The compared approaches are XRel, SMX/R, XRecursive, Ying-Cao, and s-XML. In this experiment, the following five steps are used to evaluate the effectiveness for an approach in terms of RDB storage space and query response time: 1.

Choose a well-formed XML document with known size.

2.

Map the XML document into a predefined RDB.

3.

Calculate the storage space of the RDB.

4.

Translate six types of XPath query into their corresponding SQL query based on the predefined RDB scheme of the approach.

5.

Execute each translated SQL query six times and calculate the average of these running times to get a more accurate query response time. For a given XML document, the steps above are repeated to evaluate the

effectiveness of each approach in terms of RDB storage space and query response time for the translated SQL queries. Also, the steps above are repeated on different sizes of XML dataset to evaluate each approach in terms of the scalability of the RDB storage space and the execution time for each query on small- to big-sized XML datasets. For ease of understanding, the results of the comparisons of RDB storage space and query response time are presented in column graphs and linear graphs. The following subsections describe the experimental setup, datasets, and queries used in this study.

153

5.2.1

Experimental Setup

The experiment is conducted on a PC containing the following: a MSI Intel(R) Atom(TM) CPU N450 @ 1.66 GHz 1.67 GHz, 2 GB RAM, and 256 GB HDD running Microsoft Windows 7 (32 bits) Service Pack 2. The Microsoft SQL Server 2012 (Express Edition) is employed as the database server and Java is used as the programming language. NetBeans IDE 8.0.1 with a Java platform (JDK 8) and DOM parser are used to load the XML documents into the relational tables. 5.2.2

Experimental Datasets

Two datasets, namely SIGMOD and DBLP, are used to evaluate and compare the proposed mapping approach (XAncestor) with five recent model mapping approaches, namely XRel, SMX/R, Ying-Cao, XRecursive, and s-XML in terms of the effectiveness and scalability of the RDB storage space and query response time, i.e. query performance. The SIGMOD and DBLP datasets are obtained from the ACM SIGMOD website (Silva et al. 2007) and the Washington UW Repository (Miklau & Suciu 2003), respectively. Table 5.1 lists the features of XML datasets used in the experiments in this study, including the size, number of elements, number of attributes, and maximum depth.

Table 5.1

XML datasets used

Datasets

Size

Elements

Attributes

Max-Depth

SIGMOD

467 KB

11526

3737

6

DBLP1

31.5 MB

796507

108626

5

DBLP2

63.4 MB

1604967

206949

5

DBLP3

127 MB

3332130

404276

6

As shown in Table 5.1, the proposed and existing approaches are applied to one size of SIGMOD dataset (467 KB) and three different sizes of DBLP dataset (DBLP1: 31.5 MB; DBLP2: 63.4 MB; and DBLP3: 127 MB). Different sizes of DBLP dataset that contain same structure but different numbers of elements,

154

attributes, and texts are used to evaluate of the scalability of the RDB storage space and the query response time (query performance) of the proposed approach compared to the existing approaches. The ACM SIGMOD online website is a digital library for the database systems research community (Silva et al. 2007). The SIGMOD dataset is the XML version of a portion of the ACM SIGMOD website that records information in XML format. Many researchers have begun to use the data from SIGMOD to evaluate their systems and algorithms (Fakharaldien et al. 2012; Li et al. 2010; Soltan & Rahgozar 2006). A snapshot of the XML document of the SIGMOD and its DTD are shown in Figure 5.1.

(a)

(b)

Figure 5.1

Snapshot of SIGMOD: (a) XML document, and (b) DTD

155

The DBLP website is a popular tool for computer science researchers who use it to trace the work of colleagues and to retrieve bibliographic details. In June 2009, the DBLP computer science bibliographic library contained more than 1.2 million bibliographic records. Recently, many researchers have begun to use the data from DBLP to evaluate their systems and algorithms (Subramaniam et al. 2012; Ying et al. 2012; Wu et al. 2011; Li et al. 2010; Marks & Roantree 2009; Yang et al. 2006; Qin et al. 2005; Prakas et al. 2004). A snapshot of the XML document of the DBLP and its DTD are shown in Figure 5.2.

(a)

(b)

Figure 5.2

Snapshot of DBLP: (a) XML document, and (b) DTD

156

5.2.3

Experimental Queries

Six types of XPath query are evaluated by executing them on four different sizes of XML dataset (see Table 5.1). These queries are selected based on their usage in related research studies (Ying et al. 2012; Wu et al. 2011; Marks & Roantree 2009; Soltan & Rahgozar 2006; Yang et al. 2006; Schmidt et al. 2002; Yoshikawa et al. 2001) in order to evaluate the efficiency of their query mapping algorithms. These queries are categorized based on their simplicity and complexity i.e. simple path query to twig query, in order to evaluate the P-C and A-D relationships among the XML nodes (Haw & Lee 2011). These queries are: 1.

A short simple path expression query that evaluates short parent-child relationships

2.

A long simple path expression query that evaluates long parent-child relationships

3.

A query with one ‘//’ that evaluates one ancestor-descendant relationship

4.

A query with two ‘//’ that evaluates two ancestor-descendant relationships

5.

A twig query with ‘/’ that evaluates the branching query associated with parent-child relationships

6.

A twig query with ‘//’ that evaluates the branching query associated with ancestor-descendant relationships Due to the difference in the execution times for each query that can occur from

time to time, each query is executed seven times (the first run is eliminated because it calculates the execution plan of the query before executing the query itself, making it slow). Then, the average of these times is calculated in order to get more accurate results (Maghaydah et al. 2010; Maghaydah 2010; Marks & Roantree 2009; Atay 2006). Two Transact-SQL commands (dbcc dropcleanbuffers, dbcc freeproccache) are executed before running each query to ensure that all results are warm cache numbers. Table 5.2 and Table 5.3 show the query description for the SIGMOD dataset and the DBLP dataset, respectively.

157

Table 5.2

Query description for SIGMOD dataset

Qn

Query Description

Query Expression

Feature

Q1

/SigmodRecord/issue/volume /SigmodRecord/issue/articles/article/aut hors/author

Simple path expression (short) Simple path expression (long)

Q3

Select all volumes for all issues in SigmodRecord Select all authors for each article for all articles that related to all issues in SigmodRecord Select all titles for each article

//article/title

One ‘//’

Q4

Select all authors for all articles

//articles//author

Two ‘//’s

Q5

Select all end pages for articles which its title is ‘Database Directions III Workshop Review.’ Select all titles for articles which its initial page greater than 50

/SigmodRecord/issue/articles/article [title = ‘Database Directions III Workshop Review.’]/endPage //article [initPage > 50]/title

Twig query with ‘/’

Q2

Q6

Table 5.3

Query description for DBLP dataset

Qn

Query Description

Q1

Select all titles for all phd thesis in dblp Select all superscript text for all titles that related to all inproceedings in dblp Select all authors for each master thesis Select all monospace for all articles

/dblp/phdthesis/title

Select all titles for all master thesis where year of publishing is 1992 Select all schools for all phd thesis where year of publishing greater than 1979

Q2

Q3 Q4 Q5 Q6

5.3

Twig query with ‘//’

Query Expression

/dblp/inproceedings/title/sub/sup

Feature Simple path expression (short) Simple path expression (long)

//mastersthesis/author

One ‘//’

//article//tt

Two ‘//’s

/dblp/mastersthesis [year = 1992]/title

Twig query with ‘/’

//phdthesis [year > 1979]/school

Twig query with ‘//’

EXPERIMENTAL EVALUATION

The effectiveness and scalability of the proposed mapping approach is evaluated and compared to five recent model mapping approaches (XRel, SMX/R, Ying-Cao, XRecursive, and s-XML) in terms of RDB storage space and query response time, i.e. query performance. These two evaluations are performed on the XML datasets described in Table 5.1. The effectiveness evaluation results for RDB storage space are presented in subsection 5.3.1, while the effectiveness evaluation results for query response time are presented in subsection 5.3.2. Finally, the scalability evaluation results for RDB storage space and query response time are presented in subsection 5.3.3.

158

5.3.1

Effectiveness Evaluation: RDB Storage Space

Four experiments are conducted on different sizes of XML dataset (SIGMOD, DBLP1, DBLP2, DBLP3; see Table 5.1). These experiments are conducted to evaluate the RDB storage space consumed by the XML mapping algorithms of the XRel, SMX/R, Ying-Cao, XRecursive, and s-XML approaches with that consumed by the proposed XtoDB algorithm of the XAncestor approach. The storage space is calculated after the XML datasets are mapped into their respective RDB schemes. The RDB storage space of a certain approach is calculated using the total number of tuples mapped and executing the Transact-SQL command (exec sp_spaceused). a.

RDB Storage Space for SIGMOD

First, an experiment is conducted on the smallest dataset, i.e. SIGMOD, which is 467 KB in size (see Table 5.1). Using the respective XML mapping algorithm, the content and structure of the XML document of the SIGMOD dataset are traversed, shredded, and decomposed in the RDB scheme for each approach. In addition, the XML parser is used to extract the values in the document. Table 5.4 shows the results of the RDB storage space evaluation for each approach when applied to the SIGMOD dataset. The table lists the name of each approach, the names of the tables created in the RDB scheme of each approach, the number of tuples in each table of the RDB scheme, the total number of tuples, and the size of the RDB storage space.

Table 5.4 Results of RDB storage space evaluation for (a) XRel, (b) SMX/R, (c) Ying-Cao, (d) XRecursive, (e) s-XML, and (f) XAncestor approaches based on the SIGMOD dataset No. of Tuples

Total No. of Tuples

RDB Storage Space (MB)

(a)

Path Element Attribute Text

12 11526 3737 8383

23658

1.84

(b)

Path_Index Path

12 23646

23658

1.90

(c)

File Path

1 7

27391

2.06

Approach

RDB Scheme

continue…

159 ...continued InnerNodes LeafNodes

15263 12120

(d)

Tag_Structure Tag_Value

15263 12120

27383

1.77

(e)

ParentTable ChildTable

15263 12120

27383

2.36

(f)

Ancestor_Path Leaf_Node

4 12120

12124

1.58

The results in Table 5.4 show that the XtoDB algorithm in the proposed approach (XAncestor) outperformed the other approaches. It is able to do so because it maps the XML document of the SIGMOD dataset into its RDB using less storage space. In contrast, the other approaches utilized a large amount of storage space to map the XML document of the SIGMOD dataset into their RDB scheme. More space is used primarily because these approaches store all the information about the inner nodes in their RDB scheme. In addition, some approaches, such as XRel and YingCao, store the document in more than two tables in their RDB scheme. Also, some approaches, such as XRel, SMX/R, and Ying-Cao, store distinct path information from a root-to-any node or from root-to-any leaf. Moreover, some approaches, such as s-XML and XRecursive, contain redundant and dangling columns. Figure 5.3 shows a graphical comparison of the RDB storage space used by the various approaches when they are applied to the SIGMOD dataset.

RDB Storage Space (MB)

2.5 2 1.5 1 0.5 0 XRel

SMX/R

Ying-Cao XRecursive

s-XML

XAncestor

Model Mapping Approaches Figure 5.3

Comparison of RDB storage space of approaches using SIGMOD

160

b.

RDB Storage Space for Different Sizes of DBLP

Next, experiments are conducted on the three different sizes of the DBLP dataset (DBLP1, DBLP2, and DBLP3; see Table 5.1). The aim of these experiments is to evaluate the RDB storage space of the XRel, SMX/R, Ying-Cao, XRecursive, s-XML, and XAncestor approaches. The results are presented in the following subsections. i.

RDB Storage Space for DBLP1

The dataset used for this experiment is taken from the DBLP website. It is called DBLP1 and had a size of 31.5 MB (see Table 5.1), which is larger than the SIGMOD dataset used in the previous experiment. Table 5.5 shows the results of the RDB storage space evaluation for each approach when applied to the DBLP1 dataset. The table includes the name of each approach, the names of the tables created in the RDB scheme of each approach, the number of tuples in each table of the RDB scheme, the total number of tuples, and the size of the RDB storage space.

Table 5.5 Results of RDB storage space evaluation for (a) XRel, (b) SMX/R, (c) Ying-Cao, (d) XRecursive, (e) s-XML, and (f) XAncestor approaches based on the DBLP1 dataset No. of Tuples

Total No. of Tuples

RDB Storage Space (MB)

(a)

Path Element Attribute Text

85 796507 108626 715648

1620866

80. 92

(b)

Path_Index Path

85 1620781

1620866

88.19

(c)

File Path InnerNodes LeafNodes

1 78 905133 824274

1729486

88.60

(d)

Tag_Structure Tag_Value

905133 824274

1729407

70.08

(e)

ParentTable ChildTable

905133 824274

1729407

110.57

(f)

Ancestor_Path Leaf_Node

15 824274

824289

46.41

Approach

RDB Scheme

161

The results in Table 5.5 show that the proposed approach (XAncestor), particularly the XtoDB algorithm, mapped the XML document of the DBLP1 dataset into its RDB using the smallest amount of storage space compared to the other approaches. On the other hand, the s-XML approach used the largest amount of storage space in its RDB scheme to map the document because it stores all the inner nodes with redundant and dangling columns. The Ying-Cao approach also needed a large amount of storage space in its RDB scheme to map the document. This is because its complex RDB scheme, which consists of more than two tables, stores distinct paths for all the leaf nodes from the root and also stores all the information about the inner nodes. The SMX/R approach stores a distinct path for every node in the document and all the information about the elements, attributes, and texts in a single table, which caused the RDB storage space for this approach to also be high. Even though the XRel approach handles distinct paths for every node in the document like SMX/R, it shreds and decomposes the document into four tables in its scheme, i.e. it stores all the information about the inner nodes. Consequently, the amount of storage space required by XRel is large. The XRecursive approach also needed a lot of storage space because it stores all the information about the inner nodes in its RDB scheme. Figure 5.4 shows a graphical comparison of the RDB storage space used by the approaches when applied to the DBLP1 dataset.

RDB Storage Space (MB)

120 100 80 60 40 20 0 XRel

SMX/R

Ying-Cao XRecursive s-XML

XAncestor

Model Mapping Approaches Figure 5.4 Comparison of RDB storage space of approaches using DBLP1

162

ii.

RDB Storage Space for DBLP2

This experiment is conducted on another dataset taken from the DBLP website. This dataset is called DBLP2 and is 63.4 MB in size (see Table 5.1). It is therefore larger than SIGMOD and DBLP2. Table 5.6 shows the results of the RDB storage space evaluation for each approach when applied to the DBLP2 dataset. The table provides the name of each approach, the names of the tables created in the RDB scheme of each approach, the number of tuples in each table of the RDB scheme, the total number of tuples, and the size of the RDB storage space.

Table 5.6 Results of RDB storage space evaluation for (a) XRel, (b) SMX/R, (c) Ying-Cao, (d) XRecursive, (e) s-XML, and (f) XAncestor approaches based on the DBLP2 dataset No. of Tuples

Total No. of Tuples

RDB Storage Space (MB)

(a)

Path Element Attribute Text

92 1604967 206949 1442099

3254107

162.93

(b)

Path_Index Path

92 3254015

3254107

177.32

(c)

File Path InnerNodes LeafNodes

1 85 1811916 1649048

3461050

179.95

(d)

Tag_Structure Tag_Value

1811916 1649048

3460964

142.09

(e)

ParentTable ChildTable

1811916 1649048

3460964

224.05

(f)

Ancestor_Path Leaf_Node

16 1649048

1649064

92.82

Approach

RDB Scheme

According to the results in Table 5.6, the existing approaches used up more storage space to map the XML document of the DBLP2 dataset into their RDB scheme compared to the proposed approach (XAncestor). This is because these approaches store a lot of information that is simplified in XAncestor. The information simplified in XAncestor includes the information about the inner nodes of the

163

document that are stored by all the existing approaches. It also included the path information for all the existing nodes in the document that is stored by the XRel and SMX/R approaches and the path information for all the leaf nodes that is stored by the Ying-Cao approach. Moreover, XAncestor also simplifies the unnecessary information (e.g. dangling tables and columns) that are stored by the s-XML and XRecursive approaches. Therefore, the XAncestor approach, particularly the XtoDB algorithm, is able to utilize less storage space to map the XML document in its RDB. Figure 5.5 shows a comparison in graph form of the RDB storage space consumed by the six approaches when applied to the DBLP2 dataset.

250

RDB Storage Space (MB)

200

150

100

50

0 XRel

SMX/R

Ying-Cao XRecursive s-XML

XAncestor

Model Mapping Approaches Figure 5.5

iii.

Comparison of RDB storage space of approaches using DBLP2

RDB Storage Space of DBLP3

This experiment is conducted on the biggest dataset, which is also taken from the DBLP website. The dataset is called DBLP3 and had a size of 127 MB (see Table 5.1). Table 5.7 shows the results of the RDB storage space evaluation for each approach when applied to the DBLP3 dataset. The table lists the name of each

164

approach, the names of the tables created in the RDB scheme of each approach, the number of tuples in each table of the RDB scheme, the total number of tuples, and the size of the RDB storage space.

Table 5.7 Results of RDB storage space evaluation for (a) XRel, (b) SMX/R, (c) Ying-Cao, (d) XRecursive, (e) s-XML, and (f) XAncestor approaches based on the DBLP3 dataset No. of Tuples

Total No. of Tuples

RDB Storage Space (MB)

(a)

Path Element Attribute Text

145 3332130 404276 3005848

6742399

338.42

(b)

Path_Index Path

145 6742254

6742399

368.88

(c)

File Path InnerNodes LeafNodes

1 136 3736406 3410124

7146667

372.49

(d)

Tag_Structure Tag_Value

3736406 3410124

7146530

292.95

(e)

ParentTable ChildTable

3736406 3410124

7146530

456.19

(f)

Ancestor_Path Leaf_Node

29 3410124

3410153

189.92

Approach

RDB Scheme

The results in Table 5.7 show that the existing approaches are inefficient in mapping the XML document of this largest dataset (DBLP3) into their RDB scheme. This is because these approaches need more storage space for their RDB scheme because they store all the information about the inner nodes. Some of them also store the path information for every node (XRel and SMX/R), the path information for every leaf node (Ying-Cao), and dangling tables and columns (XRecursive and sXML). On the other hand, the XtoDB algorithm of the proposed XAncestor approach utilized less storage space for its RDB scheme to map this biggest document. Figure 5.6 shows a graphical comparison of the RDB storage space needed by the approaches when applied to the DBLP3 dataset.

165

500 RDB Storage Space (MB)

450 400 350 300 250 200 150 100 50 0 XRel

SMX/R

Ying-Cao XRecursive s-XML

XAncestor

Model Mapping Approaches Figure 5.6

5.3.2

Comparison of RDB storage space of approaches using DBLP3

Effectiveness Evaluation: Query Response Time

Four experiments are conducted on six types of XPath query (see Table 5.2 (SIGMOD) and Table 5.3 (DBLP)) in order to evaluate the time taken by the proposed and existing approaches to execute the corresponding SQL query. Specifically, the query mapping algorithms of the XRel, SMX/R, Ying-Cao, XRecursive, and s-XML approaches are compared with the proposed XtoSQL algorithm of the XAncestor approach on the SIGMOD and DBLP datasets. To recap, these six types of query are a short simple path expression query (Q1), long simple path expression query (Q2), query with one ‘//’ (Q3), query with two ‘//’ (Q4), twig query with ‘/’ (Q5), and twig query with ‘//’ (Q6). Due to the difference in the execution time of each query that can occur from time to time, each query is executed seven times (the first run is eliminated because it calculates the execution plan of the query before executing the query itself, making it slow), Then the average of these times is calculated in order to get more accurate results (Maghaydah et al. 2010; Maghaydah 2010; Atay 2006). Two Transact-SQL commands (dbcc dropcleanbuffers, dbcc freeproccache) are executed before running each query to ensure that all results are warm cache numbers. The results for the experiments on SIGMOD and on the three DBLP datasets are presented in the following subsections.

166

a.

Query Response Time for SIGMOD

First, an experiment is conducted on the smallest dataset, SIGMOD (467 KB) in order to evaluate the time taken by the proposed and existing approaches to execute the corresponding SQL query for the six types of XPath query (see Table 5.2 for query description for SIGMOD). Appendix C shows the translated SQL queries and query response time to retrieve each query (Q1–Q6) of the six approaches when they are applied to the SIGMOD dataset. Figure 5.7 compares the six approaches based on the query response time for the Q1–Q6 queries on SIGMOD.

Figure 5.7

Comparison of query response time of approaches using SIGMOD

According to the results obtained (Appendix C and Figure 5.7) for the SIGMOD dataset, the proposed query algorithm (XtoSQL) of the XAncestor approach took the shortest time to execute all six types of query (Q1–Q6) on SIGMOD. This is because XAncestor extracts the ancestor’s path from the XPath query. It then matches it accurately with the ancestor’s path information in the Ancestor_Path table to get its Ances_pathID, which is used to look for the data stored in the Leaf_Node table. On the other hand, the SMX/R approach took the longest time to execute the short simple path expression query (Q1), the long simple path expression (Q2), the

167 query with one ‘//’ (Q3), the query with two ‘//’ (Q4), and the twig query with ‘//’ (Q6). This is because it requires the largest search space to get the PathID in the Path_Index table, which stores distinct path information for every node of the SIGMOD dataset. It also requires the largest search space to match the PathID and find the data in the Path table, which stores all the information about the attributes, inner nodes, and leaf nodes in a single table. In addition, the SMX/R approach also needs more time to handle the containment relationships using ϴ-join () operators. The s-XML approach took the longest time to execute the twig query with one ‘/’ (Q5) because it uses the parent ID (LParent column) recursively to form a nested query. Then, this nested query is executed on the largest amount of storage space for its RDB scheme on the SIGMOD dataset. b.

Query Response Time for DBLP datasets

Next, experiments are conducted on the DBLP datasets DBLP1, DBLP2, and DBLP3 (see Table 5.1) in order to evaluate the time taken by the six approaches to execute the corresponding SQL query for the six types of XPath query (see Table 5.3 for query description for DBLP). The results for each DBLP dataset are presented in the following subsections. i.

Query Response Time for DBLP1

First, an experiment is conducted on the DBLP1 dataset (31.5 MB). Appendix C shows the translated SQL queries and the query response time to retrieve each query (Q1–Q6) of the six approaches when applied to the DBLP1 dataset. Figure 5.8 compares the six approaches based on the query response time for the Q1–Q6 queries on DBLP1.

168

Figure 5.8

Comparison of query response time of approaches using DBLP1

According to the results obtained (Appendix C and Figure 5.8) for the DBLP1 dataset, the proposed query algorithm (XtoSQL) of the XAncestor approach outperformed the other approaches by executing all six types of query (Q1–Q6) in less query response time. As stated above, this is because the XAncestor extracts the ancestor’s path from the XPath query. It then matches it accurately with the ancestor’s path information in the Ancestor_Path table to get its Ances_pathID, which is used to look for the data stored in the Leaf_Node table. On the other hand, XRecursive took the longest time to execute the short and long simple path expression queries (Q1, Q2) and the twig queries with ‘/’ or ‘//’ (Q5, Q6). This is because it uses the parent ID (Pid column) recursively to form a nested query. This method is costly when the document size increases. In the case of the queries containing one or two ‘//’ (Q3, Q4), the SMX/R approach took the longest time to execute because it had to find the PathID in the Path_Index table, which stores distinct path information for every node of the DBLP1 dataset. It then had to match this Path ID with the Path table, which stores the whole document. In addition, the SMX/R approach also needed more time to handle the containment relationships using ϴ-join () operators. Therefore, it needed a lot of time to get the data required.

169

ii.

Query Response Time for DBLP2

The same experiment is conducted on the DBLP2 dataset (63.4 MB). Appendix C shows the translated SQL queries and the query response time to retrieve each query (Q1–Q6) of the six approaches when applied to the DBLP2 dataset. Figure 5.9 compares the six approaches based on the query response time for the Q1–Q6 queries on DBLP2.

Figure 5.9

Comparison of query response time of approaches using DBLP2

According to the results obtained (Appendix C and Figure 5.9) for the DBLP2 dataset, the XRecursive approach consumed more time compared to the other five approaches to execute the short simple path expression query (Q1). This is due to the costliness of the nested query that is formed by using the Pid column recursively. In the case of the long simple path expression query (Q2), the query containing one ‘//’ (Q3), the query containing two ‘//’ (Q4), and the twig query with ‘/’ (Q5), the SMX/R approach took the longest time to execute. This is because SMX/R needs more time to find the PathID in the Path_Index table that stores distinct path information for every node of the DBLP2 dataset. It also has to match this PathID with the Path table, which is a single table for storing the whole document. In addition, the SMX/R approach also needs more time to handle the containment

170 relationships using ϴ-join () operators. Thus, the approach takes a long time to get the data. In the case of the twig query with ‘//’ (Q6), s-XML consumed more time to execute this query because it uses the parent ID (LParent column) recursively to form a nested query, which executes on a large amount of storage space in its RDB scheme. On the other hand, the proposed query algorithm (XtoSQL) of the XAncestor approach outperformed the other approaches by executing all the queries (Q1–Q6) in less query response time. The reason for this result is that XAncestor extracts the ancestor’s path from the XPath query and then matches it accurately with the ancestor’s path information in the Ancestor_Path table to get its Ances_pathID, which is used to look for the data stored in the Leaf_Node table. iii.

Query Response Time for DBLP3

The same experiment is also conducted on the DBLP3 dataset, the largest tested (127 MB), to evaluate the query response time of the six approaches. Appendix C shows the translated SQL queries and the query response time to retrieve each query (Q1– Q6) of the six approaches when applied to the DBLP3 dataset. Figure 5.10 compares the six approaches based on the query response time for the Q1–Q6 queries on DBLP3.

Figure 5.10

Comparison of query response time of approaches using DBLP3

171

According to the results obtained (Appendix C and Figure 5.10) for the DBLP3 dataset, the XRecursive approach needed more time to execute the short simple path expression query (Q1) and the twig query with ‘//’ (Q6). This is because the SQL nested queries are translated using the parent ID (Pid column) recursively. This is a costly method for answering queries in a RDB. In the case of the long simple path expression (Q2), query containing one ‘//’ (Q3), and the twig query with ‘/’ (Q5), the SMX/R approach took the longest time to execute because it requires more time to find the PathID in Path_Index table that stores distinct path information for every node of the DBLP3 dataset. It also needs time to match this PathID with a single table that stores the whole document (Path table). In addition, the SMX/R approach also needs more time to handle the containment relationships using ϴ-join () operators. Thus, overall the approach consumes a lot of time to get data. In the case of the query containing two ‘//’ (Q4), s-XML took longest time to execute this query because it uses the parent ID (LParent column) recursively to form a nested query on highest RDB storage space. Therefore, these approaches are not scalable to larger datasets. In contrast, the proposed query algorithm (XtoSQL) of the XAncestor approach outperformed the other approaches by executing all the queries (Q1–Q6) in a shorter query response time. As stated above, this is because XAncestor extracts the ancestor path from the XPath query and matches it with the ancestor path information in the Ancestor_Path table to get its Ances_pathID, which is used to look for the data stored in the Leaf_Node table. 5.3.3

Scalability Evaluation

In this section, the scalability of the RDB storage space and query response time i.e. query performance, of the six approaches is evaluated based on the results of the experiments described above. The issue of scalability is investigated in a linear and scalable manner, i.e. with increasing XML document size. In the following subsections, the scalability of the RDB storage space is discussed followed by the scalability of the query response time.

172

a.

Scalability of RDB Storage Space

In this section, the results of the previous experiments on RDB storage space are compared to evaluate the performance of the six approaches XRel, SMX/R, Ying-Cao, XRecursive, and s-XML with that of the proposed approach XAncestor, and particularly the XtoDB mapping algorithm. This comparison is made to assess the scalability of the RDB storage space of these approaches that is consumed after mapping the various XML dataset sizes. To recap, the approaches are applied to the following XML datasets: SIGMOD with a size of 467 KB, DBLP1 with a size of 31.5 MB, DBLP2 with a size of 63.4 MB, and DBLP3 with a size of 127 MB (see Table 5.1). Table 5.8 shows the RDB storage space consumed by the six approaches when they are applied to the four different XML dataset sizes. Figure 5.11 and Figure 5.12 compare the six approaches in terms of the scalability of the RDB storage space for the various XML dataset sizes by using a column graph and a linear graph, respectively.

Table 5.8 Approach XRel SMX/R Ying-Cao XRecursive s-XML XAncestor

RDB storage space of the six approaches for various XML dataset sizes (MB) SIGMOD (467 KB) 1.84 1.90 2.06 1.77 2.36 1.58

DBLP1 (31.5 MB) 80.92 88.19 88.60 70.08 110.57 46.41

DBLP2 (63.4 MB) 162.93 177.32 179.95 142.09 224.05 92.82

DBLP3 (127 MB) 338.42 368.88 372.49 292.95 456.19 189.92

According to the results obtained (Table 5.8, Figure 5.11, and Figure 5.12), the proposed algorithm (XtoDB) of the XAncestor approach mapped the XML datasets to its RDB by using less storage space compared to the other approaches. It showed better performance due to the simplicity of its RDB scheme that contains two tables. One table stores the distinct ancestor path information for only the leaf nodes instead of all the path information about the inner nodes, while the other table stores only the information about the leaf nodes. Therefore, the scalability of the RDB storage space of XAncestor is more efficient compared to the other approaches when the size of the XML dataset increases.

173

On the other hand, the s-XML approach consumed the largest amount of storage space in its RDB scheme to map the XML datasets because it stores all the inner nodes and the redundant and dangling columns. Hence, the scalability of its RDB storage space is the most inefficient compared to the other approaches when the size of the XML dataset increases.

XRel

SMX/R

Ying-Cao

XRecursive

s-XML

XAncestor

500

RDB Storage Space (MB)

450 400 350 300 250 200 150 100 50 0 SIGMOD (467KB)

DBLP1 (31.5MB)

DBLP2 (63.4MB)

DBLP3 (127MB)

XML Datasets Size Figure 5.11

Comparison of six approaches in terms of scalability of RDB storage space for various XML dataset sizes using column graph

The Ying-Cao approach is inefficient compared to the XRel, SMX/R, XRecursive, and XAncestor approaches in terms the scalability of the RDB scheme because it has a complex RDB scheme (more than two tables), stores distinct paths for all leaf nodes from the root, and stores all the information about the inner nodes. On the other hand, the SMX/R approach is inefficient compared to the XRel, XRecursive, and XAncestor approaches because it stores a distinct path for every node in the document and all the information about the elements, attributes, and texts in a single table. Moreover, the scalability of the RDB storage space of the XRel approach is also inefficient compared to the XRecursive and XAncestor approaches because it stores all distinict path information for every node in the document, shreds and decomposes the document into four tables in its RDB scheme, making it complex, and stores all the information about the inner nodes.

174

Finally, even though the XRecursive approach outperformed the other existing approaches (XRel, SMX/R, Ying-Cao, and s-XML) in terms of the scalability of the RDB scheme, the approach is still inefficient compared to the proposed approach (XAncestor) because it stores all the information about the inner nodes in its RDB scheme.

RDB Storage Space (MB)

XRel 500

SMX/R

Ying-Cao

XRecursive

s-XML

XAncestor

450 400 350 300 250 200 150 100 50 0 SIGMOD (467KB)

DBLP1 (31.5MB)

DBLP2 (63.4MB)

DBLP3 (127MB)

XML Datasets Size Figure 5.12

b.

Comparison of six approaches in terms of scalability of RDB storage space for various XML dataset sizes using linear graph

Scalability of Query Response Time

In this section, the results of the previous experiments on query response time are compared to evaluate the performance of the six approaches (XRel, SMX/R, YingCao, XRecursive, and s-XML) with that of the proposed approach XAncestor, and particularly the XtoSQL mapping algorithm. This comparison is made to assess the scalability of the query response time of these approaches, i.e. in terms of the time taken to execute the corresponding SQL queries after translating them using the query mapping algorithms. The XML datasets that the approaches are applied to are the same as reported, namely SIGMOD with a size of 467 KB, DBLP1 with a size of 31.5 MB, DBLP2 with size of 63.4 MB, and DBLP3 with a size of 127 MB. For clarity, the results and graphs for each type of XPath query are presented separately in the following subsections.

175

i.

Short Simple Path Expression Query (Q1)

This type of XPath query evaluates the short parent-child relationships of the nodes in the document (see Q1 in Table 5.2 and Table 5.3). Table 5.9 shows the time taken by the six approaches to execute the corresponding Q1, i.e. the query response time, for the four XML dataset sizes. Figure 5.13 and Figure 5.14 compare the six approaches in terms of the scalability of the query response time for Q1 for the various XML dataset sizes by using a column graph and a linear graph, respectively.

Table 5.9

Query response time (ms) of the six approaches for executing the corresponding SQL query of the XPath query (Q1) for various XML dataset sizes SIGMOD (467 KB) 1026 2011 191 689 567 171

Approach XRel SMX/R Ying-Cao XRecursive s-XML XAncestor

Q1

XRel

SMX/R

DBLP1 (31.5 MB) 32426 69152 6006 401856 17934 5348

Ying-Cao

DBLP2 (63.4 MB) 64901 139158 11994 797005 35680 10451

XRecursive

DBLP3 (127 MB) 135182 288184 23904 1657830 72274 21269

s-XML

XAncestor

Query Respnse Time (ms)

10000000 1000000 100000 10000 1000 100 SIGMOD (467KB) DBLP1 (31.5MB)

DBLP2 (63.4MB)

DBLP3 (127MB)

XML Datasets Size Figure 5.13 Comparison of six approaches in terms of scalability of query response times for query (Q1) for various XML dataset sizes using column graph

According to the results obtained (Table 5.9, Figure 5.13, and Figure 5.14), the proposed algorithm (XtoSQL) of the XAncestor approach translated the XPath query

176

(Q1) into its corresponding SQL query in less query response time compared to the other approaches, particularly when the size of the XML datasets increased. This is because the XAncestor approach first extracts the ancestor’s path expression from ‘/SigmodRecord/issue/volume’

in

the

SIGMOD

dataset

and

from

‘/dblp/phdthesis/title’ in the DBLP datasets. It then matches each one accurately with its ancestor’s path information in the Ancestor_Path table using the ‘=’ operator to get its Ances_pathID, which is used to look for the data stored in the Leaf_Node table. In addition, the simplicity of its RDB scheme, which contains two tables, stores the distinct ancestor path information for only the leaf nodes instead of all the path information about the inner nodes, and stores only the information about the leaf nodes. This leads to a decrease in the search space inside the Ancestor_Path and Leaf_Node tables to get the Ances_pathID and the data, respectively. On the other hand, the XRecursive approach took the longest time compared to the other approaches to execute the corresponding SQL query for Q1, particularly in the case of large XML datasets like DBLP1, DBLP2, and DBLP3. This is because it uses the parent ID (Pid column) recursively to form a nested query for its RDB scheme that stores all the information about the inner and leaf nodes. As for the SMX/R approach, even though it uses a path-based technique, this approach took a longer time than the XRel, s-XML, Ying-Cao, and XAncestor approaches because it requires more time not only to handle the containment relationships using ϴ-join () operators, but also to find the PathID in the Path_Index table that stores distinct path information for every node of the XML dataset. It also needs time to match the Path ID with a single table that stores the whole document (Path table), particularly for large XML datasets like DBLP1, DBLP2, and DBLP3. The XRel approach took a longer time than the s-XML, Ying-Cao, and XAncestor approaches to execute the corresponding SQL query for Q1. This is because it requires more time to find the PathID in the Path table that stores distinct path information for every node of the XML dataset and also to match the Path ID with the other three tables (Element, Attribute, Text). It also needs time to handle the containment relationships using ϴ-join () operators. The s-XML approach took a

177

longer time than the Ying-Cao and XAncestor approaches because it uses the parent ID (LParent column) recursively to form a nested query on highest storage space in its RDB scheme. This increases the search space it needs to cover to retrieve the data. Even though the Ying-Cao approach had a shorter query response time compared to the other approaches when executing the corresponding SQL query for Q1, it still took a longer time than the XAncestor approach to get the data. This is because it requires more time to find the PathID in the Path table that stores distinct path information for all the leaf nodes of the XML dataset and to match it with the InnerNodes and LeafNodes tables that store all the inner and leaf nodes of the document, respectively.

Figure 5.14

ii.

Comparison of six approaches in terms of scalability of query response time for query (Q1) for various XML dataset sizes using linear graph

Long Simple Path Expression Query (Q2)

This type of XPath query evaluates the long parent-child relationships of the nodes in the document (see Q2 in Table 5.2 and Table 5.3). Table 5.10 shows the time taken to execute the corresponding Q2 of the six approaches when applied to the XML

178

datasets. Figure 5.15 and Figure 5.16 compare the six approaches in terms of the scalability of the query response time for Q2 for the various XML dataset sizes by using a column graph and a linear graph, respectively.

Table 5.10

Query response time (ms) of the six approaches for executing the corresponding SQL query of XPath query (Q2) for various XML dataset sizes SIGMOD (467 KB) 34982 69779 486 4385 2768 477

Approach XRel SMX/R Ying-Cao XRecursive s-XML XAncestor

Q2

XRel

SMX/R

DBLP1 (31.5 MB) 33579 72373 5930 94284 29192 4560

Ying-Cao

XRecursive

DBLP2 (63.4 MB) 68832 148550 11834 89419 58643 9252

DBLP3 (127 MB) 148220 312463 23984 164378 121451 18549

s-XML

XAncestor

Query Respnse Time (ms)

1000000

100000

10000

1000

100 SIGMOD (467KB) DBLP1 (31.5MB) DBLP2 (63.4MB)

DBLP3 (127MB)

XML Datasets Size Figure 5.15

Comparison of six approaches in terms of scalability of query response time for query (Q2) for various XML dataset sizes using column graph

According to the results obtained (Table 5.10, Figure 5.15, and Figure 5.16), the proposed algorithm (XtoSQL) of the XAncestor approach translated the XPath query (Q2) into its corresponding SQL query in less query response time compared to the other approaches, particularly when the XML datasets increased in size. This is because the XAncestor approach first extracts the ancestor’s path expression from ‘/SigmodRecord/issue/articles/article/authors/author’ in the SIGMOD dataset and ‘/dblp/inproceedings/title/sub/sup’ in the DBLP datasets and matches each one

179 accurately with its ancestor’s path information in the Ancestor_Path table using the ‘=’ operator to get its Ances_pathID, which is used to look for the data stored in the Leaf_Node table. In addition, its simple RDB scheme containing two tables stores the distinct ancestor path information for only the leaf nodes instead of all the path information about the inner nodes. Thus, it stores only the information about the leaf nodes, which results in a reduction of the search space inside the Ancestor_Path and Leaf_Node tables to get the Ances_pathID and the data, respectively. Even though it uses a path-based technique, the SMX/R approach took the longest time to execute the corresponding SQL query for Q2. This is because it requires more time to handle the containment relationships using ϴ-join () operators and to find the Path ID in Path_Index table that stores distinct path information for every node of the XML dataset. It also consumes time to match the PathID with a single table that stores the whole document, particularly for SIGMOD, DBLP2, and DBLP3. As for the XRecursive approach, it took a longer time than the XRel, s-XML, Ying-Cao, and XAncestor approaches to execute the corresponding SQL query for Q2 particularly for large XML datasets like DBLP2 and DBLP3. This is because it uses the parent ID (Pid column) recursively to form a nested query for its RDB scheme that stores all the information about the inner and leaf nodes. The XRel approach took a longer time than the s-XML, Ying-Cao, and XAncestor approaches to execute the corresponding SQL query for Q2 because it requires more time to find the Path ID in Path table that stores distinct path information for every node of the XML dataset. It also consumes time to match the Path ID with the other three tables (Element, Attribute, Text) and to handle the containment relationships using ϴ-join () operators. The s-XML approach also took a longer time than the Ying-Cao and XAncestor approaches because it uses the parent ID (LParent column) recursively to form a nested query in the highest amount of storage space in its RDB scheme, which increases the search space required to retrieve the data.

180

Even though the Ying-Cao approach used less query response time compared to the other approaches in executing the corresponding SQL query for Q2, it still took a longer time than the XAncestor approach. This is because it requires more time to find the Path ID in the Path table, which stores distinct path information for all the leaf nodes of the XML dataset, and to match it with the InnerNodes and LeafNodes tables, which store all the inner and leaf nodes of the document.

Q2

XRel

SMX/R

Ying-Cao

XRecursive

s-XML

XAncestor

Query Respnse Time (ms)

1000000

100000

10000

1000

100 SIGMOD (467KB) DBLP1 (31.5MB)

DBLP2 (63.4MB)

DBLP3 (127MB)

XML Datasets Size Figure 5.16

iii.

Comparison of six approaches in terms of scalability of query response time for query (Q2) for various XML dataset sizes using linear graph

Query with One ‘//’ (Q3)

This type of XPath query evaluates one ancestor-descendant relationship of the nodes in the document (see Q3 in Table 5.2 and Table 5.3). Table 5.11 shows the time taken by the six approaches to execute the corresponding Q3 for the four selected XML datasets. Figure 5.17 and Figure 5.18 compare the six approaches in terms of the scalability of the query response times for Q3 for the various XML dataset sizes by using a column graph and a linear graph, respectively.

181

Table 5.11

Query response time (ms) of the six approaches to execute the corresponding SQL query of XPath query (Q3) for various XML dataset sizes SIGMOD (467 KB) 16713 26591 394 1313 1405 377

Approach XRel SMX/R Ying-Cao XRecursive s-XML XAncestor

Q3

XRel

SMX/R

DBLP1 (31.5 MB) 32132 68603 6175 18593 18655 6136

Ying-Cao

DBLP2 (63.4 MB) 64690 138861 11843 36933 37130 11779

XRecursive

s-XML

DBLP3 (127 MB) 136738 287620 23875 76592 76311 23831

XAncestor

Query Respnse Time (ms)

1000000

100000

10000

1000

100 SIGMOD (467KB)

DBLP1 (31.5MB)

DBLP2 (63.4MB)

DBLP3 (127MB)

XML Datasets Size Figure 5.17

Comparison of six approaches in terms of scalability of query response time for query (Q3) for various XML dataset sizes using column graph

According to the results obtained (Table 5.11, Figure 5.17, and Figure 5.18), the existing approaches also took more time to execute the corresponding SQL query for Q3 when the XML datasets increased in size. Even though they use a path-based technique, the SMX/R and XRel approaches took longer than the other approaches to execute the corresponding SQL query for Q3. This is because they both require more time to handle the containment relationships using ϴ-join () operators and also to find the PathID in their tables that store distinct path information for every node of the XML dataset. They also take time to match the Path ID with their tables that store all the inner and leaf nodes of the document to get data from the various XML datasets.

182

The s-XML and XRecursive approaches took a longer time than the Ying-Cao and XAncestor approaches to execute the corresponding SQL query for Q3 because they use the parent IDs (LParent and Pid columns) recursively to form a nested query in a large amount of storage space in their RDB scheme. This increases the search space that needs to be covered to retrieve the data. Even though the Ying-Cao approach took less query response time compared to the other approaches in executing the corresponding SQL query for Q3, it still took a little longer than the XAncestor approach. This is because it requires more time to find the PathID in the Path table, which stores distinct path information for all the leaf nodes of the XML dataset. The approach also consumes time to match the PathID with the InnerNodes and LeafNodes tables that store all the inner and leaf nodes of the document. Thus, the proposed algorithm (XtoSQL) of the XAncestor approach translated XPath query (Q3) into its corresponding SQL query in less query response time compared to the other approaches, particularly when the size of the XML datasets increased. This is because the XAncestor approach first extracts the ancestor’s path expression

from

‘//article/title’

in

the

SIGMOD

dataset

and

from

‘//mastersthesis/author’ in the DBLP datasets. It then matches each one accurately with its ancestor’s path information in the Ancestor_Path table using the ‘Like’ operator after ‘//’ is replaced by ‘%’ in order to get its Ances_pathID, which is used to look for the data stored in the Leaf_Node table. In addition, the simplicity of its two-table RDB scheme stores distinct ancestor path information for only the leaf nodes instead of all the path information about the inner nodes. By storing only the information about the leaf nodes the XAncestor approach decreases the search space inside the Ancestor_Path and Leaf_Node tables to get the Ances_pathID and the data, respectively.

183

Figure 5.18

iv.

Comparison of six approaches in terms of scalability of query response time for query (Q3) for various XML dataset sizes using linear graph

Query with Two ‘//’ (Q4)

This type of XPath query evaluates two ancestor-descendant relationships of the nodes in the document (see Q4 in Table 5.2 and Table 5.3). Table 5.12 shows the time taken by the six approaches to execute the corresponding Q4 for the various XML datasets. Figure 5.19 and Figure 5.20 compare the six approaches in terms of the scalability of the query response time for Q4 for the various XML dataset sizes by using a column graph and a linear graph, respectively.

Table 5.12

Approach XRel SMX/R Ying-Cao XRecursive s-XML XAncestor

Query response time (ms) of the six approaches for executing the corresponding SQL query for XPath query (Q4) for various XML dataset sizes SIGMOD (467 KB) 1604 71122 499 2763 2813 474

DBLP1 (31.5 MB) 34536 72962 5989 24377 53270 4529

DBLP2 (63.4 MB) 68942 148048 11864 140556 143432 9044

DBLP3 (127 MB) 161366 316621 23992 314096 388088 18319

184

Q4

XRel

SMX/R

Ying-Cao

XRecursive

s-XML

XAncestor

Query Respnse Time (ms)

1000000

100000

10000

1000

100 SIGMOD (467KB)

DBLP1 (31.5MB)

DBLP2 (63.4MB)

DBLP3 (127MB)

XML Datasets Size Figure 5.19

Comparison of six approaches in terms of scalability of query response time for query (Q4) for various XML dataset sizes using column graph

According to the results obtained (Table 5.12, Figure 5.19, and Figure 5.20), the proposed algorithm (XtoSQL) of the XAncestor approach translated XPath query (Q4) into its corresponding SQL query in less query response time compared to the other approaches, particularly when the XML datasets became large in size. This is because the XAncestor approach first extracts the ancestor’s path expression from ‘//articles//author’ in the SIGMOD dataset and from ‘//article//tt’ in the DBLP datasets and matches each one accurately with its ancestor’s path information in the Ancestor_Path table using the ‘Like’ operator after ‘//’ is replaced by ‘%’ in order to get its Ances_pathID, which is used to look for the data stored in the Leaf_Node table. In addition, the simplicity of its RDB scheme of two tables means that it stores the distinct ancestor path information for only the leaf nodes instead of all the path information about the inner nodes. Storing only the information about leaf nodes in the XAncestor approach leads to decrease the search space inside the Ancestor_Path and Leaf_Node tables to get the Ances_pathID and the data, respectively. The existing approaches took more time to execute the corresponding SQL query for Q4 when the XML dataset size increased. Even though the SMX/R approach uses a path-based technique, it consumed the most time compared to the other approaches because it requires time to handle the containment relationships using ϴjoin () operators. It also takes time to find the PathID in its tables that store

185

distinct path information for every node of the XML dataset and to match the PathID with its single table (Path table) that store all the inner and leaf nodes of the document. When applied in turn to the SIGMOD, DBLP1, and DBLP2 datasets it required more and more time. The s-XML and XRecursive approaches took a longer time than the XRel, Ying-Cao, and XAncestor approaches to execute the corresponding SQL query for Q4 because both approaches use the parent ID columns recursively to form a nested query in a large amount of storage space in their RDB scheme. This increases the search space they cover to retrieve the data, particularly in the case of the largest XML dataset (DBLP3). Moreover, the XRel approach consumed more time than the YingCao and XAncestor approaches for the same reasons as those cited for the SMX/R approach, except that XRel stores the whole document in four tables (Path, Element, Attribute, and Text) based on the node types, which reduces the search space inside these tables compared to SMX/R.

Q4

XRel

SMX/R

Ying-Cao

XRecursive

s-XML

XAncestor

Query Respnse Time (ms)

1000000

100000

10000

1000

100 SIGMOD (467KB) DBLP1 (31.5MB)

DBLP2 (63.4MB)

DBLP3 (127MB)

XML Datasets Size Figure 5.20

Comparison of six approaches in terms of scalability of query response time for query (Q4) for various XML dataset sizes using linear graph

Even though Ying-Cao took less time compared to the other existing approaches to execute the corresponding SQL query for Q4, it still took a longer time than the XAncestor approach to get the data. This is because it requires more time to find the PathID in the Path table, which stores distinct path information for all the leaf

186

nodes of the XML dataset, and to match it with the InnerNodes and LeafNodes tables, which stores all the inner and leaf nodes of the document. Twig Query with ‘/’ (Q5)

v.

This type of XPath query evaluates the branching query associated with the parentchild relationships of the nodes in the document, and particularly the nodes that share the same parent (see Q5 in Table 5.2 and Table 5.3). Table 5.13 shows the time taken by the six approaches to execute the corresponding Q5 when applied to the various XML datasets. Figure 5.21 and Figure 5.22 compare the scalability of the six approaches by using a column graph and a linear graph, respectively.

Table 5.13

Query response time (ms) of the six approaches for executing the corresponding SQL query for XPath query (Q5) for various XML dataset sizes SIGMOD (467 KB) 813 1209 954 1316 2173 296

Approach XRel SMX/R Ying-Cao XRecursive s-XML XAncestor

Q5

XRel

SMX/R

DBLP1 (31.5 MB) 56846 76877 26010 79941 51223 10759

Ying-Cao

DBLP2 (63.4 MB) 72701 155889 52324 152034 102235 21189

XRecursive

DBLP3 (127 MB) 151318 325039 105557 316401 207490 43623

s-XML

XAncestor

Query Respnse Time (ms)

1000000

100000

10000

1000

100 SIGMOD (467KB)

DBLP1 (31.5MB)

DBLP2 (63.4MB)

DBLP3 (127MB)

XML Datasets Size Figure 5.21

Comparison of six approaches in terms of scalability of query response time for query (Q5) of various XML dataset sizes using column graph

187

According to the results obtained (Table 5.13, Figure 5.21, and Figure 5.22), the existing approaches also took more time to execute the corresponding SQL query for Q5 when the size of the XML dataset increased. The SMX/R and XRecursive approaches took the longest time compared to the other approaches, particularly when the XML dataset became bigger. The SMX/R approach took a long time to execute because it extracts three sub-paths from Q5: a path for the branching node, a path for the node that had value, and a path for the node that is required to retrieve its value. It then matches these sub-paths with the path information in the Path_Index table using the ‘=’ operator in order to get their PathIDs, which are used to look for the required value of a node by the other node in a single table (Path). In addition, the SMX/R approach handles the containment relationships between these nodes using ϴ-join operators. In the case of the XRecursive approach, it took a long time to execute because it generates excessive nested queries by using the parent ID (Pid column) recursively. Like XRecursive, the s-XML approach took a longer time than XRel, YingCao, and XAncestor to execute its corresponding SQL query for Q5 when the XML datasets increased in size because it needed to generate nested queries using the parent ID (LParent column) recursively in the largest storage space of its RDB scheme. The XRel approach also consumed more time than Ying-Cao and XAncestor to execute the corresponding SQL query for Q5. It did so for the same reasons as those cited above for SMX/R, except that XRel stores the whole document in four tables (Path, Element, Attribute, and Text) based on the node types, which reduces the search space inside these tables compared to SMX/R. The Ying-Cao approach consumed more time than XAncestor to execute its corresponding SQL query for Q5 when the XML datasets are larger in size. This is because it extracts two sub-paths from Q5 (a path for the node that had value and a path for the node that is required to retrieve its value). It then matches these sub-paths with the path information in the Path table using the ‘=’ operator in order to get their PathIDs, which are used to look for the required value of a node by the other node in the LeafNodes table. Also, in order to retrieve accurate data, Ying-Cao joins the two

188

records that are retrieved from the LeafNodes table together using their ParentID in the InnerNodes table. Thus, the proposed algorithm (XtoSQL) of the XAncestor approach translated XPath query (Q5) into its corresponding SQL query in less query response time compared to the other approaches, particularly when the XML datasets became bigger. This is because the XAncestor approach only extracts one ancestor’s path expression for the two nodes from ‘/SigmodRecord/issue/articles/article [title = ‘Database Directions III Workshop Review.’]/endPage’ in SIGMOD dataset and ‘/dblp/masterthesis [year = 1992]/title’ in DBLP datasets. Then, it matches each one accurately with its ancestor’s path information in the Ancestor_Path table using the ‘=’ operator. It does this in order to get its Ances_pathID, which is used to look for the required value of a node by the other node that is stored in the Leaf_Node table, whereas the ancestor PathID and ancestor position for both nodes are the same.

Figure 5.22

vi.

Comparison of six approaches in terms of scalability of query response time for query (Q5) for various XML dataset sizes using linear graph

Twig Query with ‘//’ (Q6)

This type of XPath query evaluates the branching query associated with the ancestordescendant relationships of the nodes in the document, particularly the nodes that

189

share the same parent (see Q6 in Table 5.2 and Table 5.3). Table 5.14 shows the time taken by the six approaches to execute the corresponding Q6. Figure 5.23 and Figure 5.24 compare the scalability of the six approaches by using a column graph and a linear graph, respectively.

Table 5.14

Query response time (ms) of the six approaches for executing the corresponding SQL query of XPath query (Q6) for various XML dataset sizes SIGMOD (467 KB) 7107

DBLP1 (31.5 MB) 92837

DBLP2 (63.4 MB) 188005

DBLP3 (127 MB) 405745

SMX/R

32513

139178

328617

605035

Ying-Cao

2248

44715

98253

224189

XRecursive

5000

451237

874837

1809968

s-XML

5090

437761

878780

1789461

XAncestor

607

10511

21071

43818

Approach XRel

Q6

XRel

SMX/R

Ying-Cao

XRecursive

s-XML

XAncestor

10000000

Query Respnse Time (ms)

1000000

100000

10000

1000

100 SIGMOD (467KB) DBLP1 (31.5MB)

DBLP2 (63.4MB)

DBLP3 (127MB)

XML Datasets Size Figure 5.23

Comparison of six approaches in terms of scalability of query response time for query (Q6) for various XML dataset sizes using column graph

According to the results obtained (Table 5.14, Figure 5.23, and Figure 5.24), the existing approaches also took more time to execute the corresponding SQL query for Q6 when the XML datasets increased in size. The s-XML and XRecursive

190

approaches took the longest time to execute the corresponding SQL query for Q6, particularly when the XML dataset size increased. This is due to both approaches needing to generate excessive nested queries by using their parent ID columns in a recursive manner and because of having the largest search space in their RDB scheme because they store all the information about the inner nodes of the document. In addition, the SMX/R approach took a longer time than the XRel, Ying-Cao, and XAncestor approaches to execute its corresponding SQL query for Q6, particularly when the size of the XML dataset increased. This is because SMX/R extracts three sub-paths from Q6, namely: a path for the branching node, a path for the node that had value, and a path for the node that is required to retrieve its value. Then it matches these sub-paths with the path information in the Path_Index table using the ‘Like’ operator after ‘//’ is replaced by ‘%’ in order to get their PathIDs, which are used to look for the required value of a node by the other node in a single table (Path). In addition, the SMX/R approach handles the containment relationships between these nodes using ϴ-join operators. The XRel approach consumed more time than Ying-Cao and XAncestor to execute the corresponding SQL query for Q6 for the same reasons as SMX/R except that XRel stores the whole document in four tables (Path, Element, Attribute, and Text) based on the node types, which reduces the search space inside these tables compared to SMX/R. The Ying-Cao approach consumed more time than XAncestor to execute its corresponding SQL query for Q6 when the XML dataset became bigger because it extracts two sub-paths from Q6, namely: a path for the node that had value and path for the node that is required to retrieve its value. It then matches these sub-paths with the path information in the Path table using the ‘Like’ operator after ‘//’ is replaced by ‘%’ in order to get their PathIDs. These IDs are used to look for the required value of a node by the other node in the LeafNodes table. Also, in order to retrieve accurate data, the Ying-Cao approach joins the two records that are retrieved from the LeafNodes table together using their ParentID in the InnerNodes table.

191

In contrast to the existing approaches, the proposed algorithm (XtoSQL) of the XAncestor approach translated XPath query (Q6) into its corresponding SQL query in less query response time, particularly when the size of the XML dataset increased. This is because the XAncestor approach only extracts one ancestor’s path expression for the two nodes from ‘//article [initPage > 50]/title’ in SIGMOD dataset and ‘//phdthesis [year > 1979]/school’ in DBLP dataset. Then, it matches each one accurately with its ancestor’s path information in the Ancestor_Path table using the ‘Like’ operator after ‘//’ is replaced by ‘%’ in order to get its Ances_pathID. This is then used to look for the required value of a node by the other node that is stored in the Leaf_Node table, whereas the ancestor PathID and ancestor position for both nodes are the same.

Figure 5.24

5.4

Comparison of six approaches in terms of scalability of query response time for query (Q6) for various XML dataset sizes using linear graph

EXPERIMENTAL DISCUSSION

With respect to the issue of the scalability of the RDB storage space of the various approaches when applied to XML datasets of different size (see Table 5.1 for details of the XML datasets used), the results show that the XML mapping algorithms of the

192

XRel, SMX/R, Ying-Cao, XRecursive, and s-XML approaches used more RDB storage space with low scalability compared to the XML mapping algorithm (XtoDB) of the XAncestor approach. The main reason for this result is that the existing approaches store a lot of information which is simplified in XAncestor. The s-XML approach consumed the most RDB storage space to map various different-sized XML documents compared to the other approaches. This is because it stores information about the inner nodes, has a high row size, and has redundant columns in its schema. The Ying-Cao approach needed a larger RDB storage space than the XRel, SMX/R, XRecursive, and XAncestor approaches because it stores distinct path information for every leaf node (root-to-leaf) of the document, a dangling column (SiblingNum), and all information about the inner nodes of the document in a complex RDB scheme that contains four tables. The SMX/R approach consumed more RDB storage space than XRel, XRecursive, and XAncestor as it stores distinct path information for every node (rootto any node) of the document and all the information about elements, attributes and texts in a single table. Like SMX/R, the XRel approach consumed a higher amount of RDB storage space than the XRecursive and XAncestor approaches to map various XML documents of different size because XRel stores the whole document in four tables (Path, Element, Attribute, and Text) based on the node types, i.e. it has a complex RDB scheme. Although the XRecursive approach used less RDB storage space compared to the other existing approaches because of the simplicity of its RDB that contains two tables, it still stores all the information about inner nodes and a dangling column (Type), which makes it less effective than the proposed approach (XAncestor). The XAncestor approach outperformed the other approaches because it stores only the distinct ancestor path information for every leaf node (root-to-parent) of the document and does not store the information about the inner nodes. Therefore from the results of the above evaluation, it can be concluded that the proposed approach

193

(XAncestor) can store various XML documents of different size in less RDB storage space with high scalability compared to the existing approaches. As regards the issue of the scalability of the query response time of the six approaches in translating six types of XPath query (Q1–Q6) into corresponding SQL queries for the various XML datasets (see Table 5.1), the results show that the existing approaches translated these types of XPath query into the equivalent SQL query with a high query response time compared to the proposed approach. In other words, the query performance of the query mapping algorithms of the XRel, SMX/R, Ying-Cao, XRecursive, and s-XML approaches is worse than that of the XtoSQL algorithm of the proposed approach (XAncestor) when applied to various XML datasets of different size. The XRel and SMX/R approaches consumed more time to execute the queries than the other existing approaches because they store the path information for all the existing nodes and all the information about the inner nodes and leaf nodes of the XML document in their RDB scheme. This increases the search space inside the RDB of these two approaches. In addition, XRel and SMX/R maintain the parent-child and ancestor-descendant relationships using the containment relationship ϴ-join () in their RDB scheme. The s-XML approach also needed a high query response time to execute the queries because it consumes a lot of space with high row size to store the whole XML document, which increased the search space in its RDB scheme. It also uses parent ID (LParent column) recursively to form nested queries in order to maintain the parentchild and ancestor-descendant relationships inside these queries.

The XRecursive approach is better than the s-XML approach in terms of the amount of storage space required, which reduces the search space inside its RDB scheme. However, like s-XML, it maintains parent-child and ancestor-descendant relationships using parent ID (Pid column) recursively to form nested queries, which is a costly procedure in terms of time in RDBMS.

194

The Ying-Cao is better than the other existing approaches in terms of query response time for all queries. However, it still took more time to execute these queries compared to the proposed approach (XAncestor). This is because it requires more time to find the PathID in the Path table, which stores distinct path information for all the leaf nodes of the XML dataset, and to match it with the InnerNodes and LeafNodes tables that store all the inner and leaf nodes of the document, respectively. The proposed approach (XAncestor) outperformed the other existing approaches for all queries (Q1–Q6) because it reduces the search space inside the Ancestor_Path table that contains only the paths of the common ancestors of the leaf nodes. It uses this path to directly maintain parent-child and ancestor-descendant relationships. Furthermore, it joins this path with the Leaf_Node table that stores only the information about the leaf nodes. Therefore from the results of the above evaluation, it can be concluded that the proposed approach (XAncestor) can perform queries in a faster time compared to the existing approaches. 5.5

COMPLETENESS OF XANCESTOR

One of the challenging issues in proposing an approach is to check that it can produce a result without loss of information when mapping XML documents into a RDB scheme. In this study, the completeness of the approach is defined as the ability of the XML mapping algorithm of the approach to map XML documents into a RDB scheme without missing any information while traversing the document. In this section, the proposed approach (XAncestor) will be checked to be complete if its mapping algorithm (XtoDB) can map all the inner nodes of the document as the information about ancestor’s paths for all the leaf nodes, which are stored in the Ancestor_path table, and if it can map all information about the leaf nodes (i.e. when they have a value whether an attribute or text) and store it in the Leaf_Node table without loss of this information while traversing the document. The completeness of XAncestor is checked in three ways: by the number of tuples

195

mapped, by the query results, and by reconstructing the XML document from the RDB. Each of these checks is dealt with in the following subsections. 5.5.1

Checking by Number of Tuples Mapped

After mapping the SIGMOD dataset containing 11526 elements and 3737 attributes (see Table 5.1) into a RDB scheme using the XRel, SMX/R, Ying-Cao, XRecursive, s-XML, and XAncestor approaches, the number of tuples mapped in the respective RDB schemes is calculated and the results are presented in Table 5.4. These results are then analysed and described to check the completeness of XAncestor, as shown in Table 5.15.

Table 5.15 Approach

Comparison of approaches in terms of number of tuples mapped for SIGMOD dataset Number of Tuples Mapped

XRel

XRel mapped the SIGMOD dataset into Element, Attribute, Text, and Path tables. The Element table contained 11526 tuples, which is equal to the number of elements in this dataset. The Attribute table contained 3737 tuples, which is also equal to the number of attributes in this dataset. The Text table contained 8383 tuples, while the Path table contained 12 tuples, i.e. 12 distinct paths for all nodes (root-to-node).

SMX/R

SMX/R mapped the SIGMOD dataset into two tables: Path_Index and Path_Table. The Path_Index_Table contained 12 tuples, which is equal to the number of tuples in the Path table of XRel. The Path table contained 23646 tuples, which is equal to the summation of 11526 elements and 3737 attributes in this dataset plus the 8383 tuples in the Text table of XRel.

Ying-Cao

Ying-Cao mapped the SIGMOD dataset into InnerNodes, LeafNodes, File, and Path tables. The InnerNodes table contained 15263 tuples, which is equal to the summation of 11526 elements and 3737 attributes in this dataset. The LeafNodes table contained 12120 tuples, which is equal to the summation of 3737 attributes in this dataset and 8383 tuples in the Text table of XRel. The File table contained one tuple, i.e. the name of the dataset. The Path table contained seven tuples, i.e. seven distinct paths for all leaf nodes (root-to-leaf).

XRecursive

XRecursive mapped the SIGMOD dataset into two tables: Tag_Structure and Tag_Value. The Tag_Structure table contained 15263 tuples, which is equal to the number of tuples in the InnerNodes table of the Ying-Cao approach, i.e. the summation of 11526 elements and 3737 attributes in this dataset. The Tag_Value table contained 12120 tuples, which is equal to the number of tuples in the LeafNodes of the Ying-Cao approach, i.e. the summation of 3737 attributes in this dataset and 8383 tuples in the Text table of XRel. continue…

196

...continued s-XML

XAncestor

s-XML mapped the SIGMOD dataset into two tables: ParentTable and ChildTable. ParentTable contained 15263 tuples, which is equal to the number of tuples in the InnerNodes table of Ying-Cao and the Tag_Structure table of XRecursive, i.e. the summation of 11526 elements and 3737 attributes in this dataset. ChildTable contained 12120 tuples, which is equal to the number of tuples in the LeafNodes table of Ying-Cao and the Tag_Value table of XRecursive, i.e. the summation of 3737 attributes in this dataset and 8383 tuples in Text table of XRel. XAncestor mapped the SIGMOD dataset into two tables: Leaf_Node and Ancestor_Path. The Leaf_Node table stored all the leaf nodes (that had a value whether attribute or text), which totalled 12120 tuples, i.e. the summation of 3737 attributes in this dataset and 8383 tuples in the Text table of XRel like LeafNodes, ChildTable, and Tag_Value tables in Ying-Cao, s-XML, and XRecursive respectively. This approach also mapped all the inner nodes as distinct ancestor paths for all leaf nodes (root-to-parent) into the Ancestor_Path table, which contained four tuples only.

Based on the analysis, according to the number of tuples mapped, the proposed approach (XAncestor) did not lose any information when traversing all the nodes of SIGMOD dataset. XAncester is able to achieve this because it mapped all the leaf nodes (attributes or texts that contained values) into the Leaf_Node table, which contained the same number of tuples as the number of attributes and texts in the SIGMOD dataset. In addition, XAncestor mapped all the inner nodes as distinct ancestor paths for all the leaf nodes (root-to-parent). Thus, the completeness of XAncestor approach is checked on the basis of the number of tuples mapped. 5.5.2

Checking by Query Results

The proposed approach (XAncestor) can be checked complete if the translated SQL queries in its RDB scheme produce the same results as those of the other approaches (XRel, SMX/R, Ying-Cao, XRecursive, s-XML) when mapping the same dataset (SIGMOD) into their RDB schemes. The corresponding translated SQL query of the XPath query (Q) of XAncestor is compared to that of other approaches to ascertain whether the same results for the SQL query are obtained by the six approaches when applied to the same dataset (SIGMOD). The XPath query investigated is: Q: //article/title

197

a.

XRel Approach

The XRel approach translated the XPath query (Q) into the corresponding SQL query, which is shown with its result in Figure 5.25.

Figure 5.25

b.

Result of corresponding SQL query for Q using XRel approach

SMX/R Approach

The SMX/R approach translated the XPath query (Q) into the corresponding SQL query, which is shown with its result in Figure 5.26.

Figure 5.26

Result of corresponding SQL query for Q using SMX/R approach

198

c.

Ying-Cao Approach

The Ying-Cao approach translated a corresponding SQL query for the XPath query (Q), which is shown with its result in Figure 5.27.

Figure 5.27

d.

Result of corresponding SQL query for Q using Ying-Cao approach

XRecursive Approach

The corresponding SQL query for the XPath query (Q) is translated by the XRecursive approach, and is shown with its result in Figure 5.28.

Figure 5.28

Result of corresponding SQL query for Q using XRecursive approach

199

e.

s-XML Approach

The s-XML approach translated the XPath query (Q) into the corresponding SQL query, which is shown with its result in Figure 5.29.

Figure 5.29

f.

Result of corresponding SQL query for Q using s-XML approach

XAncestor Approach

The corresponding SQL query for the XPath query (Q) is also translated by the XAncestor approach, and it is shown with its result in Figure 5.30.

Figure 5.30

Result of corresponding SQL query for Q using XAncestor approach

200

According to the results shown in Figures 5.25 to 5.30, the corresponding SQL query for XPath query (Q) using the XAncestor approach produced the same result (1504 tuples) as the other approaches. The existing approaches also obtained the same result (1504 tuples) after executing their corresponding SQL queries for the same XPath query (Q). Thus, based on this evaluation, the XAncestor approach is capable of producing the same results as the other approaches for other queries of their RDB scheme after mapping the SIGMOD dataset without losing any information. Therefore, according to the query results, the completeness of the XAncestor approach is checked. 5.5.3

Checking by Reconstructing XML Document from RDB

The proposed approach (XAncestor) can be checked complete if its RDB scheme reconstructs the XML document without losing information. In other words, the reconstructed XML document and the original XML document should be identical. Below, by way of example, an original XML document (DBGroup.XML) and its tree is shown in Figure 5.31.

(a)

(b)

Figure 5.31

XML data model: (a) original XML document (DBGroup.XML) and (b) its tree

201

The proposed approach (XAncestor) mapped the

above XML document

(DBGroup.XML) into its RDB scheme, which is shown in Table 5.16.

Table 5.16

XAncestor scheme for DBGroup.XML: (a) Ancestor_Path and (b) Leaf_Node tables

(a) Ancestor_Path Table Ances_PathID

Ances_PathExp

1 2 3 (b) Leaf_Node Table

/DBGroup/Member /DBGroup/Project /DBGroup/Project/Duration

Node_Name Name Age Office Name Age Office @id Title Title Dept Month Day

Ances_PathID 1 1 1 1 1 1 2 2 2 2 3 3

Ances_Pos 1.1. 1.1. 1.1. 1.2. 1.2. 1.2. 1.3. 1.3. 1.4. 1.4. 1.4.3. 1.4.3.

Node_Value ‘Fervvac’ ‘23’ ‘CSD 4212’ ‘Fedric’ ‘25’ ‘GRT 6432’ ‘105’ ‘XML’ ‘SQL’ ‘Database’ ‘10’ ‘25’

From the RDB scheme of the XAncestor approach in Table 5.16, the XML tree and document are reconstructed, as shown in Figure 5.32 and Figure 5.33, respectively.

Figure 5.32

Reconstructed XML tree for DBGroup.xml

202

Figure 5.33

Reconstructed XML document for DBGroup.xml

From the comparison of the original XML document and its tree in Figure 5.31 and the reconstructed tree and document in Figure 5.32 and Figure 5.33, respectively, it can be seen that XML documents and trees are identical. Therefore, the XAncestor approach can map the original XML document (DBGroup.xml) into its RDB scheme without losing information. As a result, the completeness of XAncestor approach is checked by successfully reconstructing the XML document from its RDB scheme. 5.6

SUMMARY

After the proposed mapping approach (XAncestor) that contains the XtoDB and XtoSQL algorithms is successfully developed, its effectiveness and its scalability are evaluated by comparing it with other approaches in terms of RDB storage space and query response time. This chapter compared the results generated by the proposed approach (XAncestor) with those generated by other existing approaches, namely: XRel, SMX/R, Ying-Cao, XRecursive, and s-XML by means of some experiments. These approaches are chosen for comparison with XAncestor because they are recent model mapping approaches that have been used for mapping XML documents into a predefined RDB scheme in terms of storing and querying. In addition, they apply recent techniques including path-based and node-based techniques for storing and querying XML documents in a RDB scheme. The results produced by applying the six approaches on real XML datasets of different size and to six different types of query

203

for each dataset are compared and analysed in terms of the effectiveness and scalability of the RDB storage space and the query response time (query performance). The analysis showed that the proposed approach (XAncestor) outperformed the other model mapping approaches in terms of the effectiveness and scalability of the RDB storage space and the time taken to execute the six types of XPath query, i.e. query response time (query performance). Lastly, the completeness of the proposed approach (XAncestor) is proven by analysing the number of tuples mapped, the query results, and reconstructing the XML document from its RDB scheme. The next chapter reports the conclusions of this study and highlights the contributions of this research for the XML community.

204

CHAPTER VI

CONCLUSION AND FUTURE WORKS

6.1

INTRODUCTION

Extensible markup language (XML) management is still an essential open research area. From a review of the literature, it is apparent that there are still some limitations regarding the storing and querying of information. These problems can be overcome by using a relational database (RDB) to store and query the content of XML documents efficiently. The mapping of an XML document into a RDB involves the mapping of ordered, hierarchical, and structured information into an unordered table. Therefore, in this study, a new model mapping approach, called XAncestor, is proposed to map the XML documents into a fixed RDB scheme using a path-based technique to address the issues of RDB storage space and query response time. The proposed approach (XAncestor) contained an XML mapping algorithm (XtoDB) for mapping different sizes of XML document into less storage space in the RDB scheme and a query mapping algorithm (XtoSQL) for translating six different types of XPath query into their corresponding SQL query in a shorter query response time. Then, the results of these SQL queries are translated into a hierarchical XML format and returned back to the user. The proposed approach (XAncestor) is evaluated successfully by implementing it and its algorithms as a prototype. The results of the proposed approach (XAncestor) are compared with those of the most-related approaches. The comparisons showed that the proposed approach (XAncestor) is able to outperform these related approaches in terms of the effectiveness and scalability of the RDB storage space and the query response time, particularly in the case of large XML documents. Finally, the correctness and completeness of XAncestor are confirmed by conducting three proofs.

205

The previous chapters discussed many important issues to support this research. This final chapter serves as a conclusion to the thesis and is organized as follows: The achievement of the objectives of the research is discussed in Section 6.2. Next, the research contributions of this study are highlighted in Section 6.3. Then, some recommendations for further research are proposed in Section 6.4 to address the limitations of the current study. Finally, some concluding remarks are presented in Section 6.5. 6.2

ACHIEVEMENT OF THE OBJECTIVES

The aim of this research is to propose a mapping approach that used two algorithms: an XML mapping algorithm and a query mapping algorithm. The XML mapping algorithm is designed to map different sizes of XML document into less storage space in a RDB scheme with high scalability. The query mapping algorithm is designed to translate different types of XPath query into their corresponding SQL query in a shorter query response time and then translate the results into XML format. These algorithms are developed and implemented as a prototype called XAncestor, which is proven in terms of correctness and completeness. In addition, the XAncestor prototype is evaluated and compared to the most-related approaches in terms of the effectiveness and scalability of the RDB storage space and the query response time. The three main objectives of this research are outlined in Chapter 1. Based on the works concluded and the findings obtained during this research, each of the three objectives is achieved as stated below. OBJ1: To propose an XML mapping algorithm (XtoDB) for mapping different sizes of XML document into a fixed and predefined RDB scheme using a pathbased technique in order to improve the RDB storage space and its scalability when XML documents become bigger. The first objective is achieved through proposing an approach that included a RDB scheme without dangling tables and columns, the use of the ancestor path

206

expression, and the development of an XML mapping algorithm (XtoDB). Using a path-based technique, the XtoDB algorithm is developed to map (shred and decompose) the content and structure of different sizes of XML document into a fixed RDB scheme. By adopting this technique, only distinct ancestor’s path information for all the leaf nodes of the XML tree with their information is mapped into a fixed RDB scheme. Therefore, the path information of all the inner nodes with their structural information does not map into the RDB scheme. This led to a reduction in the RDB storage space as well as high scalability when the XML document became larger. OBJ2: To propose a query mapping algorithm (XtoSQL) for translating different XPath queries, namely short simple path expression queries, long simple path expression queries, queries with one ‘//’, queries with two ‘//’, twig queries with ‘/’, and twig queries with ‘//’, into SQL queries based on the RDB scheme using a path-based technique to improve the performance of these types of query, i.e. reduce the query response time. The second objective is achieved by proposing a query mapping algorithm (XtoSQL). Using a path-based technique, the XtoSQL algorithm is developed to translate six different types of XPath query into their corresponding SQL query. The results of the queries are then translated into XML format by reducing the search space in the RDB scheme, reducing the nesting queries using parent ID attribute, reducing the number of join operations in the translated SQL queries, and maintaining the parent-child and ancestor-descendant relationship among the XML nodes efficiently. The XtoSQL is designed to first extract the ancestor’s path from the XPath query and then match it accurately with the ancestor’s path information in the Ancestor_Path table using ‘=’ or ‘Like’ with ‘%’ operators to get its Ances_pathID. This ID is used to look up for the data stored in the Leaf_Node table. This technique led to a reduction in the time taken to execute the queries as well as high scalability when the XML document increased in size. OBJ3: To develop a model mapping approach (XAncestor) as prototype from the results of OBJ1 and OBJ2, evaluate and compare its effectiveness and

207

scalability, and prove that the prototype is a correct and complete approach in terms of RDB storage space and query response time by comparing it with five, recent, related approaches. The third objective is achieved by developing a new prototype called XAncestor that contained the proposed algorithms (the results of OBJ1 and OBJ2). Experiments are conducted on various sizes of XML dataset in order to evaluate and compare the effectiveness and the scalability of XAncestor with five, recent, related approaches, namely: XRel (Yoshikawa et al. 2001), SMX/R (Ning & De 2010), XRecursive (Fakharaldien et al. 2011b), Ying-Cao (Ying et al. 2012), and s-XML (Subramaniam et al. 2012). Three of these approaches use a path-based technique and two use a node-based technique to address RDB storage space and query response time. The outcome of the evaluations and comparisons showed that the proposed approach (XAncestor) is able to outperform the other approaches in terms of RDB storage space and scalability and the time taken to execute six types of XPath query, i.e. query response time (query performance). Lastly, the correctness of the completeness of XAncestor is proven by calculating the number of tuples mapped, comparing the query results, and reconstructing the XML document from XAncestor’s RDB scheme. 6.3

RESEARCH CONTRIBUTION

During data exchange, more and more XML documents are being transferred with or without a schema (DTD or XML-schema). Tremendous efforts have been made by vendors and researchers to find an efficient approach for mapping XML document to a RDB and for translating XPath queries into their corresponding SQL queries so that the results can then be translated into XML format. However, the mapping approaches that have been proposed so far are still insufficient in terms of reducing RDB storage space, achieving high scalability of the RDB storage space, and reducing the time taken to execute the translated SQL queries, particularly in large XML documents.

208

The first contribution of this research is that it reported and critically analysed the current state of the art in the development of mapping approaches between XML and RDB (XML mapping and query mapping algorithms). Mapping approaches can be classified into two main categories: structure mapping and model mapping approaches. Since this study aimed to develop an approach for mapping any XML document into a RDB, whether it has a schema or not, this current work falls into the second category. Based on a review of the literature, the limitations of the existing model mapping approaches that are identified motivated this study to improve the storing of XML documents in less RDB storage space but with high scalability and the querying of XML documents in less query response time, particularly in large XML documents. The second contribution of this research is that it proposed an XML mapping algorithm (XtoDB) that maps XML documents into a fixed RDB scheme using a pathbased technique. The proposed algorithm reduces the RDB storage space and also achieves high scalability because only distinct ancestor’s path information for all the leaf nodes of the XML tree with their structural information is mapped into a fixed RDB scheme. Therefore, the path information of all the inner nodes with their structural information does not map into the RDB scheme. This leads to a reduction in the RDB storage space as well as high scalability when the XML document becomes larger. The third contribution of this research is that it proposed a query mapping algorithm (XtoSQL) that translates six types of XPath query into their corresponding SQL query and then returns the results in XML format. The proposed algorithm does this by reducing the search space in the RDB scheme, reducing nesting queries using parent ID attribute, reducing the number of join operations in the translated SQL queries, and maintaining the parent-child and ancestor-descendant relationship among the XML nodes efficiently. This leads to a reduction in the time taken to execute these queries as well as high scalability when the XML document increases in size. The six types of XPath query are a short simple path expression query that evaluates short parent-child relationships, a long simple path expression query that evaluates long

209 parent-child relationships, a query with one ‘//’ that evaluates one ancestor-descendant relationships, a query with two ‘//’ that evaluates two ancestor-descendant relationships, a twig query with ‘/’ that evaluates the branching query with parentchild relationships, and a twig query with ‘//’ that evaluates the branching query with ancestor-descendant relationships. The fourth contribution of this research is that it developed a prototype called XAncestor that implemented the proposed algorithms (XtoDB and XtoSQL) to create a new approach for mapping XML documents to a RDB with improved storing and querying capabilities. XAncestor is evaluated and compared with recent related approaches in terms of its effectiveness and scalability in storing and querying various XML documents, and is able to outperform these existing approaches. In addition, XAncestor is proven to be a correct and complete approach for mapping XML documents into a RDB scheme. This issue is a challenge for all IT organizations around the world, so there is a clear need to develop such an interface and tool. In conclusion, the contributions of this research will help a lot of commercial organizations to exchange and manage their data on the web better by giving them an efficient tool (XAncestor) for storing and querying XML documents efficiently. The contributions will reduce the costs of exchanging and managing data on the web by reducing RDB storage space while achieving high scalability and by providing high query performance by reducing the time taken to execute queries. 6.4

RECOMMENDATIONS FOR FUTURE WORK

This study focused on proposing an efficient model mapping approach for storing and querying XML documents that reduced RDB storage space and improved query response time. As the recommendations for further research, the following issues may need to be considered in order to improve the proposed model mapping approach (XAncestor):

210 

This study adopted the XQuery 1.0 and XPath 2.0 data model to represent XML documents and focused on four types of nodes: root, element, text, and attribute in line with existing studies (El-Sharkawi & El-Hadi El Tazi 2005; Qin et al. 2005; Jiang et al. 2002; Yoshikawa et al. 2001). In future work, other types of nodes may need to be considered such as namespace, processing-instruction, and comment in order to enhance the proposed approach.



This study focused on handling the storing and querying of XML documents using a RDB. Because of the XML data updates as time passes on the web, dynamic updating is an issue that may need to be considered as a future work. The efficient updating of this data does not require the modification and re-labelling of existing data stored in the RDB of a certain approach, such as when inserting a new element into XML documents. Numerous nodelabelling schemes have been proposed, such as Interval Encoding Based on the Number of Words (Li & Moon 2001), Global Order Label (Tatarinov et al. 2002), Local Order Label (Tatarinov et al. 2002), Dewey Order Label (Tatarinov et al. 2002), ORDPATH (O'neil et al. 2004), Prime Number Labelling (Wu et al. 2004), Cluster-Based Order (Soltan & Rahgozar 2006), Dynamic Interval-Based Labelling (Chung & Jesurajaiah 2005) and Persistent Labeling Scheme (Khaing & Thein 2006). These methods have been used in most of the existing model mapping approaches. However, these approaches still have drawbacks in terms of supporting dynamic updates efficiently.



XPath is currently the most popular query language for expressing queries over XML data using path expressions (Vainio & Junkkari 2014; Berglund et al. 2003). This research study focused on XPath as the standard XML query language because it uses path expressions, which are handled easily by the proposed approach. In future work, the proposed approach may need to be extended to other XML query languages such as XQuery.

211

6.5

CONCLUDING REMARKS

The problem that is addressed in this study related to the process of mapping from XML to a relational model. Despite the obstacles that exist in this process, there are two key practical benefits to taking this approach: 

Considering that the present market is mostly dominated by RDB products, it is not easy or practical to abandon RDB to support XML. It is very likely that industries would be reluctant to adopt a new technology if it does not support the existing RDB techniques, just as they are reluctant to accept objectoriented databases in the past.



By using a RDB as an underlying storage system, a number of sophisticated techniques such as OLAP, data mining, and data warehousing that are developed for RDB can be applied to XML documents with minimal changes. In conclusion, this research study found that using a RDB is a good option for

managing XML data, particularly in relation to storing and querying data. Therefore, this research study will be useful to the research community and will serve as suitable introductory material for individuals who wish to enhance mapping components and extension into new research areas.

212

REFERENCES

Adam, M., Fakharaldien, I., Khalid, E., Jasni, M. Z. & Norrozila, S. 2012. Efficient algorithms to index, store and query XML documents. International Conference on Computational Science and Information Management (ICoCSIM), pp. 281-286. Ahmad, K. 2008. A method for mapping XML DTD to relational schemas in the presence of functional dependencies. PhD. Thesis. Universiti Putra Malaysia. Ahmad, K. 2011. A comparative analysis of managing XML data in relational database. Intelligent Information and Database Systems, pp. 100-108. Aken, J. E. V. 2004. Management research based on the paradigm of the design sciences: The quest for field‐tested and grounded technological rules. Journal of Management Studies 41(2): 219-246. Amer-Yahia, S., Du, F. & Freire, J. 2004. A comprehensive solution to the XML-torelational mapping problem. Proceedings of the 6th Annual ACM International Workshop on Web Information and Data Management, pp. 31-38. Arenas, M., Barcelo, P., Libkin, L. & Murlak, F. 2010. Relational and XML data exchange. Synthesis Lectures on Data Management 2(1): 1-112. Arenas, M., Barceló, P., Libkin, L. & Murlak, F. 2014. exchange. New York. Cambridge University Press.

Foundations of data

Atay, M. 2006. XML2REL: An efficient system for storing and querying XML documents using relational databases. PhD. Thesis. Graduate School, Wayne State University. Atay, M., Chebotko, A., Liu, D., Lu, S. & Fotouhi, F. 2007. Efficient schema-based XML-to-relational data mapping. Information Systems 32(3): 458-476. Atay, M., Sun, Y., Liu, D., Lu, S. & Fotouhi, F. 2010. Mapping XML data to relational data: A DOM-based approach. http://arxiv.org/abs/1010.1746 [16 May 2014]. Augeri, C. J., Bulutoglu, D. A., Mullins, B. E., Baldwin, R. O. & Baird Iii, L. C. 2007. An analysis of XML compression efficiency. Proceedings of the 2007 Workshop on Experimental Computer Science, pp. 7. Barab, S. & Squire, K. 2004. Design-based research: Putting a stake in the ground. The Journal of the Learning Sciences 13(1): 1-14.

213

Barr, M. & Wells, C. 1990. Category theory for computing science. New York, Prentice Hall. Berglund, A., Boag, S., Chamberlin, D., Fernandez, M. F., Kay, M., Robie, J. & Siméon, J. 2003. XML Path language (XPath). World Wide Web Consortium (W3C). https://www.w3.org/TR/2003/WD-xpath20-20030502/ [20 September 2014] Berglund, A., Boag, S., Chamberlin, D., Fernández, M. F., Kay, M., Robie, J. & Siméon, J. 2010. XQuery 1.0 and XPath 2.0 data model (XDM). W3C Recommendation. https://www.w3.org/TR/xpath-datamodel/ [20 April 2015] Boag, S., Chamberlin, D., Fernández, M. F., Florescu, D., Robie, J., Siméon, J. & Stefanescu, M. 2002. XQuery 1.0: An XML query language. W3C Recommendation. https://www.w3.org/TR/2002/WD-xquery-20020816/ [11 February 2015] Bohannon, P., Freire, J., Roy, P. & Siméon, J. 2002. From XML schema to relations: A cost-based approach to XML storage. Proceedings of the 18th International Conference on Data Engineering, pp. 64-75. Bousalem, Z. & Cherti, I. 2015. XMap: A novel approach to store and retrieve XML document in relational databases. Journal of Software 10(12): 1389-1401. Bray, T., Paoli, J., Sperberg-Mcqueen, C. M., Maler, E. & Yergeau, F. 1998. Extensible markup language (XML). W3C Recommendation. https://www.w3.org/TR/1998/REC-xml-19980210 [10 July 2014]. Bray, T., Paoli, J., Sperberg-Mcqueen, C., Maler, E. & Yergeau, F. 2008. Extensible markup language (XML) 1.0. W3C Recommendation. 5th Edition. https://www.w3.org/TR/REC-xml/ [10 July 2014]. Bulajic, A. & Filipovic, N. 2012. Implementation of the tree structure in the XML and relational database. Proceedings of Informing Science and IT Education Conference (InSITE2012), pp. 27-51. Chamberlin, D., Robie, J. & Florescu, D. 2001. Quilt: An XML query language for heterogeneous data sources. The World Wide Web and Databases, pp. 1-25. Springer. Chen, H., Chiang, R. H. & Storey, V. C. 2012. Business intelligence and analytics: From big data to big impact. MIS quarterly 36(4): 1165-1188. Chen, M., Mao, S. & Liu, Y. 2014. Big data: A survey. Mobile Networks and Applications 19(2): 171-209.

214

Christophides, V., Cluet, S. & Simèon, J. 2000. On wrapping query languages and efficient XML integration. ACM SIGMOD Record, pp. 141-152. Chung, S. M. & Jesurajaiah, S. B. 2005. Schemaless XML document management in Object-Oriented databases. International Conference on Information Technology: Coding and Computing (ITCC2005), pp. 261-266. Clark, J. & Derose, S. 1999. XML Path language (XPath) Version 1.0. W3C Recommendation. https://www.w3.org/TR/1999/PR-xpath-19991008 [20 March 2014] Dennis, A., Wixom, B. H. & Roth, R. M. 2012. System analysis and design. 5th Ed. Hoboken, John Wiley & Sons, Inc. Dweib, I., Awadi, A., Elrhman, S. E. F. & Lu, J. 2008. Schemaless approach of mapping XML document into relational database. The 8th IEEE International Conference on Computer and Information Technology (CIT2008), pp. 167172. Dweib, I., Awadi, A. & Lu, J. 2009. MAXDOR: Mapping XML document into relational database. Open Information Systems Journal 3(1): 108-122. Dweib, I. M. I. & Lu, J. 2012. Automatic mapping of XML documents into relational database. PhD. Thesis, School of Computing and Engineering. The University of Huddersfield El-Sharkawi, M. E. & El-Hadi El Tazi, N. 2005. LNV: Relational database storage structure for XML documents. The 3rd ACS/IEEE International Conference on Computer Systems and Applications, pp. 49. Fakharaldien, M. a. I., Edris, K., Zain, J. M. & Sulaiman, N. 2012. Mapping extensible markup language document with relational database management system. International Journal of Physical Sciences 7(25): 4012-4025. Fakharaldien, M. I., Zain, J. M. & Sulaiman, N. 2011a. XRecursive: A storage method for XML document based on relational database. Software Engineering and Computer Systems, pp. 460-469. Springer. Fakharaldien, M. a. I., Zain, J. M. & Sulaiman, N. 2011b. XRecursive: An efficient method to store and query XML documents. Australian Journal of Basic & Applied Sciences 5(12): 2910-2916. Fakheraldien, M. A., Zain, J. M. & Sulaiman, N. 2011c. An efficient middleware for storing and querying XML data in relational database management system. Journal of Computer Science 7(2): 314-319.

215

Fan, W. & Ma, L. 2006. Selectively storing XML data in relations. Database and Expert Systems Applications, pp. 22-32. Fernández, M., Malhotra, A., Marsh, J., Nagy, M. & Walsh, N. 2002. XQuery 1.0 and XPath 2.0 data model. W3C working draft. https://www.w3.org/TR/2002/WD-query-datamodel-20021115/ [16 February 2015] Florescu, D. & Kossmann, D. 1999. Storing and querying XML data using an RDMBS. IEEE Data Engineering Bulletin 22(3): 27-34. Fomichev, A., Grinev, M. & Kuznetsov, S. 2006. Sedna: A Native XML DBMS. Sofsem 2006: Theory and Practice of Computer Science, pp. 272-281. Springer. http://www.sedna.org/ [25 March 2015] Frasincar, F., Houben, G.-J. & Pau, C. 2002. XAL: An algebra for XML query optimization. The 30th Australasian Database Conference (ADC2002), Conferences in Research and Practice in Information Technology, pp. 49-56. Fu, Q., Tan, J. & Xie, Y. 2015. The survey of big data. The 2014 International Conference on Computer Science and Electronic Technology (ICCSET 2014), pp. 403-407. Fujimoto, K., Shimizu, T., Kha, D. D., Yoshikawa, M. & Amagasa, T. 2005. A mapping scheme of XML documents into relational databases using schemabased path identifiers. Proceedings of International Workshop on Challenges in Web Information Retrieval and Integration (WIRI2005), pp. 82-90. Gerber, A., Van Der Merwe, A. & Alberts, R. 2007. Practical implications of rapid development methodologies. Proceedings of the 2007 Computer Science and IT Education Conference, pp. 233-245. Gou, G. & Chirkova, R. 2007. Efficiently querying large XML data repositories: A survey. IEEE Transactions on Knowledge and Data Engineering 19(10): 1381-1403. Grust, T. 2002. Accelerating XPath location steps. Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data, pp. 109-120. Harold, E. R. 2005. Managing XML data: native XML databases. Retrieved July 4(2008). http://www.ibm.com/us-en/ [15 June 2015] Hashem, I. a. T., Yaqoob, I., Anuar, N. B., Mokhtar, S., Gani, A. & Khan, S. U. 2015. The rise of “big data” on cloud computing: Review and open research issues. Information Systems 47(2015): 98-115. Haw, S. C. & Lee, C. S. 2011. Data storage practices and query processing in XML databases: A survey. Knowledge-Based Systems 24(8): 1317-1340.

216

Herrington, J., Mckenney, S., Reeves, T. & Oliver, R. 2007. Design-based research and doctoral students: guidelines for preparing a dissertation proposal. In C. Montgomerie & J. Seale (Eds.), Proceedings of the World Conference on Educational Multimedia, Hypermedia and Telecommunications, pp. 40894097. Herrington, J., Reeves, T. C. & Oliver, R. 2006. Authentic tasks online: A synergy among learner, task, and technology. Distance Education 27(2): 233-247. Hong, X. & Chen, W. 2012. Common data mapping system between XML and relational database. Proceedings of the 2012 International Conference on Computer Application and System Modeling, pp. 1181-1184. Hyder, S. & Hassan, S. 2006. A data mediator for storing and querying XML data in RDBMS using efficient schema-oblivious mapping strategies. The 2006 6th International Conference on ITS Telecommunications Proceedings, pp. 13111314. Ibm.

2012. DB2 database software. [27 February 2014].

http://www-01.ibm.com/software/data/db2/

Jagadish, H. V., Al-Khalifa, S., Chapman, A., Lakshmanan, L. V., Nierman, A., Paparizos, S., Patel, J. M., Srivastava, D., Wiwatwattana, N. & Wu, Y. 2002. Timber: A native XML database. The International Journal on Very Large Data Bases (VLDB) 11(4): 274-291. Jandhyala, S. 2006. An automated XPath to SQL transformation methodology for XML data. Master. Thesis, Department of Computer Science, Georgia State University. Jeong, B. S. & Lee, Y. K. 2006. Storing and querying of XML documents without redundant path information. Computational Science and Its Applications-Iccsa 2006, pp. 505-514. Springer. Jiang, H., Lu, H., Wang, W. & Yu, J. X. 2002. Path materialization revisited: An efficient storage model for XML data. Australian Computer Science Communications 24(2): 85-94. Joseph, S. V., Joey, F. G. & Jeffrey, A. H. 2011. Essentials of systems analysis and design. 5th Ed. New Jersey. Pearson Education, Inc.,. Kale, R. A. & Deshmukh, V. 2014. Design and development of XML parser using tree-branch symbiosis algorithm-review. International Journal of Advance Research in Computer Science and Management Studies 2(1): 367-373.

217

Katal, A., Wazid, M. & Goudar, R. 2013. Big data: issues, challenges, tools and good practices. The 2013 Sixth International Conference on Contemporary Computing (IC3), pp. 404-409. Kelly, A. E. 2003. Research as design. Educational Researcher 32(1): 3-4. Khaing, A. A. & Thein, N. L. 2006. A persistent labeling scheme for dynamic ordered XML trees. Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006 Main Conference Proceedings)(WI'06), pp. 498-501. Kobayashi, K., Liang, W., Kobayashi, D., Watanabe, A. & Yokota, H. 2005. VLEI code: An efficient labeling method for handling XML documents in an RDB. Proceedings of the 21st International Conference on Data Engineering (ICDE 2005), pp. 386-387. Kyu, Z. M. & Nyunt, T. T. S. 2009. Storing DTD-independent XML data in relational database. The 2009 IEEE Symposium on Industrial Electronics and Applications (ISIEA 2009), pp. 197-202. Li, C., Ling, T. W. & Global, I. 2010. Advanced applications and structures in XML processing: label streams, semantics utilization, and data query technologies. New York. Information Science Reference. Li, Q. & Moon, B. 2001. Indexing and querying XML data for regular path expressions. Proceedings of the 27th Very Large Data Bases Conference, pp. 361-370. Maghaydah, M. A. 2010. Optimisation techniques for storing and querying XML data in relational database systems. PhD. Thesis. Department of Computing, Macquarie University, Australia. Maghaydah, M., Orgun, M. A. & Khazali, I. 2010. Optimizing XML twig queries in relational systems. Proceedings of the Fourteenth International Database Engineering & Applications Symposium, pp. 123-129. Marks, G. & Roantree, M. 2009. Metamodel-based optimisation of XPath queries. Proceedings of the 26th British National Conference on Databases, Dataspace: The Final Frontier, pp. 146-157. Springer. Mchugh, J., Abiteboul, S., Goldman, R., Quass, D. & Widom, J. 1997. Lore: A database management system for semistructured data. SIGMOD Record 26(3): 54-66. Megginson, D. 2004. Simple application interface for XML (SAX) 2.0.2. http://sax.sourceforge.net/ [11August 2015].

218

Meier, W. 2003. Exist: An open source native XML database. Web, Web-Services, and Database Systems, pp. 169-183. Springer. Microsoft Corporation. 2008. Implementing XML in SQL server 2008 R2. https://msdn.microsoft.com/en-us/library/ms189887.aspx [20 July 2015]. Microsoft Corporation. 2012. OPENXML (Transact-SQL) in microsoft SQL server 2012. https://msdn.microsoft.com/en-us/library/ms186918(v=sql.110).aspx [23 June 2015]. Microsoft Corporation. 2012. SQL server 2012's hardware and software requirements, SQL Server Pro Website. http://sqlmag.com/sql-server-2012/sql-server-2012shardware-and-software-requirements [23 June 2015]. Microsoft Corporation. 2016. OPENXML (SQL server) in microsoft SQL server 2016. https://msdn.microsoft.com/en-us/library/ms175160.aspx [03 August 2015]. Miklau, G. & Suciu, D. 2003. XML data repository. University of Washington. http://www.cs.washington.edu/research/xmldatasets/ [09 February 2015]. Mohammad, S. A. 2011. Index structures for XML databases. Doctor of Philosophy, Thesis. School of Computing, Queen’s University, Kingston, Ontario, Canada Mourya, M. & Saxena, P. 2015. Survey of XML to relational database mapping techniques. Advances in Computer Science and Information Technology (ACSIT) 2(2): 162-166. Ning, W. & De, X. 2010. SMX/R: Efficient way of storing and managing XML documents using RDBMSs based on paths. The 2nd International Conference on Computer Engineering and Technology (ICCET), pp. Vl-143. O'neil, P., O'neil, E., Pal, S., Cseri, I., Schaller, G. & Westbury, N. 2004. Ordpaths: insert-friendly XML node labels. Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data, pp. 903-908. Oates, B. J. 2006. New frontiers for information systems research: computer art as an information system. European Journal of Information Systems 15(6): 617626. Oliveira, B., Santos, V. & Belo, O. 2013. Processing XML with java–a performance benchmark. International Journal of New Computer Architectures and their Applications (IJNCAA) 3(1): 72-85. Oosten, J. 2002. Basic category theory. Lecture Notes. Utrecht University.

219

Parker, J. 2011. A design-based research approach for creating effective online higher education courses. Paper presented at the 26th Annual Research Forum: Education Possibilities (Western Australia Institute for Educational Research Inc), University of Notre Dame. Fremantle. Prakas, S., Bhowmick, S. S. & Madria, S. 2004. SUCXENT: An efficient path-based approach to store and query XML documents. Database and Expert Systems Applications, pp. 285-295. Prakash, S. 2005. Efficient storage and query processing of XML data in relational database systems. Master. Thesis. School of Computer Engineering, Nanyang Technological University. Qin, J., Zhao, S., Yang, S. & Dou, W. 2005. Efficient storing well-formed XML documents using RDBMS. Proceedings of International Conference on Services Systems and Services Management, Proceedings of ICSSSM'05, pp. 1075-1080. Raut, A. D. & Atique, M. 2014. A survey of indexing techniques for XML database. An International Journal of Advanced Computer Technology 3(1): 461-466. Reeves, T. C. 2006. Design research from a technology perspective. Educational design research 1(3): 52-66, London. Ruel, C. 2013. Oracle 12c for dummies. Hoboken, John Wiley & Sons. Sainan, L., Caifeng, L. & Liming, G. 2008. A storage method for XML document based on relational database. International Symposium on Computer Science and Computational Technology (ISCSCT'08), pp. 50-53. Schmidt, A., Waas, F., Kersten, M., Carey, M. J., Manolescu, I. & Busse, R. 2002. XMark: A benchmark for XML data management. Proceedings of the 28th international conference on Very Large Data Bases, pp. 974-985. Senthilkumar, R., Suganya, N., Kiruthika, I. & Kannan, A. 2011. Indexing and querying the compressed XML data (IQCX). Advances in Computing and Information Technology, pp. 198, 497-506. Springer. Shanmugasundaram, J., Tufte, K., Zhang, C., He, G., Dewitt, D. J. & Naughton, J. F. 1999. Relational databases for querying XML documents: Limitations and opportunities. Proceedings of the 25th International Conference on Very Large Data Bases, pp. 302-314. Silva, A., Merialdo, P., Crescenzi, V., Tomassi, G. D., Mecca, G. & Refrigeri, G. 2007. XML sigmodrecord. http://www.sigmod.org/publications/sigmodrecord /xml/SigmodRecord.xml/view [25 April 2015].

220

Soltan, S. & Rahgozar, M. 2006. A clustering-based scheme for labeling XML trees. IJCSNS International Journal of Computer Science and Network Security 6(9A): 84-89. Su-Cheng, H., Chien-Sing, L. & Mustapha, N. 2010. Bridging XML and relational databases: Mapping choices and performance evaluation. IETE Technical Review 27(4): 308-317. Subramaniam, S., Haw, S.-C. & Hoong, P. K. 2012. s-XML: An efficient mapping scheme to bridge XML and relational database. Knowledge-Based Systems 27(2012): 369-380. Suri, M. P. & Sharma, D. 2012. A model mapping approach for storing XML documents in relational databases. International Journal of Computer Science Issues (IJCSI) 9(3): 495-498. Tang, J., Liu, S., Liu, C., Gu, Z. & Gaudiot, J.-L. 2013. Acceleration of XML parsing through prefetching. IEEE Transactions on Computers 62(8): 1616-1628. Tatarinov, I., Viglas, S. D., Beyer, K., Shanmugasundaram, J., Shekita, E. & Zhang, C. 2002. Storing and querying ordered XML using a relational database system. Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data, pp. 204-215. Tian, F., Dewitt, D. J., Chen, J. & Zhang, C. 2002. The design and performance evaluation of alternative XML storage strategies. ACM SIGMOD Record 31(1): 5-10. Vaidya, A. V. 2015. Study of optimal indexing schema for XML query processing and effective storage management 2009 to 2013. PhD. Thesis. Faculty of Management of Tilak Maharashtra Vidyapeeth , Pune. Vainio, J. & Junkkari, M. 2014. SQL-based semantics for path expressions over hierarchical data in relational databases. Journal of Information Science 40(3): 293-312. Von Alan, R. H., March, S. T., Park, J. & Ram, S. 2004. Design science in information systems research. MIS Quarterly 28(1): 75-105. Vyas, S. & Kolte, S. 2014. A review paper for mapping of XML data to relational table. International Journal of Computer Science and Mobile Computing 3(12): 327-332. W3consortium. 2000. Document object model (DOM) 2.0. http://www.w3.org/TR/2000/REC-DOM-Level-2-Core-20001113/ [12 March 2015]

221

Wang, F. & Hannafin, M. J. 2005. Design-based research and technology-enhanced learning environments. Educational Technology Research and Development 53(4): 5-23. Wang, Q., Ren, Z., Dong, L. & Sheng, Z. 2012. Path-based XML relational storage approach. Physics Procedia 33(2012): 1621-1625. Wu, H., Ling, T. W., Chen, B. & Xu, L. 2011. Twigtable: using semantics in XML twig pattern query processing. Journal on Data Semantics Xv, pp. 102-129. Springer. Wu, J. & Huang, S. Y. 2008. XPred: A new model-mapping-schema-based approach for efficient access to XML data. Proceedings of the 10th International Conference on Information Integration and Web-based Applications & Services (iiWAS2008), pp. 123-130. Wu, X., Lee, M. L. & Hsu, W. 2004. A prime number labeling scheme for dynamic ordered XML trees. Proceedings of the 20th International Conference on Data Engineering (ICDE’04), pp. 66-78. Yan, L., Liang, Z., Qiyang, D. & Baile, S. 2005. DTD-based XML indexing. Journal of Computer Research and Development 1(2005): 21-26. Yang, Y., Ng, W., Lau, H. L. & Cheng, J. 2006. An efficient approach to support querying secure outsourced XML information. Advanced Information Systems Engineering, pp. 157-171. Yao, J. & Zhang, M. 2004. A fast tree pattern matching algorithm for XML query. Web Intelligence. Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence (WI’04), pp. 235-241. Ying, J., Cao, S. & Long, Y. 2012. An efficient mapping approach to store and query XML documents in relational database. The 2nd International Conference on Computer Science and Network Technology (ICCSNT), pp. 2140-2144. Yoshikawa, M., Amagasa, T., Shimura, T. & Uemura, S. 2001. XRel: a path-based approach to storage and retrieval of XML documents using relational databases. ACM Transactions on Internet Technology 1(1): 110-141. Yu, H., Wang, X.-K., Gao, Y.-P., Zhang, J.-Y. & Yang, N.-H. 2007. Query optimization based on complicated scheme indexes. Application Research of Computers 8(2007): 120-125. Yuan, X., Hu, X., Wu, D., Zhang, H. & Lian, X. 2013. XML data storage and query optimization in relational database by XPath processing model. Journal of Software 8(4): 809-816.

222

Zafari, H., Hasani, K. & Shiri, M. E. 2010. Xlight, an efficient relational schema to store and query XML data. The International Conference on Data Storage and Data Engineering (DSDE), pp. 254-257. Zeis, C., Ruel, C. & Wessler, M. 2009. Oracle 11g for dummies. Hoboken, John Wiley & Sons. Zhang, C., Naughton, J., Dewitt, D., Luo, Q. & Lohman, G. 2001. On supporting containment queries in relational database management systems. ACM SIGMOD Record 30(2): 425-436. Zhang, Q., Cheng, L. & Boutaba, R. 2010. Cloud computing: state-of-the-art and research challenges. Journal of Internet Services and Applications 1(1): 7-18.

223

APPENDIX A

XANCESTOR PROTOTYPE REQUIREMENTS AND INTERFACES



XAncestor Prototype Requirements

The requirements for the implementation of the XAncestor prototype are as follows: i.

Any version of Microsoft Windows that is suitable as a platform environment for implementing the prototype can be used.

ii.

A well-formed XML document that corresponds with XML 1.0 grammar specified by W3C can be used. It must have exactly one root element, which is called the document element. Each starting element tag should have a corresponding closing tag. The elements should be nested within one another. The tags and nesting rules allow XML to represent information in a hierarchical manner.

iii.

XPath is used as the XML language source for users to represent their requests and SQL is used as the target language against the RDB to answer users’ queries. XPath is used instead of the XQuery for the following two reasons: 

XPath is simpler than XQuery, so it is better to achieve the objective of testing the model mapping approach in the current context.



The XPath structure is included in XQuery, so it is easier to upgrade to XQuery.

iv.

Java programming language under the NetBeans IDE 8.0.1 environment is used as a tool to create the graphical user interface and to implement the prototype components.

224

v.

Microsoft SQL Server 2012 or later is used as the RDBMS to store the XML content. The minimum hardware requirements for Microsoft SQL Server 2012 or later are given in Table A-1. Table A-1

Minimum Hardware Requirements for Microsoft SQL Server 2012

Hardware

Minimum Requirement

CPU Processor

1.0 GHz CPU Processor for a 32-bit (x86) implementation and a 1.4 GHz CPU Processor for a 64-bit (x64) implementation. Microsoft’s recommended minimum processor speed is 2.0 GHz

Memory

Express edition requires a minimum of 512 MB of RAM, whereas the other editions require a minimum of 1 GB. Microsoft’s recommended minimum RAM for SQL Server is 4 GB

Hard Disk

811 MB (Database Engine and data files, Replication, Full-Text Search, and Data Quality Services), 345 MB (Analysis Services and data files), 304 MB (Reporting Services and Report Manager), 591 MB (Integration Services), 1823 MB (Client Components), 200 MB (SQL Server Books Online)

Drive

CD-ROM or DVD drive

Display

Super VGA (800 × 600) or higher-resolution monitor



XAncestor Prototype Interfaces

The XAncestor prototype is developed in the NetBeans IDE 8.0.1 environment using Java programming language and is connected with Microsoft SQL Server 2012. Figure A-1 shows the flowchart for the XAncestor prototype.

225

225

Figure A-1

Flowchart for XAncestor prototype

226

The features and functions of the XAncestor prototype interfaces are described below. i.

XAncestor Prototype Main Interface

The main interface of the XAncestor prototype is shown in Figure A-2. The name of the prototype, which is ‘XAncestor – A Mapping Approach’, is shown in the upper area of the interface.

Figure A-2

Main interface of XAncestor prototype

The main interface contains four buttons that are positioned in the middle of the interface: 

XML Mapping (XtoDB) Button: The XML mapping (XtoDB) interface appears in the middle of the desktop screen when this button is pressed. This interface represents the proposed XtoDB algorithm for mapping XML documents into the RDB scheme.



Query Mapping (XtoSQL) Button: The query mapping (XtoSQL) interface appears in the middle of the desktop screen when this button is pressed. This interface represents the proposed XtoSQL algorithm for translating different types of XPath query into the corresponding SQL query based on the RDB scheme. Then, the result is translated into the XML hierarchical format and returned to the user.



Display RDB Button: The Display RDB interface appears in the middle of the desktop screen when this button is pressed. This interface represents the

227

content of the RDB scheme (Ancestor_Path and Leaf_Node tables) and allows the user to delete the content of these tables. 

Exit Button: The exit dialogue box, namely ‘Exit Confirm’, appears when the user presses this button in order to terminate the prototype. This dialogue box asks the user if he/she is sure that he/she wants to terminate the prototype. This dialogue box is shown in Figure A-3.

Figure A-3

ii.

Exit dialogue box

Interface of XML Mapping (XtoDB)

This interface appears when the user presses the XML Mapping (XtoDB) button that is shown in Figure A-2. As shown in Figure A-4, this interface allows the user to select the XML document that needs to be mapped into the RDB scheme by determining the location of the document using the small button shown in the upper area of the interface. Then, the name of the document with its path location appears in the text that is shown in the upper area of the interface.

Figure A-4

Interface of XML mapping (XtoDB)

228

The XML mapping (XtoDB) interface contains the following toolboxes: 

XML-scheme (Left JPanel): The XML-scheme that exists in the left panel of this interface displays the content of the selected XML document in a hierarchical format.



RDB scheme (Right JPanel): The RDB scheme that exists in the right panel of this interface contains the Ancestor_Path and Leaf_Node tables, and displays the mapped XML data as tuples in these tables when the user selects the XML document and presses the mapping process button.



Mapping Process Button: When the user presses this button, the XML mapping algorithm (XtoDB) is executed to map the selected XML document into the RDB stored in the Microsoft SQL Server and displays the mapped XML data as tuples in the Ancestor_Path and Leaf_Node tables in the righthand panel.



Main Window Button: This button allows the user to return to the main interface of the prototype.

iii. Interface of Query Mapping (XtoSQL)

This interface appears when the Query Mapping (XtoSQL) button is pressed, as shown in Figure A-2. As shown in Figure A-5, this interface allows the user to enter the XPath query in the textbox and select its type from the radio buttons in the upper area of the interface. Therefore, this interface translates the entered XPath query based on the selected query type into the corresponding SQL query and executes it over the RDB scheme. Then, the result is translated into the XML hierarchical format and displayed to the user.

229

Figure A-5

Interface of query mapping (XtoSQL)

The query mapping (XtoSQL) interface contains the following toolboxes: 

Translated SQL Query (Left JPanel): The translated SQL query is displayed inside the textbox located in this panel.



RDB Results (Left JPanel): The result of the translated SQL query that is executed over the RDB scheme is displayed as tuples in this panel.



XML Results (Right JPanel): The result of the translated SQL query that is executed over the RDB scheme is displayed in an XML hierarchical format in this panel.



Mapping Process Button: When the user presses this button, the query mapping algorithm (XtoSQL) is executed, thus translating the entered XPath query based on its type into the corresponding SQL query, and executing it over the RDB scheme. Then, the SQL result is displayed and translated into the XML hierarchical format, which is displayed to the user.



Reset Button: This button is used to reset all the toolboxes in this interface such as the clear textboxes, unchecked radio buttons, and enable buttons.



Main Window button: This button allows the user to return to the main interface of the prototype.

230

iv. Interface of RDB Display

This interface appears to the user when the Display RDB button is pressed, as shown in Figure A-2. As shown in Figure A-6, the title, ‘Relational Database (RDB) Schema’, is shown in the upper area of the interface.

Figure A-6

Interface of RDB display

This interface displays the content of the RDB scheme, which stores the mapped XML data as tuples in the tables (Ancestor_Path and Leaf_Node). These tables and their data are displayed in the middle panel of this interface. In addition, this interface contains two buttons, namely: the Delete RDB and Main Window buttons. The Delete RDB button allows the user to delete the mapped data from the RDB scheme. The Main Window button allows the user to return to the main interface of the prototype.

231

APPENDIX B

RESULTS OF PRE-EXPERIMENT ON EXISTING MODEL MAPPING APPROACHES BASED ON XML DATASET (DBGROUP)

Experiment Environment Intel(R) Atom(TM) CPU N450 @1.66GHz 1.67 GHz - 2GB RAM - 256GB HDD Microsoft SQL Server 2012 - Microsoft Windows 7– 32 bits Experiment Measurements  Storage Space (XML File : 2 KB) Table B-1

Results of RDB Storage Space Evaluation of DBGroup for Existing Approaches

Approach

RDB Scheme

No. of Rows

Total Rows

RDB Storage Space (KB)

XRel

Path Element Attribute Text

15 67 10 44

136

416 KB

XParent

LabelPath DataPath Element Data

15 76 77 54

222

420 KB

XPEV

Edge Value Path

77 54 15

146

268 KB

XRecursive

Tag_Structure Tag_Value

77 54

131

136 KB

Suri-Sharma

Node Data

15 77

92

140 KB

Ying-Cao

InnerNodes LeafNodes Path

77 54 11

142

424 KB

s-XML

ParentTable ChildTable

77 54

131

140 KB

232

600

RDB Storage Space (KB)

500 400 300 200 100 0 XRel

Xparent

XPEV

XRecursive Suri-Sharma Ying-Cao

s-XML

Model Mapping Approaches Figure B-1

Comparison of RDB Storage Space of Approaches for DBGroup

 Query Response Time Table B-2

Query Description for DBGroup Dataset

Query expression

Feature

Q1 : /DBGroup/Project/Dept Q2 : /DBGroup/Project/Duration/Day Q3 : //Member/Age Q4 : //Member//Office Q5 : /DBGroup/Member [2] Q6: (/DBGroup/Member) [2] /Name Q7: /DBGroup/Member/Name [Age= ‘27] Q8 : /DBGroup//Member [Age>20]/Name

i.

simple path expression (short) simple path expression (long) one ‘//’ two ‘//‘s Index grouping and index text matching ‘//’ and text matching

XRel Approach Table B-3

Query Response Times (ms) for Q1-Q8 on DBGroup Dataset using XRel

Qn

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

Average

Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8

109 84 171 109 78 140 140 135

78 140 93 113 78 93 140 124

93 93 96 109 78 109 140 125

93 109 109 93 67 124 140 109

93 93 187 124 98 109 140 156

187 93 93 109 15 109 140 124

109 109 74 109 93 93 124 124

109 109 109 93 78 156 109 140

93 202 109 62 78 66 124 156

93 78 109 93 62 109 140 124

105.7 111.0 115.0 101.4 72.5 110.8 133.7 131.7

233

XRel - Query Response Time (ms) 160 140 120 100 80 60 40 20 0 Q1 Figure B-2

ii.

Q2

Q3

Q4

Q5

Q6

Q7

Q8

Comparison of Query Response Times for Q1-Q8 on DBGroup using XRel

XParent Approach Table B-4

Query Response Times (ms) for Q1-Q8 on DBGroup Dataset using XParent

Qn

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

Average

Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8

124 296 124 124 140 78 124 109

265 78 109 78 109 109 156 124

109 93 93 104 156 78 140 109

93 93 93 109 62 109 140 249

124 93 140 124 70 78 140 140

93 62 234 109 124 78 164 109

109 156 187 109 93 109 124 327

171 93 280 109 62 124 140 156

109 93 93 109 62 78 99 140

93 78 109 109 62 78 156 140

129.0 113.5 146.2 108.4 94.0 91.9 138.3 160.3

XParent - Query Response Time (ms) 180 160 140 120 100 80 60 40 20 0 Q1 Figure B-3

Q2

Q3

Q4

Q5

Q6

Q7

Q8

Comparison of Query Response Times for Q1-Q8 on DBGroup using XParent

234

iii. XPEV Approach Table B-5

Query Response Times (ms) for Q1-Q8 on DBGroup Dataset using XPEV

Qn

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

Average

Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8

78 78 327 62 109 87 109 93

93 78 62 62 31 78 109 109

78 62 78 62 78 127 109 78

78 140 78 78 78 140 124 124

93 156 93 15 15 78 202 104

78 93 93 76 78 62 202 124

93 78 48 48 78 78 109 93

62 78 78 78 78 94 78 187

93 78 78 62 15 124 124 114

62 93 171 78 62 62 124 98

80.8 93.4 110.6 62.1 62.2 93.0 129.0 112.4

XPEV - Query Response Time (ms) 140 120 100 80 60 40 20 0 Q1 Figure B-4

Q2

Q3

Q4

Q5

Q6

Q7

Q8

Comparison of Query Response Times for Q1-Q8 on DBGroup using XPEV

iv. XRecursive Approach Table B-6

Query Response Times (ms) for Q1-Q8 on DBGroup Dataset using XRecursive

Qn

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

Average

Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8

93 109 75 171 171 140 109 124

78 93 78 156 171 156 140 95

93 109 93 171 124 109 109 124

93 156 109 157 208 124 78 101

78 140 31 171 171 140 140 127

93 87 93 146 171 140 124 139

78 124 78 156 131 140 109 140

93 109 78 171 124 156 144 202

93 93 93 156 156 136 187 95

109 109 46 140 140 140 70 124

90.1 112.9 77.4 159.5 156.7 138.1 121.0 127.1

235

XRecursive - Query Response Time (ms) 180 160 140 120 100 80 60 40 20 0 Q1 Figure B-5

v.

Q2

Q3

Q4

Q5

Q6

Q7

Q8

Comparison of Query Response Times for Q1-Q8 on DBGroup using XRecursive

Suri-Sharma Approach Table B-7

Query Response Times (ms) for Q1-Q8 on DBGroup Dataset using Suri-Sharma

Qn

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

Average

Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8

124 124 109 421 78 109 168 234

109 101 171 304 78 93 156 140

140 140 140 249 104 109 140 140

187 124 148 265 78 78 158 174

109 118 124 274 78 94 124 156

109 124 140 310 68 93 156 234

95 156 218 406 31 124 124 171

205 156 104 296 91 109 140 156

93 124 124 249 78 109 156 140

109 124 121 280 78 109 124 140

128.0 129.1 139.9 305.4 76.2 102.7 144.6 168.5

Suri-Sharma - Query Response Time (ms) 350 300 250 200 150 100 50 0 Q1 Figure B-6

Q2

Q3

Q4

Q5

Q6

Q7

Q8

Comparison of Query Response Times for Q1-Q8 on DBGroup using Suri-Sharma

236

vi. Ying-Cao Approach Table B-8

Query Response Times (ms) for Q1-Q8 on DBGroup Dataset using Ying-Cao

Qn

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

Average

Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8

100 51 38 78 78 75 140 203

124 36 156 46 62 78 93 271

62 78 62 93 93 82 141 234

43 62 62 62 56 117 124 201

62 85 40 61 78 109 72 249

62 45 54 14 62 77 124 234

193 62 46 194 58 109 153 234

78 50 78 62 57 108 124 234

90 73 31 78 78 78 64 374

62 62 18 62 62 57 150 312

87.6 60.4 58.5 75.0 68.4 89.0 118.5 254.6

Ying-Cao - Query Response Time (ms) 300 250 200 150 100 50 0 Q1 Figure B-7

Q2

Q3

Q4

Q5

Q6

Q7

Q8

Comparison of Query Response Times for Q1-Q8 on DBGroup using Ying-Cao

vii. s-XML Approach Table B-9

Query Response Times (ms) for Q1-Q8 on DBGroup Dataset using s-XML

Qn

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

Average

Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8

93 102 156 109 62 95 109 405

93 164 170 234 62 62 140 390

83 106 130 176 47 59 109 374

134 206 143 101 62 78 103 421

78 109 156 134 78 31 127 405

93 109 156 140 68 78 187 394

65 93 187 148 36 40 120 405

78 93 156 171 9 79 156 405

100 106 156 171 71 48 109 405

80 109 140 109 78 36 111 421

89.7 119.7 155.0 149.3 57.3 60.6 127.1 402.5

237

s-XML - Query Response Time (ms) 450 400 350 300 250 200 150 100 50 0 Q1 Figure B-8

Q2

Q3

Q4

Q5

Q6

Q7

Q8

Comparison of Query Response Times for Q1-Q8 on DBGroup using s-XML

 Query Response Time (All Approaches) Table B-10

Query Response Times (ms) for Q1-Q8 on DBGroup Dataset using All Approaches

Approach

Q1

Q2

Q3

Q4

Q5

Q6

Q7

Q8

XRel XParent XPEV XRecursive Suri-Sharma Ying-Cao s-XML

105.7 129.0 80.8 90.1 128.0 87.6 89.7

111.0 113.5 93.4 112.9 129.1 60.4 119.7

115.0 146.2 110.6 77.4 139.9 58.5 155.0

101.4 108.4 62.1 159.5 305.4 75.0 149.3

72.5 94.0 62.2 156.7 76.2 68.4 57.3

110.8 91.9 93.0 138.1 102.7 89.0 60.6

133.7 138.3 129.0 121.0 144.6 118.5 127.1

131.7 160.3 112.4 127.1 168.5 254.6 402.5

XRel

XParent

XPEV

XRecursive

Q1

Q2

Q3

Suri-Sharma

Ying-Cao

s-XML

450 400 350 300 250 200 150 100 50 0

Figure B-9

Q4

Q5

Q6

Q7

Q8

Comparison of Query Response Times (ms) for Q1-Q8 on DBGroup using All Approaches

238

APPENDIX C

RESULTS OF EXPERIMENTS ON MODEL MAPPING APPROACHES BASED ON XML DATASETS (SIGMOD, DBLP1, DBLP2, DBLP3)

Experiment Environment Intel(R) Atom(TM) CPU N450 @1.66GHz 1.67 GHz - 2GB RAM - 256GB HDD Microsoft SQL Server 2012 - Microsoft Windows 7– 32 bits

a.

Experiment Dataset (SIGMOD) Table C-1 Dataset

Description SIGMOD Record in XML

SIGMOD

Description of SIGMOD Dataset Used Size

Elements

Attributes

Max-Depth

467 KB

11526

3737

6

 Storage Space Table C-2

Results of RDB Storage Space Evaluation of SIGMOD for All Approaches

Approach

RDB Scheme

No. of Rows

Total Rows

RDB Storage Space (MB)

XRel

Path Element Attribute Text

12 11526 3737 8383

23658

1.84

SMX/R

Path_Index Path

12 23646

23658

1.90

Ying-Cao

File Path InnerNodes LeafNodes

1 7 15263 12120

27391

2.06

XRecursive

Tag_Structure Tag_Value

15263 12120

27383

1.77

s-XML

ParentTable ChildTable

15263 12120

27383

2.36

XAncestor

Ancestor_Path Leaf_Node

4 12120

12124

1.58

239

RDB Storage Space (MB)

2.5 2 1.5 1 0.5 0 XRel

SMX/R

Ying-Cao XRecursive s-XML

XAncestor

Model Mapping Approaches Figure C-1

Comparison of RDB Storage Space of Approaches for SIGMOD

 Query Response Time Table C-3

Query Description for SIGMOD Dataset

Query expression Q1 : /SigmodRecord/issue/volume Q2 : /SigmodRecord/issue/articles/article/authors/author Q3 : //article/title Q4 : //articles//author Q5: /SigmodRecord/issue/articles/article [title = ‘Database Directions III Workshop Review.’]/endPage Q6 : //article [initPage > 50]/title

i.

Feature simple path expression (short) simple path expression (long) one ‘//’ two ‘//‘s Twig query with “/” Twig query with “//”

XRel Approach Table C-4

Query Response Times (ms) for Q1-Q6 on SIGMOD Dataset using XRel

Qn

[1]

[2]

[3]

[4]

[5]

[6]

Average

Q1 Q2 Q3 Q4 Q5 Q6

1096 34756 16910 1694 766 6817

1014 35412 16198 1591 826 7347

997 35911 16582 1539 735 7456

1045 35837 17800 1497 936 7176

1006 33898 16161 1700 873 6891

998 34076 16629 1603 741 6957

1026 34982 16713 1604 813 7107

240

XRel - Query Response Time (ms) 100000

10000

1000

100

10 Q1 Figure C-2

ii.

Q2

Q3

Q4

Q5

Q6

Comparison of Query Response Times for Q1-Q6 on SIGMOD using XRel

SMX/R Approach Table C-5

Query Response Times (ms) for Q1-Q6 on SIGMOD Dataset using SMX/R

Qn

[1]

[2]

[3]

[4]

[5]

[6]

Average

Q1 Q2 Q3 Q4 Q5 Q6

2028 68697 26829 70870 1107 33150

2028 68237 25864 73710 1185 32682

2018 69832 26150 67376 1346 31683

2031 72519 26683 69647 1451 33680

1923 69657 27768 73148 1072 32073

2040 69729 26254 71978 1092 31808

2011 69779 26591 71122 1209 32513

SMX/R- Query Response Time (ms) 100000

10000

1000

100

10 Q1 Figure C-3

Q2

Q3

Q4

Q5

Q6

Comparison of Query Response Times for Q1-Q6 on SIGMOD using SMX/R

241

iii. Ying-Cao Approach Table C-6

Query Response Times (ms) for Q1-Q6 on SIGMOD Dataset using Ying-Cao

Qn

[1]

[2]

[3]

[4]

[5]

[6]

Average

Q1 Q2 Q3 Q4 Q5 Q6

171 483 394 468 951 2224

188 483 405 546 954 2210

238 499 405 468 905 2168

191 492 390 499 998 2207

187 460 377 483 936 2496

171 497 390 530 982 2180

191 486 394 499 954 2248

Ying-Cao - Query Response Time (ms) 10000

1000

100

10 Q1 Figure C-4

Q2

Q3

Q4

Q5

Q6

Comparison of Query Response Times for Q1-Q6 on SIGMOD using Ying-Cao

iv. XRecursive Approach Table C-7

Qn Q1 Q2 Q3 Q4 Q5 Q6

Query Response Times (ms) for Q1-Q6 on SIGMOD Dataset using XRecursive

[1] 702 4477 1372 2730 1248 4897

[2] 717 4227 1256 2752 1248 5132

[3] 733 4255 1272 2769 1280 5259

[4] 650 4274 1326 2761 1575 5132

[5] 661 4524 1326 2776 1263 4726

[6] 670 4555 1326 2792 1279 4851

Average 689 4385 1313 2763 1316 5000

242

XRecursive - Query Response Time (ms) 10000

1000

100

10 Q1 Figure C-5

v.

Q2

Q3

Q4

Q5

Q6

Comparison of Query Response Times for Q1-Q6 on SIGMOD using XRecursive

s-XML Approach Table C-8

Query Response Times (ms) for Q1-Q6 on SIGMOD Dataset using s-XML

Qn

[1]

[2]

[3]

[4]

[5]

[6]

Average

Q1 Q2 Q3 Q4 Q5 Q6

530 2776 1372 3088 2043 5098

561 2741 1372 2776 2028 5163

605 2745 1326 2731 2043 5070

624 2745 1334 2792 2277 4982

514 2745 1606 2714 2324 5116

566 2854 1421 2774 2324 5113

567 2768 1405 2813 2173 5090

s-XML - Query Response Time (ms) 10000

1000

100

10 Q1 Figure C-6

Q2

Q3

Q4

Q5

Q6

Comparison of Query Response Times for Q1-Q6 on SIGMOD using s-XML

243

vi. XAncestor Approach Table C-9

Query Response Times (ms) for Q1-Q6 on SIGMOD Dataset using XAncestor

Qn

[1]

[2]

[3]

[4]

[5]

[6]

Average

Q1 Q2 Q3 Q4 Q5 Q6

181 452 452 452 312 639

187 468 358 468 327 608

148 539 358 468 312 553

187 483 358 468 280 608

168 468 390 552 249 611

156 452 343 436 296 624

171 477 377 474 296 607

XAncestor - Query Response Time (ms) 700 600 500 400 300 200 100 0 Q1 Figure C-7

Q2

Q3

Q4

Q5

Q6

Comparison of Query Response Times for Q1-Q6 on SIGMOD using XAncestor

 Query Response Time (All Approaches) Table C-10 Approach XRel SMX/R Ying-Cao XRecursive s-XML XAncestor

Query Response Times (ms) for Q1-Q6 on SIGMOD Dataset using All Approaches Q1

Q2

Q3

Q4

Q5

Q6

1026 2011 191 689 567 171

34982 69779 486 4385 2768 477

16713 26591 394 1313 1405 377

1604 71122 499 2763 2813 474

813 1209 954 1316 2173 296

7107 32513 2248 5000 5090 607

244

XRel

SMX/R

Ying-Cao

XRecursive

s-XML

XAncestor

Query Response Time (ms)

100000

10000

1000

100

10 Q1

Q2

Q3

Q4

Q5

Q6

Translated SQL Queries Figure C-8

Comparison of Query Response Times for Q1-Q6 on SIGMOD using All Approaches

245

Table C-11

Translated SQL Queries with their Query Response Time to Retrieve Each Query (Q1-Q6) Based on (a) XRel, (b) SMX/R, (c) Ying-Cao, (d) XRecursive, (e) s-XML, and (f) XAncestor Approaches on SIGMOD Dataset

Q1

Translated SQL query

Time (ms)

(a)

Select v.Value From Element e, Path p1 , Path p2, Text v Where p1.Pathexp = '#/SigmodRecord#/issue' And p2.PathExp = '#/SigmodRecord#/issue# /volume' And e.PathID = p1.PathID And v.PathID = p2.PathID And e.Start < v.Start And e.End > v.End;

1026

(b)

Select v.NodeValue From Path_Index_Table p1 , Path_Index_Table p2 , Path_Table e, Path_Table v Where p1.Path = '.SigmodRecord.issue' And p2.Path ='.SigmodRecord.issue.volume' And e.PID = p1.PID And v.PID = p2.PID And e.StartPos < v.StartPos And e.EndPos > v.EndPos And v.NodeType = '3';

2011

(c)

Select v.LeafValue From LeafNodes v, Path p Where p.PathExp = '/SigmodRecord/issue/volume' And p.PathID = v.PathID;

191

(d)

Select v.Value From Tag_Value v Where v.TagID in (Select distinct t1.id From Tag_Structure t1Where t1.TagName = 'volume' And t1.Pid in (Select distinct t2.id From Tag_Structure t2 Where TagName = 'issue' And t2.Pid in (Select distinct t3.id From Tag_Structure t3 Where TagName = 'SigmodRecord')));

689

(e)

Select Value From ChildTable Where LParent in (Select distinct IdNode From ParentTable Where Cname='volume' And Lparent in (Select distinct IdNode From ParentTable Where Cname= 'issue' And Pname = 'SigmodRecord'));

567

(f)

Select L.Node_Value From Leaf_Node L, Ancestor_Path P Where p.Ances_PathExp = '/SigmodRecord/issue' And L.Ances_PathID = P.Ances_PathID And L.Node_Name = 'volume';

171

Q2

Time (ms)

(a)

Select v.Value From Element e, Path p1 , Path p2, Text v Where p1.PathExp = '#/SigmodRecord#/issue#/articles#/article#/authors' And p2.PathExp = '#/SigmodRecord#/issue#/articles#/article#/authors#/author' And e.PathID = p1.PathID And v.PathID = p2.PathID And e.Start < v.Start And e.End > v.End;

34982

(b)

Select v.NodeValue From Path_Index_Table p1 , Path_Index_Table p2 , Path_Table e, Path_Table v Where p1.Path = '.SigmodRecord.issue.articles. article.authors' And p2.Path ='.SigmodRecord.issue.articles.article.authors.author' And e.PID = p1.PID And v.PID = p2.PID And e.StartPos < v.StartPos And e.EndPos > v.EndPos And v.NodeType = '3';

69779

(c)

Select v.LeafValue From LeafNodes v, Path p Where p.PathExp = '/SigmodRecord/issue/articles/article/authors/author' And p.PathID = v.PathID;

486

(d)

Select v.Value From Tag_Value v Where v.TagID in (Select distinct t1.id From Tag_Structure t1Where t1.TagName = 'author' And t1.Pid in (select distinct t2.id From Tag_Structure t2 Where TagName = 'authors' And t2.Pid in (Select distinct t3.id From Tag_Structure t3 Where TagName = 'article' And t3.Pid in(Select distinct t4.id From Tag_Structure t4 Where TagName = 'articles' And t4.Pid in(Select distinct t5.id From Tag_Structure t5 Where TagName = 'issue' And t5.Pid in(Select distinct t6.id From Tag_Structure t6 Where TagName = 'SigmodRecord'))))));

4385

Select Value From ChildTable Where LParent in (Select distinct IdNode From ParentTable Where cname = 'author' And LParent in (Select distinct IdNode From ParentTable Where cname='authors' And Lparent in (Select distinct IdNode From ParentTable Where Cname='article' And Lparent in (Select distinct IdNode From ParentTable Where Cname='articles' And Lparent in (Select distinct IdNode From ParentTable Where Cname= 'issue' And Pname = 'SigmodRecord'))));

2768

Select L.Node_Value From Leaf_Node L, Ancestor_Path P Where p.Ances_PathExp = '/SigmodRecord/issue/articles/article/authors' And L.Ances_PathID = P.Ances_PathID And L.Node_Name = 'author';

477

(e)

(f)

continue…

245

246 …continued Q3

Time (ms)

(a)

Select v.Value From Element e, Path p1 , Path p2, Text v Where p1.PathExp Like '%#/article' And p2.PathExp Like '%#/article#/title' And e.PathID = p1.PathID And v.PathID = p2.PathID And e.Start < v.Start And e.End > v.End;

16713

(b)

Select v.NodeValue From Path_Index_Table p1 , Path_Index_Table p2 , Path_Table e, Path_Table v Where p1.Path Like '%.article' And p2.Path Like '%.article.title' And e.PID = p1.PID And v.PID = p2.PID And e.StartPos < v.StartPos And e.EndPos > v.EndPos And v.NodeType = '3';

26591

(c)

Select v.LeafValue From LeafNodes v, Path p Where p.PathExp Like '%/article/title' And p.PathID = v.PathID;

394

(d)

Select v.Value From Tag_Value v Where v.TagID in (Select distinct t1.id From Tag_Structure t1 Where t1.TagName = 'title' And t1.Pid in (select distinct t2.id From Tag_Structure t2 Where TagName = 'article' ));

1313

(e)

Select Value From ChildTable Where LParent in (Select distinct IdNode From ParentTable Where Cname='title' And Lparent in (Select distinct IdNode From ParentTable Where Cname= 'article'));

1405

(f)

Select L.Node_Value From Leaf_Node L, Ancestor_Path P Where p.Ances_PathExp Like '%/article' And L.Ances_PathID = P.Ances_PathID And L.Node_Name = 'title';

377

Q4

Time (ms)

(a)

Select v.Value From Element e, Path p1 , Path p2, Text v Where p1.PathExp Like '%#/articles' And p2.PathExp Like '%#/articles%#/author' And e.PathID = p1.PathID And v.PathID = p2.PathID And e.Start < v.Start And e.End > v.End;

1604

(b)

Select v.NodeValue From Path_Index_Table p1 , Path_Index_Table p2 , Path_Table e, Path_Table v Where p1.Path Like '%.articles' And p2.Path Like '%.articles%.author' And e.PID = p1.PID And v.PID = p2.PID And e.StartPos < v.StartPos And e.EndPos > v.EndPos And v.NodeType = '3';

71122

(c)

Select v.LeafValue From LeafNodes v, Path p Where p.PathExp Like '%/articles%/author' And p.PathID = v.PathID;

499

(d)

Select v.Value From Tag_Value v Where v.TagID in (Select distinct t1.id From Tag_Structure t1 Where t1.TagName = 'author' And t1.Pid in (Select distinct t2.id From Tag_Structure t2 Where t2.Pid in (Select distinct t2.id From Tag_Structure t2 Where t2.Pid in (Select distinct t3.id From Tag_Structure t3 Where TagName = 'articles' ))));

2763

Select Value From ChildTable Where LParent in (Select distinct IdNode From ParentTable Where Cname='author' And Lparent in (Select distinct IdNode From ParentTable Where LParent in (Select distinct IdNode From ParentTable Where LParent in (Select distinct IdNode From ParentTable Where Cname = 'articles'))));

2813

Select L.Node_Value From Leaf_Node L, Ancestor_Path P Where p.Ances_PathExp Like '%/articles%' And L.Ances_PathID = P.Ances_PathID And L.Node_Name = 'author';

474

(e)

(f) Q5 (a)

Time (ms) Select v2.Value From Element e, Path p1 , Path p2, Path p3 ,Text v1 ,Text v2 Where p1.PathExp = '#/SigmodRecord#/issue#/articles#/article' And p2.PathExp = '#/SigmodRecord#/issue#/articles#/article#/title' And p3.pathExp ='#/SigmodRecord#/issue#/articles#/article#/endPage' And e.PathID = p1.PathID And v1.PathID = p2.PathID And v2.PathID = p3.PathID And e.Start < v1.Start And e.End > v1.End And e.Start < v2.Start And e.End > v2.End And v1.Value =

813

continue…

246

247 …continued 'Database Directions III Workshop Review.'; (b)

(c)

(d)

(e)

(f)

Select v2.NodeValue From Path_Table e, Path_Index_Table p1 , Path_Index_Table p2, Path_Index_Table p3 ,Path_Table v1 , Path_Table v2 Where p1.Path = '.SigmodRecord.issue.articles.article' And p2.path = '.SigmodRecord.issue.articles.article.title' And p3.path = '.SigmodRecord.issue.articles .article.endPage' And e.PID = p1.PID And v1.PID = p2.PID And v2.PID = p3.PID And e.StartPos < v1.StartPos And e.EndPos > v1.EndPos And e.StartPos < v2.StartPos And e.EndPos > v2.EndPos And v1.NodeValue= 'Database Directions III Workshop Review.' And v2.NodeType = '3';

1209

Select v1.LeafValue From LeafNodes v1, Path p1 Where p1.PathExp = '/SigmodRecord/issue/articles/article/endPage' And v1.PathID = p1.PathID And v1.ParentID in (Select distinct t1.INodeID From InnerNodes t1 Where t1.ParentId in (Select distinct t2.ParentId From InnerNodes t2 Where t2.INodeId in (Select distinct v2.ParentId From LeafNodes v2, Path p2 Where p2.PathExp = '/SigmodRecord/issue/articles/article/title' And v2.PathId = p2.PathId And v2.LeafValue = 'Database Directions III Workshop Review.')));

954

Select v2.Value From Tag_Value v2 Where v2.TagID = (Select distinct t7.Id From tag_structure t7 Where t7.TagName = 'endPage' And t7.Pid in (Select distinct t8.id From Tag_Structure t8 Where t8.TagName = 'article' And t8.Pid in (Select distinct t8.id From Tag_Structure t8 Where t8.TagName = 'articles' And t8.Pid in (Select distinct t8.id From Tag_Structure t8 Where t8.TagName = 'issue' And t8.Pid in (Select distinct t9.id From Tag_Structure t9 Where t9.TagName = 'SigmodRecord' )))) And t7.pid = (Select distinct t1.pId From Tag_Structure t1 Where t1.TagName = 'title' And t1.Pid in (Select distinct t8.id From Tag_Structure t8 Where t8.TagName = 'article' And t8.Pid in (Select distinct t8.id From Tag_Structure t8 Where t8.TagName = 'articles' And t8.Pid in (Select distinct t2.id From Tag_Structure t2 Where TagName = 'issue' And t2.Pid in (Select distinct t3.id From Tag_Structure t3 Where t3.TagName = 'SigmodRecord')))) And t1.id = (Select distinct v1.tagid From Tag_Value v1 Where v1.Value = 'Database Directions III Workshop Review.')));

1316

Select Value From ChildTable Where LParent in (Select distinct IdNode From ParentTable Where Cname = 'endPage' And lparent in (Select distinct IdNode From ParentTable Where Cname = 'article' And LParent in (Select distinct IdNode From ParentTable Where Cname = 'articles' And Lparent in (Select distinct IdNode From ParentTable Where Cname= 'issue' And Pname = 'SigmodRecord'))) And Lparent in (Select distinct LParent From ParentTable Where Cname = 'title' And Lparent in (Select distinct IdNode From ParentTable Where Cname = 'article' And LParent in (Select distinct IdNode From ParentTable Where Cname = 'articles' And Lparent in (Select distinct IdNode From ParentTable Where Cname= 'issue' And Pname = 'SigmodRecord'))) And IdNode in (Select distinct LParent From ChildTable Where Value = 'Database Directions III Workshop Review.' )))

2173

Select L2.Node_Value From Leaf_Node L1, Leaf_Node L2, Ancestor_Path P Where p.Ances_PathExp = '/SigmodRecord/issue/articles/article' And L1.Ances_PathID = P.Ances_PathID And L1.Node_Name = 'title' And L1.Node_Value = 'Database Directions III Workshop Review.' And L2.Node_Name = 'endPage' And L1.Ances_PathID = L2.Ances_PathID And L1.Ances_Pos = L2.Ances_Pos;

296

Q6 (a)

(b)

Time (ms) Select v2.Value From Element e, Path p1 , Path p2, Path p3 ,Text v1 , Text v2 Where p1.PathExp Like '%#/article' And p2.PathExp Like '%#/article #/initPage' And p3.pathExp Like '%#/article#/title' And e.PathID = p1.PathID And v1.PathID = p2.PathID And v2.PathID = p3.PathID And e.Start < v1.Start And e.End > v1.End And e.Start < v2.Start And e.End > v2.End And v1.Value > '50'; Select v2.NodeValue From Path_Table e, Path_Index_Table p1 , Path_Index_Table p2, Path_Index_Table p3 ,Path_Table v1 , Path_Table v2 Where p1.Path Like '%.article' And p2.path Like '%.article.initPage' And p3.path Like '%.article.title' And e.PID = p1.PID And v1.PID = p2.PID And v2.PID = p3.PID And e.StartPos < v1.StartPos And e.EndPos > v1.EndPos And e.StartPos < v2.StartPos And e.EndPos > v2.EndPos And v1.NodeValue > '50' And v2.NodeType = '3';

7107 32513

continue…

247

248 …continued (c)

(d)

(e)

(f)

Select v1.LeafValue From LeafNodes v1, Path p1 Where p1.PathExp Like '%/article/title' And v1.PathID = p1.PathID And v1.ParentID in (Select distinct t1.INodeID From InnerNodes t1 Where t1.ParentId in (Select distinct t2.ParentId From InnerNodes t2 Where t2.INodeId in (Select distinct v2.ParentId From LeafNodes v2, Path p2 Where p2.PathExp Like '%/article/initPage' And v2.PathId = p2.PathId And v2.LeafValue > '50')));

2248

Select v2.Value From Tag_Value v2 Where v2.TagID in (Select distinct t7.Id From Tag_Structure t7 Where t7.TagName = 'title' And t7.Pid in (Select distinct t8.id From Tag_Structure t8 Where t8.TagName = 'article') And t7.pid in (Select distinct t1.pId From Tag_Structure t1 Where t1.TagName = 'initPage' And t1.Pid in (Select distinct t8.id From Tag_Structure t8 Where t8.TagName = 'article') And t1.id in (Select distinct v1.tagid From Tag_Value v1Where v1.Value > '50')));

5000

Select Value From ChildTable Where LParent in (Select distinct IdNode From ParentTable Where Cname = 'title' And Lparent in (Select distinct IdNode From ParentTable Where Cname= 'article' ) And Lparent in (Select distinct LParent From ParentTable Where Cname = 'initPage' And Lparent in (Select distinct IdNode From ParentTable Where Cname= 'article') And IdNode in (Select distinct LParent From ChildTable Where Value > '50' )));

5090

Select L2.Node_Value From Leaf_Node L1, Leaf_Node L2, Ancestor_Path P Where p.Ances_PathExp Like '%/article' And L1.Ances_PathID = P.Ances_PathID And L1.Node_Name = 'initPage' And L1.Node_Value > '50' And L2.Node_Name = 'title' And L1.Ances_PathID = L2.Ances_PathID And L1.Ances_Pos = L2.Ances_Pos;

607

248

249

b.

Experiment Dataset (DBLP1) Table C-12

Description of DBLP1 Dataset Used

Dataset

Description

Size

Elements

Attributes

Max-Depth

DBLP1

DBLP Bibliography

31.5 MB

796507

108626

5

 Storage Space Table C-13

Results of RDB Storage Space Evaluation of DBLP1 for All Approaches

Approach

RDB Scheme

No. of Rows

Total Rows

RDB Storage Space (MB)

XRel

Path Element Attribute Text

85 796507 108626 715648

1620866

80.92

SMX/R

Path_Index_Table Path_Table

85 1620781

1620866

88.19

Ying-Cao

File Path InnerNodes LeafNodes

1 78 905133 824274

1729486

88.60

XRecursive

Tag_Structure Tag_Value

905133 824274

1729407

70.08

s-XML

ParentTable ChildTable

905133 824274

1729407

110.57

XAncestor

Ancestor_Path Leaf_Node

15 824274

824289

46.41

RDB Storage Space (MB)

120 100 80 60 40 20 0 XRel

SMX/R

Ying-Cao XRecursive s-XML

XAncestor

Model Mapping Approaches Figure C-9

Comparison of RDB Storage Space of Approaches for DBLP1

250  Query Response Time Table C-14

Query Description for DBLP Datasets

Query expression Q1 : dblp/phdthesis/title Q2 : dblp/inproceedings/title/sub/sup Q3 : //mastersthesis/author Q4 : //article//tt Q5: dblp/mastersthesis [year = 1992]/title Q6 : //phdthesis [year > 1979]/school

i.

Feature simple path expression (short) simple path expression (long) one ‘//’ two ‘//‘s Twig query with “/” Twig query with “//”

XRel Approach Table C-15

Query Response Times (ms) for Q1-Q6 on DBLP1 Dataset using XRel

Qn

[1]

[2]

[3]

[4]

[5]

[6]

Average

Q1 Q2 Q3 Q4 Q5 Q6

32292 33633 32260 34663 56853 93490

32377 34008 32182 34699 56574 92831

32401 33586 31730 34600 57080 92395

32697 33789 32276 34023 56986 92288

32572 33274 32288 34788 56862 94167

32214 33181 32058 34444 56721 91852

32426 33579 32132 34536 56846 92837

XRel - Query Response Time (ms) 100000 90000 80000 70000 60000 50000 40000 30000 20000 10000 0 Q1 Figure C-10

Q2

Q3

Q4

Q5

Q6

Comparison of Query Response Times for Q1-Q6 on DBLP1 using XRel

251

ii.

SMX/R Approach Table C-16

Query Response Times (ms) for Q1-Q6 on DBLP1 Dataset using SMX/R

Qn

[1]

[2]

[3]

[4]

[5]

[6]

Average

Q1 Q2 Q3 Q4 Q5 Q6

69654 72384 68842 73329 76515 140272

69247 72584 68265 72524 77128 138918

69061 72321 68421 72945 76364 138432

68733 71679 68374 73304 77313 139981

68640 72852 68422 72866 76845 137857

69576 72415 69295 72805 77097 139609

69152 72373 68603 72962 76877 139178

SMX/R - Query Response Time (ms) 1000000 100000 10000 1000 100 10 Q1

Figure C-11

Q2

Q3

Q4

Q5

Q6

Comparison of Query Response Times for Q1-Q6 on DBLP1 using SMX/R

iii. Ying-Cao Approach Table C-17

Query Response Times (ms) for Q1-Q6 on DBLP1 Dataset using Ying-Cao

Qn

[1]

[2]

[3]

[4]

[5]

[6]

Average

Q1 Q2 Q3 Q4 Q5 Q6

6224 6052 6130 6130 26317 44928

6130 6068 5850 6021 25412 44241

6130 5990 6318 5896 26083 44491

5756 5787 6396 6052 26036 44959

5943 5647 6037 5941 26083 44850

5850 6037 6318 5892 26130 44818

6006 5930 6175 5989 26010 44715

252

Ying-Cao - Query Response Time (ms) 100000

10000

1000

100

10 Q1 Figure C-12

Q2

Q3

Q4

Q5

Q6

Comparison of Query Response Times for Q1-Q6 on DBLP1 using Ying-Cao

iv. XRecursive Approach Table C-18

Query Response Times (ms) for Q1-Q6 on DBLP1 Dataset using XRecursive

Qn

[1]

[2]

[3]

[4]

[5]

[6]

Average

Q1 Q2 Q3 Q4 Q5 Q6

401201 92476 18610 24148 82336 452278

401682 96954 18121 24694 79170 446848

414773 93631 18501 24070 79636 447705

401607 93927 18684 24415 78577 450872

392824 95394 18735 24710 78591 453450

399048 93319 18907 24226 81338 456269

401856 94284 18593 24377 79941 451237

XRecursive - Query Response Time (ms) 1000000 100000 10000 1000 100 10 Q1 Figure C-13

Q2

Q3

Q4

Q5

Q6

Comparison of Query Response Times for Q1-Q6 on DBLP1 using XRecursive

253

v.

s-XML Approach Table C-19

Query Response Times (ms) for Q1-Q6 on DBLP1 Dataset using s-XML

Qn

[1]

[2]

[3]

[4]

[5]

[6]

Average

Q1 Q2 Q3 Q4 Q5 Q6

18142 29234 18751 53352 51136 432949

17893 28828 18517 53539 50419 444989

17815 29328 18673 52618 51293 442525

17908 28938 18829 52908 52666 441316

17877 29310 18657 53227 51074 433134

17971 29515 18504 53976 50751 431652

17934 29192 18655 53270 51223 437761

s-XML - Query Respons e Time (ms) 1000000 100000 10000 1000 100 10 Q1 Figure C-14

Q2

Q3

Q4

Q5

Q6

Comparison of Query Response Times for Q1-Q6 on DBLP1 using s-XML

vi. XAncestor Approach Table C-20

Query Response Times for Q1-Q6 on DBLP1 Dataset using XAncestor

Qn

[1]

[2]

[3]

[4]

[5]

[6]

Average

Q1 Q2 Q3 Q4 Q5 Q6

5148 4446 6021 4742 10842 10717

5584 4446 6084 4508 10467 10327

5007 4820 6271 4430 10592 10342

5414 4508 6333 4555 10826 10779

5538 4383 5912 4648 11029 10264

5397 4758 6193 4290 10795 10639

5348 4560 6136 4529 10759 10511

254

XAncestor- Query Response Time (ms) 12000 10000 8000 6000 4000 2000 0 Q1 Figure C-15

Q2

Q3

Q4

Q5

Q6

Comparison of Query Response Times for Q1-Q6 on DBLP1 using XAncestor

 Query Response Time (All Approaches) Table C-21 Approach XRel SMX/R Ying&Cao XRecursive s-XML XAncestor

Query Response Times for Q1-Q6 on DBLP1 Dataset using All Approaches Q1

Q2

Q3

Q4

Q5

Q6

32426 69152 6006 401856 17934 5348

33579 72373 5930 94284 29192 4560

32132 68603 6175 18593 18655 6136

34536 714622 5989 24377 53270 4529

56846 76877 26010 79941 51223 10759

92837 139178 44715 451237 437761 10511

XRel

SMX/R

Ying-Cao

XRecursive

s-XML

XAncestor

Q5

Q6

Query Response Time (ms)

1000000

100000

10000

1000

100 Q1

Figure C-16

Q2

Q3

Q4

Translated SQL Queries Queries Comparison of Query Response Times for Q1-Q6 on DBLP1 using All Approaches

255

c.

Experiment Dataset (DBLP2) Table C-22

Description of DBLP2 Dataset Used

Dataset

Description

Size

Elements

Attributes

Max-Depth

DBLP2

DBLP Bibliography

63.4 MB

1604967

206949

5

 Storage Space Table C-23

Results of RDB Storage Space Evaluation of DBLP2 for All Approaches

Approach

RDB Scheme

No. of Rows

Total Rows

RDB Storage Space (MB)

XRel

Path Element Attribute Text

92 1604967 206949 1442099

3254107

162.93

SMX/R

Path_Index_Table Path_Table

92 3254015

3254107

177.32

Ying-Cao

File Path InnerNodes LeafNodes

1 85 1811916 1649048

3461050

179.95

XRecursive

Tag_Structure Tag_Value

1811916 1649048

3460964

142.09

s-XML

ParentTable ChildTable

1811916 1649048

3460964

224.05

XAncestor

Ancestor_Path Leaf_Node

16 1649048

1649064

92.82

250

RDB Storage Space (MB)

200

150

100

50

0 XRel

SMX/R

Ying-Cao

XRecursive

s-XML

XAncestor

Model Mapping Approaches Figure C-17

Comparison of RDB Storage Space of Approaches for DBLP2

256  Query Response Time The query description for DBLP2 Dataset used the same set of queries that was shown in Table C-14. i.

XRel Approach Table C-24

Query Response Times (ms) for Q1-Q6 on DBLP2 Dataset using XRel

Qn

[1]

[2]

[3]

[4]

[5]

[6]

Average

Q1 Q2 Q3 Q4 Q5 Q6

64942 68858 64849 68796 73398 189649

64927 68952 64615 69310 72555 187079

64864 68593 64705 68874 72555 187699

64896 68562 64755 68655 72415 186903

64896 68718 64568 69232 72571 188346

64880 69310 64646 68784 72711 188354

64901 68832 64690 68942 72701 188005

XRel - Query Response Time (ms) 200000 180000 160000 140000 120000 100000 80000 60000 40000 20000 0 Q1 Figure C-18

ii.

Q2

Q3

Q4

Q5

Q6

Comparison of Query Response Times for Q1-Q6 on DBLP2 using XRel

SMX/R Approach Table C-25

Query Response Times (ms) for Q1-Q6 on DBLP2 Dataset using SMX/R

Qn

[1]

[2]

[3]

[4]

[5]

[6]

Average

Q1 Q2 Q3 Q4 Q5 Q6

139336 148590 138247 148648 156156 327506

138978 148871 139349 147201 155593 325837

139136 149276 138793 147792 156610 329924

139011 148116 139598 148434 156281 328633

138401 147794 138605 148059 155292 333356

140088 148652 138575 148153 155404 326446

139158 148550 138861 148048 155889 328617

257

SMX/R - Query Response Time (ms) 350000 300000 250000 200000 150000 100000 50000 0 Q1 Figure C-19

Q2

Q3

Q4

Q5

Q6

Comparison of Query Response Times for Q1-Q6 on DBLP2 using SMX/R

iii. Ying-Cao Approach Table C-26

Query Response Times (ms) for Q1-Q6 on DBLP2 Dataset using Ying-Cao

Qn

[1]

[2]

[3]

[4]

[5]

[6]

Average

Q1 Q2 Q3 Q4 Q5 Q6

12012 12214 11887 11934 51667 97968

12292 12246 11965 11887 53211 96501

11918 11419 11746 12074 52195 100932

11653 11481 11841 11606 51339 97016

12012 12145 11746 11965 53804 99602

12074 11497 11871 11715 51729 97500

11994 11834 11843 11864 52324 98253

Ying-Cao - Query Response Time (ms) 100000 10000 1000 100 10 1 Q1 Figure C-20

Q2

Q3

Q4

Q5

Q6

Comparison of Query Response Times for Q1-Q6 on DBLP2 using Ying-Cao

258

iv. XRecursive Approach Table C-27

Query Response Times (ms) for Q1-Q6 on DBLP2 Dataset using XRecursive

Qn

[1]

[2]

[3]

[4]

[5]

[6]

Average

Q1 Q2 Q3 Q4 Q5 Q6

796823 87828 38500 142835 154206 879077

793932 90147 36934 140509 149791 860341

806147 88498 36348 137476 151101 899481

790047 91322 37262 139121 151179 890077

799002 89557 36285 142700 153379 860825

796076 89162 36270 140692 152545 859218

797005 89419 36933 140556 152034 874837

XRecursive - Query Response Time (ms) 1000000 100000 10000 1000 100 10 1 Q1 Figure C-21

v.

Q2

Q3

Q4

Q5

Q6

Comparison of Query Response Times for Q1-Q6 on DBLP2 using XRecursive

s-XML Approach Table C-28

Query Response Times (ms) for Q1-Q6 on DBLP2 Dataset using s-XML

Qn

[1]

[2]

[3]

[4]

[5]

[6]

Average

Q1 Q2 Q3 Q4 Q5 Q6

35975 59267 36956 143504 100479 866821

35933 59014 37221 143373 101805 879542

35521 58890 37570 142771 103303 881089

35443 58546 37026 142802 102293 884898

35591 56940 36880 143613 103849 890158

35614 59202 37128 144531 101680 870169

35680 58643 37130 143432 102235 878780

259

s-XML - Query Response Time (ms) 1000000 100000 10000 1000 100 10 1 Q1 Figure C-22

Q2

Q3

Q4

Q5

Q6

Comparison of Query Response Times for Q1-Q6 on DBLP2 using s-XML

vi. XAncestor Approach Table C-29

Query Response Times (ms) for Q1-Q6 on DBLP2 Dataset using XAncestor

Qn

[1]

[2]

[3]

[4]

[5]

[6]

Average

Q1 Q2 Q3 Q4 Q5 Q6

10367 8923 12214 8767 21637 20810

10530 9172 11824 9090 21106 20919

10252 9157 11840 9151 20997 20704

10795 8980 11809 9157 21304 21606

10557 9282 11466 9126 21231 20857

10202 9995 11522 8970 20857 21528

10451 9252 11779 9044 21189 21071

XAncestor - Query Response Time (ms) 25000 20000 15000 10000 5000 0 Q1 Figure C-23

Q2

Q3

Q4

Q5

Q6

Comparison of Query Response Times for Q1-Q6 on DBLP2 using XAncestor

260  Query Response Time (All Approaches) Table C-30

Query Response Times (ms) for Q1-Q6 on DBLP2 Dataset using All Approaches

Approach XRel SMX/R Ying-Cao XRecursive s-XML XAncestor

XRel

Q1

Q2

Q3

Q4

Q5

Q6

64901 139158 11994 797005 35680 10451

68832 148550 11834 89419 58643 9252

64690 138861 11843 36933 37130 11779

68942 148048 11864 140556 143432 9044

72701 155889 52324 152034 102235 21189

188005 328617 98253 874837 878780 21071

SMX/R

Ying-Cao

XRecursive

s-XML

XAncestor

Query Response Time (ms)

1000000

100000

10000

1000 Q1

Q2

Q3

Q4

Q5

Q6

Translated SQL Queries Queries Figure C-24

Comparison of Query Response Times for Q1-Q6 on DBLP2 using All Approaches

261

d.

Experiment Dataset (DBLP3) Table C-31

Description of DBLP3 Dataset Used

Dataset

Description

Size

Elements

Attributes

Max-Depth

DBLP3

DBLP Bibliography

127 MB

3332130

404276

6

 Storage Space Table C-32

Results of RDB Storage Space Evaluation of DBLP3 for All Approaches

Approach

RDB Scheme

No. of Rows

Total Rows

RDB Storage Space (MB)

XRel

Path Element Attribute Text

145 3332130 404276 3005848

6742399

338.42

SMX/R

Path_Index_Table Path_Table

145 6742254

6742399

368.88

Ying-Cao

File Path InnerNodes LeafNodes

1 136 3736406 3410124

7146667

372.49

XRecursive

Tag_Structure Tag_Value

3736406 3410124

7146530

292.95

s-XML

ParentTable ChildTable

3736406 3410124

7146530

456.19

XAncestor

Ancestor_Path Leaf_Node

29 3410124

3410153

189.92

500

RDB Storage Space (MB)

450 400 350 300 250 200 150 100 50 0 XRel

SMX/R

Ying-Cao XRecursive s-XML

XAncestor

Model Mapping Approaches Figure C-25

Comparison of RDB Storage Space of Approaches for DBLP3

262  Query Response Time The query description for DBLP3 Dataset used the same set of queries that was shown in Table C-14. i.

XRel Approach Table C-33

Query Response Times (ms) for Q1-Q6 on DBLP2 Dataset using XRel

Qn

[1]

[2]

[3]

[4]

[5]

[6]

Average

Q1 Q2 Q3 Q4 Q5 Q6

135876 146609 135737 148512 151511 398159

134784 147157 134238 148758 150399 396693

135127 147071 135252 150461 151897 412608

135579 147014 135301 205331 151855 410499

134316 154424 143941 160571 151429 408382

135407 147045 135956 154564 150817 408127

135182 148220 136738 161366 151318 405745

XRel - Query Response Time (ms) 450000 400000 350000 300000 250000 200000 150000 100000 50000 0 Q1 Figure C-26

ii.

Q2

Q3

Q4

Q5

Q6

Comparison of Query Response Times for Q1-Q6 on DBLP3 using XRel

SMX/R Approach Table C-34

Query Response Times (ms) for Q1-Q6 on DBLP2 Dataset using SMX/R

Qn

[1]

[2]

[3]

[4]

[5]

[6]

Average

Q1 Q2 Q3 Q4 Q5 Q6

287976 312998 288366 315640 325510 601193

288319 310424 287399 316930 325213 605125

287711 311220 287227 317320 327662 604937

288163 313950 285215 316898 323154 606778

288772 313794 287976 316072 323716 606597

288163 312390 289536 316867 324979 605577

288184 312463 287620 316621 325039 605035

263

SMX/R - Query Response Time (ms) 700000 600000 500000 400000 300000 200000 100000 0 Q1 Figure C-27

Q2

Q3

Q4

Q5

Q6

Comparison of Query Response Times for Q1-Q6 on DBLP3 using SMX/R

iii. Ying-Cao Approach Table C-35

Query Response Times (ms) for Q1-Q6 on DBLP2 Dataset using Ying-Cao

Qn

[1]

[2]

[3]

[4]

[5]

[6]

Average

Q1 Q2 Q3 Q4 Q5 Q6

23758 23799 23571 23961 105753 225576

23821 24164 24320 24382 106360 223621

24570 24226 23431 23992 104857 223782

24070 23743 23836 23805 105806 224390

23852 23790 24193 23696 104488 224791

23353 24183 23899 24117 106080 222971

23904 23984 23875 23992 105557 224189

Ying-Cao - Query Response Time (ms) 1000000 100000 10000 1000 100 10 Q1 Figure C-28

Q2

Q3

Q4

Q5

Q6

Comparison of Query Response Times for Q1-Q6 on DBLP3 using Ying-Cao

264

iv. XRecursive Approach Table C-36

Query Response Times (ms) for Q1-Q6 on DBLP2 Dataset using XRecursive

Qn

[1]

[2]

[3]

[4]

[5]

[6]

Average

Q1 Q2 Q3 Q4 Q5 Q6

1639870 165058 75769 312780 316050 1785970

1664344 162287 76315 310253 314605 1791296

1650242 165048 77251 318583 317335 1793921

1639721 163925 76996 319067 316979 1803987

1673586 165407 77282 310424 317398 1830101

1679218 164540 75940 313466 316040 1854531

1657830 164378 76592 314096 316401 1809968

XRecursive - Query Response Time (ms) 10000000 1000000 100000 10000 1000 100 10 Q1 Figure C-29

v.

Q2

Q3

Q4

Q5

Q6

Comparison of Query Response Times for Q1-Q6 on DBLP3 using XRecursive

s-XML Approach Table C-37

Query Response Times (ms) for Q1-Q6 on DBLP2 Dataset using s-XML

Qn

[1]

[2]

[3]

[4]

[5]

[6]

Average

Q1 Q2 Q3 Q4 Q5 Q6

72805 120532 76079 397922 209462 1744401

72181 124145 76827 391683 205396 1781441

71744 121212 74839 382684 208663 1848057

72789 121929 77397 382897 207660 1801637

71463 121048 76174 391342 205421 1782097

72664 119839 76549 381997 208337 1779130

72274 121451 76311 388088 207490 1789461

265

s-XML - Query Response Time (ms) 10000000 1000000 100000 10000 1000 100 10 Q1 Figure C-30

Q2

Q3

Q4

Q5

Q6

Comparison of Query Response Times for Q1-Q6 on DBLP3 using s-XML

vi. XAncestor Approach Table C-38

Query Response Times (ms) for Q1-Q6 on DBLP2 Dataset using XAncestor

Qn

[1]

[2]

[3]

[4]

[5]

[6]

Average

Q1 Q2 Q3 Q4 Q5 Q6

21294 18552 23228 18798 44460 42619

21309 18720 23977 18205 44321 43165

21512 18158 24445 18595 42494 44148

21606 18532 23322 17924 43258 44148

21278 18517 23805 18142 42822 44694

20615 18813 24211 18252 44382 44132

21269 18549 23831 18319 43623 43818

XAncestor - Query Response Time (ms) 50000 45000 40000 35000 30000 25000 20000 15000 10000 5000 0 Q1 Figure C-31

Q2

Q3

Q4

Q5

Q6

Comparison of Query Response Times for Q1-Q6 on DBLP3 using XAncestor

266  Query Response Time (All Approaches) Table C-39

Query Response Times (ms) for Q1-Q6 on DBLP3 Dataset using All Approaches

Approach XRel SMX/R Ying-Cao XRecursive s-XML XAncestor

Q1

Q2

Q3

Q4

Q5

Q6

135182 288184 23904 1657830 72274 21269

148220 312463 23984 164378 121451 18549

136738 287620 23875 76592 76311 23831

161366 316621 23992 314096 388088 18319

151318 325039 105557 316401 207490 43623

405745 605035 224189 1809968 1789461 43818

XRel

SMX/R

Ying-Cao

XRecursive

s-XML

XAncestor

Q5

Q6

Query Response Time (ms)

10000000

1000000

100000

10000

1000 Q1

Q2

Q3

Q4

Translated SQL Queries Figure C-32

Comparison of Query Response Times for Q1-Q6 on DBLP3 using All Approaches

267

Table C-40

Translated SQL Queries with their query Response Time to Retrieve Each Query (Q1-Q6) Based on (a) XRel, (b) SMX/R, (c) Ying-Cao,(d) XRecursive, (e) s-XML, and (f) XAncestor Approaches on all Sizes of DBLP (DBLP1, DBLP2, DBLP3)

Q1

Translated SQL Query

DBLP1 Time (ms)

DBLP2 Time (ms)

DBLP3 Time (ms)

(a)

Select v.Value From Element e, Path p1 , Path p2, Text v Where p1.PathExp = '#/dblp#/phdthesis' And p2.pathExp = '#/dblp#/phdthesis #/title' And e.PathID = p1.PathID And v.PathID = p2.PathID And e.Start < v.Start And e.End > v.End;

32426

64901

135182

(b)

Select v.NodeValue From Path_Index_Table p1 , Path_Index_Table p2 , Path_Table e, Path_Table v Where p1.Path = '.dblp.phdthesis' And p2.Path = '.dblp.phdthesis.title' And e.PID = p1.PID And v.PID = p2.PID And e.StartPos < v.StartPos And e.EndPos > v.EndPos And v.NodeType = '3';

69152

139158

288184

(c)

Select v.LeafValue From LeafNodes v, Path p Where p.PathExp = '/dblp/phdthesis/title' And p.PathID = v.PathID;

6006

11994

23904

(d)

Select v.Value From Tag_Value v Where v.TagID in (Select distinct t1.id From Tag_Structure t1 Where t1.TagName = 'title' And t1.Pid in (Select distinct t2.id From Tag_Structure t2 Where TagName = 'phdthesis' And t2.Pid in (select distinct t3.id From Tag_Structure t3 Where TagName = 'dblp')));

401856

797005

1657830

(e)

Select Value From ChildTable Where LParent in (Select distinct IdNode From ParentTable Where Cname='title' And Lparent in (Select distinct IdNode From ParentTable Where Cname= 'phdthesis' And Pname = 'dblp'));

17934

35680

72274

(f)

Select L.Node_Value From Leaf_Node L, Ancestor_Path P Where p.Ances_PathExp = '/dblp/phdthesis' And L.Ances_PathID = P.Ances_PathID And L.Node_Name = 'title';

5348

10451

21269

Time (ms)

Time (ms)

Time (ms)

33579

68832

148220

Q2 (a)

Select v.Value From Element e, Path p1 , Path p2, Text v Where p1.PathExp = '#/dblp#/inproceedings#/title#/sub' And p2.pathexp = '#/ dblp#/inproceedings#/title#/sub#/sup' And e.PathID = p1.PathID And v.PathID = p2.PathID And e.Start < v.Start And e.End > v.End;

(b)

Select v.NodeValue From Path_Index_Table p1, Path_Index_Table p2, Path_Table e, Path_Table v Where p1.Path = '.dblp. inproceedings.title.sub' And p2.Path = '.dblp.inproceedings.title.sub.sup' And e.PID = p1.PID And v.PID = p2.PID And e.StartPos < v.StartPos And e.EndPos > v.EndPos And v.NodeType = '3';

72373

148550

312463

(c)

Select v.LeafValue From LeafNodes v, Path p Where p.PathExp = '/dblp/inproceedings/title/sub/sup' And p.PathID = v.PathID;

5930

11834

23984

(d)

Select v.Value From Tag_Value v Where v.TagID in (Select distinct t1.id From Tag_Structure t1 Where t1.TagName = 'sup' And t1.Pid in (Select distinct t3.id From Tag_Structure t3 Where TagName = 'sub' And t3.Pid in(Select distinct t4.id From Tag_Structure t4 Where TagName = 'title' And t4.Pid in(Select distinct t5.id From Tag_Structure t5 Where TagName = 'inproceedings' And t5.Pid in(Select distinct t6.id From Tag_Structure t6 Where TagName = 'dblp')))));

94284

89419

164378

29192

58643

121451

(e)

Select Value From ChildTable Where LParent in (Select distinct IdNode From ParentTable Where Cname = 'sup' And LParent in (Select distinct IdNode From ParentTable Where Cname='sub' And Lparent in (Select distinct IdNode From ParentTable Where

continue…

267

268 …continued Cname='title' And Lparent in (Select distinct IdNode From ParentTable Where Cname= 'inproceedings' And Pname = 'dblp')))); (f)

Select L.Node_Value From Leaf_Node L, Ancestor_Path P Where p.Ances_PathExp = '/dblp/inproceedings/title/sub' And L.Ances_PathID = P.Ances_PathID And L.Node_Name = 'sup';

Q3

4560

9252

18549

Time (ms)

Time (ms)

Time (ms)

(a)

Select v.Value From Element e, Path p1 , Path p2, Text v Where p1.PathExp Like '%#/mastersthesis' And p2.pathExp Like '%#/ mastersthesis#/author' And e.PathID = p1.PathID And v.PathID = p2.PathID And e.Start < v.Start And e.End > v.End;

32132

64690

136738

(b)

Select v.NodeValue From Path_Index_Table p1 , Path_Index_Table p2 , Path_Table e, Path_Table v Where p1.Path Like '%.mastersthesis' And p2.Path Like '%.mastersthesis.author' And e.PID = p1.PID And v.PID = p2.PID And e.StartPos < v.StartPos And e.EndPos > v.EndPos And v.NodeType = '3';

68603

138861

287620

(c)

Select v.LeafValue From LeafNodes v, Path p Where p.PathExp Like '%/mastersthesis/author' And p.PathID = v.PathID;

6175

11843

23875

(d)

Select v.Value From Tag_Value v Where v.TagID in (Select distinct t1.id From Tag_Structure t1 Where t1.TagName = 'author' And t1.Pid in (Select distinct t2.id From Tag_Structure t2 Where TagName = 'mastersthesis'));

18593

36933

76592

(e)

Select Value From ChildTable Where LParent in (Select distinct IdNode From ParentTable Where Cname='author' And Lparent in (Select distinct IdNode From ParentTable Where Cname= 'mastersthesis'));

18655

37130

76311

(f)

Select L.Node_Value From Leaf_Node L, Ancestor_Path P Where p.Ances_PathExp Like '%/mastersthesis' And L.Ances_PathID = P.Ances_PathID And L.Node_Name = 'author';

6136

11779

23831

Time (ms)

Time (ms)

Time (ms)

Q4 (a)

Select v.Value From Element e, Path p1 , Path p2, Text v Where p1.PathExp Like '%#/article' And p2.pathExp Like '%#/article%#/tt' And e.PathID = p1.PathID And v.PathID = p2.PathID And e.Start < v.Start And e.End > v.End;

34536

68942

161366

(b)

Select v.NodeValue From Path_Index_Table p1 , Path_Index_Table p2 , Path_Table e, Path_Table v Where p1.Path Like '%.article' And p2.Path Like '%.article%.tt' And e.PID = p1.PID And v.PID = p2.PID And e.StartPos < v.StartPos And e.EndPos > v.EndPos And v.NodeType = '3';

72962

148048

316621

(c)

Select v.LeafValue From LeafNodes v, Path p Where p.PathExp Like '%/article%/tt' And p.PathID = v.PathID;

5989

11864

23992

(d)

Select v.Value From Tag_Value v Where v.TagID in (Select distinct t1.id From Tag_Structure t1 Where t1.TagName = 'tt' And t1.Pid in (Select distinct t2.id From Tag_Structure t2 Where t2.Pid in (Select distinct t3.id From Tag_Structure t3 Where t3.TagName = 'article')));

24377

140556

314096

(e)

Select Value From ChildTable Where LParent in (Select distinct IdNode From ParentTable Where cname='tt' And Lparent in (Select distinct IdNode From ParentTable Where LParent in (Select distinct IdNode From ParentTable Where Cname = 'article')));

53270

143432

388088

(f)

Select L.Node_Value From Leaf_Node L, Ancestor_Path P Where p.Ances_PathExp Like '%/article%' And L.Ances_PathID = P.Ances_PathID And L.Node_Name = 'tt';

4529

9044

18319

continue…

268

269 …continued Q5 (a)

(b)

(c)

(d)

(e)

(f)

Time (ms)

Time (ms)

Time (ms)

56846

72701

151318

Select v2.NodeValue From Path_Table e, Path_Index_Table p1 , Path_Index_Table p2, Path_Index_Table p3 ,Path_Table v1 , Path_Table v2 Where p1.Path = '.dblp.mastersthesis' And p2.path = '.dblp.mastersthesis.year' And p3.path = '.dblp.mastersthesis.title' And e.PID = p1.PID And v1.PID = p2.PID And v2.PID = p3.PID And e.StartPos < v1.StartPos And e.EndPos > v1.EndPos And e.StartPos < v2.StartPos And e.EndPos > v2.EndPos And v1.NodeValue= '1992' And v2.NodeType = '3';

76877

155889

325039

Select v1.LeafValue From LeafNodes v1, Path p1 Where p1.PathExp = '/dblp/mastersthesis/title' And v1.PathID = p1.PathID And v1.ParentID in (Select distinct t1.INodeID From InnerNodes t1 Where t1.ParentId in (Select distinct t2.ParentId From InnerNodes t2 Where t2.INodeId in (Select distinct v2.ParentId From LeafNodes v2, Path p2 Where p2.PathExp = '/dblp/mastersthesis/year' And v2.PathId = p2.PathId And v2.LeafValue = '1992')

26010

52324

105557

Select v2.value From Tag_Value v2 Where v2.TagID in (Select distinct t7.Id From Tag_Structure t7 Where t7.TagName = 'title' And t7.Pid in (Select distinct t8.id From Tag_Structure t8 Where t8.TagName = 'mastersthesis' And t8.Pid in (Select distinct t9.id From Tag_Structure t9 Where t9.TagName = 'dblp' )) And t7.pid in (Select distinct t1.pId From Tag_Structure t1 Where t1.TagName = 'year' And t1.Pid in (Select distinct t2.id From Tag_Structure t2 Where TagName = 'mastersthesis' And t2.Pid in (Select distinct t3.id From Tag_Structure t3 Where t3.TagName = 'dblp')) And t1.id in (Select distinct v1.tagid From Tag_Value v1 Where v1.Value = '1992')));

79941

152034

316401

Select Value From ChildTable Where LParent in (Select distinct IdNode From ParentTable Where Cname = 'title' And Lparent in (Select distinct IdNode From ParentTable Where Cname= 'mastersthesis' And Pname = 'dblp') And Lparent in (Select distinct LParent From ParentTable Where Cname = 'year' And Lparent in (Select distinct IdNode From ParentTable Where Cname= 'mastersthesis' And Pname = 'dblp') And IdNode in (Select distinct LParent From ChildTable Where Value = '1992' )));

51223

102235

207490

Select L2.Node_Value From Leaf_Node L1, Leaf_Node L2, Ancestor_Path P Where p.Ances_PathExp = '/dblp/mastersthesis' And L1.Ances_PathID = P.Ances_PathID And L1.Node_Name = 'year' And L1.Node_Value = '1992' And L2.Node_Name = 'title' And L1.Ances_PathID = L2.Ances_PathID And L1.Ances_Pos = L2.Ances_Pos;

10759

21189

43623

Time (ms)

Time (ms)

Time (ms)

92837

188005

405745

139178

328617

605035

Select v2.Value From Element e, Path p1 , Path p2, Path p3 ,Text v1 , Text v2 Where p1.PathExp = '#/dblp#/mastersthesis' And p2.PathExp = '#/dblp#/mastersthesis#/year' And p3.pathExp = '#/dblp#/mastersthesis#/title' And e.PathID = p1.PathID And v1.PathID = p2.PathID And v2.PathID = p3.PathID And e.Start < v1.Start And e.End > v1.End And e.Start < v2.Start And e.End > v2.End And v1.Value= '1992';

Q6 (a)

(b)

Select v2.Value From Element e, Path p1 , Path p2, Path p3 ,Text v1 , Text v2 Where p1.PathExp Like '%#/phdthesis' And p2.PathExp Like '%#/phdthesis#/year' And p3.pathExp Like '%#/phdthesis#/school' And e.PathID = p1.PathID And v1.PathID = p2.PathID And v2.PathID = p3.PathID And e.Start < v1.Start And e.End > v1.End And e.Start < v2.Start And e.End > v2.End And v1.Value > '1979'; Select v2.NodeValue From Path_Table e, Path_Index_Table p1 , Path_Index_Table p2, Path_Index_Table p3 ,Path_Table v1 , Path_Table v2 Where p1.Path Like '%.phdthesis' And p2.path Like '%.phdthesis.year' And p3.path Like '%.phdthesis.school' And e.PID = p1.PID And v1.PID = p2.PID And v2.PID = p3.PID And e.StartPos < v1.StartPos And e.EndPos > v1.EndPos And e.StartPos

continue…

269

270 …continued < v2.StartPos And e.EndPos > v2.EndPos And v1.NodeValue > '1979' And v2.NodeType = '3'; (c)

(d)

(e)

(f)

Select v1.LeafValue From LeafNodes v1, Path p1 Where p1.PathExp Like '%/phdthesis/school' And v1.PathID = p1.PathID And v1.ParentID in (Select distinct t1.INodeID From InnerNodes t1 Where t1.ParentId in (Select distinct t2.ParentId From InnerNodes t2 Where t2.INodeId in (Select distinct v2.ParentId From LeafNodes v2, Path p2 Where p2.PathExp Like '%/phdthesis/year' And v2.PathId = p2.PathId And v2.LeafValue > '1979')));

44715

98253

224189

Select v2.value From Tag_Value v2 Where v2.TagID in (Select distinct t7.Id From Tag_Structure t7 Where t7.TagName = 'school' And t7.Pid in (Select distinct t8.Id From Tag_Structure t8 Where t8.TagName = 'phdthesis') And t7.Pid in (Select distinct t8.Pid From Tag_Structure t8 Where t8.TagName = 'year' And t8.Pid in (Select distinct t10.Id From Tag_Structure t10 Where t10.TagName = 'phdthesis') And t8.Id in (select distinct t11.TagId From Tag_Value t11 Where t11.Value > '1979')));

451237

874837

1809968

Select Value From ChildTable Where LParent in (Select distinct IdNode From ParentTable Where Cname = 'school' And Lparent in (Select distinct IdNode From ParentTable Where Cname= 'phdthesis') And Lparent in (Select distinct LParent From ParentTable Where Cname='year' And lparent in (Select distinct IdNode From parenttable Where Cname= 'phdthesis') And IdNode in (Select distinct LParent From ChildTable Where Value > '1979' )));

437761

878780

1789461

Select L2.Node_Value From Leaf_Node L1, Leaf_Node L2, Ancestor_Path P Where p.Ances_PathExp Like '%/phdthesis' And L1.Ances_PathID = P.Ances_PathID And L1.Node_Name = 'year' And L1.Node_Value > '1979' And L2.Node_Name = 'school' And L1.Ances_PathID = L2.Ances_PathID And L1.Ances_Pos = L2.Ances_Pos;

10511

21071

43818

270

271

APPENDIX D

SAMPLE RESULTS OF MAPPING XML DOCUMENTS INTO RELATIONAL DATABASE USING XANCESTOR MAPPING APPROACH

a.

SIGMOD Dataset (467KB)

Table D-1 Dataset

Description SIGMOD Record in XML

SIGMOD

Description of SIGMOD Dataset Size

Elements

Attributes

Max-Depth

467 KB

11526

3737

6

Table D-2 Tuples Mapped in Ancestor_Path Table (4 Tuples) Ances_PathID

Ances_PathExp

1 2 3 4

/SigmodRecord/issue /SigmodRecord/issue/articles/article /SigmodRecord/issue/articles/article/authors/author /SigmodRecord/issue/articles/article/authors

Table D-3 Node_Name

Part of Tuple Mapped in Leaf_Node Table (12120 Tuples)

Ances_PathID

Ances_Pos

Node_Value

author

4

1.6.3.2.4.

Randy H. Katz

initPage

2

1.10.3.5.

46

author

4

1.11.3.19.4.

Dale Skeen

author

4

1.13.3.5.4.

Aaron M. Tenenbaum

initPage

2

1.13.3.25.

307

author

4

1.15.3.11.4.

Masaya Nakayama

author

4

1.15.3.32.4.

Timos K. Sellis

@position

3

1.17.3.6.4.2

01

initPage

2

1.17.3.29.

206

title

2

1.20.3.2.

Chairs Message.

endPage

2

1.22.3.11.

310

author

4

1.22.3.30.4.

Scott E. Hudson

@position

3

1.23.3.2.4.1

00

author

4

1.25.3.2.4.

Randy H. Katz

author

4

1.25.3.21.4.

Richard Hull continue…

272 …continued title

2

1.25.3.43.

volume

1

1.26.

Modeling Class Hierarchies with Contradictions. 17

author

4

1.27.3.10.4.

Beverly Jamison

@position

3

1.28.3.20.4.2

01

author

4

1.28.3.41.4.

Daniel Alexander Ford

@position

3

1.30.3.4.4.2

01

initPage

2

1.32.3.12.

153

endPage

2

1.32.3.30.

391

@position

3

1.33.3.1.4.1

00

author

4

1.33.3.1.4.

Li Lei

@position

3

1.33.3.1.4.2

01

author

4

1.33.3.1.4.

Georges-Henri Moll

@position

3

1.33.3.1.4.3

02

author

4

1.33.3.1.4.

Jacques Kouloumdjian

title

2

1.33.3.2.

initPage

2

1.33.3.2.

Computing Transitive Closures of Multilevel Relations. 17

endPage

2

1.33.3.2.

23

@position

3

1.33.3.2.4.1

00

author

4

1.33.3.2.4.

Bhavani M. Thuraisingham

title

2

1.33.3.3.

Database Research at Bellcore.

initPage

2

1.33.3.3.

45

endPage

2

1.33.3.3.

52

@position

3

1.33.3.3.4.1

00

author

4

1.33.3.3.4.

Amit P. Sheth

title

2

1.33.3.4.

initPage

2

1.33.3.4.

Incomplete Information in ObjectOriented Databases. 5

endPage

2

1.33.3.4.

16

@position

3

1.33.3.4.4.1

00

author

4

1.33.3.4.4.

Roberto Zicari

author

4

1.6.3.2.4.

Randy H. Katz

initPage

2

1.10.3.5.

46

author

4

1.11.3.19.4.

Dale Skeen

author

4

1.13.3.5.4.

Aaron M. Tenenbaum

initPage

2

1.13.3.25.

307

author

4

1.15.3.11.4.

Masaya Nakayama

author

4

1.15.3.32.4.

Timos K. Sellis

@position

3

1.17.3.6.4.2

01 continue…

273 …continued initPage

2

1.17.3.29.

206

title

2

1.20.3.2.

Chairs Message.

endPage

2

1.22.3.11.

310

author

4

1.22.3.30.4.

Scott E. Hudson

@position

3

1.23.3.2.4.1

00

author

4

1.25.3.2.4.

Randy H. Katz

author

4

1.25.3.21.4.

Richard Hull

title

2

1.25.3.43.

volume

1

1.26.

Modeling Class Hierarchies with Contradictions. 17

author

4

1.27.3.10.4.

Beverly Jamison

@position

3

1.28.3.20.4.2

01

author

4

1.28.3.41.4.

Daniel Alexander Ford

@position

3

1.30.3.4.4.2

01

initPage

2

1.32.3.12.

153

Figure D-1

XML Mapping Interface for SIGMOD Dataset

274

b.

DBLP1 Dataset (31.5MB)

Table D-4

Description of DBLP1 Dataset

Dataset

Description

Size

Elements

Attributes

Max-Depth

DBLP1

DBLP Bibliography

31.5 MB

796507

108626

5

Table D-5 Tuples Mapped in Ancestor_Path Table (15 Tuples) Ances_PathID

Ances_PathExp

1

/dblp/mastersthesis

2

/dblp/article

3

/dblp/phdthesis

4

/dblp/www

5

/dblp/proceedings

6

/dblp/proceedings/series

7

/dblp/inproceedings

8

/dblp/inproceedings/title

9

/dblp/inproceedings/cite

10

/dblp/proceedings/publisher

11

/dblp/proceedings/title

12

/dblp/inproceedings/title/sup

13

/dblp/inproceedings/title/i

14

/dblp/inproceedings/title/sub

15

/dblp/article/title

Table D-6 Node_Name

Part of Tuple Mapped in Leaf_Node Table (824274 Tuples)

Ances_PathID

Ances_Pos

Node_Value

@key

1

1.1

ms/Brown92

title

2

1.16.

journal

2

1.31.

SQL/Data System, Conpects and Facilities. ANSI X3H2

school

3

1.57.

Boston Univerity

author

3

1.85.

Eric Simon

title

4

1.116.

Lycos

title

7

1.146.

pages

7

1.161.

An Anti-Adaptationist Approach to Genetic Algorithms. 384-389

url

7

1.177.

db/conf/icec/icec1994continue…

275 …continued 2.html#KuboK94 author

7

1.194.

Yoshiaki Tsukamoto

pages

7

1.210.

273-277

title

7

1.227.

author

7

1.244.

The Genetic Programming of an Artificial Brain Which Grows/Evolves at Electronic Speeds in a Cellular Automata Machine. Byoung-Tak Zhang

author

7

1.260.

R. C. Gatto

@key

7

1.276

conf/icec/ParthasarathyMS94

author

7

1.292.

Tadashi Horiuchi

cite

7

1.304.

...

cite

7

1.314.

...

@label

9

1.318.17

Ku

author

7

1.323.

Atsushi Ohori

cite

7

1.325.

journals/tse/Stonebraker79

cite

7

1.329.

...

cite

7

1.332.

conf/ifip/Kowalski74

@label

9

1.335.30

ULL88

cite

7

1.340.

conf/pods/BeeriLR97

cite

7

1.343.

conf/sigmod/KuperV85

cite

7

1.347.

conf/vldb/Levy96

@label

9

1.352.12

DECdtm

pages

7

1.355.

58-69

booktitle

7

1.359.

ICDT

crossref

7

1.362.

conf/icdt/90

@label

9

1.365.11

CCF

title

7

1.369.

The Cost of Locking.

cite

7

1.373.

conf/sigmod/KornackerSH98

cite

7

1.373.

conf/pods/KoutsoupiasT98

cite

7

1.373.

conf/pods/PagelS96

cite

7

1.373.

conf/pods/PagelSTW93

cite

7

1.373.

conf/pods/PagelSW95

cite

7

1.373.

conf/pods/SamoladasM98

cite

7

1.373.

conf/vldb/SellisRF97

cite

7

1.373.

...

cite

7

1.373.

...

@key

7

1.374

conf/icdt/Lakshmanan86

author

7

1.374.

V. S. Lakshmanan continue…

276 …continued

title

7

1.374.

pages

7

1.374.

Split-Freedom and MVDIntersection: A New Characterization of Multivalued Dependencies Having Conflict-Free Covers. 221-241

cdrom

7

1.374.

ICDT1986/P221.pdf

year

7

1.374.

1986

booktitle

7

1.374.

ICDT

url

7

1.374.

crossref

7

1.374.

db/conf/icdt/icdt86.html#Lakshman an86 conf/icdt/86

ee

7

1.374.

@label

9

1.374.10

db/conf/icdt/Lakshmanan86.html B

cite

7

1.374.

journals/tods/Beeri80

Figure D-2

XML Mapping Interface for DBLP1 Dataset

277

c.

DBLP2 Dataset (63.4MB)

Table D-7

Description of DBLP2 Dataset

Dataset

Description

Size

Elements

Attributes

Max-Depth

DBLP2

DBLP Bibliography

63.4 MB

1604967

206949

5

Table D-8 Tuples Mapped in Ancestor_Path Table (16 Tuples) Ances_PathID

Ances_PathExp

1

/dblp/mastersthesis

2

/dblp/article

3

/dblp/phdthesis

4

/dblp/www

5

/dblp/proceedings

6

/dblp/proceedings/series

7

/dblp/inproceedings

8

/dblp/inproceedings/title

9

/dblp/inproceedings/cite

10

/dblp/inproceedings/title/sup

11

/dblp/inproceedings/title/i

12

/dblp/proceedings/title

13

/dblp/proceedings/cite

14

/dblp/proceedings/publisher

15

/dblp/inproceedings/title/sub

16

/dblp/article/title

Table D-9 Node_Name

Part of Tuple Mapped in Leaf_Node Table (1649048 Tuples)

Ances_PathID

Ances_Pos

Node_Value

@key

1

1.1

ms/Brown92

title

2

1.16.

journal

2

1.31.

SQL/Data System, Conpects and Facilities. ANSI X3H2

school

3

1.57.

Boston Univerity

author

3

1.85.

Eric Simon

title

4

1.116.

Lycos

title

7

1.146.

year

7

1.160.

Graphical Design of Reactive Systems. 1998 continue…

278 …continued ee

7

1.176.

title

7

1.193.

author

7

1.210.

http://link.springer.de/link/service/s eries/0558/bibs/1528/15280049.htm Designing Equvialent Semantic Models for Process Creation. Walter Fumy

url

7

1.226.

db/conf/ac/lan.html#Bux83

booktitle

7

1.244.

Advances in Petri Nets

year

7

1.262.

1986

pages

7

1.279.

87-107

title

7

1.294.

@key

7

1.311

The Application of Reason Maintenance Systems in Planning and Scheduling. conf/ac/Lapidoth94

year

7

1.328.

1998

pages

7

1.344.

413-434

year

7

1.361.

2001

author

7

1.377.

Jan Tarski

url

7

1.394.

db/conf/ac/ds.html#Lann80c

url

7

1.411.

author

7

1.429.

db/conf/ac/ds84.html#LamportS84 Reinhold Heckmann

booktitle

7

1.445.

booktitle

7

1.460.

@key

7

1.477

title

7

1.491.

title

7

1.505.

AI: Introduction, Paradigms, Applications (Including CBR), Impacts, Visions. Elementary Net Systems.

year

7

1.522.

1978

pages

7

1.537.

251-260

url

7

1.553.

booktitle

7

1.570.

year

7

1.587.

db/conf/ac/esat1991.html#Fumy91a Advanced Course: Net Theory and Applications 1976

crossref

7

1.587.

conf/ac/1976ent

booktitle

7

1.587.

url

7

1.587.

@key

7

1.588

Komplexitauml;t von Entscheidungsproblemen 1976 db/conf/ac/entscheid1976.html#Wu thrich76 conf/ac/Goltz86

author

7

1.588.

Ursula Goltz

title

7

1.588.

Synchronic Distance.

pages

7

1.588.

338-358

year

7

1.588.

1986

State of the Art in Applied Cryptography Computer Security and Industrial Cryptography conf/ac/Smith96

continue…

279 …continued booktitle

7

1.588.

Advances in Petri Nets

url

7

1.588.

@key

7

1.589

db/conf/ac/petri86-1.html#Goltz86 conf/ac/Valk86a

author

7

1.589.

Ruuml;diger Valk

title

7

1.589.

Infinite Behaiour and Fairness.

pages

7

1.589.

377-396

year

7

1.589.

1986

booktitle

7

1.589.

Advances in Petri Nets

url

7

1.589.

@key

7

1.590

db/conf/ac/petri86-1.html#Valk86a conf/ac/Odersky00

Figure D-3

XML Mapping Interface for DBLP2 Dataset

280

d.

DBLP3 Dataset (127MB)

Table D-10

Description of DBLP3 Dataset

Dataset

Description

Size

Elements

Attributes

Max-Depth

DBLP3

DBLP Bibliography

127 MB

3332130

404276

6

Table D-11 Tuples Mapped in Ancestor_Path Table (29 Tuples) Ances_PathID

Ances_PathExp

1

/dblp/mastersthesis

2

/dblp/article

3

/dblp/phdthesis

4

/dblp/www

5

/dblp/proceedings

6

/dblp/proceedings/series

7

/dblp/inproceedings

8

/dblp/inproceedings/title

9

/dblp/inproceedings/cite

10

/dblp/inproceedings/title/sup

11

/dblp/inproceedings/title/i

12

/dblp/proceedings/title

13

/dblp/proceedings/cite

14

/dblp/proceedings/publisher

15

/dblp/inproceedings/title/sub

16

/dblp/book

17

/dblp/incollection

18

/dblp/book/publisher

19

/dblp/book/cite

20

/dblp/book/series

21

/dblp/incollection/publisher

22

/dblp/incollection/title

23

/dblp/incollection/cite

24

/dblp/article/title

25

/dblp/article/title/sub

26

/dblp/article/title/i

27

/dblp/article/title/sup

28

/dblp/article/cite continue…

281 …continued 29

/dblp/article/title/sub/sup

Table D-12

Part of Tuple Mapped in Leaf_Node Table (3410124 Tuples)

Node_Name @key

Ances_PathID 1

Ances_Pos 1.1

Node_Value ms/Brown92

title

2

1.16.

journal

2

1.31.

SQL/Data System, Conpects and Facilities. ANSI X3H2

school

3

1.57.

Boston Univerity

author

3

1.85.

Eric Simon

title

4

1.116.

Lycos

title

7

1.146.

year

7

1.160.

Graphical Design of Reactive Systems. 1998

ee

7

1.176.

title

7

1.193.

author

7

1.210.

http://link.springer.de/link/service/s eries/0558/bibs/1528/15280049.htm Designing Equvialent Semantic Models for Process Creation. Walter Fumy

url

7

1.226.

db/conf/ac/lan.html#Bux83

booktitle

7

1.244.

Advances in Petri Nets

year

7

1.262.

1986

pages

7

1.279.

87-107

title

7

1.294.

@key

7

1.311

The Application of Reason Maintenance Systems in Planning and Scheduling. conf/ac/Lapidoth94

year

7

1.328.

1998

pages

7

1.344.

413-434

year

7

1.361.

2001

author

7

1.377.

Jan Tarski

url

7

1.394.

db/conf/ac/ds.html#Lann80c

url

7

1.411.

author

7

1.429.

db/conf/ac/ds84.html#LamportS84 Reinhold Heckmann

booktitle

7

1.445.

booktitle

7

1.460.

@key

7

1.477

title

7

1.491.

title

7

1.505.

State of the Art in Applied Cryptography Computer Security and Industrial Cryptography conf/ac/Smith96 AI: Introduction, Paradigms, Applications (Including CBR), Impacts, Visions. Elementary Net Systems. continue…

282 …continued year

7

1.522.

1978

pages

7

1.537.

251-260

url

7

1.553.

booktitle

7

1.570.

db/conf/ac/esat1991.html#Fumy91a Advanced Course: Net Theory and

year

7

1.587.

1976

author

7

1.603.

Ernst Bovelander

booktitle

7

1.616.

Artificial Evolution

author

7

1.632.

Francisco Herrera

pages

7

1.645.

109-122

pages

7

1.659.

137-144

@key

7

1.674

conf/ae/Emereev99

ee

7

1.688.

crossref

7

1.702.

http://link.springer.de/link/service/s eries/0558/bibs/2310/23100091.htm conf/ae/1999

@key

7

1.716

conf/ah/PetrelliBP00

title

7

1.730.

pages

7

1.743.

Knowledge Computing Method for Enhancing the Effectiveness of a WWW Distance Education System. 144-154

year

7

1.756.

2002

author

7

1.769.

Janne Pearson

author

7

1.782.

Charalampos Karagiannidis

url

7

1.795.

title

7

1.808.

@key

7

1.822

db/conf/ah/ah2000.html#KurhilaS0 0 Preventing Misleading Presentations of XML Documents: Some Initial Proposals. conf/ah/GamperK00

author

7

1.834.

Ray Armstrong

url

7

1.847.

author

7

1.861.

db/conf/ah/ah2002.html#MunOW0 2 Veronica Rossano

url

7

1.873.

@key

7

1.889

db/conf/ai/ai2002.html#MahdiM02 conf/ai/SilaghiSF01

booktitle

7

1.903.

Canadian Conference on AI

crossref

7

1.918.

conf/ai/1998

author

7

1.932.

Manon J. Sanscartier

url

7

1.946.

db/conf/ai/ai96.html#DentM96

year

7

1.960.

2000

@key

7

1.975

conf/ai/Salle02

author

7

1.989.

Kay Wiese

Applications

continue…

283 …continued author

7

1.1004.

Qihua Situ

@key

7

1.1018

conf/ai/BarriereP00

booktitle

7

1.1031.

Canadian Conference on AI

@key

1

1.1

ms/Brown92

title

2

1.16.

journal

2

1.31.

SQL/Data System, Conpects and Facilities. ANSI X3H2

Figure D-4

XML Mapping Interface for DBLP3 Dataset

284

APPENDIX E

SAMPLE RESULTS OF MAPPING XPATH QUERIES INTO SQL QUERIES USING XANCESTOR APPROACH a.

SIGMOD Dataset



XPath Query 1: /SigmodRecord/issue/volume SELECT L.Node_Value FROM

Leaf_Node L, Ancestor_Path P

WHERE p.Ances_PathExp = '/SigmodRecord/issue' AND

L.Ances_PathID = P.Ances_PathID

AND

L.Node_Name = 'volume' Figure E-1

Figure E-2



Translated SQL Query for XPath Query 1

Query Mapping Interface for XPath Query 1 and its SQL Query

XPath Query 2: /SigmodRecord/issue/articles/article/authors/author SELECT L.Node_Value FROM

Leaf_Node L, Ancestor_Path P

WHERE p.Ances_PathExp = '/SigmodRecord/issue/articles/article/authors' AND

L.Ances_PathID = P.Ances_PathID

AND

L.Node_Name = 'author' Figure E-3

Translated SQL Query for XPath Query 2

285

Figure E-4



Query Mapping Interface for XPath Query 2 and its SQL Query

XPath Query 3: //article/title SELECT L.Node_Value FROM

Leaf_Node L, Ancestor_Path P

WHERE p.Ances_PathExp LIKE '%/article' AND

L.Ances_PathID = P.Ances_PathID

AND

L.Node_Name = 'title' Figure E-5

Figure E-6

Translated SQL Query for XPath Query 3

Query Mapping Interface for XPath Query 3 and its SQL Query

286 

XPath Query 4: //articles//author SELECT L.Node_Value FROM

Leaf_Node L, Ancestor_Path P

WHERE p.Ances_PathExp LIKE '%/articles%' AND

L.Ances_PathID = P.Ances_PathID

AND

L.Node_Name = 'author' Figure E-7

Figure E-8



Translated SQL Query for XPath Query 4

Query Mapping Interface for XPath Query 4 and its SQL Query

XPath Query 5: /SigmodRecord/issue/articles/article [title = ‘Database Directions III Workshop Review.’]/endPage SELECT L2.Node_Value FROM Leaf_Node L1, Leaf_Node L2, Ancestor_Path P WHERE P.Ances_PathExp = '/SigmodRecord/issue/articles/article' AND L1.Ances_PathID = P.Ances_PathID AND L1.Node_Name = 'title' AND L1.Node_Value ='Database Directions III Workshop Review.' AND L2.Node_Name = 'endPage' AND L1.Ances_PathID = L2.Ances_PathID AND L1.Ances_Pos = L2.Ances_Pos Figure E-9

Translated SQL Query for XPath Query 5

287

Figure E-10



Query Mapping Interface for XPath Query 5 and its SQL Query

XPath Query 6: //article [initPage > 50]/title SELECT L2.Node_Value FROM Leaf_Node L1, Leaf_Node L2, Ancestor_Path P WHERE P.Ances_PathExp LIKE '%/article' AND L1.Ances_PathID = P.Ances_PathID AND L1.Node_Name = ' initPage' AND L1.Node_Value > '50' AND L2.Node_Name = 'title' AND L1.Ances_PathID = L2.Ances_PathID AND L1.Ances_Pos = L2.Ances_Pos Figure E-11

Figure E-12

Translated SQL Query for XPath Query 6

Query Mapping Interface for XPath Query 6 and its SQL Query

288

b.

DBLP Dataset



XPath Query 1: /dblp/phdthesis/title SELECT L.Node_Value FROM

Leaf_Node L, Ancestor_Path P

WHERE p.Ances_PathExp = '/dblp/phdthesis' AND

L.Ances_PathID = P.Ances_PathID

AND

L.Node_Name = 'title' Figure E-13

Figure E-14



Translated SQL Query for XPath Query 1

Query Mapping Interface for XPath Query 1 and its SQL Query

XPath Query 2: /dblp/inproceedings/title/sub/sup SELECT L.Node_Value FROM

Leaf_Node L, Ancestor_Path P

WHERE p.Ances_PathExp = '/dblp/inproceedings/title/sub' AND

L.Ances_PathID = P.Ances_PathID

AND

L.Node_Name = ' sup' Figure E-15

Translated SQL Query for XPath Query 2

289

Figure E-16



Query Mapping Interface for XPath Query 2 and its SQL Query

XPath Query 3: //mastersthesis/author SELECT L.Node_Value FROM

Leaf_Node L, Ancestor_Path P

WHERE p.Ances_PathExp LIKE '%/ mastersthesis' AND

L.Ances_PathID = P.Ances_PathID

AND

L.Node_Name = 'author' Figure E-17

Figure E-18

Translated SQL Query for XPath Query 3

Query Mapping Interface for XPath Query 3 and its SQL Query

290 

XPath Query 4: //article//tt SELECT L.Node_Value FROM

Leaf_Node L, Ancestor_Path P

WHERE p.Ances_PathExp LIKE '%/article%' AND

L.Ances_PathID = P.Ances_PathID

AND

L.Node_Name = 'tt' Figure E-19

Figure E-20



Translated SQL Query for XPath Query 4

Query Mapping Interface for XPath Query 4 and its SQL Query

XPath Query 5: /dblp/mastersthesis [year = 1992]/title SELECT L2.Node_Value FROM Leaf_Node L1, Leaf_Node L2, Ancestor_Path P WHERE P.Ances_PathExp = '/dblp/mastersthesis' AND L1.Ances_PathID = P.Ances_PathID AND L1.Node_Name = ' year' AND L1.Node_Value ='1992' AND L2.Node_Name = ' title' AND L1.Ances_PathID = L2.Ances_PathID AND L1.Ances_Pos = L2.Ances_Pos Figure E-21

Translated SQL Query for XPath Query 5

291

Figure E-22



Query Mapping Interface for XPath Query 5 and its SQL Query

XPath Query 6: //phdthesis [year > 1979]/school SELECT L2.Node_Value FROM Leaf_Node L1, Leaf_Node L2, Ancestor_Path P WHERE P.Ances_PathExp LIKE '%/ phdthesis' AND L1.Ances_PathID = P.Ances_PathID AND L1.Node_Name = 'year' AND L1.Node_Value > '1979' AND L2.Node_Name = ' school' AND L1.Ances_PathID = L2.Ances_PathID AND L1.Ances_Pos = L2.Ances_Pos Figure E-23

Figure E-24

Translated SQL Query for XPath Query 6

Query Mapping Interface for XPath Query 5 and its SQL Query

292

APPENDIX F LIST OF PUBLICATION

Amjad Qtaish, Kamsuriah Ahmad. Model-Mapping Approaches for Storing and Querying XML Documents in Relational Database: A Survey. Journal of Convergence Information Technology (JCIT), vol. 9, no. 2, pp. 148-155, 2014 (Indexed by Scopus). Amjad Qtaish, Kamsuriah Ahmad. Model Mapping Approaches for XML Documents: A review . Journal of Information Science (JIS), vol. 41, no. 4, pp. 444-466, 2015 (Indexed by Scopus & ISI, IF: 0.878, Q2) Amjad Qtaish, Kamsuriah Ahmad. Query mapping techniques for XML documents: A comparative study. International Conference In Electrical Engineering and Informatics (ICEEI), pp. 529-534, 2015 (Indexed by IEEE)