Title Formalisms on semi-structured and

Title

Author(s)

Formalisms on semi-structured and unstructured data schema computations

Lee, Yau-tat, Thomas.; 李猷達.

Citation

Issue Date

URL

Rights

2009

http://hdl.handle.net/10722/57586

The author retains all proprietary rights, (such as patent rights) and the right to use in future works.

Abstract of thesis entitled “Formalisms on Semi-structured and Unstructured Data Schema Computations” Submitted by Thomas Yau-tat LEE for the Degree of Doctor of Philosophy at The University of Hong Kong in November 2009

The nature of software applications is evolving very quickly in the past decade since the World Wide Web has been popularized. Some web applications are required to process large datasets which do not have well-defined structures. This has been challenging conventional data engineering methods. A conventional data engineering method typically requires that a system architect should have prior knowledge on what and how data are processed in an application so as to design a good database schema that optimizes data computations and storage. However, for a web application processing large-scale semi-structured and unstructured data, schema design tasks cannot always be handled totally by human, and need to be automated by software tools. In this thesis, I study the problems of schema computations for semi-structured XML data and unstructured RDF data. This thesis consists of two parts. In the first part, I investigate into the XML data interoperability problem of web services. To address this problem, I develop a formal model for XML schemas called Schema Automaton, and derive the computational techniques for schema compatibility testing and subschema extraction. In the second part, I investigate different types of databases for RDF data. For one particular database type called property tables, I propose a new datamining technique namely Attribute Clustering by Table Load to automate the schema design for the database based on the underlying data patterns.

Formalisms on Semi-structured and Unstructured Data Schema Computations

by

Thomas Yau-tat LEE M.B.A. H.K.; M.Sc. H.K.; B.Eng. H.K.

A thesis submitted in partial fulfilment of the requirements for the Degree of Doctor of Philosophy at The University of Hong Kong November 2009

DEDICATION

to my parents, LEE Chu (李柱) and LAU Mee Ching (劉美貞), who are entirely devoted to my sister and me, and have got the best education possible for us

to my sister, Sharon, my brother-in-law, Kin Yip, and my nephew, Chun Hei, who share their every happiness with me,

and to my beloved one, Karen, who always supports me with patience and love.

Declaration I declare that the thesis and the research work thereof represents my own work, except where due acknowledgement is made, and that it has not been previously included in a thesis, dissertation or report submitted to this University or to any other institution for a degree, diploma or other qualifications.

Signed : Thomas Yau-tat LEE

i

Acknowledgements I would like to thank my doctoral supervisor, Professor David Cheung, for teaching me the methods and attitudes for doing academic research. He has provided essential guidance for me to formulate my research problems and solutions. Moreover, Professor Cheung gave me the opportunity to do research as my work duty, which let me focus on my study without worrying about financial survival. More than my supervisor, Professor Cheung is also my inspiring mentor at work and life. I am indebted to my research team in Department of Computer Science of The University of Hong Kong (HKU): Saudan Lee, Patrick Yee, Jimmy Chiu, Haiyan Zhu, and Wenjun Yuan. Apart from debugging my theories, they have contributed a lot of ideas and efforts to the Attribute Clustering by Table Load experiments. I also thank other colleagues in the department for providing me with all kinds of assistance. My work in the Center for E-Commerce Infrastructure Development (CECID) of HKU has inspired me to find research directions with practicality. I thank all CECID members for sharing technology and business ideas with me and giving me help. My parents firmly believe that knowledge, not money, improves the quality of life. They have taught me the importance of lifelong learning. I understand every precious opportunity of receiving education has indeed been given by them. This has become an important driving force for me to accomplish my doctoral study. During my study, there were times when the feeling of frustration superceded the sense of satisfaction inside me. Whenever I was sinking, Karen always lifted me up with her tender care. I am grateful for her love and patience. I love her.

ii

Contents 1

I 2

3

Introduction

Schema Computations for XML Data Exchange Introduction to Part I

10

XML Schema and Standardized Schema Libraries . . . . . . . . . . . . . . 11

2.2

Research Problems and Contributions on XML Schema Computations . . . 12

2.3

Organization of Part I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

Preliminaries of XML Schema

16

XML Message Exchange . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.1.1

One-Way Message Transmission . . . . . . . . . . . . . . . . . . . 16

3.1.2

Two-Way Message Exchange . . . . . . . . . . . . . . . . . . . . 16

3.2

W3C XML Schema (XSD) . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.3

Related Work on XML Schema Computations . . . . . . . . . . . . . . . . 22 3.3.1

Schema Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.3.2

DTD and Extended DTD . . . . . . . . . . . . . . . . . . . . . . . 23

3.3.3

XSchema . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

Data Tree and Schema Automaton

4.1 4.2

Modelling XML . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

Schema Automaton . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 4.2.1

4.3

27

Data Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 4.1.1

5

9

2.1

3.1

4

7

Schema Automaton Validating Data Tree . . . . . . . . . . . . . . 30

Modelling W3C XML Schema . . . . . . . . . . . . . . . . . . . . . . . . 32

XML Schema Computations

5.1

36

Schema Minimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 5.1.1

Usefulness of States (XML Schema Types) . . . . . . . . . . . . . 36

1

Contents

2 5.1.2

5.2

Schema Equivalence Testing . . . . . . . . . . . . . . . . . . . . . . . . . 44 5.2.1

7

II 8

Subschema Testing . . . . . . . . . . . . . . . . . . . . . . . . . . 47

5.3

Subschema Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

5.4

Complexity Analysis and Performance Improvement . . . . . . . . . . . . 50 5.4.1

6

Schema Automaton Minimization . . . . . . . . . . . . . . . . . . 40

Speeding Up Regular Expression Tests . . . . . . . . . . . . . . . 50

Experiments on XML Schema Computations

52

6.1

Experiment 1: xCBL Compatibility Testing . . . . . . . . . . . . . . . . . 52

6.2

Experiment 2: xCBL Subschema Extraction . . . . . . . . . . . . . . . . . 54

Conclusions of Part I

57

Schema Computations for RDF Data Storage

59

Introduction to Part II

60

8.1

Resource Definition Framework (RDF) . . . . . . . . . . . . . . . . . . . 60

8.2

RDF Query Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

8.3

Research Problems and Contributions on RDF Database Schema Design Automation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

8.4 9

Organization of Part II . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

RDF Databases

9.1

69

Triple Store . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 9.1.1

Self-join Problem of Triple Store . . . . . . . . . . . . . . . . . . . 71

9.2

Horizontal Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

9.3

Vertical Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

9.4

Property Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 9.4.1

Problems of Property Tables Approach . . . . . . . . . . . . . . . 77

9.4.2

Automating Schema Design of Property Tables . . . . . . . . . . . 78

10 Attribute Clustering by Table Load Problem

80

10.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 10.2 NP-Completeness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 11 Attribute Clustering by Table Load Algorithm

86

11.1 Basic Agglomerative Clustering Algorithm . . . . . . . . . . . . . . . . . 86

Doctoral Thesis, Thomas Yau-tat LEE

The University of Hong Kong

Contents

3 11.1.1 Core Data Structure . . . . . . . . . . . . . . . . . . . . . . . . . . 88 11.1.2 Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 88

11.2 Pruning Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 11.2.1 Transaction Group Equality Test . . . . . . . . . . . . . . . . . . . 89 11.2.2 Maximum Combined Load Factor . . . . . . . . . . . . . . . . . . 91 11.2.3 Iterative Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . 95 11.3 Modified Clustering Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 96 11.4 Attribute Connectivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 11.5 Measuring Schema Fitness . . . . . . . . . . . . . . . . . . . . . . . . . . 99 11.6 Comparison with Other Clustering Techniques . . . . . . . . . . . . . . . . 101 11.6.1 Frequent Patterns Mining . . . . . . . . . . . . . . . . . . . . . . . 101 11.6.2 HoVer Clustering Algorithm . . . . . . . . . . . . . . . . . . . . . 102 12 Experiments on Attribute Clustering by Table Load

103

12.1 Datasets for Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 12.2 Experiment 1: Clustering Performance and Schema Fitness . . . . . . . . . 105 12.3 Experiment 2: Query Performance . . . . . . . . . . . . . . . . . . . . . . 109 12.4 Experiment 3: Frequent Patterns Mining . . . . . . . . . . . . . . . . . . . 111 13 Conclusions of Part II

113

III

120

Appendix

A1 Queries Used in Experiment 2 of Part II


121


List of Figures 3.1

Schema compatibility between a sender and a receiver . . . . . . . . . . . 17

3.2

Schema compatibility between a initiator and a responder . . . . . . . . . . 17

4.1

DT 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.2

DT 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.3

DT for Quote document . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.4

DT for Order document . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.5

SA example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.6

SA 1 modelling XSD 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.7

SA 2 modelling XSD 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

5.1

Example of SA that contains useless states . . . . . . . . . . . . . . . . . . 38

5.2

RE tree of ((A+|(AB)+)+(B+|(BA)*))+ . . . . . . . . . . . . . . . . . . . 41

5.3

Boolean expression tree to test if A is mandatory . . . . . . . . . . . . . . . 41

5.4

Boolean expression tree to test if B is mandatory . . . . . . . . . . . . . . . 41

5.5

SA 3 modelling XSD 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

6.1

Performance of different filtering techniques for HLang RE tests . . . . . . 54

6.2

Subschema extraction on xCBL 3.0 . . . . . . . . . . . . . . . . . . . . . 55

6.3


8.1

Example RDF graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

8.2

Triple store database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

9.1

Jena1 triple store database . . . . . . . . . . . . . . . . . . . . . . . . . . 71

9.2

Graphs to represent triplet patterns and join order . . . . . . . . . . . . . . 73

9.3

Horizontal database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

9.4

Vertical database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

9.5

Property tables database . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

4

List of Figures

5

11.1 Cluster sorter data structure . . . . . . . . . . . . . . . . . . . . . . . . . . 94 11.2 Example on attribute connectivity . . . . . . . . . . . . . . . . . . . . . . 98 11.3 HoVer example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 12.1 Wikipedia Infobox data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 12.2 Performance of ACTL with pruning (Wikipedia Infobox dataset) . . . . . . 106 12.3 Performance of ACTL with pruning (Barton Libraries dataset) . . . . . . . 106 12.4 Schema fitness (Wikipedia infobox dataset) . . . . . . . . . . . . . . . . . 107 12.5 Schema fitness (Barton Libraries dataset) . . . . . . . . . . . . . . . . . . 108



List of Tables 2.1

Numbers of schema files in different formats published on W3C and OASIS 11

2.2

XSD sizes of xCBL and UBL standards . . . . . . . . . . . . . . . . . . . 12

4.1

Binding map of DT 1 against SA in Figure 4.5 . . . . . . . . . . . . . . . . 32

6.1

Performance of different filtering techniques for HLang RE tests . . . . . . 54

6.2


6.3


9.1

Six indexes to cover 16 query patterns in YARS2 . . . . . . . . . . . . . . 72

12.1 Cluster size distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 12.2 Query performance of triple store (TS), vertical database (Vert) and ACTLmined property tables (ACTL) . . . . . . . . . . . . . . . . . . . . . . . . 110 12.3 Frequent patterns generated by FP-Growth . . . . . . . . . . . . . . . . . . 112 A1.1 Attribute identifiers for query transformation . . . . . . . . . . . . . . . . . 122

6

1 Introduction The nature of software applications is evolving very quickly in the past decade since the World Wide Web has been popularized. A web application in an organization is often required to interact with external users through the Internet rather than serving internal users and systems under a well-controlled environment. Outside the authority of the organization, these external users (computers or humans) may generate some requirements which are dynamic and not well-defined. This requires a web application to be developed in an extensible and flexible way to handle these requirements. This new application development trend has challenged conventional data engineering methods. A conventional data engineering approach usually assumes what and how data are processed are known so that data can be structured to optimize computations and storage. For example, when designing a traditional standalone application, a database architect may use the entity-relationship approach to model how data of a known structure should be organized in a database so that a known set of queries can be answered efficiently. To optimize the processing speed and storage space of a system is usually a key data engineering objective. Extensibility and flexibility have become important considerations in development of web applications. Various semi-structured and unstructured data formats have risen to meet the needs. The Extensible Markup Language (XML)[1] is the mainstream format for semistructured data because of its extensible tree structure. An application which needs to exchange XML messages with other applications is commonly programmed in the Web Services standards.[2] Before two web services can interact, they must agree on the permissible structures and contents of XML messages being exchanged, which can be specified in some XML schema language. Therefore, when n different web services talk to each other, at worst

n(n−1) 2

XML schemas must be defined among the services. Because of

this many-to-many data interoperability problem, a lot of XML-based standards for different domains have been created as XML schemas. (See Table 2.1.) Different web services in the same domain may use the same standardized XML schema to avoid repetitive ef-

7

Chapter 1. Introduction

8

forts in schema definition. To cater for a variety of applications, an standard XML schema is usually highly generalized, which makes the schema definition very large. However, since different web services just use different subsets of the schema, the many-to-many interoperability problem still exists. Nevertheless, this problem is now mitigated through specializing one standard XML schema for different applications instead of creating many different schemas from scratch. Besides the Web Services, the Semantic Web[2] is another important development in Web technology. The Semantic Web[2] was proposed to annotate web resources with machineprocessable meanings and relationships.[3] The Resource Description Framework (RDF) was developed as a computer language to specify these annotations.[4] Basically, RDF is a graph-based language to describe unstructured semantic data. Unlike XML data, the structures of RDF data cannot be pre-defined by some schema language.1 Despite the flexibility of RDF, its lack of well-defined structures makes application development difficult in optimizing data storage and query processing. Although we cannot specify the expected RDF data structures using some prescriptive schema before the data exist, we can mine the actual data patterns into some descriptive schema after the data has existed.[5] Different from a prescriptive schema which governs all data structures permitted by an application, a descriptive schema only summarizes the actual data patterns being processed by an application. In the other words, the actual data patterns may not entirely conform to the schema but do generally follow the schema. For traditional databases, schemas are typically designed by humans because the structures and queries of the data are well-defined. In contrast, to process large-scale semi-structured and unstructured datasets, schema design tasks need to be aided or automated by computer programs. In this doctoral thesis, I will introduce the problems in modeling and computing semi-structured and unstructured data schemas. The thesis consists of two parts. In Part I, I will study the XML data interoperability problem of web services. To address this problem, I will propose a formal model for XML schemas called Schema Automaton, and will derive some computational techniques for schema compatibility testing and subschema extraction. In Part II, I will investigate different architectures of RDF databases. For one particular database architecture called property tables, I will introduce a new datamining technique called Attribute Clustering by Table Load to automate the schema design.

1

Despite its name, RDF Schema (RDFS) is not a schema language to prescribe or validate RDF data structures. Instead, RDFS is an RDF vocabulary description language for specifying meta-data descriptions for semantic inference.



Part I

Schema Computations for XML Data Exchange

9

2 Introduction to Part I Today, software applications are commonly implemented as distributed services over the Internet. New distributed computing architectures, such as service-oriented architecture, software as a service, and cloud computing, enable applications to be delivered as web services for invocation by other web services. One key consideration in implementing these architectures is how to effectively manage the interoperability between interacting web services. Yet, the term “interoperability” is vaguely defined. Web service standards, such as SOAP[6] with Web Services Description Language (WSDL)[7], and ebXML Messaging Service[8], provide the messaging protocols that make applications running on heterogeneous technology platforms interoperate. This technology interoperability is well addressed by many standards consortia, like W3C[9], OASIS[10], and WS-I[11]. However, whether two web services are interoperable cannot be simply guaranteed by their using the same web service standard. Another dimension of interoperability is at the data level. The Extensible Markup Language (XML)[1] is typically used as the message format for web services to exchange data. The data interoperability concerns whether one web service is able to transmit XML data that can be processed by another web service. The data interoperability is a more complex problem than technology interoperability because the data interoperability is often resolved application by application. Various initiatives, such as Universal Business Language (UBL)[12], are established to standardize XML messages for business applications. Nevertheless, these data standards can only reduce the complexity of the data interoperability between web services but cannot provide real plug-and-play solutions.

10

Chapter 2. Introduction to Part I Website w3.org oasis-open.org total:

11 #XSDs 6,490 2,330 8,820 (85%)

#DTDs 612 198 810 (8%)

#RNGs+RNCs 445+37=482 106+128=234 716 (7%)

Table 2.1: Numbers of schema files in different formats published on W3C and OASIS

2.1 XML Schema and Standardized Schema Libraries The structures of XML messages permitted by an application can be defined by an XML schema language. For example, a product quotation web service receives an RFQ (request for quote) document, and then sends a Quote document. The RFQ schema defines the set of all possible XML messages that can be accepted by this product quotation service while the Quote schema defines the set of all possible XML messages that can be generated by the service. The data interoperability between two web services depends on the schemas they use. Popular XML schema languages include Datatype Definition (DTD)[1], W3C XML Schema (XSD)[13], and RelaxNG[14]. Table 2.1 lists the numbers of schema files in four formats, which are XSD, DTD, RelaxNG XML (RNG), and RelaxNG compact (RNC), published in the W3C[9] and OASIS[10] websites.1 This shows the majority of the schema files published in these websites are written in XSD. Many e-business standards are defined in XSD; some of these are very large. Two popular e-business standards are XML Common Business Library (xCBL)[15], and OASIS Universal Business Language (UBL)[12]. Table 2.2 lists the numbers of datatypes (#types), element declarations (#edecls), document types (#doctypes), XSD files (#files), and the file size (size) of xCBL 3.0[16], xCBL 3.5[17], xCBL 4.0[18], UBL 1.0[19], and UBL 2.0[20]. Such a standard is a schema library, which may contain thousands of datatype and element definitions. Many different document types (e.g., Quote, Order, Invoice) are usually specified in a single standard. Generally, each document type is not defined as an independent XSD. Some datatype definitions may be shared among different document types. For example, the PostalAddress datatype may be defined only once but can be reused by many document types, e.g., Order and Invoice. In real business cases, even though two web services apply the same data standard, they need not be interoperable with each other. Usually, a specific web service handles only 1

The numbers were reported by Google search as of August 2008.



Chapter 2. Introduction to Part I XSD #types #edecls #doctypes #files size (MB)

xCBL 3.0 1,290 3,728 42 413 1.8

12 xCBL 3.5 1,476 4,473 51 496 2.0

xCBL 4.0 830 2,941 44 709 6.3

UBL 1.0 226 1,098 8 27 0.9

UBL 2.0 682 2,918 31 43 2.7

Table 2.2: XSD sizes of xCBL and UBL standards several document types. For example, a UBL-based product quotation web service only needs to process RFQ and Quote documents and can safely ignore other irrelevant document types. In other words, this service only needs to process a subset of UBL instances. It is also typical that a web service needs to restrict a data standard to meet its specific business requirements. For example, the hPostalCodei element might be defined as optional in UBL because not all countries use postal codes in their addresses. However, a web service specific for the USA environment may require hPostalCodei as a mandatory element in all received XML documents. In this example, even a sender submits a UBL-compliant document to the web service, the service may still reject the document. Therefore, it is more practical to model an exchange of XML messages between two web services by a sending schema and a receiving schema. The sending schema specifies all possible XML messages the sender can send and the receiving schema specifies all possible messages the receiver can receive. Then, whether two web services are able to exchange all possible messages is determined by whether the receiving schema accepts all possible instances of the sending schema. When the receiving schema can accept all instances of the sending schema, the receiving schema is said to be compatible with the receiving schema.

2.2 Research Problems and Contributions on XML Schema Computations In most cases, it is not feasible to manually verify the compatibility on large schemas like xCBL and UBL, and in turn the data interoperability between web services. In this part of the thesis, I study the following two schema compatibility problems. Schema compatibility There are two levels of schema compatibility. First, schema A is

equivalent to schema B when they accept the same set of instances. Second, A is a subschema of B when B accepts every instance of A. The schema compatibility prob-



Chapter 2. Introduction to Part I

13

lem is how to verify whether one schema is equivalent to or a subschema of another schema. This problem is relevant to many applications. The following describes two examples: (1) web service interoperability, and (2) schema version compatibility. On web service interoperability, if web service A needs to accept all messages sent from web service B, the sending schema of B must be a subschema of the receiving schema of A. On schema version compatibility, when a data standard schema is updated to a new version, the new version must be a superschema of the old version in order to maintain the backward compatibility. This way, a new application using the new schema version can accept all data generated from an existing application using the previous version. Subschema extraction Before an application can use an XML schema to validate XML

data, the application is required to load and parse the schema into a data structure in main memory. In run-time, processing a huge schema may create considerable memory and performance overheads. In design-time, it is difficult for a programmer to comprehend a huge schema that defines thousands of types and elements when developing an application. In reality, an application usually processes only a few documents types defined in a huge schema. For example, a quotation application which processes only Quote and RFQ documents in xCBL 3.5 (i.e., 2 out of 51 document types) only needs to use a small subschema of the huge xCBL 3.5 schema. Therefore, the subschema extraction problem is how to extract a trimmed-down subschema that only recognizes a subset of elements defined in the original schema. Precisely, given a schema and a set of elements, how can we compute a subschema that only accepts an instance of the given schema if it contains only the elements from the given set of elements, or rejects the instance otherwise? To solve the above problems, I have developed two formal models namely Data Tree and Schema Automaton for modelling hierarchical data instances and schemas respectively. In particular, these models well represent XML documents and schemas. Because of the popularity of XSD, our discussion will be focused on how Schema Automata represent XSDs. I have also formulated two classes of schema computation operations, namely schema compatibility testing and subschema extraction. These operations are supported by four main algorithms: schema minimization, schema equivalence testing, subschema testing, and schema extraction. I have implemented the models and algorithms, and have experimented them with xCBL datasets. The first experiment refuted the claim of xCBL that v3.5




14

is compatible with v3.0. In the second experiment, xCBL XSDs were effectively trimmed down using subschema extraction. The main contributions of this part of my research are summarized as follows: 1. How the data interoperability between web services is affected by XML schema compatibility has been studied. 2. New formal models called Data Tree and Schema Automaton have been developed to represent XML instances and schemas. 3. Several practical algorithms based on these models for schema compatibility testing and subschema extraction have been developed. 4. How these algorithms can be used in real industry cases have been demonstrated through experiments. One experiment has shown that xCBL 3.5 is actually not compatible with xCBL 3.0 despite the claimed backward compatibility. Another experiment has shown that the subschema extraction algorithm can effectively trim the large xCBL XSDs into small subschemas; the processing time for each subschema has been largely shortened.

2.3 Organization of Part I The rest of Part I of this thesis is organized as follows. Chapter 3 will elaborate how the data interoperability between two web services is affected by schema compatibility and will give some motivating XSD examples to illustrate the schema compatibility problems. I will also review other related research works on XML schema formalisms and computations. In Chapter 4, I will formalize the models of Data Tree (DT) and Schema Automaton (SA) to represent XML instances and schemas respectively for various schema computations. Chapter 5 will provide the theorems and algorithms for the following schema computations: Schema minimization Given some SA (i.e., an XSD), find an equivalent SA which ac-

cepts the same instance set accepted by the given SA and has the fewest states (i.e., XSD data types) among all equivalent SAs. Schema equivalence testing Given two SAs, decide whether they accept the same in-

stance set. Subschema testing Given SA A and B, decide whether the instance set accepted by A is




15

a subset of the instance set accepted by B. Subschema extraction Given some SA and some set of symbols (i.e., XML element

names), find a subschema of SA which accepts the instances of the given SA containing only the symbols in the given symbol set. Chapter 6 will describe two experiments and will analyze their results. The first experiment verified whether xCBL 3.0 was a xCBL 3.5 as claimed. The second experiment extracted different subschemas from xCBL 3.0 and xCBL 3.5, and compared the sizes and processing performances of these subschemas with those of the original schemas. Finally, Chapter 7 will describe potential extensions of this research and will conclude this part.



3 Preliminaries of XML Schema In this chapter, I will discus how schema compatibility affects the data interoperability between web services. Using some motivating examples, I will elaborate how schema compatibility is defined based on XSD. Then, I will review some related work on formalisms and computations on XML schemas.

3.1 XML Message Exchange 3.1.1 One-Way Message Transmission The transmission of messages between two web services should be modelled by one sending schema and one receiving schema instead of only one schema between them. As shown in Figure 3.1, the sending schema (the gray arrow on the left) defines the set of all possible XML instances the sender can generate while the receiving schema (the gray arrow on the right) defines the set of all possible instances the receive can accept. Therefore, if it is required for the receiver to accept all messages generated by the sender, the instance set of the receiving schema must be a superset of the instance set of the sending schema. In this case, it is said that the receiving schema is a superschema of the sending schema or the sending schema is a subschema of the receiving schema. In other words, the receiving schema is compatible with the sending schema.

3.1.2 Two-Way Message Exchange The interaction between two web services is more often a two-way message exchange than a one-way message transmission. In this situation, when web service A invokes another web service B, A initiates the invocation by sending a request message to B. B processes the request and then replies A with a response message. A two-way message exchange is

16

Chapter 3. Preliminaries of XML Schema

Sender

Receiver

17

Responder

Initiator

Figure 3.1: Schema compatibility between a sender and a receiver Figure 3.2: Schema compatibility between a initiator and a responder composed of two one-way message transmissions in different directions; each transmission involves one sending schema and one receiving schema. As shown in Figure 3.2, the interoperability of these two web services requires: 1. the sending schema of the initiator for request messages must be a subschema of the receiving schema of the responder for request messages, and 2. the sending schema of the responder for response messages must be a subschema of the receiving schema of the initiator for response messages. Let us now look at in what conditions two web services using two different schema versions interoperate with each other when the new version is backward compatible with the old version. If the responder is required to interoperate with any initiators using any of the two schema versions, the responder must: 1. receive request messages based on the new version, but 2. send response messages based on the old version. On the other hand, if the initiator is required to interoperate with any responders using any of the two the two schema versions, the initiator must: 1. send request messages based on the old version, but 2. receive response messages based on the new version.

3.2 W3C XML Schema (XSD) An XSD consists of a collection of element declarations and datatype definitions. The elements declared in the top level of the XSD (immediately under hxs : schemai) can be




18

used as the root elements of XML instances. An element is bound to some datatype. A datatype can be defined as an anonymous type locally within an element declaration. An anonymous datatype can only be bound to its parent element declaration but cannot be reused by other element declarations. (See Listing 3.1.) A datatype can also be defined globally and assigned with a name such that this named datatype can be reused by multiple element declarations. (See Listing 3.2.) Moreover, there are two kinds of datatypes: complex types and simple types. When a parent element contains some child element or attribute, this parent element must be declared with a complex type. In contrast, a simple type defines the value space for an element or an attribute. XSD has defined a set of built-in simple types for extension or restriction to user-defined simple types. The following examples of XSDs help explain the research problems. Listing 3.1 (XSD 1) and Listing 3.2 (XSD 2) are two different XSDs that accept the same set of XML instances.1 They are considered equivalent to each other. An XML instance must have the root element named either hQuotei or hOrderi. (The documents with two different root element names can be regarded as two different document types.) Inside a Quote document are one or more hLinei elements. Under each hLinei, there are one hDesci element and one hPricei element. A hDesci contains a product description (string) while a hPricei contains a product price (decimal). In an Order document, there are one or more hLinei elements. Each hLinei contains one hProducti and one hQtyi (integer). A hProducti has one hDesci (string) and one hPricei (decimal) as children. Listing 3.3 and Listing 3.4 are two instances of XSD 1 and XSD 2. Listing 3.1: XSD 1 for Quote and Order documents 1

The design styles of XSD 1 and XSD 2 are called Russian Doll and Venetian Blind respectively.




19

Listing 3.2: XSD 2 for Quote and Order documents




20

Listing 3.3: A Quote document hPhone 499.9 iMat 999.9

Listing 3.4: An Order document hPhone 499.9 2

However, XSD 1 is larger than XSD 2 despite their equivalence. XSD 1 defines 5 complex types and declares 10 elements while XSD 2 has only 4 complex type definition and 8 element declarations. In XSD 1, each complex type is defined as an anonymous type;




21

hence, there is no reuse of type definitions. On the contrary, XSD 2 defines each complex type as a named datatype so that multiple element declarations can reference the same type and reuse its content model. For example, elements hLinei and hProducti have reused complex type ProdType. In fact, XSD 2 has maximized type reuse and represents a minimal schema. XSD 3 (Listing 3.5) is a subschema of both XSD 1 and XSD 2. XSD 3 accepts only the instances with Quote as the root element and rejects other instances. For example, XSD 3 accepts the XML document in Listing 3.3 but rejects the one in Listing 3.4. XSD 3 is even smaller than XSD 2 and contains only 2 complex types and 4 elements. Note that since the datatype names are only used for referencing within an XSD, these names (e.g., q0 and q1) do not affect the instance set of the XSD. Listing 3.5: XSD 3 as subschema of XSD 1 and XSD 2

Regarding the above examples, I will provide the formal models and algorithms to solve the following problems: Schema compatibility testing (1) How to verify XSD 1 and XSD 2 are equivalent. (2)

How to verify XSD 3 is a subschema of XSD 1 and XSD 2. Subschema extraction Given XSD 2 (or XSD 1), how to extract a smaller subschema

XSD 3 when XSD 3 only needs to recognize the elements in a Quote document.




22

3.3 Related Work on XML Schema Computations Various research projects have proposed different formalisms on XML schema languages. Despite the higher popularity in industry adoption and more power in expressiveness of XSD over DTD, DTD has attracted more research efforts than XSD. Martens et al. attributed this to the perceived simplicity of DTD and the alleged impenetrability of XSD.[21] The primary difference between DTD and XSD in expressiveness is due to the lack of element typing in DTD. XSD allows a content model to be defined as a type, which can be reused by multiple element declarations. For example in the XSD in Listing 3.2, complex type ProdType is reused in complex types QuoteType and OrderType. However, in DTD, every element must be declared with its own content model. Moreover, DTD only permits all elements with the same name to use the same content model regardless of their contexts. In contrast, XSD allows two elements declared the same name but under different complex types (contexts) to use different content models (types). For example in Listing 3.2, the elements named hLinei under QuoteType and OrderType use different types ProdType and OrderLineType respectively. This XSD example cannot be expressed in DTD.

3.3.1 Schema Matching Many early schema research works have studied the schema matching problem. They were motivated by the interoperability of two applications using two schemas with different structures and elements to exchange data. Milo and Zohar introduced a system that aimed to automate matching of elements in two schemas by their names and structures for data translation.[22] Madhavan et al. proposed a similar approach to perform schema matching in three phases.[23] In the first phase, called linguistic matching, the similarity coefficients of elements are calculated based on their names. In the second phase, called structural matching, the similarity coefficients of elements are calculated based on their structures. In the third phase, called mapping generation, mappings of elements in the two schemas are produced based on these similarity coefficients. For a detailed comparison among research projects on schema matching, the survey authored by Rahm and Bernstein serves a useful reference.[24] However, the schema models proposed by these projects are over-simplified because they are either proprietary or DTD-like, and cannot represent core XSD constructs. Moreover, these schema matching approaches usually share a common problem that they use “arti-




23

ficial” matching rules. They can only generate a meaningful element mapping when two schemas use very similar element names and very simple structures. For example, the elements hauthori and hAuthori in two schemas may be matched by these algorithms just because two elements share the same name case-insensitively. A schema mapping automatically generated by these algorithms without manual adjustment is often unreliable because schemas are not semantically annotated.[24] While schema matching is done at the syntax level rather than the semantic level, two semantically different elements with an identical element name in two schemas may mislead the algorithms to link these unrelated elements together. For example, a hlinei element may represent a purchase order line item in one schema but another hlinei element may represent an address line in the other schema; it is not correct to match these two unrelated elements just because they have the same name. Also, these schema matching projects have not considered the schema compatibility and subschema extraction problems addressed by my research.

3.3.2 DTD and Extended DTD A DTD is commonly abstracted as a set of production rules in the form of a → r, where a is an element and r is a regular expression over the element set.[25] Martens et al. formalized DTD as a tuple (Σ, d, s), where Σ is a finite alphabet called element names, d is a function that maps Σ to regular expressions over Σ, and s ∈ Σ is the start element. For example, a DTD that recognizes the XML document in Listing 3.3 can be written as follows: hQuotei → hLinei∗ hLinei → hDesci hPricei

While DTD has weaker expressive power than XSD does, Papakonstantinou and Vianu[26] proposed a specialized or extended DTD (EDTD) model, adding element typing to DTD. EDTD is theoretically backed by the tree automata theory[27] for unranked trees. Martens and Niehren published a paper to study the schema minimization problems for different types of tree automata (UTA).[28] An EDTD is formalized as a tuple (Σ, ∆, d, s, µ), where: 1. ∆ is a finite alphabet called types, 2. Σ is a finite alphabet called element names,




24

3. (∆, d, s) is a DTD, where d is a function that maps ∆ to regular expressions over ∆ and s ∈ ∆ is a start type, and 4. µ : ∆ 7→ Σ is a function that maps each type to some element name. An EDTD embeds a DTD and adds an additional alphabet and a mapping function. The alphabet used in the embedded DTD is called types, instead of element names. A new alphabet called element names is introduced to the EDTD, with a mapping function µ that maps each type to some element name. Intuitively, the mapping function transforms each XML instance X of the embedded DTD to another XML instance Y of the EDTD by renaming each type (in ∆) to some new element name (in Σ) through µ. Martens et al. published a detailed study on EDTD and its derivatives.[25] For example, an EDTD that recognizes both documents in Listing 3.3 and Listing 3.4 can be written as follows:

QuoteType → QuoteLineType∗ QuoteLineType → DescType PriceType OrderType → OrderLineType∗ OrderLineType → ProductType∗ ProductType → DescType PriceType

µ(QuoteType) = hQuotei µ(QuoteLineType) = hLinei µ(DescType) = hDesci µ(PriceType) = hPricei µ(OrderType) = hOrderi µ(OrderLineType) = hLinei µ(ProductType) = hProducti

Although EDTD has added types to DTD, it is different from XSD. In general, an EDTD can be non-deterministic. Yet, there is a special class of EDTDs called single-type and restrained competition EDTDs[28], with which validation of XML trees is top-down deterministic. This class of EDTD has been shown to have the same expressive power as XSD does. However, EDTD is different from XSD. In EDTD, a type is associated with a regular expression over types, where each type is uniquely mapped to an element. In contrast, in XSD, a type is associated with a regular expression over elements and each element is mapped to a type. The type reuse in XSD is more efficient than that in EDTD. For example, in the above EDTD, QuoteLineType and ProductType must be defined as two separate types so that the two types can be mapped to two different element names although the two types share the same content model. As illustrated in the corresponding




25

XSD in Listing 3.2, this content model only needs to be defined once in ProdType even though it is shared by two elements with different names, i.e., hLinei and hProducti.

3.3.3 XSchema In another paper, Martens et al. proposed a more accurate XSD abstraction called XSchema.[21] An XSchema is a tuple (ENames, T ypes, ρ, t0 ), where: 1. ENames is a set of element names, 2. T ypes is a set of types, 3. ρ : T ypes 7→ ENames×T ypes is a function that maps a type to some regular expression over an alphabet called Elems(ENames, T ypes), where Elems(ENames, T ypes) = a[t] : a ∈ EName, t ∈ T ypes, and 4. t0 ∈ T ypes is the start type. For example, the XSD in Listing 3.2 can be modelled as follows: t0 → hQuotei[QuoteType] | hOrderi[OrderType] QuoteType → hLinei[ProdType]∗ OrderType → hLinei[OrderLineType]∗ ProdType → hDesci[Empty] hPricei[Empty] OrderLineType → hProducti[ProdType] hQtyi[Empty] Empty →

Because of the Element Declaration Consistent (EDC) constraint of XSD, which will be discussed in Section 4.3, Martens et al. defined a special class of XSchema called singletype XSchema as an abstraction of XSD. An XSchema is single-type when there is no element a[t1 ] and a[t2 ] in ρ(t) with t1 , t2 . The Schema Automaton (SA) model to be proposed in Chapter 4 resembles the single-type XSchema model in representing XSDs. Yet, SA provides a richer abstraction of XSD than XSchema does because of the following: 1. An XSchema does not validate data values inside elements. In other words, it does not model XSD simple types and built-in types. SA models simple types as value




26

domains. 2. A formal model for XML documents is not defined with respect to the XSchema model to illustrate how an XSchema can be operated on XML documents. With respect to the SA model, the Data Tree model is defined for XML instances. 3. SA uses a different formalism that better facilitates schema computations, such as schema minimization, schema compatibility testing, and schema extraction. These computational problems have not been studied for XSchema. Also, some important concepts proposed by my research, such as the usefulness of XSD types, have not been considered for XSchema.



4 Data Tree and Schema Automaton In this chapter, I will formalize the models of Data Tree (DT) and Schema Automaton (SA). A DT is a tree-form data model. An SA is a deterministic finite automaton (DFA) to recognize DTs. I will also elaborate how DT and SA can be used to model XML documents and XSDs.

4.1 Data Tree A DT is a generic tree-form data model. Each tree node is called data node (d-node), which can store a data value. A d-node may have some child d-nodes. The parent is connected to each child by an edge called data edge (d-edge). Each d-edge is labelled with a symbol. See Definition 1. Definition 1. A Data Tree (DT) is a 7-tuple (N, E, Y, n0 , CEdges, Val, Sym). N is a finite set of data nodes (d-nodes) connected by a finite set of data edges (d-edges) E. A d-edge e ∈ E is an ordered pair (n parent , nchild ) where n parent ∈ N is the parent d-node and nchild ∈ N is the child d-node. CEdges : N 7→ E ∗ is a function that takes every d-node n parent ∈ N to a finite (possibly empty) sequence of child d-edges CEdges(n parent ) = e1 e2 . . . ek , where e1 . . . ek ∈ E. A DT has exactly one root d-node n0 ∈ N. Except the root d-node, every other d-node has exactly one parent, and is a descendant of the root via a unique path of d-edges. Every d-node stores a data value. The function Val : N 7→ V returns the data value Val(n) of d-node n. A d-node may store the null value denoted , i.e., the empty string. V denotes the universe of all possible data values, including . Every d-edge is labelled with a symbol and Y is the set of these symbols. Sym : E 7→ Y is a function that returns the symbol Sym(e) of d-edge e. (Two different d-edges can be labelled with the same symbol.) Figure 4.1 and Figure 4.2 show two DT examples. A box represents a d-node. A directed edge from a parent to its child represents a d-edge. Each d-node (e.g., n1) stores a value

27

Chapter 4. Data Tree and Schema Automaton

28

n0:"us" A n1:"ny"

A

B

n2:"ca" C

n4:"sf"

n0:ε A

n3:"50"

C n5:"la"

Figure 4.1: DT 1

n3:"123"

B

n1:"cn"

n2:"3.14"

B

C

C

n4:"bj"

n5:"hk"

Figure 4.2: DT 2

(e.g., “ny”). Each d-edge is labelled with a symbol (e.g., B between n0 and n3). n0 is the root d-node.

4.1.1 Modelling XML An XML document can be modelled as a DT. An XML element is represented by a d-edge together with its child d-node. The element name is given by the symbol of the d-edge. The content of an element is given by the child d-node. The value of the element is the textual value of the d-node. If the element has some child elements, these child elements are represented by the child d-edges and d-nodes in the next level. Since an XML document has exactly one root element, the DT modelling an XML document has exactly one child d-edge from the root d-node. Figure 4.3 and Figure 4.4 show the DTs representing the XML documents in Listing 3.3 and Listing 3.4 respectively.

4.2 Schema Automaton A Schema Automaton (SA) defines the permissible structures and contents of DTs. Essentially, an SA uses a set of regular languages to define how d-edges can be sequenced and uses a set of value domains (VDoms) to constrain the data values of d-nodes. (Each VDom is a set of values.) First, an SA uses one regular language called vertical language (VLang) to define the permissible sequences of the symbols on the d-edges along all paths from the root to the leaves in a DT. For example, in DT 1, these vertical symbol sequences are A, AC, AC, B. Second, the SA uses a set of regular languages called horizontal languages

(HLangs) to define the permissible symbol sequences of the child d-edges under a d-node.




29

n0:ε

n0:ε

n1:ε

n1:ε n2:ε n4:"hPhone"

n5:"499.9"

n2:ε

n3:ε

n3:ε

n6:"iMat"

n7:"999.9"

Figure 4.3: DT for Quote document

n4:"2"

n5:"hPhone"

n6:"499.9"

Figure 4.4: DT for Order document

For example, in DT 1, the symbol sequence of the child d-edges under the root n0 is AAB; the symbol sequence of the child d-edges under the leaf n4 is the null string because n4 has no child. The VLang is specified as a deterministic finite automaton (DFA) while the HLangs are specified in regular expression (RE). Definition 2 formally defines SA. Definition 2. A Schema Automaton (SA) is a 6-tuple A = (Q, X, q0 , δ, HLang, VDom). Q is a finite set of states. q0 ∈ Q is the initial state. There is one implicit dead state ⊥ < Q. X is a finite set of symbols. δ : Q × X 7→ Q ∪ {⊥} is a function called transition function that takes each state q ∈ Q and each symbol a ∈ X to the next state δ(q, a) (possibly ⊥). HLang : Q 7→ P(X ∗ ) − {∅} is a function that takes every state in Q to a non-empty regular language over X, called horizontal language (HLang). For any state q ∈ Q, if some symbol a does not occur in any string in HLang(q) then δ(q, a) must be set to ⊥; otherwise, δ(q, a) must be set to some state in Q. VDom : Q 7→ P(V) − {∅} is a function that takes every state q ∈ Q to a finite and non-empty set of values VDom(q), called value domain (VDom). Note that an SA does not explicitly define the set of final states. A state is final when its HLang accepts . Figure 4.5 shows an SA example. The set Q of states is {q0, q1, q2, q3}. The set X of symbols is {A, B, C}. The initial state is q0. The transition function δ is defined with the arrows. For example, the SA transits from q0 to q1 on symbol A, i.e., δ(q0, A) = q1 or A

q0 − → q1; q0 also goes to the dead state ⊥ on symbol C. Cyclic transitions are possible, A

C

A

e.g., q0 − → q1 − → q3 − → q0. The table in the figure defines the HLang and VDom for each



Chapter 4. Data Tree and Schema Automaton A q0

A

C

q

q3

q0 q1 q2 q3

q1 B

30

q2

HLang(q) A{2,5}B C* {} A*

VDom(q) STRINGS STRINGS INTEGERS STRINGS

Figure 4.5: SA example state. For example, the HLang for q0 is the regular language specified by RE A{2, 5}B, which accepts only the strings with 2 to 5 As followed by exactly one B; the HLang for q2 accepts only the null string. The VDom for q0 are all possible strings while the VDom for q1 is the set of all possible integers.

4.2.1 Schema Automaton Validating Data Tree An SA validates a DT as follows. The SA first uses the initial state to validate the root d-node of the DT. Suppose the SA is currently validating some d-node n of the DT with some state q. If the value of n is outside the VDom of q or the symbol sequence of the child d-edges of n is outside the HLang of q, then the SA immediately rejects the DT. Otherwise, the SA proceeds to validate every child d-node nchild (if any) of n against the next state q0 of the transition from q on the symbol of the d-edge (n, nchild ). If none of the descendant d-nodes in the DT subtree rooted at d-node n is rejected, then it is said that the DT subtree at n is accepted by q, or simply n is accepted by q. Ultimately, the entire DT is accepted by the SA if n0 is accepted by q0 . In this case, it is also said that the DT is an instance of the SA. If an SA accepts a DT then each d-node n in the DT is bound to exactly one state q of the SA, where n is accepted by q. See Definition 3. Algorithm 1 (AcceptDT) describes the process for an SA to validate a DT. The set of all possible instances of the SA are collectively called the language of the SA. See Definition 4. Definition

3. Let

A

=

(Q, X, q0 , δ, HLang, VDom)

be

an

SA,

T

=

(N, E, Y, n0 , CEdges, Val, Sym) be a DT. T is accepted by A when there exists a unique binding map, Bind : N 7→ Q, that binds every d-node n ∈ N to exactly one state q ∈ Q such that all of the following conditions hold. 1. Bind(n0 ) = q0 . 2. For any n ∈ N, Val(n) ∈ VDom(Bind(n)). 3. For any n ∈ N, let CEdges(n) = e1 . . . ek , and ei = (n, ni ) for i = 1, . . . , k. Define




31

CSeq : N 7→ X ∗ that takes a d-node n to the string CSeq(n) = Sym(e1 ) . . . Sym(ek ), which specifies the symbol sequence of the child d-edges of n. a) CSeq(n) ∈ HLang(Bind(n)). b) Bind(ni ) = δ(Bind(n), Sym(ei )), for i = 1, . . . , k. It is said that n is accepted by q or the DT subtree at n is accepted by q when n is bound to q. Algorithm 1. AcceptDT Input: SA A = (Q, X, q0 , δ, HLang, VDom) Input: DT T = (N, E, Y, n0 , CEdges, Val, Sym) Output: if A accepts T then true is returned; otherwise false is returned Output: if A accepts T then a well-defined binding map Bind[n] is computed 1:

initialize a list L to contain one tuple (q0 , n0 )

2:

while L is not empty do

3:

pick (q, n) from L and remove (q, n) from L

4:

if Val(n) < VDom(q) then

5:

return false

6:

end if

7:

if CSeq(n) < HLang(q) then

8: 9:

return false end if

10:

set Bind[n] to q

11:

for all e = (n, nchild ) in the sequence CEdges(n) do

12:

qnext ← δ(q, Sym(e))

13:

add (qnext , nchild ) to L

14:

end for

15:

end while

16:

return true

Definition 4. Let A be an SA. The set of all instance DTs accepted by A is called the language of A, denoted L (A). Th SA in Figure 4.5 accepts DT 1 (Figure 4.1) but rejects DT 2 (Figure 4.2). Table 4.1 shows the binding map of DT 1 against the SA. DT 2 is rejected because: 1. CSeq(n0) = AB is not in HLang(q0) = L (A{2, 5}B),



Chapter 4. Data Tree and Schema Automaton n n0 n1 n2 n3 n4 n5

q = Bind(n) q0 q1 q1 q2 q3 q3

Val(n) “us” “ny” “ca” “50” “sf” “la”

VDom(q) STRINGS STRINGS STRINGS INTEGERS STRINGS STRINGS

32 CSeq(n) AAB C

HLang(q) A{2,5}B C* C* {} {} {}

Table 4.1: Binding map of DT 1 against SA in Figure 4.5 2. CSeq(n1) = BCC is not in HLang(q1) = L (C∗), and 3. Val(n2) = “3.14” is not in VDom(q2) = INTEGERS.

4.3 Modelling W3C XML Schema SA can model the core features of XSD. For example, SA 1 (Figure 4.6) and SA 2 (Figure 4.7) model XSD 1 (Listing 3.1) and XSD 2 (Listing 3.2) respectively. A state in an SA represents an XSD data type, i.e., complex type, simple type, or built-in data type (e.g., xs : string). A symbol represents an element name. A transition from an originating state

represents a child element declaration under the complex type represented by this originating state. The destination state of a transition represents the type used by the element declaration. In XSD 2, complex type OrderLineType declares two child elements hProducti and hQtyi. hProducti uses complex type ProdType, and hQtyi uses built-in type xs : decimal. States q4, q9, and q8 in SA 2 represent data types OrderLineType, ProdType and xs : int respec-

tively. q4 has two transitions (1) to the next state q9 on symbol hProducti, and (2) to q8 on hQtyi. Besides, the xs : sequence statement in complex type OrderLineType requires that exactly one hProducti followed by exactly one hQtyi must occur as the children of element hLinei. Thus, the HLang of q4 is specified by RE hProductihQtyi. HLangs can represent more sophisticated content models. For example, the content model shown in Listing 4.1 defines that complex type ct1 has two to three X child elements, followed by either a series of three or more Ys, or zero to one X. This model can be specified as an HLang RE hXi{2,3}(hYi{3,}|hXi?). Note that XSD does not permit multiple child element declarations with the same element name to be assigned with different types because of the Element Declarations Consistent (EDC) constraint. In Listing 4.1, the first child




q0

q1

33

q3

q2

q4

q8

q5

q6

q7

q q0 q1 q2 q3 q4

HLang(q) hQuotei | hOrderi hLinei+ hLinei+ hDesci hPricei hProducti hQtyi

VDom(q) {} {} {} {} {}

q q5 q6 q7 q8

HLang(q) {} {} hDesci hPricei {}

VDom(q) STRINGS DECIMALS

{} INTEGERS

Figure 4.6: SA 1 modelling XSD 1

q0

q1

q2

q q0 q1 q2 q9

HLang(q) hQuotei | hOrderi hLinei+ hLinei+ hDesci hPricei

VDom(q) {} {} {} {}

q4

q q4 q5 q6 q8

q9

q5

q6

q8

HLang(q) hProducti hQtyi {} {} {}

VDom(q) {} STRINGS DECIMALS INTEGERS

Figure 4.7: SA 2 modelling XSD 2




34

element declaration of hXi uses built-in type xs : string, so if the last element declaration of hXi uses a different type, e.g., xs : decimal, then an EDC violation error would occur. The EDC constraint makes the transition of an SA always deterministic, i.e., a state would not move to more than one next state on one symbol. A built-in data type (e.g., xs : string) or a simple type is modelled as a state that has no child. That means the HLang of this state accepts only the empty string . For example, states q5, q6, and q7 in SA 2 represent xs : string, xs : decimal, and xs : integer in XSD 2 respectively. The VDoms for q5, q6, and q7 are therefore the set of all possible strings, the set of all decimal numbers, and the set of all integers respectively. A simple type is modelled similarly. For example, in Listing 4.2, st1 and st2 are both defined based on xs : decimal but the former accepts all decimals greater than zero while the latter accepts

all decimals inclusively between -1 and 1. Therefore, st1 and st2 should be represented by two different states with two different VDoms as described. Listing 4.1: Example of XSD complex type definition

Listing 4.2: Example of XSD simple type definition




35

Nevertheless, it is also known that SA cannot model some XSD features. For example, SA cannot express the xs : any content model, which permits a free structure of any descendant elements. Despite some limitation, SA can model most commonly-used XSD features used by industry XSDs, such as xCBL and UBL. I implemented a program to convert XSD to SA, and SA to XSD, which accurately handled the xCBL XSDs for the experiments.



5 XML Schema Computations This chapter will present several schema computational operations using SA. These operations include schema minimization, schema equivalence testing, subschema testing, and subschema extraction. I will also analyze the complexity of these operations and will propose some techniques to improve the performance of these operations.

5.1 Schema Minimization A key operation used in schema compatibility testing and subschema extraction is schema minimization. Given an SA, schema minimization computes the equivalent SA that has the fewest states among all equivalent SAs. This minimized SA can be regarded as the canonical SA for all SAs recognizing the same language. See Definition 5 on schema equivalence and schema size. Definition 5. Let A and A0 be two SAs. If L (A) = L (A0 ), A and A0 are said to be equivalent, denoted A ≡ A0 . Definition 6. Let A be an SA. The size of A, denoted |A|, is the number of states in A.

5.1.1 Usefulness of States (XML Schema Types) The first step of minimizing an SA is to remove all useless states, each representing a useless XSD type. Useless types can be safely discarded from the XSD while the instance set of the resultant XSD is unchanged. Given some SA A, I call a state of A useful if some d-node in some instance of A is bound to this state. See Definition 7. Definition 7. Let A be an SA and q be a state of A. q is said to be useful if there exists some instance T of A and some d-node n in T such that Bind(n) = q, where Bind is the binding map for A to accept T . A is said to be a useful SA if all of its states are useful.

36

Chapter 5. XML Schema Computations

37

First, if a state is not accessible (Definition 8), then it does not contribute to recognizing any instance; hence, it is useless. Definition 8. Let q be a state of an SA. q is said to be accessible if there exists some path of transitions from the initial state to q. Otherwise, q is said to be inaccessible. Second, if a state is irrational then it is useless too. A state is considered irrational if it is on a cycle of mandatory transitions. Intuitively, when an SA reaches an irrational state q on a cycle of mandatory transitions while validating some d-node n of a DT, q would require n to have infinite descendants. Since a DT is finite, an irrational state never accepts any DT subtree. Therefore, an irrational state is useless. Definition 9. Let A = (Q, X, q0 , δ, HLang, VDom) be an SA. Some symbol a ∈ X is a mandatory symbol of some state q ∈ Q if a occurs in every string of the HLang of q. a

The transition q → − q0 , where q0 ∈ Q (i.e., q0 , ⊥), is called a mandatory transition. q1 , . . . , qk ∈ Q are said to be irrational if there exists a cycle of mandatory transitions such a1

a2

ak

that q1 −−→ q2 −−→ · · · −→ q1 for some symbols a1 , . . . , ak ∈ X. It is possible for some useful and rational states to be useless too. Each of such states (1) has a path of mandatory transitions to some irrational state or (2) can be reached only via useless states. Theorem 1 formalizes the conditions for a state to be useful. Theorem 1. Let A be an SA and q be a state of A. q is useless if and only if any of the following conditions hold: 1. q is inaccessible. 2. q is irrational. 3. There is a path of mandatory transitions from q to some irrational state. 4. Every transition path from the initial state to q passes through some useless state. Proof. If part If q is inaccessible then q cannot be reached from q0 and q can never accept any d-node of any instance. Therefore, q is useless. Suppose q is accessible but irrational, and there is some DT T such that some d-node n of T is being validated by A against q. Since q is irrational, there exists q, q2 , . . . , qk ∈ Q and a1 , . . . , ak ∈ A a1

a2

ak

such that a cycle of mandatory transitions q −−→ q2 −−→ · · · qk −→ q · · · is formed. Since T is finite, n cannot be accepted by q because q requires n to have infinite descendants of d-edges, a1 , a2 , . . . , ak , a1 , . . .. Therefore, q is useless. Suppose q is




38

q2

B A q0

A q1 C

C

q3

q4

B

q7

A B

q5

C A

q

HLang(q)

VDom(q)

q

q0 q1 q2 q3 q4

A{2,5}BC? C*

STRINGS STRINGS INTEGERS STRINGS STRINGS

q5 q6 q7 q8 q9

{} A* B+

q8

q6

HLang(q) C A+B* A? B* {}

B

q9

VDom(q) STRINGS INTEGERS STRINGS STRINGS DECIMALS

Figure 5.1: Example of SA that contains useless states useful and there is a path of mandatory transitions from q to some irrational state q0 . Let T be an instance of A such that some d-node n of T is validated against q. There exists another d-node n0 of T at which A follows the mandatory transitions to q1 . However, q0 requires n0 to have infinite descendants. This is a contradiction because T cannot be an instance of A. Therefore, q is useless. If every transition path from the initial state to some state q passes through an irrational state, q is behind some irrational state, i.e., q cannot be reached from the initial state without passing through some useful state. Hence, q cannot accept any d-node of any instance. Therefore, q is useless. Only-if part Suppose (1) q is accessible, (2) q is rational, (3) there exists no path of

mandatory transitions from q to any irrational state, and (4) there is some transition path from q0 to q in A, which contains all useful states. Since q is accessible via useful states, q can be reached from q0 . Since q is rational and there exists no path of mandatory transitions to any irrational state, A must go from q to some state q0 such that ∈ π(q0 ) along every possible path in a finite number of transitions. Thus, we can construct a finite DT accepted by q. Therefore, q is useful. Figure 5.1 shows an example of SA with some useless states. q7 and q8 are inaccessible. q5 and q6 are irrational states because they form a cycle of mandatory transitions. q4 is




39

also useless because it has a mandatory transition to the irrational state q5. However, q0 is not useless because its transition to q4 on symbol C is not mandatory. q9 is also useless because its only transition path from q0 is blocked by useless states q4, q5, and q6. Algorithm 2 (MakeUsefulSA) removes all useless states from a given SA and produces a useful and equivalent SA. Running MakeUsefulSA on Figure 5.1 produces the useful SA in Figure 4.5. Algorithm 2. MakeUsefulSA Input: SA A = (Q, X, q0 , δ, HLang, VDom) Output: A is modified so that A is useful 1:

create an empty list L to store all useless states

2:

find all mandatory transitions in A

3:

add all states on any cycles of mandatory transitions, i.e., irrational states, to L /* handle condition 2 */

4:


5:

pick a state q in L and remove q from L

6:

if q = q0 then report no useful SA equivalent to A exists and halt /* this SA accepts no DTs */

7: 8:

end if

9:

for all q0 ∈ Q − L where there exists a ∈ X such that δ(q0 , a) = q is a mandatory transition do add q0 to L /* handle condition 3 */

10: 11:

end for

12:

remove all transitions to q /* q becomes inaccessible */

13:

end while /* handle conditions 1 and 4 */

14:

traverse A from q0 and add all inaccessible states to L

15:

for all q ∈ Q − L where there exists a1 , . . . , an ∈ X such that δ(q, a1 ), . . . , δ(q, an ) ∈ L do

16:

modify HLang(q) to a new regular language that is equivalent to the original regular language yet excluding all strings containing any symbol in {a1 , . . . , an } /* e.g., in Figure 5.1, HLang(q0) is modified from A{2, 5}BC? to A{2, 5}B */

17:

end for

18:

remove all states in L together with their incoming and outgoing transitions




40

Mandatory Symbol Testing

When a symbol occurs in every string of a regular language, this symbol is said to be mandatory in the language. Checking whether a symbol is mandatory in a language specified by an RE can be efficiently done in PTIME. An RE can be represented by a tree of sub-REs. For example, RE ((A+|(AB)+)+(B+|(BA)*))+ can be represented by an RE tree as shown in Figure 5.2. (Sequence means concatenation while choice means alternation.) Algorithm 3 tests whether a symbol is mandatory in the RE. It constructs a boolean expression tree from the RE tree. If the boolean expression is evaluated to be true, then the symbol is mandatory in the RE. Algorithm 3. IsSymbolMandatory Input: regular expression tree R Input: symbol a to test whether a occurs in every string of the language of R, i.e., whether a is mandatory in R 1:

Replace each subtree of R at a node with zero minimum multiplicity (e.g., Kleene star ∗) by a false node.

2:

Replace each remaining leaf of R with symbol equal to a by true and each remaining leaf with symbol not equal to a by a false node

3:

Replace each remaining sequence node of R with an or node

4:

Replace each remaining choice node of R with an and node

5:

Evaluate the transformed boolean expression tree

The boolean expression tree shown in Figure 5.3 evaluates whether A is mandatory in ((A+|(AB)+)+(B+|(BA)*))+ and the result is true. On the contrary, B is not mandatory as evaluated in the boolean expression tree shown in Figure 5.4.

5.1.2 Schema Automaton Minimization The SA minimization process involves: 1. transforming a given SA to a useful SA, 2. partitioning the set of states into a set of equivalence classes of states, 3. combining all states in each equivalence class into one state in the minimized SA, and 4. re-mapping the transitions accordingly




41

sequence+

A+

choice+

choice

sequence+

B+

A

B

sequence*

B

A

Figure 5.2: RE tree of ((A+|(AB)+)+(B+|(BA)*))+

or

true

true

or

and

and

or

false

false

false

Figure 5.3: Boolean expression tree to test if A is mandatory


false

false

and

and

or

true

false

true

Figure 5.4: Boolean expression tree to test if B is mandatory



42

See Definition 10 and Theorem 2 on state equivalence. Definition 10. Let A be a useful SA and q1 , q2 be two states of A. q1 and q2 are said to be equivalent if q1 and q2 accept the same set of DT subtrees in all instances of A. Theorem 2. Let A = (Q, X, q0 , δ, HLang, VDom) be a useful SA and q, q0 ∈ Q be two states of A. q and q0 are equivalent if and only if all of the following conditions hold. 1. HLang(q) = HLang(q0 ). 2. VDom(q) = VDom(q0 ). 3. For each a ∈ X, δ(q, a) = δ(q0 , a) = ⊥ or δ(q, a) and δ(q0 , a) are equivalent. Proof. If part Since A is useful, all its states are useful. Suppose there exist useful states q, q0 ∈ Q such that all of the above conditions hold. Condition 1 guarantees q, q0 accept the same set of symbol sequences of child d-edges. Condition 2 guarantees q, q0 accept the same set of values. Condition 3 guarantees that q, q0 share the same “future.” Therefore, q and q accept the same set of DT subtrees; the two states are equivalent. Only-if part Suppose q, q0 ∈ Q are equivalent but some of the above conditions do not

hold. Since q, q0 are useful, each of them accepts at least one DT subtree. For conditions (1) and (2), if HLang(q) , HLang(q0 ) or VDom(q) , VDom(q0 ) then there must exist some d-node n of some instance T = (N, E, Y, n0 , CEdges, Val, Sym) such that Val(n) or CSeq(n) is accepted by exactly one (but not both) of q or q0 . q and q0 cannot be equivalent. Moreover, if condition 3 does not hold, q, q0 has different futures. q, q0 cannot be equivalent too. A minimal SA of a language is an SA with the fewest states among all SAs accepting the same language. See Definition 11. In fact, this minimal SA is the minimum (canonical) SA because it is unique up to isomorphism as stated in Theorem 4. Definition 11. Let A be an SA. If there does not exist another SA A0 such that L (A0 ) = L (A) and |A0 | < |A| then A is called a minimal SA of its language. Theorem 3. Given a useful SA A, for any SA A0 equivalent to A, there cannot be fewer states in A0 than the equivalence classes of states in A.




43

Proof. Since A is a useful SA, every state of A is useful and accepts some DT subtrees. Hence, every state in each equivalence class of A accepts some DT subtrees. For the sake of contradiction, suppose there is an SA A0 equivalent to A where A0 has fewer states than the equivalence classes of A. By the Pigeon Hole Principle, there exists some state q of A such that no state of A0 accepts the same set of DT subtrees that are accepted by q. Since A is useful, q is useful and accepts some DT subtrees in some instances. We can find some instance T of A, which is also an instance of A0 , such that when A is validating T , q is validating a d-node n of T . Now suppose A0 is validating T and state q0 of A0 is validating n. We can modify the DT subtree at n to a DT that is accepted by exactly one of q and q0 . Hence, T can only be accepted by exactly one of A and A0 , which contradicts to that A0 is

equivalent to A.

Theorem 3 states that the number of equivalence classes of states in an SA of a language is the lower bound of the size of all SAs accepting the same language. Given any SA, Algorithm 4 computes an SA that is equivalent to the given SA and has as many states as the equivalence classes of states in the given SA. Therefore, the computed SA is a minimal SA of the given SA’s language. Essentially, the algorithm combines each class of equivalent states in an input useful SA into a new state in the output SA. First, all states in the input SA are partitioned into blocks of the states sharing the same HLang and VDom. Then, each block is examined. When a block contains two states that have transitions on the same symbol to the states in different blocks, the block is split into new blocks, so that all states in each new block have transitions on the same symbol to the states in the same block. The partition is refined iteratively until no new block needs to be split. At that time, every block contains an equivalence classes of states. Finally, all transitions in the input SA from the states in equivalence class B1 to the states in equivalence class B2 on the same symbol are combined into a single transition in the minimized SA from new state B1 to new state B2 on that symbol. Algorithm 4. MinimizeSA Input: useful SA A = (Q, X, q0 , δ, HLang, VDom) Output: minimum SA A0 = (Q0 , X 0 , q00 , δ0 , HLang0 , VDom0 ) equivalent to A 1:

create a partition P = {B1 , . . . , Bk } of Q such that for any two states q1 , q2 ∈ Q, HLang(q1 ) = HLang(q2 ) and VDom(q1 ) = VDom(q2 ) if and only if q1 are q2 are in the same Bi , where 1 ≤ i ≤ k

2:

create an empty list L /* L maintains a list of blocks that may need to split */

3:

add each block B ∈ P to L if |B| > 1

4:





44

5:

pick a block B from L and remove B from L

6:

if there exist two states q1 , q2 in B and some symbol a ∈ X such that δ(q1 , a) and δ(q2 , a) are in different blocks in P then

7:

partition B into R = {C1 , . . . , Cm } such that for any two states q1 , q2 ∈ B, q1 and q2 are in the same Ci if and only if δ(q1 , a) and δ(q2 , a) are in the same B0 ∈ P for all a ∈ X

8:

remove B from P and add each C ∈ R to P

9:

add C ∈ R to L for any |C| > 1

10:

end if

11:

end while

12:

set X 0 to X; set Q0 to P

13:

set q0 to B ∈ Q0 where q0 ∈ B

14:

for all B ∈ Q0 do

15:

set HLang0 (B) to HLang(q) where q ∈ B

16:

set VDom0 (B) to VDom(q) where q ∈ B /* all q’s in the same B share the same HLang and VDom */

17: 18:

for any a ∈ X, set δ0 (B, a) to B0 where δ(q, a) = q0 , q ∈ B, and q0 ∈ B0 end for

5.2 Schema Equivalence Testing If two schemas are equivalent, they are compatible with each other. Theorem 4 states that the minimum SA is unique up to isomorphism. Hence, we can test whether two SAs are equivalent by testing whether their minimized forms are isomorphic. (Two SAs are isomorphic when they are “structurally identical” although their states may share different sets of labels.) Algorithm 5 checks the equivalence of two SAs by first minimizing them and then traversing them in parallel from their initial states to check whether they transit in the same way with all HLangs and VDoms matched. Theorem 4. Let A and A0 be two equivalent SAs where A and A0 are minimal. A and A0 are isomorphic. In other words, the minimum SA of a language is unique up to isomorphism. Proof. Let A = (Q, X, q0 , δ, HLang, VDom) and A0 = (Q0 , X 0 , q00 , δ0 , HLang0 , VDom0 ) be two useful, equivalent, and minimal SAs. By Theorem 3, both A and A0 contain the same number of states, which equals the number of equivalence classes of states. We construct




45

¯ ¯ ¯ X, ¯ q¯0 , δ, ¯ HLang, another SA A¯ = (Q, VDom) as follows. Q¯ = Q ∪ Q0 ∪ {q¯0 }, where q¯0 < Q ∪ Q0 . X¯ = X ∪ X 0 ∪ {a0 , a00 }, where a0 , a00 < X ∪ X 0 .     δ(q, a) if q ∈ Q and a ∈ X,         δ0 (q, a) if q ∈ Q0 and a ∈ X 0 ,     ¯ a) =  δ(q, q0 if q = q¯0 and a = a0 ,         if q = q¯0 and a = a00 , q00        ⊥ if q = q¯0 and a , a0 and a , a00 .     HLang(q) if q ∈ Q,      ¯ HLang(q) = HLang0 (q) if q ∈ Q0 ,       a0 |a0 if q = q¯0 . 0     VDom(q) if q ∈ Q,      ¯ VDom(q) = VDom0 (q) if q ∈ Q0 ,       {} if q = q¯0 . Intuitively, A¯ is an SA combining A and A0 . A¯ has one new initial state q¯0 with transitions to the initial states of A and A0 on new symbols a0 and a00 respectively. Let N = |A| = |A0 |. The number of states of A¯ is 2N + 1. We know that the number of states of a minimum SA equivalent to A¯ is N + 1. It is because we can construct this minimum SA with N + 1 states by redirecting the two transitions of q¯0 on b and b0 to q0 of A only without using A0 as A0 is equivalent to A. Since A is minimal, and a0 and a00 are new symbols, we cannot minimize this SA anymore. Now we can minimize A¯ by grouping equivalent states. We know there is no other state of A¯ equivalent to q¯0 because a0 and a00 are new symbols. By Theorem 3, since A and A0 are minimal, each state in A is not equivalent to any other state in the same SA. Therefore, the only way to group 2N + 1 states to N + 1 equivalence classes is to pair up one state q from A and one state q0 from A0 , where q is equivalent to q0 , while q¯0 forms a distinct equivalence class on its own. By Theorem 2, q and q0 have the same HLang and VDom, and share the same future for each pair of q, q0 . Therefore, A and A0 are isomorphic.

Algorithm 5. EquivalentSA Input: SA A = (Q, X, q0 , δ, HLang, VDom)




46

Input: SA A0 = (Q0 , X 0 , q00 , δ0 , HLang0 , VDom0 ) Output: true is returned if A ≡ A0 ; false is returned otherwise 1:

MakeUsefulSA(A); MakeUsefulSA(A0 )

2:

MinimizeSA(A); MinimizeSA(A0 )

3:

create a list L that contains one tuple (q0 , q00 ) /* L maintains a list of pairs of a state in A and a state in A0 to be visited, starting from the pair of initial states */

4:

mark q0 , q00 visited

5:


6:

pick (q, q0 ) from L and remove (q, q0 ) from L

7:

if VDom(q) , VDom0 (q0 ) or HLang(q) , HLang0 (q) then

8: 9: 10:

return false end if for all a ∈ X do

11:

q1 ← δ(q, a); q01 ← δ(q0 , a)

12:

if exactly one of q1 , q01 is ⊥ then

13: 14: 15:

return false else if both q1 , q01 are not ⊥ then if exactly one of q1 , q01 is visited then return false

16: 17:

else if both q1 , q01 are not marked visited then

18:

put (q1 , q01 ) to L

19:

mark q1 , q01 visited

20: 21:

end if end if

22:

end for

23:

end while

24:

return true

SA 1 (Figure 4.6) can be minimized to SA 2 (Figure 4.7) where states q3 and q7 in SA 1 are combined into q9 in SA 2. Thus, SA 1 and SA 2 are equivalent, which implies the equivalence of their modelled XSD 1 (Listing 3.1) and XSD 2 (Listing 3.2).




47

5.2.1 Subschema Testing If one schema is a subschema of the other schema, then the latter accepts all instances of the former and thus the latter is compatible with the former one. The subschema notion is formally defined as follows. Definition 12. Let A and A0 be two SAs. If L (A) ⊆ L (A0 ), It is said that A is a subschema of A0 , and A0 is compatible with A. The overall idea of testing whether SA A = (Q, X, q0 , δ, HLang, VDom) is a subschema of A0 = (Q0 , X 0 , q00 , δ0 , HLang0 , VDom0 ) is to test whether each possible path of transitions in a0

a1

ai

A can be found in A0 . Let q0 −−→ q1 −−→ · · · −→ qi+1 · · · be any transition path in A, where all qi ∈ Q and all ai ∈ X. In order for A0 to be a superschema of A, the corresponding a0

a1

ai

transition path q00 −−→ q01 −−→ · · · −→ q0i+1 · · · must exist in A0 where all q0i ∈ Q0 and ai ∈ X 0 . In addition, the HLang of each qi must be a subset of the HLang of the corresponding q0i and the VDom of each qi must be a subset of the VDom of q0i too. Otherwise, some values and child sequences of d-nodes that can be accepted by A cannot be accepted by A0 . Algorithm 6 (SubschemaSA) performs this subschema testing. Theoretically, one can apply SubschemaSA in both directions to verify schema equivalence instead of using EquivalentSA. However, running EquivalentSA is faster than running SubschemaSA twice in general. Algorithm 6. SubschemaSA Input: SA A = (Q, X, q0 , δ, HLang, VDom) Input: SA A0 = (Q0 , X 0 , q00 , δ0 , HLang0 , VDom0 ) Output: true is returned if A is a subschema of A0 ; false is returned otherwise 1:

MakeUsefulSA(A) /* MakeUsefulSA(A0 ) is not needed */

2:

create a list L that contains one tuple (q0 , q00 ) /* L maintains a list of pairs of a state in A and a state in A0 to be visited, starting from the pair of initial states */

3:

mark the tuple (q0 , q00 ) visited

4:


5:

pick (q, q0 ) from L and remove (q, q0 ) from L

6:

if VDom(q) * VDom0 (q0 ) then

7:

report VDom incompatibility and return false

8:

end if

9:

if HLang(q) * HLang0 (q) then

10:

report HLang incompatibility and return false




48

q0

q1

q9

q5

q6

q q0 q1 q9

HLang(q) hQuotei hLinei+ hDesci hPricei

VDom(q) {} {} {}

q q5 q6

HLang(q) {} {}

VDom(q) STRINGS DECIMALS

Figure 5.5: SA 3 modelling XSD 3 11:

end if

12:

for all a ∈ X do

13:

q1 ← δ(q, a); q01 ← δ(q0 , a)

14:

if q1 , ⊥ then

15:

if q01 = ⊥ then report transition incompatibility and return false

16: 17:

else if (q1 , q01 ) is not marked visited then

18:

put (q1 , q01 ) into L

19:

mark (q1 , q01 ) visited

20: 21:

end if end if

22:

end for

23:

end while

24:

return true

For example, SA 3 (Figure 5.5) models the XSD in Listing 3.5. SubschemaSA can verify that SA 3 is a subschema of SA 1 as well as SA 2.

5.3 Subschema Extraction Given a large XSD, if an application only needs to recognize a subset of elements, we can reduce the original schema by extracting a smaller subschema that contains only the needed elements to improve the schema processing performance. Given some SA A and a set of permissible symbols X 0 , Algorithm 7 (ExtractSubschema) computes another SA A0 such that A0 accepts the instances of A containing only the symbols in X 0 , and rejects




49

any other DTs. First, all “unwanted” transitions on any symbols outside X 0 are found and put into a list L pending for deletion. Then, a loop iterates through list L and deletes each a

unwanted transition. If an unwanted transition q → − q0 is mandatory, state q should be removed from the extracted schema. This is because the HLang of q does not permit any d-node with no child carrying symbol a. In that case, all transitions going to q also need a

to be deleted. If q → − q0 is not mandatory, q need not be deleted. Yet, the HLang of q needs to be modified to a new HLang equivalent to the original HLang minus any strings containing a. After all transitions in L are removed, the resultant schema is minimized into the required subschema. Algorithm 7. ExtractSubschema Input: SA A = (Q, X, q0 , δ, HLang, VDom) Input: a set of permissible symbols X 0 ⊆ X Output: SA A is modified so that the modified A is a subschema of original A and accepts every instance T where T uses only the symbols from X 0 to label d-edges. 1:

create a list L that contains all tuples (q, a) where q ∈ Q and a ∈ X −X 0 and δ(q, a) , ⊥ /* (q, a) represents a transaction from state q on symbol a */

2:


3:

pick (q, a) from L and remove (q, a) from L

4:

set δ(q, a) to ⊥ /* remove the transaction */

5:

if (q, a) is a mandatory transition then /* q should be removed from the extracted schema */

6:

if q = q0 then report no valid subschema can be extracted and halt /* the subschema becomes

7:

invalid if q0 is removed */ 8:

end if

9:

for all (q0 , a0 ) ∈ Q × X where δ(q0 , a0 ) = q do /* all transactions to q are to be removed; q will become inaccessible and will be removed during MakeUsefulSA */

10: 11:

put (q0 , a0 ) to L if (q0 , a0 ) is not in L end for

12:

end if

13:

modify HLang(q) to a new regular language such that the new language accepts the same set of strings except those containing symbol a /* e.g., in SA 3 (Figure 5.5), the HLang of q0 is modified from RE hOrderi|hQuotei (Figure 4.7) to hQuotei. */

14:

end while



Chapter 5. XML Schema Computations 15:

MakeUsefulSA(A)

16:

MinimizeSA(A)

50

Let us look at an example that illustrates this subschema extraction process. Suppose SA 2 (Figure 4.7) is given and the permissible symbol set is given as the symbol set of SA 2 excluding hProducti, i.e., {hQuotei, hOrderi, hLinei, hQtyi, hDesci, hPricei}. The extracted subschema SA should accept all instances of the original SA except those with any hProducti d-edges. Now, we run ExtractSubschema. L is initialized to contain one tranhProducti

sition q4 −−−−−−−→ q9. In the first iteration of the while loop, this transition is removed. hLinei

Since this transition is mandatory, q4 needs to be deleted so transition q2 −−−−→ q4 is put hLinei

into L for future deletion. In the second iteration, q2 −−−−→ q4 is removed. Since this hOrderi

transition is also mandatory, q2 needs to be deleted and transition q0 −−−−−→ q2 is added hOrderi

to L. In the third iteration, q0 −−−−−→ q2 is deleted. But this time, since this transition is not mandatory, no other transition needs to be deleted. The HLang of q0 is modified from L (hQuotei|hOrderi) to L (hQuotei). Since the above transitions are deleted, states q2, q4, and q8 become inaccessible and are removed during MakeUsefulSA. The extracted

subschema SA is shown in Figure 5.5, which corresponds to XSD 3 (Listing 3.5).

5.4 Complexity Analysis and Performance Improvement This section will analyze the complexity of the algorithms MakeUsefulSA, MinimizeSA, EquivalentSA, SubschemaSA, and ExtractSubschema. Also, I will propose some tech-

niques to speed up their execution. Each algorithm has a while-loop, where the maximum number of iterations is in polynomial order of the number of states. All operations inside these algorithms are PTIME except the following two. They are: (1) testing whether one RE r1 is equivalent to another RE r2 (i.e., L (r1 ) = L (r2 )) and (2) testing whether r2 includes r1 (i.e., L (r1 ) ⊆ L (r2 )) are PSPACE-complete[29]. The RE equivalence test on HLangs is used in MinimizeSA and EquivalentSA, and the RE inclusion test on HLangs is used in SubschemaSA.

5.4.1 Speeding Up Regular Expression Tests When processing large XSDs, EquivalentSA or SubschemaSA needs to execute a large number of RE tests, which can be very time-consuming. To tackle this issue, I have de-




51

veloped a filtering technique by leveraging some common XSD usage patterns. First, most industry XSDs express xs : complexType content models (i.e., HLangs) in simple combinations of xs : sequence and xs : choice (i.e, REs). Bex et al.[30] suggested that 97% of XSDs expressed the content models in some simple forms of REs. Also, Martens et al.[29] showed that the equivalence and inclusion of some types of these simple REs could be done in PTIME. I have implemented a weak RE test to handle the content models where the occurrence of each xs : sequence or xs : choice must be one yet the occurrence of each xs : element is not restricted. This weak test runs very fast in PTIME. Second, the equality test can be used to conclude most positive cases of RE equivalence and inclusion. In reality, developers seldom express two equivalent content models differently, i.e., most equivalent HLangs are literally equal. (For example, A+ and AA∗ are equivalent but literally unequal.) Also, in an XSD version update, most complex types in the updated XSD version are the same as those in the old version. While the RE equality test is a sub-linear string matching problem, we can use it to efficiently filter many positive RE equivalence cases. Because of the above properties, we may speed up the RE equivalence / inclusion test as follows. 1. If two REs are literally equal then conclude two REs are equivalent; 2. else if the forms of REs are supported by the weak test then: do the weak test on the REs and report the result; 3. else do the full test and report the result. One of my experiments has showed that the running time of the algorithm SubschemaSA using this filtering technique runs 13 times faster than that using only the full test.



6 Experiments on XML Schema Computations This chapter will present and analyze the results of two experiments: (1) schema compatibility testing and (2) subschema extraction. I selected two real datasets, xCBL 3.0 and xCBL 3.5 XSDs, to conduct the above experiments for the following reasons: 1. These two datasets are good representatives of very large industry XSDs. 2. xCBL 3.5 is claimed to be compatible with xCBL 3.0, which can be verified by SubschemaSA.

The experiments were run on a PC with Quad Core [email protected], 4GB RAM, and Ubuntu 8.04 (x86) OS. I have implemented the algorithms in Java and have programmed a converter to transform XSD into SA, and SA to XSD.

6.1 Experiment 1: xCBL Compatibility Testing The xCBL 3.5 website[17] claims its backward-compatibility with xCBL 3.0 as follows. The only modifications allowed to xCBL 3.0 documents were the additions of new optional elements and additions to code lists; to maintain interoperability between the two versions. An xCBL 3.0 instance of a document is also a valid instance in xCBL 3.5. The above claim implies xCBL 3.0 XSD should be a subschema of xCBL 3.5 XSD. This experiment aimed to verify this claim. The result has surprisingly shown that xCBL 3.0 is in fact not a subschema of xCBL 3.5, and has refuted this compatibility claim. The experiment has detected the following four incompatibility errors:

52

Chapter 6. Experiments on XML Schema Computations

53

1. xCBL 3.0 declares a root element named Carrier, which does not exist in xCBL 3.5. 2. Under complex type CatalogSchema, element SchemaSource is declared before element ValidateAttributes in xCBL 3.0 but SchemaSource is declared after ValidateAttributes in xCBL 3.5.

3. Under complex type CatalogHeader, element CatalogProvider is declared with minOccurs = ”0” in xCBL 3.0 but minOccurs = ”1” in xCBL 3.5.

4. Under complex type SchemaCategory, element CategoryID is declared with minOccurs = ”1” in xCBL 3.5 but this element is not declared in xCBL 3.0.

After the above errors were fixed, the XSDs could pass the subschema test. I believe these were caused by human errors when updating xCBL 3.0 to xCBL 3.5. It is very difficult to manually detect these few errors (0.3%) among thousands of XSD types and elements. Yet, this has caused that a substantial number of xCBL 3.0 instances do not conform to xCBL 3.5. One of such instances is listed in Listing 6.1. Listing 6.1: Example XSD 3.0 instance that does not confirm to xCBL 3.5 XSD Other

The experiment also applied the following three filtering strategies to execute the RE inclusion test in algorithm SubschemaSA. full-only It did not use any filtering technique and performed only the full test on every

RE comparison. weak+full It first used the weak inclusion test for simple REs and then used the full test

for the REs not supported by the weak test. equality+weak+full Firstly, it used the equality test. Secondly, it used the weak inclusion

test for the unequal and simple REs. Lastly, it used the full test if the REs were not supported by the weak test. The performance results are compared in Table 6.1 and Figure 6.1, where the numbers of equality tests, weak tests, and full tests are shown, together with the running time of Algorithm SubschemaSA. There were a total of 1,258 RE inclusion tests to execute. Using the weak+full strategy, our weak test could conclude 596 (47%) RE comparisons. Using the equality+weak+full strategy, the equality test could conclude 1,196 (95%) comparisons;



Chapter 6. Experiments on XML Schema Computations strategy full only weak+full equality+weak+full

equality tests 0 0 1,258 (1,196 concluded)

54

weak tests 0 596 59

full tests 1,258 662 3

time (ms) 3,869 536 272

Table 6.1: Performance of different filtering techniques for HLang RE tests

number

1000

4000 3500 3000

800

2500

600

2000 1500

400

1000

200

time (millisecond)

#equality tests #weak tests #full tests algorithm running time

1200

500

0

0 full

weak+full equality+weak+full

Figure 6.1: Performance of different filtering techniques for HLang RE tests the weak test could conclude 59 out of the remaining 62 tests; finally, only 3 full tests were needed. The performance gain of the equality+weak+full strategy is over 13 times.

6.2 Experiment 2: xCBL Subschema Extraction This experiment extracted various subschemas from xCBL 3.0 and 3.5, and examined the reduction of the XSD size and processing time. The XSDs of xCBL 3.0 and 3.5 comprise 42 and 51 business document types respectively (e.g., Quote, Order, Invoice). These document types are grouped into different domains. For example, the quotation domain consists of RFQ and Quote. The ExtractSubschema program was first run to extract subschema XSDs from the xCBL 3.0 and xCBL 3.5 XSDs for five domains, namely, invoice, order, quote, auction, and catalog. Then, XMLBeans v2.3.0[31] schema compiler was run to compile each subschema XSD into a Java XML binding library. The number of document types (doctypes), the number of element names (enames), the number of data types (types) with the percentage of the original number of types, the number of element declarations (edecls), the XMLBeans compilation time (ctime) with the percentage of the original compilation time, and the ExtractSchema running time (rtime) are compared in Table 6.2 and Table 6.3. The



Chapter 6. Experiments on XML Schema Computations XSD Original Invoice Order Quote Auction Catalog

doctypes 42 8 6 2 4 1

enames 1,905 904 722 621 555 156

types (%) 1,290 (100) 412 (32) 352 (27) 299 (23) 266 (21) 81 (6)

edecls 3,728 1,154 910 721 646 190

55 ctime (s) 29.1 14.1 13.2 12.9 12.6 9.6

ctime (%) 100 48 46 44 43 34

rtime (s) N/A 3.11 3.17 3.01 3.01 2.74

ctime (%) 100 50 45 42 40 35

rtime (s) N/A 3.28 3.18 3.32 3.25 2.95

Table 6.2: Subschema extraction on xCBL 3.0 XSD Original Invoice Order Quote Auction Catalog

doctypes 51 9 7 2 4 1

enames 2,263 1,018 820 621 612 189

types (%) 1,476 (100) 460 (31) 384 (26) 319 (22) 291 (20) 91 (6)

edecls 4,473 1,305 1,052 786 711 231

ctime (s) 30.5 15.3 13.7 12.7 12.4 10.7s

Table 6.3: Subschema extraction on xCBL 3.5 number of document types (doctypes) in each domain is indicated in the first column. For example, the original xCBL 3.0 XSD comprises 1,905 different element names (i.e., symbols), 1,290 data types (i.e., states), and 3,726 element declarations (i.e., transitions) while the subschema for 8 invoice-related document types includes only 904 element names, 412 data types, and 1,154 element declarations. If we use the number of data types (i.e., states) to measure the size of a schema, ExtractSubschema can reduce the schema size to a fraction of 6–32%. The time required for XMLBeans to compile each subschema was significantly reduced to a fraction of 34–50%. This has evidently demonstrated ExtractSubschema can effectively reduce the schema size and processing time.

#element names #types #element declarations XMLBeans compilation time

number

4000

35 30 25

3000

20

2000

15 10

1000

time (second)

5000

5

0

0 original

invoice

order

quote

auction

catalog

Figure 6.2: Subschema extraction on xCBL 3.0



Chapter 6. Experiments on XML Schema Computations

#element names #types #element declarations XMLBeans compilation time

number

4000

35 30 25

3000

20

2000

15 10

1000

time (second)

5000

56

5

0

0 original

invoice

order

quote

auction

catalog

Figure 6.3: Subschema extraction on xCBL 3.5 Overall, the experiment took around 3 seconds to run the ExtractSubschema program in each case. The performance of the ExtractSubschema implementation is satisfactory even when subschema extraction is performed on complex and large schemas.



7 Conclusions of Part I In this research, I have studied the data interoperability between two web services and its relationship to the compatibility between XML schemas. I have formalized the Schema Automaton (SA) and Data Tree models to represent W3C XML Schema Definitions (XSDs) and XML instances for performing various computations on XSDs. Based on SA, I have investigated two problems: schema compatibility testing and subschema extraction. On the schema compatibility problem, I have developed the algorithms for schema minimization, schema equivalence testing, and subschema testing. For schema minimization, I have proposed the concept of the usefulness of SA states and have developed an algorithm to compute the minimum (canonical) SA. The schema equivalence and subschema testing can be used to verify the data interoperability between two web services and the compatibility between two schema versions. On the subschema extraction problem, I have developed an algorithm to compute a subschema from a given schema to recognize only a given subset of the symbols. The subschema extraction is useful to reduce the size of a huge schema when an application only requires a small subschema that recognizes some but not all of the element names. Moreover, I have proposed some effective filtering mechanisms to speed up the PSPACE-complete regular expression tests required by the subschema equivalence and subschema testing. I have conducted two experiments to verify the practicality and effectiveness of the above algorithms. I used the XSDs of xCBL 3.0 and xCBL 3.5 as the real datasets for the experiments. In the first experiment, I ran the subschema testing algorithm to check whether xCBL 3.5 was backward compatible with xCBL 3.0, i.e., whether xCBL 3.5 was a superschema of xCBL 3.0. Though the backward compatibility is claimed on the xCBL 3.5 website, the experimental result has showed that xCBL 3.5 is actually not a superschema of xCBL 3.0 and has refuted the claim. Moreover, the experiment has also demonstrated that the filtering mechanisms for the regular expression equality test are effective in performance improvement of the algorithm.

57

Chapter 7. Conclusions of Part I

58

In the second experiment, I ran the subschema extraction algorithm to get different subschemas from xCBL 3.0 and xCBL 3.5 XSDs for different applications. The size of an extracted subschema was only 6–32% of the size of the original schema. The time required for XMLBeans to compile each subschema was reduced to 34–50% of the time required to compile the original schema. Based on the SA model, I anticipate other schema computation techniques can be derived. Possible extensions of this research are XML schema inferencer and XML transducer. The XML schema inferencer takes a collection XML documents of unknown schema, learns their structures, and re-engineers a “good” XSD to describe the documents. The XML transducer transforms a variety of formats (e.g., structured text and database table formats) into XML documents by annotating the SA that defines the output XML format with the logic to extract data from the input data format. I believe these schema computation techniques can be applied to develop new web services design tools and runtime engines. In the next part of this thesis, I will study the problem on how to automate the design of database schemas for RDF data by mining the data patterns so that the data can be stored and queried efficiently.



Part II

Schema Computations for RDF Data Storage

59

8 Introduction to Part II The World Wide Web was originally designed for humans to share information; most web resources were created merely for comprehension by humans rather than for processing by computers. In order for computers to perform intelligent tasks for humans using the Web, the Semantic Web[2] was proposed to annotate web resources with machine-processable meanings and relationships.[3] The Resource Description Framework (RDF) was developed as a computer language to represent these annotations.[4] This chapter will give a brief introduction to RDF and its query language called SPARQL.

8.1 Resource Definition Framework (RDF) RDF is essentially a graph language to specify labelled directed graphs. In such a graph, each node is a web resource (e.g., a webpage) or some static information called literal while each arc is a property of some resource. Figure 8.1 shows an example of RDF graph. There are two types of nodes, represented by ovals and boxes in the figure.1 An oval represents a resource, referenced by a Unified Resource Identifier (URI) while a box represents a literal as a string. An arc represents a relationship between two nodes where the originating node is called subject while the destination node is called object. The arc indicates that the subject has a property, also known as predicate, associated with the object. Note that a literal node cannot have a predicate (outgoing arc). For example, in Figure 8.1, the oval labelled hhttp : //ex.edu/Mayi stands for a person (resource) called “May.”

May has a property (predicate) identified by hhttp :

//ex.edu/supervisori. This property is associated with another person called “Sam” represented by resource hhttp : //ex.edu/S ami. Sam has a title (hhttp : //ex.edu/titlei) “Professor” (literal). 1

There is a third type of node called “blank node,” which is regarded as the same as the resource node in this research for simplicity.

60

Chapter 8. Introduction to Part II

61

"Professor"

Figure 8.1: Example RDF graph The RDF graph in Figure 8.1 can be equivalently encoded in the text format of NTriples[32]. As shown in Listing 8.1, an N-Triples file is a list of statements, each ended with a full-stop. Each statement, also known as triplet, specifies an arc of the corresponding RDF graph in the form: subject predicate object. URIs are angle-bracketed while laterals are double-quoted. There is another RDF format called Notation 3 (N3) as shown in Listing 8.2; N3 shortens lengthy URIs by using namespace prefixes. Yet, both N-Triples and N3 are basically in the same three-column data structure. Listing 8.1: Example RDF in N-Triples . " Professor " . . .

Listing 8.2: Example RDF in Notation 3 @prefix ex: ex:Db ex: teacher ex:Sam . ex:Sam ex: title " Professor " . ex:May ex: supervisor "Sam" . ex:May ex: enrolls ex:Db .

This section has covered how semantic data can be modelled in RDF and expressed in different notations. In the next section, we will look at how the SPARQL query language is used to query RDF data.




62

8.2 RDF Query Language Among different RDF query languages, SPARQL[33], as a W3C Recommendation, is emerging as the most accepted one. This section will describe the key semantics of SPARQL and different kinds of SPARQL queries by example. SPARQL is a SQL-like language used to express queries on RDF data. A SPARQL SELECT statement contains a set of triple patterns as well as a set of variables to be in-

cluded in result sets. A triple pattern is like an RDF triplet except that some of the subject, predicate, and object are variables. The query engine matches the RDF triplets with all triple patterns, and then binds each variable to a subject, predicate, or object value according to each match. Finally, the engine reports the values of those required variables in the result. Listing 8.3 gives a small RDF dataset to aid our discussion about some SPARQL examples. This dataset is also used as the motivating example throughout this part of the thesis. This dataset contains 20 triplets, 8 resources (i.e., ex : Tom, ex : May, ex : Roy, ex : Sam, ex : Kat, ex : Db, ex : Net and ex : Web) and 7 predicates (i.e., ex : degree, ex : enrolls, ex : title, ex : supervisor, ex : interest, ex : teacher and ex : code).

Listing 8.3: Example RDF dataset @prefix ex: ex:Tom ex: degree "PhD" . ex:May ex: degree " MPhil " . ex:Roy ex: degree "BSc" . ex:May ex: enrolls ex:Db . ex:Roy ex: enrolls ex:Db . ex:Roy ex: enrolls ex:Web . ex:Sam ex: title " Professor " . ex:Tom ex: title " Lecturer " . ex:Kat ex: title " Professor " . ex:Tom ex: supervisor ex:Sam . ex:May ex: supervisor ex:Sam . ex:Sam ex: interest " Datamining " .




63

ex:Sam ex: interest " Database " . ex:Kat ex: interest " Security " . ex:Db ex: teacher ex:Sam . ex:Net ex: teacher ex:Sam . ex:Web ex: teacher ex:Tom . ex:Db ex:code "C123" . ex:Net ex:code "C246" . ex:Web ex:code "C135" .

Listing 8.4 shows an example of SPARQL query. The query finds all students who have enrolled the course ex : Db, and are studying a “MPhil” degree, and have a supervisor ex : Sam.

Listing 8.4: SPARQL example 1 PREFIX ex: SELECT ? student ? degree ? supervisor WHERE { ? student ex: enrolls ex:Db . ? student ex: degree " MPhil " . ? student ex: supervisor ex:Sam . }

We can see that the following three triplets are matched: ex:May ex: degree " MPhil " . ex:May ex: enrolls ex:Db . ex:May ex: supervisor ex:Sam .

The result set only has one tuple as follows: ?student ex : May

?degree ”MPhil”

?supervisor ex : Sam

Suppose we store the triplets in Listing 8.3 into a relational database table called triple store as shown in Figure 8.2. This table contains three columns: subject, predicate

and object. The SPARQL query in Listing 8.4 can be translated into an equivalent SQL query shown in Listing 8.5 to find the same result set from the table. Processing this SQL




64 triple store

subject ex : Tom ex : May ex : Roy ex : May ex : Roy ex : Roy ex : Sam ex : Tom ex : Kat ex : Tom ex : May ex : Sam ex : Sam ex : Kat ex : Db ex : Net ex : Web ex : Db ex : Net ex : Web

predicate ex : degree ex : degree ex : degree ex : enrolls ex : enrolls ex : enrolls ex : title ex : title ex : title ex : supervisor ex : supervisor ex : interest ex : interest ex : interest ex : teacher ex : teacher ex : teacher ex : code ex : code ex : code

object ”PhD” ”MPhil” ”BSc” ex : Db ex : Db ex : Web ”Professor” ”Lecturer” ”Professor” ex : Sam ex : Sam ”Datamining” ”Database” ”Security” ex : Sam ex : Sam ex : Tom ”C123” ”C246” ”C135”

Figure 8.2: Triple store database statement involves two self-joins of the triple store table. We will discuss about how different RDF databases are organized to answer SPARQL queries in Chapter 9. Listing 8.5: SQL query on triple store for SPARQL example 1 SELECT FROM WHERE AND AND AND

a. subject AS student , b. object AS degree , c. object AS supervisor triple_store a, triple_store b, triple_store c a. subject = b. subject AND b. subject = c. subject a. predicate = ’ex:enrolls ’ AND a. object = ’ex:Db ’ b. predicate = ’ex:degree ’ AND b. object = ’MPhil ’ c. predicate = ’ex: supervisor ’ AND c. object = ’ex:Sam ’;

Listing 8.6 gives another SPARQL example with three variables. The query selects all students who have enrolled the course ex : Db, and are studying some degree, and reports the matched students, degrees, and courses in the result set. The translated SQL query is given in Listing 8.7, processing which requires three self-joins. Listing 8.6: SPARQL example 2 PREFIX ex: SELECT ? student ? degree ? course2



Chapter 8. Introduction to Part II WHERE { ? course1 ? student ? student ? student }

65

ex:code "C123" . ex: enrolls ? course1 . ex: degree ? degree . ex: enrolls ? course2 .

Listing 8.7: SQL query translated from SPARQL example 2 SELECT FROM WHERE AND AND AND

b. subject AS student , b. object AS degree , c. object AS course triple_store a, triple_store b, triple_store c, triple_store d a. subject = b. object AND a. object = "C123" b. subject = c. subject AND c. subject = d. subject a. predicate = "ex:code" AND b. predicate = "ex: enrolls " c. predicate = "ex: degree " AND d. predicate = "ex: enrolls ";

The result set of executing the above query consists of three tuples as follows: ?student ex : May ex : Roy ex : Roy

?degree ”MPhil” ”BSc” ”BSc”

?course ex : Db ex : Db ex : Web

A SPARQL query may contain an object variable in a triple pattern. An example is shown in Listing 8.8. The equivalent SQL query is shown in Listing 8.9, which also requires two self-joins when processing. Listing 8.8: SPARQL example 3 PREFIX ex: SELECT ?student , ?lecturer , ? course WHERE { ? student ex: supervisor ? lecturer . ? student ex: enrolls ? course . ? course ex: teacher ? lecturer . }

Listing 8.9: SQL query translated from SPARQL example 3 SELECT a. subject AS student , a. object AS lecturer , b. object AS course FROM triple_store a, triple_store b, triple_store c WHERE a. subject = b. subject



Chapter 8. Introduction to Part II AND AND AND

66

a. predicate = ’ex: supervisor ’ AND a. object = c. object b. predicate = ’ex:enrolls ’ AND b. object = c. subject c. predicate = ’ex:teacher ’ AND c. object = a. object ;

The result set of executing the above query is: ?student

?lecturer

ex : May

ex : Sam

?course ex : Db

Finally, a SPARQL query may also contain a predicate variable in a triple pattern. As an example, the query in Listing 8.10 finds all triplets which share the same predicate and object of some triplet with ex : Tom as its subject. The equivalent SQL query is shown in Listing 8.11, which requires only one table join. Listing 8.10: SPARQL example 4 PREFIX ex: SELECT ?subject , ?predicate , ? object WHERE { ex:Tom ? predicate ? object . ex: subject ? predicate ? object . }

Listing 8.11: SQL query translated from SPARQL example 4 SELECT b. subject AS subject , b. predicate AS predicate , b. object AS object FROM triple_store a, triple_store b WHERE a. subject = ’ex:Tom ’ AND b. predicate = a. predicate AND b. object = a. object ;

The result set of executing the above query is as follows: ?subject ex : Tom ex : Tom ex : Tom ex : May

?predicate ex : degree ex : title ex : supervisor ex : supervisor

?object ex : PhD ex : Lecturer ex : Sam ex : Sam

The above examples have represented different types of SPARQL queries. I will explain in




67

Chapter 9 that different types of SPARQL queries can be favored by different types of RDF databases.

8.3 Research Problems and Contributions on RDF Database Schema Design Automation In this research, I have surveyed different architectures of RDF data storage and their pros and cons. These architectures include triple store, horizontal database, vertical database, and property tables. In particular, I have investigated the problems of the property tables. The key problem is the lack of a technique to automate schema design for property tables. When the given RDF dataset is too large, it is not feasible for humans to design a “good” schema or property tables. I have formulated a new clustering problem called Attribute Clustering by Table Load (ACTL) to automate schema design for property tables. Intuitively, given a set of predicates, which I call attributes, ACTL aims to cluster them into disjoint clusters of correlated attributes. Each attribute cluster is used to create a property table to save table-joins. Recognizing the ACTL problem is NP-complete, I have developed an agglomerative clustering algorithm to approximate the optimal solution. In addition, I have proposed several pruning techniques to improve the performance of the algorithm. I have also proposed some metrics to measure how “good” a generated schema is in storing RDF data into the property tables in terms of storage and query efficiencies. I have conducted experiments with huge real-life datasets. Experimental results have demonstrated the following. First, the ACTL algorithm can efficiently generate “good” schemas for these datasets. Secondly, the performance of running some common queries on the Wikipedia Infobox dataset using the ACTL-designed property tables is generally higher than that using other RDF database architectures. The main contributions of this research are summarized as follows: 1. Different architectures of RDF databases have been studied and their pros and cons are analyzed. 2. A new datamining problem called Attribute Clustering by Table Load (ACTL) have been formulated to automate the design of RDF property tables. This problem has been proved to be NP-complete.




68

3. An agglomerative algorithm with effective pruning mechanisms have been developed to approximate the optimal solution to the ACTL problem. 4. Some metrics on schema fitness have been introduced to measure the storage and query efficiencies of an ACTL-mined schema in storing a given dataset. 5. Experiments on mining the schemas from huge real datasets using the ACTL algorithm have been conducted. The fitness and query performance of the ACTL-mined schemas have been shown to be satisfactory.

8.4 Organization of Part II The rest of Part II of this thesis is organized as follows. Chapter 9 will survey four architectures of RDF databases, namely triple store, horizontal database, vertical database, and property tables and will analyze their pros and cons. In particular, I will elaborate the problems of using property tables to store RDF data. One problem is the lack of a suitable clustering technique to partition a given set of predicates into property tables. Chapter 10 will formalize the ACTL problem to model this problem. I will prove ACTL is an NP-complete problem. Chapter 11 will present an agglomerative clustering algorithm to compute an approximate solution to the ACTL problem. I will also discuss the following in that chapter. First, I will present three pruning techniques to address the high complexity of the algorithm: fast transaction group equality test, maximum combined load factor pruning, and iterative clustering. Second, I will introduce the concept of attribute connectivity for preventing uncorrelated attributes from being put in the same cluster. Third, I will formulate two metrics, namely Average Number of Clusters Per Transaction (ACPT) and Aggregate Load Factor (ALF), to measure the fitness of an ACTL-mined schema with respect to a given RDF dataset. Last, I will compare the ACTL technique with other existing clustering techniques for property tables schema design. Chapter 12 will present the results of three experiments conducted on ACTL mining. Experiment 1 performed ACTL mining on two real datasets, namely Wikipedia Infobox and Barton Libraries. I will analyze the clustering performance and the schema fitness. Experiment 2 stored the Wikipedia Infobox data using (1) property tables, (2) triple store, and (3) vertical database and executed the same set of of queries on these databases. I will compare the query performances in these cases. Experiment 3 attempted to use the frequent pattern mining approach to find the property tables schemas for comparison with the ACTL approach. Finally, Chapter 13 will conclude this part with a remark on some potential extensions of this research.



9 RDF Databases In this chapter, I will survey four types of RDF databases, and will analyze their pros and cons. The four database schemes are triple store, horizontal database, vertical database and property tables.

9.1 Triple Store At the conceptual level, a triple store[34] stores RDF triplets natively in three columns (i.e., subject, predicate and object) in a single table. For example, the RDF dataset in Listing 8.3 is stored in a triple store table as shown in Figure 8.2. Despite the simplicity of the conceptual triple store, there are variations in real-world implementations. HP Laboratories developed an RDF database called Jena based on triple store.[35] The first version of Jena, Jena1, maintains three tables, namely Resources, Literals and Statements. Figure 9.1 shows how the triplets in Listing 8.3 should be stored in the Jena1

triple store. The Resources table has two columns: Id and URI. Each resource or predicate with a distinct URI is assigned with a unique system identifier and stored in exactly one row in the Resources table. Similarly, the Literals table has two columns: Id and Value. Each literal with a distinct value is assigned with a unique system identifier and occupied exactly one row in the Literals table. The additional Resources and Literals tables are intended to eliminate duplications of identical URIs and literals in the Statements table. Instead of storing the resource URIs and the lateral value for each triplet, the Statements table stores only their system identifiers maintained in the Resources and Literals tables. The Statements table has four columns (instead of three): Subject, Predicate, ObjectURI and ObjectLiteral. Each field is a foreign key to either the Resource table or the Literal table. Each triplet occupies

69

Chapter 9. RDF Databases

70

exactly one row in this table. The Subject column and the Predicate column store the identifiers of the URIs in the Resources table that represent the subjects and the predicates respectively. To distinguish

whether the object in a triplet is a resource or a lateral, two separate fields ObjectURI and ObjectLiteral are used. If the object is a resource, The ObjectURI field is used to store

the identifier of the resource in the Resources table while the ObjectLiteral field is null. Alternatively, if the object is a literal, the ObjectLiteral field stores the identifier of the literal in the Literals table while the ObjectURI field is null. Separation into two fields for object resources and literals facilitates simpler table-joins when recovering the full RDF statements. Yet, in each row of the Statements table, exactly one field, either ObjectURI or ObjectLiteral, must be wasted in storing a null value, which lowers the storage efficiency. Jena2, the second version of Jena, was designed to trade off space for speed. The Resources and Literals tables only maintains lengthy URIs and literals. Short URIs and

literals are directly stored in the Statements table.1 Also, the Statements table does not distinguish object resources and object literals, and it stores all object values in one column called Object. A value in each field is encoded with a pre-defined prefix to indicate whether the value is an identifier, an URI or, a literal. This way, table-joins are not necessary for the triplets with short URIs and literals directly stored in the Statements table. However, this approach complicates table-joins when a query result set involves some fields storing identifiers while others storing URI / literal strings. Another popular triple store implementation Sesame[36] also uses a table called triples to store the foreign keys to the resources and literals tables. The triples table contains four columns, namely subject, predicate, object and is derived. The subject and predicate fields reference the resources stored in the resources table by system identifiers while the object field references either a resource URI or a literal kept in the literals table. The is derived field indicates whether a triplet is explicitly asserted or inferred from RDF

Schema (RDFS) data. Added to the above three tables, Sesame also maintains other tables to store RDFS data. 3store[37] RDF database uses the triples, resources and literals tables that resemble those in Sesame. Yet, 3store hashes a resource URI or a literal value into a 64-bit integer and uses it as the identifier of that resource or literal when stored. This approach avoids joining the three tables when searching a resource or literal. However, hash computation needs to be done and hash collision needs to be detected whenever a triplet is added. Also, 3store 1

Jena2 stores shorter prefixed URIs (e.g., ex:Sam) rather than full URIs (e.g., http://ex.edu/Sam).




71

Statements table Subject 1 2 3 2 3 3 4 1 6 1 2 4 4 6 7 8 9 7 8 9

Predicate 10 10 10 11 11 11 12 12 12 13 13 14 14 14 15 15 15 16 16 16

ObjectURI null null null 7 7 9 null null null 4 4 null null null 4 4 1 null null null

Resources table ObjectLiteral 1 2 3 null null null 4 5 4 null null 6 7 8 null null null 9 10 11

Id 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

URI ex:Tom ex:May ex:Roy ex:Sam ex:Tom ex:Kat ex:Db ex:Net ex:Web ex:degree ex:enrolls ex:title ex:supervisor ex:interest ex:teacher ex:code

Literals table Id 1 2 3 4 5 6 7 8 9 10 11

Value ”PhD” ”MPhil” ”BSc” ”Professor” ”Lecturer” ”Datamining” ”Database” ”Security” ”C123” ”C246” ”C135”

Figure 9.1: Jena1 triple store database uses larger primary and foreign keys (64-bit) to reduce the chance of hash collision, which reduces its storage efficiency.

9.1.1 Self-join Problem of Triple Store It has been commonly recognized that the triple store approach has suffered from the performance problem of expensive multiple self-joins of a potentially huge triplet table.[38, 39, 40, 41, 42, 43, 44] Some simple SPARQL examples given in Section 8.2 already involve two self-joins while typical queries for realistic applications may require many more. For example, some queries used in one of the experiments require as many as seven joins. In addition, some real-world RDF datasets (e.g., the datasets used in the experiments) can contain up to millions of triplets. Self-joins on a giant triplet table are very costly. To mitigate the performance issue caused by expensive self-joins, some research works have developed proprietary database indexing and query optimization techniques for RDF data. Many of these proprietary features are still under academic research and are not yet ready for commercial deployments. To my knowledge, these implementations have not yet supported important database features, such as ACID (Atomicity, Consistency, Isolation, Durability), commonly provided by mainstream relational database management systems (RDBMS).



Chapter 9. RDF Databases (s p o c) (? ? ? ?) (s ? ? ?) (s p ? ?) (s p o ?) (s p o c)

(p o c) (? p ? ?) (? p o ?) (? p o c)

72 (o c s) (? ? o ?) (? ? o c) (s ? o c)

(c s p) (? ? ? c) (s ? ? c) (s p ? c)

(c p) (? p ? c)

(o s) (s ? o ?)

Table 9.1: Six indexes to cover 16 query patterns in YARS2 Indexing Triple Store

YARS2 was proposed to be a distributed database for graph-structured data[43, 44]. It creates sparse indexes[45] on quadruples to cover 16 query patterns. A quadruple is a triplet plus a context entry, i.e., (subject, predicate, object, context) or (s p o c), where the context denotes the URL of the data-source where the triplet is originated. The 16 access patterns are all combinations of (s p o c) where each entry can be either a constant given in the query or a variable. For space consideration, YARS2 uses only 6 sparse indexes to cover 16 query patterns by using partial key (prefix) look-up, as shown in Table 9.1. Another similar index implementation called RDF-3X[40] creates six B+ trees on (s p o), (s o p), (o s p), (o p s), (p s o) and (p o s) respectively. (RDF-3X does not consider the context of a triplet.) To be scalable for huge datasets, RDF-3X uses a new index compression technique on RDF triplets to reduce the index size. The essential of the technique is to store a key (i.e., a triplet) by encoding its difference from the previous key in the index tree instead of encoding its full (s p o) representation.

Query Optimization for Triple Store

The primary goal for query optimization is to compute a good join order of triple patterns to answer a query efficiently. An important input parameter to this computation is the selectivity estimate for each triple pattern. A triple pattern with smaller estimated selectivity has a higher priority for processing first so as to constrain the size of the result set from joining the triplet table. Some works [46, 47, 40] have developed different techniques to estimate the selectivity of triple patterns without and with pre-computed statistics on RDF data. To evaluate a join order based on the cost model calculated using the selectivity estimates, the projects by Stocker et al.[47] and by Neumann et al.[40] used a undirected graph to




1 2 3 4

73

? student ex: enrolls ? course . ? student ex: supervisor ? staff . ? staff ex: teacher ? course . ? staff ex: interest " Database " .

1

2

1

3

4

2

3

4

Figure 9.2: Graphs to represent triplet patterns and join order subject

degree

ex:Tom ex:May

”PhD” ”MPhil”

ex:Roy

”BSc”

enrolls

title

supervisor

”Lecturer”

ex:Sam ex:Sam

”Db” ”Db” ”Web”

ex:Sam

”Professor”

ex:Kat ex:Db ex:Net ex:Web

”Professor”

interest

teacher

code

ex:Sam ex:Sam ex:Tom

C123 C246 C135

”Database”, ”Datamining” ”Security”

Figure 9.3: Horizontal database model a set of triple patterns. Each triple pattern is represented by a node in the graph. Two nodes are connected when their represented patterns share some common variables. For example, four triple patterns on the left of Figure 9.2 are represented by the undirected graph in the middle. Using the cost model, a directed acyclic graph, as shown in the right of Figure 9.2, can be computed on that undirected graph in order to generate the target join order.

9.2 Horizontal Database Some papers[48, 49] discussed about the horizontal database approach for RDF storage. This approach uses a single “universal table,” where each row stores all objects for a distinct subject, each column (except the first column, which stores the subjects or their identifiers) stores the objects for a distinct predicate. In other words, if there are two triplets (s1 p1 o1 ) and (s1 p1 o2 ), both o1 and o2 values are stored in the cell at the row for s1 and at the column for p1 . If there does not exist any triplet in the form of (s1 p2 ?), the cell at the row for s1 and at the column for p2 stores a null value. Figure 9.3 shows a horizontal database which stores the triplets in Listing 8.3.




74

However, unless the RDF dataset has a fixed structure and a small number of predicates, a horizontal database is not practical if it is implemented using an RDBMS because of the following reasons. First, the table is very sparse. For example, in Figure 9.3, 38 out of 56 property cells are null. Second, a horizontal database is not scalable to the number of predicates. For example, the Wikipedia Infobox dataset used in the experiments presented in Chapter 12 needs a table with over 60,000 columns, which is impractical for implementation. Third, a horizontal database does not handle data of dynamic structures well. When a triplet with a new predicate is added, the table has to be expanded, which requires costly data restructuring operations. Also, multiple values may be stored in a single cell yet multi-valued fields are not widely supported by RDBMS. Some publications [50, 49] suggested that a column-based database, instead of row-based RDBMS, could be used to simulate a horizontal database. This approach will be discussed in Section 9.3. However, many column-based databases are less mature and sophisticated than mainstream RDBMS, and also not widely tested in commercial deployments.

9.3 Vertical Database Abadi et al. proposed to store all objects of each distinct predicate in a separate table.[38, 39] For each predicate p, one table of two columns is created. The first column stores only the subjects which have some triplets with predicate p. In each row, the second column stores each object associated with the corresponding subject. Suppose there are two triplets (s p o1 ) and (s p o2 ). They are stored as two rows in the table for p. The fields of the first fields in both rows store s. The second field in one row stores o1 and the second field in the other row stores o2 . Therefore, there is no null cell nor multi-valued cell in any table. The RDF dataset in Listing 8.3 are stored in 7 tables as shown in Figure 9.4. The advantages of a vertical database are summarized as follows: 1. It is simple to implement. 2. It does not waste storage on null values. 3. It does not store multiple values in one cell. 4. Some specialized column-based databases (e.g., C-Store[51] and MonetDB[52]) can be used for optimized performance. However, a vertical database also has some disadvantages as follows:



Chapter 9. RDF Databases degree table subject ex:Tom ex:May ex:Roy

75 code table

object ”PhD” ”MPhil” ”BSc”

subject ex:Db ex:Net ex:Web

teacher table subject ex:Db ex:Net ex:Web

object ex:Sam ex:Sam ex:Tom

enrolls table

object ”C123” ”C246” ”C135”

subject ex:May ex:Roy ex:Roy

object ex:Db ex:Db ex:Web

title table subject ex:Tom ex:Sam ex:Kat

object ”Lecturer” ”Professor” ”Professor”

supervisor table subject ex:Tom ex:May

object ex:Sam ex:Sam

interest table subject ex:Sam ex:Sam ex:Kat

object ”Database” ”Datamining” ”Security”

Figure 9.4: Vertical database Many table-joins To process queries that involve multiple predicates, table-joins are al-

ways needed. For example, the SPARQL query in Listing 8.4 is translated into the SQL statement given in Listing 9.1, which requires two table-joins. Like a triple store, a vertical database does not consider any data and query patterns, by analyzing which storing correlated predicates in one table may save table-joins. For example, resources ex : Db, ex : Net, and ex : Web about courses all have predicates ex : teacher and ex : code, so they can be stored together in the same table to avoid

unnecessary table-joins. Replicated subject columns The subject columns are replicated in each table, which

lowers the storage efficiency, although Abadi et al. have suggested to store the identifiers of the subjects rather than their URIs. Inefficient in matching triple patterns with predicate variables When a SPARQL

query has a triple pattern that involves a predicate variable (e.g., Listing 8.10), the query engine is required to search every table to match the pattern. The Wikipedia Infobox dataset used in the experiments has over 60,000 distinct predicates. If each predicate maintained in a separate table, processing this type of SPARQL query would require a search on more than 60,000 tables. Poor scalability with the number of predicates Sidirourgos et al.[53] pointed out the

scalability problem of a vertical database. Their experimental results showed that a triple store approach outperformed a vertical database running on a row-based RDBMS in answering queries. Although the vertical database performed much faster on a column-based database than on an RDBMS, the query performance dropped rapidly as the number of predicates increased. The triple-store began to outperform the vertical database on a column-based database when there were more than around




76

200 predicates. Immature column-based database Although a vertical database runs faster on a

column-based database, it cannot enjoy some robust and powerful features (e.g., transaction support) provided by a mature RDBMS. Column-based databases have not yet evolved to mainstream platforms for commercial deployments. Listing 9.1: SQL query on vertical database for SPARQL example 1 SELECT a. subject AS student , b. object AS degree , c. object AS supervisor FROM enrolls a, degree b, supervisor_store c WHERE a. subject = b. subject AND b. subject = c. subject AND a. property = ’ex:Db ’ AND b. property = ’MPhil ’ AND c. property = ’ex:Sam ’;

9.4 Property Tables The Jena project proposed the property tables approach[54, 41] while Oracle also introduced a similar scheme called subject-property matrix.[42] This approach strikes a balance between the horizontal approach and the vertical approach. Correlated predicates are grouped into a property table. Each subject with any of the predicates of this property table has one row in the table. As shown in Figure 9.5, 7 predicates are clustered into 3 property tables: student, staff, and course.2 In the course table, all the resources ex : Db, ex : Net, ex : Web have both ex : teacher and ex : code predicates. However, in the student table, 2 out of 3 subjects do not have all the predicates maintained by the table. Also note that the subject ex : Tom appears in both student and staff tables. Clearly, the proportion of null cells (only 3 nulls out of 21 predicate cells) is largely reduced when compared to a horizontal database. Like a horizontal database, a property table may store multiple values in a single cell. In contrast to a vertical database, a property tables database does not require any table join when a query only involves the predicates within the same table. For example, the SPARQL query in Listing 8.4 is translated into the SQL statement in Listing. 9.2, which needs no table-join. However, if a query involves the predicates from different property 2

Jena actually uses a resource table and a literal table to reference the subjects and objects by system IDs in property tables.




77

student table subject

degree

ex:Tom ex:May

”PhD” ”MPhil”

ex:Roy

”BSc”

enrolls ex:Db ex:Db, ex:Web

staff table supervisor

subject

title

ex:Sam ex:Sam

ex:Tom

”Lecturer”

ex:Sam

”Professor”

ex:Kat

”Professor”

interest ”Database”, ”Datamining” ”Security”

course table subject

teacher

code

ex:Db ex:Net ex:Web

ex:Sam ex:Sam ex:Tom

”C123” ”C246” ”C135”

Figure 9.5: Property tables database tables, table-joins cannot be avoided. For example, the SPARQL query in Listing 8.6 is translated into the SQL statement shown in Listing 9.3. Note that the operator “