Ontology-based Relational Databases - Semantic Scholar

6 downloads 0 Views 1MB Size Report
Phyllis York. Hentry York. Alice Cooper. 123 Abc St. 123 Abc St. 456 Xyz St. Portland. Portland. Eugene. 97206. 97206. 97403. (a) In third normal form (3NF).
Ontology-based Relational Databases Paea LePendu Towards Partial Completion of the Comprehensive Area Exam Department of Computer and Information Science University of Oregon Committee: Dr. Dejing Dou, Chair Dr. Zena M. Ariola Dr. Christopher Wilson Fall 2007

1

Abstract There remain important gaps, both structurally and semantically, between the conceptual design of a database and its implementation in relational database management systems (RDBMS). To fully realize the idea of data independence that has made databases so popular for application programmers, we should try to bridge this gap. This study explores the use of ontologies, which have become a popular conceptual modeling paradigm for the Semantic Web as well as for scientific domains, together with contemporary RDBMSs to investigate possible solutions for this problem and highlights several difficult hurdles to overcome. The main challenge will be to align the expressiveness of existing ontology languages with the capabilities of RDBMSs. Compared with one of the current approaches, the new methodology proposed here appears no different on the surface. However, a careful look at the underlying logic systems shows that these methodologies result in fundamentally different model-theoretic interpretations if we allow deletions from the database (which is very common in practice). To allow for this functionality, we are required to reexamine the logical foundation of the ontology language. This process of refining theory and practice is reminiscent of decades of work in the area of logic and databases culminating in various knowledge base systems. To distinguish this work, the goal is to strictly limit the investigation from the top using ontologies and from the bottom using an RDBMS, and to discover how closely we can bridge the gap without relying on the traditional inference engine.

2

1

Introduction

This study explores the use of ontologies as a high-level language for relational database modeling, an idea which blossomed from the author’s previous work on ontology-based data integration [34, 35], presented as part of the directed research project (DRP) report. To understand the motivation for this work, it will be helpful to first summarize the general idea for data integration, understand how modeling and databases are relevant to this process, and finally see how this idea stands alone despite applications in integration scenarios. An ontology specifies the vocabulary of concepts and the relationships among them for some domain. Because their semantics are typically grounded upon some variant of first-order logic, ontologies have played a key role for describing information in traditional knowledge engineering, the emerging Semantic Web, and most recently in scientific domains. For integration, the idea is that by using namespaces, we can safely union concepts between two (or more) ontologies using what are called bridging axioms to create a merged ontology. Then, depending on the soundness and completeness of the inference system, we can use an inference engine to “correctly” and/or “completely” translate data and queries between the two ontologies. Figure 1 depicts the general idea for merging two family genealogy ontologies. spouse-marriage correspondence

Individual

DRC-ged

Gender

spouseIn

sex Family

Individual Male

sex

Gender

Female

Family wife

husband

BBN-ged Male

Female

Figure 1: Two rough representations of genealogy ontologies (DRC-ged [5] and BBN-ged [3]) with a difference in the way families are defined. One could argue that BBN-ged is a more flexible definition since families can include same-sex partners.

General rules such as the assumption made in DRC-ged families, “for each DRC-ged family, there is at most one husband and wife,” are usually encoded in each individual ontology using some first-order logic-based language. Likewise, when ontologies are merged, we include bridging axioms as new first-order rules in the merged ontology with fully qualified namespaces such as the spousemarriage correspondence from Figure 1 (“husbands and wives in DRC are spouses in BBN”):

3

∀x, y, z. DRC-ged:F amily(z) ∧ M ale(x) ∧ F emale(y) ∧ husband(z, x) ∧ wif e(z, y)

(1)

⇒ BBN-ged:F amily(z) ∧ spouseIn(x, z) ∧ spouseIn(y, z)

Ontology-based integration therefore reduces to making inferences (“translating”) across bridging axioms. We borrowed ideas and tools from ontology merging and translation work [36] to perform ontology-based database integration by making a simple observation: database schemas are like simple ontologies. It turns out to be a relatively trivial task to “lift” a schema definition, which is mainly structural, into an ontology by following some simple rules of thumb:

Relation à Type (or unary predicate, or Class) Attribute à Binary Predicate

(2)

Primary Key à Object Reference (instance identifier)

Defined in this manner, the “ontology-schema” for one database can be merged with another as described before, so with the help of a wrapper between the ontology query language and SQL, database integration also reduces to ontology translation. Because the lifting process is so trivial (mainly structural), the SQL query wrapper simply mimicks the same rules of thumb (in reverse) to access the SQL data whenever necessary. Figure 2 gives a rough overview of how schema correspondences can be represented as ontology mappings, and how the merged ontology is incorporated together with the inference engine in the database integration usage scenario. The key ingredient for reducing database integration to ontology translation was the notion of “lifting” a single schema into a single ontology, which turned out to be simple. But what if we flipped it upside-down? That is to say, starting with an ontology, can we automatically build a database to store its facts? This problem, it turns out, is not so simple and forms the motivation for this new line of research. As will be described shortly, this offers a new paradigm for database modeling, where the conceptual model serves as the primary query interface for application programmers, as opposed to the current standard which relies on querying at an intermediate level (the relational model) that causes a mental gap the application programmer must bridge himself or herself. To maintain a sense of focus, we limit the scope of investigation from the top using ontologies and from the bottom using relational databases. Still, this area of research requires a broad knowledge of several areas including, but not limited to, relational models, database normalization, knowledge representation, automated reasoning, logic, conceptual modeling, and query optimization. However, this is not structured as a survey paper. Instead, this paper will introduce a new methodology for solving this problem. In the process, background knowledge from relevant fields will be injected with 4

birthdate

Nwind Ontology

phone city state

customer

resides at address

fullname

Customer

same as

street

customernumber

customerfname

101 102

Michal Al

customerlname Young Malony

customerstatecode

zip

OR WA

same as

street

State statecode OR WA

Stores7

Stores7 Ontology

same as

customer

orders

postal code

concatenate

statename

is a

Oregon Washington

item

address

made by

individual

resides at

is a

state code

Customer customercontactname Art Farley Al Malony

named

first

manufacturer

customerregion Oregon Washington

name last

Nwind

(a) Schema Correspondences

(b) Ontology Mappings 1

SQL Wrapper

2

5 Nwind (schema)

Data

SQL Wrapper

OntoEngine 4 Nwind-Stores7 Merged ontology Nwind ontology

Stores7 ontology

3

Stores7 (schema)

Data

(c) Usage Scenario

Figure 2: 2(a) Schema (table) correspondences for the sample databases Stores7 from Informix and Nwind (Northwind) from Microsoft. 2(b) The merged ontology representation. 2(c) A data integration scenario in which the user [1] issues a query using the source semantics of Nwind which [2] gets translated into the target semantics of Stores7 which is then [3] issued as a SQL query using a special SQL wrapper from which [4] target data is returned and finally [5] translated back into the source semantics for the user to interpret.

5

sufficient (but not necessarily exhaustive) detail so that key issues for this problem and proposed methodology are exposed and unraveled. At this stage, the goal is less on finding an optimal solution than on gaining a deeper understanding of the problem, so several open questions will be raised. To these ends, the paper is organized as follows. First some motivation is presented in the form of a problem in Section 2 followed by a brief overview of relational databases and ontologies in Section 3. Section 4 details a new methodology for attacking the motivating problem. This approach is compared with related works in Section 5. Some of the more challenging issues are discussed in detail in Section 6. Preliminary case studies using the new methodology are highlighted in Section 7. Finally, potential areas of future work (Section 8), a list of publication venues in this area (Section 9), and concluding remarks (Section 10) bring the paper to a close.

6

2

A Motivating Problem

An ontology, similar to a knowledge base, consists of a set of general statements (rules, axioms or formulae) such as, “All Sisters are Siblings,” and specific facts (ground literals) such as, “Mary and Jane are Sisters.” These are sometimes referred to as intensional (inferred) versus extensional (explicit) knowledge. Unlike knowledge bases, relational databases store and retrieve extensional data well, but they generally do not perform inference (actually, this is a mis-characterization we will elaborate shortly). For example, given all that we have been told so far about Mary and Jane, a basic knowledge base is expected to answer the query, “Which individuals are Siblings?” by responding with “Mary and Jane are Siblings.” Whereas, a basic relational database system would either answer, “There are no known Siblings,” or flatly “There are no Siblings,” neither of which is truly the case. Clearly, intensional knowledge reduces the amount of extensional data storage required. We do not need to explicitly store the fact, “Mary and Jane are Siblings”, to know that it is true. But what if we store it anyway? To put it another way, should we perform inferences at ASKing-time or generate the ground literals at TELLing-time? (The well-known ASK-TELL interface for knowledge bases was proposed by Hector Levesque [56, 57, 58] in the 1980’s.) Obviously, we are talking about a trade-off between time and space for query answering: the more extensional data we store, the less time it should take to answer queries about them because inference is a time-expending process. If we trade space (liberally) for time, can we do (significantly) better?

7

3

Background

Before attacking the problem from an ontology and relational database point of view, a brief review of relational databases is presented, followed by an overview of ontologies.

3.1

Relational Databases

In 1970 E.F. Codd developed the relational model [28] to address the problem of data independence, the idea of separating the logical view of data from its underlying physical implementation. This allowed application programmers to store and retrieve data in a declarative rather than procedural manner, giving front-end applications a high degree of adaptability as underlying disk storage mechanisms and access paths were optimized, re-organized or otherwise changed. Before this, the network and hierarchical database models (which will not be covered here) required fine-grained manipulation of low-level data structures (such as trees and lists). The following definitions comprise the building blocks of the relational model:

Definition 1 (Domain) A domain D is a (possibly infinite) set of values, commonly referred to as a data type.

Definition 2 (Relation) A relation R is a finite subset of ordered tuples from the Cartesian product of a finite list of domains: R ⊆ D1 × D2 × . . . × Dk (or R ⊆ {< x1 , x2 , . . . , xk >| x1 ∈ D1 , x2 ∈ D2 , . . . , xk ∈ Dk }). We say that R has arity k. Definition 3 (Attribute) For a relation R of arity k, each element xi (1 ≤ i ≤ k) of some tuple t ∈ R can be referenced either by the ordinal value (xi = t[i]), or by some predefined string si called an attribute (xi = t[i] = t[si ]). Because elements can be reference by attribute value in this way, a relation is often called a table.

Definition 4 (Schema) A schema for relation R of arity k is a list of unique attribute names s1 , s2 , . . . , sk together with the relation R, which is often written: R(s1 , s2 , . . . , sk ). Definition 5 (Relational Database) A relational database DB is a finite set of relations R1 , R2 , . . . , Rn . The schemas for R1 , R2 , . . . , Rn comprise the database schema for DB. The idea to use the relational model to address data independence did not really take hold until a decade later in the 1980’s, when IBM undertook an important project called System R [25], the first 8

influential relational database management system. System R demonstrated that an RDBMS can effectively compete with an experienced programmer, automatically choosing algorithms and data structures to store and retrieve data efficiently at a low (disk and main memory) level. This was the birth of the ubiquitous SQL language covered in several well-known texts [17, 38, 77, 79, 80] which has resulted in many commercial RDBMSs such as Oracle [10], Informix [7], DB2 [4], SQL Server [14], Sybase [15], MySQL [9], and PostgreSQL [12]. E.F. Codd won the ACM Turing award for his important contribution. What makes modern RDBMSs most successful are the features that optimize the physical storage and retrieval of tuples on disk, such as partitions and query optimizers, as well as features that maintain data integrity, such as constraints and triggers. To understand why query optimization is so important, for example, consider the following relational database schema:

Person(personId, name, birthDate) Address(addressId, street, city, zip) LivesAt(personId, addressId)

(3)

Suppose we want to know the name and birthDate of all people living in the city called “Eugene.” In SQL, we write the query declaratively:

SELECT FROM WHERE AND AND

Person.name, Person.birthDate Person, LivesAt, Address Person.personId = LivesAt.personId Address.addressId = LivesAt.addressId Address.city = “Eugene”

(4) (5) (6) (7) (8)

The main idea is that to obtain the data we need, we must cross reference the data in the three tables. This is referred to as a JOIN (./) operation (a cross product) [lines 5-7]. Also, we want to narrow the results based upon some specific criteria which is called a SELECT (σ) operation [line 8] (not to be confused with the “SELECT” keyword from SQL – an unfortunate name clash). Finally, we only care to see a portion of the data returned, which is a PROJECT (Π) operation [line 4]. These operations, SELECT-PROJECT-JOIN, form the core of the relational algebra query language defined by Codd which is more procedural in nature (specifying how to access the data). For example, in the relational algebra, we can ask the same query at least four different ways:

9

Πname,birthDate (σcity=“Eugene00 (P erson ./ (LivesAt ./ Address))) Πname,birthDate (σcity=“Eugene00 ((P erson ./ LivesAt) ./ Address))) Πname,birthDate (P erson ./ (LivesAt ./ σcity=“Eugene00 (Address))) Πname,birthDate ((P erson ./ LivesAt) ./ σcity=“Eugene00 (Address))

(9) (10) (11) (12)

Although all these queries are equivalent (they return the same answer), their performance can be orders of magnitude different depending on the distribution of data, the block-size allocated on disk, or the presence of indexes on an attribute. A good RDBMS maintains a catalog of information about the database itself (e.g., the size of relations, max and min values, indexes, etc.) to decide which relational algebra expression will cost the least amount of disk access. Of all the features of commercial RDBMSs, a good query optimizer is what you pay for. Integrity maintenance will be mentioned later in Section 4. Some years after Codd’s paper, in 1976 Peter Chen introduced yet a higher level representation called the entity-relationship model (ER Model) [26] as a design tool bridging the user’s conceptual model of his or her data with the relational model. (Hull and King [51] provide a nice survey of several semantic data models, which includes the ER Model). Figure 3 illustrates the idea for separating the conceptual, logical and physical levels of data management. ADDRESS LIVES-AT STREET

PERSON

RESIDENT-OF [inverse of LIVES-AT]

CITY

HAS-NAME (1-1,total) ZIP

Database Designer

Conceptual Level (semantic model)

PNAME

Peter Chen, 1976

Person

Application Programmer

Logical Level

+PNAME STREET CITY CITY ZIP

(relational model)

E.F. Codd, 1970

Physical Level Data Structure & Algorithm Implementation

(disk storage and retrieval)

Figure 3: Separating the conceptual, logical and physical levels of data management. Note: the semantic model shown here uses Hull and King’s Generic Semantic Model (GSM) [51], not the more popular ER Model by Chen [26].

The bridge between the logical and physical model has proven exceedingly successful for automation, but there remain significant gaps between the conceptual and logical levels. This problem goes back the original motivating question for this study: starting with an ontology, can we automatically 10

build a database to store its facts? Unlike in the “lifting” process, here, structure alone causes difficulty. To illustrate this, consider the simple Person-Address model represented by Hull and King in Figure 4 using the Generic Semantic Model (GSM) [51]. In this model, there are four main ADDRESS LIVES-AT STREET

PERSON

RESIDENT-OF [inverse of LIVES-AT]

CITY

HAS-NAME (1-1,total) ZIP PNAME

Figure 4: A Generic Semantic Model (GSM) for Person-Address. pieces of data: PNAME, STREET, CITY, and ZIP. Roughly speaking, PNAME is a person’s name and STREET, CITY and ZIP comprise an address. The solid, labeled arrows represent relationships between concepts. A single-headed arrow represents a single-valued relationship so that we interpret LIVES-AT to mean that a person lives at only one address, whereas an address can be a residence for several people indicated by a multivalued or double-headed arrow. Finally, the annotation for HAS-NAME means that every person has a unique PNAME. Once we know the rules of the game, the GSM, much like the Chen’s ER Model, intuitively conveys meaning about a collection of data. At a conceptual level, if we want to know the names of all the people who share the same address, we might compose the following query in a LISP-like syntax (13):

(LIVES-AT ?person1 ?address) and (HAS-NAME ?person1 ?name1) and (LIVES-AT ?person2 ?address) and (HAS-NAME ?person2 ?name2) and (notEquals ?person1 ?person2)

(13)

A relational database is supposed to capture the meaning conveyed by the conceptual model, but we can implement the Person-Address GSM in several reasonable ways, as shown in Figure 5. Notice how the abstract concept of ADDRESS is nowhere to be found in Figure 5(a). Luckily for us (humans), the names of the attributes (“STREET,” “CITY,” and “ZIP”) provide some clue that we are still talking about an address. But we should not underestimate the cognitive distance a programmer must traverse between the conceptual model and tables in Figure 5(a). In realworld situations, with hundreds or thousands of concepts in the semantic model, imagine trying to determine where a particular concept might have disappeared to in a vast sea of tables and attributes that might not be so conveniently named. To ask the same query on Figure 5(a) as in (13) requires the following SQL:

11

PERSON

PERSON

PNAME

LIVES_AT

Phyllis York

addr_1

Hentry York

addr_1

Alice Cooper

addr_2

ADDRESS PNAME

STREET

CITY

ZIP

Phyllis York

123 Abc St.

Portland

97206

Hentry York

123 Abc St.

Portland

Alice Cooper

456 Xyz St.

Eugene

ADDR_ID

STREET

CITY

ZIP

97206

addr_1

123 Abc St.

Portland

97206

97403

addr_2

456 Xyz St.

Eugene

97403

(a) In third normal form (3NF).

(b) Another likely implementation.

Figure 5: Two possible relational databases for the Person-Address conceptual model.

SELECT FROM WHERE AND AND AND

person1.PNAME, person2.PNAME Person person1, Person person2 person1.PNAME != person2.PNAME person1.STREET = person2.STREET person1.CITY = person2.CITY person1.ZIP = person2.ZIP

(14)

which is different than the corresponding query on Figure 5(b):

SELECT FROM WHERE AND

person1.PNAME, person2.PNAME Person person1, Person person2 person1.PNAME != person2.PNAME person1.LIVES-AT = person2.LIVES-AT

(15)

Aside from being semantically (structurally) different, the two relational models in Figure 5 will result in significantly different query performance on most RDBMSs. Is structural aesthetic less important than query performance? This goes right back to the problem in 1970: data independence has not been realized at the conceptual level (of Peter Chen) like it has for the logical level (of E.F. Codd). It is quite possible that these design decisions have been left to database experts because they have been too difficult to automate reasonably so far. So the question, “Starting with an ontology, can we automatically build a database to store its facts?,” becomes, “If key logical design decisions are left to an expert, is it still realistic to hope for a system that achieves data independence at the conceptual level?.”

12

3.2

Ontologies

The word ontology comes from epistemology, a subfield of philosophy, concerning the nature of knowledge and existence. An ontology defines the things that are believed to exist in the world by providing a vocabulary with which to describe them. Although a single definition is still under debate in the information sciences, Gruber [49] and Stabb and Studer [78] give a (popular) definition of an ontology as, “A formal explicit specification of a conceptualization for a domain of interest.” Brachman and Levesque [21] describe an ontology as a definition of the objects, properties of those objects and relationships among them for a domain of discourse (in the context of agent-based systems). For our purposes, an ontology is a machine-processable language that consists of the following basic constructs from first-order logic:

• Class: a named set of objects, such as, P erson(x). Also called a type. • Property: a binary (or possibly n-ary) relationship between concepts, such as, M otherOf (x, y). Also called a predicate.

Additionally, more advanced relationships between concepts can be represented using axioms:

• Axiom: any quantified (universally or existentially) first-order formula, such as, “Everybody loves a good book,” or ∀x, y. GoodBook(y) → Loves(x, y)

Of course, the more kinds of axioms we allow, the more difficult it becomes to determine entailments (via inference). It is well-known, thanks to Turing and Church, that full first-order logic is not decidable [45, 76] because function-generating terms can create an infinite universe – an algorithm that determines whether a sentence is not entailed may not terminate. But, due to the famous theorem by Jacques Herbrand, there does exist a terminating algorithm to determine if a sentence is entailed [50] (a breadth-first search based on the number of nested calls for function-generated terms will terminate in a propositionalized proof [76]). Although not the focus of this study, the complexity of reasoning based on various combinations of features of first-order logic comprise an important area of research for ontology-based systems, most notably in description logic (DL) [18] which focuses on decidable variants of first-order logic for terminological specifications (ontologies). Furthermore, research on logic and databases has resulted in mature research in deductive databases such as Datalog [63, 70, 79, 80]. In order to restrict the kinds of axioms allowed in an ontology language, features of an ontology language can be specified. For example it is common to allow sub-class, sub-property, and range restrictions on properties:

13

• Sub-class: one class is wholly contained by another, such as, ∀x. M an(x) → P erson(x) or, in DL syntax, M an v P erson. • Sub-property: one property is wholly contained by another, such as, ∀x, y. SisterOf (x, y) → SiblingOf (x, y) or, in DL, SisterOf v SiblingOf . • Range Restriction: constrains the allowed values for a property, such as, ∀x, y. SisterOf (x, y) → P erson(y) or, in DL, ∀SisterOf.P erson

Also, we can specify constraints on the number of instances for properties and classes:

• Existential Quantification: a property has values, ∀x, ∃y. hasM other(x, y) or, in DL, ∃hasM other.>. • Cardinality Constraint: a property has at most (or at least) n values, such as, ∀x, y1 , y2 . M otherOf (x, y1 ) ∧ M otherOf (x, y2 ) → y1 = y2 , or in DL, ≤ 1M otherOf . The concepts and features mentioned so far constitute the intensional knowledge defined by the ontology, but we have said nothing about the base case, which are the actual data instances. The data, or extensional knowledge, consist of objects and facts (we do not consider term generating functions in this study):

• Object: a named instance or constant, such as janeDoe. • Fact: a ground predicate specifying either class or property membership, such as, F emale(janeDoe) or M otherOf (janeDoe, bethDoe)

Although we will not limit this presentation to a particular syntax, we should mention that the Web Ontology Language (OWL) [11] is the new standard for defining Semantic Web ontologies which is based on description logic and uses the Resource Description Framework (RDF) [13] syntax. For the author’s DRP (and implementation of the methodology described below), the Web-PDDL [62] language has been used. Because OWL does not yet have a standard query or rule language, and because Web-PDDL is highly specialized for ontology translation, the standard first-order logic syntax will be used to convey the main ideas. That said, the features of first-order logic that we will explore in the following sections mainly come from OWL and Web-PDDL.

14

4

Methodology

Many of the first-order features common to ontologies, as mentioned in in Section 3.2, are also reflected in relational database management systems. The following sections describe a methodology for tackling the problem outlined in Section 2, which is to automatically create a relational database that stores and retrieves data for a given ontology by exploiting these overlapping features. Therefore, the proposed system takes an ontology as input and generates the SQL relational database schema definition. The resulting database, which we call an ontology-database, will store extensional data in a reasonable manner as well as answer positive extensional queries (queries about the data as opposed to the model itself) “correctly.” A discussion of correctness is postponed until later, but for now we expect the ontology database to take intensional knowledge into consideration so that it can answer the example question, “Which individuals are Siblings?,” the way a knowledge base would: “Mary and Jane are Siblings.”

4.1

Basic Constructs

The first thing to consider are the basic constructs of an ontology: classes and properties. In the relational database, it comes down to a question of structure: what are the relations. There are several choices as summarized nicely by Pan and Heflin [68] which include most notably the horizontal and vertical approaches. (Another good reference for possible structural choices from an XML point of view is by Gali et al [44].) These are worth explaining briefly before presenting our approach.

4.1.1

Horizontal Table

In the horizontal table approach, there is a single (universal) relation (see Table 1): Universe(objectId, type, property 1, property 2, . . . , property n)

(16)

Issuing an extensional query on such a structure is straightforward, requiring only minor syntactic query manipulation at run time. The main drawback, however, is the table can be very large (both wide and tall) and will undoubtedly be quite sparse because many of the property values will be null (most objects participate in only one or few properties). Any changes in the ontology (number of properties, for example) will require a complete database restructuring which can be intolerable for large datasets. Furthermore, there is no trivial method for handling n-ary predicates (the structure strictly assumes binary predicates).

15

Object maryDoe johnDoe

Type Female Male

SisterOf janeDoe null

SiblingOf janeDoe maryDoe

Table 1: A horizontal table implementation for the Sisters-Sibling example. Predicate Type Type SisterOf SiblingOf

Subject janeDoe maryDoe maryDoe maryDoe

Object Female Female janeDoe janeDoe

Table 2: A vertical table implementation for the Sisters-Sibling example. 4.1.2

Vertical Table

The vertical table also has a single relation (see Table 2): Universe(predicate, subject, object)

(17)

The advantage over a horizontal table is that sparsity is eliminated – there are no null values. This is the implementation of the popular Semantic Web RDF framework Jena [8]. The disadvantage is that types and properties are slightly confounded in the implementation, so that type membership queries cannot be easily distinguished from queries on properties, making them unnecessarily expensive. An alternative is to separate the type membership into a separate table, and possibly each property into its own table. Although still somewhat expensive, class membership queries become more efficient as the size of each table are decreased when they are separated out.

4.1.3

Predicate Normal Form

We propose a different approach, which turns out to be almost identical to the hybrid approach of Pan and Heflin [68], but we call it predicate normal form. The idea is to structure the relations based directly upon the basic logical structures (classes and properties) in the ontology. It is called predicate normal form because each predicate becomes its own relation: unary predicates (classes) become single-column tables, and binary predicates (properties) become two-column tables (subject and object, first and second arguments, resp.). (Although n-ary predicates are not elaborated here, the idea is easily extended.) The name of each table is the name of the class or property, so that query answering is a simple lookup process requiring no manipulation at runtime. The resulting structure coheres closely to the set-theoretic foundation for interpreting both ontologies and databases: each basic construct is interpreted as a named set and data are elements in the set. So every fact specifying class or property membership is stored as a row of data in the appropriately

16

Female SisterOf

SiblingOf

ID Subject

Object

Subject

Object

maryDoe

janeDoe

maryDoe

janeDoe

janeDoe maryDoe

Figure 6: A predicate normal form implementation for the Sisters-Sibling example. named table. As an example, see Figure 6.

4.2

Axiomatic Features

While class and property features in an ontology define the basic structure of data, the axioms define relationships among the data. In this section, we cover domain and range restrictions, sub-class and sub-property axioms, and cardinality constraints. Some of these features map nicely to the features of relational databases, such as: foreign-key constraints, view definitions, and triggers.

4.2.1

Domain and Range Restrictions

The domain and range features for properties serve to restrict, respectively, the set of individuals to which a property applies and the values it can take. For example, the Sisters property can be restricted to a domain and range of Person in its definition: SisterOf(Person(x), Person(y)) 1 . Obviously, SisterOf(Person(x), Person(y)) is true only if (among other things) x and y belong to the class Person. Therefore, because the predicate SisterOf(maryDoe, janeDoe) is true, we are assured that maryDoe and janeDoe both belong to the Person class. The corresponding first-order formulae are: ∀x, y. SisterOf (x, y) → P erson(x)

(18)

∀x, y. SisterOf (x, y) → P erson(y)

(19)

Given the structural choices made for the database implementation, foreign-keys (with the cascading delete option2 ) enforce these axioms. For every domain restriction on a property, we define a foreign-key from the property-relation’s subject attribute to the class-relation’s id attribute, similarly for every range restriction. The following SQL definition would describe the domain and range restrictions for the SisterOf property (see Figure 7 for a visual representation): 1

Technically, SisterOf(Person(x), Person(y)) is improper notation since we do not mean things like SisterOf(true, true). Throughout this paper, we use this relaxed notation for brevity to mean: Person(x) ∧ Person(y) ∧ SisterOf(x, y). 2 Section 6.6 explains why the cascading delete option is necessary.

17

f-key

Sisters (subj, obj)

Person (id)

(Lily, Zena)

(Paul) (Lily) (Zena)

Figure 7: Domain and range restrictions are implemented as foreign-key (f-key) constraints.

CREATE TABLE SisterOf ( subject VARCHAR NOT NULL, object VARCHAR NOT NULL, CONSTRAINT fk-Sisters-Subject-Person FOREIGN KEY (subject) REFERENCES Person(id) ON DELETE CASCADE, CONSTRAINT fk-Sisters-Object-Person FOREIGN KEY (object) REFERENCES Person(id) ON DELETE CASCADE )

(20)

(21)

Sometimes, we like to distinguish between what are called object-properties and datatype-properties. An object-property restricts a property’s object values to a finite set of instances of a class, whereas a datatype-property refers to a basic datatype such as a string or number (possibly infinite sets of values we can handle in the usual, well-defined ways). In the case of a datatype-property, we do not add a foreign-key, but instead alter the basic datatype of the object field directly.

4.2.2

Sub-class and Sub-property Axioms

The sub-class and sub-property axioms are called inclusion axioms in description logic because they specify that all instances of one class (resp. property) are included in (a subset of) another [19, 66]. Obviously, Female is a subClassOf Person. We can use this knowledge to express more precisely the SisterOf property: SisterOf(Female(x), Person(y)). Furthermore, All Sisters are Siblings can be encoded as a subPropertyOf statement. So the first-order intensional representations would be: ∀x, y. SisterOf (x, y) → F emale(x)

(22)

∀x, y. SisterOf (x, y) → P erson(y)

(23)

∀x. F emale(x) → P erson(x) ∀x, y. SisterOf (x, y) → SiblingOf (x, y)

18

(24) (25)

The sub-class feature clearly adds to the expressiveness of the ontology because the previous rules, 18 and 19, can be inferred automatically from 22-24. An inclusion axiom is fundamentally different than a range restriction. We will express this difference more precisely in Section 6, but the main idea is that domain and range restrictions are epistemic in nature (regards the state of knowledge), but an inclusion axiom is “knowledge generating” (advances the state of knowledge) [19, 64, 75]. For example, if someone asserts that SisterOf(janeDoe, buddyTheFrog), a range restriction allows us to reject the assertion because we know buddyTheFrog is not a Person (it is a Frog). The range restriction axiom does not mean that we should infer that buddyTheFrog is a Person. In other words, what we really mean by the range restriction is the following specific rule: ∀x, y. ¬P erson(y) → ¬SisterOf (x, y)

(26)

In classical logic, rules 23 and 26 are equivalent. However, rule 23 is not precisely what we mean. On the other hand, for an inclusion axiom, the intension is to propagate knowledge in a forward manner. That is, if we assert SisterOf(maryDoe, janeDoe) we do intend that SiblingOf(maryDoe, janeDoe) is also true. At the same time, if SiblingOf(maryDoe, janeDoe) is not true, then neither can SisterOf(maryDoe, janeDoe) be the case. In other words, in this situation we also intend the contrapositive: ∀x, y. SisterOf (x, y) → SiblingOf (x, y)

(27)

∀x, y. ¬SiblingOf (x, y) → ¬SisterOf (x, y)

(28)

This refinement is one indication that we should be careful to distinguish between the classical logic assumptions made by ontology-based systems versus a more constructive logic foundation for data-based systems. For example, description logic reasoners automatically assume the law of the excluded middle (A ∨ ¬A) [19]. On the other hand, Horn-logic systems such as Prolog and Datalog, do not require full proof by contradiction (reductio ad absurdum [RAA]) for sound and complete inference [29, 46, 70, 76], but they do rely on domain closure axioms to guarantee termination in the presence of negations [27, 71] which makes this notion of negation as failure similar to database systems under the closed world assumption [72]. (More on this in Section 6.) Just like a restriction, therefore, rule 28 is implemented as a foreign-key constraint (with cascading delete). Rules 27 and 24, on the other hand, are constructive (knowledge generating), so we use database triggers (“assertion triggers”) to implement them (triggers 30 and 29, respectively):

19

Female (id) (Lily) (Zena) (Jane)

f-key

trigger

(Mary)

Person (id) (Paul) (Lily) (Zena) (Jane) (Mary)

Figure 8: Triggers propagate assertions in a forward reasoning manner.

CREATE TRIGGER subClassOf-Female-Person SUCH THAT UPON DETECTING EVENT INSERT [x] INTO Female, FIRST EXECUTE INSERT [x] INTO Person

(29)

CREATE TRIGGER subPropertyOf-SisterOf-SiblingOf SUCH THAT UPON DETECTING EVENT INSERT [x,y] INTO SisterOf FIRST EXECUTE INSERT [x,y] INTO SiblingOf

(30)

Triggers in relational databases follow the event-driven model familiar in user-interface programming. Listeners are registered with the RDBMS such that whenever the specified event is detected, then the specified sequence of instructions is performed. Triggers can cascade or fire-off other triggers, which is an important feature we aim to exploit, since the order of rule application is not important. Furthermore, if cycles are not allowed in our axioms, the procedure is guaranteed to terminate [19]. Databases that respond dynamically to events like this form an area of research called active database systems [69]. We use triggers as a tool enforcing the semantics defined by the ontology. The trigger and foreign-key constraint actually form a cyclic dependency (not to be confused with a cycle in axioms). The foreign-key check should not happen until after the data is propagated by the trigger. The FIRST keyword in the trigger definition ensures the correct ordering. (Alternatively, we can specify the check and trigger form an atomic action – which will delay the check until the transaction is completed. The specific implementation will depend on the particular RDBMS used.) To be clear, we have described a mechanism by which inclusion axioms (intensional rules) automatically generate extensional facts. Of course, we necessarily explode the amount of data stored by eagerly propagating copies upwards in the sub-class (property) hierarchy (this is called the subsumption hierarchy in description logic [19, 66]). This is the main point we want to stress about our methodology, so it’s worth stating specifically what it achieves: Using triggers in this way to implement inclusion axioms, the RDBMS is guaranteed to solve our motivating problem, that is, it

20

will correctly answer, “Mary is the SiblingOf Jane.” Most importantly, it will not answer the query correctly by using inference (as a knowledge base would). Rather, it will simply look up the answer in the SiblingOf relation as usual. f-key

Sisters (subj, obj)

Siblings (subj, obj)

(Lily, Zena) (Mary, Jane)

(Paul, Mary) (Lily, Zena) (Mary, Jane)

trigger f-key

Female (id) (Lily) (Zena) (Mary) (Jane)

Person (id) trigger

(Paul) (Lily) (Zena) (Mary) (Jane)

Figure 9: Altogether, inclusion axioms are implemented as both triggers and foreign-key constraints with cascading delete. Figure 9 illustrates how the database solves the motivating problem described in Section 2. When SisterOf(maryDoe, janeDoe) is asserted (inserted into the Sisters table), it fires the trigger that asserts SiblingOf(maryDoe, janeDoe). Note that we have assumed Female(maryDoe) and Female(janeDoe), as verified by the foreign-keys. By virtue of the meaning (set-theoretic interpretation) of an inclusion axiom, we expect that every tuple appearing in the Sisters table, also appears in (is a subset of) the Siblings table. The process stores the inferred knowledge explicitly, trading space for time.

4.2.3

Cardinality Constraints

A cardinality constraint limits the number of instances that can participate in a class or property. For example, the rule, “People have at most one social security number (SSN),” is a cardinality constraint on the property hasSSN. Here, we only consider cardinality constraints of zero or one. Regarding social security numbers, several intentional rules capture our everyday assumptions: ∀x, y, z. hasSSN (x, z) → P erson(x)

(31)

∀x, y, z. hasSSN (x, z) → @dataT ype(z, String [“ddd-dd-dddd”])

(32)

∀x, y, z. hasSSN (x, z) ∧ hasSSN (y, z) → sameAs(x, y)

(33)

∀x, y, z. hasSSN (z, x) ∧ hasSSN (z, y) → sameAs(x, y)

(34)

∀x. P erson(x) → ∃y : hasSSN (x, y)

(35)

Rules 31 and 32 are domain and range restrictions. Rule 33 is a maximal cardinality domain21

constraint stating, “An SSN is assigned to at most one Person.” Rule 34 is the maximal cardinality range-constraint, “A Person gets at most one SSN.” The last rule, 35, is a minimal cardinality constraint, “Every Person gets assigned at least one SSN.” Relational databases support various forms of cardinality rules (of constraint zero or one) via the use of primary keys, uniqueness constraints, foreign-key constraints, and null (or not null) fields. A key is a unique identifier (i.e., a reference) for a record in a table, so maximal cardinality rangeconstraints of one (rules 33 and 34) mean the that the subject and object attribute of the hasSSN property-relation each form a key. A maximal cardinality of zero means that a property has no values (the table should be empty). We can either delete the relation altogether, or we can provide a check to ensure the table remains empty. A minimal cardinality of zero adds nothing useful to the semantics as far as we can tell (there is no practical case we can think of in which to use it). A minimal cardinality of one (rule 35) is the interesting case. It requires a foreign-key (e.g. from hasSSN(subject) to Person(id)), as well as an assertion trigger using a null object-value (e.g., insert [x] into Person triggers insert [x,null] into hasSSN). The method of using a null value in the trigger enforces the epistemic interpretation suggested by Reiter [75] which we discuss further in Section 6.

4.3

Correctness

A formula is in Horn Normal Form (HNF) [76] if it is a disjunction with at most one positive literal as in: ¬p1 ∨ ¬p2 ∨ . . . ∨ ¬pn ∨ q. We can also re-state the same HNF formulae in terms of implication (referred to as implicative normal form [INF], without disjunctions on the right hand side of course): p1 ∧ p2 ∧ . . . ∧ pn → q.

Furthermore, Generalized Modus Ponens (GMP) [76] is an inference rule based on the well-known modus ponens rule: p01 ∧ p02 ∧ . . . ∧ p0n p1 ∧ p2 ∧ . . . ∧ pn → q GM P SU BST (θ, q) GMP allows us to unify several antecedents simultaneously to prove a conclusion. It is well-known that GMP is sound and complete for knowledge bases (a set of intensional plus extensional data) in HNF. All of the features we have discussed in Section 4 are immediately reducible to a set of INF clauses and, since there are no disjunctions on the right-hand side, to a set of HNF clauses. 22

In our system, a trigger is only implemented for INF formulae, and it is only ever fired when the antecedent is satisfied by extensional data (i.e., unified ) thereby proving the desired conclusion and making it an extensional fact. It is easy to see that a trigger is in fact an event-driven implementation of the forward chaining algorithm [76] commonly used in HNF knowledge bases. In other words, triggers are a very direct application of GMP rules, generating new facts from known ones in a cascading style. Clearly, the use of triggers in this way is sound and complete just as forward chaining is sound and complete for HNF using GMP. What we mean by correctness for an ontology-database, therefore, is one that can answer any extensional query soundly and completely. That is, all found answers to the query are correct and all correct answers can be found. In order for this to be the case, the (closed-world) database must give an interpretation to every ground literal in the universe of discourse (true if it is in the database, false otherwise). Moreover, we require that the interpretation must satisfy all intensional knowledge (INF formulae) in order to be considered correct. In other words, a relational database is typically deemed a Herbrand Interpretation for a set of formulae, but we require that an ontology-database is a Herbrand Model for the set of formulae (see [17, 45, 76] for more details on the Herbrand universe, interpretation and model), that is, it must provide an interpretation for all ground terms in the universe that satisfies all formulae. Herbrand Models are useful tools for determining the unsatisfiability of a set of formulae given a finite interpretation. Clearly, whenever a foreign-key constraint in our database model is violated, we can assume the formulae are unsatisfiable.

5

Related Work

In Semantic Web research, one important open question is how to bridge the gap between ontologies and databases since most data currently stored on the World Wide Web resides in relational databases. This was one of the ideas set forth years ago by work on the CLASSIC [20] project. Recently, Motik, Horrocks and Sattler [64] showed that integrity constraints (ICs) can be disregarded while answering positive queries against an OWL ontology, if the constraints are satisfied by the database. But it was Reiter who clarified what we really mean by ICs [75]: it requires a kind of modal logic which corresponds directly to the development of the epistemic operator in description logics [32]. Following Reiter’s suggested refinement, Motik et al proposed an extension to description logic [64]. In DL-Lite, a tractable subset of description logic, Calvanese et al [24] show how a conventional database management system can manage ABox assertions in a system with query reformulation at the TBox level (similar to answering queries using materialized views). Pan and Heflin [68] describe a description logic database (DLDB) that uses unions of views to implement inclusion axioms. Together, these works show that work in logic and databases go hand in hand, with one informing the other, so that tasks might be more clearly delineated between reasoning and database systems.

23

The idea of trading space for time when we couple databases and reasoning mechanisms comes from seminal works by Reiter [71, 73]. Reiter proposed a balanced system that uses conventional databases for handling ground instances, and a deductive counterpart for general formulae. Since no reasoning is performed on ground terms, Reiter argues convincingly that in such a system queries can be answered efficiently while retaining correctness. As illustrated in the motivating problem (Section 2), the notion of “balance” is being questioned: What would happen if space were not an issue? Relational databases have far surpassed the use of deductive databases for most real-world systems, even though the former stems from the latter [63]. However, current research in data integration [35, 52, 54] is causing a sort of revival of logics in databases. Mappings between systems are intensional rules. Data in each system are extensional. We are left with the same problem Reiter addressed by combining conventional databases with deductive engines [71]. Indeed, our own work on ontology-based data integration [35] is really a specialized implementation of Reiter’s deductive query answering system applied to a distributed, heterogeneous data store. While our previous work (and Reiter’s) is concerned primarily with deductive query answering, this current work is more concerned with semantic modeling using ontologies and the structure, storage and retrieval of data described by them in a conventional database. In our own previous work on database integration we defined a transformation method for generating an ontology (which we called a database-ontology) from schemas so that our ontology translation system can be applied to find answers across heterogeneous databases [34]. It is not unlike the work of P´erez de Laborda and Conrad [30] who use RDF queries to bring relational database schemas into the Semantic Web. But now we examine the reverse transformation: how to generate a database schema from an ontology (which we call an ontology-database, a specific kind of semantic database [51]). There are two motivations for this investigation. The first reason is to address the slowness of fact translation reasoning [34] which is directly applicable to data exchange scenarios which is related to but different from query answering [34, 52, 54]. Secondly, ontology-based conceptual modeling is becoming more prevalent in scientific research [67, 33]. Tools automating the storage, retrieval and integration of domain-specific knowledge will allow researchers to focus efforts on information curation and analysis rather than database maintenance. Research on the complexity of reasoning, given various features of the logic, has been studied meticulously by the description logic community [31]. The focus in this community has been on decidable subsets of first-order logic and the complexity thereof. Similarly, finite model theory [37, 40, 53] examines the complexity of logic under a finite model (such as a database). A class of languages is decidable if the finite model property holds for it, that is, if a given set of first-order sentences from a particular class of language has a model, then it has a finite model. Understanding the tradeoff between the expressiveness of the ontology and the complexity of its implementation in a database will inform how we finally choose to bridge the gap between them: shall we extend

24

the language or database features?

6

Discussion

In the following discussion areas, we explore the connection between databases and logics with respect to the methodology proposed by examining closely various theoretical foundations. We focus primarily on description logic phenomena, the foundation on which most research in this area is based, but we do so in the context of logic in general (i.e., specific knowledge of description logic is not required). Several difficult questions are left open as future work.

6.1

Triggers versus Subsumption

In description logic, the semantics of rules (“trigger rules”3 ), such as: C ⇒(T ) D, where C and D are concepts (classes or properties) are believed to differ from the semantics of inclusion axioms (often called subsumption rules), written: C v(S) D. Trigger rules lack two important things that inclusion axioms take for granted [19]: First, a trigger rule is not equivalent to its contrapositive: C ⇒ D 6`T ¬D ⇒ ¬C. Second, trigger rule applications do not presume a case analysis. For example, the trigger rule requires a witness (proof of) a ∈ C (or a 6∈ C) to deduce a ∈ D in the following case: C ⇒ D C(a) ⇒T E D(a)

¬C ⇒ D ¬C(a) ⇒T E¬ D(a)

but the subsumption rule does not require such knowledge in the similar case: CvD

¬C v D D(a)

3

a

vS E

A “trigger rule” should not be literally confused with a “database trigger” (or “assertion trigger”). Although, clearly, one certainly has to do with the other as shown by our implementation.

25

The reason for the discrepancy is the law of the excluded middle (C ∨ ¬C) is implicitly taken as an axiom for subsumption but not for triggers, which affects ∨-elimination (∨S E differs from ∨T E): C. .. . C ∨ ¬C D D

(1)

(1)

¬C .. .. D (1) D(a)

∨S E

.. .. C ∨ ¬C a

∀E

C. .. . D D

(1)

(1)

¬C .. .. D (1) D(a)

∨T E

a

∀E

In the proof of D(a), ∨T E expects a proof of C ∨ ¬C, but ∨S E does not. In other words, trigger rules correspond to a constructive philosophy of logic, whereas, subsumption is more classical in nature. An excellent discussion of Heyting semantics (proof-theoretic) versus Tarski semantics (model-theoretic) and their relationship to constructive versus classical philosophies of logic can be found in [47]. Ironically, description logic semantics are typically presented using Heyting semantics [19], yet subsumption is clearly Tarskian. The reason for the apparent confounding of these semantics is not yet clear. It may have to do with the tendency to separate what is called T-Box (intensional, Tarski) from A-Box (extensional, Heyting) reasoning in description logics. On the other hand, because of the clear set-theoretic foundation, databases are best understood under Heyting semantics. Reiter provides a nice logical formalization of database theory in [74]. Our methodology using database triggers to implement rules corresponds precisely to what Baader and Nutt refer to as the procedural extension of a knowledge base [19]. In particular, it can be shown that the implementation is guaranteed to terminate under a finite model (a finite set of data). Furthermore, the procedural extension is independent of the order of rule applications (i.e., it is safe to use an event-driven model such as database triggers).

6.2

Modal Logic

Interestingly, we can use modal logic to account for the semantic discrepancy between triggers and subsumption. Modal logic introduces the ¤ (“box”) operator which essentially helps to distinguish knowledge from “knowledge about knowledge.” In description logic, the epistemic operator K (for “knows”) serves the same purpose [32]. For example, KC can be interpreted as, “C is known to be true.” With this new operator, we can clearly distinguish between trigger rules and inclusion axioms (subsumption): C ⇒ D ≡ KC v D 4 Trigger rules are thought to be epistemic in nature – they tell us more about the state of knowledge rather than the state of the world itself. The semantics of trigger rules correspond nicely with the capabilities of databases. Namely, since they are data-driven, databases tend to be constructive in 4

From here on, we discard the description logic notation v and use only K and ⇒ with quantifiers.

26

nature (i.e., we like to have witnesses). Whenever we know that C is true of an individual, then we also know that D is true for that individual: ∀x. KC(x) ⇒ KD(x)

6.3

Integrity Constraints

The semantics of integrity constraints (foreign-keys) in databases turn out to also be epistemic in nature, requiring modal logic to make their meaning explicit. Reiter gives a very nice example regarding a constraint requiring all employees to have a social security number [75]:

∀x. KEmp(x) ⇒ K∃y. hasSSN (x, y) This rule explicitly tells us that we know that every employee we know about has a social security number, even if we do not know what that number is – clearly hinting toward the use of null values in a database as implemented in our methodology. Just like Donini et al proposed the K operator to distinguish trigger rules from inclusion axioms in description logic, Motik, Horrocks and Sattler [64] used a similar notion based on Reiter’s observation to distinguish integrity constraints from domain and range restrictions. Motik et al make an important claim, stating that integrity constraints can be disregarded while answering extensional queries since the database can enforce them. Indeed, this is one of important goals in bridging the gap between ontologies and databases: by understanding clearly the capabilities and interactions of the two, we can more effectively delegate data management versus reasoning tasks to the appropriate systems.

6.4

The Closed World Assumption

The closed world assumption (CWA) is typically made by database systems. Knowledge bases like ontology-based systems, on the other hand, use the open-world assumption (OWA). The basic idea of the CWA is that whatever facts are not explicitly stored in the database (the extensional data) are assumed to be false by default. In this way, we force our knowledge of the world to be complete, even if it is artificially so. Reiter showed that under the CWA query answering is guaranteed to be definite [72]. Furthermore, query answering is reducible to atomic queries with set-theoretic operations such as union, intersection and difference. Finally, for Horn logic systems (logic restricted to at most one positive literal), we can disregard negative clauses without affecting CWA query answering. Prolog and Datalog are examples of Horn-based systems. Balancing the CWA of databases with the OWA of ontologies turns out to be extremely tricky, es27

pecially when we are not restricted to Horn logic. The question is closely related to query answering in the presence of database incompleteness [39, 60, 65], a situation pertinent to data integration scenarios. We can describe the CWA idea using modal logic as well [59]: ∀x. ¬KP (x) ⇒ ¬P (x) Of course, we need second-order logic to do it properly, but the idea is clear.

6.5

Disjunction (and Negation)

Reasoning by cases (which requires proof by contradiction or reductio ad absurdum (RAA)) a very powerful feature of description logics, requires either negation or disjunction, neither of which are natural constructs for databases. The closed world assumption causes difficulty with negation since everything not in the database is considered negated by default. Under an open world assumption, disjunctions introduce indefinite answers to queries [71]. A significant body of research under Jack Minker [41, 42, 61] deals with disjunctions in logic programming and deductive databases. One might characterize the ontology-database methodology presented above as a static, pre-computed knowledge system (a “materialized” procedural extension). Based upon preliminary attempts to represent disjunctions and negation in our ontology-database implementation, it seems likely that we must provide relations to store explicit negations (versus negation by default). We can continue to use triggers to help maintain consistency by transforming a disjunction into a negated implication: C ∨ D `RAA ¬C ⇒ D It is not yet clear how to statically store disjunctive knowledge per se. It seems likely that query answering under disjunctions will require a run-time implementation (a database query engine extension that mimics resolution). For example, suppose we are given the following knowledge: ∀x. Alien(x) ∨ Human(x) Since we cannot determine which relation bob should go in at this point (Alien or Human), we can use a trigger: ∀x. K¬Alien(x) ⇒ KHuman(x) If we discover later that ¬Alien(bob), the answer to the query Human(bob)? becomes readily apparent (look it up in the Human relation and the trigger guarantees it is there). However, if we want to know whether Alien(alice) ∨ Human(alice)? we have no data either way (we can’t just look it up since both tables are empty!) But, if we first query Alien(alice) and it fails, then we can temporarily postulate ¬Alien(alice) and ask Human(alice)? which will return true (because of the 28

trigger) at which point we can return true and rollback our postulate (and trigger). Although it mimics a linear resolution theorem proving method (the method of Prolog), adding this functionality to the database query processor is not the ideal solution for us (we prefer to not require any modification to current RDBMS query engines). Disjunctions may need to be left to an external inference engine.

6.6

Deletions

Perhaps the most difficult aspect of truth maintenance is how to deal with deletions. The ability to delete from a database distinguishes them remarkably from knowledge bases which are usually strictly monotonic (knowledge always increases and never decreases). Suppose we allowed deletion as in the the following two Datalog programs:

DATALOG 1: person(X) :- faculty(X); faculty(zena); delete(faculty(zena)); DATALOG 2: person(X) :- faculty(X); faculty(zena); person(zena); delete(faculty(zena));

We would expect the following answers to the given query:

QUERY: person(X) DATALOG 1: [] DATALOG 2: [zena]

Since a delete is tantamount to a negation (under the CWA), we need to be careful to rollback implicit knowledge (Datalog 1) without destroying explicit knowledge (Datalog 2) if we want to have the usual knowledge base behavior (even though knowledge bases don’t usually allow deletion). It is possible, however, that we may want to keep implicit knowledge around (consider a distributed or incomplete database where source data might come and go). This is the biggest semantic difference between our trigger-based methodology (static, materialized) and the view-based approach in DLDB [68] (dynamic, virtual). We can use modal logic to explicitly 29

clarify the difference between the two approaches: ∀x. Student(x) ⇒ KP erson(x) In this case, the K can be interpreted to mean, “continues to be the case in the future,” which adds a temporal aspect to our semantics (temporal logic is an instance of modal logic). So if Student(bob) is true at one point, P erson(bob) continues to be true even after bob graduates (for example). (Note: under these semantics, the Deduction Theorem no longer holds [48].) To allow for the usual knowledge base behavior, we need to specifically account for implicitly generated knowledge versus asserted knowledge, or the data provenance [22, 23]. Unlike this situation, however, domain and range restrictions (implemented as foreign-key integrity constraints) explicitly require a clean-up of data. For example, if we have as a rule: ∀x. K¬P erson(x) ⇒ K¬Student(x) then deleting P erson(bob) from the database should necessarily invalidate Student(bob). The cascading delete option in the foreign-key definition enforces this rule.

6.7

Normalization

Finally, normalization (and denormalization) is a fundamental aspect of any database design process, comprising the most fundamental transformations from a conceptual design to its logical design. The main goal of normalization is to reduce redundancy in the data which makes enforcing integrity easier. Sometimes database designers purposely denormalize relations for efficiency (trading speed for extra redundancy and consistency checking). These choices are based on domain knowledge and can be highly subjective and seemingly arbitrary. In cases where features in the ontology, such as cardinality constraints, can be used to avoid unnecessary copies of records to propagate via triggers, we would prefer to avoid it since it will save on both load-time and space. For example, suppose a database is about a simple rule: All Persons have exactly one SSN. The implementation is depicted in Figure 10 (A). This implementation can be “denormalized” by deleting the Person relation altogether since all Person information is contained in the hasSSN table (see Figure 10 (B)). Of course, we have recreated the exact problem we hoped to solve in using an ontology-database methodology (recall the disappearance of the Address concept in the Person-Address problem discussed in Section 3.1). The problem regards name resolution: if we are no longer in “predicate normal form” then a query posed against concepts in the ontology no longer corresponds directly (in name) to the database structure. We therefore need a mechanism for query rewriting in the presence of database restructuring. 30

(A)

(B)

hasSSN

hasSSN

Person

subject object

subject (Person) object

id (id,null)

Figure 10: Person-hasSSN Schema. (A) A database representing the rule: All Persons have exactly one SSN. There is a trigger from Person to hasSSN, and a foreign-key from hasSSN to Person. (B) The same database after being denormalized.

7

Case Studies

7.1

Load-time and Query Answering

We implemented the proposed methodology as a tool and applied it to a simple ontology found online, the Beer Ontology [1]. It uses many of the interesting features of OWL-Lite [11] (a description logic based ontology language for the Semantic Web). It models 47 classes arranged in a hierarchy of height 4 and 12 properties employing the subPropertyOf, and inverseOf and maxCardinality features of OWL-Lite. We simplified the model further to 13 classes and 6 properties so that we may test the essential characteristics: subsumption depth, and basic OWL-Lite features. Our tool generated from that model 21 foreign-keys (12 on class-relations, 9 on property-relations) and 15 triggers (12 for subClassOf, 2 for subPropertyOf, and 1 for inverseOf ). Although we expected insertion time to increase as the database gets large or as subsumption depth is fairly large, performance remains surprisingly linear even up to 1.25 million facts (see Figure 11). Regardless of variations in subsumption depth (we tried some artificial variations on Beer up to a depth of 20 which we consider moderate), every insertion of a new data instance takes roughly 50 milliseconds.5 18 Load-Time for Facts 16

Time (hours)

14 12 10 8 6 4 2 0 0

0.2

0.4

0.6 0.8 Number of facts (millions)

1

1.2

Figure 11: Average load-time performance. This chart shows how long it takes to load extensional facts into the database. Each fact takes 50 milliseconds to load, regardless of it’s semantics. 5

All experiments were performed on an unremarkable personal laptop computer with a 1.8Ghz Centrino processor and 1Gb of RAM running MySQL 5.0 as the RDBMS.

31

Therefore, after about 18 hours we can ask as many queries as we please over the 1.25 million records now stored. To be absolutely clear, the point of this paper is not how fast we can load or insert data into our database. As long as it scales to some reasonable degree, the speed of load-time is not that important. Our primary interest lies in the kinds of queries we can answer after we have done so, and how fast and how easily those queries can be answered. Consider the following portion of the Beer database consisting of the intensional and extensional knowledge:

Intensional Knowledge: subClassOf (Ale,Beer) brews(Brewery(x),Beer(y)) brewedBy(Beer(x),Brewery(y)) inverseOf (brews,brewedBy) Extensional Data (stored in some extensional database): Ale(WinterWheat) Brewery(MillCreek) brews(MillCreek,WinterWheat)

Suppose our goal is to answer the query: brewedBy(Beer(?x),MillCreek). In other words, What Beers are brewedBy MillCreek ? A deductive system (as in a knowledge-based system) will perform several inferences before it determines the answer:

Knowledge Base with extensional database, K-DB, as in [71] 1. Ask K-DB: brewedBy(Beer(?x),MillCreek). Result: ∅. 2. Apply inverseOf(brews,brewedBy), ask K-DB brews(MillCreek,Beer(?x)). Result: ∅.6 3. Apply subClassOf(Ale,Beer), ask K-DB brewedBy(Ale(?x),MillCreek). Result: ∅. 4. Apply inverseOf(brews,brewedBy), ask K-DB brews(MillCreek,Ale(?x)). Result: {?x/WinterWheat}. 5. No inferences left. 6. Union all results obtained. Return: {?x/WinterWheat}. 6

Reason no result is found: brews(MillCreek,Ale(WinterWheat)) does not unify.

32

But an ontology-database system does not need to perform any inference: Ontology Database with extensional database, O-DB

1. Ask O-DB: brewedBy(Beer(?x),MillCreek). Result: {?x/WinterWheat}. Return: {?x/WinterWheat}.

In the ontology-database system, we simply issue the query predicate as is and it directly answers by simply looking in the corresponding property-relation table. Moreover, the ontology-database we have proposed is easier, faster and guarantees that we get the same answers a deductive system would but using nothing more than a conventional relational database engine.

7.2

Gene Ontology

The Gene Ontology (GO) provides a standard vocabulary and concept model for molecular functions, biological processes and cellular components in genetic research. The OWL specification of GO is huge, over 40 Megabytes in size [67]. While quite large and comprehensive, the model semantics are simple, requiring only OWL-Lite features for the entire specification. For example, a search on the goosecoid gene (a.k.a. “gsc”) from the Zebrafish model organism website [16] reveals that this gene belongs to (among other things) the class of molecular functions called DNA binding, the class of biological process called brain development, and to the cellular component nucleus. Each of these ontological notions can be viewed and analyzed from the GO or AmiGO website [6, 2]. Based on the GO ontology, the ontology-database will explicitly store Organelle(goosecoid) – because the Nucleus (GO-Term: 0005634) is a subClassOf Organelle (GOTerm: 0043226) in GO. Applying the proposed methodology to GO would generate 65,010 unary relations (one for every GO term), 4,695 binary relations for each part-of property, foreign-keys for the domain and range restrictions on those properties, and 32,082 triggers and foreign-key constraints for each of the subClassOf definitions. Although we have considered GO in our design, we did not directly apply our tool to GO given our limited computing resources. The most important observation we can make here is that 70,000 tables is quite excessive – it may be possible to apply the normalizing technique of “promoting a column to a table” to reduce the number of tables necessary.

33

7.3

NeuroElectroMagnetic Ontologies

The NeuroElectroMagnetic Ontologies (NEMO) [33, 43] working group is working toward creating a set of ontologies that will facilitate the sharing and analysis of electroencephalography (EEG) data, especially event-related potentials (ERP). Unlike GO, which limits the semantics to essentially a vocabulary of terms, NEMO aims to capture significantly more semantics. We have used the developmental versions of the NEMO temporal, spatial, functional and ERP ontologies as a case study in the development of our ontology-database methodology. The details of the NEMO ontology and database developed based on this methodology have been reported in another publication [55].

Figure 12: (A) 128-channel EEG waveplot; positive voltage plotted up. Black, response to words; Red, response to nonwords. (B) Time course of P100 factor for same dataset, extracted using Principal Components Analysis. (C) Topography of P100 factor (negative on top and positive at bottom). While GO might provide an interesting stress-test for our methodology on huge ontologies, we expect NEMO to test the semantic power of ontologies more than GO can. For example, various patterns that relate to specific brain and cognitive functions can be characterized from ERP data. The “P100 component” is one such pattern which is reliably seen after a word is presented visually to a human subject. It manifests as a positive-going deflection which peaks at around 100 milliseconds and is maximal over occipital (a posterior region of the brain) electrodes (see Figure 12) [33, 43]. Furthermore, it can be characterized in general by a INF rule with about four predicates in the antecedent, so our methodology using assertion triggers can handle it without difficulty. Figure 13 shows the NEMO ERP ontology used as in our case study and Figure 14 shows a graphical visualization of the NEMO database that was automatically generated based upon the ontology.

8

Future Work

The semantics of disjunction (and by association, negation) in relational databases is still unclear and requires further investigation. It seems likely that no clear solution will present itself and that disjunction must be left to an external reasoning agent (or an extension of database query 34

IN_mean_roi

measurementValue

String

Voltage

IN_ROCC ...etc...

Number

Measurement

...etc...

IN_LOCC

factorModality String

TI_max measurementUnits

factorEvent TI_duration

Factor

Time_Instance

String

SP_cor

Number

SP_min_roi SP_max_roi

occursIn

Topography ...etc...

IN_mean_roi_minvalue

P3

N300

TI_max_minvalue

IN_mean_roi_maxvalue

Channel_Group

TI_max_maxvalue

N100 Pattern

roi

P100 ...etc...

patternEvent

String

String

Legend:

D

LOCC

RORB

ROCC

LATEM

= Property(P)

P

LORB

patternModality

= Datatype(D)

A A

= Class(A)

LFRON ...etc...

B = subClassOf(A,B)

RATEM

RFON

Figure 13: The NEMO ERP Ontology. processors). This work will likely become a major focus of my dissertation. The semantics of deletion needs further investigation. When allowing for non-monotonic knowledge, we must consider the temporal aspect of the data. The order of rule applications may become critical, thereby breaking the trigger-based implementation. To what extent can we allow for deletions and still have a database system that works “correctly”? How do we define correctness in the presence of deletions? Will extensions to current ontology languages will be required? There are clear applications in the area of provenance which will be become one area of research focus in the next few months. Denormalizing predicate normal form for efficiency raises difficult challenges, especially if we truly want to bridge the gap between conceptual and logical models. Predicate normal form forces joins where they may not always be necessary, causing inefficiency. Reasoning on the schema (especially cardinality constraints) can help to denormalize it automatically. However, designers often want to make their own specific choices (which is common in real-world scenarios) based on their expert knowledge of the data and query behavior of its users. We need a way to formally capture such (de)normalization choices made by the designer so that ontology-based queries can be automatically re-written to compensate for structural changes to predicate normal form. For example, this transformation process might be considered an ontology (schema) mapping problem for which we can use an ontology translation framework not unlike our prior data integration work.

9

Publication Venues

[to be compiled]

35

Figure 14: The ER Diagram for the NEMO ERP ontology database shows tables (boxes) and foreign-key constraints (arrows). The concepts “pattern,” “factor,” and “channel group” are most densely connected (toward the right-side of the image) as expected considering their relationships to all other concepts in the ontology.

10

Conclusion

We have presented a methodology which takes an ontology as input and generates a relational database that can answer positive queries about ground literals described by the ontology. We argue that the database generated under this methodology is in fact a procedural extension of a given knowledge base and we call it specifically an ontology-database. While such a database is currently limited to Horn logics, it can help the reasoning process significantly. Certain kinds of semantics can be disregarded at query time, namely, trigger rules, domain and range restrictions, and cardinality constraints. Although we obviously cannot claim such an implementation is ideal for all circumstances because of the explosion of space and load-time, it seems well-suited for: (1) personal, small-sized knowledge-bases by using only commonly available relational database technology, and (2) “query-mostly” scenarios common to highly analytical domains such as NEMO and to semantic search on the Internet. We have left several open problems as future work, including the consideration of more expressive language constructs (negation and disjunction), the consideration of non-monotonic logics (deletions), and optimization techniques that consider redundancy and load-time (normalization and denormalization).

36

11

Acknowledgements

Thanks to the NEMO working project group, and in particular Gwen Frishkoff and Jiawei Rong, for their help in understanding the complexities of EEG and ERP knowledge.

References [1] A Beer Ontology. http://www.dayf.de/2004/owl/beer.owl. [2] AmiGO: The Gene Ontology Search Engine. http://amigo.geneontology.org/. [3] BBN-GEN Genealogy Ontology. http://www.daml.org/2001/01/gedcom/gedcom.daml. [4] DB2. http://www.ibm.com/db2. [5] DRC-GEN Genealogy Ontology. http://orlando.drc.com/daml/Ontology/Genealogy/3.1/Gentologyont.dam. [6] GO: The Gene Ontology. http://www.geneontology.org/. [7] Informix. http://www.ibm.com/informix. [8] Jena. http://jena.sourceforge.net/. [9] MySQL. http://www.mysql.com. [10] Oracle. http://www.oracle.com. [11] OWL Web Ontology Language. http://www.w3.org/TR/owl-ref/. [12] PostgreSQL. http://www.postgresql.com. [13] Resource Description Framework. http://www.w3.org/RDF/. [14] SQL Server. http://www.microsoft.com/sql. [15] Sybase. http://www.sybase.com. [16] ZFIN: The Zebrafish Information Network. http://www.zfin.org. [17] S. Abiteboul, R. Hull, and V. Vianu. Foundations of Databases. Addison-Wesley, 1995. [18] F. Baader, D. Calvanese, D. L. McGuinness, D. Nardi, and P. F. Patel-Schneider, editors. The Description Logic Handbook: Theory, Implementation, and Applications. Cambridge University Press, 2003.

37

[19] F. Baader and W. Nutt. Basic description logics. In Description Logic Handbook, pages 43–95, 2003. [20] A. Borgida, R. J. Brachman, D. L. McGuinness, and L. A. Resnick. CLASSIC: a structural data model for objects. pages 58–67, 1989. [21] R. Brachman and H. Levesque. Knowledge Representation and Reasoning. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2004. [22] P. Buneman, S. Khanna, and W. C. Tan. Why and where: A characterization of data provenance. In ICDT, pages 316–330, 2001. [23] P. Buneman and W. C. Tan. Provenance in databases. In SIGMOD Conference, pages 1171– 1173, 2007. [24] D. Calvanese, G. De Giacomo, D. Lembo, M. Lenzerini, and R. Rosati. DL-lite: Tractable description logics for ontologies. In Proceedings of the Twentieth National Conference on Artificial Intelligence (AAAI 2005), pages 602–607, 2005. [25] D. D. Chamberlin, M. M. Astrahan, M. W. Blasgen, J. Gray, W. F. K. III, B. G. Lindsay, R. A. Lorie, J. W. Mehl, T. G. Price, G. R. Putzolu, P. G. Selinger, M. Schkolnick, D. R. Slutz, I. L. Traiger, B. W. Wade, and R. A. Yost. A history and evaluation of system r. Commun. ACM, 24(10):632–646, 1981. [26] P. P. Chen. The entity-relationship model - toward a unified view of data. ACM Trans. Database Syst., 1(1):9–36, 1976. [27] K. L. Clark. Negation as failure. In Logic and Data Bases, pages 293–322, 1977. [28] E. F. Codd. A relational model of data for large shared data banks. Commun. ACM, 13(6):377– 387, 1970. [29] A. Colmerauer and P. Roussel. The birth of prolog. In HOPL Preprints, pages 37–52, 1993. [30] C. P. de Laborda and S. Conrad. Database to semantic web mapping using rdf query languages. In ER, pages 241–254, 2006. [31] F. M. Donini. Complexity of reasoning. In Description Logic Handbook, pages 96–136, 2003. [32] F. M. Donini, M. Lenzerini, D. Nardi, W. Nutt, and A. Schaerf. An epistemic operator for description logics. Artif. Intell., 100(1-2):225–274, 1998. [33] D. Dou, G. Frishkoff, J. Rong, R. Frank, A. Malony, and D. Tucker. Development of neuroelectromagnetic ontologies (NEMO): A framework for mining brain wave ontologies. In ACM International Conference on Knowledge Discovery and Data Mining (KDD), 2007. (to appear).

38

[34] D. Dou and P. LePendu. Ontology-based integration for relational databases. In ACM Symposium on Applied Computing (SAC), pages 461–466, 2006. [35] D. Dou, P. LePendu, S. Kim, and P. Qi. Integrating databases into the semantic web through an ontology-based framework. In International Workshop on Semantic Web and Databases (SWDB), page 54, 2006. [36] D. Dou, D. V. McDermott, and P. Qi. Ontology Translation on the Semantic Web. Journal of Data Semantics, 2:35–57, 2005. [37] H.-D. Ebbinghaus and J. Flum. [38] R. Elmasri and S. B. Navathe. jamin/Cummings, 1994.

Fundamentals of Database Systems, 2nd Edition.

Ben-

[39] O. Etzioni, K. Golden, and D. S. Weld. Tractable closed world reasoning with updates. In KR, pages 178–189, 1994. [40] R. Fagin. Finite-model theory - a personal perspective. Theor. Comput. Sci., 116(1&2):3–31, 1993. [41] J. A. Fern´andez and J. Minker. Disjunctive deductive databases. In LPAR, pages 332–356, 1992. [42] J. A. Fern´andez and J. Minker. Semantics of disjunctive deductive databases. In ICDT, pages 21–50, 1992. [43] G. A. Frishkoff, R. M. Frank, J. Rong, D. Dou, J. Dien, and L. K. Halderman. A framework to support automated pattern classification and labeling. Computational Intelligence and Neuroscience (CIN), Special Issue, EEG/MEG Analysis and Signal Processing, 2007. (in revision). [44] A. Gali, C. X. Chen, K. T. Claypool, and R. Uceda-Sosa. From ontology to relational databases. In ER (Workshops), pages 278–289, 2004. [45] J. Goubault-Larrecq and I. Mackie. Proof Theory and Automated Deduction, volume 6 of Applied Logic Series, chapter 6, pages 185–231. Kluwer Academic Publishers, May 1997. [46] J. Goubault-Larrecq and I. Mackie. Proof Theory and Automated Deduction, volume 6 of Applied Logic Series. Kluwer Academic Publishers, May 1997. [47] J. Goubault-Larrecq and I. Mackie. Proof Theory and Automated Deduction, volume 6 of Applied Logic Series, chapter 3, pages 73–129. Kluwer Academic Publishers, May 1997. [48] J. Goubault-Larrecq and I. Mackie. Proof Theory and Automated Deduction, volume 6 of Applied Logic Series, chapter 5, pages 157–183. Kluwer Academic Publishers, May 1997.

39

[49] T. R. Gruber. A translation approach to portable ontology specifications. Knowl. Acquis., 5(2):199–220, 1993. [50] J. Herbrand. Recherches sur la theorie de la demonstration. Phd thesis, Universite de Paris, 1930. [51] R. Hull and R. King. Semantic database modeling: survey, applications, and research issues. ACM Comput. Surv., 19(3):201–260, 1987. [52] P. G. Kolaitis. Schema mappings, data exchange, and metadata management. In PODS ’05: Proceedings of the twenty-fourth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pages 61–75, New York, NY, USA, 2005. ACM Press. [53] P. G. Kolaitis. Reflections on finite model theory. In LICS, pages 257–269, 2007. [54] M. Lenzerini. Data integration: a theoretical perspective. In PODS ’02: Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pages 233–246, New York, NY, USA, 2002. ACM Press. [55] P. LePendu, D. Dou, J. Rong, and G. Frishkoff. Semantic data modeling and query answering for brainwave ontologies. In International Semantic Web Conference (ISWC), 2007. (under review) Accessible version: http://www.cs.uoregon.edu/~paea/research/ iswc07insubmission.pdf. [56] H. J. Levesque. The interaction with incomplete knowledge bases: A formal treatment. In IJCAI, pages 240–245, 1981. [57] H. J. Levesque. Foundations of a functional approach to knowledge representation. Artif. Intell., 23(2):155–212, 1984. [58] H. J. Levesque and G. Lakemeyer. The Logic of Knowledge Bases. MIT Press, 2001. [59] H. J. Levesque and G. Lakemeyer. The Logic of Knowledge Bases, chapter 9, pages 143–162. MIT Press, 2001. [60] A. Y. Levy. Obtaining complete answers from incomplete databases. In VLDB, pages 402–412, 1996. [61] J. Lobo, A. Rajasekar, and J. Minker. Semantics of horn and disjunctive logic programs. Theor. Comput. Sci., 86(1):93–106, 1991. [62] D. McDermott and D. Dou. Representing Disjunction and Quantifiers in RDF. In International Semantic Web Conference, 2002. [63] J. Minker. Logic and databases: A 20 year retrospective. In Logic in Databases, pages 3–57, 1996.

40

[64] B. Motik, I. Horrocks, and U. Sattler. Bridging the gap between owl and relational databases. In WWW ’07: Proceedings of the 16th international conference on World Wide Web, 2007. [65] A. Motro. Integrity = validity + completeness. ACM Trans. Database Syst., 14(4):480–502, 1989. [66] D. Nardi and R. J. Brachman. An introduction to description logics. In Description Logic Handbook, pages 1–40, 2003. [67] G. Ontology Consortium. Creating the Gene Ontology Resource: Design and Implementation. Genome Research, 11(8):1425–1433, 2001. [68] Z. Pan and J. Heflin. DLDB: Extending Relational Databases to Support Semantic Web Queries. In Practical and Scalable Semantic Systems (PSSS), 2003. [69] N. W. Paton and O. D´ıaz. Active database systems. ACM Computing Surveys, 31(1):63–103, 1999. [70] R. Ramakrishnan and J. D. Ullman. A survey of deductive database systems. J. Log. Program., 23(2):125–149, 1995. [71] R. Reiter. Deductive question-answering on relational data bases. In Logic and Data Bases, pages 149–177, 1977. [72] R. Reiter. On closed world data bases. In Logic and Data Bases, pages 55–76, 1977. [73] R. Reiter. On structuring a first order data base. In Proceedings of the Canadian Society for Computational Studies of Intelligence, 1978. [74] R. Reiter. Towards a logical reconstruction of relational database theory. In On Conceptual Modelling (Intervale), pages 191–233, 1982. [75] R. Reiter. What should a database know? J. Log. Program., 14(1&2):127–153, 1992. [76] S. Russell and P. Norvig. Artificial Intelligence: A Modern Approach. Prentice-Hall, Englewood Cliffs, NJ, 2nd edition edition, 2003. [77] A. Silberschatz, H. F. Korth, and S. Sudershan. Database System Concepts. McGraw-Hill, Inc., New York, NY, USA, 2006. [78] S. Staab and R. Studer, editors. Handbook on Ontologies. International Handbooks on Information Systems. Springer, 2004. [79] J. D. Ullman. Principles of Database and Knowledge-Base Systems, Volume I. Computer Science Press, 1988. [80] J. D. Ullman. Principles of Database and Knowledge-Base Systems, Volume II. Computer Science Press, 1989.

41