RDFS-based Relational Database Integration - CiteSeerX

1 downloads 0 Views 682KB Size Report
RDF/RDFS ontology, given a set of view-based mappings between ... the triple :aaa foaf:homepage :bbb is also a query result. In other .... huajun@yahoo.com.
RDF/RDFS-based Relational Database Integration Huajun Chen, Zhaohui Wu, Heng Wang, Yuxin Mao College of Computer Science, Zhejiang University, Hangzhou, 310027,China {huajunsir,wzh,paulwang,maoyx}@zju.edu.cn

Abstract We study the problem of answering queries through a RDF/RDFS ontology, given a set of view-based mappings between one or more relational schemas and this target ontology. Particularly, we consider a set of RDFS semantic constraints such as rdfs:subClassof, rdfs:subPropertyof, rdfs:domain, and rdfs:range, which are present in RDF model but neither XML nor relational models. We formally define the query semantics in such an integration scenario, and design a novel query rewriting algorithm to implement the semantics. On our approach, we highlight the important role played by RDF Blank Node in representing incomplete semantics of relational data. A set of semantic tools supporting relational data integration by RDF are also introduced. The approach have been used to integrate 70 relational databases at China Academy of Traditional Chinese Medicine.

1 Introduction The Semantic Web aims to provide a common semantic framework allowing data to be shared and reused across application, enterprize, and community boundaries. It is based on the Resource Description Framework (RDF), which is a language for representing web information in a minimally constrained, flexible, but meaningful way so that web data can be exchanged and integrated without loss of semantics. Most of existing data , however, is stored in relational databases. Therefore, for semantic web to be really useful and successful, great efforts are required to offer methods and tools to support integration of heterogeneous relational databases using RDF model. This paper is devoted to address this problem. Specifically, it concerns the problem of answering queries through a RDF ontology, given a set of semantic mappings between one or more relational schemas and the RDF ontology. Essentially, it is the old problem of uniformly querying many disparate data sources through one common virtual interface. A typical approach, called answering query us-

ing view [7][5], is to describe data sources as precomputed views over a mediated schema, and reformulate the user query, posed over the mediated schema, into queries that refer directly to source schemas by query rewriting. While most of the preceding work has been focused on the relational case [5][5][8], and recently the XML case [9][10], we consider the case of RDF-based relational data integration. In particular, we consider a set of extra RDFS semantic constraints set on the mediated schema such as rdfs:subClassof , rdfs:subPropertyof, rdfs:domain , and rdfs:range , which are present in RDF model but neither XML nor relational models. These constraints are of great importance in web data integration. Take an example: suppose there is a statement in the RDF ontology saying 1 : foaf:schoolHomepage rdfs:subPropertyOf foaf:homepage. Given a semantic mapping from a column of a relational table T to the property foaf:schoolHomepage , and a semantic query Q referring the property foaf:homepage, the rewriting algorithm should automatically infer that T can be used to generate rewritings for Q. On the other hand, if a triple like :aaa foaf:schoolHomepage :bbb is generated as a query result, the system should automatically infer that the triple :aaa foaf:homepage :bbb is also a query result. In other words, rdfs:subPropertyOf sets an extra constraints on the mediated schema, and enables the query rewriting to infer more results. Another motivation of our work is the big model and small model problem. In the case of semantic web, shared ontologies are normally designed to cover a whole domain, and are normally big, complete models. Take the Traditional Chinese Medicine (TCM) domain as an example, the number of the RDF classes of current TCM ontology [14] developed by China Academy of TCM has reached to 100,000. However, the legacy relational databases are often designed for local user, and are normally small, incomplete models. In fact, it is often difficult to find a direct mapping from the relational schema (small model) to the RDF ontology (big model) because of the incompleteness of legacy relational data. We take the advantages of RDF Blank Node construct to help define the semantic mappings. Our experi1 N3

notation is used to represent RDF statements.

Figure 1. Semantic Mapping from Relational Tables to RDF ontology. "?en,?em,?eh,?an,?ah" are variables and represent, respectively, "employee name","employee email","employee homepage at school","account name","account service homepage". "?y1,?y2" are existential variables.

ences in TCM application shows that the RDF Blank Node is of very useful in representing incomplete semantics of relational data when it is mapped to the RDF ontology. However, the formal analysis in Section 3.2 shows that introducing Blank Node to define semantic mappings makes query computation hard. As our work reveals, there is a tradeoff between the mapping flexibility and the computational complexity of query processing. The main contribution can be summarized as below. 1.Formal Aspects of Answering Queries using Views We formally and precisely specify what it means to answer a RDF query, given a set of view-based mappings between the source relational schemas and RDF ontology. We define a Target RDF Instance that satisfies all the requirements with respect to the given views and RDFS semantic constraints, and take the query semantics to be the result of evaluating the query on this Target RDF Instance. In addition, we highlight the important role played by RDF Blank Node in representing incomplete or hidden semantics of relational data when defining semantic mapping. 2.RDF Query Rewriting Algorithm An RDF-inspired query rewriting algorithm is implemented according to the formal query semantics. It rewrites RDF queries into a set of source SQL queries. Evaluating the union of these SQL queries has essentially the same effect as running the RDF query on the Target RDF Instance. This algorithm extends earlier relational and XML tech-

niques for rewriting queries using views,with consideration of the features of RDF model. 3. A Set of Semantic Tools and their Application in Traditional Chinese Medicine A set of semantic tools are developed. For examples, the Visual Semantic Query Tool enables users to visually construct RDF queries, and the Visual Semantic Mapping Tool enables users to speed up the process of defining view-based mapping. The query system have been deployed at China Academy of Traditional Chinese Medicine to integrate 70 TCM relational databases. This paper is laid out as follows: Section 2 formally discusses the problem of answering queries using views for RDF/RDFS-based relational data integration. Section 3 presents the RDF rewriting algorithm. Section 4 introduces the implementation of some semantic tools and the application in Traditional Chinese Medicine.

2

Answering Queries Using View

2.1

RDF Views

We start with a simple mapping case. Suppose both W3C and Zhejiang University (abbreviated as ZJU) have a legacy employee database , and we want to integrate them using the FOAF ontology 2 , so that we can uniformly 2 The

Friend of a Friend (FOAF) project: http://www.foaf-project.org/.

Figure 2. RDF Views examples. Upper part is the set of original views, lower part is the set of views after applying RDFS semantic constraints(see Section 3.2). The newly added triples are italicized.

query these two databases by formulating RDF queries upon the FOAF ontology. The mapping scenario in Fig. 1 illustrates two source relational schemas,a part of the FOAF ontology, and two mappings between them. Graphically, the mappings are described by the arrows that go between the mapped schema elements. The extra RDFS semantic constraints state that both foaf:schoolHomepage and foaf:accountServiceHomepage are sub property of foaf:homepage, and both foaf:OnlineChatAccount and foaf:OnlineEcommerceAccount are subclass of foaf:OnlineAccount. Mappings are often defined as views in conventional data integration systems. With our approach, each relational table in the source is defined as a view over the RDF ontologies. Such views are called as RDF Views. For formal discussion, RDF views are expressed in a Datalog-like notation. The upper part of Fig. 2 illustrates the examples of RDF views corresponding to semantic mappings in Fig.1. A typical RDF view consists of two parts. The left part is called the view head, and is a relational predicate. The right part is called the view body, and is a set of RDF triples. In general, the body can be viewed as a RDF query over the RDF ontology, and it defines the semantics of the relational predicate from the perspective of RDF ontology. Being similar to conventional view definitions expressed in Datalog, there are two kinds of variables for RDF view. The variables appearing in the view head is often called distinguished variable. The variables appearing only in the view body but not in the view head are called existential variables. In our examples, y1,y2... denote existential variables.

Definition 1. RDF View . A typical RDF View is like the ¯ ¯ Y¯ );where : form :R(X)− : G(X, ¯ called the head of the view, and R is a rela1. R(X)is tional predicate. ¯ Y¯ ) is called the body of the view, and G is a 2. G(X, set of RDF triples with some nodes replaced by variable names. ¯ Y¯ contain either variables or constants.The 3. The X, ¯ are called distinguished variables , and variables in X the variables in Y¯ are called existential variables.

2.2 RDF Queries Next, we pay some attention on the types of RDF queries dealed with in this paper. The following example Q1 is specified in terms of foaf ontology. Q1: SELECT ?en ?em ?eh ?y2 ?an ?ah where ?y1 rdf:type foaf:Person. ?y1 foaf:name ?en.?y1 foaf:mbox ?em. OPTIONAL ?y1 foaf:homepage ?eh. ?y1 foaf:holdsAccount ?y2. ?y2 rdf:type foaf:OnlineAccount. ?y2 foaf:accountName ?an. ?y2 foaf:homepage ?ah. The query is written in SPARQL 3 query language. The query semantics is to find out the person name (?en), the mail box (?em), the homepage (?eh), his/her online account 3 W3C

SPARQL: http://www. w3.org/TR/rdf-sparql-query/

Figure 3. The Source Relational Instances and Target RDF Instances. In the target instance, :bn1, :bn2, and so on, are all newly generated blank node IDs. The italicized triples are generated because of the RDFS semantic constraints. The triples are represented in N3 notation.

(?y2), the account name (?an), the homepage of the account service (?ah).We note that there is an Optional Block in Q1. According to the SPARQL specification, the OPTIONAL predicate specifies that if the optional part does not lead to any solutions, the variables in the optional block can be left unbound. As can be seen in Section 4 , OPTIONAL predicate has an effect on the possible number of valid query writings that the algorithm can yield.

2.3 The Problem The fundamental problem we want to address is: given a set of source relational instances I such as in Fig.3, and a set of RDF views such as V 1, V 2 in Fig.2, plus a set of RDFS semantic constraints such as rdfs:subClassof in Fig.1, what should the answers to a target RDF query such as Q1 be?. One possible approach that has been extensively studied in the relational literatures, is to consider the target instance, which is yielded by applying the view definitions onto the source instances, as an incomplete databases [7]. Often a number of possible databases D are consistent with this incomplete database. Then the query semantics is to take the intersection of Q(D) over all such possible D. This intersection is called the set of the certain answers[7]. This approach can not be applied directly to our case, since we need to consider extra semantic constraints on the target schema. We take a similar but somewhat different approach,which is more RDF-inspired. In general,we define the semantics of target query answering by constructing a

Target RDF Instance G based on the view definitions and RDFS semantic constraints. We then define the result of answering a target RDF query Q1 using the views to be the result of evaluating Q1 directly on G. In detail, two phases are involved in the construction process: 1. Applying constraints onto RDF views. Before constructing G, an extra inference process is firstly applied onto the RDF views. As the example in Fig. 2 illustrated, five extra triples are added into the view definitions by applying the RDFS constraints in Fig.1. For instance, applying the constraint (foaf:accountServiceHomepage rdfs:subPropertyof foaf: homepage) to the triple (?y2 foaf:accountServiceHomepage ?ah) will yield a new triple (?y2 foaf:homepage ?ah). 2.Applying RDF views onto source instances. Next, relational instances are transformed into RDF instances according to extended RDF views. In other words, for each tuple in the source relational instance, a set of RDF triples are added in the target instance such that the RDF views are satisfied with. Fig.3 illustrates the examples of relational instance and target instance. The evaluation of Q1 on the target instance produces the tuples in Table 1. One important notion is the skolem functions introduced to generate blank node IDs in target instance. As can be seen in Fig. 3, corresponding to each existential variable ?y ∈ Y¯ in the view, a new blank node ID is generated in the target instance. As examples, :bn1, :bn2 are both newly generated blank node IDs corresponding to the variables ?y1, ?y2 in V1. This treatment of the existential variable is in accordance with the RDF semantics, since blank

Table 1. Query answers after evaluating Q1 on the Target RDF Instance in Fig. 3. Note for Dan Brickley the variable ?eh is left unbound, namely, is nullable, but other variables MUST have a binding. Person.name Person.mail-box Person.homepage Account Account.name Account.homepage Dan Brickley

[email protected]

NULL

:bn2

dan@ebay

http://ebay.com

Huajun

[email protected]

http://zju.edu.cn/huajun

:bn4

[email protected]

http://amazon.com

Huajun

[email protected]

http://zju.edu.cn/huajun

:bn5

[email protected]

http://msn.com

Huajun

[email protected]

http://zju.edu.cn/huajun

:bn6

[email protected]

http://yahoo.com

nodes can be viewed as existential variables 4 . In general, each RDF class in the target ontology is associated with a unique Skolem Function that can generate blank node ID at that type. For instances, the RDF classes in Fig. 1 are associated with the following skolem functions respectively: foaf:Person - SF1(?en), foaf:OnlineAccount - SF2(?an). The choice of function parameters depends on the constraints user want to set on the target schema. For example,SF1(?en) set a constraint that says: ”if two instances have same value for property foaf:name, then they are equivalent and the same blank node ID is generated for both”. This is somewhat similar to the Primary Key Constraint, and is useful for merging instances stemming from different sources. Take the example in Fig. 3 again, for person name ”Huajun”, the same blank node ID :bn3 is generated for both W3C and ZJU sources. Note:RDF Blank Node and Incomplete Semantics.For many legacy databases, the data semantics are often not represented explicitly enough. Take the w3c:emp table as an example, it implies the semantics that says ”for each person, there is an online account whose account name is ...”. The ”There is an...” semantics is lost. This kind of semantics can be well captured by RDF Blank Node, since blank nodes are treated as simply indicating the existence of a thing, without identifying that thing. Indeed, it is the case of incomplete semantics. The incomplete problem [15] [16] has been considered as an important issue with related to view-based data integration system, and it is more acute for semantic web applications because web is open-ended system. Indeed, the Target RDF Instance can be viewed as an incomplete databases in which the Blank Nodes can be viewed as existential variables. This make it somewhat similar to the conditional table [16] introduced in database literature to model incomplete databases. From this point of view, we argue that blank node is an important representation construct for data integration in semantic web. We finally give the formal specification of the query semantics. We adopt this semantics as a formal requirement on answering queries using views for RDF/RDFS-based re4 W3C

RDF Semantics :http://www.w3.org/TR/rdf-mt/

lational data integration. We will show in the next section how to implement this semantics, without materializing the Target RDF Instance,but instead by query rewriting. Moreover, this query semantics is also different from the certain answer [7] in relational literatures for two practical reasons: a)The query answer can contain NULL, because of the OPTIONAL predicate used in RDF query,b)The query answer can contain newly generated blank node IDs which can be viewed as existential variables. Theorem 1 is about the fundamental complexity of the query answering problem. The result and the proof 5 reveal that although blank nodes offer us great flexibility in defining semantic mappings, but it also make the query computation more hard. Therefore, there is a tradeoff between the mapping flexibility and the computational complexity. Definition 2. Query Semantics Let Q be a RDF query, then the set of the query answer of q with respect to a set of relational source instance I, a set of RDF views V , plus a set of RDFS semantic constraints C, denoted by answerV,C (Q, I), is the set of all tuples t such that t ∈ Q(G) where G is the Target RDF Instance. P Theorem 1. Let V be a set of RDF view definitions, I be a view instance, C is a set of RDFS semantic constraints, Q be a RDF query, then the problem P of computing the query answer with respect to V , I, C, Q is NP-Complete.

3 RDF Query Rewriting In most of cases, there is no full permission to access source instances, thus query rewriting is required. In this section, a query rewriting algorithm satisfying with the query semantics defined in previous section is presented.

3.1 Preprocessing Views Before rewriting, the RDF views must be preprocessed. The purpose is two fold. Firstly, the RDFS Constraints are 5 The

formal proof is available upon e-mail request.

Figure 5. The Algorithms. We use "q=q[a/b]" to denote replacing all occurrence of "a" in "q" with "b", and use "q.head" and "q.body" to denote the head and body of q.

applied onto views,so that more types of query can be answered by using the extended views. Secondly, the view definitions are turned into a set of smaller rules called Class Mapping Rules, so that the RDF query expressions can be more directly substituted by relational terms. Applying constraints has been introduced in Section 3.2. This extra inference process is valuable because it enables the rewriting algorithm to answer more types of query. For example, without this process, Q1 can not be answered by rewriting using the views, because the query terms foaf:OnlineAccount and foaf:homepage do not appear in any view definitions at all. Generating class mappings rules is somewhat complex. The algorithm is illustrated in the left part of Fig.5. In general, the algorithm can be divided into three steps. 1. Grouping Triples. The algorithm starts by looking at the body of views, and groups the triples by subject name, i.e., a separate group is created for each set of triples having same subject name. For example,three triple groups are created for V 1 as illustrated in Fig.4. In the first group, three triples share the same subject name ?y1 which will be replaced by the skolem function name SF1(?en) in next step.

Figure 4. Examples of Class Mapping Rules.

2. Skolemizing Triples. Next, the algorithm replaces all existential variables ?yn ∈ Y with corresponding Skolem Function Names. As introduced in Section 3.2, we associate each RDF class with a unique Skolem Function to generate blank node IDs for that class. For example, the ?y1, ?y2 in V 1 are replaced by skolem function name

Figure 6. The query rewriting example. The final rewriting is expressed using Dataloglike notation which can be easily transformed into a SQL query.

SF 1(?en), SF 2(?pn) respectively. 3. Constructing Class Mapping Rules. Next , for each triple group, a new class mapping rule is created. The rule head is the original relational predicate, and the rule body is set of the triples of that group. 4. Merging Class Mapping Rules. At last, some mapping rules are merged. There are two cases when rules need to be merged. One is the case of redundant rule. For example, rule5 and rule6 will be merged as rule5 6 because rule5 is a redundant rule. Another case is: if there is a referential constraints between two relational tables within a source, then their rules will be merged. For example, rule3 and rule4 will be merged into the rule3 4 because there are referential constraints between zju:emp(?en,?eh) and zju:em account(?en,?an).

3.2 Query Rewriting In this phase, the algorithm transforms the input query using the newly generated mapping rules , and outputs a set of valid rewritings. The algorithm is illustrated in the right part of Fig. 5. Being similar to generating class mappings rules, the rewriting algorithm starts by looking at the body of the query and group the triples by subject name, then replace all variables ?yn with corresponding Skolem Function Names. Next, it begins to look for rewritings for each triple group by trying to find an applicable mapping rules. If it finds one, it replaces the triple group by the head of the mapping rule, and generate a new partial rewriting. After all triple groups have been replaced, a candidate rewriting is yielded. If a triple t in Q1 is OPTIONAL and no triple in

the mapping rule is mapped to t, the variable in t is set to NULL as default value. Fig. 6 illustrates the rewriting process for query Q1. Because of space limitation, only two candidate rewritings are illustrated. Definition 3 Triple Mapping. Given two triples t1,t2, t1 is said to map with t2 , if there is a variable mapping φ from Vars(t1) to Vars(t2) such that t2 = φ(t1).Vars(t1) denotes the set of variables in t1. Definition 4 Applicable Class Mapping Rule. Given a triple group g of a query Q, a mapping rule m is a Applicable Class Mapping Rules with respect to g, if there is a triple mapping φ that maps every non optional triple in g to a triple in m. P Theorem 2 Soundness Let v be a set of RDF Views. For query Q over the RDF ontology, the rewriting algorithm generates a set of rewriting R such that: whenever I is a source instance, P G is the Target RDF Instance with respect to I and v , then R(I) ⊆ Q(G). P Theorem 3 Completeness Let v be a set of RDF Views. For every query Q over the RDF ontology, let P be a query rewriting such that R(I) ⊆ Q(G), and R is the rewriting generated by the algorithm, we have: whenever I is a source instance, P G is a Target RDF Instance with respect to I and v , then P (I) ⊆ R(I). Theorem 2 and 3 are statements of the correctness of the algorithm.The formal proofs are available in the full version of this paper. Finally, we give an analysis on the complexity of the algorithm. Let n be the number of triple groups in Q1,

B

A

Star Mapping Compleixty

600

After Smoothing

7000

C 25000

Chain Mapping Complexity

7 Triple Groups

200

2000

0

0

100

150

200

250

300

3 Triple Groups

3000

1000

50

2 Triple Groups

15000

1 Triple Groups

10000

5000

0

0

Braching Factor / Number of Views

5 Triple Groups 4 Triple Groups

4000

100

0

6 Triple Groups

5000 Rewriting Time (ms)

Rewriting Time (ms)

Rewriting Time (ms)

300

8 Triple Groups

20000

500 400

Number of Triple Group in Q:

After Smoothing

6000

50

100

150

200

250

300

0

Chain Length/ Number of Views

,

1

2

3

4

5

6

7

8

Number of Sources

,

Figure 7. Experiment Results. A. Chain Scenario, B. Star Scenario, C. Worst Case Analysis let m be the number of mapping rules, it is not difficult to see that the rewriting can be done in time O(mn ). The worst case experiment in the next section reflects the correctness of this proposition. We note that all rewriting algorithms are limited in cases where the number of resulting rewritings is especially large since a complete algorithm must produce an exponential number of rewritings. In general, the problem of query rewriting using views is NP-Complete [6]. In Section 4, we show although the computational problem is theoretically hard, the algorithm still works well in most of practical cases in our TCM application.

3.3 Experimental Evaluations The goal of our experiment is to validate that our algorithm can scale up to deal with large mapping complexity. We consider two general classes of relational schema: chain schema and star schema. In these two case, we consider queries and views that have the same shape and size. Moreover, we also consider the worst case in which two parameters are looked upon: (1)The number of triple groups of query, (2)The number of sources. The whole system is implemented in Java and all experiments are performed on a PC with a single 1.8GHz P4 CPU and 512MB RAM, running Windows XP(SP2) and JRE 1.4.1. 1.Chain Scenario. In a chain schema, there are a line of relational tables that are joined one by one with each other. The chain scenario simulates the case where multiple interlinked relational tables are mapped to a target RDF ontology with large number of levels (depth). The panel A of Fig. 7 shows the performance in the chain scenario with the increasing length of the chain and also the number of views. The algorithm can scale up to 300 views under 10 seconds. 2.Star Scenario. In a star schema, there exists a unique relational tables that is joined with every other tables, and there are no joins between the other tables. The star scenario simulates the case where source relational tables are mapped to a target RDF graph with large branching factor. The panel B of Fig. 7 shows the performance in the star

scenario with the increasing branching factor of the star and also the number of views. The algorithm can easily scale up 300 views under 1 seconds. The experiments illustrate that the algorithm works better in star scenario. 3.Worst Case Analysis. The worst case happens when for each RDF class, there are a lot of class mapping rules generated for them, and the number of triple groups in the query is also large. In this case, for each triple group of the query , there are a lot of applicable mapping rules. Thus, there would be many rewritings,since virtually all combinations produce valid rewritings , and complete algorithm is forced to form an exponential number rewritings. In the experiment illustrated in C in Fig. 7, we set up 10 sources, and for each source, 8 chained tables are mapped to 8 RDF classes respectively. The figure shows the cost of rewriting increases quickly as the number of triple groups and number of sources increases. As can be seen, in the case of 8 groups, the cost reaches 25 seconds with only 4 sources.

4 Implementation and Application The DartGrid [11] [12] system uses the techniques described in the previous sections to provide a uniform RDF query interface to sets of relational data sources. Normal users interact with DartGrid through a Semantic Browser[13] that enables users to visually construct RDF queries. Fig. 8 illustrates an example from our TCM application, which showcases how user can step by step specify a RDF query. In the first step, the user selects the TCM Prescription class and its three properties: name,dosage, and preparationMethod . In the second step, the user selects the Disease class and three properties name, symptom, and pathogeny. At last he inputs a constraint which specifies that the name of the Disease is ”influenza”. Take all, the semantic of this query is to query out TCM Prescriptions that can cure influenza. To speed up the process of defining RDF views, a Visual Semantic Mapping tool is developed. As Fig.9 dis-

Figure 8. Visually construct a RDF query. Two RDF classes are involved, TCM-Prescription and Disease. For readability, we translate Chinese terms into English ones.

plays, users can use the registration panel (the right part of the Fig.9) to view relational schema definitions, and use the semantic browsing panel (the left part of the Fig. 9) to view RDF ontologies. Then users specify which RDF classes one table should be mapped to and which RDF property one column should be mapped to. Finally, the tool automatically generate a RDF view and submit it to a semantic registry. The system has been deployed at China Academy of Traditional Chinese Medicine and currently provides access to over 70 databases including TCM herbal medicine databases, TCM compound formula databases, clinical symptom databases, traditional Chinese drug database, traditional Tibetan drug database, TCM product and enterprize databases, and so on. The current TCM ontology includes 28 RDF classes, 255 RDF properties, 9 rdfs:subClassof constraints, and 25 rdfs:subPropertyof constraints. We found the rdfs:subPropertyof is very useful in practice. Indeed, about 30% of the 70 TCM databases share some similar properties with each other. Blank nodes are commonly used, and we found it is of very useful in case of nonnormalized table. Although introducing blank nodes adds complexity to the query processing system, practical evaluation by our users shows that the system works well in most of practical use cases. However , practical scalability still needs to be tested if the number of databases become larger.

5 Related Work In the context of semantic web research, a lot of research concerns mapping RDF with the relational model. Some of them deal with the issue of using RDBMS as RDF triple

storage, such as Jena or Sesame’s relational storage component. This issue is not touched upon in this paper. Some Others deal with the issue of integrating relational data using RDF, such as D2RMap [4], KAON REVERSE, D2RQ [1] and RDF Gateway 6 . However, none of them consider the issue of RDFS semantic constraints, and the formal aspects such as query semantics, query complexity is not considered. Another issue they did not consider is the incompleteness of legacy database. For example, both of D2RQ and RDF Gateway define a declarative language to describe mappings. However, the mappings, as they defined, are simple and equivalent mappings:it consists of statements asserting that some portion of relational data is equivalent to some portion of the RDF data. In contrast, the RDF views that we consider involves incomplete mappings, where each statement asserts that a relational source is a incomplete, partial view of the big model. Piazza [2] considers the mapping of XML-to-XML and XML-to-RDF. Francois [3] considers the problem of answering query using views for semantic web,but his approach is more description-logic-oriented.

6 Summary and Future Work This paper study the problem of answering RDF queries using RDF views over incomplete relational databases under RDFS semantic constraints. We define a Target RDF Instance that satisfies all the requirements with respect to the given views and RDFS semantic constraints such as rdfs:subClassof , rdfs:subPropertyof, rdfs:domain , and 6 RDF

Gateway: http://www.intellidimension.com

Figure 9. Visual Semantic Mapping Tool. The left is RDF ontology, and the right is a source relational table. User can visually specify the mappings from relational schema to RDF ontology.

rdfs:range , which are present in RDF model but neither XML nor relational models , and take the semantics of query answering to be the result of evaluating the query on this Target RDF Instance. With our approach , we highlight the important role played by the RDF Blank Nodes in representing incomplete semantics of relational data. A set of semantic tools and the application in the TCM domain are also reported. Some of future work are: extension to a more expressive RDF-based query languages such as OWL, and how to make mappings evolve if the ontology evolves.

References

[8] Rachel Pottinger ,Alon Y. Halevy. MiniCon: A Scalable Algorithm for Answering Queries Using Views.Journal of VLDB, 2001; 10(2-3),182-198. [9] Cong Yu and Lucian Popa. Constraint-based XML Query Rewriting for Data Integration.SIGMOD2004,371 - 382. [10] A. Deutsch and V. Tannen. MARS: A system for publishing XML from mixed and redundant storage. VLDB2003 [11] ZhaohuiWu, Huajun Chen, et al. DartGrid: Semanticbased Database Grid. Lecture Notes in Computer Science. v3036, pp. 59 - 66 ,2004.

[1] Christian Bizer. D2RQ system. Poster at ISWC2004. [2] Alon Y. Halevy et al. Peer Data Management Systems: Infrastructure for the Semantic Web.WWW2003.

[12] Zhaohui Wu, Huajun Chen, et al, DartGrid II: A Semantic Grid Platform for ITS, IEEE Intelligent Systems, vol.20, No.3, Jun. 2005.

[3] Francois Goasdoue. Answering Queries using Views: a KRDB Perspective for the Semantic Web. ACM Transaction on Internet Technology.June 2003,P1-22.

[13] Mao Yuxin, Wu Zhaohui, Chen Huajun. Semantic Browser: An intelligent client for Dart-Grid, Lecture Notes in Computer Science 3036: 470-473, 2004

[4] Chris Bizer, Freie. D2R MAP - A Database to RDF Mapping Language.WWW2003.

[14] Zhou Xuezhong,Wu Zhaohui.Ontology development for Unified Taditional Chinese Medical Language System,Journal of AI in Medicine 32(1):15-27,2004.

[5] A. Y. Halevy.Answering queries using views: A survey. Journal of VLDB,2001; 10(4), 75-102. [6] A. Y. Levy, A. O. Mendelzon, Y. Sagiv, and D. Srivastava. Answering queries using views. In PODS, 1995. [7] Serge Abiteboul. Complexity of Answering Queries Using Materialized Views. PODS1998, 254-263.

[15] R. van der Meyden. Logical Approaches to Incomplete Information:A Survey. In Logics for Databases and Information Systems,p307-356.Kluwer,1998. [16] Imielinski T.,W. Lipski Jr.Incomplete Information in Relational Databases.J.ACM 31:4,1984,pp.761- 791.