Probabilistic Data Integration - Semantic Scholar

Probabilistic Data Integration

Matteo Magnani

Danilo Montesi

Technical Report UBLCS-09-10 March 2009

Department of Computer Science University of Bologna Mura Anteo Zamboni 7 40127 Bologna (Italy)

The University of Bologna Department of Computer Science Research Technical Reports are available in PDF and gzipped PostScript formats via anonymous FTP from the area ftp.cs.unibo.it:/pub/TR/UBLCS or via WWW at URL http://www.cs.unibo.it/. Plain-text abstracts organized by year are available in the directory ABSTRACTS.

Recent Titles from the UBLCS Technical Report Series 2008-10 Expressiveness of multiple heads in CHR, Di Giusto, C., Gabbrielli, M., Meo, M.C., April 2008. 2008-11 Programming service oriented applications, Guidi, C., Lucchi, R., June 2008. 2008-12 A Foundational Theory of Contracts for Multi-party Service Composition, Bravetti, M., Zavattaro, G., June 2008. 2008-13 A Theory of Contracts for Strong Service Compliance, Bravetti, M., Zavattaro, G., June 2008. 2008-14 A Uniform Approach for Expressing and Axiomatizing Maximal Progress and Different Kinds of Time in Process Algebra, Bravetti, M., Gorrieri, R., June 2008. 2008-15 On the Expressive Power of Process Interruption and Compensation, Bravetti, M., Zavattaro, G., June 2008. 2008-16 Stochastic Semantics in the Presence of Structural Congruence: Reduction Semantics for Stochastic PiCalculus, Bravetti, M., July 2008. 2008-17 Measures of conflict and power in strategic settings, Rossi, G., October 2008. 2008-18 Lebesgue’s Dominated Convergence Theorem in Bishop’s Style, Sacerdoti Coen, C., Zoli, E., November 2008. 2009-01 A Note on Basic Implication, Guidi, F., January 2009. 2009-02 Algorithms for network design and routing problems (Ph.D. Thesis), Bartolini, E., February 2009. 2009-03 Design and Performance Evaluation of Network on-Chip Communication Protocols and Architectures (Ph.D. Thesis), Concer, N., February 2009. 2009-04 Kernel Methods for Tree Structured Data (Ph.D. Thesis), Da San Martino, G., February 2009. 2009-05 Expressiveness of Concurrent Languages (Ph.D. Thesis), di Giusto, C., February 2009. 2009-06 EXAM-S: an Analysis tool for Multi-Domain Policy Sets (Ph.D. Thesis), Ferrini, R., February 2009. 2009-07 Self-Organizing Mechanisms for Task Allocation in a Knowledge-Based Economy (Ph.D. Thesis), Marcozzi, A., February 2009. 2009-08 3-Dimensional Protein Reconstruction from Contact Maps: Complexity and Experimental Results (Ph.D. Thesis), Medri, F., February 2009. 2009-09 A core calculus for the analysis and implementation of biologically inspired languages (Ph.D. Thesis), Versari, C., February 2009.

Probabilistic Data Integration Matteo Magnani1

Danilo Montesi1

Technical Report UBLCS-09-10 March 2009 Abstract In this paper we propose and experimentally evaluate a data integration approach where the uncertainty generated during the comparison and merging of the input data sources is included into the resulting mediated schema, and can be used to provide richer answers to the users. We describe a system implementing our method, and use it to empirically study the impact of uncertainty management on the effectiveness and efficiency of the data integration process. In particular, we test our approach on benchmark datasets, showing that considering uncertainty we may increase the recall of the method, and on real databases, showing that it can be applied to large data sources.

. Department of Computer Science, University of Bologna, Mura A. Zamboni 7, 40127 Bologna, Italy.

1

1

Introduction

Data integration is one of the most relevant and studied problems in the field of database management as well as in related areas like the Semantic Web. It is the process of producing a single (mediated) database from a set of local data sources, to provide homogeneous access to the available data. Data integration activities have been at the basis of very important applications for decades: company database merging, data warehousing, and meta–search engines, to mention a few. At the same time, there have been several research efforts to develop semi–automated and automated data integration methods to support new and more complex scenarios, like peerto-peer data repositories [20], dataspaces [14], network storage services like Google Base 2 , and biological and scientific repositories3 . Automatic data integration consists in creating a mediated database without (almost) any user intervention. To introduce our approach, consider the two tables in Figure 1. It is quite evident that columns Name and ID match, which means that in the integrated table there will be a single column with all the names contained in the two tables. Column Telephone may contain business telephone numbers, in which case the resulting integrated table would be the one in Figure 2(a), or private telephone numbers, in which case we would obtain the integrated table illustrated in Figure 2(b). A traditional data integration system would generate one of these two tables, often asking a human user to choose the correct result. However, if the system is asked to autonomously make the choice, it may pick the wrong table and produce a wrong result, without any indication of how uncertain it is, and how careful we should be while using this data. Name John Mark

Business Phone 555-4444 555-1111

ID Mark Mary

Telephone 333-2222 555-6666

Figure 1. Two simple data sources

Name John Mark Mark Mary

Business Phone 555-4444 555-1111 333-2222 555-6666 (a)

ID John Mark Mary

Business Phone 555-4444 555-1111 null (b)

Telephone null 333-2222 555-6666

Figure 2. Two alternative integrations of the tables in Figure 1

In this paper we shall present an uncertain data integration process. As in many other methods, we first try to reduce the uncertainty generated while comparing the input data sources as much as possible, using a pool of software matchers and, if available, some limited user feedback. However, the remaining uncertainty is not thrown away, but is included in the resulting mediated schema. In the previous example, we would not choose one of the two possible integrated tables. On the contrary, we will provide a compact representation of both alternatives, through an uncertain mapping between the two input tables. In this way, our system will be able to store rich information like: Mark’s business telephone numbers are: certainly 555-1111, and may be also 333-2222. This is very important when users have no time to manually check the data, when schemata are very large, and in general when the integration must be completely automated. In addition, the results produced by our system are similar to the kind of results a human being would produce, i.e., with varying degrees of confidence depending on his/her understanding of the data. More specifically, the contributions of this paper are:

. http://base.google.com . http://www.ncbi.nlm.nih.gov/Entrez

UBLCS-09-10

2

1. The definition of a method of data integration that does not remove any of the uncertainty it generates. 2. Its experimental analysis, to highlight the richness of its results and to evaluate its effectiveness and efficiency. The system can be evaluated on-line on the benchmark datasets at the address: http://marcello.cs.unibo.it:8080/prodi/jsp/. The paper is organized as follows. In the next section we review other works that have focused on the management of uncertainty in schema integration, pointing out the differences with our proposal. Then, in Section 3 we introduce our uncertain data integration process, and some details of the ProDI (Probabilistic Data Integration) system. In Section 4 we present extensive experimental results obtained on both artificial and real data sets. Finally, we summarize the main results of our research.

2

Related work

Automatic schema integration has been studied extensively [3, 25, 8, 22, 19, 26, 27, 2, 7], and is a very complex activity with many relevant sub–problems and applications [15]. In our system we have adapted several matching methods described in the literature to work in a probabilistic framework. The most important difference between these works and our approach is that in other methods uncertainty is manipulated and represented only until a choice is made to keep the most likely information, while we explicitly produce uncertain results. In this section we focus on other works from the database community that concerned explicitly the management of uncertainty. Probability theory has been used for many years in data integration. For example, the system described in [16] assigns probabilities to alternative relationships between pairs of schema objects. However, the characterizing feature of this and other early works using uncertainty theories in data integration is that after probabilities have been evaluated a threshold is used to select matching and non matching objects. Therefore, the uncertainty generated during the integration process is lost. Probability theory has also been used in instance integration (entity reconciliation, or record linkage), and also in this case probabilities are used together with a decision model to choose exact mappings [6]. Another application of probability theory to the field of data integration is described in [11]. In this case probabilities do not characterize the uncertainty in the matching process and in the integrated schema, but are used to rank the local data sources with the aim of improving query processing. Data is not uncertain, and mappings between schema objects are well known. In [24] the authors present a method of uncertain schema integration with a multi-matcher architecture, and keep the uncertainty modeled during the matching phase up to the merging step. Dempster-Shafer’s theory, an extension of probability theory, is used as the formalism to represent uncertainty, and the authors define how uncertainty is represented and manipulated at each step of the process. In this work the result of the integration process is not explicitly represented using a model for uncertain data, like in the method presented in the current paper, and the languages used to express local schemata and relationships between their objects are too complex to allow assumptions of probabilistic independence. A complementary approach to this work is [18], where a data model for uncertain data is provided to represent the result of a data integration process. Here the authors assume to already have a method that performs the integration and evaluates its uncertainty (probability), which is one of the objectives of our paper. In addition, this work concerns the integration of instances, and not schemata, that are assumed to be equal. This can therefore be thought of as part of a method that matches schemata, and subsequently focuses on data mappings. [28] and its extended version [29] focus on probabilistic schema matching. These papers do not provide a complete method of uncertain data integration processes, but focus on the implementation of probabilistic classifiers (matchers), that have not been covered in detail in [24, 18]. UBLCS-09-10

3

Also in these papers the probability assignment process is supported by experiments, but arbitrary, and the aggregation of probabilities does not consider dependencies, but only the confidence we have in each classifier — it is therefore a way to weight the matchers. Some of the matchers used in our system have been inspired by this work. Recently, the importance of managing uncertain information in data integration has become well recognized [15, 13, 12]. In [17] another approach to merge uncertain information has been proposed. The content of this work is analogous to [18], but more focused on the formalisms used to represent uncertainty. At the same time, some works have provided a first theoretical study of this problem [4, 9], showing that under some basic assumptions uncertain data integration may be tractable [9].

3

Data integration process

In Figure 3 we have represented the tasks composing our data integration process. This is made of three main tasks, i.e., wrapping, matching and merging, whose details are presented in the following. To give a first flavor of our approach, we can describe it by mentioning the uncertain concepts we generate and manipulate. Local schemata are wrapped into sets of schema objects. Each pair of schema objects is compared by all matchers, producing a probabilistic uncertain relationship for each matcher and pair. Then, the outcomes of the matchers are aggregated, so that we obtain a probabilistic uncertain relationship (pUR) for each pair. All pURs are aggregated to produce a probabilistic uncertain mapping (pUM). Finally, the mapping is used to define an uncertain schema, where instances are evaluated according to the uncertain relationships and to a data mapping.

Figure 3. Uncertain schema integration process

3.1 Data model First,the two S1 and S2 are represented as sets of schema objects local data sources and . Schema objects provide access to instances and metadata associated UBLCS-09-10

4

to them. As an example, consider Figure 4, where we have represented in more detail the name attribute of the first table of Figure 1. This data model is very simple, but at the same time it allows the discovery of mappings between (for example) columns of different databases, which is a very useful capability. This representation can be applied to the main existing data models: relational (columns), ER (entities), XML (leaf nodes), and OWL (classes). The current version of the system supports relational databases, XML repositories (if constrained by XML Schemas) and OWL repositories.

Name John Mark

Name: S1.name Type: varchar MaxLength: 4 AverageLength: 4 Size: 2

Figure 4. An example of schema object

3.2

Relationships

Now, each object from the first data source is compared with all the objects of the second (matching), to identify the relationships occurring between them. In the literature many kinds of relationships have been studied. The most complex are one-to-many relationships linking one schema object with two or more other objects, like name = firstName + lastName, or relationships based on expressions like salary = year salary/12. It has been recently shown that using these relationships does not increase the time complexity of query answering [9]. However, they are very difficult to identify, and only a few works have dealt with their discovery. With regard to simple one-to-one relationships, object of the majority of data integration studies and of this paper, they can be fine-grained, like overlapping and subsumption, or more general, like match and not-match. We will adopt the latter, which are the most studied, because they are useful in practice and at the same time computationally less expensive to identify. In this paper we use a typical definition of a match relationship. Consider a mediated schema object like the one represented in the left hand side of Figure 5. The two local objects on the right are views over it, i.e., they are incomplete representations of the global schema. When two local schema objects are views over the same mediated object, we say they match. This definition can be used to build a mediated schema a posteriori starting from the analysis of the local data sources. In fact, if columns Business Phone and Telephone match this means that they are incomplete, and that there is a global object with the missing information from both local columns. We call this mediated object an extension (ext) of the local columns. Therefore, instead of querying just the local column Business Phone we may query ext(Business Phone), i.e., all the instances of this concept contained in this and the other databases. In this case, we would get also the instances from the Telephone column. To represent the inherent uncertainty of automatic schema matching algorithms we use the concept of probabilistic uncertain relationship (pUR), that specializes our previous definition first appeared in [24].

Definition 3.1 uncertain relationship (pUR) between two schema objects O and O A probabilistic is a tuple , where is the set match not-match and is a probability distribution over . Example 3.2 Consider the following pUR:

S1.Business Phone S2.Telephone

UBLCS-09-10

5

Business Phone 555-4444 555-1111 Telephone 333-2222 555-6666

Business Phone 555-4444 555-1111 333-2222 555-6666 (a mediated object)

(two local objects) Figure 5. Local schema objects are views over larger mediated schema objects

match

with

with 3.3

match

indicates a high confidence that these schema objects match, while:

S1.Business Phone S2.ID

indicates that these schema objects probably do not match.

Matching schema objects

To perform the matching step, i.e., to assign probabilities to the alternative relationships, we do not use a monolithic matcher, but a pool of matchers (M in Figure 3), each with specific expertise on some properties of the analyzed objects, like type, name or structure. This data integration architecture is described in [7]. For each matcher, the generation of probabilities suffers from the well known interpretation problems of this theory, that affect nearly all probabilistic systems modeling cognitive activities: what does it mean if we say that the probability of a match is ? This is an open problem, that we tackle in the traditional way: we make our values satisfy the probability axioms, and as a guideline to design new matchers we have defined a set of numeric constants representing typical degrees of belief, from IMPOSSIBLE to SURE. In this way, matcher designers do not have to deal with counter-intuitive numbers, and the matchers’ outcome will be consistent with the output of the other matchers. In the current implementation the following matchers are available:

Cardinality(): This matcher compares the cardinality (number of instances) of the two schema objects. It is useful to discriminate between more complex relationships, but currently it is not used with match/not-match.

NameSemantics(d,n): This matcher compares the names of the two schema objects using Word Net, to identify if they refer to related concepts [10]. The parameters d and n limit the distance between the concepts and the number of traversed paths in the graph of concepts. The possible outcomes are that the first concept is a super- or sub-concept of the other, that they have a common super-concept (hypernym), or a common sub-concept (hyponym). Probabilities are assigned accordingly, and proportionally to the proximity of the identified concepts.

SimpleName(): Compares the names of the two schema objects from a syntactic point of view. In the following tests we have used a simple character-by-character comparison (without considering punctuation). This matcher can be used in association with the more sophisticated Token Based Matcher.

TokenBased(): Names are tokenized, splitting them on space characters and capital letters, and compared using the Jaccard distance. Instance(s): This matcher takes a set of instances of size s from both schema objects and checks whether they appear among the instances of the other object. Probabilities are assigned according to the number of common objects.

UBLCS-09-10

6

Statistics(s): This matcher obtains a sample of instances from each schema objects and compares their average lengths. Other statistics can be useful to compare numerical values, but have not been implemented at the moment.

Type: This matcher compares the data types of the two schema objects.

Input: Applies user feedback expressed as sets of rules.

Text: This matcher compares labels and comments associated to the schema objects. They are first tokenized on white spaces, then their Jaccard distance is evaluated (in other works we have used more sophisticated versions of this matcher, with TF-IDF distances, but we have not applied them to the following tests [5, 23]).

Structural: This matcher uses a first uncertain mapping produced by the other matchers and adjust matching probabilities of adjacent objects — this is a still rudimentary version of more complex structural matchers presented in the literature [26, 32]. Composition: Combines two or more matchers, modifying their outcomes according to their probabilistic dependencies.

Example 3.3 Consider again our working example, and in particular the two local schema objects illustrated in Figure 5. The Type matcher will return a probability of to match and to not-match, meaning that the two data types are compatible — this does not tells us that the two objects match. However, it does not exclude a match. The Statistics matcher will return a higher probability of matching, because all instances from both columns have the same length. 3.4

Aggregation of uncertain relationships

After all matchers have compared all the pairs of schema objects, their outcomes are combined to produce a single relationship (matcher aggregation). In absence of probabilistic dependencies, the aggregation can be performed using the rule described in [30]. However, this cannot be done in general, and we must take care of dependencies between the matchers. Assume we have three matchers, the first checking if one of the names of the schema objects contains the other (M1), the second that extracts from a dictionary all the synonyms of the two names and compares them (M2) and a third matcher that compares their instances. If we run M1 on two schema objects with the same name, e.g., lake, it will return a high probability of matching. Now assume we run also M2 and M3 on the same objects, and both return the same probability of matching. When we combine them with the outcome of M1, they must obviously modify it differently: M3 gives us new evidence supporting the match, while M2 does not tell us anything new, because it analyzes the same feature already checked by M1. As another example, imagine of running M1 N times. It would check N times the same thing, therefore we want to consider only the first analysis — or, more precisely, we want to condition the next runs on the first. Therefore, the combination must be computed carefully. The aggregation of the probabilities generated by the matchers is in fact a critical point of the method and of its implementation. The problem is that we do not know the dependencies between different matchers a priori, and each time we add a new matcher to the pool we must specifically study its dependencies with the other matchers. To support this task, we use specific matchers. A basic CompositionMatcher provided by the system assumes independence, and its output is the independent combination of the output of its components evaluated using the rule described in [30]. When there are dependencies, we must write specific components that run correlated matchers and aggregate their results appropriately. In our pool we have five main kinds of matchers. The first kind analyzes the names of the schema objects (SimpleName, NameSemantics, TokenBased). The second focuses on the instances (Type, Statistics and Instance). The third applies external knowledge (Input). The other two kinds of matchers analyze additional textual information (Text) and structural properties, like subclass relationships UBLCS-09-10

7

Figure 6. Organization of the matchers used in the tests

(Structural). When we run matchers of different kinds we can use the basic IndependentCompositionMatcher, while for other combinations we have implemented specific composers. For example, when the Type matcher finds that two types are incompatible, the Instance and Statistics matchers are not executed. Similar considerations can be done for the other classes. 3.5 Merging probabilistic relationships The second step of the matching phase is the production of a mapping. associates A mapping match not-match . In each pair of objects to a relationship, i.e., it is a function our approach, we keep all the uncertainty generated so far and compute an uncertain mapping:

Definition 3.4 Let and be two sets of schema objects. A probabilistic uncertain mapping (pUM) is a probability distribution over the set of all mappings between them.

Example 3.5 A possible mapping

match

Name,ID

Name,Telephone

of the two example tables is:

not-match

not-match

BusinessPhone,ID

We may also have an alternative mapping match:

Name,ID

match

Name,Telephone

BusinessPhone,ID

match

BusinessPhone,Telephone

, where the columns containing telephone numbers do not

not-match

not-match

BusinessPhone,Telephone

An example of uncertain mapping is:

not-match

.

It is worth noticing that in general a pUM is not just a juxtaposition of pURs. In fact, also in this case we are obtaining one probability distribution from a set of distributions that may not be independent. For example, using an extended set of relationships the matchers may believe that schema objects S1.name and S2.ID are equivalent, and that S1.name and S2.Telephone UBLCS-09-10

8

are equivalent as well, but columns ID and Telephone are incompatible, and the probability of this mapping would be , and not the product of the probabilities locally assigned to the single relationships. However, when we use match and not-match relationships we may assume the absence of probabilistic dependencies between different pURs: as we have illustrated in Figure 7 all combinations of relationships about common schema objects are possible. This does not mean that there cannot be dependencies, but only that we can assume independence (which we cannot do with more complex schema languages and relationships). Assumptions of independence can be done, but they would make the process more complex. As an example, consider one column Price and two other columns SellingPrice and ShippingPrice. Without probabilistic dependencies, the system will probably return one of the three following mappings:

Price match SellingPrice, Price match ShippingPrice (meaning that price contains both selling and shipping prices).

Price match SellingPrice, Price not-match ShippingPrice (meaning that price contains only selling prices). Price not-match SellingPrice, Price match ShippingPrice (meaning that price contains only shipping prices).

In this example, a mutual independence assumption would constrain the set of valid mappings, and the first mapping would no longer be possible — in practice, we would be assuming that each schema object matches at most one another. From this example, it should be clear that match/not-match relationships may have probabilistic dependencies, but they do not induce them — which means that we can work without dependencies, limiting the expressiveness of the data and mapping representation languages.

Figure 7. The match relationship does not induce probabilistic dependencies: every combination of relationships on common schema objects has a corresponding valid schema.

3.6

Computing an uncertain mediated schema and its probabilities

The discovered uncertain mapping is used to merge the data sources, i.e., to generate the integrated database (merging). Also in this case, the pUM represents only implicitly the global schema. We will introduce this point through an incremental example. Consider Figure 8, where

UBLCS-09-10

9

we assume the existence of a local schema object A. After the uncertain schema integration process, when we query A we also want to be able to retrieve ext(A), i.e., the instances of the objects matching A. If an instance belongs to A, like in Figure 8(a), it also belongs to ext(A) with probability 1. If belongs only to B, like in Figure 8(b), will belong to ext(A) with probability P(A match B)4 . Things get more difficult when belongs to many objects matching A. In Figure 8(c) is shared between schema objects B and C, and both match A. In this case, it is well known that the probability P(A match B or A match C) of belonging to ext(A) equals P(A match B) + P(A match C) - P(A match B and A match C) — this is one of the basic corollaries of probability theory. From this example we may conclude that: 1. With N matching objects containing , the formula to compute its generalizes to probability the inclusion-exclusion principle, i.e., if belongs to and to :

ext

!#"%$'&()

2. Because of our independence assumption, P(A match B and A match C) = P(A match B) P(A match C).

*

3. The complexity of computation of the probability of an instance is linear on the number of schema objects containing it. In fact, applying our independence the probabil,+.- +%/ assumption, of an instance i belonging to N data sources can be computed in ity constant time as:

,+.- +%/ 0 +1- +%/32- 34 , +%/ , +%/ ,* ,+.- +65 +%/32-

(1)

This derives from the inclusion-exclusion principle, and can be proved by induction on the number of matching schema objects. 4. Relaxing the independence assumption, to include constraints not discussed in this paper, it is well known that the exact computation of the inclusion-exclusion principle is not tractable, and thus should be tackled using approximation algorithms.

7

8

Figure 8. Some alternative ways for instance to belong to ext( )

In practice, the computation of probabilities associated to the instances of uncertain mediated schema objects poses two main problems. First, we must be able to find all the schema objects containing it, then we must use their probabilities to compute the inclusion-exclusion formula. Our approach to address this problem is iterative: we get the instances one by one, and each time we get a new one we insert it into a main memory search structure (a red-black tree). Notice that this works well even for large datasets, like the one we used to test our prototype — for example, the title column of the article table of the DBLP database contains about 400.000 instances.

9.

The probability that A matches B.

UBLCS-09-10

10

If we assume an average length of 30 characters for each title, this occupies only about 12 MB. Then, when we get the next instance, we check if it is already present in the search structure. If it is not present, we add it to the tree with its current probability. Otherwise we update its probability according to the probability of the source schema object. Basically, this corresponds to a relational GROUP BY, that can be implemented also with larger databases in secondary memory. Although the number of terms in the inclusion-exclusion formula is exponential on the number of schema objects, the fact that our relationships are probabilistically independent simplifies its evaluation drastically. In fact, for each instance the time complexity to compute its probability is linear on the number of schema objects containing it.

Example 3.6 Let us consider again the relationship BusinessPhone match Telephone, case, the instances belonging to ext(BusinessPhone) would be:

. In this

555-4444, 1.0

555-1111, 1.0 333-2222, .6

555-6666, .6

Now, assume a third column containing again the telephone number 333-2222, matching BusinessPhone with probability .3. Now, the fact that 333-2222 is a business telephone number is supported by both relationships, and will be greater .6 and .3. In particular, applying the inclusion-exclusion principle it than * , leading to the following uncertain mediated schema object: can be evaluated as 4

555-4444, 1.0

555-1111, 1.0

4

333-2222, .72 555-6666, .6

Experimental results

In the following, we will test some important features of the proposed approach. In particular we will answer two fundamental questions. The first question concerns the quality of our results: is it useful to save uncertain information, or saying it in another way, is the information we would lose relevant? The second question concerns efficiency: how much does it cost to manage uncertainty? 4.1 Data sets The first data set we use in the experiments is the benchmark set of ontologies for the evaluation of Ontology Alignment systems, version 20075. It contains one basic ontology, with a limited number of instances, and many other ontologies, both real and artificially obtained by modifying several features of the original data source. In the first set of experiments we will compare our system against one of the best ontology alignment systems currently available, presenting our results on a representative set of tasks — the on-line prototype can be tested on all the other datasets. Then, we will use additional datasets from the same benchmark to highlight some specific features of our approach, in particular the impact of uncertainty on the resulting mapping. The second database is a snapshot6 of the well known DBLP database, containing data about scientific publications in the areas of databases, logic programming, and related fields [21]. The DBLP database has been split into two views: one with a random sample of its data ( ), one with papers published before the year 2.000 and with name changes. The original DBLP database contained about 60 schema objects and 13.000.000 instances (considering duplicates).

. oaei.ontologymatching.org/2007/benchmarks/ . Extracted in October 2007.

UBLCS-09-10

11

Figure 9. Precision-Recall using the benchmark datasets

4.2

Analysis of the most likely mapping

In Figure 9 we have indicated precision and recall of the top-1 mapping automatically produced by our Probabilistic Data Integration (ProDI) system, compared with the results of the Falcon-OA system [31], one of the best performing approaches in the last ontology alignment competitions. Labels in the x axis refer to the identifier of the task in the benchmark, in particular: 262 No instances, no properties (attributes) no text comments, no structure, names replaced by random strings. 258 No instances, no text comments, no structure, names replaced by random strings. 249 No instances, no text comments, names replaced by random strings. 250 No properties (attributes), no text comments, names replaced by random strings. 204 Names expressed using different conventions (like PhDThesis and phd thesis). In these experiments uncertainty is not directly taken into account: we consider only the most likely result of the system, and no other alternatives. In this way, we can have a snapshot of the basic performance of the matchers, and we can motivate the explicit representation of uncertainty. The first four tests clearly show that the two systems have different strengths and weaknesses, depending in our case on the specific matchers currently implemented. For instance, the naive matcher which takes into account the structure of the input datasets cannot discover complex structural patterns, on which Falcon-OA is specialized. Other matchers more specific for relational databases, such as complex name matching and instance comparisons, enable a greater accuracy of our system on different tasks. As a result, the two systems obtain the same results (measured by their precision and recall, as required in the benchmark) for test 262, Falcon-OA works better on test 249 (greater recall with approximately the same precision), while the ProDI system works better for test 250 (again, same precision but higher recall). Test 258 presents two results that are more difficult to compare, because one system is significantly more precise than the other, but has a smaller recall. UBLCS-09-10

12

howPublished humanCreator Book Book Report country MastersThesis name endPage

first published publisher Booklet Reference technical report state Phd thesis shortName startPage

Figure 10. Some of the match relationships in the top-1 mapping produced by the system and not considered as correct in the benchmark

Let us now consider task 204, the last in Figure 9. Here the recall is almost one for both systems, meaning that they discover nearly all the correct relationships. However, our system is more imprecise, i.e., it finds relationships not considered as correct in the benchmark. While this can seem like a malfunctioning of the system, looking at the discovered relationships gives us a first motivation to consider uncertainty in the result of the integration process. We have indicated some of these relationships (from this and other tests) in Figure 10, divided in three sets. Relationships like humanCreator match publisher can be considered wrong. However, relationships like country match state seem to be reasonable, even if they have not been included in the standard alignment. More interestingly, consider relationships like MastersThesis match Phd thesis. Someone can say that these schema objects contain disjoint instances, therefore they do not match. Someone else could say that they contain theses, therefore they match. Basically, even for a human user it would not be evident which solution is correct — we may expect that both options are considered as possible alternatives. This fact limits the expressiveness of a crisp alignment, and opens to mappings that consider some options as possible, or probable, but not sure, which is exactly the result of our approach: representing this uncertainty makes the result of a data integration system much closer to the ones typically produced by human users. Specifically, previous work has highlighted that different people often produce different mappings for the same datasets, i.e., in real situations it is often difficult to state with certainty which mapping is correct [26]. In the following subsection we are therefore going to consider uncertainty in our analysis. 4.3 Relationship between uncertainty and precision/recall Basically, an uncertain schema may be seen as a set of possible solutions, or global schemata. In general, accepting mappings with relationships whose probability is very high will result in a high precision — we accept only those we cosider almost sure, like when they have the same name and many common instances. However, the recall will in general be lower. On the contrary, if we accept also relationships of lower probability, we will increase the recall, but consider wrong relationships, decreasing precision. In Figure 11 we have represented the variation of recall when we increase the uncertainty, i.e., considering relationships with lower probability. From this test we can appreciate how this uncertainty contributes to the quality of the results: with precision 1 we can get about 24% of the correct relationships, which is the recall of the Falcon system at precision .89. In our analysis, at precision .9 this is increased to almost 28%, and so on, as indicated in the figure. Notice that in our approach, where instances are annotated with their probability, users are aware of the fact that the increased recall may correspond to wrong instances in the mediated schema. We do not only produce more results, but also a measure of the confidence we should have in these results. Obviously, more uncertainty does not always correspond to better results: for example, if the matchers are not equipped to analyze a specific data set their outcome will be anyway wrong. In the last experiment we have tried to identify negative cases, in which the representation of UBLCS-09-10

13

Figure 11. Precision-Recall using the benchmark data sets

a)

b)

Figure 12. Precision-Recall varying the degree of uncertainty for two benchmark tests with different behaviors

uncertainty does not help. In Figure 12 we have illustrated two of these cases. Figure (a) corresponds to task 224. Also in this case, the management of uncertainty is useful to find some additional correct relationships. However, the recall is always high, meaning that most of the correct relationships are contained in the top mappings — uncertain data may be useful, but the “certain” result is already very good. Finally, Figure (a) shows a negative result, corresponding to the task 249, which cannot be managed effectively by our matchers. In this case, most of the correct relationships are in the bottom mappings, therefore we cannot find them without decreasing precision too much. 4.4 Experimental execution time In the previous tests we have shown the role of uncertainty on the integration of ontologies, i.e., semantically rich schemata. However, these datasets are small if compared with many real databases: they contain only a few instances. Therefore, the question is if this approach can also be practically used on data sources with simpler schemata (like relational tables) but large numbers of instances. From the implementation of the matchers, we know that the complexity of producing probabilities instead of, or from, similarity degrees is constant — it only depends on the fixed size of UBLCS-09-10

14

! "#$%&

Figure 13. Time complexity of the schema matching phase, varying the number of pairs of schema objects: minimum (continuous line) and maximum (dashed line))

the samples. In fact, during the schema analysis phase each expert produces some values from which probabilities are assigned statements, like: “if some instances in with simple if-then-else a sample of schema object O also belong to O , than support match”. Moreover, the combination of probabilities produced by different experts is also constant. However, in practice the matching process needs connections to database servers and the evaluation of random samples of instances, therefore we need an experimental evaluation to check the practical feasibility of the approach. In Figure 13 we have illustrated the execution time of our system on increasing subsets of views over the DBLP database — these experiments have been performed on a Linux 2.6 system with a 1GHz CPU and 512MB RAM. Because of the stochastic matchers, execution times may vary significantly, and we have thus represented two curves indicating the maximum and minimum results. The graph clearly shows the feasibility of the approach, and its tractable asymptotic complexity — notice that the x axis regards schema objects, whose number is usually far smaller than the number of instances. 4.5

Summary of the main findings

In this experimental analysis of our approach we have shown that 1) managing uncertainty does not significantly affects the time complexity of our method, and 2) when we lose uncertain information, this may contain correct relationships, that would be lost during the schema integration process. Managing uncertainty does not necessarily means that the results obtained considering more mappings at the same time will be better than using a single mapping: in this way we decrease the precision of the method, i.e., we consider more wrong relationships, and the number of errors may be more relevant than the increase in the recall. In particular, we have seen that complex integration tasks are not improved significantly by the evaluation of probabilistic mappings. However, there are cases where considering the top mappings (the ones with higher probability) allows the discovery of more correct relationships keeping a high precision. In sumUBLCS-09-10

15

mary, if the matchers are very good at integrating two data sets, the management of uncertainty is not crucial. If the matchers perform poorly, the management of uncertainty will not be useful. If the matchers perform well, but are undecided on some relationships, managing uncertainty allows us to get more correct relationships, and produce more human-like results.

5

Final remarks

In this paper we have presented a complete method of schema integration with a new perspective on the management of uncertainty. Instead of focusing on its removal, we deal with its representation and manipulation. The result of our approach is an uncertain schema, that can be managed using an uncertain database management system. Through experimental evaluations on real data sets we have shown that managing uncertainty: 1) is useful not to lose important information, and 2) is computationally tractable under the assumptions presented in the paper. As a last consideration, there is another assumption to relax in the future to define a complete uncertain schema integration method: in the process described so far, local schemata were traditional databases. However, we can also think of integrating uncertain local schemata. This would assure the closure property of the process we are defining, that could be applied iteratively also to data sources obtained as integration of others, to merge more than two input databases. In this way, this process could be used to extend the algebra of models defined in [1].

References [1] Bernstein, P.A., Halevy, A.Y., Pottinger, R.A.: A vision for management of complex models. SIGMOD Rec. 29(4), 55–63 (2000). DOI http://doi.acm.org/10.1145/369275.369289 [2] Bernstein, P.A., Melnik, S., Churchill, J.E.: Incremental schema matching. In: VLDB Conference, pp. 1167–1170. VLDB Endowment (2006) [3] Buneman, P., Davidson, S., Kosky, A.: Theoretical aspects of schema merging. In: A. Pirotte, C. Delobel, G. Gottlob (eds.) Proceedings of Advances in Database Technology (EDBT ’92), LNCS, vol. 580, pp. 152–167. Springer, Berlin, Germany (1992) [4] Cal`ı, A., Lukasiewicz, T.: An approach to probabilistic data integration for the semantic web. In: Proceedings of ISWC-URSW conference (2006) [5] Cohen, W.W., Ravikumar, P., Fienberg, S.E.: A comparison of string distance metrics for name-matching tasks. In: IIWeb, pp. 73–78 (2003) [6] Dey, D., Sarkar, S., De, P.: A distance-based approach to entity reconciliation in heterogeneous databases. IEEE Transactions on Knowledge and Data Engineering 14(3), 567–582 (2002). [7] Do, H.H., Rahm, E.: Matching large schemas: Approaches and evaluation. Information Systems (2007). To appear [8] Doan, A., Domingos, P., Halevy, A.Y.: Reconciling schemas of disparate data sources: A machine-learning approach. In: ACM SIGMOD Conference (2001) [9] Dong, X.L., Halevy, A.Y., Yu, C.: Data integration with uncertainty. In: Proceedings of the 33rd International Conference on Very Large Data Bases, pp. 687–698. ACM (2007) [10] Fellbaum, C. (ed.): WordNet: An Electronic Lexical Database. MIT press (1998) [11] Florescu, D., Koller, D., Levy, A.Y.: Using probabilistic information in data integration. In: VLDB Conference, pp. 216–225 (1997)

UBLCS-09-10

16

[12] Gal, A.: Managing uncertainty in schema matching with top-k schema mappings. Journal on Data Semantics VI pp. 90–114 (2006) [13] Gal, A.: Why is schema matching tough and what can we do about it? SIGMOD Rec. 35(4), 2–5 (2006). [14] Halevy, A.Y., Franklin, M.J., Maier, D.: Principles of dataspace systems. In: Proceedings of the Twenty-Fifth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, pp. 1–9 (2006) [15] Halevy, A.Y., Rajaraman, A., Ordille, J.J.: Data integration: The teenage years. In: VLDB, pp. 9–16 (2006) [16] Hayne, S., Ram, S.: Multi-user view integration system (muvis): An expert system for view integration. In: Proceedings of the Sixth International Conference on Data Engineering, pp. 402–409. IEEE Computer Society, Washington, DC, USA (1990) [17] Hunter, A., Liu, W.: Fusion rules for merging uncertain information. Information Fusion 7(1) (2006) [18] van Keulen, M., de Keijzer, A., Alink, W.: A probabilistic XML approach to data integration. In: ICDE (2005) [19] Lenzerini, M.: Data integration: a theoretical perspective. In: PODS Conference (2002) [20] Lenzerini, M.: Principles of p2p data integration. In: Third International Workshop on Data Integration over the Web (DIWeb), pp. 7–21 (2004) [21] Ley, M.: DBLP bibliography server. http://dblp.uni-trier.de/

Tech. rep., University of Trier (2003).

URL

[22] Madhavan, J., Bernstein, P., Rahm, E.: Generic schema matching with Cupid. In: Proc. 27th VLDB Conference, pp. 49–58 (2001). [23] Magnani, M., Montesi, D.: Integration of patent and company databases. In: IDEAS ’07: Proceedings of the 11th International Database Engineering and Applications Symposium (IDEAS 2007), pp. 163–171. IEEE Computer Society, Washington, DC, USA (2007) [24] Magnani, M., Rizopoulos, N., McBrien, P., Montesi, D.: Schema integration based on uncertain semantic mappings. In: 24th International Conference on Conceptual Modeling, Lecture Notes in Computer Science, vol. 3716, pp. 31–46. Springer (2005) [25] McBrien, P., Poulovassilis, A.: A formalisation of semantic schema integration. Information Systems 23(5), 307–334 (1998) [26] Melnik, S., Garcia-Molina, H., Rahm, E.: Similarity flooding: A versatile graph matching algorithm and its application to schema matching. In: Proceedings of ICDE Conference (2002) [27] Melnik, S., Rahm, E., Bernstein, P.A.: Rondo: a programming platform for generic model management. In: Proc. SIGMOD 2003, pp. 193–204. ACM Press (2003) [28] Nottelmann, H., Straccia, U.: splmap: A probabilistic approach to schema matching. In: ECIR, pp. 81–95 (2005) [29] Nottelmann, H., Straccia, U.: Information retrieval and machine learning for probabilistic schema matching. Inf. Process. Manage. 43(3), 552–576 (2007) [30] Shafer, G.: A mathematical theory of evidence. Princeton University Press (1976)

UBLCS-09-10

17

[31] Wei Hu Yuanyuan Zhao, D.L.G.C.H.W., Qu, Y.: Falcon-AO: Results for oaei 2007. In: ISWC Workshop on Ontology Matching (2007) [32] Wei Hu Ningsheng Jian, Y.Q., Wang, Y.: Gmo: A graph matching for ontologies. In: K-CAP Workshop on Integrating Ontologies (2005)

UBLCS-09-10

18