Semantic Data Integration for Knowledge Graph

Semantic Data Integration for Knowledge Graph Construction at Query Time Diego Collarana∗ , Mikhail Galkin∗§ , Ignacio Traverso-Ribón‡ , Christoph Lange∗ , Maria-Esther Vidal∗† , Sören Auer∗ ∗ University

of Bonn and Fraunhofer IAIS, Germany {collaran,galkin,langec,vidal,auer}@cs.uni-bonn.de † Universidad Sim´ on Bol´ıvar, Venezuela ‡ FZI Research Center for Information Technology, Germany [email protected] § ITMO University, Russia

Abstract—The evolution of the Web of documents into a Web of services and data has resulted in an increased availability of data from almost any domain. For example, general domain knowledge bases such as DBpedia or Wikidata, or domain specific Web sources like the Oxford Art archive, allow for accessing knowledge about a wide variety of entities including people, organizations, or art paintings. However, these data sources publish data in different ways, and they may be equipped with different search capabilities, e.g., SPARQL endpoints or REST services; thus requiring data integration techniques that provide a unified view of the published data. We devise a semantic data integration approach named FuhSen that exploits keyword and structured search capabilities of Web data sources, and generates on-demand knowledge graphs merging data collected from available Web sources. Resulting knowledge graphs model semantics or meaning of merged data in terms of entities that satisfy keyword queries, and relationships among those entities. FuhSen relies on both RDF to semantically describe the collected entities, and on semantic similarity measures to decide on relatedness among entities that should be merged. We empirically evaluate the results of FuhSen data integration techniques on data from the DBpedia knowledge base. The experimental results suggest that FuhSen data integration techniques accurately integrate similar entities semantically into knowledge graphs.

I. I NTRODUCTION The strong support that Web based technologies have received from researchers, developers, and practitioners has resulted in the publication of data from almost any domain. Additionally, standards and technologies have been defined to query, search, and manage Web accessible data sources. For example, Web access interfaces or APIs allow for querying and searching sources like DBpedia, Wikidata, or the Oxford Art archive. Web data sources make overlapping as well as complementary data available about entities, e.g., people, organizations, or art paintings. However, these entities may be described in terms of different vocabularies by these Web data sources, and data that correspond to the same real-world entities then needs to be integrated in order to have a more complete description of these entities. In this paper, we devise semantic data integration techniques that exploit semantics encoded in the properties of entities collected from Web data sources and provide a unified view of these entities. These techniques have been implemented

in FuhSen [1], a hybrid semantic search engine, to facilitate the integration of data collected from Web sources. FuhSen receives keyword based queries and produces knowledge graphs on-demand at query time, which are composed of data merged using the proposed semantic integration techniques. FuhSen relies on RDF vocabularies to semantically describe the collected data, and to identify RDF triples that correspond to properties of the same entity, i.e., set of RDF triples that share the same subject and comprise an RDF molecule [2], [3]. FuhSen makes use of wrappers around the original sources to generate RDF molecules. Further, FuhSen exploits semantic similarity measures to both determine relatedness among RDF molecules, and to decide whenever two RDF molecules should be merged into single molecules in on demand knowledge graphs. Similarity measures like the Jaccard coefficient are used to compute similarity of two RDF molecules based on the number of RDF triples shared by both molecules. In addition, more semantic similarity measures such as GADES [4] exploit semantics encoded in the ontologies used to describe the RDF molecules in order to decide if these molecules are semantically related. This semantic aggregation of search results allows for a more meaningful integration of the collected data, and corresponds to the main difference of FuhSen to existing approaches, which rely on triple-based integration (e.g., [5], [6]) to merge Web data. In summary, we make the following contributions to the problem of semantic data integration: •

•

An integration approach that utilizes semantic similarity measures to integrate RDF molecules that correspond to the same real-world entity. Semantic similarity measures consider both the RDF triples that compose an RDF molecule, as well as the meaning of the predicates in these triples, to determine when two RDF molecules are similar and should be integrated. An experimental study to evaluate the quality of FuhSen data integration techniques. First, we compare triplebased approaches with FuhSen; experimental results indicate that FuhSen accurately integrates RDF molecules collected from the integrated sources. Moreover, we compare the impact of different semantic similarity measures

like the Jaccard coefficient and GADES, on the accuracy of FuhSen. The observed results suggest that considering semantic similarity measures enhances the accuracy of FuhSen data integration techniques. The article is structured as follows: We motivate the relevance of considering semantics during data integration tasks in section II. Concepts required to understand the FuhSen data integration techniques are introduced in section III. Section V presents the RDF molecule based semantic data integration approach implemented in FuhSen. Section VI reports the results of our empirical evaluation, and existing approaches are reviewed in section VII. Finally, conclusions and directions for future work are pointed out in section VIII. II. M OTIVATING E XAMPLE In the crime investigation process, collecting and analyzing information from different sources is a key step performed by investigators. Although scene analysis is always required, a crime investigation process can greatly benefit from searching information about people and products on the Web. Considering a case of counterfeit paintings of Eugenio Bonivento, investigators need to gather all the information about the painter and his work. General domain knowledge bases such as DBpedia or Wikidata contain common information about Eugenio Bonivento, while domain-specific Web sources like the Oxford Art archive contain detailed information about his paintings. Figure 1 illustrates the RDF molecules of Eugenio Bonivento present in these different Web data sources. DBpedia and Wikidata RDF molecules can be integrated to produce a complete profile of Eugenio Bonivento, while Oxford Art completes the paintings information. However, there are heterogeneity problems at the schema and data levels. Each data source provides RDF molecules described in its own vocabulary (schema conflicts) and the same fact might be expressed differently (data conflicts), e.g., the dates in Figure 1. Currently, the process of data integration is performed by experts and it is extremely cumbersome and timeconsuming as it requires to access a large number of different data sources and set up a whole integration infrastructure. To facilitate the integration of the data about Eugenio Bonivento, similarity measures able to decide on the relatedness of the corresponding RDF molecules, and equivalent or complementary properties are required. We present novel semantic data integration techniques that exploit state-of-the-art semantic similarity measures, and integrate all the properties of Eugenio Bonivento into a single RDF molecule. III. P RELIMINARIES Given a keyword query, FuhSen creates a knowledge graph on-demand at query time that represents the entities associated with the keywords in the query, and the relationships between these entities. A knowledge graph is composed of a set of entities, their properties, and relations among these entities. The Semantic Web technology stack provides the pieces required to define and build a knowledge graph. To properly understand these concepts, we follow the notation of Arenas

et al. [7], Piro [8], and Fernandez et al. [2], to define RDF triples, knowledge graphs, and RDF molecules. Definition 1 (RDF triple [7] ): Let I, B, L be disjoint infinite sets of URIs, blank nodes, and literals, respectively. A tuple (s, p, o) ∈ (I ∪ B) × I × (I ∪ B ∪ L) is denominated an RDF triple, where s is called the subject, p the predicate, and o the object. Definition 2 (Knowledge Graph [8]): Given a set T of RDF triples, a knowledge graph is a pair G = (V, E), where V = {s | (s, p, o) ∈ T } ∪ {o | (s, p, o) ∈ T } and E = {(s, p, o) ∈ T }. Definition 3 (RDF Subject Molecule [2]): Given an RDF graph G, an RDF subject-molecule M ⊆ G is a set of triples t1 , t2 , . . . , tn in which subject(t1 ) = subject(t2 ) = · · · = subject(tn ). Definition 4 (Individual similarity measure [4]): Given a knowledge graph G = (V, E), two entities e1 and e2 in V , and a resource characteristic RC of e1 and e2 in G, an individual similarity measure Sim RC (e1 , e2 ) corresponds to a similarity function defined in terms of RC for e1 and e2 . Definition 5 (Aggregated similarity measure [4]): Given a knowledge graph G = (V, E) and two entities e1 and e2 in V , an aggregated similarity measure α for e1 and e2 is defined as α(e1 , e2 | T, β, γ) := T (β(e1 , e2 ), γ(e1 , e2 )) where: • T is a triangular norm (T-Norm) [9]. • (β(e1 , e2 ) and γ(e1 , e2 )) are aggregated or individual similarity measures. IV. P ROBLEM DEFINITION In this paper, we leverage semantic similarity measures to address a research problem: given a keyword query Q, a threshold T, build a knowledge graph of heterogeneous data which are no less semantically similar than T. Figure 1 presents three RDF molecules with data about Eugenio Bonivento collected from DBpedia, Wikidata, and Oxfort Art, respectively. Each of the data sources applies its own approach for knowledge serialization, e.g., DBpedia employs humanreadable URIs whereas Wikidata encodes entities with autogenerated identifiers as combinations of letters and numbers which is hard to comprehend without prior acquaintance with the Wikidata data model. Evidently, simple string similarity metrics will fail to identify a possible link among those molecules due to a lack of shared common string literals. Semantics of the facts encoded in RDF molecules has to be considered in order to truly grasp their similarity. In other words, a new, higher abstraction layer has to be established. Such a level, which operates on semantic knowledge instead of symbols (in which the knowledge is presented), allows for semantic similarity measures. The following section introduces and describes the architecture of FuhSen, a system that is capable of exploiting semantic similarity measures, and solving the semantic data integration problem described in this section. V. A N RDF M OLECULE I NTEGRATION A PPROACH As an input, FuhSen receives a keyword query Q, e.g., Eugenio Bonivento, and a similarity threshold value T, e.g., 0.7. The

dbr:Milan

dbr:Chiog gia

Q490 dbr:Italy

dbp:death Place

dbp:birth Place

Q38

rdfs: label

dbr:Eugenio _Bonivento

dbp:date OfBirth

wdt:P20

wdt:P27

1880-06-08

wdt:P19

Eugenio Bonivento

Jun 8th, 1880

rdfs: label

wd: Q16554625

wdt:P569

sdo:description

(a) DBpedia data

watercolor on paper

size img

typ e

type

img

Venezia

dc:description

Italian painter

55,5 x 38 cm oil on canvas

dbp:natio nality

Eugenio Bonivento

Q55416

La-villa-sullago

Italian

size

19,5 x 27 cm

painted painted rdfs:des cription

ox: OOUNzqQ2

rdfs: label

period

Italian painter (b) Wikidata data

Eugenio Bonivento

1880 - 1956

(c) Oxford Art data

Fig. 1: RDF Molecules. Eugenio Bonivento on different social networks on the Web is represented as RDF Molecules

input values are processed by the Query Rewriting module, which formulates a correct query to be sent to the Search Engine module. The Search Engine explores several wrappers and transforms the output into RDF molecules. Preliminary results are enriched with additional knowledge in the RDF Molecules Enrichment module. Finally, the molecules with materialized induced facts are integrated into a knowledge graph in the RDF Molecules Integration module. The integration module consists of three sub-modules which are responsible for: 1) identifying semantic similarity of the molecules, 2) performing one-to-one perfect matching, 3) integration of the similar RDF molecules. We describe each module in detail in the sequel. A. OntoFuhSen vocabulary The OntoFuhSen1 vocabulary allows for describing a user’s search activities, data sources, and entities in the federation. The vocabulary is divided into the following three modules: (1) Search engine metadata: comprises classes modeling a user’s search activity (e.g., fs:Search, fs:SearchableEntity). This module takes into account the provenance of resources. To enable provenance tracking, classes of the PROV 2 standard vocabulary have been extended to model the provenance of the information related to a user’s activities during a search process. (2) Data source metadata: contains classes describing Web API services and access points (e.g., fs:Parameter, fs:Operation). They model data sources from which the data is collected (e.g., Facebook or Twitter). (3) Domain specific metadata: includes classes for describing the results collected from FuhSen during keyword query processing. For the crime domain concepts include: gr:ProductOrService and org:Organization. The FuhSen vocabulary utilizes existing well-known ontologies, e.g., terms from FOAF and Schema.org 3 . 1 https://w3id.org/eis/vocabs/fuhsen# 2 http://www.w3.org/ns/prov 3 http://xmlns.com/foaf/spec/, http://schema.org/

B. Query Rewriting This component basically transforms the initial keyword query to queries that the wrappers understand. Using the data source description in OntoFuhSen vocabulary the initial query is transformed into, e.g., a SPARQL query or a REST API request depending on the case. The final list of queries is sent to the search engine component. C. Search Engine and Wrappers The search engine orchestrates data extraction processes using RDF wrappers, and stores the RDF molecules in an in-memory graph. The search engine receives the keyword query and, based on the data sources’ description defined in terms of the OntoFuhSen vocabulary, orchestrates in an asynchronous manner the RDF molecules creation. Requests to the RDF wrappers are created based on the Web APIs4 of the data sources, whose wrappers are described in terms of OntoFuhSen as explained above. Once a result has been received from a wrapper, a request to aggregate it in the results knowledge graph is sent to the vocabulary-based aggregator component. The aggregator creates an in-memory RDF graphs containing the RDF molecules, where all responses produced by the RDF wrappers are aggregated and described using OntoFuhSen. The vocabulary-based approach keeps the data aggregation task relatively simple. D. RDF Molecules Enrichment Once the RDF molecules have been constructed, FuhSen allows for additional quality improvement by enriching them with new facts acquired through the typing process [10]. It is thus possible to attach additional semantic information to the KG, e.g., location information. For example, the substring “Italy” of a Twitter tweet can be recognized and annotated with resources from other knowledge graphs, such as DBpedia Italy resource5 . One of the default, built-in advantages of the 4 Example of a RDF wrapper request: https://wrapper-url/ldw/oxford/ search?query=Eugenio+Bonivento 5 http://www.dbpedia.org/resource/Italy

Q = Eugenio Bonivento

RDF Molecules Enrichment

Query Rewriting

T = 0.7

M

Federated Search Engine M

Wrapper

M

M

Wrapper ...

Wrapper

RDF Molecules Integration Semantic Similarity α

One-to-one Weighted Perfect Matching Integration Function

G

... Data Sources Fig. 2: FuhSen Architecture. FuhSen receives a keyword query Q and a threshold T , and produces a knowledge graph G populated with the entities associated with the keywords in the query and their relationships. Input queries are rewritten into queries understandable by the available data sources. Wrappers are used to collected the data from the relevant sources and to create RDF molecules. Values of semantic similarity measures are computed pair-wise among RDF molecules, and the 1-1 weighted perfect matching is computed to the determine the most similar RDF molecules. RDF molecules connected by an edge in the solution of the 1-1 weighted perfect matching are merged into a single RDF molecule in knowledge graph G

on-demand KGs built by FuhSen is provenance information, which allows to trace the origins of a certain fact to a certain source. Additionally, enrichment of on-demand KGs is achievable through facts mining based on the existing facts and using graph analysis algorithms. Moreover, such KGs are able to evolve over time according to the changes appearing in the source datasets. Updates ingestion and propagation are therefore tasks to be addressed by FuhSen. FuhSen identifies named entities and tries to link them to semantic entities from external knowledge bases in the Linked Data Cloud. A well-established entity annotation tool is used DBpedia Spotlight [11], which combines named entity recognition and disambiguation based on the DBpedia linked dataset. The second tool employed during the enrichment is the Silk Framework. Besides mapping, it allows for entity linking among several datasets. Given Source and Target datasets acquired from different wrappers we check whether they semantically describe the same entity. In case they are the same, we enrich each molecule with the properties of another. We annotate subjects of the molecules with the owl:sameAs and rdfs:seeAlso properties. In the example rule we compare three properties (foaf:name,foaf:birthday, and foaf:gender) of two different datatypes (xsd:string and xsd:date). A Threshold value indicates a minimal similarity value to be taken into consideration by the linking engine. A Weight value represents a degree of importance to be assigned to each operation that affects the final similarity value. Comparing names from the source and target datasets, we leave room for possible inequalities in spelling, thus increasing the granularity parameter. The same rules apply to genders. Comparing birthdays we check exact equality of

the property values. Finally, we compute a weighted average similarity value using the values from the previous stage. Furthermore, if an eventual value exceeds a threshold of 0.95 we consider entities the same. If it is between 0.5 and 0.95 we ask the engine to wait for a human evaluation. If it is less than 0.5 we conclude the entities are not similar. In addition to the described manual linking routine, Silk employs genetic algorithms to automatically construct the most effective rules with the highest precision, recall and F-measure to interlink two arbitrary datasets. E. RDF Molecule Integration This module performs the construction of a knowledge graph out of the enriched molecules. The input of the module is a set of molecules, and the output is an integrated RDF graph. The module consists of three sub-modules, namely Semantic Similarity sub-module, Perfect Matching sub-module, and Integration sub-module. We describe below each submodule in details. F. Computing Similarity of RDF Molecules Similar molecules should be interlinked in order to create a fused, universal representation of a certain entity. In contrast with triple-based linking engines like Silk [6], we employ a molecule-based approach increasing the abstraction level and considering the semantics of molecules. That is, we do not work with independent triples, but rather with a set of triples belonging to a certain subject. The molecule-based approach allows for natural clustering of a knowledge graph, reducing the complexity of the linking algorithm.

1- 1 Maximum Weighted Matching 0.8

0.9

0.7

0.9

T1

T2

T1

(a) Bipartite Graph

T2 (b) 1-1 Weighted Perfect Matching

Fig. 3: The 1-1 Weighted Perfect Matching Problem. The algorithm to compute the 1-1 weighted perfect matching receives as input a weighted bipartite graph where weights represent the values of a similarity measure between the RDF molecules in the bipartite graph. The output of the algorithm is a maximal matching of the RDF molecules in the bipartite graph, where each RDF molecule is matched to exactly one RDF molecule; edges in the matching have a maximal value

1) The Jaccard Similarity Measure: We use Jaccard distance to compute a similarity score of two molecules. Let A be an RDF molecule with a set T1 of n properties and values (i.e. |T1 | = n ), and let B be an RDF molecule with a set T2 of k properties and values (i.e., |T2 | = k). The Jaccard similarity is then computed as: Jaccard (A, B) =

|T1 ∩ T2 | |T1 ∪ T2 |

The intersection set contains only those pairs of hproperty, val i that are present in both T1 and T2 . The union set contains all unique hproperty, val i pairs. 2) Semantic Similarity Measure: GADES [4] is a semantic similarity measure used to compare entities in a knowledge graph. GADES considers three different aspects: the class hierarchy, the neighbors of the entities, and the specificity of the entities. Thus, GADES is defined as a combination of three similarity values Simhier , Simneigh and Simspec . These similarity values can be combined with different T-Norms as the product or the average depending on the domain. In the case of the RDF molecules we define GADES as: Simhier (A, B) + Simneigh (A, B) GADES (A, B) = 2 Hierarchical similarity. Given a knowledge graph G, the hierarchy is inferred by the set of hierarchical edges. Hierarchical edges are a subset of knowledge graph edges whose property names refer to a hierarchical relation, e.g., rdf:type or rdfs:subClassOf. In the case of DBpedia and according to Lam et al. [12], the Wikipedia Category Hierarchy is used to determine the hierarchical similarity between two entities. Thus, the hierarchy is induced by relations skos:broader and dc:subject. Given this hierarchy, dtax [13] is used by GADES to measure the hierarchical similarity between two entities. Neighborhood similarity. The neighborhood of an entity e in a RDF molecule M is defined as the set of propertyobject pairs included in the triples of the molecule N (e) =

{(p, o)|(s, p, o) ∈ M }. Thus, there are two types of neighbors: URIs representing entities and literals representing attributes. This definition of neighborhood allows for considering together the neighbor entity and the relation type of the edge. GADES uses the knowledge encoded in the relation and class hierarchies of the knowledge graph to compare two pairs n1 = (p1 , o1 ) and n2 = (p2 , o2 ). The similarity between two pairs n1 and n2 is computed as: • If o1 and o2 are URIs, GADES uses a hierarchical similarity measure between URIs: Simhier (o1 , o2 ) + Simhier (p1 , p2 ) 2 If o1 and o2 are literals, GADES uses the Jaro-Winkler similarity measure between literals: Simpair (n1 , n2 ) =

•

SimJaro-Winkler (o1 , o2 ) + Simhier (p1 , p2 ) 2 In order to maximize the similarity between two neighborhoods, GADES combines pair comparisons as Simneigh (e1 , e2 ) = Simpair (n1 , n2 ) =

|NP (e1 )| i=0

max

nx ∈N (e2 )

Simpair (ni , nx ) +

|NP (e2 )| j=0

max

ny ∈N (e1 )

Simpair (nj , ny )

|N (e1 )| + |N (e2 )| G. The 1-1 Weighted Perfect Matching Given a weighted bipartite graph BG of RDF molecules, where weights correspond to values of semantic similarity between the RDF molecules in BG, a matching of BG corresponds to a set of edges that do not share an RDF molecule, and where each RDF molecule of BG is incident to exactly one edge of the matching. The problem of the 1-1 weighted perfect matching of BG corresponds to a matching where the sum of the values of the weights of the edges in the matching have a maximal value [14]. The Hungarian algorithm [15] computes the 1-to-1 weighted perfect matching.

Figure 3a illustrates the input of the algorithm where BG comprises edges between RDF molecules, while Figure 3b represents the final state. RDF molecules with the maximal values of similarity are mapped in pairs in the solution of the 1-1 weighted perfect matching and will be considered as RDF molecules to be merged. To determine the minimal value of similarity that represents RDF molecules that may be considered similar, a threshold T in the range of [0.1] is considered. Edges with weights less than T are considered as 0.0 by the 1-1 weighted perfect matching algorithm.

TABLE I: Description of Datatasets. 500 molecules Size (MB) RDF Molecules Triples

DataSet1 2.3 500 14,692

DataSet2 2.3 500 14,705

Gold 3.2 500 20,936

TABLE II: Description of Datatasets. 20,000 molecules Size (MB) RDF Molecules Triples

DataSet1 86.1 13,242 553,059

DataSet2 85.9 13,391 552,425

Gold 124 20,000 829,184

H. Integration function When similar molecules are identified under the desired conditions, the last step of the pipeline is to integrate them into an RDF knowledge graph. The result knowledge graph contains all the unique facts of the analyzed set of molecules. The implementation of the integration function in FuhSen is the union, i.e., the logical disjunction, of the molecules identified as similar during the previous steps. VI. E XPERIMENTAL E VALUATION We empirically study the effectiveness of FuhSen in the integration problem of RDF molecules. We assessed the following research questions: RQ1) Can the molecule-based integration technique implemented in FuhSen integrate data in a knowledge graph more accurately than triple-based integration techniques? RQ2) Is the accuracy of the moleculebased integration technique implemented in FuhSen impacted by similarity measures used to determine relatedness between integrated molecules? The experimental configuration to evaluate these research questions is as follows: Gold Standard (GS): the ground truth dataset was extracted from the live version of DBpedia (July 2016). Evaluating scalability as well, we created two subsets of the ground truth. The first Gold Standard contains 500 molecules of type Person6 , i.e., 500 subjects with all available properties and their values. The overall number of triples is 20,936. The second Gold Standard contains 20,000 molecules of the type Person, which results in 829,184 triples. The Gold Standards are used to compute precision and recall during the evaluation. Test Datasets (TS): The molecules from the Gold Standard with their properties and values were randomly split among two test datasets. Each triple is randomly assigned to one or several test datasets. The selection process takes two steps: 1) a number of test datasets to copy a triple to is chosen randomly under a uniform distribution; 2) the chosen number is used as a sample size to randomly select particular test datasets to write a triple. URIs are generated specifically for each test dataset. Eventually, each test dump contains a subset of the properties in the gold standard. Each subset of properties of each molecule is composed randomly using a uniform distribution. A small 6 http://dbpedia.org/ontology/Person

tweak was made as to the first Gold Standard in order to make both test datasets contain 500 molecules. Nevertheless, properties were still assigned randomly to each test dataset. Tables I and II provide additional statistics on the data sources. Metrics: We measure the behavior of FuhSen in terms of the following metrics: a) Precision is the fraction of RDF molecules identified and integrated by FuhSen (M ) that intersects with the Gold Standard (GS). Precision =

|M ∩ GS| |M |

b) Recall is the cardinality of the intersection of molecules integrated and Gold Standard, divided by that of the Gold Standard. |M ∩ GS| Recall = |GS| c) F-measure is the harmonic mean of Precision and Recall. Implementation: Experiments were run on a Windows 8 machine with an Intel i7-4710HQ 2.5 GHz CPU and 16 GB 1333 MHz DDR3 RAM. We implemented FuhSen and the Jaccard similarity metric in Scala and Java. Further, the transformation of the RDF molecules was implemented using Jena in Java 1.8. The FuhSen framework, and the test sets evaluated in this experiment are publicly available.7 Discussion: The goal of this experiment is to answer our research questions RQ1 and RQ2. FuhSen is run on the two test sets of different sizes to calculate the similarity among molecules with a triple-based approach implemented by Jaccard and a molecule-based one implemented by GADES. Precision, recall, and F-measure are computed according to the Gold Standard. Table III reports the values of these metrics for 500 molecules, Table IV contains the values for 20,000 molecules. A wide variety of results are observed. Jaccard demonstrates lower performance on both datasets as its algorithm relies just on the particular properties of the RDF molecule. Jaccard does not utilize semantics encoded in the knowledge graph and cannot be used as a ’black box’ to compute similarity between arbitrary sets of molecules without prior knowledge of the data model of those RDF molecules. On the other hand, GADES might be used as such a ’black box’ as it does not require any metadata or knowledge 7 https://github.com/LiDaKrA/RDF-Molecules-Experiment

TABLE III: Effectiveness of FuhSen on 500 RDF molecules. Jaccard triple-based integration vs GADES semantic integration approach using different thresholds (T). Highest values of Recall and F-measure are highlighted in bold.

Jaccard GADES

T0.0 0.77 0.81

T0.1 0.84 0.86

T0.2 0.55 0.86

Jaccard GADES

T0.0 0.77 0.81

T0.1 0.5 0.81

T0.2 0.1 0.81

Jaccard GADES

T0.0 0.77 0.81

T0.1 0.63 0.84

T0.2 0.17 0.84

Precision T.0.4 0.45 0.86 Recall T0.3 T.0.4 0.05 0.03 0.81 0.81 F-Measure T0.3 T.0.4 0.1 0.06 0.84 0.84 T0.3 0.43 0.86

T0.5 0.45 0.86

T0.6 0.62 0.86

T0.7 0.4 0.87

T0.8 0.4 0.83

T0.9 0.4 0.87

T0.5 0.03 0.81

T0.6 0.01 0.77

T0.7 0.004 0.59

T0.8 0.004 0.26

T0.9 0.004 0.07

T0.5 0.06 0.84

T0.6 0.02 0.81

T0.7 0.008 0.70

T0.8 0.008 0.40

T0.9 0.008 0.13

TABLE IV: Effectiveness of FuhSen on 20,000 RDF molecules. Jaccard triple-based integration vs GADES semantic integration approach using different thresholds (T). Highest values of Recall and F-measure are highlighted in bold.

Jaccard GADES

T0.0 0.72 0.76

T0.1 0.77 0.80

T0.2 0.44 0.80

Jaccard GADES

T0.0 0.72 0.76

T0.1 0.42 0.76

T0.2 0.09 0.76

Jaccard GADES

T0.0 0.72 0.76

T0.1 0.54 0.78

T0.2 0.15 0.78

Precision T0.3 T.0.4 0.34 0.37 0.79 0.79 Recall T0.3 T.0.4 0.05 0.02 0.76 0.76 F-Measure T0.3 T.0.4 0.08 0.04 0.77 0.77

of the schema. Nevertheless, the performance depends on the threshold parameter. As a simple sets-based approach, the performance (precision, recall, and F-Measure) of the Jaccard similarity quickly decreases with higher thresholds. On low thresholds only one or two common triples between molecules are sufficient to mark the molecules as similar even though other properties and values are different. Higher thresholds increase the necessary amount of common triples to classify molecules as similar. However, GADES leverages higher semantic abstraction layers involving hierarchies and neighborhoods. GADES is capable of maintaining stable performance and quality on thresholds up to 0.7 despite the size of the datasets. The drop at higher thresholds is explained by insufficient amounts of common triples which serve as a basis for materialization of class hierarchy, property hierarchy, and neighborhood. RQ1 is therefore confirmed, as one can vary the quality of interlinking in a wide range, whereas in the triplebased approach the quality is always constant. The accuracy of the molecule-based integration approach (RQ2) is indeed affected by a similarity measure and its parameters as shown in the Table III and Table IV. VII. R ELATED WORK Traditional approaches toward constructing knowledge graphs, e.g., NOUS [16], DeepDive [17], NELL [18], or Knowledge Vault [19], imply materialization of the designed graph built from (un-,semi-)structured sources. In comparison, the novelty of the FuhSen approach resides in a nonmaterialized KG and profound use of RDF molecules. Non-

T0.5 0.36 0.79

T0.6 0.27 0.79

T0.7 0.21 0.76

T0.8 0.21 0.70

T0.9 0.21 0.65

T0.5 0.02 0.76

T0.6 0.01 0.68

T0.7 0.01 0.46

T0.8 0.01 0.22

T0.9 0.01 0.06

T0.5 0.04 0.77

T0.6 0.02 0.73

T0.7 0.02 0.57

T0.8 0.02 0.33

T0.9 0.02 0.11

materialization supports efficient on-demand knowledge delivery. Further, FuhSen creates RDF molecules that unify and embed hybrid knowledge from heterogeneous sources in an abstract entity. Since RDF molecules enclose the information associated with knowledge graph RDF resources [3], FuhSen’s integration process allows for a more meaningful integration than triple-based integration approaches [5], [6]. Such a unification of RDF molecules comprises one of the most essential contributions of FuhSen. In the area of hybrid search engines, several approaches combine text search with structured data results. For example, Usbeck et al. [20] present a Hybrid Question Answering Framework combining entity search over linked data and textual data from the Web. The search input is a question expressed in natural language, which passes through an eightstep pipeline. HAWK is more complex than FuhSen since it pursues a Question Answering approach; FuhSen’s keyword search is simpler but incorporates different information sources ranging from the Social Web, Deep Web and Data Web to internal databases. Bhagdev et al. [21] propose a hybrid search architecture whose aim is to combine the search of concepts and keywords on documents and their metadata. In contrast to our approach, they focus only on documents, while we define a more generic and abstract notion of “entity”. They proposed the combination of traditional keyword search engines on documents with semantic search on documents’ metadata. The architecture proposed by Bhagdev et al. also indexes document content, whereas we do not index content but aggregate molecules of data on demand.

In the specific application domain of law enforcement, organizations are demanding increasingly intelligent software to support their work. Therefore, both in academia and in industry efforts are made to build innovative crime analysis software. The DIG system builds a knowledge graph to combat human trafficking by crawling web sites with escort ads [22]. Huber presents a crime investigation tool focusing just on online social networks [23]. Maltego8 , an open source forensics application, offers information mining as well as visualization tools to determine the relationships between entities such as people, companies, or websites. Finally, Poderopedia9 is an initiative to promote transparency of power control in South America. It builds a knowledge graph of people and the power they have in the continent by registering the relations with organizations and other people. Journalists and contributors can manually add entities and relations to the knowledge graph. In contrast to these tools, FuhSen creates a knowledge graph on demand when a keyword query is entered; results are built by integrating results collected from Web sources (e.g., DBpedia or Wikidata). In addition, the obtained results are enriched by semantic metadata. VIII. C ONCLUSIONS AND F UTURE W ORK In this paper, we have presented an approach to integrate RDF molecules spread on different Web data sources. The proposed techniques have been implemented in FuhSen, a federated hybrid search engine. FuhSen is able to create a knowledge graph on-demand by integrating data collected from a federation of heterogeneous data sources using an RDF molecule integration approach. We have explained the creation of RDF molecules by using Linked Data wrappers; we have also presented how semantic similarity measures can be used to determine the relatedness of two resources in terms of the relatedness of their RDF molecules. Results of the empirical evaluation suggest that FuhSen is able to effectively integrate pieces of information spread across different data sources. The experiments suggest that the molecule based integration technique implemented in FuhSen integrates data in a knowledge graph more accurately than existing integration techniques. The RDF molecule integration approach devises a novel integration paradigm incorporating elements from linked data and federated search engines. Although our initial use cases address the criminal investigation domain, we deem that there are numerous further use cases, e.g., related to ecommerce (e.g., price comparison) or human resources management (e.g., build a complete candidate profile from data from the web). In the future, we plan to evaluate the impact of other semantic similarity measures and to define operators to efficiently integrate RDF molecules on the fly. ACKNOWLEDGMENTS This work is supported in part by the European Union under the Horizon 2020 Framework Program for the project BigDataEurope (GA 644564) as well as by the German 8 https://www.paterva.com/ 9 http://www.poderopedia.org/

Ministry of Education and Research with grant no. 13N13627 for the project LiDaKra. R EFERENCES [1] D. Collarana, M. Galkin, C. Lange, I. Grangel-González, M. Vidal, and S. Auer, “Fuhsen: A federated hybrid search engine for building a knowledge graph on-demand (short paper),” in OTM Conferences ODBASE, 2016, pp. 752–761. ´ Corcho, “Efficient RDF interchange [2] J. D. Fernández, A. Llaves, and O. (ERI) format for RDF data streams,” in ISWC. Springer, 2014, pp. 244–259. [3] L. Ding, T. Finin, Y. Peng, P. P. Da Silva, and D. L. McGuinness, “Tracking rdf graph provenance using rdf molecules,” in ISWC (Poster), 2005. [4] I. T. Ribón, M. Vidal, B. Kämpgen, and Y. Sure-Vetter, “GADES: A graph-based semantic similarity measure,” in SEMANTiCS, 2016, pp. 101–104. [5] A. Schultz, A. Matteini, R. Isele, P. N. Mendes, C. Bizer, and C. Becker, “Ldif-a framework for large-scale linked data integration,” in WWW, Developers Track, 2012. [6] J. Volz, C. Bizer, M. Gaedke, and G. Kobilarov, “Silk - A link discovery framework for the web of data,” in WWW2009 Workshop on Linked Data on the Web, LDOW, vol. 538. CEUR-WS.org, 2009. [7] M. Arenas, C. Gutierrez, and J. Pérez, “Foundations of rdf databases,” in Reasoning Web. Semantic Technologies for Information Systems. Springer, 2009, pp. 158–204. [8] G. Pirrò, “Explaining and suggesting relatedness in knowledge graphs,” in ISWC. Springer, 2015, pp. 622–639. [9] E. P. Klement, R. Mesiar, and E. Pap, Triangular norms. Springer Science & Business Media, 2013, vol. 8. [10] K. Gunaratna, K. Thirunarayan, A. P. Sheth, and G. Cheng, “Gleaning types for literals in RDF triples with application to entity summarization,” in ESWC, 2016, pp. 85–100. [11] P. N. Mendes, M. Jakob, A. Garc´ıa-Silva, and C. Bizer, “Dbpedia spotlight: shedding light on the web of documents,” in I-SEMANTICS. ACM, 2011, pp. 1–8. [12] S. Lam and C. Hayes, “Using the structure of DBpedia for exploratory search,” in SIGKDD, 2013. [13] J. Benik, C. Chang, L. Raschid, M.-E. Vidal, G. Palma, and A. Thor, “Finding cross genome patterns in annotation graphs,” in International Conference on Data Integration in the Life Sciences. Springer, 2012, pp. 21–36. [14] J. Kleinberg and E. Tardos, Algorithm Design. Boston, MA, USA: Addison-Wesley Longman Publishing Co., Inc., 2005. [15] H. W. Kuhn, “The hungarian method for the assignment problem,” Naval research logistics quarterly, vol. 2, no. 1-2, pp. 83–97, 1955. [16] S. Choudhury, K. Agarwal, S. Purohit, B. Zhang, M. Pirrung, W. Smith, and M. Thomas, “Nous: Construction and querying of dynamic knowledge graphs,” arXiv preprint arXiv:1606.02314, 2016. [17] T. Palomares, Y. Ahres, J. Kangaspunta, and C. Ré, “Wikipedia knowledge graph with DeepDive,” in 10th International AAAI Conference on Web and Social Media, 2016. [18] A. Carlson, J. Betteridge, B. Kisiel, B. Settles, E. R. Hruschka Jr, and T. M. Mitchell, “Toward an architecture for never-ending language learning.” in AAAI, vol. 5, 2010, p. 3. [19] X. Dong, E. Gabrilovich, G. Heitz, W. Horn, N. Lao, K. Murphy, T. Strohmann, S. Sun, and W. Zhang, “Knowledge vault: A web-scale approach to probabilistic knowledge fusion,” in SIGKDD. ACM, 2014, pp. 601–610. [20] R. Usbeck, A. Ngonga Ngomo, L. Bühmann, and C. Unger, “HAWK - hybrid question answering using linked data,” in ESWC, 2015, pp. 353–368. [21] R. Bhagdev, S. Chapman, F. Ciravegna, V. Lanfranchi, and D. Petrelli, “Hybrid search: Effectively combining keywords and semantic searches,” in ESWC, 2008, pp. 554–568. [22] P. A. Szekely, C. A. Knoblock, and J. S. et al., “Building and using a knowledge graph to combat human trafficking,” in ISWC, 2015. [23] M. Huber, “Social snapshot framework: Crime investigation on online social networks,” ERCIM News, vol. 2012, no. 90, 2012.