RDF Graph Partitions: a Brief Survey

RDF Graph Partitions: a Brief Survey

Dominik Tomaszuk1 , ukasz Skonieczny2 , and David Wood3 Institute of Computer Science, University of Bialystok, Poland [email protected] Institute of Computer Science, Warsaw University of Technology, Poland [email protected] 3 3 Round Stones Inc., USA [email protected] 1

2

Abstract. The paper presents justications and solutions for RDF

graph partitioning. It uses an approach from the classical theory of graphs to deal with this problem. We present four ways to transform an RDF graph to a classical graph. We show how to apply solutions from the theory of graphs to RDF graphs. We also perform an experimental evaluation using the gpmetis algorithm (a recognized graph partitioner) on both real and synthetic RDF graphs and prove its practical usability.

1

Introduction

Machines commonly have a need to exchange machine-readable data. One useful approach to facilitate the ecient exchange of such data is to agree upon a common data model under which to structure, represent and store content. This data model should be generic enough to provide a canonical representation for arbitrary content irrespective of its syntax. The data model should also enable processing this content. The core data model chosen for use on the Semantic Web and Linked Data environments is the Resource Description Framework (RDF), an edge-labeled directed graph data model [1]. The size of the RDF graph is often too large to be eciently managed on a single machine. To deal with this problem many popular RDF graph stores facilitate various ways of distributing RDF data among dierent nodes of computer cluster. Early techniques came from the adaptation of similar solutions from the RDBMS world e.g. vertical partitioning [2] or horizontal partitioning [3]. These techniques tend to create poor partitions in terms of inter-partition connectivity which leads to poor performance of queries involving many joins. More advanced techniques which take the graph nature of the RDF data, have started to appear only recently [4,5,6,7]. One of the most natural and promising approach involves graph partitioning a classical problem from theory of graphs. Section 2 presents a formalized syntax and concept for RDF. Section 3 discusses the graph partition problem in the context of RDF graphs. Section 4 shows the practical relevance of RDF graph partitioning. Finally, Section 5 gives some concluding remarks.

2

D. Tomaszuk, . Skonieczny, D. Wood

2

Preliminaries

An RDF is used as a general method for the conceptual description or the modeling of information that is available in web resources. It provides the essential foundation and infrastructure to support the description and management of data. In other words, RDF is a very general data model for describing resources and relationships between them. The RDF data model is based upon the idea of making statements about web resources in the form of subject-predicate-object expressions. These expressions are known as triples in the RDF terminology. An RDF triple consists of a subject, a predicate, and an object. In [8], the meaning of subject, predicate and object is explained.

Denition 2.1 (Subject, predicate and object). The subject denotes a

resource, the object lls the value of the relation, the predicate means traits or aspects of the resource, and expresses a relationship between the subject and the object. The predicate denotes a binary relation, also known as a property. t u Following [8], we provide denitions of RDF triples below.

Denition 2.2 (RDF triple). Assume that I is the set of all Internationalized

Resource Identier (IRI) references, B an innite set of blank nodes, L the set of RDF literals. An RDF triple t is dened as a triple t = hs, p, oi where s ∈ I ∪ B is called the subject, p ∈ I is called the predicate and o ∈ I ∪ B ∪ L is called the object. t u Example 2.1 The example presents an RDF triple consists of subject, predicate and object.

foaf:name

"John Smith" . u t

The elemental constituents of the RDF data model are RDF terms that can be used in reference to resources: anything with identity. The set of RDF terms is divided into three disjoint subsets:

IRIs, literals, blank nodes. Denition 2.3 (IRIs). IRIs serve as global identifers that can be used to

identify any resource. For example, is used to identify the house in DBpedia4 . t u 4

http://dbpedia.org/


3

Note that in RDF 1.0 identifers was RDF URI References. Identiers in RDF 1.1 are now IRIs, which are a generalization of URIs that permits a wider range of Unicode characters. Every absolute URI and URL is an IRI, but not every IRI is an URI. When IRIs are used in operations that are only dened for URIs, they must rst be converted.

Denition 2.4 (Literals). Literals are a set of lexical values. It can be a set of plain strings, such as "Apple", optionally with an associated language tag, such as "Apple"@en. Literals comprise of a lexical string and a datatype, such as "3.14"http://www.w3.org/2001/XMLSchema#float. Datatypes are identied by IRIs, where RDF borrows many of the datatypes dened in XML Schema 1.1 [9] t u Note that in RDF 1.0 literals with a language tag did not have a datatype URI. In RDF 1.1 literals with language tags have the datatype IRI rdf:langString. Now all literals have datatypes. Implementations might choose to support syntax for simple literals, but only as synonyms for xsd:string literals. Moreover, RDF 1.1 supports the new datatype rdf:HTML. Both rdf:HTML and rdf:XMLLiteral depend on DOM4 (Document Object Model level 4)5 .

Denition 2.5 (Blank nodes). Blank nodes are dened as existential vari-

ables used to denote the existence of some resource for which an IRI or literal is not given. They are always locally scoped to the le or RDF store, and are not persistent or portable identiers for blank nodes. t u Note that RDF 1.0 makes no reference to any internal structure of blank nodes. Given two blank nodes, it is not possible to determine whether or not they are the same. In RDF 1.1 blank node identiers are local identiers that are used in some concrete RDF syntaxes or RDF store implementations. Blank nodes do not have identiers in the RDF abstract syntax. In situations where stronger identication is needed, some or all of the blank nodes can be replaced with IRIs. Systems wishing to do this should create a globally unique IRI (called a skolem IRI) for each blank node so replaced. This transformation does not appreciably change the meaning of an RDF graph. It permits the possibility of other graphs subsequently using the skolem IRIs, which is not possible for blank nodes. Systems that want skolem IRIs to be recognizable outside of the system boundaries use a well-known IRI [10] with the registered name genid. A collection of RDF triples intrinsically represents a labeled directed multigraph. The nodes are the subjects and objects of their triples. RDF is often referred to as being graph structured data where each hs, p, oi triple can be seen p as an edge s − → o.

Denition 2.6 (RDF graph). Let L = LS ∪ LL ∪ LD , O = I ∪ B ∪ L and

S = I ∪ B , then G ⊂ S × I × O RDF graph. 5

is a nite subset of RDF triples, which is called t u

DOM4 is a way to refer to XML or HTML elements as objects, see http://www.w3. org/TR/dom/

4


Example 2.2 The example in Figure 1 presents an RDF graph of a FOAF [11] prole. This graph includes four RDF triples: rdf:type foaf:Person . foaf:name "John Smith" . foaf:workplaceHomepage . rdfs:label "University" . u t

foaf:workplaceHomepage

foaf:name rdf:type

rdf:label "John Smith"

University

foaf:Person

Fig. 1. An RDF graph When applying classical graph theory on RDF graphs, the RDF graph is usually treated as a directed labeled graph (see denition 2.7) in way that each p0

RDF triple hs, p, oi is transformed into corresponding s0 −→ o0 edge, where s0 , o0 ∈ V and p0 ∈ L.

Denition 2.7 (Directed labeled graph). Directed labeled graph G is a

quadruple G = (V, E, lbl, L), where V is a set of vertices, E = {(v1 , v2 )|v1 , v2 ∈ V } is a set of directed edges, lbl : E ∪ V → L is a labeling function, and L is a set of labels. The k-way graph partition problem (see denition 2.8) is in general dened as nding k disjoint subsets of graph vertices. The optimal graph partition is a partition which optimizes some given criteria, e.g. number of edges running between separated components is low (size of the edge cut set, in other words), and the numbers of vertices in every component are close to each other. Please note, that such criteria are especially desirable in case of distributing RDF graphs as it creates highly independent, loosely-coupled partitions, maximizing chances that the RDF query is executed on the minimal number of cluster nodes.


5

Denition 2.8 (k-way graph partition). Given a graph G = (V, E, lbl, L), a

k-way graph S partitioning, C, is a division of V into k partitions {P1 , P2 , ..., Pk } such that 1≤i≤k Pi = V , and Pi ∩ Pj = ∅ for any i 6= j . The edge cut set Ec is the set of edges whose vertices belong to dierent partitions.

2

4

7

3

6

1

5

Fig. 2. Graph partition example An example of graph partition is illustrated in Figure 2.

3 3.1

RDF Graph Partition Classical Graph Partitioning

The optimal graph partition problem is known to be NP-complete [12] but a lot of ecient, suboptimal algorithms have been proposed [13,14,15,16,17]. One of the most recognized ones is a gpmetis included in the METIS software package for partitioning graphs, partitioning meshes and computing ll-reducing orderings of sparse matrices. gpmetis is based on the multilevel graph partitioning paradigm ([18,19]) which consists of three phases: graph coarsening, initial partitioning, graph uncoarsening. The goal of the coarsening phase is to derive a series of smaller graph from the initial input graph by collapsing together a maximal size set of adjacent pairs of vertices. When the resulted graph is low enough (usually a few hundred vertices) it is being partitioned with the Kernighan-Lin algorithm [20] - this is the initial partitioning phase. Then, in the uncoarsening phase, the partitioning is projected to the successively larger graphs. It is done by successive uncollapisng of vertices and assigning vertices that were collapsed

8

6


together to the same partition. After each projection step, the partitioning is rened by moving vertices between partitions as long as it improves the quality of the partitioning solution. The uncoarsening phase ends when the graph is uncollapsed to the original input graph. The KernighanLin algorithm which is used in the initial partitioning phase is a O(n2 log(n)) heuristic algorithm for solving the graph partitioning problem. The original algorithm performs 2-way partitioning but can be easily extended to k-way partitioning. The algorithm nds two disjoint, equally sized sets of vertices (namely A and B ) which minimizes the sum T of weights of the edges between vertices from A and B . The weights of the edges come from the number of vertices collapsed together in the coarsening phase. Let IvV be the sum of the weights of edges between v and vertices in V . The Kernighan-Lin algorithm starts with a random A, B sets and successively interchanges vertices a ∈ A and b ∈ B with each other such that the reduction cost Told − Tnew = IaB − IaA + IbA − IbB − 2ca,b is maximized, where ca,b is the cost of the edge between a and b.

3.2

Relationship Between Classical Graphs and RDF Graphs

All graph partitioning algorithms can be applied to RDF graphs if they are transformed to classical graph representation. A typical triple to edge transformation is the simplest one, but as was noted in [21] it is ambiguous and a not one-to-one relationship. A RDF triple t = hs, p, oi where s ∈ I ∪ B , p ∈ I , o ∈ I ∪ B ∪ L (see p0

denition 2.2) is transformed into directed edge s0 −→ o0 edge, where s0 , o0 ∈ V and p0 ∈ L. Notice that the RDF predicate domain intersects with the subject and object domains, so the p predicate might occur as a subject or object in some other triple. In the graph theory, however, E , V and L domains are distinct, i.e. there are no edges coming from/to other edges or labels. There are a p0

few solutions to this problem. First is a hs, p, oi to v1 (s0 ) −→ v2 (o0 ) transformation, where v1 , v2 ∈ V and s0 , p0 , o0 ∈ L, i.e. s0 and o0 are labels of the vertices. In this solution, predicates may occur either as labels of edges or labels of vertices. The Second solution is to make use of hypergraphs instead of simple graphs, that is allowing edges to connect more than two nodes. In this approach all s, p and o are transformed into distinct vertices, and each RDF triple is represented as a directed hyperedge connecting s,p and o with each other. The drawback of this method is that processing hypergraphs requires specialized algorithms (e.g. [22] in case of partitioning), which are in general slower than their simple graphs counterparts. The Third solution was presented in [21]. It takes a hypergraph representation as a starting point and transforms it into bipartite graph. In this approach a RDF triple hs, p, oi is represented as 4 nodes and 3 edges subject

object

s0 ←−−−−− t0 −−−−→ o0 ↓ predicate . The Fourth approach is to transform every RDF triple p0 t into a distinct graph node, and generate edges between those nodes which share subjects, objects and/or predicates. The choice of RDF graph representation is a subject of application. In case of graph partitioning the rst and fourth approach


7

seems to be superior but additional research is required. One can notice that the RDF graph is more general than the classical graph. A directed labeled graph can be easily transformed into RDF graph, but the reversed transformation is cumbersome. It means that the complexity of every RDF graph problem is not better than the complexity of the corresponding classical graph problem.

4

Experiments

To examine the practical relevance of RDF graph partitioning we performed an experimental evaluation of the gpmetis 6 algorithm from the grph 7 library. We chose four RDF data sets: berlin - a synthetic dataset generated by Berlin SPARQL Benchmark generator [23], elvis a metadata about Elvis impersonators8 and dbpedia-geo Geographic Coordinates9 dataset from DBpedia [24]. Each dataset was partitioned into k partitions with k ranging from 2 to 10. We collected the execution time of the algorithm and the size of the edge cut set |Ec | (see denition 2.8). We were using default parameters of the gpmetis algorithm. The results are presented in Table 1. The elvis* is a graph obtained from elvis dataset with the fourth approach. All other graphs were obtained with the rst approach. The fourth approach turned out to be not practical as it created very large graphs resulting in out-of-memory errors. The gpmetis algorithm generally performed very well generating partitions in seconds even for quite big dbpedia graphs. The size of edge cut set depends on the data set, 3%-8% of total number of edges in the case of the elvis graph, which is a very good result, 10%-20% in the case of the dbpedia graph, which is probably acceptable, and 12%-45% in the case of the berlin graph.

5

Conclusions

We outlined a partition of the vertices of an RDF graph into two disjoint subsets. In this paper we presented works from the RDF graph partitions research area. This paper provided insights on classical graph partitioning of RDF graphs. Moreover, we presented formal relationships between classical graphs and RDF graphs. Finally, we presented experiments, which showed a great potential for the presented approaches.

References 1. Czajkowski, K., Trela, T.: Semantic web - standard, narz¦dzia, implementacje. Studia Informatica Vol. 33, nr 2A (2012) 379393 6 7 8 9

http://glaros.dtc.umn.edu/gkhome/views/metis http://www.i3s.unice.fr/ hogie/grph/index.php http://www.rdfdata.org/ http://wiki.dbpedia.org/Downloads2014

8


#vertices #edges k time 2 |Ec | time 3 |Ec | time 4 |Ec | time 5 |Ec | time 6 |Ec | time 7 |Ec | time 8 |Ec | time 9 |Ec | time 10 |Ec |

elvis berlin dbpedia-geo elvis* 742 774

116728 1637808 348106 2104985 k-way partitioning 252 ms 2617 ms 20954 ms 21 44611 226002 122 ms 2652 ms 14407 ms 37 79950 330831 101 ms 3139 ms 16357 ms 40 96827 415257 52 ms 3116 ms 14561 ms 49 119679 396505 95 ms 3133 ms 17862 ms 52 131902 473697 76 ms 3304 ms 14299 ms 52 139853 456204 79 ms 3358 ms 16628 ms 62 145927 504719 45 ms 3567 ms 13886 ms 62 155111 515311 43 ms 3624 ms 16382 ms 63 162344 520028

774 41458

1792 ms 1409 224 ms 2172 180 ms 3650 194 ms 11574 92 ms 12341 137 ms 16616 291 ms 15709 87 ms 17955 154 ms 21634

Table 1. Partitioning of elvis, berlin and dbpedia datasets with gpmetis.

2. Abadi, D.J., Marcus, A., Madden, S.R., Hollenbach, K.: Scalable Semantic Web Data Management Using Vertical Partitioning. In: Proceedings of the 33rd International Conference on Very Large Data Bases. VLDB '07, VLDB Endowment (2007) 411422 3. Mulay, K., Kumar, P.S.: SPOVC: A Scalable RDF Store Using Horizontal Partitioning and Column Oriented DBMS. In: Proceedings of the 4th International Workshop on Semantic Web Information Management. SWIM '12, New York, NY, USA, ACM (2012) 8:18:8 4. Goczyªa, K., Waloszek, A., Waloszek, W.: Techniki modularyzacji ontologii. Bazy danych. Rozwój metod i technologiiArchitektura, metody formalne i zaawansowana analiza danych, red.: S. Kozielski, B. Maªysiak, P. Kasprowski, D. Mrozek, WK (s 309) (2008) 322 5. Lee, K., Liu, L.: Scaling Queries over Big RDF Graphs with Semantic Hash Partitioning. Proc. VLDB Endow. 6(14) (September 2013) 18941905 6. Wang, R., Chiu, K.: A Graph Partitioning Approach to Distributed RDF Stores. In: Proceedings of the 2012 IEEE 10th International Symposium on Parallel and Distributed Processing with Applications. ISPA '12, Washington, DC, USA, IEEE Computer Society (2012) 411418 7. Yan, Y., Wang, C., Zhou, A., Qian, W., Ma, L., Pan, Y.: Ecient Indices Using Graph Partitioning in RDF Triple Stores. In: Proceedings of the 2009 IEEE International Conference on Data Engineering. ICDE '09, Washington, DC, USA, IEEE Computer Society (2009) 12631266


9

8. Cyganiak, R., Lanthaler, M., Wood, D.: RDF 1.1 Concepts and Abstract Syntax. W3C recommendation, World Wide Web Consortium (February 2014) 9. Sperberg-McQueen, M., Thompson, H., Peterson, D., Malhotra, A., Biron, P.V., Gao, S.: W3C XML Schema Denition Language (XSD) 1.1 Part 2: Datatypes. W3C recommendation, World Wide Web Consortium (April 2012) 10. Nottingham, M., Hammer-Lahav, E.: Dening Well-Known Uniform Resource Identiers (URIs). RFC 5785, Internet Engineering Task Force (April 2010) 11. Brickley, D., Miller, L.: FOAF Vocabulary Specication 0.99. Technical report, FOAF Project (January 2014) 12. Garey, M.R., Johnson, D.S., Stockmeyer, L.: Some simplied NP-complete graph problems. Theoretical Computer Science 1(3) (1976) 237267 13. Saran, H., Vazirani, V.V.: Finding k-cuts within Twice the Optimal (1995) 14. Goldschmidt, O., Hochbaum, D.: Polynomial algorithm for the k-cut problem. In: Foundations of Computer Science, 1988., 29th Annual Symposium on. (Oct 1988) 444451 15. Guttmann-Beck, N., Hassin, R.: Approximation Algorithms for Minimum K -Cut. Algorithmica 27(2) (2000) 198207 16. Arora, S., Karger, D., Karpinski, M.: Polynomial Time Approximation Schemes for Dense Instances of NP-Hard Problems. Journal of Computer and System Sciences 58(1) (1999) 193210 17. Karypis, G., Kumar, V.: Multilevel Algorithms for Multi-constraint Graph Partitioning. In: Proceedings of the 1998 ACM/IEEE Conference on Supercomputing. SC '98, Washington, DC, USA, IEEE Computer Society (1998) 113 18. Karypis, G., Kumar, V.: A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs. SIAM J. Sci. Comput. 20(1) (December 1998) 359392 19. Karypis, G., Kumar, V.: Multilevel K-way Partitioning Scheme for Irregular Graphs. J. Parallel Distrib. Comput. 48(1) (January 1998) 96129 20. Kernighan, B.W., Lin, S.: An Ecient Heuristic Procedure for Partitioning Graphs. The Bell system technical journal 49(1) (1970) 291307 21. Hayes, J., Gutierrez, C.: Bipartite Graphs as Intermediate Model for RDF. In McIlraith, S., Plexousakis, D., van Harmelen, F., eds.: The Semantic Web ISWC 2004. Volume 3298 of Lecture Notes in Computer Science. Springer Berlin Heidelberg (2004) 4761 22. Karypis, G., Kumar, V.: Multilevel K-way Hypergraph Partitioning. In: Proceedings of the 36th Annual ACM/IEEE Design Automation Conference. DAC '99, New York, NY, USA, ACM (1999) 343348 23. Bizer, C., Schultz, A.: The Berlin SPARQL Benchmark. International Journal on Semantic Web and Information Systems (IJSWIS) 5(2) (2009) 124 24. Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., Ives, Z.: DBpedia: A Nucleus for a Web of Open Data. In: Proceedings of the 6th International The Semantic Web and 2Nd Asian Conference on Asian Semantic Web Conference. ISWC'07/ASWC'07, Berlin, Heidelberg, Springer-Verlag (2007) 722735