Storing, Indexing and Querying Large Provenance Data Sets as RDF ...

Storing, Indexing and Querying Large Provenance Data Sets as RDF Graphs in Apache HBase Artem Chebotko, John Abraham, Pearl Brazier Department of Computer Science University of Texas - Pan American Edinbug, TX, USA {chebotkoa, jabraham, brazier}@utpa.edu

Anthony Piazza Piazza Software Consulting Corpus Christi, TX, USA [email protected]

Andrey Kashlev, Shiyong Lu Department of Computer Science Wayne State University Detroit, MI, USA {andrey.kashlev, shiyong}@wayne.edu

Abstract—Provenance, which records the history of an insilico experiment, has been identified as an important requirement for scientific workflows to support scientific discovery reproducibility, result interpretation, and problem diagnosis. Large provenance datasets are composed of many smaller provenance graphs, each of which corresponds to a single workflow execution. In this work, we explore and address the challenge of efficient and scalable storage and querying of large collections of provenance graphs serialized as RDF graphs in an Apache HBase database. Specifically, we propose: (i) novel storage and indexing techniques for RDF data in HBase that are better suited for provenance datasets rather than generic RDF graphs and (ii) novel SPARQL query evaluation algorithms that solely rely on indices to compute expensive join operations, make use of numeric values that represent triple positions rather than actual triples, and eliminate the need for intermediate data transfers over a network. The empirical evaluation of our algorithms using provenance datasets and queries of the University of Texas Provenance Benchmark confirms that our approach is efficient and scalable.

respectively, and edges denote dependencies whose interpretations can be inferred based on their domains and ranges.

Keywords-scientific workflow; provenance; big data; HBase; distributed database; SPARQL; RDF; query; scalability

This provenance graph can be serialized using the Resource Description Framework (RDF) and OPM vocabularies, such as Open Provenance Model Vocabulary (OPMV) or Open Provenance Model OWL Ontology (OPMO). As an example, we show a partial serialization of the presented provenance graph using OPMV and Terse RDF Triple Language (Turtle):

I. I NTRODUCTION In scientific workflow environments, scientific discovery reproducibility, result interpretation, and problem diagnosis primarily depend on the availability of provenance – metadata that records the complete history of an in-silico experiment [1], [2], [3], [4]. Using the terminology of the Open Provenance Model (OPM) [5], provenance of a single workflow execution is a directed graph with three kinds of nodes: artifacts (e.g., data products), processes (e.g., computations or actions), and agents (e.g., catalysts of a process). Nodes are connected via directed edges that represent five types of dependencies: process used artifact, artifact wasGeneratedBy process, process wasControlledBy agent, process wasTriggeredBy process, and artifact wasDerivedFrom artifact. In addition, OPM includes a number of other constructs that are helpful for provenance modelling. A sample provenance graph that uses the OPM notation and encodes provenance of a workflow for creating and populating a relational database is shown in Fig. 1. In this graph, ellipses and rectangles denote artifacts and processes,

Create Table SQL Statements

Create Index SQL Statements

Create Trigger SQL Statements

Create Database Schema

Schema

Dataset

Load Data

Instance

Figure 1.

Sample OPM Provenance Graph.

utpb:schema utpb:instance utpb:dataset utpb:loadData utpb:loadData

rdf:type rdf:type rdf:type rdf:type opmv:used

utpb:instance utpb:instance

opmv:wasGeneratedBy opmv:wasDerivedFrom

opmv:Artifact opmv:Artifact opmv:Artifact opmv:Process utpb:schema, utpb:dataset utpb:loadData utpb:schema, utpb:dataset

. . . . . . .

The provenance graph is now effectively converted to an RDF graph (or set of triples that encode its edges) and can be further stored and queried using the SPARQL query language. SPARQL and other RDF query languages have been frequently used for provenance querying [6], [7], [8]. A provenance query, such as “Find all artifacts and their values, if any, in a provenance graph with identifier http://cs.panam.edu/utpb#opmGraph”, can be expressed in SPARQL as

SELECT ?artifact ?value FROM NAMED WHERE { GRAPH utpb:opmGraph { ?artifact rdf:type opmv:Artifact . OPTIONAL { ?artifact rdf:label ?value . } . } }

The main focus of our research in this work is on the efficient and scalable storage and querying of large collections of provenance graphs serialized as RDF graphs in an Apache HBase database. With the development of user-friendly and powerful tools, such as scientific workflow management systems [9], [10], [11], [12], [13], [14], scientists are able to design and repeatedly execute workflows with different input datasets and varying input parameters with just a few mouse clicks. Each workflow execution generates a provenance graph that will be stored and queried on different occasions. A single provenance graph is readily manageable as its size is correlated with the workflow size and even workflows with many hundreds of processes produce a relatively small metadata footprint that fits into main memory of a single machine. The challenge arises when hundreds of thousands or even millions of provenance graphs constitute a provenance dataset. Managing large and constantly growing provenance datasets on a single machine eventually fails and we turn to distributed data management solutions. We design such a solution for large provenance datasets based on Apache HBase [15], an open-source implementation of Google’s BigTable [16]. While we deploy and evaluate our solution on a small cluster of commodity machines, HBase is readily available in cloud environments suggesting virtually unlimited elasticity. The main contributions of this work are: (i) novel storage and indexing schemes for RDF data in HBase that are suitable for provenance datasets, and (ii) novel and efficient querying algorithms to evaluate SPARQL queries in HBase that are optimized to make use of bitmap indices and numeric values instead of triples. Our solution enables the evaluation of queries over an individual provenance graph without intermediate data transfers over a network. In addition, we conducted an empirical evaluation of our approach using provenance graphs and test queries of the University of Texas Provenance Benchmark [17]. Our experiments confirmed that our proposed storage, indexing and querying techniques are efficient and scalable for large provenance datasets. II. R ELATED W ORK Besides HBase, there are multiple projects under the Apache umbrella (http://projects.apache.org) that focus on distributed computing, including Hadoop, Cassandra, Hive, Pig, and CouchDB. Hadoop implements a MapReduce software framework and a distributed file system. Cassandra blends a fully distributed design with a column-oriented storage model. Hive deals with data warehousing on top

of Hadoop and provides its own Hive QL query language. Pig is geared towards analyzing large datasets through use of its high-level Pig Latin language for expressing data analysis programs, which are then turned into MapReduce jobs. CouchDB is a distributed, document-oriented database that supports incremental MapReduce queries written in JavaScript. Along the same lines, other projects in academia and industry include Cheetah (data warehousing on top of MapReduce), Hadoop++ (an improved MapReduce framework based on Hadoop), G-Store (a key-value store with multi key access functionality), and Hadapt (data warehousing on top of MapReduce). None of the above projects targets RDF data specifically or supports SPARQL. RDF data management in non-relational (often called NoSQL) databases has only recently been gaining momentum. Due to the paper size limit, we only briefly introduce the reader to the most relevant works in this area. Techniques for evaluating SPARQL basic graph patterns using MapReduce are presented in [18] and [19]. Efficient approaches to analytical query processing and distributed reasoning on RDF graphs in MapReduce-based systems are proposed in [20] and [21]. The translation of SPARQL queries into Pig Latin queries that can be evaluated using Hadoop is presented in [22]. Efficient RDF querying in distributed RDF-3X is reported in [23]. RDF storage schemes and querying algorithms for HBase and MySQL Cluster are proposed in our own work [24]. Bitmap indices for RDF join processing on a single machine have been previously studied in [25]. While existing works deal with very large graphs that require partitioning, this work deals with very large numbers of relatively small RDF graphs, which enables us to apply unique optimizations in our storing, indexing, and querying techniques. III. S TORING AND I NDEXING RDF G RAPHS IN HBASE In this section, we first formalize definitions of RDF dataset, RDF graph, and SPARQL basic graph pattern. We then propose our indexing and storage schemes for RDF data in HBase. A. RDF Data and Queries Definition 3.1 (RDF dataset): An RDF dataset D is a set of RDF graphs {G1 , G2 , ..., Gn }, where each graph Gi ∈ D is a named graph that has a unique identifier Gi .id and n ≥ 0. The RDF dataset definition requires each RDF graph to have a unique identifier, which is frequently the case in large collections of RDF graphs to allow easy distinction among graphs. Such an identifier is either a part of the graph description or can be assigned automatically without a loss of generality. Definition 3.2 (RDF graph): An RDF graph G is a set of RDF triples {t1 , t2 , ..., tn }, where n = |G| and each triple

ti ∈ G is a tuple of the form (s, p, o) with s, p, and o denoting a subject, predicate, and object, respectively. The RDF graph definition views an RDF graph as a set of triples whose subjects, objects, and predicates correspond to labeled nodes and edges, respectively. Each node in an RDF graph is labeled with a unique identifier (i.e., Universal Resource Identifier or URI) or a value (i.e., literal). Edges are labeled with identifiers (i.e., URIs) and multiple edges with the same label are common. While the number of distinct labels for nodes, which correspond to subjects and objects, increases with the growth of an RDF graph or RDF dataset as new nodes are added, the number of labels for edges, which correspond to predicates, is usually bound to the number of properties defined in an annotation vocabulary (e.g., OPMV defines 13 properties and reuses a few properties from the RDF and RDFS namespaces; OPMO extends OPMV and supports around 50 properties). Therefore, for any large RDF dataset, it is safe to assume that the number of distinct predicates is substantially smaller (on the order of dozens or hundreds) than the number of distinct subjects or objects. We use P to denote a set of all predicates in an RDF dataset D; formally, P = P1 ∪ P2 ∪ ... ∪ Pn , where Pi = {p|(s, p, o) ∈ Gi }, Gi ∈ D, and D = {G1 , G2 , ..., Gn }. In an RDF graph, the order of RDF triples is not semantically important. Since any concrete RDF graph serialization has to store triples in some order, it is convenient to view a set of triples in an RDF graph as an ordered set. We use function num to denote the position of a triple in an RDF graph G, such that num(ti ) returns position i, where ti ∈ G and G = {t1 , t2 , ..., tn }. Furthermore, inverse function num−1 (i) returns triple ti found at position i in graph G = {t1 , t2 , ..., tn }. To query RDF datasets and individual RDF graphs, the standard RDF query language, called SPARQL, is used. SPARQL allows defining various graph patterns that can be matched over RDF graphs to retrieve results. While SPARQL distinguishes basic graph patterns, optional graph patterns and alternative graph patterns, in this paper, we restrict our presentation to the basic graph patterns as defined in the following. Definition 3.3 (Basic graph pattern): A basic graph pattern bgp is a set of triple patterns {tp1 , tp2 , . . . , tpn }, also denoted as tp1 AN D tp2 AN D · · · AN D tpn , where n ≥ 1, AN D is a binary operator that corresponds to the conjunction in SPARQL and each tpi ∈ bgp is a triple (sp, pp, op), such that sp, pp, and op are a subject pattern, predicate pattern, and object pattern, respectively. A basic graph pattern consists of the simplest querying constructs, called triple patterns. A triple pattern can contain variables or URIs as subject, predicate, and object patterns (object patterns can also be represented by literals) that are to be matched over the respective components of individual triples. Unlike a URI or a literal in a triple patten that has to match itself in a triple, a variable can match anything.

Multiple occurrences of the same variable in a triple pattern or a basic graph pattern must be bound to the same values. B. Indexing Scheme Matching a basic graph pattern over an RDF graph involves matching of constituent triple patterns over a set of RDF triples. Each triple pattern yields an intermediate set of triples and such intermediate results must be further joined together to find matching subgraphs. To speed up this computation, we define several bitmap indices. Definition 3.4 (Index Ip ): A bitmap index Ip for an RDF graph G ∈ D is a set of tuples {(p1 , v1 ), (p2 , v2 ), . . . , (pn , vn )}, where pi ∈ P is a predicate in a set of all predicates in an RDF dataset D, n = |P |, and vi is a bit vector of size |G| that has 1 in k-th position iff triple tk = num−1 (k), tk ∈ G, and tk .p = pi . Index Ip helps quickly identify triples (their positions) in an RDF graph that have a particular predicate. Ip (p) denotes a bit vector for predicate p. The size of the bit vector is fixed and equals the number of triples in the graph (i.e., |G|). The number of vectors in the index equals the number of distinct predicates (i.e., |P |), which is relatively small (usually |P | < |G| ). Similarly, indices Is and Io to quickly identify triples with a given subject and object can be defined. Intuitively, to find a triple with subject s, predicate p, and object o in an RDF graph, a logical ∧ (AND) of the corresponding vectors can be computed: Is (s) ∧ Ip (p) ∧ Io (o). While the purpose of indices Is , Ip and Io is to speed up matching of individual triple patterns, the indices that we define next can be used to join intermediate results obtained via triple pattern matching into subgraphs. Definition 3.5 (Index Iss ): A bitmap index Iss for an RDF graph G ∈ D is a set of tuples {(1, v1 ), (2, v2 ), . . . , (n, vn )}, where n = |G|, 1, 2, . . . , n are the consecutive triple positions in G, and vi is a bit vector of size |G| that has 1 in k-th position iff tk = num−1 (k), tk ∈ G, ti = num−1 (i), ti ∈ G, and tk .s = ti .s. Definition 3.6 (Index Ioo ): A bitmap index Ioo for an RDF graph G ∈ D is a set of tuples {(1, v1 ), (2, v2 ), . . . , (n, vn )}, where n = |G|, 1, 2, . . . , n are the consecutive triple positions in G, and vi is a bit vector of size |G| that has 1 in k-th position iff tk = num−1 (k), tk ∈ G, ti = num−1 (i), ti ∈ G, and tk .o = ti .o. Definition 3.7 (Indices Iso and Ios ): A bitmap index Iso for an RDF graph G ∈ D is a set of tuples {(1, v1 ), (2, v2 ), . . . , (n, vn )}, where n = |G|, 1, 2, . . . , n are the consecutive triple positions in G, and vi is a bit vector of size |G| that has 1 in k-th position iff tk = num−1 (k), tk ∈ G, ti = num−1 (i), ti ∈ G, and tk .o = ti .s. A bitmap T index Ios is the transpose of Iso , such that Ios = Iso .

rowid G1 .id G2 .id ··· Gn .id

data:graph G1 G2 ··· Gn

index:Is Is1 Is2 ··· Isn

index:Ip Ip1 Ip2 ··· Ipn

Figure 2.

TD index:Io Io1 Io2 ··· Ion

index:Iss Iss1 Iss2 ··· Issn

index:Ioo Ioo1 Ioo2 ··· Ioon

index:Iso Iso1 Iso2 ··· Ison

HBase Storage Scheme.

Indices Iss , Ioo , Iso , and Ios are all of the same size of |G| × |G| bits for a given graph G. They can be used to quickly match triples from two sets based on the equality of their subjects, objects, subjects-objects, and objects-subjects, respectively. Intuitively, if given a position i that corresponds to a triple ti ∈ G such that ti = num−1 (i), other triples (their positions) in G with the same subject as ti .s can be found in the bit vector Iss (i). It should be noted that indices that allow matching triples based on predicates, subjectspredicates, and objects-predicates equalities can also be defined; however their usability is limited since graph patterns with variables shared by predicate patterns and other patterns are rarely used. We denote indices Is , Ip and Io as selection indices and indices Iss , Ioo , Iso , and Ios as join indices. Note that index Ios can be obtained from index Iso and vice versa. Although we introduce both types of indices for theoretical completeness, only one is required in practice. C. Storage Scheme HBase stores data in tables that can be described as sparse multidimensional sorted maps and are structurally different from relations found in conventional relational databases. An HBase table (hereafter “table” for short) stores data rows that are sorted based on the row keys. Each row has a unique row key and an arbitrary number of columns, such that columns in two distinct rows do not have to be the same. A full column name (hereafter “column” for short) consists of a column family and a column qualifier (e.g., family:qualifier), where column families are usually specified at the time of table creation and their number does not change and column qualifiers are dynamically added or deleted as needed. Rows in a table can be distributed over different machines in an HBase cluster and efficiently retrieved based on a given row key and, if available, columns. To store provenance datasets composed of provenance graphs serialized as RDF graphs, we propose a single table storage scheme shown in Fig. 2. Each row in the table stores: (1) an RDF graph identifier as a unique row id/key, (2) a complete RDF graph as one aggregate value in the data column family, and (3) precomputed bitmap indices for the respective RDF graph in the index column family. The decision to store each RDF graph as one value rather than partition it into subgraphs or even individual triples is motivated by the following observations. First, such storage avoids unnecessary data transfers that may occur if a graph is partitioned and distributed over different machines. Second,

Algorithm 1 Applying selection indices 1: function applySelectionIndices 2: input: graph identifier G.id, triple pattern tp = (sp, pp, op), table TD

3: output: bit vector v that has 1 in k-th position, i.e., v[k] = 1, if triple tk = num−1 (k), tk ∈ G, and tk matches non-variable components of tp

4: Let Is , Ip , and Io be the respective indices in the row with rowid G.id of table TD

5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16:

Let v be a bit vector that has 1 in every position k, where 1 ≤ k ≤ |G| if tp.sp is not a variable then v = v ∧ Is (tp.sp) end if if tp.pp is not a variable then v = v ∧ Ip (tp.pp) end if if tp.op is not a variable then v = v ∧ Io (tp.op) end if return v end function

as we show in detail in the next section, expensive query processing operations (i.e., joins) can be performed using compact bitmap indices and an RDF graph is only required to be accessed once to replace triple positions in query results with actual triples. Finally, unlike some applications that require dealing with very large graphs that cannot fit into main memory of a single machine and therefore require partitioning, individual provenance graphs are relatively small in general (yet their number can be very large) and can be stored as one aggregate value. We present query processing over this storage scheme next. IV. RDF Q UERY P ROCESSING IN HBASE To be able to evaluate SPARQL queries in HBase, we design four efficient functions that deal with application of selection indices, application of join indices, handling of special cases not supported by the indices, and basic graph pattern evaluation. Function applySelectionIndices is outlined in Algorithm 1. It takes a graph identifier and a triple pattern and returns a bit vector of triple positions in the graph, where value 1 signifies that a position-corresponding triple matches URIs and/or literals found in the triple pattern. Selection indices for a particular graph identifier are applied using the conjunction of bit vectors. A resulting bit vector encodes the result of one triple matching over a graph. Function applyJoinIndices is outlined in Algorithm 2. This function, given a graph identifier, a triple pattern with one known solution represented by a triple position, and a

Algorithm 2 Applying join indices

Algorithm 4 Matching a basic graph pattern over a graph

1: function applyJoinIndices 2: input: graph identifier G.id, triple pattern with known solution tp, ′

to-be-joined triple pattern tp , triple position p, i.e., triple tp = num−1 (p) matches tp, table TD 3: output: bit vector v that has 1 in k-th position, i.e., v[k] = 1, if triple tk = num−1 (k), tk ∈ G, and tk can join with tp based on the equality of their subjects, objects, and/or subjects-objects.

4: Let Iss , Ioo , and Iso be the respective indices in the row with rowid G.id of table TD

5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20:

Let v be a bit vector that has 1 in every position k, where 1 ≤ k ≤ |G| Let v[p] = 0 /* to avoid joining triple at position p with itself */ if tp.sp and tp′ .sp are variables and tp.sp = tp′ .sp then v = v ∧ Iss (p) end if if tp.op and tp′ .op are variables and tp.op = tp′ .op then v = v ∧ Ioo (p) end if ′

′

if tp.sp and tp .op are variables and tp.sp = tp .op then v = v ∧ Iso (p) end if if tp.op and tp′ .sp are variables and tp.op = tp′ .sp then T v = v ∧ Iso (p) end if return v end function

Algorithm 3 Handling special cases 1: 2: 3: 4: 5: 6: 7: 8:

function handleSpecialCases input: basic graph pattern bgp, set of solutions S output: set of solutions F ⊆ S Let F = S

/* Special case not supported by selection indices - rare in practice */ if Any tp ∈ bgp contains any two variables with the same name then for each s ∈ F do Discard s, i.e., F = F − {s}, if triple t ∈ s that corresponds to tp has different bindings for such variables end for 9: 10: end if

11: /* Special case not supported by join indices - rare in practice */ 12: if Any tp ∈ bgp has a variable at tp.pp that also occurs in some other

tp′ ∈ bgp, tp′ ̸= tp then for each s ∈ F do Discard s, i.e., F = F − {s}, if triples t ∈ s and t′ ∈ s that correspond to tp and tp′ have different bindings for such variables end for 15: 16: end if

1: function matchBGP 2: input: graph identifier G.id, basic graph pattern bgp = {tp1 , tp2 , . . . , tpn } and n ≥ 1, table TD

3: output: set of subgraph solutions S = {g|g is a (sub)graph of G and g matches bgp}

4: Order triple patterns in bgp, such that triple patterns that yield a smaller result and triple patterns that have a shared variable with preceding triple patterns are evaluated first. 5: Let ordered bgp = (tp1 , tp2 , ..., tpn )

6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17:

18: 19: 20: 21: 22: 23: 24: 25: 26: 27: 28: 29: 30: 31:

vsel = applySelectionIndices(G.id, tp1 , TD ) S = {(k) | vsel [k] = 1} /*solutions for the first triple pattern*/ if S = ø then return S end if for each tpi in (tp2 , ..., tpn ) do vsel = applySelectionIndices(G.id, tpi , TD ) Let set Sjoin = ø /*solutions for current join*/ Let set T P = {tpj | tpj ∈ (tp1 , tp2 , ..., tpi−1 ), j < i, and tpi and tpj have variables with the same name as subject or object patterns} for each s in S do vjoin = vsel for each tpj in (tp1 , tp2 , ..., tpi−1 ) do if tpj in T P then vjoin = vjoin ∧ applyJoinIndices(G.id, tpj , tpi , s[j], TD ) /* s[j] is a solution (triple position) for tpj found in sequence s at position j */ end if end for Stpi = {(k) | vjoin [k] = 1} /*solutions for current triple pattern*/ Compute Cartesian product of {s} and Stpi , i.e., Sjoin = Sjoin ∪ ({s} × Stpi ) end for S = Sjoin if S = ø then return S end if end for /* Replace triple positions in S with actual triples */ for each s in S do s′ = {num−1 (k) | k ∈ s} Replace s with s′ in S end for /* Handle special cases that are not supported by the selection and join indices */ S = handleSpecialCases(bgp, S)

32: return S 33: end function

13: 14:

17: return F 18: end function

to-be-joined triple pattern, can quickly compute a bit vector that encodes solutions (triple positions) that join with the known solution. A join condition is implicitly coded by the use of the same variable in the two triple patterns. It can be represented by the equality of subjects, objects, and/or subjects-objects in the two triple patterns. Join indices are also applied using the conjunction of the respective bit vectors. Function handleSpecialCases is outlined in Algorithm 3. This function performs post-processing of final results obtained via basic graph pattern matching. It takes a basic graph pattern and its set of solutions, where each solution is represented by a sequence of actual triples, and

deals with special cases not supported by selection and join indices. In particular, selection indices have no means to verify that if a triple pattern contains the same variable twice (or even three times) then a matching triple must have identical values that can be bound to multiple occurrences of this variable. Join indices do not support join conditions based on the equality of a predicate and any other term in a triple. It is possible to add additional indices to handle selection and join operations on predicates, however such indices will be rarely needed for real-life queries. Finally, main function matchBGP is outlined in Algorithm 4. This function matches a SPARQL basic graph pattern bgp that consists of a set of triple patterns tp1 , tp2 , ..., tpn over an RDF graph with a known identifier that is stored in HBase. The final result is a set of subgraph solutions S. The algorithm starts by ordering triple patterns in bgp (lines 4 and 5) using two criteria: (1) triple patterns that yield a smaller result should be evaluated first to decrease the number of iterations and (2) triple patterns that have

a shared variable with preceding triple patterns should be given preference over triple patterns with no shared variables to avoid unnecessary Cartesian products. Next (lines 6-8), the algorithm applies selection indices and obtains a set of solutions for the first triple pattern. Each solution in the set is represented by a sequence with one triple position. An empty set of solutions results in an empty result for the whole basic graph pattern. All subsequent triple patterns are further evaluated and joined with already available results (lines 9-25). For any subsequent triple pattern, selection indices are applied (line 10), an empty set of join solutions is prepared (line 11), and preceding triple patterns that share variables with the current triple pattern are identified (line 12). For each solution that has been obtained for the preceding triple patterns (lines 13-22), join indices are applied (line 14-19), a bit vector resulting from both selection and join index applications is converted to a set of solutions for the current triple pattern (line 20), and the join result is computed by combining the known solution with newly computed ones (line 21). The set of known solutions is updated (line 23) and verified that it is not empty (line 24). The process repeats for the next available triple pattern. Once all joins are processed, each triple position in the set of solutions is replaced with an actual triple from the graph using the num−1 function (lines 26-30). The solutions are then post-processed by function handleSpecialCases to accommodate cases that are not supported by selection and join indices (line 31). Finally, the resulting set S is returned (line 32). Some of the advantages of these algorithms include: (1) expensive selection and join computations are performed over indices rather than a graph; (2) computation heavily relies on numeric values that represent triple positions rather than actual triples with lengthy literals and URIs; (3) computation can be fully completed on the same machine where data resides, eliminating all intermediate data transfers over a network. V. P ERFORMANCE S TUDY This section reports our empirical evaluation of the proposed approach and algorithms. A. Experimental Setup Hardware. Our experiments used nine commodity machines with identical hardware. Each machine had a latemodel 3.0 GHz 64-bit Pentium 4 processor, 2 GB DDR2533 RAM, 80 GB 7200 rpm Serial ATA hard drive. The machines were networked together via their add-on gigabit Ethernet adapters connected to a Dell PowerConnect 2724 gigabit Ethernet switch and were all running 64-bit Debian Linux 6.0 and Oracle JDK 7. Hadoop and HBase. Hadoop 1.0.0 and HBase 0.94 were used. Minor changes to the default configuration for stability

included setting each block of data to replicate two times and increasing the HBase max heap size to 1.2 GB. Out of nine identical machines in the cluster, one was designated as an HBase master and the other eight were HBase region servers (slaves). Our implementation. Our algorithms were implemented in Java and the experiments were conducted using Bash shell scripts to execute the Java class files and store the results in an automated and repeatable manner. B. Datasets and Queries The experiments used the University of Texas Provenance Benchmark (UTPB) [17]. UTPB includes provenance templates defined according to the three vocabularies of the Open Provenance Model (OPM)1 , provenance generation software capable of generating provenance for any number of workflow runs based on a provenance template, and provenance test queries in several categories. We used UTPB to generate datasets of varying sizes using the “Database Experiment” template for a successful workflow execution that was serialized based on the Open Provenance Model OWL Ontology (OPMO). Each generated RDF graph in these datasets represented the provenance of a single workflow execution and contained roughly 400 RDF triples. Table I indicates the characteristics of each generated UTPB dataset. The table does not take into account the dictionary file (also an RDF graph) that was generated by UTPB for each dataset and contained all graph identifiers. The number of triples in this graph was the same as the number of RDF graphs in the dataset from Table I (e.g., 100,000 triples for D1). We used 11 UTPB test queries in the first four categories (Graphs, Dependencies, Artifacts, and Processes) to benchmark the performance of our implementation. The exact queries expressed in SPARQL can be found on the UTPB website2 . When a provenance dataset was stored in our HBase cluster according to the proposed schema, HBase automatically partitioned the table into regions (subsets of rows). Available region servers were assigned to handle certain regions. In other words, the provenance dataset was partitioned into subsets of provenance graphs that were stored on individual machines in the cluster. Every provenance graph (along with its indices) was stored as a whole on one of the machines with no partitioning. Therefore, any query over an individual provenance graph was processed by a machine that stored the graph, avoiding any expensive data transfers of intermediate results among region servers. The final result of a query was transferred to a client application running on the HBase master. 1 Open

Provenance Model, http://openprovenance.org of Texas Provenance Benchmark, http://faculty.utpa.edu/ chebotkoa/utpb/ 2 University

Table I DATASET C HARACTERISTICS . # of RDF graphs (# of workflow runs) 100,000 200,000 300,000 400,000 500,000

Dataset D1 D2 D3 D4 D5

# of RDF triples

Size

40,000,000 80,000,000 120,000,000 160,000,000 200,000,000

Q1

2.1 4.2 6.3 8.4 10.5

GB GB GB GB GB

Q2

4,000

80

3,000

60

2,000

40

1,000

20

0

0 D1

D2

D3

D4

D5

D1

D2

Q3

D3

D4

D5

D4

D5

D4

D5

D4

D5

D4

D5

Q4

30

15

20

10

10

5

0

0 D1

D2

D3

D4

D5

D1

D2

Q5

D3

Q6

30

30

20

20

10

10

0

0 D1

D2

D3

D4

D5

D1

D2

Q7

D3

Q8

20

50 40 30 20 10 0

15 10 5 0 D1

D2

D3

D4

D5

D1

D2

Q9

D3

Q10

80

30

60

20

40 10

20 0

0 D1

D2

D3

D4

D5

Q11

D1

D2

D3

Legend

60

X-axis: Dataset Y-axis: Query execution time, ms

40 20 0 D1

D2

D3

Figure 3.

D4

D5

Query Performance and Scalability.

Q2 returned all triples in a particular provenance graph (e.g., returned around 400 triples for each dataset). Even though these queries were the simplest queries among the 11 UTPB test queries, they showed to be more expensive due to larger result sets, which is especially evident for query Q1 – the only query whose performance was on the order of seconds. In the case of both Q1 and Q2, which involved no joins, the major factor in query performance is the transfer time of final query results to a client machine and it is hardly possible to achieve better performance on the given hardware. By contrast, all other queries performed on the order of tens of milliseconds, required joins, and returned subsets of triples in a particular provenance graph (< 400 triples). Queries Q3 − Q7 were in the Dependencies category and dealt with various dependencies among artifacts and processes in provenance graphs. They all had similar complexities: basic graph patterns with three triple patterns each. They also returned comparable result sets in terms of the number of triples (except query Q4 that returned an empty result set for the selected UTPB provenance graph template). As a result, these queries showed very similar query evaluation performance with Q4 being the fastest as it only required the evaluation of its first triple pattern to compute the final (empty) result. Queries Q8 and Q9 were in the Artifacts category and dealt with data artifacts in provenance graphs. Q8 contained six triple patterns and two optional clauses. Q9 had 18 triple patterns, two optional clauses, two filer and one union constructs. Q9 is the most complex query of all, yet it was shown to be efficient and scalable with our approach. The last two queries, Q10 and Q11, were in the Processes category and dealt with processes in provenance graphs. Q10 had two triple patterns and one optional clause. While Q11 is a complex query with 11 triple patterns and one union clause, it yielded an empty query result in our experiments due to the selected provenance template. In summary, the proposed approach and its implementation proved to be efficient and scalable. Q1 showed linear scalability and took the most time to execute due to a relatively large result set. The other queries showed nearly constant scalability (technically, linear with a small scope). This can be explained by the fact that each query (except Q1) dealt with a single provenance graph of fixed size with minimal data transfers and fast index-based join processing. VI. C ONCLUSIONS AND F UTURE W ORK

C. Query Evaluation Performance and Scalability Query performance and scalability of our approach are reported in Fig. 3. Queries Q1 and Q2 were in the Graphs category and had basic graph patterns with one triple pattern. Both Q1 and Q2 yielded larger results compared to other queries. Q1 returned all graph identifiers in a dataset (e.g., 100,000 triples for D1 and 500,000 triples for D5) and

In this paper, we studied the problem of storing and querying large collections of scientific workflow provenance graphs serialized as RDF graphs in Apache HBase. We designed novel storage and indexing schemes for RDF data in HBase that are suitable for provenance datasets. Our storage scheme takes advantage of the fact that individual provenance graphs generally fit into memory of

a single machine and require no partitioning. Our bitmap indices are stored together with graphs and support both selection and join operations for efficient query processing. We also proposed efficient querying algorithms to evaluate SPARQL queries in HBase. Our algorithms rely on indices to compute expensive join operations, make use of numeric values that represent triple positions rather than actual triples with lengthy literals and URIs, and eliminate the need for intermediate data transfers over a network. Finally, we conducted an empirical evaluation of our approach using provenance graphs and test queries of the University of Texas Provenance Benchmark. Our experiments confirmed that our proposed storage, indexing and querying techniques are efficient and scalable for large provenance datasets. In the future, we plan to compare our approach with other SQL and NoSQL solutions in the context of distributed scientific workflow provenance management, as well as experiment with a multi-user workload to measure query throughput of our system. R EFERENCES [1] Y. Simmhan, B. Plale, and D. Gannon, “A survey of data provenance in e-science,” SIGMOD Record, vol. 34, no. 3, pp. 31–36, 2005. [2] S. B. Davidson, S. C. Boulakia, A. Eyal, B. Ludäscher, T. M. McPhillips, S. Bowers, M. K. Anand, and J. Freire, “Provenance in scientific workflow systems,” IEEE Data Engineering Bulletin, vol. 30, no. 4, pp. 44–50, 2007. [3] S. B. Davidson and J. Freire, “Provenance and scientific workflows: challenges and opportunities,” in Proc. of SIGMOD Conference, 2008, pp. 1345–1350. [4] V. Cuevas-Vicentt´ın, S. C. Dey, S. Köhler, S. Riddle, and B. Ludäscher, “Scientific workflows and provenance: Introduction and research opportunities,” Datenbank-Spektrum, vol. 12, no. 3, pp. 193–203, 2012. [5] L. Moreau, B. Clifford, J. Freire, J. Futrelle, Y. Gil, P. T. Groth, N. Kwasnikowska, S. Miles, P. Missier, J. Myers, B. Plale, Y. Simmhan, E. G. Stephan, and J. V. den Bussche, “The Open Provenance Model core specification (v1.1),” Future Gen. Comp. Syst., vol. 27, no. 6, pp. 743–756, 2011.

[10] T. M. Oinn, et al., “Taverna: lessons in creating a workflow environment for the life sciences,” Concurr. Comput. : Pract. Exper., vol. 18, no. 10, pp. 1067–1100, 2006. [11] B. Ludäscher, I. Altintas, C. Berkley, D. Higgins, E. Jaeger, M. B. Jones, E. A. Lee, J. Tao, and Y. Zhao, “Scientific workflow management and the Kepler system,” Concurr. Comput. : Pract. Exper., vol. 18, no. 10, pp. 1039–1065, 2006. [12] S. P. Callahan, J. Freire, E. Santos, C. E. Scheidegger, C. T. Silva, and H. T. Vo, “Managing the evolution of dataflows with VisTrails,” in Proc. of ICDE Workshops, 2006, p. 71. [13] J. Kim, E. Deelman, Y. Gil, G. Mehta, and V. Ratnakar, “Provenance trails in the Wings/Pegasus system,” Concurr. Comput. : Pract. Exper., vol. 20, no. 5, pp. 587–597, 2008. [14] Y. Zhao, et al., “Swift: Fast, reliable, loosely coupled parallel computation,” in Proc. of SWF, 2007, pp. 199–206. [15] Apache HBase, http://hbase.apache.org. [16] F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber, “Bigtable: A distributed storage system for structured data,” ACM Transactions on Computer Systems, vol. 26, no. 2, 2008. [17] A. Chebotko, E. D. Hoyos, C. Gomez, A. Kashlev, X. Lian, and C. Reilly, “UTPB: A benchmark for scientific workflow provenance storage and querying systems,” in Proc. of SWF, 2012, pp. 17–24. [18] M. F. Husain, L. Khan, M. Kantarcioglu, and B. M. Thuraisingham, “Data intensive query processing for large RDF graphs using cloud computing tools,” in Proc. of CLOUD, 2010, pp. 1 – 10. [19] J. Myung, J. Yeon, and S. Lee, “SPARQL basic graph pattern processing with iterative MapReduce,” in Proc. of MDAC, 2010, pp. 6:1–6:6. [20] P. Ravindra, V. V. Deshpande, and K. Anyanwu, “Towards scalable RDF graph analytics on MapReduce,” in Proc. of MDAC, 2010, pp. 5:1–5:6. [21] J. Urbani, S. Kotoulas, E. Oren, and F. van Harmelen, “Scalable distributed reasoning using MapReduce,” in Proc. of ISWC, 2009, pp. 634–649.

[6] A. Chebotko, S. Lu, X. Fei, and F. Fotouhi, “RDFProv: A relational RDF store for querying and managing scientific workflow provenance,” Data Knowl. Eng., vol. 69, no. 8, pp. 836–865, 2010.

[22] A. Schätzle, M. Przyjaciel-Zablocki, and G. Lausen, “PigSPARQL: mapping SPARQL to Pig Latin,” in Proc. of SWIM, 2011, p. 4.

[7] J. Zhao, C. A. Goble, R. Stevens, and D. Turi, “Mining Taverna’s semantic web of provenance,” Concurr. Comput. : Pract. Exper., vol. 20, no. 5, pp. 463–472, 2008.

[23] J. Huang, D. J. Abadi, and K. Ren, “Scalable SPARQL querying of large RDF graphs,” PVLDB, vol. 4, no. 11, pp. 1123–1134, 2011.

[8] Third Provenance Challenge, http://twiki.ipaw.info/bin/view/ Challenge/ThirdProvenanceChallenge.

[24] C. Franke, S. Morin, A. Chebotko, J. Abraham, and P. Brazier, “Distributed semantic web data management in HBase and MySQL Cluster,” in Proc. of CLOUD, 2011, pp. 105–112.

[9] C. Lin, S. Lu, X. Fei, A. Chebotko, D. Pai, Z. Lai, F. Fotouhi, and J. Hua, “A reference architecture for scientific workflow management systems and the VIEW SOA solution,” IEEE Transactions on Services Computing, vol. 2, no. 1, pp. 79– 92, 2009.

[25] M. Atre, V. Chaoji, M. J. Zaki, and J. A. Hendler, “Matrix ‘Bit’ loaded: a scalable lightweight join query processor for RDF data,” in Proc. of the WWW, 2010, pp. 41–50.