Similarity-Based Query Caching - Semantic Scholar

Similarity-Based Query Caching Heiner Stuckenschmidt Vrije Universiteit Amsterdam de Boelelaan 1081a, 1081HV, Amsterdam [email protected]

Abstract. With the success of the semantic web infrastructures for storing and querying RDF data are gaining importance. A couple of systems are available now that provide basic database functionality for RDF data. Compared to modern database systems, RDF storage technology still lacks sophisticated optimization methods for query processing. Current work in this direction is mainly focussed on index structures for speeding up the access at triple level or for special queries. In this paper, we discuss semantic query caching as a high level optimization technique for RDF querying to supplement existing work on lower level techniques. Our approach for semantic caching is based on the notion of similarity of RDF queries determined by the costs of modifying the results of a previous query into the result for the actual one. We discuss the problem of subsumption for RDF queries, present a cost model and derive a similarity measure for RDF queries based on the cost model and the notion of graph edit distance.

1

Motivation

With the success of the semantic web infrastructures for storing and querying RDF data are gaining importance. A couple of systems are available now that provide basic database functionality for RDF data (e.g. [1]). Compared to modern database systems, RDF storage technology still lacks sophisticated optimization methods for query processing. Current work in this direction is mainly focussed on index structures for speeding up the access on triple level or for special queries (e.g. [5]). In this paper, we discuss semantic query caching [6, 4, 7, 8] as a high level optimization technique for RDF querying to supplement existing work on lower level techniques. The idea of query caching is to reuse computed results of previously asked queries in further query processing. In order to illustrate the idea, we assume an RDF repository with information about computer science publications. A possible query is for all papers about the topic ’databases’ written by researchers from the Vrije Universiteit Amsterdam. Processing this query can be done more efficiently if we already know the results of queries for all papers about ’databases’, because we can evaluate the query on the result set rather than the complete repository saving the computational costs for selecting database papers from the set of all papers. Similarly, if we know the result of the query for all researchers

from the Vrije Universiteit Amsterdam, we can save costs of computing the join of papers with authors, because the result set is much smaller than the set of all authors. In [9] we have argued for the benefits of distributed RDF repositories over centralized ones in terms of scalability and flexibility. In such a distributed setting, the benefits of query caching are even more evident: The use of a local copy of previously computed answers of a remote source instead of querying the source itself avoids communication costs which are the major cost factor in distributed query processing. Another benefit is an increase in robustness of the system, because in cases where a remote source is not available due to network problems, parts of the information is still available in the cached results. The main function of conventional approaches to query caching, we refer to as syntactic query caching here is to reduce costs for network communication. In these approaches, queries to the system are completely evaluated using index structures and proxy representations. The cache is only put to use when the identifiers of results have been determined. In order to avoid costs for retrieving the complete information about the objects in the result sets, the system checks whether relevant object information is available in the cache. The access of the cached information is often organized based on the physical address of the data items to be retrieved. Our goals as compared to syntactic caching, however, is not only to reduce network communication, but also to reduce the costs of the query processing step itself. More specifically, we don’t want to execute the given query completely, but directly use previous results to avoid the need to perform selection and join operations. This is special importance for RDF query processing, because in the presence of RDF schema information query processing also involves expensive reasoning steps. Results of previous queries already contain information that has been derived in a reasoning step. In order to be able to directly use previous results, we use the queries that have been used to produce the results in the cache as high level descriptions. When processing a new query, we first compare it to previously asked queries and determine whether there are useful results in the cache. Only if this is not the case or if we did not find all requested information in the cache we evaluate the query in the common way and retrieve the results. We refer to this strategy as semantic query caching (compare [8]. In this paper, we present an approach for semantic query caching that is based on the similarity of subsuming queries. In the next section we discuss RDF querying and the notion of query subsumption. Afterward, we define a cost model for RDF querying and show how it can be used as a basis for a similarity measure used in query caching. Finally, we define the similarity of RDF queries relative to given information using the notion of graph edit distance, where the cost of edit operations is based on the cost model. We conclude with a discussion of our approach and some directions for future work.

2

RDF Queries and Subsumption

Our work on semantic caching is focussed on the special case of queries RDF data. The special characteristics of the RDF data model influences the design of different techniques used in our approach. In this section, we formally define the RDF models and queries using graph- theoretic notions. We also define the notion of query subsumption needed to make sure that we only retrieve correct answers from the cache. A characteristic property of an RDF model is that its statements form a labelled, directed graph. The resources mentioned in the statements can be seen as nodes in such a graph (this may include properties as special kinds of resources). Subject and object of each statement are connect by a directed link labelled with the name of the property acting as predicate. RDF graphs of this kind are called ground graphs. Figure 1 shows such an RDF graph.

Fig. 1. Example of an RDF graph describing a publication

Based on the graph based view on an RDF model, we can characterize some properties of RDF models that are relevant for query answering. The fist basic property of RDF is the fact that a graph entails all its subgraphs. The RDF data model described above and its associated semantics provides us with a basis for defining queries on RDF models in a straightforward way. The idea that has been proposed elsewhere and is adopted here is to use graphs with unlabelled nodes as queries. The unlabelled elements in a query graph can be seen as variables of the query. Figure 2 shows an RDF graph that represents a query. The nodes labelled with X and Y are unlabelled nodes in the sense of the RDF model that act as query variables. Assuming that inProceedings is a subclass of Publication and keyword is a subproperty of about, we see that the graph in figure 1 represents a result of the query in figure 2. Using a graph based interpretation of RDF data and queries provides us with the possibility to use well established concepts from graph theory as a basis for defining important notions such as query result and query subsumption. As we will see later, representing RDF data as graphs also supports the definition of

Fig. 2. An example query for publications on RDF

a similarity measure for comparing queries. As a basis for our investigation we use the following standard notion of graph and subgraph: Definition 1 (Graph/Subgraph). A graph is a 4 tuple hV, E, α, βi where V is a finite set of vertices, E ⊆ V × V is a set of edges, α : V → LV is a function assigning labels to vertices and β : E → LE is a function assigning labels to edges. A graph g 0 is a subgraph of a graph g denoted as g 0 ⊆ g if V 0 ⊆ V , E 0 = E ∩ (V 0 × V 0 ), α(v) = α0 (v) for all v ∈ V 0 and β(e) = β 0 (e) for all e ∈ E 0 . We can now define RDF models and queries as graphs with special labels that correspond to resources, properties and literals. In particular, we define an RDF model as a graph where all the edge labels describe RDF properties and all node labels describe resources or literals. An RDF query simply is an RDF model graph that contains unlabelled nodes and edges. These unlabelled nodes and edges correspond to query variables: Definition 2 (RDF Model/Query Graph). An RDF model is a graph g = hV, E, α, βi with LV = R ∪ L where R are resource IDs L are literal values and LE = P where P are property IDs. An RDF query graph is a RDF model graph with LV = R ∪ L ∪ {∅}. Nodes with α(v) = ∅ correspond to variables in the query. We can now use established concepts from graph theory for defining relations between RDF queries and models. In particular we use a modification of the notion of graph and subgraph isomorphisms. We define the notion of a relative graph match by replacing the equality conditions between node and edge labels by a specialization relation . The corresponding definition is the following: Definition 3 (Relative Graph Match). Let g and g’ be graphs, then a graph match between g and g’ relative to a specialization relation is a bijective mapping f : V → V 0 if there – α(v) α0 (f (v)) for all v ∈ V – for any edge e = (u, v) ∈ E there exists an edge e0 = (f (u), f (v)) ∈ E 0 such that β(e) β 0 (e0 ) , and for any edge e0 = (u0 , v 0 ) ∈ E 0 there exists an edge e = (f −1 (u0 ), f −1 (v 0 )) ∈ E such that β(e) β 0 (e0 ).

If f : V → V 0 is a relative graph match between graphs g and g 0 and g 0 ⊆ g 0 ” then f is a relative subgraph match form g to g 00 . It turns out that the notion of relative graph match is general enough to define conditions for an RDF model to be the answer of a query as well as conditions for an RDF query to be subsumed by another one. For this purpose, we define the specialization relation in such a way that allows us to model the use of inheritance relations and variable instantiation in querying: Definition 4 (Query Subsumption and Query Result). An RDF query graph g’ is subsumed by a query graph g if there exists a relative subgraph match between g and g’ such that – β(e) β 0 (e0 ) iff β 0 (f (e)) = β(e) or β 0 (f (e)) rdfs : subProprtyOf β(e) can be derived. – α(v) α0 (f (e)) iff α(v) = ∅, α(v) = α0 (f (e)) or α0 (f (v)) rdfs : subClassOf α(v) can be derived. If g’ is an RDF model graph, g’ is called a result of g. Based on this definition we can now explain why the graph in figure 1 is a result of the query shown in figure 2. First of all, on the structural level, we see that there is a subgraph isomorphism between the graphs. Concerning the correspondence of labels, we have ∅ f qas2004 ∅ 0 H.Stuckenschmidt0 RDF RDF author author rdf : type rdf : type P ulication inP roceedings because inProceedings rdf:subClassOf Publication – about keyword because keyword subPropertyOf about.

– – – – – –

Using similar arguments, we can conclude that the query in figure 3 is subsumed by the one in figure 2. While the known complexity of determining subgraph isomorphisms prohibits the use of corresponding algorithms for determining query results for large information sources, the use of these algorithms for determining subsumed queries in a semantic cache is permissible because the size and number of graphs representing queries will be rather low compared to the RDF models they represent.

3

Semantic Caching and Run-Time Complexity

One of the main goals of semantic caching is the reduction of the time needed to process a given query. In order to be able to decide whether the reuse of previous results will indeed reduce the processing time, we need a cost model that enables us to estimate this time. For this purpose, we adapt the cost model proposed in [9]. We also use this cost model to argue that the reuse of partial results will indeed often lead to a reduction of the processing time.

Fig. 3. Query for Articles on RDF by employees of the Vrie Universiteit which is subsumed by the query in figure 2

3.1

A Cost Model for RDF Query Processing

The exact determination of the run time costs of RDF query processing depends on a number of factors as well as on our assumption about the general query processing strategy. Concerning the latter, we assume an approach, where the query processor has access to a repository that contains RDF data. This data can be accessed on the basis of single properties and returns all tuples with the respective property. Further, we assume that the data in the repository is the deductive closure of the data with respect to the model-theoretic semantic of RDF. These assumptions correspond to the design of the RDF storage and retrieval system that has been developed in cooperation with the Knowledge Representation and Reasoning Group at the Vrije Universiteit Amsterdam. Details about Sesame can be found in [1]. Under these assumptions, the overall costs of query processing is now divided into costs for accessing data in the repository and costs for performing join operations on the retrieved data. Data Access The first step of query processing is the retrieval of the relevant data from the repository. These access cost of instances of a property p from the repository to the query processor is modelled as ACp = Cinit + |p| ∗ kInstkp ∗ C where Cinit represents the cost of initiating the data transmission, |p| denotes the cardinality, kInstkp is average length of instances of the property p, C represents transmission cost per data unit from the repository to the query processor. In order to answer the example query from figure 2 we would have to retrieve the content of the relations rdf:type, author and about each time producing the costs indicated above. Join Processing After having retrieved the relevant data, we need to perform selection and join operations in the properties in order to compute the results of a query. The processing cost of a nested loop join of two relations p, r is defined as N JCp,r = |p|∗|r|∗K, where |x| denotes the cardinality of the relation x and K represents the comparison cost of two objects. Selection of tuples from a relation can be seen as a join with a relation of size 1 and therefore has a cost of |p|∗K. For the example query, we would have to select those instances of the type relation that have Publication as an object causing costs of |rdf : type| ∗ K. Further, we

would have to join the result with the author relation. To compute the cardinality of intermediate results, a join selectivity is used. The join selectivity σ is defined as a ratio between the tuples retained by the join and those created by the Cartesian product: σ = |p./r| |p×r| . As a result, the costs for joining the author relation with the result of the selection is |rdf : type ./ 0 Publication0 | ∗ |author| ∗ K |rdf : type| Query Plan Costs The overall cost of a query plan θ consists of the sum of all communication costs and all join processing costs. QP Cθ =

n X

ACpi + P Cθ

i=1

where P Cθ represents the join processing cost of all joins in the query plan θ and it is computed as a sum of recurrent applications of the formula for computing join costs. This means that the overall costs for computing the result for the query in figure 2 is the sum of the costs for accessing the three relations plus the costs for performing selections on the type and the about relations and of the costs of joining the results with each other and with the author relation. 3.2

Caching Costs vs. Querying Costs

The cost model introduced above provides us with a method to decide whether result caching provides an advantage with respect to run time complexity. In principle, we can analyze the benefits for any concrete query being asked to the systems. Due to the limited space we only discuss some general observations about the relative complexity of semantic caching as compared to traditional query processing. The first aspect that is influenced by semantic caching are the access costs. Depending on the architecture of the overall system architecture and the implementation of the cache, the different parameters of the access costs are influences. Concerning the initiation costs, we have to notice that normally these are higher for accessing the cache than for accessing the repository, because the costs of determining subsuming queries fall into this category. This drawback will in most cases be outranged by the savings that can be achieved on the other parameters. Already being the result of a query, the size of relations will normally be smaller in the cache, because a part of a relation is stored. While the average length of the tuples will be the same in the cache, there are cases where the transmission costs from the cache will be significantly lower that for the repository. This is for example the case when a client server architecture is used where the repository is located at a different location and content has to be shipped over the network. Even if repository and query processor are on the same machine, the cache normally being smaller than the complete data set can be implemented as an in-memory repository that allows faster access.

The semantic caching approach also leads to potential major savings in the costs for join processing. Because we cache complete query results, the joins relevant for the indexing query are already computed. The fact that we use query subsumption as a criterion for selecting results from the cache makes sure that we do not miss relevant information. All that is left to do is to perform a set of additional joins and selections that would also be part of the normal query process of query answering. When we consider the queries in figures 2 and 3 we save the costs of selecting with respect to the topic and the costs for joining the type with the author relation. We still have to select the type relation with respect to Articles. Further, we have to access the keyword and the affiliation relation compute the selection with respect to the Vrije Universiteit and join the two with the cached result. The costs of the individual joins are again lower than in the normal case because the size of the cached relations is normally significantly smaller that the original ones.

4

A Cost-Based Similarity Measure for RDF Queries

In the previous section we have argued that reusing the result of subsuming queries can improve the performance of RDF querying. In real life situations, we will often have more than one relevant result set is found in the cache. In this case, we need to decide which of these result sets leads to the highest savings. We do this by defining a notion of similarities between RDF queries that reflect the amount of effort needed to compute the result of one query from the result of the other. As a consequence, the result set indexed by the query with the highest degree of similarity to reg current query leads to the highest reduction and should be chosen. 4.1

Graph Edit Operations and Graph Edit Distance

There is along tradition of research on graph matching. Most of these approaches are based on the notion of subgraph isomorphisms. For determining the similarity between two graphs, often the notion of maximal common subgraph is used that also relies on the determination of an isomorphism between subgraphs [2]. These approaches are not applicable, because we do not rely on the notion of isomorphism but allow labels to be specializations of one another. An alternative approach to determine similarity between graphs in the notion of the edit distance between two graphs that is closely related to the notion of maximal common subgraphs [3] but allows for a more flexible definition of similarity. The basic idea of using the graph edit distance is to find a sequence of edit operations that transform one graph into the other. the corresponding operations are defined as follows [10]. Definition 5 (Edit Operation). Let ⊥ be a unique symbol different from all labels. An edit operation is written a → b where a, b ∈ LV ∪ LE ∪ {⊥}. An edit operation is called

– a node/edge relabelling if a 6= ⊥ and b 6= ⊥ are node/edge labels. – a node/edge delete if a 6= ⊥ and b = ⊥ – a node/edge insert if a = ⊥ and b 6= ⊥ Nodes can only be deleted if no edge connects to them. Edges can only be inserted between existing nodes. An edit operation a → b is undefined if a = ⊥ and b = ⊥, a ∈ LV and b ∈ LE or vice versa. The similarity between two graphs can now be determined by assigning editing costs to each operation. These costs should reflect the degree of change implied and therefore the difference between the initial and the edited graph. The edit distance between two graphs is now the sum of the costs of all individual edit operations used to transform one graph into the other. In cases, where different sequences of operations can be used, the sequence with the minimal costs is used. Definition 6 (Edit Distance). Let S be a sequence of edit operations s1 , · · · , sk . S transforms a graph g into a graph g’ if there is a sequence of graphs g0 , · · · , gk such that g0 = g, gk = g 0 and gi is the result of applying si to gi−1 . We denote this as g ⇒S g 0 . Let γ be a cost function that assigns to each Pk edit operation a nonnegative real number γ(si ). Let further γ(s1 , · · · , sk ) = i=1 γ(si ) then the edit distance between two graphs g and g’ is defines as dist(g, g 0 ) = min{γ(S)|g ⇒S g 0 } The notion of graph edit distance provides us with a notion of similarity between graphs, because graphs with a low edit distance can be assumed to be more similar than graphs with a high edit distance. In particular, two graphs whose edit distance in zero have similarity of one and are in fact equivalent. 4.2

Edit Operations on RDF Queries

Based on the general notion of edit operations on graphs, we can now define suitable editing operations on RDF Graphs. Edit operations on RDF graphs differ from general graph edit operations insofar, as we have to take into account the underlying data model. In particular, the triple data model does not allow us to remove edges without also removing nodes that are unconnected after removing the edge. Further, when considering query graphs, edit operations that split up the graph in two unconnected components. In our semantic caching approach, we even only consider relabelling operations that follow the specialization relation . In order to avoid having to deal with such cases, that do not make sense for RDF models, we define specialized edit operations on RDF graphs that correspond to the insertion, removal or replacement or triples in the underlying RDF model. In the following we define the corresponding operations as combinations of general graph edit operations that fulfill certain additional constraints:

Definition 7 (Edge Insert). An edge insertion operation is a pair of graph edit operations go →s1 g1 →s2 g2 such that s1 is a node insertion adding a node v to the graph and s2 is an edge insertion that inserts a edge (u, v) where v is the previously inserted node and u is a node in g0 . The edge insertion operation above corresponds to the addition of tuple pattern to an RDF query. This operation refines the query by adding an additional constraint on the result. As we are only interested in queries that are subsumed by cached queries, we do not have to take into account the converse operation of edge deletion. The use of such a deletion operation in combination with insertion would lead to a query that is no longer subsumed by the original one. As we do not want this to happen, we can reduce the search space by only considering edge insertion in combination with refinement operations on edge and node labels as defined in the following. Definition 8 (Edge/Node Refinement). An edge(node) refinement operation is a graph edit operation a → b such that a is an edge(node) label in the original graph, b ∈ LE (LV ) and a b. The application of these special editing operations always lead to a graph that is subsumed by the original one. Based on these operations we can use the notion of edit distance provided in definition 6 to determine subsumption and, given an appropriate cost function for edit operations, compute the distance between queries in the cache and new queries submitted to the system. 4.3

Determining Editing Costs

The crucial aspect of the approach is to find a useful similarity measure for RDF queries. Relying on the edit distance of query graphs, this problem reduces to determining the costs of an edit operation that together define the edit distance. As our main goal is to reduce processing time, the costs of an edit operation should be linked to the processing costs that result from the difference between the two queries described by the operation. In section 3.2 we briefly discussed the costs of adapting cached results and compared it with the costs of processing a query in the standard way.The idea of assigning costs to edit operations is now to use the costs resulting from actually computing the results for the refined query using the results of the more general query in the cache. For the case of an edge insertion we need to access the relation causing costs of ACe to be inserted and to compute the its join with the stored results which adds |e| ∗ |g| ∗ K to the costs. The resulting definition of the costs of an edge insertion is the following: Definition 9 (Insertion Costs). Let s be an insertion operation that adds an edge e = (u,v) to a query graph g. Then γ(s) = ACe + |e| ∗ |g| ∗ K

Where |e| denotes the number of instances of the relation LE (e) and |g| denotes the number of results of g. For the case of an edge refinement e → e0 , the situation is quite similar. We have to access the relation e0 that will replace a more general relation e in the query with costs ACe0 . Afterwards we also also compute the join of this relation with the results in the cache. In contrast to the insertion operation, however, we have to perform comparisons on both sides of the edge leading to addition al costs of |e0 | ∗ |g| ∗ 2K. The resulting definition of the costs of an edge refinement is the following: Definition 10 (Edge Refinement Costs). Let s be an operation that replaces an edge label e by another edge label e0 with e e0 then γ(s) = ACe0 + |e0 | ∗ |g| ∗ 2K Where |e| denotes the number of instances of the relation LE (e) and |g| denotes the number of results of g. With respect to node refinement v → v 0 we have to distinguish two cases. The first and simpler case is the one where a label is assigned to an unlabelled node. In this case, we only have to compare every result on the cache with the value to be assigned causing costs of |g| ∗ K. As we only consider replacements that change the label of a node and we claim that v v 0 , all situations are case where a class name is substituted by the name of a subclass. Consequently, we have to access the type information in the relation rdf:type causing costs of ACrdf:type and compare it with the results in the cache. This adds costs of |rdf:type| ∗ |g| ∗ K. The resulting definition of the costs of a node refinement is the following: Definition 11 (Node Refinement Costs). Let s be an operation that replaces an node label v by another edge label v 0 6= v with v v 0 then ( |g| ∗ K if v = ⊥ γ(s) = ACrdf:type + |rdf:type| ∗ |g| ∗ K otherwise As we see from the definitions, the costs of the different edit operations depend on statistics of the information in the repository and the cache. This means that if we have up-to-date statistics available, we can guarantee that our approach selects a strategy that is optimal wrt. run time costs by first simulating the query execution on the basis of the cost model before actually performing it.

5

Discussion

In this paper we describe the foundations of an approach for semantic caching of RDF queries. Our approach is a flexible one in the sense that it takes the actual information into account. In particular the same queries will have a different

similarity based on the underlying information thereby ensuring that we get an optimal result wrt. the cost model. A question that we did not address in this work yet is whether the combined use of different result sets in the cache can further improve the approach. For a successful application of the approach in an RDF storage and retrieval system, a number of additional topics have to be addressed. We have to develop strategies for building and maintaining the cache. Changes in the underlying information that invalidate parts of the chaced results are the main challenge here. We also need to provide indexing and retrieval methods for accessing the cache in order to avoid losing the advantages wrt. run time costs. A real system will also have to be evaluated experimentally because the implementation of the cache can significantly influence the run time behavior.

6

Acknowledgements

I would like to thank Rochard Vdovjak for useful work on cost models for RDF query processing and Martin Schaaf for fruitful discussions about the topic of the paper.

References 1. J. Broekstra, A. Kampman, and F. van Harmelen. Sesame: A generic architecture for storing and querying rdf and rdf schema. In The Semantic Web - ISWC 2002, 2002. 2. H. Bunke and K. Shearer. A graph distance metric based on the maximal common subgraph. Pattern Recognition Letters, 19(3-4):255–259, 1998. 3. Horst Bunke. On a relation between graph edit distance and maximum common subgraph. Pattern Recognition Letters, pages 689–694, 1997. 4. B. Chidlovskii and U.M. Borgho. Semantic caching of web querie. VLDB Journal, 9(1):2–17, 2000. 5. Vassilis Christophides, Dimitris Plexousakisa, Michel Scholl, and Sotirios Tourtounis. On labeling schemes for the semantic web. In Proceedings of the 13th World Wide Web Conference, pages 544–555, 2003. 6. Shaul Dar, Michael J. Franklin, Bjoern Joensson, Divesh Srivastava, and Michael Tan. Semantic data caching and replacement. In Proceedings of VLDB’96, pages 330–341, 1996. 7. Dongwon Lee and Wesley W. Chu. Towards intelligent semantic caching for web sources. Journal of Intelligent Information Systems, 17(1):23–45, 2001. 8. Q. Ren, M.H. Dunham, and V. Kumar. Semantic caching and query processing. IEEE Transactions on Knwoledge and Data Engineering, 15(1), January/February 2003. 9. H. Stuckenschmidt, R. Vdovjak, J. Broekstra, and G.-J. Houben. Towards distributed RDF querying. International Journal on Web Engineering and Technology, 2004. to appear. 10. J.T.L. Wang, K. Zhang, and G.-W. Chirn. Algorithms for approximate graph matching. Information Sciences, 82:45–74, 1995.