Factorizing Longitudinal Linked Data

0 downloads 0 Views 946KB Size Report
nal linked data are RDF descriptions of observations from related sam- ...... of redundancy elimination rules; and 2) a dormant dataset which is not com- pressed ...
Factorizing Longitudinal Linked Data Farah Karim, Maria-Esther Vidal, and S¨oren Auer Enterprise Information Systems (EIS), University of Bonn, Fraunhofer Institute for Intelligent Analysis and Information Systems (IAIS), Germany {karim,vidal,auer}@cs.uni-bonn.de

Abstract. Large collections of longitudinal linked data are publicly available as part of the Linking Open Data (LOD) cloud. Longitudinal linked data are RDF descriptions of observations from related sampling frames or sensors at multiple points in time, e.g., patient medical records or climate sensor data. Observations are expressed as measurements whose values can be repeated several times in a sampling frame, resulting in a considerable increase in data volume. In this paper, we devise a factorized compact representation of longitudinal linked data to reduce repetition of same measurements, and propose algorithms to generate collections of factorized longitudinal linked data. We empirically study the effectiveness of the proposed factorized representation on linked observation data, and we show that the total data volume can be reduced by more than 30% on average without loss of information. Furthermore, we have evaluated the impact of the proposed factorized representation on query execution. The results suggest that query optimizers can be empowered with semantics from factorized representations to generate query plans that effectively reduce query execution time on factorized longitudinal linked data. Keywords: Longitudinal Linked Data; Data Factorization; Query Optimization and Execution

1

Introduction

With the maturing of semantic technologies and their increasing industrial use, scalability, performance and robustness progressively shift into the focus. One particular important dimension of Linked Data is longitudinal linked data, which represents observations over time, i.e., time series. Examples of such data are sensor measurements, medical records, stock prices, logistics, and traffic data. Also, for making Linked Data representations more suitable to be applied in Big Data scenarios the velocity and volume dimensions have to be better supported. Representing longitudinal data directly in RDF will result in a substantial inflation of the data volume due to repetition. The application of compression techniques is a possible solution, but has the adverse effect, that the data cannot be directly processed and queried. Building on existing results on factorized databases [1,2], we propose factorization techniques tailored for longitudinal linked data that generate compact

2

Factorizing Longitudinal Linked Data

representations of observations where repeated values are reduced. Furthermore, SPARQL queries can be executed against factorized datasets without affecting answer completeness or query time complexity. Moreover, properties of the factorized representation of observations can be exploited during query optimization to produce query plans able to speed up query execution time. Thus, the proposed factorization techniques differ from state-of-the-art RDF data compression approaches [3,6,9] that although being efficient, are not able to produce compact representations of RDF from where query processing can be performed. We conduct an experimental study to analyze the effectiveness of the proposed techniques on collections of longitudinal linked data of different sizes. Experimental outcomes suggest that factorized representation reduce the number of RDF triples by more than 30% without loss of information. Additionally, we study the effects of the proposed factorized representation on query processing and data loading time. The observed results confirm that exploiting knowledge encoded in factorized representations facilitates the generation of efficient query plans able to reduce execution time by up to two orders of magnitude. This paper comprises five additional sections. Section 2 motivates with a real-world example, the need of factorization techniques for longitudinal linked data. We then define our approach in Section 3, and report the results of our empirical evaluation in Section 4. Existing approaches are reviewed in Section 5, and we conclude and present an outlook to future work in Section 6.

2

Motivating Example

The MesoWest RDF datasets1 comprise longitudinal linked data describing hurricane and blizzard observations in the United States; observations include measurements of different climate phenomena, e.g., temperature, visibility, precipitation, wind speed, and humidity. Together these datasets contain almost two billion RDF triples comprehensively describing major storms in the United States since 2002. The RDF dataset from the storm season in April 2003 comprises 12,011,466 RDF triples about temperature, and 1,193,345 observations. The SPARQL query in Listing 1.1 calculates the frequency distribution of temperature values that are repeated more than 9 times in the temperature observations. Listing 1.1: SPARQL query against the April 2003 MesoWest dataset with temperature observations. Prefixes are used as in https://www.w3.org/wiki/SRBench SELECT ? v a l u e (COUNT( ? v a l u e ) a s ? v a l u e C o u n t ) WHERE { ? o b s e r v a t i o n a weather : TemperatureObservation ; om−o w l : r e s u l t ? measurement . ? measurement om−o w l : f l o a t V a l u e ? v a l u e } GROUP BY ? v a l u e HAVING ? v a l u e C o u n t > 9

Figure 1a shows the results produced by the evaluation of query in Listing 1.1. These results reveal that 1,191,218 observations out of 1,193,345 meet this condition. Additionally, temperature values are repeated on average 1,114.45 times, 1

http://wiki.knoesis.org/index.php/LinkedSensorData

Factorizing Longitudinal Linked Data

(a) Frequency Distribution

3

(b) Percentage of Repeated Values

Fig. 1: Motivating Example: (a) Frequency Distribution of temperature values from April 2003 Mesowest Linked Observation Data produced by query in Listing 1.1; (b) Percentage of RDF triples per repeated temperature values with respect to Linked Observations of temperature from April 2003 Mesowest Data and the temperature value 32 is the mode, i.e., 32 is the most repeated value and is associated with 24,632 observations. Also, observations are described using 11 RDF triples; Figure 1b illustrates the percentage RDF triples repeated per temperature value. As can be observed, some repeated temperature values are associated with up to 2.5% of the total number of temperature related RDF triples in the whole dataset. Similar frequency distributions can be observed for other climate phenomena, and corroborate the natural intuition that the number of phenomenon values is much smaller than the number of observations. We exploit these characteristics of longitudinal linked data, and propose a compact representation where RDF triples of repeated values are factorized from the observations and included in the dataset only once. Unlike other compression techniques, queries can be executed against factorized longitudinal linked data. Further, semantics encoded in factorized representations can be utilized to guide query optimizer to generate query plans able to speed up query processing.

3

Longitudinal Linked Data Factorization

We devise RDF-based factorization techniques tailored to longitudinal linked data in which RDF triples associated with repeated measurement values are grouped into molecules to reduce redundancy. In particular, the proposed factorization techniques rely on RDF transformation rules to identify three type of molecules: i) Measurement: a factorized measurement is represented by a blank node, and corresponds to the subject of the RDF triple that represents the value and unit of the measurement. ii) Time Interval: a factorized time interval is modeled as a blank node, and is related using RDF triples with the initial and end timestamps of the interval, as well as with the difference between two equidistant timestamps. iii) Observation: a set of RDF triples representing the

4

Factorizing Longitudinal Linked Data

observation properties; a blank node also represents a factorized observation. Once the original dataset is decomposed into molecules, RDF triples are added to represent relationships among the molecules, to ensure thus a lossless compact representation of the original longitudinal linked data. The framework FLLD implements the proposed factorization techniques, and allows for the reformulation of SPARQL queries against factorized data. FLLD is composed of two main components the factorized data decomposer and the query re-writer. 3.1

Factorized Data Decomposer

The longitudinal linked data decomposer factorizes RDF molecules while preserves all the information represented in the original dataset. The factorized data decomposer executes two main tasks: molecule creation and molecule bonding. In the molecule creation task, transformations are applied until all the molecules are created, i.e., measurement, time interval, and observation molecules. Similarly, transformation rules are evaluated to relate molecules during the molecule bonding task. Without lost of generality, we assume that longitudinal linked data is described using the Semantic Sensor Network (SSN) Ontology2 . Molecule Creation The following three rules state characteristics of longitudinal linked data that support the creation of molecules. Rule 1 Measurement Molecule: creates a measurement molecule whenever two or more different measurements have the same value and unit of measurement. A measurement molecule of type om-owl:MeasureData is modeled as a blank node :m, and is related to a unit of measurement and value using the predicates: om-owl:uom and om-owl:hasValue, respectively. (?m1 rdf : type om−owl : MeasureData) ∧ (?m1 om−owl : uom ?uom) ∧ (?m1 om−owl : hasValue ?v)∧ (?m2 rdf : type om−owl : MeasureData) ∧ (?m2 om−owl : uom ?uom) ∧ (?m2 om−owl : hasValue ?v) ( : m rdf : type om−owl : MeasureData) ∧ ( : m om−owl : uom ?uom) ∧ ( : m om−owl : hasValue ?v)

Rule 1

Figure 2 illustrates how a measurement molecule is created from two measurements using Rule 1. These measurements are of type om-owl:MeasureData and have the value "20.7"xsd:float and measurement unit weather:fahrenheit, and Rule 1 allows for the creation of single measurement molecule. This measurement molecule is modeled as a blank node :m, and has the same predicates and their values as in the original measurements. Thus, the predicates and their values in the original measurements are combined into a single measurement molecule, and are described only once using these RDF triples instead of being repeatedly described in individual measurements. 2

Prefixes are used as in https://www.w3.org/wiki/SRBench; additionally, we consider the prefix: ld:

Factorizing Longitudinal Linked Data

om-owl:MeasureData

5

a

om-owl:uom

weather:fahrenheit

sens-obs: MeasureData_DewPoint_4U T01_2003_3_31_0_15_00 om-owl:hasValue

om-owl:MeasureData

a

"20.7"^^xsd:float

Molecule Creator om-owl:MeasureData

om-owl:uom weather:fahrenheit

_:m

a

om-owl:hasValue "20.7"^^xsd:float

om-owl:uom

weather:fahrenheit

sens-obs: MeasureData_DewPoint_4U T01_2003_3_31_0_20_00 om-owl:hasValue

"20.7"^^xsd:float

Fig. 2: Rule 1 Measurement Molecule of type om-owl:MeasureData from two measurements with the same measurement unit and value

Rule 2 Time Interval Molecule: creates a time interval molecule when at least two different observations report the same value at consecutive timestamps. A time interval molecule of type ld:Interval, is modeled as a blank node :t, and is related to a start and end time, and time difference of these observations using the predicates: ld:startTime, ld:endTime, and ld:timeDifference, respectively. (?t1 rdf : type owl−time : Instant) ∧ (?t1 owl−time : inXSDDateTime ?time1 )∧ (?t2 rdf : type owl−time : Instant) ∧ (?t2 owl−time : inXSDDateTime ?time2 )∧ (?o1 om−owl : samplingTime ?t1 ) ∧ (?o2 om−owl : samplingTime ?t2 )∧ (?o1 om−owl : result ?m1 ) ∧ (?o2 om−owl : result ?m2 )∧ FILTER(func : succ(?time1 , ?time2 )&func : same(?m1 , ?m2 )) ( : t rdf : type ld : Interval) ∧ ( : t ld : startTime ?time1 ) ∧ ( : t ld : endTime ?time2 )∧ BIND(?time2 −?time1 AS ?timeDifference) ∧ ( : t ld : timeDifference ?timeDifference)

sens-obs: MeasureData_DewPoint_4U T01_2003_3_31_0_15_00 om-owl:result

owl-time:Instant

“2003-03-31T00:15:0007:00”^^xsd:dateTime

owl-time:Instant

a

sens-obs: Observation_DewPoint_4UT0 1_2003_3_31_0_15_00 om-owl:samplingTime ld:Interval

sens-obs: Instant_2003_3_31_0_15_00 a

owl-time:inXSDDateTime

Molecule Creator

_:t

ld:startTime ld:endTime

a

ld:timeDifference “2003-03-31T00:20:0007:00”^^xsd:dateTime

Rule 2

sens-obs: Instant_2003_3_31_0_20_00

“2003-03-31T00:15:0007:00”^^xsd:dateTime “2003-03-31T00:20:0007:00”^^xsd:dateTime

“300.0”^^xsd:float

owl-time:inXSDDateTime om-owl:samplingTime

om-owl:result sens-obs: MeasureData_DewPoint_4UT 01_2003_3_31_0_20_00

sens-obs: Observation_DewPoint_4UT0 1_2003_3_31_0_20_00

Fig. 3: Rule 2 Time Interval Molecule of type ld:Interval from two time instants with consecutive timestamps; time difference of observations is included

Figure 3 illustrates how a time interval molecule is created from two time instants using Rule 2. These time instants are of type owl-time:Instant describ-

Factorizing Longitudinal Linked Data weather: _DewPoint

weather:fahrenheit sens-obs: System_4UT01 om-owl:uom

om-owl:haValue

sens-obs: MeasureData_DewPoint_4UT 01_2003_3_31_0_15_00 sens-obs: Instant_2003_3_31 _0_15_00

"20.7"^^xsd:float

owl-time:inXSDDateTime

weather: TemperatureObservation

om-owl:observedProperty

a

om-owl:procedure om-owl:result

sens-obs: Observation_DewPoint_4UT0 1_2003_3_31_0_15_00

weather: _DewPoint

weather:fahrenheit om-owl:uom om-owl:hasValue "20.7"^^xsd:float

sens-obs: Instant_2003_3_31 _0_20_00

sens-obs: MeasureData_DewPoint_4UT 01_2003_3_31_0_20_00

sens-obs: System_4UT01

“2003-03-31T00:15:0007:00”^^xsd:dateTime

Molecule Creator

“2003-03-31T00:20:0007:00”^^xsd:dateTime

_:O

om-owl:procedure ld:hasOrder

om-owl:result

rdf:_1

rdf:_2

sens-obs: Observation_DewPoint_4UT0 1_2003_3_31_0_20_00

a

weather: _DewPoint

a

sens-obs: Observation_DewPoint_4UT0 1_2003_3_31_0_20_00

sens-obs: Instant_2003_3_31 _0_15_00

sens-obs: Instant_2003_3_31 _0_20_00

om-owl:result rdf:Seq

om-owl:procedure

sens-obs: Observation_DewPoint_4UT0 1_2003_3_31_0_15_00

ld:order

om-owl:samplingTime

om-owl:result

om-owl:observedProperty sens-obs: System_4UT01

sens-obs: MeasureData_DewPoint_4UT 01_2003_3_31_0_15_00

om-owl:observedProperty

a

owl-time:inXSDDateTime

weather: TemperatureObservation

om-owl:samplingTime

om-owl:samplingTime

6

sens-obs: MeasureData_DewPoint_4UT 01_2003_3_31_0_20_00

weather: TemperatureObservation

Fig. 4: Rule 3 Observation Molecule of type weather:TemperatureObservation from observations with the same values and consecutive timestamps ing consecutive timestamps using predicate owl-time:inXSDDateTime, and are related to two observations having the same measurement results using predicate owl-owl:samplingTime. The new time interval molecule is modeled as a blank node, and has different predicates than in the original time instants. These predicates describe the start and end time of the time interval when these observations are taken, and the difference between two consecutive timestamps. Rule 3 Observation Molecule: creates an observation molecule if at least two observations, belonging to the same observed phenomenon, have the same value of measurement and are observed at consecutive timestamps. An observation molecule of the relevant phenomenon type is modeled as a blank node :o, and is related to observation data using the predicates: om-owl:observedProperty, om-owl:procedure, ld:hasOrder, om-owl:result, and om-owl:samplingTime. (?o1 om−owl : observedProperty ?phenomenonProperty) ∧ (?o1 rdf : type ?phenomenon)∧ (?o1 om−owl : procedure ?sensor) ∧ (?o1 om−owl : result ?m1 ) ∧ (?o1 om−owl : samplingTime ?t1 )∧ (?m1 om−owl : uom ?uom) ∧ (?m1 om−owl : hasValue ?v) ∧ (?t1 owl−time : inXSDDateTime ?time1 )∧ (?o2 rdf : type ?phenomenon) ∧ (?o2 om−owl : observedProperty ?phenomenonProperty)∧ (?o2 om−owl : procedure ?sensor) ∧ (?o2 om−owl : result ?m2 ) ∧ (?o2 om−owl : samplingTime ?t2 )∧ (?m2 om−owl : uom ?uom) ∧ (?m2 om−owl : hasValue ?v)(?t2 owl−time : inXSDDateTime ?time2 )∧ FILTER(func : succ(?time1 , ?time2 )) ( : o om−owl : observedProperty ?phenomenonProperty) ∧ ( : o rdf : type ?phenomenon)∧ ( : o om−owl : procedure ?sensor) ∧ ( : o ld : hasOrder