Tailoring Linked Data Exploration through inCloud ... - Semantic Scholar

3 downloads 0 Views 1017KB Size Report
Mar 30, 2012 - Lorenzo Genta. DICo. Università degli Studi di Milano. Via Comelico, 39 - 20135. Milano, Italy lorenzo.genta@unimi.it. Stefano Montanelli.
Tailoring Linked Data Exploration through inCloud Filtering Alfio Ferrara

Lorenzo Genta

Stefano Montanelli

DICo Università degli Studi di Milano Via Comelico, 39 - 20135 Milano, Italy

DICo Università degli Studi di Milano Via Comelico, 39 - 20135 Milano, Italy

DICo Università degli Studi di Milano Via Comelico, 39 - 20135 Milano, Italy

[email protected]

[email protected]

ABSTRACT This paper presents techniques for tailoring linked data exploration by relying on the use of high-level, intuitive organization structures called inClouds (information Clouds). Techniques based on content filtering and relevance filtering are proposed for enabling a user to adjust the inCloud visualization according to her/his own preferences.

Categories and Subject Descriptors H.5 [Information Systems]: Information Interfaces and Presentation; H.3 [Information Systems]: Information Storage and Retrieval

General Terms Linked data exploration and filtering

1.

INTRODUCTION

Due to the growing success of the linked data paradigm, the problem of effectively exploring and browsing the Linked Data Cloud according to specific selection criteria is getting more and more importance [1]. On one side, SPARQL endpoints and other frameworks like SQUIN and iSPARQL are being developed to provide visual and intuitive interfaces for querying a given linked data repository [4, 5]. On the other side, methods and techniques based on matching and aggregation are being proposed in the literature to overcome the inherent flat organization of linked data repositories by providing advanced exploration functionalities [2, 6]. In this paper, we present techniques for tailoring linked data exploration by relying on inClouds (information Clouds). We introduced the notion of inCloud in [3] to provide a highlevel, intuitive organization structure capable of representing at a glance a (generally wide) collection of linked data. In the following, we first recall the definition of inCloud as well as its main featuring properties (Section 2). Then, we present techniques for content-based filtering and relevancebased filtering capable of adjusting the inCloud structure

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. LWDM 2012, March 30, 2012, Berlin, Germany. Copyright 2012 ACM 978-1-4503-1143-4/12/03 ...$10.00

[email protected]

according to the exploration needs specified by a final user (Sections 3 and 4). Techniques for customizing the visualization of the filtering results are also discussed in the paper (Sections 5). Concluding remarks are finally provided (Section 6).

2.

LINKED DATA CLOUDS

An inCloud originates from a set S of linked data extracted from a repository R (e.g., Freebase, DBpedia) through a combination of SPARQL queries. Starting from a seed s, that is an URI of R chosen by the user as the “point of origin”, the extraction of S consists in determining the set of linked data that are most pertinent to s. We select as pertinent resources those linked data that are connected to s through a property path of length ≤ d, where d is a user-defined distance to set the extension of linked data extraction1 . Definition of inCloud. Given a seed s, an inCloud is defined as a graph iCs = (N, E), where a node ni ∈ N represents a concept with an associated cluster of similar linked data belonging to S and an edge e(ni , nj ) ∈ E represents a relation of proximity between ni and nj , denoting the fact that the two concepts and the respective clusters are somehow related. A concept ni ∈ N is defined as ni = (ei , cli ) where ei is the essential of ni , namely a concise and convenient summary of the concept, and cli is a cluster of linked data of S. The essential ei is defined as a pair ei = hKi , Ti i, where Ki is a set of keywords and Ti is a set of type names. Ki and Ti are defined over the linked data of cli by choosing the most frequently occurring terms and type names in the specification of the linked data of cli , respectively. To build cli , techniques for linked data matching are employed to discover all the pairs of similar resources in the set S. Then, techniques for linked data aggregation are invoked to determine cli by putting together those resources of S that are characterized by a high level of similarity according to the matching results [2]. Each concept ni ∈ N is characterized by a prominence pi , which denotes the relative importance of ni within the overall inCloud. The concept prominence is proportional to the number and the strength of proximity relations holding between ni and the other concepts of the inCloud. The prominence affects the visual organization of the corresponding concept/cluster of the inCloud, in that the most prominent concepts are highlighted in foreground. 1

More details about linked data extraction are given in [3].

legenda caro diario ...

concept

region

types

film

keywords

/en/caro_diario ...

resources cluster

151

158

189 190

180

187

188 201

157

206

163

184 183

6

181

123

175 186 174

125

film

126 167

122

23

22

119 148

169

11

20

130

134

142

131

159

111

185

91 94

81 108

110 88

139 152 195

93

29

204

69

150 132

143

128

75

144 135

114

72

87

67

73

137

173

160

166

165

104

71

203 99

95

145

101

84

2 3 ...

138

168

170

149

120 207

193 31

26

169

15

30 35

136

137

160

33

34 24

/en/two_strangers_in_rome 27 /en/to_live_in_peace

176

121 25

39

36

179

138 134

37

40

148

177

199

18

13

28

199

153 117 21 8

32

178

206

9

14 12

10

141

172

29

19

two strangers in rome to live in peace ...

119

151

197

17

16

27

153

127

182 192

3 41

171

191

196

7

38

141

172

5

4

/en/latium /en/lazio

lazio

essential

129

124

156

133

78 103

77

82

100 92

89

14

19

161

200

9

76

202

116

140

198

162

70

194 164

86

66

83

96 109

98

105

58

44

65

film

45

49

18

61 52 63

62

59

54

50

37 25

33

34 24

15 /en/caro_diario /en/il_postino /en/i_girasoli ...

30 35

20

57

53

caro diario il postino ...

39

142

131

43

47

11

36

130 159

64

48

56

13

28 40

21

46

8

32 22

55

106 85

207

51 60

97

90 102

149

155

115

112 113

74

80

120

146

107

118 68

79

12

10

154

147

145 152

tourist attraction st peters basilica trevi fountain appian way ...

195

/en/apostolic_palace /en/appian_way /en/baths_of_caracalla ...

150 132

31

26

91 94

Figure 1: An example of inCloud extracted from the Freebase repository for the seed s =/en/italy A proximity relation e(ni , nj ) represents a similarity-based relationship between the concepts ni and nj . Proximity relations suggest possible exploration paths across the concepts of the inCloud, thus enabling a user to navigate from one concept to another according to a similarity-based criterion. A proximity relation e(ni , nj ) is associated with a degree of proximity xij which denotes the strength of the relationship between the concepts ni and nj . The proximity degree xij depends on the number of elements that are similar (i.e., matching) across the clusters cli and clj , respectively. The higher the number of similar elements, the higher the proximity degree xij . Example. An example of inCloud extracted from the Freebase repository for the seed s =/en/italy is shown in Figure 1, providing concepts about Italy and some related countries in Europe (e.g., France, Germany). In the figure, we highlight some concepts containing data about cities, regions, tourist attractions, and films. Keywords Ki and type names Ti of a concept ni are represented as boxes, while a circle with an ID is used to represent the corresponding cluster cli . For a concept ni , a portion of the linked data resources contained in the cluster cli is also shown. The size of a cluster cli can vary from one cluster to another, and it is proportional to the prominence pi associated with the concept ni . A proximity relation e(ni , nj ) is represented as a solid line whose thickness is proportional to the degree of proximity xij holding between the concepts ni and nj .

3.

FILTERING INCLOUDS BY CONTENT

Content-based filtering techniques allow a user to restrict the exploration of a considered inCloud to a subset of concepts that satisfy a given user interest. Such an user interest is expressed through a set of keywords Q where AND/OR operators are employed to denote conjunctive/disjunctive conditions, respectively. In other words, these techniques transform an initial inCloud iC into a resulting inCloud

iC 0 = (N 0 , E 0 ) (i.e., iC 0 is a subgraph of iC), where only the concepts/nodes of iC that satisfy the specified user interest are inserted in N 0 and a proximity link e(ni , nj ) ∈ E is inserted in E 0 iff ni , nj ∈ N 0 . To determine whether a concept ni ∈ N of an inCloud satisfies the user interest, each keyword qi ∈ Q is matched against the corresponding essential ei . In particular, the user can specify the elements of the essential ei to be considered for matching, in that, the set Ki , the set Ti or both can be exploited in such a matching operation. Content-based filtering techniques are based on a RETRIEVE instruction defined through the following SQL-like syntax.

RETRIEVE FROM

expr K | T | K, T

where the RETRIEVE clause specifies the conjunctive/disjunctive expression to search, and the FROM clause specifies the elements of the inCloud essentials upon which the retrieval operation has to be performed.

• expr: qi | (expr OR expr) | (expr AND expr) is the list of keywords with AND/OR operators specified by the user2 . S • K={ n i=1 Ki } is the set of keywords extracted from the concept essentials of the overall inCloud where n = |N | is the cardinality of N . S • T ={ n i=1 Ti } is the set of type names extracted from the concept essentials of the overall inCloud. 2 If expr is not specified, the whole inCloud iC is returned as a result. In this case iC ≡ iC 0 .

FILTERING BY CONTENT (a)

film seven hills of rome ...

FILTERING BY RELEVANCE (b)

city

film

rome

seven hills of rome ...

194

9

9

120 120

12

film

27

st peters basilica trevi fountain appian way ...

two strangers in rome to live in peace ...

tourist attraction

12

tourist attraction 151

film

st peters basilica trevi fountain appian way ...

two strangers in rome to live in peace ...

Figure 2: Results of (a) filtering-by-content, and (b) filtering-by-relevance

• K, T is the set of keywords and type names of the inCloud essentials, namely K ∪ T .

Example. As an example of filtering-by-content, we consider the inCloud in Figure 1 and we suppose that a user is interested in retrieving contents concerning films located in or depicting the city of Rome. In order to express her/his interest, the user formulates the following RETRIEVE instruction: RETRIEVE FROM

‘film’ AND ‘rome’ K, T

This RETRIEVE instruction filters the inCloud by selecting those concepts that contain both film and rome in their essentials (i.e., keywords or type names). In particular, the concepts that satisfy the request contain information about films located in Rome (ID=9, ID=12), tourist attractions (ID=120), and other general information about the town (ID=194). The concept about tourist attractions is bigger in size due to its higher prominence with respect to the other concepts in the inCloud.

4.

FILTERING INCLOUDS BY RELEVANCE

Relevance-based filtering techniques allow a user to restrict the exploration of an inCloud to those concepts that are more relevant with respect to prominence and proximity requirements. These techniques can perform two different kinds of filtering, namely filtering-on-prominence and filtering-on-proximity, that are expressed through appropriate conditions. Filtering-on-prominence allows to select those concepts ni that have a prominence pi higher than a userspecified threshold. Filtering-on-proximity allows to select those proximity relations e(ni , nj ) that have a degree of proximity xij higher than a user-specified threshold. In other words, these techniques transform an incoming inCloud iC 0 (produced by the content filtering techniques) into a resulting inCloud iC 00 where only the concepts and proximity relations that satisfy the specified conditions on prominence and/or proximity are represented.

Relevance-based techniques are specified in the RETRIEVE instruction through the WHERE clause as follows:

RETRIEVE FROM [WHERE

expr K | T | K, T [p > valuep [, ]][x > valuex ]]

where valuep and valuex are user-specified thresholds to set the minimum level of requested prominence and proximity, respectively. When the condition p > valuep is specified, a concept ni is included in the resulting iC 00 iff the associated prominence value pi is higher than the valuep . When the condition x > valuex is specified, a proximity relation e(ni , nj ) is included in the resulting iC 00 iff the associated proximity value xij is higher than valuex . Moreover, each node ni ∈ iC is included in iC 00 iff a relation e(ni , nj ) exists where nj ∈ iC 0 and the proximity value xij is higher than valuex . Example. An example of filtering-by-relevance is given in Figure 2(b), where the following WHERE clause is added to the RETRIEVE instruction of section 3: RETRIEVE FROM WHERE

‘film’ AND ‘rome’ K, T p > 0.5, x > 0.5

The result of this RETRIEVE instruction is that the concepts of Figure 2(b) are a subset of those represented in Figure 2(a), and some proximity relations are not reported in the resulting inCloud. In particular, we focus on the concept about the city of Rome (id=194), which is filtered out from the results due to its low level of prominence. In this case, the concept is pertinent with respect to the keywords Q specified by the user (i.e., film and rome) but it is not sufficiently prominent with respect to the specified valuep (p194 < 0.5).

5.

DISPLAYING INCLOUDS

DISPLAY ESSENTIALS (a)

The visualization of the inCloud generated through filteringby-content and filtering-by-relevance can be adjusted by the user according to her/his desired level of detail. The user can choose the elements of the inCloud to consider for display. In particular, the inCloud visualization can be focused on showing only the concept essentials of the result, which is a more suitable solution for those users interested in learning general knowledge about a certain topic from linked data. Instead, the visualization can be focused on displaying only the content of clusters for more detailed data exploration, which is a more suitable solution for users interested in using inCloud filtering as a query facility for linked data repositories. Finally, the visualization of proximity relations between the concepts can be enabled/disabled to include connections between concepts in the final display.

film seven hills of rome ...

tourist attraction

film

st peters basilica trevi fountain appian way ...

two strangers in rome to live in peace ...

DISPLAY CLUSTERS (b)

/en/seven_hills_of_rome 9 ...

The display options are specified through a SHOW clause as follows: [SHOW RETRIEVE FROM [WHERE

[clusters[, ]][essentials[, ]][relations] ON]( expr K | T | K, T [p > valuep [, ]][x > valuex ]])

The SHOW clause allows to specify the elements to display, namely clusters, essentials and relations. As a final remark, we stress that when only the relations option is specified, the visualization of clusters is also enforced since the only display of proximity links without essentials or clusters does not provide any useful information. Example. An example of inCloud displaying is provided in Figure 3, where the result of the following SHOW-RETRIEVE instructions is given. Display essentials (a) SHOW essentials, relations ON ( RETRIEVE ‘film’ AND ‘rome’ FROM K, T WHERE p > 0.5, x > 0.5)

Display clusters (b) SHOW clusters, relations ON ( RETRIEVE ‘film’ AND ‘rome’ FROM K, T WHERE p > 0.5, x > 0.5)

In particular, Figure 3 shows the main difference between two different display options. In Figure 3(a), the inCloud view is focused on essentials display to summarize the inCloud contents in terms of keywords and type names. In Figure 3(b), the inCloud view is focused on cluster display to provide a direct link to the underlying resources.

6.

/en/st_peters_basilica /en/trevi_fountain 120 ...

CONCLUDING REMARKS

In this paper, techniques for linked data exploration through inCloud filtering have been presented together with real examples on the Freebase repository. Ongoing work is devoted to integrate the proposed filtering techniques in the existing clouding Java prototype for inCloud construction and visualization (http://islab.dico.unimi.it/inCloud).

/en/two_strangers_in_rome 12 /en/to_live_in_peace ...

Figure 3: Results of (a) essential displaying, and (b) cluster displaying

7.

REFERENCES

[1] C. Bizer, T. Heath, and T. Berners-Lee. Linked Data The Story So Far. Int. Journal on Semantic Web and Information Systems, 5(3), 2009. [2] S. Castano, A. Ferrara, and S. Montanelli. Structured Data Clouding across Multiple Webs. Information Systems, 2011. To appear. [3] S. Castano, A. Ferrara, and S. Montanelli. Thematic Exploration of Linked Data. In Proc. of the 1st VLDB Int. Workshop on Searching and Integrating New Web Data Sources (VLDS 2011), Seattle, USA, 2011. [4] O. Hartig, C. Bizer, and J.-C. Freytag. Executing SPARQL Queries over the Web of Linked Data. In Proc. of the 8th Int. Semantic Web Conference (ISWC 2009), Washington, DC, USA, 2009. [5] C. Kiefer, A. Bernstein, and M. Stocker. The Fundamentals of iSPARQL: A Virtual Triple Approach for Similarity-Based Semantic Web Tasks. In Proc. of the 6th Int. Semantic Web Conference (ISWC 2007), Busan, Korea, 2007. [6] R. Mirizzi, A. Ragone, T. Di Noia, and E. Di Sciascio. Semantic Wonder Cloud: Exploratory Search in DBpedia. In Proc. of the ICWE 2nd Int. Workshop on Semantic Web Information Management (SWIM 2010), Vienna, Austria, 2010.