Exploiting Linked Data to Build Web Applications

3 downloads 726 Views 1MB Size Report
Semantic Web technologies have been around for a while. However ..... able PHP library for RDF processing that targets ... hensive list of appropriate Web-of-.
Spotlight

E d i t o r : M u n i n d a r P. S i n g h • s i n g h @ n c s u .e d u S h e n g r u Tu • s h e n g r u @ c s . u n o .e d u

Exploiting Linked Data to Build Web Applications Michael Hausenblas • Digital Enterprise Research Institute, Galway

Semantic Web technologies have been around for a while. However, such technologies have had little impact on the development of real-world Web applications to date. With linked data, this situation has changed dramatically in the past few months. This article shows how linked data sets can be exploited to build rich Web applications with little effort.

M

any Web developers today use APIs such as those from Google or Facebook to build and enhance their Web applications. Due to the many ways these APIs are typically designed (proprietary XML formats, JavaScript Object Notation [JSON], and so on), application development based on Web 2.0 mashups is often burdensome and doesn’t scale well. In addition to requiring effort to repeatedly learn new interfaces, the so-called created data is locked in its respective platform. The Web of data — which is a synonym for or part of the Semantic Web, depending on whom you ask — has promised to resolve these issues for a long time. To date, however, only partial solutions to real-world problems exist, many of them addressing “toy” data sets — that is, small, artificial, non-realworld, data sets that don’t demonstrate scalability for the Web. The Linking Open Data (LOD) community project is a recent initiative that’s changed the situation dramatically: using basic Web of data technologies such as RDF (www.w3.org/TR/rdf -concepts/) and URIs along with a set of socalled “linked data” principles, several data sources, such as Wikipedia, are now available to developers, who clearly benefit from linked data sets based on a common data model.1 Here, I’ll show you how to exploit available linked data sets to build rich Web applications with little effort.

Using Linked Data: Examples

Before we tackle linked data’s technical challenges, let’s look at some exemplary uses of linked data sets. Faviki (www.faviki.com), for 68

Published by the IEEE Computer Society

example, is a social bookmarking tool that lets users tag Web pages with semantic tags stemming from Wikipedia. In the case of Faviki, Web of data technologies and data provide an unambiguous space for identifying concepts. Figure 1 shows the tool, which uses URIs from DBpedia (the interlinked version of Wikipedia in RDF) for tagging; in the figure, http://dbpedia. org/resource/Internet is used as a tag — anyone interested in this term can dereference this URI and obtain further information about it. DBpedia mobile,2 which Figure 2 shows, is an interesting application for mobile environments. Basically, it’s a location-centric DBpedia client application for mobile devices; based on a device’s GPS signal, DBpedia can render a map indicating nearby locations from its data set. Recently, the BBC announced the release of its new Music beta site (www.bbc.co.uk/music/ beta/) built around the Musicbrainz (http:// musicbrainz.org/) metadata and identifier. The site pulls music metadata, such as related artists, from Musicbrainz and fetches introductory text for artists’ biographies from Wikipedia via DBpedia interlinking. Figure 3 (p. 70) shows an example page for Madonna (www. bbc.co.uk/music/artists/79239441-bfd5-4981 -a70c-55c3f15c1287). The sidebar “BBC Music: Where’s the RDF?” has background information on the data, interlinking, and vocabularies used in this example. How have these example projects been realized? What are their design principles? Let’s examine linked data sets’ rather technical aspects — the so-called linked data principles — and the technologies that enable their implementation.

1089-7801/09/$25.00 © 2009 IEEE

IEEE INTERNET COMPUTING

Exploiting Linked Data

Linked Data Principles

Tim Berners-Lee first outlined the basic idea of linked data in 2006. In his seminal design note (www. w3.org/DesignIssues/LinkedData. html), he described the four linked data principles as follows: • All items should be identified using URIs. • All URIs should be dereferenceable — that is, using HTTP, URIs enable anyone (machine or human) to look up an item identified through the URI. • Looking up a URI leads to more data, also known as the followyour-nose principle. • Links to URIs in other data sets should be included to enable further data discovery. In contrast to the full-fledged Semantic Web vision, linked data is mainly about publishing structured data in RDF using URIs rather than focusing on the ontological level or inference. This simplification — just as the Web simplified the established academic approaches of hypertext systems — lowers the entry barrier for data providers, fostering widespread adoption.3–5 The LOD project (http://esw. w3.org/topic/SweoIG/Task Forces/ Communit yProjects/LinkingOpen Data) — an open, collaborative effort initiated by the W3C Semantic Web Education and Outreach Community Projects initiative — aims to bootstrap the Web of data by publishing existing data sets in RDF on the Web and creating numerous links between them. The project began in early 2007 with a relatively modest number of data sets and participants and has since grown in depth, impact, and contributors. Currently, the project includes more than 60 different data sets (see Figure 4, p. 71) with a couple of billion RDF triples and several million (semantic) links at the time of this JULY/AUGUST 2009

Figure 1. Faviki. This social bookmarking tool uses DBpedia terms for semantic tagging. writing; it represents a steadily growing, open implementation of linked data principles. When we look more closely at widely deployed vocabularies6,7 in the LOD cloud, we can group the semantic link types into • person-related link types, such as foaf:knows from the Friend of a Friend (FOAF) vocabulary (http:// xmlns.com/foaf/0.1/); • spatial link types, such as geo:lat from the Basic Geo vocabulary (WGS84 lat/long; www. w3.org/2003/01/geo/); • temporal link types, such as Dublin Core’s dc:created property (http://dublincore.org/documents/ dcmi-terms/) or the Event Ontology’s event:time property (http:// purl.org/NET/c4dm/event.owl); • link types such as dc:isPartOf for representing structural semantics; and • other link types, such as scovo:dimension from the Statistical Core Vocabulary (SCOVO; http://purl.org/NET/scovo).

Figure 2. DBpedia mobile’s map view. This view provides mobile, locationbased services based on Wikipedia. Equipped with the linked data principles, I’ll now take you step-by-step through a concrete example that shows how to exploit linked data to build a Web application.

My First Linked Data Application

The linked data principles provide a framework for publishing and consuming data on the Web but don’t give implementation details. Several phrases are deliberately rather generic, such as “leads to more data” or “in order to enable the discovery of more data.” This fact suggests that additional details are necessary before 69

Spotlight

or Drupal. The site has ­several manually maintained sections on various aspects of the Cold War: politicians, countries, conflicts, and so on. As the site’s maintainer, you’ve heard about linked data and want to use it to enrich your content. What are the necessary steps? To exploit linked data sets properly, you must first prepare your own data and then select appropriate target data sets for interlinking. (More detail on these steps is available in prior work, in which colleagues and I report on building an interlinked version of the Eurostat statistical data set.8)

Figure 3. BBC Music beta site. The site pulls information on artists (in this case, Madonna) from Musicbrainz and Wikipedia. we can use linked data in a practical setup. Next, I’ll describe the steps needed to exploit linked data sets in an exemplary Web application. Imagine that you’re a historicalinclined person running the Web site, http://example.org/cw/, which deals with the topic “Cold War.” Let’s further assume the site is powered by popular software such as Wordpress 70

Prepare Your Data

Typically, the data you’re about to use is available in a nonRDF format, such as relational data or XML; the actual format doesn’t matter as long as the data is structured and the schema is known. To make your data Web-of-data-compliant, you must first mint — that is, assign URIs to entities; more detailed advice on how to achieve this is available elsewhere.9 For example, somewhat www.computer.org/internet/

comparable to what ­DBpedia does, you would identify entities in the URI space http://example.org/cw/ resource/ — for example, http://example.org/cw/resource/conflict. An RDF representation, on the other hand, would reside at http://example. org/cw/rdf/, as in http://example.org/ cw/rdf/conflict. A human-­d igestible version would be in the http:// example.org/cw/html/ space, such as http://example.org/cw/html/conflict. The ultimate guide on how to publish linked data on the Web (http:// linkeddata.org/docs/how-to-publish) explains the entire publishing process in detail, including URI minting, vocabulary selection, and deployment issues. The next challenge is to pick one or more existing vocabularies and extend them as needed for your own purpose. Based on your data’s schema and the selected vocabularies, the “RDFising” step is rather straightforward. Experience tells us that we should reuse existing vocabularies and extend them if needed rather then reinventing the wheel for each application. As maintainer of the Cold War site, you’ve analyzed the entities and relations occurring in your content and identified the need to represent people, geographical regions, and events in a first iteration. This would mean, for example, using FOAF to describe people or the Event Ontology to state when and where a certain event, such as a conflict, occurred. A finergrained, domain-specific description (for example, regarding political systems or military aspects) is certainly desirable, but we’ll assume you started with a simple modeling and refined it in a second iteration. The final step in preparing the data is deciding how to expose it (see the “Tools and Libraries for RDF Data Management” sidebar). A range of options is available for deploying RDF data8: RDF/XML stand-alone documents; XHTML+RDFa (www. w3.org/ T R/x ht m l-rd fa-pr i mer/ ), IEEE INTERNET COMPUTING

Exploiting Linked Data

Figure 4. The Linking Open Data cloud in early 2009. The project has more than 60 real-world data sets adhering to linked data principles. (Source: Richard Cyganiak; redrawn with permission) which lets you embed an RDF graph in an HTML document using attributes; or SPARQL endpoints, which let agents query an RDF store via the SPARQL language (the RDF equivalent to the relational query language SQL). Because our imaginary Cold War site is based on a content management system, this step is rather straightforward: you typically mint URIs based on system-specific rules with an option of creating more legible URIs (see, for example, Drupal’s abilities at http://buyta ert.net/rdfa-and-drupal). As the Cold War site operator, you’re in a comfortable pos­ition: plug-ins are available for the system, letting you expose the data with just a few configuration changes. JULY/AUGUST 2009

Linked Data Discovery and Usage So far, the data is compliant with the Web of data. Let’s tackle the question of how to find and select target data sets for interlinking that you can use to enrich your content. Given the current infrastructure, discovering linked data sets on the Web of data can be challenging. In principle, we can learn about a linked data set’s content by applying the follow-yournose principle10 — that is, by inspecting its content step-wise by following URIs. This task is laborious and expensive in terms of time and resources. With semantic indexers, such as Sindice,11 we can now get an idea of what a data set offers without the follow-your-nose step. Furthermore, if a SPARQL endpoint is advertised

with the semantic sitemaps extension,12 we can query the data set and learn about its internals. However, in terms of scalability, conciseness, and convenience, follow-your-nose might not be the final word. Based on semantic sitemaps, colleagues and I recently proposed how to address these discovery issues using the Vocabulary of Interlinked Datasets (voiD).13 In a nutshell, voiD introduces classes and properties to formally describe a data set’s content and how it interlinks with other data sets. Regarding interlinking, dataset providers describe links’ type and quantity7; for example, I can state that there are “120k links of type foaf:depiction from data set A to data set B.” However, until voiD descriptions are widely 71

Spotlight

Tools and Libraries for RDF Data Management

F

or an out-of-the-box solution to expose relational data on the Web as RDF, you should consider using mature frameworks such as the Database to RDF (D2R) server (http://www4.wiwiss. fu-berlin.de/bizer/d2r-server/) or Triplify (http://triplify.org/Overview). These tools allow a close-to-instant deployment based on simple configuration and RDF mappings. In the enterprise realm, two options exist: the Talis platform (www.talis.

com/platform/), available as software as a service (SaaS) and OpenLink’s Virtuoso (http://virtuoso.openlinksw.com/), a middleware and database engine. In our projects, we often use ARC2 (http://arc.semsol.org/), a freely available PHP library for RDF processing that targets xAMP systems. A comprehensive list of appropriate Web-ofdata tools, frameworks, and libraries is available at http://esw.w3.org/topic/ SemanticWebTools.

w3.org/2008/09/msnws/papers/trust privacy.html), addressing quality of service (reliability of data sources, for example), and tackling performance and scalability issues. Acknowledgments I thank the Linking Open Data community, especially Tom Heath, for feedback on an early draft of this article and Richard Cyganiak for his never-ending support and willingness to discuss and explain issues around linked data.

References

deployed, the exploration process is somewhat limited. As the Cold War site maintainer, you would likely inspect the LOD cloud (as in Figure 4) using a Web of data browser such as Tabulator10 or the OpenLink Data Explorer (http://linkeddata.uriburner.com/ ode/) and a semantic search engine such as Sindice (http://sindice.com) to manually find and select worthwhile target data sets. Let’s imagine that for the Cold War site, you’ve picked two data sets: for people-related data, you use DBpedia, and for geographical data, you use Geonames. On one hand, this decision lets you seamlessly integrate data from the previously mentioned data sets; on the other, it plugs the Cold War site into the LOD cloud, driving new agents (both humans and machines) to it. Typically, to consume RDF data, we’d use SPARQL.14 The complete setup can now render as follows: the data provider exposes its data through standardized interfaces such as XHTML+RDFa or a SPARQL endpoint, and consumers choose the best-fitting format for their purpose. A human using a vintage Web browser will consume an XHTML representation, whereas a machine agent, such as an indexer or a content syndicator, will likely prefer an RDF serialization such as RDF/XML.

S

o far, I’ve outlined the minimal steps needed to enhance a Web

72

application by exploiting available linked data sets. With the approach I’ve described, a developer — rather than having to learn a multitude of proprietary APIs — learns RDF once (the data model). Besides knowledge about a manageable amount of widely deployed vocabularies, such as FOAF, Dublin Core, or Semantically Interlinked Online Communities (SIOC; http://sioc-project.org), the only other thing he or she must be aware of is HTTP. In a sense, linked data defines a simple, read-only Representational State Transfer (REST) API with a high reusability factor. Regarding the issue of turning the read-only Web of data into a readwrite Web of data, colleagues and I have recently launched a community project called pushback (http:// esw.w3.org/topic/PushBackDataTo LegacySources) that defines an API and a vocabulary for so-called RDForms to update Web 2.0 data sources from the linked data space. Finally, note that URI management15 and vocabulary creation and selection have some unresolved issues. The Web of data community addresses these issues by holding regular VoCamps (http://vocamp. org) — that is, casual, practical meetings where interested people develop or extend vocabularies. Additionally, more steps might be required for commercial applications as regards handling provenance and trust (www. www.computer.org/internet/

1. T. Heath, “How Will We Interact with the Web of Data?” IEEE Internet Computing, vol. 12, no. 5, 2008, pp. 88–91. 2. C. Becker and C. Bizer, “DBpedia Mobile: A Location-Enabled Linked Data Browser,” Proc. World Wide Web 2008 Workshop: Linked Data on the Web (LDOW 08), 2008, http://events.linkeddata.org/ldow2008/. 3. J. Zhao, G. Klyne, and D. Shotton, “Building a Semantic Web Image Repository for Biological Research Images,” Proc. 5th European Semantic Web Conf., LNCS 5021, Springer, 2008, pp. 154–169. 4. T. Heath and E. Motta, “Revyu.com: A Reviewing and Rating Site for the Web of Data,” Proc. 6th Int’l Semantic Web Conf. and 2nd Asian Semantic Web Conf. (ISWC 07 & ASWC 07), LNCS 4825, 2007, pp. 895–902. 5. D. Ayers, “Evolving the Link,” IEEE Internet Computing, vol. 11, no. 3, 2007, pp. 94–96. 6. L. Ding and T. Finin, “Characterizing the Semantic Web on the Web,” Proc. 5th Int’l Semantic Web Conf. (ISWC 06), LNCS 4273, 2006, pp. 242–257. 7. M. Hausenblas et al., “What is the Size of the Semantic Web?” Proc. Int’l Conf. Semantic Systems (I-Semantics 2008), J. Universal Computer Science, 2008, pp. 9–16. 8. W. Halb, Y. Raimond, and M. Hausenblas, “Building Linked Data for Both Humans and Machines,” Proc. World Wide Web 2008 Workshop: Linked Data on the Web (LDOW 08), 2008; http://events.linked data.org/ldow2008/. 9. L. Sauermann and R. Cyganiak, “Cool URIs for the Semantic Web,” W3C Semantic Web Education and Outreach Interest Group note, 31 Mar. 2008; www.w3.org/ IEEE INTERNET COMPUTING

Exploiting Linked Data

BBC Music: Where’s the RDF?

T

he BBC Music beta site (Figure 3 in the main text) is a Web site in HTML that primarily targets human users. However, agents operating on the Web of data consume RDF. In the following, I show how to use the “Swiss army knife” curl to obtain an RDF view on data from this site. Using content negotiation (that is, setting the accept field in the HTTP header to RDF/XML) such as

curl -H “Accept: application/rdf+xml”http:// www.bbc.co.uk/music/artists/79239441-bfd54981-a70c-55c3f15c1287

...

yields the RDF/XML representation of the resource: Madonna

TR/cooluris/. 10. T. Berners-Lee et al., “Tabulator Redux: Browsing and Writing Linked Data,” Proc. World Wide Web 2008 Workshop: Linked Data on the Web (LDOW 08), 2008; http:// events.linkeddata.org/ldow2008/. 11. E. Oren et al., “Sindice.com: A DocumentOriented Lookup Index for Open Linked Data,” Int’l J. Metadata, Semantics and Ontologies, vol. 3, no. 1, 2008, pp. 37–52. 12. R. Cyganiak et al., “Semantic Sitemaps: Efficient and Flexible Access to Datasets on the Semantic Web,” Proc. 5th EuroJULY/AUGUST 2009

We can identify various vocabularies in this RDF graph — widely deployed ones such as Friend of a Friend (foaf:), but also specialized ones, such as the Music Ontology (mo:), represent information about the artist Madonna. Furthermore, we can see interlinking to DBpedia (http://dbpedia.org/resource/ Madonna_(singer)). We can now perform structured queries on top of this RDF representation using SPARQL. For example, to obtain fan pages for Madonna, we can use the following query: PREFIX mo: PREFIX owl: SELECT ?artist ?fanpage FROM WHERE { ?artist a mo:SoloMusicArtist ; owl:sameAs ; mo:fanpage ?fanpage . } The query will return the URI for the artist (www.bbc.co.uk/ music/artists/79239441-bfd5-4981-a70c-55c3f15c1287#artis) and respective fan pages.

pean Semantic Web Conf., LNCS 5021, Springer, 2008, pp. 690–704. 13. K. Alexander et al., “Describing Linked Datasets — On the Design and Usage of voiD, the Vocabulary of Interlinked Datasets,” Proc. World Wide Web 2009 Workshop: Linked Data on the Web (LDOW 09), 2009; http://events.linkeddata.org/ldow2009/. 14. D. Lewis, Building Semantic Web CRUD Operations using PHP, tech. report, IBM developerWorks, 2008. 15. A. Jaffri, H. Glaser, and I. Millard, “Managing URI Synonymity to Enable Con-

sistent Reference on the Semantic Web,” Proc. Identity and Reference on the Semantic Web 2008 (IRSW 08), 2008; http:// ceur-ws.org/Vol-422. Michael Hausenblas is a postdoctoral researcher at the Digital Enterprise Research Institute (DERI), National University of Ireland, Galway. He works in the areas of linked data, data discovery, and multimedia semantics, and is active in several W3C activities. Contact him at [email protected]. 73