An XML-based representational document format for FRBR

3 downloads 468 Views 513KB Size Report
and its attributes such as name, lastname and email are represented as child elements of ... The simplest one is simply the use of "relation" tag in an XML and.
An XML-based representational document format for FRBR Naimdjon Takhirov1 , Trond Aalberg1 , and Maja Žumer2

2

1 NTNU, NO-7491 Trondheim, Norway {takhirov,trondaal}@idi.ntnu.no University of Ljubljana, 1000 Ljubljana, Slovenia [email protected]

Abstract. Metadata related to cultural items such as movies, books and music is a valuable resource that currently is exploited in many applications and services based on mashup and linked data. Unfortunately, existing metadata formats do not have the semantics needed for versatile integration and reuse of such information across domains and applications. The conceptual model in the Functional Requirements for Bibliographic Records is a major contribution towards a solution, but the existing large body of legacy data makes a transition to this model difficult. In this paper we present a format for exchange of MARC-based information that makes the entities and relationships of the FRBR model explicit. The main purpose of this format is to enable the exchange of FRBR enriched MARC records while still maintaining compatibility with MARC-based systems.

1

Introduction

Books, music, movies are major points of interest on the Web and there has been a significant increase in the information pertaining to such products in the recent year. Detailed information about artists or authors and listings of the works they have created can be found by searching or browsing numerous sites devoted to genres or specific creators as well as many of the more general purpose resources such as Wikipedia or Freebase. Content has become a major merchandise on the Web and the barrier between the digital and non-digital content as well as purchase and free access is diminishing. Bibliographic information is stored and managed in a huge number of different library systems where the exchange of records using the MARC format is a key service to many. Such an environment is inherently resistant to changes and the adoption of new models and formats has to be evolutionary and pragmatic. Adapting to the FRBR model is a complex challenge that on the one hand requires solutions for mining existing bibliographic information to discover the structure of entities and relationships represented by the data [8]. On the other hand we need solutions for explicit representation of these structures in ways that meet the requirements of the environment where this information is created, maintained and used. Libraries create, manage and exchange bibliographic information as distinct records, the entities that implicitly are described in a record such as authors are usually identified by descriptions only. The primary users of the data favour readability and ease of management and exchange, but there is an additional requirement for this data to be available as Semantic Web data.

2

N. Takhirov, T. Aalberg and M. Žumer

In this paper we present a format for expressing existing MARC-based bibliographic records with the semantics of the FRBR model. Our format, FRBR Core, builds upon the MarcXchange standard for coding MARC records, and introduces additional elements for grouping MARC data fields into typed entity descriptions with support for identification and referencing combined with different solutions for expressing typed relationships. The format is compatible with RDF/OWL by direct transformation and we show how the format can be transformed back in to native MARC. Although we present our work in the context of MARC format, the solution is generic and can be implemented on other formats as well.

2 2.1

Background MARC

The MARC format is a compact and rather simple data structure where fields identified by three-character tags to organize the data. Two different field types are used for the main part of the record. The variable fields (called control fields) are of fixed length but may consist of sub-elements defined by character position and are typically used for codes and numbers. The other main type of field is called data fields and may be of variable length. Data fields have a substructure consisting of subfields identified by a delimiter and a single character code. Each MARC record typically describes a single publication and each data field reflects a logical grouping of the data elements that together describes a specific aspect of a publication. Records are typically self-contained pieces of information which basically means that each record contains all the information that is needed about the cataloged publication without dependencies on other records. MARC is normally stored in a proprietary format that requires specific software to be processed. In order to make MARC records available to a wider range of stakeholders, the Library of Congress has developed MARCXML format that can be validated against its XML Schema. This standard is often referred to as lossless as it enables a round trip conversion MAR C21-MARCXML-MARC 21 without any loss of information. This is an important feature since it allows all other interested parties to use it in a universal way. A valid MARCXML document is also a valid MarcXchange document,i.e. MarcXchange is a superset of MARCXML. The purpose of MarcXchange is to facilitate the exchange of MARC records in XML as a supplement to exchange of MARC records in ISO-2709.

2.2

FRBR

FRBR model [3] was a major step towards modernization of current practice of cataloging. It is an ER model that aims to address four user tasks: find entities that correspond to user’s expressed information need, identify entities, select entities and acquire access to entities. The model was published by the International Federation of Library Associations and Institutions(IFLA) in 1998 and has received much attention in the last ten years. It is generally considered to be an important contribution to our understanding of the entities and relationships that are of interest to end users of bibliographic

An XML-based representational document format for FRBR

3

Work is realized through

Expression is embodied in

Manifestation is exemplified by

Item

Fig. 1: FRBR Group 1 entities

information. The FRBR model depicts intellectual products as four interrelated entities: Item, Manifestation, Expression and Work (Figure 1). Manifestation and Item entities are more or less equivalent to the commonly known concepts of publication and copy respectively. The intellectual contributions found in publications are in the FRBR model modelled by the use of the expression and work entities. A manifestation embodies one or more expressions whereas each expression realizes a single work. An expression is the intellectual product that we recognize as unique content in the shape of text, sound or images independent on the specific formatting it has been given in different publications. The work entity is the most abstract and is needed because of the way we refer to and reason about intellectual and artistic creations at the most general level. The play by William Shakespeare commonly referred to as "Hamlet" exists in numerous translations where each translation is considered to be specific expressions that realize the same work. The main advantage of the work entity is that it enables collocation of intellectually equivalent products and enables the modeling of closely related intellectual products in tree-like structures. The FRBR model additionally includes entities for persons and corporate bodies and the relationships they may have to the different intellectual products. Shakespeare created the work Hamlet, and the person responsible for a specific translation is related to a particular expression by the use of a has realized relationship. Finally, the FRBR model defines the entities that occur as subjects of works and the model describes the attributes that are needed to for each entity and a rich set of relationships that may exist between the entities. As an ER model, FRBR is considerably different from the data structure that is found in MARC records. FRBR initially was not intended to serve directly as model for bibliographic databases and records, but there has been a significant interest in the use of the model as a foundation for new types of services and user interfaces. As a conceptual model the main contribution of FRBR is a more knowledge-like representation of bibliographic data that enables applications where users can explore and learn about the entities described in the bibliographic information in addition to the more traditional way of searching.

4

2.3

N. Takhirov, T. Aalberg and M. Žumer

RDF and OWL

The Resource Description Framework(RDF) is a general purpose model for data on the web and is based on the representation of information as a collection of subject, predicate and object statements (triplets). Objects may be the subject of other statements and form complex graphs of nodes interlinked by the predicates (properties). RDF is based on the use of URIs to identify the nodes and the property types and the type system (vocabulary) can be specified using the RDF Schema for vocabulary definition. OWL is build on top of RDF but is a more extensive language with a stronger syntax for ontologies. RDF and OWL are important technologies for implementing the Semantic Web by enabling information to be exchanged and integrate. The basic principle of the Semantic Web is that information we exchange and make available on the web needs to be machine-interpretable and have explicit meaning expressed as identifiable types. The primary exchange syntax for RDF and OWL is RDF/XML and even if XML itself is a human readable format, the resulting serialization of RDF is only designed to be machine readable. Turtle and N3 [2] are non-XML alternative syntax with a more compact and human readable textual form. A MARC-based record can in principle be expressed directly in RDF as demonstrated in [11]. Simply transforming a MARC record to a corresponding RDF/XML representation, however, only makes the data available for tools unable to process native MARC. For MARC-based data in strict MARC 21 or UNIMARC the tags and codes represent a certain level of strict typing, but in practice there are parts of the data that have a contextual meaning in the sense that the interpretation depends on the values in other fields. An "added entry" title in a MARC record may mean different things. It can be an alternative title for the cataloged item, the title of a part or the title of a related publication. Indicators may specify the interpretation of fields but in other cases the meaning is only revealed if interpreted in the context of cataloging rules and common patterns in the data. Essentially this means that a direct translation to RDF/XML only makes the data available for other tools, but does not contribute to the machine-interpretation of the meaning. As a conceptual framework for bibliographic records the FRBR model offers a more formal semantic model of the entities and relationships that are of main concern to end users. A mapping between MARC 21 fields and the FRBR attributes is presented in [10]. Issues with RDF/OWL The use of RDF relies on URI-based vocabularies and node identification. Unfortunately, there is no tradition of any of these in traditional metadata management systems. Nodes (objects/entities) are identified by description only and will turn into blank nodes in an RDF representation. Though libraries for decades have utilized authority files for the description of authors and other actors, they still use descriptions only when referring to persons. RDF and OWL are not very readable (by humans) when written in XML or as RDF triples. One reason of this is that RDF/XML is extremely verbose [9], but the major part of the issue of the readability is the representation of OWL constructs in RDF/XML or RDF triples. Furthermore, the problem of verbosity brings along another issue - that of the size. Once the size of the file is large it requires a lot of computing resources

An XML-based representational document format for FRBR

5

to process. The RDF version indicates that the format is compatible with this format. The problem, however, is that every resource must have URIRef, i.e. resources must be identifiable on the Web. In a record-based library world information is usually organized in MARC records which makes generating globally visible and unique URIRefs a significant challenge. The RDF solution would require every record to have a global identifier since the primary purpose of this framework is to describe Web resources.

3

Simplified semantic information representation in XML

Expressing entities in a well-defined manner and especially relationships between those entities in XML are one of the important tasks in introducing semantics to the data. There are different ways this task can be accomplished. The classic and most logical approach of representing entities in XML is to describe entities with an element. An entity is described by a "root" element and its attributes are represented as sub- or child-elements. As an example the person entity is described using element and its attributes such as name, lastname and email are represented as child elements of the person element. The simplest one is simply the use of "relation" tag in an XML and specifying the type of relation and entities involved. In DTD we can use attribute type ID/IDREF to achieve our goal. ID contains a value which is a unique id name for the attribute that identifies the element within the context of the document. IDs are much like internal links in plain HTML. ID is a defined attribute that uniquely represents an element within an XML document. IDREF is another special XML attribute that references ID value. The HTML elements A and LINK may have a rel attribute which specifies a relationship between Web resources. The most common use of a link is to retrieve another Web resource. However, document authors may specify other type of relationships. There are different mechanisms for representing relationships between entities. One method is dynamic typing. In this method, either we specify the type of relationship inside a "relationship" element or include an attribute type that specifies that information. Another approach would be to have a set of "strongly or statically typed" (predefined) relationship types which would be represented as attributes. Both methods have their strengths and weaknesses. In the static tag method, there is a bigger room to introduce additional relationship types as we only have to change the specified type. On the other hand, the document may loose readability. Additionally, this method requires a schema language with support to define acceptable values for XML elements/attributes. For example, XML Schema (W3C) has the support but restriction of elements works differently from restriction of attributes. The strongly typed method eliminates the readability weakness of the first method but the problem is when there is a need to define new type of relationship. To do that, the schema language must be updated each time the change is required. Additionally, existing documents might have to be updated as well. There are a few basic methods of expressing relationships (hierarchical, referencing and hybrid) in XML and these methods are presented below.

6

N. Takhirov, T. Aalberg and M. Žumer

           57d655d21bd478bdf264e1ed737ac02c                           e4a694dc75d1dff64fee6e90ad61a06b             4246whitemichael#tolkien#                   4asd246pxuyej                             

              ...                        ....         

(a)

(b)

              57d655d21bd478bdf264e1ed737ac02c                         e4a694dc75d1dff64fee6e90ad61a06b                  4246whitemichael#tolkien#          ....                                                 ....          

(c)

    ...               Tolkien                         4246whitemichael#tolkien#    

(d)

Fig. 2: Different approaches to expressing FRBR entities and their relationships.

3.1

Hierarchical method

As the name implies, this method enables expressing entities and their respective relationships in a hierarchical fashion (also called parent-child relationship). In this case, if an entity A has a relationship to an entity B, B is included as a child element of A. In FRBR world, we are able to deduct a fact that a manifestation that includes a child element expression, has an "embodies" relationship to that expression (the inverse of is embodied in relation). The main problem with this approach is that it can potentially lead to a loop. If entity A has a relationship to entity B which points to entity C with some relationship and C having a link to A, it will create an endless relationship tree. This approach is used to represent a one-to-one or one-to-many relationship between elements. However, this mechanism is insufficient to represent a many-to-many

An XML-based representational document format for FRBR

7

relationship, as each element may only have a single parent element. Therefore, this method is suitable for expressing entities with finite set of nested elements (relationship). Another disadvantage of this approach is data duplication. This issue arises when an entity is mentioned several times in the hierarchy. However, aside from the growing size, this is not a big issue for read-only data since the data does not have to be changed several places. On the other hand, the compactness, increased proximity of related entities, and increased readability from FRBR perspective of this approach may be traded for the aforementioned problems.

3.2

Reference method

An XML document can potentially be very large. To avoid duplication, we can employ the reference method instead of the hierarchical when expressing relationships between FRBR entities. Instead of embedding related entities under the entity being described we could simply use a few referencing techniques. One of these methods is the use of previously discussed ID/IDREF. The basic approach is the following. We express each entity under its own tag. There is no hierarchy since entities do not include related entities. Instead, the relationship is expressed through an attribute. Since the entities have an ID attribute, i.e. they are unique in our context, we can use IDREF to reference those entities. Related entities are stored in a loosely coupled manner which means there is no proximity from the XML processing perspective. Each time there is a need to access related entity, we have to perform a lookup, usually through an XPath expression. This technique reduces the size of the document and we avoid duplication. Referencing makes representation of read-write data easy since entities are stored one place and once the data is updated, those updates become globally visible. However, this feature does not come for free. The loose coupling introduces cost in regard to processing data. A simple lookup may need a certain number of iterations through nodes in order to access the data since entities are spread across the document and even collection. Situation gets worse when additional I/O operation is required for documents stored in separate files.

3.3

Hybrid approach

A confluence to the methods discussed above is a hybrid approach. As the name suggests, in this approach we employ both of the methods in conjunction. In an ideal hybrid approach, entities with only one reference would be stored under the related entity taking advantage of proximity, readability properties and efficient processing. The same applies to the situations when one is dealing with read-only data since there is no need to track every duplicate record when the data is changed. At the same time, entities in the same document referenced several times could be stored as a separate entity. The problem with this, however, is coping with changes in the environment. What if there is a change in system requirements; if data is no longer read-only? What if an entity referenced only once, is reused in the future by another entity? To cope with these problems, one has to carefully look at trade-offs with each approach.

8

3.4

N. Takhirov, T. Aalberg and M. Žumer

Identification of entities

There is a lack of specifically designated fields for identification in bibliographic data [7]. In fact, there is no standards-based technique or methodology for identifying entities in bibliographic records. Several works have previously explored the area of identification of entities and proposed algorithms for duplicate detection techniques [4]. However, some types of entities can be identified through an "identity field" such as ISBN for books, ISSN for journals, etc. The simple and often usual approach is to construct a key based on the descriptive attributes(such as title, sub-title, author) of entities and use this key for comparison. The comparison is performed based on a decision tree/table [6] or set of rules. The main issue with these techniques is data inconsistency. Two MARC records referring to the same global entity may have variations in describing various attributes of the entity.

4

Design criteria

MARC formats were designed with communication and exchange of bibliographic data in mind. While we can obtain good results exchanging records using MARC, the problem of expressing hidden semantics is not addressed. Though design criteria discussed here may seem fairly obvious in defining our format, we outline each of criterion that addresses specific aspects pertaining to representation of semantics found in MARC records. First and foremost, our focus in the new FRBR Core format is representation of information as FRBR entities, relationships between those and their attributes. Conceptually, our approach represented on a higher level than existing triplet based frameworks such as RDF and OWL. On data element side, we find it difficult to process information with ontologies as information found, for example, in MARC records often lacks clear structure and semantics. For example, 7xx Added Entry fields in MARC 21 format bear some kind of title information, but we lack information about the kind of title. Further analysis of additional fields is required in order to make a clear statement of what information that field contains. One of the criteria we have set is that the format should facilitate human readability. By contrast, other semantic markup languages such as OWL have a greater degree of machine readability. Second, the format should have a clear structure. The name of the tags should be familiar and preferably names conventional in the FRBR model are to be employed. The structure of the XML should resemble atomic entities connecting to work, expression and manifestation FRBR entities. Third, the format should enable exchange of records. There are dozens of formats for exchanging information, especially in repositories hosting heterogeneous content. With the new format our intention is to support a two-way transformation of resources in a variety of formats: MARC21/UNIMARC, MODS, OWL, Dublin Core.

5

Motivating example

The semantic network of entities and the use of FRBR is depicted in Figure 3. An example of what is referred to as Work in FRBR is the three-book epic by J.R.R. Tolkien

-­‐  Creator

-­‐  Manifestation

-­‐  Expression

-­‐  Work

Ringenes  Herre  -­   To  tårn DVD  -­    2003 The  Two  Towers DVD  release,  2005

The  Two  Towers DVD  release,  2002

is embodied in

(Widescreen  Edition)

is embodied in

The  Lord  of  the  Rings:   The  Two  Towers  

The  Lord  of  the   Rings:   The  Return  of   the  King

The  Lord  of  the  Rings:   The  Two  Towers  

is realized through

The  Lord  of  the   Rings:   The  Two   Towers

(Widescreen  Ed.  w/norw.  subts)

The  Lord  of  the   Rings:   The  Fellowship   of  the  Ring

created

Jackson  P.

The  Two  Towers Paperback  edition,   September  2003

The  Lord  of  the   Rings

Record  1

The  Two  Towers Paperback  edition,   June  2005

(Eng.)

Ringenes  Herre  -­   To  tårn (2003)

The  Two  Towers

The  Return  of   the  King

To  tårn

The  Two   Towers

(norw.  translation)

The  Fellowship   of  the  Ring

Tolkien  J.R.R

Record  2

Bored  of  the  Rings Paperback  Edition, June  2005

English  (version)

Bored  of  the   Rings

Beard  H.

An XML-based representational document format for FRBR 9

Fig. 3: Fragment of the network of entities and their relations describing the "Lord of the Rings" related works. The second part "The two towers" is presented in a more detail.

10

N. Takhirov, T. Aalberg and M. Žumer

"The Lord of the Rings" encompassing "The Fellowship of the Ring", "The Two Towers", and "The Return of the King". Each of these parts are regarded as separate works as well. In our example, the second part "The Two Towers" is shown in more detail. Like any work, "The Two Towers" is available in a number of versions and each edition of the work forms an FRBR expression. Therefore, the original English version of the "The Two Towers" and the Norwegian translation "To tårn" are regarded as separate expressions of the same work. A particular printed version of expression or differences in formats form a manifestation. Therefore, the paperback versions of 2003 and 2005 are two different manifestations of the expression called "The Two Towers". In this example, we can also see on the left part the famous movie "The Lord of the Rings" directed by Peter Jackson consisting of three parts with identical names as those by Tolkien. Since Jackson-directed movies are based on the book by Tolkien, entities in our model are related to each other. The parody "Bored of the Rings" is also based on the "The Lord of the Rings" book. This complex network of entities and relationships is difficult if not impossible to find in MARC records. Record 1 and Record 2 are typical examples of how this information is recorded and the challenge often is to find and identify entities and draw relationships.

6

Structure of the format

The XML Schema 3 in Figure 4 describes the structure of the FRBR Core format. The schema contains a root element record that can have one or more manifestation elements. Manifestation element comprises various attributes that are specified in FRBR. A particular embodiment of expression can be represented as either child element (hierarchical) or by attribute (referencing method). The same technique is used to describe work and creator entities. The portion of transformed output is illustrated in Figure 5b. The bold tags resemble FRBR entities. The main entities supported in the format are: work, expression, manifestation, person. Traditional library cataloging is normally done on manifestation level and therefore Item entity is not presented here. The schema introduces few new elements mainly those of FRBR entities. Other elements are the same as those in MARCXML. The FRBR elements are used to group MARCXML elements that describe specific aspect of publication found in a MARC record. For example, title of a publication is mainly found in 245, 240 fields. Thus, these fields are listed under element under manifestation. The simplicity of the format enables easy transformation back to MARCXML. In fact, simply dropping the FRBR specific tags will bring the document back to MARCXML. The main rationale behind this choice is to ensure that the structure of the MARC is not lost. An example of the MarcXchange format and corresponding final output is depicted in Figure 2d. The relationships between entities are represented using the strongly typed method (see Section 4). This method has been chosen due to the fact that a set of predefined relationship types are defined in the FRBR model. The FRBR model has not changed much since it was published. Even minor changes have to go through a fair amount of review before final publication of new revision. 3

The complete schema is accessible at http://is.gd/fllMr

An XML-based representational document format for FRBR schema

+

11

Target Namespace http://www.idi.ntnu.no/frbr import http://loc.gov/standards/marcxml/schema/ (marcxml.xsd)

record Type tns:record

+@

+

manifestation Type tns:manifestation

+

person Type tns:person

+

manifestation

+

expression tns:expression Type work Type tns:work

+

attributes marcxml:controlfield

‐ 

tns:identifier title Type

‐ 

+

record

+ + + +

marcxml:datafield

tns:title

published Type tns:published

+

description Type tns:description

+

series Type

+

tns:series

+

expression Type tns:expression @ attributes @ realizationOf

@ abdrigement

@ adaption

@ adaptionOf

@ arrangement

@ arrangementOf

@ complement

@complementOf

@ embodied

@ imitation

@ imitationOf

@ successor

@ successorOf

@ translation

@ translationOf

@ relation marcxml:datafield

‐  expression

+ @ attributes work

‐ 

marcxml:datafield

title Type

‐ 

tns:title

hasSubject Type tns:subject person Type tns:person

‐  published ‐  description ‐  series ‐  language ‐  subject ‐  title

‐  ‐  ‐  ‐  ‐  ‐ 

1...∞

0...∞

0...∞ 0...∞

0...∞ 0...∞

person

‐ 

+

+

work Type

+

tns:work

marcxml:datafield

+

marcxml:datafield

+

marcxml:datafield

+ +

marcxml:datafield

+

marcxml:datafield

+

‐ 

‐ 

fullName

name marcxml:datafield

+

marcxml:datafield

language Type tns:language

+ @ attributes

+ +

+

identifier

+

marcxml:controlfield

+

marcxml:datafield

+

‐  ‐ 

Fig. 4: The XML Schema of FRBR Core format.

12

N. Takhirov, T. Aalberg and M. Žumer

  ...            eng         nob                   The Lord of the rings         The return of the king                   Ringenes herre         Atter en konge         screenplay by Fran Walsh  &  Philippa Boynes &; Peter Jackson;  by Peter Jackson                  video          ...          ...          ...

                              0613576                                                 Ringenes herre                Atter en konge                directed by P. Jackson                                video                                           The Lord of the rings                               The return of the king                                                ...         

(a)

(b)

Fig. 5: Portion of input MarcXchange (a) and the transformed output-FRBR Core(b).

Namespaces. The frbr namespace is used to qualify elements and attribute names pertaining to the FRBR model. For the data- and subfields we have used existing marcxml namespace (from MARCXML XML Schema). Linking. Relationships between entities are linked in two ways, i.e. we use the hybrid approach discussed in section 3 which is similar to the technique of describing links between resources in XLink.

7

Transformation to OWL

The output of the final transformation is a set of interrelated FRBR records (based on initial MARC records) with clear structure as well as typed relationships. These records are assigned the same identifiers as those generated by the frbrizer tool discussed earlier. These records may have relationships to other records in the same collection and using referencing method discussed in section 3 the relationships are created between those records. Our format can be used in conjunction with lower level representation formats such as RDF/OWL as well as domain specific formats such as MARC. OWL class representations and their relations result in very complex and unintuitive graphs. This issue results in poor performance by reasoners when parsing and classifying ontologies [12]. The transformation process includes a series of XML/XSL transformation. The interpretation and creation of FRBR records is performed by the use of XSLT. In order to

An XML-based representational document format for FRBR FRBR Core XML Schema

MARCXML documents

XSLT

FRBR Core

13

OWL ontology

XSLT

OWL instances

Fig. 6: Transformation from FRBR Core format to OWL.

convert MARC records to a normalized FRBR, we have used the previously developed tool at NTNU [1]. This conversion tool performs several transformations and as an input accepts records in the MarcXchange format. A pre-defined set of rules were created beforehand in a database which is exported to XML. These rules create identifiers for entities, match entities and govern mappings of entities in the table that contains the variable data for the various occurrences of entities. The final step is simply arranging elements and transforming these XML files into FRBR Core format. The steps in the transformation process are depicted in Figure 6. On the schema level, the XML Schema of the format is used to create the OWL ontology model which is more or less static. For each XML document (on the instance level) validated by the schema, we generate OWL instance via XSLT transformation. The FRBR entities work, expression, manifestation, person etc. are declared as owl:Class and owl:ObjectProperties to specify elements and attributes of the entities. Transformation to RDF/OWL created a number of issues. We came across a problem of identifying FRBR entities which occurs when the identity of record is locally defined. There is no GUID for FRBR work entities that could universally be employed. The latest trends however show a sign towards these kinds standards. Example of projects that address this issue is viaf, FRAD etc. Another issue was the language of expression. Records may have multiple languages (e.g. spoken language and subtitle language for records describing movies). Thus, the language element may be repeated several times.

8

Related work

The CIDOC CRM is a core ontology focused on the semantic integration of cultural heritage information, including libraries and archives. The model is rather compact with 80 classes and 130 relationships. CIDOC has proposed CRM Core [5] which is a set of metadata elements with the primary purpose of resource discovery. In addition, the format represents a simple schema for summarization of historical facts. CRM Core captures the basic functions of identification, classification participation, references and similarity. Another model proposed by CIDOC is FRBRoo (object oriented version of FRBR). As a formal ontology by definition, it is intended not only to capture and represent the underlying semantics of bibliographic information, but also to facilitate the integration, mediation, and interchange of bibliographic and museum information.

14

9

N. Takhirov, T. Aalberg and M. Žumer

Conclusion and further work

In this paper we have presented a framework for enriching existing data with a semantic layer of entities and relationships defined in the FRBR model. The use of existing Semantic Web technologies such RDF/XML and OWL results in a number of issues such as (1) the model becomes too verbose and complex, (2) unintuitive and poor readability from human perspective, and (3) no clear structure for entities and their corresponding relationship. Experience has shown that adoption of new Semantic Technologies are slow, especially in library community where traditional cataloging practice is employed and records are still stored in MARC format. Instead of converting the data into a new format such as RDF/XML and/or OWL to introduce semantics into the data, we proposed a new format for enriching existing metadata with entities and relationships defined in the FRBR model that can be used as an intermediary format to easily transform to/from MARC, RDF/XML, OWL, MODS and various other formats. A resource can be described with a variety of metadata attribute sets, such as MARC, Dublin Core, RDA attribute set, Onix etc. However, entities have fixed semantics. The attributes used to describe each entity can be more "flexible". The important thing is to identify the entity type (e.g. work, person). Once identified, the entity is described using various attribute-sets. Finally, the solution described in this paper is a generalization of what can be achieved in RDF and OWL solution. We have taken as a case MARC, but the technique can be applied to other formats as well. Further work includes evaluation of the format with respect to services it can provide and the quality of those services.

References 1. T. Aalberg and M. Žumer. Looking for Entities in Bibliographic Records. In Proceedings of the ICADL 2008, Berlin, 2008. Springer-Verlag. 2. T. Berners-Lee. Getting into RDF & Semantic Web using N3. http://is.gd/fM5GN/, 2005. 3. P. L. Boeuf. FRBR and Further. Cataloging & classification quarterly, 32, 2001. 4. California Digital Library. The Melvyl Recommender Project. 2006. 5. CIDOC. CRM Core. http://is.gd/fM6GS/, 2005. 6. T. G. Dietterich. An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, Boosting, and Randomization. Journal of Machine Learning, 40(2), 2000. 7. N. Freire, J. L. Borbinha, and P. Calado. Identification of FRBR Works Within Bibliographic Databases: An Experiment with UNIMARC and Duplicate Detection Techniques. In Proceedings of the ICADL 2007, LNCS. Springer, 2007. 8. T. B. Hickey, E. T. O’Neill, and J. Toves. Experiments with the IFLA Functional Requirements for Bibliographic Records (FRBR). D-Lib Magazine, 8(9), 2002. 9. I. Horrocks, P. F. Patel-Schneider, and F. van Harmelen. From SHIQ and RDF to OWL: the making of a Web Ontology Language. Web Semantics: Science, Services and Agents on the WWW, 1(1), 12 2003. 10. Library of Congress Network Development and MARC Standards Office. Functional analysis of the marc 21 bibliographic and holdings formats. web. http://is.gd/fM5BB/. 11. D. A. Rob Styles and N. Shabir. Semantic MARC, MARC21 and the Semantic Web. In Linked Data on the Web (LDOW2008), Bejing, China, 2008. 12. M. Samwald and K.-H. Cheung. Experiences with the conversion of SenseLab databases to RDF/OWL. http://is.gd/fM5NH/, 2008.