Representing Versions in XML Documents Using Versionstamp

Representing Versions in XML Documents Using Versionstamp Luis Jesús Arévalo Rosado, Antonio Polo Márquez, and Juan Mª Fernández González Departament of Computer Science. University of Extremadura. Spain1 {ljarevalo, polo, jmfergon}@unex.es

Abstract. The problem of managing versions in XML documents can be approached through traditional adapted procedures, based on managing XML operations (deltas) or using timestamped markups to represent the validity of each versioned tag within the document. The first solution entails a high reconstruction cost for any version different from the current one. Whereas the second solution, due to the linear nature of time, implies that these techniques do not to support branched versioning. In this work, the XML data model is extended for the representation of different versions of XML documents that consists of marking the tags with a versionstamp instead of using a timestamp. This technique is based on two ideas: on the one hand storing the ancestral relations of the versions (version tree) produced a new version is generated and on the other hand the version validity of each versioned tag is defined based on this tree (versionstamp). The easy management of multiple versioning, the wide number of queries in XML standard query languages and its implementation only using XML technology, are some of the advantages of the proposed technique.

1 Introduction The standard XML [1] has been consolidated as a tool for information representation and interchange, due to both its flexibility and the management and conversion facilities that the documents offer in this standard. Since its specification a great variety of languages with different functionalities have arisen in research and commercial areas (SOAP, WSDL, XSLT, XML-Schema, etc) [1]. In essence, any XML document can be considered as semi-structured data to store information, so that as it occurs in any other information system, they are susceptible to change. The versions of an XML document can be managed through traditional adapted procedures, based on XML operations change (delta XML) [2, 3] or integrating the different versions of the XML document into a single file using temporal aspects [6, 7]. The first solution prevents the validation and query of the different versions by means of the XML language standard and the second one, due to the linear nature of time causes these techniques not to support branched versioning and for this reason this technique cannot be used in collaborative scenarios [4] where new versions can be generated from any past version. 1

This work has been financed by the Spanish CICYT project "TIN2005-09098-C05-05".

J.F. Roddick et al. (Eds.): ER Workshops 2006, LNCS 4231, pp. 257 – 267, 2006. © Springer-Verlag Berlin Heidelberg 2006

258

L.J.A. Rosado, A.P. Márquez, and J.Mª F. González

In this work, we extend the XML data model for the representation and management of different versions of XML documents that consists of marking the tags with a versionstamp instead of using a timestamp. In order to do this effectively it is necessary to store the versions included in the document and their relations (ancestral relation), thereby the version validity for each mark in the document is defined based on it. The following advantages should be highlighted: the easy management of non lineal versioning, the possibility to query the different versions using XML query languages standards and it runs in whatever XQuery/XSLT engine with id() support[1]. The remainder of this paper is organized as follows: we begin by summarizing the current solutions for the management of XML versions. Then, we show a data model for XML which is extended to a version model and later we show how to map it to XML documents. Next, we describe different queries that can be made on versioned documents. Finally we offer our conclusions and a look at our future works.

2 State of the Art The problem of XML document version management combines the issues of document version management [2, 3, 4, 5] and temporal databases [10]. Delta XML management: Based on traditional change operation procedures adapted to XML [2, 3]. It consists of obtaining and storing the XML differences between two versions (delta XML). The high reconstruction cost for any version different from the current one and the impossibility to validate and query the document using standard XML languages are some of their disadvantages. Multiversion XML: [4, 5] define an indexing technique for branched versioning which they called BT-Tree and BT-ElementList respectively, however they cannot be used in XML Standard Query languages (XQuery or XPath[1]). Temporal XML Representation: Some works has recently focused on how to represent and manage historical information in XML. In [11] a technique for managing temporal web documents is shown using an XML/XSLT infrastructure. A data model is proposed for temporal XML documents [9] where leaf data nodes can have alternative values; however how to support different structures for non-leaf nodes is not discussed. Extensions of XPath data model are exposed in [12,13] to represent and query transactional and valid time respectively by the addition of several temporal axis. In [6] a technique for the management of changes in XML documents using a notion similar to time-interval is described which is called context specifiers; however how to support queries is not discussed. A temporally-grouped data model is shown in [7] that provides a way to represent the evolution of the content of a database using XML timestamps; nevertheless non-lineal versioning is not supported. However these timestamped techniques cannot be used in collaborative scenario [4] due to the fact that in this scenario users can update any version of the document generating a new version either from the current one or discard it and reuse an old version. Our technique represents the validity of each tag by means of a versionstamp, which provides us: 1) an easy support for branched versioning 2) a wide set of queries based on the relation between the included versions and 3) It runs in whatever XQuery/XSLT engine with id() support.

Representing Versions in XML Documents Using Versionstamp

259

3 Representing Versions in XML Documents with Versionstamp In this section, we begin by summarizing an XML data model, which is extended to a version model, and finally a versioned graph is converted to a V-XML document. 3.1 XML Data Model An XML document [1] is a collection of elements delimited by a start-tag and an end-

tag where between the tags of each element might be either its content or other elements, that is, these elements are nested and cannot overlap. An XML document is defined in [16] as: Definition 1. An XML document D is a rooted directed graph D=(V,E,root) where V is a finite set of vertices in the graph, E is a finite set of directed edges between vertices in V and root ∈ V is the root of the directed graph. Each vertex v∈V has two properties: type(v)∈TV and value(v)∈ D (type(v)), where TV is a finite set of types TV ={element, ID, IDREF, IDREFS, string, int, ...} and every type t in TV has a domain of values, denoted as D (t). Vertices are of type element or data value. Edges, which have the form e=(v1,v2) being parent(e)=v1 and child(e)=v2 , have two properties type and name with type(e)∈{E,A,R} that classifies E in three disjoints sets (elements, attributes and references), and name(e) denotes the set of valid names for every type of edge. There is an order relation that represents the total order between edges of a particular class E, A or R and connect parent element to its children. … Seafood Spanish &2 15 id ….. name E A R element

ID

“string”, int, … IDREF or IDREFS

c1

“Arzak”

root doc &1 cook

rest id

r1

&3 name

cooks &4

c1

~data

price menu &5 ~data

&6 ~data

~ref “Seafood”

“Spanish”

15

Fig. 1. An XML document and its graph

Figure 1 shows an XML document with information about cooks and restaurants of a hotel and its graphical representation in this data model. XML documents are not static: inserts, deletes or updates can modify them [14]. Beginning from the initial state of the document (version 1), the new versions are then established by applying a number of changes to the current version that we will call version change transaction and it must guarantee the XML specification for the new versions of the document. In table 1 we briefly describe the updates operations for this data model where move operation is represented by a delete and an insert operation.

260

L.J.A. Rosado, A.P. Márquez, and J.Mª F. González Table 1. Primitives of changes Update Operations (IV)

InsertVertex(t,v,ne,p,o)

(DV)

DeleteVertex(v)

Description

(UDV) UpdateDataVertex(p,ne,nv) (DE)

DeleteEdge(p,c)

It adds a new vertex with the specific type t and value v and also adds a new edge with the name ne from the parent vertex p to the new vertex in the position o. It removes the vertex identified by v. All descendant vertices and edges are deleted. It changes the value of a data vertex with parent p and name of the inedge ne to a new value nv. It removes the edge from p to c. All descendant vertices and edges are deleted.

v5 Sea Spanish 18

UDV(&4, ~data,”Sea”) Seafood Spanish 15 UDV(&6,~data,18) UDV(&5,~data,”Italian”) v1 v2 Seafood Spanish 18 DE(&4,”Seafood”)

UDV(&2,”name”,”Adria”)

DV(&3)

v7

v9

IV(element,&7, ”street”,&3,4) IV(string,”Paris 10”, ~data,&7,1) v3

v4

Seafood Italian 18 v10

UDV(&5,~data,”French”) Spanish UDV(&3,”cook”,c2) 18 v6 v8

Fig. 2. Snapshots of an XML document

When a new version is produced, the new vertices and edges are integrated with the previous versions, being necessary thereby to establish for each versioned item its validity. In the Figure 2 several change operations have been made to the XML document shown in the Figure 1, generating different versions identified by Vi, as shown inside the circle, representing different snapshots of the document and establishing thereby what is known as a version tree in traditional document versioning systems. Edges represent the changes made from a specific version to a specific descendant from table 1 and the order represents the ancestral relation between the defined versions. For example in Figure 2, the data of the vertex &5 has changed in versions V3 and V10 and it has been deleted in V9.


261

3.2 V-XML Document Model In this paper our objective is how to represent different versions of an XML document in a single document, which we called V-XML Document, using versionstamp. Definition 2. (V-XML Document) A V-XML Document Dv=(V,E,v_document) is an XML document where v_document is the root of the Dv document that has only two outedges: ‘version_tree and ‘versioned_doc’. • A version tree Vt=(V’,E’,&V0) is an XML document where &V0 is the root of the version tree and it has as many element vertices as versions stored. A version tree is characterized: 1) We denote each vertex as &Vi 2) Each vertex has an outedge e∈E’A such that name(e)=‘id’ and value(child(e))=Vi∈D (ID) (a version identifier) 3) Each vertex has a set of outedges labeled (‘version’) that represent the ancestral relation between them. 4) There is a special vertex &Vnow that is child of &V0 and has no children. Notice that we could consider adding more information to each edge of the version tree in order to enrich it. • A versioned doc D=(V, Ev, &0) is an XML document where &0 is the root and each edge of the graph has a new property VV, which we called as V-Edge and represents its version validity, that defines in which versions of the version tree the edge is valid. VV will be defined as a version region (definition 3). To represent this validity we can choose either a positive representation defined by means of those versions of the version tree where the edge is valid or by a negative representation where the edge is not valid. In Figure 3 the positive and negative representation for the edge between vertices &5 and ‘Spanish’ is shown VV={V1,V2,V5,V6,V7,V8} where, in both solutions, the solid circles indicate which versions are in VV. Both solutions are acceptable. The first case (positive) allows us to represent the validity by means of a tree (a connected graph) instead of having to handle a forest (several connected graphs) as it happens in the negative representation. To define the version validity we could choose between the following solutions: 1) A list of version identifiers that represent which versions the edge is valid for. Thus, version validity for the edge between vertices ‘&5’ and ‘Spanish’ value is VV={V1,V2,V5,V6,V7,V8} 2) A list of lineal version intervals that consists of a set of pairs [v1-v2] and represents a path in the version tree where its origin is the node with the v1 identifier and its end is the node with the v2 identifier. The version validity for the edge between vertices ‘&5’ and ‘Spanish’ could be VV= {[V1-V7],[V6-V8]}. v5 v1

v2

v3

v7

v5

v9 v1

v4

v2

v3

v10 v6

v9 v4

v10 v6

v8 Positive Representation

v7

v8 Negative Representation

Fig. 3. Positive and negative representation for the edge between vertices &5 and ‘Spanish’

Both representations present the same problem derived from the necessity to update the version validity of all affected edges in the graph when a new version is produced (i.e: If a new version is generated, with version identifier V11, from the version

262


identified by V8, this forces us to change the validity version [V1-V8] to [V1-V11]). Obviously this situation entails a high reconstruction cost for all affected edges, because, although we use a tree representation to describe the included versions, the version validity is defined by a lineal representation (with only one start and one end). For this reason it is necessary to use a non-lineal representation to define the versions associated to an edge. To do this effectively it has used a mixed representation, positive and negative, that allows us to define several non-lineal areas in the version tree which we call version region as in [5]. Before defining it, it is necessary to set the following notations. Given a version tree Vt=(V’,E’,&V0) and given a ‘v’∈V’element: • Given a x∈D(ID), if ∃ v∈V’element / ∃ e=(v,y)∈E'A;(name(e)=‘id’)∧(value(y)=x), then we denote this vertex v as γ(x). If this vertex exists for x, it must be unique. • (v)=descendant(v,Vt) ⊂ V’element as all descendant element vertices of ‘v’. • ≺ (v)=ancestor(v,Vt) ⊂ V’ element as all ancestor element vertices of ‘v’. • V (v)=(descendant(v,Vt) ∪ {v}) ⊂ V’element as all descendant element vertices of v including itself. • T (v)=(ancestor(v,Vt) ∪ {v}) ⊂ V’element as all ancestor element vertices of v including itself. • , ≺ , V , T can also work on a set of version identifiers, i,e {vi∈I}=∪i∈I (vi). Definition 3. (Version region) Given a version tree Vt=(V’,E’,&V0), a version region is a [start-End] pair where the start value is a version identifier that represents the origin node of the valid area in the version tree (positive) and End is a set of version identifiers that indicate when each area has stopped being valid (negative) representing the different non-lineal areas. They must carry out the properties: 1. ∃ γ (start) ∈V’element and ∀vi∈End, ∃ γ (vi)∈V’element. 2. ∀vi∈End, γ (vi ) ∈ γ (start) and ∀vi∈End, γ (vj ) ∉ V γ (vi ), ∀i≠j. Thus a version region stands for a subset of non-lineal areas in the version tree and each edge in the document is formed by a version region indicating its validity. If the set of values of End only contains the special value ‘now’, this indicates that this region is valid for all descendants of versions in the version tree whose area begins in the start identifier. When a new version is produced the version region of the affected edges is modified as it is shown in table 2 for every update operations shown in table 1. As in [13] the version validity of any edge e is a superset of the union of the version validity of all the edges in outedges(child(e)). Notice that a version region defines a connected subset of a version tree since only an origin version identifier is allowed. If a versioned edge occurs intermittently in the version tree it is represented as a different edge with different version region. The next step is to represent the version region as a connected subset of a graph (formed by a list of non-lineal version region intervals) similar to temporal element idea [8].


263

Table 2. Meaning of changes in the version validity property Update Operations InsertVertex DeleteVertex UpdateDataVertex DeleteEdge

Description

Edge.start

Edge.end

It creates a vertex and an edge All affected edges and vertices Current edge All affected edges and vertices

Id. New version No affected No affected No affected

Now Id. New version Id. New version Id. New version

The previous ‘menu’ example of Figure 2 has three V-Edges: The first one VR1= [V1-{V3,V10}] with the ‘Spanish’ value, the second one VR2=[V10-{now}] with ‘French’ and the third VR3=[V3-{V9}] with ‘Italian’. The first region is formed by those versions defined by all descendants of the version identified by V1 except V3 and V9 and all their descendants. The resulted versioned graph is shown in figure 4. {”Rest”,[V1-{V9}]} {id,[V1-{V9}]}

&3 {”Street”,[V4-{now}]}}

r1 {”name”,[V1-{V9}]}

{”cooks”,[V1-{V8,V9}]}

{”Price”,[V1-{V9}]}

c1

{”menu”,[V1-{V9}]} {”cooks”,[V8-{now}]} &4

c2

&5

&7

&6 {~data,[V3-{V9}]}

{~data, [V1-{V3,V10}]}

{~data, [V1-{V5,V6,V9}]} {~data,[V5-{now}]} “Seafood” “Sea”

“Spanish”

{~data, [V4-{now}]}

“Russian”

{~data,[V10-{now}]} “French”

{~data,[V2-{V9}]}

{~data,[V1-{V2}]} 15

18

“Paris 10”

Fig. 4. A versioned graph document with some versions

3.3 Representing a V-XML Graph in a V-XML Document To represent a graph document in XML the root of the graph maps to the root element of the document and for each element vertex, there is an XML element, labeled with the name of the containment incoming edge. If the element vertex has an edge to a value vertex, its value is included in the element. For each attribute or reference edge there will be an attribute where its name will be the edge name and its value will be the data value vertex or the vertex being referenced respectively. We have represented a version tree by means of an XML document where each &Vi is represented as a tag and the nesting of these version tags stands for the relationship that may exist between them. The XML representation of the version tree showed in Figure 2 is shown in Figure 5.a To represent the versioned edges in XML we have used two techniques: versionstamped attributes as in [7,12] and versionstamped child nodes as in [13]. A version region is turn into two attributes, v:start and v:end. The first one is defined as an IDREF datatype attribute which refers to a version identifier and the second one defined as IDREFS

264


datatype allows us to represent a set of version identifiers from the version tree. Depending on the type of the versioned edge the transformation will be: 1. Containment edges incoming to an element vertex, an XML element with the name of the edge is created with its corresponding v:start and v:end attributes. 2. Containment edges incoming to a data value vertex, a new child XML element ‘v:data’ is created with its corresponding v:start and v:end attributes, where its value is included in the element. 3. For reference edges, a new XML element ‘v:isref’ is created with its v:start and v:end attributes where name and value of the reference are as attribute. 4. For attribute edges, a new child XML element ‘v:attrib’ is created with its corresponding v:start and v:end attribute, where the name and value of the attribute are represented as an attribute value. ………….. Seafood Sea Spanish Italian French …….

Fig. 5.a. A version tree. 5.b. A V-XML Document with some versions.

4 Version Retrieval In this section we begin by showing how to retrieve the valid edges that belongs to a version in a versioned graph and later we make some queries in XQuery and XPath [1]. 4.1 Version Retrieval in a V-XML Document Model In this section, we describe how we can verify if a VV belongs to a given version and how to retrieve the valid tags for a given version. To retrieve the valid labels for a given version it will be necessary to analyze which versions are included in a version region and check if the requested version belongs to them. We have considered that this occurs only if 1) the given version is among the descendants in the ‘start’ version identifier in the version tree or even is itself and 2) the given version is not among the descendants or is itself in all version identifiers for ‘end’ attribute. The following proposition can be used to determine if a version region contains a given version.


265

Proposition 1. A version region VR=[start-End] contains an identifier version x∈D(ID) if it satisfies that: (γ(x)∈ V (start)) ∧ (γ(x)∉ V (End ) ) i.e: we want to find out which edge between the vertices &5 and its data value vertices is valid in V4. From the version tree shown in Figure 2 the second version region VR2=[V3,V9] fulfils the previous proposition since γ(V4) ∈ V (V3) and γ(V4) ∉ V (V9).

• Spanish. VR3=[V1 -{V3,V10}], V (V1)={V1, V2 , V3, V4, V5, V6, V7, V8, V9, V10}, V { V3, V10} = ( V ( V3) ∪ V ( V10) ) = { V3, V4, V9, V10} • French. VR1 = [V10-now], V ( V10)={ V10}, V (now)= φ • Italian. VR2 = [V3- V9], V ( V3)={V3,V4,V9}, V (V9)={V9}. 4.2 Version Retrieval in XQuery and XPath in a V-XML Document

Firstly, the ‘menu’ history will be shown and then we will make a snapshot query. Query1: Version Projection query: Retrieve menu's history. 1. for $s in doc("rest.xml")//rest/menu 2. return $s

Query 2: Snapshot version query: retrieve the menu tag on V8 version To do this effectively, we have to obtain which versions are in a version region and check if they belong to the requested version. We use the id() function provided by XPath [1] to obtain the versions by means of dereference the version/s in the version tree which v:start and v:End refer to (they are defined as IDRef and IDRefs datatypes respectively) and thereby we can easily obtain their descendants and check the constraints said before. 1. { 2. for $s in doc("rest_xml_id.xml")//rest/menu//v:data [(not(f:descen_self_end(.,"V8")))and(f:descen_self_start(.,"V8"))] 3. return $s 4. }

We have defined two version operators as user-defined functions (figure 6) (line 1,4) that return the descendants or itself of the v:start and v:End attribute respectively. Using them, we can check in this query (line 2) if the given version belongs only to the v:start attribute and not to the v:End attribute. From the V-XML Document shown in Figure 5, it retrieves the first ‘v:data’ label (Spanish) since the V8 version descends from V3 and V8 does not descend from V9. 1. 2. 3. 4. 5. 6. 7. 8

declare function f:descen_self_end($p,$v) as xs:boolean{ let $desc:=$p/id($p/@v:end)/descendant-or-self::version/@xml:id return ($desc=$v) }; declare function f:descen_self_start($p,$v) as xs:boolean{ let $desc:=$p/id($p/@v:start)/descendant-or-self::version/@xml:id return ($desc=$v) };

Fig. 6. ‘descen_self_start’ and ‘descen_self_end’ user-defined functions

266


The same query (Query 2) in XPath is: //menu/v:data[not(id(./@v:end)/descendant-or-self::version/@xml:id ='V8') and(id(./@v:start)/descendant-or-self::version/@xml:id='V8')]

The special value ‘now’ in the attribute End indicates ‘no changes until now’, it means; the version region is formed by all version descendants from the start attribute. In order not to change the previous queries, a new node ‘Vnow’ has been added in the version tree and is characterized by having neither descendants nor ancestors. All previous queries can be improved if instead of recovering in each query the ancestors and descendants of a referenced version, they will be stored as attribute in the version tree, reducing the time of the query since it does not work out. These queries are executable in any Xquery/XPath/XSLT engine that supports the id() function (tested in Exist, Saxon, XsltProc).

5 Conclusions and Future Work The current solutions for the management of versions for XML documents use timestamp to represent the validity of each versioned tag in the document. Due to the linear nature of time, this solution cannot support branched versioning. In this work a technique for the management of versions of XML documents has been put forward using versionstamp. When a new version is generated the ancestral relations of the versions produced are stored by using a version tree and the version validity of each versioned tag is defined based on it. The following advantages should be highlighted: the easy management of branched versioning, its implementation is carried out with XML technology and the possibility to query the versions using XML standard query languages. The following steps are:

• To represent the version region as a connected subset of a graph • To measure this technique in large documents and compare the results against traditional document versioning systems and XML solutions. • To analyze, study and develop new version queries. • To add the valid/ transaction time for each version included in the version tree. • To implement a versioning system based on XUpdate [14].

References 1. W3C. http://www.w3c.org. 2. Amelie Marian, Serge Abiteboul, Gregory Cobena, and Laurent Mignet. Change-centric management of versions in an xml warehouse. The VLDB Journal. 2001. 3. Chien S., Tsotras V., Zaniolo C. XML Document Versioning. ACM Sigmod. 2001. 4. Zografoula Vagena and Mirella M. Moro and Vassilis J. Tsotras. Supporting Branched Versions on XML Documents. in RIDE 2004. 5. Betty Salzberg, Linan Jiang, David B. Lomet, Manuel Barrena, Jing Shan and Evangelos Kanoulas. A Framework for Access Methods for Versioned Data. In EDBT 2004. 6. Manolis Gergatsoulis and Yannis Stavrakas. Representing changes in XML documents using dimensions. In XSYM 2003. 7. F. Wang and C. Zaniolo. XBiT: An XML-based Bitemporal Data Model", in ER 2004.


267

8. S. K. Gadia. A homogeneous relational model and query languages for temporal databases. ACM Transactions on Database Systems. 1988. 9. Toshiyuki Amagasa, Masatoshi Yoshikawa, and Shunsuke Uemura. A data model for temporal XML documents .In DEXA 2000. 10. Richard T. Snodgrass. The TSQL2 Temporal Query Language. Kluwer, 1995. 11. Fabio Grandi and Federica Mandreoli. The valid web: An XML/XSL infrastructure for temporal management of web documents. In ADVIS. 2000. 12. Curtis. E. Dyreson. Observing transaction-time semantics with TTXPath. In WISE. 2001. 13. Shuohao Zhang and Curtis. E. Dyreson. Adding valid time to XPath. In DNIS. 2002. 14. Igor Tatarinov, Zachary. G. Ives, Alon. Y. Halevy, and Daniel. S. Weld. Updating XML. In ACM Sigmod. 2001. 15. Alberto O. Mendelzon, Flavio Rizzolo, Alejandro A. Vaisman: Indexing Temporal XML Documents. In VLDB 2004: 16. D. Beech, A. Malhotra, and M. Rys. A formal data model and algebra for XML. W3C XML Query Working Group Note, September 1999.