MONDRIAN: Annotating and querying databases

MONDRIAN: Annotating and querying databases through colors and blocks Henrico Dolfing [email protected] University of Konstanz, Winterterm 2005/2006 Seminar Digital Information Curation

Abstract. This paper presents and discusses an annotation-oriented data model for the manipulation and querying of both data and annotations as proposed by Geerts et al. in the paper ”MONDRIAN: Annotating and querying databases through colors and blocks”. The model allows the specification of annotations on sets of values, and effective querying the information on their association. Annotations are very important in the curation of scientific databases. Despite their importance, existing data formats and schemas are mostly not designed to support and manage the increasing variety of annotations. Also database management systems often lack support for storing and querying annotations. A side of the data model, Mondrian, a prototype implementation of the annotation-oriented data model, will be presented. Keywords. Annotations, Annotation-oriented Data Model, Databases

1

Introduction

Annotations are very important in the curation of scientific databases. Despite their importance, existing data formats and schemas are mostly not designed to support and manage the increasing variety of annotations. Also database management systems often lack support for storing and querying annotations. Most existing annotation systems focus on text and HTML documents and are often specialized to support annotations for a particular kind of data, for example genomic sequences [4]. Bhagwat et al. [2] propose an annotation system for relational databases where annotations are stored in extra annotation attributes. Within their system, only single data values can be annotated. The DBNotes system [3] extends this framework and offers limited support for querying annotations over single values. Annotations provide a solid way of keeping track of provenance, and computing provenance by forwarding annotations along data transformations has been proposed in various forms. The mechanism presented in this paper provides a framework in which both provenance information and other forms of annotations can be managed. Throughout the paper an example relation between three biological databases plays a central role. These three databases are GDB (a human gene database [5]), Swissprot (a protein database [6]) and PIR (a protein sequence database [7]). A short excerpt of them is shown in Table 1.

2 gid 120231 120232 120233 120234 120235

gname NF1 NF2 NGFB NGFR NHS

chr 17 22 1 17 21

sid P01138 P08138 P14543 P21359 P35240

sname NGF TNR16 Nidogen Neurofibromin Merlin

Table 1. (a)GDB relation

origin Human Human Human Human Human

pid A01399 A25218 A45770 I78852 Q6T45

(b)Swissprot relation

pname Nerve growth factor Tumor necrosis factor Merlin Neurofibromatosis Nancy-Horan syndrome (c)PIR relation

The rest of the paper is organized as follows: Section 2 presents the annotation mechanism for relational databases that is capable of annotating both single values and the associations between multiple values. Section 3 introduces the proposed algebra to query values and annotations alike. Section 4 presents Mondrian, which is a prototype implementation of the proposed annotation mechanism over a relational database management system. Section 5 presents some experimental results on the performance of the system. Finally, section 6 presents an overview of the advantages and disadvantages of the annotation-oriented data model and concludes the paper.

2

Annotation Mechanism

As noted before the aim of the annotation mechanism is to provide a mechanism for annotating groups of attribute values. Such a group of attribute values will be referred to as a block. For example in Figure 1 there are six different blocks, with each block having an associated annotation in the form of a name. In

Fig. 1. Colors and Blocks

the remainder of the paper the assumption is made that each annotation is represented by a color. Annotations and annotated blocks are then defined as

3

colors and color blocks. Annotated databases are then, analog to annotated blocks, color databases. Where color queries are defined as queries on annotated databases that are written in a algebra that accounts for annotations. The data model is defined as follows. Let D be a standard relational database consisting of the relations R1 , ..., Rk For each relation Ri , its set of attributes is denoted by sort(Ri ), where ri is used to denote an instance of the relation. C denotes a set of colors. Let r be an instance of relation R and let t be a tuple in r. The annotation (coloring), of a tuple t is performed through a coloring function χ. Function χ accepts as input a tuple t and a non-empty set of attributes Y ⊆ sort(R) and assigns a set of colors to the values in t[Y ]. For a tuple t, the triplet (t, Y, χ(t, Y )) defines a color block which consists of the attribute values in t[Y ] along with their assigned colors. If χ(t, Y ) = ∅, then the values in t[Y ] are not within a color block. Hereafter, hr, χi denotes a relation r whose tuples are colored through function χ. Example 1. Consider the relation in Figure 1. Coloring of each tupl ein the relation is expressed thruogh the following coloring function χ. χ(t1 , {pid, gid}) = {John} χ(t1 , {gid, sid}) = {M ary} χ(t2 , {pid, gid}) = {John, M ary} χ(t2 , {gid, sid}) = {John} χ(t3 , {gid, sid}) = {M ary} χ(t1 , {pid, gid, sid}) = {P eter} Note that χ(ti , Y ) = ∅ for every tuple ti and every other set of attributes Y . Colors and blocks have the following set of properties in order to answer queries over a color database and capture semantics. Block overlapping An attribute value t[A], A ∈ sort(R) is allowed to participate in more then one color block. Inheritance A coloring function χ is inheriting if for every color block (t, Y, χ(t, Y )), and every set of attributes Y 0 ⊆ Y , the following blocks are implied: (t, Y 0 , χ(t, Y 0 ))is a block with χ(t, Y 0 ) ⊇ χ(t, Y )

(1)

Transitivity A coloring function χ is transitive if for any two blocks (t, X, χ(t, X)) and (t, Y, χ(t, Y )), with χ(t, X) ∩ χ(t, Y ) = ∅, the following color block exists (t, X ∪ Y, χ(t, X) ∩ χ(t, Y )).

3

Algebra

This section presents the so called color algebra, that is proposed by Geerts et al. to query annotations and values alike. The main set of operators will be introduced and clarified at hand of examples and figures. Input relation: The operator R accepts as input an instance hr, χi and returns hr, χi, if r is an instance of R, and the empty relation otherwise.

4

Unary singleton constant: Creation of (un-)annotated single values is allowed. The operator (A, a) takes as input any instance hr, χi and returns the instance of sort {A} containing an unique element a which is in a block of color c or χ(a, A) = {c}. Projection: A projection πA1 ,...,Ak is defined as an operator with as input any instance hr, χi of sort containing {A1 , ..., Ak } and returns the instance hr0 , χ0 i of sort {A1 , ..., Ak } such that r0 = {t[A1 , ..., Ak ]|t ∈ r} and for any t ∈ r, and any Y ⊆ {A1 , ..., Ak } [ χ0 (t[A1 , ..., Ak ], Y ) = χ(t, Y ∪ Z),

(2)

(3)

Z

where Z ranges over all subsets of sort(R) {A1 , ..., Ak }. The projection operator treats the coloring function as an inheriting one since it projects the blocks in each tuple of r to the projected attributes. Example 2. Consider the relation hr, χi shown in Figure 1. πpid,gid (r) returns the relation r0 as shown in Figure 2. Due to inheritance, tuples t1 , t2 and t3 have blocks involving only the {gid} attribute, while tuple t4 has a block involving {pid, gid} attributes.

Fig. 2. A simple projection

Block projections: The algebra contains two types of block projection, named lower- and upper block projection, that allow for the projection of blocks based on whether blocks contain or are contained QL in a specified set of attributes. The lower block projection operator A1 ,...,Ak takes as input an instance hr, χi of sort containing {A1 , ..., Ak }, and returns the instance hr0 , χ0 i of the same sort defined by r0 = {t|t ∈ r ∧ ∃(t, Y, χ(t, Y )) with A1 , ..., Ak ⊆ Y }

(4)

5

and for any t ∈ r0 , and any set of attributes Y ⊆ sort(R0 ), χ0 (t, Y ) =

χ(t, Y ) if {A1 , ..., Ak } ⊆ Y, χ(t, Y ) 6= ∅ ∅ otherwise

(5)

QU The upper block projection operator A1 ,...,Al is defined similarly, except that r0 = r and in the definition of χ0 (t, Y ), Y ⊆ {A1 , ..., Al } must hold. QL QU Where ∅ = ID, ∅ only returns the unannotated tuples. Example 3. Consider the relation hr, χi in Figure 2. Assume the task of finding all tuples with at least one QLannotation that involves the protein identifier (pid) attribute. Expression pid (r) returns the desired results as shown in Figure 3.

Fig. 3. An Lower Block Projection

Now assume the task of finding all tuples of relation r Q that might have an U annotation on the gid attribute only. The expression gid (r) returns the desired result as shown in Figure QU 4. Note that an unannotated tuple always satisfies the condition of the operator, and always violates the condition QL of the operator.

Fig. 4. An Upper Block Projection

6

QL QU By combining and , all tuples that have a block on a specific attribute QL QU set can be found. Expression gid ( gid (r)) returns all tuples with a block on gid alone, as shown in Figure 5

Fig. 5. A Combined Block Projection

Selection: On input hr, χi, operator σA=a returns the instance hr0 , χ0 i of the same sort defined by r0 = {t|t ∈ r, t[A] = a} and χ0 is the restriction of χ to r0 . On input hr, χi, operator σA = a returns the instance hr0 , χ0 i of the same sort defined by r0 = {t|t ∈ r, t[A] = a} and χ0 is the restriction of χ to r0 . On input hr, χi of sort containing {A, B}, operator σA=B returns the instance hr0 , χ0 i of the same sort defined by r0 = {t|t ∈ r, t[A] = t[B]} with χ(t, Y ) A, B 3 Y 0 χ (t, Y ) = (6) χ(t, Y ) ∩ β(t, A) ∩ β(t, B) otherwise where β(t, A) and β(t, B) are the set of colors of all blocks in t containing attribute A and B. Similar to the projection operator, the algebra contains a block selection operator. Example 4. Consider again the relation hr, χi shown in Figure 1. Selection σsid=P 01138 (r) returns the relation r0 as shown in Figure 6.

Fig. 6. A simple selection

P Block selection: Operator c , where c ∈ C, takes as input any instance hr, χi and returns the instance hr0 , χ0 i of the same sort defined by r0 = {t|t ∈ r and there exists a block in t with color c}, and for any t ∈ r0 and any set of attributes Y ⊆ sort(R), χ0 (t, Y ) ∩ {c}.

7

Example 5. Consider again the relation hr, χi in Figure 1 and assume the task of finding P all tuples that have a block annotated by Mary. Then the expression M ary (r) returns the results as shown in Figure 7.

Fig. 7. Block Selection

Union: The union operator takes as input any two instances hr, χr i and hs, χs i, of the same sort, and returns the instance hr0 , χ0 i of the same sort defined by r0 = s ∪ r and for any t ∈ r0 , and any set of attributes Y ⊆ sort(R0 ), χ0 (t, Y ) = χr (t, Y ) ∪ χs (t, Y )

(7)

Example 6. Assume the task of finding all tuples that are annotated by P Mary or concern the protein with sid P08138. The expression (r) ∪ M ary P σsid=P 01138 (r) returns the desired results. M ary (r) return the relation as shown in Figure 7, σsid=P 01138 (r) returns the relation as shown in Figure 6, and the union of those two relations is shown in Figure 8.

Fig. 8. Union

8

Product: Given two instances hr, χr i and hs, χs i of disjoint sorts, the product operator × returns the instance hr0 , χ0 i with sort(R0 ) = sort(R) ∪ sort(S) defined by r0 = r × s. For any tuple t ∈ r0 and Y ⊆ sort(R0 ), χr (πsort(R) (t), Y ) if Y ⊆ sort(R) χ (t, Y ) = χs (πsort(S) (t), Y ) if Y ⊆ sort(S) ∅ otherwise 0

(8)

Merge: The merge operator µY,Z with Y and Z being sets of attributes such that Y ∩ Z = ∅, takes as input instances hr, χi of sort(R) containing Y ∪ Z and returns the instance hr0 , χ0 i of the same sort defined by r0 = r. For any t ∈ r0 and any X ⊆ sort(R), χ0 (t, X) = χ(t, X1 ) ∩ χ(t, X2 ),

(9)

where X = X1 ∪ X2 , X1 ⊆ Y, X2 ⊆ Z and χ(t, X) = ∅.

Recoloring / Renaming: These operators speak for themselves. The first changes the color of all blocks with that color to a different color, and the second gives an attribute a different name. Geerts et al. prove that the set of operators contained in the color algebra is minimal.

4

Mondrian

Mondrian is the prototype implementation of the presented annotation-mechanism and color algebra. It is implemented on top of a relational database management system and exists of two different modules. One of them excepts text based algebra queries, that are then translated by the module to relational color algebra queries, and then to an equivalent SQL query. The resulting SQL query is then sent to the DBMS server and is executed against the representation of an annotated database. For operators, such as selection and projection, text based queries are useful. This because of the fact that the user is probable already familiar with these operators by interacting with relational databases trough SQL queries. Note the extra translation from color algebra queries to relational color algebra queries. This extra step is need, because of the fact that the color algebra is defined for relation, but not for relational databases. The other module excepts graphical queries, that are then translated to the corresponding SQL queries the same way as the text based module. Graphical queries are useful for operators like block projections and selections. For example, the user can select a block of a particular color and request to retrieve all the tuples with the same color. Figure 9 gives an overview of the whole system.

9

Fig. 9. The Mondrian System

5

Experimental Results

Geerts et al. use in their experiments real biological data from the Swissprot database [6]. The relational representation of the Swissprot data was based on the schema of the UCSC Genome Browser database [4]. Mondrian was used to measure the performance of the annotation model. Figure 10 shows a compari-

Fig. 10. Color vs. normal algebra [1]

son of response times between normal algebra queries and color algebra queries. It is obvious that the latter take more time. Further results presented by Geerts et al. show some measurements concerning the addition of annotations and relations, and are in general comparable with the results presented in Figure 10, the annotation model is slower then a normal relational database model.

10

6

Discussion and Conclusion

In this paper a model for data annotations is presented. The model exists of an annotation mechanism and a color algebra. Querying the annotations as well as the values are supported, and considered equally important. Because of the importance of annotations for scientific databases the model could be very promising. By implementing the idea on top of a relational database, without changing any of the original tables, but only adding new tables, the idea can be implemented on top of any existing relational database, without changing existing interfaces to the database. The performance test show that the color algebra queries needs more time then normal algebra queries. But the additional possibilities for annotating, and querying those annotations are well worth the extra response time.

References 1. Geerts, F., Kementsietsidis, A., Milano, D., ”MONDRIAN: Annotating and Querying Databases through Colors and Blocks”, Technical report, University of Edinburgh, 2005. 2. Bhagwat, D., Chiticariu, L., Tan, W.C., Vijayvargiya, G., ”An Annotation Management System for Relational Databases”, in Proc. Int. Conf. on Very Large Databases, 900-911, 2004. 3. Chiticariu, L., Tan, W.C., Vijayvargiya, G., ”DBNotes: A Post-it System for Relational Databases Based on Provenance”, in Proc. Int. Conf. on the Management of Data, 942-944, 2005. 4. Karolchik, D., Baertsch, R., Diekhans, M., Furey, T., Hinrichs, A., Lu, Y., Roskin, K., Schwartz, M., Sugnet, C., Thomas, D., Weber, R., Haussler, D., and Kent, W., ”The UCSC Genome Browser Database”, Nucl. Acids Res, 31:51-54, 2003 5. GDB Human Genome Database. URL: http://www.gdb.org 6. The SWISS-PROT Protein Knowledgebase. URL: http://www.ebi.ac.uk/swissprot/ 7. PIR Protein Information Resource Database. URL: http://pir.georgetown.edu/