STORING AND QUERYING XML DATA USING ...

STORING AND QUERYING XML DATA USING RDBMS Yesi Novaria Kunang Computer Science Faculty Bina Darma University, Palembang – Indonesia [email protected]

Ahmad Ashari Faculty of Mathematics and Natural Sciences Gadjah Mada Universty, Yogyakarta – Indonesia [email protected]

Abstract XML (eXtensible Markup Language) is rapidly becoming a popular data format and emerging standard for data exchange over the Internet. With a large amount of data represented as XML documents, it becomes necessary to store and query these XML documents. One of these is using an RDBMS or for media storage and using SQL to query an XML document. There are two approaches to parsing an XML document into RDBMS using a middleware, i.e. SAX parsing and DOM parsing methods. This research studied those methods, and then compared the performance the two methods. It also studied performance of some alternatives way to structuring and tagging data from one or more tables on RDBMS as a hierarchical XML document. As a final result, we will get the best performance for storing and querying XML data using an RDBMS from these alternatives

1. Introduction XML (eXtensible Markup Language) has become a popular data format and exchanging data over the Internet. The flexibility of XML structure is suitable for exchanging data and modeling applications. However, when a large number of data is to be present as an XML document, it causes the query request and saving process at XML document very important needed. One of the approaches is by using the XML native database system. This approach has two weaknesses: first, XML native database system is not adequate to save data and cannot accommodate the complicated query at relational database system; second, it is impossible for the users to ask directly for XML documents and other data that are stored in a relational database system. Querying and storing XML data techniques using relational database system are implemented to overcome those weaknesses, which have been presented above. The steps for this approach are as follows: first, make the relational table design to save data or an XML document; second, divide the XML data by separating them into columns in the presented table; and third is processing an SQL query to get the XML document format needed from RDBMS data.

457

2. Literature Review Bourret [3] says that XML and related technology can be said to be a simple database because XML documents can be used in environments with small amounts of data, few users, and simple work performance. XML also provides many things found in databases: storage (XML documents), schemas (DTD, XML schema languages), query languages, programming interfaces (SAX, DOM, JDOM), and so on. However, XML lacks many of the things found in real databases, efficient storage, indexes, security, transaction and data integrity, multi user access, queries across multiple documents, and so on. For this reasons, XML is not suitable for environments that have many users, strict data integrity requirements, and the need for good performance. Two mappings are commonly used to map an XML document schema to database schema: the table-based mapping and the object-relational mapping [4]. These mapping data in XML document other than in that document itself, so it will be suitable to map data-centric and not suitable for document-centric [6]. Strategies for data transfer from XML to database depend on the software used and middleware support. There are two ways to parsing XML document using middleware application (Java, Perl, PHP, Python, C/C++, Eiel, Tcl, dll); they are using SAX (Simple API for XML) or DOM (Document Object Model) for XML parser [1]. Shanmugasundaram, et.al, [5], discuss some alternatives to publishing relational data as an XML document, that can be differentiated based on basic principle between relational table and XML document, where document XML has been tagged and structured while relational table does not have these two things. Therefore, in order to converse relational table into XML document, tagging and structuring need to be added in processing. One approach is to do tagging as the final step of query processing (late tagging), while another approach is to do it earlier in the process (early tagging). Similarly, structuring can be done as the final step query processing (late structuring) or it can be done earlier (early structuring). Each alternative depends on how much work is done inside the relational engine. Inside Engine means tagging and structuring are done completely inside the relational engine, whereas outside engine means that part, though not necessarily all, of that work is done outside the relational engine. This depends upon the ability of relational database engine in used. For early tagging with late structuring is not visible alternative because adding tags to an XML document without having its structure makes no sense. 2.1. Individual Tagging and structuring are both done early in query processing. One the simplest technique for structuring relational data as an XML document is using the early tagging and early structuring method which is called the Stored Procedure [5] or Individual table technique [2]. This strategy transfers a hierarchy database orderly document. This process can be done by opening a number of results set in the table for root element. For each row within the table, it contains row element, and then open almost all child table orderly.

458

2.2. Universal I table In this way, the tagging and structuring were done as the final step in arranging an XML document. The forming of XML document is divided into two phases: (a) form content, where relational data produced, and (b) tagging and structuring; relational data are arranged and given tag to produce XML document. In order to produce the content as that we want is by making a single result set (universal table), consisting of all data in document. By correlating all tables, using join predicate is to relate parent to child. This is known as Redundant Relation [5], or it also called Universal I [2]. To tagging and structuring result, there are two ways this should be done: (a) grouping all siblings (XML documents) which have the same category (and eliminate duplicate for redundant), and (b) extracting information from each tuple and tag to produce XML result. 2.3. Universal II table The main problem with the Late Tagging Late Structuring technique is related to memory management when forming tags. To overcome this problem, a relational engine can structure relational content. This strategy can be done using a Universal table type II [2] or Sorted Outer Union [5], which uses UNION statement.

3. Conducting the Research This research specializes to see strategies to transfer data between XML document and relational database, and vice versa. The steps are as follows: 3.1. Designing an XML Document To show how treating the element data attribute and sub element in an XML document and transferring to a database or vice versa, an example will be given in the form of an XML document where inside there is web element, attribute and sub element. The example that looks in Figure 1 can be done for this research. Professional PHP Programming paperback 909 $25 Tutorial Wrox Press Ltd.

Figure 1: Inventory Document sample

459

To examine this XML document, numbers of root node were added, from 1 until 100,000-root node by the assisting of PHP script to form the data file. The size of data begins from the smallest unit, 583 bytes until 35,135 M. 3.2. Object Relational - XML Document Mapping The previous example of an XML documents, can be mapped by Object Relational mapping become objects that look in Figure 2. Object Book { - Idbook=001; - Title=Profesional PHPProgramming; - Year =2000; - Binding =paperback; - Pages = 909; - Price = $25; - Genre = Tutorial; -Publisher = Wrox Press Ltd. }

Object Author { -First = Jesus -middle= -last =Castagnetto }

Object Author { -First = Sascha -middle= -last =Schumann }

Object Author { -First = Harish -middle= -last =Rawat }

Etc.

Figure 2: Object based Mapping on Document Inventory

These objests can be mapped into MySQL database with two tables, the database schema or table relation is shown in Figure 3.

Figure 3: Table Relation

The idbook attribute in the tabbook1 table is the primary key, and the idbook attribute in the tabauthor1 is the foreign key. 3.3. Transferring Data from XML to Database The strategy to transfer data from XML to database using PHP as middleware in this research can be done by: 1. Parsing file using PHP SAX parser 2. Using PEAR’s XML Tree Class

460

3.4. Transferring Data from Database into XML By using an alternative way in forming tagging and structure, the present of XML document format from relational MySQL database supported by PHP script as middleware. For all alternatives done by tagging and structuring process, which is outside engine, has meant that a part of the process was done outside the relational engine. The alternatives of data transfer from database into XML that used in this research are: 1. Early tagging, early structuring; stored procedure/individual table 2. Late Tagging, Late Structuring; Universal Table I/Redundant 3. Late Tagging, Early Structuring; Universal Table II /Sorted Union 3.5. Presenting and Searching XML Data To compare the work performance between XML document and RDBMS, from speed side in loading process in browser, it was done by the following task: 1. Data Searching at XML document by using DSO binding technique, using script. 2. Presenting XML data from RDBMS, by conducting a data search from XML document that have been saved into MySQL database, using the redundant method. The result from query is saved as XML document and by using DOM Tree method, the result from query is saved as XML document using DOM Tree method, the result file is read and bound using XSL file.

4. Results and Discussion 4.1. Comparison between SAX and DOM

Time (s)

Comparison between SAX and DOM were conducted in this research by inserting the data from XML document, which has some node variation to be inserted into database table. This result is look in Figure 4.

2000 1500 1000 500 0

DOM

SAX

1

50

100

500

1000

5000

Number of nodes

Figure 4: Comparison between SAX and DOM

As we see from graphic, parsing XML document is faster if we use SAX method (Simple API for XML) than DOM method (Document Object Model). There are two important things: first, SAX code uses smaller memory because the buffer is only one row, while DOM code uses the buffer for the whole document. The second, SAX code is faster because it is saving time to form DOM Tree.

461

The most important effect of using the memory is SAX method can be use for large document while DOM uses a lot of memory. However, DOM is suitable to be used for application where DOM tree is needed, for example if we want to present an XML document supported by XSLT. Another reason is the hierarchy of XML document parsing by DOM technique is more complete for the tag name, tag attribute, data content and other nested tags. In order to parsing the larger XML document with a large number of node, from this research, it is suggested to divide those document for having parsing faster by using SAX and DOM method. 4.2. The Comparison Result for transfer from Database into XML To compare some strategy for data transfer from database to an XML document, each method examine with the variation of numbers data to observe the speed of data transfer from database into XML document. The comparison result is shown in Figure 5.

Time (s)

1500

Individual

1000

Universal I

500

Universal II

0 1

10

50

100

1000

5000

Number of records

Figure 5: Data Transfer Strategies from Database into Document XML Graphic

From the graphic, it is significant that the individual table method becomes most inappropriate method. This method shows the worst work performance by processing data slowly to be an XML document, compared to Universal Universal II methods and I. The main cause of this is because most of the resource database used in this method, one or more SQL query that should be given in every tuple for tables should have nested structure. Therefore, to form a larger document, thousands of queries that should be processed cause the inefficient or even deadlock. The Universal table I/ redundant method shows a very good work performance compared to universal type II. This happened because of the efficient process of query even we found redundancy, compared to Universal II method, which should form result table in the structure form. Also for tagging process, this become faster because rows from the result set table is fewer than Universal table II. The use of memory of Universal table I is better than Universal table II, it is indicated by the number of record data (50.000). Universal II method spends all, or even more of the default memory provided by MySQL. 4.3. The Comparison of Searching Data for XML Document toward RDBMS The comparison result of searching data using XML file document toward RDBMS can be found in Figure 6.

462

Time (s)

40 30

XML

20

RDBMS

10 0 1

10

50

100

500

1000 5000 10000 50000

Number of nodes/records

Figure 6: Comparison of query using RDBMS and using XML Document

From the graphic of XML searching data, which is saved using XML document and using RDBMS, it is concluded that RDBMS work performance for keeping and data query is better than keeping XML data in XML document. To find a certain record (final record from total number of the record) it is found that RDBMS is more stable, starting from searching time until after the display in web page form. Compared to an XML document, the bigger number of node, the worst the form of data and final node become. This happened because by using DSO before data was searched; browser should form (cache) data from XML document and find out each node to find certain data. So the bigger the number of data at an XML document the longer the time to cache the data and data searching. On the other hand, XML document data also needs a bigger capacity for saving than saving data in RDBMS form; it is look in Table 1. The use of index in RDBMS will make the searching for data faster. Besides that in an XML document, for every data saving into an XML document, they also should save the tags and this causes the larger capacity. Table 1: Comparison Space XML Document vs. RDBMS Table

Number of Nodes/ records 1 10 50 100 500 1.000 5.000 10.000 50.000

XML Document 583 b 4Kb 16 Kb 32 Kb 165 Kb 330 Kb 1,7 M 3,4 M 17,5 M

Data 224 b 1,3 Kb 6,8 Kb 13,6 Kb 73,0 Kb 147,2 Kb 717,7 Kb 1,4 M 7,9 M

RDBMS Index 3 Kb 3 Kb 3 Kb 3 Kb 7 Kb 11,0 Kb 43,0 Kb 82,0 Kb 403 Kb

Total 3,2 Kb 4,3 Kb 9,8 Kb 19,6 Kb 80,0 Kb 158,2 Kb 760,7 Kb 1,5 M 8,3M

5. Conclusion and Future Work The results clearly indicated that storing XML document using RDBMS needs SAX parser to make better work performance compared to DOM tree technique. Redundant/Universal I technique is the best alternative for querying XML document in RDBMS and data transfer from RDBMS into XML document since the use of memory of Universal I table is better than the other techniques. The use of RDBMS for querying XML data especially for large number of data is faster than XML document flat file as storage. In general, storing XML data using RDBMS is more efficient since

463

RDBMS only needs smaller capacity to store the data compared to XML document. This happened because XML document not only saved the content of the data but also the tags. Our future work wills also comparing the relative performance of relational database and native XML database, integrating and comparing some other XML query languages.

6. References [1]

Asaduzzaman, A., 2003, Building http://www.devarticles.com/art/1/443

XML

Tress

with

PEAR’s

XML_Tree

Class,

Devshed

article,

[2] Bourret, R., Data Transfer Strategies, 2001. http://www.rpbourret.com/xml/DataTransfer.htm

[3] Bourret, R., XML and Databases, 2003. http://www.informatik.tudarmstadt.de/DVS1/staff/bourret/xml/XMLAndDatabases.htm

[4] Florescu, D., Kossmann, D., Storing and Querying XML Data using an RDBMS, Bulletin of the Technical Comitte on Data Enginering, 1999. http://www.research.microsoft.com/research/db/debull/99sept/we.ps

[5] Shanmugasundaram, J., Shekita, E., Carey, M., Lindsay, B., Pirahesh, H., Reinwald, B., Efficiently Publishing Relational Data as XML Document, 1999. http://www.acm.org/sigmod/vIdb/conf/2000/P065.pdf

[6] Tatarinov, I., Viglas, S.D., Beyer, K., Shanmugusundaram, J., Shekita, E., Zhang, C., Storing and Querying Ordered XML Using a Relational Database System, ACM SIGMOOD, Madison, Wisconsin, USA, 2002.

464