An Effective Mechanism for Index Update in ... - Semantic Scholar

25 downloads 0 Views 2MB Size Report
students, agitators, dissidents. 283. 55. 068 tine-diameter fibers, ceramic, mineral-wool, glass, asbestos, cellulose. 4883. 1157. 075 increased smaller efficiency,.
An Effective Mechanism for Index Update in Structured Documents Hyunchul Department

Dongwook

Jang and Youngil Kim of Computer

National

Engineering,

8600 Rockville

Chungnam National University 220 Kung-Dong, Yusong-Gu Taejon 305764, Republic of Korea

Library of Medicine

Pike Bethesda

Maryland

20894 USA

Tel: (301) 435-3257

FAX: +82-42-822-4997 E-mail:

Shin

FAX: (301) 480-3035

~hcianq,[email protected]

E-mail:

dwshin@

nlm.nih.aov

Recently. several methods have been developed for structured IR. Path encoding [ 1l] is a technique that encodes each path from the root to the destination node in the hierarchy into a compressed form and uses it to identify the node. With an optimization. it is reported that the index overhead amounts to around 20 G of the original data size. However. how to do query evaluation \cas not addressedin the paper.

Abstract Indexing and retrieval of structured documents have been drawing attention increasingly since they enable to retrieve and access a certain part of a document easily. So far. several methods have been proposed in the setting that documents are rarely changed. These can be applied for the books or journals possessedin libraries, but hardly work for the documents that are subject to change frequently in the business domain. This paper aims at enabling incremental update of indices whenever parts of documents are changed. For this, it employs the index-organized table that has been developed for the full-text retrieval in Oracle. It creates several index-organized tables that are essential in implementing the Bottom Up Scheme strategy. which has been developed for manipulating structured documents efficiently.

Lee at al [5] represented a document structure as k-ary complete tree and assign a unique number named UID (Unique element Identifier) to each node. They reduce the index size in case an index term appearsin all the children nodes, which leads that the term can only be indexed in the parent node without being indexed in each child node. Even though it makes it possible to lessen the index overhead. it does not accommodatethe weigh information of an index term. Shin et al [9] proposed BUS (Bottom Up Scheme) that extended the notion of UID and made it possible to include weight information. In BUS, terms are indexed only at the lowest level of the document structure, whereas index information at higher levels is computed at query evaluation time using the index information at the lowest level. BUS also representsa document as a k-ary complete tree, which makes the accumulation of index information at the lowest level into a higher level quick.

Along with an experiment, the technique presentedhere does not add much index overhead to the original one taken to the index organized table. In addition, the updates of indices are performed quickly as soon as parts of documents are changed.

Keywords XML, SGML, indexing, retrieval, incremental update

1. INTRODUCTION Structured documents are beginning to prevail in World Wide Web and digital libraries since XML (extensible Markup Language) [l] and SGML (Standard Generalized Markup Language) [4] have been emerged as standards for structuring documents. With this. information retrieval of the structured documents [6, 71 has been also drawing increasing attention because it facilitates to retrieve and access a certain part of a document easily. The IR (Information Retrieval) with respect to structured documents brings more difficult problems than the traditional one that concerns about unstructured documents does. That is, it should keep structural information as well as content and the relation between structure and content.

The methods described so far underlie file systems. which hardly cope with the content update. The support of content update appearsto be important particularly in business applications such as the inventory management since inventory information tends to be updated regularly. This paper aims at supporting content update in structural information retrieval. In particular, it suggestsan implementation technique of BUS (Bottom Up Scheme) in a relational database managementsystem so that it facilitates the incremental update of indices. We take BUS as the underlying model since it guarantees less index overhead and quick retrieval time with a rich set of retrieval functions. In order to support the incremental update, we also employs the cooperative indexing technique [2]. also called index-organized table that has been developed for the full-text retrieval in Oracle [8]. This paper creates an index-organized table that implements the posting information in BUS strategy, and another two tables that keep the structural relation among the nodes and attributes. With an experiment, the index overhead amounts to 240 percent of the original data size. Considering that the original overhead of index organized table taken to traditional retrieval is about 120 percent. it does not bring much overhead with the structural IR. When a user query is issued, we can figure out the element types (meaning elements in DTD not element instances in documents) in

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advant -age and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to Itsts, requires prior specific permission and/or a fee. CIKM ‘99 1 l/99 Kansas City, MO, USA 0 1999 ACM l-561 13.146.1/99/0010...$5.00

advance that should participate

383

in the query evaluation.

These

chapter

element types are used as the conditions that the only database records (a database record corresponds to a posting in the inverted index) with the types are retrieved. With this, database retrieval can be performed as fast as the retrieval from a file when the portion of database records to be retrieved is much smaller than that of whole records. The retrieval of structural query is also performed fast. The update of indices is carried out in such a way that the indices of the previous content are removed first and the new indices are inserted. The update is performed fast, taking around two or three secondswhen a 40 Kbyte document is updated. To summarize, with the implementation technique suggested in this paper, the index overhead is not so high compared to the traditional IR systems using databases,nor the update of indices owing to content change carries much overhead.

(a) Document tree with index terms

2. BOTTOM UP SCHEME Shin et al [9] proposed the bottom up scheme that is an efficient way of indexing and retrieval for structured document. The main idea behind this is that the indexing information is only kept in the nodes (or elements) at the lowest level, whereas that in intermediate nodes is computed at run time. That is, the term frequencies are not indexed at the internal nodes, but only at leaf nodes. If a user wants to get information at an intermediate level, all the term frequencies at leaf nodes are accumulated to the corresponding ones in the intermediate level automatically, which brings the term frequency in the target level quickly. Figure 1 explains this briefly. In order to facilitate the accumulation of term frequencies into the corresponding internal nodes, Shin et al [9] introduced the notion of CID (General element lDentifier), extending that of UID (Unique element IDentifier) and assigned a unique CID to each node. A CID consists of (1) Document number, (2) the UID of the element in the document tree, (3) the level of the element in the document tree, and (4) the element type number in the structure. The first constituent in GID informs which document the element belongs to and the second tells the position of the element in the document tree. The third and the fourth constituent facilitate the reproduction of term frequencies in the appropriate level that a user wants. That is, with the third constituent, we can compute the difference of the user level (the level that the user wants to retrieve elements) and the text level (the levels of the elements where the text is included and thus indexing is really performed). The fourth makes it possible to check whether term frequencies should participate in the frequency accumulation step or not.

2.1 Unique element identifier (UID) Lee et al. [5] proposed an indexing structure that is able to reduce the storage overhead taken to indexing at all levels of document structure. They first represented a document as a k-ary complete tree where k is the largest number of child elements of an element in the structure. The result of the mapping is called ‘document tree’. Secondly they assigned each element a UID (Unique element IDentifier) according to the order of the level-order tree traversal. In this tree, with the knowledge of a child’s UID one can compute the parent UID directly by the following expression:

(b) Indexing and retrieval of term frquency Figure 1. The Principle of Bottom Up Scheme

2.2 Indexing and retrieval with UID In the Bottom Up Scheme, the indexing is carried out in each element at the text level and terms are extracted with the auxiliary information: (1) the frequency of a term appearing in the element and, (2) the UID of the element. First, the indexer scans through the document, assigning a UID to each element. Secondly, it extracts terms and cafculates their frequencies in each element at the text level.

384

However, one problem is that the update of indices is not easy since we may have to reformulate all the postings.

rl dl

SGMUXML documents Indexer

! fl

m

q

m

realnode [II;

Parser

I ;

i

I I

virtual node

(a) 3-sty document tree

L Extractor

(parsed

i

1

1

UID 1 2 3 5 6 (b Resul Df assigning

UlDs content index

structural index

Figure 2. Document tree and UlDs The QEP (Query Evaluation Procedure) of the Bottom Up Scheme is able to process a user query by manipulating accumulators and UID obtained from the posting tile. First, it createsa set of accumulators corresponding to the elements at the user level. Secondly, it extracts all the postings relevant to the user query and carries out the term frequency accumulation, With the frequency information, it can compute the real weights of terms, which makesit possible to evaluate the similarity between elements and the user query.

Figure 3. Indexing process TOK

3. INDEXING

UID

LEV

(a)content ELE

Indexing consists of three parts - preprocessing original documents, extracting postings, and loading posting information into databasetables. The first two concern about extract terms, composeindex information and then create postings. The last one attempts to load these postings into database tables. Figure 3 shows this indexing process.

DID

UID

AlT

DID

UID

ETY

TF

ETY

CHD

LEV

VAL

posting LEV

(b) structural

posting

EN

(c) attribute posting

3.1 Preprocessing and Posting Extraction

Figure 4. Posting structures

As mentioned previously, these steps separate words from documents and calculate the UID of the element. They also compute the term frequencies in each element and associate a GID to the term. These data are saved in the postings described in Figure 4-(a). Here, TOK means the index word and ETY representsthe element type number. And LEV meansthe level of the element and TF implies the term frequency. DID and UID represent the document number and Unique element Idenlier respectively. In Figure 4, the attributes enclosed in the bold line form the index key.

An alternative way to compose a posting is to include the term in the posting itself. It is often adopted when the postings are managed in a databasetable. Despite that it carries some amount of spaceoverhead, it facilitates the update of indices easily. In this setting, as all the postings are independent with one another. The update is localized in the postings whose content should be really changed. We take this approach as in Table 4-(a) since we attempt to update the indices immediately after the content is changed. In addition to creating content postings, the indexer generates a structural table that encodes the relation among the elements. Structural information includes element type number, parent-child and sibling relationship and so on. But with each element associated with a UID, we need not represent parent-child and

Normally, a posting created in the inverted index does not include the index word. Instead, it is saved into the term tree (often made as B+ tree) in order to make the posting file as compact as possible. And the term points to a sequence of postings that have the term in common as the index term. /

DID

b

385

sibling relationship explicitly. It is becausewe can recognize the relationship among elements by comparing those UIDs. In the structural table, we include the level and the number of children. The level informs where the node is located in the document tree, whereas the number of children tells how many children a node possesses. Figure 4-(b) illustrates how a structural posting is composed. Here, LEV means the level of the node, ETY represents the element type number and CHD informs the number of children, The key of the structural index table is the element type number (ETY) becausestructural retrieval is normally queried by element type. In addition, the indexer extracts information of each attribute and composes attribute posting. An attribute consists of an attribute name and attribute value in an element. For instance, an attribute in SGML or XML represents that the value of id attribute of the person element is 100. To facilitate the retrieval of attributes queried by the pair of element and attribute name, the attribute posting is composed of DID, UID, level (LEV), element type number (ETY) and attribute value(VAL). Figure 4-(c) explains that the key of attribute index table is attribute name (ATT) and element type number (ETY) pair. Note that all attributes in the same element have the same DID, UID, LEV and ETY values.

(b) Structural

ETY

LEV

TF

index

3

2893

53

3

5

index

5

2343

46

3

3

6

2987

53

2

2

index

UID

LEV

ETY

VAL

FONT

100

3361202

4

53

korean

4. UPDATE OF INDICES The update of structured documents varies according to what a user intends to do. A user may update the whole document or some part of a document. Occasionally, new elements may be inserted or existing elements may be deleted, which causes the structure change. In many cases, change in structure brings the update of document content at the same time. This section addresseshow to accomplish the update of indices as soon as the content or structure is changed.

4.1 Change in element content When the contents of elements are changed, the indices made previously should be replaced by the new ones. A simple solution is to delete all the indices for the previous content and insert the new ones. Here if the content of an element is changed, the deletion operation is performed first to the Content Index Table with the DID and UID key. It is followed by that new postings are inserted.

In creating tables, we use Index-Organized tables instead of normal tables. Index-Organized tables have been specially designed for performing better in information retrieval rather than in database [2]. An Index-Organized table is an index that stores auxiliary data in leaf nodes as well as index terms. That is, an actual table is not created. Instead, the only index is created, which stores all posting information described in Figure 4. With this, space can be reduced with duplicated keys removed. In addition, retrieval can be faster since we need to accessonly the index without accessing the tables through its index. Figure 5 describes a snapshot of the tables after loading posting. UID

DID

Figure 5. A snapshot in the result of loading

There are two ways in loading postings into Oracle database tables. The one is to insert each posting using SQL, and the other is to use a databaseutility that helps the insertion of large data in an efficient way. Using SQL is common in most relational databasesystems. However, it may spend a significant amount of time in loading an enormous amount of postings since it carries concurrency and recovery check repeatedly whenever a posting is inserted into database.On the other hand, using SQL*Loader (an Oracle utility) [8] makes the loading fast since the loading is performed in the batch manner, where concurrency and recovery check are never performed during loading. We employ the second since the indexing can be performed fast.

DID

A-IT

(c) Attribute Index Table

3.2 Loading index into table

TOK

index table

4.2 Changes in element structures When the structure of elements is changed, content information is likely to be updated as well as structural information. For instance, supposethat a new element is inserted in the position of UID 83 as shown in Figure 6. Even though it appearsto raise only the update in structure, it also entails some amount of content update since all the postings in the following elements should be modified. That is, the UIDs of the following elements whose parents are the same as that of the inserted one are increased by one. Note that the UIDs of the elements whose parents are different from the inserted one are not affected unless the insertion of an element changes the value k (the maximum number of sibling elements in a tree). If k is changed, the whole content and structure of the document should be updated. The deletion of an element leads to the deletion of all the postings derived from the element. Similarly as the insertion, it may entail the update of content. If an element followed by another sibling is deleted, the content postings from the sibling should be updated as well as the structural postings. That is, in Figure 6, if the element with UID 83 is deleted, the UIDs of the sibling nodes are decreased,which brings the update of UIDs in the corresponding postings.

1

(a) Content Index Table

386

if k=40

5. RETRIEVAL The retrieval of a query may be performed to some of the index tables depending on what a user query intends to search. We classify the query retrievals into content retrieval, structural retrieval and attribute retrieval according to which table it accesses.

3322

3282 3283

5.1 Content Retrieval

3323

Figure 6. Insertion and Deletion of elements

4.3 Update of postings As we store the postings as tuples in the database tables, the updates of postings can be carried out simply by insertion, deletion and update operation. If the content of an element is changed, we can update the tables by the deletion of the corresponding postings followed by the insertion of new postings with the DID and UID key. As stated previously, if an element is inserted and the UIDs of the sibling nodes are to be changed, the structural index table should be updated with the DID and UID key. At the same time, the postings should be modified if their UIDs are incremented. Likewise, the deletion of an element carries the update of content postings as well as the update of structural index table.

Given a term and element name, content retrieval accessesthe content index table and retrieves the postings whose terms math with the given term. The results of postings are normally obtained immediately from the table. However, filtering and accumulating process that sums all the frequencies of the term in the leaf level make take a certain amount of time.

5.2 Structural Retrieval Structural retrieval attempts to retrieve the elements that satisfy the condition the query imposes. For instance, suppose that a query ‘ child cABST>’ is given (Here the syntax follows that in [lo]). The query intends to retrieve the abstract elements, each of which has a ‘PARGI’ element as its child. In this paper, all the element types are assigned unique type numbers (ETY). Hence, with the ‘PARGI’ type number (here 393) as a key, we can extract the all the ‘PARGI’ elements. Secondly, we compute the parent UIDs of those elements.These can be done by first calculating the parent UIDs (here 242). Figure 7 illustrates this procedure graphically.

1. cPARGI>child -3 2. Calculate the ETY of PARGI : 393

1 ABST,1,6,2,6,1

3. Select all PARGIs with the ETY of the value 393

I

TEXT17272 , , , , 7 4. Compute the parent UID of the PARGI elements

PAR,1,202,3,25,0 ABST,1,242,3,36,1

PARGI, 1,242,4,393,0

PARN,1,243,3,37,1 5. Select them as the results PARGI, 1,9682,4,39&O

I

Figure 7. Structural

Retrieval Processing

We implemented all the operators suggestedin [lo] including ‘child’, ‘in’, positional operators’, ‘cardinality’ operator and so on.

6. EXPERIMENT

5.3 Attribute retrieval

To measure performance of indexing and retrieval of structured documents using relation database table, we use Patent in TIPSIER’s TREC (Text REserch Collection) Vol.3 [3]. The document structure of Patent is somewhat more complex than other collection of TREC. Table 1 shows the statistical data for Patent collection including the total number of elements and so on.

Attribute retrieval is fairly simple if we know the element name, attribute name and value. As a query normally involves all these and wants to get DID and UID pair, the retrieval can be performed easily by composing the query with the element name, attribute name and value. A set of DID and UID pair is immediately available as the result.

AND CORPUS

6.1 Corpus

387

Table 1. Statistical data for Patent

Document PATENT

Size(MB)

Number of Documents

250

6,711

Number of elements Average in document

Total

Maximum in document

208

1,393,541

2,042

consider here is that if the deletion of postings is carried out with only DID and UID pair, it takes a significant amount of time in completing the job. It is because the index of the content index table consists of token, DID, and UID, but the retrieval with DID and UID leads to the whole database search. One solution for this is to keep the tokens in the old elements and delete the postings made by the old one with the index keys, say, token, DID, and UID.

One another thing to note is that PATENT does not have an attribute. As it is hard to find a corpus having attributes, we do not include the attributes in this experiment.

6.2 Indexing In this experiment, we first measured the indexing overhead in terms of time and space. As explained earlier, postings are made in a temporary tile and saved in the databasetable using SQL*Loader [8] in the course of indexing, In table 2, the creation time of content postings is the time taken to creating the temporary tile from the beginning. And the table loading time is that to read the temporary file above and to load into Indexed-Organized tables.

This greatly reduces the update time, but requires the preservation of the tokens in the old elements. Rather than doing this, if the content of an element is changed, we extract tokens again from the old content and apply the deletion with the token, DID and UID. It may carry an overhead in extracting tokens from the old content. But in reality, it adds little overhead when the size of an element is around several Kbytes. Table 4 shows the results of updates in various levels.

Table2. Index time for content and structure (a) content index time

Table 3. Space overhead (a) Size of the content index table Numberof Tuples

Tuple Size(bytes) token(c40) +DID(4)+UID(20)

13,533,117

+ETY(4)+LEV(4) (b) structural

index time

1,393,541

8min

+ 487M

+tf(4) (b) structural

1Omin

Size 70M

Number of Tuples

Tuple Size(bytes)

18min

index table size

element name(c40) +DID(4)+UID(20)

In Table 2-(b), the structural posting creation time is the time taken to extract element information - its name, CID and the number of its children and to save them into a temporary tile.

4M 1,393,541 + 45M

+ETY(4)+LEV(4) +CHD(3)

In fact, we can run both content and structural indexing at the same time. As a great portion of them can be performed concurrently, the total index time and the spacefor temporary files are less than the sum of the two.

Size

Table 4.Time taken to update 1 Nuzber

Element

Secondly, we measuredthe space taken to making the tables. Table 3 shows how much space is required for the tables. Among the fields, the fields for tokens and element names are variable lengths. The space required for the token field amounts to 70Mbytes and their averageis 5.24 bytes. Element namesoccupy 4Mbytes in the structural index table.

‘;iiz’,p’

Time

elements

(content + structural)

(sec.)

v-

and level I

To summarize, the whole space consumed for the two tables is 604 M bytes, which amounts to 240 % of the source.

6.3 Update As mentioned previously, an update is normally performed by a deletion and insertion operation. One important thing to 388

1

DOC

1

121

1215 + 121

4.70

TEXT

2

69

1145 +69

4.37

ABST

3

13

98

+13

0.43

PARGI

4

12

98

+12

0.43

PAR

5

1

14 +l

0.14

time from the old element, the resulting time does not vary greatly.

This table estimated the time taken to update in five levels from the root to the leaf level. It shows that the update of PAR element spends little time, whereas that of a document (DOC element) consumes four to five second. Considering that the average size of a document is around fifty Kbytes, the update of indices for a document can be performed quickly. Despite that it does not include the term extraction

6.4 Content Retrieval In order to measure the retrieval time for content queries, we selected the ten out of TRE queries (No. 51- 100) and converted tern into vector Queries.Table 5 shows them with the number of retrieved results.

Table 6. Sample queries Number of retrieved elements

IR Queries from the Concept 057

1 MCI Communications Corp.

I

3289 1

I

063

PAR

DOC I

I

591

I

1

( batch, interactive, process,user interface

14650 1

4012

065

storage,database,data, query

13252

2033

067

students, agitators, dissidents

283

55

4883

1157

16283

5404

tine-diameter fibers,

068

asbestos,cellulose increased

075

efficiency, work force reduction

glass,

ceramic, mineral-wool,

smaller

payroll,

8

077

I

( poaching, illegal hunting, fishing, trapping, equipment genetically engineered product, plant, animal, drug,

I Og2 I

microorganism, vaccine, agricultural product

083 096

Greenhouse effect, global warming, carbon dioxide buildup diagnosis, scanning, testing

Table 6 shows the result of retrieval time for the ten queries. It compares with the system that stores the postings into the files [9]. As expected, the Oracle version is normally slower than the file system. But the difference is not so big. One thing to note is that the Oracle version is faster than the file system version for some queries. It often happens in most of queries in PAR level. In particular, the Oracle version

Table 6. The comparison

2946 (

945

16320

2970

9763

2948

5412

1080

performs better than the file system one in case a significant portion of postings is filtered out owing to the mismatch of the element type. In Oracle, it is carried out fast if we give a condition in retrieval. On the other hand, in the file system approach, each posting should be checked one by one, which adds a fairly amount of time.

of retrieval time between the file system and Oracle version

389

indices. And in some cases,the retrieval is even faster than the file system version.

6.5 Structural Retrieval We also measuredthe retrieval time for structural queries. For this, we made several structural queries arbitrarily according to the syntax suggested in [lo]. Table 7 shows the structural retrieval time consumed. Table 7. Retrieval time for structural

Queries

queries

Number of Retrieved Elements

Time

6711

2.13

child[]

6194

2.05

child

6194

2.06

cGOVT>child

129

0.27

incABST>

220

0.35

0

2.01

35

0.89

cPARN>

Some works remain to be done. First, an optimization technique for the mixed queries of content, structure and attribute deserves to be studied. The retrieval may be much slower than the current data in very large collections. Secondly, a more accurate analysis for attribute is required in the future. In business applications, attribute values play very important role in retrieval. Hence how to provide attribute

8. ACKNOWLEDGMENTS The second author is on sabbatical leave from Chungnam National University. We are grateful to Dr. Alexa McCray and Ms. May Cheh in Lister Hill National Center for Biomedical Communications, National Library of Medicine for allowing us to continue the work and support the necessarythings.

REFERENCES [l]

These queries inquire about parent-child relationship (query 2, 3, 4), ancestor-descendant relationship (query 5), cardinality relationship (query 6), and positional relationship (query 7). The retrieval times for these queries are rather shorter than those for content queries. One important reason for this is that the number of result is normally smaller than that for content queries. In both the structural queries and content queries, as the number of the retrieved elements increase, it takes longer time in retrieval. It is becausemost of time is spent in extracting matched records from a database into memory.

7. CONCLUSION AND FUTURE WORKS This paper proposes a database technique that enables incremental update of indices for structured documents in an efficient way. The incremental update of indices is very important especially in the business domain since the out-ofdate indices do not help at all and may lead to even wrong direction. As XML has been emerged as one of standards for representing document structures, it has been gaining growing attention in many business applications such as EDI (Electronic Data Interchange). This paper implements the BUS (Bottom Up Scheme in Oracle), which has been developed for efficient indexing and retrieval for structured documents. As it was originally implemented using the inverted file, it does not support the update of indices well. We employ the Index Organized tables in Oracle especially designed to work well in information retrievai and store the indices into the tables. We carry out several experiments and measure the performance in several aspects. Despite that the Oracle version performs poor than the file system version in terms of index time and space overhead, it works very well in the update of content or element structures. But, considering the inherent overhead of the Index Organized tables in full-text search, the Oracle version does not add substantial overhead in supporting both structure and incremental update of

390

Bray T., J. Paoli J., and Sperberg-McQueen C.M., Extensible Markup Language 1.O, w3c recommendation REC-xml-19980210, 1998. [2] DeFazio S., “Integrating and RDBMS Using Cooperative Indexing,” Proc. of SIGIR 95, 1995, 84-92. [3] Harman D., “Overview of the Second Text Retrieval Conference,” Proc. Of the Second Text Retrieval Conference (TREC-2), 1994, l-20. [4] Herwijnen E., Practical SGML Second Edition, Kluwer Academic Publishers, 1994. [5] Lee Y.K., Yoo S.J., Yoon K., Berra B., “Index Structures for Structured Documents,” Proc. Digital Library ‘96 1996, 91-99. [6] Macleod I.A., “Storage and Retrieval of Structured Information Processing & documents”, Management Vol. 26, No. 2, 1990, 197-208. [7] Navarro G., and Baeza-Yates R., “ Proximal Nodes: A Model to Query Document Databasesby Contents and Structure,” ACM transaction on SIGMOD, 1996. [8] Oracle, Oracle 8 Server Administrator’s Guide, 1997. [9] Shin D.W., Jang H.C., and Jin H-L., “BUS: An Effective Indexing and Retrieval Scheme in Structured Documents,” in the Proc. of Digital Libraries 98,235-243. [lo] Shin D.W., Jang H.C., and Nam H.J., “Structured querying, indexing and retrieval for SGML/XML documents,” Proc. of SGML/XML Japan ‘98, 1998, 199-216. [ll]Thom J.A., Zobel J., and Grima B., “Design of Indexes for Structured Documents,” CITRI/TR-958, Department of Computer Science, RMIT, 1995.