Efficient Transformation Scheme for Indexing ... - IEEE Xplore

1 downloads 0 Views 332KB Size Report
RFID middleware filters the data acquired from the readers to process queries from applications. To enhance the performance of the middleware, an index must ...
Efficient Transformation Scheme for Indexing Continuous Queries on RFID Streaming Data* Jaekwan Park1, Bonghee Hong1, Chaehoon Ban2 Department of Computer Engineering, Pusan National University {jkpack,bhhong}@pusan.ac.kr 2 Department of Internet Application, Kyungnam College of Information & Technology [email protected] 1

Abstract RFID data are usually regarded as streaming data that are huge and change frequently because they are gathered continuously by numerous readers. RFID middleware filters the data acquired from the readers to process queries from applications. To enhance the performance of the middleware, an index must be built to process the queries efficiently. Several approaches to building an index on queries rather than data records, called query index, have been proposed and are widely used to evaluate continuous queries over streaming data. EPCglobal proposed an Event Cycle Specification (ECSpec) model, which is a standard query interface for RFID applications. The problem with using any of the query indexes in the ECSpec model is that it takes a long time to build the index and to process queries because each data consist of a large number of segments. To solve this problem, we propose an aggregate transformation that converts a group of segments into compressed data. Then, we measure the performance improvement of the index by the proposed technique using experimental evaluation.

1. Introduction Radio frequency identification (RFID) technology has emerged as a practical solution to aid automatic object identification and tracking. RFID systems consist of electronic tags attached to physical objects, readers sensing the tags, and middleware that processes the RFID data in response to request from RFID applications. There are various applications for RFID systems, such as automated identification, asset tracking, and supply chain management. In these

applications, RFID data are streaming data that are gathered continuously by numerous readers [5]. Middleware systems collect and filter the data acquired from the readers to process requests of applications. These requests are called continuous queries because they are executed continuously during tag movement. The middleware systems must build an index to process these continuous queries efficiently. The approach that builds an index on queries rather than data records, called query index, is suited for RFID middleware systems because there are fewer queries than data records and the queries are active during certain periods. Several methods related to the query index approach have been proposed. The CQI [3] and VCR [4] methods build an index on continuous range queries over the location data stream of moving objects. These methods divide the continuous query using fixed-size rectangles and insert the rectangles into the query index. The search in these methods is to find the rectangles containing the positions of the moving objects. Sensor data stream systems, such as NiagaraCQ [6] and TelegraphCQ [7], adopt a query indexing scheme. They use a predicate index that is similar to the IBS-tree [8], which is a binary search tree that is created for each attribute. This method stores the ranges of an attribute value that makes a query predicate true. The search in this method finds continuous queries whose predicates are satisfied by a value sensed on a sensor node. EPCglobal, which is a standard association devoted to RFID systems, has established the ECSpec (Event Cycle Specification) [1], which is a standard query interface for RFID applications. There are predicates as parameters of the interface for setting up various queries. The predicates are composed of filtering

* “This work was supported by the Korea Research Foundation Grant funded by the Korean Government(MOEHRD)” (The Regional Research Universities Program/Research Center for Logistics Information Technology)

Second International Conference on Systems and Networks Communications (ICSNC 2007) 0-7695-2938-0/07 $25.00 © 2007

conditions of RFID readers and tags. As an example of an ECSpec, consider “readerID=1~3, EPC pattern=”. The first predicate is a range of identifiers of readers and the second predicate is an EPC (Electronic Product Code) pattern of tags. Note that the EPC pattern is represented as a set of ranges. To follow the standard, the RFID query index should be built on continuous queries representing the predicates of ECSpec and the index should support queries, called stabbing queries [9], which occur when an RFID reader identifies each tag individually. In this case, the continuous query is represented as a number of segments in 2-dimensional space composed from the Reader Identification Domain (RID) and the Tag Identification Domain (TID), because it has a range on RID and a set of ranges on TID. The above example is represented as two segments: {(1, 10·260 + 1·236 + 3001), (3, 10·260 + 1·236 + 4000)} and {(1, 10·260 + 2· 236 + 3001), (3, 10·260 + 2·236 + 4000)}†. In previous studies, the query indexes have treated simple queries as the data, such as a region continuous query or an interval continuous query. However, the data in an RFID query index is represented as a number of segments, because the data is a continuous query from the ECSpec. The problem with using any of the existing query index schemes for such data is that it takes a long time to build the index and to evaluate stabbing queries. It is very time consuming to build an index on such data because it is necessary to insert huge segments into the index to store a continuous query from the application. It is also inefficient to process stabbing queries that find data in the query index, because of the large size of the index after many insertions. In this paper, we propose an effective technique for indexing RFID continuous queries. The basic idea is to convert a number of segments into compressed data and to store the result as one object. To do this, we analyze the continuous queries and define their congruent relationship and regular repetition. Then, we propose a transform technique, called aggregate transformation, which finds a repeated group of segments and converts the group into compressed data. That is, this compressed data is the transformed representation of the repeated segments The remainder of this paper is structured as follows. Section 2 identifies the problem of indexing RFID continuous queries. Section 3 suggests an effective technique for indexing RFID continuous queries. †

We assume that the EPC for tag is a 96-bit EPC, General Identifier (GID-96) in Tag Data Standard [2]. The EPC is composed of three parts – Industry(28bits), Product(24bits) and Serial(36bits)

Section 4 presents the experimental results and Section 5 concludes the paper.

2. Problem Definition The important characteristics of RFID streaming data are that they arrive continuously, rapidly, unboundedly, and in real-time. Traditional data indexes on these data suffer from frequent updates as tags are acquired. In addition, queries over RFID streaming data are not one-time queries but continuous queries that run over a period of time. In an environment of moving objects and sensor streaming data, previous studies [3, 4, 6, 7] proposed query indexes to avoid the updating of the index and to improve the processing of continuous queries. In contrast to traditional data indexes, these methods build an index on the continuous queries. This approach is especially suited for evaluating continuous queries over streaming data, because the queries are active for a period of time and the data arrive continuously. The query index approach can be also applied to processing continuous queries over RFID streaming data. However, data of the RFID query index are different from those of the previous studies. The data in the query index, which we call query data, are continuous queries from the applications. Definition 1. Query Data represents a continuous query derived from an ECSpec. The query data is composed of filtering conditions for readers and tags. The condition of readers is represented as a range on RID. The condition of tags is represented as an EPC pattern, which is a set of ranges on TID. Therefore, the query data is an object that has a range on the RID axis and a set of ranges on the TID axis. The query data in 2-dimensional space (RID, TID) are complex objects composed of discrete segments. For example, as shown in Fig. 1, if a user searches for information about “in warehouse A, items that are mobile phones from SAMSUNG Electronics made this year” then the application sends an ECSpec including CQ1: readerID = 1, EPC_Pattern = , assuming that the ID of the reader installed in warehouse A is 1. CQ1 arrives at the RFID middleware and is inserted into the query index. The query data has three discrete segments: the first is {(1, 10·260 + 1·236 + 3001), (1, 10·260 + 1·236 + 4000)}, the second is {(1, 10·260 + 2·236 + 3001), (1, 10·260 + 2·236 + 4000)} and the final is {(1, 10·260 + 3·236 + 3001), (1, 10·260 + 3· 236 + 4000)}. The reason is that the EPC pattern specifies the discrete ranges on TID.

Second International Conference on Systems and Networks Communications (ICSNC 2007) 0-7695-2938-0/07 $25.00 © 2007

< meta information > Product

value

year

Serial

Mobile phone

1

2003

1 ~ 1000

Mobile phone

2

2004

1001 ~ 2000

Mobile phone

3

2005

2001 ~ 3000

12

Flatron TV

4

2006

3001 ~ 4000











Company

value

SAMSUNG

10

LG

11

HYUNDAI …

User

in warehouse A, items that are mobile phones from SAMSUNG Electronics made this year ECSpec format Application

CQ1 : ReaderID = 1, EPC_Pattern = Insert into index

Figure 1. An example of query data

The query data consists of many segments if the product and serial values of the EPC pattern are ranges. As shown in Fig. 1, the size of segment is dependent on the range of serial values and the number of segments is equal to the range of product values. The query data is a complex object composed of maximum 224 segments, because the length of the product code within the tag ID is 24 bits according to the TDS [2]. For this reason, it is very time consuming to build an index on these data because it is necessary to insert many segments into the index to store a continuous query of application. It is also inefficient to process the stabbing queries because the size of the index is very large after many insertions. To avoid many insertions, multi-dimensional indexes, such as R-tree [11], can be used to these query data. In this case, the index must store data in the space of four dimensions: RID, manager, product and serial. As we know, the query performance of the index deteriorates exponentially when going to higher dimensions [10]. Thus, the traditional indexes can avoid many insertions of data, but they are not efficient to process queries.

3. Efficient Transformation Scheme Query data in the query index are dependent upon the EPC pattern. Thus, it is necessary to study all the EPC patterns to discover properties of the query data. We analyzed 27 patterns because each part of the pattern is a constant, [low–high] or *. Table 1. The results of a case study of EPC patterns. Each variant (ex. a, a1 or a2) is a constant value. a. x. x Patterns [a1 – a2] Patterns *. x. x Patterns

a. b. c a. b. * a. b. [c1 – c2] a. *. c a. *. * a. *. [c1 – c2] a. [b1 – b2]. c a. [b1 – b2]. * a. [b1 – b2]. [c1 – c2]

[a1 – a2]. b. c [a1 – a2]. b. * [a1 – a2]. b. [c1 – c2] [a1 – a2]. *. c [a1 – a2]. *. * [a1 – a2]. *. [c1 – c2] [a1 – a2]. [b1 – b2]. c [a1 – a2]. [b1 – b2]. * [a1 – a2]. [b1 – b2]. [c1 – c2]

*. b. c *. b. * *. b. [c1 – c2] *. *. c *. *. * *. *. [c1 – c2] *. [b1 – b2]. c *. [b1 – b2]. * *. [b1 – b2]. [c1 – c2]

The grayed 11 cases are meaningful patterns among the 27 cases of syntactic combination. The first part of the EPC pattern identifies the manager, the second the product and the third the serial. The purpose of the manager is to give a unique global number to each industry, while the product and serial are numbered according to local rules of each industry. Therefore, there are illegal cases in which the manager part is a range in the EPC pattern with the exception of and . The results of the case study show that the query data consist of single or multiple segments and the number of segments in multiple cases is related to the value of the product part. Query data consists of single or multiple segments in 2-dimensional space (RID, TID). We call query data with only one segment simple query data. We also call query data that is composed of multiple segments complex query data. Complex query data has two or more segments, up to a maximum 224 segments. A segment, called a segment of query data, is the smallest unit organizing the complex query data. If we assume that query data is d, then d = {d1, … , dn} where 1 ≤ n ≤ 224. A segment of query data is represented as di = {(minrid, mintid), (maxrid, maxtid)} where di ∈ d. Complex query data has more than two discrete segments and they are related to each other geometrically. The relationship between the segments of the complex query data is congruence: their shape and size are the same but they are located at different positions in the geometry. Fact 1. The general form of the EPC pattern in complex query data is . Assume that complex query data is composed of a range (rid a , rid b ) on the RID axis and an EPC pattern on the TID axis. Assume that the complex query data d is composed of the segments {d0, d1, … , dp2-p1}. Then, a congruence relationship (denoted by ≡) exists between di and dj where 0 ≤ i < j ≤ p2 – p1. Another property of complex query data is that segments of query data are located with regular gaps along the TID axis. Each segment exists in the same RID and successive distances between the starting

Second International Conference on Systems and Networks Communications (ICSNC 2007) 0-7695-2938-0/07 $25.00 © 2007

point of the ith segment di and the starting point of the (i+1)th segment di+1 in TID is 236. Fact 2. The segments of query data appear repeatedly with a regular gap (236) on the TID axis. It is not possible to insert all the segments of complex query data into the query index because of the problems of insertion time and storage cost. Therefore, we must convert the complex query data into a simplified form. We propose a new transformation technique, called aggregate transformation, using the properties relating segments of query data described in Fact 1 and 2. Complex query data has properties of congruence and regular repetition between segments of query data. To simplify the complex query data, it is important to find the repeated form of the data. A regular grid structure is a good conceptual tool to extract a regularly repeated shape of the complex query data, because it is composed of fixed cells that are repeated regularly, like the complex query data.

Unifying these steps, we can achieve the transform formula. It allows the transformation processes to be treated more quickly. We assume that query data is composed of a range (rid a , rid b ) on the RID axis and an EPC pattern on the TID axis. Let d0, d1, … , and dP2 – P1 be the segments of query data. Let Cellsize be the length of the grid cell on the TID axis, c1 be the ID of the cell containing (rida, d0.mintid), and let c2 be the ID of the cell containing (ridb, db2 – b1.maxtid). Let h1 and h2 be the shortest and longest distance on the TID axis between a segment di and the bottom-left point of a cell containing the bottom-left point of di . The transform formulas are summarized in Table 2. Moreover, formulas for other cases can be derived in the same manner. Table 2. Transform formulas for complex query data with format ReaderID = rida ~ ridb and EPC Pattern = m1.[p1 – p2].[s1 – s2]. Cellsize

≤ 236

Size of segment < Cellsize ≥ Cellsize

> 236

Figure 2. An example of aggregate transformation

The idea is shown in Fig. 2. Figure 2-(a) shows that complex query data d consists of three discrete segments d1, d2 and d3 in 2-dimensional space. If the complex query data is overlaid with a fixed grid with cell size 236 on the TID axis and with the maximum length on the RID axis, then d1, d2 and d3 are the same shape in each cell space according to Fact 1 and 2. However, they have different cell IDs in the grid; we assume that the cell IDs are c, c + 1 and c + 2. Now, if we represent d1, d2 and d3 in 3-dimensional space, adding a cell ID dimension to the cell space (RID, TIDcell), then d1, d2 and d3 are completely the same on the cell space and they have cell IDs from c to c + 2 in sequence, as shown in Fig. 2-(b). Finally, we can generate new rectangular data by aggregating d1, d2 and d3, because they are continuous on the Cell ID axis. We call this object aggregated data that is representative of the segments of query data and its representation is {(1, 3001, c), (1, 4000, c + 2)}.

< Cellsize ≥ Cellsize

Number of Overlapped Cells 1 ≥2 1 ≥2 1 ≥2 1 ≥2

Transform formula AD = Formula1 AD1 = Formula2 AD2 = Formula3 AD = Formula1 AD = Formula4 -

Transform formulas Formula1 {( rid a , h1 , c1 ), (rid b , h2 , c2 )} Formula2 {( rid a , h1 , c1 ), ( rid b , Cell size , c 2 )} Formula3 {(rid a ,0, c1 + 1), (rid b , MOD(h 2 , Cell size ), c 2 + 1)} Formula4

{( rid a , h1 − MOD ( p1 , 2 K − 36 ) ⋅ 2 36 , c1 ), (rid b , h2 + MOD ( p1 , 2 K − 36 ) ⋅ 2 36 , c 2 )}

The aggregate transformation solves the indexing problems of long insertion times and huge storage to store all the segments of complex query data. Our technique establishes the properties of data in RFID query index, congruence and repetition. Based on these properties, it transforms the complex query data in 2dimensional space (RID, TID) into simple data in 3dimensional space (RID, TIDcell, CID). This is simplification by our transformation technique.

4. Experimental Results In this section, we present experimental results for the index structures. To apply the aggregate transformation, we extend insert and search algorithms of the R-tree that is widely used for the multidimensional space, and the extended R-tree is denoted

Second International Conference on Systems and Networks Communications (ICSNC 2007) 0-7695-2938-0/07 $25.00 © 2007

Uniform

100,000

80,000

Storage (KB)

Storage (KB)

80,000

Skewed

100,000

CQI VCR R-tree R-tree(AT)

60,000 40,000

60,000 40,000 20,000

20,000

0

0 5,000

10,000

50,000

# of query data

5,000

100,000

10,000

50,000

100,000

# of query data

Figure 3. Storage costs

We made experiments of storage costs, insertion performance and search performance with uniform and skewed distributions of the dataset. As shown in Fig. 3, the CQI and VCR indexes require high storage costs because they should decompose the query data into cells or virtual constructs, and store all the fragments. The R-tree(AT) is better than others because it store data once in 3-dimensional space. The storage costs of the R-tree are slightly more than R-tree(AT) because it stores 4-dimensional data. Uniform

25,000

20,000 CPU Time (ms)

CPU Time (ms)

20,000

Skewed

25,000

CQI VCR R-tree R-tree(AT)

15,000 10,000 5,000

15,000 10,000 5,000

0

0 5,000

10,000

50,000

# of query data

100,000

5,000

10,000

50,000

100,000

# of query data

Figure 4. Insertion performance

As shown in Fig. 4, the insertion performance of the grid-based CQI index outperforms the others because ‡

The RCLIT is Korea national project for developing the next generation of logistics information technology. It focuses on logistics research and practice with IT-based technology such as ubiquitous computing and RFID system.

the insertion process of the index is very simple and intuitive. On the other hand, the VCR index shows lower performance than others because it generates many partitions of data and stores them individually in the insertion process. However, the CQI and VCR indexes require refinement step in query processing because they store approximate data. The R-tree(AT) needs slightly more time than the original R-tree in insertion process, because of costs for transformation from 4-dimensional to 3-dimensional data. Uniform

50,000 40,000 35,000

Skewed

70,000

CQI VCR R-tree R-tree(AT)

60,000

CPU Time (ms)

45,000

CPU Time (ms)

to R-tree(AT). We compare the performance of the Rtree(AT) with the original R-tree [11], CQI [3] and VCR [4] on various sets of data. The R-tree(AT) is implemented as query index for the aggregate transformed space (RID, TIDcell, CID). Also, the original R-tree is implemented as query index for 4dimensional space (RID, manager, product, serial). Finally, the CQI and VCR indexes are the existing query indexes in 2-dimensional space (RID, TID). We applied approximations to the data of the CQI and VCR indexes because it is impossible to store all the segments of query data. All indexes are kept in the main memory to support real-time processing. There are no well-known and widely accepted RFID datasets for experimental purposes. Therefore, we carried out experiments with real datasets, uniform and skewed distributions, from the Research Center for Logistics Information Technology (RCLIT) ‡ . To measure the performance of the indexes, we used 100,000 stabbing queries.

30,000 25,000 20,000 15,000 10,000

50,000 40,000 30,000 20,000 10,000

5,000

0

0 5,000

10,000

50,000

# of query data

100,000

5,000

10,000

50,000

100,000

# of query data

Figure 5. Search performance

In the search performance, the CQI and VCR indexes take a long time to processing queries because they must identify every data sequentially in the cell or virtual constructs containing the point of query. Thus, the search performance of these indexes deteriorates much more quickly as the number of data grows larger. The R-tree(AT) performs better than the original Rtree. The reason is that the original R-tree searches data in 4-dimensional space, but the R-tree(AT) searches data in 3-dimensional transformed space. Overall, the R-tree(AT) performs better than the existing query indexes and the original R-tree. The experiment results indicate that average 25% in the storage costs and 21% in the search performance are enhanced by the aggregate transformation technique.

5. Conclusion The query index approach is usually suitable for evaluating continuous queries over streaming data. Query indexes in previous studies have treated a simple query as the data, such as a region continuous query or an interval continuous query. However, the data in the RFID query index is represented as a great many segments because it is a continuous query from the ECSpec. With any of the existing methods as the RFID query index, it is very time consuming to build an index on these data, because it is necessary to insert a great many segments into the index for storing a continuous query. The index size also makes it inefficient to process stabbing queries. In this paper, we proposed a transformation technique for indexing complex continuous queries. We first established the properties of congruence and

Second International Conference on Systems and Networks Communications (ICSNC 2007) 0-7695-2938-0/07 $25.00 © 2007

repetition between the segments of the continuous query. Based on these properties, we suggested a transform technique, called aggregate transformation, which finds a group from the huge segments and changes the group of segments into aggregated data. The aggregated data is representative of the segments. That is, it allows the query index to store one object instead of inserting all the segments. We carried out an extensive evaluation to identify the performance improvement by the proposed technique, and compared with existing query indexes. The experiments show that the extended R-tree using the proposed technique performs better than the others on various datasets.

References [1] EPCglobal, "The Application Level Event (ALE) Specification Version 1.0", EPCglobal Standard Specification, 2005. [2] EPCglobal, "EPCTM Tag Data Standards Version 1.3", EPCglobal Standard Specification, 2005. [3] D. V. Kalashnikov, S. Prabhakar, W. G. Aref and S. E. Hambrusch, "Efficient evaluation of continuous range

queries on moving objects", In Proc. 13th DEXA, pp. 731– 740, 2002. [4] K-L. Wu, S-K. Chen and P. S. Yu, "Processing Continual Range Queries over Moving Objects Using VCR-Based Query Indexes", MobiQuitous 2004, pp. 226–235, 2004. [5] F. Wang and P. Liu, "Temporal Management of RFID Data", In Proc. 31st VLDB Conf., pp. 1128–1139, 2005. [6] J. Chen et al, "NiagaraCQ: A Scalable Continuous Query System for Internet Databases", ACM SIGMOD, pp. 379– 390, 2000. [7] S. Chandrasekaran et al, "TelegraphCQ: Continuous Dataflow Processing for an Uncertain World", In Proc. CIDR, pp. 269–280, 2003. [8] E. N. Hanson et al, "A Predicate Matching Algorithm for Database Rule Systems", ACM SIGMOD Record, pp. 271– 280, 1990. [9] M. de Berg, M. van Kreveld, M. Overmars and O. Schwarzkopf, "Computational Geometry: Algorithms and Applications", 2nd ed., Springer-Verlag, Berlin, 2000. [10] C. Bőhm, S. Berchtold and D. Keim, "Searching in High-Dimensional Spaces-Index Structures for Improving the Performance of Multimedia Databases", ACM Computing Surveys, Vol. 33, pp. 322–373, 2001. [11] A. Guttman, "R-trees: a dynamic index structure for spatial searching", Proc. ACM SIGMOD International Conference on Management of Data, pp. 47–57, 1984.

Second International Conference on Systems and Networks Communications (ICSNC 2007) 0-7695-2938-0/07 $25.00 © 2007