Deriving High Confidence Rules from Spatial Data using ... - CiteSeerX

0 downloads 0 Views 73KB Size Report
wherever a pure-0 quadrant occurs on at least one of the operands. For most .... All sums are at least 10% support (6.4). There is one ... /____/____/. |. /27|. 3,0 /.
Deriving High Confidence Rules from Spatial Data using Peano Count Trees* William Perrizo1, Qin Ding1, Qiang Ding1, and Amalendu Roy1 1

Department of Computer Science, North Dakota State University, Fargo, ND 58105-5164, USA

{William_Perrizo, Qin_Ding, Qiang_Ding, Amalendu_Roy}@ndsu.nodak.edu

Abstract. The traditional task of association rule mining is to find all rules with high support and high confidence. In some applications, such as mining spatial datasets for natural resource location, the task is to find high confidence rules even though the support may be low. In still other applications, such as the identification of agricultural pest infestations, the task is to find high confidence rules preferably while the support is still very low. The basic Apriori algorithm cannot be used to solve these problems efficiently since it relies on first identifying all high support itemsets. In this paper, we propose a new model to derive high confidence rules for spatial data regardless of their support level. A new data structure, the Peano Count Tree (P-tree), is used in our model to represent all the information we need. P-trees represent spatial data bit-by-bit in a recursive quadrant-by-quadrant arrangement. Based on the P-tree, we build a special data cube, the Tuple Count Cube (T-cube), to derive high confidence rules. Our algorithm for deriving confident rules is fast and efficient. In addition, we discuss some strategies for avoiding over-fitting (removing redundant and misleading rules).

1 Introduction Association rule mining [1,2,3,4,5], proposed by Agrawal, Imielinski and Swami in 1993, is one of the important methods of data mining. The original application of association rule mining was on market basket data. A typical example is “customers who purchase one item are very likely to purchase another item at the same time”. There are two accuracy measures, support and confidence, for each rule. The problem of association rule mining is to find all the rules with support and confidence exceeding some user specified thresholds. The basic algorithms, such as Apriori [1] and DHP [4], use the downward closure property of support to find frequent itemsets, whose supports are above the threshold. After obtaining all frequent itemsets, which is very time consuming, high confidence rules are derived in a very straightforward way. However, in some applications, such as spatial data mining, we are also interested in rules with high confidence that do not necessarily have high support. In still other applications, such as the identification of agricultural pest infestations, the task is to find high confidence rules preferably while the support is still very low. In these *

This work was partially supported by a U. S. – G. S. A. VAST grant. Patents are pending on the P-Tree Data Mining Technology.

cases, the traditional algorithms are not suitable. One may think that we can simply set the minimal support to a very low value, so that high confidence rules with almost no support limit can be derived. However, this will lead to a huge number of frequent itemsets, and is, thus, impractical. In this paper, we propose a new model, including new data structures and algorithms, to derive “confident” rules (high confidence only rules), especially for spatial data. We use a data structure, called the Peano Count Tree (P-tree), to store all the information we need. A P-tree is a quadrant based count tree. From the P-trees, we build a data cube, the Tuple Count Cube or T-cube which exposes confident rules. We also use the attribute precision concept hierarchies and a natural rule ranking to prune the complexity of our data mining algorithm. The rest of the paper is organized as follows. In section 2, we provide some background on spatial data. In section 3, we describe the data structures we use for association rule mining, including P-trees and T-cubes. In section 4, we detail our algorithms for deriving confident rules. Performance analysis and implementation issues are given in section 5, followed by related work in section 6. Finally, the conclusion is given.

2 Formats of Spatial Data There are huge amounts of spatial data on which we can perform data mining to obtain useful information [16]. Spatial data are collected in different ways and are organized in different formats. BSQ, BIL and BIP are three typical formats. An image contains several bands. For example, TM6 (Thermatic Mapper) scene contains six bands, while TM7 scene contains seven bands, including Blue, Green, Red, NIR, MIR, TIR, MIR2, each of which contains reflectance values in the range, 0~255. An image can be organized into a relational table in which each pixel is a tuple and each spectral band is an attribute. The primary key can be latitude and longitude pairs which uniquely identify the pixels. BSQ (Band Sequential) is a similar format, in which each band is stored as a separate file. Raster order is used for each individual band. TM scenes are in BSQ format. BIL (Band Interleaved by Line) is another format in which all the bands are organized in one file and bands are interleaved by row (the first row of all bands are followed by the second row of all bands, and so on). In the BIP (Band Interleaved by Pixel) format, there is also just one file in which the first pixel-value of the first band is followed by the first pixel-value of the second band, ..., the first pixel-value of the last band, followed by the second pixel-value of the first band, and so on. See Fig. 1 for an example. In this paper, we propose a new format, called bSQ (bit Sequential), to organize images. The reflectance values of each band range from 0 to 255, represented as 8 bits. We split each band into a separate file for each bit position. Fig. 1 also gives an example of bSQ format. There are several reasons to use the bSQ format. First, different bits have different degrees of contribution to the value. In some applications, we do not need all the bits because the high order bits give us enough information. Second, the bSQ format

facilitates the representation of a precision hierarchy. Third, and most importantly, bSQ format facilitates the creation of an efficient, rich data structure, the P-tree, and accomodates algorithm pruning based on a one-bit-at-a-time approach. We give a very simple illustrative example (Fig. 1) with only 2 data bands for a scene having only 2 rows and 2 columns (both decimal and binary representation are shown). BAND-1 254 127 (11111110) (01111111) 14 193 (00001110) (11000001)

B11 1 0 0 1

BAND-2 37 240 (00100101) (11110000) 200 19 (11001000) (00010011)

bSQ format (16 files) B12 B13 B14 B15 B16 B17 B18 B21 B22 B23 B24 B25 B26 B27 B28 1 1 1 1 1 1 0 0 0 1 0 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 1 1 1 0 1 1 0 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 1 1

BSQ format (2 files) Band 1: 254 127 14 193 Band 2: 37 240 200 19

BIL format (1 file) 254 127 37 240 14 193 200 19 BIP format (1 file) 254 37 27 240 14 200 193 19

Fig. 1. Two bands of a 2-row-2-column image and its BSQ, BIP, BIL and bSQ formats

3 Data Structures We organize each bit file in the bSQ format into a tree structure, called a Peano Count Tree (P-tree). A P-tree is a quadrant based tree. The idea is to recursively divide the entire image into quadrants and record the count of 1-bits for each quadrant, thus forming a quadrant count tree. P-trees are somewhat similar in construction to other data structures in the literature (e.g., Quadtrees[10] and HHcodes [14]). For example, given an 8-row-8-column image, the P-tree is as shown in Fig. 2 (PMtree, a variation of P-tree, will be discussed later). 11 11 11 11 11 11 11 01

11 11 11 11 11 11 11 11

11 10 11 11 11 11 11 11

00 00 00 10 11 11 11 11

P-tree

55 _________/ / \ \_________ / ___ / \___ \ / / \ \ 16 ___8__ _15__ 16 / / | \ / | \ \ 3 0 4 1 4 4 3 4 //|\ //|\ //|\

1110

0010

1101

PM-tree

m _______ __/ / \ \_________ / ___ / \___ \ / / \ \ 1 ____m__ _m__ 1 / / | \ / | \ \ m 0 1 m 1 1 m 1 //|\ //|\ //|\ 1110 0010 1101

Fig. 2. 8*8 image and its P-trees (P-tree and PM-tree)

In this example, 55 is the number of 1’s in the entire image. This root level is labeled as level 0. The numbers at the next level (level 1), 16, 8, 15 and 16, are the 1bit counts for the four major quadrants. Since the first and last quadrant are composed

entirely of 1-bits (called a “pure-1 quadrant”), we do not need subtrees for these two quadrants, so these branches terminate. Similarly, quadrants composed entirely of 0bits are called “pure-0 quadrants” which also terminate these tree branches. This pattern is continued recursively using the Peano or Z-ordering of the four subquadrants at each new level. Every branch terminates eventually (at the “leaf” level, each quadrant is a pure quadrant). If we were to expand all subtrees, including those for pure quadrants, then the leaf sequence is just the Peano-ordering (or, Zordering) of the original raster image. Thus, we use the name Peano Count Tree. We note that, the fan-out of the P-tree need not be limited to 4. It can be any power of 4 (effectively skipping that number of levels in the tree). Also, the fanout at any one level need not coincide with the fanout at another level. The fanout pattern can be chosen to produce maximum compression for each bSQ file. For each band (assuming 8-bit data values), we get 8 basic P-trees, one for each bit position. For band B1, we will label the basic P-trees, P11, P12, …, P18. Pij is a lossless representation of the jth bits of the values from the ith band. In addition, Pij provides the 1-bit count for every quadrant of every dimension. Finally, we note that these Ptrees can be generated quite quickly and can be viewed as a “data mining ready”, lossless format for storing spatial data. The 8 basic P-trees defined above can be combined using simple logical operations (AND, NOT, OR, COMPLEMENT) to produce P-trees for the original values in a band (at any level of precision, 1-bit precision, 2-bit precision, etc.). We let Pb,v denote the Peano Count Tree for band, b, and value, v, where v can be expressed in 1bit, 2-bit,.., or 8-bit precision. Pb,v is called a value P-tree. Using the full 8-bit precision (all 8 bits) for values, value P-tree Pb,11010011 can be constructed by ANDing basic P-trees (for each 1-bit) and their complements (for each 0 bit): Pb,11010011 = Pb1 AND Pb2 AND Pb3’ AND Pb4 AND Pb5’ AND Pb6’ AND Pb7 AND Pb8

where ‘ indicates the bit-complement (which is simply the P-tree with each count replaced by its count complement in each quadrant). From value P-trees, we can construct tuple P-trees, which is P-trees to record 1-bit counts for tuples. Tuple P-tree for tuple (v1,v2,…,vn), denoted P(v1, v2, …, vn), is: P(v1,v2,…,vn) = P1,v1 AND P2,v2 AND … AND Pn,vn

where n is the total number of bands. Basic (bit) P-trees (i.e., P11, P12, …, P21, …, P88) AND

Value P-trees (i.e., P1, 001 ) AND

Tuple P-trees (i.e., P001, 010, 111, 011, 001, 110, 011, 101 )

Fig. 3. Basic P-trees, Value P-trees (for 3-bit values) and Tuple P-trees

The AND operation is simply the pixel-wise AND of the bits. Before going further, we note that the process of converting the BSQ data for a TM satellite image (approximately 60 million pixels) to its basic P-trees can be done in just a few seconds using a high performance PC computer. This is a one-time process. We also note that we are storing the basic P-trees in the way which specifies the pure-1 quadrants only. Using this data structure, each AND can be completed in a few milliseconds and the result counts can be accumulated easily once the AND and COMPLEMENT program has completed. In order to optimize the AND operation, we use a variation of the P-tree, called PM-tree (Pure Mask tree). In the PM-tree, we use a 3-value logic to represent pure-1, pure-0 and mixed quadrant. To simplify the exposition, we use 1 for pure 1, 0 for pure 0, and m for mixed quadrants. The PM-tree for the previous example is also given in Fig. 2. The PM-tree specifies the location of the pure-1 quadrants of the operands, so that the pure-1 quadrants of the AND result can be easily identified by the coincidence of pure-1 quadrants in both operands, and pure-0 quadrants of the AND result occur wherever a pure-0 quadrant occurs on at least one of the operands. For most spatial data mining, the root counts of the tuple P-trees (e.g., P(v1,v2,…,vn) = P1,v1 AND P2,v2 AND … AND Pn,vn), are the numbers required, since root counts tell us exactly the number of occurrences of that particular pattern over the space in question. These root counts can be inserted into a data cube, called the Tuple Count cube (T-cube) of the spatial dataset. Each band corresponds to a dimension of the cube, the band values labeling that dimension. The T-cube cell at location, (v1,v2,…,vn), contains the root count of P(v1,v2,…,vn). For example, assuming just 3 bands, the (v1,v2,v3)th cell of the T-cube contains the root count of P(v1,v2,v3) = P1,v1 AND P2,v2 AND P3,v3. The cube can be contracted or expanded by going up [down] in the precision hierarchy.

4 Confident Rule Mining Algorithm We begin this section with a description of the AND algorithm. This algorithm is used to compose the value P-trees and to populate the T-cube. The approach is to store only the basic P-trees and then generate value P-tree root counts “on-the-fly” when needed. In this algorithm we will assume the P-tree is coded in its most compact form, a depth-first ordering of the paths to each pure-1 quadrant. Let’s look at one example (Fig. 4). Each path is represented by the sequence of quadrants in Peano order, beginning just below the root. Therefore, the depth-first pure1 path code for the first operand is: 0 100 101 102 12 132 20 21 220 221 223 23 3 (0 indicates the entire level 1 upper left quadrant is pure 1s, 100 indicates the level 3 quadrant arrived at along the branch through node 1 (2nd node) of level 1, node 0 (1st node) of level 2 and node 0 of level 3, etc.). We will take the second operand, with depth-first pure1 path code: 0 20 21 22 231. Since a quadrant will be pure 1’s in the result only if it is pure 1’s in both operands (or all operands, in the case there are more than 2), the AND is done by: scan the operands; output matching pure1 paths. Therefore we get the result (Fig. 4).

Operand 1

11 11 11 11 11 11 11 01

11 11 11 11 11 11 11 11

11 10 11 11 11 11 11 11

00 00 00 10 11 11 11 11

PC-tree: 55 ________ / / \ \___ / ____ / \ \ / / \ \ 16 _8 _ _15_ 16 / / | \ / | \ \ 3 0 4 1 4 4 3 4 //|\ //|\ //|\ 1110 0010 1101

PM-tree: m ______/ / \ \______ / / \ \ / / \ \ 1 m m 1 / / \ \ / / \ \ m 0 1 m 11 m 1 //|\ //|\ //|\ 1110 0010 1101

00 00 00 00 00 00 00 00

00 00 00 00 00 00 00 00

PC-tree: 29 ________ / / \ \___ / ____ / \ \ / / \ \ 16 0 _ 13_ 0 / | \ \ 4 4 4 1 //|\ 0100

PM-tree: m ______/ / \ \______ / / \ \ / / \ \ 1 0 m 0 / / \ \ 11 1 m //|\ 0100

00 00 00 00 00 00 00 00

00 00 00 00 00 00 00 00

PC-tree: 28 ________ / / \ \___ / ____ / \ \ / / \ \ 16 0 _ 12_ 0 / | \ \ 4 4 3 1 //|\ //|\ 1101 1000

PM-tree: m ______/ / \ \______ / / \ \ / / \ \ 1 0 m 0 / / \ \ 1 1 m m //|\ //|\ 1101 1000

Operand 2

11 11 11 11 11 11 11 11

11 11 11 11 11 11 01 00

AND Result

11 11 11 11 11 11 11 01

11 11 11 11 11 11 10 00

AND Process

0 100 101 102 12 132 20 21 220 221 223 23 3 & 0 20 21 22 231 0 0 20 20 21 21 220 221 223 22 23 231

! ! ! ! ! !

RESULT 0 20 21 220 221 223 231

Fig. 4. Operand 1, Operand 2, AND Result and AND Process

In the following a T-cube based method for mining non-redundant, low-support, high-confidence rules is introduced. Such rules will be called confident rules. The main interest is in rules with low support, which are important for many application areas such as, natural resource searches, agriculture pest infestations identification, etc. However, a small positive support threshold is set, in order to eliminate rules that result from noise and outliers (similar to [7], [8] and [15]). A high threshold for confidence is set in order to find only the most confident rules. To eliminate redundant rules resulting from over-fitting, an algorithm similar to the one introduced in [8] is used. In [8] rules are ranked based on confidence, support, rule-size and data-value ordering, respectively. Rules are compared with their generalizations for redundancy before they are included in the set of confident rules.

In this paper, we use a similar rank definition, except that we do not use support level and data-value ordering. Since support level is expected to be very low in many spatial applications, and since we set a minimum support only to eliminate rules resulting from noise, it is not used in rule ranking. Rules are declared redundant only if they are outranked by a generalization. We choose not to eliminate a rule which is outranked only by virtue the specific data values involved. A rule, r, ranks higher than rule, r', if confidence[r] > confidence[r'], or if confidence[r] = confidence[r'] and the number of attributes in the antecedent of r is less than the number in the antecedent of r'. A rule, r, generalizes a rule, r’, if they have the same consequent and the antecedent of r is properly contained in the antecedent of r’. The algorithm is given in Fig. 5. Build the set of confident rules, C (initially empty) as follows. Start with 1-bit values, 2 bands; then 1-bit values and 3 bands; … then 2-bit values and 2 bands; then 2-bit values and 3 bands; … ... At each stage defined above, do the following: Find all confident rules (support at least minimum_support and confidence at least minimum_confidence), by rolling-up the T-cube along each potential consequent set using summation. Comparing these sums with the support threshold to isolate rule support sets with the minimum support. Compare the normalized T-cube values (divide by the rolled-up sum) with the minimum confidence level to isolate the confident rules. Place any new confident rule in C, but only if the rank is higher than any of its generalizations already in C.

Fig. 5 Algorithm for mining confident rules

The following example contains 3 bands of 3-bit spatial data in bSQ format. Band-1: B11 11 11 00 01 11 01 00 01 11 00 00 11 11 00 01 11 11 11 00 00 11 11 00 00 11 11 00 00 11 11 00 00

00 00 00 00 11 11 10 10

B12 00 00 00 00 11 11 01 11

11 11 11 11 00 00 00 00

00 00 01 10 00 00 11 11

11 11 11 11 00 10 00 00

B13 11 00 11 00 11 00 11 00 11 00 11 00 00 00 01 00

Band-2: B21 00 00 00 11 00 00 00 11 00 00 11 00 00 00 11 00 11 11 00 00 11 11 00 00 11 11 00 00 11 11 00 00

B22 00 11 00 11 11 00 11 00 11 11 11 11 10 01 10 11

00 00 00 00 11 11 11 11

00 00 00 00 11 11 11 11

B23 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11

11 11 11 11 00 00 00 00

11 11 11 11 11 10 11 11

Band-3: B31 11 11 00 00 11 11 00 00 11 11 00 00 11 11 00 00 00 00 11 01 00 00 11 11 00 00 00 11 00 00 00 11

B32 00 00 00 00 00 00 00 00 00 00 00 00 11 00 11 00

00 00 11 11 00 00 00 00

00 00 00 00 00 00 00 00

B33 00 00 00 00 10 11 10 11 11 11 11 11 11 11 11 11

11 11 11 11 11 11 11 11

11 11 11 11 11 11 11 11

00 00 00 00 00 00 00 00

Band-1: PM11

PM12

mm10 0mmm 1m10 0mm1 101m 11mm 0001 1101 0101 0001 0110 1010 0111

PM13 10m0 m10m 0010 0001

Band-2: PM21

PM22

PM23

0m10 0110

m011 0110

111m 0m01

PM31

PM32

PM33

100m 1m01 0111

00m0 0010

m111 00m1 1010

Band-3:

Fig. 6. 8×8 image data and its PM-trees

Assume minimum confidence threshold of 80% and minimum support threshold of 10%. Start with 1-bit values and 2 bands, B1 and B2. The T-cube values (root counts from the P-trees) are given in Fig. 7, while the rolled-up sums and confidence thresholds are given in Fig. 8. __________ / / /| / / / | /____/____/ | | | | | 2,0 | 25 | 15 | /| |____|____|/ | | | | / 2,1 | 5 | 19 | / |____|____|/ 1,0 1,1

Fig. 7. T-cube for band 1 and band 2

__________ / 30 / 34 /" sums / 24 /27.2/ " thresholds /____/____/ __________ / / /| /| / / / | / | /____/____/ | /40| | | | | |32| 2,0 | 25 | 15 | /| | /| |____|____|/ | |/ | | | | / |24/ 2,1 | 5 | 19 | / |19.2 |____|____|/ |/ 1,0 1,1

Fig. 8. Rolled-up sums and confidence thresholds

All sums are at least 10% support (6.4). There is one confident rule: C:

B1={0} => B2={0}

c = 83.3%

Continue with 1-bit values and the 2 bands, B1 and B3, we can get the following Tcube with rolled-up sums and confidence thresholds (Fig. 9). There are no new confident rules. Similarly, the 1-bit T-cube for band B2 and B3 can be constructed (Fig. 10). __________ 3,1 / / /| /| / 16 / 11 / | / | /____/____/ | /27| 3,0 / / /| / /|21.6 / 14 / 23 / | / / | / /____/____/ |/ /37|/ | | | / |29.6 | | | / | / |____|____|/ |/ 1,0 1,1 _________ | 30 | 34 | | 24 | 27.2 |____|____|

Fig. 9. T-cube for band 1 and band 3

_____ 3,1 / 27 /| /21.6/ | /____/ | 3,0 / 37 /|27| /29.6/ | /| /____/ |/ | | 40 |13| 0| 2,0 | 32 | /| / |____|/ |/ | 24 |24/ 2,1 |19.2| / |____|/

Fig. 10. T-cube for band 2 and band 3

All sums are at least 10% of 64 (6.4), thus, all rules will have enough support. There are two confident rule, B2={1} => B3={0} with confidence = 100% and B3={1} => B2={0} with confidence = 100%. Thus,

C:

B1={0} => B2={0} B2={1} => B3={0} B3={1} => B2={0}

c = 83.3% c = 100% c = 100%

Next consider 1-bit values and bands, B1, B2 and B3. The counts, sums and confidence thresholds are given in Fig. 11: __________ 27/ / 16 / 11 /. 21.6/ /12.8/8.8 / . / /____/____/ . 37/ / 14 / 23 / . 29.6 /11.2/18.4/ . . /. . . ./____/____/ . . __________. . / / /| /| 3,1/ 16 / 11 / | /27 /____/____/ | /21.6 /| | /| | /| | 3,0/ | | / | /| / | /| / |___ _/ |/ | /13|/ | |\ | 0 / | | 0/ 10.4/ 0/ 2,0| \____/ | /| / | /| / |/ |/ |_9__|__4_|/ |/ | | | / |24/ 2,1| 5 | 19 | / |19.2 |____|____|/ . . . . |/ 1,0 1,1. _________ . | 25 | 15 | . | 20 | 12 | . |____|____| . | 5 | 19 | . | 4 | 15.2 . |____|____|. ____ ____. 30 34 24 27.2

|40 |24 | | |24 |19.2

Fig. 11. The counts, sums and confidence thresholds for 1-bit values

Support sets, B1={0}^B2={1} and B2={1}^B3={1} lack support. The new confident rules are: B1={1} ^ B2={1} => B3={0}, B1={1} ^ B3={0} => B2={1}, B1={1} ^ B3={1} =>B2={0}, B1={0} ^ B3={1} => B2={0},

c = 100% c = 82.6% c =100% c =100%

B1={1}^B2={1} => B3={0} in not included because it is generalized by B2={1} => B3={0}, which is already in C and has higher rank. Also, B1={1}^B3={1} => B2={0} is not included because it is generalized by B3={1} => B2={0}, which is already in C and has higher rank. B1={0}^B3={1} => B2={0} is not included because it is generalized by B3={1} => B2={0}, which has higher rank also. Thus, C:

B1={0} => B2={0} B2={1} => B3={0} B3={1} => B2={0} B1={1} ^ B3={0} => B2={1}

c = 83.3% c = 100% c = 100% c = 82.6%

Next, we consider 2-bit data values and proceed in the same way. Depending upon the goal of the data mining task (e.g., mine for classes of rules, individual rules, …), the rules already in C can be used to obviate the need to consider 2-bit refinements of the rules in C. This simplifies the 2-bit stage markedly.

5 Implementation Issues and Performance Analysis In our model, we build T-cube values from basic P-trees on the fly as needed. Once the T-cube is built, we can perform the mining task with different parameters (i.e., different support and confidence thresholds) without rebuilding the cube. Using the roll-up cube operation, we can get the T-cube for n bit from the T-cube for n+1 bit. This is a good feature of precision concept hierarchy. We have enhanced the functionalities of our model in two ways. Firstly, we don’t specify the antecedent attribute. Compared to other approaches for deriving high confidence rules, our model is more general. Secondly, we remove redundant rules based on the rule ranking. One important feature of our model is its scalability. It has two meanings. First, our model is scalable with respect to the data set size. The reason is that the size of T-cube is independent of the data set size, but only based on the number of bands and number of bits. In addition, the mining cost only depends on the T-cube size. For example, for an image with size 8192×8192 with three bands, the T-cube using 2 bits is as simple as that of the example in Section 4. By comparison, in Apriori algorithm, the larger the data set, the higher the cost of the mining process. Therefore, the larger the data set, the more benefit in using our model. The other aspect of scalability is that our model is scalable with respect to the support threshold. Our task focuses on mining high confidence rules with very small support. As the support threshold is decreased to very low value, the cost of using Aprioir algorithm will be increased dramatically, resulting in a huge number of frequent itemsets (combination explosion). However, in our model, the process is not based on the frequent itemsets generation, so it works well for low support threshold. As we mentioned, there is an additional cost to build the T-cube. The key issue of this cost is the P-tree ANDing. We have implemented an efficient P-tree ANDing on a cluster of computers. We use an array of 16 dual 266 MHz processor systems with a 400 MHz dual processor as the control node. We partition the 2048*2048 image among all the nodes. Each node contains data for 512×512 pixels. These data are store at different nodes as another variation of P-tree, called Peano Vector Tree (PV-Tree). Here is how PV-tree is constructed. First we build a Peano Count Tree using fan-out 64 for each level. Then the tree is saved as bit vectors. For each internal node (except the root), we use two 64 bit bit-vectors, one is for pure 1 and other is for pure 0. At the leaf level we only use one vector (for pure 1). From a single TM scene, we will have 56 (7×8) Peano Vector Tree - all saved in a single node. Using 16 nodes we are covering a scene of size, 2048×2048. When we need to perform ANDing operation on the entire scene, we calculate the local ANDing result of two Peano Vector Trees and send the result to the control node, giving us the final result.

We use Message Passing Interface (MPI) on the cluster to implement the logical operations on Peano Vector Trees. This program uses the Single Program Multiple Data (SPMD) paradigm. The following graph (Fig. 12) shows the result of ANDing time for a TM scene. The AND time varies from 6.72 ms to 52.12 ms for different lower bit numbers of the two P-trees. ANDing Time Vs Bit Number Time (ms)

60 50 40 30 20 10 0 0

1

2

3

4

5

6

7

8

Lower Bit Number of the two P-trees

Fig. 12. P-tree ANDing time vs. Bit Number

With this high-speed ANDing, the T-cube can be built very quickly.

6 Related work The work in [6,7,8] deal with mining high confidence rules from non-spatial data and is therefore only marginally related. In [7] rules are found that have extremely high confidence, but for which there is no (or extremely weak) support. Some algorithms are proposed to solve this problem. There are two disadvantages of this work. One is that only pairs of columns (attributes) are considered. All pairs of columns, with similarity exceeding a pre-specified threshold are identified. The second disadvantage is that the similarity measure is bi-directional. In [6], a brute-force technique is used for mining classification rules. They used association rule mining to solve the classification problem, i.e., a special rule set (classifiers) is derived. However, both the support and confidence are used in the algorithm even though only the high confidence rules are targeted. Several pruning techniques are proposed but there are trade-offs among those pruning techniques. [8] and [15] are similar in that they both apply the association rule mining method to the classification task. They turn an arbitrary set of association rules into a classifier. A confidence based pruning method is proposed using the property called "existential upward closure". The method is used for building a decision tree from association rules. The antecedent attribute is specified. Our model is more general than the models cited above and is particularly efficient and useful for spatial data mining. The P-tree structure is related to Quadtrees [10,11,13] and its variants (such as point quadtree [13] and region quadtree [10]), and HHcodes [14]. The similarities among P-trees, quadtrees and HHCodes are that they are quadrant based, but the difference is that P-trees focus on counts. P-trees are not only beneficial for storing data, but also for association rule mining, since they also contain useful needed information for association rule mining.

7 Conclusion In this paper, we propose a new model to derive high confidence rules on spatial data. Data cube techniques are used in our model. The basic data structure of our model, Ptree, has much more information than the original image file but is small in size. We build a Tuple Count cube from which the high confidence rules can be derived. Currently we use the 16-node system to perform the ANDing operations for images with size 2048×2048. In the future we will extend our system to 256-nodes so that we can handle the image as large as 8192×8192. In that case, the P-tree ANDing time will be approximately the same as in the 16-node system for a 2048×2048 image since only the communication cost is increased and that increase is insignificant.

References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16.

R. Agrawal, T. Imielinski, A. Swami. Mining Association Rules Between Sets of Items in Large Database. ACM SIGMOD 1993. R. Agrawal, R. Srikant. Fast Algorithms for Mining Association Rules. VLDB 1994. R. Srikant, R. Agrawal. Mining Quantitative Association Rules in Large Relational Tables. ACM SIGMOD 1996. J. S. Park, M. Chen, P. S. Yu. An effective Hash-Based Algorithm for Mining Association Rules. ACM SIGMOD 1995. J. Han, J. Pei, Y. Yin. Mining Frequent Patterns without Candidate Generation. ACM SIGMOD 2000. R. J. Bayardo. Brute-Force Mining of High-Confidence Classification Rules. KDD 1997. E. Cohen, et al. Finding Interesting Associations without Support Pruning. VLDB 2000. K. Wang, S. Zhou, Y. He. Growing Decision Trees on Support-less Association Rules. KDD 2000. V. Gaede, O. Gunther. Multidimensional Access Methods. Computing Surveys, 30(2), 1998. H. Samet. The quadtree and related hierarchical data structure. ACM Computing Survey, 16, 2, 1984. H. Samet. Applications of Spatial Data Structures. Addison-Wesley, 1990. H. Samet. The Design and Analysis of Spatial Data Structures. Addison-Wesley, 1990. R. A. Finkel, J. L. Bentley. Quad trees: A data structure for retrieval of composite keys. Acta Informatica, 4, 1, 1974. HH-code. Available at http://www.statkart.no/nlhdb/iveher/hhtext.htm B. Liu, W. Hsu, Y. Ma. Integrating classification and association rule mining. KDD 1998. J. Dong, W. Perrizo, Q. Ding and J. Zhou. The Application of Association Rule Mining on Remotely Sensed Data. ACM Symposium on Applied Computing, 2000.