Maintenance of Generalized Association Rules under Transaction ...

3 downloads 83021 Views 185KB Size Report
Carrot ⇒ Apple (Support =30%, Confidence =60%), does not hold when the minimum support is set to 40%, but the following association rule may be valid.
Maintenance of Generalized Association Rules under Transaction Update and Taxonomy Evolution Ming-Cheng Tseng1, Wen-Yang Lin2 and Rong Jeng3 1, 3

Institute of Information Engineering, I-Shou University, Kaohsiung 840, Taiwan 1 [email protected], 3 [email protected] 2 Dept. of Comp. Sci. & Info. Eng., National University of Kaohsiung, Kaohsiung 811, Taiwan 2 [email protected]

Abstract. Mining generalized association rules among items in the presence of taxonomies has been recognized as an important model in data mining. Earlier work on mining generalized association rules ignore the fact that the taxonomies of items cannot be kept static while new transactions are continuously added into the original database. How to effectively update the discovered generalized association rules to reflect the database change with taxonomy evolution and transaction update is a crucial task. In this paper, we examine this problem and propose a novel algorithm, called IDTE, which can incrementally update the discovered generalized association rules when the taxonomy of items is evolved with new transactions insertion to the database. Empirical evaluations show that our algorithm can maintain its performance even in large amounts of incremental transactions and high degree of taxonomy evolution, and is more than an order of magnitude faster than applying the best generalized associations mining algorithms to the whole updated database.

1

Introduction

Mining association rules from a large database of business data, such as transaction records, has been a popular topic within the area of data mining [1, 2]. An association rule is an expression of the form X  Y, where X and Y are sets of items. Such a rule reveals that transactions in the database containing items in X tend to contain items in Y, and the probability, measured as the fraction of transactions containing X also containing Y, is called the confidence of the rule. The support of the rule is the fraction of the transactions that contain all items in both X and Y. For an association rule to be valid, the rule should satisfy a user-specified minimum support, called ms, and minimum confidence, called mc, respectively. In many applications, there are taxonomies (hierarchies), explicitly or implicitly, over the items. It may be more useful to find associations at different levels of the taxonomies than only at the primitive concept level [8, 14]. For example, consider the taxonomies of items in Fig. 1. It is likely to happen that the association rule, Carrot  Apple (Support  30%, Confidence 60%), does not hold when the minimum support is set to 40%, but the following association rule may be valid.

Vegetable  Fruit Vegetable

Non-root Vegetable

Kale

Carrot

Fruit Papaya

Pickle Apple

Tomato

Fig. 1. An example of taxonomies

Up to date, all work on mining generalized association rules, to our best knowledge, confined the taxonomies of items to be static, ignoring the fact that the taxonomy may change as time passes while new transactions are continuously added into the original database [7]. For example, items corresponding to new products have to be added into the taxonomy, and whose insertion would further introduce new classifications if they are of new invented types. On the other hand, items and/or their classifications will also be abandoned if they do not be produced any more. All of these changes would reshape the taxonomy, and in turn would invalidate previously discovered and/or introduce new generalized associations rules, no mention the change caused by the transaction update to the database. Under these circumstances, how to update the discovered generalized association rules effectively becomes a critical task. In this paper, we examine this problem and propose an algorithm called IDTE (Incremental Database with Taxonomy Evolution)which is capable of effectively reducing the number of candidate sets and database re-scanning, and so can update the generalized association rules efficiently. Empirical evaluations show that our algorithm can maintain its performance even at relative low support thresholds, large amounts of incremental transactions, and high degree of taxonomy evolution, and is more than an order of magnitude faster than applying the best generalized associations mining algorithms to the whole updated database. The remaining of this paper is organized as follows. We discuss related work in Section 2, and describe the problem in Section 3. Detail description of the IDTE algorithm is given in Section 4. In Section 5, we evaluate the performance of the proposed IDTE algorithm. Finally, we conclude the work of this paper in Section 6.

2 Related Work The problem of mining association rules in the presence of taxonomy information was first introduced in [8] and [14], independently. In [14], the problem aimed at finding associations among items at any level of the taxonomy, while in [8], the objective was to discover associations of items in a progressively level-by-level fashion along the taxonomy. The problem of updating association rules incrementally was first addressed by Cheung et al. [4], whose work was later be extended to incorporate the situations of deletion and modification [6]. Since then, a number of techniques have been proposed to improve the efficiency of incremental mining algorithm [9, 10, 13, 15]. But all of them were confined to mining associations among primitive items. Cheung et al. [5] were the first to consider the problem of maintaining generalized (multi-level) association rules. We then extended the problem model to that

adopting non-uniform minimum support [16]. To our knowledge, no work to date has considered the issue of maintaining generalized associations while the taxonomy is evolving with the transaction update.

3

Problem Statement

In real business applications the database are changing over time; new transactions (may consist of new types of items) are continuously added, while outdated transactions (may consist of abandoned items) are deleted, and the taxonomy that represents the classification of items are also evolved to reflect such changes. This implies that if the updated database is processed afresh, the previously discovered associations might be invalid and some undiscovered associations should be generated. That is, the discovered association rules must be updated to reflect the new circumstance. Analogous to mining associations, this problem can be reduced to updating the frequent itemsets. 3.1 Problem Description Consider the task of mining generalized frequent itemsets from a given transaction database DB with the item taxonomy T. In the literature, although different proposed methods have different strategies in the implementation aspect, the main process involves adding to each transaction the generalized items in the taxonomy. For this reason, we can view the task as mining frequent itemsets from the extended database ED, the extended version of DB by adding to each transaction the ancestors of each primitive item in T. We use LED denote the set of discovered frequent itemsets. Now let us consider the situation when new transactions (db) are added to DB and the taxonomy T is changed into a new one T’ . Following the previous paradigm, we can view the problem as follows. Let ED’and ed’denote the extended version of the original database DB and incremental database db, respectively, by adding to each transaction the generalized items in T’ . Further, let UE’be the updated extended database containing ED’and ed’ , i.e., UE’= ED’+ ed’ . The problem of updating LED when new transactions db are added to DB and T is changed into T’is equivalent to finding the set of frequent itemsets in UE’ , denoted as LUE’. 3.2 Situations for Taxonomy Evolution and Frequent Itemsets Update In this subsection, we will describe different situations for taxonomy evolution, and clarify the essence of frequent itemsets update for each type of taxonomy evolutions. According to our observation, there are four basic types of item updates that will cause taxonomy evolution: item insertion, item deletion, item rename and item reclassification. Each of them will be elaborated in the following. For simplicity, in all figures hereafter i t e m“ A”s t a n dsf or“ Ve g e t a bl e ” ,“ B”f or“ Non-r ootVe g e t a bl e ” , “ C”f or“ Ka l e ” ,“ D”f or“ Ca r r ot ” ,“ E”f or“ Toma t o” ,“ F”f or“ Fr ui t ” ,“ G”f or“ Papa y a ” ,“ H”f or“ Appl e ” ,“ I ”f or“ Pi c k l e ” ,“ J”for “ Root Vegetable” ,“ K”for “ Potato” , “ B1”for “ Non -r ootVe g e t a bl eNe w” ,and “ G1”for “ Pa pa y aNe w” .

Type 1: Item Insertion. The strategies to handle this type of update operation are different, depending on whether an inserted item is primitive or generalized. When the new inserted item is primitive, we do not have to process it until an incremental database update containing that item indeed occurs. This is because the new item does not appear in the original database, neither in the discovered associations. However, if the new item is a generalization, then the insertion will affect the discovered associations since a new generalization often incurs some item reclassification. Fig. 2 shows this type of taxonomy evolution, where a new item “ K”is inserted as a primitive item and “ B”is a generalized item. Type 2: Item Deletion. Unlike the case of item insertion, the deletion of a primitive item from the taxonomy would incur inconsistence problem. In other words, if there is no transaction update to delete the occurrence of that item, then the refined item taxonomy will not conform to the updated database. An outdated item still appears in the transaction of interest! To simplify the discussion, we assume that the evolution of the taxonomy is always consistent with the transaction update to the database. Additionally, , the removal of a generalization may also lead to item reclassification. So we always have to deal with the situation caused by item deletion. Fig. 3 shows this type of taxonomy evolution, where a primitive item “ C”and a generalized item “ B” are deleted respectively. A

F

A

I B

C KED G H

F

C E

(a)

F

A

I B

D G H

C E

(b)

Fig. 2. Item insertion

I

D G H

A

F

I

C BED G H

(a)

(b)

Fig. 3. Item deletion

Type 3: Item Rename. When items are renamed, we do not have to process the database; we just replace the frequent itemsets with new names since the process codes of renamed items are the same. Fig. 4 shows this type of taxonomy evolution, where items “ G”a n d“ B”a r erenamed to “ G1”a n d“ B1” , respectively. Type 4: Item Reclassification. Among the four types of taxonomy updates this is the most profound operation. Once an item, primitive or generalized, is reclassified into another category, all of its ancestor (generalized items) in the old and the new taxonomies are affected. For example, in Fig. 5, the two s h i f t e di t e ms“ E”a n d“ G” will affect the support counts ofi t e ms e t sc on t a i n i ng“ A” ,“ B” ,or“ F” . A B C E

4

F D G H

I

A

F

B1 D G1 H

C E Fig. 4. Item rename

I

A B C E

F D G H

I

A B

F

I

D E H

C G

Fig. 5. Item reclassification

The Proposed Method

A straightforward method to update the discovered generalized frequent itemsets would be to run any of the algorithms for finding generalized frequent itemsets, such

as Cumulate and Stratify [14], on the updated extended database UE’ . This simple way, however, does not utilize the discovered frequent itemsets and ignores the fact that scanning the whole updated database would be avoided. Instead, a better approach is to, within the set of discovered frequent itemsets LED, differentiate the itemsets that are unaffected with respect to the taxonomy evolution from the others, and then utilize them to avoid unnecessary computation in the course of incremental update. To this end, we first have to identify the unaffected items whose supports does not change with respect to the taxonomy evolution, and then use them to identify the unaffected itemsets. We introduce the following notation to facilitate the discussion: I, J denote the set of primitive items and the set of generalized items in T, respectively, and I’ , J’represent the counterparts in T’ . Definition 1. An item in T is called an unaffected item if its support does not change with respect to a taxonomy evolution. Lemma 1. Consider a primitive item a in T. Then (a) supED’(a) = supED(a) if a I I’ , and (b) supED’(a) = 0 if a I  I’ , where supED;(a) and supED(a) denote the supports of item a in ED’and ED, respectively. Lemma 2. Consider a generalized item g in T. Then supED’(g) = supED(g) if desT’(g) = desT(g), where desT’(g) and desT(g) denote the sets of descendant primitive items of g in T’and T, respectively. In summary, Lemmas 1 and 2 state that an item is unaffected by the taxonomy evolution if it is a primitive item before and after the taxonomy evolution or it is a generalized item whose descendant set of primitive items remains the same. Definition 2. An itemset A in ED is called an unaffected itemset if its support does not change with respect to a taxonomy evolution. Lemma 3. Consider an itemset A in ED. Then (a) supED’(A) = supED(A) if A contains unaffected items only; or (b) supED’(A) = 0 if A contains at least one item a, for a I –I’ . Now that we have clarified how to identify the unaffected itemsets, we will further show how to utilize this information to alleviate the overhead in updating the supports of itemsets. Consider a candidate itemset A generated during the mining process. We observe that there are six different cases in arriving at the support counts of A in the whole updated databse UE’ . (1) If A is an unaffected itemset and is frequent in ed’and ED’ , then it is also frequent in the updated extended database UE’ . (2) If A is an unaffected itemset and is infrequent in ed’and ED’ , then it is also infrequent in UE’ . (3) If A is an unaffected itemset and is infrequent in ed’but frequent in ED’ , then a simple calculation can determine whether A is frequent or not in UE’ . (4) If A is an unaffected itemset and is frequent in ed’but infrequent in ED’ , then it is an undetermined itemset in UE’ , i.e., it may be frequent or infrequent.

(5) If A is not unaffected and is frequent in ed’ , then it is an undetermined itemset in UE’ . (6) If A is not unaffected and is infrequent in ed’ , then it is an undetermined itemset in UE’ . Note that only Cases 4 to 6 requires an additional scan of ED’to determine the support count of A in UE’ . For Case 4, after scanning ed; and comparing with ms, if A is frequent in ed’ , A may become frequent in UE’ . Then we need rescan ED’to determine the support count of A. For Cases 5 and 6, since A is not an unaffected itemset its support count would be changed in ED’ . Therefore, we need further scan ED’to decide whether it is frequent or not. For Cases 1 to 3, there is no need to further scan ED’to determine the support counts of itemset A. That is, we have utilize the information of unaffected itemsets and discovered frequent itemsets to avoid such a database scan. Furthermore, the identification of itemsets satisfy Case 2 provides another opportunity for candidate pruning. The IDTE algorithm is shown in Fig. 7. An example for illustrating the proposed IDTE algorithm is provided in Fig 8, where ms = 20%. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14.

k = 1; repeat if k  then generate C1 fromT’ ; ' else apriori-gen( LUE Ck  k 1 ); Delete any candidate in Ck that consists of an item and its ancestor;

Load original frequent k-itemsets LED k ; Divide Ck into two subsets: CX and CY; /* CX consists of unaffected itemsets in ED, and CY = CK - CX. */ Divide CX into two subsets: CXa and CXb; /* CXa consists of frequent ED itemsets in Lk , and CXb = CX CXa . */ Scan ed’to count suped’(A) for each itemset A in Ck; ' Led (A) ms}; k = {A|A Ck and suped’ ' Delete any candidate A from CXb if A Led /* Case 2 */ k ; Scan ED’to count supED’(A) for each itemset A in CXb and CY ; /* Case 4, 5 & 6 */ Calculate supUE’(A) for each itemset A in Ck; ' {A|A Ck and supUE’(A) ms}; LUE k

15.

' until LUE = k

16.

' Result Uk LUE ; k

Fig. 7. Algorithm IDTE

A B C

F D G

I H

E A G

F J

D

E

I H

K

Updated Extended Database (ED ’ )

Original Extended Database (ED) Primitive Generalized TID Items Items 1 E A, B 2 D, I A

Primitive Items E D, I

TID 1 2

Original Extended Incremental Database (ed) Primitive Generalized TID Items Items 7 D, K, I A

Generalized Items F A, J

3

C, E, G

A, B, F

3

C, E, G

A, F

4

C, D, H

A, B, F

4

C, D, H

A, F, J

5

D, I

A

5

D, I

A, J

6

E, G

A, B, F

6

E, G

A, F

8

D, K

A

Updated Extended Incremental Database (ed’ ) Primitive Generalized TID Items Items 7 D, K, I A, J 8

D, K

A, J

A, D,E,F, G, H,I ,J , K

C1

ED

Unaffected C1 ED

Load L1

Affected C1

ED

C1 in L1

C1 not in L1

D,E, G,I

H, K

A,F,J Scan ed’

counted’( C1 )

D 2

E 0

G 0

I 1

H 0

K 2

A 2

F 0

A 5

F 4

ed’

C1 in L1 & having no new primitive item Scan ED’

J 2 Scan ED’



J 3

countED’( C1 )

Cal. support UE’

Generate L1 UE’

A, D,E,F, G,I ,J , K

L1

Generate C2

AE,AF,AI,AJ,DE,DF,DG,DI,DJ,DK,EG, EI,EJ,EK,FG,FI,FJ,FK,GI,GJ, GK,IJ,IK

C2

ED

Unaffected C2 ED

ED

C2 in L2

Load L2

Affected C2

C2 not in L2

DE, DK,EI,DG, EK,GI, GK,IK

DI,EG

AE,AF,AI,AJ,DF,DJ,EJ, FG,FI,FJ,FK,GJ,IJ

Scan ed’ counted’( C2 )

DE DK EI DG

DI 1 EG 0

0 2 0 0

EK GI GK IK

0 0 0 1

AE AF AI AJ

DF DJ EJ FG

0 0 1 0

ed’

C2 in L2 & having no new primitive item Scan ED’

0 0 0 0

AE AF AI AJ

2 3 2 0

DF DJ EJ FG

1 0 0 2

Cal. support UE’

Generate L2 UE’

AE,AF,AI,DI,DK,EG,FG,IJ Generate C3 C3

0 0 0 0

IJ

1

C2 having no new primitive item Scan ED’



L2

FI FJ FK GJ



Fig. 8. Illustration of algorithm IDTE

FI FJ GJ IJ

0 1 0 2

countED’( C2 )

5 Experiments In order to examine the performance of IDTE, we conducted experiments to compare its performance with that of applying generalized association mining algorithms, including Cumulate and Stratify, to the whole updated database. Synthetic datasets generated by the IBM data generator [2] were used in the experiments. The parameter settings for synthetic data are shown in Table 1. The comparisions were evaluated from different aspects: include minimum support, incremental transaction size, fanout, number of groups, and percent of affected items, i.e., the ratio of affected items to total items. In the implementation of each algorithm, we also adopted two different support counting strategies: one with the horizontal counting [1, 2, 3, 11] and the other with the vertical intersection counting [12, 17]. For the horizontal counting, the algorithms are denoted as Cumulate(H), Stratify(H), and IDTE(H) while for the vertical intersection counting, the algorithms are denoted as Cumulate(V), Stratify(V), and IDTE(V). All experiments were performed on an Intel Pentium-IV 2.80GHz with 2GB RAM, running on Windows 2000. Minimum Supports: We first compared the performance of these three algorithms with varying minimum supports at 40,000 incremental transactions with constant affected item percent. The experimental results are shown in Fig. 8. As shown in the figure, IDET perform significantly better than Cumulate and Stratify. Besides, algorithms with vertical counting strategy are better than their counterpart with horizontal strategy. Transaction Sizes: We then compared the three algorithms under varying transaction sizes at ms  1.0% with constant affected item percent. As the results shown in Fig. 9, the running time of all algorithms increase in proportional to the incremental size. Furthermore, IDET(H) significantly outperforms Cumulate(H) and Stratify(H) and similarly, IDET(V) beats Cumulate(V) and Stratify(V). Fanout: We changed the fanout from 3 to 11 at ms  1.0% with constant affected item percent and 40,000 incremental transactions. The experimental results are shown in Fig. 10. It can be observed that all algorithms perform faster as the fanout increases because the number of generalized items decreases upon increasing the number of fanout. Again, IDET significantly outperforms Cumulate and Stratify, either with vertical or horizontal counting strategy. Number of Groups: We varied the number of groups from 15 to 35 at ms  1.0% with constant affected item percentand 40,000 incremental transactions. As Fig. 11 shows, the effect of increasing the number of groups is similar to that of increasing the fanout. The reason is that the number of items within a specific group decreases as the number of groups increases, so the probability of a generalized item decreases. Affected Item Percent: We finally compared the three algorithms under varying affected item percent at ms  1.0% and 40,000 incremental transactions. The affected items were randomly chosen, undergoing reclassification. The results are depicted in Fig. 12.

Table 1. Parameter settings

Fig. 8. Different ms

Fig. 9. Different transactions

Fig. 10. Varying fanout

Fig. 11. Varying number of groups

Fig. 12. Different affected item percent

In summary, we observe that IDTE(V) performs better than Cumulate(V) and Stratify(V) while IDTE(H) performs better than Cumulate(H) and Stratify(H) in all aspects of evaluation. Besides, all algorithms with vertical support counting strategy performs better than their counterpart with horizontal counting strategy.

6

Conclusions

We have investigated in this paper the problem of updating generalized association rules when new transactions are inserted into the database and the taxonomy of items is evolved over time. We also have presented a novel algorithm, IDTE, for updating generalized frequent itemsets. Empirical evaluation on synthetic data showed that the IDTE algorithm is very efficient , which outperforms applying the best general-

ized associations mining algorithms to the whole updated database.. In the future, we will extend the problem of updating generalized association rule to a more general model that adopts non-uniform minimum support to solve the problem that new introduced items usually have much lower supports. References 1.

2. 3. 4.

5.

6. 7.

8. 9.

10. 11.

12. 13.

14. 15.

16. 17.

Agrawal R., Imielinski T., Swami A.: Mining Association Rules between Sets of Items in Large Databases. In Proc. 1993 ACM-SIGMOD Intl. Conf. Management of Data (1993) 207-216 Agrawal R., Srikant R.: Fast Algorithms for Mining Association Rules. In Proc. 20th Intl. Conf. Very Large Data Bases (1994) 487-499 Brin S., Motwani R., Ullman J.D., Tsur S.: Dynamic Itemset Counting and Implication Rules for Market Basket Data. SIGMOD Record, Vol. 26 (1997) 255-264 Cheung D.W., Han J., Ng V.T., Wong C.Y.: Maintenance of Discovered Association Rules in Large Databases: An Incremental Update Technique. In Proc. 1996 Int. Conf. Data Engineering (1996) 106-114 Cheung D.W., Ng V.T., Tam B.W.: Maintenance of Discovered Knowledge: A case in Multi-level Association Rules. In Proc. 1996 Int. Conf. Knowledge Discovery and Data Mining (1996) 307-310 Cheung D.W., Lee S.D., Kao B.: A General Incremental Technique for Maintaining Discovered Association Rules. In Proc. DASFAA'97 (1997) 185-194 J. Han, Y. Fu: Dynamic Generation and Refinement of Concept Hierarchies for Knowledge Discovery in Databases. In Proc. AAAI’ 94 Workshop on Knowledge Discovery in Databases (KDD’ 94) (1994) 157-168 Han J., Fu Y.: Discovery of Multiple-level Association Rules from Large Databases. In Proc. 21st Intl. Conf. Very Large Data Bases, Zurich, Switzerland (1995) 420-431 Hong T.P., Wang C.Y., Tao Y.H.: Incremental Data Mining Based on Two Support Thresholds. In Proc. 4 Int. Conf. Knowledge-Based Intelligent Engineering Systems and Allied Technologies (2000) 436-439 Ng K.K., Lam W.: Updating of Association Rules Dynamically. In Proc. 1999 Int. Symp. Database Applications in Non-Traditional Environments (2000) 84-91 Park J.S., Chen M.S., Yu P.S.: An Effective Hash-based Algorithm for Mining Association Rules. In Proc. 1995 ACM SIGMOD Intl. Conf. on Management of Data, San Jose, CA, USA (1995) 175-186 Savasere A., Omiecinski E., Navathe S.: An Efficient Algorithm for Mining Association Rules in Large Databases. In Proc. 21st Intl. Conf. Very Large Data Bases (1995) 432-444 Sarda N.L., Srinivas N.V.: An Adaptive Algorithm for Incremental Mining of Association Rules. In Proc. 9th Int. Workshop on Database and Expert Systems Applications (DEXA'98) (1998) 240-245 Srikant R., Agrawal R.: Mining Generalized Association Rules. In Proc. 21st Int. Conf. Very Large Data Bases (1995) 407-419 Thomas S., Bodagala S., Alsabti K., Ranka S.: An Efficient Algorithm for the Incremental Updation of Association Rules in Large Databases. In Proc. 3rd Int. Conf. Knowledge Discovery and Data Mining (1997) Tseng M.C., Lin W.Y.: Maintenance of Generalized Association Rules with Multiple Minimum Supports. Intelligent Data Analysis, Vol. 8 (2004) 417-436 Zaki M.J.: Scalable Algorithms for Association Mining. IEEE Transactions on Knowledge and Data Engineering, Vol. 12, No. 2 (2000) 372-390