An Incremental Mining Algorithm for Association

0 downloads 0 Views 279KB Size Report
Keywords: data mining, association rule, incremental mining. 1. Introduction ... other hand, DHP [2] and MPIP [3] algorithms employ hash structures to reduce the database access times. ..... In: Lecture Notes in Computer Science, Vol. 5579, pp.
An Incremental Mining Algorithm for Association Rules based on Minimal Perfect Hashing and Pruning Chuang-Kai Chiou1, Judy C. R. Tseng2 1

College of Engineering, Chung Hua University, Hsinchu, 300, Taiwan, ROC [email protected] 2 Dept. of Computer Science and Information Engineering, Chung Hua University, Hsinchu, 300, Taiwan, ROC [email protected]

Abstract. In the literatures, hash-based association rule mining algorithms are more efficient than Apriori-based algorithms, since they employ hash functions to generate candidate itemsets efficiently. However, when the dataset is updated, the whole hash table needs to be reconstructed. In this paper, we propose an incremental mining algorithm based on minimal perfect hashing. In our algorithm, each candidate itemset is hashed into a hash table, and their minimum support value can be verified directly by a hash function for latter mining process. Even though new items are added, the structure of the proposed hash does not need to be reconstructed. Therefore, experimental results show that the proposed algorithm is more efficient than other hash-based association rule mining algorithms, and is also more efficient than other Apriori-based incremental mining algorithms for association rules, when the database is dynamically updated. Keywords: data mining, association rule, incremental mining

1

Introduction

Association rules mining is an important data mining issue. It represents the relationships among items in a given database. The most well-known method for mining association rules is the Apriori algorithm [1]. Many proposed association rule mining algorithms are also Apriori-based [2-4]. Some researchers have tried to find efficient methods to improve Apriori-based algorithms. For instance, FP-Tree [5] and CAT tree algorithms [6] employ special tree structures for mining the frequent itemsets. On the other hand, DHP [2] and MPIP [3] algorithms employ hash structures to reduce the database access times. They are suitable for dealing with the candidates of 2-itemsets (C2), which is the most time-consuming step in association rules mining. Consequently, hash-based association rule mining algorithms are more efficient than Apriori-based algorithms.

Besides, the traditional mining methods focus on mining in a static database (that means the items are seldom changed or updated). In most practical cases, the items in the database are added or updated frequently. Therefore, incremental mining techniques become essential when apply association rule mining in practice. Several incremental mining techniques have been proposed [7-10]. For example, FUP [7], an Apriori-based algorithm, stores the previous counts of large itemsets and examines the newly added transitions with these counts. And then a small number of new candidates were generated. The overall counts of candidates were obtained by scanning the original database. Although FUP dealt with the incremental dataset, the mining efficiency is still poor. In this paper, we not only employ minimal perfect hashing structure [11] to improve the hash-based mining algorithm but also employ incremental mining technique for realistic practice. Hence, IMPHP (Incremental Minimum Perfect Hashing and Pruning) algorithm is proposed. Two advantages are obtained: 1). each candidate itemset will be hashed into a hash table without collisions and their minimum support vale can be verified directly by a hash function for latter process. 2.) When new items are added, the arrangement of proposed hash structure need not to be re-constructed. We only need to scan the updated parts and add new items into the end of the original hash table. Hence, the efficiency can be improved significantly.

2

Relative Work

For evaluating the improvement of hash-based mining algorithm, two hash-based mining algorithms, Direct Hashing & Pruning (DHP) algorithm and Multi-Phase Indexing and Pruning (MPIP) algorithm, will be compared. The details of the algorithms are described in the flowing subsections. 2.1

Direct Hashing & Pruning (DHP)

For dealing with the low performance of Apriori-based algorithm, Park et al. proposed DHP (Direct Hashing and Pruning) algorithm [2]. DHP employ hash functions to generate candidate itemsets efficiently, and DHP also employs effective pruning techniques to reduce the size of database. The potential-less itemsets will be filtered out in early stage of candidate generation, and the scanning of database will be avoided. Since DHP does not scan over the database all the time, the performance is also enhanced. DHP is particularly powerful for finding the frequent itemsets in early stage. It finds 1-itemsets and makes a hash table for C2, and then determines L2 based on the hash table generated in previous stage. However, there is no guarantee that collisions can be avoided. If we use small number of buckets to hash the frequent itemsets, a heavy collision will be occurred. At this time, only few candidate itemsets will be filtered out and the performance might be worse than Apriori algorithm. Enlarging the number of buckets can solve the problem, but the requirement of large memory space will also reduce the performance.

2.2

Multi-Phase Indexing and Pruning (MPIP)

Tseng et al. proposed MPIP algorithm to improve DHP algorithm [3]. MPIP employed a minimum perfect hashing function to instead of the hashing function in DHP. If a hashing function can determine a unique address for each itemset and hash all items without space wasted (That means the length of space is equal to the length of all items), it is said a minimum perfect hashing function [11]. The hashing process is shown as Fig. 1.

Fig. 1. Hashing process in MPIP algorithm

One of the contributions of MPIP is to improve the hashing collision problem in DHP algorithm. In MPIP algorithm, a unique address will be assigned to each itemset. It also promotes the accuracy of the hash table. Each entry in the hash table is used for determining the support value of corresponding itemset. Under such structure, the repeated scanning of database can be avoided. Besides, the Bit Vector in the hash table can filter out the candidate itemset and directly indicate the large itemsets. Hence, once we construct the hashing table, the large itemsets are also obtained. However, DHP and MPIP cannot deal with the updating transactions where new items are included. In such situation, more collisions may be occurred in DHP. In MPIP, re-constructing the hash table is needed to include the new items. In order to simplify the hashing process and the hash table reconstructing problems, a new algorithm is proposed. In our proposed algorithm, hashing process does not need to adjust. For the updating transaction, we just need to scan the updating parts instead of scanning whole database.

3

Incremental Minimum Perfect Hashing and Pruning (IMPHP)

In this paper, an incremental association rule mining algorithm, IMPHP (Incremental Minimum Perfect Hashing and Pruning) is proposed. In this algorithm, hashing address can be determined by a minimum perfect hashing function and mining with incremental transaction and item is also supported. In this section, we will analyze the regularity in 2-itemsets first and then extend it to the case of 3-item and k-itemsets. Finally, the minimal perfect hashing function is obtained. 3.1

Minimal perfect hashing function

The notation used in our minimum perfect hashing function is defined as following. Let a k-itemset represent as {ij1, ij1,…, ijk} where k is an integer. j1, j2,…, jk is the serial number of items and j1< j2< j3