An Efficient Incremental Maintenance for Association ...

0 downloads 0 Views 702KB Size Report
An incremental mining algorithm proposed by Cheung called Fast Update ..... [15] Chun-Wei Lin a,b, Guo-Cheng Lan c, and Tzung-Pei Hong, “An incremental ...
2013 3rd International Conference on Computer Science and Network Technology

An Efficient Incremental Maintenance for Association Rule Mining based on Distributed Databases Mahmoud Darwish, Rania Elgohery, Nagwa Badr

Hossam Faheem

Information Systems Department, Faculty of Computer and Information Science, Ain Shams University, Cairo, Egypt

Computer Systems Department, Faculty of Computer and Information Science, Ain Shams University, Cairo, Egypt

An incremental mining algorithm proposed by Cheung called Fast Update Algorithm FUP [5] to maintain mined association rules after record modifications. FUP approach is a modification of the Apriori mining algorithm. It starts with evaluating large itemsets from inserted transactions, and then makes a comparison between them and previous large itemsets in the original database. FUP checks if there is a need to rescan the original database or not which leads to decrease the time needed to maintain the association rules. It is clear that FUP can improve the performance of incremental mining on growing databases but sometimes there is a necessity to scan original databases. Pre-large concept proposed by Hong and Wang [11] to reduce any needs to rescan original databases. It generates itemsets that are not large but also are candidates to be large after adding records to original databases, pre-large concept decrease the movements of item sets directly from large itemsets to small itemsets.

Abstract—Data Mining is an important step in the knowledge discovery process that discovers useful patterns hidden in the data. Discovering these associations is very important to decision makers. Many organizations have multiple distributed databases and it is important to discover and maintain its association rule. Any update in the database requires starting the whole mining process again to generate new rules. This requires twice the computation time required for a single mining procedure It also requires rescanning the whole database to maintain rules. In this paper we present a distributed incremental mining approach for maintaining association rules in distributed databases. The proposed approach is Apriori based distributed formulation on message passing interface. It also uses the concept of pre large concept to help in reducing maintenance costs. Experimental results show that the approach outperforms the existing approaches. We present comparative analysis between the proposed approach and existing approach. Keywords—data mining; association rules mining; Incremental mining; Message Passing Interface

I.

Most of proposed approaches in incremental association rules mining are sequential. They can’t handle incremental mining over distributed databases. In this paper a novel algorithm is proposed to maintain incremental mining of association rules in distributed databases. The rest of this paper is organized as follows. We present background knowledge, terms and concepts used in this paper and some related work in the second section. Our proposed approach is presented in the third section. The fourth section presents comparisons and experiments results of our proposed approach. Finally we present the conclusion.

INTRODUCTION

There is a dramatic increase in the amount of data stored in database. There are many very large data-sets available in many scientific disciplines. The rate of generating such datasets far outstrips the ability to manually analyze them to gain useful information. Mining for knowledge becomes an important process to support decision making. Data mining techniques becomes a necessity for developing new sets of tools which could be used to analyze the massive datasets to discover the knowledge and the relationships from these datasets.

II.

There are many large organizations have multiple distributed databases. It is necessary to incorporate knowledge from all local databases. There is a necessity to differentiate between global knowledge and local knowledge. Merging local databases into a central database won't give us the advantage of discovering both local and global knowledge. Also applying serial data mining algorithms over a large central database would take unreasonable amount of time. The best solution is to apply a distributed data mining algorithm over all local databases.

978-1-4799-0559-1/13/$31.00 ©2013 IEEE

RELATED WORK

A. Background 1) Association Rule Mining [1] Association Rule Mining (ARM) has become one of the effective data mining tasks. It has attracted great interest among decision makers and mining researchers. Mining for association rules and relationships between items in large databases has been identified as a critical area of research in database. These association rules could be used to uncover unknown relationships between different items, producing results that can provide a basis for forecasting and decision

404

Dalian, China

making. The original problem addressed by association rule mining was to find hidden relations among sales of different products from the analysis of a large set of supermarket sales data.

Case 1: An itemset is frequent (large) both in an original database and in newly inserted transactions. Case 2: An itemset is frequent (large) in an original database but not frequent (small) in newly inserted transactions.

ARM algorithms composed of two stage process. The first stage is generating a list of frequent itemsets in database. Any combination of database items is known as an itemset. To avoid generating a big list of all possible itemsets a threshold value is chosen, which is known as the support. The support value is used to filters out itemsets that would lead to uninteresting rules. The second stage generates association rules based on the frequent itemsets generated from the first stage. The association rules are selected based on their support value. That is, an itemset is= {I1‫׫‬I2 … ‫׫‬Ik} where, Ii I.. The general form of an association rule is X֜Y, where XpI, YpI and X∩Y=Φ. The support of X֜Y is the probability of a transaction in the database to contain both X and Y. On the other hand, the confidence of X֜Y is the probability of a transaction containing X will also contain Y.

Case 3: An itemset is not frequent (small) in an original database but frequent (large) in newly inserted transactions. Case 4: An itemset is not frequent (small) both in an original database and in newly inserted transactions. C. Pre- Large Concept [11, 12] Pre-large is used by Hong & Wang [11] to solve incremental mining for records modification. Although maintenance of rules for modification of records could be performed by using deletion and insertion, it requires twice the computation time needed for a single procedure. Through using the concept of pre-large itemsets, they managed to reduce the need for rescanning original databases until a specified number of records have been modified. Considering an original database and the itemset differences resulting from modification of some records, the following nine cases (illustrated in Figure 2) may arise when using the concept of pre-large itemsets.

The Apriori algorithm [1, 2, 9, and 14] is an effective algorithm for mining association rules. It works recursively. It starts with finding frequent 1-itemsets L1, which have a support greater than the threshold value s. From the 1-itemsets, the 2-itemsets L2 are found. This repeats until Lk+1 is empty. Then the set, L1∪L2∪ … ∪Lk is the set of globally frequent itemsets, which is represented by Lg. Using Lg, one generates all association rules, which have confidence greater than the user predefined confidence value c. 2) Message Passing Interface [4, 8, 10] The MPI programming model is based on message passing. In a message-passing system, different concurrently-executing processes communicate by sending messages from one process to another over a network. Unlike multithreading, where different threads share the same program state, each of the MPI processes has its own local program state that cannot be modified by any other process except in response to a message. Therefore, the MPI processes can be as distributed over the network, with different processes running on different machines or even different architectures.

Fig. 2. Nine Cases arising from modifying records into an existing database [12].

B. Fast Update Algorithm [5] Fast Update (FUP) algorithm updates the discovered association rules in incremental mining. Considering an original database and some new inserted transactions, the following four cases (illustrated in Figure 1) may occur:

D. Synthesizing global patterns from local patterns in different databases [13] Many companies have multiple databases. Animesh proposed [13] an approach for generating global rules from local rules exists in local databases. It has set of interfaces and set of layers as shown in Figure 3. There are four interfaces of the proposed model of synthesizing global rules. Each interface applies set of functions over the dataset. Interface 2/1 processes each local database to get processed database by applying set of operations on each database at the lowest level. Interface 3/2 processes each local database to separate relevant data from outlier data. Interface 4/3 processes each local database to generate local patterns. There are two types of local patterns: local patterns and suggested local patterns. A suggested local pattern is not currently a local pattern but it is a candidate to be large after adding records to

Fig. 1. Four Cases When New Transactions are inserted into existing database [5]

405

database. Interface 5/4 processes local patterns from local databases to generate global patterns.

Fig. 3. A model of synthesizing global patterns from local patterns in different databases [13].

E. Notations The notation used in this paper is defined below. Di  the original database for site i Ti  the set of added records for database i Ui the updated database IDk the k-itemset difference di the number of records in Di ti the number of records in T Sl the lower support threshold for pre-large itemsets Su the upper support threshold for large itemsets, Su > Sl LR Local Rule for database i. PR Pre-Large Local Rule for database i. GR Global Rule over all databases. GPR Global Pre-large rules over all databases. III.

Fig. 4. The proposed Incremental ARM approach based on distributed databases architecture.

The Incremental ARM Algorithm composed of the below modules:A- DB Processing Module: - This module Processes the DB to clean the data and organize the data in a suitable form for mining algorithm. B- Configuration Module: - This module manages the Support Value for large and pre-large patterns. It also manages the Confidence value for large and pre-large rules. It manages number of distributed databases. C- Database Connector Module: - This module manages any connection between the application and the database. D- ARM Module: - This module generates Large and pre-large frequent patterns for each local database based on provided support and confidence values for large and pre-large rules and number of distributed databases. It depends on Message Passing Interface to pass support and confidence values to different local servers. The output of this module is local frequent patterns and pre-large local frequent rules based on local distributed databases. E- Global Patterns Controller: - This module receives local rules from local servers as an input and generates global patterns for large and pre-large rules as an output. It works as below : 1- Receive Local Rules and Record Counts from local databases. 2- Sum Total Database records in all databases to generate global database records in all databases.

PROPOSED APPROACH

In this section we present the proposed approach, when new records in a database are added or records in an existing database are modified, the original association rules may become invalid, or new implicitly valid rules may appear in the resulting updated database. As most organizations have many distributed databases, so any change in one of the databases may change local and global rules. An incremental approach for maintaining and updating the discovered association rules for distributed databases is proposed here. It integrates the FUP concepts [5], association-rule maintenance algorithm for record modification [12] and Synthesizing heavy association rules from different real data sources [13]. This is an advanced approach to maintain incremental mining in distributed databases. This approach depends on pre-large concepts to depend on pre-large itemsets to generate new frequent itemsets rather than having to rescan the whole databases to find new frequent itemsets. It also depends on Message Passing interface as programming model to communicate between different distributed databases. It is based on Apriori due to its parallel nature so it is easy to make a parallel formulation from it. Details of the proposed approach are listed below (shown in Figure 4).

406

Step 5:- Send the local rules and pre-large rules to the master process that generates the global rules.

3- Calculate Global Support and Confidence Percentages for Large and Pre-large rules. 4- Generate Global Frequent Rules and Pre-large Global Frequent Rules. 5- Save Large/Pre-Large Global Rules in Database to be used in the future by incremental Mining Module. F- Incremental Mining Module: - This module manages generating new large/pre-large rules after adding records to database. It depends on the previous generated large and pre-large generated by global patterns controller module. It works as below: 1- Retrieve support and confidence values. 2- Retrieve old large and pre-large rules. 3- Process new added records on each database to generate new rules support count. 4- Compare new large and pre-large support/confidence count with current support/confidence. 5- Generate new large & pre-large rules. 6- Save result in database. The incremental ARM Algorithm based on Distributed Database explained in details as follows:

Step 6:- Sum the local databases count to generate new global lower support, global upper support, confidence lower support, and confidence upper support. Step 7:- Generate new global rules and pre-large global rules. Case 1:- If the overall count of the itemset is greater than or equal the global upper support count and meets the upper confidence then this rule is a global rule GR [Global Rule] Case 2:- If the overall; count of the itemset is less than the global upper support count but greater than the global lower support count and greater than or equal the lower confidence then this rule is a PGR [Pre-Large Global Rule]. Case 3:- If the overall count of the itemset is less than global upper support count then remove this rule from global large rules or global pre-large rules when exists.

IV.

EXPERIMENTAL RESULTS AND EVALUATION

The primary metric for evaluating our proposed approach performance is association accuracy depending on the provided support and confidence rules. The other important metrics is execution time. The ideal goal for ARM algorithm is to produce accurate rules in a short execution time.

INPUT: - A lower support threshold Sl, an upper support threshold Su, a set of large itemsets and pre-large itemsets for each database consisting of di records where (i) represents database number, and a set of t modified records or new inserted records.

We carried out six experiments using existing approach and the proposed approach using the same support and confidence values. Experiments carried out using different number of transactions and different database size. The results are very encouraging. We have observed considerable reduction in patterns generation time for adding additional records by the proposed approach in comparison to existing approach. Figure 5 and 6 presents our experimental results.

OUTPUT: A set of final local association rules for each updated database (LR1, LRi, LRn), set of final pre-large local rules (PR1, PRi, PRn), set of global rules and pre-large global rules (GR, GPR). Step 1: Calculate the safety number f for modified records for each database according to Theorem 1 as follows: F= (su-sl)di , number of added records should be less than this safety threshold or database Di has to be scanned to get new large, pre-large itemsets. Step 2: Generate old local rules count, old pre-large rules count after adding positive or negative change in the new added transactions for each Di. Step 3: Sum the count of the old database records count and new added records count to generate lower support count, upper support count for each Di. Step 4:- Apply for each local process the below cases to generate new Local rules and Pre-Large rules.

Fig. 5. Execution time comparison between our proposed approach and Animesh approach [13], Upper support = 35%, Lower Support = 15%, Upper Confidence = 35% and Lower Confidence = 15 %.

Case 1:- if the local count of the itemset is greater than or equal the upper support and meets the upper confidence then this rule is a local rule LRi [Large Local Rule]. Case 2:- if the local count of the itemset is less than the local upper support count but greater than the local lower support count and greater than or equal the lower confidence then this rule is a PLRi [Pre Large Local Rule]. Case 3:- if the local count of the itemset is less than upper support then remove this rule from large or pre-large local rules if exists.

Figure 5 shows a comparison between Existing Animesh approach [13] which discussed in section 2.4 and the proposed approach processing time when adding additional records on databases contains different number of transactions (30000, 120000, and 450000 Transaction). Experiment in figure 5 executed based on upper support with value 35%, lower support with value 15%, Upper Confidence with value 35%, and lower Confidence with Value 15%. The experimental

407

results shows that the proposed algorithm execution time for incremental mining in database records contains 450,000 transactions is 0.72 seconds while the execution time for processing the same added records and database size using Animesh is 6.0 seconds. The execution time for the proposed approach over database size 120,000 transactions is 0.67 while the execution time for processing the same added records and database size is 3 seconds. The execution time for the proposed approach over database size 30,000 transactions is 0.62 while the execution time for processing the same added records and database size is 1.1 seconds. Figure 6 shows the same comparison but using different support and confidence values.

REFERENCES [1]

[2] [3]

[4] [5]

[6]

[7]

[8] [9] Fig. 6. Execution time comparison between our proposed approach and Animesh approach [13], Upper support = 40%, Lower Support = 20%, Upper Confidence = 40% and Lower Confidence = 20 %. [10]

V.

CONCLUSION

[11]

The purpose of this research paper is to provide a high performance approach for incremental association rules based on distributed databases. We propose a high performance approach based on Apriori using message passing interface for discovering and updating association rules in distributed environments. The proposed approach depends on pre-large approach to avoid scanning the whole database whenever new transactions added. We compared our proposed approach to the distributed Apriori. The results from different experiments have confirmed that our proposed approach takes less processing time than existing approaches like Animesh approach. It also proves that our approach can deal with distributed databases to generate both local and global frequent rules.

[12]

[13]

[14] [15]

408

R. Agrawal, T. Imielinski, and A. Swami, “Mining association rules between sets of items in large databases,” International Conference on Management of Data (ACM SIGMOD '93), pages 207216, Washington, USA, 1993. Rakish Agrawal and R.Srikant, “Fast Algorithms for Mining Association Rules,” 20th Intl conference very large database, Chile, 1994. Rakish Agrawal, and R. Srikant, “Mining association rules with item constraints,” The third international conference on knowledge discovery in databases and data mining, 1997. W. Gropp, E. Lusk, and A. SKJELLUM, “Using MPI-: Portable Parallel Programming with the Message Passing Interface,” 1999. Cheung, D. W, Jiawei, H, Ng, V. T., and Wong, “Maintenance of discovered association rules in large databases: an incremental updating technique,”12th international conference on data engineering (pp. 106– 114), 1996. Kumar, P.Ozisikyilmaz, Wei-Keng Liao, Memik, G., Choudhary, “High Performance Data Mining Using R on Heterogeneous Platforms,” Department. of Electr. Eng. & Comput. Sci., Northwestern Univ., Evanston, IL, USA; Parallel and Distributed Processing Workshops and Phd Forum (IPDPSW), IEEE International Symposium, 2011. Vpin Kumar, Mahesh v. Joshi , Eui-Hong (Sam) Han, Pang-Ning Tan, and Michael Steinbach, “High Performance Data Mining,” University of Minnesota, USA, 2002. B Barney, “Message Passing Interface (MPI),” Lawrence Livermore National Laboratory, 2010 XindongWu , Vipin Kumar , J. Ross Quinlan , Joydeep Ghosh , Qiang Yang ,Hiroshi Motoda , Geoffrey J. McLachlan , Angus Ng , Bing Liu , Philip S. Yu , Zhi-Hua Zhou , Michael Steinbach , David J. Hand , and Dan Steinberg, ”Top 10 algorithms in data mining”, 2007. B Barney, “Introduction to parallel computing,” Lawrence Livermore National Laboratory, 2010 Hong, T. P., Wang, C. Y., & Tao, Y. H, ”A new incremental data mining algorithm using pre-large itemsets,” Intelligent Data Analysis, 5(2), pp. 111–129, 2001. Tzung-Pei Hong a,b,*, and Ching-Yao Wang, “An efficient and effective association-rule maintenance algorithm for record modification,” Expert Systems with Applications 37, pp. 618–626, 2010. Animesh Adhikari a,*, and P.R. Rao b,”Synthesizing heavy association rules from different real data sources,” Pattern Recognition Letters 29, pp. 59–71 c, 2008. Jiawei Han, Micheline Kamber, and Jian Pei, “Data Mining Concepts and Techniques,” 3rd Edition, 2011. Chun-Wei Lin a,b, Guo-Cheng Lan c, and Tzung-Pei Hong, “An incremental mining algorithm for high utility itemsets,” Expert Systems with Applications 39 , pp. 7173–7180, 2012.