Alternative Approach to Mining Association Rules

11 downloads 0 Views 401KB Size Report
alternative approach for mining association rules based on representation of each ..... Prolog, Prolog Association of Japan, Tokyo, October. 2001, pp. 285-294.
Rauch, J. - Šimůnek, M.: Alternative Approach to Mining Association Rules. in FDM 2002, The Foundation of Data Mining and Knowledge Discovery, The Proceedings of the Workshop of ICDM02, pp 157-162.

Alternative Approach to Mining Association Rules 1

Jan Rauch1, Milan Šimůnek1 2 Faculty of Informatics and Statistics, University of Economics Prague, Czech Republic 2 Institute of Computer Sciences, Czech Academy of Sciences, Czech Republic [email protected], simunek@{vse.cz, cs.cas.cz}

Abstract An alternative approach to mining association rules is described. Some special techniques and algorithms are used that lead to a much richer syntax of association rules with only linear complexity of computation. A free and open system LISp-Miner implements these algorithms and can serve as a demonstration of used techniques. The same techniques can be used in other kinds of mining e.g. multi-relation mining and conditional frequency analysis.

1. Introduction An association rule is in common way understood as an expression of the form of X→Y, where X and Y are sets of items. The intuitive meaning is that transactions (e.g. supermarket baskets) containing set X of items tend to contain set Y of items. Two measures of intensity of association rule are used, confidence and support. An association rule discovery task is a task to find all association rules of the form X→Y such that the support and confidence of X→Y are above the user-defined thresholds minsup and minconf. The conventional algorithm of association rules discovery proceeds in two steps. All frequent itemsets are found in the first step. The frequent itemset is the itemset that is included in at least minsup transactions. The association rules with the confidence at least minconf are generated in the second step [1]. Particular items can be represented by Boolean attributes and a Boolean data matrix can represent the whole set of transactions. The algorithm can be modified to deal with attributes with more than two values. Thus, the association rules of the form e.g. A(a1)∧B(b3)→C(c7) can be mined. We suppose that the attribute A has k particular values a1, …, ak. The expression A(a1) denotes the Boolean attribute that is true if the value of attribute A is a1 etc. The goal of this paper is to draw attention to an alternative approach for mining association rules based on representation of each possible value of each attribute by a single string of bits. It is possible to mine for association rules of the form e.g. A(α) ∧ B(β) → C(δ) where α is a coefficient (a subset of all the possible values) of the attribute A. The expression A(α) denotes the Boolean

attribute that is true for particular row of data matrix if the value of A in this row belongs to α, similarly for B(β) and C(δ). The bit string approach makes also possible to easy compute all necessary frequencies. Then we can mine not only for association rules based on confidence and support but also for rules corresponding to further various relations of Boolean attributes including relations described by statistical hypotheses tests. It is also possible to mine for conditional association rules and to deal with missing information. The presented form of association rules can be understood as a contribution to the discussion about the notion of interesting patterns. Several data structures consisting of disjunctions and conjunctions of bit strings representing particular values of attributes are maintained to optimise generation and verification of association rules. Final algorithm is very fast and it is linearly dependent on the number of rows of the analysed data matrix. Time and memory complexity are discussed in section 3. As a demonstration of capabilities of bit string approach we present the procedure 4ft-Miner (see section 2). The 4ft-Miner procedure is a part of the academic data mining system LISp-Miner (see http://lispminer.vse.cz). The bit string approach proved to be very efficient. Experiences with it lead to development of new mining procedures, an example can be found in section 4. The presented approach was first applied in connection of development of the GUHA method of mechanized hypotheses formation [2], [3].

2. Procedure 4ft-Miner Procedure 4ft-Miner mines for association rules of the form ϕ ≈ ψ and for conditional association rules of the form ϕ ≈ ψ / χ. Here ϕ, ψ and χ are conjunctions of Boolean attributes automatically derived from manyvalued attributes in various ways. The symbol ≈ is called 4ft-quantifier. The association rule ϕ ≈ ψ means that Boolean attributes ϕ and ψ are somehow associated in the sense of the 4ft-quantifier ≈. A conditional association rule ϕ ≈ ψ / χ means that ϕ and ψ are associated (in the sense of ≈) if the condition χ is satisfied.

1

Rauch, J. - Šimůnek, M.: Alternative Approach to Mining Association Rules. in FDM 2002, The Foundation of Data Mining and Knowledge Discovery, The Proceedings of the Workshop of ICDM02, pp 157-162.

The left part of association rule (ϕ) is called antecedent, part denoted as ψ is called succedent and χ is condition. All parts together are referred as cedents. This section describes features of the procedure 4ftMiner to show advantages of the bit string approach. The first one is richness of possibilities how to define in a simple way the set of interesting association rules to be automatically generated and verified, see section 2.1. The second one is possibility to deal with many types of association rules, see section 2.2. The important features of output of 4ft-Miner are outlined in section 2.3. 2.1. Sets of Interesting Association Rules Analysed data for the procedure are stored in data matrix. Rows of the data matrix correspond to observed objects and columns correspond to attributes – properties of observed object. An example is the data matrix Loans, see Figure 1.

Client Age Sex Salary District 1 2 3 4 ... 6180 6181

45 22 37 53 ... 54 30

M very high Prague F very low Plzen F average Brno F high Benesov ... ... ... M low Kolin F high Brod

Quality good bad good good ... bad good

Figure 1. – Data matrix Loans Each row of the data matrix Loans describes one loan given to a client of bank. There are 6 181 loans. The first row describes a loan that received a 45 years old man. This man has a very high salary and he lives in the district of Prague. The quality of his loan is good. Each cedent is a conjunction of Boolean attributes called literals. Literal is the expression of the form A(α), here A is an attribute and α is the subset of all possible values (i.e. categories) of the attribute A. The subset α is called a coefficient of the literal A(α). Examples of cedents ϕ, ψ and χ are: • ϕ = Age 0 and 0 < α ≤ 0.5 a+b

∑ i=a

Let us emphasize that each cedent and even partial cedent are treated as objects and can be copied or moved to another task or cedent.

¬ψ b d

Figure 4. – Four-fold table 4ft(ϕ, ψ, M) of ϕ, ψ in M

True iff Figure 3. – Example of the coefficient of one value

ψ a c

(a+b)! i!(a+b−i)!

* pi * (1 − p)a+b−i ≤ α ∧ a ≥ Base

Association rule ϕ ⇒!p;∝;Base ψ corresponds to a test (on the level α) of a null hypothesis H0: P(ϕ|ψ ) ≤ p against the alternative one H1: P(ϕ|ψ) > p. If association rule ϕ ⇒!p;∝;Base ψ is true in data matrix M then the alternative hypothesis is accepted.

3

Rauch, J. - Šimůnek, M.: Alternative Approach to Mining Association Rules. in FDM 2002, The Foundation of Data Mining and Knowledge Discovery, The Proceedings of the Workshop of ICDM02, pp 157-162.

• Double founded implication ⇔p;Base Parameters 0 < p ≤ 1 and Base > 0 a True iff ≥ p ∧ a ≥ Base a + b + c

Association rule ϕ ⇔p;Base ψ can be interpreted as “100p percent of objects satisfying ϕ or ψ satisfy both ϕ and ψ” or “ϕ ∧ ψ implies ϕ ∨ ψ on the level 100p per cent“. All the implemented 4ft-quantifiers are described at http://lispminer.vse.cz\overview\4ft_quantifier.html. The four–fold table can be computed in a very fast way, see section 3. Let us remark that pre-computed tables of critical frequencies can be used to verification of 4ft-quantifiers based on statistical hypotheses tests [4]. This way we need only one test of inequality instead of computation of complex formula. When we deal with missing information we have to compute nine-fold tables or even eighteen-fold tables. The bit string approach again is used for very fast computation of these tables. There are also several possibilities how to reduce these tables back to four-fold table. For details see e.g. [5].

Figure 5. – Example of the 4ft-Miner output

3. Bit String Approach The basic principle of bit-string approach is in representation of analysed data by suitable strings of bits (see section 3.1). It makes then possible to use simple algorithm and data structures to efficiently compute necessary frequencies (see 3.2).

2.3. Output of 4ft-Miner

3.1. Bit-string Representation of Attributes

Output of the procedure consists of all prime association rules. The association rule is prime if both it is true in the analysed data matrix and it does not follow immediately from other more simple association rules already in the output. The question is what does it mean that the association rule ϕ ≈ ψ immediately follows from more simple association rule ϕ1 ≈ ψ1. Answer depends on properties of the used 4ft-quantifier. The definition of prime association rule for the 4ft-quantifier of founded implication ⇒p;Base must take into account that if the association rule e.g. Sex(M) ⇒p;Base District(Prague) is true then the association rule Sex(M)⇒p;BaseDistrict(Prague, Plzen) is also always true. Thus the second association rule immediately follows from the first, more simple one. All the followers are automatically omitted from output. There is theoretical background of logical properties of association rules. For details see section 4 or e.g. [4]. An example of the output of 4ft-Miner is in Figure 5. This output represents the task with the set of interesting antecedents and succedents defined in Figure 2 and Figure 3 respectively and with the quantifier ⇒0.7;20 of founded implication. The whole solution contains 46 prime association rules.

Each category of each attribute (i.e. each of its possible values) is represented by one string of bits. This string is called card of category [3]. We can use the attribute District as an example. The attribute District has 77 categories: Benesov, Brno, … , Prague, Plzen, … , Znojmo. Its representation is shown in Figure 6. Client

District

1 2 3 4 ... 6180 6181

Prague Plzen Brno Benesov ... Kolin Brod

Cards of Categories Brno Kolin Plzen Prague 0 0 0 1 0 0 1 0 1 0 0 0 0 0 0 0 ... ... ... ... 0 1 0 0 1 0 0 0

… … … … … … … …

Figure 6. – Cards of categories The first row of this table corresponds to column Client (row number) of the data matrix Loans, see Figure 1. The second row of the table corresponds to column District. Each of the further rows of Figure 6 is the card of one category. Each bit of the card of category corresponds to one row of the data matrix Loans. The first bit corresponds to the first row; the second bit corresponds to the second row etc. There is 1 in particular bit if there is the value (i.e.

4

Rauch, J. - Šimůnek, M.: Alternative Approach to Mining Association Rules. in FDM 2002, The Foundation of Data Mining and Knowledge Discovery, The Proceedings of the Workshop of ICDM02, pp 157-162.

category) in the row corresponding to this bit in the column District. Otherwise there is 0 in this bit. The first bit of the card of the category Benesov is 0 because the value in the first row of the data matrix is not Benesov (but Prague). The third bit of the card of the category Brno is 1 because of the value in the third row is Brno, etc. There are 6181 rows in the data matrix Loans, therefore 6 181 bits or 773 bytes are necessary to represent one category by its card. Attribute District has 77 categories. It means that 59 521 bytes (i.e. 773 × 77) are necessary to represent this attribute. 3.2. Algorithm and Data Structures Structure named card of antecedent represents each antecedent. We denote it by Card_[antecedent]. It is a string of bits of the same length as number of rows in the analysed data matrix. Each bit of card corresponds again to one row of the analysed data matrix. There is 1 in a particular bit if the row corresponding to this bit satisfies the antecedent. The card of antecedent is thus the bit-wise representation of Boolean attribute antecedent. It is created as conjunction of card of literals of all its literals. Card of literal is beforehand created as disjunction of card of categories from literal coefficient. Detail description is out of range of this article and can be found in e.g. [3]. The number of 1’s in the card of antecedent is the number of rows satisfying the antecedent. We use a lowlevel bit-string function Count(α) returning number of values 1 in the string α. The number of rows satisfying the antecedent must be equal or greater than the value of parameter Base, see section 2.2. For every generated antecedent we test whether Count(Card_[antecedent]) ≥ Base to decide if this antecedent can be at all a part of the true association rule. This test can be understood whether the corresponding itemset is frequent [1]. Both Card_[antecedent] and Card_[succedent] (analogous to card of antecedent) are used to compute frequencies of four-fold table of antecedent and succedent, see Figure 7. M Succedent ¬ Succedent Antecedent a b c d ¬ Antecedent Figure 7. – Four-fold table from cards The particular frequencies are computed in the following way: • a = Count(Card_[Antecedent] ∧ Card_[Succedent]) • b = Count(Card_[Antecedent]) – a • c = Count(Card_[Succedent]) – a • d=n–a–b–c

Here n is the total number of rows in the data matrix M. Memory used by strings of bits while running a datamining task is not a significant problem. Especially when compared to significant time improvements during generation and verification. Let us remark that e.g. lot of medical data concerns thousands of patients and tens or hundreds of attributes. The corresponding data mining tasks can be solved without problems at common PC’s. Moreover in many cases we get the solution in several minutes or even in several seconds. Therefore 4ft-Miner is also suitable for teaching purposes. Here we provide results of an experiment at a Pentium 400 MHz computer with 98 MB RAM. We solved tasks to find true and prime association rules in the data matrices Loans, Loans_10 and Loans_20. The data matrix Loans_10 has 10 times more rows than original data matrix Loans. Analogously data matrix Loans_20 has 20 times more rows. There are about 7 000 000 relevant association rules that has to generated and verified according to task definition. Only about 70 000 of association rules were actually verified due to all the optimisations some of them described above. The time of solution for particular data matrices is given in Figure 8. Data matrix Loans Loans_10 Loans_20 Rows 6 181 61 810 123 620 Time of sol. [sec] 26 232 481 Figure 8. – Time of solution of various tasks Let us emphasize that the time of the bit string operations AND, NOT, OR and Count is linearly dependent on the length of particular cards. The length of each card is equal to the number of rows of the analysed data matrix. Thus the time the procedure 4ft-Miner needs to solve a given task is linearly dependent on the number of rows of the analysed data matrix.

4. New Data Mining Procedures Advantages of the bit-strings approach can be further used in new data mining procedures. An example is the procedure Pareto-Miner. Figures 9 and 10 express the motivation for this procedure. Both figures concern distribution of clients (see the data matrix Loans, Figure 1) among particular regions. The first one concerns all clients and the second one concerns the clients with high salary only. The distribution of clients with high salary remarkable differs from the distribution of all clients. The difference concerns namely the pair Prague – south Moravia. It can be useful to find all segments of clients that differ in a given way from the segment of all clients in the

5

Rauch, J. - Šimůnek, M.: Alternative Approach to Mining Association Rules. in FDM 2002, The Foundation of Data Mining and Knowledge Discovery, The Proceedings of the Workshop of ICDM02, pp 157-162.

distribution of clients among particular regions. The Pareto-Miner procedure is intended to solve such tasks. Its input consists of: • a data matrix with columns linked to attributes and rows corresponding to observed objects., • a analysed attribute A (usually with several values), • parameters defining a large set of conditions in the same way as a set of conditions in the 4ft-Miner procedure is defined, • a criterion of interestingness of a particular condition.

Figure 9. – Distribution of all clients among regions

Figure 10. – Distribution of clients with high salary among regions The criterion of interestingness describes a distribution of rows of the data matrix among the particular values of the attribute A. Examples of the criteria are: • a remarkable difference of the distribution when the particular condition is satisfied and the distribution for the whole analysed data matrix. The difference can be measured e.g. by number of values with different order. • a remarkable difference of the distribution when the particular condition is satisfied and the distribution under an other given condition. The evaluation of these criteria requires knowledge of frequencies of particular values of the attribute A under the condition in questions. These frequencies can be computed using cards of cedents for conditions and using cards of particular categories. Thus tools already developed can be used. We can use the already developed tools for generation including particular conditions C and for computing card Card_[C]. The particular frequencies can computed such that fi,j = Count((Card_[ ai] ∧ Card_[ sj] ∧ Card_[C]).

Literature [1] Aggraval, R. et all.: Fast Discovery of Association Rules, Advances in Knowledge Discovery and Data Mining (Fayyad, U. M. et al. eds.), AAAI Press / The MIT Press, 1996, pp. 307-328 [2] Hájek, P. – Havránek, T.: Mechanising Hypothesis Formation – Mathematical Foundations for a General Theory, Springer-Verlag, 1978, pp. 396. [3] Rauch, J.: Some Remarks on Computer Realisations of GUHA Procedures, International Journal of ManMachine Studies 10, 1978, pp. 23-28. [4] Rauch, J.: Classes of Four-Fold Table Quantifiers, Principles of Data Mining and Knowledge Discovery, (J. Zytkow, M. Quafafou, eds.), Springer-Verlag, 1998, pp. 203-211. [5] Rauch, J.: Four-fold Table Calculi and Missing Information, JCI’S98 Association for Intelligent Machinery, Vol. II., (Wang Paul eds.), Durham, Duke University, 1998. [6] Rauch, J. – Šimůnek, M.: Mining for 4ft Association Rules by 4ft-Miner, INAP 2001, The Proceeding of the International Conference On Applications of Prolog, Prolog Association of Japan, Tokyo, October 2001, pp. 285-294.

This paper has been supported by the grant COST ACTION 274 – TARSKI (Theory and Applications of Relational Structures as Knowledge Instruments).

6