A Collaborative Filtering Framework Based on Fuzzy Association ...

0 downloads 0 Views 304KB Size Report
extended existing techniques by using fuzzy association rule mining, and takes advan- ..... system, and are used when users request recommendations.
To appear in Knowledge and Information Systems

A Collaborative Filtering Framework Based on Fuzzy Association Rules and Multiple-Level Similarity Cane Wing-ki Leung, Stephen Chi-fai Chan and Fu-lai Chung Department of Computing, The Hong Kong Polytechnic University

Abstract. The rapid development of Internet technologies in recent decades has imposed a heavy information burden on users. This has led to the popularity of recommender systems, which provide advice to users about items they may like to examine. Collaborative Filtering (CF) is the most promising technique in recommender systems, providing personalized recommendations to users based on their previously expressed preferences and those of other similar users. This paper introduces a CF framework based on Fuzzy Association Rules And Multiple-level Similarity (FARAMS). FARAMS extended existing techniques by using fuzzy association rule mining, and takes advantage of product similarities in taxonomies to address data sparseness and non-transitive associations. Experimental results show that FARAMS improves prediction quality, as compared to similar approaches. Keywords: Collaborative Filtering; Recommender Systems; Fuzzy Association Rule Mining; Similarity

1. Introduction The rapid development of Internet technologies in recent decades has imposed a heavy information burden on users. This has led to the popular use of recommender systems, which receive information from users about items that users are interested in, and recommend items that may fit users’ needs. There are three major types of recommender systems: content-based, knowledge-based and social- or collaborative-filtering- (CF-) based (Burke 2000). Content-based recommender systems establish users’ interest profiles by analyzing the features of their preferred items. They compare the features of recommendable items to those of the preferred items of the user. They then recommend the more relevant items, measured by features similarity. Knowledge-based systems make use of knowledge about users and products to generate recommendations. They use a reasoning process to determine what products meet a user’s requirements. Social-filtering or CF-based recommender systems provide personalized recommendations according to user preferences. They maintain data about target (or active) users’ purchasing habits or interests and use this data Received Feb 28, 2005 Revised Aug 30, 2005 Accepted Dec 15, 2005

2

Leung et al.

to identify groups of similar users. They then recommend items liked by other, similar users. CF systems offer two major advantages over the other two types of recommender systems (Goldberg et al. 1992, Resnick et al. 1994, Maltz and Ehrlich 1995, Shardanand and Maes 1995). First, they do not take into account content information, and second, they are simpler and easier to implement. Because CF systems do not take into account content information, they can filter items that are not computer-parsable. Further, ignoring content information allows CF systems to generate recommendations based on user tastes rather than the objective properties of domain items themselves. This means that the system can recommend items very different (content-wise) from those that the user had previously shown a preference for. This overcomes a major limitation of content-based recommender systems (Shardanand and Maes 1995). As noted, CF techniques are also much simpler and easier to implement than knowledge-based recommender systems. In order to build a knowledge base of items, knowledge-based systems require a domain knowledge engineering process, whereas CF systems can be fully automated. Consequently, it is easy to apply CF to domains where a database of user preferences is available. To date, CF has already been successfully applied in various domains to recommend, for example, electronic documents (Goldberg et al. 1992), Usenet articles (Resnick et al. 1994, Konstan et al. 1997, Sarwar et al. 1998), movies (GroupLens , Perny and Zucker 1999), jokes (Goldberg et al. 2001), books (Linden et al. 2003) and web pages (Silvestri et al. 2004). CF algorithms fall into two major types: memory-based and model-based (Breese et al. 1998). Memory-based algorithms, such as neighborhood-based methods, operate over the entire user database and recommend products using statistical methods. Model-based algorithms construct compact models from a user database and then recommend products using probabilistic methods. Since such models can be constructed offline, the online performance of model-based algorithms is usually better. This, along with the recognition of the scalability problem of memory-based algorithms, has meant that most recent algorithms have been model-based. Some well-known algorithms include classification, clustering, hybrid algorithms that combine CF and content-based techniques, as well as the focus of this paper, association rule mining (ARM). There has been considerable research into CF, yet the issues of scalability, data sparseness (Sarwar et al. 1998) and the non-transitive association problems (Kim et al. 2004) remain open challenges. This paper focuses on the CF paradigm, association rule mining (ARM), and introduces a CF framework based on Fuzzy Association Rules And Multiple-level Similarity (FARAMS). The use of ARM-based techniques in CF has a number of desirable outcomes. First, because they allow product hierarchies to be integrated into the association rule mining process, the fact that relationships among items are already implicit in these hierarchies reduces non-transitive association problems (Leung et al. 2004). Second, ARM techniques apply higher-level association rules to address data sparseness, resulting in improved algorithm recall rate (Kim and Kim 2003). Third, the use of integrated product hierarchies in ARM techniques increases the number of items that can be recommended to users for whom only limited known preference data is available. Finally, ARM techniques provide the flexibility to, if necessary, easily mine associations among content-related attributes and item ratings. For example, a new user using a travel recommender system may specify trip-related attributes which are then

A Collaborative Filtering Framework Based on Fuzzy Association Rules and Multiple-Level Similarity3

used to find destination or tourist spots that are strongly related with those attributes. The rest of this paper is organized as follows. Section 2 describes popular approaches to CF that are related to the proposed CF framework, FARAMS. We analyze the underlying technologies of these popular approaches and comment on their strengths and weaknesses. Section 3 details related ARM techniques including their problem definitions, and describes a variation of ARM known as fuzzy association rule mining (FAR). Section 4 discusses the various steps in FARAMS. Section 5 presents and discusses results of an evaluation of FARAMS. Section 6 concludes the paper.

2. Related Work In recent years, CF has attracted a considerable amount of research attention resulting in the proposal of a large variety of CF approaches. Well-known approaches include the traditional neighborhood-based methods, classification, clustering, association rule mining (ARM), as well as hybrid algorithms that combine CF and content-based techniques. This section describes three of these approaches that are related to the proposed framework: k-nearest neighbor, classification, and ARM techniques.

2.1. K-nearest Neighbor (k-nn) K-nearest neighbor (k-nn) was commonly used in early CF-based systems. It consists of three major steps, namely user similarity weighting, neighbor selection, and prediction computation (Resnick et al. 1994, Breese et al. 1998, Herlocker et al. 2002). The similarity weighting step requires all users in the database to be weighted according to their similarity with the active user. Similarities are reflected in the ratings that users have given items. In order for two particular users to be comparable, only items that both users have rated are counted. The neighbor selection step requires that a number of k-nearest neighbors of the active user be selected as item predictors. These selected users have the highest similarity weights and it is based on their interests and some partial information of the active user that an item’s prediction score is computed. K-nn is well-known for its simplicity and prediction accuracy, which improves as the active user rates more items (Breese et al. 1998, Herlocker et al. 2002). Further, ratings-based and content-independent predictions allows K-nn to be used even in domains where textual descriptions of products are not available, not meaningful, or cannot be easily categorized by any attribute. K-nn was commonly used in early CF-based systems but it is not without problems, one of the most crucial being that of data sparseness. Data sparseness arises because users cannot rate unobserved items in the system. In reality, for example, where a purchase database is used, the number of items unobserved by each user is certainly very large. This has an adverse effect on prediction quality (Deshpande and Karypis 2004) as the known preferences of users may be insufficient for generating recommendations. Another difficulty with using k-nn has been identified in Kim et al. (2004), the non-transitive association problem, which manifests in two forms. In user

4

Leung et al.

similarity-based methods, if two users have both experienced or rated similarbut-not-identical items, their correlation is lost. This is known as the user-based non-transitive association problem, which we may attempt to avoid by using item similarity instead of user similarity but even then, if two similar items have never been experienced or rated by the same user, we face the item-based non-transitive association problem. K-nn also has difficulties with scalability, especially in real-time applications. This is because K-nn requires that each prediction be made from real-time computations performed over the entire database. This leads to severe performance bottlenecks in the similarity weighting step. As the number of users and items in the system grows, the performance of the algorithms degrades. The proposed method deals with data sparseness and non-transitive associations by taking advantage of item similarities that are implicit in their taxonomies. Scalability is ensured by identifying item similarities offline.

2.2. Classification Instead of predicting the rating of an item, classification approaches predict whether the active user would like or dislike it (Basu et al. 1998, Breese et al. 1998, Miyahara and Pazsani 2000). To do so, user ratings are first discretized into classes such as “Like” and “Dislike”. Classification approaches, such as Bayesian network (Breese et al. 1998) and Simple Bayesian classifier (Miyahara and Pazsani 2000), are then used to make predictions for the active user. As classification models are constructed offline, and as a consequence only related models have to be inspected in real-time, classification approaches improve scalability. Furthermore, because their class labels use linguistic terms, classification approaches produce more human-understandable results. Two drawbacks to classification approaches are, first, the degraded quality of their predictions, a problem arising from their probabilistic nature, and, second, the sharp boundary problem, which is caused by boolean discretization of quantitative ratings data (Srikant and Agrawal 1996, Lee and Hyung 1997, Gyenesei 2000). While prediction quality has been one of the major goals in CF research (Sarwar et al. 1998, Kim and Kim 2003, Deshpande and Karypis 2004), the sharp boundary problem has received little attention, if any. We use an example to illustrate the sharp boundary problem. In existing work (Miyahara and Pazsani 2000, Lin et al. 2002, Kim and Kim 2003), ratings are discretized using a score threshold (or like threshold). For instance, ratings above a certain value are transformed into “Like”, or “Dislike” otherwise. Given the Jester dataset (Goldberg et al. 2001), in which ratings were recorded as real numbers ranging from -10 to +10, and the like threshold 3, the rating 3 is transformed into ”Like” while the rating 2.9 is transformed into “Dislike”. In other words, a 0.1 difference between the two values cause them to fall into two totally different classes. Furthermore, using this method, both of the ratings 10 and 3 will be classified as “Like”. Clearly, this is problematic as we intuitively know that the preference implied by a rating of 10 is much higher. CF techniques have not addressed the sharp boundary problem. One solution to this problem is the use of fuzzy logic, which has been widely-adopted in related studies, such as data mining on quantitative data (Lee and Hyung 1997, Kuok et al. 1998, Gyenesei 2000). FARAMS shall also adopt this approach, which is further described in Section 3.3.

A Collaborative Filtering Framework Based on Fuzzy Association Rules and Multiple-Level Similarity5

2.3. Association Rule Mining (ARM) Association rule mining (ARM) is a well-studied data mining technique, and the purpose of which is to discover interesting relationships between items by finding items that have frequently appeared together (Agrawal et al. 1993, Agrawal and Srikant 1994, Han and Kamber 2000). When applied to CF, ARM can be used to generate recommendations after interesting rules are mined (Fu et al. 2000, Lin et al. 2002, Kim and Kim 2003, Huang et al. 2000). For example, if the association rule “A  B ” is interesting in a certain system, and the active user has liked item A previously, the user will probably like item B also. Lin et al. (2002) proposed an Adaptive-Support Association Rule Mining (ASARM) algorithm for CF. ASARM mines rules for one target item at a time, and automatically adjusts the minimum support value to mine a user-specified number of rules. Numeric ratings are discretized into two classes, “Like” and “Dislike”, based on some chosen threshold value. The MAR approach proposed in Kim and Kim (2003) applied association rules between categories in multiple-level product taxonomies to address data sparseness. MAR can be applied on items that are organized in a hierarchical category structure, such as the classification of goods in a department store (the “is-a” hierarchy). Taking advantage of such a product hierarchy increases the number of recommendable items for those users whose known preference data are otherwise so limited that it is not possible to produce recommendations. The rating a user has given an item is classified as “Like” if it is above the average rating given on all items by that user in the MAR approach. As noted, the sharp boundary problem exists in both ASARM and MAR. The proposed framework attempts to address such problem by incorporating fuzziness in quantitative ratings.

3. Association Rule Mining This section first describes the problem definitions of ARM, including its classical definition and the adaptive-support ARM tailored for CF, followed by the fuzzy association rule mining technique (FAR) that is used to address the sharp boundary problem in FARAMS.

3.1. The ARM Problem The following formal problem definition of ARM is provided in Agrawal et al. (1993): given a database of transactions (T ), generate all association rules that satisfy certain syntactic and support constraints. Syntactic constraints involve restrictions on items that can appear in a rule, while support constraints define the number of transactions in T that support a rule. An association rule is denoted as “X  Ij ”, where X, known as the antecedent or the body of the rule, is a set of items in I, and where Ij , known as the consequent or the head of the rule, is a single item in I that is not present in X. The rule’s statistical significance and strength are measured by the two values known as support and confidence. The support of a rule is defined as the percentage of transactions in T that contain both X and Ij , while confidence is defined as the percentage of transactions containing X that also contain Ij . For

6

Leung et al.

example, the rule “A  B” [20%, 90%] means that 90% of users who purchased item A also purchased item B, and that 20% of all users purchased both A and B. Note that a rule is considered interesting only if its support and confidence values are higher than the user-specified minimum. The ARM problem can be further decomposed into two sub problems (Agrawal et al. 1993): 1. Find all combinations of items, known as large itemsets or frequent itemsets, with support values above a certain threshold (the minimum support). Note that if a certain itemset Y is frequent, all subsets of items in Y are also frequent. This is known as the downward closure property of support values. 2. Generate association rules for the frequent itemsets. For a rule “X  Ij ”, divide the support of the union of X and Ij by that of X to obtain the rule’s confidence. If its confidence is above the specified minimum, the rule holds.

3.2. Adaptive-Support ARM Lin et al. (2002) pointed out that, for two reasons, the traditional ARM problem definition is inefficient for collaborative recommendation. First, ARM algorithms mine rules for all items in the database. Many rules mined will not be relevant for a given user. Second, the minimum support and confidence values have to be specified in advance. Due to variations in user tastes and the popularity of various items, this might lead to either too many or too few rules as rules involving less popular items may be difficult to discover. In view of these problems, Lin et al. (2002) proposed a new ARM problem definition for CF: Given: A transaction dataset, a target item, a specified minimum confidence (minConf idence) and a desired range for the number of rules [minN umRules, maxN umRules], Find: A set of association rules S having the target item in the heads such that the number of rules is in the given range (unless fewer than minN umRules exist for the given minConf idence); the rules satisfy the minConf idence constraint; and the rules have the highest possible support (i.e. no rule outside S with a confidence greater than or equal to the minConf idence has a higher support than a rule in S). Although the rule mining process for all items in the database makes multiple passes over the data, this is done offline and therefore does not affect the response time of the recommendation process. The proposed method adopts the above ARM problem definition, and therefore, with some extensions and modifications, the ASARM algorithm.

3.3. Fuzzy Association Rule Mining Fuzzy association rule mining (FAR) addresses the sharp boundary problem by extending the classical problem definition of ARM (Lee and Hyung 1997, Kuok et al. 1998, Gyenesei 2000). FAR is in the form of “hA is Xi  hB is Yi”, where X and Y are fuzzy sets that characterize attributes A and B. The fuzzy set concept provides a smooth transition between members and non-members of a set. An attribute can be a member of a fuzzy set to a certain degree in [0,1]. This

A Collaborative Filtering Framework Based on Fuzzy Association Rules and Multiple-Level Similarity7 N

D

L

N

L

D 1 0.9

0.8

0.8

0.8

0.7

0.7

0.7 0.6 0.5 0.4 0.3

0.6 0.5 0.4 0.3

0.2

0.2

0.1

0.1

0 1

2

3 Rating

4

Degree of Membership

1 0.9

Degree of Membership

Degree of Membership

L 1 0.9

D

0.6 0.5 0.4 0.3 0.2 0.1

0 −10

5

N

(a) MF(A)

−5

0 Rating

5

0 0

10

0.2

0.4

0.6

0.8

Rating

(b) MF(B)

(c) MF(C)

Fig. 1. Sample fuzzy sets and membership functions. Table 1. A sample user-item rating matrix. U serID 100 200 300 400

M1

M2

1

4

M ovieID M3 M4

M5 2

5 5 5

2 1

1

5

value is assigned by the membership function (MF) associated with each fuzzy set. Three examples are shown in Fig. 1, in which numeric ratings of items are fuzzified into three classes, Like, N eutral and Dislike, respectively represented by L, N and D. Given the sample ratings in Table 1 and MF(A) in Fig. 1(a), Table 2 shows the transformed ratings of the two movies M 1 and M 2. Each attribute, which was a movie name, is expanded into a hM ovieID, F uzzySeti pair, and each value, which was the user rating of that movie, is transformed into its membership degree with respect to the specified fuzzy set. A fuzzy support (FS) value reflects not only the number of transactions supporting an itemset but also their degree of support. The FS of an itemset hA, Xi is defined as follows (Gyenesei and Teuhola 2001): F ShA,Xi =

Σti ∈T Πaj ∈A {mxj ∈X (ti [aj ])} |T |

(1)

In Equation (1), hA, Xi represents an hItemset, F uzzySeti pair, where A is a set of attributes aj and X is the set of fuzzy sets xj that characterizes A. ti[aj] represents the value of aj in the ith record in the transactional database T , and the value is transformed to a degree of membership between 0 and 1 by the MF of aj , denoted by mxj ∈X (ti [aj ]). The vote of each transaction ti is calculated by Table 2. Fuzzified ratings. U serID 100 200 300 400

hM 1, Li

hM 1, N i

0 0 1 1

0 0 0 0

hM ovieID, F uzzySeti hM 1, Di hM 2, Li hM 2, N i 1 0 0 0

0.8 0 0.2 0

0.5 0 0.5 0

hM 2, Di 0.2 0 0.8 1

1

8

Leung et al.

Table 3. Fuzzified ratings with normalization. hM ovieID, F uzzySeti U serID 100 200 300 400

hM 1, Li

hM 1, N i

hM 1, Di

hM 2, Li

hM 2, N i

hM 2, Di

0 0 1 1

0 0 0 0

1 0 0 0

0.53 0 0.14 0

0.33 0 0.33 0

0.14 0 0.53 1

taking the product of the degree of membership of each aj that it contains. The votes of all ti ∈ T are then summed up and divided by the size of T (|T |) to obtain F ShA,Xi . According to Equation (1), however, some movies in Table 2 contribute more towards the support count of itemsets (Gyenesei 2000). An example is the movie M 2 in the first transaction with UserID 100 (t100). The support counts of the itemsets hM 2, Li, hM 2, N i and hM 2, Di are respectively 0.8, 0.5 and 0.2. In other words, t100 will be counted 1.5 times (0.8+0.5+0.2) for the movie M 2. To avoid some movies contributing more than others, ratings are normalized using Equation (2). This ensures that every movie has a total contribution of 1 if rated (Gyenesei 2000). m0aj (l, ti [aj ]) =

maj (l, ti [aj ]) f (aj ) Σl=1 maj (l, ti [aj ])

(2)

In Equation (2),m0aj and maj respectively represent the normalized and original membership degrees of aj for the lth fuzzy set. ti [aj ] represents the value of the attribute aj in the ith record in T , and f (aj ) represents the number of fuzzy sets for aj . According to this equation, the fuzzified ratings in Table 2 are normalized to 0.8/1.5, 0.5/1.5 and 0.2/1.5 (i.e. 0.53, 0.33 and 0.14) as shown in Table 3. FS values are then computed based on the normalized ratings. Let hA, Xi = hM 1, Li, hB, Y i = hM 2, Di and therefore hC, Zi = h{M 1, M 2}, {L, D}i. Given the ratings in Table 3, F ShA,Xi = (0 + 0 + 1 + 1) / 4 = 0.5, F ShB,Y i = (0.14 + 0 + 0.53 + 1) / 4 = 0.4175, and F ShC,Zi = (0*0.14) + (0*0) + (1*0.53) + (1*1) / 4 = 0.3825. The fuzzy confidence (FC) of the rule “hA, Xi  hB, Y i”, denoted by F ChhA,Xi,hB,Y ii , is shown in Equation (3) (Gyenesei and Teuhola 2001). It is computed by dividing F ShC,Zi by F ShA,Xi , where A ∈ C, B = C − A, X ∈ Z and Y = Z − X. F ChhA,Xi,hB,Y ii =

F ShC,Zi F ShA,Xi

=

Σti ∈T Πcj ∈C {mzj ∈Z (ti [cj ])} |T | Σti ∈T Πaj ∈A {mxj ∈X (ti [aj ])} |T |

=

Σti ∈T Πcj ∈C {mzj ∈Z (ti [cj ])} Σti ∈T Πaj ∈A {mxj ∈X (ti [aj ])}

(3)

Given the ratings in Table 3, F ChhA,Xi,hB,Y ii = (0.3825 / 0.5) = 0.765. There exists another interestingness measure on the correlation (CORR) be-

A Collaborative Filtering Framework Based on Fuzzy Association Rules and Multiple-Level Similarity9

tween the body and the head of a fuzzy association rule. CORR between hA, Xi and hB, Y i, denoted by CORRhhA,Xi,hB,Y ii is defined as follows (Gyenesei and Teuhola 2001): CovhhA,Xi,hB,Y ii V arhA,Xi ∗V arhB,Y i

CORRhhA,Xi,hB,Y ii = p

(4)

where CovhhA,Xi,hB,Y ii = F ShC,Zi − F ShA,Xi ∗F ShB,Y i

(5)

V arhA,Xi = F ShA,Xi2 − (F ShA,Xi )2

(6)

F ShA,Xi2 =

Σti ∈T (Πaj ∈A {mxj ∈X (ti [aj ])})2 |T |

(7)

and similar for hB, Y i. The definitions in Equations (4-7) are extensions of the basic formulas of variance and covariance in statistics (Gyenesei and Teuhola 2001). The value of fuzzy correlation ranges from -1 to 1. Only positive value tells that the body and the head of a rule are positively correlated. The closer the value is to 1, the more related they are. According to the equations, and given the ratings in Table 3, CovhhA,Xi,hB,Y ii = 0.3825 - (0.5 *0.4175) = 0.17375, F ShA,Xi2 = (02 + 02 + 12 + 12 ) / 4 = 0.5, (F ShA,Xi )2 = (0.5)2 = 0.25, V arhA,Xi = (0.5 - 0.25) = 0.25, F ShB,Y i2 = (0.142 + 02 + 0.532 + 12 ) / 4 ≈ 0.3251, (F ShB,Y i )2 = (0.4175)2 ≈ 0.1743, V arhB,Y i = (0.3251 - 0.1743) = 0.1508, and therefore CORRhhA,Xi,hB,Y ii 0.17375 ≈ 0.895. = √0.25∗0.1508

4. The FARAMS Framework The FARAMS framework adopted some existing ARM techniques, including Lin et al. (2002) and Kim and Kim (2003), to generate collaborative recommendations. These techniques, when applied to CF, are integrated with classification techniques for handling quantitative ratings data. FAR mining, which is a variation of the classical ARM techniques, is used to address the resulting sharp boundary problem in existing techniques (Gyenesei 2000). FARAMS is carried out in four major steps. The first step is data preprocessing, in which data are prepared in a format that is suitable for the subsequent tasks. The second step mines interesting associations among domain items and categories from user preferences. Rules mined in this step are stored in the system, and are used when users request recommendations. The third step is prediction computation, which determines relevant rules for a user and assigns predicted preferences to items recommended by those rules. The fourth step produces recommendations. If the number of recommendable items is smaller than the predefined number, multiple-level similarity among items is used to predict user preferences for items that are not covered in the product-level association rules. The following sections provide further details about the tasks involved in each step.

10

Leung et al.

Table 4. Mapping user preferneces for (a) items and (b) categories into transactions. TID

Items

TID

Categories

100 200 300 400

M1(1), M2(4), M5(2) M4(5) M1(5), M2(2), M3(1), M5(5) M1(5), M2(1)

100 200 300 400

[C1] [C5] [C1] [C1]

(4), (5), (2), (1),

(a)

[C2] [C9] [C2] [C3]

(2), (5), (5), (5),

[C3] (1.5), . . . ... [C3] (5), . . . [C4] (5), . . .

(b)

4.1. Data Pre-Processing This section describes the various procedures required in the data pre-processing step. They include transforming user-item rating matrixes into transactions for both items and categories, fuzzifying user preferences for FAR mining and transforming transactions to allow efficient support counting.

4.1.1. Mapping user-item matrixes to transactions CF ratings data are usually represented as preference matrixes. They are transformed into transactional databases for ARM tasks. As shown in Table 4 (a), each transaction consists of a transaction identifier (TID), which is the User ID of the user to which the transaction belongs, and of the IDs and ratings of the items that have been rated by that user.

4.1.2. Computing user preferences for higher-level categories When the known preferences of a user is so limited such that recommendations cannot be produced, FARAMS makes use of multiple-level similarity among items (hereafter referred to as MS). Users’ preferences for higher-level categories are not readily available in CF datasets because users have rated the productlevel items rather than their categories. These have to be computed from the original transactions containing users’ item preferences, so that rules involving categories can be mined. A user can have preferences in multiple categories, and ratings of items in the same category can be different. In order to compute a user’s preference for a category, ratings given to each category are averaged. In the sample transactions shown in Table 4(b), a category ID is enclosed in square brackets, followed by the average rating the user has given it.

4.1.3. Fuzzifying ratings The fuzzification of ratings is implemented in four steps. These are the same for both higher-level categories and product-level items. First, the fuzzy sets and membership functions for ratings are determined. Second, items in transactions are expanded into hItem, F uzzySeti or hCategory, F uzzySeti pairs. Third, the degree of membership of each rating is determined with respect to each fuzzy set. Finally, the fuzzified ratings are normalized so that each transaction makes the same contribution, which is 1. Table 5 extends Table 4(a) to show the resulting fuzzified transactions given the fuzzy sets and membership functions in Fig. 1(a). For simplicity, the table does not show hItem, F uzzySeti pairs with membership degrees of 0.

A Collaborative Filtering Framework Based on Fuzzy Association Rules and Multiple-Level Similarity11 Table 5. Transactions containing hItem, F uzzySeti pairs and normalized membership degrees. TID

Items

100

hM1,Di(1), hM2,Li(0.5), hM2,Ni(0.33), hM2,Di(0.17), hM5,Li(0.17), hM5,Ni(0.33), hM5,Di(0.5)

200 300

hM4,Li(1) hM1,Li(1), hM2,Li(0.17), hM5,Li(1) hM1,Li(1), hM2,Di(1)

400

hM2,Ni(0.33),

hM2,Di(0.5),

hM3,Di(1),

Table 6. Transformed transactions containing (a)items and (b)categories for efficient support counting. TID

Items

TID

Categories

hM1,Li hM1,Di hM2,Li hM2,Ni hM2,Di ... ...

300(1), 400(1) 100(1) 100(0.5), 300(0.17) 100(0.33), 300(0.33) 100(0.17), 300(0.5), 400(1) ... ... (a)

h[C1],Li h[C1],Ni h[C1],Di h[C2],Li h[C2],Ni h[C2],Di ...

100(0.5), 300(0.17) 100(0.33), 300(0.33) 100(0.17), 300(0.5), 400(1) 100(0.17), 300(1) 100(0.33) 100(0.5) ... (b)

4.1.4. Transforming transactions for efficient support counting Like the ASARM algorithm (Lin et al. 2002), FARAMS obtains the support counts of itemsets by making multiple passes over the data. One major optimization made in our approach is that transactions described in the previous sections are further transformed into a vertical tid-list format (Zaki 2000), with each hItem, F uzzySeti pair associated with a list of transactions that contained the pair and the corresponding membership degrees it obtained. As shown in Table 6, the same transformation procedure takes place for both items and categories. This transformation enables efficient support counting of itemsets, as described in Section 4.2.3.

4.2. Mining User Preferences The overall structure and flow of our algorithms for adjusting the minimum support value and mining user preferences are similar to those described in Lin et al. (2002). This section therefore focuses on the adaptations and extensions made in the association rule miner of FARAMS. Several issues that were considered when designing the association rule miner are first described, followed by the description of the mining algorithm of FARAMS.

4.2.1. Association mode used Existing CF frameworks use two major association modes, user association and article (item) association. User association identifies similarities between users and recommends items preferred by users similar to the active user. Mining user associations produces rules that are in the form of “hU ser1, Likei → hActiveU ser, Likei”. For a test article of the active user, that is an article that has not been

12

Leung et al.

Table 7. Transactions for (a) hM 1, Li and (b) {hM 1, LihM 2, Di}. TID

Items

TID

Items

hM 1, Li

300(1), 400(1)

hM 1, Li hM 2, Di

300(1), 400(1) 100(0.17), 300(0.5), 400(1)

(a)

(b)

rated by him/her, this rule fires if it has been liked by U ser1 (Lin et al. 2002). It will therefore be considered for recommendation. Article (item) association identifies similarities between articles, and is used Kim and Kim (2003) and Linden et al. (2003). It mines rules that are in the form of “hArticleA, Likei → hT argetArticle, Likei”. This rule fires if the active user has liked ArticleA but has not rated T argetArticle previously. The T argetArticle in the head of the rule will be considered for recommendation. This can be applied to categories by treating a category as an article. To facilitate the use of product taxonomies in the recommendation process, article association is adopted in FARAMS.

4.2.2. Defining candidate 1-itemsets As FARAMS mines rules for one target item (targetItem) at a time, association rules are generated using only itemsets containing the targetItem. Candidate 1itemsets can therefore be limited to the “related items”, defined as the union of all items that appeared in transactions containing the targetItem. FARAMS mines rules in an apriori-like fashion, which iteratively generates k-itemsets by joining two (k-1)-itemsets. Once the related items are determined in the first iteration, their associated tid-lists serve as a reduced database for support counting in the subsequent iterations. It is expected that taking into account only “related items” will not affect the results of the rule mining process. This is because the downwardclosure property of support values means that all subsets of a frequent itemset must be frequent (Agrawal and Srikant 1994). If an item did not appear with the targetItem, and as rules are generated using only itemsets containing the targetItem, any k-itemset, where k > 1, containing that item and the targetItem must be infrequent, and can therefore be excluded from consideration in the first place.

4.2.3. Computing fuzzy support values In traditional methods, the fuzzy support count of an itemset is obtained by scanning an entire database. The transformed transactions used in FARAMS scan more efficiently because they inspect only k records, where k is equal to the size of the itemset. As shown in the following examples, fuzzy support counting can be carried out using simple calculations and joining operations (Zaki 2000). Referring to Table 7(a), the fuzzy support count of the 1-itemset hM 1, Li can be found efficiently by inspecting only one record, and the result is (1 + 1) = 2. To obtain the fuzzy support count of the 2-itemset {hM 1, LihM 2, Di}, it is necessary to inspect only two records (Table 7(b)). Before the fuzzy support count of the 2-itemset can be calculated, the transactions that contain both items are first determined by a simple join operation. As shown in Table 7(b), 300 and 400 appeared in both records. The fuzzy support count of the 2-itemset can then be determined, and is found to be equal to (1*0.5) + (1*1) = 1.5.

A Collaborative Filtering Framework Based on Fuzzy Association Rules and Multiple-Level Similarity13 Algorithm 4.1. The mining algorithm. Input: Transactional database (T ), targetItem, minimum support (minSupport), minimum confidence (minConf idence), minN umRules, maxN umRules, maximum number of items in a rule’s body (maxRuleLength). Output: Set of association rules (Rt ), so that each rule in Rt : (1) has targetItem in its head, (2) with no more than maxRuleLength items in its body, and (3) satisfies the minSupport and minConf idence constraints. The number of rules in Rt is at most maxN umRules. If the number of rules in Rt is above maxN umRules (resp. below minN umRules), raise the aboveM axN umRulesF lag (resp. belowM inN umRulesF lag). 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

(F1 , T ID) = find frequent 1 itemsets (T , targetItem); for (k=2; (k ≤ maxRuleLength + 1) and (Fk−1 6= φ) and Rt .aboveM axN umRulesF lag); k + +) do { Ck = gen candidate (Fk−1 ); for each candidate c ∈ Ck do { c.f uzzySupport = compute fuzzy support (c, T ID); if (c.f uzzySupport ≥ minSupport) then add c to Fk ; } Rk = gen rules (Fk , targetItem, minConf idence); if (|Rt |+ |Rk |>maxN umRules) then set Rt .aboveM axN umRulesF lag; Rt = maxN umRules rules with highest support from Rt .rules ∪ Rk .rules; } if (|Rt |