A Comparison of Customer Data Clustering Techniques in ... - CiteSeerX

A Comparison of Customer Data Clustering Techniques in an e-Shopping Application D.N. Sotiropoulos, G.A. Tsihrintzis, A. Savvopoulos, and M. Virvou University of Piraeus Department of Computer Science Piraeus 185 34 Greece {dsotirop,geoatsi,asavvop,mvirvou}@unipi.gr

Abstract. In the context of the problem of adapting of the interaction between an e-shopping application and its users, we examine and compare four different clustering methodologies, namely hierarchical, fuzzy c-means, spectral, and artificial immune network (AIN)-based clustering, to 150 customer profile feature vectors. The profile vectors have been collected using Vision.Com, an electronic video store application that we have developed as a test-bed for research purposes. The data collected have been fed into each of the four clustering algorithms and the results have been compared. Each algorithm produced clusters of users’ interests with respect to characteristics of movies of the video store application.

1

Introduction

Adaptivity provides individualised assistance to users’, which is dynamically generated. In the context of an e-shopping application, adaptivity can enhance the system by individualising its sales operation. This means that the system may collect information about customers’ preferences and interests, process this information and provide personalised sales recommendations and assistance. This procedure needs a user modelling component. The adaptive responses to users are usually created using the technology of adaptive hypermedia [1]. One common approach to user modelling is the categorisation of users into groups of similar behaviour. In this way, if a user is found to belong to a particular group of users of similar behaviour then this user may be expected by the system to have similar preferences and behaviour with the rest of the members belonging to the same group. Thus, a system may draw quick inferences on an individual user based on the behaviour of other users. Such technique has often been used in user stereotypes [12]. User stereotypes depict clusters of users with similar behaviour in the context of a particular application. The construction of user stereotypes is not an easy task. One solution to this problem can be provided by clustering algorithms that may group users dynamically based on their behaviour while they use a system on-line. The main advantage of such an approach is that the categorization of user behaviour can be conducted automatically.

2

D.N. Sotiropoulos, G.A. Tsihrintzis, A. Savvopoulos, and M. Virvou

In the context of an e-commerce application it is very helpful to acquire information about users’ interests and then gruop users with similar interests in products. Clustering algorithms can undertake the role of grouping users in an efficient way, thus creating the bone structure of the user model. There are many web-based recommendation applications that have used clustering algorithms (e.g. [2–7, 10, 11]). There are, also, many clustering algorithms that exist and could be used. It is not clear in advance which algorithm can yield better results. Therefore, one solution to the problem of selecting the most appropriate clustering algorithm for a particular application is the comparison between prospective algorithms after haveing applied each of them in a prototype version of the system. In this paper examine and compare four different clustering algortihms, namely hierarchical, fuzzy c-means, spectral and AIN-based clustering in the context of an e-shopping application. The e-shopping application is a presonalised electronic video store named Vision.Com. To compare the four clustering algoritjms and select the most appropriate one for our application we used a prototype version of the electronic video store to collect the data on users’ behaviour. Thus, the prototype version of Vision.Com has been used by 150 users. Their behaviour within the e-shop application has been collected in users’ protocols. These protocols constitute data that have been fed to each clustering algorithm separately. The results of each algorithm have been compared in terms if the clarity of users’ groups.

Fig. 1. The Vision.Com animated agent.

2

Vision.Com and Experimental Customer Data Description

The system we developed and used as test-bed is called Vision.Com and is an adaptive e-commerce video store that learns from customers preferences. Its aim

Customer data clustering in e-shopping

3

is to provide help to customers choosing the best movie for them. In order to help users understand the system more quickly, Vision.Com uses an animated agent (Fig. 1) that provides help whenever is needed. The web based system ran at a local network with IIS playing the role of the web server. We used this technique in order to avoid the network problems at peak hours. In Vision.Com every customer can visit a large amount of movies by navigating through four movie categories: social, action, thriller and comedy movies. Every customer has a personal shopping cart. If a customer intends to buy a movie she/he must simply move the movie into her/his cart by pressing the specific button. S/he also has the ability to remove one or more movies from his/her cart by choosing to delete them. After concluding which movies to buy a customer can easily purchase them by pressing the button buy. All navigational moves of a customer are recorded by the system in the statistics database. In this way the Vision.Com saves statistics considering the visits in the different categories of movies and movies individually. The same type of statistics was saved for every customer and every movie that was moved to the buyers cart. The same task is conducted for the movies that are eventually bought by every customer. All of these statistical results are scaled to the unit interval [0,1]. In particular, Vision.Com interprets users actions in a way that results in the calculation of users interests in individual movies and movie categories. Each users action contributes to the individual user profile by showing degrees of interest into one or another movie category or individual movie. For example, the visit of a user into a movie shows interest of this user to the particular movie and its category. If the user puts this movie into the shopping cart this shows more interest in the particular movie and its category. If a user buys this movie then this shows even more interest whereas if the user takes it out of the shopping cart before payment then there is not any increase in the interest counter. Apart from movie categories that are already presented, other movies features that are taken into consideration are the following: price range, leading actor and director. The price of every movie belongs to one of the five price ranges in euro: 20 to 25, 26 to 30, 31 to 35, 36 to 40 and over 41. As a consequence, every customers interest in one of the above features is recorded as a percentage of his/her visits in movie-pages. For example, interest of the customer at a particular bought movie is calculated as in Eq.(1). VisitsOnBoughtMovie (1) VisitsOnAllBoughtMovies Vision.Com was used by 150 users that bought movies using this particular system. The system collected data about the user’s behaviour. The data collected consisted of three parts. Every part is similar to the others. The first one contains statistical data of the visits that every user made to specific movies. The second part contains data of the cart moves (i.e. which movies the user moved into his/her cart). The last part consists of statistical data concerning the preferences on the movies bought by every user. Every record in every part is a vector of the same 80 features that were extracted of the movies characteristics and represents the references of one user. The 80 features of these vectors are the movie features InterestOnBoughtMovie =

4


we described above. Every 80 featured vector is consisted of the four movie categories, the five price ranges, all the leading actors and all the directors. The value of each feature is the percentage of interest of every individual customer in this particular feature (equation (1)).

3

Clustering Algorithms

In this paper attention, was focused on the clustering capabilities provided by the novel computational paradigm of artificial immune systems (AIS) through the development of an AIN. Specifically, we tested and compared three widely used clustering techniques [10], namely a) agglomerative hierarchical clustering, b) fuzzy c-means clustering and c) spectral clustering against AIS-based clustering. 3.1

Agglomerative Hierarchical Clustering

This clustering procedure produces a hierarchy of nested clusterings that are best visualized with the utilization of a dendrogram whose leaves coincide with the initial data points to be clustered. More specifically, agglomerative hierarchical clustering involves an iterative procedure, which begins with a number of clusters which equals the population of the initial data points and terminates at a single cluster containing the complete set of the given data points. The formation of groups is dominated by the definition of suitable proximity measures that estimate firstly the similarity between points and secondly the similarity between points and groups of points. The dendrogram produced by hierarchical clustering applied to the feature vectors of our application is shown in Fig. 2.

Hierachical Clustering based Dendrogram for the complete data set 1.25 1.2 1.15 1.1 1.05 1 0.95 0.9 0.85 0.8 2 6 15 12 19 25 26 17 18 21 5 28 7 16 30 4 22 27 29 3 1 9 14 10 24 13 8 11 23 20

Fig. 2. Hierarchical clustering-based dendrogram of 150 customer profile feature vectors.

Customer data clustering in e-shopping

3.2

5

Fuzzy c-Means Clustering Algorithms

This clustering technique is derived from the optimization of a cost functional which depends on both the original data and an unknown vector of parameters which can be interpreted as identifiable centers of the initial data points. Moreover, the resulting clusters are fuzzy, in the sense of the initial data vectors not assigned to a unique cluster each, but rather assigned a different degree of membership in every cluster identified by the algorithm.

Fuzzy c means based data points clusters 1 0.8 0.6 0.4 0.2 0 −0.2 −0.4 −0.6 −0.8 −1

−0.5

0

0.5

1

Fig. 3. Plot of the spatial distribution of the two-dimensional projection of the 150 data points, as clustered by fuzzy c-means clustering. Different clusters are indicated with different marks

3.3

Spectral Clustering Algorithms

Spectral clustering constitutes a recent clustering approach which is based on spectral graph theory and interprets the data as graph nodes and their distances as the corresponding connecting weights. In the context of spectral clustering the formation of data groups involves the utilization of eigenvectors computed from matrices that are derived from the distances between the data points. This clustering method is based on obtaining a data representation in a lower dimensional space where the original data points can be more easily clustered with application of traditional clustering techniques, such as the standard cmeans algorithm. In Fig. 4, we show a plot of the spatial distribution of 150 customer profile data points obtained by reducing the original 80-dimensional feature vectors to their 2-dimensional projections obtained with a principal component analysis algorithm. In Fig. 4, we also show the dendrogram produced by spectral clustering applied to the feature vectors of our application.

6

D.N. Sotiropoulos, G.A. Tsihrintzis, A. Savvopoulos, and M. Virvou Spectral Clustering based data points clusters

Spectral Clustering based Dendrogram for the complete data set

1

1.5

0.8 1.4

0.6 0.4

1.3

0.2 1.2

0 −0.2

1.1

−0.4 −0.6 −0.8 −1

1

−0.5

0

0.5

1

9 20 15 2 22 14 23 26 27 13 11 29 25 4 18 28 6 5 12 16 1 21 8 19 3 7 10 30 17 24

Fig. 4. Plot of the spatial distribution of the two-dimensional projection of the 150 data points, as clustered by spectral clustering and corresponding dendrogram. Different clusters are indicated with different marks.

3.4

AIS-based Clustering Algorithms

The core idea behind the development of an AIN is to generate a minimal set of representative points that capture the properties of the original data set and can be interpreted as the centers of the initial feature vectors. In the context of AIS [11], this set of representative points in a multidimensional feature space constitute a set of memory antibodies that recognize, in the sense of Euclidean distance proximity, a given antigenic population consisting of the complete data set. The produced set of memory antibodies provides an alternative more compact way of representing the given data set while conserving their original space distribution. In the present e-shop application, the antigenic population consists of the initial set of customer profile feature vectors. In Fig. 5, we show a plot of the spatial distribution of 22 representative points obtained by reducing the original 80-dimensional antibodies to their 2-dimensional projections obtained with a principal component analysis algorithm. Clearly, the set of representative antibodies in Fig. 5 maintain the spatial structure of the complete data set in Figs. 2 and 3, but, at the same time, form a minimum representation of 150 feature vectors with only 22 antibodies. This indicates significant data compression, combined with clear revelation and visualization of the intrinsic data classes. On the other hand, the application of traditional clustering algorithms on the set of memory antibodies takes advantage of the realized redundancy reduction yielding a plain revelation of the intrinsic data clusters present in the set of customer profiles. This latter fact is also demonstrated in Fig. 5, in which the dendrogram produced by hierarchical clustering applied to the memory antibodies feature vectors of our application is shown.

Customer data clustering in e-shopping Memory antibodies projected in 2 dimensions

7

AIN based Hierachical Dendrogram

0.6 1.4

0.4 0.2

1.3

0 1.2

−0.2 1.1

−0.4 −0.6

1

−0.8 0.9

−1 −0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

2

6 13 12 17 18 3 14 21 4 11 10 16 5 20 19 8

9

7 15 1

Fig. 5. Plot of the spatial distribution of the projection of the 22 AIN-produced memory antibodies onto two dimensions and corresponding hierarchical clustering-based dendrogram.

4

Discussion of Results of Clustering Algorithms

We observe, that spectral clustering (Fig. 4) does not provide a clearer revelation of the intrinsic similarities in the dataset over hierarchical clustering. On the other hand, the leaves in the AIN-based dendrogram in Fig. 5 are significantly fewer than the leaves in either the hierarchical (Fig. 2) or the spectral dendrograms (Fig. 4), which stems from the fact that the former corresponds to clustering only 22 representative points in the 80-dimensional feature space, while the latter two correspond to clustering the complete set of 150 data points. Thus, the AIN-based dendrogram demonstrates the intrinsic data point clusters significantly more clearly and compactly than the corresponding hierarchical and spectral dendrograms. Also, we observe that spectral clustering does not result in cluster homogeneity, while fuzzy c-means clustering results in higher cluster homogeneity, but in only four clusters rather than six required. Specifically, we observed that fuzzy cmeans clustering assigned the same degree of cluster membership to all the data points, which implies that certain intrinsic data dissimilarities were not captured by the fuzzy c-means clustering algorithm and this makes the clustering result less useful. On the contrary, AIN-based clustering returned significantly higher cluster homogeneity. Moreover, the degree of intra-cluster consistency is clearly significantly higher in the AIN-based rather than the hierarchical and spectral clusters, which is of course a direct consequence of a data redundancy reduction achieved by the AIN. A third observation is that the set of representative antibodies in Fig. 5 maintain the spatial structure of the complete dataset in Figs. 3 and 4, but, at the same time, form a minimum representation of 150 feature vectors with only

8


22 antibodies. This indicates significant data compression, combined with clear revelation and visualization of the intrinsic data classes.

5

Conclusions and Future Work

In this paper, we have compared four different clustering algorithms for creating groups of users’ interests in the context of an e-shopping application. The comparison has been performed by using protocols of 150 users who were asked to use a prototype version of Vision.Com. The clustering algorithm that was considered as most appropriate gave clearer grouping results than the rest of algorithms. Such results were achieved by the AIN-based algorithm in contrast to hierarchical, fuzzy c-means and spectral clustering. Future work will concentrate on mapping the various customer profile data clusters identified via the clustering algorithms into of user classes (“stereotypes”). This and other work is currently in progress and will be reported on a future occasion.

References 1. Brusilovsky, P.: Adaptive Hypermedia, User Modeling and User-Adapted Interaction 11. Kluwer Academic Publishers (2001) 87–110 2. Jin, X., Zhou, Y., Mobasher, B.: Unified Approach to Personalization Based on Probabilistic Latent Semantic Models of Web Usage and Content. Proceedings of the AAAI 2004 Workshop on Semantic Web Personalization (SWP’04). San Jose (2004) 3. Adil, C.S., Banaei, F.F, Faruque, K.J.: INSITE: A Tool for Real-Time Knowledge Discovery from Users Web Navigation. Proceedings of the 26th International Conference on Very Large Databases. Cairo Egypt (2000) 4. Menczer, F., Monge, A.E, Street, W.N.: Adaptive Assistants for Customized EShopping. Journal of IEEE Intelligent Systems (2002) 12–19 5. Jin, X., Zhou, Y., Mobasher, B.: Web usage mining based on probabilistic latent semantic analysis. Proceedings of ACM SIGKDD (KDD’04) (2004) 197–205 6. Ajith, A.: 6. Business Intelligence from Web Usage Mining. Journal of Information & Knowledge Management, Vol. 2, No. 4. iKMS & World Scientific Publishing Co (2004) 375–390. 7. Wang, Q., Makaroff, D.J., Edwards, H.K.: Characterizing Customer Groups for an Ecommerce Website. Proceedings of ACM Conference on Electronic Commerce (EC’04), New York USA (2004) 8. Cayzer, S., Aickelin, U.: A Recommender System based on the Immune Network. Proceedings of the 2002 Congress on Evolutionary Computation (2002) 9. Morrison, T., Aickelin, U.: An Artificial Immune System as a Recommender for Web Sites. Proceedings of the 1st Conference on ARtificial Immune Systems (ICARIS2002) Canterbury UK (2002) 161–169 10. Theodoridis, S, Koutroumbas, K.: Pattern Recognition.3rd edn. Academic Press, San Diego (2006) 11. De Castro, L.,N., Timmis, J.: Artificial Immune Systems: A New Computational Inteligence Approach.1st edn. Springer-Verlag, London Berlin Heidelberg (2002) 12. Rich, E.: Users are individuals: individualizing user models. Int. Journal of HumanComputer Studies52 Science Direct (1999) 323–338